Drill · Solutions Hidden

SQL Problems Worked Out

11 SQL problems that cover the patterns interviewers actually test. Multiple approaches each. Click to reveal only after you've tried.

🎯 11 problems 🐘 Postgres / Snowflake / BigQuery dialects ⏱ 15-25 min each 🔁 Progress saved locally
0 / 11 practiced

1Latest row per group

WindowROW_NUMBER · The single most common SQL interview pattern

Prompt: Given an orders table with customer_id, order_id, ordered_at, status, return one row per customer showing their most recent order.

schema
orders(customer_id, order_id, ordered_at, status)
Approach 1 — ROW_NUMBER (most readable)
latest_per_customer.sql
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY ordered_at DESC, order_id DESC  -- tiebreaker!
         ) AS rn
  FROM orders
)
SELECT customer_id, order_id, ordered_at, status
FROM ranked
WHERE rn = 1;

Time: O(n log n) for the sort. Tiebreaker: always include a second column in ORDER BY for stability — without it, ties are non-deterministic.

Approach 2 — QUALIFY (Snowflake / BigQuery / Databricks)
with QUALIFY
SELECT customer_id, order_id, ordered_at, status
FROM orders
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY customer_id
  ORDER BY ordered_at DESC, order_id DESC
) = 1;

Cleaner. Note Postgres doesn't support QUALIFY — know both forms.

Approach 3 — correlated subquery (works everywhere, slowest)
correlated
SELECT *
FROM orders o
WHERE ordered_at = (
  SELECT MAX(ordered_at)
  FROM orders o2
  WHERE o2.customer_id = o.customer_id
);

Breaks on ties. Avoid; show you know it exists.

What to say at the end

"ROW_NUMBER with a tiebreaker is the canonical answer. QUALIFY makes it cleaner on warehouses that support it. The thing I'd verify with the team: what should happen on ties — earliest order_id wins, latest wins, or is it undefined?"

2Top-N per group

Window · Variation on #1 — keep K rows per partition, not just 1

Prompt: Return the top 3 highest-revenue orders per customer.

Solution — ROW_NUMBER ≤ N
top_n_per_group.sql
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY revenue DESC, order_id DESC
         ) AS rn
  FROM orders
)
SELECT *
FROM ranked
WHERE rn <= 3;
Tie handling — RANK vs DENSE_RANK vs ROW_NUMBER

If two orders tie at $1000 revenue, who's in the "top 3"?

  • ROW_NUMBER: returns exactly 3 rows, ties broken arbitrarily by the tiebreaker.
  • RANK: returns more than 3 if ties exist at position 3 (e.g., 1, 2, 2, 4 — first three).
  • DENSE_RANK: returns "top 3 distinct revenue levels" — can be many rows.

Always clarify in the interview which the business wants.

3Running totals

WindowCumulative

Prompt: Given daily_revenue(date, revenue), return cumulative revenue as of each date.

Solution — SUM OVER with explicit frame
running_total.sql
SELECT
  date,
  revenue,
  SUM(revenue) OVER (
    ORDER BY date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS cumulative_revenue
FROM daily_revenue
ORDER BY date;

Always specify the frame explicitly. The default with ORDER BY is RANGE-based and can behave unexpectedly with ties.

Variation — cumulative per group
running_total_per_customer.sql
SELECT
  customer_id,
  ordered_at,
  revenue,
  SUM(revenue) OVER (
    PARTITION BY customer_id
    ORDER BY ordered_at
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS lifetime_revenue
FROM orders;
Variation — 7-day moving average
moving_avg_7d.sql
SELECT
  date,
  revenue,
  AVG(revenue) OVER (
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS revenue_7d_avg
FROM daily_revenue;

Note: this is row-based, so it assumes one row per day. If days are missing, you'd left-join to a date dim first.

4Gaps and islands

WindowClassic · The classic SQL puzzle

Prompt: Given a subscriptions table with one row per (customer, day) the customer was subscribed, identify continuous "subscription periods" (islands). Return one row per period with customer_id, start_date, end_date.

schema & expected
INPUT: subscriptions(customer_id, date)
  101, 2024-01-01
  101, 2024-01-02
  101, 2024-01-03
  101, 2024-01-05  -- gap on Jan 4
  101, 2024-01-06
  202, 2024-01-01

OUTPUT:
  101, 2024-01-01, 2024-01-03
  101, 2024-01-05, 2024-01-06
  202, 2024-01-01, 2024-01-01
The insight — group by "date minus row number"

For consecutive dates, date - ROW_NUMBER() is constant. When there's a gap, the value jumps. Use that as the group key.

Solution — the classic gaps-and-islands
gaps_and_islands.sql
WITH numbered AS (
  SELECT
    customer_id,
    date,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) AS rn
  FROM subscriptions
),
grouped AS (
  SELECT
    customer_id,
    date,
    date - INTERVAL '1 day' * rn AS island_key  -- groups consecutive dates
  FROM numbered
)
SELECT
  customer_id,
  MIN(date) AS start_date,
  MAX(date) AS end_date
FROM grouped
GROUP BY customer_id, island_key
ORDER BY customer_id, start_date;

Time: O(n log n) from the window sort. The trick: the date - INTERVAL '1 day' * rn stays constant within an island and jumps at gaps.

Variation — using LAG for gap detection

Alternative approach: detect gaps with LAG, mark a "new island" flag, cumulative-sum it.

lag_approach.sql
WITH with_gap AS (
  SELECT
    customer_id,
    date,
    CASE
      WHEN LAG(date) OVER (PARTITION BY customer_id ORDER BY date)
           = date - INTERVAL '1 day' THEN 0
      ELSE 1
    END AS is_new_island
  FROM subscriptions
),
with_island_id AS (
  SELECT
    customer_id,
    date,
    SUM(is_new_island) OVER (
      PARTITION BY customer_id ORDER BY date
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS island_id
  FROM with_gap
)
SELECT customer_id, MIN(date) AS start_date, MAX(date) AS end_date
FROM with_island_id
GROUP BY customer_id, island_id;

More explicit. Some interviewers prefer this because the logic reads top-to-bottom.

5Sessionization

Window · Cousin of gaps-and-islands, applied to event streams

Prompt: Given events(user_id, event_ts), identify "sessions" where consecutive events for the same user are within 30 minutes. Return session_id, user_id, session_start, session_end, event_count.

Solution — LAG + cumulative-sum island pattern
sessionize.sql
WITH with_gap AS (
  SELECT
    user_id,
    event_ts,
    CASE
      WHEN LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts)
           >= event_ts - INTERVAL '30 minutes' THEN 0
      ELSE 1
    END AS is_new_session
  FROM events
),
with_session_id AS (
  SELECT
    user_id,
    event_ts,
    SUM(is_new_session) OVER (
      PARTITION BY user_id ORDER BY event_ts
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS session_num
  FROM with_gap
)
SELECT
  user_id || '_' || session_num AS session_id,
  user_id,
  MIN(event_ts) AS session_start,
  MAX(event_ts) AS session_end,
  COUNT(*)      AS event_count
FROM with_session_id
GROUP BY user_id, session_num
ORDER BY user_id, session_start;

Same trick as gaps-and-islands. The first event of each user gets is_new_session = 1 (because LAG is NULL). The cumulative sum becomes the session sequence.

6Dedup with tiebreaker

Window · The "we accidentally have duplicate rows" problem

Prompt: A users table has duplicate rows for the same email from imperfect deduplication. Keep one row per email: the most recently updated one, and among those, the one with non-null phone if any.

Solution — multi-key ORDER BY in ROW_NUMBER
dedup_users.sql
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY LOWER(TRIM(email))
           ORDER BY
             updated_at DESC,
             (phone IS NOT NULL) DESC,  -- prefer non-null phones
             user_id DESC                -- final tiebreaker for determinism
         ) AS rn
  FROM users
)
SELECT *
FROM ranked
WHERE rn = 1;

Notes:

  • LOWER(TRIM(email)) for the partition catches case + whitespace variations.
  • (phone IS NOT NULL) DESC exploits boolean ordering (TRUE > FALSE) for preference.
  • Always include a final tiebreaker (e.g. user_id) — otherwise ties are non-deterministic.

7Cohort retention

WindowBusiness · The metric every product company asks for

Prompt: Given users(user_id, signup_date) and events(user_id, event_date), build a cohort retention table: rows are signup-month cohorts, columns are months-since-signup, values are % of cohort that had at least one event that month.

The shape of the answer
cohort_month | months_since_signup | active_users | cohort_size | retention_pct
2024-01      | 0                   | 1000         | 1000        | 100.0%
2024-01      | 1                   | 620          | 1000        | 62.0%
2024-01      | 2                   | 410          | 1000        | 41.0%
2024-02      | 0                   | 800          | 800         | 100.0%
...
Solution — cohort table
cohort_retention.sql
WITH cohorts AS (
  SELECT user_id,
         DATE_TRUNC('month', signup_date) AS cohort_month
  FROM users
),
user_activity AS (
  SELECT DISTINCT
         user_id,
         DATE_TRUNC('month', event_date) AS activity_month
  FROM events
),
joined AS (
  SELECT
    c.cohort_month,
    c.user_id,
    ua.activity_month,
    DATEDIFF('month', c.cohort_month, ua.activity_month) AS months_since_signup
  FROM cohorts c
  LEFT JOIN user_activity ua ON ua.user_id = c.user_id
),
cohort_sizes AS (
  SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
  FROM cohorts
  GROUP BY 1
)
SELECT
  j.cohort_month,
  j.months_since_signup,
  COUNT(DISTINCT j.user_id) AS active_users,
  cs.cohort_size,
  ROUND(100.0 * COUNT(DISTINCT j.user_id) / cs.cohort_size, 2) AS retention_pct
FROM joined j
JOIN cohort_sizes cs USING (cohort_month)
WHERE j.months_since_signup IS NOT NULL
  AND j.months_since_signup >= 0
GROUP BY j.cohort_month, j.months_since_signup, cs.cohort_size
ORDER BY j.cohort_month, j.months_since_signup;

For a pivoted view (cohorts as rows, M0 / M1 / M2 as columns), wrap in a final SUM(CASE WHEN ... ) pivot.

8Median in SQL

Window · The classic "no MEDIAN built-in" problem

Prompt: Compute the median latency_ms for each model_id from inference_requests.

Approach 1 — PERCENTILE_CONT (modern warehouses)
percentile_cont.sql
-- Snowflake / BigQuery / Postgres (Aggregate flavor)
SELECT
  model_id,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY latency_ms) AS median_latency
FROM inference_requests
GROUP BY model_id;

-- Snowflake / BigQuery also support APPROX_PERCENTILE for huge data
SELECT
  model_id,
  APPROX_PERCENTILE(latency_ms, 0.5) AS approx_median
FROM inference_requests
GROUP BY model_id;
Approach 2 — manual median (if PERCENTILE_CONT not available)
manual_median.sql
WITH ranked AS (
  SELECT
    model_id,
    latency_ms,
    ROW_NUMBER() OVER (PARTITION BY model_id ORDER BY latency_ms) AS rn,
    COUNT(*)    OVER (PARTITION BY model_id) AS total
  FROM inference_requests
)
SELECT
  model_id,
  AVG(latency_ms::FLOAT) AS median_latency  -- AVG handles even-count case (avg of two middle values)
FROM ranked
WHERE rn IN (
  (total + 1) / 2,      -- middle when odd
  (total + 2) / 2       -- second middle when even
)
GROUP BY model_id;

Trick: the two indices (total+1)/2 and (total+2)/2 are equal when total is odd (just the middle) and differ by 1 when even (the two middle values, which we average).

9Period-over-period change

WindowLAG

Prompt: Given monthly_revenue(month, revenue), compute month-over-month % change. Handle the first month (no prior) and division by zero.

Solution — LAG + NULLIF
mom_change.sql
SELECT
  month,
  revenue,
  LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
  revenue - LAG(revenue) OVER (ORDER BY month) AS mom_change,
  ROUND(
    100.0 * (revenue - LAG(revenue) OVER (ORDER BY month))
          / NULLIF(LAG(revenue) OVER (ORDER BY month), 0),
    2
  ) AS mom_pct_change
FROM monthly_revenue
ORDER BY month;

Key details:

  • First row: LAG returns NULL → mom_change and mom_pct_change are NULL. Correct behavior.
  • NULLIF(LAG(...), 0) guards division by zero — without it, prior-month zero throws.
  • You can DRY this up by computing prev_revenue in a CTE first; in interview code, the inline form is fine.

10Funnel analysis

WindowConditional · The product-analytics classic

Prompt: Given events(user_id, event_type, occurred_at) with event types signup, visited_pricing, started_checkout, completed_purchase, build a funnel: count users who hit each stage in order.

Solution — MIN-per-event with order check
funnel.sql
WITH user_funnel AS (
  SELECT
    user_id,
    MIN(CASE WHEN event_type = 'signup'             THEN occurred_at END) AS signup_at,
    MIN(CASE WHEN event_type = 'visited_pricing'    THEN occurred_at END) AS pricing_at,
    MIN(CASE WHEN event_type = 'started_checkout'   THEN occurred_at END) AS checkout_at,
    MIN(CASE WHEN event_type = 'completed_purchase' THEN occurred_at END) AS purchase_at
  FROM events
  GROUP BY user_id
)
SELECT
  COUNT(*)                                                  AS users_signed_up,
  COUNT(*) FILTER (WHERE pricing_at  >= signup_at)          AS reached_pricing,
  COUNT(*) FILTER (WHERE checkout_at >= pricing_at)         AS started_checkout,
  COUNT(*) FILTER (WHERE purchase_at >= checkout_at)        AS completed_purchase
FROM user_funnel
WHERE signup_at IS NOT NULL;

Why MIN per event: first occurrence of each event type per user.

Why >=: enforces order. A user who visited pricing before signing up shouldn't count for that stage. If you want strict ordering, use > instead.

FILTER (WHERE ...) is Postgres / Snowflake / BigQuery syntax. On older dialects use COUNT(CASE WHEN ... THEN user_id END).

Follow-up: conversion rate between stages
funnel_rates.sql
WITH counts AS (
  -- (above query, but selecting just the four counts)
)
SELECT
  users_signed_up,
  reached_pricing,
  ROUND(100.0 * reached_pricing / NULLIF(users_signed_up, 0), 2)  AS signup_to_pricing_pct,
  started_checkout,
  ROUND(100.0 * started_checkout / NULLIF(reached_pricing, 0), 2) AS pricing_to_checkout_pct,
  completed_purchase,
  ROUND(100.0 * completed_purchase / NULLIF(started_checkout, 0), 2) AS checkout_to_purchase_pct
FROM counts;

11Pivot dynamically

PivotAggregate

Prompt: Given orders(customer_id, quarter, revenue), return one row per customer with columns Q1, Q2, Q3, Q4 showing revenue per quarter.

Approach 1 — CASE WHEN (works everywhere)
pivot_case_when.sql
SELECT
  customer_id,
  SUM(CASE WHEN quarter = 'Q1' THEN revenue ELSE 0 END) AS q1_revenue,
  SUM(CASE WHEN quarter = 'Q2' THEN revenue ELSE 0 END) AS q2_revenue,
  SUM(CASE WHEN quarter = 'Q3' THEN revenue ELSE 0 END) AS q3_revenue,
  SUM(CASE WHEN quarter = 'Q4' THEN revenue ELSE 0 END) AS q4_revenue
FROM orders
GROUP BY customer_id;

Use ELSE 0 to avoid NULL in the sum. Use no ELSE if you want NULLs to indicate "no data."

Approach 2 — PIVOT operator (Snowflake / BigQuery / Databricks)
pivot_native.sql
-- Snowflake / BigQuery
SELECT *
FROM orders
PIVOT (
  SUM(revenue) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4')
);

Less portable but cleaner when available.

What if the quarters are dynamic?

SQL pivot requires known column values. For truly dynamic pivots (unknown quarters at query time), you have two options:

  1. Two-step: query the distinct quarter values first, then dynamically construct the pivot SQL in application code or a macro.
  2. In dbt: dbt_utils.pivot macro generates the SQL at compile time.
  3. Just don't pivot in SQL: return long-format data and let the BI tool pivot. Often the right answer.

Drill protocol

  1. Toggle Drill mode at the top to hide solutions.
  2. Set a timer — 20-25 min for first pass per problem, 10 min for re-solves.
  3. Write the SQL on paper or in a real warehouse, not just in your head. Watching the query execute teaches you something the text doesn't.
  4. State complexity and edge cases out loud as you finish.
  5. Reveal. Note structural differences.
  6. Mark practiced. Re-solve the next day from memory.
Where these come from

These 11 cover roughly 80% of the patterns you'll see in SQL interviews for data analytics engineering roles. Each one maps to a concept in 04-sql-deep-dive. If a problem feels hard, the deep-dive section on that pattern is your remediation.