SQL Problems Worked Out
11 SQL problems that cover the patterns interviewers actually test. Multiple approaches each. Click to reveal only after you've tried.
1Latest row per group
WindowROW_NUMBER · The single most common SQL interview pattern
Prompt: Given an orders table with customer_id, order_id, ordered_at, status, return one row per customer showing their most recent order.
orders(customer_id, order_id, ordered_at, status)
Approach 1 — ROW_NUMBER (most readable)
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY ordered_at DESC, order_id DESC -- tiebreaker!
) AS rn
FROM orders
)
SELECT customer_id, order_id, ordered_at, status
FROM ranked
WHERE rn = 1;
Time: O(n log n) for the sort. Tiebreaker: always include a second column in ORDER BY for stability — without it, ties are non-deterministic.
Approach 2 — QUALIFY (Snowflake / BigQuery / Databricks)
SELECT customer_id, order_id, ordered_at, status
FROM orders
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY ordered_at DESC, order_id DESC
) = 1;
Cleaner. Note Postgres doesn't support QUALIFY — know both forms.
Approach 3 — correlated subquery (works everywhere, slowest)
SELECT *
FROM orders o
WHERE ordered_at = (
SELECT MAX(ordered_at)
FROM orders o2
WHERE o2.customer_id = o.customer_id
);
Breaks on ties. Avoid; show you know it exists.
What to say at the end
"ROW_NUMBER with a tiebreaker is the canonical answer. QUALIFY makes it cleaner on warehouses that support it. The thing I'd verify with the team: what should happen on ties — earliest order_id wins, latest wins, or is it undefined?"
2Top-N per group
Window · Variation on #1 — keep K rows per partition, not just 1
Prompt: Return the top 3 highest-revenue orders per customer.
Solution — ROW_NUMBER ≤ N
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY revenue DESC, order_id DESC
) AS rn
FROM orders
)
SELECT *
FROM ranked
WHERE rn <= 3;
Tie handling — RANK vs DENSE_RANK vs ROW_NUMBER
If two orders tie at $1000 revenue, who's in the "top 3"?
- ROW_NUMBER: returns exactly 3 rows, ties broken arbitrarily by the tiebreaker.
- RANK: returns more than 3 if ties exist at position 3 (e.g., 1, 2, 2, 4 — first three).
- DENSE_RANK: returns "top 3 distinct revenue levels" — can be many rows.
Always clarify in the interview which the business wants.
3Running totals
WindowCumulative
Prompt: Given daily_revenue(date, revenue), return cumulative revenue as of each date.
Solution — SUM OVER with explicit frame
SELECT
date,
revenue,
SUM(revenue) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue
FROM daily_revenue
ORDER BY date;
Always specify the frame explicitly. The default with ORDER BY is RANGE-based and can behave unexpectedly with ties.
Variation — cumulative per group
SELECT
customer_id,
ordered_at,
revenue,
SUM(revenue) OVER (
PARTITION BY customer_id
ORDER BY ordered_at
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS lifetime_revenue
FROM orders;
Variation — 7-day moving average
SELECT
date,
revenue,
AVG(revenue) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS revenue_7d_avg
FROM daily_revenue;
Note: this is row-based, so it assumes one row per day. If days are missing, you'd left-join to a date dim first.
4Gaps and islands
WindowClassic · The classic SQL puzzle
Prompt: Given a subscriptions table with one row per (customer, day) the customer was subscribed, identify continuous "subscription periods" (islands). Return one row per period with customer_id, start_date, end_date.
INPUT: subscriptions(customer_id, date)
101, 2024-01-01
101, 2024-01-02
101, 2024-01-03
101, 2024-01-05 -- gap on Jan 4
101, 2024-01-06
202, 2024-01-01
OUTPUT:
101, 2024-01-01, 2024-01-03
101, 2024-01-05, 2024-01-06
202, 2024-01-01, 2024-01-01
The insight — group by "date minus row number"
For consecutive dates, date - ROW_NUMBER() is constant. When there's a gap, the value jumps. Use that as the group key.
Solution — the classic gaps-and-islands
WITH numbered AS (
SELECT
customer_id,
date,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) AS rn
FROM subscriptions
),
grouped AS (
SELECT
customer_id,
date,
date - INTERVAL '1 day' * rn AS island_key -- groups consecutive dates
FROM numbered
)
SELECT
customer_id,
MIN(date) AS start_date,
MAX(date) AS end_date
FROM grouped
GROUP BY customer_id, island_key
ORDER BY customer_id, start_date;
Time: O(n log n) from the window sort. The trick: the date - INTERVAL '1 day' * rn stays constant within an island and jumps at gaps.
Variation — using LAG for gap detection
Alternative approach: detect gaps with LAG, mark a "new island" flag, cumulative-sum it.
WITH with_gap AS (
SELECT
customer_id,
date,
CASE
WHEN LAG(date) OVER (PARTITION BY customer_id ORDER BY date)
= date - INTERVAL '1 day' THEN 0
ELSE 1
END AS is_new_island
FROM subscriptions
),
with_island_id AS (
SELECT
customer_id,
date,
SUM(is_new_island) OVER (
PARTITION BY customer_id ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS island_id
FROM with_gap
)
SELECT customer_id, MIN(date) AS start_date, MAX(date) AS end_date
FROM with_island_id
GROUP BY customer_id, island_id;
More explicit. Some interviewers prefer this because the logic reads top-to-bottom.
5Sessionization
Window · Cousin of gaps-and-islands, applied to event streams
Prompt: Given events(user_id, event_ts), identify "sessions" where consecutive events for the same user are within 30 minutes. Return session_id, user_id, session_start, session_end, event_count.
Solution — LAG + cumulative-sum island pattern
WITH with_gap AS (
SELECT
user_id,
event_ts,
CASE
WHEN LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts)
>= event_ts - INTERVAL '30 minutes' THEN 0
ELSE 1
END AS is_new_session
FROM events
),
with_session_id AS (
SELECT
user_id,
event_ts,
SUM(is_new_session) OVER (
PARTITION BY user_id ORDER BY event_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_num
FROM with_gap
)
SELECT
user_id || '_' || session_num AS session_id,
user_id,
MIN(event_ts) AS session_start,
MAX(event_ts) AS session_end,
COUNT(*) AS event_count
FROM with_session_id
GROUP BY user_id, session_num
ORDER BY user_id, session_start;
Same trick as gaps-and-islands. The first event of each user gets is_new_session = 1 (because LAG is NULL). The cumulative sum becomes the session sequence.
6Dedup with tiebreaker
Window · The "we accidentally have duplicate rows" problem
Prompt: A users table has duplicate rows for the same email from imperfect deduplication. Keep one row per email: the most recently updated one, and among those, the one with non-null phone if any.
Solution — multi-key ORDER BY in ROW_NUMBER
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY LOWER(TRIM(email))
ORDER BY
updated_at DESC,
(phone IS NOT NULL) DESC, -- prefer non-null phones
user_id DESC -- final tiebreaker for determinism
) AS rn
FROM users
)
SELECT *
FROM ranked
WHERE rn = 1;
Notes:
LOWER(TRIM(email))for the partition catches case + whitespace variations.(phone IS NOT NULL) DESCexploits boolean ordering (TRUE > FALSE) for preference.- Always include a final tiebreaker (e.g.
user_id) — otherwise ties are non-deterministic.
7Cohort retention
WindowBusiness · The metric every product company asks for
Prompt: Given users(user_id, signup_date) and events(user_id, event_date), build a cohort retention table: rows are signup-month cohorts, columns are months-since-signup, values are % of cohort that had at least one event that month.
The shape of the answer
cohort_month | months_since_signup | active_users | cohort_size | retention_pct 2024-01 | 0 | 1000 | 1000 | 100.0% 2024-01 | 1 | 620 | 1000 | 62.0% 2024-01 | 2 | 410 | 1000 | 41.0% 2024-02 | 0 | 800 | 800 | 100.0% ...
Solution — cohort table
WITH cohorts AS (
SELECT user_id,
DATE_TRUNC('month', signup_date) AS cohort_month
FROM users
),
user_activity AS (
SELECT DISTINCT
user_id,
DATE_TRUNC('month', event_date) AS activity_month
FROM events
),
joined AS (
SELECT
c.cohort_month,
c.user_id,
ua.activity_month,
DATEDIFF('month', c.cohort_month, ua.activity_month) AS months_since_signup
FROM cohorts c
LEFT JOIN user_activity ua ON ua.user_id = c.user_id
),
cohort_sizes AS (
SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
FROM cohorts
GROUP BY 1
)
SELECT
j.cohort_month,
j.months_since_signup,
COUNT(DISTINCT j.user_id) AS active_users,
cs.cohort_size,
ROUND(100.0 * COUNT(DISTINCT j.user_id) / cs.cohort_size, 2) AS retention_pct
FROM joined j
JOIN cohort_sizes cs USING (cohort_month)
WHERE j.months_since_signup IS NOT NULL
AND j.months_since_signup >= 0
GROUP BY j.cohort_month, j.months_since_signup, cs.cohort_size
ORDER BY j.cohort_month, j.months_since_signup;
For a pivoted view (cohorts as rows, M0 / M1 / M2 as columns), wrap in a final SUM(CASE WHEN ... ) pivot.
8Median in SQL
Window · The classic "no MEDIAN built-in" problem
Prompt: Compute the median latency_ms for each model_id from inference_requests.
Approach 1 — PERCENTILE_CONT (modern warehouses)
-- Snowflake / BigQuery / Postgres (Aggregate flavor)
SELECT
model_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY latency_ms) AS median_latency
FROM inference_requests
GROUP BY model_id;
-- Snowflake / BigQuery also support APPROX_PERCENTILE for huge data
SELECT
model_id,
APPROX_PERCENTILE(latency_ms, 0.5) AS approx_median
FROM inference_requests
GROUP BY model_id;
Approach 2 — manual median (if PERCENTILE_CONT not available)
WITH ranked AS (
SELECT
model_id,
latency_ms,
ROW_NUMBER() OVER (PARTITION BY model_id ORDER BY latency_ms) AS rn,
COUNT(*) OVER (PARTITION BY model_id) AS total
FROM inference_requests
)
SELECT
model_id,
AVG(latency_ms::FLOAT) AS median_latency -- AVG handles even-count case (avg of two middle values)
FROM ranked
WHERE rn IN (
(total + 1) / 2, -- middle when odd
(total + 2) / 2 -- second middle when even
)
GROUP BY model_id;
Trick: the two indices (total+1)/2 and (total+2)/2 are equal when total is odd (just the middle) and differ by 1 when even (the two middle values, which we average).
9Period-over-period change
WindowLAG
Prompt: Given monthly_revenue(month, revenue), compute month-over-month % change. Handle the first month (no prior) and division by zero.
Solution — LAG + NULLIF
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
revenue - LAG(revenue) OVER (ORDER BY month) AS mom_change,
ROUND(
100.0 * (revenue - LAG(revenue) OVER (ORDER BY month))
/ NULLIF(LAG(revenue) OVER (ORDER BY month), 0),
2
) AS mom_pct_change
FROM monthly_revenue
ORDER BY month;
Key details:
- First row:
LAGreturns NULL →mom_changeandmom_pct_changeare NULL. Correct behavior. NULLIF(LAG(...), 0)guards division by zero — without it, prior-month zero throws.- You can DRY this up by computing
prev_revenuein a CTE first; in interview code, the inline form is fine.
10Funnel analysis
WindowConditional · The product-analytics classic
Prompt: Given events(user_id, event_type, occurred_at) with event types signup, visited_pricing, started_checkout, completed_purchase, build a funnel: count users who hit each stage in order.
Solution — MIN-per-event with order check
WITH user_funnel AS (
SELECT
user_id,
MIN(CASE WHEN event_type = 'signup' THEN occurred_at END) AS signup_at,
MIN(CASE WHEN event_type = 'visited_pricing' THEN occurred_at END) AS pricing_at,
MIN(CASE WHEN event_type = 'started_checkout' THEN occurred_at END) AS checkout_at,
MIN(CASE WHEN event_type = 'completed_purchase' THEN occurred_at END) AS purchase_at
FROM events
GROUP BY user_id
)
SELECT
COUNT(*) AS users_signed_up,
COUNT(*) FILTER (WHERE pricing_at >= signup_at) AS reached_pricing,
COUNT(*) FILTER (WHERE checkout_at >= pricing_at) AS started_checkout,
COUNT(*) FILTER (WHERE purchase_at >= checkout_at) AS completed_purchase
FROM user_funnel
WHERE signup_at IS NOT NULL;
Why MIN per event: first occurrence of each event type per user.
Why >=: enforces order. A user who visited pricing before signing up shouldn't count for that stage. If you want strict ordering, use > instead.
FILTER (WHERE ...) is Postgres / Snowflake / BigQuery syntax. On older dialects use COUNT(CASE WHEN ... THEN user_id END).
Follow-up: conversion rate between stages
WITH counts AS (
-- (above query, but selecting just the four counts)
)
SELECT
users_signed_up,
reached_pricing,
ROUND(100.0 * reached_pricing / NULLIF(users_signed_up, 0), 2) AS signup_to_pricing_pct,
started_checkout,
ROUND(100.0 * started_checkout / NULLIF(reached_pricing, 0), 2) AS pricing_to_checkout_pct,
completed_purchase,
ROUND(100.0 * completed_purchase / NULLIF(started_checkout, 0), 2) AS checkout_to_purchase_pct
FROM counts;
11Pivot dynamically
PivotAggregate
Prompt: Given orders(customer_id, quarter, revenue), return one row per customer with columns Q1, Q2, Q3, Q4 showing revenue per quarter.
Approach 1 — CASE WHEN (works everywhere)
SELECT
customer_id,
SUM(CASE WHEN quarter = 'Q1' THEN revenue ELSE 0 END) AS q1_revenue,
SUM(CASE WHEN quarter = 'Q2' THEN revenue ELSE 0 END) AS q2_revenue,
SUM(CASE WHEN quarter = 'Q3' THEN revenue ELSE 0 END) AS q3_revenue,
SUM(CASE WHEN quarter = 'Q4' THEN revenue ELSE 0 END) AS q4_revenue
FROM orders
GROUP BY customer_id;
Use ELSE 0 to avoid NULL in the sum. Use no ELSE if you want NULLs to indicate "no data."
Approach 2 — PIVOT operator (Snowflake / BigQuery / Databricks)
-- Snowflake / BigQuery
SELECT *
FROM orders
PIVOT (
SUM(revenue) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4')
);
Less portable but cleaner when available.
What if the quarters are dynamic?
SQL pivot requires known column values. For truly dynamic pivots (unknown quarters at query time), you have two options:
- Two-step: query the distinct quarter values first, then dynamically construct the pivot SQL in application code or a macro.
- In dbt:
dbt_utils.pivotmacro generates the SQL at compile time. - Just don't pivot in SQL: return long-format data and let the BI tool pivot. Often the right answer.
Drill protocol
- Toggle Drill mode at the top to hide solutions.
- Set a timer — 20-25 min for first pass per problem, 10 min for re-solves.
- Write the SQL on paper or in a real warehouse, not just in your head. Watching the query execute teaches you something the text doesn't.
- State complexity and edge cases out loud as you finish.
- Reveal. Note structural differences.
- Mark practiced. Re-solve the next day from memory.
These 11 cover roughly 80% of the patterns you'll see in SQL interviews for data analytics engineering roles. Each one maps to a concept in 04-sql-deep-dive. If a problem feels hard, the deep-dive section on that pattern is your remediation.