Section B · Critical

SQL Deep Dive

CTEs, window functions, query plans, anti-patterns. The single most-tested skill in analytics-engineering loops — fluency under pressure separates strong candidates from the rest.

Reality check

SQL is graded on fluency, not cleverness. You should write correct CTEs and window functions without hesitating. Plan to write SQL out loud on a timer — record yourself if you're alone — until each pattern feels automatic. Then drill 11-sql-problems.

Logical query order — know this cold

SQL is written one way and executed another. The logical order:

SQL logical order
1. FROM        (and JOINs)
2. WHERE       (filters rows BEFORE grouping)
3. GROUP BY
4. HAVING      (filters groups AFTER grouping)
5. SELECT      (expressions evaluated here — that's why aliases aren't available in WHERE)
6. DISTINCT
7. ORDER BY    (aliases ARE available here)
8. LIMIT / OFFSET

This explains common errors interviewers love to catch:

  • "Why can't I use a SELECT alias in WHERE?" — because WHERE runs before SELECT.
  • "Why is COUNT(*) including NULL?" — because COUNT(*) counts rows, not values. COUNT(col) skips NULLs.
  • "Why does HAVING work where WHERE didn't?" — because HAVING runs after grouping; WHERE runs before.

Joins — really know them

Five join types you should be able to draw on a whiteboard and explain in one sentence:

JoinWhat it returnsCommon pitfall
INNER JOINOnly rows that match on both sidesSilent data loss when keys are NULL or unmatched
LEFT JOINAll left rows; right side NULL if unmatchedRow explosion when right side has many matches
RIGHT JOINAll right rows; left side NULL if unmatchedRarely the right choice — rewrite as LEFT JOIN by swapping sides
FULL OUTER JOINAll rows from both, NULLs where unmatchedUse when you genuinely need both sides — rare
CROSS JOINCartesian productMassive row counts; usually a bug unless intentional

The fanout problem

The most common SQL bug in analytics engineering. Joining at the wrong grain inflates your numbers:

fanout bug example
-- Wrong: revenue per customer, but order_items duplicates orders
SELECT c.customer_id,
       SUM(o.revenue) AS total_revenue          -- inflated!
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
LEFT JOIN order_items oi ON oi.order_id = o.order_id  -- adds rows per item
GROUP BY 1;

-- Right: aggregate first, then join
WITH order_totals AS (
  SELECT order_id, SUM(item_revenue) AS revenue
  FROM order_items
  GROUP BY 1
)
SELECT c.customer_id,
       SUM(ot.revenue) AS total_revenue
FROM customers c
LEFT JOIN orders o   ON o.customer_id = c.customer_id
LEFT JOIN order_totals ot ON ot.order_id = o.order_id
GROUP BY 1;
Mention "grain" unprompted

When discussing joins, say: "the grain of this CTE is one row per order; if I join to order_items I'll fan out to one row per item, so I'd aggregate first." That sentence signals senior-level thinking.

NULL handling — the gotcha that trips everyone

NULL is not a value. It's absence of value. Three behaviors that catch candidates:

  • NULL = NULL is NULL, not TRUE. Use IS NULL / IS NOT NULL.
  • NULL <> 'x' is NULL, not TRUE. So WHERE col <> 'x' silently excludes NULL rows.
  • Arithmetic with NULL is NULL. NULL + 1 = NULL. SUM skips NULLs; AVG skips NULLs (and divides by non-null count); COUNT(*) includes them, COUNT(col) skips them.

Safe patterns:

null-safe SQL
-- Coalesce to a default before comparing
WHERE COALESCE(status, 'unknown') <> 'active'

-- Or be explicit
WHERE status <> 'active' OR status IS NULL

-- Null-safe equality (Postgres / Snowflake)
WHERE col IS DISTINCT FROM other_col       -- treats NULLs as equal to each other
WHERE col IS NOT DISTINCT FROM other_col   -- inverse

GROUP BY semantics

Every non-aggregated column in SELECT must appear in GROUP BY (with rare exceptions like Postgres allowing functionally-dependent columns). Patterns:

  • GROUP BY 1, 2, 3 — refer to SELECT columns by position. Common in interview code; some teams ban it for clarity.
  • GROUP BY ROLLUP / CUBE / GROUPING SETS — multi-level aggregation in one query.
  • HAVING vs WHERE — HAVING filters after grouping; WHERE filters before. Use WHERE whenever possible (smaller rowsets to group).

CTEs and structuring complex queries

CTEs (Common Table Expressions) are the single biggest readability lever in SQL. Use them liberally.

CTE pattern
WITH
  active_customers AS (
    SELECT customer_id
    FROM customers
    WHERE is_active AND created_at >= '2024-01-01'
  ),
  recent_orders AS (
    SELECT customer_id,
           SUM(revenue) AS revenue_30d,
           COUNT(*)     AS orders_30d
    FROM orders
    WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY 1
  )
SELECT ac.customer_id,
       COALESCE(ro.revenue_30d, 0) AS revenue_30d,
       COALESCE(ro.orders_30d, 0)  AS orders_30d
FROM active_customers ac
LEFT JOIN recent_orders ro USING (customer_id)
ORDER BY revenue_30d DESC;

Pattern: one CTE per concept

Name each CTE for what it represents: active_customers, monthly_revenue, first_purchase_per_customer. Reviewers should be able to read CTE names and understand the query structure without looking inside.

Performance note

In older Postgres (< 12), CTEs were "optimization fences" — the planner couldn't see through them. Modern warehouses (Snowflake, BigQuery, modern Postgres) inline CTEs by default. You can use CTEs liberally without performance penalty in the warehouses you're likely to interview on.

Recursive CTEs

For hierarchies (org charts, category trees), use WITH RECURSIVE:

recursive CTE
WITH RECURSIVE org_tree AS (
  -- anchor: the CEO
  SELECT employee_id, manager_id, name, 1 AS depth
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  -- recursive step: anyone whose manager is already in the tree
  SELECT e.employee_id, e.manager_id, e.name, ot.depth + 1
  FROM employees e
  JOIN org_tree ot ON e.manager_id = ot.employee_id
)
SELECT * FROM org_tree ORDER BY depth, name;

Know they exist. Know the shape (anchor + recursive step + UNION ALL). Practical use is rare but they come up.

Window functions — the analytics engineer's superpower

If you can't write window functions fluently, you will fail SQL interviews. They're the tool for "do something per group without collapsing rows."

The shape

window function syntax
function_name(...) OVER (
  PARTITION BY ...     -- like GROUP BY, but doesn't collapse rows
  ORDER BY ...         -- ordering within partition (required for some functions)
  ROWS BETWEEN ... AND ...  -- frame: which rows the window covers
)

Functions you must know

FunctionWhat it doesTypical use
ROW_NUMBER()1, 2, 3, ... per partitionDedup ("keep latest"), top-N per group
RANK()Ties get same rank, gap follows (1, 2, 2, 4)Leaderboards, ranking
DENSE_RANK()Ties get same rank, no gap (1, 2, 2, 3)Same as RANK but compact
LAG(col, n)Value from n rows earlier in partitionPeriod-over-period, sessionization
LEAD(col, n)Value from n rows laterSame kind of analysis, looking forward
FIRST_VALUE / LAST_VALUEFirst / last value in window frameFirst purchase, latest status
SUM/AVG/MIN/MAX OVER(...)Aggregate as windowRunning totals, moving averages
NTILE(n)Divide partition into n bucketsQuartiles, deciles
PERCENT_RANK / CUME_DISTPercentile within partitionDistribution analysis

Canonical pattern: latest row per group

latest_per_group.sql
-- Get the most recent order per customer
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY ordered_at DESC
         ) AS rn
  FROM orders
)
SELECT * FROM ranked WHERE rn = 1;

Canonical pattern: running total

running_total.sql
SELECT
  order_date,
  daily_revenue,
  SUM(daily_revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS cumulative_revenue
FROM daily_revenue
ORDER BY order_date;

Canonical pattern: 7-day moving average

moving_avg.sql
SELECT
  date,
  daily_value,
  AVG(daily_value) OVER (
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS moving_avg_7d
FROM daily_metrics;

Canonical pattern: period-over-period change

pop_change.sql
SELECT
  month,
  revenue,
  LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
  revenue - LAG(revenue, 1) OVER (ORDER BY month) AS mom_change,
  (revenue - LAG(revenue, 1) OVER (ORDER BY month))
    / NULLIF(LAG(revenue, 1) OVER (ORDER BY month), 0) AS mom_pct_change
FROM monthly_revenue
ORDER BY month;
Frame defaults bite

When you use SUM() OVER (ORDER BY ...) without an explicit ROWS clause, the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY alone and no PARTITION BY, that's a running total. Without ORDER BY, the default frame is the entire partition. Always specify the frame explicitly for clarity.

QUALIFY — filter on window functions

BigQuery, Snowflake, Databricks, DuckDB all support QUALIFY. It's like HAVING but for window functions. Cleans up the latest-per-group pattern:

QUALIFY
-- Instead of wrapping in a CTE:
SELECT *
FROM orders
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY customer_id
  ORDER BY ordered_at DESC
) = 1;

Postgres doesn't support QUALIFY (yet). Know both forms.

Pivot & unpivot

Pivoting rows to columns is common in reports. The portable way:

pivot_with_case.sql
-- Revenue by quarter, one row per customer
SELECT
  customer_id,
  SUM(CASE WHEN quarter = 'Q1' THEN revenue END) AS q1_revenue,
  SUM(CASE WHEN quarter = 'Q2' THEN revenue END) AS q2_revenue,
  SUM(CASE WHEN quarter = 'Q3' THEN revenue END) AS q3_revenue,
  SUM(CASE WHEN quarter = 'Q4' THEN revenue END) AS q4_revenue
FROM quarterly_revenue
GROUP BY customer_id;

Snowflake and BigQuery also have PIVOT / UNPIVOT operators. The CASE WHEN form works everywhere and is what most interviewers expect.

JSON / semi-structured data

Modern warehouses handle JSON natively. Syntax varies; the operations are the same:

JSON access by dialect
-- Snowflake
SELECT raw:user.email::STRING AS email FROM events;

-- BigQuery
SELECT JSON_VALUE(raw, '$.user.email') AS email FROM events;

-- Postgres
SELECT raw ->> 'user' ->> 'email' AS email FROM events;
SELECT raw #>> '{user,email}' AS email FROM events;  -- path operator

-- Snowflake / BigQuery: array unnesting
SELECT id, item.value AS product
FROM orders, UNNEST(items) AS item;  -- BigQuery
SELECT o.id, item.value::STRING AS product
FROM orders o, LATERAL FLATTEN(o.items) item;  -- Snowflake

Know how to extract scalar values, unnest arrays, and cast to strong types. Inference logs are usually JSON; you'll touch this every day at an AI infra company.

Date / time — the source of most bugs

Date arithmetic is dialect-specific and time-zone-sensitive. Patterns:

date snippets
-- Snowflake / BigQuery / Postgres-ish
DATE_TRUNC('month', ts)     -- truncate to start of month
DATE_TRUNC('week', ts)      -- start of week (varies by dialect — Monday vs Sunday)
ts + INTERVAL '1 day'
ts - INTERVAL '7 days'
EXTRACT(YEAR FROM ts)
EXTRACT(DOW FROM ts)        -- day of week
DATEDIFF('day', start_ts, end_ts)
DATEADD(month, 1, ts)

-- Generate a date series (BigQuery)
SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2024-01-01', '2024-12-31')) AS dt;

-- Snowflake
SELECT DATEADD(day, seq4(), '2024-01-01') AS dt
FROM TABLE(GENERATOR(ROWCOUNT => 365));
Time-zone gotchas

Always know whether your timestamp is UTC, local, or naïve. Date-truncate after converting to the relevant timezone, or you'll get rows in the wrong day. Common bug: "why is the daily revenue chart off by ~3% between US and EU views?" — TZ mishandling.

Performance — what makes warehouse queries slow

You won't be running EXPLAIN ANALYZE in a 45-minute interview, but you should know conceptually what makes queries slow:

  • Scanning too much data — no partition / cluster filter. Snowflake clusters by columns you specify; BigQuery partitions by date and clusters within. Always filter on the partition key first.
  • Skewed joins — one join key dominates volume (e.g. joining on customer_id = NULL). Causes one worker to do all the work.
  • Cartesian explosions — accidental CROSS JOIN from missing ON clause.
  • Spilling — query exceeds memory; data spills to disk. Often happens with window functions over huge unfiltered datasets.
  • SELECT * — columnar warehouses bill you for column scans. Pulling only what you need is meaningfully cheaper.
  • Repeated subqueries — running the same subquery in multiple SELECT branches. Refactor to a CTE.
  • Predicate pushdown — filter early, aggregate after. Apply WHERE in the deepest CTE that can use it.

Query plan reading — the shape

Every modern warehouse shows you a plan. Look for:

  • Number of rows scanned at each step — if the bottom of the plan reads billions of rows for a query that should return thousands, you're missing a filter.
  • Join order and method — hash joins vs merge joins; broadcast vs shuffle in Spark/Databricks.
  • Bytes processed / partitions read — BigQuery shows this directly; it's your cost.
  • Spill warnings — Snowflake shows local/remote spill; Databricks shows shuffle write/read.

SQL anti-patterns to call out

  • SELECT * in production — breaks when upstream adds a column, scans extra bytes.
  • DISTINCT as a band-aid for duplicates — fix the grain, don't paper over it.
  • NOT IN (subquery) when subquery may have NULLs — returns nothing. Use NOT EXISTS or LEFT JOIN ... WHERE IS NULL.
  • Functions in WHERE that break indexesWHERE LOWER(email) = 'x' can't use an index on email. Less of an issue in MPP warehouses (no indexes anyway) but matters in Postgres.
  • Implicit type coercion — comparing string to int. May work, may not, often slow.
  • Reusing aliases in WHERE — doesn't work, because WHERE runs before SELECT.
  • HAVING used as a filter when WHERE would do — slower because grouping happens first.

Dialect differences worth knowing

You'll be asked "have you used X?" Have one-line answers ready.

DialectNotable featuresWhat's missing
SnowflakeQUALIFY, LATERAL FLATTEN, time travel, zero-copy clones, dynamic tablesSome standard niceties (FULL OUTER with USING)
BigQueryQUALIFY, UNNEST, scripting/procedures, JS UDFs, partitions + clusteringNo row-level updates without WHERE on key
RedshiftSort keys, dist keys, vacuum/analyzeLess ergonomic JSON than Snowflake; older standard support
Databricks SQLDelta Lake operations, MERGE INTO, photon engine, table constraintsQUALIFY is recent; some warehouse niceties newer
PostgresBest standard compliance, JSON ops, extensions (pgvector, etc.)No QUALIFY, slower at warehouse scale
DuckDBOLAP-on-laptop, fast, modern SQL, QUALIFY, list/array typesSingle-node (mostly)
ClickHouseExtreme scan speed for analytics, materialized views, dictionariesQuirky SQL dialect, weak join performance historically

How to approach a SQL live coding round

The framework that makes you look senior:

  1. Restate the schema. "So we have a customers table at one row per customer, and an orders table at one row per order. Is there an order_items table too?"
  2. Clarify grain & constraints. "What's the grain of the output? One row per customer per month?" "Are timestamps UTC?" "Can a customer be deleted?"
  3. Walk through a tiny example before coding. "For customer A with orders on Jan 1 and Jan 15, the answer would be 2. Right?"
  4. State the approach. "I'll filter to active customers, then aggregate orders by month per customer, then join."
  5. Code in CTEs. One concept per CTE. Name them well. Resist the urge to write one giant nested query.
  6. Talk while you type. Silence reads as stuck. Narration reads as competent.
  7. Test against the example. Walk your own code line by line against the input.
  8. Mention edge cases. NULLs, ties in window functions, the date-boundary issue. Even if you don't fix them, calling them out signals seniority.
  9. Discuss performance. "At scale I'd add a filter on partition_date here. And I'd worry about the grain after this join."
The line that wins rooms

"What's the grain of this CTE?" Say it out loud at every major step. It's the senior data engineer's tic and interviewers love hearing it.