Practice Interview Questions
30 questions across 8 categories. Read each, give your version out loud on a ~90 second timer, then reveal the strong answer.
Section A · Background / motivation
Q1. "Walk me through your data engineering experience."
Show strong answer
"Background is [generalist / analyst / SWE — whatever]. The last [N months/years] I've focused on the modern data stack — SQL, dbt, warehouse work, and the analytics-engineer mindset. Concretely I've [shipped X / done a side project / completed Y course]. The piece I'd be ramping on if I joined is [be honest — production-scale incremental, specific warehouse, your specific orchestration stack], and I have a concrete plan for how to do that. The reason this role caught my attention is the data is unusually rich — high-volume telemetry, multi-tenant economics, real-time-ish demands — which makes the modeling work genuinely interesting."
Q2. "Why this company / role?"
Show strong answer
"Two reasons. One: the data is interesting. AI infra companies have unit economics that depend on getting the data right — GPU utilization, cost-per-token, customer LTV, multi-tenant margin. That's analytics-engineering work where the work actually matters to the business, not 'pretty dashboards for executives.' Two: I'd rather work somewhere the data team has real leverage — finance, product, ops all depend on the warehouse being right. That's how I want to spend my time."
Q3. "What's the gap between your background and the role?"
Show strong answer
"Two gaps I'd flag. First, I haven't operated at [their scale — petabytes / billions of events / X concurrent users] in production. I understand the patterns — incremental models, partitioning, clustering, the cost levers — but I haven't been the on-call for a warehouse that size. Second, [their specific stack piece — Iceberg / Databricks / streaming]. What I do bring: strong SQL, modeling discipline, dbt fluency, and the instinct to ask 'what's the grain' before writing anything. The ramp plan for the gaps would be [specific learning move]."
Q4. "Why analytics engineering vs data engineering vs analyst?"
Show strong answer
"I like that analytics engineering sits exactly where I find the work most useful — close enough to the business to know why the metric matters, technical enough to model it well and own the testing. Pure data engineering, the work is infrastructure-first; pure analyst work, the work is question-first but you depend on others to build the data. Analytics engineering is the bridge: I own the layer that turns raw data into the metrics the business runs on. The transformation layer is where most of the value gets created in a modern data team."
Section B · SQL
Q5. "What's the difference between RANK, DENSE_RANK, and ROW_NUMBER?"
Show strong answer
"All three rank rows within a partition. ROW_NUMBER gives a unique integer 1, 2, 3 to every row, breaking ties arbitrarily by the ORDER BY tiebreaker. RANK gives ties the same rank but skips numbers — so values 100, 100, 90 become ranks 1, 1, 3. DENSE_RANK is the same as RANK but doesn't skip — 1, 1, 2. I default to ROW_NUMBER for 'one row per group' patterns and RANK or DENSE_RANK when ties should genuinely tie."
Q6. "When do you reach for a window function vs a GROUP BY?"
Show strong answer
"GROUP BY collapses rows — you get one row per group with aggregated values. Window functions don't collapse — you get every original row, with an aggregate computed across the partition attached to it. So I reach for window functions when I need 'per-group calculation but keep the rows' — running totals, ranking, latest-per-group, period-over-period change. GROUP BY when I want the aggregated view itself. A common pattern is to compute window functions in a CTE, then filter / aggregate in the outer query."
Q7. "Why does WHERE col != 'x' miss rows where col is NULL?"
Show strong answer
"Because NULL is 'unknown,' not a value. Comparing NULL to anything — including the inequality — returns NULL, not TRUE. SQL's three-valued logic: TRUE, FALSE, NULL. WHERE only keeps rows where the predicate is TRUE, so NULL predicate means 'exclude.' The fix is explicit: WHERE col != 'x' OR col IS NULL, or COALESCE(col, '') != 'x'. This is the #1 silent-data-loss bug in SQL."
Q8. "How would you find duplicate rows?"
Show strong answer
"GROUP BY the candidate-key columns and HAVING COUNT(*) > 1. That gives me the duplicate key values. To return the actual rows, I'd use ROW_NUMBER OVER (PARTITION BY ... ORDER BY ...) and filter rn > 1. The harder question is: what should the unique key be? That's a modeling conversation. And what tiebreaker rule keeps the 'right' duplicate? Often it's most-recent-updated, but sometimes it's 'the one with the non-null email' or 'the one from the authoritative source.' I'd nail that down with the business before dedup'ing."
Q9. "A query is slow. How do you debug?"
Show strong answer
"Start with the query plan — every warehouse has one. Look for the biggest row-count step; that's usually the culprit. Common causes: missing partition or cluster filter scanning the whole table, a join with an unexpected fanout (multiplies rows), data skew (one key dominates volume), spilling to disk, or a non-sargable predicate. Common fixes in order: filter earlier (predicate pushdown), aggregate before joining if the right side fans out, ensure the partition / cluster key is in the WHERE, avoid SELECT *. Then if the query itself is correct and still slow, materialize a precomputed table — incremental or even just a daily-rebuilt mart."
Section C · dbt
Q10. "How would you organize a new dbt project?"
Show strong answer
"Staging, intermediate, marts — the dbt Labs convention. Staging models are 1:1 with sources; rename columns, cast types, no business logic. Intermediate is reusable building blocks for complex joins, optional layer. Marts are the analytics-facing tables organized by business domain — fact and dimension tables. Folder-level config defaults staging to views, marts to tables. Naming: stg_, int_, fct_, dim_. Sources declared explicitly with freshness checks. The convention's main value is making PR review and onboarding faster — anyone joining knows where to look."
Q11. "When do you choose incremental over table materialization?"
Show strong answer
"Default to table. Incremental only when full rebuild is too slow or too expensive — usually event-grain tables at the billions-of-rows scale. When I go incremental, three things to nail down: unique key for the merge strategy, lookback window for late-arriving data (don't just filter 'greater than max timestamp' — you'll silently drop late events), and the on_schema_change behavior. Plus a documented 'full refresh' procedure for when business logic changes."
Q12. "How do you handle late-arriving data in an incremental model?"
Show strong answer
"Two parts. First, the filter — don't use 'where occurred_at > max(occurred_at)'; use a lookback like 'where occurred_at > max(occurred_at) - interval 24 hours' so late events get reprocessed. Second, the strategy — incremental_strategy='merge' with the right unique_key, so late arrivals upsert correctly rather than duplicating. For very late data (days+), I'd have a scheduled full-refresh window for the prior period, or a separate backfill job. The general principle: any 'incremental' design has to answer 'what about late data' or it's silently lossy."
Q13. "How do you ensure tests actually run in production?"
Show strong answer
"dbt build instead of dbt run — build interleaves tests with model runs and fails fast. In CI on PRs, Slim CI runs build only on changed models (state:modified+) against a staging schema. In production, the scheduler runs build and alerts on test failures. Severity-warn for tests that are signal-not-blocker; severity-error for true data contracts. And I'd add a source-freshness step before transformations run — if the upstream load failed, abort early."
Section D · Data modeling
Q14. "Design a data model for an AI inference platform."
Show strong answer
"First — business questions. What does the team need to answer? Revenue per customer, GPU utilization, model performance, unit economics. Then entities and events. Entities are dims: customers, models, GPUs, providers, regions. Events are facts: every inference request, every GPU-hour, every billing event."
"For facts I'd build three. fct_inference_requests — one row per API call — customer_id, model_id, gpu_id, latency_ms, input_tokens, output_tokens, cost_usd. fct_gpu_hourly_utilization — one row per GPU per hour — utilization_pct, memory_used_gb, revenue_generated. fct_billing_events for invoicing. Dims: dim_customers SCD2 because plan and region matter point-in-time, dim_models SCD1, dim_gpus SCD2, dim_providers, dim_regions, dim_date."
"That model answers most of the canonical questions. For very high-volume joins I'd build an OBT-style wide table on top for self-serve BI. Edge cases I'd want to nail down: time zones for billing, definition of 'request' (do retries count?), how partial-failures are recorded, multi-tenant GPU assignment if shared."
Q15. "When do you SCD2 a dimension?"
Show strong answer
"Default to SCD1 — overwrite — because most BI questions are 'what does the world look like now.' I reach for SCD2 when the business genuinely needs point-in-time accuracy. Clearest cases: financial reporting, where you need 'what plan was this customer on when they placed this order,' not what plan they're on today. Also pricing, regional assignments, anything regulators might ask about historically. The cost of SCD2 is more complex point-in-time joins and more storage; the cost of not SCD2-ing something that needed it is silent wrong numbers in historical reports."
Q16. "What's the grain of fct_orders? How do you decide?"
Show strong answer
"Almost always one row per order. But if orders have line items I might keep a separate fct_order_items at the atomic line-item grain, and roll up to order-grain via a derived table or view. The principle is to model at the most atomic grain you can; you can always aggregate up, you can't easily decompose down. Grain is the first conversation I have before writing any model — if you can't state the grain in one sentence, the model isn't right yet."
Q17. "Star schema or OBT?"
Show strong answer
"Both. Star schema as the canonical source-of-truth layer — that's where modeling discipline, testing, and documentation live. OBT-style wide tables built on top for specific consumption use cases: self-serve BI where users can't join, ML feature stores, performance-sensitive dashboards. The wide tables are derived from the star, not the source. This way you get modeling discipline and consumption ergonomics. Pure OBT collapses under maintenance pressure; pure star schema sometimes makes self-serve hard."
Section E · Pipelines & operations
Q18. "ETL vs ELT?"
Show strong answer
"ELT — extract, load raw to the warehouse, transform in the warehouse — is the modern default. Reasons: cloud warehouses make compute cheap relative to storage, having raw data lets you re-derive when business logic changes, and dbt makes the in-warehouse transform layer ergonomic. ETL — transforming before loading — was the older pattern when warehouses were expensive. ETL still makes sense in specific cases: when data volume is huge and most of it isn't useful (filter early), or when PII has to be stripped before landing for compliance reasons."
Q19. "How do you handle backfills?"
Show strong answer
"The pipeline has to be idempotent — running it twice for the same window has to produce the same result. Then backfill is just 'run for these specific windows.' For dbt incrementals, that means parameterizing the model by a date variable and being able to re-run for any window without duplicating side effects. For Airflow-style pipelines, partitioned DAGs with idempotent tasks. The wrong pattern: a backfill script that's special-cased, separate from the prod path — it always diverges. Backfill should use the production code with different parameters."
Q20. "Schema evolves upstream. How do you handle it?"
Show strong answer
"Three layers. First, source freshness + schema tests detect the change — column added, column removed, column type changed. Second, the contract approach: declare the columns I depend on in staging, fail loud if they're missing. Third, dbt's on_schema_change config — append_new_columns or sync_all_columns to handle additions automatically. For breaking changes — type or name changes — there's no automatic answer; it's a coordination question. The principle: detect early, fail loud, version when needed."
Section F · Data quality & observability
Q21. "Walk me through your data quality strategy."
Show strong answer
"Three layers. One — dbt tests on every model: unique, not_null, relationships, accepted_values at minimum, plus custom singular tests for business rules. Two — source freshness checks alert when upstream stops loading. Three — a data observability layer (Elementary, Monte Carlo, or similar) for what tests can't catch: anomalies, distribution drift, volume drops. I severity-warn the noisy-but-useful tests and severity-error the contracts. Critical: tests run as part of dbt build, not as an afterthought, and the alert path goes to someone who actually owns the model."
Q22. "A dashboard's revenue number is off. How do you debug?"
Show strong answer
"Walk back through the lineage. Start with the dashboard query — what model is it pulling from? Then the model — what does its SQL do, what's the grain, does the test suite pass? Common bugs at each layer: a fanout from a join at the wrong grain, a NULL filter excluding rows silently, a time-zone confusion, a deduplication that didn't deduplicate (or did and shouldn't have). I'd compare totals at each upstream layer to find where the number diverges from expectation. Once found, fix the model, add a test that would have caught it, and document the gotcha in the model's yaml."
Q23. "What's a data contract?"
Show strong answer
"An explicit, versioned schema and behavior agreement between the producer of a dataset and its consumers. The producer commits to: these columns, these types, this freshness, these uniqueness guarantees. The consumer commits to: relying only on what's in the contract, not on undocumented behavior. In practice: dbt's data contracts feature, explicit schema definitions, CI that fails when contract is violated. The shift from 'data team gets blindsided when an engineer drops a column' to 'producer can't ship that change without breaking the build.' Cultural change at least as much as technical."
Section G · Domain (AI compute / GPU platform)
Q24. "How would you model GPU utilization?"
Show strong answer
"Periodic snapshot fact table — one row per GPU per time period. Choice of period depends on use case: hourly is the common middle ground (good for cost analysis and ops dashboards), per-minute or per-second is needed for real-time alerting. Columns: gpu_id, hour_ts, utilization_pct (compute), memory_used_gb, requests_served, revenue_generated, currently_assigned_customer_id, status. Foreign keys to dim_gpus (SCD2, location can change), dim_providers, dim_regions, dim_date. Be careful with the semi-additive nature: you can sum utilization across GPUs at a point in time, but summing across time is averaging-not-summing territory."
Q25. "Define unit economics for an inference platform."
Show strong answer
"At the per-request level: revenue per request minus cost per request. Cost per request is the hard part — it's GPU-time-attributable plus overhead. For a GPU rented at $X/hour serving N requests/hour, cost-per-request = X/N. So utilization rate directly drives unit margin. At the customer level: revenue minus attributed cost, where 'attributed' means GPU-time consumed by that customer's requests. For a marketplace, also subtract provider payout. For a self-hosted cloud, subtract amortized GPU cost. Key metric I'd build first: gross margin per customer per month. Second: requests-per-GPU-hour. Third: revenue-per-GPU-hour. If the platform also serves multiple models, segment all of those by model."
Q26. "Inference logs are 10 billion rows/month. How do you serve analytics on them?"
Show strong answer
"Don't serve them raw — pre-aggregate. The raw logs live in cheap object storage, partitioned by date and customer. Build incremental dbt models that aggregate to the grains analysts actually need: hourly per (customer, model, region), daily per customer, etc. Most queries hit aggregates, not raw. For ad-hoc deep dives, query the raw with a strict partition filter — Snowflake or BigQuery handle this well if partitioning is right. For sub-second latency on customer-facing dashboards, materialize to a fast OLAP store — ClickHouse or warehouse materialized views. Two anti-patterns to avoid: querying raw events without partition filter (bankrupting query), and over-aggregating so analysts can't answer new questions without re-running heavy jobs."
Section H · Curveballs
Q27. "Tell me about a time you found bad data in production."
Show strong answer
Use a real story from your background. The shape: what was wrong → how you found it → root cause → what you fixed → what you added to prevent it.
If you don't have a vivid one, flip honestly:
"Truthfully, my biggest example was [whatever scale — small but real]. The pattern that matters: I treat 'bad data in prod' as an incident, not a bug fix. Find it, fix the data, add a test that would have caught it, document the gotcha so the next person learns. The thing I'd want to avoid: silently fixing the SQL without leaving a paper trail, because then the next person hits the same trap."
Q28. "Finance and product disagree about what 'active customer' means. What do you do?"
Show strong answer
"I don't pick — they do. My job is to make the definitions explicit and the tradeoffs visible. Step one: get both definitions on paper. 'Active means logged in in the last 28 days' vs 'Active means generated revenue in the last 28 days' — those are different things, and that's fine. Step two: build both, name them distinctly (active_login_28d, active_revenue_28d), document who uses which and why. Step three: if there's a 'company-wide official' active metric needed, that's a leadership decision; I'd surface the options with the numerical impact of each. Long-term: a metrics layer where definitions are versioned and explicit prevents the 'three different numbers in three dashboards' problem from recurring."
Q29. "What would you do in your first 30 days?"
Show strong answer
"Listen first. Read the existing dbt project end-to-end. Sit with the people who consume the data — finance, product, ops — and ask 'what's the question you can't easily answer today?' Read the on-call runbooks if any exist. Inventory: what's the warehouse cost trend, what models take longest to run, where does the test suite fail, which numbers do people not trust. Identify the highest-leverage first build — usually it's a metric definition the team has been working around, or a model that's silently slow. Avoid shipping anything in the first month I haven't seen the failure modes of."
Q30. "What's something the data engineering field gets wrong?"
Show strong answer
"Two. One: over-investing in tooling, under-investing in data modeling discipline. The number of teams with a state-of-the-art stack and a wedge of legacy SQL no one wants to touch is staggering. Tools are levers, not solutions. Two: treating data quality as a dashboard project instead of a software engineering practice. Tests on every model, CI gates, ownership, incident reviews — that's how trust gets built, not by buying a quality vendor and hoping. I'd rather work somewhere with a smaller stack and better discipline than the reverse."
How to drill
- Toggle Drill mode at the top (it's on by default — answers hidden).
- Pick a question. Cover the answer button. Give your version out loud on a ~90 second timer.
- Reveal. Compare. Note structural differences.
- Re-try, integrating anything you missed. Mark practiced once you hit it cleanly twice in a row.
- Don't memorize verbatim — memorize the structure and the concrete examples.
Spread the drill over multiple sessions. Same-day repetition has diminishing returns; doing 5 questions today + 5 different ones tomorrow + revisiting all 10 the day after beats drilling 15 in one sitting.