The Modern Data Stack
How warehouse + dbt + orchestrator + BI fit together. What each layer is good at. How to pick when stacking up a new team.
The stack in one diagram
Ingestion (the "EL" of ELT)
Get raw data from sources into the warehouse with minimal transformation. Options:
- Managed connectors — Fivetran, Airbyte, Stitch. Pay-per-row or per-connector. Hands-off but pricey at scale.
- Open source / DIY — Airbyte (self-hosted), Singer, Meltano, custom Python. More work, cheaper.
- Streaming — Kafka + Snowpipe Streaming / BigQuery Streaming Inserts. For event firehoses.
- CDC (Change Data Capture) — Debezium, Snowflake Streams, BigQuery Datastream. Replicate database changes in near-real-time.
- SaaS-native — most SaaS tools have official connectors (Stripe, Salesforce, etc.).
For interview questions: "How would you ingest data from X?" The answer depends on freshness, volume, cost tolerance. Default: Fivetran for SaaS, custom Python for in-house DBs, Kafka for high-volume events.
Storage — warehouse vs lakehouse
The split matters less than it used to. Modern offerings blur:
| Class | Examples | Strengths |
|---|---|---|
| Cloud warehouse | Snowflake, BigQuery, Redshift | SQL-first, ergonomic, separation of storage/compute, fast |
| Lakehouse | Databricks (Delta), Snowflake (Iceberg), BigQuery (Iceberg) | Open table formats, ML/data-science friendly, multi-engine |
| OLAP DB | ClickHouse, Druid, Pinot, StarRocks | Sub-second analytics, real-time ingest, customer-facing dashboards |
| Streaming SQL | Materialize, RisingWave, Flink SQL | Incremental view maintenance, true real-time |
See 09-warehouses for deeper coverage of each.
Transformation
dbt is the dominant tool. SQLMesh and Dataform are notable alternatives.
- dbt Core — open source, run anywhere.
- dbt Cloud — managed, includes scheduler + IDE + observability.
- SQLMesh — newer, Python-native, virtual data environments.
- Dataform — Google-acquired, integrates closely with BigQuery.
If asked: "Why dbt?" — "It made the transformation layer a software engineering practice. Version control, tests, lineage, documentation as code. The category exists because of dbt; alternatives are still catching up on ecosystem."
See 05-dbt-deep-dive.
Orchestration
Run things in order, retry, schedule, handle failures, observability.
- Airflow — the default. Python DAGs, huge ecosystem, mature. Also clunky for data-specific patterns.
- Dagster — modern, asset-oriented (you declare what you want to exist, not what to run). Better dev ergonomics.
- Prefect — Python-native, lighter weight, hybrid cloud/OSS.
- dbt Cloud scheduler — limited but sufficient for dbt-only workflows.
- GitHub Actions / cloud-native — for simple cron-style jobs.
See 13-orchestration.
BI / consumption
- Looker — semantic layer (LookML), strong governance, exec-friendly. Now Google.
- Tableau — most powerful viz, weaker semantic story. Now Salesforce.
- Power BI — dominant in Microsoft shops.
- Hex / Mode / Deepnote — notebook + SQL, popular with data teams.
- Metabase / Preset / Superset — open-source-friendly.
- Sigma — spreadsheet-style on warehouse.
- Embedded BI — Cube, ClickHouse + custom UI, for customer-facing analytics.
Metrics layer / semantic layer
The "we have three different numbers for active users" problem.
- dbt Semantic Layer (formerly MetricFlow) — define metrics in dbt, query via a unified API. Most-aligned with the dbt-centric stack.
- Cube — open-source semantic + caching layer.
- LookML — Looker's built-in semantic model.
- Lightdash — open-source dbt-native BI with built-in metrics.
Have an opinion: "For any company with more than 5 analysts asking the same questions, the metrics layer is no longer optional — it's the only way to avoid metric drift."
Quality & observability
- dbt tests — the first line. Built-in (unique, not_null, relationships, accepted_values) + dbt-utils + dbt-expectations.
- Great Expectations — Python-native, more flexible than dbt tests.
- Monte Carlo, Bigeye, Anomalo — commercial data observability (anomaly detection, freshness, schema drift).
- Elementary — open-source dbt-native observability.
- OpenLineage / DataHub / Atlan — lineage and catalog.
See 08-data-quality and 14-observability.
Reverse ETL
Push warehouse data back to operational tools — sync customer segments to marketing tools, sync user attributes to product analytics, etc.
- Hightouch
- Census
- RudderStack (also doubles as a CDP)
The "data activation" layer. Increasingly common; not always part of the analytics-engineering scope, but worth knowing exists.
An opinionated default stack
If asked "what would you set up from scratch?" — have a defensible answer:
"For a mid-size team starting today: Fivetran for SaaS ingestion, custom Python for in-house DBs, Snowflake or BigQuery as the warehouse — pick by team familiarity. dbt Core in version control, run by Airflow or Dagster, with Slim CI on PRs. dbt tests + Elementary for quality, Looker or Hex for BI, dbt Semantic Layer for metrics. Hightouch for any reverse ETL. The whole thing fits in one diagram and has clear ownership at each layer. I'd avoid streaming until there's a specific latency requirement; default to hourly or daily batch."