Section B · Foundation

The Modern Data Stack

How warehouse + dbt + orchestrator + BI fit together. What each layer is good at. How to pick when stacking up a new team.

The stack in one diagram

[Operational databases] [SaaS apps] [Event streams] [Files] │ │ │ │ └──────────────────┴───────┬────────┴────────────────┘ │ INGESTION / EL (Fivetran / Airbyte / custom / streaming) │ ▼ ┌──────────────────────┐ │ WAREHOUSE / LAKE │ │ Snowflake / BigQuery│ │ Redshift / Databricks│ └──────────┬───────────┘ │ TRANSFORMATION (T) (dbt — the dominant tool) │ ▼ ┌──────────────────────┐ │ ANALYTICS-READY │ │ MARTS (fct_, dim_) │ └─────┬────────┬───────┘ │ │ ┌────────┘ └────────┐ ▼ ▼ BI / consumption Reverse ETL (Looker / Tableau / Hex) (Hightouch / Census) │ ▼ Operational tools (CRM / Marketing / Slack) ┌─────────────────────────────────────┐ │ Cross-cutting: │ │ - Orchestration (Airflow / Dagster)│ │ - Quality (dbt tests / Monte Carlo)│ │ - Lineage (DataHub / OpenLineage) │ │ - Semantic layer (dbt SL / Cube) │ └─────────────────────────────────────┘

Ingestion (the "EL" of ELT)

Get raw data from sources into the warehouse with minimal transformation. Options:

  • Managed connectors — Fivetran, Airbyte, Stitch. Pay-per-row or per-connector. Hands-off but pricey at scale.
  • Open source / DIY — Airbyte (self-hosted), Singer, Meltano, custom Python. More work, cheaper.
  • Streaming — Kafka + Snowpipe Streaming / BigQuery Streaming Inserts. For event firehoses.
  • CDC (Change Data Capture) — Debezium, Snowflake Streams, BigQuery Datastream. Replicate database changes in near-real-time.
  • SaaS-native — most SaaS tools have official connectors (Stripe, Salesforce, etc.).

For interview questions: "How would you ingest data from X?" The answer depends on freshness, volume, cost tolerance. Default: Fivetran for SaaS, custom Python for in-house DBs, Kafka for high-volume events.

Storage — warehouse vs lakehouse

The split matters less than it used to. Modern offerings blur:

ClassExamplesStrengths
Cloud warehouseSnowflake, BigQuery, RedshiftSQL-first, ergonomic, separation of storage/compute, fast
LakehouseDatabricks (Delta), Snowflake (Iceberg), BigQuery (Iceberg)Open table formats, ML/data-science friendly, multi-engine
OLAP DBClickHouse, Druid, Pinot, StarRocksSub-second analytics, real-time ingest, customer-facing dashboards
Streaming SQLMaterialize, RisingWave, Flink SQLIncremental view maintenance, true real-time

See 09-warehouses for deeper coverage of each.

Transformation

dbt is the dominant tool. SQLMesh and Dataform are notable alternatives.

  • dbt Core — open source, run anywhere.
  • dbt Cloud — managed, includes scheduler + IDE + observability.
  • SQLMesh — newer, Python-native, virtual data environments.
  • Dataform — Google-acquired, integrates closely with BigQuery.

If asked: "Why dbt?" — "It made the transformation layer a software engineering practice. Version control, tests, lineage, documentation as code. The category exists because of dbt; alternatives are still catching up on ecosystem."

See 05-dbt-deep-dive.

Orchestration

Run things in order, retry, schedule, handle failures, observability.

  • Airflow — the default. Python DAGs, huge ecosystem, mature. Also clunky for data-specific patterns.
  • Dagster — modern, asset-oriented (you declare what you want to exist, not what to run). Better dev ergonomics.
  • Prefect — Python-native, lighter weight, hybrid cloud/OSS.
  • dbt Cloud scheduler — limited but sufficient for dbt-only workflows.
  • GitHub Actions / cloud-native — for simple cron-style jobs.

See 13-orchestration.

BI / consumption

  • Looker — semantic layer (LookML), strong governance, exec-friendly. Now Google.
  • Tableau — most powerful viz, weaker semantic story. Now Salesforce.
  • Power BI — dominant in Microsoft shops.
  • Hex / Mode / Deepnote — notebook + SQL, popular with data teams.
  • Metabase / Preset / Superset — open-source-friendly.
  • Sigma — spreadsheet-style on warehouse.
  • Embedded BI — Cube, ClickHouse + custom UI, for customer-facing analytics.

Metrics layer / semantic layer

The "we have three different numbers for active users" problem.

  • dbt Semantic Layer (formerly MetricFlow) — define metrics in dbt, query via a unified API. Most-aligned with the dbt-centric stack.
  • Cube — open-source semantic + caching layer.
  • LookML — Looker's built-in semantic model.
  • Lightdash — open-source dbt-native BI with built-in metrics.

Have an opinion: "For any company with more than 5 analysts asking the same questions, the metrics layer is no longer optional — it's the only way to avoid metric drift."

Quality & observability

  • dbt tests — the first line. Built-in (unique, not_null, relationships, accepted_values) + dbt-utils + dbt-expectations.
  • Great Expectations — Python-native, more flexible than dbt tests.
  • Monte Carlo, Bigeye, Anomalo — commercial data observability (anomaly detection, freshness, schema drift).
  • Elementary — open-source dbt-native observability.
  • OpenLineage / DataHub / Atlan — lineage and catalog.

See 08-data-quality and 14-observability.

Reverse ETL

Push warehouse data back to operational tools — sync customer segments to marketing tools, sync user attributes to product analytics, etc.

  • Hightouch
  • Census
  • RudderStack (also doubles as a CDP)

The "data activation" layer. Increasingly common; not always part of the analytics-engineering scope, but worth knowing exists.

An opinionated default stack

If asked "what would you set up from scratch?" — have a defensible answer:

"For a mid-size team starting today: Fivetran for SaaS ingestion, custom Python for in-house DBs, Snowflake or BigQuery as the warehouse — pick by team familiarity. dbt Core in version control, run by Airflow or Dagster, with Slim CI on PRs. dbt tests + Elementary for quality, Looker or Hex for BI, dbt Semantic Layer for metrics. Hightouch for any reverse ETL. The whole thing fits in one diagram and has clear ownership at each layer. I'd avoid streaming until there's a specific latency requirement; default to hourly or daily batch."