Section B · Critical

dbt Deep Dive

Models, sources, tests, snapshots, macros, incremental loads. The defining tool of analytics engineering — fluency here separates senior from mid-level.

What dbt actually is

dbt (data build tool) is — at its core — a SQL compiler with conventions. You write SELECT statements; dbt wraps them in CREATE TABLE / CREATE VIEW / INCREMENTAL INSERT logic and runs them against your warehouse in the right order.

The pieces that make it more than that:

  • Dependency graph — uses {{ ref('model_name') }} to determine build order. The DAG is the entire reason dbt works.
  • Materialization strategies — view, table, incremental, ephemeral, snapshot. Same SQL, different runtime behavior.
  • Tests — first-class SQL-based assertions on data. Run alongside builds.
  • Jinja templating — Python-like macros over SQL.
  • Documentation as code — yaml descriptions + auto-generated docs site with lineage graph.
  • Environments + targets — same code, different warehouses/schemas (dev / staging / prod).
The pitch in one sentence

"dbt is the T in ELT — it takes raw loaded data in the warehouse and turns it into analytics-ready, tested, documented models, using version-controlled SQL and a dependency graph."

Project layout — know it

typical dbt project
my_dbt_project/
├── dbt_project.yml          # project config — name, profile, paths, configs
├── packages.yml             # external dbt packages
├── models/
│   ├── staging/             # 1:1 with source tables, light cleanup
│   │   ├── _sources.yml     # declares raw schemas
│   │   ├── stg_users.sql
│   │   └── stg_orders.sql
│   ├── intermediate/        # complex joins, reusable building blocks
│   │   └── int_user_lifetime_value.sql
│   └── marts/               # business-facing tables (fct_, dim_)
│       ├── core/
│       │   ├── fct_orders.sql
│       │   └── dim_users.sql
│       └── finance/
│           └── fct_daily_revenue.sql
├── tests/                   # singular (custom) tests
├── macros/                  # reusable Jinja macros
├── snapshots/               # SCD type 2 tracking
├── seeds/                   # CSVs that get loaded as tables
└── analyses/                # one-off SQL files (NOT materialized)

The staging → intermediate → marts pattern

  • Staging (stg_*): one model per source table. Rename columns, cast types, light filtering. No business logic.
  • Intermediate (int_*): complex joins and computations that get reused. Optional layer.
  • Marts (fct_*, dim_*): the analytics-facing layer. Star schema tables. This is what BI tools query.

This convention comes from dbt Labs' style guide. Most companies follow it. If asked "how do you organize a dbt project?" — name these three layers.

Models & materializations

A model is just a SELECT in a .sql file. dbt wraps it based on the materialization config:

MaterializationWhat dbt doesWhen to use
viewCREATE OR REPLACE VIEWCheap, always fresh; OK for small datasets or staging layer
tableCREATE OR REPLACE TABLE AS SELECTFull rebuild every run; default for most marts
incrementalInitial CREATE; subsequent runs MERGE/INSERT new rows onlyLarge fact tables where full rebuild is too slow/expensive
ephemeralInlined as a CTE in downstream models — no table createdReusable transformations you don't want to materialize
materialized_view / dynamic_tableWarehouse-native incremental refresh (Snowflake Dynamic Tables, BigQuery Materialized Views)Near-real-time freshness with low operational overhead

Config — three places to set it

materialization config
-- 1. In the model file (highest precedence for that model):
{{ config(materialized='incremental', unique_key='order_id', on_schema_change='append_new_columns') }}

SELECT ...

-- 2. In a yaml file alongside the model:
-- models/marts/_marts.yml
models:
  - name: fct_orders
    config:
      materialized: incremental

-- 3. In dbt_project.yml (lowest precedence — folder-level defaults):
models:
  my_project:
    staging:
      +materialized: view
    marts:
      +materialized: table

Most teams set folder-level defaults in dbt_project.yml (staging = view, marts = table) and override at the model level when needed.

Sources — declare your raw data

Sources declare which raw tables in your warehouse dbt depends on. Two reasons they matter:

  1. Lineage — dbt's DAG starts at source tables.
  2. Freshness checks — dbt can run dbt source freshness to alert when source data is stale.
models/staging/_sources.yml
version: 2

sources:
  - name: raw_app
    database: raw
    schema: app_prod
    loaded_at_field: _ingested_at
    freshness:
      warn_after: {count: 2, period: hour}
      error_after: {count: 6, period: hour}

    tables:
      - name: users
        description: "User accounts from the app database"
        columns:
          - name: id
            tests:
              - unique
              - not_null
      - name: orders
        loaded_at_field: created_at  # override per-table
models/staging/stg_users.sql
SELECT
  id           AS user_id,
  email,
  created_at,
  updated_at,
  is_active
FROM {{ source('raw_app', 'users') }}

ref() and source() — the magic

These are the two Jinja functions that make dbt's DAG work:

  • {{ source('schema', 'table') }} — references a raw source table.
  • {{ ref('model_name') }} — references another dbt model. dbt builds it first, then yours.

Never hard-code a table name. Always use ref(). This:

  • Builds the dependency graph.
  • Handles environment-specific schemas automatically (dev vs prod).
  • Lets dbt resolve cross-database / cross-project references.
marts/fct_orders.sql
WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),
order_items AS (
    SELECT * FROM {{ ref('stg_order_items') }}
),
users AS (
    SELECT * FROM {{ ref('stg_users') }}
)
SELECT ...

Tests — first-class data quality

dbt has two kinds of tests:

1. Generic tests (declarative, in yaml)

Built in: unique, not_null, accepted_values, relationships.

models/marts/_marts.yml
version: 2

models:
  - name: dim_users
    description: "One row per user"
    columns:
      - name: user_id
        description: "Primary key"
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values: ['active', 'inactive', 'churned']
      - name: plan_id
        tests:
          - relationships:
              to: ref('dim_plans')
              field: plan_id

2. Singular tests (custom SQL in tests/)

A singular test is just a SQL file in tests/ that returns the failing rows. Empty result = test passes.

tests/revenue_never_negative.sql
SELECT *
FROM {{ ref('fct_orders') }}
WHERE revenue < 0

Powerful packages

dbt-utils and dbt-expectations add dozens of tests: expression_is_true, recency, at_least_one, distribution tests, etc. Mention these in interviews to show ecosystem awareness.

"Severity: warn"

Tests can be configured with severity: warn (vs default error). Warn-level tests show in logs but don't fail the run. Useful for "I want to know about this" without paging the on-call. Mention this pattern.

Snapshots — SCD Type 2 for free

A snapshot tracks how a row's columns change over time. Each version gets valid_from and valid_to columns. This is the canonical way to implement Slowly Changing Dimension Type 2 in dbt.

snapshots/users_snapshot.sql
{% snapshot users_snapshot %}

{{
  config(
    target_schema='snapshots',
    unique_key='user_id',
    strategy='check',
    check_cols=['plan_id', 'status', 'email']
  )
}}

SELECT * FROM {{ source('raw_app', 'users') }}

{% endsnapshot %}

Strategies:

  • check: dbt compares the columns you list; new row written when any change.
  • timestamp: dbt watches an updated_at column. More efficient when available.

Run with dbt snapshot on a schedule (usually nightly).

Seeds — small CSVs as tables

Drop a CSV into seeds/, run dbt seed, get a table. Good for: country code mappings, manually-curated business rules, fixture data for tests. Not for: anything large or that changes often.

Macros & Jinja

Macros are reusable Jinja functions over SQL. Useful for DRYing up repeated logic.

macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name, precision=2) %}
  ROUND(({{ column_name }} / 100.0), {{ precision }})
{% endmacro %}
using the macro
SELECT
  order_id,
  {{ cents_to_dollars('amount_cents') }} AS amount_usd
FROM {{ ref('stg_orders') }}

Common Jinja patterns

  • {% if target.name == 'dev' %} ... {% endif %} — environment-conditional SQL.
  • {% for col in get_columns_in_relation(ref('x')) %} — iterate over columns at compile time.
  • {{ var('start_date') }} — parameterize via project variables.
  • {{ env_var('SOMETHING') }} — read from environment.

Incremental models — the most-asked-about

When a fact table has billions of rows and you can't afford to rebuild it nightly, you go incremental.

models/marts/fct_events.sql
{{ config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge',
    on_schema_change='append_new_columns'
) }}

SELECT
  event_id,
  user_id,
  event_type,
  occurred_at,
  payload
FROM {{ ref('stg_events') }}

{% if is_incremental() %}
  -- only process new events on subsequent runs
  WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }})
{% endif %}

Incremental strategies

StrategyWhat it doesWhen
appendJust INSERT new rowsImmutable event logs
mergeUPSERT by unique_key (warehouse MERGE INTO)Most common — when rows can update
delete+insertDelete matching rows then insertWhen MERGE isn't supported or you want partition-replace semantics
insert_overwriteReplace whole partitions (BigQuery, Spark)Partition-based reprocessing
Common bugs

1. Late-arriving data falls off the window — if your filter is WHERE occurred_at > MAX(occurred_at), an event that arrives 2 hours late but happened 4 hours ago will be missed. Use a wider lookback: > MAX(occurred_at) - INTERVAL '24 hours' + merge.

2. Schema drift breaks the model — when upstream adds a column, the existing table has the old shape. Use on_schema_change='append_new_columns' or 'sync_all_columns'.

3. Full refresh forgotten — when business logic changes, you need dbt run --full-refresh -s fct_events. Document this.

Packages — don't reinvent

dbt has a Hub. Common packages:

  • dbt-utils — generic tests, macros (generate_surrogate_key, pivot, star, group_by).
  • dbt-expectations — Great Expectations-style tests ported to dbt.
  • codegen — generates yaml from your warehouse schema.
  • elementary — data observability layer over dbt artifacts.
  • dbt_audit_helper — compare two model outputs (for refactors / migrations).

Exposures & the semantic layer

Exposures declare the downstream consumers of your models — dashboards, ML models, applications. They appear in the lineage graph.

models/_exposures.yml
version: 2
exposures:
  - name: gpu_utilization_dashboard
    type: dashboard
    maturity: high
    owner:
      name: Data Team
      email: data@example.com
    depends_on:
      - ref('fct_gpu_hours')
      - ref('dim_providers')

Semantic Layer (dbt's metrics framework, also called MetricFlow): centralizes metric definitions. Instead of every dashboard re-defining "MRR," you define it once in YAML, and BI tools query it via a unified API.

For senior interviews

Have an opinion on the metrics layer. "I'd reach for dbt Semantic Layer when there's any real risk of metric drift across teams — finance, product, and ops disagreeing on what 'active customer' means is the #1 problem the semantic layer solves."

CLI essentials

dbt CLI cheat sheet
# Build everything (run + test)
dbt build

# Just compile templates — no warehouse calls
dbt compile

# Run a specific model and its descendants
dbt run -s fct_orders+

# Run a model and its ancestors
dbt run -s +fct_orders

# Run a folder, models tagged "daily", or by source
dbt run -s marts.finance
dbt run -s tag:daily
dbt run -s source:raw_app+

# Run only models changed in this PR (state-based)
dbt run -s state:modified+ --state ./prod-manifest

# Run tests
dbt test
dbt test -s fct_orders

# Full refresh an incremental model
dbt run --full-refresh -s fct_events

# Build docs site
dbt docs generate
dbt docs serve

The + graph syntax is essential. +model+ = ancestors and descendants. Know it cold.

CI / CD with dbt

Modern dbt teams treat models like code:

  • PR opens → CI runs dbt build -s state:modified+ against a staging schema (Slim CI).
  • Tests must pass before merge.
  • Slim CI uses the production manifest as state — only rebuilds what changed.
  • Production scheduler (dbt Cloud, Airflow, Dagster) runs dbt build on a cron + after upstream loads complete.
  • Audit packages (dbt_audit_helper) used during PR review to diff before/after of model output.

Performance & cost

  • Materialize wisely — view for staging, table for marts, incremental for big facts.
  • Cluster / partition — Snowflake: cluster_by in config; BigQuery: partition_by + cluster_by; Databricks: partitioned_by + Z-order.
  • Limit ephemeral chains — many ephemerals in a chain produce monstrous compiled SQL.
  • Tag and schedule — not every model needs to run every hour. Tag by SLA, schedule accordingly.
  • Look at dbt_artifacts / Elementary — track run times, see which models are getting slow.

Interview talking points

"How do you organize a dbt project?"

"Staging, intermediate, marts. Staging models are 1:1 with sources — rename columns, cast types, no business logic. Intermediate models are reusable building blocks for complex joins. Marts are the analytics-facing layer organized by business domain — fact tables for events, dimension tables for entities. The convention's not magic but it makes onboarding easier and PR review faster. I'd default staging to views, marts to tables, and reach for incremental only when scale demands it."

"How do you decide on materialization?"

"Default to view for staging — cheap, always fresh, no real cost. For marts, table is the default. I reach for incremental when full rebuild is too slow or too expensive — usually event tables or transactional logs at the billions-of-rows scale. For incremental, I think about: what's the unique key for merge, what's the lookback window for late-arriving data, and what happens on schema change. ephemeral for small reusable transformations I don't want to materialize."

"How do you handle data quality?"

"Three layers. One, dbt tests on every model — unique, not_null, relationships, accepted_values at minimum, plus custom singular tests for business rules. Two, source freshness checks that alert when upstream stops loading. Three, a data observability layer — Elementary or Monte Carlo — for things tests can't catch: anomalies, distribution shifts, volume drops. I severity-warn the tests that are signal-not-blocker and severity-error the ones that should fail the build."