Section B · Pipelines

Data Pipelines

ETL vs ELT, batch vs streaming, idempotency, backfills, late-arriving data, schema evolution.

ETL vs ELT

  • ETL (Extract, Transform, Load): transform before loading. Older pattern, optimized for storage cost.
  • ELT (Extract, Load, Transform): land raw to warehouse, transform inside. The modern default.

Why ELT won:

  • Cloud warehouses made compute cheap relative to storage.
  • Having raw data lets you re-derive when business logic changes.
  • dbt makes warehouse-side transformation ergonomic.
  • Multiple consumers can derive different views from the same raw.

ETL still makes sense for specific cases: very high volume where most data is uninteresting (filter early), or PII that must be stripped before landing.

Batch vs streaming

ApproachFreshnessComplexityUse when
Daily batch24hLowestMost reporting; finance closes
Hourly batch1hLowOperational dashboards
Micro-batch (5min)5-10minMedium"Near real-time" dashboards, ops alerting
True streamingsecondsHighReal-time fraud detection, live monitoring, customer-facing live data
The rule of thumb

Most "we need real-time" actually means "we need fresher." Ask: "What's the actual latency requirement?" Often the answer is "users are fine with 5-15 minutes," which makes micro-batch the right pick. True streaming is hard — exactly-once is painful, debugging is harder, observability is harder. Don't pay the streaming cost without a specific latency justification.

Delivery semantics

GuaranteeWhat it meansUse
At-most-onceEach event delivered 0 or 1 timesTelemetry where loss is OK
At-least-onceEach event delivered ≥1 times, may duplicateMost common — combined with idempotent consumer
Exactly-onceEach event delivered exactly 1 timeHard. Usually approximated with at-least-once + idempotency.

Idempotency — the most-asked-about pattern

A pipeline is idempotent when running it twice for the same window produces the same result. Non-negotiable for production data pipelines.

Pattern: idempotency keys

Every event carries a unique ID. Consumers track seen IDs and skip duplicates.

Pattern: window-replaceable

A pipeline run produces output for a specific window (e.g., a date partition). Re-running replaces the output for that window rather than appending.

idempotent partition replacement (BigQuery)
-- delete + insert into one partition
DELETE FROM fct_orders WHERE date_key = @run_date;

INSERT INTO fct_orders
SELECT ... FROM source WHERE date(occurred_at) = @run_date;

Run a thousand times for the same @run_date → same output. Idempotent.

Pattern: MERGE / upsert

dbt's incremental "merge" strategy. The unique_key determines which rows are updated vs inserted. Re-running on overlapping data updates rather than duplicating.

Backfills

A backfill is "re-run the pipeline for historical windows." It happens when:

  • Business logic changed and you need to apply it to history.
  • The pipeline had a bug and produced wrong output.
  • You're adding a new metric and need to fill in history.
  • You're bootstrapping a new model.

The principle

Backfill should use the production code, not a special-cased script. Special-cased backfill scripts diverge from prod and break.

For dbt: parameterize the model by date variable, then run with the variable set to each window:

parameterized backfill
for d in $(seq 0 30); do
  date_to_backfill=$(date -d "30 days ago + $d days" +%Y-%m-%d)
  dbt run --select fct_orders --vars "{run_date: $date_to_backfill}"
done

For Airflow: partitioned DAG runs. Each run is for a specific logical date; backfilling = scheduling the missing logical dates.

Late-arriving data — the silent killer

Data that should have arrived in window N actually arrives in window N+1 (or N+5). If your pipeline filters "only process new" by current timestamp, late data falls off the edge.

Pattern: lookback window

Always look back further than "since last run."

incremental with lookback
{% if is_incremental() %}
  -- 24-hour lookback, merged on unique_key
  WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }}) - INTERVAL '24 hours'
{% endif %}

Combined with incremental_strategy='merge' and a unique_key, late events upsert correctly within the lookback window.

Pattern: watermarking (streaming)

Streaming systems use watermarks — "we're now processing data up to this event time; anything later is late." Late events go to a side channel for special handling.

Pattern: reprocessing window

For very late data (days or weeks late), schedule a periodic full-rerun of the prior week. Catches stragglers, costs more, but reliable.

Schema evolution

Upstream schemas change. Strategies for handling it:

Detect

  • Source-freshness alerts on missing columns.
  • Schema tests that fail when expected columns are missing.
  • Schema registry (Confluent, Glue) for streaming events.

Adapt

  • dbt on_schema_change='append_new_columns' — auto-add new columns.
  • on_schema_change='sync_all_columns' — also drop columns that disappear. Use with caution.
  • Explicit staging contract — staging models list the columns they depend on; missing columns fail loudly.

Coordinate

For breaking changes — column renames, type changes, semantic shifts — there's no auto-magic. This is a coordination conversation. The pattern: data contracts. Producers commit to "this schema for this version," consumers commit to "we only rely on documented columns." Schema changes are versioned and announced.

CDC (Change Data Capture)

Replicate database changes in near-real-time without full table dumps. Tools:

  • Debezium — open-source Kafka-based CDC.
  • Fivetran HVR, Airbyte, Stitch — managed CDC for major databases.
  • Snowflake Streams — track changes within Snowflake.
  • BigQuery Datastream — managed CDC into BigQuery.

CDC produces a stream of (op_type, before, after, ts) per row change. The warehouse-side challenge: collapse the stream to the current state of each row, while preserving history for SCD-style queries.

Talking points

"How would you design this pipeline?"

"Four layers. Ingestion: source → land raw in the warehouse with retention. Staging: 1:1 with source, light cleanup, no business logic. Transformation: dbt models with tests and lineage. Serving: marts that BI consumes. Orchestration with Airflow or Dagster running on a schedule. For incrementality I'd ask about volume — if it's manageable for full rebuild, table materialization is simpler; if not, incremental with a lookback window and merge strategy. Failure modes I'd plan for: source schema drift (tests fail loud), late-arriving data (lookback window), pipeline failures (orchestrator retries + idempotent steps), backfills (parameterized models)."