Data Modeling
Kimball star schema, slowly-changing dimensions, OBT, grain. The conceptual heart of analytics engineering — get this right and most other questions get easier.
Grain — the first thing you ask, always
Before designing anything, before writing any SQL, before any modeling discussion: what is the grain of this table?
Grain = what does one row represent. "One row per order." "One row per user per day." "One row per GPU per minute." "One row per inference request."
If you can't state the grain of a table in one sentence, the model isn't right yet.
Saying "what's the grain of this table?" out loud in any modeling discussion. Interviewers physically lean in.
Atomic grain vs aggregated grain
Always model facts at the atomic grain (the lowest possible), then aggregate on top. "One row per order_item" is more atomic than "one row per order". Atomic grain is harder to recreate from aggregated grain; the reverse is easy.
Facts and dimensions — the Kimball vocabulary
Fact tables (fct_*)
- Hold measures — things you sum/count/average (revenue, count, duration).
- Tall and narrow — many rows, relatively few columns.
- Time-anchored — almost always have an event timestamp.
- Foreign keys to dimensions —
user_id,product_id,date_key. - One row per atomic event — order_placed, inference_request, gpu_minute.
Dimension tables (dim_*)
- Hold attributes / descriptions — name, status, country, tier.
- Short and wide — fewer rows, many columns.
- One row per entity — one row per user, one row per product, one row per date.
- Updated when entities change (slowly — hence "slowly changing dimensions").
Three types of fact tables
| Type | What it captures | Example |
|---|---|---|
| Transaction fact | One row per atomic event | fct_orders, fct_inference_requests |
| Periodic snapshot fact | One row per entity per period | fct_daily_account_balance, fct_gpu_hourly_utilization |
| Accumulating snapshot fact | One row per process, columns for each milestone | fct_order_lifecycle (ordered, paid, shipped, delivered) |
Measure types in facts
- Additive — sum across any dimension makes sense. Revenue, units sold. Easiest to model.
- Semi-additive — sums across some dimensions but not time. Account balance — summable across customers, not across time.
- Non-additive — ratios, percentages. Conversion rate. Margin %. Store the numerator and denominator separately; compute the ratio at query time.
Star schema — the default
A central fact table surrounded by dimension tables, each connected by a foreign key. Visually it looks like a star.
Why star schema wins
- Query-friendly — joins are simple and consistent (one hop from fact to dim).
- Performance — modern warehouses optimize star schema joins explicitly.
- BI-tool-friendly — Looker, Tableau, Power BI all model around it.
- Easy to extend — add a new fact or dim without breaking existing ones.
- Self-documenting — the structure tells you what the business cares about.
Surrogate keys
Each dimension has a synthetic primary key (a hash or sequence) called a surrogate key. The fact references that, not the natural key.
Why surrogate keys:
- SCD Type 2 versions need their own keys (a customer can have multiple rows in
dim_customeracross time). - Natural keys can change. Surrogate keys don't.
- Joins are faster on small integer/hash keys than on long strings.
In dbt: {{ dbt_utils.generate_surrogate_key(['col1', 'col2']) }} generates a hash.
Snowflake schema (the technique, not the warehouse)
Same as star schema, but dimensions are normalized — a dimension can reference other dimension tables.
Example: dim_product → dim_category → dim_department. Three tables instead of one wide product dim.
Trade-off:
- Pro: saves storage, reduces update anomalies.
- Con: more joins, harder for BI users, slower queries.
Modern best practice: denormalize dimensions (single wide dim_product table) unless you have a specific reason to snowflake. Storage is cheap; query simplicity is valuable.
One Big Table (OBT) — the modern alternative
Instead of fact + dimensions joined at query time, materialize a single wide table that pre-joins everything. Each fact row already contains all dimension attributes.
Pros
- No joins for analysts — point and click in BI.
- Faster query latency on flat data.
- Simpler mental model.
Cons
- Storage cost — denormalized data balloons.
- Maintenance — when a dim attribute changes, every fact row needs updating.
- SCD handling is messier.
- Less flexible — you've baked in specific joins.
Most mature teams use both. Star schema as the canonical model layer; OBT-style "wide" tables built on top for specific BI use cases or for ML feature serving. The wide tables are derived from the star; they're not the source of truth.
Slowly Changing Dimensions — the time-travel problem
Dimensions change. A customer's plan changes from Basic to Pro. A product's price changes. A GPU provider's location changes. How do you model this?
The question is: when looking at a historical fact, do you want to see the dimension attributes as they were then, or as they are now?
SCD types — know all of them by name
| Type | What it does | Use when |
|---|---|---|
| Type 0 | Never changes. Original values preserved forever. | Birth date, original signup date. |
| Type 1 | Overwrite. Current value only; history lost. | Don't care about history. Email updates, typos fixed. |
| Type 2 | Add a new row per change. Versioned with valid_from / valid_to. | Most common. When you need point-in-time accuracy. |
| Type 3 | Add a column for "previous value." Only keeps one version of history. | Rare. When you specifically need "current and previous." |
| Type 4 | Maintain a separate history table. Main dim is current-only. | Hybrid approach when history is queried rarely. |
| Type 6 | 1+2+3 combined. Type 2 versioning + a "current value" column on each row. | You want point-in-time joins and easy "current state" filtering. |
SCD Type 2 — the shape
user_id | user_key | plan | valid_from | valid_to | is_current
─────────┼──────────┼───────────┼─────────────┼─────────────┼───────────
123 | abc | basic | 2024-01-01 | 2024-06-15 | false
123 | def | pro | 2024-06-15 | 2025-02-01 | false
123 | ghi | enterprise| 2025-02-01 | 9999-12-31 | true
456 | jkl | basic | 2024-03-10 | 9999-12-31 | true
Joining a fact to an SCD Type 2 dim
This is the interview gotcha. Don't join on user_id alone — you'll fan out.
SELECT
o.order_id,
o.ordered_at,
o.revenue,
u.plan AS plan_at_order_time
FROM fct_orders o
JOIN dim_users u
ON o.user_id = u.user_id
AND o.ordered_at >= u.valid_from
AND o.ordered_at < u.valid_to
This gives the plan the user was on when they placed the order. That's the whole point of SCD2.
For "what plan are they on now?" — filter on is_current and skip the time-range join.
valid_to
Use 9999-12-31 or similar sentinel for the current row's valid_to, never NULL. Otherwise joins with BETWEEN silently exclude current rows. This is in every "Kimball-style SCD" gotcha question.
The date dimension
Every warehouse should have a dim_date table. One row per date, with columns for year, quarter, month, week, day-of-week, fiscal-year, holiday flags, business-day flags.
Why bother
- BI tools can pivot/filter on rich date attributes (fiscal quarter, week-over-week).
- Avoids re-deriving date attributes in every query.
- Holiday and business-day logic lives in one place.
- Outer-joining dim_date to fact_daily_x guarantees a row for every day (handles gaps).
date_day, year, quarter, month, month_name, week, day_of_week, day_name,
fiscal_year, fiscal_quarter, is_weekend, is_business_day, is_holiday,
days_in_month, is_first_of_month, is_last_of_month, ...
dbt's dbt_utils.date_spine macro is the standard way to generate it.
Periodic snapshot fact tables
Some questions don't have a natural transaction grain. "What's the GPU utilization right now?" — there's no single event. You take a snapshot at regular intervals.
Example: fct_gpu_hourly_utilization — one row per GPU per hour. Captures utilization_pct, memory_used_gb, requests_served, customer_id_currently_assigned, etc.
Different from a transaction fact (one row per event) — same shape, different semantics. You can sum measures across GPUs but be careful summing across hours (semi-additive).
Bridge tables — many-to-many
When a fact has a many-to-many relationship to a dim, you can't join directly. Example: an order can have multiple coupons applied; a coupon can apply to many orders.
Bridge table: fct_order_coupons with (order_id, coupon_id). The bridge sits between the fact and the dim.
Warning: joining a fact to a bridge to a dim causes fanout. Aggregate first or use grouping aggregations.
Degenerate dimensions
A dimension that has no attributes — just an ID. Example: order_id on a fact_order_items table. You'd never look up attributes about an order_id from a dim table; the ID itself is the dim.
Live them in the fact table; don't create a 1-column dim_order for them.
Conformed dimensions
The same dim used by multiple fact tables. dim_users joined by fct_orders, fct_logins, and fct_support_tickets all uses the same dim.
The principle: a dim is conformed when its meaning and grain is identical across uses. This is what enables drill-across analysis: "Show me revenue (from orders) and ticket count (from support) by user segment."
Lack of conformed dimensions = "we have three different definitions of customer and the numbers don't tie." Common mid-size company problem.
How to answer "design a data model for X"
The framework that makes you look senior. They'll give you a business — design the warehouse for them.
- Ask about the business questions. "Before I model — what are the top three questions the business needs to answer? Revenue by segment? Retention? Unit economics?"
- Identify the core entities. Users, products, orders, GPUs, inference requests — whatever's central. These become dimensions.
- Identify the core events. What happens in the world that we want to count or measure? Each event becomes a fact table.
- State the grain of each fact. "
fct_ordersis one row per order.fct_inference_requestsis one row per inference request." - Identify measures vs attributes. Measures go in facts (numeric, additive where possible). Attributes go in dims (descriptive).
- Decide on SCD strategy for each dim. Most are SCD1 (overwrite) by default; SCD2 where point-in-time accuracy matters.
- Sketch the diagram. Draw it. Star schema. Foreign keys. Talk through one query path.
- Mention edge cases. Late-arriving facts. Backfills. Time zones. Soft deletes. NULL handling.
- Discuss the layer strategy. Staging → marts. Where dbt fits.
Worked example: design for an AI inference platform
FACTS
─────
fct_inference_requests — one row per API request
request_id, customer_id, model_id, gpu_id, region_id, date_key,
ts, latency_ms, input_tokens, output_tokens, cost_usd, status
fct_gpu_hourly_utilization — one row per GPU per hour
gpu_id, hour_ts, utilization_pct, memory_used_gb, requests_served,
revenue_generated_usd, current_customer_id
fct_billing_events — one row per billing event
event_id, customer_id, date_key, ts, event_type, amount_usd
DIMENSIONS
──────────
dim_customers (SCD2) — one row per customer per attribute version
customer_key, customer_id, plan, region, signup_date, valid_from, valid_to
dim_models — one row per model
model_id, model_name, family, parameter_count, hosted_since
dim_gpus (SCD2) — one row per GPU per location change
gpu_key, gpu_id, gpu_type, provider_id, region_id, valid_from, valid_to
dim_providers — one row per GPU provider
dim_regions — one row per region
dim_date — one row per calendar day
With this model you can answer:
- "Revenue per customer per month" — sum cost_usd from fct_inference_requests, point-in-time joined to dim_customers.
- "GPU utilization by region" — fct_gpu_hourly_utilization joined to dim_gpus (current version) joined to dim_regions.
- "Provider payout this month" — fct_gpu_hourly_utilization × pricing logic per provider.
- "Average latency per model" — fct_inference_requests grouped by model_id.
Interview talking points
"With the business questions, not the source data. I ask the team what the top three to five things they need to answer are, and what 'good' looks like. Then I identify the entities — the dimensions — and the events — the facts. I state the grain of each fact out loud and confirm it with stakeholders before writing SQL. The schema falls out of that conversation, not the other way around. Source data is a constraint; the business question is the goal."
"Default to SCD1 — overwrite — for most dimensions, because most BI questions ask 'what does the world look like now.' I reach for SCD2 when the business genuinely needs point-in-time accuracy. The clearest case: financial reporting, where you need to know 'what plan was this customer on when they placed this order' — not what plan they're on today. Also: pricing, organizational changes (region/segment), anything regulators might ask about. The cost of SCD2 is more complex joins and more storage; the cost of not SCD2-ing something that needed it is silent wrong numbers in historical reports."
"Star schema as the canonical layer — that's where source-of-truth lives. OBT-style wide tables built on top for specific consumption use cases: self-serve BI, ML feature serving, dashboards where join cost matters. The wide tables are derived from the star; they're not what we test or document — the star is. The hybrid gives you the maintainability of normalized modeling and the query speed of denormalized consumption."