Section B · Storage

Warehouses & Lakehouses

Snowflake, BigQuery, Redshift, Databricks, ClickHouse — what each is good at, how they bill you, when to pick which.

Snowflake

Strengths:

Best-in-class SQL ergonomics — QUALIFY, LATERAL FLATTEN, MATCH_RECOGNIZE.
Separation of storage and compute (multiple "warehouses" sized independently).
Time travel — query data as of any point in the retention window. Zero-copy clones.
Strong JSON / VARIANT handling.
Dynamic Tables for declarative incremental materialization.
Snowpark for Python/Scala compute on the warehouse.

Pricing: per-second of warehouse compute (Standard / Enterprise / Business Critical tiers) + storage. Compute dominates cost — pause unused warehouses.

Partitioning: automatic (micro-partitions). User-controlled via cluster_by for high-cardinality skewed access patterns.

Watch out for: cost runaway from over-provisioned warehouses. Use auto-suspend and resource monitors.

BigQuery

Strengths:

Serverless — no warehouse sizing. You query, Google scales compute.
Strong UNNEST + nested/repeated fields — great for event data.
Native integration with GCP services (Pub/Sub, Dataflow, Vertex AI).
BigQuery ML for in-warehouse ML.
Materialized views with automatic incremental refresh.
BigQuery Omni — query across clouds.

Pricing: two models. On-demand: $5-7/TB of data scanned. Editions: flat-rate slots/hour. On-demand bills you for column scans — strong partition/cluster filters matter a lot for cost.

Partitioning: by date or integer range (1 column). Clustering: up to 4 columns within a partition.

Watch out for: on-demand cost from unfiltered SELECT *. Use INFORMATION_SCHEMA to monitor heavy queries.

Redshift

Strengths:

Original cloud warehouse — mature, AWS-native.
Redshift Spectrum — query S3 data without loading.
RA3 nodes separate storage from compute.
Strong with AWS-stack integration.

Pricing: per-node-hour for provisioned, or serverless. Cheaper than Snowflake at sustained high utilization; more operational overhead.

Partitioning: distribution keys determine how rows shard across nodes. Sort keys determine row order within a node. Both materially affect performance — get them wrong and queries crawl.

Watch out for: VACUUM and ANALYZE — manual maintenance unlike Snowflake/BigQuery. Newer Redshift Serverless reduces this.

Databricks

Strengths:

Native lakehouse — Delta Lake as the table format, query with Spark SQL or Photon engine.
Strong for ML workflows — MLflow, Unity Catalog, model serving in one place.
Notebooks-first dev experience.
Multi-language (Python / SQL / Scala / R).
Open data — Delta tables are usable from outside Databricks.

Pricing: DBUs (Databricks Units) × cluster size × time. Compute-heavy.

Partitioning: partition columns + Z-order (correlated clustering across columns).

Watch out for: can be more complex operationally than pure SQL warehouses. Best-fit for teams doing both analytics and ML.

ClickHouse

OLAP database optimized for fast analytics on huge tables. Common for:

Customer-facing dashboards (sub-second response on billions of rows).
Observability platforms (logs, metrics).
Inference logs / event analytics at AI infra companies.

Strengths: extreme scan speed, real-time ingest, materialized views for incremental refresh. Weaknesses: quirky SQL dialect (until recently), historically weaker joins (improving).

DuckDB

OLAP on your laptop. Single-process, columnar, full modern SQL (CTEs, window functions, QUALIFY, list/struct types). Fast.

Common uses: local development against warehouse-style data, embedded analytics, "smaller-than-Snowflake" workloads. dbt-duckdb is real.

Mention as ecosystem awareness, not as a primary warehouse choice.

Postgres as warehouse

Works at small/medium scale. Add citus for sharding or use a managed offering. Useful when:

You're early-stage and don't want a separate warehouse yet.
Data volume is GBs, not TBs.
You want operational and analytical workloads in one place (with care).

Limits: row-oriented storage is slow for warehouse-style queries past medium scale. Migrate to a real warehouse before pain hits.

Iceberg / Delta / Hudi — table formats

Open table formats over object storage. The "lakehouse" foundation.

Format	Origin	Notable
Apache Iceberg	Netflix → Apache	Most adopted, multi-engine (Snowflake, BigQuery, Databricks, Trino all read it). Partition evolution.
Delta Lake	Databricks → Linux Foundation	Databricks-native, strong ACID, Z-order clustering.
Apache Hudi	Uber → Apache	Strong for streaming + upserts. Less ubiquitous.

What they all give you:

ACID transactions on object storage.
Time travel (query past versions).
Schema evolution.
Partition evolution (Iceberg specifically).
Multi-engine readability — write once, query from many tools.

Be ready to talk about Iceberg specifically. It's becoming the default open format, and major warehouses now read/write it.

Partitioning & clustering — the universal pattern

Every modern warehouse uses some form of partitioning + clustering for performance. The pattern:

Partition by a column with low-cardinality "natural" buckets — usually date.
Cluster / sort / Z-order by higher-cardinality columns frequently used in filters or joins.
Always filter on the partition key in your WHERE clause — that's the biggest cost lever.

Cost models — know which is which

Compute-time (Snowflake, Databricks) — pay for warehouse-seconds. Lever: pause unused warehouses; right-size.
Data-scanned (BigQuery on-demand) — pay per TB queried. Lever: partition filters, SELECT only needed columns.
Slots / flat-rate (BigQuery editions, Redshift) — pay for reserved capacity. Lever: workload management to balance concurrent queries.
Storage — typically cheap, but watch for "infinite retention" + zero-copy clones piling up.

For interview answers: "My approach to cost: monitor query history, identify the top expensive queries, fix those first. Often a handful of queries account for most of the bill. Then it's partition filters, query rewrites, and right-sizing compute."

How to pick (interview-grade answer)

"Most of the time: pick the warehouse the team already knows. Switching cost is real. New project, no preference: Snowflake for SQL ergonomics and operational simplicity if budget allows; BigQuery if you're GCP-native or have a Google billing arrangement; Databricks if the team does heavy ML alongside analytics. Redshift is fine if AWS-heavy and cost-sensitive at high sustained volume. For customer-facing analytics that need sub-second on huge data, ClickHouse or a similar OLAP DB alongside the main warehouse. For lakehouse: Iceberg is the safe default open format."