Warehouses & Lakehouses
Snowflake, BigQuery, Redshift, Databricks, ClickHouse — what each is good at, how they bill you, when to pick which.
Snowflake
Strengths:
- Best-in-class SQL ergonomics — QUALIFY, LATERAL FLATTEN, MATCH_RECOGNIZE.
- Separation of storage and compute (multiple "warehouses" sized independently).
- Time travel — query data as of any point in the retention window. Zero-copy clones.
- Strong JSON / VARIANT handling.
- Dynamic Tables for declarative incremental materialization.
- Snowpark for Python/Scala compute on the warehouse.
Pricing: per-second of warehouse compute (Standard / Enterprise / Business Critical tiers) + storage. Compute dominates cost — pause unused warehouses.
Partitioning: automatic (micro-partitions). User-controlled via cluster_by for high-cardinality skewed access patterns.
Watch out for: cost runaway from over-provisioned warehouses. Use auto-suspend and resource monitors.
BigQuery
Strengths:
- Serverless — no warehouse sizing. You query, Google scales compute.
- Strong UNNEST + nested/repeated fields — great for event data.
- Native integration with GCP services (Pub/Sub, Dataflow, Vertex AI).
- BigQuery ML for in-warehouse ML.
- Materialized views with automatic incremental refresh.
- BigQuery Omni — query across clouds.
Pricing: two models. On-demand: $5-7/TB of data scanned. Editions: flat-rate slots/hour. On-demand bills you for column scans — strong partition/cluster filters matter a lot for cost.
Partitioning: by date or integer range (1 column). Clustering: up to 4 columns within a partition.
Watch out for: on-demand cost from unfiltered SELECT *. Use INFORMATION_SCHEMA to monitor heavy queries.
Redshift
Strengths:
- Original cloud warehouse — mature, AWS-native.
- Redshift Spectrum — query S3 data without loading.
- RA3 nodes separate storage from compute.
- Strong with AWS-stack integration.
Pricing: per-node-hour for provisioned, or serverless. Cheaper than Snowflake at sustained high utilization; more operational overhead.
Partitioning: distribution keys determine how rows shard across nodes. Sort keys determine row order within a node. Both materially affect performance — get them wrong and queries crawl.
Watch out for: VACUUM and ANALYZE — manual maintenance unlike Snowflake/BigQuery. Newer Redshift Serverless reduces this.
Databricks
Strengths:
- Native lakehouse — Delta Lake as the table format, query with Spark SQL or Photon engine.
- Strong for ML workflows — MLflow, Unity Catalog, model serving in one place.
- Notebooks-first dev experience.
- Multi-language (Python / SQL / Scala / R).
- Open data — Delta tables are usable from outside Databricks.
Pricing: DBUs (Databricks Units) × cluster size × time. Compute-heavy.
Partitioning: partition columns + Z-order (correlated clustering across columns).
Watch out for: can be more complex operationally than pure SQL warehouses. Best-fit for teams doing both analytics and ML.
ClickHouse
OLAP database optimized for fast analytics on huge tables. Common for:
- Customer-facing dashboards (sub-second response on billions of rows).
- Observability platforms (logs, metrics).
- Inference logs / event analytics at AI infra companies.
Strengths: extreme scan speed, real-time ingest, materialized views for incremental refresh. Weaknesses: quirky SQL dialect (until recently), historically weaker joins (improving).
DuckDB
OLAP on your laptop. Single-process, columnar, full modern SQL (CTEs, window functions, QUALIFY, list/struct types). Fast.
Common uses: local development against warehouse-style data, embedded analytics, "smaller-than-Snowflake" workloads. dbt-duckdb is real.
Mention as ecosystem awareness, not as a primary warehouse choice.
Postgres as warehouse
Works at small/medium scale. Add citus for sharding or use a managed offering. Useful when:
- You're early-stage and don't want a separate warehouse yet.
- Data volume is GBs, not TBs.
- You want operational and analytical workloads in one place (with care).
Limits: row-oriented storage is slow for warehouse-style queries past medium scale. Migrate to a real warehouse before pain hits.
Iceberg / Delta / Hudi — table formats
Open table formats over object storage. The "lakehouse" foundation.
| Format | Origin | Notable |
|---|---|---|
| Apache Iceberg | Netflix → Apache | Most adopted, multi-engine (Snowflake, BigQuery, Databricks, Trino all read it). Partition evolution. |
| Delta Lake | Databricks → Linux Foundation | Databricks-native, strong ACID, Z-order clustering. |
| Apache Hudi | Uber → Apache | Strong for streaming + upserts. Less ubiquitous. |
What they all give you:
- ACID transactions on object storage.
- Time travel (query past versions).
- Schema evolution.
- Partition evolution (Iceberg specifically).
- Multi-engine readability — write once, query from many tools.
Be ready to talk about Iceberg specifically. It's becoming the default open format, and major warehouses now read/write it.
Partitioning & clustering — the universal pattern
Every modern warehouse uses some form of partitioning + clustering for performance. The pattern:
- Partition by a column with low-cardinality "natural" buckets — usually date.
- Cluster / sort / Z-order by higher-cardinality columns frequently used in filters or joins.
- Always filter on the partition key in your WHERE clause — that's the biggest cost lever.
Cost models — know which is which
- Compute-time (Snowflake, Databricks) — pay for warehouse-seconds. Lever: pause unused warehouses; right-size.
- Data-scanned (BigQuery on-demand) — pay per TB queried. Lever: partition filters, SELECT only needed columns.
- Slots / flat-rate (BigQuery editions, Redshift) — pay for reserved capacity. Lever: workload management to balance concurrent queries.
- Storage — typically cheap, but watch for "infinite retention" + zero-copy clones piling up.
For interview answers: "My approach to cost: monitor query history, identify the top expensive queries, fix those first. Often a handful of queries account for most of the bill. Then it's partition filters, query rewrites, and right-sizing compute."
How to pick (interview-grade answer)
"Most of the time: pick the warehouse the team already knows. Switching cost is real. New project, no preference: Snowflake for SQL ergonomics and operational simplicity if budget allows; BigQuery if you're GCP-native or have a Google billing arrangement; Databricks if the team does heavy ML alongside analytics. Redshift is fine if AWS-heavy and cost-sensitive at high sustained volume. For customer-facing analytics that need sub-second on huge data, ClickHouse or a similar OLAP DB alongside the main warehouse. For lakehouse: Iceberg is the safe default open format."