The Role, Decoded
What "Senior Data Analytics Engineer" actually means in practice, how it differs from data engineer and analyst roles, and what's specifically different at AI infrastructure companies.
The three-role triangle
Three adjacent titles confuse candidates and hiring managers alike. They overlap, but the centers of gravity differ:
| Data Engineer | Analytics Engineer | Data Analyst | |
|---|---|---|---|
| Center of gravity | Infrastructure, ingestion, scaling | Transformation, modeling, data products | Business questions, dashboards, ad-hoc |
| Primary languages | Python, SQL, Java/Scala, sometimes Rust | SQL, dbt, Python (secondary) | SQL, spreadsheets, BI tools |
| Lives in | Airflow / Spark / Kafka, infra-as-code | dbt + warehouse + Git | Looker / Tableau / Mode |
| Owns | Ingestion + serving infra | Models in the warehouse + tests + docs | Reports + insights + stakeholder relationships |
| Failure mode | Pipeline outages | Wrong numbers, slow models | Bad recommendations |
The hybrid title that became standard around 2022. Center of gravity: the transformation layer. Most days you're writing dbt models in SQL, designing star schemas, writing tests, building lineage. You touch data engineering when you need to (Airflow DAGs, warehouse tuning) and you touch analyst work when you need to (stakeholder conversations, metric definitions). You're the bridge that makes the warehouse actually useful.
What the work actually looks like
A typical week:
- Build / extend data models in dbt — new dimensions, new metrics, refactoring legacy SQL into modular models with tests.
- Define metrics — work with finance / product / ops to nail down "what does active user mean here?" Then encode it in the metrics layer.
- Debug data quality issues — a dashboard's revenue number looks off; trace through lineage, find the broken upstream join.
- Write tests — unique, not_null, relationships, accepted_values, custom singular tests. Make the suite a gate, not a courtesy.
- Performance work — a model takes 40 minutes; figure out why (skewed joins, unnecessary CTEs, lack of clustering), fix it.
- Documentation — yaml descriptions, exposures, lineage that makes the warehouse navigable for non-experts.
- Stakeholder conversations — translate "what's our GPU margin by region?" into a defensible model.
- Reviewing PRs — yes, dbt PRs get reviewed. CI runs the test suite against staging.
The canonical modern data stack
You're not going to be quizzed on every tool, but you should be fluent with the shape and know the major players:
| Layer | What it does | Common tools |
|---|---|---|
| Ingestion / EL | Move raw data from sources into the warehouse | Fivetran, Airbyte, Stitch, custom Python, Kafka Connect, Snowpipe |
| Storage | Warehouse or lakehouse where data lives | Snowflake, BigQuery, Redshift, Databricks, ClickHouse |
| Transformation / T | Turn raw → analytics-ready | dbt (dominant), SQLMesh, Dataform |
| Orchestration | Run things in order, retry, schedule | Airflow, Dagster, Prefect, dbt Cloud's scheduler |
| BI / consumption | Dashboards, exploration | Looker, Tableau, Mode, Hex, Metabase, Sigma, Preset |
| Metrics layer | Single source of truth for metric definitions | dbt Semantic Layer, Cube, MetricFlow, LookML |
| Data quality / observability | Monitor freshness, schema, anomalies | Monte Carlo, Elementary, Great Expectations, dbt tests |
| Lineage / catalog | Where does this column come from | DataHub, OpenLineage, Atlan, dbt docs |
| Reverse ETL | Push warehouse data back to operational tools | Hightouch, Census |
You absolutely need fluency with SQL + dbt + a warehouse (Snowflake or BigQuery). Everything else is "I know what this is for and which one I'd pick when."
At AI / GPU infrastructure companies — what's specific
For roles at AI compute, GPU marketplace, or inference platform companies, the data you'll be modeling has unusual characteristics:
- High-volume, high-frequency telemetry — GPU utilization metrics every few seconds from thousands of nodes. Time-series flavor. ClickHouse, Druid, or warehouse-native time-series tables get involved.
- Inference logs — every API request has latency, token counts, model version, customer ID, cost. Petabyte-scale potential. Sampling and aggregation strategies matter.
- Multi-tenant economics — you're often computing unit economics per customer, per model, per region. "What's our gross margin on Llama-70B for customers in Europe?" is a typical question.
- Real-time-ish requirements — billing accuracy demands fresh data. Engineering may want hourly cost dashboards. Streaming or micro-batch matters more than at a classic SaaS.
- Provider-side data (for marketplaces) — if GPUs come from third-party providers, you're tracking supplier utilization, payouts, reliability. Two-sided marketplace metrics.
- Model performance + cost tradeoffs — analysts asking "if we switched these customers from Opus to Sonnet, what's the quality cost vs the margin gain?" You'll be building those analyses.
See 17-ai-compute-domain for a deeper walkthrough of the data model and key metrics.
That's fine. Most candidates haven't. What you can prepare: read the company's pricing page closely (it tells you their unit economics), think about what metrics their finance team needs, and have a few opinions about how you'd model GPU-hours and request logs. That preparation alone differentiates you.
JD signals to watch for
Job descriptions for analytics-engineering roles tend to have signature phrases. Decode them:
- "Build the analytics layer" → dbt-heavy transformation work. Models, tests, semantic layer.
- "Partner with finance / product / ops" → stakeholder management, metric definitions. Practice translating business → SQL.
- "Data quality / trust" → testing, alerts, observability. Have a story about how you'd onboard quality.
- "End-to-end ownership" → expect to design and implement and debug. Generalist mindset.
- "Self-serve analytics" → enabling analysts and PMs to query without needing you. Implies docs, modeling for accessibility, BI tooling.
- "Real-time / streaming" → most analytics-engineer roles are batch-first. If this is highlighted, the stack probably includes Kafka + Flink/Spark Streaming, or warehouse-native streaming (Snowflake Dynamic Tables, BigQuery Continuous Queries).
- "Performance optimization" → warehouse tuning, query plans, clustering, partitioning. Be ready to talk about why a query is slow, not just how to make it faster.
- "SDK / API for data" → less common, but at infra companies they may want a data product (e.g. usage dashboard, billing API) as much as internal analytics.
What to ask them
Strong, role-fit questions to have ready:
- "What's the state of the warehouse today — clean, partially modernized, or do you have a wedge of legacy SQL that needs untangling?"
- "How is the metrics layer organized? dbt Semantic Layer, LookML, MetricFlow, or convention-based?"
- "Who owns data quality alerts when they fire? Is there a rotation?"
- "How do data and engineering collaborate on schema changes? Contract or convention?"
- "What's the one model in production right now that everyone's afraid to touch, and why?" (Reveals a lot about tech debt.)
- "What's the team's biggest analytics question that's still hard to answer with the current setup?" (Reveals where you'd have leverage.)
- "How does the team handle ad-hoc analyst requests vs structured modeling work? Where's that line?"
These show you're thinking about the role as a builder and a partner, not just an SQL gun-for-hire.