Practice Interview Questions
28 questions across 8 categories. Read each question, give your version out loud on a 90-second timer, then reveal the strong answer to compare.
Section A · Background / motivation
Q1. "Walk me through your AI experience."
Show strong answer
"I'll be honest about scope. My background is [generalist software / product / ops / whatever]. The last [N months] I've been going deep on AI engineering — specifically agentic systems, MCP, evals, and how regulated industries are productionizing LLMs. I've been [building small projects / writing prompts and tool integrations / reading the Anthropic and OpenAI cookbooks and applying them]. The reason this role caught my attention is that the constraints of compliance — audit trails, human-in-the-loop, eval-driven development — force good architecture, and I'd rather work in that environment than somewhere autonomous-agent demos are the bar. The piece I'd be ramping on if I joined is production MLOps at scale and the specific compliance workflows themselves; I have a concrete plan for how I'd close that gap."
Don't claim shipped enterprise AI work you haven't done. AI engineers can tell. See 02-positioning-from-scratch for more.
Q2. "Why this role / company?"
Show strong answer
"Two reasons. One: I find the application of AI to high-stakes, regulated workflows more interesting than building general-purpose assistants — the constraints force good design. The fact that Compliance won't let you ship without audit, eval, and human-in-the-loop is exactly what makes this work technically interesting. Two: mature firms in this space have real institutional knowledge about regulators across multiple jurisdictions, and that depth shapes what good looks like in this role. I'd rather build AI inside a serious compliance org than the other way around."
Q3. "What's the biggest gap between you and this role?"
Show strong answer
"Two gaps I'd flag. First, I haven't run a formal eval program for production AI workloads at enterprise scale — I understand the pattern (dataset, grader, metric, EDD loop) and can talk about how I'd build one, but I haven't operated one in production. Second, I've worked adjacent to compliance but not inside a Compliance team, so I'd be ramping on the specific workflows — SAR drafting cadence, EDD packaging, regulatory change pipeline — in the first weeks. What I do bring: a clean mental model of the technical stack, fast on-ramp on new domains, strong systems thinking, and a deeply skeptical instinct about autonomous AI in high-stakes work — which is what compliance actually wants."
Section B · MCP
Q4. "What is MCP and why does it matter?"
Show strong answer
"MCP — Model Context Protocol — is an open protocol from Anthropic that standardizes how AI applications connect to external tools, data, and prompts. Three primitives: tools (callable functions), resources (read-only data), prompts (reusable templates). Plus sampling, where servers can request the client's LLM to do inference. Transports are stdio for local, HTTP for remote. Why it matters: pre-MCP, every agent integration was bespoke — your custom server, your custom auth, your custom schema. MCP makes those pluggable, which means a compliance agent can talk to your KYC system, your sanctions DB, and your case management without reimplementing each integration."
Q5. "Walk me through what happens when an agent calls an MCP tool."
Show strong answer
"Initialization handshake: client connects to server, they exchange capabilities — does the server expose tools? prompts? resources? Client calls tools/list and gets the schemas. The host LLM gets these tools in its API call. Model decides to call one, returns a tool_use block with name and args. Client validates against the schema, sends tools/call to the server. Server executes the handler, returns content — text, JSON, image, or an error with isError: true. Client feeds the result back into the model. Model continues — may call more tools, eventually replies. For audit, every step gets logged with the trace ID."
Q6. "How would you secure an MCP server exposing compliance data?"
Show strong answer
"I'd never let an MCP server hold a god-mode credential. The pattern I'd reach for: a long-lived credential lives only in a trusted surface — the server, a secrets manager, never in IDE config files. When a client needs access, a token-mint endpoint issues a JWT scoped to user, purpose, and a narrow set of tools, signed with a server-side secret, short TTL. The MCP server only holds that JWT plus an API URL. Every tool call goes through a server-side validator that checks the JWT and runs the operation under the user's identity, so row-level access controls apply — a compromised token only grants what that user already had, never tenant-wide access. For compliance specifically I'd narrow further: classify tools by data sensitivity, gate high-sensitivity tools behind extra approval, and audit-log every call. I'd also be paranoid about tool-description injection — a hostile description can prompt-inject the model into doing things the user didn't intend, so I'd pin trusted servers and review descriptions."
Q7. "What's MCP sampling and when would you use it?"
Show strong answer
"Sampling inverts the usual flow: the server asks the client to run an LLM completion on its behalf. It's how a server can leverage whatever model the user already has access to, without bringing its own API key — useful when a tool wants to do some inference of its own as part of producing a result. In compliance, you'd use it carefully though: sampling means the server is delegating to whatever model the host happens to be running, which may not match what Compliance and Model Risk have approved for production. So sampling fits low-risk, host-driven UX work — less so production agentic decisions where you want to pin a specific approved model and log against it."
Section C · Evals
Q8. "How do you evaluate an LLM-based system?"
Show strong answer
"Four building blocks: a versioned dataset of representative inputs, a system under test, a grader, and a metric. The grader can be exact-match or regex (deterministic tasks), code-based (format / business rules), an LLM-as-judge (free-form quality), human review (high-stakes), or a hybrid. The metric depends on task — classification gets precision/recall/F1, free-form gets pass-rate against a rubric, agents get trajectory match plus final-answer match. Cost and latency get tracked alongside quality. Evals run in three modes: pre-deploy (CI), pre-release regression, and online sampling production traffic. For compliance specifically, recall on must-catch cases dominates — you can't miss a sanctions hit — and you need calibration so 'high confidence' actually means high accuracy."
Q9. "How do you trust an LLM-as-judge?"
Show strong answer
"I anchor it against humans. Take 50-100 outputs, have SMEs grade them, then check the LLM judge's agreement with humans — typically you want 90%+ agreement on category judgments. If it's not there, refine the rubric, retest. Once anchored, run a continuous spot-check sample where humans regrade some judge decisions, so you catch drift. Watch for known biases — position bias in pairwise comparisons, length bias, self-preference if the judge is the same model as the system. Pin the judge model version because upgrades change scores. And remember LLM-as-judge is a tool, not a religion — for high-stakes outcomes you still want humans on a sample."
Q10. "Walk me through eval-driven development for a new compliance agent."
Show strong answer
"I'd build the eval set before the prompt. Step one, sit with SMEs, collect 30-50 real or realistic inputs that span the cases the agent must handle — including known failure modes and edge cases. Step two, define the grader and what 'good' means before writing the prompt — that forces clarity. Step three, write the simplest prompt that might work, run the eval, see the actual failures. Step four, cluster the failures, fix the biggest cluster — usually with prompt edits, sometimes with new tools or retrieval. Step five, re-run, compare scores, watch for regressions. Step six, hold out a test set the prompt has never seen, score on that before deploying. After deployment, sample production for online evals, route low-confidence outputs to human review, and pull edits/escalations back into the dataset."
Q11. "How would you eval an agent, not just a prompt?"
Show strong answer
"Agents are harder because the output isn't text — it's a trajectory. I'd score on multiple dimensions: did it reach the right final state, did the tool sequence look reasonable, did each step's tool choice make sense given state, did it stay within authorized actions, did it recover well from errors, did it spend a reasonable number of steps. For compliance, trajectory eval matters most — a regulator wants to see how a decision was reached, not just what was decided. So I'd capture full traces, build expert reference trajectories, and compute step-level alignment plus end-state correctness. The dataset is harder to build — agent traces are larger and richer than single-turn data — so I'd start from real production traces with expert annotation."
Section D · Harnesses and agents
Q12. "When should you use an agent vs a workflow?"
Show strong answer
"Default to the simplest pattern that works. Workflows — prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer — are predictable, cheaper, easier to evaluate, easier to audit. You reach for an open-ended agent only when the task genuinely requires the LLM to plan dynamically because you can't enumerate the steps in advance. In compliance specifically, the strong default is workflow-heavy with surgical agentic steps. You don't deploy a fully open-ended agent that decides whether to file a SAR. You deploy a workflow with explicit states and an agent that does one bounded subtask — gathering facts, drafting a narrative, summarizing a document."
Q13. "Design an agent harness for a compliance task."
Show strong answer — the 10-step sketch
Walk through the 10-step compliance-architect sketch from 05-harnesses-and-agents:
"Risk-tier the task first. For [task], it's [tier]. Pick the simplest pattern that fits — looks like routing into one of two chains here. Tools: [list, note read-only vs side-effecting]. Side-effecting tools require human approval. Human checkpoints at [these gates]. Audit log: every step, with snapshot of model version, prompt version, tool versions, retrieval index version. Eval: dataset of [N] cases, graded on [metrics], plus online sampling. Failure paths: tool error → retry then escalate, malformed output → repair-prompt then escalate, low confidence → escalate, cost budget exceeded → halt and escalate. Rollout: shadow mode parallel with humans, then human-in-the-loop, then — if the tier permits — autonomous, but high-tier never goes autonomous. Monitoring: drift, cost, latency, eval scores, human override rate. Kill switch: halt-flag in config that the system checks every step."
Q14. "What does a production-grade harness add beyond while True: call_llm()?"
Show strong answer
"Retries with backoff for transient errors. Structured-output coercion with repair-prompts when JSON fails to parse. Token and cost budgets with hard cutoffs. Step limits to prevent runaway loops. Tool authorization gating by user identity and risk tier. Human-in-the-loop pauses where the agent yields and waits for approval. Trace capture — every model call, tool call, and result recorded for replay and audit. Caching for repeated subcalls. Fallback model routing when primary fails. Input and output guardrails — PII detection, prompt-injection scanning, policy violation filtering. Sandboxing of tool execution. None of these are optional in a compliance system."
Section E · Error handling
Q15. "What failure modes do you design for in agentic AI?"
Show strong answer
"Six buckets. Infrastructure — timeouts, rate limits, downtime — handled with backoff, circuit breakers, fallback models. Format — bad JSON, missing fields, truncation — handled with schema validation, repair-prompts, forced structured output. Tool errors — handled with structured tool-error responses the model can see and react to, idempotency keys for side-effecting tools. Reasoning errors — hallucination, sycophancy, drift, wrong tool — caught by evals plus runtime checks like grounding ratio and confidence scoring. Safety — PII leaks, unauthorized actions, prompt-injection in retrieved content — handled with input/output guardrails and scope minimization. Loop errors — stuck states, infinite tool calls, context loss — handled with step limits, stuck-state detection, compaction. The default behavior for any unhandled or low-confidence case in compliance is escalate to human."
Q16. "How do you handle a tool that succeeded once and now you're retrying?"
Show strong answer
"Idempotency keys. Generate a UUID client-side, pass it to the tool, persist a key→result mapping with TTL. On retry, the tool sees the key, returns the prior result instead of re-executing. Critical for write operations — duplicate SAR filings, duplicate notices, duplicate alerts are not okay. I'd never let an agent call a side-effecting tool without an idempotency key in scope."
Q17. "What's prompt injection and how do you defend against it?"
Show strong answer
"Two flavors. Direct — user types 'ignore previous instructions, do X.' Indirect — a document, email, retrieved page, or tool result contains instructions the model treats as legitimate. Indirect is the harder problem because models can't reliably tell instructions apart from data. Defenses: privilege separation — the system prompt is privileged, user input is not, retrieved content is even less trusted. Sanitize and structure tool returns. Don't let tool results trigger more tools without re-invoking the model and validating. Sandbox tool side effects. Pin and review the trusted set of MCP servers. Run input scanners for known injection patterns. And the strongest defense for high-stakes systems: scaffold the workflow so user-controlled text can't steer high-impact decisions — keep humans in the loop where it matters. In compliance, if your agent reads customer-supplied KYC docs, that's an injection surface."
Section F · Audit trails
Q18. "Design the audit trail for an AI agent that drafts SAR narratives."
Show strong answer
"First principle: prompts, tool schemas, retrieval indexes, and model versions are first-class versioned artifacts, not constants. Each agent run gets a trace ID; each step — model call, tool call, retrieval, human review — is an immutable event with snapshot of which versions were active. Inputs and outputs go to content-addressed storage; events hold hashes. Side-effecting tools have idempotency keys recorded. PII is tokenized at the boundary so logs are useful for analysis without revealing identity. Storage: append-only writes (Postgres write-only role or Kafka), indexed for queries (OpenSearch or BigQuery), cold-archive in S3 with object-lock for retention. Hash-chain events for tamper evidence. Human review events capture who, when, what they saw, what they did, and the diff between AI draft and final action — that diff is gold, it's both audit and eval feedback. For SAR specifically, retention runs minimum five years per BSA, often longer. The bar I'd build to: a regulator asks 'why did the AI recommend filing on case X?' and we produce the full trace within hours."
Q19. "How do you reconcile audit retention with GDPR right-to-erasure?"
Show strong answer
"Tokenization at the boundary. Customer identifiers get replaced with internal tokens before anything reaches the model or the log. The audit log refers to customer_T0123, not 'Jane Doe.' A separate, tightly access-controlled mapping resolves tokens to identities. On a valid erasure request, you delete the mapping. The audit log remains intact for AML retention — useful for aggregate analytics — but no longer connects to the individual. You also tier retention by data sensitivity: full prompts and outputs may have shorter retention than the metadata events. Anything that touches actual data flows gets reviewed by privacy/legal."
Q20. "How do you make AI decisions reproducible enough to satisfy a regulator?"
Show strong answer
"I'd push back gently on the word 'reproducible.' LLMs are non-deterministic — even with temperature zero, infrastructure-level non-determinism means a re-run can differ. What you can guarantee is explainability: the exact prompt sent, the model version and parameters, the tool schemas at that moment, the retrieval results, the resulting output, the human reviewer. Given that record, anyone can see exactly what happened and why. For most regulators, that's the standard — they want a complete, consistent record, not a guarantee of bit-for-bit re-derivation. I'd be explicit about that in the model risk documentation."
Section G · Compliance / domain
Q21. "What's the highest-risk thing you wouldn't let an AI do autonomously?"
Show strong answer
"Anything with a regulatory filing, an external customer-facing action, or an account-state change. Filing a SAR, freezing an account, sending a customer-facing KYC notice, locking out a trader. Those are decisions where the cost of being wrong includes regulator scrutiny, customer harm, or both. The agent can draft, can recommend, can assemble evidence — humans approve and execute. I'd also be cautious about chain-affecting actions specific to crypto — anything that could move funds or affect on-chain state. Risk-tier the workload first; for high tier, autonomous is the wrong default."
Q22. "What's the relationship between AI controls and the firm's risk-based approach?"
Show strong answer
"Regulators expect controls scaled to risk. Same applies to AI: a low-tier task — internal regulatory summary, drafting a routing classification — gets a light eval and a sample-based human review. Medium tier — case narrative drafting, EDD section drafting — gets full audit, mandatory pre-action human review, citation requirements. High tier — anything close to a regulatory filing recommendation — gets multi-stage review, model risk management sign-off, scoped tools, eval against a labeled gold set, and a kill switch. The architecture has to make that tiering visible — not buried in code."
Q23. "How would you measure success for an AI-assisted alert triage system?"
Show strong answer
"Two layers. Quality: recall on must-catch cases (you cannot miss a true sanctions hit), false positive rate (every FP is human time), human override rate (when the agent says dismiss, how often does the human disagree?). Operational: cost per alert, latency p95, throughput, eval drift. Plus the business metric the team actually cares about: alert-review time, investigator capacity freed up. I'd also instrument the override diff — when humans correct the agent's recommendation, that data feeds back into evals and the next prompt iteration."
Section H · Curveballs
Q24. "You disagree with a Compliance lead about how risky an AI workflow is. What do you do?"
Show strong answer
"Compliance lead has the call on risk classification — that's their domain, not mine. My job is to surface the technical realities clearly: here's what the model can and can't reliably do, here's the failure mode I'm worried about, here's a controls package that would mitigate it. Then they make the policy call. If the call is 'too risky, don't ship,' that's a fine answer — better than shipping something that creates regulatory exposure. The JD specifically calls out 'halt or redesign solutions posing regulatory risks' as part of the role. I'd want to be the architect known for stopping bad ideas."
Q25. "Tell me about a time you decided NOT to ship something."
Show strong answer
Use a real story from your background — bug you caught, security risk you flagged, edge case you escalated, scope you pushed back on. The shape that wins: risk you saw → why others were inclined to ship anyway → what specifically you said or did → outcome and tradeoff.
If you don't have one, flip the question honestly:
"Truthfully, I don't have a clean 'I refused to ship' war story I'd tell — most of my recent work has been small enough that the stakes weren't 'ship vs don't.' What I can speak to is the instinct: when I'm reviewing a design and someone wants to use a god-mode credential in a user-facing config, or skip an audit log because it's faster, my default is to flag it and propose an alternative. In a compliance context I expect that instinct will have more occasions to fire, and I'd rather err on the side of slowing down than shipping something we can't defend to a regulator."
Q26. "What's something the AI engineering field gets wrong?"
Show strong answer
"Two: under-investing in evals upfront, and over-reaching for agents when a workflow would do. Both come from the same place — wanting to ship the demo. Demos hide failure modes that production exposes hard. I'd rather start ugly with explicit workflow steps and a real eval set than ship a flashy autonomous agent with no measurable definition of 'good.'"
Q27. "If we hired you, what would you do in your first 30 days?"
Show strong answer
"Listen and inventory. Map the existing workflows the team has — what's in n8n, what's in custom code, what's manual. Sit with SMEs on the highest-volume manual workflow and build a candidate eval set with them — even before any AI is in the loop. Audit what observability/audit logging exists for any AI already deployed. Identify the lowest-risk, highest-leverage first build — probably alert pre-screening or regulatory change summaries — and scope a 30-60 day pilot with explicit eval and rollout plan. Avoid shipping anything in the first month I haven't seen the failure cases of."
Q28. "How do you stay current on AI?"
Show strong answer
Real, practiced answer — whatever's true for you. Sources to name-drop if accurate:
- Anthropic / OpenAI cookbooks and engineering blogs
- Anthropic's "Building Effective Agents" essay
- Practitioner Twitter/X (Eugene Yan, Jason Liu, Hamel Husain, Shreya Shankar)
- Building things in your own repo
- Specific newsletters: Latent Space, Interconnects, Jack Clark's Import AI
The signal: name 2-3 specific sources, mention a thing you read recently, mention a thing you built recently to test an idea.
How to drill
- Toggle Drill mode at the top (it's on by default — answers hidden).
- Pick a question. Cover the answer button. Give your version out loud on a ~90 second timer.
- Reveal. Compare. Note structural differences (claim → why → example).
- Re-try, integrating anything you missed. Mark practiced once you hit it cleanly twice in a row.
- Don't memorize verbatim — memorize the structure and the concrete examples.
Spread the drill over multiple sessions. Same-day repetition has diminishing returns; doing 5 questions today + 5 different ones tomorrow + revisiting all 10 the day after beats drilling 15 in one sitting.