Production RAG · Concept & drill

RAG Applied

From naïve top-K to hybrid + reranking + agentic. The failure modes that show up only at scale, and the patterns that distinguish a serious answer from a textbook one.

📚 Reference 🎯 7 interview probes with answers 🔁 Drill mode supported

Why RAG is the right default for compliance

Compliance is fundamentally about grounded reasoning over authoritative documents. The model must:

  • Answer based on the current regulation, not training-time recollection
  • Cite the source so a reviewer can verify
  • Refuse to answer when the source doesn't cover the question
  • Surface the version / effective date of what it's referencing
RAG vs long context — when to pick which

Long-context models are tempting ("just stuff the whole regulation in") but RAG usually wins because: your KB is bigger than any one context, you want to select relevant context not pay tokens to skim everything, you want the citation trail RAG produces naturally, and you can update the KB without touching the agent.

The naïve RAG pipeline (start here)

[Question] ↓ [Embed question] ↓ [Top-K cosine-similarity search in vector store] ↓ [Insert top-K chunks into prompt] ↓ [Generate answer]

This works for demos. It fails in production. Below is a tour of what breaks and how to fix it.

Failure mode 1 · Bad retrieval (most common)

Symptom: model answers without grounding, hallucinates, or contradicts the source.

Causes: question doesn't lexically match source chunks; chunks lack context to be retrievable alone; top-K are similar to each other (redundant); vector similarity isn't aligned with relevance for the domain.

Fixes

  • Hybrid search — combine BM25 (keyword) and vector. BM25 catches exact-term matches embeddings miss; embeddings catch semantic matches BM25 misses. Combine via Reciprocal Rank Fusion (RRF).
  • Reranking — retrieve top 50 with cheap method, then run a cross-encoder (Cohere Rerank, BGE-Reranker, Voyage Rerank). Keep top 5-10. Slower but far more accurate.
  • Query expansion / rewriting — LLM rewrites the question into 2-3 variants. HyDE (Hypothetical Document Embeddings) is one such technique.
  • Multi-hop retrieval — agent retrieves, reads, decides what to retrieve next.

Failure mode 2 · Bad chunks

Symptom: chunks contain wrong context; answers wrap around chunk boundaries.

Fixes

  • Structural chunking — respect document hierarchy (headings, sections). Non-negotiable for regulatory documents.
  • Chunk size — sweet spot 500-1500 tokens with 10-20% overlap. Tune per corpus.
  • Contextual retrieval (Anthropic) — prepend a 50-100 token context paragraph to each chunk before embedding ("This chunk is from FATF Recommendation 10's section on ongoing CDD..."). Use Claude Haiku to generate.
  • Parent-child / small-to-big — index small chunks for retrieval precision; return parent chunk for generation context.

Failure mode 3 · Hallucination on retrieved context

Symptom: model says things not in the retrieved chunks; fabricates citations.

Fixes

  • System prompt discipline — explicit instructions to only answer from provided context, refuse if context is insufficient.
  • Citation-required output — every claim tagged with chunk ID. Post-process to verify.
  • Grounding verification — separate validation step asks "is this claim supported by these chunks?" (Ragas faithfulness metric).
  • Refuse-when-unsure prompt — explicit "answer 'I don't have enough information' if context doesn't cover the question."

Failure mode 4 · Stale data

Symptom: model cites outdated regulation or policy.

Fixes

  • Index versioning — each chunk has effective_date, superseded_date metadata. Filter retrieval by date.
  • Refresh pipeline — scheduled re-ingestion when source changes; change-detection at document level.
  • Source-of-truth references — the index points to a canonical store; the canonical store is the source of truth.

Failure mode 5 · Access control bleed

Symptom: agent retrieves a document the user shouldn't see.

Fixes

  • Per-tenant indexes or metadata-filtered retrieval — at query time, restrict to chunks the user has permission to read.
  • Authorization at retrieval, not at generation — don't rely on the model to "not mention" something it shouldn't have seen.
  • Audit log of every retrieval — who, what, when.

RAG architectures — the spectrum

1 · Naïve / single-shot

Embed query → retrieve top-K → generate. Baseline.

2 · Hybrid

BM25 + vector + reranking. Minimum bar for production.

3 · Agentic

Agent has retrieval as a tool. Decides when and how to retrieve, can iterate. Modern default for complex queries.

4 · Graph RAG

Entities and relationships extracted into a knowledge graph; retrieval traverses graph + retrieves text.

5 · Self-RAG / CRAG

Model assesses retrieved context, decides whether to retrieve more, regenerate, or refuse. Research-flavored but practical.

6 · Router RAG

Different KBs for different queries. Small classifier or LLM routes the query to the right index.

Agentic RAG pattern — compliance fit

The agentic pattern fits compliance well: investigators iterate; the agent should too. Agent decides to call lookup_regulation(query="EU AML transaction threshold") → reads → decides it also needs lookup_regulation(query="Wirecard precedent") → synthesizes draft with citations.

Embeddings — what to know

  • Model choice matters: OpenAI text-embedding-3-large, Voyage AI, Cohere, Anthropic, open-source (BGE, E5, MXBai). Benchmark on your data.
  • Dimension: 1024-1536 common; 3072+ for highest quality. Tradeoff vs storage/latency.
  • Domain adaptation: fine-tuning embeddings on your domain data lifts retrieval significantly when quality plateaus.
  • Multi-vector / late interaction: ColBERT and similar represent each token with its own vector. Higher quality, more storage.

Vector databases — by use case

ChoiceWhy pick it
pgvectorYou already have Postgres. Easiest operationally.
OpenSearch / ElasticsearchHybrid search (BM25 + vector) in one place, mature ops.
PineconeManaged, fast, simple API. Cost grows with scale.
WeaviateRich filtering, hybrid, open source.
Qdrant, Chroma, Milvus, LanceDBEach has a niche; all credible.
Bedrock KB / Vertex AI Vector SearchCloud-managed, integrate cleanly with the rest of the stack.

Evaluating RAG specifically

RAG has its own metrics; Ragas is the well-known framework.

  • Faithfulness — are the answer's claims supported by the retrieved context? (LLM-as-judge.)
  • Answer relevance — does the answer address the question?
  • Context precision — are the retrieved chunks relevant?
  • Context recall — do the retrieved chunks contain everything needed to answer?

Beyond Ragas: citation correctness (most important for compliance — is the cited regulation actually the source?), refusal accuracy (does the system refuse when it should?), end-to-end answer correctness (human or LLM-judge against gold answers).

Production RAG — operational concerns

  • Refresh cadence — daily at minimum for regulatory KBs; on-demand on publication.
  • Index size and shard strategy — billion-vector indexes are doable but design matters.
  • Cache layer — identical/near-identical queries hit cache, not re-retrieve.
  • Cost monitoring — embedding + vector reads + LLM calls add up. Per-query and per-tenant budgets.
  • Privacy at retrieval — tokenize PII before it touches anything third-party.
  • Disaster recovery — can you rebuild the index from source of truth in a known time?

Common interview probes — with answers

Q1: "Why use RAG over long context?"

"Three reasons. Cost — RAG retrieves only the few chunks relevant to a query; long context pays for the whole corpus on every call. Selectivity — long context suffers from 'lost in the middle,' where the model misses details buried mid-prompt; RAG puts only the relevant material in view. Operational decoupling — a RAG index updates independently of the agent, so adding a new regulation doesn't require touching the prompt. Long context is the right call when the task genuinely needs whole-document reasoning — say, drafting an impact analysis on a single 200-page regulation. RAG is the default when the corpus is bigger than any one task, when freshness matters, and when you need citations. In a compliance KB you almost always want RAG; you might use long context as a downstream step after retrieval narrows the relevant document."

Q2: "How do you choose chunk size?"

"It's the tradeoff between retrieval precision and answer context. Smaller chunks — 200-500 tokens — give precise retrieval but lose surrounding context, so the model sees a sentence stripped of its section heading. Larger chunks — 1000-1500 — preserve context but dilute the embedding signal and waste tokens on irrelevant content. The pragmatic default I'd start with is 500-800 tokens with 10-20% overlap, then tune based on retrieval evals. Better than guessing size: respect document structure. For regulatory documents that means recursive chunking by chapter then section then paragraph — chunks become natural units, not arbitrary slices. And the parent-child trick is often the right answer: index small chunks for retrieval precision, but return the parent paragraph or section for generation context."

Q3: "How do you handle queries the KB can't answer?"

"Make refusal a first-class output, not a failure. Three layers. One, system prompt discipline — explicit instruction that the model must answer 'I don't have enough information' rather than guess when retrieved context doesn't cover the question. Two, grounding validation — a post-generation check that every claim in the answer maps back to a cited chunk; ungrounded claims fail the answer. Three, retrieval-quality signal — if top-K scores are below a threshold, the system surfaces 'no good match' and routes to human review rather than generating. For compliance specifically, the refusal itself is valuable signal — it tells the team where the KB has gaps. I'd capture refusals into the eval dataset and use them to prioritize content gaps."

Q4: "How do you keep retrievals fresh when the source updates?"

"Source-of-truth lives outside the index. The index is a derived artifact rebuilt from canonical sources. Change-detection runs on the source — document version, hash, last-modified — and triggers re-ingestion only for what changed. You don't re-embed the whole corpus nightly; you re-embed deltas. Each chunk carries metadata for effective_date and superseded_date. Retrieval filters by date so a query about today's rule doesn't surface last year's superseded version — but the superseded version stays in the index because a regulator may still ask about historical decisions. For regulations published on a known cadence, schedule the refresh; for ad-hoc updates, webhook the publisher or poll. And version the index itself so audit logs can pin which index version answered which query."

Q5: "How do you eval whether retrieval is good vs whether the answer is good?"

"Separate the two — they're different failure modes. Retrieval quality uses Ragas-style metrics: context precision asks 'are the retrieved chunks relevant?', context recall asks 'do the chunks contain everything needed to answer?'. Those require a labeled set where you know the right answer and the right supporting chunks. Answer quality is the next layer: faithfulness — are the answer's claims supported by the cited chunks — plus answer relevance and citation correctness. The diagnostic move: when an answer is wrong, check retrieval first. If the right chunks weren't retrieved, the retriever is the bug — fix chunking, reranking, or query expansion. If the right chunks were retrieved but the answer is still wrong, the generator is the bug — fix the prompt, grounding instructions, or model choice. That split tells you where to invest."

Q6: "Walk me through how you'd add hybrid search to a vector-only system."

"Goal: combine BM25 keyword matching with vector similarity so you catch both lexical exact matches and semantic matches. Two implementation paths. One, if the vector store supports hybrid natively — OpenSearch, Weaviate, recent versions of pgvector with full-text — enable the BM25 side and use reciprocal rank fusion or a weighted score to combine. Two, if not, run two retrievers in parallel: vector search returning top-50 with scores, BM25 search returning top-50 with scores, then merge with RRF — for each doc, score = sum of 1/(k + rank_in_each_retriever). RRF is simple and surprisingly robust because it sidesteps the score-normalization problem between BM25 and cosine similarity. Then layer a cross-encoder reranker on the merged top-50 to get final top-5-to-10. Eval the change against a held-out set before swapping it in — hybrid usually wins, but the weighting needs tuning per corpus."

Q7: "An investigator says the AI cited a regulation that doesn't exist. How do you debug?"

"Pull the trace first — every agent run should have one. Find the exact retrieved chunks for that query and the model output. Then triage:

  1. Was the cited regulation in the retrieved chunks? If no, the model fabricated the citation — that's a generator failure. Tighten the system prompt's citation discipline, add a post-generation validator that every cited ID exists in the retrieved set, and add a runtime guardrail: every cited reg ID must resolve in your citation registry or the answer is rejected.
  2. Was it in the retrieved chunks but the cite is wrong? That's an ingestion failure — likely chunk metadata got mis-tagged during indexing. Check the source-of-truth, fix the metadata, re-index that document, add a data-quality check to the ingestion pipeline.
  3. Either way, add this case to the eval set so the regression test catches it forever.
  4. Report back to the investigator with what happened, what was fixed, and how we'll prevent it. That last step matters for trust — compliance teams need to see the AI's failure mode being treated as a real incident, not waved away."

The "design a RAG for compliance" talking point

~90 seconds out loud

"Start with the corpus shape — regulatory documents are highly structured, so I'd respect the structure with recursive structural chunking by chapter and section, with metadata for jurisdiction, document type, effective date. Embedding model: a strong production embedder, ideally domain-fine-tuned if I have labeled data. Index: hybrid — BM25 plus vector — because regulatory queries often have specific terminology that pure embedding misses. I'd add contextual retrieval — that Anthropic technique of prepending a Haiku-generated context paragraph before embedding — because regulatory chunks lose meaning without the surrounding section. At query time: query rewriting to expand variants, retrieve top 30-50, rerank to top 5-10 with a cross-encoder. The agent calls retrieval as a tool so it can re-retrieve based on what it learned mid-task. Generation has explicit prompt instructions to cite chunk IDs and refuse when context is insufficient. Validation pass post-generation checks that every claim is grounded in the cited chunks. Eval suite uses Ragas-style metrics — faithfulness, context precision, citation correctness — plus a human-graded sample. Refresh: change-detection on the source store; only re-embed what changed; effective-date metadata so superseded versions don't surface as current."

Memorize the structure: corpus → chunking → embedding → index → query-time → agent → eval → operations. Substitute as needed.