Error Handling for AI Systems
It's its own discipline — different from traditional software because the failures are different.
Why AI error handling is different
Traditional services fail clearly: 500 errors, exceptions, timeouts. You handle them with retries, fallbacks, alerts.
AI systems also fail quietly:
- The model returns a confident-sounding but factually wrong answer.
- The model partially follows instructions.
- The model invents a citation.
- The model picks a tool that's syntactically right but semantically wrong for the task.
- The model produces malformed JSON 1% of the time.
- A tool result is corrupted; the model carries on as if it's fine.
- A prompt-injected document silently steers the agent.
Each of these can ship without anyone noticing. You can't catch them with try/except. You catch them with a layered strategy.
Failure mode taxonomy — speak this fluently
When asked "what failure modes do you design for?" walk through these seven categories:
1Infrastructure errors
- API timeouts, 5xx errors, connection drops
- Rate limits (429), quota exhaustion
- Network partitions
- Tool service downtime
Handling: retries with exponential backoff + jitter, circuit breakers, timeouts, fallback models, queueing, graceful degradation.
2Format / parsing errors
- Malformed JSON
- Missing required fields
- Wrong types
- Truncated output (hit max_tokens mid-structure)
Handling: schema validation (Pydantic, Zod), repair-prompts, forced tool-call structured output, max_tokens budget that anticipates output length.
3Tool errors
- Tool raises an exception
- Tool returns invalid data
- Tool times out
- Tool side-effect partially succeeded
Handling: structured tool-error response (isError: true in MCP), let the model see and react, idempotency keys for side-effecting tools, compensating actions, transactional boundaries when possible.
4Reasoning / quality errors
- Hallucination (factually wrong)
- Sycophancy (agrees with user even when user is wrong)
- Drift (response degrades over conversation)
- Incomplete response (skips required sections)
- Wrong tool chosen
- Wrong arguments to right tool
Handling: this is eval territory — you only catch these by evaluating. Plus runtime checks: post-output validators (does the narrative cover all required sections?), grounding checks (is each claim backed by retrieval?), confidence scoring.
5Safety / policy errors
- Output contains PII that shouldn't be there
- Output makes a decision the agent isn't authorized to make
- Output triggers a refusal
- Output contains a prompt-injection payload
Handling: output guardrails (regex / classifier / LLM-as-judge), policy gates, refusal-handling fallbacks, human escalation.
6Agent-loop errors
- Infinite tool-call loops
- Stuck states (model keeps calling same tool)
- Token / cost blow-up
- Lost context (information falls out of conversation window)
Handling: max-step limits, cost budgets, stuck-state detection (same tool 3 times in a row?), context compaction, checkpoint-and-resume.
7Adversarial errors
- Prompt injection (direct or indirect)
- Jailbreak attempts
- Data poisoning of retrieval corpus
- Tool description manipulation
Handling: input scanning, retrieval-source trust tiering, sandboxing, scope minimization, human review on consequential actions.
The retry pattern — done right
Naive retries make AI failures worse (you keep paying for the same hallucination).
def llm_call_with_retry(prompt, schema, max_attempts=3):
last_error = None
for attempt in range(max_attempts):
try:
response = llm.complete(prompt)
validated = schema.parse(response.text)
return validated
except ValidationError as e:
last_error = e
# repair-prompt: include the bad output and the validation error
prompt = build_repair_prompt(prompt, response.text, str(e))
continue
except RateLimitError:
sleep(exponential_backoff(attempt))
continue
except APIError as e:
if e.is_transient:
sleep(exponential_backoff(attempt))
continue
raise
raise MaxRetriesExceeded(last_error)
Key principles:
- Distinguish transient vs permanent. Don't retry on 4xx auth errors.
- Repair, don't blind retry. When a schema fails, show the model what was wrong.
- Cap attempts. 3 is normal. After that, escalate.
- Cap cost. Total cost budget per task — abort if exceeded.
- Different error → different strategy. Rate limit = backoff. Schema error = repair. Tool error = let model decide.
Idempotency for side-effecting tools
The hardest class of AI errors: the agent calls a side-effecting tool, gets a timeout, retries, and now there are two records.
For any tool that mutates state:
- Generate idempotency key client-side (UUID), pass to the tool.
- Tool checks: have I seen this key? If yes, return the prior result.
- Persist
(idempotency_key → result)mapping with a TTL.
Duplicate SAR filings, duplicate notices, duplicate alerts are not okay. Always idempotent for write operations.
Human-in-the-loop as error handling
The big one for compliance. Build a system where:
- Low-confidence outputs route to human review.
- High-stakes decisions always require human confirmation, even on high-confidence outputs.
- Edge cases (novel patterns, ambiguous inputs, conflicting evidence) detected and escalated.
- Humans can correct, and corrections feed back into evals.
The agent should have an explicit escalate_to_human tool. It's not a failure to use it — it's the right behavior when the agent isn't sure.
Confidence and abstention
Models say things confidently regardless of whether they know. To make confidence meaningful:
- Self-rated confidence: ask the model to score its own confidence (calibrated against eval data — most models are overconfident; you map raw scores to actual probabilities).
- Multi-sample agreement: run the same query 3-5 times; if outputs disagree, treat as low confidence. (Self-consistency.)
- Grounding ratio: % of claims with retrieval support. Lower = riskier.
- Refusal as data: when the model declines or says "I don't have enough information" — that's useful. Surface it, don't suppress it.
- Abstention loop: agent has an "I can't answer this" path. Common in QA systems.
Validation layers — defense in depth
Production AI systems usually have 3-4 validation layers:
Don't skip layers. Each catches a different failure class.
Logging errors for AI specifically
In addition to traditional logs, capture:
- Full input prompt (or hash + reference) at the time of the call
- Model version and parameters
- Raw output (not just parsed)
- Validation errors if any (with the malformed output)
- Tool calls (name, args, result, error)
- Eval / grader scores when available
- User feedback (thumbs, edits, escalations)
This is your debugging substrate AND your audit trail. Without raw inputs/outputs, you can't reproduce or explain what happened.
Redact or tokenize sensitive fields before logging. Either redact at write time or store full content in a tightly-controlled, retention-managed store.
Observability — what to monitor
| Metric | Why |
|---|---|
| Error rate per error class | Catch infra vs format vs reasoning failures separately |
| Output schema validation failure rate | Format drift, model regression |
| Tool error rate | Tool reliability, integration health |
| Average tool calls per task | Stuck-loop detection |
| Cost per task | Budget overruns |
| Latency p50/p95/p99 | UX, capacity |
| Eval scores (online sample) | Quality drift |
| Human override / edit rate | Real-world quality signal |
| Escalation rate | Where the agent doesn't know |
| Input distribution metrics | Drift detection |
Set alerts on rate-of-change, not just absolute thresholds. Sudden drop in tool success rate = investigate.
Failing safe in compliance contexts
The default failure mode in compliance must be more human review, never less.
- Tool error → don't proceed with partial info; escalate.
- Low confidence → escalate.
- Validation failure → escalate, log, do not auto-recover.
- Cost / step budget exceeded → halt, escalate.
- Drift / eval regression detected → revert to prior model/prompt, page on-call.
- Unfamiliar input pattern → escalate.
When in doubt: pause. A regulator never penalized a firm for asking a human to double-check.
Chaos engineering for AI
Worth mentioning if you want to sound senior:
- Inject malformed tool results to verify the agent recovers.
- Force timeouts on a tool — does the harness fall back?
- Replay traces with corrupted prompts — does validation catch it?
- Inject prompt-injection payloads in test docs — does the guardrail catch it?
- Model fallback drills — primary model down, do you switch?
Talking-point: "How do you handle errors in agentic systems?"
"I think about errors in three buckets: infrastructure (timeouts, rate limits, tool service errors), format (schema failures, malformed JSON), and reasoning (hallucination, wrong tool, drift). Each gets a different strategy — backoff and circuit-breakers for infra, repair-prompts and structured-output enforcement for format, evals and human-in-the-loop for reasoning. On top of that, I idempotency-key any side-effecting tool, cap step and cost budgets to prevent runaway loops, and instrument every layer so when something goes wrong I can replay the trace. In compliance, the strong default is fail-safe: if the agent isn't confident or the inputs look unusual, it escalates rather than guesses."