Section B · Critical

Error Handling for AI Systems

It's its own discipline — different from traditional software because the failures are different.

Why AI error handling is different

Traditional services fail clearly: 500 errors, exceptions, timeouts. You handle them with retries, fallbacks, alerts.

AI systems also fail quietly:

  • The model returns a confident-sounding but factually wrong answer.
  • The model partially follows instructions.
  • The model invents a citation.
  • The model picks a tool that's syntactically right but semantically wrong for the task.
  • The model produces malformed JSON 1% of the time.
  • A tool result is corrupted; the model carries on as if it's fine.
  • A prompt-injected document silently steers the agent.
The hard part

Each of these can ship without anyone noticing. You can't catch them with try/except. You catch them with a layered strategy.

Failure mode taxonomy — speak this fluently

When asked "what failure modes do you design for?" walk through these seven categories:

1Infrastructure errors
  • API timeouts, 5xx errors, connection drops
  • Rate limits (429), quota exhaustion
  • Network partitions
  • Tool service downtime

Handling: retries with exponential backoff + jitter, circuit breakers, timeouts, fallback models, queueing, graceful degradation.

2Format / parsing errors
  • Malformed JSON
  • Missing required fields
  • Wrong types
  • Truncated output (hit max_tokens mid-structure)

Handling: schema validation (Pydantic, Zod), repair-prompts, forced tool-call structured output, max_tokens budget that anticipates output length.

3Tool errors
  • Tool raises an exception
  • Tool returns invalid data
  • Tool times out
  • Tool side-effect partially succeeded

Handling: structured tool-error response (isError: true in MCP), let the model see and react, idempotency keys for side-effecting tools, compensating actions, transactional boundaries when possible.

4Reasoning / quality errors
  • Hallucination (factually wrong)
  • Sycophancy (agrees with user even when user is wrong)
  • Drift (response degrades over conversation)
  • Incomplete response (skips required sections)
  • Wrong tool chosen
  • Wrong arguments to right tool

Handling: this is eval territory — you only catch these by evaluating. Plus runtime checks: post-output validators (does the narrative cover all required sections?), grounding checks (is each claim backed by retrieval?), confidence scoring.

5Safety / policy errors
  • Output contains PII that shouldn't be there
  • Output makes a decision the agent isn't authorized to make
  • Output triggers a refusal
  • Output contains a prompt-injection payload

Handling: output guardrails (regex / classifier / LLM-as-judge), policy gates, refusal-handling fallbacks, human escalation.

6Agent-loop errors
  • Infinite tool-call loops
  • Stuck states (model keeps calling same tool)
  • Token / cost blow-up
  • Lost context (information falls out of conversation window)

Handling: max-step limits, cost budgets, stuck-state detection (same tool 3 times in a row?), context compaction, checkpoint-and-resume.

7Adversarial errors
  • Prompt injection (direct or indirect)
  • Jailbreak attempts
  • Data poisoning of retrieval corpus
  • Tool description manipulation

Handling: input scanning, retrieval-source trust tiering, sandboxing, scope minimization, human review on consequential actions.

The retry pattern — done right

Naive retries make AI failures worse (you keep paying for the same hallucination).

smart retry policy
def llm_call_with_retry(prompt, schema, max_attempts=3):
    last_error = None
    for attempt in range(max_attempts):
        try:
            response = llm.complete(prompt)
            validated = schema.parse(response.text)
            return validated
        except ValidationError as e:
            last_error = e
            # repair-prompt: include the bad output and the validation error
            prompt = build_repair_prompt(prompt, response.text, str(e))
            continue
        except RateLimitError:
            sleep(exponential_backoff(attempt))
            continue
        except APIError as e:
            if e.is_transient:
                sleep(exponential_backoff(attempt))
                continue
            raise
    raise MaxRetriesExceeded(last_error)

Key principles:

  • Distinguish transient vs permanent. Don't retry on 4xx auth errors.
  • Repair, don't blind retry. When a schema fails, show the model what was wrong.
  • Cap attempts. 3 is normal. After that, escalate.
  • Cap cost. Total cost budget per task — abort if exceeded.
  • Different error → different strategy. Rate limit = backoff. Schema error = repair. Tool error = let model decide.

Idempotency for side-effecting tools

The hardest class of AI errors: the agent calls a side-effecting tool, gets a timeout, retries, and now there are two records.

For any tool that mutates state:

  • Generate idempotency key client-side (UUID), pass to the tool.
  • Tool checks: have I seen this key? If yes, return the prior result.
  • Persist (idempotency_key → result) mapping with a TTL.
Critical in compliance

Duplicate SAR filings, duplicate notices, duplicate alerts are not okay. Always idempotent for write operations.

Human-in-the-loop as error handling

The big one for compliance. Build a system where:

  • Low-confidence outputs route to human review.
  • High-stakes decisions always require human confirmation, even on high-confidence outputs.
  • Edge cases (novel patterns, ambiguous inputs, conflicting evidence) detected and escalated.
  • Humans can correct, and corrections feed back into evals.

The agent should have an explicit escalate_to_human tool. It's not a failure to use it — it's the right behavior when the agent isn't sure.

Confidence and abstention

Models say things confidently regardless of whether they know. To make confidence meaningful:

  • Self-rated confidence: ask the model to score its own confidence (calibrated against eval data — most models are overconfident; you map raw scores to actual probabilities).
  • Multi-sample agreement: run the same query 3-5 times; if outputs disagree, treat as low confidence. (Self-consistency.)
  • Grounding ratio: % of claims with retrieval support. Lower = riskier.
  • Refusal as data: when the model declines or says "I don't have enough information" — that's useful. Surface it, don't suppress it.
  • Abstention loop: agent has an "I can't answer this" path. Common in QA systems.

Validation layers — defense in depth

Production AI systems usually have 3-4 validation layers:

[input] → input guardrails (PII redaction, injection scan) → LLM call (with structured output / forced tool) → schema validation (Pydantic) → semantic validation (does the output meet business rules?) → output guardrails (PII leak check, policy) → grounding check (claims in retrieved docs?) [output]

Don't skip layers. Each catches a different failure class.

Logging errors for AI specifically

In addition to traditional logs, capture:

  • Full input prompt (or hash + reference) at the time of the call
  • Model version and parameters
  • Raw output (not just parsed)
  • Validation errors if any (with the malformed output)
  • Tool calls (name, args, result, error)
  • Eval / grader scores when available
  • User feedback (thumbs, edits, escalations)

This is your debugging substrate AND your audit trail. Without raw inputs/outputs, you can't reproduce or explain what happened.

PII caveat

Redact or tokenize sensitive fields before logging. Either redact at write time or store full content in a tightly-controlled, retention-managed store.

Observability — what to monitor

MetricWhy
Error rate per error classCatch infra vs format vs reasoning failures separately
Output schema validation failure rateFormat drift, model regression
Tool error rateTool reliability, integration health
Average tool calls per taskStuck-loop detection
Cost per taskBudget overruns
Latency p50/p95/p99UX, capacity
Eval scores (online sample)Quality drift
Human override / edit rateReal-world quality signal
Escalation rateWhere the agent doesn't know
Input distribution metricsDrift detection

Set alerts on rate-of-change, not just absolute thresholds. Sudden drop in tool success rate = investigate.

Failing safe in compliance contexts

The default

The default failure mode in compliance must be more human review, never less.

  • Tool error → don't proceed with partial info; escalate.
  • Low confidence → escalate.
  • Validation failure → escalate, log, do not auto-recover.
  • Cost / step budget exceeded → halt, escalate.
  • Drift / eval regression detected → revert to prior model/prompt, page on-call.
  • Unfamiliar input pattern → escalate.

When in doubt: pause. A regulator never penalized a firm for asking a human to double-check.

Chaos engineering for AI

Worth mentioning if you want to sound senior:

  • Inject malformed tool results to verify the agent recovers.
  • Force timeouts on a tool — does the harness fall back?
  • Replay traces with corrupted prompts — does validation catch it?
  • Inject prompt-injection payloads in test docs — does the guardrail catch it?
  • Model fallback drills — primary model down, do you switch?

Talking-point: "How do you handle errors in agentic systems?"

Senior-level answer — memorize the structure

"I think about errors in three buckets: infrastructure (timeouts, rate limits, tool service errors), format (schema failures, malformed JSON), and reasoning (hallucination, wrong tool, drift). Each gets a different strategy — backoff and circuit-breakers for infra, repair-prompts and structured-output enforcement for format, evals and human-in-the-loop for reasoning. On top of that, I idempotency-key any side-effecting tool, cap step and cost budgets to prevent runaway loops, and instrument every layer so when something goes wrong I can replay the trace. In compliance, the strong default is fail-safe: if the agent isn't confident or the inputs look unusual, it escalates rather than guesses."