Section B · Concept & build pattern

Harnesses & Agents

The vocabulary AI engineers use for the loop, the patterns, and the wrappers. Plus a 10-step sketch for any "design an agent" interview question.

📚 Reference 🧩 Patterns library 📐 10-step design template

What "harness" means — disambiguate first

"Harness" has three overlapping uses in AI interviews. Listen for context:

TermWhat it means
Eval harness Runner that loads a dataset, calls your system, scores outputs. (See 07-evals.)
Agent harness The loop that drives an LLM through tool-use cycles: prompt → output → parse → call tools → feed results → repeat.
LLM harness The wrapper around individual model calls: retries, parsing, structured-output coercion, caching, fallbacks, logging.
Ask

If a question is ambiguous, ask: "When you say harness, do you mean the eval runner, the agent loop, or the wrapper around the LLM call?" The question itself signals fluency.

The agent loop — every framework boils down to this

Internalize this shape. Be able to draw it on a whiteboard.

agent_loop.py (canonical)
def agent_loop(user_request, tools, max_steps=20):
    messages = [{"role": "user", "content": user_request}]

    for step in range(max_steps):
        response = llm.complete(messages, tools=tools)

        if response.stop_reason == "end_turn":
            return response.text  # done

        if response.stop_reason == "tool_use":
            tool_results = []
            for tool_call in response.tool_calls:
                result = execute_tool(tool_call.name, tool_call.args)
                tool_results.append({"tool_use_id": tool_call.id, "content": result})

            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
            continue

    raise MaxStepsExceeded()

Every modern framework — Claude Agent SDK, LangChain agents, OpenAI Assistants, n8n's AI agent node — is a variation of this.

What a production-grade harness adds on top

A naïve while True: call llm() loop is not a harness. A real harness adds:

Retries with backoff

For transient API errors, rate limits, timeouts.

Structured-output coercion

Re-prompt or repair when the model returns malformed JSON.

Token / cost budgets

Bail when a step blows past a configured cost.

Step / time limits

Hard cap on agent steps to prevent runaway loops.

Tool authorization

Gate tool calls by user identity / risk tier.

Human-in-the-loop pauses

Agent yields, asks for approval, resumes.

Trace capture

Every step (input, output, tool call, result) recorded.

Caching

Repeated identical sub-calls don't re-incur cost.

Fallback routing

Primary model fails → retry on secondary → degrade gracefully.

Guardrails

Pre-filter (PII, injection); post-filter (policy, leaks).

Sandboxing

Tool execution in an isolated environment.

Agent design patterns — Anthropic's "Building Effective Agents"

Memorize these names. The essay is required reading.

Workflows (deterministic paths)

1 · Prompt chaining

Fixed sequence of LLM calls; each step's output feeds the next. Great when the task decomposes cleanly.

2 · Routing

First call classifies the request; downstream steps differ by class. "Is this alert about sanctions, KYC, or TM?" → different chains.

3 · Parallelization

Run independent subtasks concurrently. Sectioning (split into parts) or voting (run same task multiple times, majority).

4 · Orchestrator-workers

Planner LLM decomposes; worker LLMs execute leaf nodes; orchestrator synthesizes.

5 · Evaluator-optimizer

Generator produces a draft, evaluator critiques, generator revises. Loop until accepted.

Agents (open-ended)

Agent — the LLM has tools and decides on its own when to stop. Use only when the task genuinely requires open-endedness; deterministic workflows are cheaper, more predictable, easier to evaluate.

The big takeaway

Don't reach for an agent when a workflow will do. Agents are powerful but expensive, slower, and harder to evaluate. In compliance, you'll often want a workflow with a small agentic step in the middle.

Scaffold vs agent — the tradeoff

More scaffold (workflow)More agent (open-ended)
Predictable, faster, cheaperFlexible, handles novel cases
Easier to evaluateHarder to evaluate (trajectories)
Easier to audit (clear states)Auditing requires deep tracing
Brittle on novel tasksAdaptive
Lower model dependencyHostage to model quality

For high-stakes compliance, the strong default is scaffold-heavy with surgical agentic steps. You won't deploy a fully open-ended agent that decides whether to file a SAR. You'll deploy a workflow with explicit states, where one step uses an agent to do a bounded subtask.

The major harnesses / frameworks

Claude Agent SDK

First-party Anthropic primitives. Tight MCP integration, strong permission-gating, hooks, sub-agents. Most direct path for production Claude + MCP agents.

LangChain / LangGraph

Big ecosystem. LangGraph's state-machine model is great for auditable workflows. Some teams find the abstractions heavy.

n8n + AI nodes

Workflow-as-canvas + native AI step. Visual representation is itself audit-friendly. Strong fit for compliance work.

OpenAI Assistants / Agents SDK

First-party OpenAI primitives. Ecosystem-specific. Mention you know they exist; this role is Claude-first.

CrewAI / AutoGen

Multi-agent collaboration patterns. Useful for specific structured-team scenarios.

DSPy

Programming framework that compiles prompt/agent code, optimizes prompts via evaluation.

Pydantic AI

Typed agents in Python, tight schema validation.

n8n-specific concepts to know

  • Workflow — a graph of nodes executed top-down with branching
  • Node — a step (HTTP, transform, AI agent, conditional, wait)
  • Trigger node — entry point (cron, webhook, queue)
  • Sub-workflows — reusable workflows called from others
  • Wait node — pause until human approval, signal, or time
  • AI Agent node — LangChain-based LLM with tools
  • Memory nodes — chat history, buffer, vector retrieval
  • Execution history — every run logged with inputs/outputs at each node — your audit trail substrate

Memory and context management

Real agents accumulate context across steps; production harnesses manage it actively.

  • Short-term memory — the conversation/trace itself. The harness decides what to keep in the prompt.
  • Long-term memory — vector store + retrieval, explicit memory tools, summarization.
  • Working memory / scratchpad — dedicated tool the agent uses to write notes mid-task.
  • Compaction / summarization — long traces summarized to fit context windows.
  • Sliding window — drop oldest messages when context is full.

10-step sketch — answer any "design an agent" question

Memorize this. You can apply it to any design question.

  1. Risk-tier the task. Low (drafts a summary) vs high (recommends a SAR). Different gates.
  2. Pick the simplest pattern that fits. Routing → chain → orchestrator-workers → agent.
  3. List the tools with descriptions. Mark side-effecting ones. Those need approval.
  4. Define the human checkpoints. Where does the workflow pause for sign-off?
  5. Specify the audit log. What fields at each step? Where does it persist?
  6. Plan the eval. Dataset, grader, what "good" means, offline regression coverage.
  7. Plan the failure paths. Tool error, model timeout, ambiguous input, contradictory data.
  8. Plan the rollout. Shadow → HITL → autonomous (only for low tier).
  9. Plan ongoing monitoring. Online evals, drift, cost/latency dashboards.
  10. Plan the kill switch. How to stop the system fast if it's misbehaving.

Talking-point answer: "What's a harness?"

3 sentences, clean vocabulary

"Depends on context. The eval harness is the runner that scores outputs against a dataset. The agent harness is the loop that drives an LLM through tool-use until completion. The LLM harness is the wrapper around individual model calls — retries, schema validation, caching, logging. In a real system you have all three: the agent harness orchestrates the agent loop, the LLM harness wraps each model call inside it, the eval harness exercises the whole thing offline and online."