Harnesses & Agents
The vocabulary AI engineers use for the loop, the patterns, and the wrappers. Plus a 10-step sketch for any "design an agent" interview question.
What "harness" means — disambiguate first
"Harness" has three overlapping uses in AI interviews. Listen for context:
| Term | What it means |
|---|---|
| Eval harness | Runner that loads a dataset, calls your system, scores outputs. (See 07-evals.) |
| Agent harness | The loop that drives an LLM through tool-use cycles: prompt → output → parse → call tools → feed results → repeat. |
| LLM harness | The wrapper around individual model calls: retries, parsing, structured-output coercion, caching, fallbacks, logging. |
If a question is ambiguous, ask: "When you say harness, do you mean the eval runner, the agent loop, or the wrapper around the LLM call?" The question itself signals fluency.
The agent loop — every framework boils down to this
Internalize this shape. Be able to draw it on a whiteboard.
def agent_loop(user_request, tools, max_steps=20):
messages = [{"role": "user", "content": user_request}]
for step in range(max_steps):
response = llm.complete(messages, tools=tools)
if response.stop_reason == "end_turn":
return response.text # done
if response.stop_reason == "tool_use":
tool_results = []
for tool_call in response.tool_calls:
result = execute_tool(tool_call.name, tool_call.args)
tool_results.append({"tool_use_id": tool_call.id, "content": result})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
continue
raise MaxStepsExceeded()
Every modern framework — Claude Agent SDK, LangChain agents, OpenAI Assistants, n8n's AI agent node — is a variation of this.
What a production-grade harness adds on top
A naïve while True: call llm() loop is not a harness. A real harness adds:
Retries with backoff
For transient API errors, rate limits, timeouts.
Structured-output coercion
Re-prompt or repair when the model returns malformed JSON.
Token / cost budgets
Bail when a step blows past a configured cost.
Step / time limits
Hard cap on agent steps to prevent runaway loops.
Tool authorization
Gate tool calls by user identity / risk tier.
Human-in-the-loop pauses
Agent yields, asks for approval, resumes.
Trace capture
Every step (input, output, tool call, result) recorded.
Caching
Repeated identical sub-calls don't re-incur cost.
Fallback routing
Primary model fails → retry on secondary → degrade gracefully.
Guardrails
Pre-filter (PII, injection); post-filter (policy, leaks).
Sandboxing
Tool execution in an isolated environment.
Agent design patterns — Anthropic's "Building Effective Agents"
Memorize these names. The essay is required reading.
Workflows (deterministic paths)
1 · Prompt chaining
Fixed sequence of LLM calls; each step's output feeds the next. Great when the task decomposes cleanly.
2 · Routing
First call classifies the request; downstream steps differ by class. "Is this alert about sanctions, KYC, or TM?" → different chains.
3 · Parallelization
Run independent subtasks concurrently. Sectioning (split into parts) or voting (run same task multiple times, majority).
4 · Orchestrator-workers
Planner LLM decomposes; worker LLMs execute leaf nodes; orchestrator synthesizes.
5 · Evaluator-optimizer
Generator produces a draft, evaluator critiques, generator revises. Loop until accepted.
Agents (open-ended)
Agent — the LLM has tools and decides on its own when to stop. Use only when the task genuinely requires open-endedness; deterministic workflows are cheaper, more predictable, easier to evaluate.
Don't reach for an agent when a workflow will do. Agents are powerful but expensive, slower, and harder to evaluate. In compliance, you'll often want a workflow with a small agentic step in the middle.
Scaffold vs agent — the tradeoff
| More scaffold (workflow) | More agent (open-ended) |
|---|---|
| Predictable, faster, cheaper | Flexible, handles novel cases |
| Easier to evaluate | Harder to evaluate (trajectories) |
| Easier to audit (clear states) | Auditing requires deep tracing |
| Brittle on novel tasks | Adaptive |
| Lower model dependency | Hostage to model quality |
For high-stakes compliance, the strong default is scaffold-heavy with surgical agentic steps. You won't deploy a fully open-ended agent that decides whether to file a SAR. You'll deploy a workflow with explicit states, where one step uses an agent to do a bounded subtask.
The major harnesses / frameworks
Claude Agent SDK
First-party Anthropic primitives. Tight MCP integration, strong permission-gating, hooks, sub-agents. Most direct path for production Claude + MCP agents.
LangChain / LangGraph
Big ecosystem. LangGraph's state-machine model is great for auditable workflows. Some teams find the abstractions heavy.
n8n + AI nodes
Workflow-as-canvas + native AI step. Visual representation is itself audit-friendly. Strong fit for compliance work.
OpenAI Assistants / Agents SDK
First-party OpenAI primitives. Ecosystem-specific. Mention you know they exist; this role is Claude-first.
CrewAI / AutoGen
Multi-agent collaboration patterns. Useful for specific structured-team scenarios.
DSPy
Programming framework that compiles prompt/agent code, optimizes prompts via evaluation.
Pydantic AI
Typed agents in Python, tight schema validation.
n8n-specific concepts to know
- Workflow — a graph of nodes executed top-down with branching
- Node — a step (HTTP, transform, AI agent, conditional, wait)
- Trigger node — entry point (cron, webhook, queue)
- Sub-workflows — reusable workflows called from others
- Wait node — pause until human approval, signal, or time
- AI Agent node — LangChain-based LLM with tools
- Memory nodes — chat history, buffer, vector retrieval
- Execution history — every run logged with inputs/outputs at each node — your audit trail substrate
Memory and context management
Real agents accumulate context across steps; production harnesses manage it actively.
- Short-term memory — the conversation/trace itself. The harness decides what to keep in the prompt.
- Long-term memory — vector store + retrieval, explicit memory tools, summarization.
- Working memory / scratchpad — dedicated tool the agent uses to write notes mid-task.
- Compaction / summarization — long traces summarized to fit context windows.
- Sliding window — drop oldest messages when context is full.
10-step sketch — answer any "design an agent" question
Memorize this. You can apply it to any design question.
- Risk-tier the task. Low (drafts a summary) vs high (recommends a SAR). Different gates.
- Pick the simplest pattern that fits. Routing → chain → orchestrator-workers → agent.
- List the tools with descriptions. Mark side-effecting ones. Those need approval.
- Define the human checkpoints. Where does the workflow pause for sign-off?
- Specify the audit log. What fields at each step? Where does it persist?
- Plan the eval. Dataset, grader, what "good" means, offline regression coverage.
- Plan the failure paths. Tool error, model timeout, ambiguous input, contradictory data.
- Plan the rollout. Shadow → HITL → autonomous (only for low tier).
- Plan ongoing monitoring. Online evals, drift, cost/latency dashboards.
- Plan the kill switch. How to stop the system fast if it's misbehaving.
Talking-point answer: "What's a harness?"
"Depends on context. The eval harness is the runner that scores outputs against a dataset. The agent harness is the loop that drives an LLM through tool-use until completion. The LLM harness is the wrapper around individual model calls — retries, schema validation, caching, logging. In a real system you have all three: the agent harness orchestrates the agent loop, the LLM harness wraps each model call inside it, the eval harness exercises the whole thing offline and online."