Section D · Production

Model Deployment — AWS & GCP

Speak fluently about the components, name the right service for each role, reason about the tradeoffs. You don't need to be a cloud architect.

What "deployment" means for this role

In agentic-LLM work, you typically don't train and host large models — you consume them via API (Claude API directly, or via Bedrock/Vertex). What you do deploy:

The orchestrator (n8n, custom Python, Airflow) — the workflow layer.
MCP servers (custom Python or TypeScript) exposing tools to agents.
Inference glue (Lambda / Cloud Run / containers) that wraps API calls with retries, caching, validation.
Vector DBs / retrieval indexes.
Audit / observability stacks.
Evaluation harnesses (often as scheduled jobs).
Maybe: fine-tuned smaller models (open-source, 7-70B) for narrow tasks where API isn't right.

That last bullet is where SageMaker / Vertex AI / Bedrock get interesting.

AWS — Foundation models / hosted LLMs

Amazon Bedrock: managed access to Anthropic, Meta, Mistral, Cohere, Amazon's own models via a single API. The most relevant for this role. Bedrock Claude is Anthropic's Claude API under AWS auth/billing/governance. Compliance teams in AWS shops often prefer Bedrock for SOC/PCI/HIPAA inheritance.
- Bedrock Agents: AWS's first-party agent framework. Tool use, action groups, knowledge bases. Lighter-weight than rolling your own.
- Bedrock Knowledge Bases: managed RAG — point at S3, get a vector store, get retrieval.
- Bedrock Guardrails: pre/post filtering for PII, profanity, off-topic.
Amazon SageMaker: end-to-end ML platform. Train, fine-tune, host. Less central for API-consumed Claude; central if you're hosting your own models.
- SageMaker JumpStart: deploy a pre-trained open-source model in a few clicks.
- SageMaker Endpoints: hosted inference for your models.

AWS — Compute for orchestration / inference glue

Lambda: serverless functions. Right for: short-lived MCP servers, webhook handlers, post-processing. Cold starts matter for low-latency UX.
AWS Fargate / ECS: containers without EC2 management. Right for: longer-running services, MCP HTTP servers, n8n self-hosted.
EKS: Kubernetes when scale or complexity justifies. Often overkill for this role.
App Runner: simple managed-container service. Easiest path for "just run this image as an HTTP service."
Step Functions: workflow orchestration with built-in state, retries, fan-out. Compelling for compliance flows where state and audit-friendliness matter.

AWS — Storage / data & vector search

S3: object storage. Foundational. Includes Object Lock for WORM compliance.
DynamoDB: serverless NoSQL. Good for session state, idempotency keys, light metadata.
Aurora / RDS PostgreSQL: relational. Aurora supports pgvector for vector search.
OpenSearch: full-text + vector search.
Kinesis Data Streams / MSK (managed Kafka): streaming.

Vector / search

OpenSearch with vector engine: hybrid (BM25 + vector) at managed scale.
Aurora pgvector: pragmatic, single store for app + vector.
Bedrock Knowledge Bases: managed end-to-end RAG.
Third-party (Pinecone, Weaviate) integrate easily.

AWS — Observability / governance

CloudWatch: logs, metrics, alarms.
CloudTrail: API audit log. Critical for compliance — every AWS API call is logged.
AWS Config: configuration history and drift detection.
Secrets Manager / Parameter Store: credential management.
IAM: identity, scoped roles. Get this right or nothing else matters.

Compliance / certification

AWS publishes attestations: SOC 1/2/3, PCI DSS, HIPAA, FedRAMP, ISO 27001, etc. Inheriting these via AWS infra is part of why regulated firms use it. For a multi-jurisdictional firm, region selection (e.g. eu-west for EU customer data) is part of the compliance design.

GCP — the parallel services

Foundation models

Vertex AI Model Garden: hosts Anthropic Claude, Google Gemini, Llama, Mistral, etc. via single API. Parallel to Bedrock.
Vertex AI Agent Builder: agent framework with tool use and knowledge integration.
Vertex AI Search / Vertex AI Vector Search: managed retrieval / RAG.

Compute

Cloud Functions: serverless functions. Lambda parallel.
Cloud Run: managed containers, scales to zero. Often the sweet spot for HTTP services in a GCP-heavy org.
GKE: Kubernetes.
Workflows: managed orchestration (Step Functions parallel).

Storage / data

GCS: object storage with retention policies.
Firestore: managed NoSQL.
AlloyDB, Cloud SQL: PostgreSQL with pgvector support.
BigQuery: warehouse + ML extensions; central piece in GCP.
Pub/Sub: streaming.

Observability / governance

Cloud Logging, Cloud Monitoring: logs and metrics.
Cloud Audit Logs: equivalent of CloudTrail.
Secret Manager: credentials.
IAM: identity.

Compliance attestations are similar to AWS. Both are credible bases for regulated workloads.

Reference 1: Compliance agent on AWS

The shape for an alert-pre-screening agent or case-narrative-drafting agent:

┌─────────────────┐ │ API Gateway │ ← entry point (compliance UI / n8n) └────────┬────────┘ │ ┌────────▼────────┐ │ Lambda / ECS │ ← agent harness (Python) │ (orchestrator) │ └────┬────────┬───┘ │ │ ┌──────────────┘ └────────────┐ │ │ ┌───────▼────────┐ ┌─────────────▼─────────┐ │ Bedrock │ │ MCP servers (ECS) │ │ (Claude API) │ │ → KYC, sanctions, │ │ + Guardrails │ │ case mgmt APIs │ └───────┬────────┘ └────────┬──────────────┘ │ │ │ ┌────────▼──────────────┐ │ │ Audit log: DynamoDB │ │ │ (events) + S3 │ │ │ (content), Object │ │ │ Lock for retention │ │ └───────────────────────┘ │ ┌────▼────────────────────────┐ │ Bedrock Knowledge Base / │ │ OpenSearch vector index │ │ (regs, policies, KB) │ └─────────────────────────────┘

Key choices to articulate

Bedrock over direct Anthropic API: easier compliance inheritance, single billing, AWS-native auth.
ECS over Lambda for MCP servers: longer-lived, state-friendlier, no cold-start hit.
DynamoDB + S3 for audit: events small/queryable, content immutable/cheap.
Object Lock on S3: regulatory retention enforcement at infra layer.

Reference 2: Compliance agent on GCP

[Cloud Run service: agent harness] │ ├──→ [Vertex AI: Claude via Model Garden] ├──→ [Cloud Run services: MCP tool servers] ├──→ [Vertex AI Vector Search: regulatory KB] └──→ [BigQuery: trace events, queryable audit] │ ↓ [GCS: content store + retention]

Pick what's already there

Both architectures are reasonable. Pick the cloud the team's already on. Don't introduce a second cloud unless required.

Self-hosted vs managed — the tradeoff

When to self-host	When to use API/managed
Strict data-residency where API providers can't meet	Default for general work
Latency-critical (low milliseconds, very local)	Most workloads tolerate API latency
Very high-volume, narrow task where fine-tune beats foundation	Foundation models keep getting better
You have ML platform expertise	Most teams don't, and shouldn't pretend to

For this kind of role, default is managed (Bedrock / Vertex / direct Anthropic). Self-hosting has high TCO; only justify if there's a compliance constraint that forces it.

CI / CD for AI systems

Deployment isn't only "running services" — it's also the lifecycle around them.

Model versioning: pin specific versions (claude-opus-4-7, not aliases). Promote between dev/staging/prod via config.
Prompt versioning: prompts in a registry, semver-tagged, deployed alongside code.
Eval gating: PR can't merge unless offline eval suite passes. CI runs evals against canonical dataset.
Shadow / canary: new prompt or model runs alongside existing in a small % of traffic, compared automatically.
Rollback: any change must have an immediate rollback path. Config flag, prompt-pinning, model alias.
Infra-as-code: Terraform / Pulumi / CDK / Cloud Deployment Manager for the deployment itself.
Secrets rotation: API keys, JWTs, service-account credentials all rotate on schedule. No human reads them.

Containerization

You should be able to write a basic Dockerfile and Compose / Kubernetes manifest.

Dockerfile — minimal Python service

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

Have notes ready on:

Multi-stage builds: smaller production images.
Distroless / minimal base: smaller attack surface.
Health checks: liveness and readiness probes.
Resource limits: CPU and memory caps prevent runaway containers.
Non-root user: don't run as root inside the container.

Network / security at the deployment layer

VPC / Private endpoints: keep traffic off the public internet. AWS PrivateLink, GCP Private Service Connect.
WAF: web application firewall in front of public endpoints.
mTLS between internal services.
KMS / Cloud KMS: encryption keys managed centrally.
Customer-managed keys (CMK): required by some compliance regimes for the most sensitive data.

Observability stack — the practical shape

For an LLM-heavy system, you need traces that include LLM call details, not just HTTP latencies.

Layer	Tool
Application logs	CloudWatch / Cloud Logging, structured JSON
Metrics	CloudWatch / Cloud Monitoring + Prometheus if needed
Distributed tracing	OpenTelemetry → AWS X-Ray / Cloud Trace / Datadog
LLM-specific tracing	Langfuse / Phoenix / Braintrust
Audit log (compliance-grade)	Append-only DynamoDB / BigQuery + S3/GCS WORM
Dashboards	CloudWatch Dashboards / Looker / Grafana
Alerts	CloudWatch Alarms / Cloud Monitoring + PagerDuty/Opsgenie

The LLM-specific layer is the new piece many candidates miss. Mentioning OpenTelemetry GenAI semantic conventions or a tool like Langfuse signals current awareness.

Common deployment design questions

"How would you deploy a multi-tenant MCP server on AWS such that tenants are isolated?"
"Walk me through the deployment of a RAG-backed agent on GCP."
"How would you scale your inference glue layer when traffic 10×'s?"
"What's your rollout strategy for a new prompt that's been A/B-tested in eval?"
"How would you keep a regulatory document index fresh without re-embedding everything daily?"
"What does the failure path look like when Bedrock returns a 5xx?"

For each, the answer structure: components → data flow → scale assumptions → failure modes → cost/security tradeoffs. Even if you don't know the exact service name, the structure makes you sound architect-shaped.

Talking-point: "How would you deploy this on AWS?"

Architect-shaped answer

"Default to managed services where possible — less ops surface, more compliance attestations to inherit. The agent harness runs on ECS Fargate (or Lambda if it's short-running and stateless). Inference goes through Bedrock for Claude — same model, AWS-native auth, easier compliance story. Tools are exposed via MCP servers running on Fargate, behind an internal ALB, accessed by the harness via JWT-scoped auth. Vector retrieval uses Bedrock Knowledge Bases or OpenSearch with the vector engine, depending on team familiarity. Audit log writes go to DynamoDB for the event records and S3 with Object Lock for content. CloudTrail captures the AWS API plane for free. Everything's IaC via Terraform. CI runs offline evals on PRs and gates merges. Secrets in Secrets Manager, rotated on schedule. Region selection respects data residency — eu-west-1 for EU customer data."

Memorize the structure; substitute GCP services if asked.