Model Deployment — AWS & GCP
Speak fluently about the components, name the right service for each role, reason about the tradeoffs. You don't need to be a cloud architect.
What "deployment" means for this role
In agentic-LLM work, you typically don't train and host large models — you consume them via API (Claude API directly, or via Bedrock/Vertex). What you do deploy:
- The orchestrator (n8n, custom Python, Airflow) — the workflow layer.
- MCP servers (custom Python or TypeScript) exposing tools to agents.
- Inference glue (Lambda / Cloud Run / containers) that wraps API calls with retries, caching, validation.
- Vector DBs / retrieval indexes.
- Audit / observability stacks.
- Evaluation harnesses (often as scheduled jobs).
- Maybe: fine-tuned smaller models (open-source, 7-70B) for narrow tasks where API isn't right.
That last bullet is where SageMaker / Vertex AI / Bedrock get interesting.
AWS — Foundation models / hosted LLMs
- Amazon Bedrock: managed access to Anthropic, Meta, Mistral, Cohere, Amazon's own models via a single API. The most relevant for this role. Bedrock Claude is Anthropic's Claude API under AWS auth/billing/governance. Compliance teams in AWS shops often prefer Bedrock for SOC/PCI/HIPAA inheritance.
- Bedrock Agents: AWS's first-party agent framework. Tool use, action groups, knowledge bases. Lighter-weight than rolling your own.
- Bedrock Knowledge Bases: managed RAG — point at S3, get a vector store, get retrieval.
- Bedrock Guardrails: pre/post filtering for PII, profanity, off-topic.
- Amazon SageMaker: end-to-end ML platform. Train, fine-tune, host. Less central for API-consumed Claude; central if you're hosting your own models.
- SageMaker JumpStart: deploy a pre-trained open-source model in a few clicks.
- SageMaker Endpoints: hosted inference for your models.
AWS — Compute for orchestration / inference glue
- Lambda: serverless functions. Right for: short-lived MCP servers, webhook handlers, post-processing. Cold starts matter for low-latency UX.
- AWS Fargate / ECS: containers without EC2 management. Right for: longer-running services, MCP HTTP servers, n8n self-hosted.
- EKS: Kubernetes when scale or complexity justifies. Often overkill for this role.
- App Runner: simple managed-container service. Easiest path for "just run this image as an HTTP service."
- Step Functions: workflow orchestration with built-in state, retries, fan-out. Compelling for compliance flows where state and audit-friendliness matter.
AWS — Storage / data & vector search
- S3: object storage. Foundational. Includes Object Lock for WORM compliance.
- DynamoDB: serverless NoSQL. Good for session state, idempotency keys, light metadata.
- Aurora / RDS PostgreSQL: relational. Aurora supports
pgvectorfor vector search. - OpenSearch: full-text + vector search.
- Kinesis Data Streams / MSK (managed Kafka): streaming.
Vector / search
- OpenSearch with vector engine: hybrid (BM25 + vector) at managed scale.
- Aurora pgvector: pragmatic, single store for app + vector.
- Bedrock Knowledge Bases: managed end-to-end RAG.
- Third-party (Pinecone, Weaviate) integrate easily.
AWS — Observability / governance
- CloudWatch: logs, metrics, alarms.
- CloudTrail: API audit log. Critical for compliance — every AWS API call is logged.
- AWS Config: configuration history and drift detection.
- Secrets Manager / Parameter Store: credential management.
- IAM: identity, scoped roles. Get this right or nothing else matters.
Compliance / certification
AWS publishes attestations: SOC 1/2/3, PCI DSS, HIPAA, FedRAMP, ISO 27001, etc. Inheriting these via AWS infra is part of why regulated firms use it. For a multi-jurisdictional firm, region selection (e.g. eu-west for EU customer data) is part of the compliance design.
GCP — the parallel services
Foundation models
- Vertex AI Model Garden: hosts Anthropic Claude, Google Gemini, Llama, Mistral, etc. via single API. Parallel to Bedrock.
- Vertex AI Agent Builder: agent framework with tool use and knowledge integration.
- Vertex AI Search / Vertex AI Vector Search: managed retrieval / RAG.
Compute
- Cloud Functions: serverless functions. Lambda parallel.
- Cloud Run: managed containers, scales to zero. Often the sweet spot for HTTP services in a GCP-heavy org.
- GKE: Kubernetes.
- Workflows: managed orchestration (Step Functions parallel).
Storage / data
- GCS: object storage with retention policies.
- Firestore: managed NoSQL.
- AlloyDB, Cloud SQL: PostgreSQL with pgvector support.
- BigQuery: warehouse + ML extensions; central piece in GCP.
- Pub/Sub: streaming.
Observability / governance
- Cloud Logging, Cloud Monitoring: logs and metrics.
- Cloud Audit Logs: equivalent of CloudTrail.
- Secret Manager: credentials.
- IAM: identity.
Compliance attestations are similar to AWS. Both are credible bases for regulated workloads.
Reference 1: Compliance agent on AWS
The shape for an alert-pre-screening agent or case-narrative-drafting agent:
Key choices to articulate
- Bedrock over direct Anthropic API: easier compliance inheritance, single billing, AWS-native auth.
- ECS over Lambda for MCP servers: longer-lived, state-friendlier, no cold-start hit.
- DynamoDB + S3 for audit: events small/queryable, content immutable/cheap.
- Object Lock on S3: regulatory retention enforcement at infra layer.
Reference 2: Compliance agent on GCP
Both architectures are reasonable. Pick the cloud the team's already on. Don't introduce a second cloud unless required.
Self-hosted vs managed — the tradeoff
| When to self-host | When to use API/managed |
|---|---|
| Strict data-residency where API providers can't meet | Default for general work |
| Latency-critical (low milliseconds, very local) | Most workloads tolerate API latency |
| Very high-volume, narrow task where fine-tune beats foundation | Foundation models keep getting better |
| You have ML platform expertise | Most teams don't, and shouldn't pretend to |
For this kind of role, default is managed (Bedrock / Vertex / direct Anthropic). Self-hosting has high TCO; only justify if there's a compliance constraint that forces it.
CI / CD for AI systems
Deployment isn't only "running services" — it's also the lifecycle around them.
- Model versioning: pin specific versions (
claude-opus-4-7, not aliases). Promote between dev/staging/prod via config. - Prompt versioning: prompts in a registry, semver-tagged, deployed alongside code.
- Eval gating: PR can't merge unless offline eval suite passes. CI runs evals against canonical dataset.
- Shadow / canary: new prompt or model runs alongside existing in a small % of traffic, compared automatically.
- Rollback: any change must have an immediate rollback path. Config flag, prompt-pinning, model alias.
- Infra-as-code: Terraform / Pulumi / CDK / Cloud Deployment Manager for the deployment itself.
- Secrets rotation: API keys, JWTs, service-account credentials all rotate on schedule. No human reads them.
Containerization
You should be able to write a basic Dockerfile and Compose / Kubernetes manifest.
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
Have notes ready on:
- Multi-stage builds: smaller production images.
- Distroless / minimal base: smaller attack surface.
- Health checks: liveness and readiness probes.
- Resource limits: CPU and memory caps prevent runaway containers.
- Non-root user: don't run as root inside the container.
Network / security at the deployment layer
- VPC / Private endpoints: keep traffic off the public internet. AWS PrivateLink, GCP Private Service Connect.
- WAF: web application firewall in front of public endpoints.
- mTLS between internal services.
- KMS / Cloud KMS: encryption keys managed centrally.
- Customer-managed keys (CMK): required by some compliance regimes for the most sensitive data.
Observability stack — the practical shape
For an LLM-heavy system, you need traces that include LLM call details, not just HTTP latencies.
| Layer | Tool |
|---|---|
| Application logs | CloudWatch / Cloud Logging, structured JSON |
| Metrics | CloudWatch / Cloud Monitoring + Prometheus if needed |
| Distributed tracing | OpenTelemetry → AWS X-Ray / Cloud Trace / Datadog |
| LLM-specific tracing | Langfuse / Phoenix / Braintrust |
| Audit log (compliance-grade) | Append-only DynamoDB / BigQuery + S3/GCS WORM |
| Dashboards | CloudWatch Dashboards / Looker / Grafana |
| Alerts | CloudWatch Alarms / Cloud Monitoring + PagerDuty/Opsgenie |
The LLM-specific layer is the new piece many candidates miss. Mentioning OpenTelemetry GenAI semantic conventions or a tool like Langfuse signals current awareness.
Common deployment design questions
- "How would you deploy a multi-tenant MCP server on AWS such that tenants are isolated?"
- "Walk me through the deployment of a RAG-backed agent on GCP."
- "How would you scale your inference glue layer when traffic 10×'s?"
- "What's your rollout strategy for a new prompt that's been A/B-tested in eval?"
- "How would you keep a regulatory document index fresh without re-embedding everything daily?"
- "What does the failure path look like when Bedrock returns a 5xx?"
For each, the answer structure: components → data flow → scale assumptions → failure modes → cost/security tradeoffs. Even if you don't know the exact service name, the structure makes you sound architect-shaped.
Talking-point: "How would you deploy this on AWS?"
"Default to managed services where possible — less ops surface, more compliance attestations to inherit. The agent harness runs on ECS Fargate (or Lambda if it's short-running and stateless). Inference goes through Bedrock for Claude — same model, AWS-native auth, easier compliance story. Tools are exposed via MCP servers running on Fargate, behind an internal ALB, accessed by the harness via JWT-scoped auth. Vector retrieval uses Bedrock Knowledge Bases or OpenSearch with the vector engine, depending on team familiarity. Audit log writes go to DynamoDB for the event records and S3 with Object Lock for content. CloudTrail captures the AWS API plane for free. Everything's IaC via Terraform. CI runs offline evals on PRs and gates merges. Secrets in Secrets Manager, rotated on schedule. Region selection respects data residency — eu-west-1 for EU customer data."
Memorize the structure; substitute GCP services if asked.