AWS Data Stack
The exact stack SentiLink names in the JD — Python 3, PostgreSQL, AWS (EC2, S3, RDS, Redshift) — decoded. What each piece does, how they connect, and how to discuss them credibly.
Why this matters for SentiLink
SentiLink names its stack in one line of the JD: "Python 3, PostgreSQL, and AWS infrastructure (EC2, S3, RDS, Redshift, etc.)." That tells you a lot:
- EC2 + RDS rather than serverless / managed-platform-everything signals they run their own services — they want operational fluency.
- Redshift for analytics says they have a separate analytics tier from production transactional data. You'll know the difference between OLTP and OLAP.
- PostgreSQL is the system of record. Strong Postgres SQL is in scope for interviews.
You don't need AWS certifications. You need to describe each component's purpose, the data flow between them, and the typical operational concerns.
EC2 — compute
Elastic Compute Cloud — virtual servers in AWS. For SentiLink: where the real-time scoring services run. The relevant concepts:
- Instance types:
mfor general purpose,cfor compute-heavy,rfor memory-heavy,gfor GPU. Pick by workload — ML inference is usually compute-bound on CPU instances (c-family) unless using large neural models. - Auto Scaling Groups: scale instances up/down based on load. Important for real-time inference services that have bursty traffic.
- Spot vs On-Demand: spot is ~70% cheaper but can be interrupted. Use for batch training jobs, not real-time inference.
- Containers: most teams run on ECS or EKS rather than raw EC2, but the underlying compute is still EC2.
S3 — object storage
Simple Storage Service. Cheap, durable, infinite-scale blob storage. In a fraud stack:
- Training datasets in Parquet (much smaller than CSV, columnar reads).
- Model artifacts: serialized models, versioned by path or by S3 versioning.
- Audit logs: a copy of every prediction with features and timestamps, partitioned by date.
- Backups of RDS / Redshift.
Operational details that show competence:
- S3 partitioning: structure paths as
s3://bucket/dataset/date=2026-03-15/so Athena/Spark can prune partitions. - Storage classes: hot data in Standard; archived training data in Glacier (much cheaper, slow retrieval).
- Encryption: SSE-KMS at rest, TLS in transit — table stakes for fintech.
RDS & PostgreSQL
Relational Database Service — managed Postgres (or MySQL / Aurora). For SentiLink: the system of record for applications, decisions, and historical predictions.
Postgres essentials for DS
- Window functions: identical to those in Guide A · SQL. Postgres-specific syntax for
DATE_DIFFis(a - b)returning an interval. - JSON operations:
->,->>,jsonb_path_query. SentiLink stores semi-structured application data; navigating JSONB columns is in scope. - Indexes: B-tree for range scans, GIN for full-text and JSONB, BRIN for time-sorted append-only tables. Know which to suggest.
- EXPLAIN ANALYZE: the diagnostic tool when a query is slow. Read the plan from leaves up; spot the wrong join order or missing index.
Postgres gotchas worth naming
- Long-running transactions block VACUUM and accumulate bloat. In an analytics-heavy fraud stack, this matters.
- Sequential scans on large tables rarely good. Index appropriately or shift to a columnar store (Redshift).
- Synchronous replication lag for read replicas — analytics queries on replicas can be slightly stale.
Redshift — analytics
AWS's columnar data warehouse. Separate from RDS (which is OLTP / Postgres). For SentiLink: where bulk historical analysis runs.
What Redshift is good at
- Scanning columns over millions to billions of rows.
- Aggregations with GROUP BY at scale.
- Read-heavy analytical workloads.
What it's not for
- Real-time row-level lookups (use RDS or a key-value store).
- High-concurrency OLTP (a Redshift instance handles tens to low-hundreds of concurrent queries, not thousands).
- Tiny tables (overhead per query is high; under ~100k rows just use Postgres).
Concepts worth naming
- Sort keys: queries that filter on the sort key skip irrelevant blocks. Pick the most common filter column.
- Distribution styles: how rows are distributed across nodes.
KEYfor joins,EVENfor general,ALLfor small dim tables. - RA3 vs DC2: RA3 separates compute and storage, DC2 doesn't. Modern Redshift is RA3.
- Spectrum: query S3 data directly from Redshift via external tables. Useful for cold historical data.
The shape of a fraud-detection stack
Walking through what likely happens at SentiLink (inferred, not from internal sources):
- Application arrives at a customer's API. Customer calls SentiLink's API.
- Request hits an EC2-hosted scoring service. Read application data.
- Feature lookup: a mix of (a) precomputed features from a feature store / cache (sub-ms), (b) live SQL queries to RDS for app data (~10ms), (c) third-party API calls for bureau data (~100ms).
- Model scores the feature vector. Sub-10ms inference.
- Decision returned to the customer with a score and supporting signals.
- Prediction logged to S3 (audit) + RDS (operational).
- Periodically, training data is built from RDS + S3 audit logs into Redshift; analysts and modelers query Redshift; new model versions are trained and deployed back to the EC2 scoring service.
Cost & ops awareness
The senior signal is awareness that AWS infrastructure costs money and has tradeoffs:
- S3 storage costs differ 10× across tiers. Don't keep cold training data in Standard.
- NAT Gateway data transfer is expensive. VPC endpoints for S3 access save real money.
- Cross-AZ traffic costs money. Inference services calling cross-AZ DBs add up.
- Redshift concurrency scaling is paid extra. Tune queries before throwing concurrency at the problem.
- RDS multi-AZ doubles your DB cost. Worth it for production; not for dev.
Interview probes
Show probe 1: "When would you use Redshift vs Postgres?"
Postgres for OLTP — row-level reads/writes, real-time application data, high concurrency. Redshift for OLAP — large scans, aggregations, historical analytics. The dividing line is roughly "do queries touch many rows" (Redshift) or "do queries touch specific rows" (Postgres). Don't run interactive dashboards directly on production Postgres at scale; replicate analytics tables to Redshift.
Show probe 2: "Why partition S3 data by date?"
Query engines that read S3 (Athena, Spark, Redshift Spectrum) can prune partitions — they skip reading files that don't match a date filter. Without partitioning, every query reads the entire dataset. With date=YYYY-MM-DD partitioning, a query for last week reads ~7 days of files instead of years.
Show probe 3: "How do you decide on indexes in Postgres for a fraud table?"
Index the columns you filter on (application_id, customer_id, applied_at), and the columns you join on. For range scans over time, B-tree on the timestamp. For JSONB application data you query into, GIN with the right operator class. Don't over-index — every write must update every index, which slows insertion. Profile with EXPLAIN ANALYZE; add indexes only where the plan shows sequential scans hurting you.
Show probe 4: "Sketch the architecture for sub-100ms fraud scoring."
EC2-hosted scoring service (or ECS/EKS), behind a load balancer. Model loaded once at startup, kept in memory. Features: precomputed entity-velocity features in a fast key-value store (Redis or DynamoDB) — sub-ms lookup. Recent application data in RDS Postgres — ~10ms with proper indexes. Third-party data (bureaus) cached aggressively because the API is slow. Async logging to S3 and RDS — never block the response on logging. p99 budget of 100ms forces feature precomputation; you can't compute cross-entity aggregates synchronously per request.
Show probe 5: "What's the difference between RDS and Aurora?"
RDS is managed standard databases (Postgres, MySQL, etc.) on EC2 storage. Aurora is AWS's cloud-native rewrite with better throughput, faster failover, separation of compute and storage, and read replicas that aren't reads-from-a-replica but reads-from-the-shared-storage. More expensive per hour but more elastic. For fraud at scale, Aurora is usually the right call; for small teams or non-critical paths, RDS is enough.