Section B · Core DS

Data Pipelines, Applied

The operational glue around modeling — acquisition decisions, labeling workflows, iterative preprocessing, feature pipelines, training-data versioning, and point-in-time correctness.

Why pipelines matter at SentiLink & Archetype

Both JDs put pipeline thinking front and center. SentiLink: "the full model development lifespan: from data acquisition decisions through featurization, focusing labeling resources, model training, experimentation, productionalization, and monitoring." Archetype: "Iterative preprocessing cycles with evaluation and refinement," "Support labeling/annotation workflows end-to-end."

For full-stack roles, the modeler IS the pipeline owner. There's no separate "data engineer who handles ingestion." If you can't keep the data flowing, the model is stuck.

Data acquisition

The decision that most shapes model quality. Three flavors:

Internal data

What your own product generates. The question is usually "do we log the right thing?" If you need a feature the system doesn't emit, you're negotiating with engineering to add the event.

Third-party data

Bureau data, sanctions feeds, open-source datasets, sensor benchmarks. The decisions: cost, coverage, license, refresh cadence, latency to access.

Custom-collected data

Survey design, customer beta programs, sensor deployments. Highest cost, sometimes only path to a needed label.

The SentiLink-flavored thinking

"Should we acquire this dataset?" is a real DS decision at SentiLink, not just a procurement one. The DS asks: what's the lift on existing models? What's the operational cost (latency, license, recurring fee)? What's the failure mode if the source becomes unavailable? You're expected to make that call, not just consume what's bought.

Labeling workflows

"Focus labeling resources" is half of SentiLink's lifecycle. Archetype names it explicitly. The components:

What to label first

Active learning: prioritize labels that disagree with the current model, or labels in regions of feature space the model is uncertain about. Random labeling is wasteful past a point.

Inter-rater reliability

Have multiple labelers agree on a sample. If agreement is low (Cohen's kappa < 0.7), the labels are noisy and the task definition needs sharpening. Don't model on labels you don't trust.

Tools

  • Encord — Archetype's bonus mention. Used for video / image annotation.
  • Labelbox, Scale, SuperAnnotate — common alternatives.
  • Internal review tools — often the right call for fraud at SentiLink because the labels are inherently expert-judgment-driven and the rubric evolves.

Cost economics

Labeling is often the single most expensive line item in ML projects. A senior DS thinks about budget: if labels cost $5 each and you need 10k, that's $50k. Active learning that gets you to model quality with 3k labels saves $35k — worth doing carefully.

Iterative preprocessing

Archetype's JD: "Iterative data preprocessing cycle execution with baseline comparison." What does that actually mean operationally?

  1. Define the success criterion (eval set + metric) before preprocessing.
  2. Build a baseline preprocessing pipeline. Score on eval.
  3. Make one preprocessing change. Re-score.
  4. Decide: keep the change if it improved aggregate AND didn't regress important segments.
  5. Document what you tried and why; archive intermediate datasets so you can A/B test preprocessing variants.
  6. Repeat.

What changes count as "one"

Resolution change, filter cutoff change, label cleaning rule, n-shot example swap, lens parameter. Don't change two things simultaneously — you won't know which one helped.

Feature pipelines

The two-system problem

Training computes features one way (Python over a snapshot). Serving computes them another (a real-time service over live data). If those diverge, the model in production sees different features than the model trained on — silent failure.

Mitigations

  • Single source-of-truth for feature definitions. Feature store, dbt models, or shared Python module — pick one and use it for both training and serving.
  • End-to-end tests on a known input. "Given this raw event, the feature service should produce X." Run on every deploy.
  • Logged features in production. Log the actual features used for each prediction. Compare a sample to features recomputed from raw — diffs are bugs.

Training-data versioning

The model you ship is a function of the data you trained on. If the data changes silently, the model's behavior changes silently. Treat training-data snapshots as artifacts:

  • Save a hash and a row count for every dataset used in a trained model. Stored alongside the model.
  • Save the exact label-cleaning rules and preprocessing functions as code, versioned with the model.
  • For very large datasets, save a stratified sample so the snapshot is reconstructable for debugging without paying the storage cost of the full set.

Tools: DVC, lakeFS, Pachyderm, or just S3 + structured naming with a registry.

Point-in-time correctness

The hardest engineering problem in fraud/risk ML. For any prediction made at time T, every feature must be computable from data available strictly before T. Otherwise the model trains on data it won't have at serving.

Common violations

  • Features that include the label or a downstream-of-label signal.
  • Features that include future events ("number of transactions ever for this user" — at training, "ever" includes after-prediction transactions).
  • Features that use a static snapshot of a dimension table that changed over time (a customer's tier today, not their tier when the prediction was made).

Recipe

  1. Every feature query is parameterized by an "as-of" timestamp.
  2. Joins to dimension tables use temporal lookups (the row valid at that timestamp), not the current row.
  3. Validate by re-running historical training data with current pipelines and confirming reproducibility.
  4. Periodically simulate a serving call from raw event timestamps and confirm the features computed match what was actually served.

Interview probes

Show probe 1: "How would you decide which labels to acquire next?"

Active learning. (1) Score the unlabeled pool with the current model. (2) Pick examples where the model is uncertain (near 0.5 for binary), or where the model and a separate ensemble disagree, or that fall in feature regions with sparse labeled coverage. (3) Send those to labeling first. The wrong answer is "label more of the same" — you've already learned what those teach.

Show probe 2: "What's point-in-time correctness and why does it matter?"

For any prediction made at time T, features must be computable from data available before T. If your training features include any post-T data (downstream labels, future events, current dimension values), your model trains on information it won't have at serving — and silently fails in production. Mitigations: parameterize feature queries by an as-of timestamp; use temporal joins to slowly-changing dimensions; validate by reproducing serving features from raw timestamps.

Show probe 3: "How do you keep training-time and serving-time features identical?"

Single source of truth for feature logic — feature store, shared Python module, or dbt models used by both pipelines. End-to-end tests that compute features from a known raw input and assert expected output, run on every deploy. Log features in production and compare a sample against features recomputed from raw — diffs are bugs.

Show probe 4: "Walk me through your iterative preprocessing loop."

Build the eval set first. Baseline preprocessing scored on eval. One change at a time — resolution, filter cutoff, n-shot example. Re-score. Keep if aggregate improved without segment regression. Document attempts and their effects. Archive intermediate datasets so I can A/B test preprocessing variants later. Senior signal: defending the discipline of one-change-at-a-time, and naming the segment-regression check.

Show probe 5: "What's worth versioning, what isn't?"

Version: training data snapshots (hash + row count), preprocessing code (in Git, tagged with the model), label-cleaning rules, prompts and lens configs. Don't try to version: every intermediate notebook output, exhaustive raw logs. Versioning is for the things you'd need to reproduce a result that's now in production.