Section B · Core DS

ML Fundamentals

The supervised-ML knowledge a full-stack DS needs to demonstrate fluency cold — model selection by problem shape, evaluation under skew, calibration, and the staff-bar instincts around tradeoffs.

The staff bar

Full-stack DS interviews don't test whether you can train an XGBoost — assume you can. They test:

Can you pick the right model in under 30 seconds for a given problem shape, and articulate why?
Do you reason about calibration and threshold tuning, not just AUC?
Do you know when a simpler model is the right answer despite a small AUC loss?
Can you spot leakage before it bites?
Do you talk about evaluation under realistic class imbalance, not balanced toy datasets?

SentiLink puts it directly: "Strong practical ML / Stats knowledge, i.e. can easily employ the suite of standard ML / stats tools to quickly scope out solutions, and double down where needed." The bar is fluent application, not novel research.

Choosing a model

The decision factors, in order of impact:

Data shape: tabular, text, image, sequence, mixed.
Latency budget: sub-100ms, sub-second, batch.
Interpretability requirements: regulators reading the model? Customer-facing explanations?
Label availability: clean, noisy, delayed, partial.
Sample size: hundreds, thousands, millions.
Calibration vs ranking: do downstream consumers use the probability, or just the order?

Tabular default

For tabular fraud, propensity, and similar problems, gradient boosting (LightGBM, XGBoost, or CatBoost) is the default. It handles missing values, mixed types, non-linear interactions, and skewed distributions. Logistic regression remains useful when:

You need a model regulators (or your own analytics team) can stare at coefficient-by-coefficient.
The relationship is genuinely linear / additive after good features.
Sample size is small (gradient boosting overfits with few hundred rows; logistic regression with L2 is more robust).

Text default

For modern text classification, sentence embeddings + a linear/boosted classifier on top is usually within a couple of points of fine-tuning, much faster, and easier to ship. Fine-tune (or use LLM-based zero/few-shot) only when the embedding-based baseline doesn't clear the bar.

Image / sensor default

Pretrained backbone (CLIP-style, ResNet, EfficientNet, or modality-specific) + light head fine-tuning. Don't train CNNs from scratch unless you have a labeled dataset measured in the millions.

Tabular models in depth

Logistic regression

The "interpretable baseline." Coefficients map to log-odds. Pair with L2 (Ridge) or L1 (Lasso) regularization. Coefficients are interpretable only after standardizing features and respecting collinearity.

Gradient boosting (LightGBM is the modern default)

lightgbm baseline

from lightgbm import LGBMClassifier
from sklearn.model_selection import TimeSeriesSplit

model = LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=-1, num_leaves=63,
    min_child_samples=50,
    subsample=0.8, colsample_bytree=0.8,
    reg_alpha=0.1, reg_lambda=0.1,
    objective='binary',
    class_weight=None,   # set this for imbalanced — see ch. 5
    random_state=42,
)
cv = TimeSeriesSplit(n_splits=5)
# train, eval, calibrate — see later sections

Key hyperparameters to know:

n_estimators + learning_rate: the trade. More trees + lower LR = better fit + slower.
num_leaves / max_depth: tree capacity. Larger = more overfit risk.
min_child_samples: minimum samples per leaf. Larger = more regularization.
subsample / colsample_bytree: row/column bagging. Reduces overfit.
reg_alpha / reg_lambda: L1/L2 on leaf weights.

Random forest

Mostly displaced by gradient boosting in practice, but still relevant when you want lower variance with less hyperparameter tuning, or when you want OOB estimates without a separate validation set.

Neural networks for tabular

Don't reach here first. Modern entries (TabNet, NODE, TabPFN) are competitive in narrow regimes but require more care than gradient boosting and rarely beat it by enough to justify the operational overhead. Mention them if asked, but defaulting to gradient boosting is a sane staff answer.

Evaluation

For classification under imbalance

Accuracy is misleading. The right metric depends on the decision the model informs:

Precision-recall curve / AUC-PR — when positives are rare, this captures behavior in the actionable region. Always show PR curves alongside ROC for fraud-style problems.
Lift at top K% — most decisions are "act on the top K% scored." Report lift directly. "At top 1%, recall is 60% — meaning if we act on the 1% scored highest, we catch 60% of fraud."
Recall at fixed precision (or vice versa) — bind a constraint that the business cares about ("we tolerate 5% false-positive rate"), report the other.
Cost-weighted metric — when you can quantify the cost of FP and FN, weighted accuracy or expected loss is the most honest scoreboard.

For regression

RMSE penalizes large errors — use when those are catastrophic.
MAE is more robust to outliers — use when occasional large errors are tolerable.
MAPE for "percentage off" intuitions, but blows up near zero.
Quantile loss for forecasts where the interval matters more than the point estimate.

For ranking

NDCG, MAP, MRR — depending on whether the top-k ranking matters vs the position of the first relevant item. Most fraud and propensity problems are technically ranking problems; reporting NDCG@K alongside lift is a common staff move.

Calibration

Calibration is whether your model's predicted 0.3 actually corresponds to a 30% positive rate. Gradient boosting models are often miscalibrated (skewed toward the extremes). Fix with:

Platt scaling

Fit a logistic regression on the model's outputs as inputs. Cheap, works for non-monotonic miscalibration patterns.

Isotonic regression

Fit a piecewise-constant monotonic function. More flexible, requires more held-out data. Industry default.

isotonic calibration

from sklearn.calibration import CalibratedClassifierCV
from lightgbm import LGBMClassifier

base = LGBMClassifier(n_estimators=300)
calibrated = CalibratedClassifierCV(base, method='isotonic', cv=5)
calibrated.fit(X_train, y_train)
probs = calibrated.predict_proba(X_test)[:, 1]

Diagnostic: a reliability diagram. Bucket predictions into deciles, plot mean predicted vs actual positive rate. Should fall on the diagonal.

When calibration matters most

Any time downstream consumers use the probability for math — combining with a cost, thresholding for action, comparing across models. If consumers just use rank, calibration is a nice-to-have. If they use probability, it's table stakes.

Regularization & overfitting

Three layers of defense:

Architectural: pick a model that can't overfit easily (logistic regression, shallow trees).
Regularization: L1/L2 penalties, early stopping, dropout (NN), min_child_samples (boosting).
Data: more training data, augmentation, removing leaky features.

Spotting overfit

Train metric >> held-out metric.
Held-out performance gets worse as training continues (use early stopping).
Performance varies wildly across CV folds.
Performance is great in development but worse in production — usually means temporal leakage or distribution drift.

Model selection & cross-validation

Use time-based splits for temporal data

For fraud, finance, or any problem where today's data depends on yesterday's, random k-fold CV leaks information from the future into the past. Use TimeSeriesSplit or roll-forward.

Group-based splits when units repeat

If the same user appears in many rows, ensure all of a user's rows land in the same fold. Otherwise the model memorizes users, not patterns. GroupKFold is the standard tool.

Stratify when classes are imbalanced

So that each fold has the same positive rate. StratifiedKFold.

Nested CV when hyperparameter tuning matters

Inner CV picks hyperparameters, outer CV estimates true generalization. The "default" CV-then-eval pipeline over-estimates performance because the held-out set leaks into hyperparameter choice. Nested CV is the staff bar when the model selection itself is the question.

Interview probes

Show probe 1: "When would you pick logistic regression over gradient boosting?"

Four cases. (1) The model has to be interpretable coefficient-by-coefficient for regulators or stakeholders. (2) Sample size is small — boosting overfits with hundreds of rows; LR with L2 doesn't. (3) The relationship is genuinely linear and additive after good features. (4) Real-time latency is tight enough that a tree-ensemble inference is too slow (rare, but happens). Otherwise default to gradient boosting on tabular.

Show probe 2: "Why calibrate? Isn't AUC enough?"

AUC measures ranking only. It tells you whether positives score higher than negatives on average. It tells you nothing about whether 0.3 means 30% chance. If downstream consumers use the probability for math — combining with cost, applying a threshold, comparing models — uncalibrated scores produce silently wrong decisions. Isotonic on a holdout set is the standard fix.

Show probe 3: "Random k-fold CV on a fraud dataset — what could go wrong?"

Two failure modes. (1) Temporal leakage — fraud patterns evolve; random k-fold leaks future patterns into the training fold and overstates performance. Fix: TimeSeriesSplit. (2) Identity leakage — if the same identity appears in many rows, random k-fold lets the model memorize identities rather than learn fraud patterns. Fix: GroupKFold on identity. Combine both — group-aware time-based splits — for fraud-style data.

Show probe 4: "How do you pick a threshold for a binary classifier?"

It depends on the cost asymmetry. If you can quantify the cost of false positives and false negatives, compute expected cost across thresholds and pick the minimum. If you can't quantify costs but the business has a constraint ('we'll review at most 1% of applications'), use the threshold that matches the operational capacity. If neither, report PR curve and let the consumer pick. The wrong answer is 0.5 — it has no operational meaning.

Show probe 5: "What's a sign you're overfitting?"

Train metric meaningfully better than held-out, held-out performance worsening as training continues, large variance across folds, or — most diagnostic — production performance noticeably below validation. The last one is usually temporal leakage or distribution drift, not overfit in the classical sense, and the fix is different.