Section B · Core DS

Experimentation Foundations

Designing, sizing, running, and analyzing an A/B test — including the failure modes interviewers will probe. Both target JDs call experimentation out by name.

Why this is the load-bearing skill

Both JDs put experimentation at the center. HeyGen lists "design and execute A/B tests" in the responsibilities. Cohere says "design and lead experimentation programs including A/B tests, multi-armed bandits, causal inference studies, that directly map to product and go-to-market decisions." The senior-IC version of this role is judged by the experimentation programs they ran, not the analyses they wrote up.

The expectation in a loop is that you can:

  1. Take a vague product prompt ("we want to launch X — how should we measure success?") and produce a defensible experimental design in 10 minutes.
  2. Compute power and sample size, including the recovery from "this needs 4 weeks at our traffic" → "what's the smallest detectable effect we can call shipped at 2 weeks?"
  3. Name three failure modes that would invalidate your result, and the guardrails that would catch each.

Anatomy of an experiment

The pieces every interviewer expects you to name, in order:

PieceWhat it commits to
HypothesisWhat change you're making, why it might work, and what you'd expect to see.
Unit of randomizationUser, account, session, server, country. Determines what kind of bias your test is vulnerable to.
Primary metricOne. The decision metric. Drive your power calc off this.
Secondary metricsWhat you'd report alongside, with appropriate skepticism about multiple-comparison adjustments.
GuardrailsMetrics you'd halt on if they moved badly (latency, crash rate, revenue, customer support volume).
Power & sample sizeWhat effect size you can detect with what α and β at this traffic level.
DurationDriven by sample size and by novelty effect — at least a full week to cover the weekly cycle, usually two.
Decision ruleWhat outcomes lead to ship / kill / iterate, written down before the test runs.
The decision rule trap

If your team can't agree on what outcome ships the feature before the test runs, the test won't settle the question — people will rationalize whatever shipped. Writing the decision rule down up front is one of the highest-leverage process moves a senior DS makes.

Hypothesis and metric

A useful hypothesis has three parts:

  1. Change: "We'll move the upgrade CTA from the settings page to the header."
  2. Why it might work: "Settings is a low-traffic surface; users who'd upgrade aren't seeing the prompt."
  3. Expected effect: "We'd expect a 5–10% lift in weekly upgrade conversion among free users, with no impact on retention."

The "why" line forces specificity. If you can't articulate the mechanism, the test is a fishing expedition.

Metric selection: pick a metric that's sensitive enough to move in the experiment window (revenue per user often isn't, conversion-to-paid usually is) and tied directly to the decision (don't optimize page views when you care about upgrades).

Don't optimize a proxy when the goal is downstream

"Clicks on the upgrade CTA" is a proxy for "users upgraded." If you ship based on clicks, you can move clicks without moving upgrades. Always carry the downstream metric — and call clicks a secondary metric, not the primary.

Power and sample size

The simple version, internalized: for a two-sample test of proportions at α=0.05, β=0.20 (80% power), to detect a relative lift of r on a baseline rate p, you need roughly

back-of-envelope sample size per arm
import math

def sample_size_per_arm(p_baseline: float, relative_lift: float,
                        alpha: float = 0.05, power: float = 0.8) -> int:
    """Two-sample proportion test, two-sided, equal arms."""
    from scipy.stats import norm
    p1 = p_baseline
    p2 = p_baseline * (1 + relative_lift)
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)
    pbar = (p1 + p2) / 2
    numerator = (z_alpha * math.sqrt(2 * pbar * (1 - pbar))
                 + z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
    return math.ceil(numerator / (p1 - p2) ** 2)

# Example: 5% baseline, want to detect 10% relative lift
print(sample_size_per_arm(0.05, 0.10))  # → 30,809 per arm

The intuition every senior DS carries:

  • Smaller relative lifts need quadratically more sample. Detecting 1% lift costs 100× the sample of detecting 10% lift.
  • Lower baselines need more sample. Conversion at 1% is much harder to A/B test than conversion at 20%.
  • Variance reduction matters. CUPED (covered in 05-advanced-experimentation) can cut required sample by 30–50% for free.

For continuous metrics

The Cohen's d framing: required n per arm is roughly n ≈ 16 / d² at 80% power, where d = (μ1 − μ2) / σ. For session length in minutes (μ=10, σ=15) to detect a 1-minute lift: d=0.067, n ≈ 3,600 per arm.

The "what's the smallest lift we can call?" move

When traffic is tight, invert the question: "given two weeks of traffic, what's the minimum detectable effect (MDE) at 80% power?" If MDE = 12% and the team thinks the feature is worth 5%, the test won't settle it. Better to know up front than to run it.

The tests

Two-proportion z-test

For conversion-style binary outcomes. Equivalent to chi-squared. Most experimentation platforms use this.

two-proportion z-test
from statsmodels.stats.proportion import proportions_ztest

# 1200/30000 in control, 1380/30000 in treatment
stat, p_value = proportions_ztest(
    count=[1200, 1380],
    nobs=[30000, 30000],
    alternative='two-sided'
)
print(stat, p_value)

Welch's t-test

For continuous metrics with potentially unequal variances. Don't use Student's t unless you've verified equal variance — and you usually haven't.

Mann–Whitney U (rank-sum)

When your continuous metric is heavily skewed (revenue, session length). Tests whether one distribution stochastically dominates. Don't trust t-tests on revenue data with whales — Mann–Whitney is more honest about what's actually different.

Bootstrap confidence intervals

When in doubt, bootstrap. Resample with replacement, compute the metric, repeat 10k times, take percentiles. Works for any metric definition, including ratios.

bootstrap CI for a ratio metric
import numpy as np

def bootstrap_diff(treatment, control, n_iter=10000, agg=np.mean):
    diffs = np.empty(n_iter)
    for i in range(n_iter):
        t = np.random.choice(treatment, size=len(treatment), replace=True)
        c = np.random.choice(control,   size=len(control),   replace=True)
        diffs[i] = agg(t) - agg(c)
    return np.percentile(diffs, [2.5, 97.5])

Analysis time

You ran the test. Now what.

  1. Sample Ratio Mismatch check first. Your randomizer was supposed to split 50/50. Did it? Chi-squared the assignment counts against the expected ratio. If it fails (p < 0.001), stop. SRM is the loudest possible signal that something is wrong with the test infrastructure, and any result downstream is suspect.
  2. Pre-experiment balance check. Are the two arms similar on pre-period metrics? If not, your randomization didn't work cleanly — could be a bot in one bucket, a CDN cache, an SDK bug.
  3. Primary metric. Point estimate, confidence interval, p-value. Report all three.
  4. Guardrails. Did any move adversely? If so, weigh the primary lift against the guardrail decline.
  5. Secondary metrics. Adjust for multiple comparisons (Bonferroni or Benjamini-Hochberg) before claiming any of them moved.
  6. Heterogeneous effects. Slice by platform, country, plan type — but only if you'd already committed to those slices before running, otherwise you're p-hacking.

The pitfalls

The named failures every loop will ask about:

Peeking

Looking at the test repeatedly and stopping when it crosses significance inflates Type I error dramatically. Fix: either pre-commit to a duration and don't peek, or use a sequential test (mSPRT, group sequential, Bayesian) that's valid under continuous monitoring.

Sample Ratio Mismatch (SRM)

Assignment ratio drifts from expected. Indicates a bug in the randomizer, in event logging, or in your filter on the downstream table. Cited as the single most common reason experiment platforms produce wrong results.

Network effects / SUTVA violations

"Stable Unit Treatment Value Assumption" — one user's outcome shouldn't depend on others' assignment. Violated by: marketplaces (a treatment seller competes with a control seller), social features (treatment user posts content seen by control user), supply-side caps. Fix: randomize at the cluster level (city, marketplace, day).

Novelty / primacy effects

Users react to any change initially, then revert to baseline. A 1-week test with a strong first-day signal might be measuring novelty, not steady-state. Fix: run longer, or fit a curve and report the asymptote.

Selection bias in opt-in

If you can only run the test on users who landed on a specific surface, the population isn't representative. Be explicit about that in the writeup.

p-hacking on secondary slices

"The primary didn't move, but in Tier-2 cities the effect was huge" — only legitimate if you committed to that slice before the test. Otherwise multiple-comparison adjustment is mandatory.

Designing from a fuzzy product prompt

Common interview prompt: "We're considering moving the upgrade CTA to the header. How would you measure success?" The right answer follows a script:

  1. Restate the decision. "We want to know whether moving the CTA increases upgrades, without harming X." Get the interviewer's nod.
  2. Pick the primary metric and defend why. "Weekly conversion to paid, per free user. It's downstream, sensitive on this timescale, ties to revenue." Avoid clicks as primary — they're a proxy.
  3. Pick guardrails. "Crash rate, latency, free-user retention, customer support tickets." Name them; don't wave your hands.
  4. Unit of randomization. "User. Sticky assignment via a hash on user_id."
  5. Power. "Free-to-paid is ~3% weekly. To detect a 10% relative lift at 80% power, we need ~25k users per arm — about a week at our scale."
  6. Decision rule. "If primary lifts ≥5% with p < 0.05 and no guardrail regresses meaningfully, ship. If primary is flat and a guardrail is fine, kill. If the primary is positive but a guardrail moved badly, escalate to product."
  7. Risks. "Novelty effect — we'll commit to two weeks regardless of when the test crosses significance. SRM — we'll check assignment ratios daily."

Practice this script with three prompts: a CTA move, a price change, an algorithmic recommendation change. The shape repeats.

Interview probes

Show probe 1: "What does p < 0.05 mean?"

If the null hypothesis is true (no real effect), there's a less-than-5% chance of seeing a test statistic at least as extreme as what we observed. It's not "5% chance the null is true" — that's a common mis-statement. It's a long-run frequency property of the test under the null.

Show probe 2: "Why is peeking bad?"

Repeated significance tests on the same accumulating data inflate the Type I error rate. At 10 looks, your effective false-positive rate is closer to 20% than 5%. Either commit to a fixed sample size and don't peek, or use a sequential test (mSPRT, group sequential boundaries) that's valid under continuous monitoring.

Show probe 3: "When would you not run an A/B test?"

Five cases: (1) the change is reversible and the downside is bounded — just ship it; (2) you don't have the traffic for adequate power on the primary metric; (3) the unit you'd need to randomize on is a region/market and you'd contaminate (use a switchback or quasi-experiment instead); (4) the change is required for compliance or contract — there's no "kill" option; (5) the metric you care about is too long-horizon to measure in a test window — use a leading indicator and triangulate.

Show probe 4: "Your test ran for two weeks, primary moved +3%, p = 0.04. Ship?"

"Probably, but I'd check four things first. (1) SRM — was the split clean? (2) Pre-experiment balance — were the arms similar on pre-period? (3) Did any guardrail regress meaningfully? (4) Was the +3% concentrated in a sub-segment or distributed? If all clean, I'd ship and continue monitoring the primary in the post-period for novelty fade." Notice: I didn't just say "yes." The reasoning is the answer.

Show probe 5: "What's a SUTVA violation? Give an example."

The Stable Unit Treatment Value Assumption: user A's outcome doesn't depend on whether user B is in treatment or control. Violations: (a) marketplace tests where a treatment seller competes with a control seller for the same buyer; (b) social-feed changes where treatment posts are seen by control users; (c) supply-side caps where treatment uses inventory control was supposed to have. Fix: cluster-randomize at the level that contains the spillover (city, market, day).

Show probe 6: "What's CUPED and when do you use it?"

Controlled-experiment Using Pre-Experiment Data — variance reduction via covariate adjustment with a pre-period covariate (typically the same metric measured pre-experiment). Reduces required sample by 30–50% if the pre-period covariate is correlated with the outcome. Standard at scale; covered in 05-advanced-experimentation.