Section B · Core DS

Causal Inference

When you can't randomize — the five methods Cohere's JD calls out by name, plus the framework for picking the right one. Critical for senior DS loops; useful for any role that touches GTM analytics.

Why this chapter exists

The Cohere JD says, verbatim: "design and lead experimentation programs including A/B tests, multi-armed bandits, causal inference studies." Calling causal inference out by name is a strong signal — there will be a round, or part of a round, where you're handed a "you can't randomize, what would you do" question.

The right preparation isn't memorizing Pearl's do-calculus. It's being able to:

  1. Name the standard quasi-experimental methods and what assumptions each one rests on.
  2. Pick one for a given business question and defend the choice.
  3. State the threats to inference (confounders, selection, attrition) honestly.

The potential outcomes frame

Every causal question is, at its core: "what would have happened to this unit under the other treatment?" That counterfactual is unobservable for any given unit — you only see one. Causal inference is the science of estimating the average difference between the observed and counterfactual outcomes.

The fundamental quantity:

  • ATE (Average Treatment Effect): E[Y(1) − Y(0)] over the whole population.
  • ATT (Treatment on the Treated): E[Y(1) − Y(0) | treated] — what was the effect on those who actually got treated?
  • LATE (Local Average Treatment Effect): the effect on "compliers" — those whose treatment status was affected by the instrument.

Picking the right estimand matters. A marketing analysis usually wants ATT ("did our campaign actually help the people who saw it?"), not ATE.

Difference-in-differences (DiD)

The workhorse. Use it when treatment is applied to one group and not another, and you have pre-period data on both.

The setup

Two groups (treated, control), two periods (pre, post). The DiD estimator is the difference between (treated post − treated pre) and (control post − control pre). It nets out time trends and group-level differences simultaneously — assuming the two would have moved in parallel absent the treatment.

DiD with two-way fixed effects
import pandas as pd
import statsmodels.formula.api as smf

# df: columns [unit_id, period, treated, outcome]
# 'treated' = 1 for treated units in post-period, 0 otherwise
model = smf.ols(
    formula='outcome ~ treated + C(unit_id) + C(period)',
    data=df
).fit(cov_type='cluster', cov_kwds={'groups': df['unit_id']})
print(model.summary())

The critical assumption

Parallel trends: the treated and control groups would have followed the same trajectory absent the treatment. You can't fully prove this, but you can support it by:

  • Plotting pre-period trends — they should look parallel.
  • "Placebo" tests: pretend a pre-period event was the treatment, run DiD, expect no effect.
  • Event-study plots: estimate effects period-by-period and check for pre-treatment effects (which would invalidate parallel trends).

When DiD fits

  • Geographic rollouts (feature launched in country X first).
  • Policy changes (a regulation applied to one product line).
  • Natural experiments (a competitor exited a market).

Propensity scoring

When treatment is non-random and you have observable confounders, propensity scoring approximates randomization by matching or weighting on the probability of treatment given confounders.

The flow

  1. Fit a model (typically logistic regression or gradient boosting) predicting treatment from observable confounders.
  2. Get propensity scores — predicted P(treated | X) — for every unit.
  3. Use the scores via matching, stratification, or inverse-probability-of-treatment weighting (IPTW).
  4. Check overlap (the distributions of propensity scores for treated and untreated should substantially overlap — if they don't, you have units with no comparable counterfactual).
  5. Check balance (after weighting/matching, the covariates should be balanced between groups).
IPTW for ATE
from sklearn.linear_model import LogisticRegression
import numpy as np

# X: covariates, T: treatment indicator (0/1), Y: outcome
ps_model = LogisticRegression(max_iter=1000).fit(X, T)
ps = ps_model.predict_proba(X)[:, 1]

# Stabilized IPTW
p_treat = T.mean()
weights = np.where(T == 1, p_treat / ps, (1 - p_treat) / (1 - ps))

# Weighted mean difference = IPTW ATE estimate
ate = np.average(Y[T == 1], weights=weights[T == 1]) - \
      np.average(Y[T == 0], weights=weights[T == 0])

The critical assumption

No unobserved confounding (a.k.a. selection on observables, ignorability). Anything affecting both treatment and outcome must be in your propensity model. If there's a confounder you can't measure, no amount of propensity wizardry recovers the causal effect.

Propensity is not a get-out-of-confounding-free card

Propensity scoring shifts the assumption from "you randomized" to "you measured every relevant confounder." The latter is rarely true. Always discuss the confounders you're worried you missed.

Instrumental variables (IV)

When you have unobserved confounding and you can find an "instrument" — a variable that affects treatment but not the outcome except through treatment — you can recover a causal effect (specifically, the LATE on compliers).

The three conditions

  1. Relevance: the instrument actually affects treatment uptake. Test with a first-stage F-statistic ≥ 10.
  2. Exclusion: the instrument only affects the outcome through treatment. Untestable; defended on substantive grounds.
  3. Monotonicity: the instrument moves treatment in the same direction for everyone (no "defiers").

Canonical examples

  • "Distance to college" as an instrument for "years of education" — affects who goes to college, only affects earnings through education.
  • Random encouragement designs — emailing a randomly chosen subset to nudge them into a feature. The email is randomized, the feature use isn't. The email is the instrument.
  • Lottery-based access to limited services.

Two-stage least squares (2SLS)

2SLS via linearmodels
from linearmodels.iv import IV2SLS
# Y: outcome, T: treatment, Z: instrument, X: exogenous controls
res = IV2SLS(Y, X, T, Z).fit(cov_type='robust')
print(res.summary)
Weak instruments

If the first-stage F is below 10, the IV estimates are biased toward the OLS estimate and have understated standard errors. Report the F. If it's weak, don't pretend it isn't.

Regression discontinuity (RDD)

When treatment is assigned based on a continuous variable crossing a threshold — credit-score cutoffs, age cutoffs, eligibility scores — you can compare units just above and just below the threshold. The assumption is that units near the threshold are otherwise similar, so the discontinuity in the outcome at the threshold reveals the causal effect.

The flow

  1. Pick a bandwidth around the threshold.
  2. Fit a local regression on each side of the threshold.
  3. Estimate the discontinuity = the jump in fitted values at the threshold.
  4. Robustness: check different bandwidths, check for manipulation of the running variable around the threshold (McCrary test).

Where it fits

  • Free-trial eligibility based on a usage threshold.
  • Loan approval based on a credit score cutoff (SentiLink-adjacent — this is the canonical fintech RDD setup).
  • Enterprise discount tiers triggered at spend thresholds.

Synthetic controls

When you have one treated unit (a single city, a single product line) and many possible controls, the synthetic control method constructs a weighted average of the controls that best matches the treated unit's pre-period trajectory. The "synthetic" version is your counterfactual.

Where it fits

  • A new feature launched in one country.
  • A pricing change applied to one product line.
  • An ad campaign launched in one DMA.

The challenges

  • Inference is hard — placebo-based permutation tests are standard but small-sample.
  • The pre-period fit needs to be good; if your synthetic control doesn't match the treated unit's pre-period, you don't have a credible counterfactual.

Choosing a method

Picking the right method, narrated, is what loops are testing. The decision tree:

SetupMethod
Can randomize at unit levelA/B test — the standard
Can't randomize, but have a clean instrumentInstrumental variables
Treatment assigned by a threshold on a continuous variableRegression discontinuity
Treatment applied to one group and not another, with pre-period data on bothDifference-in-differences
Treatment in one unit only, with many candidate controls and pre-period dataSynthetic control
Treatment status is observational, confounders are observedPropensity scoring (matching / IPTW)
None of the aboveBe honest — you can do exploratory analysis but you can't claim causation

Interview probes

Show probe 1: "When would you choose DiD over a randomized A/B test?"

I wouldn't, if I could randomize. DiD is for cases where randomization is infeasible (regulatory rollouts, geographic launches, policy changes) and I have pre-period data on both the treated and untreated groups. The win is that DiD nets out group-level baseline differences and time trends that would bias a simple pre/post comparison.

Show probe 2: "What's the key assumption behind propensity scoring?"

Ignorability — that conditional on observed covariates, treatment is as good as random. Equivalently: there are no unobserved confounders. This is a strong, often-violated assumption. Propensity scoring is best when treatment assignment is well-understood and you've captured the drivers (clinician decisions based on labs and history, marketing campaigns based on user attributes). It's weakest when treatment status reflects something the analyst can't observe (motivation, sophistication, hidden context).

Show probe 3: "What's a 'weak instrument' and why does it matter?"

An instrument with low correlation to treatment (first-stage F-statistic below ~10). With a weak instrument, the IV estimate is biased toward the OLS (confounded) estimate, and the standard errors are understated, making spurious significance more likely. Always report the first-stage F. If it's weak, the IV estimate isn't trustworthy.

Show probe 4: "Walk me through how you'd evaluate whether a marketing campaign caused a lift, with no randomization."

"Step one: find a natural comparison. Did some markets get the campaign and others not? If so, DiD with parallel-trends checks. Did exposure depend on a threshold (impression count, frequency cap)? RDD around the threshold. Did some users get nudged via random emails? That's an instrument — IV. If none of those exist and we just have campaign vs no-campaign with no clean structure, propensity scoring on observable confounders is the best we can do, but I'd state explicitly that residual confounding can't be ruled out and we should describe the work as suggestive, not causal."

Show probe 5: "Synthetic control sounds magical. What's the catch?"

Three catches: (1) the pre-period fit needs to be visibly good — if your synthetic doesn't track the treated unit pre-treatment, you don't have a credible counterfactual; (2) inference is hard because you have one treated unit — the standard approach is permutation tests over placebo treatments, which are small-sample and conservative; (3) the result is only causal if the pool of donor units wasn't itself affected by the treatment. Geographic spillovers are a real threat.