Predictive Modeling for Business
The four model types Cohere calls out — forecasting, segmentation, propensity scoring, opportunity sizing — and how to frame each as a decision-support artifact, not an ML experiment.
The four business-modeling jobs
The Cohere JD lists them explicitly: "develop and deploy models for forecasting, segmentation, propensity scoring, and opportunity sizing." These are not academic ML problems — they're tools for the business to make better decisions, and they're judged by whether decisions changed, not by AUC.
The framing that distinguishes a senior business DS from an ML engineer:
- The output is a recommendation, not a model card. The model exists to inform a decision — what allocation, what segment, what target list.
- Calibration over rank when probabilities matter. A model that ranks well but is poorly calibrated will fool stakeholders into thinking probabilities are real.
- Stability over accuracy when the model gets re-run. A forecast that swings 30% between months is unusable, even if monthly accuracy is good.
- Explainability at the level the consumer needs. The exec doesn't need SHAP; they need "the model says X because of Y, Z."
Forecasting
The most common business-modeling job. Forecast revenue, usage, headcount, costs. The right method depends on the shape of the series.
Method ladder
| Method | Use when |
|---|---|
| Naïve / last value / seasonal naïve | Baseline. Always show this — if your fancy model doesn't beat it, you don't have a model. |
| Exponential smoothing (ETS) | Smooth series with trend & seasonality, short forecasting horizons. |
| ARIMA / SARIMA | Stationary or differenceable series, clear autocorrelation structure. |
| Prophet | Multiple seasonalities (weekly + yearly), holiday effects, missing data. Forgiving and easy to defend. |
| Gradient boosting on lag features | Many parallel series (per-account, per-product), exogenous features matter, large data. |
| Bayesian structural time series | You need uncertainty intervals stakeholders trust, and decomposition into trend + seasonal + regression components. |
The three things to get right
- Validation by time-based holdouts, not random k-fold. Walk-forward validation: train on weeks 1–8, predict weeks 9–12; train on weeks 1–12, predict weeks 13–16; and so on.
- Stable retraining: don't reactively fit to the latest spike. Define a retraining cadence (weekly, monthly) and a stability rule (don't update if the new fit differs from the old fit by more than X without investigation).
- Communicate intervals, not point estimates. The expected value is one number; the "we'd be surprised if it's outside [A, B]" interval is what stakeholders actually need to make decisions.
Stakeholders pattern-match "deep learning forecaster" with "better." For business forecasts at most companies, Prophet plus seasonal naïve does within 5% of any fancy model and is debuggable when it goes wrong. Pick the simplest method that meets the bar, and document the comparison.
Segmentation
"Group our customers into useful types." Two approaches, and the choice matters:
Behavioral clustering (unsupervised)
K-means or hierarchical clustering on usage features. The risk: clusters that are statistically distinct but useless for the business — distinguishing two segments that get the same treatment doesn't help.
Decision-driven segmentation
Define segments by the decision they imply. "Accounts likely to upgrade in next 30 days" vs "accounts at risk of churn" vs "accounts plateauing." Then build models per segment.
If sales/marketing can't tell you what they'd do differently with each segment, the segmentation isn't valuable yet. Start with the decision ("how do you allocate AE time?") and design the segments to inform it, not the other way around.
Validating a segmentation
- Stability: do users stay in the same segment month-over-month, or do they hop?
- Distinctiveness: do segments differ meaningfully on the outcome variables you care about?
- Actionability: can a sales rep tell, from the dashboard, which segment a customer is in and what to do?
Propensity
"Probability that account X does action Y in the next N days." Used to score leads, prioritize sales outreach, flag at-risk customers.
Standard recipe
- Define the outcome unambiguously: "account purchased in next 30 days" with a clean snapshot date.
- Pull features as of the snapshot — usage stats, account age, plan, history. Avoid leakage (no feature that uses post-snapshot data).
- Train a classifier — logistic regression, gradient boosting (LightGBM, XGBoost). Cross-validation with time-based folds.
- Calibrate. Either isotonic or Platt scaling on a held-out set. Calibration is what makes the model usable for stakeholder decisions ("this score means there's roughly a 30% chance").
- Evaluate not just AUC but business utility — "if we acted on the top 20% by score, what's the lift over random?"
from lightgbm import LGBMClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.model_selection import TimeSeriesSplit
base = LGBMClassifier(n_estimators=300, max_depth=6, learning_rate=0.05)
clf = CalibratedClassifierCV(base, method='isotonic', cv=TimeSeriesSplit(n_splits=5))
clf.fit(X_train, y_train)
ps = clf.predict_proba(X_holdout)[:, 1]
Opportunity sizing
"How big is the prize if we ship X, or enter market Y?" Senior DS spends a lot of time on this — it's the analytical input to GTM investment decisions.
The framework
- Define the addressable universe: which customers (or potential customers) could the change affect?
- Estimate the per-unit effect: based on test data, an analog product, or a structural argument.
- Roll up to a total: multiply, sum, integrate.
- Bound the uncertainty: low/mid/high estimates, named drivers of each.
- Identify the sensitive levers: which assumption, if wrong by 20%, would change the recommendation?
The trap is single-number opportunity estimates with hidden assumptions. The right artifact is a one-pager with explicit assumptions, a range, and a sensitivity table.
At Cohere or HeyGen, you'll be sizing AI features. Trap: estimating revenue from "users who use the feature" without accounting for cannibalization (they might have paid anyway) or substitution (this feature replaces a more profitable behavior). Always model the counterfactual world, not just the world-with-feature.
The framing that wins
Interviewers can tell the difference between "I built a propensity model" and "I built a propensity model to support a decision and here's what changed." The decision-first framing:
- What decision is this model going to inform? (e.g., "which 200 accounts get a touch from a CSM this quarter")
- What's the unit of decision? (account)
- What's the target? (defined relative to the decision)
- What threshold or ranking does the decision-maker apply? (top 200 by score)
- What's the eval metric the business cares about? (lift at top 20%, not raw AUC)
- How does this get refreshed and acted on? (weekly retrain, push to Salesforce as a field)
Shipping a business model
Models that don't ship don't help. The shipping checklist:
- Output integration: where does the score live so the user (sales rep, marketer, exec) sees it? Salesforce field, BI dashboard, weekly email?
- Refresh cadence: how often is it rerun? Monthly is usually fine for opportunity sizing; weekly for propensity; daily only if real-time matters.
- Monitoring: input drift (are feature distributions stable?), output drift (is the score distribution stable?), outcome drift (when labels arrive, is calibration holding?).
- Stakeholder feedback loop: how do you hear when the model's wrong? A senior AE who says "this account is hot and the model says cold" is signal — capture it.
- Sunset clause: what would make us turn this model off?
Interview probes
Show probe 1: "How would you forecast Cohere's monthly API revenue?"
"Start with a baseline — seasonal-naïve at the customer-segment level (enterprise tier × industry). Then add an exogenous-features model — gradient boosting on lag features plus pipeline (newly signed accounts), product mix shifts, and seasonality. Walk-forward validation. Compare both to seasonal-naïve. Report point estimate and interval, decomposed into 'committed' (existing accounts) and 'new pipeline' (signed but not yet ramped) — because those are different levers and have different uncertainty profiles."
Show probe 2: "Why calibrate a classifier? Isn't AUC enough?"
AUC measures ranking — whether positives score higher than negatives on average. It doesn't say anything about whether a 0.3 score means 30% chance. Business stakeholders interpret scores as probabilities, and decisions ("we'll target everyone above 0.5") implicitly assume calibration. Without it, the threshold is uninterpretable and the model can't be combined with cost/value math. Isotonic or Platt scaling on a held-out set is the standard fix.
Show probe 3: "When would you not segment customers?"
When the decision-maker would treat every segment identically. Segmentation has a real cost — analysts maintain the model, stakeholders learn the segments, dashboards split. If no downstream action differs by segment, you've added complexity for nothing. Start by asking the stakeholder: 'if we had three segments, how would your work change for each?' If they can't answer concretely, you're not ready to segment yet.
Show probe 4: "Walk me through sizing a new product line opportunity."
"Define addressable universe: who could buy this? Estimate per-customer revenue: from analog products, comparable customers, or test pricing. Multiply for total. Apply realistic adoption curve — typically 5–15% TAM in year one for B2B. Bound: low (current customers only), mid (current + 50% of warm prospects), high (mid + 30% cold reach). Sensitivity: which input, if 20% off, would flip the recommendation? Usually it's pricing assumption or adoption rate — call those out as the loss-bearing assumptions. Deliver as a one-pager with explicit assumptions and ranges, not a single number."