Methodology
How the MLB expected runs model works — data, features, modeling, and evaluation
This project builds a machine learning pipeline that predicts expected runs per team per MLB game and evaluates those predictions against sportsbook-implied probabilities. Sports betting markets are used as a calibration benchmark, not as a gambling application. Sharp market participants push lines to near-true probabilities quickly, making them a higher-quality probability signal than most independently constructed models.
The metrics here are drawn from quantitative finance and decision theory: expected value, Kelly criterion, and probability calibration. Kelly criterion is a portfolio optimization formula from information theory. Expected value is standard in rational decision-making under uncertainty. Probability calibration is a core model evaluation technique. The domain is baseball; the methods are from applied statistics and quantitative analysis.
The project covers the full stack: automated data ingestion from multiple APIs, feature engineering from pitch-level Statcast data, probabilistic inference with calibrated outputs, daily evaluation against actual results, and a production frontend updated each morning of the season.
Prediction target
Expected runs per team per game
Model
XGBoost regressor, 12 features
Win probability
Negative binomial distribution (r = 6)
Calibration
Isotonic regression on OOF predictions
Market comparison
Quarter-Kelly criterion (portfolio theory)
Pipeline cadence
Daily cron at 6 AM PST via GitHub Actions
The pipeline runs as a GitHub Actions cron job at 14:00 UTC (6 AM PST), before early games tip off. Each step is timed and logged; failures are captured with full tracebacks but do not abort downstream steps.
Schedule & Scores
Fetch 3-day window of schedules, upsert games, finalize scores
Statcast Stats
Compute pitcher, bullpen, and batting stats from pitch-level data
Park Factors
Load ballpark run environment factors (cached per season)
Odds
Fetch moneyline, run line, and totals; match to games by team + time
Model
Train XGBoost, predict xR per team, write to model_outputs
Evaluation
Score predictions vs actuals, update accuracy + calibration tables
Data Sources
FanGraphs was the original data source for advanced pitching stats, but it blocks automated requests via Cloudflare. All features are computed directly from Statcast pitch-level data using pybaseball, which gives direct formula control and removes a scraping dependency.
Batting split features use a 60/40 handedness blend: 60% weight on the team's splits vs the starting pitcher's hand, 40% on bullpen (assumed 60% RHP league-wide). This approximates real plate appearance distribution across a full game. Early-season fallbacks use league-average values when fewer than 10 games of data exist.
| Feature | Category | Description | Source |
|---|---|---|---|
| xfip | Pitching | Starter Expected FIP, adjusted for park effects | Statcast |
| starter_whip | Starter walks + hits per inning pitched | Statcast | |
| xfip_bullpen | IP-weighted bullpen xFIP | Statcast | |
| bullpen_k_9 | IP-weighted bullpen strikeouts per 9 innings | Statcast | |
| batting_ops | Batting Splits | Team OPS vs opponent pitcher handedness (60/40 blend) | Statcast |
| batting_iso | Team isolated power (SLG − AVG), a measure of extra-base hit rate | Statcast | |
| batting_k_pct | Team strikeout percentage vs pitcher handedness | Statcast | |
| avg_last5 | Rolling Performance | 5-game rolling average runs scored | DB |
| avg_last10 | 10-game rolling average runs scored | DB | |
| std_last5 | 5-game rolling standard deviation of runs scored | DB | |
| park_factor | Context | Venue run-scoring environment (100 = neutral) | Savant |
| is_home | Home field advantage indicator (0/1) | MLB API |
The model predicts expected runs as a regression target (reg:squarederror) rather than a classification. This lets downstream logic derive win probabilities via distributional assumptions rather than encoding them directly into the model.
Hyperparameter Search Space
| Parameter | Values |
|---|---|
| n_estimators | 100, 200, 300 |
| max_depth | 3, 4, 5 |
| learning_rate | 0.05, 0.10 |
| min_child_weight | 3, 5 |
36 combinations evaluated per run
Cross-Validation
TimeSeriesSplit with 5 folds, games sorted chronologically. Training windows only see past data, with no forward-looking information. Minimum 60 samples required; folds are reduced dynamically for early-season sparsity.
Optimization Metric
Negative mean absolute error (MAE). Chosen over RMSE because run-scoring outliers (blowout games) should not dominate gradient updates.
Given expected runs λ for each team, win probability is derived from a negative binomial distribution rather than the simpler Poisson. Baseball run-scoring exhibits overdispersion: the variance in runs scored exceeds the mean across games. Poisson forces variance equal to the mean, producing probabilities that are systematically overconfident. The negative binomial adds a dispersion parameter r that relaxes this constraint.
where p = r / (r + λ), r = 6 (calibrated to MLB run distributions)
A joint probability matrix is computed for all combinations of home/away scores from 0 to 25 runs. Three probabilities are derived from this matrix:
- Win probability:P(home > away) + P(tie) × λ_home / (λ_home + λ_away)
- Cover probability:P(home margin > spread), pushes split 50/50
- Over/under probability:P(total > line), pushes split 50/50
All outputs are clipped to [0.05, 0.95] to prevent degenerate Kelly fractions from extreme probability estimates.
Raw model win probabilities are calibrated using isotonic regression, a non-parametric, monotone-constrained method that maps predicted probabilities to empirical win rates without assuming any functional form.
The calibrator is fit only on out-of-fold (OOF) predictions from the TimeSeriesSplit CV folds. The calibration mapping therefore never sees the same data the model trained on, ensuring calibrated outputs are not optimistically biased.
Method
Isotonic regression
Training data
OOF predictions only
Threshold
≥ 400 outcomes to activate
After calibration, complementary probabilities are renormalized so that P(home win) + P(away win) = 1 per game. The calibration curve is tracked live on the Performance page.
When the model's implied probability diverges from the sportsbook's implied probability, that gap is the edge: a measure of how much the model disagrees with market consensus. Tracking edge and its realized accuracy over time is how model quality is evaluated in practice. The relevant question is not just whether the prediction was close, but whether the model identified cases where the market was systematically wrong.
Position sizing uses the Kelly criterion, a formula from information theory and portfolio optimization that determines the theoretically optimal allocation to maximize long-run logarithmic growth. It is widely used in quantitative finance (Thorp, Shannon) and applied here as a framework for weighting predictions by confidence:
where p = model win prob, q = 1 − p, b = decimal odds − 1
Full Kelly is mathematically optimal but practically aggressive; short losing streaks can draw down a bankroll significantly. The implementation uses quarter-Kelly (f* × 0.25), a fractional adjustment that reduces variance at the cost of some long-run growth rate.
Evaluation Metrics
Historical validation uses a walk-forward backtest over full MLB seasons: the model trains on all games before a 7-day window, predicts that window, advances, and repeats, mirroring live deployment conditions. Results are visible on the Performance dashboard.
ML / Data
Database
Frontend
Infrastructure