Methodology

How the MLB expected runs model works — data, features, modeling, and evaluation

Project Overview
End-to-end ML system for probabilistic run prediction and model evaluation against efficient markets

This project builds a machine learning pipeline that predicts expected runs per team per MLB game and evaluates those predictions against sportsbook-implied probabilities. Sports betting markets are used as a calibration benchmark, not as a gambling application. Sharp market participants push lines to near-true probabilities quickly, making them a higher-quality probability signal than most independently constructed models.

The metrics here are drawn from quantitative finance and decision theory: expected value, Kelly criterion, and probability calibration. Kelly criterion is a portfolio optimization formula from information theory. Expected value is standard in rational decision-making under uncertainty. Probability calibration is a core model evaluation technique. The domain is baseball; the methods are from applied statistics and quantitative analysis.

The project covers the full stack: automated data ingestion from multiple APIs, feature engineering from pitch-level Statcast data, probabilistic inference with calibrated outputs, daily evaluation against actual results, and a production frontend updated each morning of the season.

Prediction target

Expected runs per team per game

Model

XGBoost regressor, 12 features

Win probability

Negative binomial distribution (r = 6)

Calibration

Isotonic regression on OOF predictions

Market comparison

Quarter-Kelly criterion (portfolio theory)

Pipeline cadence

Daily cron at 6 AM PST via GitHub Actions

Daily Pipeline
Six-step automated pipeline running every morning before the first pitch

The pipeline runs as a GitHub Actions cron job at 14:00 UTC (6 AM PST), before early games tip off. Each step is timed and logged; failures are captured with full tracebacks but do not abort downstream steps.

01

Schedule & Scores

Fetch 3-day window of schedules, upsert games, finalize scores

02

Statcast Stats

Compute pitcher, bullpen, and batting stats from pitch-level data

03

Park Factors

Load ballpark run environment factors (cached per season)

04

Odds

Fetch moneyline, run line, and totals; match to games by team + time

05

Model

Train XGBoost, predict xR per team, write to model_outputs

06

Evaluation

Score predictions vs actuals, update accuracy + calibration tables

Data Sources

MLB Stats APIStatcast / pybaseballBaseball SavantThe Odds API
Feature Engineering
12 features across four categories, all computed from raw pitch-level Statcast data

FanGraphs was the original data source for advanced pitching stats, but it blocks automated requests via Cloudflare. All features are computed directly from Statcast pitch-level data using pybaseball, which gives direct formula control and removes a scraping dependency.

Batting split features use a 60/40 handedness blend: 60% weight on the team's splits vs the starting pitcher's hand, 40% on bullpen (assumed 60% RHP league-wide). This approximates real plate appearance distribution across a full game. Early-season fallbacks use league-average values when fewer than 10 games of data exist.

FeatureDescription
xfipStarter Expected FIP, adjusted for park effects
starter_whipStarter walks + hits per inning pitched
xfip_bullpenIP-weighted bullpen xFIP
bullpen_k_9IP-weighted bullpen strikeouts per 9 innings
batting_opsTeam OPS vs opponent pitcher handedness (60/40 blend)
batting_isoTeam isolated power (SLG − AVG), a measure of extra-base hit rate
batting_k_pctTeam strikeout percentage vs pitcher handedness
avg_last55-game rolling average runs scored
avg_last1010-game rolling average runs scored
std_last55-game rolling standard deviation of runs scored
park_factorVenue run-scoring environment (100 = neutral)
is_homeHome field advantage indicator (0/1)
Model Architecture
XGBoost regressor trained with temporally-aware cross-validation

The model predicts expected runs as a regression target (reg:squarederror) rather than a classification. This lets downstream logic derive win probabilities via distributional assumptions rather than encoding them directly into the model.

Hyperparameter Search Space

ParameterValues
n_estimators100, 200, 300
max_depth3, 4, 5
learning_rate0.05, 0.10
min_child_weight3, 5

36 combinations evaluated per run

Cross-Validation

TimeSeriesSplit with 5 folds, games sorted chronologically. Training windows only see past data, with no forward-looking information. Minimum 60 samples required; folds are reduced dynamically for early-season sparsity.

Optimization Metric

Negative mean absolute error (MAE). Chosen over RMSE because run-scoring outliers (blowout games) should not dominate gradient updates.

Win Probability
Negative binomial joint distribution over all possible final scores

Given expected runs λ for each team, win probability is derived from a negative binomial distribution rather than the simpler Poisson. Baseball run-scoring exhibits overdispersion: the variance in runs scored exceeds the mean across games. Poisson forces variance equal to the mean, producing probabilities that are systematically overconfident. The negative binomial adds a dispersion parameter r that relaxes this constraint.

P(X = k) = C(k+r−1, k) · p^r · (1−p)^k
where p = r / (r + λ), r = 6 (calibrated to MLB run distributions)

A joint probability matrix is computed for all combinations of home/away scores from 0 to 25 runs. Three probabilities are derived from this matrix:

  • Win probability:P(home > away) + P(tie) × λ_home / (λ_home + λ_away)
  • Cover probability:P(home margin > spread), pushes split 50/50
  • Over/under probability:P(total > line), pushes split 50/50

All outputs are clipped to [0.05, 0.95] to prevent degenerate Kelly fractions from extreme probability estimates.

Probability Calibration
Isotonic regression on out-of-fold predictions eliminates in-sample leakage

Raw model win probabilities are calibrated using isotonic regression, a non-parametric, monotone-constrained method that maps predicted probabilities to empirical win rates without assuming any functional form.

The calibrator is fit only on out-of-fold (OOF) predictions from the TimeSeriesSplit CV folds. The calibration mapping therefore never sees the same data the model trained on, ensuring calibrated outputs are not optimistically biased.

Method

Isotonic regression

Training data

OOF predictions only

Threshold

≥ 400 outcomes to activate

After calibration, complementary probabilities are renormalized so that P(home win) + P(away win) = 1 per game. The calibration curve is tracked live on the Performance page.

Quantitative Evaluation & Market Comparison
Kelly criterion as a portfolio theory benchmark; walk-forward backtesting on historical seasons

When the model's implied probability diverges from the sportsbook's implied probability, that gap is the edge: a measure of how much the model disagrees with market consensus. Tracking edge and its realized accuracy over time is how model quality is evaluated in practice. The relevant question is not just whether the prediction was close, but whether the model identified cases where the market was systematically wrong.

Position sizing uses the Kelly criterion, a formula from information theory and portfolio optimization that determines the theoretically optimal allocation to maximize long-run logarithmic growth. It is widely used in quantitative finance (Thorp, Shannon) and applied here as a framework for weighting predictions by confidence:

f* = (p · b − q) / b
where p = model win prob, q = 1 − p, b = decimal odds − 1

Full Kelly is mathematically optimal but practically aggressive; short losing streaks can draw down a bankroll significantly. The implementation uses quarter-Kelly (f* × 0.25), a fractional adjustment that reduces variance at the cost of some long-run growth rate.

Evaluation Metrics

·Win accuracy (ML, RL, O/U)
·Mean absolute error (xR vs actual)
·Brier score
·Log-loss
·ROI by edge bucket
·Equity curve (unit P&L)

Historical validation uses a walk-forward backtest over full MLB seasons: the model trains on all games before a 7-day window, predicts that window, advances, and repeats, mirroring live deployment conditions. Results are visible on the Performance dashboard.

Tech Stack
Production tools across data science, backend, and frontend

ML / Data

Python 3.13XGBoostscikit-learnpybaseballpandasNumPySciPy

Database

Supabase (PostgreSQL)SQLAlchemy

Frontend

Next.js 16TypeScriptTailwind CSSshadcn/uiRecharts

Infrastructure

GitHub Actions (daily cron)MLB Stats APIThe Odds APIBaseball Savant
Note on data sourcing: FanGraphs was the original source for advanced pitching stats but blocks automated requests via Cloudflare. All stats (xFIP, WHIP, K/9, batting splits) are computed directly from Statcast pitch-level data via pybaseball, giving direct formula control and removing the scraping dependency. Prior-season stats are cached to cache/ on first run (~30 min) and reused on subsequent days.