Methodology

How the model is built, trained, gated, and bet. Including the components we turned off on purpose.

What we measure

The first-order metric is closing line value (CLV) — the difference between the price we got and the closing price the market settled on, weighted by stake. CLV is the cleanest signal that the model is finding real edge rather than absorbing variance, because the closing line is the market's most informed estimate after sharps have moved it. ROI is the second-order metric. It will be noisy over any window short of a full season and will diverge from CLV when the sample is small.

Backtest CLV across the 4,444 bets in the 2024–2025 walk-forward sample is +2.89% mean, with 60.8% of bets positive. That is the durable number. Whatever ROI you see on the track record is what that CLV happened to convert to under a particular price stream and bet selection rule.

The model: two layers

The forecast is a stack of two models. The first is a Negative Binomial team-runs model. We estimate each team's expected runs (λ) as a product of factors — lineup quality (16 inputs), starting pitcher (16), bullpen weighted by usage and fatigue (12), recent form (14), situational context (7, including travel and rest), and environmental adjustments for ballpark, weather, and home plate umpire (5). A linear calibration step (slope and intercept) corrects league-mean drift. The two λ estimates feed independent NegBin draws (dispersion r ≈ 3.78, fit from data); we Monte-Carlo enough simulations to read off win probabilities, run-line probabilities, total over/under at any line, and team totals.

NegBin matters because runs are over-dispersed counts — Poisson would systematically under-price tail outcomes (10+ run games, shutouts). The fitted dispersion was eyeballed against ten years of game-level totals before we locked it.

The second layer is a LightGBM residual model. It does not replace the baseline; it learns the residual:

λ_final = λ_baseline + α × Δλ_predicted

The residual model is trained on 67 walk-forward-safe features — same raw inputs as the baseline plus interaction terms (pitcher-vs-handedness splits, recent bullpen leverage usage, park-weather interaction) the linear baseline cannot express. α is a per-game scaling factor controlled by a third model — see the next section.

Reliability gating

The residual model is useful on most games and harmful on some. A LogisticRegression classifier (the reliability model) estimates the probability that the LightGBM correction will improve the forecast on a given game. α gets scaled down on low-reliability games — when one or both teams are missing data, when lineup announcements are late, when starting pitcher data is thin, or when context features fall outside the training distribution.

The point isn't to add another layer for its own sake. It's that "model says X" is the wrong unit. The unit is "model says X, and we trust the model this much on this game." Bet sizing reads both.

Walk-forward training

Every feature, every training row, every backtest score is computed using only data available before the game it predicts. The split is time-ordered, not random:

This rules out the most common backtest leak — using season-end aggregate stats to predict mid-season games. It is also the reason the 2025 test ROI is only marginally lower than 2024 validation ROI: we did not fit anything to 2025.

The walk-forward discipline was not free. An earlier iteration of the pipeline shipped a leak: pitcher and batter aggregate CSVs were season-end snapshots, which meant April predictions had implicit knowledge of September performance. We caught it, quantified the inflation, rebuilt the feature builder to derive stats from game logs as-of each game's date, and re-ran the full training and backtest cycle. The numbers on the track record are post-fix.

The calibrator we turned off

Conventional ML pipelines wrap a final calibration step — Platt, isotonic, or temperature scaling — to align predicted probabilities with empirical frequencies. We trained ours. It improved Brier score. We disabled it anyway, after six independent rounds of evaluation said the same thing: the calibrator helps the average bet, and hurts the bet we actually place.

The mechanism: calibration is fit against the full population of model probabilities (all games, all markets, every prediction). The bet selection rule places stakes on a small, biased subset — only when the model's edge over the market exceeds the 5% floor. That subset is where the model is most confident and where the market has priced things most aggressively. The shape of the miscalibration on the betting subset is different from the shape on the full population, and the calibrator pulls in the wrong direction on the bets that matter.

Across six evaluation rounds, applying the calibrator under the 5% edge betting rule reduced ROI by 1.2 to 9 percentage points relative to raw probabilities. The model now runs without it — raw probabilities into the edge calculation, no post-hoc smoothing. We will revisit if a new failure mode shows up. We have not seen one in production.

Bet selection rules

Forecasts become bets through a fixed rule set, not discretion:

None of these parameters are tuned to maximize backtest ROI. They are tuned for survival across drawdowns plausible at our sample size.

What v1 covers, what it doesn't

Sports: MLB only. NBA and NFL are in the v2 candidate set if v1 reaches scale, but baseball is the sport where the model — long-horizon prior, daily batch, public pricing — has structural advantages. We will not add sports to fill out a product matrix.

Markets: moneyline and spread (run line). The current backtest shows +16.8% on moneyline and +12.3% on spread across 4,444 bets, blended +14.6% at 5% vig and flat $100 stake. Totals and team totals are modeled but excluded from the v1 bet selection — they show smaller edge in backtest and we do not yet trust the calibration on derivative-of-runs markets enough to bet them at scale.

Cadence: daily batch. Predictions and stakes are computed once per day against the morning price snapshot. We do not chase line moves intraday and we do not ship in-game adjustments. Real-time pricing is a different product and a different infrastructure problem; we are not building it.

Reliability gate: when the model's confidence is low (data missing, lineups unannounced, weather TBD), the day's bet count drops. Some days the gate produces zero bets. That is the system working.

Honest limitations

The numbers on the track record are walk-forward backtest at 5% vig and flat $100 stake. The vig assumption matches the platform we use; sharper books will produce better results, fish books worse. The flat-stake assumption strips out Kelly compounding, so the displayed ROI is a clean read of average bet edge rather than a bankroll outcome.

Live performance in the current season is on a small sample and is currently negative. We publish it next to the backtest deliberately. Two reasons. First, hiding it would be dishonest, and the audience this product is for can read through that anyway. Second, the gap between backtest and live is exactly the question worth obsessing over — it is where regime shifts, data drift, and model mis-specification show up first. CLV across the same window is positive, which is what would be expected if the bets are sound and ROI is just unlucky over a few hundred games.

Two things the model does not handle well: (1) extreme weather games where late lineup scratches change projected runs by more than the model's residual range, and (2) the first three weeks of a season, when rolling stats have not converged. The reliability gate catches most of (1) but not all. We start placing bets two weeks into the regular season.

What changes between versions

The model is not static. Each season we re-train the residual layer, refit the linear calibration, and re-evaluate the reliability classifier on the prior year's out-of-sample. Feature additions go through the same walk-forward gauntlet — train through prior years, validate on the most recent full season, test on the current season's held-out portion. Anything that does not improve held-out Brier and does not produce a positive ROI lift across an honest sample gets removed.

We publish change notes when a model component changes meaningfully — the calibrator disable is one example. The history of disabled, re-enabled, and re-disabled components is itself a useful artifact: it shows which kinds of improvements survive contact with real betting and which do not.

All backtest numbers on this page: 2024–2025 walk-forward, 5% vig, flat $100 stake, ML + spread markets. Source data regenerated nightly from the underlying bet log. See the track record for the equity curve, reliability diagrams, and edge-bucket breakdown.