How to Backtest Live Win-Probability Signals Against Closing Lines at Major Sportsbooks

Backtesting an in-play win-probability model against historical sportsbook closing lines is the most rigorous way to know whether your model has edge. It is also one of the easiest things to do wrong. This post walks through the methodology we use, including the pitfalls that produce backtest results 30-50% higher than your live performance will turn out to be.

Why closing lines are the right benchmark

The closing line at a sharp sportsbook (Pinnacle, Circa) is the market's most-informed estimate of the true probability before the event begins. It incorporates all sharp money, all known information, and all market participants' best forecasts.

If your model produces a probability that beats the closing line on average — meaning your model and the closing line both observed the same outcomes, but your model assigned higher probability to the actual winners — you have edge. The full conceptual case is in our CLV deep dive.

Beating the closing line is a stronger test than beating individual game outcomes. Outcome variance is huge — a single weekend can swing a win-rate measure by 15 percentage points. Closing-line value converges much faster because every game contributes a continuous-valued data point rather than a binary win or loss.

Data you need

Three datasets, joined on game and timestamp:

Historical game state. Score, time, period, possession at each minute (or finer) for every game. ESPN's archived JSON, or paid services like Sportradar.
Historical sportsbook lines at multiple timestamps. Pinnacle's closing line is the gold standard. Their API provides every line change. The Odds API has multi-book coverage if you need consensus.
Final outcomes. Trivial to source — every sport has free outcome feeds.

The first dataset is what your model consumes. The second is what you compare against. The third is the ground truth.

Time alignment is the first place backtests go wrong

Your model produces a fair probability at every time t. The sportsbook line you compare against also has a timestamp. If those timestamps are not properly aligned, you are comparing apples to oranges and the comparison is meaningless.

Two specific time-alignment errors we have made:

Comparing your in-play probability to a closing line that was set after the entire game ended. The closing line incorporates information your in-play model did not have at time t. The comparison is reversed — the closing line should be a benchmark for your final pre-game probability, not your minute-15 probability.

Using a delayed sportsbook feed without correcting for the delay. Some historical odds feeds lag the actual market by 30-120 seconds. If your model "beats" a feed that is 90 seconds behind reality, you are not beating the market — you are beating yesterday's market.

The fix: compare your model's probability at time t against the closing line, not against the in-play line at time t. The closing line is the right benchmark because it is the market's final answer with all information incorporated.

Vig must be removed before comparison

Sportsbook lines include the bookmaker's vig — typically 4-8% on a two-way market. The implied probability of a -110/-110 line sums to 104.8%. The fair probability per side, after vig removal, is 50%/50%.

If you compare your model's probability to the raw vig-included implied probability, your model will look better than it is. The correction is straightforward but mandatory:

def remove_vig(implied_a, implied_b):
    total = implied_a + implied_b
    return implied_a / total, implied_b / total

Always remove vig before comparison. Always. Our odds converter demonstrates this interactively.

Choose the right metric

For comparing your model to the market, two metrics dominate:

Brier score against the actual outcome. If your model's probability is closer to the realized outcome (1 if home wins, 0 if away) than the closing line's probability is, your model has edge in that comparison. Average across many games and you get a robust per-sport edge measure.

Closing line value in cents. If your model would have entered at price X cents and the line closed at price Y cents, your CLV is Y - X. Positive means your entry beat the close; negative means it lost. This is what we monitor in production.

Win rate is a poor metric. It is too noisy and too dependent on outcome variance. Use it as a sanity check, not a primary measure.

Walk-forward, not random splits

A random 80/20 split of historical games will overstate your model's performance because it allows information to leak from future games into your training set. Use walk-forward cross-validation instead: train on games before time t, test on games after time t, advance t, repeat.

For a multi-season backtest, expanding-window walk-forward (train on all prior data, test on the next month) is usually the right choice. For a single-season backtest, four-fold expanding-window CV (5 equal time slices, train on 1, test on 2; train on 1+2, test on 3; etc.) is reasonable.

The full case for walk-forward is at our walk-forward methodology post. The short version: it produces lower numbers than random splits, and the lower numbers are correct.

The backtest-to-live gap

Even with perfect methodology, your live performance will be lower than your backtest indicates. The gap comes from real-world frictions the backtest cannot fully model:

Slippage on real fills versus backtest closing-price proxy (1-3 cents per trade)
Real-time data lag versus backtest's perfect-information assumption (0-2 cents)
Settlement-time and resolution edge cases the backtest does not encounter
Strategy adaptation by other market participants between backtest period and now

A reasonable rule of thumb: discount your backtest CLV by 30-50% when projecting live performance. If the backtest shows positive 5 cents per trade, expect 2.5-3.5 cents live. If the backtest shows positive 10 cents per trade, expect 5-7 cents live. Anyone whose live numbers match their backtest numbers either got lucky or is measuring something wrong.

What a clean backtest output looks like

For each sport, edge bucket, and time-period bucket, report:

Number of trades
Win rate
Average CLV (cents)
Walk-forward fold standard deviation of CLV
Cumulative dollar P&L assuming quarter-Kelly sizing

The fold standard deviation matters more than the mean. A bucket with mean CLV 4 cents and fold std 1 cent is real edge. A bucket with mean CLV 4 cents and fold std 6 cents is noise that happened to average positive across folds.

The bottom line

A backtest done right is the most honest signal you have about whether your model has edge. A backtest done wrong is a confidence trap that produces a great-looking report and live losses. The difference is in the methodology — time alignment, vig removal, walk-forward CV, and honest discounting of the backtest-to-live gap.

Calibrated probabilities you can backtest yourself

ZenHodl publishes historical predictions and backtest harnesses for 11 sports. Validate the methodology on your own data.

Try ZenHodl free