We Backtested Buying the Favorite on a Calibrated Prediction Market — Here's Why It Loses

Here is a tempting idea. On a prediction market, teams priced around 87% to win actually win about 86% of the time. The market is well-calibrated — its prices are honest probabilities. So just buy the favorites. You'll win the overwhelming majority of your bets. How could a strategy that wins 86% of the time lose money?

We tested exactly this on real, outcome-labeled data, and the answer is that it loses — not because of bad luck, but because of a structural fact that's worth internalizing once and never forgetting: a well-calibrated market is a fairly-priced market, and a fairly-priced market is the anti-signal for a directional bet. The better the calibration, the more certain the loss.

The data

We used a cross-venue "matched book" — the same MLB games captured live on both Polymarket and Kalshi, stitched onto one timeline, with the eventual winner labeled on every row. The honest sample is small and we'll say so up front: 41 settled game-sides across about 30 games over 3 days. (The raw file has 57,757 rows, but those are the same books re-snapshotted every ~31 seconds — quoting that as your sample size is the most common way prediction-market analyses fool themselves. The real unit is the game.)

First we confirmed the premise. The closing prices are well-calibrated: the Brier score is about 0.136 on Polymarket and 0.133 on Kalshi — comfortably below the 0.25 you'd get from a coin flip — and favorites priced ~87% won ~86%. So the premise of the strategy is true. Now watch it not matter.

The one-line proof

Let p be the price (in probability units) and assume the market is perfectly calibrated, so the true win probability is also p. You buy one contract at the ask, hold to settlement, and receive $1 if the team wins, $0 if it loses. Your expected profit is:

E[profit] = (prob win)·($1 − ask) + (prob lose)·(−ask)
          = p·(1 − ask) − (1 − p)·ask
          = p − ask
          = −(half-spread)        ← because a calibrated ask ≈ p + half-spread

The price terms cancel. At every price — favorite, underdog, or coin flip — the calibrated taker's expected edge is exactly minus the half-spread he crossed, before a cent of platform friction. There is no p where this flips positive. Win rate and expected value have decoupled. The 86% win rate is real, but you already paid for all of it in the 87¢ ask: you're buying 13¢ of upside for 87¢-plus-spread, and 14% of the time you lose the whole stake.

The backtest agrees

Across all 41 game-sides, buying the tracked side at the ask returned a gross +1.8¢ per contract — which looks like a faint pulse until you put error bars on it. The game-level bootstrap 95% confidence interval runs from −9.5¢ to +13.3¢ (a permutation test gives p = 0.76 — pure noise). And that's before trading costs. After a realistic round-trip friction on top of the spread, it's negative either way:

Cost on top of the spread-inclusive ask	Net EV per contract
+3¢ (friendly)	−1.2¢
+5¢ (realistic on thin books)	−3.2¢

"But the heavy favorites won every single time!" They did — ten-of-ten in the most extreme bucket. Their ask was about 0.98. You pay 98¢ to win a dollar, so after 5¢ of friction that "perfect" bucket nets −2.62¢. A 100% hit rate at a 98¢ price is not an edge; it's the market doing its job.

Where the "winning" buckets came from

Every positive-looking slice we found dissolved under scrutiny, and it's worth seeing how, because these are the exact traps that make a backtest lie to you:

One lucky day carried the whole positive sign. Remove a single day and the tracked-side edge falls to −3.79¢. An "edge" that depends on which 3 days you sampled is not an edge.
Three longshots accounted for 295% of total profit. The rest of the book was a net loss; a handful of upset winners masked it.
A "perfect streak" bucket faked a positive confidence interval. One price band went 15-for-15. A bootstrap that resamples a set with zero losers can never draw a loss — so its lower bound is spuriously positive. That's a sample-size pathology, not a discovery.

The production proof

We don't have to argue this in theory, because we ran the strategy with real money. Our own live trading bot takes directional positions on these markets, and on its last 115 settled trades it posted a closing-line value of −6.7¢, beat the close only 42% of the time, and ran breakeven-to-negative on ROI. That's the theorem realized in production: directional taking on a calibrated market bleeds the spread. The most useful thing our bot ever taught us is that it can't beat the market it trades.

So where does a real edge live?

If the level is right, the only places left for an edge are narrow, and we can see them closing on our own data:

Speed. One venue does move first — on the same games, Polymarket tends to lead Kalshi. But the cross-venue gap is a zero-cent median and sits inside the spread ~96% of the time, so the lead is real and uneconomic at once. You can't cross a 1¢ gap when the round trip costs 3–5¢.
An un-priced conditional pocket — a specific situation the market systematically misjudges. Maybe one exists, but 41 games can't find it, and the pockets people usually chase (extreme favorites, "due" underdogs) are exactly the ones the price has already absorbed.
Stop taking, start making. The half-spread that bleeds the taker is income to the market-maker. Earning the spread instead of paying it is a different game — one that needs queue priority and depth, not a win-rate hunch.

The takeaway

A high win rate on a well-calibrated market is the price, not the edge. If anything, discovering that a market is beautifully calibrated is the strongest possible evidence that there's no directional trade in it. The signal that actually separates skill from luck isn't your win rate — it's closing-line value: did you get a better price than the market's final word? Win rate flatters; CLV tells the truth.

See the calibration analysis and the raw data

ZenHodl publishes the full write-up plus a free, outcome-labeled Polymarket × Kalshi sample you can re-run these tests on yourself — no marketing mockups, real settled rows.

Read the calibration study →