Inside a Sports Prediction Market Market-Maker: Lessons from Modeling 11 Sports

Most prediction-market write-ups focus on a single sport. We have spent the last fourteen months running automated trading across eleven of them — basketball, hockey, baseball, college football, NFL, soccer, tennis, CS2, LoL — and the cross-sport perspective taught us things that a single-sport view never could. This post is the consolidated set of lessons.

What generalizes across sports

About two-thirds of the playbook is the same regardless of sport. The infrastructure is identical. The risk-management primitives are identical. The CLV-versus-win-rate distinction applies everywhere. The Kelly-Criterion sizing applies everywhere. The "cap your maximum edge" lesson applies everywhere.

The first time we ported a working bot to a new sport, we expected to rebuild most of it. We rebuilt almost none. The model swapped, the feature pipeline swapped, the live data source swapped — but the trade lifecycle (signal → size → fire → settle → log) and the monitoring stack (CLV, drift alerts, circuit breakers) did not change.

This is the strongest argument for a unified codebase: most prediction-market trading is the same problem with different inputs. Specializing too early — building a NBA-specific bot, a NHL-specific bot, a CS2-specific bot — makes the cross-sport infrastructure lessons invisible.

What does not generalize

The sport-specific layer is where the work concentrates. A few patterns:

Clock semantics. Basketball has quarters with stoppage. Hockey has periods with overtime. Baseball has innings with no clock. Soccer has elapsed minutes plus stoppage time displayed as "45'+3'". Tennis has sets and games and points and tiebreaks. Each clock requires its own state model. The unifying abstraction has to support all of them.

Outcome cardinality. Most sports are binary (home wins or away wins). Soccer is ternary (home / draw / away). Some bots and dashboards assume binary throughout. Adding soccer correctly means refactoring those assumptions. Better to design for n-ary outcomes from day one.

Score grid behavior. Basketball scores in 2s and 3s; hockey scores in 1s; baseball runs are essentially Poisson; soccer goals are very low Poisson; tennis points are Bernoulli. The right model family differs sport by sport. There is no master architecture that works for all of them — XGBoost is great for basketball and football, terrible for soccer (you need the draw outcome), and useless for tennis (the hierarchical structure dominates).

Where each sport surprised us

NCAAMB was easier than NBA. More games, less attention from sharp money, larger and more persistent edges. The 5,345-game NCAAMB backtest produced 4.39% Expected Calibration Error — the best of any sport in our coverage. Our biggest dollar profits come from this league.

Tennis was harder than expected. The hierarchical structure (point-game-set-match) makes the model genuinely complex, and the high-edge band is plagued by stale data more than any other sport. ATP edges over 20 cents have a worse win rate than coin flip in our 30-day data. We cap maximum edge at 20c.

Soccer required rethinking the entire outcome model. Three-way markets, Poisson scoring, league strength normalization for international competitions — none of which exists in basketball or hockey. We built a separate Poisson process model for soccer and it could not borrow much from the other bots.

CS2 data was a nightmare for months. Round-level data is gated behind various paid APIs and scraped sources of varying reliability. The model accuracy is dramatically dependent on which fallback tier is feeding it. We eventually built a 4-tier fallback chain (bo3.gg + combined → map ML → series ML → Elo binomial). The chain is what makes CS2 trading viable.

CFB we killed. Negative across every edge bucket. The model is the issue — feature coverage is patchy across teams and the home-field advantage is much larger and more team-specific than other sports. Disabled until we rebuild the feature set.

Operational lessons

Settlement reliability matters more than price latency. Missing a settlement (because the bot was offline when the contract resolved) costs more in CLV data quality than any single late entry. We invested heavily in restart-safe position tracking. Worth every hour.

Drift alerts are necessary, not optional. Models drift. Players retire. Strategies adapt. A model that was well-calibrated in February will not be well-calibrated in May without intervention. Our recalibration runs nightly and a model-health alert fires if Expected Calibration Error rises past threshold. Without this, we would have shipped degraded models for weeks before noticing.

Per-sport circuit breakers save you from your own mistakes. When a sport's rolling 30-day ROI drops below -5%, the bot disables that sport automatically. It re-enables when the rolling ROI recovers above 0%. Self-healing. The breaker has saved us from at least three multi-week bleeds where we would have manually rationalized continuing to trade.

The non-trading work eats more time than the trading. Building the bot is the small project. Monitoring it, alerting on drift, managing data feeds, handling exchange API changes, and patching the bug surface area is the long tail. Budget for it.

Strategic lessons

Edge is everywhere it is small. The biggest edges (20+ cents) are usually model errors. The middle of the distribution (8-18 cents) is where the actual money is. We learned this in tennis, then re-learned it in NHL, then re-learned it in MLB. It is the most universal lesson in the book.

Cross-sport scaling beats single-sport optimization. Spending another month tuning the NBA bot to squeeze 0.3 more cents per trade was always worse than spending the same month adding a new sport with greenfield edge. Diversification across uncorrelated sports also smooths the equity curve dramatically.

The market gets more efficient over time. Edges that existed 12 months ago may not exist today. Continuous retraining and continuous edge monitoring is the only defense. Models that stop being updated stop making money.

What we still have not figured out

A few open questions:

Optimal exit timing. We currently hold all positions to settlement; we have not found an exit rule that beats hold-to-settlement on backtests, but we suspect one exists for very large in-game swings.
Cross-sport correlation in drawdowns. When the model is having a bad day on NBA, is it also more likely to be having a bad day on NHL? We see correlations in the data but have not turned them into a portfolio-level risk control.
Optimal model retraining cadence. We retrain weekly. We do not know if daily would be better — the retraining itself introduces variance, and there is a tradeoff we have not fully characterized.

The bottom line

Running a multi-sport book is mostly the same job done eleven times with different inputs. The shared infrastructure is the moat. The sport-specific models are the entry tickets. The discipline of measuring edge per sport per bucket, killing the losing buckets, and resisting the urge to "fix" working bots is what separates the books that compound from the books that drift.

11-sport calibrated probability API

ZenHodl runs the production bots described in this post and publishes the per-sport calibration metrics. Seven-day free trial.

Try ZenHodl free