LiveModel

F1 Race Predictor

Three forecasters, one honest scorecard. The deliverable is the evaluation, not the prediction.

What this shows: I can test an AI system honestly: no peeking at future data, probabilities that mean what they say, and the result published even when the AI loses.

Outcome

A calibrated probability for every driver, every race, graded against the real finishing order once it exists.

Proof

77 races tested. When it says 70%, it lands about 70%. Claude ran 24% worse than the baseline, kept in, not hidden.

Takeaway

Most of F1’s predictable signal is just where you start. The scorecard mattered more than the model.

Predicted order vs actual

Monaco GP · projected order vs actual finishLIVE

Monaco 2026, as it crossed the line

real result

At Monaco, overtaking is nearly impossible, so grid position usually decides the finish; the model leaned on that and called Antonelli's pole-to-win at 76%. Six cars retired, three our own podium picks (orange): externalities no pre-race data can see.

ANT · MercedesHAM · FerrariHAD · Red Bulldnf · externality we can't predictrest of the field

Next race

projection · pre-qualifying

Round 8 · 28 Jun 2026

Austrian Grand Prix

Red Bull Ring, Spielberg

PROJECTED PODIUM

Hamilton

Ferrari

47% to win

Antonelli

Mercedes

50% podium

Russell

Mercedes

46% podium

The percentages are the model’s track record over 72 of the 77 past races (the earliest races only train it, so they aren’t scored): how often its pre-qualifying P1 pick went on to win, and its P2 and P3 picks reached the podium.

Why these three

Hamilton won the last round at Barcelona, and the form-only model reads recent form and team pace, so it now makes him the pick for the win, even though Antonelli still leads the championship on 143 points (Hamilton 104, Russell 85). All three have run at the front all season.

What changes Saturday

Grid position is the model's strongest feature, and qualifying has not happened yet. Once the real grid exists, all three forecasters rerun and the win and podium probabilities update before lights out. The form model likes Hamilton, but qualifying pace has belonged to Mercedes all year.

Qualifying call

On this season's one-lap form, the front of the grid is Mercedes: Antonelli (four poles from seven) and Russell (the other three, including Barcelona). Mercedes has taken every pole in 2026, so Hamilton's win case rests on race pace rather than starting position.

ON THE HORIZONR9Britain · 5 Jul·Hamiltonform pickR10Belgium · 19 Jul·Hamiltonform pick

The season so far

stats model · graded

Every 2026 round, the model’s predicted podium against what actually happened. Green means a podium pick landed. Locked in after qualifying, before the race.

Round	Our podium	Actual podium	Winner
R1 Australia	RUSANTLEC	RUSANTLEC	✓
R2 China	RUSANTLEC	ANTRUSHAM	×
R3 Japan	ANTRUSLEC	ANTPIALEC	✓
R4 Miami	ANTRUSLEC	ANTNORPIA	✓
R5 Canada	ANTRUSNOR	ANTHAMVER	✓
R6 Monaco	ANTHAMVER	ANTHAMGAS	✓
R7 Barcelona	ANTHAMRUS	HAMRUSNOR	×

Called the winner in 5 of 7 · landed 13 of 21 podium picks. The wins are easy; the third step is where it’s hard.

At a glance

77
Races tested (2023-26): 3
Forecasters compared: 24% worse
Claude's odds vs. just using the grid order: 2.78
Places off per driver, on average

How it works

1
Gather
Race and qualifying results for 2023 to 2026 become nine pre-race clues per driver: grid slot, quali gap, recent form, team pace, track history. Strictly nothing from the race being predicted.
2
Predict
Three forecasters fill in the same form: a naive baseline (you finish where you start), a statistical model trained only on past races, and Claude reasoning over a written pre-race brief.
3
Grade
Proper scoring rules (Brier score, log loss, skill vs the baseline) plus calibration curves: when it says 70%, does that happen 70% of the time?
4
Track
Every forecaster is graded race by race across the season, and the next race is always called before lights out, so the prediction is locked in before the result exists.

Honest about it

A prediction is only worth the eval behind it. So I keep score: three forecasters, every race, graded against what actually happened.

How often do we call the winner?

2026 · 7 rounds

naive baselineour model

Places off our winner pick, race by race. The winner is usually the pole-sitter, so both call it five times in seven. Both miss Barcelona, where Hamilton won from P2; our model also slips in China, the baseline in Canada, where its pick led then retired. Calling the winner is the easy part.

real output

Problem

Prediction posts are easy to fake after the fact, and LLMs make it worse: past seasons sit in their training data, so a strong backtest proves memory, not skill. I wanted calls put on the record before each race, and an evaluation I could actually trust.

Approach

Three forecasters emit the same output, so they compete like for like. The statistical model only sees earlier races, automated tests prove no future data leaks in, and Claude is graded only on races after its training cutoff. The rest of 2026 is the live test: every pick is locked in after qualifying, before the race.

Eval results

Two honest findings. The baseline just predicts the starting grid order; the stats model beat it by about 12% on win probability, while Claude scored 24% worse than it — the kind of negative result most write-ups quietly drop. Almost all the signal is one thing: where you start. Remove grid position and podium error jumps about 20%; remove anything else and nothing moves. On ranking, the stats model lands within 2.78 places of each driver's real finish on average — under three spots off. The plot below is the part I trust most: when the model says 70%, it happens about 70% of the time.

Can you trust the probabilities?

every race, 2023-26

spot onclosemissed

When the model says a driver has a 70% shot at the podium, it lands there about 70% of the time. Points sitting on the line mean the numbers mean what they say.

What broke

A free data API silently returned four empty races after rate-limiting, caught by validation, not an error. Grid position 0 means a pit-lane start, which a model reads as better than pole. And the LLM sometimes returns duplicate finishing positions, so the schema rejects loudly and a deterministic repair re-ranks. The lesson that stuck: the eval design mattered more than the model — most of the work was keeping the test fair.