Benchmarks & reproductions
The studio’s core promise: anything you can train here that’s labelled as paper-backed will reproduce the paper’s published number. Not “close to” — match it, within reasonable tolerance for stochastic training.
The /benchmarks page is the live leaderboard for that claim.
What /benchmarks shows
A table, one row per paper-backed prior in the catalog. Each row carries:
- Study — the prior’s display name + the paper citation (e.g. Müller 2022, Adriaensen 2023)
- Metric · paper number — for every metric the paper reports, the number the paper claims, with a citation back to the table or figure it came from
- Try the reproduction → — opens the wizard with the paper-pinned route pre-armed for that study
Pick a row, hit Try the reproduction, train. The trained brain’s page shows a 📄 Paper reproduction card that:
- Reads the metrics produced by the training run
- Pairs each metric with the number the paper reported
- Emits a verdict per row — ✅ pass / ❌ fail / ⏳ pending / ⊘ skipped
- Rolls up an overall verdict across all rows
That verdict is the falsification gate. If the studio’s training produces a number the paper didn’t, the card shows ❌ and you know the reproduction didn’t hold.
Why this works
Three pieces make the claim provable:
- Paper-pinned route — when you pick a single paper-backed prior, the wizard locks the model architecture, hyperparameters, prior overrides, and eval suite to the paper’s own setup. The wizard’s recipe sliders are ignored. See Paper reproductions.
- Paper numbers in the catalog — every paper-backed listing ships with the paper’s published numbers attached, each one citing the table or figure it came from.
- Auto-comparison — the brain page’s reproduction card pairs measured against reported, per metric.
So the claim “this studio reproduces published research” isn’t a marketing promise — it’s a button you can press.
”Reproducible study” badge
Wherever a paper-backed prior shows up in the studio, you’ll see a green ✓ Reproducible study chip:
- On marketplace tiles
- On the speciality picker’s prior sub-grid (when applicable)
- Inside the benchmarks table itself
Tooltip on the badge cites the backing paper. The badge appears on the listing (the claim is reproducible); the per-run 📄 Paper reproduction card on /brain/:id shows the actual verdict for a specific training run.
What the badge does not claim
- It doesn’t say “we re-ran this paper today and it passed.” That requires the per-brain card on
/brain/:id. - It doesn’t say “this is the only way to train this PFN.” You can fork, edit, and train however you like — the badge only applies to the paper-pinned route.
- It doesn’t compare against unpublished or proprietary baselines. Only what the paper itself reports.
When a row’s number is blank
Some baselines are computed at runtime from the eval dataset rather than read from the paper’s table (e.g. a seasonal-naive forecast computed against the held-out test split). Those rows show “placeholder — fills at runtime” on the benchmarks page; the per-brain reproduction card surfaces the real number after training completes.
Every paper-backed listing eventually carries either a literal number from the paper or a runtime-computed baseline. A blank cell never means a number can’t be verified — only that it hasn’t been authored or trained yet.
Upcoming paper-backed priors
The most impactful additions on our list:
| Paper | What it adds |
|---|---|
| TabPFN (Hollmann 2022 / 2025 Nature) | Flips Sort into categories live |
| ForecastPFN (Dooley 2023) | Strengthens Forecast forward with a second paper-backed option |
| Drift-Resilient TabPFN (Helli 2024) | Opens a new Handle drift capability |
| Mothernet | Broadens Predict a number and Sort into categories |
See the roadmap for the full picture.
Related
- Paper reproductions — the wizard’s paper-pinned route in detail
- The wizard — how capabilities map to paper-backed priors
- Capabilities reference — which capabilities are paper-backed today