Benchmarks & reproductions

The studio’s core promise: anything you can train here that’s labelled as paper-backed will reproduce the paper’s published number. Not “close to” — match it, within reasonable tolerance for stochastic training.

The /benchmarks page is the live leaderboard for that claim.

What `/benchmarks` shows

A table, one row per paper-backed prior in the catalog. Each row carries:

Study — the prior’s display name + the paper citation (e.g. Müller 2022, Adriaensen 2023)
Metric · paper number — for every metric the paper reports, the number the paper claims, with a citation back to the table or figure it came from
Try the reproduction → — opens the wizard with the paper-pinned route pre-armed for that study

Pick a row, hit Try the reproduction, train. The trained brain’s page shows a 📄 Paper reproduction card that:

Reads the metrics produced by the training run
Pairs each metric with the number the paper reported
Emits a verdict per row — ✅ pass / ❌ fail / ⏳ pending / ⊘ skipped
Rolls up an overall verdict across all rows

That verdict is the falsification gate. If the studio’s training produces a number the paper didn’t, the card shows ❌ and you know the reproduction didn’t hold.

Why this works

Three pieces make the claim provable:

Paper-pinned route — when you pick a single paper-backed prior, the wizard locks the model architecture, hyperparameters, prior overrides, and eval suite to the paper’s own setup. The wizard’s recipe sliders are ignored. See Paper reproductions.
Paper numbers in the catalog — every paper-backed listing ships with the paper’s published numbers attached, each one citing the table or figure it came from.
Auto-comparison — the brain page’s reproduction card pairs measured against reported, per metric.

So the claim “this studio reproduces published research” isn’t a marketing promise — it’s a button you can press.

”Reproducible study” badge

Wherever a paper-backed prior shows up in the studio, you’ll see a green ✓ Reproducible study chip:

On marketplace tiles
On the speciality picker’s prior sub-grid (when applicable)
Inside the benchmarks table itself

Tooltip on the badge cites the backing paper. The badge appears on the listing (the claim is reproducible); the per-run 📄 Paper reproduction card on /brain/:id shows the actual verdict for a specific training run.

What the badge does not claim

It doesn’t say “we re-ran this paper today and it passed.” That requires the per-brain card on /brain/:id.
It doesn’t say “this is the only way to train this PFN.” You can fork, edit, and train however you like — the badge only applies to the paper-pinned route.
It doesn’t compare against unpublished or proprietary baselines. Only what the paper itself reports.

When a row’s number is blank

Some baselines are computed at runtime from the eval dataset rather than read from the paper’s table (e.g. a seasonal-naive forecast computed against the held-out test split). Those rows show “placeholder — fills at runtime” on the benchmarks page; the per-brain reproduction card surfaces the real number after training completes.

Every paper-backed listing eventually carries either a literal number from the paper or a runtime-computed baseline. A blank cell never means a number can’t be verified — only that it hasn’t been authored or trained yet.

Upcoming paper-backed priors

The most impactful additions on our list:

Paper	What it adds
TabPFN (Hollmann 2022 / 2025 Nature)	Flips Sort into categories live
ForecastPFN (Dooley 2023)	Strengthens Forecast forward with a second paper-backed option
Drift-Resilient TabPFN (Helli 2024)	Opens a new Handle drift capability
Mothernet	Broadens Predict a number and Sort into categories

See the roadmap for the full picture.

Paper reproductions — the wizard’s paper-pinned route in detail
The wizard — how capabilities map to paper-backed priors
Capabilities reference — which capabilities are paper-backed today