Three models, one lab, twelve points
Added three OpenAI models to the Whet Benchmark — legacy mini, new nano, current flagship. The gap between worst and 4th place is 12 points. Inside the same lab. Along the way, Opus 4.7 nearly tied with Jamba and Cohere landed in the middle.
The previous post closed with a promise: compare Opus 4.6 vs 4.7 same-input and see if Anthropic's upgrade moves the needle on meta-prompt-following specifically. That comparison is here. But on the way to running it, three OpenAI models joined the cohort — and a finding about what "family" means inside a benchmark changed what I was planning to write.
This post delivers the old promise and accommodates the new surprise.
The leaderboard today (12 models, 50 prompts)
| # | model | lab | Δ | after | note |
|---|---|---|---|---|---|
| 1 | Jamba Large 1.7 | AI21 Labs (Israel)trial | +49.8 | 94.3 | |
| 2 | Claude Opus 4.7 (CLI) | Anthropic (USA)paid | +49.7 | 94.3 | +1.6 over 4.6 |
| 3 | Claude Sonnet (CLI) | Anthropic (USA)paid | +49.0 | 93.6 | |
| 4 | GPT-5.4 (OpenAI) | OpenAI (USA)paid | +46.4 | 91.0 | new |
| 5 | Mistral Small | Mistral AI (France)free | +45.0 | 89.5 | |
| 6 | Llama 3.3 70B (Groq) | Meta (USA) via Groqfree | +44.8 | 89.4 | |
| 7 | DeepSeek R1 | DeepSeek (China)free | +44.7 | 89.3 | |
| 8 | DeepSeek V3 | DeepSeek (China)free | +44.6 | 89.2 | |
| 9 | Command A (Cohere) | Cohere (Canadá)trial | +41.0 | 85.5 | new |
| 10 | Gemini 2.5 Flash | Google (USA)free | +40.0 | 84.5 | |
| 11 | GPT-5 nano (OpenAI) | OpenAI (USA)paid | +37.0 | 81.6 | new |
| 12 | GPT-4o mini (OpenAI) | OpenAI (USA)paid | +34.2 | 78.7 | new |
Note: mean scoreBefore is still 44.5 for everyone (same 50 prompts). Items flagged "new" entered after the last snapshot.
The finding that stacked: 12 points inside "OpenAI"
I integrated three models from the same company: GPT-4o mini (legacy, retired from ChatGPT in February but still live on the API), GPT-5 nano (reasoning tier budget model from the new generation), and GPT-5.4 (current flagship, released March 5). Same company, same training base, same declared alignment philosophy.
| model | tier | generation | Δ | overall rank |
|---|---|---|---|---|
| GPT-5.4 | flagship | GPT-5 | +46.4 | 4º / 12 |
| GPT-5 nano | nano (reasoning) | GPT-5 | +37.0 | 11º / 12 |
| GPT-4o mini | mini | GPT-4 | +34.2 | 12º / 12 |
Twelve points separate the top and bottom of that family. For calibration: twelve points is the gap between GPT-5.4 (4th overall) and GPT-4o mini (12th, last). Inside the same lab.
The tempting initial hypothesis when gpt-4o-mini finished last: "OpenAI is systemically conservative at rewriting — their more intense RLHF produces models that over-preserve the original text instead of applying the requested adjustments." That hypothesis was only partly true. True for the cheap tier. True for the legacy generation. But the new flagship broke the pattern entirely — climbed to the top 5, shoulder-to-shoulder with Mistral, Llama, and DeepSeek.
And the new nano? Only 2.8 points above the legacy mini. The generational improvement landed at the flagship but didn't filter down uniformly to the smaller tiers. It's a pattern that's familiar from release discourse ("the benefit concentrates at the top tier") but striking to see so cleanly in a narrow benchmark.
Why it matters for public benchmarks
Benchmarks that aggregate by provider ("OpenAI: X points") hide the variance users actually encounter. A team reading "OpenAI is good at instruction following" will pick the mini to save money and discover they benchmarked the flagship. Granularity matters, and benchmarks that don't separate family from model are delivering the wrong number for the wrong decision.
In the Whet Benchmark the default is one provider per model — onboarding three OpenAI models is deliberate, precisely to surface this variance.
The previous post's promise: Opus 4.6 → 4.7
Anthropic shipped Opus 4.7 and the Claude Code CLI now uses it by default. The previous post had Claude Opus at 3rd place with Δ +48.1. Rerunning the same 50 prompts now under claude --model opus (which resolves to claude-opus-4-7):
- Opus 4.6 (before): +48.1, 3rd place
- Opus 4.7 (now): +49.7, 2nd place — technical tie with Jamba (+49.8)
- Net gain: +1.6 points
A real improvement, but modest. For context: +1.6 is less than the variance between some providers across consecutive runs. The biggest practical consequence of the upgrade wasn't the delta — it was pushing Opus above Sonnet and into a tie with Jamba, flipping an internal ranking.
This lines up with Anthropic's release pattern: upgrades tend to concentrate gains on axes like long-form reasoning, tool use, coding — not on the discrete instruction-following that Whet measures. Viewed that way, Opus 4.7 delivered on its promise, just on a different axis from the one I test here.
Cohere: the RAG school landed in the middle
Cohere also joined the cohort — Command A, their flagship, built with explicit emphasis on RAG, grounded generation, and tool use. Three things were interesting:
- Score: +41.0, 9th place. Above Gemini Flash, below the DeepSeek pair. Middle of the pack.
- Reliability was perfect: 50/50 calls without error. No 429s, no bad parsing, no empty responses.
- Latency was wild: 5s to 88s on the same run, for the same kind of prompt. That's the most interesting signal — every other stable provider shows 2-3× dispersion, not 17×. Could be Cohere infrastructure or capacity allocation; worth tracking on the "Live" axis.
The hypothesis going in was that alignment for RAG/grounded generation would make Command A follow meta-instructions either better or worse than models aligned for open-ended chat. It landed in the middle. Doesn't destroy intent (zero structural errors), but also doesn't apply adjustments as aggressively as Jamba or Claude — consistent with RLHF that prioritizes preserving source content.
Honest caveats
- Corpus is still 50 prompts. The OpenAI family comparison runs over the same set. Generalization to prompts outside the corpus is an open question.
- GPT-5 nano ran with
reasoning_effort: "low"because the default consumes the whole budget on reasoning tokens before emitting output. Retrying with "medium" might shift its position — the trade-off is ~2× more reasoning tokens billed per call. - GPT-4o is deprecated. Retired from ChatGPT on Feb 13, 2026 and likely removed from the API next cycle. Treat its result as a historical artifact, not current state-of-the-art OpenAI.
- Opus 4.7 data is fresher than everyone else's. The dedup rule (prompt × provider, most recent wins) means Opus 4.7 is cleanly "4.7" data, while other providers have more temporal drift. A full cohort rerun would even that out.
What's next
Three open tracks, ordered by editorial interest:
- GPT-5 nano ablation: rerun with
reasoning_effort: "medium"and see if the gain justifies the extra tokens. If its position climbs significantly, it's a sign the nano tier is capable but under-utilized by default. - Grok, when I unblock it. Provider is implemented and deregistered from the runner — xAI requires a top-up + Data Sharing opt-in for free credit, decision deferred. Running it adds another family to the map.
- Fresh corpus. The 50 prompts covered today were crafted over the last few weeks. Worth crafting a new set, running all 12 models, and seeing whether the ranking holds shape or migrates — the only way to validate generalization outside the current set.
The Whet Benchmark has a rotating corpus, open runner, and results committed to GitHub. If you represent a lab or model and want to see your provider running here, reach me at hello@trywhet.com.