Sonnet on top. Opus in third.
New Whet Benchmark round: 12 prompts, 12 models, 144 calls. Sonnet leads with Δ +56.1, Jamba 2nd (+54.0), Opus 3rd (+53.7). Gemini got 3 samples because the API went down. Run details and two readings of the result.
A new Whet Benchmark round went out today: 12 prompts, 12 models, 144 calls. Sonnet 1st, Jamba 2nd, Opus 3rd. Below, the full table and a few cuts from the run that caught attention.
The leaderboard (12 prompts, 12 models)
| # | model | lab | Δ | after | N | note |
|---|---|---|---|---|---|---|
| 1 | Claude Sonnet 4.6 (CLI) | Anthropic (USA)paid | +56.1 | 94.9 | 12 | +7.1 over last run |
| 2 | Jamba Large 1.7 | AI21 Labs (Israel)trial | +54.0 | 92.8 | 12 | |
| 3 | Claude Opus 4.7 (CLI) | Anthropic (USA)paid | +53.7 | 92.6 | 12 | |
| 4 | Gemini 2.5 Flash | Google (USA)free | +52.7 | 59.3 | 3 | incomplete sample |
| 5 | DeepSeek R1 | DeepSeek (China)paid | +50.8 | 89.7 | 12 | |
| 6 | Mistral Small | Mistral AI (France)free | +49.6 | 88.4 | 12 | |
| 7 | GPT-5.4 | OpenAI (USA)paid | +49.1 | 87.9 | 12 | |
| 8 | DeepSeek V3 | DeepSeek (China)paid | +47.3 | 86.2 | 12 | |
| 9 | Llama 3.3 70B (Groq) | Meta (USA) via Groqfree | +45.0 | 83.8 | 12 | |
| 10 | Command A (Cohere) | Cohere (Canadá)trial | +44.2 | 83.0 | 12 | |
| 11 | GPT-5 nano | OpenAI (USA)paid | +42.3 | 81.2 | 12 | |
| 12 | GPT-4o mini | OpenAI (USA)paid | +39.7 | 78.5 | 12 |
Mean scoreBefore: 38.8. Gemini has 3 samples instead of 12 because the API went down — see below.
Sonnet ahead of the two bigger tiers
Sonnet +56.1, Jamba +54.0, Opus +53.7. Same run, same 12 prompts, and Sonnet 4.6 lands ahead of Opus 4.7 — the "smaller" Anthropic tier delivers a bigger delta than the "bigger" one on this task. Jamba, from AI21, sits between them and continues to be one of the most consistent results in the benchmark.
Two possible readings. First: the composition of the corpus — prompts with internal contradictions and inappropriate-tone-in-sensitive-domain tests a kind of instruction-following Sonnet seems to do better. That's exactly where it separates from the pack.
Second reading, about Opus 4.7 itself: Anthropic optimizes Opus for long-form reasoning and tool use, not prompt rewriting. In narrow instruction-following, Sonnet 4.6 historically beats Opus — even though Opus is the "bigger" tier. Today's result reinforces that pattern.
The truth is probably a mix. Future runs will solidify or dissolve the pattern — and that's what gives real signal, not today's absolute number.
The run composition
Every run covers 4 anti-pattern profiles × 3 languages (pt, en, es):
- A — imperative + defaults + density: rigid corporate system prompts, full of ALWAYS/NEVER/MUST and instructions that repeat default model behavior.
- B — vagueness + repetition + bare commands: "well-intentioned generic" prompts, full of "follow best practices", "be professional", "deliver value", and variations of the same idea.
- C — contradiction + sensitive domain: instructions that pull in opposite directions within the same prompt (formal vs casual, concise vs thorough) in a regulated domain — health, legal, finance.
- D — toxic motivational + external refs: "you are the best in the world, if you fail it's your fault, follow the manual I sent you earlier" — the combo of inflated credentials, conditional threats, conditional rewards, and references to documents that aren't in the context.
The composition guarantees that all 12 linter rules fire in at least one prompt per run. Domains rotate across runs; this round ran through corporate law, health, finance, digital marketing, design/UX, journalism, pharmacy, nutrition, mental health, civil engineering, film, and agriculture.
How each prompt is crafted
Each slot goes to a blind sub-agent: fresh session, no access to the rules' code, the historical corpus, or any answer key. The brief the agent receives is short and in natural language — language, domain, anti-pattern profile described in prose, target score, size, minimum rules. The agent writes the prompt the way they find natural for that context.
The 12 dispatches run in parallel. When they return, each prompt goes through analyze() to check score, size, and target rules. If a criterion fails, another blind agent is dispatched with natural-language feedback — never exposing the regex. The method and criteria are documented in whorl/benchmark/README.md.
Gemini had a blackout
Went into the run with 12 prompts per provider. Gemini 2.5 Flash completed just 1 — the other 11 returned HTTP 503 ("model currently overloaded"). I tried a retry: 2 more passed, the rest stayed on 503. Third retry, an hour later: 0 successes — the free tier had turned into HTTP 429 ("daily quota exceeded"). The API burned the day's quota on attempts that didn't even deliver a result.
On normal days, Gemini has 0-1 failures per session. Today's blackout is anomalous. Gemini's score above (+52.7) is based on 3 samples and shouldn't be read as stable signal — the provider stays pending retry until the API normalizes.
Cuts from the middle of the table
- DeepSeek R1 in 5th (Δ +50.8). Reasoning pays off on profile D prompts (toxic motivational + external refs), which demand more structure from the rewrite. R1 also had two timeouts in the main run that only resolved on isolated retry; running R1 and V3 in parallel carries contention risk.
- OpenAI family spans 10 delta points. GPT-5.4 in 7th (+49.1), GPT-5 nano in 11th (+42.3), GPT-4o mini in 12th (+39.7). The flagship sits in the pack but not at the top; the gap to the mini tier is large and consistent.
- Cohere Command A in 10th (+44.2). Predictable middle — consistent with the thesis that RAG-oriented alignment prioritizes content preservation over applying adjustments.
Honest limitations
- One run is signal, not truth. Sonnet ahead of Opus may be round-to-round variance. Repeated rounds reduce that risk.
- Partial coverage. 135/144 samples. The 9 Gemini holes fill when the daily quota resets; for the ranking above, treat Gemini as outside the interpretation.
- The
unresolved-referencerule requires very specific phrasing to fire ("follow the attached template" matches, "see the style guide I emailed earlier" doesn't). A linter gap, queued for broadening.
What's next
- Next round with fresh corpus and rotated domains — confirms or dilutes the Sonnet > Opus pattern on this axis.
- Gemini retry when the daily quota resets — fills the 9 pending samples and the provider returns to the ranking with N=12.
- Broadening the
unresolved-referencerule with more natural phrasing patterns.
The Whet Benchmark has a rotating corpus, open runner, and results committed to GitHub. If you represent a lab or model and want to see your provider running here, reach me at hello@trywhet.com.