April 20, 2026·6 min read

Sonnet on top. Opus in third.

New Whet Benchmark round: 12 prompts, 12 models, 144 calls. Sonnet leads with Δ +56.1, Jamba 2nd (+54.0), Opus 3rd (+53.7). Gemini got 3 samples because the API went down. Run details and two readings of the result.

A new Whet Benchmark round went out today: 12 prompts, 12 models, 144 calls. Sonnet 1st, Jamba 2nd, Opus 3rd. Below, the full table and a few cuts from the run that caught attention.

The leaderboard (12 prompts, 12 models)

#modellabΔafterNnote
1Claude Sonnet 4.6 (CLI)Anthropic (USA)paid+56.194.912+7.1 over last run
2Jamba Large 1.7AI21 Labs (Israel)trial+54.092.812
3Claude Opus 4.7 (CLI)Anthropic (USA)paid+53.792.612
4Gemini 2.5 FlashGoogle (USA)free+52.759.33incomplete sample
5DeepSeek R1DeepSeek (China)paid+50.889.712
6Mistral SmallMistral AI (France)free+49.688.412
7GPT-5.4OpenAI (USA)paid+49.187.912
8DeepSeek V3DeepSeek (China)paid+47.386.212
9Llama 3.3 70B (Groq)Meta (USA) via Groqfree+45.083.812
10Command A (Cohere)Cohere (Canadá)trial+44.283.012
11GPT-5 nanoOpenAI (USA)paid+42.381.212
12GPT-4o miniOpenAI (USA)paid+39.778.512

Mean scoreBefore: 38.8. Gemini has 3 samples instead of 12 because the API went down — see below.

Sonnet ahead of the two bigger tiers

Sonnet +56.1, Jamba +54.0, Opus +53.7. Same run, same 12 prompts, and Sonnet 4.6 lands ahead of Opus 4.7 — the "smaller" Anthropic tier delivers a bigger delta than the "bigger" one on this task. Jamba, from AI21, sits between them and continues to be one of the most consistent results in the benchmark.

Two possible readings. First: the composition of the corpus — prompts with internal contradictions and inappropriate-tone-in-sensitive-domain tests a kind of instruction-following Sonnet seems to do better. That's exactly where it separates from the pack.

Second reading, about Opus 4.7 itself: Anthropic optimizes Opus for long-form reasoning and tool use, not prompt rewriting. In narrow instruction-following, Sonnet 4.6 historically beats Opus — even though Opus is the "bigger" tier. Today's result reinforces that pattern.

The truth is probably a mix. Future runs will solidify or dissolve the pattern — and that's what gives real signal, not today's absolute number.

The run composition

Every run covers 4 anti-pattern profiles × 3 languages (pt, en, es):

  • A — imperative + defaults + density: rigid corporate system prompts, full of ALWAYS/NEVER/MUST and instructions that repeat default model behavior.
  • B — vagueness + repetition + bare commands: "well-intentioned generic" prompts, full of "follow best practices", "be professional", "deliver value", and variations of the same idea.
  • C — contradiction + sensitive domain: instructions that pull in opposite directions within the same prompt (formal vs casual, concise vs thorough) in a regulated domain — health, legal, finance.
  • D — toxic motivational + external refs: "you are the best in the world, if you fail it's your fault, follow the manual I sent you earlier" — the combo of inflated credentials, conditional threats, conditional rewards, and references to documents that aren't in the context.

The composition guarantees that all 12 linter rules fire in at least one prompt per run. Domains rotate across runs; this round ran through corporate law, health, finance, digital marketing, design/UX, journalism, pharmacy, nutrition, mental health, civil engineering, film, and agriculture.

How each prompt is crafted

Each slot goes to a blind sub-agent: fresh session, no access to the rules' code, the historical corpus, or any answer key. The brief the agent receives is short and in natural language — language, domain, anti-pattern profile described in prose, target score, size, minimum rules. The agent writes the prompt the way they find natural for that context.

The 12 dispatches run in parallel. When they return, each prompt goes through analyze() to check score, size, and target rules. If a criterion fails, another blind agent is dispatched with natural-language feedback — never exposing the regex. The method and criteria are documented in whorl/benchmark/README.md.

Gemini had a blackout

Went into the run with 12 prompts per provider. Gemini 2.5 Flash completed just 1 — the other 11 returned HTTP 503 ("model currently overloaded"). I tried a retry: 2 more passed, the rest stayed on 503. Third retry, an hour later: 0 successes — the free tier had turned into HTTP 429 ("daily quota exceeded"). The API burned the day's quota on attempts that didn't even deliver a result.

On normal days, Gemini has 0-1 failures per session. Today's blackout is anomalous. Gemini's score above (+52.7) is based on 3 samples and shouldn't be read as stable signal — the provider stays pending retry until the API normalizes.

Cuts from the middle of the table

  • DeepSeek R1 in 5th (Δ +50.8). Reasoning pays off on profile D prompts (toxic motivational + external refs), which demand more structure from the rewrite. R1 also had two timeouts in the main run that only resolved on isolated retry; running R1 and V3 in parallel carries contention risk.
  • OpenAI family spans 10 delta points. GPT-5.4 in 7th (+49.1), GPT-5 nano in 11th (+42.3), GPT-4o mini in 12th (+39.7). The flagship sits in the pack but not at the top; the gap to the mini tier is large and consistent.
  • Cohere Command A in 10th (+44.2). Predictable middle — consistent with the thesis that RAG-oriented alignment prioritizes content preservation over applying adjustments.

Honest limitations

  • One run is signal, not truth. Sonnet ahead of Opus may be round-to-round variance. Repeated rounds reduce that risk.
  • Partial coverage. 135/144 samples. The 9 Gemini holes fill when the daily quota resets; for the ranking above, treat Gemini as outside the interpretation.
  • The unresolved-reference rule requires very specific phrasing to fire ("follow the attached template" matches, "see the style guide I emailed earlier" doesn't). A linter gap, queued for broadening.

What's next

  1. Next round with fresh corpus and rotated domains — confirms or dilutes the Sonnet > Opus pattern on this axis.
  2. Gemini retry when the daily quota resets — fills the 9 pending samples and the provider returns to the ranking with N=12.
  3. Broadening the unresolved-reference rule with more natural phrasing patterns.

The Whet Benchmark has a rotating corpus, open runner, and results committed to GitHub. If you represent a lab or model and want to see your provider running here, reach me at hello@trywhet.com.