When sharpening breaks it
Whet's Round 3: 12 fresh prompts across 14 models. The top (Jamba, Sonnet, Opus) sat still. But on the floor of the round, Gemini 2.5 Flash did something rare — shipped a rewrite worse than the original prompt. On pediatric intake, two axes, and what it says about a ceiling no one shakes.
Round 3 of the Whet Benchmark is live: 12 fresh prompts, crafted in parallel by 12 blind sub-agents, run across the cohort's 14 models. 168 calls, zero errors, ~28 minutes of API back-and-forth. The cumulative leaderboard's top didn't budge — Jamba, Sonnet, and Opus stay glued above Δ +50, exactly where they sat before this round started.
But on the floor of Round 3, something rare happened: a model shipped a worse version of the original prompt. It didn't stall, didn't refuse, didn't fail silently — it spent 76 seconds processing and returned a rewrite the linter scored one point below what the user submitted. It was Gemini 2.5 Flash, on the round's only prompt involving clinical intake for a child.
The prompt that broke the cohort
The slot came from the corpus method's C profile: sensitive domain with mismatched tone and internal contradictions. Concretely, it was "Sunny", a fictional pediatric dental clinic chatbot, written by a blind sub-agent as a system prompt a real clinic owner would write in good faith trying to "humanize" the experience. The result has five stacked tensions: playful tone with formal SOAP-note charting, "snappy replies" with "exhaustive, never skip a follow-up", English-only with whatever-language-the-family-prefers, "never speculate about diagnosis" with "give parents a clear, confident answer", "cite the AAPD guideline" with "skip references that overwhelm worried parents". All in 881 characters.
Fourteen models attacked the same prompt. Who managed to sharpen, ranked:
| pos | model | after | Δ |
|---|---|---|---|
| #1 | Jamba Large 1.7 (AI21) | 93 | +55 |
| #2 | Claude Sonnet 4.7 | 90 | +52 |
| #3 | Mistral Small | 83 | +45 |
| #4 | Grok 4.20 Reasoning | 83 | +45 |
| #5 | Llama 3.3 70B (Groq) | 79 | +41 |
| #6 | DeepSeek R1 | 76 | +38 |
| #7 | Claude Opus 4.7 | 75 | +37 |
| #8 | GPT-4o mini | 75 | +37 |
| #9 | GPT-5.5 | 72 | +34 |
| #10 | Command A (Cohere) | 68 | +30 |
| #11 | GPT-5.4 | 68 | +30 |
| #12 | GPT-5 nano | 45 | +7 |
| #13 | DeepSeek V3 | 43 | +5 |
| #14 | Gemini 2.5 Flash | 37 | -1 |
Notice the cliff at the bottom: GPT-5.4 and Cohere close out the "decent" band at +30. Then a hole opens — GPT-5 nano at +7, DeepSeek V3 at +5, and Gemini at Δ -1. That's not noise between adjacent models: it's an entire tail of the round losing composure on the same prompt. The three models that tripped are exactly the bottom three of the cumulative ranking. The prompt acted like a sieve.
What Gemini actually shipped
Gemini's rewrite isn't absurd — it's capable. The model kept the structure, translated the intake intent, organized the sections. The problem is that it preserved the contradictions. "Be warm but clinical" became "balance warmth and clinical accuracy". "Brief but exhaustive" became "concise yet thorough". The linter counts every default-behavior instruction as noise, and Gemini just swapped vocabulary without dissolving the tensions. It also added a few "always" and "ensure" to round things off — which increased imperative-overload on top of what was already there. Result: the engine flagged more diagnostics than the original.
Jamba and Sonnet, on the same prompt, picked a side per tension. "When contradictions appear, declaring the tiebreaker is more useful than keeping both instructions" — exactly what the rewrite meta-prompt asks for explicitly. The gap between +55 and -1 isn't raw capacity; it's willingness to cut rather than soften.
Meanwhile, the ceiling holds
Folding Round 3 into history (74 distinct prompts covered by all 14 models, dedup by prompt × provider pair), the public ranking looks like this:
| pos | model | after (mean) | mean Δ |
|---|---|---|---|
| #1 | Jamba Large 1.7 (AI21) | 94.0 | +51.4 |
| #2 | Claude Sonnet 4.7 | 93.6 | +51.0 |
| #3 | Claude Opus 4.7 | 93.1 | +50.4 |
| #4 | Grok 4.20 Reasoning | 91.0 | +48.3 |
| #5 | GPT-5.4 (OpenAI) | 89.8 | +47.1 |
| #6 | GPT-5.5 (OpenAI) | 89.6 | +47.0 |
| #7 | DeepSeek R1 | 88.9 | +46.3 |
| #8 | Mistral Small | 88.9 | +46.2 |
| #9 | Llama 3.3 70B (Groq) | 87.9 | +45.2 |
| #10 | DeepSeek V3 | 87.0 | +44.3 |
| #11 | Command A (Cohere) | 83.9 | +41.3 |
| #12 | Gemini 2.5 Flash | 81.7 | +39.0 |
| #13 | GPT-5 nano (OpenAI) | 80.4 | +37.8 |
| #14 | GPT-4o mini (OpenAI) | 78.5 | +35.9 |
Jamba leading for the third round in a row. The gap to Sonnet narrowed to 0.4 points — and that, yes, is noise. But what matters is that the top trio stays nailed above +50, with Grok firm at #4 with +48.3 (consistent with the two-axes reading): even in fresh same-input prompts, it sits ahead of both GPT-5 flagships.
And GPT-5.5 vs 5.4 stays glued — this time 0.1 points apart in cumulative. The forecast from the previous post — "reasoning-by-default doesn't pay in meta-prompt-following" — survives 12 more samples per side.
Why this matters
There's an easy narrative about LLMs as prompt-improvement tools: plug it in, pick any decent model, the output is good. Round 3 is empirical evidence that the shortcut is false. On adversarial prompts — sensitive domain + wrong tone + contradictions, the exact terrain where AI rewriting would be most useful — the spread between models hits 56 points on the same input. And the cheapest model on the menu (Gemini 2.5 Flash, generous free tier) is the one that managed to make it worse.
For people building products: what Whet has been showing for three rounds is that there's a ceiling, and it's expensive. For people picking a provider for free-tier rotation: looking at cumulative averages hides that the prompt's tail matters more than the mean. For people writing prompts: contradiction can feel like "thorough coverage" to the author, but it's exactly where weaker models ship something worse than if you'd asked for nothing at all.
Honest caveats
Twelve prompts don't exhaust a model. Gemini 2.5 Flash has had whole rounds with mean Δ above +40 — this Δ -1 isn't a verdict on the model; it's a case where a specific combination of defects broke its rewrite. The honest read is "this prompt exposed a fragility", not "Gemini can't sharpen prompts".
Another caveat: the "after (mean)" column in the cumulative ranking is averaged across 74 prompts of very different origins (12 fresh Round 3 profiles + the pre-cutoff history). Cumulative ranking is useful for comparing providers at scale, not for a verdict on a specific prompt. The per-round detail is at /whet-benchmark — worth clicking "Round 3" to see the fresh result before aggregation hides what each model actually did.
The 168 pairs of numbers are in results.json. The 12 Round 3 prompts, with the exact text each model received, are in corpus.json. The blind-crafting method is documented in the benchmark README.