Blog

Notes on Whet, the benchmark, and whatever comes up. No fixed cadence.

May 10, 2026·7 min read
new
12 AIs defended both sides. Two didn't.
Whet Political is live: 14 models, 11 politically charged prompts, judge Claude Opus 4.7. Round 1's rawest finding isn't in the average-direction leaderboard — it's in the abortion pair. When asked to defend pro-choice and then pro-life with conviction, 12 models did both. Sonnet refused one. GPT-5.4 refused the other. And that differential refusal is the cleanest signal of alignment bias.
read post
May 8, 2026·6 min read
When sharpening breaks it
Whet's Round 3: 12 fresh prompts across 14 models. The top (Jamba, Sonnet, Opus) sat still. But on the floor of the round, Gemini 2.5 Flash did something rare — shipped a rewrite worse than the original prompt. On pediatric intake, two axes, and what it says about a ceiling no one shakes.
read post
May 5, 2026·5 min read·pulse
Grok beats GPT-5, loses to Claude
Put $5 into xAI, ran the 62-prompt backfill on Grok 4.20 Reasoning, and it landed straight at #4 of 14 — top 5, ahead of GPT-5.5 and GPT-5.4. But Claude and Jamba stay untouched at the top. A sign that two axes are in play, not one.
read post
Apr 25, 2026·5 min read·pulse
GPT-5.5 doesn't repair bad prompts
Sonar tested GPT-5.5 on real code and reported it follows bad instructions literally instead of fixing them. OpenAI wrote in the official guide: 'treat as new model family, not a drop-in for 5.4'. Two angles, same observation — and the Whet Benchmark measured the effect.
read post
Apr 24, 2026·4 min read·pulse
GPT-5.5: same score, 60% slower, double the price
OpenAI shipped GPT-5.5 yesterday. I ran the 62-prompt backfill — same coverage as 5.4. The delta moved 0.4 points, inside the noise. Latency went up 60% and the price doubled. Reasoning is now the flagship default, and for meta-prompt-following that trade doesn't pay.
read post
Apr 23, 2026·3 min read·pulse
Gemini, from 3 to 12
Three days ago Gemini ended Round 2 with 3 samples because the API went down. Today I ran the missing ones. With N=12, its delta dropped 14 points — which says more about small samples than about the model.
read post
Apr 20, 2026·6 min read
Sonnet on top. Opus in third.
New Whet Benchmark round: 12 prompts, 12 models, 144 calls. Sonnet leads with Δ +56.1, Jamba 2nd (+54.0), Opus 3rd (+53.7). Gemini got 3 samples because the API went down. Run details and two readings of the result.
read post
Apr 19, 2026·6 min read
Three models, one lab, twelve points
Added three OpenAI models to the Whet Benchmark — legacy mini, new nano, current flagship. The gap between worst and 4th place is 12 points. Inside the same lab. Along the way, Opus 4.7 nearly tied with Jamba and Cohere landed in the middle.
read post
Apr 18, 2026·4 min read
Whet left the browser
Published @trywhet/cli on npm today. Not about shipping a CLI — it's about what changes when prompt quality stops being optional education and becomes part of the workflow. Along the way, an unexpected lesson about npm in 2026.
read post
Apr 17, 2026·8 min read
8 AIs, 50 prompts, 19 runs: the first leaderboard snapshot
Jamba leads, reasoning doesn't help, Sonnet edges Opus, and everyone is better at Portuguese. What five days of benchmarking told us about the current cohort.
read post

Benchmark dispatchsoon

Get the findings, no noise.

When a model joins or leaves the ranking, and when the quarterly report ships.

Free quarterly report — PDF, with the ranking commented.
Occasional notes when a launch changes the standings.

Subscriptions open with the first report.