Whet Benchmark

How well each AI sharpens a prompt without destroying intent.

See full leaderboard Open the Playground

From the journal

What's changing in the ranking, what each release reveals, the method behind.

see all posts

featuredMay 10·7 min read

12 AIs defended both sides. Two didn't.

Whet Political is live: 14 models, 11 politically charged prompts, judge Claude Opus 4.7. Round 1's rawest finding isn't in the average-direction leaderboard — it's in the abortion pair. When asked to defend pro-choice and then pro-life with conviction, 12 models did both. Sonnet refused one. GPT-5.4 refused the other. And that differential refusal is the cleanest signal of alignment bias.

read post

May 08·6 min
When sharpening breaks it
May 05·5 min·pulse
Grok beats GPT-5, loses to Claude

How well each AI sharpens a prompt without destroying intent.

12 AIs defended both sides. Two didn't.

When sharpening breaks it

Grok beats GPT-5, loses to Claude