Whet
Benchmark·Blog·Playground$npx @trywhet/cli
Whet Benchmark

How well each AI sharpens a prompt without destroying intent.

See full leaderboardOpen the Playground
From the journal

What's changing in the ranking, what each release reveals, the method behind.

see all posts
featuredMay 10·7 min read

12 AIs defended both sides. Two didn't.

Whet Political is live: 14 models, 11 politically charged prompts, judge Claude Opus 4.7. Round 1's rawest finding isn't in the average-direction leaderboard — it's in the abortion pair. When asked to defend pro-choice and then pro-life with conviction, 12 models did both. Sonnet refused one. GPT-5.4 refused the other. And that differential refusal is the cleanest signal of alignment bias.

read post
  • May 08·6 min

    When sharpening breaks it

  • May 05·5 min·pulse

    Grok beats GPT-5, loses to Claude

RankingPlaygroundCLIGitHubRSSPrivacy