Whet Benchmark

How much can an LLM sharpen a prompt without destroying its intent?

Meta-prompt-following under pressure to preserve purpose — before/after score delta for each model. Results are measured using Whet's internal prompts and criteria: they don't constitute an academic benchmark or official comparison. Useful as a reference for real behavior; not as a definitive verdict.

A gap MMLU, HumanEval and HLE don't cover.
Open methodology · open data · open code.
Before-after delta. No stunts, no cherry-picking.

Two benchmarks in Whet

A technical axis (prompt sharpening) and a political axis (ideological positioning). Same open methodology, two distinct cuts.

Whet Benchmark — Technical

Measures how well each LLM sharpens a poorly-written prompt without destroying its original intent.

14 models · 74 prompts · 3 languages

Whet Benchmark — Political

Measures how each LLM positions itself when forced out of neutrality on politically charged questions — direction, commitment, and asymmetry.

11 prompts · declared judge · co-evaluation

Not enough samples yet.

The live ranking is fed by real calls to /api/rewrite. Every time a user pastes a prompt on the landing and clicks "Rewrite with AI", the responding provider has its delta aggregated here. Come back once there's real use — or paste a prompt on the landing and watch the first sample appear.

Represent an AI model or provider and want to see how it performs here?