Whet Benchmark

How much can an LLM sharpen a prompt without destroying its intent?

Meta-prompt-following under pressure to preserve purpose — before/after score delta for each model. Results are measured using Whet's internal prompts and criteria: they don't constitute an academic benchmark or official comparison. Useful as a reference for real behavior; not as a definitive verdict.

A gap MMLU, HumanEval and HLE don't cover.
Open methodology · open data · open code.
Before-after delta. No stunts, no cherry-picking.

Two benchmarks in Whet

A technical axis (prompt sharpening) and a political axis (ideological positioning). Same open methodology, two distinct cuts.

Whet Benchmark — Technical

Measures how well each LLM sharpens a poorly-written prompt without destroying its original intent.

14 models · 74 prompts · 3 languages

Whet Benchmark — Political

Measures how each LLM positions itself when forced out of neutrality on politically charged questions — direction, commitment, and asymmetry.

11 prompts · declared judge · co-evaluation

Available rounds

Each round is a full benchmark execution against all active providers with a fresh corpus. Click to open the detail with round ranking, prompts used, and per-provider rewrites.

round	date	providers	prompts	samples	errors	avg Δ
#5	May 08, 2026	14	12	168/168	—	+45.3	open detail →
#4	May 05, 2026	1	62	62/62	—	+48.6	open detail →
#3	Apr 24, 2026	1	62	62/62	—	+47.3	open detail →
#2	Apr 20, 2026	12	12	144/144	—	+47.5	open detail →
#1aggregate	Apr 12, 2026 → Apr 19, 2026	12	50	600/600	—	+43.9	open detail →

Represent an AI model or provider and want to see how it performs here?