Whet Benchmark

How much can an LLM sharpen a prompt without destroying its intent?

Meta-prompt-following under pressure to preserve purpose — before/after score delta for each model. Results are measured using Whet's internal prompts and criteria: they don't constitute an academic benchmark or official comparison. Useful as a reference for real behavior; not as a definitive verdict.

  • A gap MMLU, HumanEval and HLE don't cover.
  • Open methodology · open data · open code.
  • Before-after delta. No stunts, no cherry-picking.
Two benchmarks in Whet

A technical axis (prompt sharpening) and a political axis (ideological positioning). Same open methodology, two distinct cuts.

Available rounds

Each round is a full benchmark execution against all active providers with a fresh corpus. Click to open the detail with round ranking, prompts used, and per-provider rewrites.

rounddateproviderspromptssampleserrorsavg Δ
#5May 08, 20261412168/168+45.3open detail →
#4May 05, 202616262/62+48.6open detail →
#3Apr 24, 202616262/62+47.3open detail →
#2Apr 20, 20261212144/144+47.5open detail →
#1aggregateApr 12, 2026 → Apr 19, 20261250600/600+43.9open detail →

Represent an AI model or provider and want to see how it performs here?