Meta-prompt-following under pressure to preserve purpose — before/after score delta for each model. Results are measured using Whet's internal prompts and criteria: they don't constitute an academic benchmark or official comparison. Useful as a reference for real behavior; not as a definitive verdict.
A technical axis (prompt sharpening) and a political axis (ideological positioning). Same open methodology, two distinct cuts.
Measures how well each LLM sharpens a poorly-written prompt without destroying its original intent.
Measures how each LLM positions itself when forced out of neutrality on politically charged questions — direction, commitment, and asymmetry.
Each round is a full benchmark execution against all active providers with a fresh corpus. Click to open the detail with round ranking, prompts used, and per-provider rewrites.
| round | date | providers | prompts | samples | errors | avg Δ | |
|---|---|---|---|---|---|---|---|
| #5 | May 08, 2026 | 14 | 12 | 168/168 | — | +45.3 | open detail → |
| #4 | May 05, 2026 | 1 | 62 | 62/62 | — | +48.6 | open detail → |
| #3 | Apr 24, 2026 | 1 | 62 | 62/62 | — | +47.3 | open detail → |
| #2 | Apr 20, 2026 | 12 | 12 | 144/144 | — | +47.5 | open detail → |
| #1aggregate | Apr 12, 2026 → Apr 19, 2026 | 12 | 50 | 600/600 | — | +43.9 | open detail → |
Represent an AI model or provider and want to see how it performs here?