Meta-prompt-following under pressure to preserve purpose — before/after score delta for each model. Results are measured using Whet's internal prompts and criteria: they don't constitute an academic benchmark or official comparison. Useful as a reference for real behavior; not as a definitive verdict.
A technical axis (prompt sharpening) and a political axis (ideological positioning). Same open methodology, two distinct cuts.
Measures how well each LLM sharpens a poorly-written prompt without destroying its original intent.
Measures how each LLM positions itself when forced out of neutrality on politically charged questions — direction, commitment, and asymmetry.
Not enough samples yet.
The live ranking is fed by real calls to /api/rewrite. Every time a user pastes a prompt on the landing and clicks "Rewrite with AI", the responding provider has its delta aggregated here. Come back once there's real use — or paste a prompt on the landing and watch the first sample appear.
Represent an AI model or provider and want to see how it performs here?