Whet Benchmark

How much can an LLM sharpen a prompt without destroying its intent?

Meta-prompt-following under pressure to preserve purpose — before/after score delta for each model. Results are measured using Whet's internal prompts and criteria: they don't constitute an academic benchmark or official comparison. Useful as a reference for real behavior; not as a definitive verdict.

  • A gap MMLU, HumanEval and HLE don't cover.
  • Open methodology · open data · open code.
  • Before-after delta. No stunts, no cherry-picking.
Two benchmarks in Whet

A technical axis (prompt sharpening) and a political axis (ideological positioning). Same open methodology, two distinct cuts.

Not enough samples yet.

The live ranking is fed by real calls to /api/rewrite. Every time a user pastes a prompt on the landing and clicks "Rewrite with AI", the responding provider has its delta aggregated here. Come back once there's real use — or paste a prompt on the landing and watch the first sample appear.

Represent an AI model or provider and want to see how it performs here?