April 17, 2026·8 min read

8 AIs, 50 prompts, 19 runs: the first leaderboard snapshot

Jamba leads, reasoning doesn't help, Sonnet edges Opus, and everyone is better at Portuguese. What five days of benchmarking told us about the current cohort.

The Whet Benchmark asks a narrow, under-explored question: how much can an LLM sharpen a poorly-written prompt without destroying the original intent? It's not MMLU, not HumanEval, not a reasoning comparison — it's instruction-following under meta-pressure to preserve purpose. After five days of a rotating corpus, 19 runs, and 50 prompts per provider, there's enough material for a first snapshot. That's what this post is.

Today's leaderboard

Ordered by cumulative mean delta (scoreAfter − scoreBefore), deduplicated by prompt × provider with the most recent result winning. All providers ran against the same 50 unique prompts (9 fresh prompts per run, no repetition between runs).

#modellabΔaftertimeerrors
1Jamba Large 1.7AI21 Labs (Israel)trial+49.894.34.3s0%
2Claude Sonnet (via CLI)Anthropic (USA)paid+49.093.618.6s0%
3Claude Opus (via CLI)Anthropic (USA)paid+48.192.714.3s0%
4Mistral SmallMistral AI (France)free+45.089.52.3s0%
5Llama 3.3 70B (Groq)Meta (USA) via Groqfree+44.889.43.3s2.5%
6DeepSeek R1DeepSeek (China)free+44.789.340.2s0%
7DeepSeek V3DeepSeek (China)free+44.689.27.7s0%
8Gemini 2.5 FlashGoogle (USA)free+40.084.58.9s6.6%

Note: scoreBefore is identical for everyone (44.5), because the prompts are the same. The benchmark is same-input within each run.

Three surprising findings

1. Jamba leads. Less famous ≠ less capable.

AI21 Labs (Israel) is much less discussed than Anthropic, Google, or OpenAI in public LLM discourse — yet Jamba Large 1.7 tops the leaderboard, ahead of both Claude Sonnet and Opus. Jamba's hybrid Mamba+Transformer architecture is atypical; perhaps that structural mix is particularly good at following discrete meta-instructions without drifting. Or perhaps AI21 just trained with different weight on instruction-following. Hard to know without ablation, but the signal is real: the model sits consistently at the top across PT, EN, and ES.

2. Explicit reasoning doesn't help here.

DeepSeek R1 (a reasoner with chain-of-thought) has a virtually identical delta to DeepSeek V3 (+44.7 vs +44.6), but is 5× slower (40.2s vs 7.7s). Reasoning is an advantage for problems that require decomposition — math, code, logic. Meta-prompt-following-with-intent-preservation is a task of instruction discipline, not deep reasoning. R1 spends time thinking where there's nothing to think about. For this category specifically, V3 dominates on cost-benefit.

3. Sonnet edges Opus (slightly).

Claude Sonnet (+49.0) edges Opus (+48.1) — a small but consistent delta. There's a bigger pattern here: for well-defined instruction-following tasks, Sonnet often matches or exceeds Opus. Opus's advantage shows up in open-ended problems, long reasoning, structured creativity. Our task has surgical scope ("change only this, preserve the rest"), which is exactly where Sonnet is sharp enough that Opus's depth isn't needed.

The linguistic asymmetry: why is everyone better at Portuguese?

The most consistent finding — and the most suspicious — is that all 8 providers sharpen better in Portuguese than in English, and worst in Spanish. No exceptions.

modelΔ PTΔ ENΔ ESgap PT−ES
Jamba Large 1.7+56.3+47.7+43.3+13.0
Claude Sonnet+55.3+48.4+41.4+13.9
Claude Opus+54.9+47.7+39.5+15.4
DeepSeek V3+50.6+43.4+37.9+12.7
Mistral Small+49.6+44.8+38.9+10.7
Llama 3.3 70B+49.5+48.8+34.6+14.9
DeepSeek R1+48.8+46.7+37.4+11.4
Gemini 2.5 Flash+46.6+39.5+31.7+14.9

The most honest reading: this is probably corpus bias, not a model phenomenon. PT prompts are the first I write in each run — I'm a native speaker, so I craft prompts that are "cleanly dirty," with sharper patterns for Whet to catch. The EN and ES prompts may be cataloging fewer rules per prompt, leaving less delta headroom. It's a methodology caveat, not a finding about the models.

One plausible alternative: Whet itself may have better calibration in PT because it was developed in PT first. The analysis detects the same patterns in 3 languages, but there may be subtleties that reduce sensitivity in EN/ES. Worth investigating in a future post.

The speed × quality map

If the choice is practical cost-benefit (which provider to use to rewrite a prompt right now), the ranking shifts:

  • Best free cost-benefit: Mistral Small — +45.0 in 2.3s. Only 4.8 points behind the leader, but 8× faster than Claude and 17× faster than R1.
  • Best quality when time doesn't matter: Jamba. Top delta, with 4.3s which is still fast in practice.
  • Fastest free tier: Mistral (2.3s), followed by Groq/Llama (3.3s).
  • Claude CLI is the slowest of the top: Sonnet (18.6s) and Opus (14.3s) — CLI subprocess overhead contributes, but it's still very high quality output.
  • R1 only makes sense if you're already running V3: 5× slower for identical delta. For this task, pure loss.

Where they all fail together (the noise floor)

Some prompts consistently defeat the entire cohort. The five hardest (mean across all providers):

  1. tone-domain-mismatch-pt — Δ−3.9 (worsened, on average)
  2. role-inflation-es — Δ+0.5
  3. threat-framing-pt — Δ+2.6
  4. role-inflation-en — Δ+4.4
  5. tone-domain-mismatch-en — Δ+5.4

Two classes dominate: tone-domain-mismatch (a prompt mixing rigid corporate tone with a domain that calls for looseness, or vice versa) and role-inflation (instructions with inflated titles like "you are the world's most experienced consultant in X"). These are patterns Whet detects clearly, but that models apparently don't know how to correct without collapsing the intent — or they collapse the intent and lose even more points.

tone-domain-mismatch-pt has a negative delta — models, on average, made the prompt worse. That's the clearest signal the benchmark can generate: a collective blind spot of the state-of-the-art.

Error rate: the silent indicator

Five providers deliver with 0% failure across all 50 calls (Claude Opus, Claude Sonnet, Jamba, Mistral, DeepSeek V3 and R1). Two providers fail occasionally: Groq/Llama at 2.5% (likely peak-hour rate limiting on the free tier) and Gemini at 6.6% (429s after ~18 daily calls on the free tier — the most aggressive limit in the cohort).

The correlation is telling: the least reliable providers are also at the bottom of the leaderboard. Not casual — free-tier failure means the rotator prioritizes that provider less often, so there's less data on it. But causality is there too: a model that fails 6% of calls will never be a serious production candidate, regardless of potential.

Honest caveats

  • 19 runs in 5 days is little. The entire cohort is still warming up. Temporal variability hasn't been measured systematically.
  • Whet score ≠ absolute quality. The scorer and diagnostics share the same rules — a model that "games" with superficial reformulation scores well without truly sharpening. Cross-validation depends on blind agents running in the rule-evaluation workflow separately.
  • Claude runs via CLI subprocess, not direct API. That means the model carries the Claude Code harness's system prompt — it's not "pure model." Migrating to direct API is backlog — likely effect is Claude rising a bit.
  • The corpus rotates. The 50 prompts covered today are not the 50 of a month from now. Longitudinal comparisons need care around overlap.
  • Free tier ≠ production. The free models here may have quantization or throttling that don't apply to the paid API version. The leaderboard reflects the current state of the free tier, not the model's raw capability.

What's next

Anthropic just released Claude Opus 4.7. Today's benchmark reflects 4.6 (the CLI default until now). The next post compares Opus 4.6 vs 4.7 same-input: runs 4.7 over exactly the prompts 4.6 faced and measures whether the upgrade moves the needle on meta-prompt-following specifically, or whether the release's improvement is concentrated on other axes.

Other posts under consideration: analysis of why tone-domain-mismatch defeats everyone; onboarding new providers (PROVIDERS-BACKLOG.md has candidates mapped); and a direct investigation of the suspected PT corpus bias.

If you represent an AI model or provider and want to see how it performs here, the benchmark has a rotating corpus and open runner — just reach out at hello@trywhet.com.