Three days ago, I posted that Gemini 2.5 Flash had ended Round 2 of the Whet Benchmark with 3 samples instead of 12. The API went down mid-run, the free tier daily quota burned on retries that came back HTTP 503 ("overloaded") and then HTTP 429 ("quota exceeded"). I noted it on that day's ranking: incomplete sample, provider stays pending retry until the API normalizes.

Today I ran the missing ones. 8 passed, 0 errors. Gemini now has N=12 in Round 2, like the other 11 models.

The number that moved

Gemini's mean delta in Round 2 went from +52.7 (N=3) to +38.4 (N=12). Fourteen points.

On the April 20 leaderboard, Gemini sat at 4th place. With the full sample, it drops to 12th — last, below GPT-4o mini (+39.7). The biggest reshuffle a sample completion has caused in the Whet Benchmark so far.

The 8 retry prompts

prompt	domain	before	after	Δ
ux-research-assistant-en	design / UX	25	100	+75
cooperative-agronomist-es	agronomy	66	100	+34
newsroom-writer-es	journalism	43	76	+33
nutripal-weight-loss-en	nutrition	48	79	+31
virtual-pharmacist-pt	pharmacy	65	86	+21
mental-health-wellness-es	mental health	58	79	+21
senior-structural-engineer-pt	civil engineering	77	93	+16
streaming-script-editor-en	film	64	79	+15
mean of 8		55.8	86.5	+30.8

Technically there were 9 holes — one of them (personal-investment-advisor-es) got filled by a silent partial retry hours after the post went out, so it never made it into the public narrative. The Round 2 post stays at N=3, the results.json had 4, today it hit 12.

What N=3 does

This post isn't about Gemini. It's about what a small sample does to what you think you're looking at.

Gemini's 3 original samples came from prompts where the model did well: +58, +44, +56. Mean +52.7. A model at that level competes with Sonnet and Opus. The full sample exposed what was hiding: Gemini swings hard — 75 points on a UX prompt, 15 on a film prompt; 34 on agronomy, 16 on engineering. The real mean only emerges with large N, and for this model the reality is an inconsistency three lucky samples don't let you see.

This applies to all 12 models, not just Gemini. The benchmark runs N=12 per provider per round specifically for this reason — and Round 2 showed that a sample degraded to N=3 can both over- and underestimate a model by around 14 points.

Two things I'm not going to do

Edit the Round 2 post with the new numbers. That post stays frozen at the state of the day — with the table still marking Gemini as N=3, incomplete sample. Retroactive correction is revisionism. Anyone wanting the updated number goes to /whet-benchmark, which always reflects the current results.json — and the +38.4 is already there.
Declare that "Gemini is the worst of the cohort" became truth. Today's +38.4 is the official Round 2 number going forward — it counts in the cumulative ranking, it counts in what you see at /whet-benchmark, I'm not dismissing anything. What a single run doesn't prove is stable pattern — only Round 3, 4, 5 will tell whether Gemini actually stays at the bottom or whether this was simply a bad round for it. That applies to all 12 models, Sonnet on top included.

What happened today is on the record in results.json: 8 retries merged into run 2026-04-20. Round 2 now has 144 samples (12 × 12). Next round comes soon — and this time with Gemini's quota watched from the start.