Grok was on standby in Whet's provider backlog since April, waiting on a budget decision. I put $5 into xAI today (the console manual minimum), uncommented the provider in the runner, and ran the backfill on the same 62 prompts GPT-5.4 and 5.5 had been tested on. Equal footing — nobody dropped out to make room for Grok.

Direct result: mean Δ +48.6, 62/62 with no errors. Final position #4 out of 14 on the cumulative leaderboard. Ahead of GPT-5.5 (+47.3) and GPT-5.4 (+47.0); behind Sonnet, Opus, and Jamba — the three above +50. Real backfill cost: $0.28 (remaining balance $4.72) — well within an experimental run, and the balance covers dozens of new runs before a top-up is needed.

Where Grok landed

sample · 62 prompts (same set across all)

pos	model	after (mean)	mean Δ
top of ranking
#1	Jamba Large 1.7	94.0	+50.6
#2	Claude Opus 4.7	93.9	+50.5
#3	Claude Sonnet 4.7	93.8	+50.4
Grok enters here
#4	Grok 4.20 Reasoning	92.0	+48.6
OpenAI flagships (passed)
#5	GPT-5.5	90.8	+47.3
#6	GPT-5.4	90.4	+47.0

Grok 4.20 Reasoning lands in fourth with Δ +48.6, within striking distance of the top trio (Jamba, Claude Opus, Claude Sonnet — all above +50) and 1.3 points ahead of GPT-5.5. For context: 1.3 points isn't noise when the N is the same and the set is identical. 5.5 and 5.4 sat three tenths of a point apart on the same backfill — a margin I called noise. 1.3 is four times that gap.

The reading: two axes, not one

The hypothesis logged before the test was: "less restrictive alignment → better preservation of the rewrite meta-instruction (less RLHF-driven instinct to 'soften' the original prompt)". Grok is the cleanest case of this hypothesis in the cohort — xAI sells the product explicitly as less restricted.

The result partially confirms. It confirms because Grok beat the entire OpenAI family, including the current reasoning flagship. It doesn't confirm because the ceiling stayed intact: Claude Opus, Claude Sonnet, and Jamba all sit above.

The reading this suggests is that there are two distinct axes in what Whet measures:

Alignment looseness. Less reluctance to actually rewrite the user's prompt, instead of returning a "clean but equivalent" version. Grok climbed this axis — past the entire OpenAI family.
Base capacity to follow meta-instruction. Understanding what the rewrite instruction is asking and executing with depth. Claude and Jamba dominate this axis — and Grok didn't reach it.

The two axes can move independently. You can have a model open to rewriting (axis 1) that still doesn't interpret the instruction with depth (low axis 2). Or the opposite — a model that interprets perfectly, but resists rewriting on alignment principle. Whet's cumulative ranking, read through this lens, stops being a single scale and becomes a 2D map — with top 5 sitting in different regions of the map for different reasons.

Honest caveats

Same posture as the previous posts: one round doesn't prove a stable pattern. Grok is at N=62 backfilled — same set as GPT-5.4/5.5, but different from Round 2 which caught Claude/Jamba on fresh prompts. Mixed test regimes. The next round (Round 3) will pull everyone into the same 12 fresh same-input prompts, and that's when we'll be able to look with stronger ground.

Another Grok-specific caveat: it runs as a reasoning model with mean latency around 17s per call. If the real application needs sub-second response, the Δ gain doesn't pay off. Whet measures how much it sharpens, not how much it sharpens per second.

Sources

Factual claims about Grok (model, API parameters, alignment positioning) were cross-checked against official sources. Benchmark numbers are ours — in the results.json linked below.

xAI · Models and Pricing — official documentation of API models, including grok-4.20-reasoning
xAI · Manage Billing — console top-up rules ($5 manual minimum, $25 auto top-up minimum)
xAI · Responses API reference — Responses API shape ({ model, input } instead of OpenAI-compatible chat completions) — required by the provider
xAI · Alignment positioning — xAI's public editorial line framing Grok as 'less restricted' — source of the tested hypothesis

The 62 pairs of numbers are in results.json. The updated ranking is at /whet-benchmark. Provider, runner, and editorial backlog updated in the same PR.