GPT-5.5: same score, 60% slower, double the price
OpenAI shipped GPT-5.5 yesterday. I ran the 62-prompt backfill — same coverage as 5.4. The delta moved 0.4 points, inside the noise. Latency went up 60% and the price doubled. Reasoning is now the flagship default, and for meta-prompt-following that trade doesn't pay.
OpenAI shipped GPT-5.5 yesterday (April 23, 2026) as the flagship successor to GPT-5.4. I ran the Whet Benchmark on it via the backfill procedure — the same 62 prompts GPT-5.4 had already covered. Equal footing, 5.4 stays in the ranking.
Before the score, a scare
The first request of the first call came back with HTTP 400: "Unsupported value: 'temperature' does not support 0.3". The 62 prompts cascaded with the same error. The provider file was written mirroring GPT-5.4 — regular chat, with temperature: 0.3 and max_completion_tokens: 4096.
GPT-5.5 isn't chat. It's reasoning-by-default — like GPT-5 nano, or like the o1 family. There's a technical reason behind the temperature refusal: reasoning models run internal rounds of drafting, verifying, and selecting before emitting output, and exposing an external sampling knob would destabilize the whole calibration. The knob integrators get instead is reasoning_effort — how much internal thinking the model spends before replying.
I deleted the 62-error run from results.json, reconfigured the provider for reasoning, and ran again. Zero errors on the second pass. The 5.5 default is reasoning_effort: "medium"; I used "low" to stay consistent with the GPT-5 nano provider (also reasoning, also run at low in the benchmark). At medium, the latency number below would be higher — and that itself is a data point.
The backdrop
Three release facts that anchor the delta discussion:
- Reasoning moved from opt-in to flagship default. In the previous generation, anyone wanting internal chain-of-thought picked
gpt-5-nanooro1; 5.4 stayed conventional chat. On 5.5, everyone pays the internal reasoning budget, wanted or not. - API price doubled. GPT-5.4 was $2.50 per million input tokens and $15 per million output tokens. GPT-5.5 is $5 and $30 — exactly 2× on each column.
- Context window moved to 1M tokens. That's relevant for long-horizon tasks, but it's orthogonal to what the Whet Benchmark measures: system prompts between 500-1000 characters fit in any reasonable context window.
What moved on the score
| GPT-5.4 | GPT-5.5 | |
|---|---|---|
| before (mean) | 43.4 | 43.4 |
| after (mean) | 90.4 | 90.8 |
| mean Δ | +47.0 | +47.3 |
| mean latency | 4.7s | 7.4s |
| head-to-head | 24·18 tie·20(5.5 wins / tie / 5.4 wins) | |
Essentially nothing. Mean delta went from +47.0 to +47.3. Three tenths of a point, on a scale that swings 14 points between N=3 and N=12 samples of the same model. Noise, not signal.
Prompt by prompt, 5.5 wins 24, ties 18, loses 20. That flat a distribution means neither model has evidence of meta-prompt-following superiority at this sample size. For ranking purposes, GPT-5.5 enters statistically tied with 5.4.
What moved on the cost
Mean time per prompt went from 4.7s on 5.4 to 7.4s on 5.5. About 60% slower. Consistent with the chat-to-reasoning switch: the model burns a reasoning-token budget before emitting output, even with reasoning_effort: "low". At the model default (medium) that number would climb further.
Adding price to the ledger: paying double per token and waiting 60% longer in exchange for 0.3 points on a metric that swings 14 points at small sample sizes is a hard trade to defend — at least for this task. For sharpening prompts, GPT-5.4 remains the sensible pick in this family.
What this says
One of the field's quiet bets in 2025/2026 is that making reasoning the default improves "anything" an LLM does. The delta here suggests that, for meta-prompt-following on system prompts, that bet doesn't pay off. Following a meta-instruction that asks for a rewrite preserving intent isn't a task that responds well to thinking harder — it's a task that responds well to following the received instruction cleanly. Different things.
Usual caveat: one run doesn't prove stable pattern. 5.4 and 5.5 are at N=62 each, on distinct prompts, run in different time windows. Round 3 will pull both into the same 12 fresh prompts, same-input, and we'll look again. For now: equal-footing entry done, ranking updated, and an empirical confirmation that reasoning-by-default isn't free.
Factual claims about GPT-5.5 (release date, API parameters, pricing, context window) were cross-checked against public sources. The benchmark numbers are ours — in the results.json linked below.
- OpenAI · Introducing GPT-5.5 — official announcement: release date (April 23, 2026), model positioning, API availability
- OpenAI · API docs changelog — API parameters, reasoning_effort defaults, chat completions compatibility
- OpenAI Developer Community · Temperature in GPT-5 models — technical discussion on why reasoning models reject temperature (internal drafting/verification loop)
- The Decoder · GPT-5.5 at double the API price — secondary coverage on the 2× pricing vs. 5.4 — used only to cross-check against official pricing
The 62 pairs of numbers are in results.json. The updated ranking is at /whet-benchmark. GPT-5.4 stays where it was — nobody got dropped to make room for 5.5.