Sonar published an evaluation of GPT-5.5 on production code this week. The line that stuck: the model followed instructions "too literally" when the prompt was poorly structured, lacked detail, or had weak underlying concepts — and in those cases it didn't repair the direction on its own. Their phrasing; the observation isn't new for anyone who works with prompts daily, but it's the first time it shows up in an independent benchmark on a freshly-launched flagship.

On the same beat, OpenAI published the official migration guide. Recommendation number one: treat 5.5 as a new family, not a drop-in for 5.4. Start from a clean baseline, tune verbosity, format, reasoning effort — don't carry the old prompt over wholesale.

Two texts, two angles, same observation. New models are almost always sold as more robust to bad prompts — less prompt engineering, less baby-sitting. 5.5 inverted the promise. It's more sensitive to prompt quality, not less.

What Whet measured

The backfill I ran on 5.5 right after launch — covered in the April 24 post — gives an adjacent number. 62 deliberately bad prompts: role inflation, too many imperatives, vague instructions, internal contradictions. Same 62 as 5.4. 5.4's mean delta was +47.0. 5.5's was +47.3. Same number, inside the noise.

At first glance that looks fine — 5.5 delivered, after all. But 5.5 is reasoning by default. It ruminates before answering. It was reasonable to expect that rumination to do something when the input was bad: catch the contradiction between "be concise" and "be exhaustive" in the same prompt, drop the role inflation in "the world's best lawyer with 25 years of experience", pick one of the two opposing instructions instead of both. It didn't. The 62 prompts got sharpened by 5.5 with the same effect as 5.4 — and 5.4 had no reasoning at all.

Hypothesis, not verdict

The hypothesis that ties the two angles together goes roughly like this: reasoning amplifies the prompt's structure rather than questioning it. The model spends its reasoning budget figuring out how to comply with the instruction, not whether the instruction makes sense. That's defensible as design — you don't want a model that ignores what was asked. But the side effect is exactly this: good prompts get better; bad prompts stay the same, or worse.

A single data point doesn't prove a hypothesis. Whet has one run with 5.5; Sonar has theirs, on a different axis (code rather than prompt-rewriting). But two independent readings converging on the same point deserves attention. The next benchmark rounds will give more N for this question — especially crossing reasoning vs chat across the other providers in the cohort, which hasn't been a direct focus until now.

What it changes

For people running GPT-5.5 in production: invest more in prompt cleanup before calling the model, don't expect reasoning to fix things on the way. OpenAI's guide says it technically ("treat as new family"); Sonar says it operationally (be more specific about constraints and success criteria); Whet delivers the missing number between the two.

For people measuring models: meta-prompt-following should show up in more benchmarks. MMLU measures what the model knows; HumanEval measures whether it writes code that passes tests; Tau-bench measures agentic behavior. None of them measures what happens when the instruction itself is broken — and that's the real interface, day to day, for almost everyone. The Whet Benchmark measures only that slice, and probably stays as one of the few that does until someone bigger decides to step in.

Next round comes soon. The question will be how much of this is specific to 5.5 and how much is the reasoning family as a whole — o1, GPT-5 nano, and the reasoning modes of other providers all on the same test with deliberately bad prompts. That'll let us say with more confidence whether reasoning amplifies defects or 5.5 was an isolated case. Current numbers live in /whet-benchmark; the results.json keeps the full run. Sonar's and OpenAI's pieces are linked above.