corpus v3
| # | model | lab | before | after | Δ | latency | coverage |
|---|---|---|---|---|---|---|---|
| 1 | Jamba Large 1.7 (AI21) jamba-large-1.7 | AI21 Labs (Israel)trial | 38.6 | 94.0 | +55.4 | 4.9s | 12/12 |
| 2 | Claude Sonnet (via CLI) claude-sonnet-4-6 (via CLI) | Anthropic (USA)paid | 38.6 | 92.5 | +53.9 | 27.4s | 12/12 |
| 3 | Claude Opus (via CLI) claude-opus-4-7 (via CLI) | Anthropic (USA)paid | 38.6 | 88.8 | +50.2 | 15.6s | 12/12 |
| 4 | DeepSeek R1 deepseek-reasoner | DeepSeek (China)free | 38.6 | 86.8 | +48.2 | 18.8s | 12/12 |
| 5 | Mistral Small mistral-small-latest | Mistral AI (France)free | 38.6 | 86.6 | +48.0 | 2.5s | 12/12 |
| 6 | GPT-5.4 (OpenAI) gpt-5.4 | OpenAI (USA)paid | 38.6 | 86.6 | +48.0 | 6.6s | 12/12 |
| 7 | Llama 3.3 70B (Groq) llama-3.3-70b-versatile | Meta (USA) via Groqfree | 38.6 | 85.6 | +47.0 | 4.2s | 12/12 |
| 8 | Grok 4.20 Reasoning (xAI) grok-4.20-reasoning | xAI (USA)data-sharing | 38.6 | 85.5 | +46.9 | 26.6s | 12/12 |
| 9 | GPT-5.5 (OpenAI) gpt-5.5 | OpenAI (USA)paid | 38.6 | 83.7 | +45.1 | 10.1s | 12/12 |
| 10 | DeepSeek V3 deepseek-chat | DeepSeek (China)free | 38.6 | 78.7 | +40.1 | 5.1s | 12/12 |
| 11 | Command A (Cohere) command-a-03-2025 | Cohere (Canada)trial | 38.6 | 78.2 | +39.6 | 4.8s | 12/12 |
| 12 | GPT-4o mini (OpenAI) gpt-4o-mini | OpenAI (USA)paid | 38.6 | 77.8 | +39.2 | 5.7s | 12/12 |
| 13 | GPT-5 nano (OpenAI) gpt-5-nano | OpenAI (USA)paid | 38.6 | 74.8 | +36.2 | 6.6s | 12/12 |
| 14 | Gemini 2.5 Flash gemini-2.5-flash | Google (USA)free | 38.6 | 74.3 | +35.7 | 8.1s | 12/12 |
Prompts tested in this round. Click each card to expand the prompt text and see each provider's response.
Calls that failed — usually transient API instability or quota exhaustion. Recoverable via retry + merge-retry.
No errors in this round.