corpus v3
| # | model | lab | before | after | Δ | latency | coverage |
|---|---|---|---|---|---|---|---|
| 1 | Claude Sonnet (via CLI) claude-sonnet-4-6 (via CLI) | Anthropic (EUA)paid | 38.8 | 94.9 | +56.1 | 23.1s | 12/12 |
| 2 | Jamba Large 1.7 (AI21) jamba-large-1.7 | AI21 Labs (Israel)trial | 38.8 | 92.8 | +54.0 | 5.0s | 12/12 |
| 3 | Claude Opus (via CLI) claude-opus-4-7 (via CLI) | Anthropic (EUA)paid | 38.8 | 92.6 | +53.7 | 19.3s | 12/12 |
| 4 | DeepSeek R1 deepseek-reasoner | DeepSeek (China)free | 38.8 | 89.7 | +50.8 | 59.8s | 12/12 |
| 5 | Mistral Small mistral-small-latest | Mistral AI (França)free | 38.8 | 88.4 | +49.6 | 2.6s | 12/12 |
| 6 | GPT-5.4 (OpenAI) gpt-5.4 | OpenAI (EUA)paid | 38.8 | 87.9 | +49.1 | 7.5s | 12/12 |
| 7 | DeepSeek V3 deepseek-chat | DeepSeek (China)free | 38.8 | 86.2 | +47.3 | 11.5s | 12/12 |
| 8 | Llama 3.3 70B (Groq) llama-3.3-70b-versatile | Meta (EUA) via Groqfree | 38.8 | 83.8 | +45.0 | 6.5s | 12/12 |
| 9 | Command A (Cohere) command-a-03-2025 | Cohere (Canadá)trial | 38.8 | 83.0 | +44.2 | 11.7s | 12/12 |
| 10 | GPT-5 nano (OpenAI) gpt-5-nano | OpenAI (EUA)paid | 38.8 | 81.2 | +42.3 | 5.4s | 12/12 |
| 11 | GPT-4o mini (OpenAI) gpt-4o-mini | OpenAI (EUA)paid | 38.8 | 78.5 | +39.7 | 4.8s | 12/12 |
| 12 | Gemini 2.5 Flash gemini-2.5-flash | Google (EUA)free | 38.8 | 77.3 | +38.4 | 8.1s | 12/12 |
Prompts tested in this round. Click each card to expand the prompt text and see each provider's response.
Calls that failed — usually transient API instability or quota exhaustion. Recoverable via retry + merge-retry.
No errors in this round.