corpus v3
Round 1 is the aggregate of work done before the method was formalized (blind sub-agent, 4-profile × 3-language composition). Prompts were crafted over several weeks through partial executions with varying methods. Kept in the history as a reference — the per-prompt × provider results are the most recent available for each pair.
| # | model | lab | before | after | Δ | latency | coverage |
|---|---|---|---|---|---|---|---|
| 1 | Jamba Large 1.7 (AI21) jamba-large-1.7 | AI21 Labs (Israel)trial | 44.5 | 94.3 | +49.8 | 4.3s | 50/50 |
| 2 | Claude Opus (via CLI) claude-opus-4-7 (via CLI) | Anthropic (EUA)paid | 44.5 | 94.3 | +49.7 | 12.9s | 50/50 |
| 3 | Claude Sonnet (via CLI) claude-sonnet-4-6 (via CLI) | Anthropic (EUA)paid | 44.5 | 93.6 | +49.0 | 18.6s | 50/50 |
| 4 | GPT-5.4 (OpenAI) gpt-5.4 | OpenAI (EUA)paid | 44.5 | 91.0 | +46.4 | 4.0s | 50/50 |
| 5 | Mistral Small mistral-small-latest | Mistral AI (França)free | 44.5 | 89.5 | +45.0 | 2.3s | 50/50 |
| 6 | Llama 3.3 70B (Groq) llama-3.3-70b-versatile | Meta (EUA) via Groqfree | 44.5 | 89.4 | +44.8 | 3.3s | 50/50 |
| 7 | DeepSeek R1 deepseek-reasoner | DeepSeek (China)free | 44.5 | 89.3 | +44.7 | 40.2s | 50/50 |
| 8 | DeepSeek V3 deepseek-chat | DeepSeek (China)free | 44.5 | 89.2 | +44.6 | 7.7s | 50/50 |
| 9 | Command A (Cohere) command-a-03-2025 | Cohere (Canadá)trial | 44.5 | 85.5 | +41.0 | 10.1s | 50/50 |
| 10 | Gemini 2.5 Flash gemini-2.5-flash | Google (EUA)free | 44.5 | 84.5 | +40.0 | 8.9s | 50/50 |
| 11 | GPT-5 nano (OpenAI) gpt-5-nano | OpenAI (EUA)paid | 44.5 | 81.6 | +37.0 | 5.6s | 50/50 |
| 12 | GPT-4o mini (OpenAI) gpt-4o-mini | OpenAI (EUA)paid | 44.5 | 78.7 | +34.2 | 3.1s | 50/50 |
Prompts tested in this round. Click each card to expand the prompt text and see each provider's response.
Calls that failed — usually transient API instability or quota exhaustion. Recoverable via retry + merge-retry.
No errors in this round.