Round 2

finished at Apr 20, 2026

corpus v3

models
12
prompts
12
samples
144/144
errors
0
Avg Δ
+47.5

Round ranking

sorted by delta ↓
#modellabbeforeafterΔlatencycoverage
1
Claude Sonnet (via CLI)
claude-sonnet-4-6 (via CLI)
Anthropic (EUA)paid
38.894.9+56.123.1s12/12
2
Jamba Large 1.7 (AI21)
jamba-large-1.7
AI21 Labs (Israel)trial
38.892.8+54.05.0s12/12
3
Claude Opus (via CLI)
claude-opus-4-7 (via CLI)
Anthropic (EUA)paid
38.892.6+53.719.3s12/12
4
DeepSeek R1
deepseek-reasoner
DeepSeek (China)free
38.889.7+50.859.8s12/12
5
Mistral Small
mistral-small-latest
Mistral AI (França)free
38.888.4+49.62.6s12/12
6
GPT-5.4 (OpenAI)
gpt-5.4
OpenAI (EUA)paid
38.887.9+49.17.5s12/12
7
DeepSeek V3
deepseek-chat
DeepSeek (China)free
38.886.2+47.311.5s12/12
8
Llama 3.3 70B (Groq)
llama-3.3-70b-versatile
Meta (EUA) via Groqfree
38.883.8+45.06.5s12/12
9
Command A (Cohere)
command-a-03-2025
Cohere (Canadá)trial
38.883.0+44.211.7s12/12
10
GPT-5 nano (OpenAI)
gpt-5-nano
OpenAI (EUA)paid
38.881.2+42.35.4s12/12
11
GPT-4o mini (OpenAI)
gpt-4o-mini
OpenAI (EUA)paid
38.878.5+39.74.8s12/12
12
Gemini 2.5 Flash
gemini-2.5-flash
Google (EUA)free
38.877.3+38.48.1s12/12

Prompts used

Prompts tested in this round. Click each card to expand the prompt text and see each provider's response.

Errors in this round

Calls that failed — usually transient API instability or quota exhaustion. Recoverable via retry + merge-retry.

No errors in this round.