Round 5

finished at May 08, 2026

corpus v3

models
14
prompts
12
samples
168/168
errors
0
Avg Δ
+45.3

Round ranking

sorted by delta ↓
#modellabbeforeafterΔlatencycoverage
1
Jamba Large 1.7 (AI21)
jamba-large-1.7
AI21 Labs (Israel)trial
38.694.0+55.44.9s12/12
2
Claude Sonnet (via CLI)
claude-sonnet-4-6 (via CLI)
Anthropic (USA)paid
38.692.5+53.927.4s12/12
3
Claude Opus (via CLI)
claude-opus-4-7 (via CLI)
Anthropic (USA)paid
38.688.8+50.215.6s12/12
4
DeepSeek R1
deepseek-reasoner
DeepSeek (China)free
38.686.8+48.218.8s12/12
5
Mistral Small
mistral-small-latest
Mistral AI (France)free
38.686.6+48.02.5s12/12
6
GPT-5.4 (OpenAI)
gpt-5.4
OpenAI (USA)paid
38.686.6+48.06.6s12/12
7
Llama 3.3 70B (Groq)
llama-3.3-70b-versatile
Meta (USA) via Groqfree
38.685.6+47.04.2s12/12
8
Grok 4.20 Reasoning (xAI)
grok-4.20-reasoning
xAI (USA)data-sharing
38.685.5+46.926.6s12/12
9
GPT-5.5 (OpenAI)
gpt-5.5
OpenAI (USA)paid
38.683.7+45.110.1s12/12
10
DeepSeek V3
deepseek-chat
DeepSeek (China)free
38.678.7+40.15.1s12/12
11
Command A (Cohere)
command-a-03-2025
Cohere (Canada)trial
38.678.2+39.64.8s12/12
12
GPT-4o mini (OpenAI)
gpt-4o-mini
OpenAI (USA)paid
38.677.8+39.25.7s12/12
13
GPT-5 nano (OpenAI)
gpt-5-nano
OpenAI (USA)paid
38.674.8+36.26.6s12/12
14
Gemini 2.5 Flash
gemini-2.5-flash
Google (USA)free
38.674.3+35.78.1s12/12

Prompts used

Prompts tested in this round. Click each card to expand the prompt text and see each provider's response.

Errors in this round

Calls that failed — usually transient API instability or quota exhaustion. Recoverable via retry + merge-retry.

No errors in this round.