Round 5

finished at May 08, 2026

corpus v3

View another round:

models

prompts

samples

168/168

errors

Avg Δ

+45.3

Round ranking

sorted by delta ↓

#	model	lab	before	after	Δ	latency	coverage
1	Jamba Large 1.7 (AI21) jamba-large-1.7	AI21 Labs (Israel)trial	38.6	94.0	+55.4	4.9s	12/12
2	Claude Sonnet (via CLI) claude-sonnet-4-6 (via CLI)	Anthropic (USA)paid	38.6	92.5	+53.9	27.4s	12/12
3	Claude Opus (via CLI) claude-opus-4-7 (via CLI)	Anthropic (USA)paid	38.6	88.8	+50.2	15.6s	12/12
4	DeepSeek R1 deepseek-reasoner	DeepSeek (China)free	38.6	86.8	+48.2	18.8s	12/12
5	Mistral Small mistral-small-latest	Mistral AI (France)free	38.6	86.6	+48.0	2.5s	12/12
6	GPT-5.4 (OpenAI) gpt-5.4	OpenAI (USA)paid	38.6	86.6	+48.0	6.6s	12/12
7	Llama 3.3 70B (Groq) llama-3.3-70b-versatile	Meta (USA) via Groqfree	38.6	85.6	+47.0	4.2s	12/12
8	Grok 4.20 Reasoning (xAI) grok-4.20-reasoning	xAI (USA)data-sharing	38.6	85.5	+46.9	26.6s	12/12
9	GPT-5.5 (OpenAI) gpt-5.5	OpenAI (USA)paid	38.6	83.7	+45.1	10.1s	12/12
10	DeepSeek V3 deepseek-chat	DeepSeek (China)free	38.6	78.7	+40.1	5.1s	12/12
11	Command A (Cohere) command-a-03-2025	Cohere (Canada)trial	38.6	78.2	+39.6	4.8s	12/12
12	GPT-4o mini (OpenAI) gpt-4o-mini	OpenAI (USA)paid	38.6	77.8	+39.2	5.7s	12/12
13	GPT-5 nano (OpenAI) gpt-5-nano	OpenAI (USA)paid	38.6	74.8	+36.2	6.6s	12/12
14	Gemini 2.5 Flash gemini-2.5-flash	Google (USA)free	38.6	74.3	+35.7	8.1s	12/12

Prompts tested in this round. Click each card to expand the prompt text and see each provider's response.

Calls that failed — usually transient API instability or quota exhaustion. Recoverable via retry + merge-retry.

No errors in this round.