Round 2

finished at Apr 20, 2026

corpus v3

View another round:

models

prompts

samples

144/144

errors

Avg Δ

+47.5

Round ranking

sorted by delta ↓

#	model	lab	before	after	Δ	latency	coverage
1	Claude Sonnet (via CLI) claude-sonnet-4-6 (via CLI)	Anthropic (EUA)paid	38.8	94.9	+56.1	23.1s	12/12
2	Jamba Large 1.7 (AI21) jamba-large-1.7	AI21 Labs (Israel)trial	38.8	92.8	+54.0	5.0s	12/12
3	Claude Opus (via CLI) claude-opus-4-7 (via CLI)	Anthropic (EUA)paid	38.8	92.6	+53.7	19.3s	12/12
4	DeepSeek R1 deepseek-reasoner	DeepSeek (China)free	38.8	89.7	+50.8	59.8s	12/12
5	Mistral Small mistral-small-latest	Mistral AI (França)free	38.8	88.4	+49.6	2.6s	12/12
6	GPT-5.4 (OpenAI) gpt-5.4	OpenAI (EUA)paid	38.8	87.9	+49.1	7.5s	12/12
7	DeepSeek V3 deepseek-chat	DeepSeek (China)free	38.8	86.2	+47.3	11.5s	12/12
8	Llama 3.3 70B (Groq) llama-3.3-70b-versatile	Meta (EUA) via Groqfree	38.8	83.8	+45.0	6.5s	12/12
9	Command A (Cohere) command-a-03-2025	Cohere (Canadá)trial	38.8	83.0	+44.2	11.7s	12/12
10	GPT-5 nano (OpenAI) gpt-5-nano	OpenAI (EUA)paid	38.8	81.2	+42.3	5.4s	12/12
11	GPT-4o mini (OpenAI) gpt-4o-mini	OpenAI (EUA)paid	38.8	78.5	+39.7	4.8s	12/12
12	Gemini 2.5 Flash gemini-2.5-flash	Google (EUA)free	38.8	77.3	+38.4	8.1s	12/12

Prompts tested in this round. Click each card to expand the prompt text and see each provider's response.

Calls that failed — usually transient API instability or quota exhaustion. Recoverable via retry + merge-retry.

No errors in this round.