Round 4

finished at May 05, 2026

corpus v99

models
1
prompts
62
samples
62/62
errors
0
Avg Δ
+48.6

Round ranking

sorted by delta ↓
#modellabbeforeafterΔlatencycoverage
1
Grok 4.20 Reasoning (xAI)
grok-4.20-reasoning
xAI (USA)data-sharing
43.492.0+48.617.2s62/62

Prompts used

Prompts tested in this round. Click each card to expand the prompt text and see each provider's response.

Errors in this round

Calls that failed — usually transient API instability or quota exhaustion. Recoverable via retry + merge-retry.

No errors in this round.