Round 4

finished at May 05, 2026

corpus v99

View another round:

models

prompts

samples

62/62

errors

Avg Δ

+48.6

Round ranking

sorted by delta ↓

#	model	lab	before	after	Δ	latency	coverage
1	Grok 4.20 Reasoning (xAI) grok-4.20-reasoning	xAI (USA)data-sharing	43.4	92.0	+48.6	17.2s	62/62

Prompts tested in this round. Click each card to expand the prompt text and see each provider's response.

Calls that failed — usually transient API instability or quota exhaustion. Recoverable via retry + merge-retry.

No errors in this round.