Whet Benchmark

Quanto um LLM consegue afiar um prompt sem destruir a intenção?

Meta-prompt-following sob pressão pra preservar propósito — uma capacidade que nenhum benchmark público avalia diretamente. Aqui: delta de score antes/depois em cada modelo do corpus.

Modelos testados
8
Prompts avaliados
50
20 conceitos · cumulativo
Delta médio
+45.8
Última rodada
17 de abr. de 2026
corpus v1

Leaderboard

ordenado por delta ↓
#modelolabantesdepoisΔlatênciaprompts
1
Jamba Large 1.7 (AI21)
jamba-large-1.7
AI21 Labs (Israel)trial
Arquitetura Mamba-Transformer híbrida — única não-transformer pura do benchmark
44.594.3+49.84.3s50
2
Claude Sonnet (via CLI)
claude-sonnet-4-6 (via CLI)
Anthropic (EUA)paid
Claude Sonnet via CLI — versão mais rápida e leve que Opus
44.593.6+49.018.6s50
3
Claude (via CLI)
claude-opus-4-6 (via CLI)
Anthropic (EUA)paid
Claude Opus via CLI subscription — sem API key separada
44.592.7+48.114.3s50
4
Mistral Small
mistral-small-latest
Mistral AI (França)free
Modelo europeu open-weight focado em eficiência
44.589.5+45.02.3s50
5
Llama 3.3 70B (Groq)
llama-3.3-70b-versatile
Meta (EUA) via Groqfree
Llama open-source com inferência em hardware LPU da Groq
44.589.4+44.83.3s50
6
DeepSeek R1
deepseek-reasoner
DeepSeek (China)free
Versão reasoner do DeepSeek com chain-of-thought explícito
44.589.3+44.740.2s50
7
DeepSeek V3
deepseek-chat
DeepSeek (China)free
Modelo chinês de alta performance treinado com foco em código e raciocínio
44.589.2+44.67.7s50
8
Gemini 2.5 Flash
gemini-2.5-flash
Google (EUA)free
Modelo multimodal de referência do Google
44.584.5+40.08.9s50

Detalhe por prompt

Cada provider aparece como coluna. Clique num provider pra abrir a lista expandida com o resultado em cada prompt do corpus — útil pra entender onde cada modelo afia melhor ou pior.

Jamba Large 1.7 (AI21)50 prompts · Δ médio +49.8
promptlangantesdepoisΔ
legal-advisor-en?0100+100
rigid-code-agent-en?1100+99
rigid-code-agent-pt?197+96
agente-codigo-rigido?197+96
legal-advisor-pt?093+93
consultor-juridico?090+90
medical-assistant-pt?087+87
legal-advisor-es?14100+86
medical-assistant-en?086+86
disguised-repetitions-pt?1397+84
repeticoes-disfarcadas?1397+84
assistente-medico?080+80
rigid-code-agent-es?2297+75
disguised-repetitions-en?2397+74
marketing-strategist-pt?28100+72
estrategista-marketing?28100+72
medical-assistant-es?2493+69
disguised-repetitions-es?2897+69
marketing-strategist-en?43100+57
Supply chain optimizer (en)en47100+53
financial-analyst-en?3790+53
Tax advisor (en)en48100+52
Tutor de secundaria (es)es49100+51
Consultor fiscal (es)es51100+49
financial-analyst-pt?3180+49
Tutor de ensino médio (pt)pt4997+48
tone-domain-mismatch-es?4593+48
Otimizador logístico (pt)pt55100+45
High school tutor (en)en5397+44
Consultor tributário (pt)pt5193+42
task-style-no-roleplay-pt?5290+38
Optimizador logístico (es)es63100+37
marketing-strategist-es?64100+36
task-style-no-roleplay-en?4883+35
financial-analyst-es?5690+34
task-style-no-roleplay-es?6597+32
contradiction-es?6186+25
unresolved-reference-es?79100+21
contradiction-en?6179+18
threat-framing-en?83100+17
unresolved-reference-pt?6986+17
threat-framing-es?7993+14
unresolved-reference-en?7690+14
role-inflation-pt?88100+12
contradiction-pt?6172+11
tone-domain-mismatch-pt?93100+7
tone-domain-mismatch-en?93100+7
role-inflation-en?94100+6
threat-framing-pt?9093+3
role-inflation-es?97100+3
Claude Sonnet (via CLI)50 prompts · Δ médio +49.0
promptlangantesdepoisΔ
medical-assistant-pt?094+94
legal-advisor-en?093+93
consultor-juridico?090+90
agente-codigo-rigido?190+89
disguised-repetitions-pt?13100+87
repeticoes-disfarcadas?13100+87
legal-advisor-pt?086+86
rigid-code-agent-en?187+86
medical-assistant-en?083+83
assistente-medico?083+83
rigid-code-agent-pt?181+80
disguised-repetitions-en?23100+77
disguised-repetitions-es?28100+72
legal-advisor-es?1479+65
rigid-code-agent-es?2287+65
estrategista-marketing?2893+65
medical-assistant-es?2486+62
marketing-strategist-pt?2886+58
marketing-strategist-en?4397+54
Supply chain optimizer (en)en47100+53
financial-analyst-pt?3183+52
task-style-no-roleplay-en?48100+52
Tax advisor (en)en48100+52
Tutor de ensino médio (pt)pt49100+51
Tutor de secundaria (es)es49100+51
Consultor fiscal (es)es51100+49
tone-domain-mismatch-es?4593+48
High school tutor (en)en53100+47
financial-analyst-en?3783+46
Consultor tributário (pt)pt5193+42
task-style-no-roleplay-pt?5293+41
contradiction-en?61100+39
Otimizador logístico (pt)pt5593+38
financial-analyst-es?5693+37
Optimizador logístico (es)es63100+37
contradiction-pt?6197+36
contradiction-es?6197+36
marketing-strategist-es?6497+33
unresolved-reference-pt?69100+31
task-style-no-roleplay-es?6593+28
unresolved-reference-en?76100+24
unresolved-reference-es?79100+21
threat-framing-en?83100+17
threat-framing-es?7993+14
threat-framing-pt?90100+10
tone-domain-mismatch-en?93100+7
role-inflation-es?97100+3
role-inflation-pt?8887+-1
role-inflation-en?9490+-4
tone-domain-mismatch-pt?9379+-14
Claude (via CLI)50 prompts · Δ médio +48.1
promptlangantesdepoisΔ
rigid-code-agent-pt?197+96
consultor-juridico?090+90
legal-advisor-pt?090+90
rigid-code-agent-en?190+89
medical-assistant-en?087+87
disguised-repetitions-pt?13100+87
assistente-medico?086+86
legal-advisor-es?14100+86
repeticoes-disfarcadas?1397+84
legal-advisor-en?083+83
medical-assistant-pt?083+83
disguised-repetitions-en?23100+77
agente-codigo-rigido?177+76
disguised-repetitions-es?28100+72
rigid-code-agent-es?2287+65
medical-assistant-es?2486+62
marketing-strategist-pt?2886+58
financial-analyst-pt?3186+55
task-style-no-roleplay-en?48100+52
Tax advisor (en)en48100+52
Tutor de ensino médio (pt)pt49100+51
estrategista-marketing?2878+50
financial-analyst-en?3786+49
task-style-no-roleplay-pt?52100+48
High school tutor (en)en53100+47
Supply chain optimizer (en)en4793+46
Consultor fiscal (es)es5197+46
marketing-strategist-en?4387+44
Tutor de secundaria (es)es4993+44
Consultor tributário (pt)pt5193+42
contradiction-pt?61100+39
contradiction-en?61100+39
contradiction-es?61100+39
task-style-no-roleplay-es?65100+35
tone-domain-mismatch-es?4579+34
Otimizador logístico (pt)pt5586+31
Optimizador logístico (es)es6393+30
unresolved-reference-en?76100+24
marketing-strategist-es?6486+22
threat-framing-es?79100+21
unresolved-reference-es?79100+21
threat-framing-en?83100+17
unresolved-reference-pt?6986+17
financial-analyst-es?5671+15
role-inflation-pt?88100+12
threat-framing-pt?90100+10
role-inflation-en?94100+6
tone-domain-mismatch-en?9397+4
role-inflation-es?9797+0
tone-domain-mismatch-pt?9386+-7
Mistral Small50 prompts · Δ médio +45.0
promptlangantesdepoisΔ
legal-advisor-en?097+97
medical-assistant-en?093+93
repeticoes-disfarcadas?13100+87
rigid-code-agent-en?184+83
legal-advisor-pt?079+79
disguised-repetitions-pt?1390+77
disguised-repetitions-en?23100+77
agente-codigo-rigido?177+76
medical-assistant-pt?076+76
rigid-code-agent-pt?177+76
legal-advisor-es?1486+72
consultor-juridico?069+69
disguised-repetitions-es?2897+69
estrategista-marketing?2894+66
assistente-medico?065+65
marketing-strategist-pt?2891+63
financial-analyst-pt?3190+59
rigid-code-agent-es?2280+58
medical-assistant-es?2479+55
Tutor de secundaria (es)es49100+51
Supply chain optimizer (en)en4797+50
tone-domain-mismatch-es?4593+48
task-style-no-roleplay-pt?52100+48
High school tutor (en)en53100+47
financial-analyst-en?3783+46
task-style-no-roleplay-en?4893+45
Tax advisor (en)en4893+45
Tutor de ensino médio (pt)pt4993+44
marketing-strategist-en?4386+43
Consultor tributário (pt)pt5190+39
Optimizador logístico (es)es63100+37
task-style-no-roleplay-es?65100+35
Consultor fiscal (es)es5186+35
marketing-strategist-es?6494+30
contradiction-es?6190+29
Otimizador logístico (pt)pt5583+28
financial-analyst-es?5683+27
contradiction-pt?6183+22
threat-framing-es?79100+21
unresolved-reference-en?7693+17
unresolved-reference-es?7993+14
role-inflation-pt?88100+12
contradiction-en?6173+12
unresolved-reference-pt?6979+10
tone-domain-mismatch-pt?93100+7
tone-domain-mismatch-en?93100+7
role-inflation-en?94100+6
threat-framing-en?8387+4
role-inflation-es?97100+3
threat-framing-pt?9079+-11
Llama 3.3 70B (Groq)50 prompts · Δ médio +44.8
promptlangantesdepoisΔ
legal-advisor-en?091+91
medical-assistant-en?090+90
repeticoes-disfarcadas?13100+87
rigid-code-agent-en?187+86
disguised-repetitions-pt?1393+80
consultor-juridico?079+79
assistente-medico?079+79
legal-advisor-pt?079+79
medical-assistant-pt?079+79
disguised-repetitions-en?2397+74
estrategista-marketing?2893+65
rigid-code-agent-pt?165+64
financial-analyst-en?37100+63
medical-assistant-es?2486+62
marketing-strategist-pt?2890+62
financial-analyst-pt?3193+62
legal-advisor-es?1472+58
disguised-repetitions-es?2886+58
agente-codigo-rigido?158+57
rigid-code-agent-es?2279+57
marketing-strategist-en?4397+54
Supply chain optimizer (en)en47100+53
task-style-no-roleplay-en?48100+52
Tax advisor (en)en48100+52
Consultor fiscal (es)es51100+49
High school tutor (en)en5397+44
Consultor tributário (pt)pt5193+42
tone-domain-mismatch-es?4586+41
task-style-no-roleplay-pt?5293+41
contradiction-en?61100+39
Tutor de ensino médio (pt)pt4986+37
Tutor de secundaria (es)es4986+37
task-style-no-roleplay-es?65100+35
contradiction-es?6193+32
Otimizador logístico (pt)pt5583+28
contradiction-pt?6186+25
Optimizador logístico (es)es6386+23
unresolved-reference-es?79100+21
financial-analyst-es?5676+20
marketing-strategist-es?6480+16
unresolved-reference-en?7690+14
role-inflation-pt?88100+12
threat-framing-en?8390+7
threat-framing-es?7986+7
tone-domain-mismatch-pt?93100+7
tone-domain-mismatch-en?93100+7
role-inflation-en?94100+6
threat-framing-pt?9093+3
role-inflation-es?97100+3
unresolved-reference-pt?6972+3
DeepSeek R150 prompts · Δ médio +44.7
promptlangantesdepoisΔ
rigid-code-agent-en?197+96
legal-advisor-en?093+93
legal-advisor-pt?086+86
agente-codigo-rigido?186+85
repeticoes-disfarcadas?1397+84
consultor-juridico?083+83
disguised-repetitions-pt?1394+81
assistente-medico?079+79
medical-assistant-en?077+77
legal-advisor-es?1490+76
medical-assistant-pt?076+76
disguised-repetitions-en?2397+74
rigid-code-agent-pt?174+73
disguised-repetitions-es?28100+72
estrategista-marketing?2893+65
financial-analyst-pt?3194+63
rigid-code-agent-es?2284+62
medical-assistant-es?2483+59
Supply chain optimizer (en)en47100+53
task-style-no-roleplay-en?48100+52
financial-analyst-en?3787+50
Tax advisor (en)en4897+49
tone-domain-mismatch-es?4593+48
Tutor de secundaria (es)es4997+48
High school tutor (en)en53100+47
marketing-strategist-en?4387+44
Consultor tributário (pt)pt5193+42
Consultor fiscal (es)es5193+42
task-style-no-roleplay-pt?5293+41
marketing-strategist-pt?2866+38
financial-analyst-es?5694+38
Tutor de ensino médio (pt)pt4986+37
Otimizador logístico (pt)pt5586+31
Optimizador logístico (es)es6393+30
marketing-strategist-es?6493+29
contradiction-pt?6186+25
contradiction-en?6186+25
contradiction-es?6186+25
task-style-no-roleplay-es?6587+22
unresolved-reference-en?7693+17
threat-framing-es?7993+14
threat-framing-en?8393+10
tone-domain-mismatch-en?93100+7
role-inflation-en?94100+6
role-inflation-pt?8893+5
unresolved-reference-pt?6972+3
unresolved-reference-es?7979+0
role-inflation-es?9793+-4
threat-framing-pt?9083+-7
tone-domain-mismatch-pt?9379+-14
DeepSeek V350 prompts · Δ médio +44.6
promptlangantesdepoisΔ
legal-advisor-en?097+97
agente-codigo-rigido?190+89
disguised-repetitions-pt?13100+87
repeticoes-disfarcadas?13100+87
legal-advisor-pt?086+86
rigid-code-agent-en?187+86
consultor-juridico?083+83
rigid-code-agent-pt?183+82
medical-assistant-pt?079+79
medical-assistant-en?079+79
assistente-medico?079+79
rigid-code-agent-es?22100+78
disguised-repetitions-en?23100+77
disguised-repetitions-es?28100+72
legal-advisor-es?1479+65
medical-assistant-es?2486+62
financial-analyst-pt?3193+62
task-style-no-roleplay-en?48100+52
Tax advisor (en)en48100+52
Tutor de ensino médio (pt)pt49100+51
Tutor de secundaria (es)es49100+51
Supply chain optimizer (en)en4797+50
financial-analyst-en?3786+49
Consultor tributário (pt)pt51100+49
task-style-no-roleplay-pt?52100+48
estrategista-marketing?2876+48
High school tutor (en)en53100+47
marketing-strategist-pt?2872+44
Consultor fiscal (es)es5193+42
tone-domain-mismatch-es?4586+41
Otimizador logístico (pt)pt5593+38
Optimizador logístico (es)es63100+37
task-style-no-roleplay-es?65100+35
marketing-strategist-es?6493+29
contradiction-es?6186+25
marketing-strategist-en?4365+22
contradiction-pt?6179+18
financial-analyst-es?5673+17
contradiction-en?6177+16
threat-framing-es?7993+14
unresolved-reference-en?7686+10
threat-framing-en?8390+7
unresolved-reference-es?7986+7
tone-domain-mismatch-en?9397+4
threat-framing-pt?9093+3
role-inflation-en?9497+3
role-inflation-pt?8886+-2
role-inflation-es?9790+-7
tone-domain-mismatch-pt?9386+-7
unresolved-reference-pt?6958+-11
Gemini 2.5 Flash50 prompts · Δ médio +40.0
promptlangantesdepoisΔ
consultor-juridico?090+90
agente-codigo-rigido?187+86
rigid-code-agent-pt?186+85
repeticoes-disfarcadas?1397+84
disguised-repetitions-pt?1393+80
legal-advisor-en?079+79
disguised-repetitions-en?23100+77
rigid-code-agent-en?175+74
medical-assistant-pt?072+72
assistente-medico?069+69
estrategista-marketing?2897+69
disguised-repetitions-es?2893+65
rigid-code-agent-es?2286+64
legal-advisor-pt?058+58
legal-advisor-es?1472+58
medical-assistant-en?058+58
medical-assistant-es?2479+55
tone-domain-mismatch-es?4593+48
High school tutor (en)en53100+47
Supply chain optimizer (en)en4793+46
task-style-no-roleplay-en?4893+45
Otimizador logístico (pt)pt55100+45
Tutor de ensino médio (pt)pt4993+44
Consultor tributário (pt)pt5193+42
Tax advisor (en)en4890+42
task-style-no-roleplay-pt?5293+41
marketing-strategist-en?4381+38
Consultor fiscal (es)es5186+35
Tutor de secundaria (es)es4983+34
financial-analyst-pt?3163+32
financial-analyst-en?3768+31
Optimizador logístico (es)es6393+30
marketing-strategist-pt?2855+27
contradiction-pt?6186+25
contradiction-en?6186+25
task-style-no-roleplay-es?6586+21
contradiction-es?6179+18
financial-analyst-es?5671+15
threat-framing-en?8397+14
threat-framing-es?7993+14
threat-framing-pt?90100+10
unresolved-reference-en?7686+10
marketing-strategist-es?6472+8
unresolved-reference-es?7986+7
role-inflation-en?94100+6
role-inflation-es?97100+3
tone-domain-mismatch-en?9393+0
role-inflation-pt?8883+-5
tone-domain-mismatch-pt?9383+-10
unresolved-reference-pt?6958+-11