Meta-prompt-following under pressure to preserve purpose — before/after score delta for each model. Results are measured using Whet's internal prompts and criteria: they don't constitute an academic benchmark or official comparison. Useful as a reference for real behavior; not as a definitive verdict.
A technical axis (prompt sharpening) and a political axis (ideological positioning). Same open methodology, two distinct cuts.
Measures how well each LLM sharpens a poorly-written prompt without destroying its original intent.
Measures how each LLM positions itself when forced out of neutrality on politically charged questions — direction, commitment, and asymmetry.
| # | model | lab | before | after | Δ | latency | prompts |
|---|---|---|---|---|---|---|---|
| 1 | Jamba Large 1.7 (AI21) jamba-large-1.7 | AI21 Labs (Israel)trial Arquitetura Mamba-Transformer híbrida — única não-transformer pura do benchmark | 42.6 | 94.0 | +51.4 | 4.5s | 74 |
| 2 | Claude Sonnet (via CLI) claude-sonnet-4-6 (via CLI) | Anthropic (EUA)paid Claude Sonnet via CLI — versão mais rápida e leve que Opus | 42.6 | 93.6 | +51.0 | 20.8s | 74 |
| 3 | Claude Opus (via CLI) claude-opus-4-7 (via CLI) | Anthropic (EUA)paid Claude Opus via CLI subscription — sem API key separada | 42.6 | 93.1 | +50.4 | 14.4s | 74 |
| 4 | Grok 4.20 Reasoning (xAI) grok-4.20-reasoning | xAI (USA)data-sharing Self-described less-restrictive alignment — tests meta-prompt-following under lower resistance to reformulating | 42.6 | 91.0 | +48.3 | 18.7s | 74 |
| 5 | GPT-5.4 (OpenAI) gpt-5.4 | OpenAI (EUA)paid Flagship atual da OpenAI (lançado março/2026) — testa se a nova geração quebra o padrão conservador do gpt-4o-mini | 42.6 | 89.8 | +47.1 | 5.0s | 74 |
| 6 | GPT-5.5 (OpenAI) gpt-5.5 | OpenAI (EUA)paid Iteração flagship reasoning sucessora do gpt-5.4 — mede se o salto pra reasoning por padrão na linhagem 5.5 mantém o ganho de meta-prompt-following | 42.6 | 89.6 | +47.0 | 7.9s | 74 |
| 7 | DeepSeek R1 deepseek-reasoner | DeepSeek (China)free Versão reasoner do DeepSeek com chain-of-thought explícito | 42.6 | 88.9 | +46.3 | 39.9s | 74 |
| 8 | Mistral Small mistral-small-latest | Mistral AI (França)free Modelo europeu open-weight focado em eficiência | 42.6 | 88.9 | +46.2 | 2.4s | 74 |
| 9 | Llama 3.3 70B (Groq) llama-3.3-70b-versatile | Meta (EUA) via Groqfree Llama open-source com inferência em hardware LPU da Groq | 42.6 | 87.9 | +45.2 | 4.0s | 74 |
| 10 | DeepSeek V3 deepseek-chat | DeepSeek (China)free Modelo chinês de alta performance treinado com foco em código e raciocínio | 42.6 | 87.0 | +44.3 | 7.9s | 74 |
| 11 | Command A (Cohere) command-a-03-2025 | Cohere (Canadá)trial Filosofia de alinhamento focada em RAG/grounded generation/tool use — testa meta-prompt-following sob viés distinto | 42.6 | 83.9 | +41.3 | 9.5s | 74 |
| 12 | Gemini 2.5 Flash gemini-2.5-flash | Google (EUA)free Modelo multimodal de referência do Google | 42.6 | 81.7 | +39.0 | 8.6s | 74 |
| 13 | GPT-5 nano (OpenAI) gpt-5-nano | OpenAI (EUA)paid Tier nano da geração GPT-5 — barato e recente, contraste direto ao gpt-4o-mini legado | 42.6 | 80.4 | +37.8 | 5.8s | 74 |
| 14 | GPT-4o mini (OpenAI) gpt-4o-mini | OpenAI (EUA)paid Flagship cost-efficient da OpenAI — representa a filosofia RLHF pioneira do campo | 42.6 | 78.5 | +35.9 | 3.8s | 74 |
Each provider appears as a column. Click a provider to open the expanded list with the result for each corpus prompt — useful for understanding where each model sharpens better or worse.
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| legal-advisor-en | ? | 0 | 100 | +100 |
| rigid-code-agent-en | ? | 1 | 100 | +99 |
| rigid-code-agent-pt | ? | 1 | 97 | +96 |
| agente-codigo-rigido | ? | 1 | 97 | +96 |
| legal-advisor-pt | ? | 0 | 93 | +93 |
| corporate-counsel-pt | ? | 0 | 93 | +93 |
| clinical-nurse-intake-en | ? | 0 | 93 | +93 |
| personal-investment-advisor-es | ? | 0 | 93 | +93 |
| consultor-juridico | ? | 0 | 90 | +90 |
| medical-assistant-pt | ? | 0 | 87 | +87 |
| legal-advisor-es | ? | 14 | 100 | +86 |
| medical-assistant-en | ? | 0 | 86 | +86 |
| disguised-repetitions-pt | ? | 13 | 97 | +84 |
| repeticoes-disfarcadas | ? | 13 | 97 | +84 |
| assistente-medico | ? | 0 | 80 | +80 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 100 | +76 |
| rigid-code-agent-es | ? | 22 | 97 | +75 |
| disguised-repetitions-en | ? | 23 | 97 | +74 |
| marketing-strategist-pt | ? | 28 | 100 | +72 |
| estrategista-marketing | ? | 28 | 100 | +72 |
| ux-research-assistant-en | ? | 25 | 97 | +72 |
| social-media-agency-pt | ? | 20 | 90 | +70 |
| medical-assistant-es | ? | 24 | 93 | +69 |
| disguised-repetitions-es | ? | 28 | 97 | +69 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 93 | +64 |
| Satellite ground-station CI reviewer (en) | en | 38 | 100 | +62 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 76 | +60 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 100 | +59 |
| marketing-strategist-en | ? | 43 | 100 | +57 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 100 | +56 |
| Coordinador last-mile en Lima (es) | es | 42 | 97 | +55 |
| Pediatric dental clinic intake (en) | en | 38 | 93 | +55 |
| newsroom-writer-es | ? | 43 | 97 | +54 |
| logistics-optimizer-en | ? | 47 | 100 | +53 |
| financial-analyst-en | ? | 37 | 90 | +53 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 93 | +53 |
| tax-consultant-en | ? | 48 | 100 | +52 |
| Tangier waterfront pier competition (en) | en | 34 | 86 | +52 |
| education-tutor-es | ? | 49 | 100 | +51 |
| tax-consultant-es | ? | 51 | 100 | +49 |
| financial-analyst-pt | ? | 31 | 80 | +49 |
| education-tutor-pt | ? | 49 | 97 | +48 |
| tone-domain-mismatch-es | ? | 45 | 93 | +48 |
| logistics-optimizer-pt | ? | 55 | 100 | +45 |
| education-tutor-en | ? | 53 | 97 | +44 |
| tax-consultant-pt | ? | 51 | 93 | +42 |
| task-style-no-roleplay-pt | ? | 52 | 90 | +38 |
| nutripal-weight-loss-en | ? | 48 | 86 | +38 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 100 | +38 |
| logistics-optimizer-es | ? | 63 | 100 | +37 |
| marketing-strategist-es | ? | 64 | 100 | +36 |
| task-style-no-roleplay-en | ? | 48 | 83 | +35 |
| mental-health-wellness-es | ? | 58 | 93 | +35 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 90 | +35 |
| financial-analyst-es | ? | 56 | 90 | +34 |
| task-style-no-roleplay-es | ? | 65 | 97 | +32 |
| virtual-pharmacist-pt | ? | 65 | 93 | +28 |
| cooperative-agronomist-es | ? | 66 | 93 | +27 |
| contradiction-es | ? | 61 | 86 | +25 |
| senior-structural-engineer-pt | ? | 77 | 100 | +23 |
| streaming-script-editor-en | ? | 64 | 86 | +22 |
| unresolved-reference-es | ? | 79 | 100 | +21 |
| contradiction-en | ? | 61 | 79 | +18 |
| threat-framing-en | ? | 83 | 100 | +17 |
| unresolved-reference-pt | ? | 69 | 86 | +17 |
| threat-framing-es | ? | 79 | 93 | +14 |
| unresolved-reference-en | ? | 76 | 90 | +14 |
| role-inflation-pt | ? | 88 | 100 | +12 |
| contradiction-pt | ? | 61 | 72 | +11 |
| tone-domain-mismatch-pt | ? | 93 | 100 | +7 |
| tone-domain-mismatch-en | ? | 93 | 100 | +7 |
| role-inflation-en | ? | 94 | 100 | +6 |
| threat-framing-pt | ? | 90 | 93 | +3 |
| role-inflation-es | ? | 97 | 100 | +3 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| medical-assistant-pt | ? | 0 | 94 | +94 |
| legal-advisor-en | ? | 0 | 93 | +93 |
| personal-investment-advisor-es | ? | 0 | 93 | +93 |
| consultor-juridico | ? | 0 | 90 | +90 |
| agente-codigo-rigido | ? | 1 | 90 | +89 |
| disguised-repetitions-pt | ? | 13 | 100 | +87 |
| repeticoes-disfarcadas | ? | 13 | 100 | +87 |
| legal-advisor-pt | ? | 0 | 86 | +86 |
| rigid-code-agent-en | ? | 1 | 87 | +86 |
| corporate-counsel-pt | ? | 0 | 86 | +86 |
| medical-assistant-en | ? | 0 | 83 | +83 |
| assistente-medico | ? | 0 | 83 | +83 |
| clinical-nurse-intake-en | ? | 0 | 83 | +83 |
| rigid-code-agent-pt | ? | 1 | 81 | +80 |
| social-media-agency-pt | ? | 20 | 100 | +80 |
| disguised-repetitions-en | ? | 23 | 100 | +77 |
| ux-research-assistant-en | ? | 25 | 100 | +75 |
| disguised-repetitions-es | ? | 28 | 100 | +72 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 100 | +71 |
| legal-advisor-es | ? | 14 | 79 | +65 |
| rigid-code-agent-es | ? | 22 | 87 | +65 |
| estrategista-marketing | ? | 28 | 93 | +65 |
| medical-assistant-es | ? | 24 | 86 | +62 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 86 | +62 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 75 | +59 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 100 | +59 |
| Tangier waterfront pier competition (en) | en | 34 | 93 | +59 |
| marketing-strategist-pt | ? | 28 | 86 | +58 |
| Coordinador last-mile en Lima (es) | es | 42 | 100 | +58 |
| newsroom-writer-es | ? | 43 | 100 | +57 |
| marketing-strategist-en | ? | 43 | 97 | +54 |
| logistics-optimizer-en | ? | 47 | 100 | +53 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 97 | +53 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 93 | +53 |
| financial-analyst-pt | ? | 31 | 83 | +52 |
| task-style-no-roleplay-en | ? | 48 | 100 | +52 |
| tax-consultant-en | ? | 48 | 100 | +52 |
| nutripal-weight-loss-en | ? | 48 | 100 | +52 |
| Satellite ground-station CI reviewer (en) | en | 38 | 90 | +52 |
| Pediatric dental clinic intake (en) | en | 38 | 90 | +52 |
| education-tutor-pt | ? | 49 | 100 | +51 |
| education-tutor-es | ? | 49 | 100 | +51 |
| tax-consultant-es | ? | 51 | 100 | +49 |
| tone-domain-mismatch-es | ? | 45 | 93 | +48 |
| education-tutor-en | ? | 53 | 100 | +47 |
| financial-analyst-en | ? | 37 | 83 | +46 |
| tax-consultant-pt | ? | 51 | 93 | +42 |
| mental-health-wellness-es | ? | 58 | 100 | +42 |
| task-style-no-roleplay-pt | ? | 52 | 93 | +41 |
| contradiction-en | ? | 61 | 100 | +39 |
| logistics-optimizer-pt | ? | 55 | 93 | +38 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 93 | +38 |
| financial-analyst-es | ? | 56 | 93 | +37 |
| logistics-optimizer-es | ? | 63 | 100 | +37 |
| contradiction-pt | ? | 61 | 97 | +36 |
| contradiction-es | ? | 61 | 97 | +36 |
| marketing-strategist-es | ? | 64 | 97 | +33 |
| virtual-pharmacist-pt | ? | 65 | 97 | +32 |
| unresolved-reference-pt | ? | 69 | 100 | +31 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 93 | +31 |
| task-style-no-roleplay-es | ? | 65 | 93 | +28 |
| cooperative-agronomist-es | ? | 66 | 93 | +27 |
| streaming-script-editor-en | ? | 64 | 90 | +26 |
| unresolved-reference-en | ? | 76 | 100 | +24 |
| unresolved-reference-es | ? | 79 | 100 | +21 |
| senior-structural-engineer-pt | ? | 77 | 97 | +20 |
| threat-framing-en | ? | 83 | 100 | +17 |
| threat-framing-es | ? | 79 | 93 | +14 |
| threat-framing-pt | ? | 90 | 100 | +10 |
| tone-domain-mismatch-en | ? | 93 | 100 | +7 |
| role-inflation-es | ? | 97 | 100 | +3 |
| role-inflation-pt | ? | 88 | 87 | +-1 |
| role-inflation-en | ? | 94 | 90 | +-4 |
| tone-domain-mismatch-pt | ? | 93 | 79 | +-14 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| consultor-juridico | ? | 0 | 100 | +100 |
| legal-advisor-en | ? | 0 | 100 | +100 |
| legal-advisor-pt | ? | 0 | 100 | +100 |
| assistente-medico | ? | 0 | 93 | +93 |
| medical-assistant-en | ? | 0 | 93 | +93 |
| medical-assistant-pt | ? | 0 | 90 | +90 |
| corporate-counsel-pt | ? | 0 | 90 | +90 |
| clinical-nurse-intake-en | ? | 0 | 90 | +90 |
| rigid-code-agent-pt | ? | 1 | 90 | +89 |
| repeticoes-disfarcadas | ? | 13 | 100 | +87 |
| agente-codigo-rigido | ? | 1 | 87 | +86 |
| rigid-code-agent-en | ? | 1 | 87 | +86 |
| personal-investment-advisor-es | ? | 0 | 86 | +86 |
| disguised-repetitions-pt | ? | 13 | 97 | +84 |
| social-media-agency-pt | ? | 20 | 100 | +80 |
| legal-advisor-es | ? | 14 | 93 | +79 |
| disguised-repetitions-en | ? | 23 | 100 | +77 |
| medical-assistant-es | ? | 24 | 100 | +76 |
| disguised-repetitions-es | ? | 28 | 97 | +69 |
| ux-research-assistant-en | ? | 25 | 93 | +68 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 83 | +67 |
| estrategista-marketing | ? | 28 | 93 | +65 |
| rigid-code-agent-es | ? | 22 | 84 | +62 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 100 | +59 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 83 | +59 |
| marketing-strategist-pt | ? | 28 | 86 | +58 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 86 | +57 |
| financial-analyst-en | ? | 37 | 93 | +56 |
| task-style-no-roleplay-en | ? | 48 | 100 | +52 |
| tax-consultant-en | ? | 48 | 100 | +52 |
| Satellite ground-station CI reviewer (en) | en | 38 | 90 | +52 |
| Tangier waterfront pier competition (en) | en | 34 | 86 | +52 |
| education-tutor-pt | ? | 49 | 100 | +51 |
| education-tutor-es | ? | 49 | 100 | +51 |
| Coordinador last-mile en Lima (es) | es | 42 | 93 | +51 |
| marketing-strategist-en | ? | 43 | 93 | +50 |
| newsroom-writer-es | ? | 43 | 93 | +50 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 90 | +50 |
| tax-consultant-es | ? | 51 | 100 | +49 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 93 | +49 |
| tone-domain-mismatch-es | ? | 45 | 93 | +48 |
| task-style-no-roleplay-pt | ? | 52 | 100 | +48 |
| education-tutor-en | ? | 53 | 100 | +47 |
| logistics-optimizer-en | ? | 47 | 93 | +46 |
| logistics-optimizer-pt | ? | 55 | 100 | +45 |
| nutripal-weight-loss-en | ? | 48 | 93 | +45 |
| tax-consultant-pt | ? | 51 | 93 | +42 |
| mental-health-wellness-es | ? | 58 | 100 | +42 |
| financial-analyst-pt | ? | 31 | 72 | +41 |
| contradiction-pt | ? | 61 | 100 | +39 |
| contradiction-es | ? | 61 | 100 | +39 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 100 | +38 |
| Pediatric dental clinic intake (en) | en | 38 | 75 | +37 |
| contradiction-en | ? | 61 | 93 | +32 |
| unresolved-reference-pt | ? | 69 | 100 | +31 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 86 | +31 |
| task-style-no-roleplay-es | ? | 65 | 93 | +28 |
| financial-analyst-es | ? | 56 | 83 | +27 |
| logistics-optimizer-es | ? | 63 | 90 | +27 |
| cooperative-agronomist-es | ? | 66 | 93 | +27 |
| unresolved-reference-en | ? | 76 | 100 | +24 |
| senior-structural-engineer-pt | ? | 77 | 100 | +23 |
| streaming-script-editor-en | ? | 64 | 87 | +23 |
| unresolved-reference-es | ? | 79 | 100 | +21 |
| virtual-pharmacist-pt | ? | 65 | 86 | +21 |
| marketing-strategist-es | ? | 64 | 79 | +15 |
| threat-framing-es | ? | 79 | 93 | +14 |
| threat-framing-en | ? | 83 | 93 | +10 |
| role-inflation-en | ? | 94 | 100 | +6 |
| role-inflation-pt | ? | 88 | 93 | +5 |
| threat-framing-pt | ? | 90 | 93 | +3 |
| tone-domain-mismatch-pt | ? | 93 | 93 | +0 |
| tone-domain-mismatch-en | ? | 93 | 90 | +-3 |
| role-inflation-es | ? | 97 | 93 | +-4 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| legal-advisor-en | ? | 0 | 100 | +100 |
| legal-advisor-pt | ? | 0 | 93 | +93 |
| medical-assistant-en | ? | 0 | 93 | +93 |
| consultor-juridico | ? | 0 | 93 | +93 |
| disguised-repetitions-pt | ? | 13 | 100 | +87 |
| repeticoes-disfarcadas | ? | 13 | 100 | +87 |
| legal-advisor-es | ? | 14 | 100 | +86 |
| assistente-medico | ? | 0 | 86 | +86 |
| rigid-code-agent-pt | ? | 1 | 84 | +83 |
| agente-codigo-rigido | ? | 1 | 84 | +83 |
| medical-assistant-pt | ? | 0 | 79 | +79 |
| corporate-counsel-pt | ? | 0 | 79 | +79 |
| clinical-nurse-intake-en | ? | 0 | 79 | +79 |
| rigid-code-agent-en | ? | 1 | 76 | +75 |
| ux-research-assistant-en | ? | 25 | 100 | +75 |
| disguised-repetitions-en | ? | 23 | 97 | +74 |
| social-media-agency-pt | ? | 20 | 93 | +73 |
| disguised-repetitions-es | ? | 28 | 100 | +72 |
| personal-investment-advisor-es | ? | 0 | 72 | +72 |
| rigid-code-agent-es | ? | 22 | 90 | +68 |
| Tangier waterfront pier competition (en) | en | 34 | 100 | +66 |
| medical-assistant-es | ? | 24 | 86 | +62 |
| estrategista-marketing | ? | 28 | 90 | +62 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 86 | +62 |
| financial-analyst-en | ? | 37 | 97 | +60 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 100 | +59 |
| marketing-strategist-en | ? | 43 | 100 | +57 |
| tone-domain-mismatch-es | ? | 45 | 100 | +55 |
| Satellite ground-station CI reviewer (en) | en | 38 | 93 | +55 |
| logistics-optimizer-en | ? | 47 | 100 | +53 |
| tax-consultant-en | ? | 48 | 100 | +52 |
| task-style-no-roleplay-en | ? | 48 | 100 | +52 |
| nutripal-weight-loss-en | ? | 48 | 100 | +52 |
| education-tutor-pt | ? | 49 | 100 | +51 |
| Coordinador last-mile en Lima (es) | es | 42 | 93 | +51 |
| tax-consultant-es | ? | 51 | 100 | +49 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 93 | +49 |
| marketing-strategist-pt | ? | 28 | 76 | +48 |
| financial-analyst-pt | ? | 31 | 79 | +48 |
| task-style-no-roleplay-pt | ? | 52 | 100 | +48 |
| education-tutor-en | ? | 53 | 100 | +47 |
| newsroom-writer-es | ? | 43 | 90 | +47 |
| Pediatric dental clinic intake (en) | en | 38 | 83 | +45 |
| tax-consultant-pt | ? | 51 | 93 | +42 |
| mental-health-wellness-es | ? | 58 | 100 | +42 |
| contradiction-pt | ? | 61 | 100 | +39 |
| logistics-optimizer-pt | ? | 55 | 93 | +38 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 54 | +38 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 100 | +38 |
| education-tutor-es | ? | 49 | 86 | +37 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 66 | +37 |
| marketing-strategist-es | ? | 64 | 100 | +36 |
| contradiction-en | ? | 61 | 93 | +32 |
| contradiction-es | ? | 61 | 93 | +32 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 72 | +32 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 86 | +31 |
| logistics-optimizer-es | ? | 63 | 93 | +30 |
| financial-analyst-es | ? | 56 | 86 | +30 |
| virtual-pharmacist-pt | ? | 65 | 93 | +28 |
| cooperative-agronomist-es | ? | 66 | 93 | +27 |
| streaming-script-editor-en | ? | 64 | 90 | +26 |
| unresolved-reference-es | ? | 79 | 100 | +21 |
| threat-framing-en | ? | 83 | 100 | +17 |
| unresolved-reference-pt | ? | 69 | 86 | +17 |
| unresolved-reference-en | ? | 76 | 93 | +17 |
| task-style-no-roleplay-es | ? | 65 | 79 | +14 |
| senior-structural-engineer-pt | ? | 77 | 90 | +13 |
| role-inflation-pt | ? | 88 | 100 | +12 |
| tone-domain-mismatch-en | ? | 93 | 100 | +7 |
| role-inflation-en | ? | 94 | 100 | +6 |
| role-inflation-es | ? | 97 | 100 | +3 |
| threat-framing-es | ? | 79 | 79 | +0 |
| threat-framing-pt | ? | 90 | 86 | +-4 |
| tone-domain-mismatch-pt | ? | 93 | 65 | +-28 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| medical-assistant-en | ? | 0 | 90 | +90 |
| clinical-nurse-intake-en | ? | 0 | 90 | +90 |
| medical-assistant-pt | ? | 0 | 86 | +86 |
| assistente-medico | ? | 0 | 86 | +86 |
| agente-codigo-rigido | ? | 1 | 87 | +86 |
| legal-advisor-en | ? | 0 | 84 | +84 |
| legal-advisor-pt | ? | 0 | 83 | +83 |
| rigid-code-agent-en | ? | 1 | 84 | +83 |
| disguised-repetitions-pt | ? | 13 | 94 | +81 |
| repeticoes-disfarcadas | ? | 13 | 94 | +81 |
| corporate-counsel-pt | ? | 0 | 80 | +80 |
| personal-investment-advisor-es | ? | 0 | 80 | +80 |
| rigid-code-agent-pt | ? | 1 | 80 | +79 |
| consultor-juridico | ? | 0 | 79 | +79 |
| disguised-repetitions-en | ? | 23 | 100 | +77 |
| ux-research-assistant-en | ? | 25 | 100 | +75 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 90 | +74 |
| legal-advisor-es | ? | 14 | 86 | +72 |
| rigid-code-agent-es | ? | 22 | 93 | +71 |
| social-media-agency-pt | ? | 20 | 90 | +70 |
| medical-assistant-es | ? | 24 | 93 | +69 |
| disguised-repetitions-es | ? | 28 | 97 | +69 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 93 | +69 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 86 | +57 |
| financial-analyst-pt | ? | 31 | 86 | +55 |
| estrategista-marketing | ? | 28 | 83 | +55 |
| financial-analyst-en | ? | 37 | 90 | +53 |
| tax-consultant-en | ? | 48 | 100 | +52 |
| Satellite ground-station CI reviewer (en) | en | 38 | 90 | +52 |
| education-tutor-pt | ? | 49 | 100 | +51 |
| education-tutor-es | ? | 49 | 100 | +51 |
| Coordinador last-mile en Lima (es) | es | 42 | 93 | +51 |
| newsroom-writer-es | ? | 43 | 93 | +50 |
| task-style-no-roleplay-en | ? | 48 | 97 | +49 |
| tone-domain-mismatch-es | ? | 45 | 93 | +48 |
| task-style-no-roleplay-pt | ? | 52 | 100 | +48 |
| education-tutor-en | ? | 53 | 100 | +47 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 90 | +46 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 87 | +46 |
| logistics-optimizer-en | ? | 47 | 90 | +43 |
| tax-consultant-pt | ? | 51 | 93 | +42 |
| tax-consultant-es | ? | 51 | 93 | +42 |
| mental-health-wellness-es | ? | 58 | 100 | +42 |
| Tangier waterfront pier competition (en) | en | 34 | 76 | +42 |
| marketing-strategist-pt | ? | 28 | 69 | +41 |
| contradiction-en | ? | 61 | 100 | +39 |
| contradiction-es | ? | 61 | 100 | +39 |
| nutripal-weight-loss-en | ? | 48 | 86 | +38 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 93 | +38 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 100 | +38 |
| financial-analyst-es | ? | 56 | 93 | +37 |
| task-style-no-roleplay-es | ? | 65 | 100 | +35 |
| marketing-strategist-en | ? | 43 | 76 | +33 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 73 | +33 |
| contradiction-pt | ? | 61 | 93 | +32 |
| logistics-optimizer-pt | ? | 55 | 86 | +31 |
| logistics-optimizer-es | ? | 63 | 93 | +30 |
| Pediatric dental clinic intake (en) | en | 38 | 68 | +30 |
| cooperative-agronomist-es | ? | 66 | 93 | +27 |
| threat-framing-es | ? | 79 | 100 | +21 |
| marketing-strategist-es | ? | 64 | 83 | +19 |
| unresolved-reference-es | ? | 79 | 97 | +18 |
| virtual-pharmacist-pt | ? | 65 | 83 | +18 |
| unresolved-reference-pt | ? | 69 | 86 | +17 |
| threat-framing-en | ? | 83 | 97 | +14 |
| senior-structural-engineer-pt | ? | 77 | 90 | +13 |
| role-inflation-pt | ? | 88 | 100 | +12 |
| threat-framing-pt | ? | 90 | 97 | +7 |
| unresolved-reference-en | ? | 76 | 83 | +7 |
| role-inflation-en | ? | 94 | 100 | +6 |
| streaming-script-editor-en | ? | 64 | 70 | +6 |
| role-inflation-es | ? | 97 | 100 | +3 |
| tone-domain-mismatch-en | ? | 93 | 83 | +-10 |
| tone-domain-mismatch-pt | ? | 93 | 72 | +-21 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| legal-advisor-en | ? | 0 | 90 | +90 |
| medical-assistant-pt | ? | 0 | 90 | +90 |
| consultor-juridico | ? | 0 | 90 | +90 |
| assistente-medico | ? | 0 | 90 | +90 |
| corporate-counsel-pt | ? | 0 | 90 | +90 |
| legal-advisor-pt | ? | 0 | 87 | +87 |
| medical-assistant-en | ? | 0 | 87 | +87 |
| personal-investment-advisor-es | ? | 0 | 87 | +87 |
| rigid-code-agent-pt | ? | 1 | 87 | +86 |
| rigid-code-agent-en | ? | 1 | 87 | +86 |
| disguised-repetitions-pt | ? | 13 | 97 | +84 |
| repeticoes-disfarcadas | ? | 13 | 97 | +84 |
| agente-codigo-rigido | ? | 1 | 84 | +83 |
| clinical-nurse-intake-en | ? | 0 | 79 | +79 |
| disguised-repetitions-en | ? | 23 | 100 | +77 |
| social-media-agency-pt | ? | 20 | 90 | +70 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 93 | +69 |
| legal-advisor-es | ? | 14 | 82 | +68 |
| rigid-code-agent-es | ? | 22 | 90 | +68 |
| ux-research-assistant-en | ? | 25 | 90 | +65 |
| estrategista-marketing | ? | 28 | 90 | +62 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 77 | +61 |
| medical-assistant-es | ? | 24 | 83 | +59 |
| Tangier waterfront pier competition (en) | en | 34 | 93 | +59 |
| disguised-repetitions-es | ? | 28 | 84 | +56 |
| Satellite ground-station CI reviewer (en) | en | 38 | 93 | +55 |
| tax-consultant-en | ? | 48 | 100 | +52 |
| task-style-no-roleplay-en | ? | 48 | 100 | +52 |
| education-tutor-pt | ? | 49 | 100 | +51 |
| education-tutor-es | ? | 49 | 100 | +51 |
| financial-analyst-pt | ? | 31 | 82 | +51 |
| newsroom-writer-es | ? | 43 | 93 | +50 |
| tax-consultant-es | ? | 51 | 100 | +49 |
| task-style-no-roleplay-pt | ? | 52 | 100 | +48 |
| education-tutor-en | ? | 53 | 100 | +47 |
| financial-analyst-en | ? | 37 | 82 | +45 |
| logistics-optimizer-en | ? | 47 | 90 | +43 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 83 | +43 |
| tax-consultant-pt | ? | 51 | 93 | +42 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 86 | +42 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 71 | +42 |
| tone-domain-mismatch-es | ? | 45 | 86 | +41 |
| marketing-strategist-en | ? | 43 | 82 | +39 |
| contradiction-en | ? | 61 | 100 | +39 |
| nutripal-weight-loss-en | ? | 48 | 86 | +38 |
| Coordinador last-mile en Lima (es) | es | 42 | 80 | +38 |
| marketing-strategist-pt | ? | 28 | 65 | +37 |
| contradiction-es | ? | 61 | 97 | +36 |
| task-style-no-roleplay-es | ? | 65 | 100 | +35 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 90 | +35 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 97 | +35 |
| cooperative-agronomist-es | ? | 66 | 100 | +34 |
| Pediatric dental clinic intake (en) | en | 38 | 72 | +34 |
| logistics-optimizer-pt | ? | 55 | 86 | +31 |
| unresolved-reference-pt | ? | 69 | 100 | +31 |
| logistics-optimizer-es | ? | 63 | 93 | +30 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 69 | +28 |
| marketing-strategist-es | ? | 64 | 90 | +26 |
| financial-analyst-es | ? | 56 | 82 | +26 |
| contradiction-pt | ? | 61 | 87 | +26 |
| senior-structural-engineer-pt | ? | 77 | 100 | +23 |
| unresolved-reference-es | ? | 79 | 100 | +21 |
| streaming-script-editor-en | ? | 64 | 84 | +20 |
| threat-framing-es | ? | 79 | 97 | +18 |
| unresolved-reference-en | ? | 76 | 90 | +14 |
| mental-health-wellness-es | ? | 58 | 72 | +14 |
| role-inflation-pt | ? | 88 | 100 | +12 |
| threat-framing-en | ? | 83 | 94 | +11 |
| virtual-pharmacist-pt | ? | 65 | 76 | +11 |
| tone-domain-mismatch-en | ? | 93 | 100 | +7 |
| role-inflation-en | ? | 94 | 100 | +6 |
| threat-framing-pt | ? | 90 | 94 | +4 |
| role-inflation-es | ? | 97 | 100 | +3 |
| tone-domain-mismatch-pt | ? | 93 | 76 | +-17 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| clinical-nurse-intake-en | ? | 0 | 100 | +100 |
| rigid-code-agent-en | ? | 1 | 97 | +96 |
| legal-advisor-en | ? | 0 | 93 | +93 |
| legal-advisor-pt | ? | 0 | 86 | +86 |
| personal-investment-advisor-es | ? | 0 | 86 | +86 |
| agente-codigo-rigido | ? | 1 | 86 | +85 |
| repeticoes-disfarcadas | ? | 13 | 97 | +84 |
| consultor-juridico | ? | 0 | 83 | +83 |
| disguised-repetitions-pt | ? | 13 | 94 | +81 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 97 | +81 |
| assistente-medico | ? | 0 | 79 | +79 |
| medical-assistant-en | ? | 0 | 77 | +77 |
| legal-advisor-es | ? | 14 | 90 | +76 |
| medical-assistant-pt | ? | 0 | 76 | +76 |
| corporate-counsel-pt | ? | 0 | 76 | +76 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 100 | +76 |
| disguised-repetitions-en | ? | 23 | 97 | +74 |
| rigid-code-agent-pt | ? | 1 | 74 | +73 |
| disguised-repetitions-es | ? | 28 | 100 | +72 |
| ux-research-assistant-en | ? | 25 | 97 | +72 |
| Tangier waterfront pier competition (en) | en | 34 | 100 | +66 |
| estrategista-marketing | ? | 28 | 93 | +65 |
| financial-analyst-pt | ? | 31 | 94 | +63 |
| rigid-code-agent-es | ? | 22 | 84 | +62 |
| medical-assistant-es | ? | 24 | 83 | +59 |
| social-media-agency-pt | ? | 20 | 79 | +59 |
| logistics-optimizer-en | ? | 47 | 100 | +53 |
| task-style-no-roleplay-en | ? | 48 | 100 | +52 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 93 | +52 |
| financial-analyst-en | ? | 37 | 87 | +50 |
| newsroom-writer-es | ? | 43 | 93 | +50 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 79 | +50 |
| tax-consultant-en | ? | 48 | 97 | +49 |
| tone-domain-mismatch-es | ? | 45 | 93 | +48 |
| education-tutor-es | ? | 49 | 97 | +48 |
| education-tutor-en | ? | 53 | 100 | +47 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 90 | +46 |
| marketing-strategist-en | ? | 43 | 87 | +44 |
| tax-consultant-pt | ? | 51 | 93 | +42 |
| tax-consultant-es | ? | 51 | 93 | +42 |
| mental-health-wellness-es | ? | 58 | 100 | +42 |
| Satellite ground-station CI reviewer (en) | en | 38 | 80 | +42 |
| task-style-no-roleplay-pt | ? | 52 | 93 | +41 |
| marketing-strategist-pt | ? | 28 | 66 | +38 |
| financial-analyst-es | ? | 56 | 94 | +38 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 93 | +38 |
| Pediatric dental clinic intake (en) | en | 38 | 76 | +38 |
| education-tutor-pt | ? | 49 | 86 | +37 |
| Coordinador last-mile en Lima (es) | es | 42 | 79 | +37 |
| nutripal-weight-loss-en | ? | 48 | 83 | +35 |
| cooperative-agronomist-es | ? | 66 | 100 | +34 |
| logistics-optimizer-pt | ? | 55 | 86 | +31 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 93 | +31 |
| logistics-optimizer-es | ? | 63 | 93 | +30 |
| marketing-strategist-es | ? | 64 | 93 | +29 |
| streaming-script-editor-en | ? | 64 | 93 | +29 |
| contradiction-pt | ? | 61 | 86 | +25 |
| contradiction-en | ? | 61 | 86 | +25 |
| contradiction-es | ? | 61 | 86 | +25 |
| task-style-no-roleplay-es | ? | 65 | 87 | +22 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 62 | +22 |
| unresolved-reference-en | ? | 76 | 93 | +17 |
| threat-framing-es | ? | 79 | 93 | +14 |
| virtual-pharmacist-pt | ? | 65 | 79 | +14 |
| senior-structural-engineer-pt | ? | 77 | 90 | +13 |
| threat-framing-en | ? | 83 | 93 | +10 |
| tone-domain-mismatch-en | ? | 93 | 100 | +7 |
| role-inflation-en | ? | 94 | 100 | +6 |
| role-inflation-pt | ? | 88 | 93 | +5 |
| unresolved-reference-pt | ? | 69 | 72 | +3 |
| unresolved-reference-es | ? | 79 | 79 | +0 |
| role-inflation-es | ? | 97 | 93 | +-4 |
| threat-framing-pt | ? | 90 | 83 | +-7 |
| tone-domain-mismatch-pt | ? | 93 | 79 | +-14 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| legal-advisor-en | ? | 0 | 97 | +97 |
| medical-assistant-en | ? | 0 | 93 | +93 |
| corporate-counsel-pt | ? | 0 | 93 | +93 |
| repeticoes-disfarcadas | ? | 13 | 100 | +87 |
| clinical-nurse-intake-en | ? | 0 | 86 | +86 |
| rigid-code-agent-en | ? | 1 | 84 | +83 |
| legal-advisor-pt | ? | 0 | 79 | +79 |
| disguised-repetitions-pt | ? | 13 | 90 | +77 |
| disguised-repetitions-en | ? | 23 | 100 | +77 |
| agente-codigo-rigido | ? | 1 | 77 | +76 |
| medical-assistant-pt | ? | 0 | 76 | +76 |
| rigid-code-agent-pt | ? | 1 | 77 | +76 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 100 | +76 |
| ux-research-assistant-en | ? | 25 | 100 | +75 |
| legal-advisor-es | ? | 14 | 86 | +72 |
| social-media-agency-pt | ? | 20 | 90 | +70 |
| consultor-juridico | ? | 0 | 69 | +69 |
| disguised-repetitions-es | ? | 28 | 97 | +69 |
| estrategista-marketing | ? | 28 | 94 | +66 |
| assistente-medico | ? | 0 | 65 | +65 |
| personal-investment-advisor-es | ? | 0 | 65 | +65 |
| marketing-strategist-pt | ? | 28 | 91 | +63 |
| financial-analyst-pt | ? | 31 | 90 | +59 |
| rigid-code-agent-es | ? | 22 | 80 | +58 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 86 | +57 |
| medical-assistant-es | ? | 24 | 79 | +55 |
| Satellite ground-station CI reviewer (en) | en | 38 | 93 | +55 |
| Tangier waterfront pier competition (en) | en | 34 | 86 | +52 |
| education-tutor-es | ? | 49 | 100 | +51 |
| logistics-optimizer-en | ? | 47 | 97 | +50 |
| newsroom-writer-es | ? | 43 | 93 | +50 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 93 | +49 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 90 | +49 |
| tone-domain-mismatch-es | ? | 45 | 93 | +48 |
| task-style-no-roleplay-pt | ? | 52 | 100 | +48 |
| education-tutor-en | ? | 53 | 100 | +47 |
| financial-analyst-en | ? | 37 | 83 | +46 |
| task-style-no-roleplay-en | ? | 48 | 93 | +45 |
| tax-consultant-en | ? | 48 | 93 | +45 |
| Pediatric dental clinic intake (en) | en | 38 | 83 | +45 |
| education-tutor-pt | ? | 49 | 93 | +44 |
| Coordinador last-mile en Lima (es) | es | 42 | 86 | +44 |
| marketing-strategist-en | ? | 43 | 86 | +43 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 57 | +41 |
| tax-consultant-pt | ? | 51 | 90 | +39 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 79 | +39 |
| nutripal-weight-loss-en | ? | 48 | 86 | +38 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 100 | +38 |
| logistics-optimizer-es | ? | 63 | 100 | +37 |
| task-style-no-roleplay-es | ? | 65 | 100 | +35 |
| tax-consultant-es | ? | 51 | 86 | +35 |
| streaming-script-editor-en | ? | 64 | 97 | +33 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 86 | +31 |
| marketing-strategist-es | ? | 64 | 94 | +30 |
| contradiction-es | ? | 61 | 90 | +29 |
| logistics-optimizer-pt | ? | 55 | 83 | +28 |
| mental-health-wellness-es | ? | 58 | 86 | +28 |
| financial-analyst-es | ? | 56 | 83 | +27 |
| cooperative-agronomist-es | ? | 66 | 93 | +27 |
| contradiction-pt | ? | 61 | 83 | +22 |
| threat-framing-es | ? | 79 | 100 | +21 |
| unresolved-reference-en | ? | 76 | 93 | +17 |
| senior-structural-engineer-pt | ? | 77 | 93 | +16 |
| unresolved-reference-es | ? | 79 | 93 | +14 |
| virtual-pharmacist-pt | ? | 65 | 79 | +14 |
| role-inflation-pt | ? | 88 | 100 | +12 |
| contradiction-en | ? | 61 | 73 | +12 |
| unresolved-reference-pt | ? | 69 | 79 | +10 |
| tone-domain-mismatch-pt | ? | 93 | 100 | +7 |
| tone-domain-mismatch-en | ? | 93 | 100 | +7 |
| role-inflation-en | ? | 94 | 100 | +6 |
| threat-framing-en | ? | 83 | 87 | +4 |
| role-inflation-es | ? | 97 | 100 | +3 |
| threat-framing-pt | ? | 90 | 79 | +-11 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| legal-advisor-en | ? | 0 | 91 | +91 |
| medical-assistant-en | ? | 0 | 90 | +90 |
| repeticoes-disfarcadas | ? | 13 | 100 | +87 |
| rigid-code-agent-en | ? | 1 | 87 | +86 |
| disguised-repetitions-pt | ? | 13 | 93 | +80 |
| consultor-juridico | ? | 0 | 79 | +79 |
| assistente-medico | ? | 0 | 79 | +79 |
| legal-advisor-pt | ? | 0 | 79 | +79 |
| medical-assistant-pt | ? | 0 | 79 | +79 |
| clinical-nurse-intake-en | ? | 0 | 76 | +76 |
| ux-research-assistant-en | ? | 25 | 100 | +75 |
| disguised-repetitions-en | ? | 23 | 97 | +74 |
| social-media-agency-pt | ? | 20 | 93 | +73 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 93 | +69 |
| estrategista-marketing | ? | 28 | 93 | +65 |
| rigid-code-agent-pt | ? | 1 | 65 | +64 |
| financial-analyst-en | ? | 37 | 100 | +63 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 79 | +63 |
| medical-assistant-es | ? | 24 | 86 | +62 |
| marketing-strategist-pt | ? | 28 | 90 | +62 |
| financial-analyst-pt | ? | 31 | 93 | +62 |
| personal-investment-advisor-es | ? | 0 | 62 | +62 |
| Satellite ground-station CI reviewer (en) | en | 38 | 97 | +59 |
| legal-advisor-es | ? | 14 | 72 | +58 |
| disguised-repetitions-es | ? | 28 | 86 | +58 |
| agente-codigo-rigido | ? | 1 | 58 | +57 |
| rigid-code-agent-es | ? | 22 | 79 | +57 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 86 | +57 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 97 | +56 |
| marketing-strategist-en | ? | 43 | 97 | +54 |
| logistics-optimizer-en | ? | 47 | 100 | +53 |
| task-style-no-roleplay-en | ? | 48 | 100 | +52 |
| tax-consultant-en | ? | 48 | 100 | +52 |
| corporate-counsel-pt | ? | 0 | 51 | +51 |
| newsroom-writer-es | ? | 43 | 93 | +50 |
| tax-consultant-es | ? | 51 | 100 | +49 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 93 | +49 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 86 | +46 |
| education-tutor-en | ? | 53 | 97 | +44 |
| tax-consultant-pt | ? | 51 | 93 | +42 |
| nutripal-weight-loss-en | ? | 48 | 90 | +42 |
| Tangier waterfront pier competition (en) | en | 34 | 76 | +42 |
| tone-domain-mismatch-es | ? | 45 | 86 | +41 |
| task-style-no-roleplay-pt | ? | 52 | 93 | +41 |
| Pediatric dental clinic intake (en) | en | 38 | 79 | +41 |
| contradiction-en | ? | 61 | 100 | +39 |
| education-tutor-pt | ? | 49 | 86 | +37 |
| education-tutor-es | ? | 49 | 86 | +37 |
| task-style-no-roleplay-es | ? | 65 | 100 | +35 |
| Coordinador last-mile en Lima (es) | es | 42 | 76 | +34 |
| contradiction-es | ? | 61 | 93 | +32 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 93 | +31 |
| logistics-optimizer-pt | ? | 55 | 83 | +28 |
| virtual-pharmacist-pt | ? | 65 | 93 | +28 |
| mental-health-wellness-es | ? | 58 | 86 | +28 |
| cooperative-agronomist-es | ? | 66 | 93 | +27 |
| contradiction-pt | ? | 61 | 86 | +25 |
| logistics-optimizer-es | ? | 63 | 86 | +23 |
| unresolved-reference-es | ? | 79 | 100 | +21 |
| financial-analyst-es | ? | 56 | 76 | +20 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 72 | +17 |
| marketing-strategist-es | ? | 64 | 80 | +16 |
| streaming-script-editor-en | ? | 64 | 79 | +15 |
| unresolved-reference-en | ? | 76 | 90 | +14 |
| senior-structural-engineer-pt | ? | 77 | 90 | +13 |
| role-inflation-pt | ? | 88 | 100 | +12 |
| threat-framing-en | ? | 83 | 90 | +7 |
| threat-framing-es | ? | 79 | 86 | +7 |
| tone-domain-mismatch-pt | ? | 93 | 100 | +7 |
| tone-domain-mismatch-en | ? | 93 | 100 | +7 |
| role-inflation-en | ? | 94 | 100 | +6 |
| threat-framing-pt | ? | 90 | 93 | +3 |
| role-inflation-es | ? | 97 | 100 | +3 |
| unresolved-reference-pt | ? | 69 | 72 | +3 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| legal-advisor-en | ? | 0 | 97 | +97 |
| clinical-nurse-intake-en | ? | 0 | 90 | +90 |
| agente-codigo-rigido | ? | 1 | 90 | +89 |
| disguised-repetitions-pt | ? | 13 | 100 | +87 |
| repeticoes-disfarcadas | ? | 13 | 100 | +87 |
| legal-advisor-pt | ? | 0 | 86 | +86 |
| rigid-code-agent-en | ? | 1 | 87 | +86 |
| consultor-juridico | ? | 0 | 83 | +83 |
| rigid-code-agent-pt | ? | 1 | 83 | +82 |
| medical-assistant-pt | ? | 0 | 79 | +79 |
| medical-assistant-en | ? | 0 | 79 | +79 |
| assistente-medico | ? | 0 | 79 | +79 |
| personal-investment-advisor-es | ? | 0 | 79 | +79 |
| rigid-code-agent-es | ? | 22 | 100 | +78 |
| disguised-repetitions-en | ? | 23 | 100 | +77 |
| ux-research-assistant-en | ? | 25 | 100 | +75 |
| disguised-repetitions-es | ? | 28 | 100 | +72 |
| corporate-counsel-pt | ? | 0 | 72 | +72 |
| legal-advisor-es | ? | 14 | 79 | +65 |
| medical-assistant-es | ? | 24 | 86 | +62 |
| financial-analyst-pt | ? | 31 | 93 | +62 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 86 | +62 |
| Tangier waterfront pier competition (en) | en | 34 | 93 | +59 |
| Coordinador last-mile en Lima (es) | es | 42 | 100 | +58 |
| newsroom-writer-es | ? | 43 | 100 | +57 |
| social-media-agency-pt | ? | 20 | 76 | +56 |
| task-style-no-roleplay-en | ? | 48 | 100 | +52 |
| tax-consultant-en | ? | 48 | 100 | +52 |
| education-tutor-pt | ? | 49 | 100 | +51 |
| education-tutor-es | ? | 49 | 100 | +51 |
| logistics-optimizer-en | ? | 47 | 97 | +50 |
| financial-analyst-en | ? | 37 | 86 | +49 |
| tax-consultant-pt | ? | 51 | 100 | +49 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 65 | +49 |
| task-style-no-roleplay-pt | ? | 52 | 100 | +48 |
| estrategista-marketing | ? | 28 | 76 | +48 |
| education-tutor-en | ? | 53 | 100 | +47 |
| marketing-strategist-pt | ? | 28 | 72 | +44 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 72 | +43 |
| tax-consultant-es | ? | 51 | 93 | +42 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 86 | +42 |
| tone-domain-mismatch-es | ? | 45 | 86 | +41 |
| logistics-optimizer-pt | ? | 55 | 93 | +38 |
| Satellite ground-station CI reviewer (en) | en | 38 | 76 | +38 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 100 | +38 |
| logistics-optimizer-es | ? | 63 | 100 | +37 |
| task-style-no-roleplay-es | ? | 65 | 100 | +35 |
| cooperative-agronomist-es | ? | 66 | 100 | +34 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 75 | +34 |
| marketing-strategist-es | ? | 64 | 93 | +29 |
| nutripal-weight-loss-en | ? | 48 | 76 | +28 |
| mental-health-wellness-es | ? | 58 | 86 | +28 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 83 | +28 |
| contradiction-es | ? | 61 | 86 | +25 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 65 | +25 |
| marketing-strategist-en | ? | 43 | 65 | +22 |
| streaming-script-editor-en | ? | 64 | 86 | +22 |
| contradiction-pt | ? | 61 | 79 | +18 |
| financial-analyst-es | ? | 56 | 73 | +17 |
| contradiction-en | ? | 61 | 77 | +16 |
| threat-framing-es | ? | 79 | 93 | +14 |
| virtual-pharmacist-pt | ? | 65 | 79 | +14 |
| senior-structural-engineer-pt | ? | 77 | 90 | +13 |
| unresolved-reference-en | ? | 76 | 86 | +10 |
| threat-framing-en | ? | 83 | 90 | +7 |
| unresolved-reference-es | ? | 79 | 86 | +7 |
| Pediatric dental clinic intake (en) | en | 38 | 43 | +5 |
| tone-domain-mismatch-en | ? | 93 | 97 | +4 |
| threat-framing-pt | ? | 90 | 93 | +3 |
| role-inflation-en | ? | 94 | 97 | +3 |
| role-inflation-pt | ? | 88 | 86 | +-2 |
| role-inflation-es | ? | 97 | 90 | +-7 |
| tone-domain-mismatch-pt | ? | 93 | 86 | +-7 |
| unresolved-reference-pt | ? | 69 | 58 | +-11 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| consultor-juridico | ? | 0 | 93 | +93 |
| clinical-nurse-intake-en | ? | 0 | 90 | +90 |
| legal-advisor-en | ? | 0 | 87 | +87 |
| legal-advisor-pt | ? | 0 | 86 | +86 |
| personal-investment-advisor-es | ? | 0 | 86 | +86 |
| disguised-repetitions-pt | ? | 13 | 90 | +77 |
| repeticoes-disfarcadas | ? | 13 | 90 | +77 |
| medical-assistant-pt | ? | 0 | 76 | +76 |
| ux-research-assistant-en | ? | 25 | 100 | +75 |
| disguised-repetitions-en | ? | 23 | 97 | +74 |
| legal-advisor-es | ? | 14 | 86 | +72 |
| rigid-code-agent-pt | ? | 1 | 73 | +72 |
| rigid-code-agent-en | ? | 1 | 73 | +72 |
| assistente-medico | ? | 0 | 72 | +72 |
| agente-codigo-rigido | ? | 1 | 73 | +72 |
| corporate-counsel-pt | ? | 0 | 71 | +71 |
| marketing-strategist-pt | ? | 28 | 94 | +66 |
| rigid-code-agent-es | ? | 22 | 84 | +62 |
| Coordinador last-mile en Lima (es) | es | 42 | 100 | +58 |
| estrategista-marketing | ? | 28 | 84 | +56 |
| medical-assistant-es | ? | 24 | 79 | +55 |
| disguised-repetitions-es | ? | 28 | 83 | +55 |
| tone-domain-mismatch-es | ? | 45 | 100 | +55 |
| Satellite ground-station CI reviewer (en) | en | 38 | 93 | +55 |
| medical-assistant-en | ? | 0 | 54 | +54 |
| social-media-agency-pt | ? | 20 | 73 | +53 |
| education-tutor-es | ? | 49 | 100 | +51 |
| logistics-optimizer-en | ? | 47 | 97 | +50 |
| task-style-no-roleplay-pt | ? | 52 | 100 | +48 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 76 | +47 |
| task-style-no-roleplay-en | ? | 48 | 93 | +45 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 86 | +45 |
| Tangier waterfront pier competition (en) | en | 34 | 79 | +45 |
| tax-consultant-es | ? | 51 | 93 | +42 |
| education-tutor-pt | ? | 49 | 90 | +41 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 65 | +41 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 55 | +39 |
| tax-consultant-en | ? | 48 | 86 | +38 |
| financial-analyst-en | ? | 37 | 75 | +38 |
| nutripal-weight-loss-en | ? | 48 | 86 | +38 |
| education-tutor-en | ? | 53 | 90 | +37 |
| tax-consultant-pt | ? | 51 | 86 | +35 |
| financial-analyst-pt | ? | 31 | 66 | +35 |
| task-style-no-roleplay-es | ? | 65 | 100 | +35 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 79 | +35 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 72 | +32 |
| financial-analyst-es | ? | 56 | 87 | +31 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 93 | +31 |
| Pediatric dental clinic intake (en) | en | 38 | 68 | +30 |
| marketing-strategist-en | ? | 43 | 72 | +29 |
| logistics-optimizer-pt | ? | 55 | 83 | +28 |
| cooperative-agronomist-es | ? | 66 | 93 | +27 |
| contradiction-pt | ? | 61 | 86 | +25 |
| contradiction-es | ? | 61 | 86 | +25 |
| logistics-optimizer-es | ? | 63 | 87 | +24 |
| newsroom-writer-es | ? | 43 | 66 | +23 |
| senior-structural-engineer-pt | ? | 77 | 100 | +23 |
| mental-health-wellness-es | ? | 58 | 79 | +21 |
| marketing-strategist-es | ? | 64 | 83 | +19 |
| threat-framing-en | ? | 83 | 100 | +17 |
| unresolved-reference-en | ? | 76 | 93 | +17 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 72 | +17 |
| streaming-script-editor-en | ? | 64 | 76 | +12 |
| contradiction-en | ? | 61 | 72 | +11 |
| virtual-pharmacist-pt | ? | 65 | 76 | +11 |
| threat-framing-es | ? | 79 | 86 | +7 |
| role-inflation-pt | ? | 88 | 93 | +5 |
| role-inflation-en | ? | 94 | 97 | +3 |
| role-inflation-es | ? | 97 | 100 | +3 |
| tone-domain-mismatch-pt | ? | 93 | 93 | +0 |
| unresolved-reference-es | ? | 79 | 79 | +0 |
| threat-framing-pt | ? | 90 | 86 | +-4 |
| tone-domain-mismatch-en | ? | 93 | 86 | +-7 |
| unresolved-reference-pt | ? | 69 | 58 | +-11 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| consultor-juridico | ? | 0 | 90 | +90 |
| agente-codigo-rigido | ? | 1 | 87 | +86 |
| rigid-code-agent-pt | ? | 1 | 86 | +85 |
| repeticoes-disfarcadas | ? | 13 | 97 | +84 |
| disguised-repetitions-pt | ? | 13 | 93 | +80 |
| legal-advisor-en | ? | 0 | 79 | +79 |
| disguised-repetitions-en | ? | 23 | 100 | +77 |
| ux-research-assistant-en | ? | 25 | 100 | +75 |
| rigid-code-agent-en | ? | 1 | 75 | +74 |
| medical-assistant-pt | ? | 0 | 72 | +72 |
| assistente-medico | ? | 0 | 69 | +69 |
| estrategista-marketing | ? | 28 | 97 | +69 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 93 | +69 |
| disguised-repetitions-es | ? | 28 | 93 | +65 |
| rigid-code-agent-es | ? | 22 | 86 | +64 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 100 | +59 |
| legal-advisor-pt | ? | 0 | 58 | +58 |
| legal-advisor-es | ? | 14 | 72 | +58 |
| medical-assistant-en | ? | 0 | 58 | +58 |
| corporate-counsel-pt | ? | 0 | 58 | +58 |
| personal-investment-advisor-es | ? | 0 | 57 | +57 |
| social-media-agency-pt | ? | 20 | 76 | +56 |
| medical-assistant-es | ? | 24 | 79 | +55 |
| Satellite ground-station CI reviewer (en) | en | 38 | 87 | +49 |
| tone-domain-mismatch-es | ? | 45 | 93 | +48 |
| Coordinador last-mile en Lima (es) | es | 42 | 90 | +48 |
| education-tutor-en | ? | 53 | 100 | +47 |
| logistics-optimizer-en | ? | 47 | 93 | +46 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 62 | +46 |
| task-style-no-roleplay-en | ? | 48 | 93 | +45 |
| logistics-optimizer-pt | ? | 55 | 100 | +45 |
| education-tutor-pt | ? | 49 | 93 | +44 |
| clinical-nurse-intake-en | ? | 0 | 44 | +44 |
| tax-consultant-pt | ? | 51 | 93 | +42 |
| tax-consultant-en | ? | 48 | 90 | +42 |
| task-style-no-roleplay-pt | ? | 52 | 93 | +41 |
| marketing-strategist-en | ? | 43 | 81 | +38 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 100 | +38 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 65 | +36 |
| tax-consultant-es | ? | 51 | 86 | +35 |
| education-tutor-es | ? | 49 | 83 | +34 |
| cooperative-agronomist-es | ? | 66 | 100 | +34 |
| newsroom-writer-es | ? | 43 | 76 | +33 |
| financial-analyst-pt | ? | 31 | 63 | +32 |
| financial-analyst-en | ? | 37 | 68 | +31 |
| nutripal-weight-loss-en | ? | 48 | 79 | +31 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 86 | +31 |
| logistics-optimizer-es | ? | 63 | 93 | +30 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 72 | +28 |
| marketing-strategist-pt | ? | 28 | 55 | +27 |
| contradiction-pt | ? | 61 | 86 | +25 |
| contradiction-en | ? | 61 | 86 | +25 |
| task-style-no-roleplay-es | ? | 65 | 86 | +21 |
| virtual-pharmacist-pt | ? | 65 | 86 | +21 |
| mental-health-wellness-es | ? | 58 | 79 | +21 |
| contradiction-es | ? | 61 | 79 | +18 |
| senior-structural-engineer-pt | ? | 77 | 93 | +16 |
| financial-analyst-es | ? | 56 | 71 | +15 |
| streaming-script-editor-en | ? | 64 | 79 | +15 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 55 | +15 |
| threat-framing-en | ? | 83 | 97 | +14 |
| threat-framing-es | ? | 79 | 93 | +14 |
| threat-framing-pt | ? | 90 | 100 | +10 |
| unresolved-reference-en | ? | 76 | 86 | +10 |
| Tangier waterfront pier competition (en) | en | 34 | 44 | +10 |
| marketing-strategist-es | ? | 64 | 72 | +8 |
| unresolved-reference-es | ? | 79 | 86 | +7 |
| role-inflation-en | ? | 94 | 100 | +6 |
| role-inflation-es | ? | 97 | 100 | +3 |
| tone-domain-mismatch-en | ? | 93 | 93 | +0 |
| Pediatric dental clinic intake (en) | en | 38 | 37 | +-1 |
| role-inflation-pt | ? | 88 | 83 | +-5 |
| tone-domain-mismatch-pt | ? | 93 | 83 | +-10 |
| unresolved-reference-pt | ? | 69 | 58 | +-11 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| legal-advisor-en | ? | 0 | 93 | +93 |
| rigid-code-agent-pt | ? | 1 | 93 | +92 |
| consultor-juridico | ? | 0 | 87 | +87 |
| rigid-code-agent-en | ? | 1 | 81 | +80 |
| medical-assistant-pt | ? | 0 | 79 | +79 |
| personal-investment-advisor-es | ? | 0 | 79 | +79 |
| disguised-repetitions-en | ? | 23 | 100 | +77 |
| clinical-nurse-intake-en | ? | 0 | 73 | +73 |
| agente-codigo-rigido | ? | 1 | 71 | +70 |
| repeticoes-disfarcadas | ? | 13 | 83 | +70 |
| marketing-strategist-pt | ? | 28 | 97 | +69 |
| estrategista-marketing | ? | 28 | 97 | +69 |
| ux-research-assistant-en | ? | 25 | 93 | +68 |
| medical-assistant-en | ? | 0 | 67 | +67 |
| financial-analyst-en | ? | 37 | 97 | +60 |
| disguised-repetitions-es | ? | 28 | 86 | +58 |
| assistente-medico | ? | 0 | 58 | +58 |
| disguised-repetitions-pt | ? | 13 | 70 | +57 |
| marketing-strategist-en | ? | 43 | 97 | +54 |
| social-media-agency-pt | ? | 20 | 74 | +54 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 83 | +54 |
| tax-consultant-en | ? | 48 | 100 | +52 |
| education-tutor-pt | ? | 49 | 100 | +51 |
| logistics-optimizer-en | ? | 47 | 97 | +50 |
| task-style-no-roleplay-en | ? | 48 | 97 | +49 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 73 | +49 |
| corporate-counsel-pt | ? | 0 | 47 | +47 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 63 | +47 |
| Satellite ground-station CI reviewer (en) | en | 38 | 84 | +46 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 90 | +46 |
| task-style-no-roleplay-pt | ? | 52 | 97 | +45 |
| legal-advisor-es | ? | 14 | 58 | +44 |
| newsroom-writer-es | ? | 43 | 86 | +43 |
| logistics-optimizer-pt | ? | 55 | 97 | +42 |
| Coordinador last-mile en Lima (es) | es | 42 | 83 | +41 |
| education-tutor-es | ? | 49 | 87 | +38 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 79 | +38 |
| logistics-optimizer-es | ? | 63 | 100 | +37 |
| financial-analyst-es | ? | 56 | 93 | +37 |
| legal-advisor-pt | ? | 0 | 36 | +36 |
| rigid-code-agent-es | ? | 22 | 57 | +35 |
| task-style-no-roleplay-es | ? | 65 | 100 | +35 |
| mental-health-wellness-es | ? | 58 | 93 | +35 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 90 | +35 |
| education-tutor-en | ? | 53 | 86 | +33 |
| streaming-script-editor-en | ? | 64 | 97 | +33 |
| tax-consultant-es | ? | 51 | 83 | +32 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 93 | +31 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 69 | +29 |
| cooperative-agronomist-es | ? | 66 | 94 | +28 |
| contradiction-es | ? | 61 | 86 | +25 |
| virtual-pharmacist-pt | ? | 65 | 83 | +18 |
| marketing-strategist-es | ? | 64 | 81 | +17 |
| financial-analyst-pt | ? | 31 | 48 | +17 |
| nutripal-weight-loss-en | ? | 48 | 65 | +17 |
| threat-framing-en | ? | 83 | 97 | +14 |
| unresolved-reference-es | ? | 79 | 93 | +14 |
| medical-assistant-es | ? | 24 | 37 | +13 |
| senior-structural-engineer-pt | ? | 77 | 90 | +13 |
| Tangier waterfront pier competition (en) | en | 34 | 46 | +12 |
| tax-consultant-pt | ? | 51 | 62 | +11 |
| tone-domain-mismatch-es | ? | 45 | 56 | +11 |
| unresolved-reference-en | ? | 76 | 83 | +7 |
| Pediatric dental clinic intake (en) | en | 38 | 45 | +7 |
| contradiction-pt | ? | 61 | 66 | +5 |
| contradiction-en | ? | 61 | 65 | +4 |
| role-inflation-pt | ? | 88 | 90 | +2 |
| threat-framing-pt | ? | 90 | 90 | +0 |
| threat-framing-es | ? | 79 | 79 | +0 |
| role-inflation-es | ? | 97 | 97 | +0 |
| tone-domain-mismatch-en | ? | 93 | 93 | +0 |
| role-inflation-en | ? | 94 | 87 | +-7 |
| unresolved-reference-pt | ? | 69 | 62 | +-7 |
| tone-domain-mismatch-pt | ? | 93 | 62 | +-31 |
| prompt | lang | before | after | Δ |
|---|---|---|---|---|
| legal-advisor-en | ? | 0 | 85 | +85 |
| repeticoes-disfarcadas | ? | 13 | 97 | +84 |
| clinical-nurse-intake-en | ? | 0 | 76 | +76 |
| rigid-code-agent-en | ? | 1 | 76 | +75 |
| disguised-repetitions-en | ? | 23 | 94 | +71 |
| disguised-repetitions-pt | ? | 13 | 83 | +70 |
| agente-codigo-rigido | ? | 1 | 71 | +70 |
| legal-advisor-pt | ? | 0 | 69 | +69 |
| personal-investment-advisor-es | ? | 0 | 69 | +69 |
| rigid-code-agent-pt | ? | 1 | 68 | +67 |
| disguised-repetitions-es | ? | 28 | 93 | +65 |
| ux-research-assistant-en | ? | 25 | 87 | +62 |
| medical-assistant-pt | ? | 0 | 59 | +59 |
| Mediación familiar / uniones de hecho (es) | es | 29 | 86 | +57 |
| assistente-medico | ? | 0 | 56 | +56 |
| UX lead pre-IPO Avenida Pay (es) | es | 24 | 80 | +56 |
| medical-assistant-en | ? | 0 | 55 | +55 |
| Assistente de leilão judicial de imóveis (pt) | pt | 16 | 69 | +53 |
| tone-domain-mismatch-es | ? | 45 | 97 | +52 |
| task-style-no-roleplay-en | ? | 48 | 100 | +52 |
| Satellite ground-station CI reviewer (en) | en | 38 | 90 | +52 |
| marketing-strategist-pt | ? | 28 | 78 | +50 |
| corporate-counsel-pt | ? | 0 | 50 | +50 |
| task-style-no-roleplay-pt | ? | 52 | 100 | +48 |
| Coordinador last-mile en Lima (es) | es | 42 | 90 | +48 |
| social-media-agency-pt | ? | 20 | 67 | +47 |
| logistics-optimizer-en | ? | 47 | 93 | +46 |
| tax-consultant-en | ? | 48 | 94 | +46 |
| nutripal-weight-loss-en | ? | 48 | 93 | +45 |
| education-tutor-es | ? | 49 | 90 | +41 |
| legal-advisor-es | ? | 14 | 55 | +41 |
| consultor-juridico | ? | 0 | 41 | +41 |
| Cybersec bootcamp instructional designer (en) | en | 41 | 82 | +41 |
| education-tutor-en | ? | 53 | 93 | +40 |
| newsroom-writer-es | ? | 43 | 83 | +40 |
| financial-analyst-en | ? | 37 | 75 | +38 |
| Pediatric dental clinic intake (en) | en | 38 | 75 | +37 |
| logistics-optimizer-pt | ? | 55 | 90 | +35 |
| task-style-no-roleplay-es | ? | 65 | 100 | +35 |
| tax-consultant-pt | ? | 51 | 83 | +32 |
| tax-consultant-es | ? | 51 | 83 | +32 |
| rigid-code-agent-es | ? | 22 | 54 | +32 |
| PMO de hortas hidropônicas em escolas (pt) | pt | 44 | 76 | +32 |
| education-tutor-pt | ? | 49 | 79 | +30 |
| logistics-optimizer-es | ? | 63 | 93 | +30 |
| Integridade de risers em águas ultraprofundas (pt) | pt | 62 | 90 | +28 |
| cooperative-agronomist-es | ? | 66 | 93 | +27 |
| Contadora tributária para usinas de etanol (pt) | pt | 40 | 65 | +25 |
| medical-assistant-es | ? | 24 | 48 | +24 |
| Tangier waterfront pier competition (en) | en | 34 | 58 | +24 |
| contradiction-es | ? | 61 | 83 | +22 |
| virtual-pharmacist-pt | ? | 65 | 86 | +21 |
| mental-health-wellness-es | ? | 58 | 79 | +21 |
| financial-analyst-pt | ? | 31 | 51 | +20 |
| Cierre y remediación de pasivos mineros (es) | es | 55 | 72 | +17 |
| marketing-strategist-en | ? | 43 | 58 | +15 |
| role-inflation-pt | ? | 88 | 100 | +12 |
| streaming-script-editor-en | ? | 64 | 76 | +12 |
| unresolved-reference-en | ? | 76 | 86 | +10 |
| estrategista-marketing | ? | 28 | 37 | +9 |
| contradiction-pt | ? | 61 | 69 | +8 |
| threat-framing-en | ? | 83 | 90 | +7 |
| threat-framing-es | ? | 79 | 86 | +7 |
| role-inflation-en | ? | 94 | 100 | +6 |
| senior-structural-engineer-pt | ? | 77 | 83 | +6 |
| financial-analyst-es | ? | 56 | 61 | +5 |
| contradiction-en | ? | 61 | 66 | +5 |
| marketing-strategist-es | ? | 64 | 67 | +3 |
| threat-framing-pt | ? | 90 | 93 | +3 |
| role-inflation-es | ? | 97 | 100 | +3 |
| unresolved-reference-pt | ? | 69 | 72 | +3 |
| tone-domain-mismatch-pt | ? | 93 | 93 | +0 |
| tone-domain-mismatch-en | ? | 93 | 93 | +0 |
| unresolved-reference-es | ? | 79 | 79 | +0 |
Represent an AI model or provider and want to see how it performs here?