MMLU
Massive Multitask Language Understanding
Example
Why It Matters
MMLU is the benchmark that headlines most model launches. GPT-4 scored 86.4%, Claude 3.5 Sonnet hit 88.7%, and Gemini Ultra reached 90.0%. These numbers drive enterprise adoption decisions. When evaluating models for a project, MMLU scores provide the broadest capability comparison.
How It Works
MMLU (Massive Multitask Language Understanding) tests a model's knowledge across 57 academic subjects, from elementary mathematics to professional medicine and law. It contains 15,908 multiple-choice questions spanning STEM, humanities, social sciences, and professional domains.
MMLU became the de facto standard for measuring broad AI knowledge because it covers such diverse domains. A model scoring 90% on MMLU demonstrates knowledge equivalent to a well-educated human across academic disciplines. Top models now score above 90%, with GPT-4o and Claude 3.5 Sonnet in the 88-90% range.
However, MMLU has known issues: some questions have multiple valid answers, some are ambiguous, and performance on MMLU doesn't necessarily correlate with performance on practical tasks. MMLU-Pro addresses some of these issues with harder questions and 10 answer choices instead of 4. ARC and HellaSwag complement MMLU for reasoning and commonsense evaluation.
Common Mistakes
Common mistake: Using MMLU scores to compare models within a few percentage points
Small MMLU differences (1-2%) are within noise range. A model scoring 88% vs 87% is not meaningfully different on MMLU. Only large gaps (5%+) indicate clear capability differences.
Common mistake: Assuming high MMLU scores mean a model is good at everything
MMLU tests academic knowledge, not practical skills like writing quality, code debugging, or multi-turn conversation. Supplement with task-specific evaluations.
Career Relevance
MMLU literacy is important for anyone evaluating AI models or reading AI research papers. It's the most commonly cited benchmark in model comparisons and product announcements. Understanding what it does and doesn't measure prevents poor model selection decisions.
Related Terms
Stay Ahead in AI
Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.
Join the Community →