Core Concepts

MMLU

Massive Multitask Language Understanding

Quick Answer: A benchmark that tests AI models across 57 academic subjects including math, history, law, medicine, and computer science.
Massive Multitask Language Understanding is a benchmark that tests AI models across 57 academic subjects including math, history, law, medicine, and computer science. MMLU uses multiple-choice questions at difficulty levels ranging from elementary to professional, making it the most widely cited general-knowledge benchmark for LLMs.

Example

An MMLU question from professional medicine: 'A 45-year-old man presents with chest pain radiating to the left arm. Which of the following is the most appropriate initial diagnostic test? (A) CT scan (B) ECG (C) Chest X-ray (D) Blood culture.' The model must select the correct answer across thousands of such questions.

Why It Matters

MMLU is the benchmark that headlines most model launches. GPT-4 scored 86.4%, Claude 3.5 Sonnet hit 88.7%, and Gemini Ultra reached 90.0%. These numbers drive enterprise adoption decisions. When evaluating models for a project, MMLU scores provide the broadest capability comparison.

How It Works

MMLU (Massive Multitask Language Understanding) tests a model's knowledge across 57 academic subjects, from elementary mathematics to professional medicine and law. It contains 15,908 multiple-choice questions spanning STEM, humanities, social sciences, and professional domains.

MMLU became the de facto standard for measuring broad AI knowledge because it covers such diverse domains. A model scoring 90% on MMLU demonstrates knowledge equivalent to a well-educated human across academic disciplines. Top models now score above 90%, with GPT-4o and Claude 3.5 Sonnet in the 88-90% range.

However, MMLU has known issues: some questions have multiple valid answers, some are ambiguous, and performance on MMLU doesn't necessarily correlate with performance on practical tasks. MMLU-Pro addresses some of these issues with harder questions and 10 answer choices instead of 4. ARC and HellaSwag complement MMLU for reasoning and commonsense evaluation.

Common Mistakes

Common mistake: Using MMLU scores to compare models within a few percentage points

Small MMLU differences (1-2%) are within noise range. A model scoring 88% vs 87% is not meaningfully different on MMLU. Only large gaps (5%+) indicate clear capability differences.

Common mistake: Assuming high MMLU scores mean a model is good at everything

MMLU tests academic knowledge, not practical skills like writing quality, code debugging, or multi-turn conversation. Supplement with task-specific evaluations.

Career Relevance

MMLU literacy is important for anyone evaluating AI models or reading AI research papers. It's the most commonly cited benchmark in model comparisons and product announcements. Understanding what it does and doesn't measure prevents poor model selection decisions.

Stay Ahead in AI

Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.

Join the Community →