Precision and Recall
Example
Why It Matters
Accuracy alone is misleading when classes are imbalanced. A cancer screening test that says 'no cancer' for everyone gets 99% accuracy if only 1% of patients have cancer, but it's completely useless. Precision and recall give you the full picture of how your model handles the class you care about.
How It Works
Precision = True Positives / (True Positives + False Positives). It answers: 'When the model says yes, how often is it right?' High precision means few false alarms.
Recall = True Positives / (True Positives + False Negatives). It answers: 'Of all the actual positives, how many did the model catch?' High recall means few missed cases.
The precision-recall tradeoff is fundamental: making the model more selective (higher threshold) increases precision but decreases recall. Making it more inclusive (lower threshold) increases recall but decreases precision. The right balance depends on your application.
High-precision use cases: spam filtering (users hate losing real emails), content recommendation (bad recommendations erode trust). High-recall use cases: cancer screening (missing a case is far worse than a false alarm), fraud detection (better to investigate false positives than miss real fraud).
F1 score is the harmonic mean of precision and recall, providing a single number that balances both. F-beta score lets you weight one more heavily: F2 weights recall higher, F0.5 weights precision higher.
Precision-recall curves plot the tradeoff across all possible thresholds and are more informative than ROC curves for imbalanced datasets. Average precision (area under the PR curve) summarizes overall performance.
For multi-class problems, precision and recall are computed per class and can be aggregated as macro-average (unweighted mean across classes) or micro-average (computed globally across all predictions).
Precision and recall are increasingly relevant in LLM and prompt engineering contexts, not just traditional ML classification.
In RAG systems, retrieval precision measures what percentage of retrieved documents are actually relevant to the query. Retrieval recall measures what percentage of all relevant documents in your corpus were successfully retrieved. A RAG system with high precision but low recall gives accurate but incomplete answers. A system with high recall but low precision wastes context window tokens on irrelevant documents, which can actually degrade generation quality.
For LLM-based classification (using prompts to categorize text), precision and recall apply directly. If you prompt Claude to identify customer complaints in support tickets, precision tells you how many of the flagged tickets were actual complaints (vs. false alarms), and recall tells you how many real complaints the model caught. Adjusting your prompt instructions (more restrictive vs. more inclusive criteria) directly trades precision for recall.
In content moderation, the precision-recall tradeoff has real business consequences. A content filter with 99% recall catches nearly all harmful content but may block legitimate posts (low precision), frustrating users. A filter with 99% precision rarely blocks legitimate content but may miss harmful posts (low recall), creating safety risks. Most platforms prioritize recall for severe violations (safety-critical) and precision for borderline content (user experience).
Practical tip for prompt engineers: when building classification or detection prompts, always measure both precision and recall on a test set before deploying. A prompt that feels accurate on a few examples may have terrible recall (missing half the cases) or terrible precision (flagging everything). The only way to know is systematic evaluation with labeled data.
Common Mistakes
Common mistake: Using accuracy as the primary metric for imbalanced classification problems
Use precision, recall, and F1 score. On imbalanced datasets, a model that predicts only the majority class gets high accuracy but zero recall for the minority class.
Common mistake: Optimizing for F1 without considering which type of error is more costly for your application
Choose your primary metric based on business impact. If false negatives are dangerous (medical, security), prioritize recall. If false positives are costly (spam, content moderation), prioritize precision.
Common mistake: Evaluating RAG retrieval only by checking if the answer is correct, without measuring retrieval precision and recall separately
Measure retrieval quality independently from generation quality. A correct final answer might mask poor retrieval that will fail on harder queries. Check which documents were retrieved and whether the right ones were included.
Career Relevance
Precision and recall are interview staples and daily-use metrics for data scientists, ML engineers, and anyone evaluating AI systems. Product managers and prompt engineers working with classifiers need to understand these metrics to set appropriate thresholds and make deployment decisions. In the LLM era, these metrics are essential for evaluating RAG retrieval quality, LLM-based classification prompts, and content moderation systems.
Related Terms
Learn More
Stay Ahead in AI
Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.
Join the Community →