Core Concepts

Precision and Recall

Quick Answer: Two complementary metrics for evaluating classification models.

Precision and Recall is two complementary metrics for evaluating classification models. Precision measures how many of the model's positive predictions were actually correct (quality of positives). Recall measures how many of the actual positives the model successfully found (completeness of detection).

Example

A spam filter flags 100 emails as spam. 90 actually are spam and 10 are legitimate (false positives). Precision = 90%. The inbox had 120 total spam emails. The filter caught 90 of them. Recall = 75%. You caught most spam but missed 30, and 10 good emails got trashed.

Why It Matters

Accuracy alone is misleading when classes are imbalanced. A cancer screening test that says 'no cancer' for everyone gets 99% accuracy if only 1% of patients have cancer, but it's completely useless. Precision and recall give you the full picture of how your model handles the class you care about.

How It Works

Precision = True Positives / (True Positives + False Positives). It answers: 'When the model says yes, how often is it right?' High precision means few false alarms.

Recall = True Positives / (True Positives + False Negatives). It answers: 'Of all the actual positives, how many did the model catch?' High recall means few missed cases.

The precision-recall tradeoff is fundamental: making the model more selective (higher threshold) increases precision but decreases recall. Making it more inclusive (lower threshold) increases recall but decreases precision. The right balance depends on your application.

High-precision use cases: spam filtering (users hate losing real emails), content recommendation (bad recommendations erode trust). High-recall use cases: cancer screening (missing a case is far worse than a false alarm), fraud detection (better to investigate false positives than miss real fraud).

F1 score is the harmonic mean of precision and recall, providing a single number that balances both. F-beta score lets you weight one more heavily: F2 weights recall higher, F0.5 weights precision higher.

Precision-recall curves plot the tradeoff across all possible thresholds and are more informative than ROC curves for imbalanced datasets. Average precision (area under the PR curve) summarizes overall performance.

For multi-class problems, precision and recall are computed per class and can be aggregated as macro-average (unweighted mean across classes) or micro-average (computed globally across all predictions).

Common Mistakes

Common mistake: Using accuracy as the primary metric for imbalanced classification problems

Use precision, recall, and F1 score. On imbalanced datasets, a model that predicts only the majority class gets high accuracy but zero recall for the minority class.

Common mistake: Optimizing for F1 without considering which type of error is more costly for your application

Choose your primary metric based on business impact. If false negatives are dangerous (medical, security), prioritize recall. If false positives are costly (spam, content moderation), prioritize precision.

Career Relevance

Precision and recall are interview staples and daily-use metrics for data scientists, ML engineers, and anyone evaluating AI systems. Product managers and prompt engineers working with classifiers need to understand these metrics to set appropriate thresholds and make deployment decisions.

Related Terms

Stay Ahead in AI

Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.

Join the Community →