Cosine Similarity
Example
Why It Matters
Cosine similarity is the backbone of semantic search, RAG systems, and recommendation engines. Every time you build a system that retrieves relevant documents or finds similar content, you're relying on cosine similarity under the hood.
How It Works
Cosine similarity measures the cosine of the angle between two vectors, producing a score from -1 (opposite) to 1 (identical), with 0 meaning no relationship. Unlike Euclidean distance, cosine similarity ignores vector magnitude and focuses purely on direction, which makes it ideal for comparing embeddings of different-length texts.
The formula is straightforward: divide the dot product of two vectors by the product of their magnitudes. In practice, you rarely compute this yourself. Libraries like NumPy, scikit-learn, and vector databases handle the math. What matters is understanding what the scores mean and how to set thresholds.
Setting the right similarity threshold is more art than science. A 0.85 cosine similarity between two sentence embeddings usually indicates strong semantic similarity, but the exact threshold depends on your embedding model, domain, and use case. For search, you might accept anything above 0.7. For deduplication, you'd want 0.95 or higher. Always calibrate thresholds on your specific data rather than using generic cutoffs.
Common Mistakes
Common mistake: Using a fixed similarity threshold without calibrating on your data
Test different thresholds on labeled examples from your domain. What counts as 'similar' varies significantly by embedding model and content type.
Common mistake: Comparing embeddings from different models
Embeddings from different models exist in different vector spaces. Only compare embeddings generated by the same model.
Common mistake: Assuming high cosine similarity always means semantic equivalence
Two sentences can have high similarity scores while meaning different things. Always validate with human review, especially for high-stakes applications.
Career Relevance
Cosine similarity comes up constantly in AI engineering interviews and practical work. Any role involving RAG, search, or recommendation systems requires a solid understanding of similarity metrics and how to tune them.
Related Terms
Stay Ahead in AI
Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.
Join the Community →