Text Similarity Measures
Text similarity measures quantify how alike two pieces of text are. These metrics are fundamental to information retrieval, document clustering, plagiarism detection, and semantic search.
Cosine Similarity
Jaccard Similarity
Euclidean Distance
Cosine Similarity Implementation
import numpy as np
from collections import Counter
def cosine_similarity(text1, text2):
words1 = text1.lower().split()
words2 = text2.lower().split()
# Build vocabulary
vocab = sorted(set(words1 + words2))
# Create vectors
vec1 = [words1.count(w) for w in vocab]
vec2 = [words2.count(w) for w in vocab]
# Compute cosine
dot_product = sum(a * b for a, b in zip(vec1, vec2))
magnitude1 = sum(a ** 2 for a in vec1) ** 0.5
magnitude2 = sum(b ** 2 for b in vec2) ** 0.5
if magnitude1 == 0 or magnitude2 == 0:
return 0.0
return dot_product / (magnitude1 * magnitude2)
text1 = "the cat sat on the mat"
text2 = "the cat sat on the rug"
print(f"Similarity: {cosine_similarity(text1, text2):.3f}")
# Similarity: 0.857
Jaccard Similarity Implementation
def jaccard_similarity(text1, text2):
set1 = set(text1.lower().split())
set2 = set(text2.lower().split())
intersection = set1 & set2
union = set1 | set2
return len(intersection) / len(union)
text1 = "natural language processing"
text2 = "language processing machine"
print(f"Jaccard: {jaccard_similarity(text1, text2):.3f}")
# Jaccard: 0.500
Using scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
documents = [
"the cat sat on the mat",
"the cat sat on the rug",
"the dog played in the park"
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Cosine similarity matrix
cos_sim = cosine_similarity(tfidf_matrix)
print("Cosine Similarity:")
print(cos_sim.round(3))
# Euclidean distance matrix
euc_dist = euclidean_distances(tfidf_matrix)
print("\nEuclidean Distance:")
print(euc_dist.round(3))
Comparison of Similarity Metrics
| Metric | Range | Handles Length | Use Case |
|---|---|---|---|
| Cosine | 0 to 1 | Yes (normalized) | Document similarity, search |
| Jaccard | 0 to 1 | Partially | Set-based comparison |
| Euclidean | 0 to infinity | No | Clustering, nearest neighbor |
| Manhattan | 0 to infinity | No | Sparse data |
Similarity vs Distance
| Similarity Metric | Corresponding Distance |
|---|---|
| Cosine similarity | 1 - cosine_similarity |
| Jaccard similarity | 1 - Jaccard_index |
| Gaussian kernel | exp(-gamma * distanceΒ²) |
Practical Example: Document Retrieval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Document collection
corpus = [
"Machine learning is a subset of artificial intelligence",
"Deep learning uses neural networks with many layers",
"Natural language processing deals with text and speech",
"Computer vision analyzes images and videos",
"Reinforcement learning trains agents through rewards"
]
# Query
query = "neural networks for deep learning"
# Vectorize
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(corpus + [query])
query_vector = doc_vectors[-1]
doc_vectors = doc_vectors[:-1]
# Compute similarities
similarities = cosine_similarity(query_vector, doc_vectors)[0]
# Rank results
ranked = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)
for idx, score in ranked:
print(f"{score:.3f} - {corpus[idx]}")
Handling Large-Scale Similarity
For large document collections, exact similarity computation is expensive. Approximate methods include:
- Locality-Sensitive Hashing (LSH): Hash similar items to same buckets
- MinHash: Efficient Jaccard estimation
- FAISS: Facebook's similarity search library
- Annoy: Approximate Nearest Neighbors by Spotify