πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Text Similarity Measures

Classical NLPSimilarity Metrics🟒 Free Lesson

Advertisement

Text Similarity Measures

Text similarity measures quantify how alike two pieces of text are. These metrics are fundamental to information retrieval, document clustering, plagiarism detection, and semantic search.

Cosine Similarity

cosine_similarity(A,B)=Aβ‹…Bβˆ₯Aβˆ₯Γ—βˆ₯Bβˆ₯=βˆ‘i=1nAiBiβˆ‘i=1nAi2Γ—βˆ‘i=1nBi2\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{\|A\| \times \|B\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}}

Jaccard Similarity

J(A,B)=∣A∩B∣∣AβˆͺB∣J(A, B) = \frac{|A \cap B|}{|A \cup B|}

Euclidean Distance

d(A,B)=βˆ‘i=1n(Aiβˆ’Bi)2d(A, B) = \sqrt{\sum_{i=1}^{n}(A_i - B_i)^2}

Cosine Similarity Implementation

import numpy as np
from collections import Counter

def cosine_similarity(text1, text2):
    words1 = text1.lower().split()
    words2 = text2.lower().split()

    # Build vocabulary
    vocab = sorted(set(words1 + words2))

    # Create vectors
    vec1 = [words1.count(w) for w in vocab]
    vec2 = [words2.count(w) for w in vocab]

    # Compute cosine
    dot_product = sum(a * b for a, b in zip(vec1, vec2))
    magnitude1 = sum(a ** 2 for a in vec1) ** 0.5
    magnitude2 = sum(b ** 2 for b in vec2) ** 0.5

    if magnitude1 == 0 or magnitude2 == 0:
        return 0.0
    return dot_product / (magnitude1 * magnitude2)

text1 = "the cat sat on the mat"
text2 = "the cat sat on the rug"
print(f"Similarity: {cosine_similarity(text1, text2):.3f}")
# Similarity: 0.857

Jaccard Similarity Implementation

def jaccard_similarity(text1, text2):
    set1 = set(text1.lower().split())
    set2 = set(text2.lower().split())

    intersection = set1 & set2
    union = set1 | set2

    return len(intersection) / len(union)

text1 = "natural language processing"
text2 = "language processing machine"
print(f"Jaccard: {jaccard_similarity(text1, text2):.3f}")
# Jaccard: 0.500

Using scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

documents = [
    "the cat sat on the mat",
    "the cat sat on the rug",
    "the dog played in the park"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Cosine similarity matrix
cos_sim = cosine_similarity(tfidf_matrix)
print("Cosine Similarity:")
print(cos_sim.round(3))

# Euclidean distance matrix
euc_dist = euclidean_distances(tfidf_matrix)
print("\nEuclidean Distance:")
print(euc_dist.round(3))

Comparison of Similarity Metrics

MetricRangeHandles LengthUse Case
Cosine0 to 1Yes (normalized)Document similarity, search
Jaccard0 to 1PartiallySet-based comparison
Euclidean0 to infinityNoClustering, nearest neighbor
Manhattan0 to infinityNoSparse data

Similarity vs Distance

Similarity MetricCorresponding Distance
Cosine similarity1 - cosine_similarity
Jaccard similarity1 - Jaccard_index
Gaussian kernelexp(-gamma * distanceΒ²)

Practical Example: Document Retrieval

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Document collection
corpus = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning uses neural networks with many layers",
    "Natural language processing deals with text and speech",
    "Computer vision analyzes images and videos",
    "Reinforcement learning trains agents through rewards"
]

# Query
query = "neural networks for deep learning"

# Vectorize
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(corpus + [query])
query_vector = doc_vectors[-1]
doc_vectors = doc_vectors[:-1]

# Compute similarities
similarities = cosine_similarity(query_vector, doc_vectors)[0]

# Rank results
ranked = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)
for idx, score in ranked:
    print(f"{score:.3f} - {corpus[idx]}")

Handling Large-Scale Similarity

For large document collections, exact similarity computation is expensive. Approximate methods include:

  • Locality-Sensitive Hashing (LSH): Hash similar items to same buckets
  • MinHash: Efficient Jaccard estimation
  • FAISS: Facebook's similarity search library
  • Annoy: Approximate Nearest Neighbors by Spotify
⭐

Premium Content

Text Similarity Measures

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement