πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

TF-IDF Weighting

Classical NLPText Representation🟒 Free Lesson

Advertisement

TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects how important a word is to a document in a collection. It weights words by their frequency in a document while penalizing words that appear across many documents.

TF-IDF Formula

TF-IDF(t,d,D)=TF(t,d)Γ—IDF(t,D)TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D)

Term Frequency

TF(t,d)=ft,dβˆ‘tβ€²βˆˆdftβ€²,dTF(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}

Inverse Document Frequency

IDF(t,D)=log⁑N∣{d∈D:t∈d}∣IDF(t, D) = \log \frac{N}{|\{d \in D : t \in d\}|}

Using scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "Cats and dogs are friends"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:")
print(tfidf_matrix.toarray().round(2))

Manual Implementation

import math
from collections import Counter

def compute_tf(doc):
    word_counts = Counter(doc)
    total = len(doc)
    return {word: count / total for word, count in word_counts.items()}

def compute_idf(documents):
    N = len(documents)
    idf = {}
    all_words = set(word for doc in documents for word in doc)
    for word in all_words:
        containing = sum(1 for doc in documents if word in doc)
        idf[word] = math.log(N / containing)
    return idf

def tfidf(documents):
    idf = compute_idf(documents)
    tfidf_vectors = []
    for doc in documents:
        tf = compute_tf(doc)
        tfidf_vec = {word: tf_val * idf.get(word, 0) for word, tf_val in tf.items()}
        tfidf_vectors.append(tfidf_vec)
    return tfidf_vectors

docs = [["cat", "sat", "mat"], ["dog", "sat", "log"], ["cat", "dog", "friends"]]
for vec in tfidf(docs):
    print(vec)

TF-IDF Variants

VariantFormula ChangeBenefit
Smoothed IDFlog(1 + N/(1+df)) + 1Handles zero document frequency
Sublinear TF1 + log(tf)Reduces impact of high frequency
BM25TF * IDF with length normalizationState-of-the-art for IR

Applications

  • Information Retrieval: Ranking documents by relevance
  • Text Classification: Feature extraction for classifiers
  • Keyword Extraction: High TF-IDF words are key terms
  • Document Similarity: Cosine similarity on TF-IDF vectors
⭐

Premium Content

TF-IDF Weighting

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement