TF-IDF Weighting

TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects how important a word is to a document in a collection. It weights words by their frequency in a document while penalizing words that appear across many documents.

TF-IDF Formula

TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D)

Term Frequency

TF(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}

Inverse Document Frequency

IDF(t, D) = \log \frac{N}{|\{d \in D : t \in d\}|}

Using scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "Cats and dogs are friends"
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:")
print(tfidf_matrix.toarray().round(2))

Manual Implementation

import math
from collections import Counter

def compute_tf(doc):
    word_counts = Counter(doc)
    total = len(doc)
    return {word: count / total for word, count in word_counts.items()}

def compute_idf(documents):
    N = len(documents)
    idf = {}
    all_words = set(word for doc in documents for word in doc)
    for word in all_words:
        containing = sum(1 for doc in documents if word in doc)
        idf[word] = math.log(N / containing)
    return idf

def tfidf(documents):
    idf = compute_idf(documents)
    tfidf_vectors = []
    for doc in documents:
        tf = compute_tf(doc)
        tfidf_vec = {word: tf_val * idf.get(word, 0) for word, tf_val in tf.items()}
        tfidf_vectors.append(tfidf_vec)
    return tfidf_vectors

docs = [["cat", "sat", "mat"], ["dog", "sat", "log"], ["cat", "dog", "friends"]]
for vec in tfidf(docs):
    print(vec)

TF-IDF Variants

Variant	Formula Change	Benefit
Smoothed IDF	log(1 + N/(1+df)) + 1	Handles zero document frequency
Sublinear TF	1 + log(tf)	Reduces impact of high frequency
BM25	TF * IDF with length normalization	State-of-the-art for IR

Applications

Information Retrieval: Ranking documents by relevance
Text Classification: Feature extraction for classifiers
Keyword Extraction: High TF-IDF words are key terms
Document Similarity: Cosine similarity on TF-IDF vectors