TF-IDF
Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects how important a word is to a document in a collection. It weights words by their frequency in a document while penalizing words that appear across many documents.
TF-IDF Formula
Term Frequency
Inverse Document Frequency
Using scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"The cat sat on the mat",
"The dog sat on the log",
"Cats and dogs are friends"
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
print("Vocabulary:", vectorizer.get_feature_names_out())
print("TF-IDF Matrix:")
print(tfidf_matrix.toarray().round(2))
Manual Implementation
import math
from collections import Counter
def compute_tf(doc):
word_counts = Counter(doc)
total = len(doc)
return {word: count / total for word, count in word_counts.items()}
def compute_idf(documents):
N = len(documents)
idf = {}
all_words = set(word for doc in documents for word in doc)
for word in all_words:
containing = sum(1 for doc in documents if word in doc)
idf[word] = math.log(N / containing)
return idf
def tfidf(documents):
idf = compute_idf(documents)
tfidf_vectors = []
for doc in documents:
tf = compute_tf(doc)
tfidf_vec = {word: tf_val * idf.get(word, 0) for word, tf_val in tf.items()}
tfidf_vectors.append(tfidf_vec)
return tfidf_vectors
docs = [["cat", "sat", "mat"], ["dog", "sat", "log"], ["cat", "dog", "friends"]]
for vec in tfidf(docs):
print(vec)
TF-IDF Variants
| Variant | Formula Change | Benefit |
|---|---|---|
| Smoothed IDF | log(1 + N/(1+df)) + 1 | Handles zero document frequency |
| Sublinear TF | 1 + log(tf) | Reduces impact of high frequency |
| BM25 | TF * IDF with length normalization | State-of-the-art for IR |
Applications
- Information Retrieval: Ranking documents by relevance
- Text Classification: Feature extraction for classifiers
- Keyword Extraction: High TF-IDF words are key terms
- Document Similarity: Cosine similarity on TF-IDF vectors