Bag of Words (BoW)
Bag of Words is one of the simplest and most fundamental text representation methods. It represents text as a collection (bag) of its words, disregarding grammar and word order but keeping track of frequency.
How BoW Works
- Collect all unique words across the corpus to build a vocabulary
- For each document, count the occurrences of each vocabulary word
- Represent each document as a vector of word counts
from sklearn.feature_extraction.text import CountVectorizer
documents = [
"The cat sat on the mat",
"The dog sat on the log",
"The cat chased the dog"
]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)
print("Vocabulary:", vectorizer.get_feature_names_out())
# ['cat', 'chased', 'dog', 'log', 'mat', 'on', 'sat', 'the']
print("BoW Matrix:")
print(bow_matrix.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 2],
# [0, 0, 1, 1, 0, 1, 1, 2],
# [1, 1, 1, 0, 0, 0, 0, 2]]
Binary BoW
Binary BoW only tracks presence (1) or absence (0) of a word, ignoring frequency.
binary_vectorizer = CountVectorizer(binary=True)
binary_matrix = binary_vectorizer.fit_transform(documents)
print(binary_matrix.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 1],
# [0, 0, 1, 1, 0, 1, 1, 1],
# [1, 1, 1, 0, 0, 0, 0, 1]]
Manual Implementation
from collections import Counter
def bag_of_words(documents):
# Build vocabulary
vocab = sorted(set(word for doc in documents for word in doc.lower().split()))
vocab_index = {word: i for i, word in enumerate(vocab)}
# Create BoW vectors
vectors = []
for doc in documents:
counter = Counter(doc.lower().split())
vector = [counter.get(word, 0) for word in vocab]
vectors.append(vector)
return vocab, vectors
docs = ["the cat sat", "the dog sat", "the cat dog"]
vocab, vectors = bag_of_words(docs)
print("Vocab:", vocab)
print("Vectors:", vectors)
BoW Variations
| Variant | Description | Use Case |
|---|---|---|
| Count | Raw word counts | General classification |
| Binary | Presence/absence | Document similarity |
| Normalized | Count / total words | Comparing documents of different lengths |
| TF-IDF weighted | Count * TF-IDF | Information retrieval |
Limitations of BoW
- Loss of word order: "dog bites man" = "man bites dog"
- No semantic meaning: "good" and "excellent" treated as unrelated
- High dimensionality: Vocabulary can be 50K+ words
- Sparse vectors: Most entries are zero
Practical Example: Spam Classification
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Training data
train_docs = [
"Win free money now",
"Click here for prize",
"Meeting tomorrow at noon",
"Project deadline Friday"
]
train_labels = [1, 1, 0, 0] # 1=spam, 0=ham
# Vectorize
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_docs)
# Train classifier
classifier = MultinomialNB()
classifier.fit(X_train, train_labels)
# Predict
test_docs = ["Free money prize", "Meeting at noon"]
X_test = vectorizer.transform(test_docs)
predictions = classifier.predict(X_test)
print(predictions) # [1, 0]