Bag of Words Representation

Bag of Words (BoW)

Bag of Words is one of the simplest and most fundamental text representation methods. It represents text as a collection (bag) of its words, disregarding grammar and word order but keeping track of frequency.

How BoW Works

Collect all unique words across the corpus to build a vocabulary
For each document, count the occurrences of each vocabulary word
Represent each document as a vector of word counts

from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "The cat chased the dog"
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

print("Vocabulary:", vectorizer.get_feature_names_out())
# ['cat', 'chased', 'dog', 'log', 'mat', 'on', 'sat', 'the']

print("BoW Matrix:")
print(bow_matrix.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 2],
#  [0, 0, 1, 1, 0, 1, 1, 2],
#  [1, 1, 1, 0, 0, 0, 0, 2]]

Binary BoW

Binary BoW only tracks presence (1) or absence (0) of a word, ignoring frequency.

binary_vectorizer = CountVectorizer(binary=True)
binary_matrix = binary_vectorizer.fit_transform(documents)

print(binary_matrix.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 1],
#  [0, 0, 1, 1, 0, 1, 1, 1],
#  [1, 1, 1, 0, 0, 0, 0, 1]]

Manual Implementation

from collections import Counter

def bag_of_words(documents):
    # Build vocabulary
    vocab = sorted(set(word for doc in documents for word in doc.lower().split()))
    vocab_index = {word: i for i, word in enumerate(vocab)}

    # Create BoW vectors
    vectors = []
    for doc in documents:
        counter = Counter(doc.lower().split())
        vector = [counter.get(word, 0) for word in vocab]
        vectors.append(vector)

    return vocab, vectors

docs = ["the cat sat", "the dog sat", "the cat dog"]
vocab, vectors = bag_of_words(docs)
print("Vocab:", vocab)
print("Vectors:", vectors)

BoW Variations

Variant	Description	Use Case
Count	Raw word counts	General classification
Binary	Presence/absence	Document similarity
Normalized	Count / total words	Comparing documents of different lengths
TF-IDF weighted	Count * TF-IDF	Information retrieval

Limitations of BoW

Loss of word order: "dog bites man" = "man bites dog"
No semantic meaning: "good" and "excellent" treated as unrelated
High dimensionality: Vocabulary can be 50K+ words
Sparse vectors: Most entries are zero

Practical Example: Spam Classification

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Training data
train_docs = [
    "Win free money now",
    "Click here for prize",
    "Meeting tomorrow at noon",
    "Project deadline Friday"
]
train_labels = [1, 1, 0, 0]  # 1=spam, 0=ham

# Vectorize
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_docs)

# Train classifier
classifier = MultinomialNB()
classifier.fit(X_train, train_labels)

# Predict
test_docs = ["Free money prize", "Meeting at noon"]
X_test = vectorizer.transform(test_docs)
predictions = classifier.predict(X_test)
print(predictions)  # [1, 0]