πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Bag of Words Representation

Classical NLPText Representation🟒 Free Lesson

Advertisement

Bag of Words (BoW)

Bag of Words is one of the simplest and most fundamental text representation methods. It represents text as a collection (bag) of its words, disregarding grammar and word order but keeping track of frequency.

How BoW Works

  1. Collect all unique words across the corpus to build a vocabulary
  2. For each document, count the occurrences of each vocabulary word
  3. Represent each document as a vector of word counts
from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "The cat chased the dog"
]

vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

print("Vocabulary:", vectorizer.get_feature_names_out())
# ['cat', 'chased', 'dog', 'log', 'mat', 'on', 'sat', 'the']

print("BoW Matrix:")
print(bow_matrix.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 2],
#  [0, 0, 1, 1, 0, 1, 1, 2],
#  [1, 1, 1, 0, 0, 0, 0, 2]]

Binary BoW

Binary BoW only tracks presence (1) or absence (0) of a word, ignoring frequency.

binary_vectorizer = CountVectorizer(binary=True)
binary_matrix = binary_vectorizer.fit_transform(documents)

print(binary_matrix.toarray())
# [[1, 0, 0, 0, 1, 1, 1, 1],
#  [0, 0, 1, 1, 0, 1, 1, 1],
#  [1, 1, 1, 0, 0, 0, 0, 1]]

Manual Implementation

from collections import Counter

def bag_of_words(documents):
    # Build vocabulary
    vocab = sorted(set(word for doc in documents for word in doc.lower().split()))
    vocab_index = {word: i for i, word in enumerate(vocab)}

    # Create BoW vectors
    vectors = []
    for doc in documents:
        counter = Counter(doc.lower().split())
        vector = [counter.get(word, 0) for word in vocab]
        vectors.append(vector)

    return vocab, vectors

docs = ["the cat sat", "the dog sat", "the cat dog"]
vocab, vectors = bag_of_words(docs)
print("Vocab:", vocab)
print("Vectors:", vectors)

BoW Variations

VariantDescriptionUse Case
CountRaw word countsGeneral classification
BinaryPresence/absenceDocument similarity
NormalizedCount / total wordsComparing documents of different lengths
TF-IDF weightedCount * TF-IDFInformation retrieval

Limitations of BoW

  • Loss of word order: "dog bites man" = "man bites dog"
  • No semantic meaning: "good" and "excellent" treated as unrelated
  • High dimensionality: Vocabulary can be 50K+ words
  • Sparse vectors: Most entries are zero

Practical Example: Spam Classification

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Training data
train_docs = [
    "Win free money now",
    "Click here for prize",
    "Meeting tomorrow at noon",
    "Project deadline Friday"
]
train_labels = [1, 1, 0, 0]  # 1=spam, 0=ham

# Vectorize
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_docs)

# Train classifier
classifier = MultinomialNB()
classifier.fit(X_train, train_labels)

# Predict
test_docs = ["Free money prize", "Meeting at noon"]
X_test = vectorizer.transform(test_docs)
predictions = classifier.predict(X_test)
print(predictions)  # [1, 0]
⭐

Premium Content

Bag of Words Representation

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement