Bayes' Theorem in Action — Fast, Simple, Surprisingly Powerful
Naive Bayes applies Bayes' theorem with a "naive" independence assumption. Despite its simplicity, it often outperforms more complex algorithms on text classification.
- Bayes' Theorem — The probabilistic foundation of classification
- Conditional Independence — The simplifying assumption that makes it tractable
- Gaussian / Multinomial / Bernoulli — Three variants for different data types
"Simplicity is the ultimate sophistication." — Leonardo da Vinci
Naive Bayes — Complete Guide
Naive Bayes applies Bayes' theorem with the "naive" assumption that features are conditionally independent given the class. Despite this simplification, it works remarkably well in practice.
Bayes' Theorem
ThBayes' Theorem
where:
- is the posterior probability of class
- is the class-conditional likelihood
- is the prior probability
- is the evidence (normalizing constant)
Example: Spam Detection
With naive assumption:
The Naive Independence Assumption
DfConditional Independence Assumption
Given class , features are assumed independent:
This reduces the number of parameters from (joint) to (independent), making estimation tractable even with limited data.
Why 'Naive' But Effective?
The independence assumption is often violated (e.g., "free" and "money" are correlated in spam). However, Naive Bayes still works well because:
- Classification only needs the correct ranking of posteriors, not exact probabilities
- The errors from incorrect independence assumptions often cancel out
- With limited training data, the bias from the assumption reduces variance
Variants
Gaussian Naive Bayes
DfGaussian Naive Bayes
For continuous features , assume :
Parameters and are estimated from training data per class.
Multinomial Naive Bayes
DfMultinomial Naive Bayes
For discrete count features (e.g., word counts), with Laplace smoothing :
where = count of feature in class , = total count in class , = number of features.
Bernoulli Naive Bayes
DfBernoulli Naive Bayes
For binary features (present/absent):
Suitable for binary text data (word present/absent) and binary feature sets.
Text Classification Example
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score
emails = ["Win free money now", "Claim your prize", "Meeting tomorrow at 10",
"Project update", "Buy discount pills", "Special offer just for you",
"Team lunch Friday", "Quarterly report attached"]
labels = [1, 1, 0, 0, 1, 1, 0, 0] # 1=spam, 0=not spam
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
model = MultinomialNB(alpha=1.0) # Laplace smoothing
model.fit(X, labels)
# Predict on new email
new_emails = ["Free prize waiting for you"]
X_new = vectorizer.transform(new_emails)
print(f"Prediction: {model.predict(X_new)[0]}") # 1 (spam)
print(f"P(spam|words) = {model.predict_proba(X_new)[0][1]:.4f}")
When to Use Naive Bayes
When to Use Naive Bayes
Excellent for:
- Text classification (spam, sentiment, topic)
- High-dimensional sparse data
- Small training sets (low sample complexity)
- Multi-class problems (naturally handles K classes)
- When speed matters (linear training time)
Poor for:
- Strongly correlated features
- Continuous features with complex distributions
- When calibrated probability estimates matter
- When feature interactions are important
Key Takeaways
Summary: Naive Bayes
- Applies Bayes' theorem with conditional independence:
- Reduces parameters from to — from exponential to linear
- Gaussian for continuous, Multinomial for counts, Bernoulli for binary features
- Laplace smoothing () handles unseen features:
- Fast training and prediction — among the fastest classifiers
- Works surprisingly well despite the naive assumption
- Probabilistic output enables threshold tuning for precision-recall tradeoff
- Great baseline — always worth trying before complex models
What to Learn Next
-> Logistic Regression Classification with probability — from linear to sigmoid.
-> SVM Finding the optimal boundary — maximum margin classification.
-> NLP Fundamentals Natural language processing — tokenization, embeddings, and transformers.