Supervised Learning

Bayes' Theorem in Action — Fast, Simple, Surprisingly Powerful

Naive Bayes applies Bayes' theorem with a "naive" independence assumption. Despite its simplicity, it often outperforms more complex algorithms on text classification.

Bayes' Theorem — The probabilistic foundation of classification
Conditional Independence — The simplifying assumption that makes it tractable
Gaussian / Multinomial / Bernoulli — Three variants for different data types

"Simplicity is the ultimate sophistication." — Leonardo da Vinci

Naive Bayes — Complete Guide

Naive Bayes applies Bayes' theorem with the "naive" assumption that features are conditionally independent given the class. Despite this simplification, it works remarkably well in practice.

Bayes' Theorem

ThBayes' Theorem

P(C_k | \mathbf{x}) = \frac{P(\mathbf{x} | C_k) P(C_k)}{P(\mathbf{x})} = \frac{P(\mathbf{x} | C_k) P(C_k)}{\sum_{j=1}^{K} P(\mathbf{x} | C_j) P(C_j)}

where:

$P(C_k | \mathbf{x})$ is the posterior probability of class $C_k$
$P(\mathbf{x} | C_k)$ is the class-conditional likelihood
$P(C_k)$ is the prior probability
$P(\mathbf{x})$ is the evidence (normalizing constant)

Example: Spam Detection

P(\text{Spam}|\text{"free money"}) = \frac{P(\text{"free money"}|\text{Spam}) \times P(\text{Spam})}{P(\text{"free money"})}

With naive assumption:

P(\text{"free","money"}|\text{Spam}) \approx P(\text{"free"}|\text{Spam}) \times P(\text{"money"}|\text{Spam})

The Naive Independence Assumption

DfConditional Independence Assumption

Given class $C_k$ , features $x_1, x_2, \ldots, x_n$ are assumed independent:

P(\mathbf{x}|C_k) = \prod_{i=1}^{n} P(x_i|C_k)

This reduces the number of parameters from $O(K \cdot n^d)$ (joint) to $O(K \cdot n)$ (independent), making estimation tractable even with limited data.

Why 'Naive' But Effective?

The independence assumption is often violated (e.g., "free" and "money" are correlated in spam). However, Naive Bayes still works well because:

Classification only needs the correct ranking of posteriors, not exact probabilities
The errors from incorrect independence assumptions often cancel out
With limited training data, the bias from the assumption reduces variance

Variants

Gaussian Naive Bayes

DfGaussian Naive Bayes

For continuous features $x_i$ , assume $P(x_i|C_k) = \mathcal{N}(\mu_{ik}, \sigma_{ik}^2)$ :

P(x_i|C_k) = \frac{1}{\sqrt{2\pi\sigma_{ik}^2}} \exp\left(-\frac{(x_i - \mu_{ik})^2}{2\sigma_{ik}^2}\right)

Parameters $\mu_{ik}$ and $\sigma_{ik}^2$ are estimated from training data per class.

Multinomial Naive Bayes

DfMultinomial Naive Bayes

For discrete count features (e.g., word counts), with Laplace smoothing $\alpha$ :

P(x_i|C_k) = \frac{N_{ik} + \alpha}{N_k + \alpha n}

where $N_{ik}$ = count of feature $i$ in class $k$ , $N_k$ = total count in class $k$ , $n$ = number of features.

Bernoulli Naive Bayes

DfBernoulli Naive Bayes

For binary features (present/absent):

P(x_i=1|C_k) = \frac{\text{count}(x_i=1 \land C_k)}{\text{count}(C_k)}

Suitable for binary text data (word present/absent) and binary feature sets.

Text Classification Example

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score

emails = ["Win free money now", "Claim your prize", "Meeting tomorrow at 10",
          "Project update", "Buy discount pills", "Special offer just for you",
          "Team lunch Friday", "Quarterly report attached"]
labels = [1, 1, 0, 0, 1, 1, 0, 0]  # 1=spam, 0=not spam

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

model = MultinomialNB(alpha=1.0)  # Laplace smoothing
model.fit(X, labels)

# Predict on new email
new_emails = ["Free prize waiting for you"]
X_new = vectorizer.transform(new_emails)
print(f"Prediction: {model.predict(X_new)[0]}")  # 1 (spam)
print(f"P(spam|words) = {model.predict_proba(X_new)[0][1]:.4f}")

When to Use Naive Bayes

Excellent for:

Text classification (spam, sentiment, topic)
High-dimensional sparse data
Small training sets (low sample complexity)
Multi-class problems (naturally handles K classes)
When speed matters (linear training time)

Poor for:

Strongly correlated features
Continuous features with complex distributions
When calibrated probability estimates matter
When feature interactions are important

Key Takeaways

Summary: Naive Bayes

Applies Bayes' theorem with conditional independence: $P(\mathbf{x}|C_k) = \prod_i P(x_i|C_k)$
Reduces parameters from $O(K \cdot n^d)$ to $O(K \cdot n)$ — from exponential to linear
Gaussian for continuous, Multinomial for counts, Bernoulli for binary features
Laplace smoothing ( $\alpha$ ) handles unseen features: $P(x_i|C_k) = \frac{N_{ik}+\alpha}{N_k+\alpha n}$
Fast training $O(Nn)$ and prediction $O(Kn)$ — among the fastest classifiers
Works surprisingly well despite the naive assumption
Probabilistic output enables threshold tuning for precision-recall tradeoff
Great baseline — always worth trying before complex models

What to Learn Next

-> Logistic Regression Classification with probability — from linear to sigmoid.

-> SVM Finding the optimal boundary — maximum margin classification.

-> NLP Fundamentals Natural language processing — tokenization, embeddings, and transformers.

Naive Bayes — Complete Guide for Classification