Text Classification

Text classification is the task of assigning predefined categories to text documents. It's one of the most common NLP applications, used for spam detection, topic labeling, intent recognition, and content moderation.

Classical ML Approaches

Naive Bayes

Based on Bayes' theorem with the "naive" assumption of feature independence.

Naive Bayes Classification

P(class | features) = \frac{P(features | class) \times P(class)}{P(features)}

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Sample data
train_texts = [
    "Win free iPhone now!!!",
    "Click here for prize money",
    "Meeting agenda for Monday",
    "Quarterly report attached",
    "Buy cheap medications",
    "Project deadline extended"
]
train_labels = [1, 1, 0, 0, 1, 0]  # 1=spam, 0=ham

# Build pipeline
nb_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', MultinomialNB(alpha=1.0))
])

nb_pipeline.fit(train_texts, train_labels)

# Predict
test = ["Free money click here", "Meeting at noon"]
print(nb_pipeline.predict(test))  # [1, 0]

Support Vector Machine (SVM)

SVMs find the optimal hyperplane that maximizes the margin between classes.

from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

svm_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        ngram_range=(1, 2),
        max_features=10000,
        sublinear_tf=True
    )),
    ('classifier', LinearSVC(C=1.0))
])

svm_pipeline.fit(train_texts, train_labels)
print(svm_pipeline.predict(test))

Logistic Regression

Linear model for binary and multiclass classification.

from sklearn.linear_model import LogisticRegression

lr_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
    ('classifier', LogisticRegression(
        C=1.0,
        max_iter=1000,
        class_weight='balanced'
    ))
])

lr_pipeline.fit(train_texts, train_labels)
print(lr_pipeline.predict(test))

Model Comparison

Model	Speed	Accuracy	Interpretability	Best For
Naive Bayes	Very Fast	Good	High	Baseline, small data
SVM	Fast	Very Good	Medium	High-dimensional text
Logistic Regression	Fast	Good	High	Probabilistic outputs
Random Forest	Medium	Good	Medium	Non-linear boundaries
Neural Networks	Slow	Excellent	Low	Large datasets

Evaluation Metrics

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, classification_report, confusion_matrix
)

y_true = [1, 1, 0, 0, 1, 0, 1, 0]
y_pred = [1, 0, 0, 1, 1, 0, 1, 1]

print(f"Accuracy: {accuracy_score(y_true, y_pred):.3f}")
print(f"Precision: {precision_score(y_true, y_pred):.3f}")
print(f"Recall: {recall_score(y_true, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_true, y_pred):.3f}")
print(classification_report(y_true, y_pred))

F1 Score

F1 = 2 \cdot \frac{precision \times recall}{precision + recall}

Multi-Class Classification

from sklearn.datasets import fetch_20newsgroups

# Load dataset
categories = ['sci.space', 'rec.sport.baseball', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

# Train classifier
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('clf', LogisticRegression(max_iter=1000))
])

pipeline.fit(train.data, train.target)
accuracy = pipeline.score(test.data, test.target)
print(f"Accuracy: {accuracy:.3f}")

Text Classification

Text Classification

Classical ML Approaches

Naive Bayes

Naive Bayes Classification

Support Vector Machine (SVM)

Logistic Regression

Model Comparison

Evaluation Metrics

F1 Score

Multi-Class Classification

Premium Content

Need Expert NLP Help?