Text Classification
Text classification is the task of assigning predefined categories to text documents. It's one of the most common NLP applications, used for spam detection, topic labeling, intent recognition, and content moderation.
Classical ML Approaches
Naive Bayes
Based on Bayes' theorem with the "naive" assumption of feature independence.
Naive Bayes Classification
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# Sample data
train_texts = [
"Win free iPhone now!!!",
"Click here for prize money",
"Meeting agenda for Monday",
"Quarterly report attached",
"Buy cheap medications",
"Project deadline extended"
]
train_labels = [1, 1, 0, 0, 1, 0] # 1=spam, 0=ham
# Build pipeline
nb_pipeline = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1, 2))),
('classifier', MultinomialNB(alpha=1.0))
])
nb_pipeline.fit(train_texts, train_labels)
# Predict
test = ["Free money click here", "Meeting at noon"]
print(nb_pipeline.predict(test)) # [1, 0]
Support Vector Machine (SVM)
SVMs find the optimal hyperplane that maximizes the margin between classes.
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
svm_pipeline = Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2),
max_features=10000,
sublinear_tf=True
)),
('classifier', LinearSVC(C=1.0))
])
svm_pipeline.fit(train_texts, train_labels)
print(svm_pipeline.predict(test))
Logistic Regression
Linear model for binary and multiclass classification.
from sklearn.linear_model import LogisticRegression
lr_pipeline = Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
('classifier', LogisticRegression(
C=1.0,
max_iter=1000,
class_weight='balanced'
))
])
lr_pipeline.fit(train_texts, train_labels)
print(lr_pipeline.predict(test))
Model Comparison
| Model | Speed | Accuracy | Interpretability | Best For |
|---|---|---|---|---|
| Naive Bayes | Very Fast | Good | High | Baseline, small data |
| SVM | Fast | Very Good | Medium | High-dimensional text |
| Logistic Regression | Fast | Good | High | Probabilistic outputs |
| Random Forest | Medium | Good | Medium | Non-linear boundaries |
| Neural Networks | Slow | Excellent | Low | Large datasets |
Evaluation Metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, classification_report, confusion_matrix
)
y_true = [1, 1, 0, 0, 1, 0, 1, 0]
y_pred = [1, 0, 0, 1, 1, 0, 1, 1]
print(f"Accuracy: {accuracy_score(y_true, y_pred):.3f}")
print(f"Precision: {precision_score(y_true, y_pred):.3f}")
print(f"Recall: {recall_score(y_true, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_true, y_pred):.3f}")
print(classification_report(y_true, y_pred))
F1 Score
Multi-Class Classification
from sklearn.datasets import fetch_20newsgroups
# Load dataset
categories = ['sci.space', 'rec.sport.baseball', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
# Train classifier
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression(max_iter=1000))
])
pipeline.fit(train.data, train.target)
accuracy = pipeline.score(test.data, test.target)
print(f"Accuracy: {accuracy:.3f}")