Text Mining and Statistical NLP
Advanced Statistical Methods
Extracting Knowledge From Unstructured Text
Statistical text mining transforms raw text into quantitative representations for analysis. TF-IDF, topic models, and word embeddings enable automated understanding of document collections at scale.
- Sentiment analysis β Track public opinion trends from social media and product reviews
- Healthcare β Extract clinical information from medical records and research papers
- Legal discovery β Automatically classify and prioritize relevant documents in litigation
Text mining turns the world's unstructured text into analyzable, quantifiable data.
Text mining extracts structured statistical information from unstructured text. The central challenge is converting variable-length text sequences into fixed-dimensional numerical representations amenable to statistical analysis. This lesson develops the mathematical foundations of text representation, topic modeling, sentiment analysis, word embeddings, and document classification.
Bag-of-Words Representation
DfBag-of-Words (BoW)
A document-term matrix represents documents over a vocabulary of terms. Each entry counts the occurrences of term in document . The bag-of-words model discards word order, treating each document as a multiset of terms.
Term Frequency
The raw term frequency is where is the count of term in document . Common transformations include:
The augmented form prevents over-weighting of frequent terms within a document.
Inverse Document Frequency
Inverse document frequency (IDF) measures term specificity across the corpus:
where is the total number of documents and is the document frequency. Variants include smoothed IDF: .
TF-IDF Weighting
The TF-IDF score combines term frequency and inverse document frequency:
TF-IDF downweights common terms (high , low ) and upweights discriminative terms (high , high ). The TF-IDF matrix is the input to many text mining algorithms. Cosine similarity between TF-IDF vectors measures document similarity:
TF-IDF Normalization
- L2 normalization projects document vectors onto the unit hypersphere, making cosine similarity equivalent to dot product
- Sublinear TF scaling () reduces the impact of very frequent terms
- BM25 (Okapi) extends TF-IDF with document length normalization:
Topic Models
DfLatent Dirichlet Allocation (LDA)
LDA (Blei et al., 2003) is a generative probabilistic model for document collections. Each document is modeled as a mixture over latent topics, and each topic is a distribution over the vocabulary:
where is the topic proportions for document , are topic assignments, is the -th word, and is the word distribution for topic .
LDA Joint Distribution
The joint distribution of words, topics, and topic proportions is:
The marginal likelihood (evidence) requires integrating out latent variables:
This integral is intractable, necessitating approximate inference.
Inference Algorithms
- Variational inference (original LDA): Approximates the posterior with a factorized distribution , optimized via coordinate ascent on the ELBO
- Collapsed Gibbs sampling (Griffiths & Steyvers, 2004): Integrates out and , sampling topic assignments conditioned on all other assignments. Converges to exact posterior as samples increase
- Online LDA (Hoffman et al., 2010): Mini-batch variational inference for streaming/ε€§θ§ζ¨‘ corpora
Perplexity
Model quality is measured by perplexity on held-out documents:
Lower perplexity indicates better generalization. In practice, compute the approximate predictive probability via variational inference or held-out likelihood estimation.
Sentiment Analysis
DfSentiment Classification
Sentiment analysis assigns a polarity (positive/negative/neutral) or continuous sentiment score to text. Formally, it is a mapping where is the space of text documents and (categorical) or (continuous).
Lexicon-Based Approach
A lexicon-based approach assigns sentiment scores using a pre-built dictionary:
where is the sentiment score for term . This approach is interpretable but limited by vocabulary coverage and context insensitivity.
Logistic Regression for Sentiment
A supervised approach learns a linear decision boundary in TF-IDF space:
The L2-regularized objective minimizes:
Aspect-Based Sentiment
Beyond document-level polarity, aspect-based sentiment analysis (ABSA) extracts sentiment toward specific entities or attributes. For example, "The food was excellent but the service was terrible" contains positive sentiment toward food and negative toward service. Methods include:
- Attention-based neural models that learn aspect-document interactions
- Dependency-based feature extraction for local context
- Structured prediction models that jointly extract aspects and sentiments
Word Embeddings
DfDistributed Representation
Word embeddings map each word to a dense vector (typically β) such that semantic similarity is captured by geometric proximity. Unlike sparse one-hot vectors, embeddings capture analogical relationships: .
Word2Vec (Skip-Gram)
The Skip-Gram model (Mikolov et al., 2013) learns embeddings by predicting context words from a target word. For a target word and context window of size :
where the softmax is:
Since the denominator sums over the entire vocabulary, negative sampling approximates it:
where is the sigmoid function and is the unigram distribution raised to the 3/4 power.
GloVe (Global Vectors)
GloVe (Pennington et al., 2014) learns embeddings from the global co-occurrence matrix. Let be the count of word in the context of word . The objective is:
where is a weighting function that downweights very frequent co-occurrences. GloVe combines the advantages of global matrix factorization (LSA) and local context window methods (Word2Vec).
Embedding Properties
- Analogy reasoning: tests linear relationships
- Clustering: K-means on embeddings reveals semantic clusters
- Compositionality: Word embeddings can be averaged or summed for sentence representations (though this loses word order)
- Subword information: FastText extends Word2Vec with character n-gram embeddings, handling rare and out-of-vocabulary words
Document Classification with Naive Bayes
Multinomial Naive Bayes
The Multinomial Naive Bayes classifier models documents as bags of words generated from a mixture of class-conditional multinomial distributions:
where is the probability of term in class . Classification uses Bayes' rule:
The maximum likelihood estimate is smoothed with Laplace (add-1) smoothing to avoid zero probabilities.
Why Naive Bayes Works for Text
Despite the strong independence assumption (conditional independence of words given the class), Naive Bayes performs surprisingly well for text classification because:
- The classification rule only requires correct ranking of class posteriors, not accurate probability estimates
- High-dimensional text data often has sufficient class-separating signal even with independence violations
- The model is extremely fast to train () and predict ( per document)
- It works well with small training sets due to strong inductive bias
Evaluation Metrics
Classification Metrics
For binary classification with true positives (TP), false positives (FP), true negatives (TN), false negatives (FN):
Macro-averaging computes precision/recall/F1 per class and averages; micro-averaging aggregates TP/FP/FN across classes before computing metrics. Micro-average equals accuracy for single-label classification.
Area Under the ROC Curve
The ROC curve plots true positive rate vs false positive rate at varying thresholds. The AUC is:
AUC is the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. An AUC of 0.5 indicates random performance; 1.0 is perfect.
Metric Choice for Imbalanced Text
- Accuracy is misleading when classes are imbalanced (e.g., 95% negative reviews)
- F1 is appropriate when both precision and recall matter equally
- F weights recall more heavily when (missed positives are costly)
- AUC is threshold-independent and useful for comparing classifiers
- Precision@k measures the fraction of relevant documents in the top-k results (information retrieval)
Python Implementation
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
np.random.seed(42)
# --- Sample corpus ---
corpus = [
"The stock market rallied on strong earnings reports",
"New vaccine shows promising results in clinical trials",
"Federal Reserve raises interest rates to combat inflation",
"Tech companies report record profits amid AI boom",
"Hospital implements new patient safety protocols",
"Bond yields decline as investors seek safe assets",
"Study finds link between exercise and mental health",
"GDP growth exceeds expectations in third quarter",
]
labels = np.array([0, 1, 0, 0, 1, 0, 1, 0]) # 0=finance, 1=health
# --- TF-IDF ---
tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF shape:", X_tfidf.shape)
print("Top features:", tfidf.get_feature_names_out()[:10])
# --- LDA Topic Model ---
count_vec = CountVectorizer(max_features=1000, stop_words='english')
X_counts = count_vec.fit_transform(corpus)
lda = LatentDirichletAllocation(n_components=2, random_state=42, max_iter=50)
doc_topics = lda.fit_transform(X_counts)
print("\n=== LDA Topics ===")
for idx, topic in enumerate(lda.components_):
top_words = [count_vec.get_feature_names_out()[i] for i in topic.argsort()[-8:][::-1]]
print(f"Topic {idx}: {', '.join(top_words)}")
# --- Naive Bayes Classification ---
X_train, X_test, y_train, y_test = train_test_split(
X_tfidf, labels, test_size=0.25, random_state=42, stratify=labels
)
nb = MultinomialNB(alpha=1.0)
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
y_prob_nb = nb.predict_proba(X_test)[:, 1]
print("\n=== Naive Bayes ===")
print(classification_report(y_test, y_pred_nb, target_names=['Finance', 'Health']))
# --- Logistic Regression ---
lr = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
y_prob_lr = lr.predict_proba(X_test)[:, 1]
print("=== Logistic Regression ===")
print(classification_report(y_test, y_pred_lr, target_names=['Finance', 'Health']))
# --- Word Embeddings (pretrained) ---
# Using synthetic embeddings for demonstration
vocab = tfidf.get_feature_names_out()
k = 50 # embedding dimension
embeddings = np.random.randn(len(vocab), k) * 0.1
# Document embedding = TF-IDF weighted average of word embeddings
def doc_embedding(tfidf_matrix, embeddings):
doc_embs = tfidf_matrix @ embeddings
norms = np.linalg.norm(doc_embs, axis=1, keepdims=True)
norms[norms == 0] = 1
return doc_embs / norms
X_emb = doc_embedding(X_tfidf, embeddings)
print(f"\nDocument embedding shape: {X_emb.shape}")
print(f"Similarity matrix (cosine):\n{X_emb @ X_emb.T:.3f}")
Summary
Key Takeaways: Text Mining and Statistical NLP
- TF-IDF weighting balances term frequency with discriminative power; cosine similarity in TF-IDF space measures document similarity. BM25 extends TF-IDF with document length normalization.
- LDA models documents as mixtures of latent topics with Dirichlet priors. Variational inference or collapsed Gibbs sampling approximate the intractable posterior. Perplexity measures generalization.
- Sentiment analysis ranges from lexicon-based scoring to supervised classification. Logistic regression and SVMs in TF-IDF space are strong baselines; aspect-based methods capture entity-level polarity.
- Word embeddings (Word2Vec, GloVe) map words to dense vectors capturing semantic relationships. Skip-Gram with negative sampling scales to large corpora; GloVe leverages global co-occurrence statistics.
- Evaluation β Precision, recall, F1, and AUC are appropriate depending on class balance and cost structure. Macro vs micro averaging matters for multi-class problems. Always report multiple metrics.