Text Mining and Statistical NLP

Advanced Statistical Methods

Extracting Knowledge From Unstructured Text

Statistical text mining transforms raw text into quantitative representations for analysis. TF-IDF, topic models, and word embeddings enable automated understanding of document collections at scale.

Sentiment analysis — Track public opinion trends from social media and product reviews
Healthcare — Extract clinical information from medical records and research papers
Legal discovery — Automatically classify and prioritize relevant documents in litigation

Text mining turns the world's unstructured text into analyzable, quantifiable data.

Text mining extracts structured statistical information from unstructured text. The central challenge is converting variable-length text sequences into fixed-dimensional numerical representations amenable to statistical analysis. This lesson develops the mathematical foundations of text representation, topic modeling, sentiment analysis, word embeddings, and document classification.

Bag-of-Words Representation

DfBag-of-Words (BoW)

A document-term matrix $\mathbf{X} \in \mathbb{N}^{d \times V}$ represents $d$ documents over a vocabulary of $V$ terms. Each entry $X_{ij}$ counts the occurrences of term $j$ in document $i$ . The bag-of-words model discards word order, treating each document as a multiset of terms.

Term Frequency

The raw term frequency is $tf(t, d) = n_{t,d}$ where $n_{t,d}$ is the count of term $t$ in document $d$ . Common transformations include:

tf_{\text{log}}(t, d) = \log(1 + n_{t,d}), \quad tf_{\text{aug}}(t, d) = 0.5 + 0.5 \cdot \frac{n_{t,d}}{\max_{t'} n_{t',d}}

The augmented form prevents over-weighting of frequent terms within a document.

Inverse Document Frequency

Inverse document frequency (IDF) measures term specificity across the corpus:

idf(t) = \log \frac{d}{|\{d' : t \in d'\}|}

where $d$ is the total number of documents and $|\{d' : t \in d'\}|$ is the document frequency. Variants include smoothed IDF: $idf(t) = \log\frac{1 + d}{1 + df(t)} + 1$ .

TF-IDF Weighting

The TF-IDF score combines term frequency and inverse document frequency:

\text{tf-idf}(t, d) = tf(t, d) \cdot idf(t)

TF-IDF downweights common terms (high $tf$ , low $idf$ ) and upweights discriminative terms (high $tf$ , high $idf$ ). The TF-IDF matrix $\mathbf{X}_{\text{tfidf}}$ is the input to many text mining algorithms. Cosine similarity between TF-IDF vectors measures document similarity:

\cos(\mathbf{x}_i, \mathbf{x}_j) = \frac{\mathbf{x}_i^\top \mathbf{x}_j}{\|\mathbf{x}_i\| \|\mathbf{x}_j\|}

TF-IDF Normalization

L2 normalization projects document vectors onto the unit hypersphere, making cosine similarity equivalent to dot product
Sublinear TF scaling ( $1 + \log tf$ ) reduces the impact of very frequent terms
BM25 (Okapi) extends TF-IDF with document length normalization: $tf_{\text{BM25}} = \frac{tf \cdot (k_1 + 1)}{tf + k_1(1 - b + b \cdot |d|/\bar{d})}$

Topic Models

DfLatent Dirichlet Allocation (LDA)

LDA (Blei et al., 2003) is a generative probabilistic model for document collections. Each document $d$ is modeled as a mixture over $K$ latent topics, and each topic is a distribution over the vocabulary:

\begin{aligned} \boldsymbol{\theta}_d &\sim \text{Dirichlet}(\boldsymbol{\alpha}) \\ \mathbf{z}_d &\sim \text{Multinomial}(\boldsymbol{\theta}_d) \quad \text{for each word position} \\ w_{dn} &\sim \text{Multinomial}(\boldsymbol{\phi}_{z_{dn}}) \end{aligned}

where $\boldsymbol{\theta}_d$ is the topic proportions for document $d$ , $\mathbf{z}_d$ are topic assignments, $w_{dn}$ is the $n$ -th word, and $\boldsymbol{\phi}_k \sim \text{Dirichlet}(\boldsymbol{\beta})$ is the word distribution for topic $k$ .

LDA Joint Distribution

The joint distribution of words, topics, and topic proportions is:

P(\mathbf{w}, \mathbf{z}, \boldsymbol{\theta}, \boldsymbol{\phi} \mid \boldsymbol{\alpha}, \boldsymbol{\beta}) = \prod_{d=1}^{D} P(\boldsymbol{\theta}_d \mid \boldsymbol{\alpha}) \prod_{n=1}^{N_d} P(z_{dn} \mid \boldsymbol{\theta}_d) P(w_{dn} \mid \boldsymbol{\phi}_{z_{dn}})

The marginal likelihood (evidence) requires integrating out latent variables:

P(\mathbf{w} \mid \boldsymbol{\alpha}, \boldsymbol{\beta}) = \int \int P(\boldsymbol{\theta} \mid \boldsymbol{\alpha}) P(\boldsymbol{\phi} \mid \boldsymbol{\beta}) \prod_{d,n} P(w_{dn} \mid \boldsymbol{\phi}_{z_{dn}}) P(z_{dn} \mid \boldsymbol{\theta}_d) \, d\boldsymbol{\theta} \, d\boldsymbol{\phi}

This integral is intractable, necessitating approximate inference.

Inference Algorithms

Variational inference (original LDA): Approximates the posterior with a factorized distribution $q(\mathbf{z}, \boldsymbol{\theta}, \boldsymbol{\phi}) = \prod_d q(\boldsymbol{\theta}_d) \prod_{dn} q(z_{dn}) \prod_k q(\boldsymbol{\phi}_k)$ , optimized via coordinate ascent on the ELBO
Collapsed Gibbs sampling (Griffiths & Steyvers, 2004): Integrates out $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ , sampling topic assignments $z_{dn}$ conditioned on all other assignments. Converges to exact posterior as samples increase
Online LDA (Hoffman et al., 2010): Mini-batch variational inference for streaming/大规模 corpora

Perplexity

Model quality is measured by perplexity on held-out documents:

\text{Perplexity} = \exp\left(-\frac{\sum_{d=1}^{D}\sum_{n=1}^{N_d} \log P(w_{dn} \mid \mathbf{w}_{-dn})}{\sum_{d=1}^{D} N_d}\right)

Lower perplexity indicates better generalization. In practice, compute the approximate predictive probability via variational inference or held-out likelihood estimation.

Sentiment Analysis

DfSentiment Classification

Sentiment analysis assigns a polarity (positive/negative/neutral) or continuous sentiment score to text. Formally, it is a mapping $f: \mathcal{X} \rightarrow \mathcal{Y}$ where $\mathcal{X}$ is the space of text documents and $\mathcal{Y} = \{-1, 0, +1\}$ (categorical) or $\mathcal{Y} = [-1, +1]$ (continuous).

Lexicon-Based Approach

A lexicon-based approach assigns sentiment scores using a pre-built dictionary:

\text{Sentiment}(d) = \frac{\sum_{t \in d} s(t) \cdot \text{tf-idf}(t, d)}{\sum_{t \in d} \text{tf-idf}(t, d)}

where $s(t) \in [-1, +1]$ is the sentiment score for term $t$ . This approach is interpretable but limited by vocabulary coverage and context insensitivity.

Logistic Regression for Sentiment

A supervised approach learns a linear decision boundary in TF-IDF space:

P(y = +1 \mid \mathbf{x}) = \sigma(\mathbf{w}^\top \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}

The L2-regularized objective minimizes:

\mathcal{L}(\mathbf{w}) = -\sum_{i=1}^{n} \left[ y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i) \right] + \lambda \|\mathbf{w}\|_2^2

Aspect-Based Sentiment

Beyond document-level polarity, aspect-based sentiment analysis (ABSA) extracts sentiment toward specific entities or attributes. For example, "The food was excellent but the service was terrible" contains positive sentiment toward food and negative toward service. Methods include:

Attention-based neural models that learn aspect-document interactions
Dependency-based feature extraction for local context
Structured prediction models that jointly extract aspects and sentiments

Word Embeddings

DfDistributed Representation

Word embeddings map each word $w$ to a dense vector $\mathbf{v}_w \in \mathbb{R}^k$ (typically $k = 100$ – $300$ ) such that semantic similarity is captured by geometric proximity. Unlike sparse one-hot vectors, embeddings capture analogical relationships: $\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}}$ .

Word2Vec (Skip-Gram)

The Skip-Gram model (Mikolov et al., 2013) learns embeddings by predicting context words from a target word. For a target word $w_t$ and context window of size $c$ :

\mathcal{L}(\theta) = -\sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} \mid w_t)

where the softmax is:

P(w_O \mid w_I) = \frac{\exp(\mathbf{v}_{w_O}' \cdot \mathbf{v}_{w_I})}{\sum_{w \in V} \exp(\mathbf{v}_w' \cdot \mathbf{v}_{w_I})}

Since the denominator sums over the entire vocabulary, negative sampling approximates it:

\log P(w_O \mid w_I) \approx \log \sigma(\mathbf{v}_{w_O}' \cdot \mathbf{v}_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n} [\log \sigma(-\mathbf{v}_{w_i}' \cdot \mathbf{v}_{w_I})]

where $\sigma$ is the sigmoid function and $P_n(w) \propto f(w)^{3/4}$ is the unigram distribution raised to the 3/4 power.

GloVe (Global Vectors)

GloVe (Pennington et al., 2014) learns embeddings from the global co-occurrence matrix. Let $X_{ij}$ be the count of word $j$ in the context of word $i$ . The objective is:

\mathcal{L} = \sum_{i,j=1}^{V} f(X_{ij}) \left( \mathbf{w}_i^\top \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij} \right)^2

where $f(x) = \min((x/x_{\max})^{0.75}, 1)$ is a weighting function that downweights very frequent co-occurrences. GloVe combines the advantages of global matrix factorization (LSA) and local context window methods (Word2Vec).

Embedding Properties

Analogy reasoning: $\mathbf{v}_b - \mathbf{v}_a + \mathbf{v}_c \approx \mathbf{v}_d$ tests linear relationships
Clustering: K-means on embeddings reveals semantic clusters
Compositionality: Word embeddings can be averaged or summed for sentence representations (though this loses word order)
Subword information: FastText extends Word2Vec with character n-gram embeddings, handling rare and out-of-vocabulary words

Document Classification with Naive Bayes

Multinomial Naive Bayes

The Multinomial Naive Bayes classifier models documents as bags of words generated from a mixture of class-conditional multinomial distributions:

P(\mathbf{x} \mid y = c) = \frac{(\sum_j x_j)!}{\prod_j x_j!} \prod_{j=1}^{V} \theta_{cj}^{x_j}

where $\theta_{cj}$ is the probability of term $j$ in class $c$ . Classification uses Bayes' rule:

\hat{y} = \arg\max_c \left[ \log P(y = c) + \sum_{j=1}^{V} x_j \log \theta_{cj} \right]

The maximum likelihood estimate $\hat{\theta}_{cj} = \frac{n_{cj}}{\sum_{j'} n_{cj'}}$ is smoothed with Laplace (add-1) smoothing to avoid zero probabilities.

Why Naive Bayes Works for Text

Despite the strong independence assumption (conditional independence of words given the class), Naive Bayes performs surprisingly well for text classification because:

The classification rule only requires correct ranking of class posteriors, not accurate probability estimates
High-dimensional text data often has sufficient class-separating signal even with independence violations
The model is extremely fast to train ( $O(nV)$ ) and predict ( $O(V)$ per document)
It works well with small training sets due to strong inductive bias

Evaluation Metrics

Classification Metrics

For binary classification with true positives (TP), false positives (FP), true negatives (TN), false negatives (FN):

\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}

\text{F}_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Macro-averaging computes precision/recall/F1 per class and averages; micro-averaging aggregates TP/FP/FN across classes before computing metrics. Micro-average equals accuracy for single-label classification.

Area Under the ROC Curve

The ROC curve plots true positive rate vs false positive rate at varying thresholds. The AUC is:

\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}^{-1}(t)) \, dt = P(\hat{f}(x^+) > \hat{f}(x^-))

AUC is the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example. An AUC of 0.5 indicates random performance; 1.0 is perfect.

Metric Choice for Imbalanced Text

Accuracy is misleading when classes are imbalanced (e.g., 95% negative reviews)
F1 is appropriate when both precision and recall matter equally
F $_\beta$ weights recall more heavily when $\beta > 1$ (missed positives are costly)
AUC is threshold-independent and useful for comparing classifiers
Precision@k measures the fraction of relevant documents in the top-k results (information retrieval)

Python Implementation

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Sample corpus ---
corpus = [
    "The stock market rallied on strong earnings reports",
    "New vaccine shows promising results in clinical trials",
    "Federal Reserve raises interest rates to combat inflation",
    "Tech companies report record profits amid AI boom",
    "Hospital implements new patient safety protocols",
    "Bond yields decline as investors seek safe assets",
    "Study finds link between exercise and mental health",
    "GDP growth exceeds expectations in third quarter",
]
labels = np.array([0, 1, 0, 0, 1, 0, 1, 0])  # 0=finance, 1=health

# --- TF-IDF ---
tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
X_tfidf = tfidf.fit_transform(corpus)
print("TF-IDF shape:", X_tfidf.shape)
print("Top features:", tfidf.get_feature_names_out()[:10])

# --- LDA Topic Model ---
count_vec = CountVectorizer(max_features=1000, stop_words='english')
X_counts = count_vec.fit_transform(corpus)

lda = LatentDirichletAllocation(n_components=2, random_state=42, max_iter=50)
doc_topics = lda.fit_transform(X_counts)

print("\n=== LDA Topics ===")
for idx, topic in enumerate(lda.components_):
    top_words = [count_vec.get_feature_names_out()[i] for i in topic.argsort()[-8:][::-1]]
    print(f"Topic {idx}: {', '.join(top_words)}")

# --- Naive Bayes Classification ---
X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, labels, test_size=0.25, random_state=42, stratify=labels
)

nb = MultinomialNB(alpha=1.0)
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
y_prob_nb = nb.predict_proba(X_test)[:, 1]

print("\n=== Naive Bayes ===")
print(classification_report(y_test, y_pred_nb, target_names=['Finance', 'Health']))

# --- Logistic Regression ---
lr = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
y_prob_lr = lr.predict_proba(X_test)[:, 1]

print("=== Logistic Regression ===")
print(classification_report(y_test, y_pred_lr, target_names=['Finance', 'Health']))

# --- Word Embeddings (pretrained) ---
# Using synthetic embeddings for demonstration
vocab = tfidf.get_feature_names_out()
k = 50  # embedding dimension
embeddings = np.random.randn(len(vocab), k) * 0.1

# Document embedding = TF-IDF weighted average of word embeddings
def doc_embedding(tfidf_matrix, embeddings):
    doc_embs = tfidf_matrix @ embeddings
    norms = np.linalg.norm(doc_embs, axis=1, keepdims=True)
    norms[norms == 0] = 1
    return doc_embs / norms

X_emb = doc_embedding(X_tfidf, embeddings)
print(f"\nDocument embedding shape: {X_emb.shape}")
print(f"Similarity matrix (cosine):\n{X_emb @ X_emb.T:.3f}")

Summary

Key Takeaways: Text Mining and Statistical NLP

TF-IDF weighting balances term frequency with discriminative power; cosine similarity in TF-IDF space measures document similarity. BM25 extends TF-IDF with document length normalization.
LDA models documents as mixtures of latent topics with Dirichlet priors. Variational inference or collapsed Gibbs sampling approximate the intractable posterior. Perplexity measures generalization.
Sentiment analysis ranges from lexicon-based scoring to supervised classification. Logistic regression and SVMs in TF-IDF space are strong baselines; aspect-based methods capture entity-level polarity.
Word embeddings (Word2Vec, GloVe) map words to dense vectors capturing semantic relationships. Skip-Gram with negative sampling scales to large corpora; GloVe leverages global co-occurrence statistics.
Evaluation — Precision, recall, F1, and AUC are appropriate depending on class balance and cost structure. Macro vs micro averaging matters for multi-class problems. Always report multiple metrics.

Text Mining and Statistical NLP

Text Mining and Statistical NLP

Extracting Knowledge From Unstructured Text

Bag-of-Words Representation

DfBag-of-Words (BoW)

Term Frequency

Inverse Document Frequency

TF-IDF Weighting

Topic Models

DfLatent Dirichlet Allocation (LDA)

LDA Joint Distribution

Perplexity

Sentiment Analysis

DfSentiment Classification

Lexicon-Based Approach

Logistic Regression for Sentiment

Word Embeddings

DfDistributed Representation

Word2Vec (Skip-Gram)

GloVe (Global Vectors)

Document Classification with Naive Bayes

Multinomial Naive Bayes

Evaluation Metrics

Classification Metrics

Area Under the ROC Curve

Python Implementation

Summary

Key Takeaways: Text Mining and Statistical NLP

Premium Content

Need Expert Statistics Help?