Hidden Markov Models (HMM)
Advanced Statistical Methods
Modeling Sequences With Hidden States
Hidden Markov Models capture systems that transition between unobservable states while generating observable outputs. The forward-backward, Viterbi, and Baum-Welch algorithms enable inference and learning in these doubly stochastic processes.
- Speech recognition — Model phoneme sequences where acoustic observations depend on hidden articulatory states
- Bioinformatics — Identify gene regions using hidden states representing functional annotations
- Finance — Detect regime changes in market volatility using hidden bull/bear states
HMMs let you decode the hidden story behind observable sequences.
DfHidden Markov Model
A Hidden Markov Model (HMM) is a doubly stochastic process consisting of an underlying Markov chain of unobservable (hidden) states and an observable emission process. Formally, an HMM λ = (S, V, A, B, π) is defined by:
- S = {s₁, s₂, ..., s_N}: a finite set of N hidden states
- V = {v₁, v₂, ..., v_M}: a finite set of M observation symbols
- A = [a_ij]: an N × N state transition probability matrix where a_ij = P(q_{t+1} = s_j | q_t = s_i)
- B = [b_j(k)]: an N × M emission probability matrix where b_j(k) = P(o_t = v_k | q_t = s_j)
- π = {π_i}: an initial state distribution where π_i = P(q_1 = s_i)
Key Probability Distributions
The joint probability of a state sequence Q = (q_1, q_2, ..., q_T) and observation sequence O = (o_1, o_2, ..., o_T) is:
The likelihood of observations is obtained by marginalizing over all possible state sequences:
The Forward-Backward Algorithm
ThForward Algorithm
The forward variable α_t(i) = P(o_1, o_2, ..., o_T, q_t = s_i | λ) can be computed recursively:
Initialization:
Induction:
Termination:
The computational complexity is O(N²T), a dramatic improvement over the O(N^T) brute-force approach.
Backward Variables
The backward variable β_t(i) = P(o_{t+1}, o_{t+2}, ..., o_T | q_t = s_i, λ) is computed similarly:
Initialization:
Induction:
DfPosterior State Probabilities
The posterior probability of being in state s_i at time t given the entire observation sequence is:
The joint posterior of two consecutive states is:
The Viterbi Algorithm
ThViterbi Decoding
The Viterbi algorithm finds the most likely state sequence Q* = argmax_Q P(Q | O, λ) using dynamic programming:
Define:
Recursion:
Backtracking:
The optimal state sequence is recovered by tracing back from q_T* = argmax_i δ_T(i).
import numpy as np
class HiddenMarkovModel:
def __init__(self, n_states, n_observations):
self.N = n_states
self.M = n_observations
self.A = np.ones((n_states, n_states)) / n_states
self.B = np.ones((n_states, n_observations)) / n_observations
self.pi = np.ones(n_states) / n_states
def forward(self, obs):
T = len(obs)
alpha = np.zeros((T, self.N))
alpha[0] = self.pi * self.B[:, obs[0]]
for t in range(1, T):
for j in range(self.N):
alpha[t, j] = np.sum(alpha[t-1] * self.A[:, j]) * self.B[j, obs[t]]
return alpha
def backward(self, obs):
T = len(obs)
beta = np.zeros((T, self.N))
beta[T-1] = 1.0
for t in range(T-2, -1, -1):
for i in range(self.N):
beta[t, i] = np.sum(self.A[i] * self.B[:, obs[t+1]] * beta[t+1])
return beta
def viterbi(self, obs):
T = len(obs)
delta = np.zeros((T, self.N))
psi = np.zeros((T, self.N), dtype=int)
delta[0] = self.pi * self.B[:, obs[0]]
for t in range(1, T):
for j in range(self.N):
candidates = delta[t-1] * self.A[:, j]
psi[t, j] = np.argmax(candidates)
delta[t, j] = candidates[psi[t, j]] * self.B[j, obs[t]]
states = np.zeros(T, dtype=int)
states[T-1] = np.argmax(delta[T-1])
for t in range(T-2, -1, -1):
states[t] = psi[t+1, states[t+1]]
return states, delta[T-1, states[T-1]]
Baum-Welch (EM for HMMs)
ThBaum-Welch Re-estimation
The Baum-Welch algorithm is the EM algorithm applied to HMMs. Given the current model parameters λ, the re-estimated parameters are:
Transition probabilities:
Emission probabilities:
Initial state distribution:
The likelihood P(O | λ) is non-decreasing under each iteration, guaranteeing convergence to a local maximum.
Computational Complexity
All three core algorithms share the same time complexity structure:
| Algorithm | Time Complexity | Space Complexity |
|---|---|---|
| Forward | O(N²T) | O(NT) |
| Backward | O(N²T) | O(NT) |
| Viterbi | O(N²T) | O(NT) |
| Baum-Welch | O(KN²T) | O(NT) |
where K is the number of EM iterations. For numerical stability, log-space implementations are essential for long sequences.
Log-Likelihood and Model Selection
The log-likelihood under an HMM is:
For model selection (choosing N), information criteria are employed:
AIC: -2 log P(O|λ̂) + 2p
BIC: -2 log P(O|λ̂) + p log T
where p = N² + NM - 1 is the number of free parameters.
Continuous Observation HMMs
DfGaussian HMM
For continuous observations, the emission distribution is typically modeled as a Gaussian mixture:
where c_{jm} are mixture weights, and the state-dependent mean vectors μ_{jm} and covariance matrices Σ_{jm} are estimated via a modified Baum-Welch algorithm. The E-step computes sufficient statistics using the forward-backward variables, and the M-step updates Gaussian parameters via weighted maximum likelihood.
Applications in Bioinformatics
HMMs have become foundational in computational biology:
- Gene finding: Prokaryotic and eukaryotic gene prediction using state architectures with coding/non-coding states and splice site detection
- Sequence alignment: Profile HMMs for protein family classification (HMMER, Pfam)
- Protein structure: Secondary structure prediction with helix, sheet, and coil states
- Phylogenetics: Phylo-HMMs for detecting selection pressure across evolutionary lineages
The key advantage is that HMMs naturally handle insertions, deletions, and substitutions as stochastic processes, providing probabilistic frameworks for sequence analysis where deterministic methods fail.
Speech Recognition Application
In continuous speech recognition, HMMs model acoustic units (phonemes) with:
- States representing sub-phonetic segments (onset, nucleus, coda)
- Observations as Mel-frequency cepstral coefficients (MFCCs)
- Transitions capturing temporal dynamics
- Emissions modeled as Gaussian mixtures (typically 8-32 components per state)
Modern ASR systems use context-dependent triphone HMMs with thousands of states, trained on hundreds of hours of speech. The Viterbi algorithm provides the alignment between acoustic observations and phonetic labels, while the forward algorithm computes likelihoods for language model integration.