Hidden Markov Models (HMM)

Advanced Statistical Methods

Modeling Sequences With Hidden States

Hidden Markov Models capture systems that transition between unobservable states while generating observable outputs. The forward-backward, Viterbi, and Baum-Welch algorithms enable inference and learning in these doubly stochastic processes.

Speech recognition — Model phoneme sequences where acoustic observations depend on hidden articulatory states
Bioinformatics — Identify gene regions using hidden states representing functional annotations
Finance — Detect regime changes in market volatility using hidden bull/bear states

HMMs let you decode the hidden story behind observable sequences.

DfHidden Markov Model

A Hidden Markov Model (HMM) is a doubly stochastic process consisting of an underlying Markov chain of unobservable (hidden) states and an observable emission process. Formally, an HMM λ = (S, V, A, B, π) is defined by:

S = {s₁, s₂, ..., s_N}: a finite set of N hidden states
V = {v₁, v₂, ..., v_M}: a finite set of M observation symbols
A = [a_ij]: an N × N state transition probability matrix where a_ij = P(q_{t+1} = s_j | q_t = s_i)
B = [b_j(k)]: an N × M emission probability matrix where b_j(k) = P(o_t = v_k | q_t = s_j)
π = {π_i}: an initial state distribution where π_i = P(q_1 = s_i)

Key Probability Distributions

The joint probability of a state sequence Q = (q_1, q_2, ..., q_T) and observation sequence O = (o_1, o_2, ..., o_T) is:

P(Q, O | \lambda) = \pi_{q_1} \prod_{t=1}^{T-1} a_{q_t q_{t+1}} \prod_{t=1}^{T} b_{q_t}(o_t)

The likelihood of observations is obtained by marginalizing over all possible state sequences:

P(O | \lambda) = \sum_{Q} \pi_{q_1} \prod_{t=1}^{T-1} a_{q_t q_{t+1}} \prod_{t=1}^{T} b_{q_t}(o_t)

The Forward-Backward Algorithm

ThForward Algorithm

The forward variable α_t(i) = P(o_1, o_2, ..., o_T, q_t = s_i | λ) can be computed recursively:

Initialization:

\alpha_1(i) = \pi_i b_i(o_1), \quad 1 \leq i \leq N

Induction:

\alpha_t(j) = \left[\sum_{i=1}^{N} \alpha_{t-1}(i) a_{ij}\right] b_j(o_t), \quad 2 \leq t \leq T, \; 1 \leq j \leq N

Termination:

P(O | \lambda) = \sum_{i=1}^{N} \alpha_T(i)

The computational complexity is O(N²T), a dramatic improvement over the O(N^T) brute-force approach.

Backward Variables

The backward variable β_t(i) = P(o_{t+1}, o_{t+2}, ..., o_T | q_t = s_i, λ) is computed similarly:

Initialization:

\beta_T(i) = 1, \quad 1 \leq i \leq N

Induction:

\beta_t(i) = \sum_{j=1}^{N} a_{ij} b_j(o_{t+1}) \beta_{t+1}(j), \quad T-1 \geq t \geq 1

DfPosterior State Probabilities

The posterior probability of being in state s_i at time t given the entire observation sequence is:

\gamma_t(i) = P(q_t = s_i | O, \lambda) = \frac{\alpha_t(i) \beta_t(i)}{P(O|\lambda)}

The joint posterior of two consecutive states is:

\xi_t(i,j) = P(q_t = s_i, q_{t+1} = s_j | O, \lambda) = \frac{\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)}{P(O|\lambda)}

The Viterbi Algorithm

ThViterbi Decoding

The Viterbi algorithm finds the most likely state sequence Q* = argmax_Q P(Q | O, λ) using dynamic programming:

Define:

\delta_t(i) = \max_{q_1,...,q_{t-1}} P(q_1,...,q_{t-1}, q_t = s_i, o_1,...,o_t | \lambda)

Recursion:

\delta_t(j) = \max_{1 \leq i \leq N} [\delta_{t-1}(i) \cdot a_{ij}] \cdot b_j(o_t)

Backtracking:

\psi_t(j) = \arg\max_{1 \leq i \leq N} [\delta_{t-1}(i) \cdot a_{ij}]

The optimal state sequence is recovered by tracing back from q_T* = argmax_i δ_T(i).

import numpy as np

class HiddenMarkovModel:
    def __init__(self, n_states, n_observations):
        self.N = n_states
        self.M = n_observations
        self.A = np.ones((n_states, n_states)) / n_states
        self.B = np.ones((n_states, n_observations)) / n_observations
        self.pi = np.ones(n_states) / n_states

    def forward(self, obs):
        T = len(obs)
        alpha = np.zeros((T, self.N))
        alpha[0] = self.pi * self.B[:, obs[0]]
        for t in range(1, T):
            for j in range(self.N):
                alpha[t, j] = np.sum(alpha[t-1] * self.A[:, j]) * self.B[j, obs[t]]
        return alpha

    def backward(self, obs):
        T = len(obs)
        beta = np.zeros((T, self.N))
        beta[T-1] = 1.0
        for t in range(T-2, -1, -1):
            for i in range(self.N):
                beta[t, i] = np.sum(self.A[i] * self.B[:, obs[t+1]] * beta[t+1])
        return beta

    def viterbi(self, obs):
        T = len(obs)
        delta = np.zeros((T, self.N))
        psi = np.zeros((T, self.N), dtype=int)
        delta[0] = self.pi * self.B[:, obs[0]]
        for t in range(1, T):
            for j in range(self.N):
                candidates = delta[t-1] * self.A[:, j]
                psi[t, j] = np.argmax(candidates)
                delta[t, j] = candidates[psi[t, j]] * self.B[j, obs[t]]
        states = np.zeros(T, dtype=int)
        states[T-1] = np.argmax(delta[T-1])
        for t in range(T-2, -1, -1):
            states[t] = psi[t+1, states[t+1]]
        return states, delta[T-1, states[T-1]]

Baum-Welch (EM for HMMs)

ThBaum-Welch Re-estimation

The Baum-Welch algorithm is the EM algorithm applied to HMMs. Given the current model parameters λ, the re-estimated parameters are:

Transition probabilities:

\hat{a}_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)}

Emission probabilities:

\hat{b}_j(k) = \frac{\sum_{t=1, o_t=v_k}^{T} \gamma_t(j)}{\sum_{t=1}^{T} \gamma_t(j)}

Initial state distribution:

\hat{\pi}_i = \gamma_1(i)

The likelihood P(O | λ) is non-decreasing under each iteration, guaranteeing convergence to a local maximum.

Computational Complexity

All three core algorithms share the same time complexity structure:

Algorithm	Time Complexity	Space Complexity
Forward	O(N²T)	O(NT)
Backward	O(N²T)	O(NT)
Viterbi	O(N²T)	O(NT)
Baum-Welch	O(KN²T)	O(NT)

where K is the number of EM iterations. For numerical stability, log-space implementations are essential for long sequences.

Log-Likelihood and Model Selection

The log-likelihood under an HMM is:

\log P(O|\lambda) = \log \left[\sum_{Q} \pi_{q_1} \prod_{t=1}^{T-1} a_{q_t q_{t+1}} \prod_{t=1}^{T} b_{q_t}(o_t)\right]

For model selection (choosing N), information criteria are employed:

AIC: -2 log P(O|λ̂) + 2p
BIC: -2 log P(O|λ̂) + p log T

where p = N² + NM - 1 is the number of free parameters.

Continuous Observation HMMs

DfGaussian HMM

For continuous observations, the emission distribution is typically modeled as a Gaussian mixture:

b_j(o) = \sum_{m=1}^{M_j} c_{jm} \mathcal{N}(o | \mu_{jm}, \Sigma_{jm})

where c_{jm} are mixture weights, and the state-dependent mean vectors μ_{jm} and covariance matrices Σ_{jm} are estimated via a modified Baum-Welch algorithm. The E-step computes sufficient statistics using the forward-backward variables, and the M-step updates Gaussian parameters via weighted maximum likelihood.

Applications in Bioinformatics

HMMs have become foundational in computational biology:

Gene finding: Prokaryotic and eukaryotic gene prediction using state architectures with coding/non-coding states and splice site detection
Sequence alignment: Profile HMMs for protein family classification (HMMER, Pfam)
Protein structure: Secondary structure prediction with helix, sheet, and coil states
Phylogenetics: Phylo-HMMs for detecting selection pressure across evolutionary lineages

The key advantage is that HMMs naturally handle insertions, deletions, and substitutions as stochastic processes, providing probabilistic frameworks for sequence analysis where deterministic methods fail.

Speech Recognition Application

In continuous speech recognition, HMMs model acoustic units (phonemes) with:

States representing sub-phonetic segments (onset, nucleus, coda)
Observations as Mel-frequency cepstral coefficients (MFCCs)
Transitions capturing temporal dynamics
Emissions modeled as Gaussian mixtures (typically 8-32 components per state)

Modern ASR systems use context-dependent triphone HMMs with thousands of states, trained on hundreds of hours of speech. The Viterbi algorithm provides the alignment between acoustic observations and phonetic labels, while the forward algorithm computes likelihoods for language model integration.

Hidden Markov Models (HMM)

Hidden Markov Models (HMM)

Modeling Sequences With Hidden States

DfHidden Markov Model

Key Probability Distributions

The Forward-Backward Algorithm

ThForward Algorithm

DfPosterior State Probabilities

The Viterbi Algorithm

ThViterbi Decoding

Baum-Welch (EM for HMMs)

ThBaum-Welch Re-estimation

Log-Likelihood and Model Selection

Continuous Observation HMMs

DfGaussian HMM

Applications in Bioinformatics

Speech Recognition Application

Premium Content

Need Expert Statistics Help?