🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Hidden Markov Models (HMM)

Advanced Statistical MethodsSequential Data🟢 Free Lesson

Advertisement

Hidden Markov Models (HMM)

Advanced Statistical Methods

Modeling Sequences With Hidden States

Hidden Markov Models capture systems that transition between unobservable states while generating observable outputs. The forward-backward, Viterbi, and Baum-Welch algorithms enable inference and learning in these doubly stochastic processes.

  • Speech recognition — Model phoneme sequences where acoustic observations depend on hidden articulatory states
  • Bioinformatics — Identify gene regions using hidden states representing functional annotations
  • Finance — Detect regime changes in market volatility using hidden bull/bear states

HMMs let you decode the hidden story behind observable sequences.


DfHidden Markov Model

A Hidden Markov Model (HMM) is a doubly stochastic process consisting of an underlying Markov chain of unobservable (hidden) states and an observable emission process. Formally, an HMM λ = (S, V, A, B, π) is defined by:

  • S = {s₁, s₂, ..., s_N}: a finite set of N hidden states
  • V = {v₁, v₂, ..., v_M}: a finite set of M observation symbols
  • A = [a_ij]: an N × N state transition probability matrix where a_ij = P(q_{t+1} = s_j | q_t = s_i)
  • B = [b_j(k)]: an N × M emission probability matrix where b_j(k) = P(o_t = v_k | q_t = s_j)
  • π = {π_i}: an initial state distribution where π_i = P(q_1 = s_i)

Key Probability Distributions

The joint probability of a state sequence Q = (q_1, q_2, ..., q_T) and observation sequence O = (o_1, o_2, ..., o_T) is:

P(Q,Oλ)=πq1t=1T1aqtqt+1t=1Tbqt(ot)P(Q, O | \lambda) = \pi_{q_1} \prod_{t=1}^{T-1} a_{q_t q_{t+1}} \prod_{t=1}^{T} b_{q_t}(o_t)

The likelihood of observations is obtained by marginalizing over all possible state sequences:

P(Oλ)=Qπq1t=1T1aqtqt+1t=1Tbqt(ot)P(O | \lambda) = \sum_{Q} \pi_{q_1} \prod_{t=1}^{T-1} a_{q_t q_{t+1}} \prod_{t=1}^{T} b_{q_t}(o_t)

The Forward-Backward Algorithm

ThForward Algorithm

The forward variable α_t(i) = P(o_1, o_2, ..., o_T, q_t = s_i | λ) can be computed recursively:

Initialization:

α1(i)=πibi(o1),1iN\alpha_1(i) = \pi_i b_i(o_1), \quad 1 \leq i \leq N

Induction:

αt(j)=[i=1Nαt1(i)aij]bj(ot),2tT,  1jN\alpha_t(j) = \left[\sum_{i=1}^{N} \alpha_{t-1}(i) a_{ij}\right] b_j(o_t), \quad 2 \leq t \leq T, \; 1 \leq j \leq N

Termination:

P(Oλ)=i=1NαT(i)P(O | \lambda) = \sum_{i=1}^{N} \alpha_T(i)

The computational complexity is O(N²T), a dramatic improvement over the O(N^T) brute-force approach.

Backward Variables

The backward variable β_t(i) = P(o_{t+1}, o_{t+2}, ..., o_T | q_t = s_i, λ) is computed similarly:

Initialization:

βT(i)=1,1iN\beta_T(i) = 1, \quad 1 \leq i \leq N

Induction:

βt(i)=j=1Naijbj(ot+1)βt+1(j),T1t1\beta_t(i) = \sum_{j=1}^{N} a_{ij} b_j(o_{t+1}) \beta_{t+1}(j), \quad T-1 \geq t \geq 1

DfPosterior State Probabilities

The posterior probability of being in state s_i at time t given the entire observation sequence is:

γt(i)=P(qt=siO,λ)=αt(i)βt(i)P(Oλ)\gamma_t(i) = P(q_t = s_i | O, \lambda) = \frac{\alpha_t(i) \beta_t(i)}{P(O|\lambda)}

The joint posterior of two consecutive states is:

ξt(i,j)=P(qt=si,qt+1=sjO,λ)=αt(i)aijbj(ot+1)βt+1(j)P(Oλ)\xi_t(i,j) = P(q_t = s_i, q_{t+1} = s_j | O, \lambda) = \frac{\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)}{P(O|\lambda)}

The Viterbi Algorithm

ThViterbi Decoding

The Viterbi algorithm finds the most likely state sequence Q* = argmax_Q P(Q | O, λ) using dynamic programming:

Define:

δt(i)=maxq1,...,qt1P(q1,...,qt1,qt=si,o1,...,otλ)\delta_t(i) = \max_{q_1,...,q_{t-1}} P(q_1,...,q_{t-1}, q_t = s_i, o_1,...,o_t | \lambda)

Recursion:

δt(j)=max1iN[δt1(i)aij]bj(ot)\delta_t(j) = \max_{1 \leq i \leq N} [\delta_{t-1}(i) \cdot a_{ij}] \cdot b_j(o_t)

Backtracking:

ψt(j)=argmax1iN[δt1(i)aij]\psi_t(j) = \arg\max_{1 \leq i \leq N} [\delta_{t-1}(i) \cdot a_{ij}]

The optimal state sequence is recovered by tracing back from q_T* = argmax_i δ_T(i).

import numpy as np

class HiddenMarkovModel:
    def __init__(self, n_states, n_observations):
        self.N = n_states
        self.M = n_observations
        self.A = np.ones((n_states, n_states)) / n_states
        self.B = np.ones((n_states, n_observations)) / n_observations
        self.pi = np.ones(n_states) / n_states

    def forward(self, obs):
        T = len(obs)
        alpha = np.zeros((T, self.N))
        alpha[0] = self.pi * self.B[:, obs[0]]
        for t in range(1, T):
            for j in range(self.N):
                alpha[t, j] = np.sum(alpha[t-1] * self.A[:, j]) * self.B[j, obs[t]]
        return alpha

    def backward(self, obs):
        T = len(obs)
        beta = np.zeros((T, self.N))
        beta[T-1] = 1.0
        for t in range(T-2, -1, -1):
            for i in range(self.N):
                beta[t, i] = np.sum(self.A[i] * self.B[:, obs[t+1]] * beta[t+1])
        return beta

    def viterbi(self, obs):
        T = len(obs)
        delta = np.zeros((T, self.N))
        psi = np.zeros((T, self.N), dtype=int)
        delta[0] = self.pi * self.B[:, obs[0]]
        for t in range(1, T):
            for j in range(self.N):
                candidates = delta[t-1] * self.A[:, j]
                psi[t, j] = np.argmax(candidates)
                delta[t, j] = candidates[psi[t, j]] * self.B[j, obs[t]]
        states = np.zeros(T, dtype=int)
        states[T-1] = np.argmax(delta[T-1])
        for t in range(T-2, -1, -1):
            states[t] = psi[t+1, states[t+1]]
        return states, delta[T-1, states[T-1]]

Baum-Welch (EM for HMMs)

ThBaum-Welch Re-estimation

The Baum-Welch algorithm is the EM algorithm applied to HMMs. Given the current model parameters λ, the re-estimated parameters are:

Transition probabilities:

a^ij=t=1T1ξt(i,j)t=1T1γt(i)\hat{a}_{ij} = \frac{\sum_{t=1}^{T-1} \xi_t(i,j)}{\sum_{t=1}^{T-1} \gamma_t(i)}

Emission probabilities:

b^j(k)=t=1,ot=vkTγt(j)t=1Tγt(j)\hat{b}_j(k) = \frac{\sum_{t=1, o_t=v_k}^{T} \gamma_t(j)}{\sum_{t=1}^{T} \gamma_t(j)}

Initial state distribution:

π^i=γ1(i)\hat{\pi}_i = \gamma_1(i)

The likelihood P(O | λ) is non-decreasing under each iteration, guaranteeing convergence to a local maximum.

Computational Complexity

All three core algorithms share the same time complexity structure:

AlgorithmTime ComplexitySpace Complexity
ForwardO(N²T)O(NT)
BackwardO(N²T)O(NT)
ViterbiO(N²T)O(NT)
Baum-WelchO(KN²T)O(NT)

where K is the number of EM iterations. For numerical stability, log-space implementations are essential for long sequences.

Log-Likelihood and Model Selection

The log-likelihood under an HMM is:

logP(Oλ)=log[Qπq1t=1T1aqtqt+1t=1Tbqt(ot)]\log P(O|\lambda) = \log \left[\sum_{Q} \pi_{q_1} \prod_{t=1}^{T-1} a_{q_t q_{t+1}} \prod_{t=1}^{T} b_{q_t}(o_t)\right]

For model selection (choosing N), information criteria are employed:

AIC: -2 log P(O|λ̂) + 2p
BIC: -2 log P(O|λ̂) + p log T

where p = N² + NM - 1 is the number of free parameters.

Continuous Observation HMMs

DfGaussian HMM

For continuous observations, the emission distribution is typically modeled as a Gaussian mixture:

bj(o)=m=1MjcjmN(oμjm,Σjm)b_j(o) = \sum_{m=1}^{M_j} c_{jm} \mathcal{N}(o | \mu_{jm}, \Sigma_{jm})

where c_{jm} are mixture weights, and the state-dependent mean vectors μ_{jm} and covariance matrices Σ_{jm} are estimated via a modified Baum-Welch algorithm. The E-step computes sufficient statistics using the forward-backward variables, and the M-step updates Gaussian parameters via weighted maximum likelihood.

Applications in Bioinformatics

HMMs have become foundational in computational biology:

  1. Gene finding: Prokaryotic and eukaryotic gene prediction using state architectures with coding/non-coding states and splice site detection
  2. Sequence alignment: Profile HMMs for protein family classification (HMMER, Pfam)
  3. Protein structure: Secondary structure prediction with helix, sheet, and coil states
  4. Phylogenetics: Phylo-HMMs for detecting selection pressure across evolutionary lineages

The key advantage is that HMMs naturally handle insertions, deletions, and substitutions as stochastic processes, providing probabilistic frameworks for sequence analysis where deterministic methods fail.

Speech Recognition Application

In continuous speech recognition, HMMs model acoustic units (phonemes) with:

  • States representing sub-phonetic segments (onset, nucleus, coda)
  • Observations as Mel-frequency cepstral coefficients (MFCCs)
  • Transitions capturing temporal dynamics
  • Emissions modeled as Gaussian mixtures (typically 8-32 components per state)

Modern ASR systems use context-dependent triphone HMMs with thousands of states, trained on hundreds of hours of speech. The Viterbi algorithm provides the alignment between acoustic observations and phonetic labels, while the forward algorithm computes likelihoods for language model integration.

Premium Content

Hidden Markov Models (HMM)

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement