Bayesian Networks
Advanced Statistical Methods
Reasoning Under Uncertainty With Graphical Models
Bayesian networks compactly represent joint probability distributions using directed acyclic graphs, encoding conditional independencies that enable efficient inference. D-separation provides the graphical criterion for independence.
- Medical diagnosis β Combine symptom observations to infer disease probabilities
- Risk assessment β Model cascading failures in complex engineering systems
- Fraud detection β Combine transaction patterns to flag suspicious activities with explainable reasoning
Bayesian networks turn complex dependency structures into tractable probabilistic reasoning.
DfBayesian Network
A Bayesian Network (BN) is a directed acyclic graph (DAG) G = (V, E) together with a set of conditional probability distributions that compactly represent a joint probability distribution over random variables Xβ, Xβ, ..., X_n. Each node i β V represents a random variable X_i, and each directed edge (i, j) β E encodes the dependency of X_j on its parent set Pa(X_j).
The joint distribution factorizes as:
This factorization is valid for any topological ordering of the variables and exploits conditional independencies encoded in the graph structure.
Conditional Independence and Factorization
A Bayesian network encodes two types of independence:
- Local independence: Each variable is conditionally independent of its non-descendants given its parents:
- Global independence: The joint distribution satisfies the Markov condition:
The number of parameters required is:
which is exponentially smaller than the |Xβ| Γ |Xβ| Γ ... Γ |X_n| parameters for the full joint distribution.
D-Separation and Conditional Independence
Thd-Separation Criterion
A set of nodes Z d-separates nodes X from nodes Y in a DAG G if every undirected path between any node in X and any node in Y is "blocked" by Z. A path is blocked by Z if:
- The path contains a chain a β b β c or a β b β c where b β Z
- The path contains a fork a β b β c where b β Z
- The path contains an collider a β b β c where b β Z and no descendant of b is in Z
The Bayes-Ball algorithm provides an efficient O(|E|) procedure for testing d-separation: given Z, a node is reachable from X by a path not blocked by Z if and only if X and Y are not d-separated.
Faithfulness and Equivalence
Two DAGs are Markov equivalent if they encode the same set of conditional independence statements. The Completed Partially Directed Acyclic Graph (CPDAG) uniquely characterizes an equivalence class. Key results:
- DAGs Gβ and Gβ are Markov equivalent iff they have the same skeleton and the same v-structures
- The skeleton is the underlying undirected graph
- A v-structure is a configuration X β Z β Y where X and Y are not adjacent
Faithfulness assumes all and only conditional independencies implied by the d-separation criterion hold in the true distribution. Violations occur with measure zero under continuous distributions but can arise in discrete models.
Parameter Estimation
ThMaximum Likelihood Estimation
Given a DAG G and dataset D = {dβ, ..., d_N} of i.i.d. observations, the log-likelihood decomposes according to the network structure:
where N_{ijk} is the number of instances where X_i takes value k and Pa(X_i) takes configuration z.
The MLE for each parameter is:
with the constraint that βk ΞΈ{ijk} = 1 for each (i, z) combination. This is a multinomial MLE computed independently for each node-parent configuration.
Bayesian Parameter Estimation
With a Dirichlet prior Dir(Ξ±_{ijk}) for each set of parameters {ΞΈ_{ij1}, ..., ΞΈ_{ij|X_i|}}, the posterior is also Dirichlet:
The posterior mean (Bayesian estimate) is:
The BDeu (Bayesian Dirichlet equivalent uniform) prior with Ξ±_{ijk} = Ξ± / (|X_i| Γ β_{Pa} |X_j|) provides a score equivalent to the BIC approximation.
Structure Learning
ThScore-Based Structure Learning
Structure learning is NP-hard in general. Score-based methods search over the space of DAGs and optimize a scoring function:
BIC Score:
BDe Score (Bayesian):
where P(D | G) is the marginal likelihood obtained by integrating over parameters. The BDe score is decomposable over nodes and can be computed in closed form with the Dirichlet prior.
Constraint-Based Learning
The PC algorithm (named after Peter and Clark) learns structure via conditional independence tests:
- Start with a complete undirected graph
- For each pair (X, Y), test X β₯ Y | S for increasing subsets S
- Remove edges where independence is found
- Direct edges using v-structure detection (collider orientation)
- Apply Meek's rules for orientation propagation
The PC algorithm requires O(nβ΄ |D|) conditional independence tests in the worst case. The Max-Min Hill Climbing (MMHC) algorithm combines constraint and score-based approaches for improved performance.
Inference in Bayesian Networks
DfExact Inference β Variable Elimination
Exact inference computes P(Q | E = e) for query variables Q given evidence E. Variable elimination exploits the factored representation:
Variables are eliminated one at a time by summing out. The order of elimination determines efficiency. With treewidth w, the complexity is O(n Β· |X|^{w+1}), where w is the treewidth of the moralized triangulated graph.
Junction Tree Algorithm
The junction tree algorithm converts a DAG into a tree of cliques:
- Moralize: Add edges between co-parents, drop directions
- Triangulate: Add fill-in edges to eliminate induced cycles of length > 3
- Form cliques: Identify maximal cliques Cβ, ..., C_k
- Build junction tree: Find maximum spanning tree over cliques with intersection sizes as weights
Belief propagation on the junction tree computes exact marginals in O(k Β· |X|^{w+1}) time, where w is the treewidth.
Approximate Inference β Sampling Methods
When exact inference is intractable (treewidth too large), approximate methods are used:
- Forward sampling: Generate ancestors before descendants; simple but can suffer from low acceptance
- Rejection sampling: Discard samples inconsistent with evidence
- Likelihood weighting: Fix evidence variables, weight samples by likelihood
- Gibbs sampling: MCMC method that samples each variable conditioned on its Markov blanket
- Variational inference: Optimize a tractable approximation q(Q) to minimize KL(q || p)
Variational methods provide deterministic bounds and scale to large networks, while MCMC methods are asymptotically exact but may mix slowly.
import numpy as np
from itertools import product
class BayesianNetwork:
def __init__(self, n_vars, card, parents):
self.n = n_vars
self.card = card
self.parents = parents
self.cpd = [None] * n_vars
def randomεε§ε(self, seed=42):
rng = np.random.RandomState(seed)
for i in range(self.n):
parent_card = np.prod([self.card[p] for p in self.parents[i]]) if self.parents[i] else 1
self.cpd[i] = rng.dirichlet(np.ones(self.card[i]), size=parent_card)
def sample(self, n_samples=1, rng=None):
if rng is None:
rng = np.random.RandomState()
data = np.zeros((n_samples, self.n), dtype=int)
for i in range(self.n):
if not self.parents[i]:
data[:, i] = rng.choice(self.card[i], size=n_samples, p=self.cpd[i])
else:
for idx in range(n_samples):
parent_config = tuple(data[idx, p] for p in self.parents[i])
flat_idx = np.ravel_multi_index(parent_config, [self.card[p] for p in self.parents[i]])
data[idx, i] = rng.choice(self.card[i], p=self.cpd[i][flat_idx])
return data
def log_likelihood(self, data):
ll = 0.0
for i in range(self.n):
for row in data:
if self.parents[i]:
parent_config = tuple(row[p] for p in self.parents[i])
flat_idx = np.ravel_multi_index(parent_config, [self.card[p] for p in self.parents[i]])
else:
flat_idx = 0
ll += np.log(self.cpd[i][flat_idx, row[i]] + 1e-10)
return ll
def variable_elimination(self, evidence, query):
factors = [self.cpd[i].copy() for i in range(self.n)]
observed = set(evidence.keys())
hidden = [i for i in range(self.n) if i not in observed and i not in query]
for var in hidden:
product_factor = np.ones([self.card[v] if v in [var] + self.parents[v] else 1
for v in range(self.n)])
# Simplified VE: sum out variable
factors = [f for f in factors] # placeholder
return factors
Naive Bayes as a Special Case
The Naive Bayes classifier is a Bayesian network with a single parent (the class variable C) and conditionally independent features:
Despite its strong independence assumption, Naive Bayes performs surprisingly well in practice, particularly for text classification. The assumption is "good enough" because:
- It provides a biased but low-variance estimate
- The decision boundary is robust to correlation between features
- Regularization via the prior helps with small samples
Extensions like Tree-Augmented Naive Bayes (TAN) relax the independence assumption by allowing a tree structure among features, adding O(d) parameters while maintaining tractability.
Medical Diagnosis Network
A classic Bayesian network for medical diagnosis includes variables like:
- Disease (D): {Flu, Cold, Healthy}
- Fever (F): {High, Low, None}
- Cough (C): {Severe, Mild, None}
- Headache (H): {Yes, No}
The structure encodes domain knowledge:
- D β F, D β C, D β H (diseases cause symptoms)
- F β₯ C | D (symptoms are conditionally independent given disease)
With CPTs (conditional probability tables):
- P(D = Flu) = 0.1, P(D = Cold) = 0.3, P(D = Healthy) = 0.6
- P(F = High | D = Flu) = 0.8, P(F = High | D = Cold) = 0.2
Posterior inference: P(D | F = High, C = Severe, H = Yes) computes the diagnostic probability using Bayes' theorem with the network's factorization.