Bayesian Networks

Advanced Statistical Methods

Reasoning Under Uncertainty With Graphical Models

Bayesian networks compactly represent joint probability distributions using directed acyclic graphs, encoding conditional independencies that enable efficient inference. D-separation provides the graphical criterion for independence.

Medical diagnosis — Combine symptom observations to infer disease probabilities
Risk assessment — Model cascading failures in complex engineering systems
Fraud detection — Combine transaction patterns to flag suspicious activities with explainable reasoning

Bayesian networks turn complex dependency structures into tractable probabilistic reasoning.

DfBayesian Network

A Bayesian Network (BN) is a directed acyclic graph (DAG) G = (V, E) together with a set of conditional probability distributions that compactly represent a joint probability distribution over random variables X₁, X₂, ..., X_n. Each node i ∈ V represents a random variable X_i, and each directed edge (i, j) ∈ E encodes the dependency of X_j on its parent set Pa(X_j).

The joint distribution factorizes as:

P(X_1, X_2, ..., X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))

This factorization is valid for any topological ordering of the variables and exploits conditional independencies encoded in the graph structure.

Conditional Independence and Factorization

A Bayesian network encodes two types of independence:

Local independence: Each variable is conditionally independent of its non-descendants given its parents:

X_i \perp\!\!\!\perp (\text{NonDesc}(X_i) \setminus \text{Pa}(X_i)) \mid \text{Pa}(X_i)

Global independence: The joint distribution satisfies the Markov condition:

P(X_1,...,X_n) = \prod_i P(X_i | \text{Pa}(X_i))

The number of parameters required is:

\sum_{i=1}^{n} \left(|X_i| - 1\right) \prod_{X_j \in \text{Pa}(X_i)} |X_j|

which is exponentially smaller than the |X₁| × |X₂| × ... × |X_n| parameters for the full joint distribution.

D-Separation and Conditional Independence

Thd-Separation Criterion

A set of nodes Z d-separates nodes X from nodes Y in a DAG G if every undirected path between any node in X and any node in Y is "blocked" by Z. A path is blocked by Z if:

The path contains a chain a → b → c or a ← b ← c where b ∈ Z
The path contains a fork a ← b → c where b ∈ Z
The path contains an collider a → b ← c where b ∉ Z and no descendant of b is in Z

The Bayes-Ball algorithm provides an efficient O(|E|) procedure for testing d-separation: given Z, a node is reachable from X by a path not blocked by Z if and only if X and Y are not d-separated.

Faithfulness and Equivalence

Two DAGs are Markov equivalent if they encode the same set of conditional independence statements. The Completed Partially Directed Acyclic Graph (CPDAG) uniquely characterizes an equivalence class. Key results:

DAGs G₁ and G₂ are Markov equivalent iff they have the same skeleton and the same v-structures
The skeleton is the underlying undirected graph
A v-structure is a configuration X → Z ← Y where X and Y are not adjacent

Faithfulness assumes all and only conditional independencies implied by the d-separation criterion hold in the true distribution. Violations occur with measure zero under continuous distributions but can arise in discrete models.

Parameter Estimation

ThMaximum Likelihood Estimation

Given a DAG G and dataset D = {d₁, ..., d_N} of i.i.d. observations, the log-likelihood decomposes according to the network structure:

\ell(\theta | D, G) = \sum_{i=1}^{n} \sum_{j=1}^{|X_i|} \sum_{\mathbf{z} \in \text{Pa}(X_i)} N_{ijk} \log \theta_{ijk}

where N_{ijk} is the number of instances where X_i takes value k and Pa(X_i) takes configuration z.

The MLE for each parameter is:

\hat{\theta}_{ijk} = \frac{N_{ijk}}{\sum_{k'} N_{ijk'}}

with the constraint that ∑k θ{ijk} = 1 for each (i, z) combination. This is a multinomial MLE computed independently for each node-parent configuration.

Bayesian Parameter Estimation

With a Dirichlet prior Dir(α_{ijk}) for each set of parameters {θ_{ij1}, ..., θ_{ij|X_i|}}, the posterior is also Dirichlet:

P(\theta | D, G) = \prod_{i=1}^{n} \prod_{j} \text{Dir}(\theta_{ij} | \alpha_{ij} + N_{ij\cdot})

The posterior mean (Bayesian estimate) is:

\tilde{\theta}_{ijk} = \frac{N_{ijk} + \alpha_{ijk}}{\sum_{k'} (N_{ijk'} + \alpha_{ijk'})}

The BDeu (Bayesian Dirichlet equivalent uniform) prior with α_{ijk} = α / (|X_i| × ∏_{Pa} |X_j|) provides a score equivalent to the BIC approximation.

Structure Learning

ThScore-Based Structure Learning

Structure learning is NP-hard in general. Score-based methods search over the space of DAGs and optimize a scoring function:

BIC Score:

\text{BIC}(G | D) = \sum_{i=1}^{n} \sum_{j,k} N_{ijk} \log \hat{\theta}_{ijk} - \frac{\log N}{2} \sum_{i=1}^{n} (|X_i| - 1) \prod_{X_j \in \text{Pa}(X_i)} |X_j|

BDe Score (Bayesian):

\text{BDe}(G | D) = \log P(D | G) + \log P(G)

where P(D | G) is the marginal likelihood obtained by integrating over parameters. The BDe score is decomposable over nodes and can be computed in closed form with the Dirichlet prior.

Constraint-Based Learning

The PC algorithm (named after Peter and Clark) learns structure via conditional independence tests:

Start with a complete undirected graph
For each pair (X, Y), test X ⊥ Y | S for increasing subsets S
Remove edges where independence is found
Direct edges using v-structure detection (collider orientation)
Apply Meek's rules for orientation propagation

The PC algorithm requires O(n⁴ |D|) conditional independence tests in the worst case. The Max-Min Hill Climbing (MMHC) algorithm combines constraint and score-based approaches for improved performance.

Inference in Bayesian Networks

DfExact Inference — Variable Elimination

Exact inference computes P(Q | E = e) for query variables Q given evidence E. Variable elimination exploits the factored representation:

P(Q, E=e) = \sum_{H} \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i)) \Big|_{E=e}

Variables are eliminated one at a time by summing out. The order of elimination determines efficiency. With treewidth w, the complexity is O(n · |X|^{w+1}), where w is the treewidth of the moralized triangulated graph.

Junction Tree Algorithm

The junction tree algorithm converts a DAG into a tree of cliques:

Moralize: Add edges between co-parents, drop directions
Triangulate: Add fill-in edges to eliminate induced cycles of length > 3
Form cliques: Identify maximal cliques C₁, ..., C_k
Build junction tree: Find maximum spanning tree over cliques with intersection sizes as weights

Belief propagation on the junction tree computes exact marginals in O(k · |X|^{w+1}) time, where w is the treewidth.

Approximate Inference — Sampling Methods

When exact inference is intractable (treewidth too large), approximate methods are used:

Forward sampling: Generate ancestors before descendants; simple but can suffer from low acceptance
Rejection sampling: Discard samples inconsistent with evidence
Likelihood weighting: Fix evidence variables, weight samples by likelihood
Gibbs sampling: MCMC method that samples each variable conditioned on its Markov blanket
Variational inference: Optimize a tractable approximation q(Q) to minimize KL(q || p)

Variational methods provide deterministic bounds and scale to large networks, while MCMC methods are asymptotically exact but may mix slowly.

import numpy as np
from itertools import product

class BayesianNetwork:
    def __init__(self, n_vars, card, parents):
        self.n = n_vars
        self.card = card
        self.parents = parents
        self.cpd = [None] * n_vars

    def random初始化(self, seed=42):
        rng = np.random.RandomState(seed)
        for i in range(self.n):
            parent_card = np.prod([self.card[p] for p in self.parents[i]]) if self.parents[i] else 1
            self.cpd[i] = rng.dirichlet(np.ones(self.card[i]), size=parent_card)

    def sample(self, n_samples=1, rng=None):
        if rng is None:
            rng = np.random.RandomState()
        data = np.zeros((n_samples, self.n), dtype=int)
        for i in range(self.n):
            if not self.parents[i]:
                data[:, i] = rng.choice(self.card[i], size=n_samples, p=self.cpd[i])
            else:
                for idx in range(n_samples):
                    parent_config = tuple(data[idx, p] for p in self.parents[i])
                    flat_idx = np.ravel_multi_index(parent_config, [self.card[p] for p in self.parents[i]])
                    data[idx, i] = rng.choice(self.card[i], p=self.cpd[i][flat_idx])
        return data

    def log_likelihood(self, data):
        ll = 0.0
        for i in range(self.n):
            for row in data:
                if self.parents[i]:
                    parent_config = tuple(row[p] for p in self.parents[i])
                    flat_idx = np.ravel_multi_index(parent_config, [self.card[p] for p in self.parents[i]])
                else:
                    flat_idx = 0
                ll += np.log(self.cpd[i][flat_idx, row[i]] + 1e-10)
        return ll

    def variable_elimination(self, evidence, query):
        factors = [self.cpd[i].copy() for i in range(self.n)]
        observed = set(evidence.keys())
        hidden = [i for i in range(self.n) if i not in observed and i not in query]
        for var in hidden:
            product_factor = np.ones([self.card[v] if v in [var] + self.parents[v] else 1
                                      for v in range(self.n)])
            # Simplified VE: sum out variable
            factors = [f for f in factors]  # placeholder
        return factors

Naive Bayes as a Special Case

The Naive Bayes classifier is a Bayesian network with a single parent (the class variable C) and conditionally independent features:

P(C, X_1,...,X_d) = P(C) \prod_{j=1}^{d} P(X_j | C)

Despite its strong independence assumption, Naive Bayes performs surprisingly well in practice, particularly for text classification. The assumption is "good enough" because:

It provides a biased but low-variance estimate
The decision boundary is robust to correlation between features
Regularization via the prior helps with small samples

Extensions like Tree-Augmented Naive Bayes (TAN) relax the independence assumption by allowing a tree structure among features, adding O(d) parameters while maintaining tractability.

Medical Diagnosis Network

A classic Bayesian network for medical diagnosis includes variables like:

Disease (D): {Flu, Cold, Healthy}
Fever (F): {High, Low, None}
Cough (C): {Severe, Mild, None}
Headache (H): {Yes, No}

The structure encodes domain knowledge:

D → F, D → C, D → H (diseases cause symptoms)
F ⊥ C | D (symptoms are conditionally independent given disease)

With CPTs (conditional probability tables):

P(D = Flu) = 0.1, P(D = Cold) = 0.3, P(D = Healthy) = 0.6
P(F = High | D = Flu) = 0.8, P(F = High | D = Cold) = 0.2

Posterior inference: P(D | F = High, C = Severe, H = Yes) computes the diagnostic probability using Bayes' theorem with the network's factorization.

Bayesian Networks

Bayesian Networks

Reasoning Under Uncertainty With Graphical Models

DfBayesian Network

Conditional Independence and Factorization

D-Separation and Conditional Independence

Thd-Separation Criterion

Parameter Estimation

ThMaximum Likelihood Estimation

Bayesian Parameter Estimation

Structure Learning

ThScore-Based Structure Learning

Inference in Bayesian Networks

DfExact Inference — Variable Elimination

Junction Tree Algorithm

Naive Bayes as a Special Case

Medical Diagnosis Network

Premium Content

Need Expert Statistics Help?