πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Bayesian Networks

Advanced Statistical MethodsGraphical Models🟒 Free Lesson

Advertisement

Bayesian Networks

Advanced Statistical Methods

Reasoning Under Uncertainty With Graphical Models

Bayesian networks compactly represent joint probability distributions using directed acyclic graphs, encoding conditional independencies that enable efficient inference. D-separation provides the graphical criterion for independence.

  • Medical diagnosis β€” Combine symptom observations to infer disease probabilities
  • Risk assessment β€” Model cascading failures in complex engineering systems
  • Fraud detection β€” Combine transaction patterns to flag suspicious activities with explainable reasoning

Bayesian networks turn complex dependency structures into tractable probabilistic reasoning.


DfBayesian Network

A Bayesian Network (BN) is a directed acyclic graph (DAG) G = (V, E) together with a set of conditional probability distributions that compactly represent a joint probability distribution over random variables X₁, Xβ‚‚, ..., X_n. Each node i ∈ V represents a random variable X_i, and each directed edge (i, j) ∈ E encodes the dependency of X_j on its parent set Pa(X_j).

The joint distribution factorizes as:

P(X1,X2,...,Xn)=∏i=1nP(Xi∣Pa(Xi))P(X_1, X_2, ..., X_n) = \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i))

This factorization is valid for any topological ordering of the variables and exploits conditional independencies encoded in the graph structure.

Conditional Independence and Factorization

A Bayesian network encodes two types of independence:

  1. Local independence: Each variable is conditionally independent of its non-descendants given its parents:
XiβŠ₯ ⁣ ⁣ ⁣βŠ₯(NonDesc(Xi)βˆ–Pa(Xi))∣Pa(Xi)X_i \perp\!\!\!\perp (\text{NonDesc}(X_i) \setminus \text{Pa}(X_i)) \mid \text{Pa}(X_i)
  1. Global independence: The joint distribution satisfies the Markov condition:
P(X1,...,Xn)=∏iP(Xi∣Pa(Xi))P(X_1,...,X_n) = \prod_i P(X_i | \text{Pa}(X_i))

The number of parameters required is:

βˆ‘i=1n(∣Xiβˆ£βˆ’1)∏Xj∈Pa(Xi)∣Xj∣\sum_{i=1}^{n} \left(|X_i| - 1\right) \prod_{X_j \in \text{Pa}(X_i)} |X_j|

which is exponentially smaller than the |X₁| Γ— |Xβ‚‚| Γ— ... Γ— |X_n| parameters for the full joint distribution.

D-Separation and Conditional Independence

Thd-Separation Criterion

A set of nodes Z d-separates nodes X from nodes Y in a DAG G if every undirected path between any node in X and any node in Y is "blocked" by Z. A path is blocked by Z if:

  1. The path contains a chain a β†’ b β†’ c or a ← b ← c where b ∈ Z
  2. The path contains a fork a ← b β†’ c where b ∈ Z
  3. The path contains an collider a β†’ b ← c where b βˆ‰ Z and no descendant of b is in Z

The Bayes-Ball algorithm provides an efficient O(|E|) procedure for testing d-separation: given Z, a node is reachable from X by a path not blocked by Z if and only if X and Y are not d-separated.

Faithfulness and Equivalence

Two DAGs are Markov equivalent if they encode the same set of conditional independence statements. The Completed Partially Directed Acyclic Graph (CPDAG) uniquely characterizes an equivalence class. Key results:

  • DAGs G₁ and Gβ‚‚ are Markov equivalent iff they have the same skeleton and the same v-structures
  • The skeleton is the underlying undirected graph
  • A v-structure is a configuration X β†’ Z ← Y where X and Y are not adjacent

Faithfulness assumes all and only conditional independencies implied by the d-separation criterion hold in the true distribution. Violations occur with measure zero under continuous distributions but can arise in discrete models.

Parameter Estimation

ThMaximum Likelihood Estimation

Given a DAG G and dataset D = {d₁, ..., d_N} of i.i.d. observations, the log-likelihood decomposes according to the network structure:

β„“(θ∣D,G)=βˆ‘i=1nβˆ‘j=1∣Xiβˆ£βˆ‘z∈Pa(Xi)Nijklog⁑θijk\ell(\theta | D, G) = \sum_{i=1}^{n} \sum_{j=1}^{|X_i|} \sum_{\mathbf{z} \in \text{Pa}(X_i)} N_{ijk} \log \theta_{ijk}

where N_{ijk} is the number of instances where X_i takes value k and Pa(X_i) takes configuration z.

The MLE for each parameter is:

ΞΈ^ijk=Nijkβˆ‘kβ€²Nijkβ€²\hat{\theta}_{ijk} = \frac{N_{ijk}}{\sum_{k'} N_{ijk'}}

with the constraint that βˆ‘k ΞΈ{ijk} = 1 for each (i, z) combination. This is a multinomial MLE computed independently for each node-parent configuration.

Bayesian Parameter Estimation

With a Dirichlet prior Dir(Ξ±_{ijk}) for each set of parameters {ΞΈ_{ij1}, ..., ΞΈ_{ij|X_i|}}, the posterior is also Dirichlet:

P(θ∣D,G)=∏i=1n∏jDir(ΞΈij∣αij+Nijβ‹…)P(\theta | D, G) = \prod_{i=1}^{n} \prod_{j} \text{Dir}(\theta_{ij} | \alpha_{ij} + N_{ij\cdot})

The posterior mean (Bayesian estimate) is:

ΞΈ~ijk=Nijk+Ξ±ijkβˆ‘kβ€²(Nijkβ€²+Ξ±ijkβ€²)\tilde{\theta}_{ijk} = \frac{N_{ijk} + \alpha_{ijk}}{\sum_{k'} (N_{ijk'} + \alpha_{ijk'})}

The BDeu (Bayesian Dirichlet equivalent uniform) prior with Ξ±_{ijk} = Ξ± / (|X_i| Γ— ∏_{Pa} |X_j|) provides a score equivalent to the BIC approximation.

Structure Learning

ThScore-Based Structure Learning

Structure learning is NP-hard in general. Score-based methods search over the space of DAGs and optimize a scoring function:

BIC Score:

BIC(G∣D)=βˆ‘i=1nβˆ‘j,kNijklog⁑θ^ijkβˆ’log⁑N2βˆ‘i=1n(∣Xiβˆ£βˆ’1)∏Xj∈Pa(Xi)∣Xj∣\text{BIC}(G | D) = \sum_{i=1}^{n} \sum_{j,k} N_{ijk} \log \hat{\theta}_{ijk} - \frac{\log N}{2} \sum_{i=1}^{n} (|X_i| - 1) \prod_{X_j \in \text{Pa}(X_i)} |X_j|

BDe Score (Bayesian):

BDe(G∣D)=log⁑P(D∣G)+log⁑P(G)\text{BDe}(G | D) = \log P(D | G) + \log P(G)

where P(D | G) is the marginal likelihood obtained by integrating over parameters. The BDe score is decomposable over nodes and can be computed in closed form with the Dirichlet prior.

Constraint-Based Learning

The PC algorithm (named after Peter and Clark) learns structure via conditional independence tests:

  1. Start with a complete undirected graph
  2. For each pair (X, Y), test X βŠ₯ Y | S for increasing subsets S
  3. Remove edges where independence is found
  4. Direct edges using v-structure detection (collider orientation)
  5. Apply Meek's rules for orientation propagation

The PC algorithm requires O(n⁴ |D|) conditional independence tests in the worst case. The Max-Min Hill Climbing (MMHC) algorithm combines constraint and score-based approaches for improved performance.

Inference in Bayesian Networks

DfExact Inference β€” Variable Elimination

Exact inference computes P(Q | E = e) for query variables Q given evidence E. Variable elimination exploits the factored representation:

P(Q,E=e)=βˆ‘H∏i=1nP(Xi∣Pa(Xi))∣E=eP(Q, E=e) = \sum_{H} \prod_{i=1}^{n} P(X_i | \text{Pa}(X_i)) \Big|_{E=e}

Variables are eliminated one at a time by summing out. The order of elimination determines efficiency. With treewidth w, the complexity is O(n Β· |X|^{w+1}), where w is the treewidth of the moralized triangulated graph.

Junction Tree Algorithm

The junction tree algorithm converts a DAG into a tree of cliques:

  1. Moralize: Add edges between co-parents, drop directions
  2. Triangulate: Add fill-in edges to eliminate induced cycles of length > 3
  3. Form cliques: Identify maximal cliques C₁, ..., C_k
  4. Build junction tree: Find maximum spanning tree over cliques with intersection sizes as weights

Belief propagation on the junction tree computes exact marginals in O(k Β· |X|^{w+1}) time, where w is the treewidth.

Approximate Inference β€” Sampling Methods

When exact inference is intractable (treewidth too large), approximate methods are used:

  • Forward sampling: Generate ancestors before descendants; simple but can suffer from low acceptance
  • Rejection sampling: Discard samples inconsistent with evidence
  • Likelihood weighting: Fix evidence variables, weight samples by likelihood
  • Gibbs sampling: MCMC method that samples each variable conditioned on its Markov blanket
  • Variational inference: Optimize a tractable approximation q(Q) to minimize KL(q || p)

Variational methods provide deterministic bounds and scale to large networks, while MCMC methods are asymptotically exact but may mix slowly.

import numpy as np
from itertools import product

class BayesianNetwork:
    def __init__(self, n_vars, card, parents):
        self.n = n_vars
        self.card = card
        self.parents = parents
        self.cpd = [None] * n_vars

    def randomεˆε§‹εŒ–(self, seed=42):
        rng = np.random.RandomState(seed)
        for i in range(self.n):
            parent_card = np.prod([self.card[p] for p in self.parents[i]]) if self.parents[i] else 1
            self.cpd[i] = rng.dirichlet(np.ones(self.card[i]), size=parent_card)

    def sample(self, n_samples=1, rng=None):
        if rng is None:
            rng = np.random.RandomState()
        data = np.zeros((n_samples, self.n), dtype=int)
        for i in range(self.n):
            if not self.parents[i]:
                data[:, i] = rng.choice(self.card[i], size=n_samples, p=self.cpd[i])
            else:
                for idx in range(n_samples):
                    parent_config = tuple(data[idx, p] for p in self.parents[i])
                    flat_idx = np.ravel_multi_index(parent_config, [self.card[p] for p in self.parents[i]])
                    data[idx, i] = rng.choice(self.card[i], p=self.cpd[i][flat_idx])
        return data

    def log_likelihood(self, data):
        ll = 0.0
        for i in range(self.n):
            for row in data:
                if self.parents[i]:
                    parent_config = tuple(row[p] for p in self.parents[i])
                    flat_idx = np.ravel_multi_index(parent_config, [self.card[p] for p in self.parents[i]])
                else:
                    flat_idx = 0
                ll += np.log(self.cpd[i][flat_idx, row[i]] + 1e-10)
        return ll

    def variable_elimination(self, evidence, query):
        factors = [self.cpd[i].copy() for i in range(self.n)]
        observed = set(evidence.keys())
        hidden = [i for i in range(self.n) if i not in observed and i not in query]
        for var in hidden:
            product_factor = np.ones([self.card[v] if v in [var] + self.parents[v] else 1
                                      for v in range(self.n)])
            # Simplified VE: sum out variable
            factors = [f for f in factors]  # placeholder
        return factors

Naive Bayes as a Special Case

The Naive Bayes classifier is a Bayesian network with a single parent (the class variable C) and conditionally independent features:

P(C,X1,...,Xd)=P(C)∏j=1dP(Xj∣C)P(C, X_1,...,X_d) = P(C) \prod_{j=1}^{d} P(X_j | C)

Despite its strong independence assumption, Naive Bayes performs surprisingly well in practice, particularly for text classification. The assumption is "good enough" because:

  • It provides a biased but low-variance estimate
  • The decision boundary is robust to correlation between features
  • Regularization via the prior helps with small samples

Extensions like Tree-Augmented Naive Bayes (TAN) relax the independence assumption by allowing a tree structure among features, adding O(d) parameters while maintaining tractability.

Medical Diagnosis Network

A classic Bayesian network for medical diagnosis includes variables like:

  • Disease (D): {Flu, Cold, Healthy}
  • Fever (F): {High, Low, None}
  • Cough (C): {Severe, Mild, None}
  • Headache (H): {Yes, No}

The structure encodes domain knowledge:

  • D β†’ F, D β†’ C, D β†’ H (diseases cause symptoms)
  • F βŠ₯ C | D (symptoms are conditionally independent given disease)

With CPTs (conditional probability tables):

  • P(D = Flu) = 0.1, P(D = Cold) = 0.3, P(D = Healthy) = 0.6
  • P(F = High | D = Flu) = 0.8, P(F = High | D = Cold) = 0.2

Posterior inference: P(D | F = High, C = Severe, H = Yes) computes the diagnostic probability using Bayes' theorem with the network's factorization.

⭐

Premium Content

Bayesian Networks

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement