Network Analysis and Graph Statistics

Advanced Statistical Methods

Understanding Complex Systems Through Their Connections

Network analysis uses graph theory and statistical models to study the structure and dynamics of complex interconnected systems. Centrality measures, community detection, and generative models reveal hidden patterns.

Social networks — Identify influential individuals and detect community structure in online platforms
Epidemiology — Model disease spread through contact networks for targeted intervention
Biology — Analyze protein interaction networks to discover functional modules and drug targets

Network analysis reveals how the pattern of connections shapes system behavior.

Network science provides the mathematical language for studying complex systems of interacting entities. From social networks and biological pathways to communication infrastructure and financial systems, the statistical analysis of networks reveals structural patterns, identifies influential nodes, and detects latent community organization. This lesson develops the theoretical foundations and computational methods for rigorous network analysis.

Graph Fundamentals

DfGraph (Network)

A graph $G = (V, E)$ consists of a finite set of vertices (nodes) $V = \{v_1, v_2, \dots, v_n\}$ and a set of edges (ties) $E \subseteq V \times V$ . The order of the graph is $|V| = n$ and its size is $|E| = m$ .

DfAdjacency Matrix

The adjacency matrix $\mathbf{A} \in \{0, 1\}^{n \times n}$ of a simple undirected graph encodes edge presence:

A_{ij} = \begin{cases} 1 & \text{if } (v_i, v_j) \in E \\ 0 & \text{otherwise} \end{cases}

For weighted graphs, $A_{ij} = w_{ij}$ where $w_{ij} \geq 0$ . For directed graphs, $A_{ij} = 1$ if there is an edge from $i$ to $j$ , and $A_{ij} \neq A_{ji}$ in general.

Degree

The degree of vertex $v_i$ is the number of edges incident to it:

k_i = \sum_{j=1}^{n} A_{ij}

In a directed graph, the in-degree is $k_i^{\text{in}} = \sum_j A_{ji}$ and the out-degree is $k_i^{\text{out}} = \sum_j A_{ij}$ . The degree distribution $P(k)$ gives the fraction of vertices with degree $k$ , which is a fundamental descriptor of network topology.

DfPath, Distance, Diameter

A path of length $\ell$ from $v_i$ to $v_j$ is a sequence of distinct vertices $v_i = v_{i_0}, v_{i_1}, \dots, v_{i_\ell} = v_j$ where each consecutive pair is connected by an edge. The shortest path length $d(i,j)$ is the geodesic distance. The diameter of a connected graph is $\text{diam}(G) = \max_{i,j} d(i,j)$ .

Key Graph Properties

Connected graph: Every pair of vertices is joined by a path
Bipartite vertices: Can be partitioned into two sets with edges only between sets
Clustering coefficient: $C_i = \frac{2e_i}{k_i(k_i - 1)}$ where $e_i$ is the number of edges among $i$ 's neighbors
Average path length: $\bar{d} = \frac{1}{n(n-1)}\sum_{i \neq j} d(i,j)$

Centrality Measures

Centrality quantifies the relative importance of nodes within a network. Different measures capture different notions of "importance" — popularity, brokerage, proximity, or influence.

Degree Centrality

The simplest centrality measure normalizes degree by the maximum possible:

C_D(v_i) = \frac{k_i}{n - 1}

This measures local connectivity and is appropriate for identifying nodes with many direct connections (e.g., popular individuals in social networks).

Betweenness Centrality

Betweenness centrality captures the role of a node as a bridge along shortest paths:

C_B(v_i) = \sum_{s \neq i \neq t} \frac{\sigma_{st}(i)}{\sigma_{st}}

where $\sigma_{st}$ is the total number of shortest paths from $s$ to $t$ , and $\sigma_{st}(i)$ is the number of those passing through $i$ . Normalized betweenness is:

C_B'(v_i) = \frac{2C_B(v_i)}{(n-1)(n-2)}

Betweenness is $O(nm)$ to compute exactly, though approximation algorithms exist for large networks.

Closeness Centrality

Closeness centrality measures how close a node is to all other nodes:

C_C(v_i) = \frac{n - 1}{\sum_{j \neq i} d(i, j)}

A node with high closeness can spread information efficiently to the entire network. This is undefined for disconnected graphs unless restricted to the reachable component.

Eigenvector Centrality

Eigenvector centrality assigns importance based on the importance of neighbors:

C_E(v_i) = \frac{1}{\lambda} \sum_{j=1}^{n} A_{ij} C_E(v_j)

In matrix form, $\mathbf{A}\mathbf{x} = \lambda \mathbf{x}$ , so the centrality vector is the principal eigenvector of the adjacency matrix. By the Perron-Frobenius theorem, for a connected non-negative matrix, this eigenvector is unique and has all positive entries. Google's PageRank is a modified eigenvector centrality.

Centrality Comparison

Measure	Captures	Complexity	Sensitive to
Degree	Local connectivity	$O(n)$	Hub detection
Betweenness	Bridge/ brokerage role	$O(nm)$	Bottlenecks
Closeness	Proximity to all nodes	$O(nm)$	Central positioning
Eigenvector	Influence in network	$O(n^2)$ per iteration	Global structure

Community Detection

Communities (modules, clusters) are groups of nodes that are more densely connected internally than with the rest of the network. Quantifying community structure requires a modularity function.

Modularity (Newman-Girvan)

The modularity $Q$ of a partition of the network into communities $\{C_1, \dots, C_K\}$ is:

Q = \frac{1}{2m} \sum_{ij} \left[ A_{ij} - \frac{k_i k_j}{2m} \right] \delta(C_i, C_j)

where $\delta(C_i, C_j) = 1$ if nodes $i$ and $j$ belong to the same community, and $m = \frac{1}{2}\sum_i k_i$ is the total number of edges. The term $\frac{k_i k_j}{2m}$ is the expected number of edges under the configuration model (random graphs with the same degree sequence). $Q$ ranges from $-1/2$ to $1$ , with values above $0.3$ indicating significant community structure.

Louvain Algorithm

The Louvain method (Blondel et al., 2008) greedily optimizes modularity in two phases:

Local moving: Each node $i$ is moved to the community that yields the maximum modularity gain $\Delta Q$ :

\Delta Q = \left[ \frac{\sum_{\text{in}} + 2k_{i,\text{in}}}{2m} - \left(\frac{\sum_{\text{tot}} + k_i}{2m}\right)^2 \right] - \left[ \frac{\sum_{\text{in}}}{2m} - \left(\frac{\sum_{\text{tot}}}{2m}\right)^2 - \left(\frac{k_i}{2m}\right)^2 \right]

Aggregation: A new network is built where each community becomes a single node, with edge weights summed. Repeat until no further improvement.

Louvain runs in $O(n \log n)$ time and handles networks with millions of nodes.

Resolution Limit

The Louvain algorithm suffers from a resolution limit: communities smaller than a scale $\sim \sqrt{2m}$ may not be detected. This is because modularity optimization with a single resolution parameter cannot detect small communities in large networks. The Leiden algorithm (Traag et al., 2019) improves upon Louvain by guaranteeing well-connected communities and avoiding poorly connected sub-communities.

Network Generative Models

Generative models specify probability distributions over graphs, enabling hypothesis testing and simulation.

Erdos-Renyi Model

The Erdos-Renyi $G(n, p)$ model generates a random graph on $n$ nodes where each edge appears independently with probability $p$ . The degree distribution is $\text{Binomial}(n-1, p)$ , approaching Poisson( $\lambda = (n-1)p$ ) for large $n$ :

P(k) = \binom{n-1}{k} p^k (1-p)^{n-1-k} \approx \frac{\lambda^k e^{-\lambda}}{k!}

The giant component emerges when $p > 1/n$ , with a phase transition at the critical threshold.

Barabasi-Albert Model

The Barabasi-Albert model generates scale-free networks via preferential attachment. Starting from a small complete graph, each new node attaches to $m$ existing nodes with probability proportional to their degree:

\Pi(i) = \frac{k_i}{\sum_j k_j}

The resulting degree distribution follows a power law: $P(k) \sim k^{-\gamma}$ with $\gamma = 3$ . This models many real-world networks (Internet, citation networks, biological networks) where a few "hubs" have extremely high degree.

Watts-Strogatz Model

The Watts-Strogatz model generates small-world networks with high clustering and short average path length. Start with a ring lattice of $n$ nodes each connected to $k$ nearest neighbors. Each edge is rewired with probability $p$ :

At $p = 0$ : regular lattice (high clustering, long paths)
At $p = 1$ : random graph (low clustering, short paths)
Intermediate $p$ : small-world regime — short paths AND high clustering

The characteristic path length scales as $L \sim \frac{n}{2k}$ while clustering $C \sim \frac{3k}{3k - 1}$ for small $p$ .

Model Comparison

Model	Degree Distribution	Clustering	Path Length	Motivation
Erdos-Renyi	Poisson	Low ( $\sim p$ )	Short ( $\sim \log n$ )	Random baseline
Barabasi-Albert	Power law ( $k^{-3}$ )	Moderate	Short	Preferential attachment
Watts-Strogatz	Narrow (peaked)	High	Short	Small-world phenomenon

Exponential Random Graph Models

ERGMs provide a principled statistical framework for modeling network structure as a function of network statistics.

DfExponential Random Graph Model (ERGM)

The ERGM defines a probability distribution over the space of all graphs $\mathcal{G}$ on $n$ nodes:

P(G = g \mid \boldsymbol{\theta}) = \frac{\exp\left(\boldsymbol{\theta}^\top \mathbf{s}(g)\right)}{\kappa(\boldsymbol{\theta})}

where $\mathbf{s}(g)$ is a vector of network statistics (e.g., edge count, triangles, degree terms) and $\kappa(\boldsymbol{\theta}) = \sum_{g' \in \mathcal{G}} \exp(\boldsymbol{\theta}^\top \mathbf{s}(g'))$ is the normalizing constant. The parameter $\theta_j$ measures the contribution of statistic $s_j$ to the log-odds of edge presence.

Common ERGM Statistics

Typical network statistics include:

Edges: $\sum_{i<j} A_{ij}$ — controls overall density
Mutual: $\sum_{i<j} A_{ij}A_{ji}$ — reciprocity in directed networks
Triangles: $\sum_{i<j<k} A_{ij}A_{jk}A_{ik}$ — transitivity
k-stars: $\sum_i \binom{k_i}{k}$ — degree-based terms
Geometrically weighted edges (GWE): $e^{\alpha}\sum_{i=1}^{n}(1 - (1-e^{-\alpha})^{k_i})$ — smooth degree distribution penalty

Degeneracy Problem

Many naive ERGM specifications (e.g., edges + triangles) suffer from degeneracy: the probability mass concentrates on either the empty or complete graph, with no realistic intermediate structures. Solutions include:

Geometrically weighted terms (Snijders et al., 2006)
Maximum pseudolikelihood estimation as a starting point
Curved ERGM specifications with carefully chosen statistics

Python Implementation

import numpy as np
import networkx as nx
from networkx.algorithms.community import louvain_communities, modularity
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Generate Network Models ---
n = 500

# Erdos-Renyi
G_er = nx.erdos_renyi_graph(n, p=0.01, seed=42)
degrees_er = [d for _, d in G_er.degree()]

# Barabasi-Albert
G_ba = nx.barabasi_albert_graph(n, m=3, seed=42)
degrees_ba = [d for _, d in G_ba.degree()]

# Watts-Strogatz
G_ws = nx.watts_strogatz_graph(n, k=6, p=0.1, seed=42)
degrees_ws = [d for _, d in G_ws.degree()]

# --- Centrality Measures ---
G = G_ba  # Use BA model for centrality analysis
degree_cent = nx.degree_centrality(G)
between_cent = nx.betweenness_centrality(G, k=100)  # sampled
closeness_cent = nx.closeness_centrality(G)
eigen_cent = nx.eigenvector_centrality(G, max_iter=1000)

# Top nodes by different centrality measures
top_degree = sorted(degree_cent.items(), key=lambda x: -x[1])[:5]
top_between = sorted(between_cent.items(), key=lambda x: -x[1])[:5]
top_eigen = sorted(eigen_cent.items(), key=lambda x: -x[1])[:5]

print("=== Top 5 Nodes by Centrality ===")
print(f"{'Node':<8} {'Degree':<12} {'Betweenness':<14} {'Eigenvector':<12}")
print("-" * 46)
for i in range(5):
    print(f"{top_degree[i][0]:<8} {top_degree[i][1]:<12.4f} "
          f"{top_between[i][1]:<14.4f} {top_eigen[i][1]:<12.4f}")

# --- Community Detection ---
communities = louvain_communities(G, seed=42)
Q = modularity(G, communities)
print(f"\n=== Community Detection (Louvain) ===")
print(f"Number of communities: {len(communities)}")
print(f"Modularity: {Q:.4f}")
print(f"Community sizes: {[len(c) for c in sorted(communities, key=len, reverse=True)]}")

# --- Degree Distribution Analysis ---
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, degs, title in zip(axes, 
                            [degrees_er, degrees_ba, degrees_ws],
                            ['Erdos-Renyi', 'Barabasi-Albert', 'Watts-Strogatz']):
    ax.hist(degs, bins=max(degs) - min(degs) + 1, density=True, 
            alpha=0.7, edgecolor='black', linewidth=0.5)
    ax.set_xlabel('Degree k')
    ax.set_ylabel('P(k)')
    ax.set_title(f'{title} (n={n})')

plt.tight_layout()
plt.savefig("network_degree_distributions.png", dpi=150, bbox_inches="tight")
plt.show()

# --- Network Statistics ---
print(f"\n=== Network Statistics ===")
for name, G_model in [('ER', G_er), ('BA', G_ba), ('WS', G_ws)]:
    print(f"\n{name}:")
    print(f"  Edges: {G_model.number_of_edges()}")
    print(f"  Avg degree: {2 * G_model.number_of_edges() / n:.2f}")
    print(f"  Clustering coeff: {nx.average_clustering(G_model):.4f}")
    print(f"  Avg path length: {nx.average_shortest_path_length(G_model) if nx.is_connected(G_model) else 'N/A (disconnected)'}")
    print(f"  Diameter: {nx.diameter(G_model) if nx.is_connected(G_model) else 'N/A (disconnected)'}")

Summary

Key Takeaways: Network Analysis

Graph fundamentals — Adjacency matrices encode network structure; degree distribution $P(k)$ is the primary topological descriptor. Connectedness, clustering, and path lengths characterize global structure.
Centrality measures — Degree (local), betweenness (brokerage), closeness (proximity), and eigenvector (influence) capture different facets of node importance. Each has distinct computational complexity and appropriate use cases.
Community detection — Modularity $Q$ quantifies community structure relative to a null model. Louvain and Leiden algorithms efficiently optimize modularity, but beware the resolution limit.
Network models — Erdos-Renyi (Poisson degree), Barabasi-Albert (power-law), and Watts-Strogatz (small-world) capture distinct generative mechanisms. Real networks often exhibit scale-free degree distributions with high clustering.
ERGMs — Provide a statistical framework for modeling networks as functions of network statistics. Careful specification (geometrically weighted terms) avoids degeneracy. Parameter estimation via MCMLE or MCMC enables hypothesis testing about network structure.

Network Analysis and Graph Statistics

Network Analysis and Graph Statistics

Understanding Complex Systems Through Their Connections

Graph Fundamentals

DfGraph (Network)

DfAdjacency Matrix

Degree

DfPath, Distance, Diameter

Centrality Measures

Degree Centrality

Betweenness Centrality

Closeness Centrality

Eigenvector Centrality

Community Detection

Modularity (Newman-Girvan)

Louvain Algorithm

Network Generative Models

Erdos-Renyi Model

Barabasi-Albert Model

Watts-Strogatz Model

Exponential Random Graph Models

DfExponential Random Graph Model (ERGM)

Common ERGM Statistics

Python Implementation

Summary

Key Takeaways: Network Analysis

Premium Content

Need Expert Statistics Help?