Network Analysis and Graph Statistics
Advanced Statistical Methods
Understanding Complex Systems Through Their Connections
Network analysis uses graph theory and statistical models to study the structure and dynamics of complex interconnected systems. Centrality measures, community detection, and generative models reveal hidden patterns.
- Social networks β Identify influential individuals and detect community structure in online platforms
- Epidemiology β Model disease spread through contact networks for targeted intervention
- Biology β Analyze protein interaction networks to discover functional modules and drug targets
Network analysis reveals how the pattern of connections shapes system behavior.
Network science provides the mathematical language for studying complex systems of interacting entities. From social networks and biological pathways to communication infrastructure and financial systems, the statistical analysis of networks reveals structural patterns, identifies influential nodes, and detects latent community organization. This lesson develops the theoretical foundations and computational methods for rigorous network analysis.
Graph Fundamentals
DfGraph (Network)
A graph consists of a finite set of vertices (nodes) and a set of edges (ties) . The order of the graph is and its size is .
DfAdjacency Matrix
The adjacency matrix of a simple undirected graph encodes edge presence:
For weighted graphs, where . For directed graphs, if there is an edge from to , and in general.
Degree
The degree of vertex is the number of edges incident to it:
In a directed graph, the in-degree is and the out-degree is . The degree distribution gives the fraction of vertices with degree , which is a fundamental descriptor of network topology.
DfPath, Distance, Diameter
A path of length from to is a sequence of distinct vertices where each consecutive pair is connected by an edge. The shortest path length is the geodesic distance. The diameter of a connected graph is .
Key Graph Properties
- Connected graph: Every pair of vertices is joined by a path
- Bipartite vertices: Can be partitioned into two sets with edges only between sets
- Clustering coefficient: where is the number of edges among 's neighbors
- Average path length:
Centrality Measures
Centrality quantifies the relative importance of nodes within a network. Different measures capture different notions of "importance" β popularity, brokerage, proximity, or influence.
Degree Centrality
The simplest centrality measure normalizes degree by the maximum possible:
This measures local connectivity and is appropriate for identifying nodes with many direct connections (e.g., popular individuals in social networks).
Betweenness Centrality
Betweenness centrality captures the role of a node as a bridge along shortest paths:
where is the total number of shortest paths from to , and is the number of those passing through . Normalized betweenness is:
Betweenness is to compute exactly, though approximation algorithms exist for large networks.
Closeness Centrality
Closeness centrality measures how close a node is to all other nodes:
A node with high closeness can spread information efficiently to the entire network. This is undefined for disconnected graphs unless restricted to the reachable component.
Eigenvector Centrality
Eigenvector centrality assigns importance based on the importance of neighbors:
In matrix form, , so the centrality vector is the principal eigenvector of the adjacency matrix. By the Perron-Frobenius theorem, for a connected non-negative matrix, this eigenvector is unique and has all positive entries. Google's PageRank is a modified eigenvector centrality.
Centrality Comparison
| Measure | Captures | Complexity | Sensitive to |
|---|---|---|---|
| Degree | Local connectivity | Hub detection | |
| Betweenness | Bridge/ brokerage role | Bottlenecks | |
| Closeness | Proximity to all nodes | Central positioning | |
| Eigenvector | Influence in network | per iteration | Global structure |
Community Detection
Communities (modules, clusters) are groups of nodes that are more densely connected internally than with the rest of the network. Quantifying community structure requires a modularity function.
Modularity (Newman-Girvan)
The modularity of a partition of the network into communities is:
where if nodes and belong to the same community, and is the total number of edges. The term is the expected number of edges under the configuration model (random graphs with the same degree sequence). ranges from to , with values above indicating significant community structure.
Louvain Algorithm
The Louvain method (Blondel et al., 2008) greedily optimizes modularity in two phases:
- Local moving: Each node is moved to the community that yields the maximum modularity gain :
- Aggregation: A new network is built where each community becomes a single node, with edge weights summed. Repeat until no further improvement.
Louvain runs in time and handles networks with millions of nodes.
Resolution Limit
The Louvain algorithm suffers from a resolution limit: communities smaller than a scale may not be detected. This is because modularity optimization with a single resolution parameter cannot detect small communities in large networks. The Leiden algorithm (Traag et al., 2019) improves upon Louvain by guaranteeing well-connected communities and avoiding poorly connected sub-communities.
Network Generative Models
Generative models specify probability distributions over graphs, enabling hypothesis testing and simulation.
Erdos-Renyi Model
The Erdos-Renyi model generates a random graph on nodes where each edge appears independently with probability . The degree distribution is , approaching Poisson() for large :
The giant component emerges when , with a phase transition at the critical threshold.
Barabasi-Albert Model
The Barabasi-Albert model generates scale-free networks via preferential attachment. Starting from a small complete graph, each new node attaches to existing nodes with probability proportional to their degree:
The resulting degree distribution follows a power law: with . This models many real-world networks (Internet, citation networks, biological networks) where a few "hubs" have extremely high degree.
Watts-Strogatz Model
The Watts-Strogatz model generates small-world networks with high clustering and short average path length. Start with a ring lattice of nodes each connected to nearest neighbors. Each edge is rewired with probability :
- At : regular lattice (high clustering, long paths)
- At : random graph (low clustering, short paths)
- Intermediate : small-world regime β short paths AND high clustering
The characteristic path length scales as while clustering for small .
Model Comparison
| Model | Degree Distribution | Clustering | Path Length | Motivation |
|---|---|---|---|---|
| Erdos-Renyi | Poisson | Low () | Short () | Random baseline |
| Barabasi-Albert | Power law () | Moderate | Short | Preferential attachment |
| Watts-Strogatz | Narrow (peaked) | High | Short | Small-world phenomenon |
Exponential Random Graph Models
ERGMs provide a principled statistical framework for modeling network structure as a function of network statistics.
DfExponential Random Graph Model (ERGM)
The ERGM defines a probability distribution over the space of all graphs on nodes:
where is a vector of network statistics (e.g., edge count, triangles, degree terms) and is the normalizing constant. The parameter measures the contribution of statistic to the log-odds of edge presence.
Common ERGM Statistics
Typical network statistics include:
- Edges: β controls overall density
- Mutual: β reciprocity in directed networks
- Triangles: β transitivity
- k-stars: β degree-based terms
- Geometrically weighted edges (GWE): β smooth degree distribution penalty
Degeneracy Problem
Many naive ERGM specifications (e.g., edges + triangles) suffer from degeneracy: the probability mass concentrates on either the empty or complete graph, with no realistic intermediate structures. Solutions include:
- Geometrically weighted terms (Snijders et al., 2006)
- Maximum pseudolikelihood estimation as a starting point
- Curved ERGM specifications with carefully chosen statistics
Python Implementation
import numpy as np
import networkx as nx
from networkx.algorithms.community import louvain_communities, modularity
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(42)
# --- Generate Network Models ---
n = 500
# Erdos-Renyi
G_er = nx.erdos_renyi_graph(n, p=0.01, seed=42)
degrees_er = [d for _, d in G_er.degree()]
# Barabasi-Albert
G_ba = nx.barabasi_albert_graph(n, m=3, seed=42)
degrees_ba = [d for _, d in G_ba.degree()]
# Watts-Strogatz
G_ws = nx.watts_strogatz_graph(n, k=6, p=0.1, seed=42)
degrees_ws = [d for _, d in G_ws.degree()]
# --- Centrality Measures ---
G = G_ba # Use BA model for centrality analysis
degree_cent = nx.degree_centrality(G)
between_cent = nx.betweenness_centrality(G, k=100) # sampled
closeness_cent = nx.closeness_centrality(G)
eigen_cent = nx.eigenvector_centrality(G, max_iter=1000)
# Top nodes by different centrality measures
top_degree = sorted(degree_cent.items(), key=lambda x: -x[1])[:5]
top_between = sorted(between_cent.items(), key=lambda x: -x[1])[:5]
top_eigen = sorted(eigen_cent.items(), key=lambda x: -x[1])[:5]
print("=== Top 5 Nodes by Centrality ===")
print(f"{'Node':<8} {'Degree':<12} {'Betweenness':<14} {'Eigenvector':<12}")
print("-" * 46)
for i in range(5):
print(f"{top_degree[i][0]:<8} {top_degree[i][1]:<12.4f} "
f"{top_between[i][1]:<14.4f} {top_eigen[i][1]:<12.4f}")
# --- Community Detection ---
communities = louvain_communities(G, seed=42)
Q = modularity(G, communities)
print(f"\n=== Community Detection (Louvain) ===")
print(f"Number of communities: {len(communities)}")
print(f"Modularity: {Q:.4f}")
print(f"Community sizes: {[len(c) for c in sorted(communities, key=len, reverse=True)]}")
# --- Degree Distribution Analysis ---
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, degs, title in zip(axes,
[degrees_er, degrees_ba, degrees_ws],
['Erdos-Renyi', 'Barabasi-Albert', 'Watts-Strogatz']):
ax.hist(degs, bins=max(degs) - min(degs) + 1, density=True,
alpha=0.7, edgecolor='black', linewidth=0.5)
ax.set_xlabel('Degree k')
ax.set_ylabel('P(k)')
ax.set_title(f'{title} (n={n})')
plt.tight_layout()
plt.savefig("network_degree_distributions.png", dpi=150, bbox_inches="tight")
plt.show()
# --- Network Statistics ---
print(f"\n=== Network Statistics ===")
for name, G_model in [('ER', G_er), ('BA', G_ba), ('WS', G_ws)]:
print(f"\n{name}:")
print(f" Edges: {G_model.number_of_edges()}")
print(f" Avg degree: {2 * G_model.number_of_edges() / n:.2f}")
print(f" Clustering coeff: {nx.average_clustering(G_model):.4f}")
print(f" Avg path length: {nx.average_shortest_path_length(G_model) if nx.is_connected(G_model) else 'N/A (disconnected)'}")
print(f" Diameter: {nx.diameter(G_model) if nx.is_connected(G_model) else 'N/A (disconnected)'}")
Summary
Key Takeaways: Network Analysis
- Graph fundamentals β Adjacency matrices encode network structure; degree distribution is the primary topological descriptor. Connectedness, clustering, and path lengths characterize global structure.
- Centrality measures β Degree (local), betweenness (brokerage), closeness (proximity), and eigenvector (influence) capture different facets of node importance. Each has distinct computational complexity and appropriate use cases.
- Community detection β Modularity quantifies community structure relative to a null model. Louvain and Leiden algorithms efficiently optimize modularity, but beware the resolution limit.
- Network models β Erdos-Renyi (Poisson degree), Barabasi-Albert (power-law), and Watts-Strogatz (small-world) capture distinct generative mechanisms. Real networks often exhibit scale-free degree distributions with high clustering.
- ERGMs β Provide a statistical framework for modeling networks as functions of network statistics. Careful specification (geometrically weighted terms) avoids degeneracy. Parameter estimation via MCMLE or MCMC enables hypothesis testing about network structure.