πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Network Analysis and Graph Statistics

Advanced Statistical MethodsNetwork Science🟒 Free Lesson

Advertisement

Network Analysis and Graph Statistics

Advanced Statistical Methods

Understanding Complex Systems Through Their Connections

Network analysis uses graph theory and statistical models to study the structure and dynamics of complex interconnected systems. Centrality measures, community detection, and generative models reveal hidden patterns.

  • Social networks β€” Identify influential individuals and detect community structure in online platforms
  • Epidemiology β€” Model disease spread through contact networks for targeted intervention
  • Biology β€” Analyze protein interaction networks to discover functional modules and drug targets

Network analysis reveals how the pattern of connections shapes system behavior.


Network science provides the mathematical language for studying complex systems of interacting entities. From social networks and biological pathways to communication infrastructure and financial systems, the statistical analysis of networks reveals structural patterns, identifies influential nodes, and detects latent community organization. This lesson develops the theoretical foundations and computational methods for rigorous network analysis.


Graph Fundamentals

DfGraph (Network)

A graph G=(V,E)G = (V, E) consists of a finite set of vertices (nodes) V={v1,v2,…,vn}V = \{v_1, v_2, \dots, v_n\} and a set of edges (ties) EβŠ†VΓ—VE \subseteq V \times V. The order of the graph is ∣V∣=n|V| = n and its size is ∣E∣=m|E| = m.

DfAdjacency Matrix

The adjacency matrix A∈{0,1}nΓ—n\mathbf{A} \in \{0, 1\}^{n \times n} of a simple undirected graph encodes edge presence:

Aij={1if (vi,vj)∈E0otherwiseA_{ij} = \begin{cases} 1 & \text{if } (v_i, v_j) \in E \\ 0 & \text{otherwise} \end{cases}

For weighted graphs, Aij=wijA_{ij} = w_{ij} where wijβ‰₯0w_{ij} \geq 0. For directed graphs, Aij=1A_{ij} = 1 if there is an edge from ii to jj, and Aijβ‰ AjiA_{ij} \neq A_{ji} in general.

Degree

The degree of vertex viv_i is the number of edges incident to it:

ki=βˆ‘j=1nAijk_i = \sum_{j=1}^{n} A_{ij}

In a directed graph, the in-degree is kiin=βˆ‘jAjik_i^{\text{in}} = \sum_j A_{ji} and the out-degree is kiout=βˆ‘jAijk_i^{\text{out}} = \sum_j A_{ij}. The degree distribution P(k)P(k) gives the fraction of vertices with degree kk, which is a fundamental descriptor of network topology.

DfPath, Distance, Diameter

A path of length β„“\ell from viv_i to vjv_j is a sequence of distinct vertices vi=vi0,vi1,…,viβ„“=vjv_i = v_{i_0}, v_{i_1}, \dots, v_{i_\ell} = v_j where each consecutive pair is connected by an edge. The shortest path length d(i,j)d(i,j) is the geodesic distance. The diameter of a connected graph is diam(G)=max⁑i,jd(i,j)\text{diam}(G) = \max_{i,j} d(i,j).

Key Graph Properties

  • Connected graph: Every pair of vertices is joined by a path
  • Bipartite vertices: Can be partitioned into two sets with edges only between sets
  • Clustering coefficient: Ci=2eiki(kiβˆ’1)C_i = \frac{2e_i}{k_i(k_i - 1)} where eie_i is the number of edges among ii's neighbors
  • Average path length: dΛ‰=1n(nβˆ’1)βˆ‘iβ‰ jd(i,j)\bar{d} = \frac{1}{n(n-1)}\sum_{i \neq j} d(i,j)

Centrality Measures

Centrality quantifies the relative importance of nodes within a network. Different measures capture different notions of "importance" β€” popularity, brokerage, proximity, or influence.

Degree Centrality

The simplest centrality measure normalizes degree by the maximum possible:

CD(vi)=kinβˆ’1C_D(v_i) = \frac{k_i}{n - 1}

This measures local connectivity and is appropriate for identifying nodes with many direct connections (e.g., popular individuals in social networks).

Betweenness Centrality

Betweenness centrality captures the role of a node as a bridge along shortest paths:

CB(vi)=βˆ‘sβ‰ iβ‰ tΟƒst(i)ΟƒstC_B(v_i) = \sum_{s \neq i \neq t} \frac{\sigma_{st}(i)}{\sigma_{st}}

where Οƒst\sigma_{st} is the total number of shortest paths from ss to tt, and Οƒst(i)\sigma_{st}(i) is the number of those passing through ii. Normalized betweenness is:

CBβ€²(vi)=2CB(vi)(nβˆ’1)(nβˆ’2)C_B'(v_i) = \frac{2C_B(v_i)}{(n-1)(n-2)}

Betweenness is O(nm)O(nm) to compute exactly, though approximation algorithms exist for large networks.

Closeness Centrality

Closeness centrality measures how close a node is to all other nodes:

CC(vi)=nβˆ’1βˆ‘jβ‰ id(i,j)C_C(v_i) = \frac{n - 1}{\sum_{j \neq i} d(i, j)}

A node with high closeness can spread information efficiently to the entire network. This is undefined for disconnected graphs unless restricted to the reachable component.

Eigenvector Centrality

Eigenvector centrality assigns importance based on the importance of neighbors:

CE(vi)=1Ξ»βˆ‘j=1nAijCE(vj)C_E(v_i) = \frac{1}{\lambda} \sum_{j=1}^{n} A_{ij} C_E(v_j)

In matrix form, Ax=Ξ»x\mathbf{A}\mathbf{x} = \lambda \mathbf{x}, so the centrality vector is the principal eigenvector of the adjacency matrix. By the Perron-Frobenius theorem, for a connected non-negative matrix, this eigenvector is unique and has all positive entries. Google's PageRank is a modified eigenvector centrality.

Centrality Comparison

MeasureCapturesComplexitySensitive to
DegreeLocal connectivityO(n)O(n)Hub detection
BetweennessBridge/ brokerage roleO(nm)O(nm)Bottlenecks
ClosenessProximity to all nodesO(nm)O(nm)Central positioning
EigenvectorInfluence in networkO(n2)O(n^2) per iterationGlobal structure

Community Detection

Communities (modules, clusters) are groups of nodes that are more densely connected internally than with the rest of the network. Quantifying community structure requires a modularity function.

Modularity (Newman-Girvan)

The modularity QQ of a partition of the network into communities {C1,…,CK}\{C_1, \dots, C_K\} is:

Q=12mβˆ‘ij[Aijβˆ’kikj2m]Ξ΄(Ci,Cj)Q = \frac{1}{2m} \sum_{ij} \left[ A_{ij} - \frac{k_i k_j}{2m} \right] \delta(C_i, C_j)

where Ξ΄(Ci,Cj)=1\delta(C_i, C_j) = 1 if nodes ii and jj belong to the same community, and m=12βˆ‘ikim = \frac{1}{2}\sum_i k_i is the total number of edges. The term kikj2m\frac{k_i k_j}{2m} is the expected number of edges under the configuration model (random graphs with the same degree sequence). QQ ranges from βˆ’1/2-1/2 to 11, with values above 0.30.3 indicating significant community structure.

Louvain Algorithm

The Louvain method (Blondel et al., 2008) greedily optimizes modularity in two phases:

  1. Local moving: Each node ii is moved to the community that yields the maximum modularity gain Ξ”Q\Delta Q:
Ξ”Q=[βˆ‘in+2ki,in2mβˆ’(βˆ‘tot+ki2m)2]βˆ’[βˆ‘in2mβˆ’(βˆ‘tot2m)2βˆ’(ki2m)2]\Delta Q = \left[ \frac{\sum_{\text{in}} + 2k_{i,\text{in}}}{2m} - \left(\frac{\sum_{\text{tot}} + k_i}{2m}\right)^2 \right] - \left[ \frac{\sum_{\text{in}}}{2m} - \left(\frac{\sum_{\text{tot}}}{2m}\right)^2 - \left(\frac{k_i}{2m}\right)^2 \right]
  1. Aggregation: A new network is built where each community becomes a single node, with edge weights summed. Repeat until no further improvement.

Louvain runs in O(nlog⁑n)O(n \log n) time and handles networks with millions of nodes.

Resolution Limit

The Louvain algorithm suffers from a resolution limit: communities smaller than a scale ∼2m\sim \sqrt{2m} may not be detected. This is because modularity optimization with a single resolution parameter cannot detect small communities in large networks. The Leiden algorithm (Traag et al., 2019) improves upon Louvain by guaranteeing well-connected communities and avoiding poorly connected sub-communities.


Network Generative Models

Generative models specify probability distributions over graphs, enabling hypothesis testing and simulation.

Erdos-Renyi Model

The Erdos-Renyi G(n,p)G(n, p) model generates a random graph on nn nodes where each edge appears independently with probability pp. The degree distribution is Binomial(nβˆ’1,p)\text{Binomial}(n-1, p), approaching Poisson(Ξ»=(nβˆ’1)p\lambda = (n-1)p) for large nn:

P(k)=(nβˆ’1k)pk(1βˆ’p)nβˆ’1βˆ’kβ‰ˆΞ»keβˆ’Ξ»k!P(k) = \binom{n-1}{k} p^k (1-p)^{n-1-k} \approx \frac{\lambda^k e^{-\lambda}}{k!}

The giant component emerges when p>1/np > 1/n, with a phase transition at the critical threshold.

Barabasi-Albert Model

The Barabasi-Albert model generates scale-free networks via preferential attachment. Starting from a small complete graph, each new node attaches to mm existing nodes with probability proportional to their degree:

Ξ (i)=kiβˆ‘jkj\Pi(i) = \frac{k_i}{\sum_j k_j}

The resulting degree distribution follows a power law: P(k)∼kβˆ’Ξ³P(k) \sim k^{-\gamma} with Ξ³=3\gamma = 3. This models many real-world networks (Internet, citation networks, biological networks) where a few "hubs" have extremely high degree.

Watts-Strogatz Model

The Watts-Strogatz model generates small-world networks with high clustering and short average path length. Start with a ring lattice of nn nodes each connected to kk nearest neighbors. Each edge is rewired with probability pp:

  • At p=0p = 0: regular lattice (high clustering, long paths)
  • At p=1p = 1: random graph (low clustering, short paths)
  • Intermediate pp: small-world regime β€” short paths AND high clustering

The characteristic path length scales as L∼n2kL \sim \frac{n}{2k} while clustering C∼3k3kβˆ’1C \sim \frac{3k}{3k - 1} for small pp.

Model Comparison

ModelDegree DistributionClusteringPath LengthMotivation
Erdos-RenyiPoissonLow (∼p\sim p)Short (∼log⁑n\sim \log n)Random baseline
Barabasi-AlbertPower law (kβˆ’3k^{-3})ModerateShortPreferential attachment
Watts-StrogatzNarrow (peaked)HighShortSmall-world phenomenon

Exponential Random Graph Models

ERGMs provide a principled statistical framework for modeling network structure as a function of network statistics.

DfExponential Random Graph Model (ERGM)

The ERGM defines a probability distribution over the space of all graphs G\mathcal{G} on nn nodes:

P(G=g∣θ)=exp⁑(θ⊀s(g))κ(θ)P(G = g \mid \boldsymbol{\theta}) = \frac{\exp\left(\boldsymbol{\theta}^\top \mathbf{s}(g)\right)}{\kappa(\boldsymbol{\theta})}

where s(g)\mathbf{s}(g) is a vector of network statistics (e.g., edge count, triangles, degree terms) and ΞΊ(ΞΈ)=βˆ‘gβ€²βˆˆGexp⁑(θ⊀s(gβ€²))\kappa(\boldsymbol{\theta}) = \sum_{g' \in \mathcal{G}} \exp(\boldsymbol{\theta}^\top \mathbf{s}(g')) is the normalizing constant. The parameter ΞΈj\theta_j measures the contribution of statistic sjs_j to the log-odds of edge presence.

Common ERGM Statistics

Typical network statistics include:

  • Edges: βˆ‘i<jAij\sum_{i<j} A_{ij} β€” controls overall density
  • Mutual: βˆ‘i<jAijAji\sum_{i<j} A_{ij}A_{ji} β€” reciprocity in directed networks
  • Triangles: βˆ‘i<j<kAijAjkAik\sum_{i<j<k} A_{ij}A_{jk}A_{ik} β€” transitivity
  • k-stars: βˆ‘i(kik)\sum_i \binom{k_i}{k} β€” degree-based terms
  • Geometrically weighted edges (GWE): eΞ±βˆ‘i=1n(1βˆ’(1βˆ’eβˆ’Ξ±)ki)e^{\alpha}\sum_{i=1}^{n}(1 - (1-e^{-\alpha})^{k_i}) β€” smooth degree distribution penalty

Degeneracy Problem

Many naive ERGM specifications (e.g., edges + triangles) suffer from degeneracy: the probability mass concentrates on either the empty or complete graph, with no realistic intermediate structures. Solutions include:

  • Geometrically weighted terms (Snijders et al., 2006)
  • Maximum pseudolikelihood estimation as a starting point
  • Curved ERGM specifications with carefully chosen statistics

Python Implementation

import numpy as np
import networkx as nx
from networkx.algorithms.community import louvain_communities, modularity
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Generate Network Models ---
n = 500

# Erdos-Renyi
G_er = nx.erdos_renyi_graph(n, p=0.01, seed=42)
degrees_er = [d for _, d in G_er.degree()]

# Barabasi-Albert
G_ba = nx.barabasi_albert_graph(n, m=3, seed=42)
degrees_ba = [d for _, d in G_ba.degree()]

# Watts-Strogatz
G_ws = nx.watts_strogatz_graph(n, k=6, p=0.1, seed=42)
degrees_ws = [d for _, d in G_ws.degree()]

# --- Centrality Measures ---
G = G_ba  # Use BA model for centrality analysis
degree_cent = nx.degree_centrality(G)
between_cent = nx.betweenness_centrality(G, k=100)  # sampled
closeness_cent = nx.closeness_centrality(G)
eigen_cent = nx.eigenvector_centrality(G, max_iter=1000)

# Top nodes by different centrality measures
top_degree = sorted(degree_cent.items(), key=lambda x: -x[1])[:5]
top_between = sorted(between_cent.items(), key=lambda x: -x[1])[:5]
top_eigen = sorted(eigen_cent.items(), key=lambda x: -x[1])[:5]

print("=== Top 5 Nodes by Centrality ===")
print(f"{'Node':<8} {'Degree':<12} {'Betweenness':<14} {'Eigenvector':<12}")
print("-" * 46)
for i in range(5):
    print(f"{top_degree[i][0]:<8} {top_degree[i][1]:<12.4f} "
          f"{top_between[i][1]:<14.4f} {top_eigen[i][1]:<12.4f}")

# --- Community Detection ---
communities = louvain_communities(G, seed=42)
Q = modularity(G, communities)
print(f"\n=== Community Detection (Louvain) ===")
print(f"Number of communities: {len(communities)}")
print(f"Modularity: {Q:.4f}")
print(f"Community sizes: {[len(c) for c in sorted(communities, key=len, reverse=True)]}")

# --- Degree Distribution Analysis ---
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, degs, title in zip(axes, 
                            [degrees_er, degrees_ba, degrees_ws],
                            ['Erdos-Renyi', 'Barabasi-Albert', 'Watts-Strogatz']):
    ax.hist(degs, bins=max(degs) - min(degs) + 1, density=True, 
            alpha=0.7, edgecolor='black', linewidth=0.5)
    ax.set_xlabel('Degree k')
    ax.set_ylabel('P(k)')
    ax.set_title(f'{title} (n={n})')

plt.tight_layout()
plt.savefig("network_degree_distributions.png", dpi=150, bbox_inches="tight")
plt.show()

# --- Network Statistics ---
print(f"\n=== Network Statistics ===")
for name, G_model in [('ER', G_er), ('BA', G_ba), ('WS', G_ws)]:
    print(f"\n{name}:")
    print(f"  Edges: {G_model.number_of_edges()}")
    print(f"  Avg degree: {2 * G_model.number_of_edges() / n:.2f}")
    print(f"  Clustering coeff: {nx.average_clustering(G_model):.4f}")
    print(f"  Avg path length: {nx.average_shortest_path_length(G_model) if nx.is_connected(G_model) else 'N/A (disconnected)'}")
    print(f"  Diameter: {nx.diameter(G_model) if nx.is_connected(G_model) else 'N/A (disconnected)'}")

Summary

Key Takeaways: Network Analysis

  1. Graph fundamentals β€” Adjacency matrices encode network structure; degree distribution P(k)P(k) is the primary topological descriptor. Connectedness, clustering, and path lengths characterize global structure.
  2. Centrality measures β€” Degree (local), betweenness (brokerage), closeness (proximity), and eigenvector (influence) capture different facets of node importance. Each has distinct computational complexity and appropriate use cases.
  3. Community detection β€” Modularity QQ quantifies community structure relative to a null model. Louvain and Leiden algorithms efficiently optimize modularity, but beware the resolution limit.
  4. Network models β€” Erdos-Renyi (Poisson degree), Barabasi-Albert (power-law), and Watts-Strogatz (small-world) capture distinct generative mechanisms. Real networks often exhibit scale-free degree distributions with high clustering.
  5. ERGMs β€” Provide a statistical framework for modeling networks as functions of network statistics. Careful specification (geometrically weighted terms) avoids degeneracy. Parameter estimation via MCMLE or MCMC enables hypothesis testing about network structure.
⭐

Premium Content

Network Analysis and Graph Statistics

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement