Big Data and Statistics
Advanced Statistical Methods
Scaling Statistical Methods to Massive Datasets
Big data statistics adapts classical methods for massive datasets using distributed computing, scalable algorithms, and approximate inference. Spark, streaming methods, and sketching techniques make analysis feasible at scale.
- Tech companies β Analyze billions of user interactions for recommendation systems and ad targeting
- Internet of Things β Process continuous streams of sensor data for predictive maintenance
- Healthcare β Mine electronic health records across millions of patients for population insights
Big data statistics ensures that having more data actually leads to better decisions, not just bigger files.
Big data fundamentally challenges classical statistical methods. When datasets exceed memory, arrive as streams, or contain billions of observations, traditional algorithms that assume random access and finite storage become computationally infeasible.
DfBig Data
Big data is characterized by the V's: Volume (terabytes to petabytes), Velocity (real-time streams), Variety (structured, unstructured, semi-structured), Veracity (noise, uncertainty), and Value (actionable insights).
Statistical Challenges of Big Data
Core Statistical Challenges
- Computational infeasibility β O(nΒ³) algorithms fail at n = 10βΉ
- Memory constraints β data doesn't fit in RAM on a single machine
- Streaming data β observations arrive continuously, cannot be stored
- High dimensionality β p >> n or p β n creates new statistical regimes
- Heterogeneity β distributional shifts across subpopulations
- Spurious correlations β with many variables, false positives increase
Distributed Computing: Apache Spark
DfMapReduce Framework
MapReduce is a programming model for distributed computation: (1) Map β transform data in parallel across nodes, (2) Shuffle β redistribute data by key, (3) Reduce β aggregate results across nodes.
Spark RDD Transformation Pipeline
Here,
- =Resilient Distributed Dataset β Spark's core abstraction
- =Parallel transformation function
- =Parallel aggregation by key
Scalable Algorithms
DfDivide-and-Conquer Estimation
For a statistic T that is combineable β meaning T(D) can be computed from partial results on subsets β we can distribute estimation:
Divide-and-Conquer Mean Estimation
Here,
- =Sample mean of the k-th partition
- =Size of the k-th partition
- =Number of partitions
Combinable Statistics
- β Mean β partition means combine exactly
- β Variance β Welford's algorithm for streaming variance
- β Quantiles β t-digest or KLL sketch for approximate quantiles
- β Median β not directly combinable; requires sorting or sketching
- β Mode β not combinable without full data
- β Regression coefficients β combine via sufficient statistics (X'X, X'y)
Streaming Statistics
When data arrives as a stream (network packets, IoT sensors, social media), we cannot store all observations. Streaming algorithms use sublinear memory.
DfWelford's Online Algorithm
Welford's algorithm computes the mean and variance in a single pass with O(1) memory:
Welford's Online Variance
Here,
- =Difference between new observation and current mean
- =Running mean after n observations
- =Running sum of squared deviations
The sample variance is then: sΒ² = Mβ,n / (n β 1).
Sketching Methods
Sketches are randomized data structures that provide approximate answers with provable error bounds using sublinear memory.
DfCount-Min Sketch
The Count-Min Sketch (Cormode & Muthukrishnan, 2005) estimates frequency counts in a data stream. It uses d hash functions and a w Γ d table of counters:
Count-Min Sketch
Here,
- =Estimated frequency of element x
- =Number of hash functions (depth)
- =Width of the sketch table
- =i-th hash function
Count-Min Sketch Properties
- Overestimation only β fΜ(x) β₯ f(x) always (never underestimates)
- Error bound β P(fΜ(x) β f(x) > Ρ·N) β€ Ξ΄, where w = βe/Ξ΅β, d = βln(1/Ξ΄)β
- Space β O((1/Ξ΅) Β· ln(1/Ξ΄)) independent of stream length N
- Applications β network traffic monitoring, database query optimization, NLP frequency estimation
Python Implementation: Streaming Statistics
Streaming Algorithms from Scratch
import numpy as np
import hashlib
import mmh3 # MurmurHash3 for Count-Min Sketch
class WelfordStreamingStats:
"""Online computation of mean and variance (Welford's algorithm)."""
def __init__(self):
self.n = 0
self.mean = 0.0
self.M2 = 0.0
def update(self, x):
self.n += 1
delta = x - self.mean
self.mean += delta / self.n
delta2 = x - self.mean
self.M2 += delta * delta2
@property
def variance(self):
return self.M2 / (self.n - 1) if self.n > 1 else 0.0
@property
def std(self):
return np.sqrt(self.variance)
def __repr__(self):
return f"n={self.n}, mean={self.mean:.4f}, var={self.variance:.4f}"
class CountMinSketch:
"""Count-Min Sketch for frequency estimation in data streams."""
def __init__(self, width=1000, depth=5):
self.width = width
self.depth = depth
self.table = np.zeros((depth, width), dtype=np.int64)
self.seeds = list(range(depth))
def _get_positions(self, item):
"""Hash item to positions in each row."""
item_str = str(item)
positions = []
for i in range(self.depth):
pos = mmh3.hash(item_str, self.seeds[i]) % self.width
positions.append(pos)
return positions
def add(self, item, count=1):
"""Add item to the sketch."""
positions = self._get_positions(item)
for i, pos in enumerate(positions):
self.table[i, pos] += count
def estimate(self, item):
"""Estimate frequency of item (overestimates)."""
positions = self._get_positions(item)
return min(self.table[i, pos] for i, pos in enumerate(positions))
def get_top_k(self, items, k=10):
"""Get estimated top-k frequent items."""
estimates = {item: self.estimate(item) for item in items}
return sorted(estimates.items(), key=lambda x: x[1], reverse=True)[:k]
class HyperLogLog:
"""HyperLogLog for cardinality estimation."""
def __init__(self, precision=10):
self.p = precision
self.m = 1 << precision # 2^p registers
self.registers = np.zeros(self.m, dtype=np.int8)
def _hash(self, item):
h = int(hashlib.sha256(str(item).encode()).hexdigest(), 16)
return h
def add(self, item):
h = self._hash(item)
idx = h & (self.m - 1) # first p bits
remaining = h >> self.p
# Count leading zeros + 1
leading_zeros = 1
while remaining & 1 == 0 and leading_zeros < 64:
leading_zeros += 1
remaining >>= 1
self.registers[idx] = max(self.registers[idx], leading_zeros)
def estimate_cardinality(self):
alpha = 0.7213 / (1 + 1.079 / self.m)
raw = alpha * self.m ** 2 / sum(2.0 ** (-r) for r in self.registers)
if raw <= 2.5 * self.m:
zeros = np.sum(self.registers == 0)
if zeros > 0:
return self.m * np.log(self.m / zeros)
return raw
# === Demonstration ===
np.random.seed(42)
# 1. Streaming variance with Welford's algorithm
print("=== Welford's Online Algorithm ===")
stream = WelfordStreamingStats()
data = np.random.normal(100, 15, 10000)
for x in data:
stream.update(x)
print(f"Streaming result: {stream}")
print(f"NumPy result: n={len(data)}, mean={np.mean(data):.4f}, var={np.var(data, ddof=1):.4f}")
# 2. Count-Min Sketch frequency estimation
print("\n=== Count-Min Sketch ===")
sketch = CountMinSketch(width=10000, depth=5)
np.random.seed(42)
elements = np.random.choice([f"item_{i}" for i in range(1000)], size=100000)
for elem in elements:
sketch.add(elem)
true_freq = {elem: np.sum(elements == elem) for elem in np.unique(elements)[:5]}
print("Element | True Freq | Estimated | Overestimate")
for elem, true_f in true_freq.items():
est_f = sketch.estimate(elem)
print(f"{elem:>8} | {true_f:>9} | {est_f:>9} | {est_f - true_f:>10}")
# 3. HyperLogLog cardinality estimation
print("\n=== HyperLogLog Cardinality ===")
hll = HyperLogLog(precision=12)
n_unique = 100000
unique_items = range(n_unique * 10) # universe of items
sample = np.random.choice(unique_items, size=500000, replace=True)
for item in sample:
hll.add(item)
estimated = hll.estimate_cardinality()
true_count = len(set(sample))
print(f"True unique count: {true_count}")
print(f"HLL estimate: {estimated:.0f}")
print(f"Relative error: {abs(estimated - true_count) / true_count * 100:.2f}%")
Dimension Reduction for Big Data
DfRandom Projections
Random projections reduce dimensionality while preserving pairwise distances (Johnson-Lindenstrauss lemma). For an n Γ p matrix X, projecting onto k dimensions using a random matrix R β β^{pΓk}:
Johnson-Lindenstrauss Projection
Here,
- =n Γ k projected matrix
- =Target dimensionality (k << p)
- =Random projection matrix
Johnson-Lindenstrauss Guarantee
For any n points in βα΅, projecting onto k β₯ O(log(n) / Ρ²) dimensions preserves all pairwise distances within a (1 Β± Ξ΅) factor with high probability. This is data-independent β the bound holds for any dataset.
Approximate Inference for Big Data
DfStochastic Variational Inference (SVI)
SVI scales Bayesian inference to massive datasets by subsampling data at each iteration. The evidence lower bound (ELBO) is estimated using a mini-batch:
Stochastic Variational ELBO
Here,
- =Variational parameters
- =Total dataset size
- =Mini-batch size
- =Variational approximation
Key Takeaways
Summary: Big Data and Statistics
- Big data requires algorithms with sublinear memory, streaming capability, or distributed computation
- Spark implements MapReduce for distributed statistical computation across clusters
- Welford's algorithm computes streaming mean and variance in O(1) memory
- Count-Min Sketch estimates frequencies with O(1/Ξ΅ Β· ln 1/Ξ΄) space, overestimating only
- HyperLogLog estimates cardinality (unique counts) with ~2% error using kilobytes of memory
- Random projections reduce dimensionality while provably preserving distances (JL lemma)
- Stochastic variational inference scales Bayesian methods to massive datasets via mini-batches
- Divide-and-conquer works for combinable statistics (mean, variance, regression) but not all (median, mode)