Big Data and Statistics

Advanced Statistical Methods

Scaling Statistical Methods to Massive Datasets

Big data statistics adapts classical methods for massive datasets using distributed computing, scalable algorithms, and approximate inference. Spark, streaming methods, and sketching techniques make analysis feasible at scale.

Tech companies — Analyze billions of user interactions for recommendation systems and ad targeting
Internet of Things — Process continuous streams of sensor data for predictive maintenance
Healthcare — Mine electronic health records across millions of patients for population insights

Big data statistics ensures that having more data actually leads to better decisions, not just bigger files.

Big data fundamentally challenges classical statistical methods. When datasets exceed memory, arrive as streams, or contain billions of observations, traditional algorithms that assume random access and finite storage become computationally infeasible.

DfBig Data

Big data is characterized by the V's: Volume (terabytes to petabytes), Velocity (real-time streams), Variety (structured, unstructured, semi-structured), Veracity (noise, uncertainty), and Value (actionable insights).

Statistical Challenges of Big Data

Core Statistical Challenges

Computational infeasibility — O(n³) algorithms fail at n = 10⁹
Memory constraints — data doesn't fit in RAM on a single machine
Streaming data — observations arrive continuously, cannot be stored
High dimensionality — p >> n or p ≈ n creates new statistical regimes
Heterogeneity — distributional shifts across subpopulations
Spurious correlations — with many variables, false positives increase

Distributed Computing: Apache Spark

DfMapReduce Framework

MapReduce is a programming model for distributed computation: (1) Map — transform data in parallel across nodes, (2) Shuffle — redistribute data by key, (3) Reduce — aggregate results across nodes.

Spark RDD Transformation Pipeline

\text{RDD}_{\text{raw}} \xrightarrow{\text{map}} \text{RDD}_{\text{parsed}} \xrightarrow{\text{filter}} \text{RDD}_{\text{clean}} \xrightarrow{\text{reduceByKey}} \text{RDD}_{\text{aggregated}}

Here,

$RDD$ =Resilient Distributed Dataset — Spark's core abstraction
$map$ =Parallel transformation function
$reduceByKey$ =Parallel aggregation by key

Scalable Algorithms

DfDivide-and-Conquer Estimation

For a statistic T that is combineable — meaning T(D) can be computed from partial results on subsets — we can distribute estimation:

Divide-and-Conquer Mean Estimation

\bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i = \frac{1}{K}\sum_{k=1}^{K} \bar{X}_k \cdot \frac{n_k}{n}

Here,

$X̄_k$ =Sample mean of the k-th partition
$n_k$ =Size of the k-th partition
$K$ =Number of partitions

Combinable Statistics

✅ Mean — partition means combine exactly
✅ Variance — Welford's algorithm for streaming variance
✅ Quantiles — t-digest or KLL sketch for approximate quantiles
❌ Median — not directly combinable; requires sorting or sketching
❌ Mode — not combinable without full data
✅ Regression coefficients — combine via sufficient statistics (X'X, X'y)

Streaming Statistics

When data arrives as a stream (network packets, IoT sensors, social media), we cannot store all observations. Streaming algorithms use sublinear memory.

DfWelford's Online Algorithm

Welford's algorithm computes the mean and variance in a single pass with O(1) memory:

Welford's Online Variance

\delta = x_n - \bar{x}_{n-1},\quad \bar{x}_n = \bar{x}_{n-1} + \frac{\delta}{n},\quad M_{2,n} = M_{2,n-1} + \delta(x_n - \bar{x}_n)

Here,

$δ$ =Difference between new observation and current mean
$x̄_n$ =Running mean after n observations
$M₂,n$ =Running sum of squared deviations

The sample variance is then: s² = M₂,n / (n − 1).

Sketching Methods

Sketches are randomized data structures that provide approximate answers with provable error bounds using sublinear memory.

DfCount-Min Sketch

The Count-Min Sketch (Cormode & Muthukrishnan, 2005) estimates frequency counts in a data stream. It uses d hash functions and a w × d table of counters:

Count-Min Sketch

\hat{f}(x) = \min_{i=1}^{d} \text{Table}[i, h_i(x)]

Here,

$f̂(x)$ =Estimated frequency of element x
$d$ =Number of hash functions (depth)
$w$ =Width of the sketch table
$h_i$ =i-th hash function

Count-Min Sketch Properties

Overestimation only — f̂(x) ≥ f(x) always (never underestimates)
Error bound — P(f̂(x) − f(x) > ε·N) ≤ δ, where w = ⌈e/ε⌉, d = ⌈ln(1/δ)⌉
Space — O((1/ε) · ln(1/δ)) independent of stream length N
Applications — network traffic monitoring, database query optimization, NLP frequency estimation

Python Implementation: Streaming Statistics

Streaming Algorithms from Scratch

import numpy as np
import hashlib
import mmh3  # MurmurHash3 for Count-Min Sketch

class WelfordStreamingStats:
    """Online computation of mean and variance (Welford's algorithm)."""

    def __init__(self):
        self.n = 0
        self.mean = 0.0
        self.M2 = 0.0

    def update(self, x):
        self.n += 1
        delta = x - self.mean
        self.mean += delta / self.n
        delta2 = x - self.mean
        self.M2 += delta * delta2

    @property
    def variance(self):
        return self.M2 / (self.n - 1) if self.n > 1 else 0.0

    @property
    def std(self):
        return np.sqrt(self.variance)

    def __repr__(self):
        return f"n={self.n}, mean={self.mean:.4f}, var={self.variance:.4f}"


class CountMinSketch:
    """Count-Min Sketch for frequency estimation in data streams."""

    def __init__(self, width=1000, depth=5):
        self.width = width
        self.depth = depth
        self.table = np.zeros((depth, width), dtype=np.int64)
        self.seeds = list(range(depth))

    def _get_positions(self, item):
        """Hash item to positions in each row."""
        item_str = str(item)
        positions = []
        for i in range(self.depth):
            pos = mmh3.hash(item_str, self.seeds[i]) % self.width
            positions.append(pos)
        return positions

    def add(self, item, count=1):
        """Add item to the sketch."""
        positions = self._get_positions(item)
        for i, pos in enumerate(positions):
            self.table[i, pos] += count

    def estimate(self, item):
        """Estimate frequency of item (overestimates)."""
        positions = self._get_positions(item)
        return min(self.table[i, pos] for i, pos in enumerate(positions))

    def get_top_k(self, items, k=10):
        """Get estimated top-k frequent items."""
        estimates = {item: self.estimate(item) for item in items}
        return sorted(estimates.items(), key=lambda x: x[1], reverse=True)[:k]


class HyperLogLog:
    """HyperLogLog for cardinality estimation."""

    def __init__(self, precision=10):
        self.p = precision
        self.m = 1 << precision  # 2^p registers
        self.registers = np.zeros(self.m, dtype=np.int8)

    def _hash(self, item):
        h = int(hashlib.sha256(str(item).encode()).hexdigest(), 16)
        return h

    def add(self, item):
        h = self._hash(item)
        idx = h & (self.m - 1)  # first p bits
        remaining = h >> self.p
        # Count leading zeros + 1
        leading_zeros = 1
        while remaining & 1 == 0 and leading_zeros < 64:
            leading_zeros += 1
            remaining >>= 1
        self.registers[idx] = max(self.registers[idx], leading_zeros)

    def estimate_cardinality(self):
        alpha = 0.7213 / (1 + 1.079 / self.m)
        raw = alpha * self.m ** 2 / sum(2.0 ** (-r) for r in self.registers)

        if raw <= 2.5 * self.m:
            zeros = np.sum(self.registers == 0)
            if zeros > 0:
                return self.m * np.log(self.m / zeros)
        return raw


# === Demonstration ===
np.random.seed(42)

# 1. Streaming variance with Welford's algorithm
print("=== Welford's Online Algorithm ===")
stream = WelfordStreamingStats()
data = np.random.normal(100, 15, 10000)

for x in data:
    stream.update(x)

print(f"Streaming result: {stream}")
print(f"NumPy result:     n={len(data)}, mean={np.mean(data):.4f}, var={np.var(data, ddof=1):.4f}")

# 2. Count-Min Sketch frequency estimation
print("\n=== Count-Min Sketch ===")
sketch = CountMinSketch(width=10000, depth=5)
np.random.seed(42)
elements = np.random.choice([f"item_{i}" for i in range(1000)], size=100000)

for elem in elements:
    sketch.add(elem)

true_freq = {elem: np.sum(elements == elem) for elem in np.unique(elements)[:5]}
print("Element | True Freq | Estimated | Overestimate")
for elem, true_f in true_freq.items():
    est_f = sketch.estimate(elem)
    print(f"{elem:>8} | {true_f:>9} | {est_f:>9} | {est_f - true_f:>10}")

# 3. HyperLogLog cardinality estimation
print("\n=== HyperLogLog Cardinality ===")
hll = HyperLogLog(precision=12)
n_unique = 100000
unique_items = range(n_unique * 10)  # universe of items
sample = np.random.choice(unique_items, size=500000, replace=True)

for item in sample:
    hll.add(item)

estimated = hll.estimate_cardinality()
true_count = len(set(sample))
print(f"True unique count: {true_count}")
print(f"HLL estimate:      {estimated:.0f}")
print(f"Relative error:    {abs(estimated - true_count) / true_count * 100:.2f}%")

Dimension Reduction for Big Data

DfRandom Projections

Random projections reduce dimensionality while preserving pairwise distances (Johnson-Lindenstrauss lemma). For an n × p matrix X, projecting onto k dimensions using a random matrix R ∈ ℝ^{p×k}:

Johnson-Lindenstrauss Projection

X_{\text{proj}} = \frac{1}{\sqrt{k}} X R,\quad \text{where } R_{ij} \sim \mathcal{N}(0, 1)

Here,

$X_proj$ =n × k projected matrix
$k$ =Target dimensionality (k << p)
$R$ =Random projection matrix

Johnson-Lindenstrauss Guarantee

For any n points in ℝᵖ, projecting onto k ≥ O(log(n) / ε²) dimensions preserves all pairwise distances within a (1 ± ε) factor with high probability. This is data-independent — the bound holds for any dataset.

Approximate Inference for Big Data

DfStochastic Variational Inference (SVI)

SVI scales Bayesian inference to massive datasets by subsampling data at each iteration. The evidence lower bound (ELBO) is estimated using a mini-batch:

Stochastic Variational ELBO

\tilde{L}(\phi) = \frac{N}{M} \sum_{i=1}^{M} \mathbb{E}_{q}\left[\log p(x_i, z) - \log q(z)\right] + \mathbb{E}_q\left[\log p(\theta) - \log q(\theta)\right]

Here,

$φ$ =Variational parameters
$N$ =Total dataset size
$M$ =Mini-batch size
$q$ =Variational approximation

Key Takeaways

Summary: Big Data and Statistics

Big data requires algorithms with sublinear memory, streaming capability, or distributed computation
Spark implements MapReduce for distributed statistical computation across clusters
Welford's algorithm computes streaming mean and variance in O(1) memory
Count-Min Sketch estimates frequencies with O(1/ε · ln 1/δ) space, overestimating only
HyperLogLog estimates cardinality (unique counts) with ~2% error using kilobytes of memory
Random projections reduce dimensionality while provably preserving distances (JL lemma)
Stochastic variational inference scales Bayesian methods to massive datasets via mini-batches
Divide-and-conquer works for combinable statistics (mean, variance, regression) but not all (median, mode)

Big Data and Statistics

Big Data and Statistics

Scaling Statistical Methods to Massive Datasets

DfBig Data

Statistical Challenges of Big Data

Distributed Computing: Apache Spark

DfMapReduce Framework

Spark RDD Transformation Pipeline

Scalable Algorithms

DfDivide-and-Conquer Estimation

Divide-and-Conquer Mean Estimation

Streaming Statistics

DfWelford's Online Algorithm

Welford's Online Variance

Sketching Methods

DfCount-Min Sketch

Count-Min Sketch

Python Implementation: Streaming Statistics

Streaming Algorithms from Scratch

Dimension Reduction for Big Data

DfRandom Projections

Johnson-Lindenstrauss Projection

Approximate Inference for Big Data

DfStochastic Variational Inference (SVI)

Stochastic Variational ELBO

Key Takeaways

Summary: Big Data and Statistics

Premium Content

Need Expert Statistics Help?