πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Big Data and Statistics

Advanced Statistical MethodsModern Methods🟒 Free Lesson

Advertisement

Big Data and Statistics

Advanced Statistical Methods

Scaling Statistical Methods to Massive Datasets

Big data statistics adapts classical methods for massive datasets using distributed computing, scalable algorithms, and approximate inference. Spark, streaming methods, and sketching techniques make analysis feasible at scale.

  • Tech companies β€” Analyze billions of user interactions for recommendation systems and ad targeting
  • Internet of Things β€” Process continuous streams of sensor data for predictive maintenance
  • Healthcare β€” Mine electronic health records across millions of patients for population insights

Big data statistics ensures that having more data actually leads to better decisions, not just bigger files.


Big data fundamentally challenges classical statistical methods. When datasets exceed memory, arrive as streams, or contain billions of observations, traditional algorithms that assume random access and finite storage become computationally infeasible.

DfBig Data

Big data is characterized by the V's: Volume (terabytes to petabytes), Velocity (real-time streams), Variety (structured, unstructured, semi-structured), Veracity (noise, uncertainty), and Value (actionable insights).


Statistical Challenges of Big Data

Core Statistical Challenges

  1. Computational infeasibility β€” O(nΒ³) algorithms fail at n = 10⁹
  2. Memory constraints β€” data doesn't fit in RAM on a single machine
  3. Streaming data β€” observations arrive continuously, cannot be stored
  4. High dimensionality β€” p >> n or p β‰ˆ n creates new statistical regimes
  5. Heterogeneity β€” distributional shifts across subpopulations
  6. Spurious correlations β€” with many variables, false positives increase

Distributed Computing: Apache Spark

DfMapReduce Framework

MapReduce is a programming model for distributed computation: (1) Map β€” transform data in parallel across nodes, (2) Shuffle β€” redistribute data by key, (3) Reduce β€” aggregate results across nodes.

Spark RDD Transformation Pipeline

RDDraw→mapRDDparsed→filterRDDclean→reduceByKeyRDDaggregated\text{RDD}_{\text{raw}} \xrightarrow{\text{map}} \text{RDD}_{\text{parsed}} \xrightarrow{\text{filter}} \text{RDD}_{\text{clean}} \xrightarrow{\text{reduceByKey}} \text{RDD}_{\text{aggregated}}

Here,

  • RDDRDD=Resilient Distributed Dataset β€” Spark's core abstraction
  • mapmap=Parallel transformation function
  • reduceByKeyreduceByKey=Parallel aggregation by key

Scalable Algorithms

DfDivide-and-Conquer Estimation

For a statistic T that is combineable β€” meaning T(D) can be computed from partial results on subsets β€” we can distribute estimation:

Divide-and-Conquer Mean Estimation

XΛ‰=1nβˆ‘i=1nXi=1Kβˆ‘k=1KXΛ‰kβ‹…nkn\bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i = \frac{1}{K}\sum_{k=1}^{K} \bar{X}_k \cdot \frac{n_k}{n}

Here,

  • XΛ‰kXΜ„_k=Sample mean of the k-th partition
  • nkn_k=Size of the k-th partition
  • KK=Number of partitions

Combinable Statistics

  • βœ… Mean β€” partition means combine exactly
  • βœ… Variance β€” Welford's algorithm for streaming variance
  • βœ… Quantiles β€” t-digest or KLL sketch for approximate quantiles
  • ❌ Median β€” not directly combinable; requires sorting or sketching
  • ❌ Mode β€” not combinable without full data
  • βœ… Regression coefficients β€” combine via sufficient statistics (X'X, X'y)

Streaming Statistics

When data arrives as a stream (network packets, IoT sensors, social media), we cannot store all observations. Streaming algorithms use sublinear memory.

DfWelford's Online Algorithm

Welford's algorithm computes the mean and variance in a single pass with O(1) memory:

Welford's Online Variance

Ξ΄=xnβˆ’xΛ‰nβˆ’1,xΛ‰n=xΛ‰nβˆ’1+Ξ΄n,M2,n=M2,nβˆ’1+Ξ΄(xnβˆ’xΛ‰n)\delta = x_n - \bar{x}_{n-1},\quad \bar{x}_n = \bar{x}_{n-1} + \frac{\delta}{n},\quad M_{2,n} = M_{2,n-1} + \delta(x_n - \bar{x}_n)

Here,

  • δδ=Difference between new observation and current mean
  • xΛ‰nxΜ„_n=Running mean after n observations
  • M2,nMβ‚‚,n=Running sum of squared deviations

The sample variance is then: sΒ² = Mβ‚‚,n / (n βˆ’ 1).


Sketching Methods

Sketches are randomized data structures that provide approximate answers with provable error bounds using sublinear memory.

DfCount-Min Sketch

The Count-Min Sketch (Cormode & Muthukrishnan, 2005) estimates frequency counts in a data stream. It uses d hash functions and a w Γ— d table of counters:

Count-Min Sketch

f^(x)=min⁑i=1dTable[i,hi(x)]\hat{f}(x) = \min_{i=1}^{d} \text{Table}[i, h_i(x)]

Here,

  • f^(x)fΜ‚(x)=Estimated frequency of element x
  • dd=Number of hash functions (depth)
  • ww=Width of the sketch table
  • hih_i=i-th hash function

Count-Min Sketch Properties

  • Overestimation only β€” fΜ‚(x) β‰₯ f(x) always (never underestimates)
  • Error bound β€” P(fΜ‚(x) βˆ’ f(x) > Ρ·N) ≀ Ξ΄, where w = ⌈e/Ξ΅βŒ‰, d = ⌈ln(1/Ξ΄)βŒ‰
  • Space β€” O((1/Ξ΅) Β· ln(1/Ξ΄)) independent of stream length N
  • Applications β€” network traffic monitoring, database query optimization, NLP frequency estimation

Python Implementation: Streaming Statistics

Streaming Algorithms from Scratch

import numpy as np
import hashlib
import mmh3  # MurmurHash3 for Count-Min Sketch

class WelfordStreamingStats:
    """Online computation of mean and variance (Welford's algorithm)."""

    def __init__(self):
        self.n = 0
        self.mean = 0.0
        self.M2 = 0.0

    def update(self, x):
        self.n += 1
        delta = x - self.mean
        self.mean += delta / self.n
        delta2 = x - self.mean
        self.M2 += delta * delta2

    @property
    def variance(self):
        return self.M2 / (self.n - 1) if self.n > 1 else 0.0

    @property
    def std(self):
        return np.sqrt(self.variance)

    def __repr__(self):
        return f"n={self.n}, mean={self.mean:.4f}, var={self.variance:.4f}"


class CountMinSketch:
    """Count-Min Sketch for frequency estimation in data streams."""

    def __init__(self, width=1000, depth=5):
        self.width = width
        self.depth = depth
        self.table = np.zeros((depth, width), dtype=np.int64)
        self.seeds = list(range(depth))

    def _get_positions(self, item):
        """Hash item to positions in each row."""
        item_str = str(item)
        positions = []
        for i in range(self.depth):
            pos = mmh3.hash(item_str, self.seeds[i]) % self.width
            positions.append(pos)
        return positions

    def add(self, item, count=1):
        """Add item to the sketch."""
        positions = self._get_positions(item)
        for i, pos in enumerate(positions):
            self.table[i, pos] += count

    def estimate(self, item):
        """Estimate frequency of item (overestimates)."""
        positions = self._get_positions(item)
        return min(self.table[i, pos] for i, pos in enumerate(positions))

    def get_top_k(self, items, k=10):
        """Get estimated top-k frequent items."""
        estimates = {item: self.estimate(item) for item in items}
        return sorted(estimates.items(), key=lambda x: x[1], reverse=True)[:k]


class HyperLogLog:
    """HyperLogLog for cardinality estimation."""

    def __init__(self, precision=10):
        self.p = precision
        self.m = 1 << precision  # 2^p registers
        self.registers = np.zeros(self.m, dtype=np.int8)

    def _hash(self, item):
        h = int(hashlib.sha256(str(item).encode()).hexdigest(), 16)
        return h

    def add(self, item):
        h = self._hash(item)
        idx = h & (self.m - 1)  # first p bits
        remaining = h >> self.p
        # Count leading zeros + 1
        leading_zeros = 1
        while remaining & 1 == 0 and leading_zeros < 64:
            leading_zeros += 1
            remaining >>= 1
        self.registers[idx] = max(self.registers[idx], leading_zeros)

    def estimate_cardinality(self):
        alpha = 0.7213 / (1 + 1.079 / self.m)
        raw = alpha * self.m ** 2 / sum(2.0 ** (-r) for r in self.registers)

        if raw <= 2.5 * self.m:
            zeros = np.sum(self.registers == 0)
            if zeros > 0:
                return self.m * np.log(self.m / zeros)
        return raw


# === Demonstration ===
np.random.seed(42)

# 1. Streaming variance with Welford's algorithm
print("=== Welford's Online Algorithm ===")
stream = WelfordStreamingStats()
data = np.random.normal(100, 15, 10000)

for x in data:
    stream.update(x)

print(f"Streaming result: {stream}")
print(f"NumPy result:     n={len(data)}, mean={np.mean(data):.4f}, var={np.var(data, ddof=1):.4f}")

# 2. Count-Min Sketch frequency estimation
print("\n=== Count-Min Sketch ===")
sketch = CountMinSketch(width=10000, depth=5)
np.random.seed(42)
elements = np.random.choice([f"item_{i}" for i in range(1000)], size=100000)

for elem in elements:
    sketch.add(elem)

true_freq = {elem: np.sum(elements == elem) for elem in np.unique(elements)[:5]}
print("Element | True Freq | Estimated | Overestimate")
for elem, true_f in true_freq.items():
    est_f = sketch.estimate(elem)
    print(f"{elem:>8} | {true_f:>9} | {est_f:>9} | {est_f - true_f:>10}")

# 3. HyperLogLog cardinality estimation
print("\n=== HyperLogLog Cardinality ===")
hll = HyperLogLog(precision=12)
n_unique = 100000
unique_items = range(n_unique * 10)  # universe of items
sample = np.random.choice(unique_items, size=500000, replace=True)

for item in sample:
    hll.add(item)

estimated = hll.estimate_cardinality()
true_count = len(set(sample))
print(f"True unique count: {true_count}")
print(f"HLL estimate:      {estimated:.0f}")
print(f"Relative error:    {abs(estimated - true_count) / true_count * 100:.2f}%")

Dimension Reduction for Big Data

DfRandom Projections

Random projections reduce dimensionality while preserving pairwise distances (Johnson-Lindenstrauss lemma). For an n Γ— p matrix X, projecting onto k dimensions using a random matrix R ∈ ℝ^{pΓ—k}:

Johnson-Lindenstrauss Projection

Xproj=1kXR,where Rij∼N(0,1)X_{\text{proj}} = \frac{1}{\sqrt{k}} X R,\quad \text{where } R_{ij} \sim \mathcal{N}(0, 1)

Here,

  • XprojX_proj=n Γ— k projected matrix
  • kk=Target dimensionality (k << p)
  • RR=Random projection matrix

Johnson-Lindenstrauss Guarantee

For any n points in ℝᡖ, projecting onto k β‰₯ O(log(n) / Ρ²) dimensions preserves all pairwise distances within a (1 Β± Ξ΅) factor with high probability. This is data-independent β€” the bound holds for any dataset.


Approximate Inference for Big Data

DfStochastic Variational Inference (SVI)

SVI scales Bayesian inference to massive datasets by subsampling data at each iteration. The evidence lower bound (ELBO) is estimated using a mini-batch:

Stochastic Variational ELBO

L~(Ο•)=NMβˆ‘i=1MEq[log⁑p(xi,z)βˆ’log⁑q(z)]+Eq[log⁑p(ΞΈ)βˆ’log⁑q(ΞΈ)]\tilde{L}(\phi) = \frac{N}{M} \sum_{i=1}^{M} \mathbb{E}_{q}\left[\log p(x_i, z) - \log q(z)\right] + \mathbb{E}_q\left[\log p(\theta) - \log q(\theta)\right]

Here,

  • φφ=Variational parameters
  • NN=Total dataset size
  • MM=Mini-batch size
  • qq=Variational approximation

Key Takeaways

Summary: Big Data and Statistics

  1. Big data requires algorithms with sublinear memory, streaming capability, or distributed computation
  2. Spark implements MapReduce for distributed statistical computation across clusters
  3. Welford's algorithm computes streaming mean and variance in O(1) memory
  4. Count-Min Sketch estimates frequencies with O(1/Ξ΅ Β· ln 1/Ξ΄) space, overestimating only
  5. HyperLogLog estimates cardinality (unique counts) with ~2% error using kilobytes of memory
  6. Random projections reduce dimensionality while provably preserving distances (JL lemma)
  7. Stochastic variational inference scales Bayesian methods to massive datasets via mini-batches
  8. Divide-and-conquer works for combinable statistics (mean, variance, regression) but not all (median, mode)
⭐

Premium Content

Big Data and Statistics

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement