Streaming Statistics and Online Learning

Advanced Statistical Methods

Analyzing Data One Observation at a Time

Streaming statistics and online learning algorithms process data sequentially without storing it all, using fixed memory. Welford's algorithm, Count-Min sketch, HyperLogLog, and multi-armed bandits enable real-time analysis.

Network monitoring — Detect DDoS attacks and anomalies in real-time traffic streams
A/B testing — Use multi-armed bandits to dynamically allocate traffic to better-performing variants
Financial trading — Update risk metrics continuously as new market data arrives

Streaming statistics let you keep pace with data that never stops flowing.

DfStreaming Model

In the streaming model, data arrives sequentially one element at a time, and the algorithm has access to only a single pass over the data. The goal is to compute statistical quantities using $O(\text{polylog}(n))$ space, where $n$ is the total number of elements observed.

"In a world drowning in data, the ability to compute statistics in a single pass with bounded memory is not a luxury — it is a necessity." — Cormode & Muthukrishnan, 2005

Why Streaming Matters

Traditional batch algorithms require the entire dataset to be stored in memory. In modern applications:

Network traffic analysis: Billions of packets per day; storing all flow records is infeasible
Sensor networks: Continuous streams from thousands of IoT devices with limited battery and memory
Financial tick data: Real-time computation of portfolio statistics as trades execute
Social media analytics: Rolling computation of engagement metrics across millions of posts
Database query processing: Approximate aggregate queries over massive tables

The streaming model forces us to rethink fundamental statistical operations: how do we compute a mean, a variance, or a quantile without storing the data?

Welford's Online Algorithm

DfStreaming Mean

Given a stream $x_1, x_2, \ldots, x_n$ , the streaming mean is updated iteratively:

\bar{x}_n = \bar{x}_{n-1} + \frac{x_n - \bar{x}_{n-1}}{n}

This is algebraically equivalent to the batch formula but avoids accumulating $n$ raw values.

Streaming Variance (Welford's Method)

DfWelford's Algorithm

Welford's algorithm (1962) computes the mean and variance in a single pass using a recurrence relation. Define the second moment $M_{2,n}$ :

M_{2,n} = M_{2,n-1} + (x_n - \bar{x}_{n-1})(x_n - \bar{x}_n)

The sample variance is then:

s_n^2 = \frac{M_{2,n}}{n - 1}

ThNumerical Stability of Welford's Algorithm

Welford's algorithm computes the exact same result as the batch formula $\sum(x_i - \bar{x})^2$ but with dramatically better numerical stability. The key insight is that each term $(x_n - \bar{x}_{n-1})(x_n - \bar{x}_n)$ is always non-negative, eliminating catastrophic cancellation that plagues the naive two-pass algorithm.

Why Naive Algorithms Fail

The naive variance formula $s^2 = \frac{1}{n-1}\left(\sum x_i^2 - \frac{(\sum x_i)^2}{n}\right)$ suffers from catastrophic cancellation when all $x_i$ are nearly identical. If $x_i \approx 10^8$ , the difference $\sum x_i^2 - (\sum x_i)^2/n$ can lose all significant digits. Welford's method avoids this entirely.

Parallel Streaming: Chan's Algorithm

DfChan's Parallel Algorithm

To merge statistics from two disjoint streams of sizes $n_A$ and $n_B$ with means $\bar{x}_A, \bar{x}_B$ and second moments $M_{2A}, M_{2B}$ :

n = n_A + n_B

\bar{x} = \bar{x}_A + \frac{n_B}{n}(\bar{x}_B - \bar{x}_A)

M_2 = M_{2A} + M_{2B} + \frac{n_A n_B}{n}(\bar{x}_B - \bar{x}_A)^2

Count-Min Sketch

DfCount-Min Sketch

A Count-Min sketch (Cormode & Muthukrishnan, 2003) is a probabilistic data structure that estimates the frequency of items in a stream. It uses a 2D array of width $w$ and depth $d$ :

\text{sketch}[i][j] = 0 \quad \forall \, i \in [1,d], \, j \in [1,w]

For each element $x$ in the stream, we update:

\text{sketch}[i][h_i(x)] \mathrel{+}= 1 \quad \forall \, i \in [1,d]

where $h_1, \ldots, h_d$ are independent hash functions mapping to $\{1, \ldots, w\}$ .

Query Estimation

Count-Min Query

The estimated frequency of item $x$ is:

\hat{f}(x) = \min_{i=1}^{d} \text{sketch}[i][h_i(x)]

ThCount-Min Error Bound

With probability at least $1 - \delta$ , the estimated frequency satisfies:

f(x) \leq \hat{f}(x) \leq f(x) + \frac{\varepsilon}{w} \cdot N

where $N = \sum_i f_i$ is the total stream length, if we set $w = \lceil e/\varepsilon \rceil$ and $d = \lceil \ln(1/\delta) \rceil$ .

Space Complexity

Parameter	Formula	Typical Value
Width $w$	$\lceil e/\varepsilon \rceil$	272 for $\varepsilon = 0.01$
Depth $d$	$\lceil \ln(1/\delta) \rceil$	5 for $\delta = 0.01$
Total space	$O(w \cdot d) = O\left(\frac{\ln(1/\delta)}{\varepsilon}\right)$	~1,360 counters
Update time	$O(d)$	Very fast

HyperLogLog

DfHyperLogLog

HyperLogLog (Flajolet et al., 2007) estimates the cardinality (number of distinct elements) of a multiset using the probabilistic observation that a random hash function $h: S \to \{0,1\}^b$ will produce a hash with $P(\text{leading zeros} \geq k) = 2^{-k}$ . The maximum number of leading zeros among $n$ elements follows $\text{Geometric}(1/2)$ .

Algorithm

HyperLogLog Estimation

Divide the stream into $m = 2^p$ buckets using the first $p$ bits of each hash. For bucket $j$ , track:

M_j = \max_{i: \text{bucket}(h(x_i)) = j} \text{LeadingZeros}(h(x_i)) + 1

The cardinality estimate is:

\hat{n} = \alpha_m \cdot m^2 \cdot \left(\sum_{j=1}^{m} 2^{-M_j}\right)^{-1}

where $\alpha_m$ is a bias-correction constant.

HyperLogLog Accuracy

HyperLogLog achieves a standard error of approximately $\frac{1.04}{\sqrt{m}}$ . With $m = 16384$ (2 KB memory), the standard error is about 0.81%, sufficient for most large-scale applications.

Bloom Filters

DfBloom Filter

A Bloom filter (Bloom, 1970) is a space-efficient probabilistic data structure for set membership testing. It supports only two operations: add(x) and might_contain(x). The filter is a bit array of $m$ bits initialized to zero, with $k$ independent hash functions $h_1, \ldots, h_k$ .

Add: Set bits at positions $h_1(x), h_2(x), \ldots, h_k(x)$ to 1.

Query: Return true if all bits at positions $h_1(x), \ldots, h_k(x)$ are 1.

ThBloom Filter False Positive Rate

For a Bloom filter of size $m$ bits with $k$ hash functions containing $n$ elements, the false positive probability is approximately:

p \approx \left(1 - e^{-kn/m}\right)^k

The optimal number of hash functions is $k = \frac{m}{n} \ln 2$ .

Parameters	$n = 10^6$ elements, $p = 1\%$	$n = 10^9$ elements, $p = 0.1\%$
Memory $m$	~1.2 MB	~175 MB
Hash functions $k$	7	12
Bits per element	~9.6	~175

Online Gradient Descent

DfOnline Gradient Descent (OGD)

In the online convex optimization framework, at each round $t$ :

The learner picks a point $w_t$ from a convex set $K$ .
The adversary reveals a convex loss function $f_t$ .
The learner incurs loss $f_t(w_t)$ .

OGD update rule:

w_{t+1} = \Pi_K\left(w_t - \eta_t \nabla f_t(w_t)\right)

where $\Pi_K$ is the projection onto $K$ and $\eta_t$ is the learning rate.

ThRegret Bound for OGD

The regret of OGD is bounded by:

R_T = \sum_{t=1}^{T} f_t(w_t) - \min_{w \in K} \sum_{t=1}^{T} f_t(w) \leq \frac{\|w_1 - w^*\|^2}{2\eta_T} + \frac{\eta_T}{2} \sum_{t=1}^{T} \|\nabla f_t(w_t)\|^2

With step size $\eta_t = \frac{D}{G\sqrt{t}}$ (where $D = \text{diam}(K)$ , $G = \max_t \|\nabla f_t\|$ ), the regret is $O(GD\sqrt{T})$ .

Diminishing Step Sizes

Unlike batch gradient descent, OGD requires diminishing step sizes to achieve sublinear regret. Constant step sizes yield linear regret in adversarial settings, which is suboptimal.

Multi-Armed Bandits

DfMulti-Armed Bandit (MAB)

A multi-armed bandit consists of $K$ arms, each yielding i.i.d. rewards from an unknown distribution $\nu_k$ . At each round $t$ , the learner selects an arm $A_t$ and observes reward $X_t \sim \nu_{A_t}$ . The goal is to maximize cumulative reward while balancing exploration and exploitation.

Upper Confidence Bound (UCB1)

UCB1 Index

The UCB1 algorithm selects arm $a$ maximizing:

X_a + \sqrt{\frac{2 \ln t}{N_a(t)}}

where $X_a$ is the empirical mean, $N_a(t)$ is the number of times arm $a$ has been pulled, and $t$ is the current round.

ThUCB1 Regret Bound

The UCB1 algorithm achieves expected regret:

\mathbb{E}[R_T] \leq \sum_{k: \Delta_k > 0} \left(\frac{8 \ln T}{\Delta_k} + \frac{\pi^2}{3\Delta_k}\right)

where $\Delta_k = \mu^* - \mu_k$ is the suboptimality gap. This is asymptotically optimal up to constant factors.

Thompson Sampling

DfThompson Sampling

Thompson sampling maintains a posterior distribution over each arm's reward distribution. At each round:

Sample $\tilde{\theta}_k \sim P(\theta_k | \text{data})$ for each arm $k$ .
Select $A_t = \arg\max_k \tilde{\theta}_k$ .
Observe reward and update posterior.

For Bernoulli arms with Beta prior $\text{Beta}(\alpha, \beta)$ , the posterior after $s$ successes and $f$ failures is $\text{Beta}(\alpha + s, \beta + f)$ .

Thompson Sampling vs UCB

Thompson sampling often outperforms UCB1 in practice due to its natural exploration mechanism. It implicitly balances exploitation (high posterior means) with uncertainty (high posterior variance) through the sampling step.

Python Implementation

import numpy as np
from collections import defaultdict
import hashlib

# --- Welford's Online Algorithm ---
class WelfordOnlineStatistics:
    """Track mean and variance using Welford's algorithm in one pass."""
    def __init__(self):
        self.n = 0
        self.mean = 0.0
        self.M2 = 0.0

    def update(self, x):
        self.n += 1
        delta = x - self.mean
        self.mean += delta / self.n
        delta2 = x - self.mean
        self.M2 += delta * delta2

    @property
    def variance(self):
        return self.M2 / (self.n - 1) if self.n > 1 else 0.0

    @property
    def std(self):
        return np.sqrt(self.variance)

    def __repr__(self):
        return f"Welford(n={self.n}, mean={self.mean:.4f}, var={self.variance:.4f})"

# Demonstration
np.random.seed(42)
data = np.random.normal(loc=5.0, scale=2.0, size=10000)

w = WelfordOnlineStatistics()
for x in data:
    w.update(x)

print(f"Streaming: {w}")
print(f"Batch mean: {data.mean():.4f}, Batch var: {data.var():.4f}")
# Output: Streaming mean/var matches batch exactly

# --- Count-Min Sketch ---
class CountMinSketch:
    def __init__(self, width, depth):
        self.width = width
        self.depth = depth
        self.table = np.zeros((depth, width), dtype=int)
        self.hashes = [self._make_hash(i) for i in range(depth)]

    def _make_hash(self, seed):
        def h(x):
            return int(hashlib.md5(f"{seed}:{x}".encode()).hexdigest(), 16) % self.width
        return h

    def add(self, item, count=1):
        for i in range(self.depth):
            self.table[i][self.hashes[i](item)] += count

    def estimate(self, item):
        return min(self.table[i][self.hashes[i](item)] for i in range(self.depth))

# Demonstration
cms = CountMinSketch(width=1000, depth=5)
items = np.random.choice(1000, size=100000, replace=True)
for item in items:
    cms.add(str(item))

print(f"\nCount-Min Sketch estimates for first 5 items:")
for i in range(5):
    print(f"  Item '{i}': true={np.sum(items==i)}, est={cms.estimate(str(i))}")

# --- HyperLogLog (simplified) ---
class HyperLogLog:
    def __init__(self, p=14):
        self.p = p
        self.m = 1 << p
        self.buckets = [0] * self.m

    def _hash(self, item):
        h = int(hashlib.sha256(str(item).encode()).hexdigest(), 16)
        return h

    def _leading_zeros(self, h, max_bits=64):
        if h == 0:
            return max_bits
        count = 0
        while (h & 1) == 0:
            count += 1
            h >>= 1
        return count

    def add(self, item):
        h = self._hash(item)
        bucket = h & (self.m - 1)
        w = h >> self.p
        lz = self._leading_zeros(w)
        self.buckets[bucket] = max(self.buckets[bucket], lz)

    def cardinality(self):
        alpha = 0.7213 / (1 + 1.079 / self.m)
        raw = alpha * self.m ** 2 / sum(2.0 ** (-b) for b in self.buckets)
        return raw

# Demonstration
hll = HyperLogLog(p=14)
true_n = 500000
for i in range(true_n):
    hll.add(i)

print(f"\nHyperLogLog:")
print(f"  True cardinality: {true_n}")
print(f"  Estimated: {hll.cardinality():.0f}")
print(f"  Error: {abs(hll.cardinality() - true_n) / true_n * 100:.2f}%")

# --- Bloom Filter ---
class BloomFilter:
    def __init__(self, size, num_hashes):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = [False] * size
        self.hashes = [self._make_hash(i) for i in range(num_hashes)]

    def _make_hash(self, seed):
        def h(item):
            return int(hashlib.md5(f"{seed}:{item}".encode()).hexdigest(), 16) % self.size
        return h

    def add(self, item):
        for h in self.hashes:
            self.bit_array[h(item)] = True

    def __contains__(self, item):
        return all(self.bit_array[h(item)] for h in self.hashes)

# Demonstration
bf = BloomFilter(size=10000, num_hashes=7)
for i in range(1000):
    bf.add(f"item_{i}")

false_positives = sum(f"new_{i}" in bf for i in range(10000))
print(f"\nBloom Filter: {false_positives}/10000 false positives "
      f"(expected ~1%)")

# --- Multi-Armed Bandit (UCB1) ---
class UCB1:
    def __init__(self, k):
        self.k = k
        self.counts = np.zeros(k)
        self.values = np.zeros(k)
        self.total = 0

    def select_arm(self):
        # Pull each arm once first
        for i in range(self.k):
            if self.counts[i] == 0:
                return i
        ucb_values = self.values + np.sqrt(2 * np.log(self.total) / self.counts)
        return np.argmax(ucb_values)

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n
        self.total += 1

# 10-armed Gaussian bandit test
np.random.seed(42)
true_means = np.random.normal(0, 1, 10)
optimal_arm = np.argmax(true_means)

agent = UCB1(10)
cumulative_regret = []

for t in range(1, 2001):
    arm = agent.select_arm()
    reward = np.random.normal(true_means[arm], 1.0)
    agent.update(arm, reward)
    regret = true_means[optimal_arm] - true_means[arm]
    cumulative_regret.append(regret)

print(f"\nUCB1 Bandit: final avg reward = {agent.values[optimal_arm]:.3f} "
      f"(true optimal = {true_means[optimal_arm]:.3f})")
print(f"Cumulative regret after 2000 rounds: {sum(cumulative_regret):.1f}")

Key Takeaways

Summary: Streaming Statistics and Online Learning

Welford's algorithm computes exact mean and variance in a single pass with $O(1)$ memory, avoiding catastrophic cancellation.
Count-Min sketch estimates item frequencies with controllable error $\varepsilon$ and probability $\delta$ using only $O(\frac{1}{\varepsilon} \log \frac{1}{\delta})$ counters.
HyperLogLog estimates set cardinality with ~1% error using only ~2 KB memory, independent of the true cardinality.
Bloom filters test set membership with zero false negatives and tunable false positive rates in $O(1)$ query time.
Online gradient descent achieves $O(\sqrt{T})$ regret in convex settings, enabling learning against adversaries.
Multi-armed bandits (UCB1, Thompson sampling) balance exploration and exploitation with provable regret bounds.
All streaming algorithms trade a small, controlled probability of error for dramatic reductions in space and time complexity.

Streaming Statistics and Online Learning

Streaming Statistics and Online Learning

Analyzing Data One Observation at a Time

DfStreaming Model

Why Streaming Matters

Welford's Online Algorithm

DfStreaming Mean

Streaming Variance (Welford's Method)

DfWelford's Algorithm

ThNumerical Stability of Welford's Algorithm

Parallel Streaming: Chan's Algorithm

DfChan's Parallel Algorithm

Count-Min Sketch

DfCount-Min Sketch

Query Estimation

Count-Min Query

ThCount-Min Error Bound

Space Complexity

HyperLogLog

DfHyperLogLog

Algorithm

HyperLogLog Estimation

Bloom Filters

DfBloom Filter

ThBloom Filter False Positive Rate

Online Gradient Descent

DfOnline Gradient Descent (OGD)

ThRegret Bound for OGD

Multi-Armed Bandits

DfMulti-Armed Bandit (MAB)

Upper Confidence Bound (UCB1)

UCB1 Index

ThUCB1 Regret Bound

Thompson Sampling

DfThompson Sampling

Python Implementation

Key Takeaways

Summary: Streaming Statistics and Online Learning

Next Steps

Premium Content

Need Expert Statistics Help?