Streaming Statistics and Online Learning
Advanced Statistical Methods
Analyzing Data One Observation at a Time
Streaming statistics and online learning algorithms process data sequentially without storing it all, using fixed memory. Welford's algorithm, Count-Min sketch, HyperLogLog, and multi-armed bandits enable real-time analysis.
- Network monitoring β Detect DDoS attacks and anomalies in real-time traffic streams
- A/B testing β Use multi-armed bandits to dynamically allocate traffic to better-performing variants
- Financial trading β Update risk metrics continuously as new market data arrives
Streaming statistics let you keep pace with data that never stops flowing.
DfStreaming Model
In the streaming model, data arrives sequentially one element at a time, and the algorithm has access to only a single pass over the data. The goal is to compute statistical quantities using space, where is the total number of elements observed.
"In a world drowning in data, the ability to compute statistics in a single pass with bounded memory is not a luxury β it is a necessity." β Cormode & Muthukrishnan, 2005
Why Streaming Matters
Traditional batch algorithms require the entire dataset to be stored in memory. In modern applications:
- Network traffic analysis: Billions of packets per day; storing all flow records is infeasible
- Sensor networks: Continuous streams from thousands of IoT devices with limited battery and memory
- Financial tick data: Real-time computation of portfolio statistics as trades execute
- Social media analytics: Rolling computation of engagement metrics across millions of posts
- Database query processing: Approximate aggregate queries over massive tables
The streaming model forces us to rethink fundamental statistical operations: how do we compute a mean, a variance, or a quantile without storing the data?
Welford's Online Algorithm
DfStreaming Mean
Given a stream , the streaming mean is updated iteratively:
This is algebraically equivalent to the batch formula but avoids accumulating raw values.
Streaming Variance (Welford's Method)
DfWelford's Algorithm
Welford's algorithm (1962) computes the mean and variance in a single pass using a recurrence relation. Define the second moment :
The sample variance is then:
ThNumerical Stability of Welford's Algorithm
Welford's algorithm computes the exact same result as the batch formula but with dramatically better numerical stability. The key insight is that each term is always non-negative, eliminating catastrophic cancellation that plagues the naive two-pass algorithm.
Why Naive Algorithms Fail
The naive variance formula suffers from catastrophic cancellation when all are nearly identical. If , the difference can lose all significant digits. Welford's method avoids this entirely.
Parallel Streaming: Chan's Algorithm
DfChan's Parallel Algorithm
To merge statistics from two disjoint streams of sizes and with means and second moments :
Count-Min Sketch
DfCount-Min Sketch
A Count-Min sketch (Cormode & Muthukrishnan, 2003) is a probabilistic data structure that estimates the frequency of items in a stream. It uses a 2D array of width and depth :
For each element in the stream, we update:
where are independent hash functions mapping to .
Query Estimation
Count-Min Query
The estimated frequency of item is:
ThCount-Min Error Bound
With probability at least , the estimated frequency satisfies:
where is the total stream length, if we set and .
Space Complexity
| Parameter | Formula | Typical Value |
|---|---|---|
| Width | 272 for | |
| Depth | 5 for | |
| Total space | ~1,360 counters | |
| Update time | Very fast |
HyperLogLog
DfHyperLogLog
HyperLogLog (Flajolet et al., 2007) estimates the cardinality (number of distinct elements) of a multiset using the probabilistic observation that a random hash function will produce a hash with . The maximum number of leading zeros among elements follows .
Algorithm
HyperLogLog Estimation
Divide the stream into buckets using the first bits of each hash. For bucket , track:
The cardinality estimate is:
where is a bias-correction constant.
HyperLogLog Accuracy
HyperLogLog achieves a standard error of approximately . With (2 KB memory), the standard error is about 0.81%, sufficient for most large-scale applications.
Bloom Filters
DfBloom Filter
A Bloom filter (Bloom, 1970) is a space-efficient probabilistic data structure for set membership testing. It supports only two operations: add(x) and might_contain(x). The filter is a bit array of bits initialized to zero, with independent hash functions .
Add: Set bits at positions to 1.
Query: Return true if all bits at positions are 1.
ThBloom Filter False Positive Rate
For a Bloom filter of size bits with hash functions containing elements, the false positive probability is approximately:
The optimal number of hash functions is .
| Parameters | elements, | elements, |
|---|---|---|
| Memory | ~1.2 MB | ~175 MB |
| Hash functions | 7 | 12 |
| Bits per element | ~9.6 | ~175 |
Online Gradient Descent
DfOnline Gradient Descent (OGD)
In the online convex optimization framework, at each round :
- The learner picks a point from a convex set .
- The adversary reveals a convex loss function .
- The learner incurs loss .
OGD update rule:
where is the projection onto and is the learning rate.
ThRegret Bound for OGD
The regret of OGD is bounded by:
With step size (where , ), the regret is .
Diminishing Step Sizes
Unlike batch gradient descent, OGD requires diminishing step sizes to achieve sublinear regret. Constant step sizes yield linear regret in adversarial settings, which is suboptimal.
Multi-Armed Bandits
DfMulti-Armed Bandit (MAB)
A multi-armed bandit consists of arms, each yielding i.i.d. rewards from an unknown distribution . At each round , the learner selects an arm and observes reward . The goal is to maximize cumulative reward while balancing exploration and exploitation.
Upper Confidence Bound (UCB1)
UCB1 Index
The UCB1 algorithm selects arm maximizing:
where is the empirical mean, is the number of times arm has been pulled, and is the current round.
ThUCB1 Regret Bound
The UCB1 algorithm achieves expected regret:
where is the suboptimality gap. This is asymptotically optimal up to constant factors.
Thompson Sampling
DfThompson Sampling
Thompson sampling maintains a posterior distribution over each arm's reward distribution. At each round:
- Sample for each arm .
- Select .
- Observe reward and update posterior.
For Bernoulli arms with Beta prior , the posterior after successes and failures is .
Thompson Sampling vs UCB
Thompson sampling often outperforms UCB1 in practice due to its natural exploration mechanism. It implicitly balances exploitation (high posterior means) with uncertainty (high posterior variance) through the sampling step.
Python Implementation
import numpy as np
from collections import defaultdict
import hashlib
# --- Welford's Online Algorithm ---
class WelfordOnlineStatistics:
"""Track mean and variance using Welford's algorithm in one pass."""
def __init__(self):
self.n = 0
self.mean = 0.0
self.M2 = 0.0
def update(self, x):
self.n += 1
delta = x - self.mean
self.mean += delta / self.n
delta2 = x - self.mean
self.M2 += delta * delta2
@property
def variance(self):
return self.M2 / (self.n - 1) if self.n > 1 else 0.0
@property
def std(self):
return np.sqrt(self.variance)
def __repr__(self):
return f"Welford(n={self.n}, mean={self.mean:.4f}, var={self.variance:.4f})"
# Demonstration
np.random.seed(42)
data = np.random.normal(loc=5.0, scale=2.0, size=10000)
w = WelfordOnlineStatistics()
for x in data:
w.update(x)
print(f"Streaming: {w}")
print(f"Batch mean: {data.mean():.4f}, Batch var: {data.var():.4f}")
# Output: Streaming mean/var matches batch exactly
# --- Count-Min Sketch ---
class CountMinSketch:
def __init__(self, width, depth):
self.width = width
self.depth = depth
self.table = np.zeros((depth, width), dtype=int)
self.hashes = [self._make_hash(i) for i in range(depth)]
def _make_hash(self, seed):
def h(x):
return int(hashlib.md5(f"{seed}:{x}".encode()).hexdigest(), 16) % self.width
return h
def add(self, item, count=1):
for i in range(self.depth):
self.table[i][self.hashes[i](item)] += count
def estimate(self, item):
return min(self.table[i][self.hashes[i](item)] for i in range(self.depth))
# Demonstration
cms = CountMinSketch(width=1000, depth=5)
items = np.random.choice(1000, size=100000, replace=True)
for item in items:
cms.add(str(item))
print(f"\nCount-Min Sketch estimates for first 5 items:")
for i in range(5):
print(f" Item '{i}': true={np.sum(items==i)}, est={cms.estimate(str(i))}")
# --- HyperLogLog (simplified) ---
class HyperLogLog:
def __init__(self, p=14):
self.p = p
self.m = 1 << p
self.buckets = [0] * self.m
def _hash(self, item):
h = int(hashlib.sha256(str(item).encode()).hexdigest(), 16)
return h
def _leading_zeros(self, h, max_bits=64):
if h == 0:
return max_bits
count = 0
while (h & 1) == 0:
count += 1
h >>= 1
return count
def add(self, item):
h = self._hash(item)
bucket = h & (self.m - 1)
w = h >> self.p
lz = self._leading_zeros(w)
self.buckets[bucket] = max(self.buckets[bucket], lz)
def cardinality(self):
alpha = 0.7213 / (1 + 1.079 / self.m)
raw = alpha * self.m ** 2 / sum(2.0 ** (-b) for b in self.buckets)
return raw
# Demonstration
hll = HyperLogLog(p=14)
true_n = 500000
for i in range(true_n):
hll.add(i)
print(f"\nHyperLogLog:")
print(f" True cardinality: {true_n}")
print(f" Estimated: {hll.cardinality():.0f}")
print(f" Error: {abs(hll.cardinality() - true_n) / true_n * 100:.2f}%")
# --- Bloom Filter ---
class BloomFilter:
def __init__(self, size, num_hashes):
self.size = size
self.num_hashes = num_hashes
self.bit_array = [False] * size
self.hashes = [self._make_hash(i) for i in range(num_hashes)]
def _make_hash(self, seed):
def h(item):
return int(hashlib.md5(f"{seed}:{item}".encode()).hexdigest(), 16) % self.size
return h
def add(self, item):
for h in self.hashes:
self.bit_array[h(item)] = True
def __contains__(self, item):
return all(self.bit_array[h(item)] for h in self.hashes)
# Demonstration
bf = BloomFilter(size=10000, num_hashes=7)
for i in range(1000):
bf.add(f"item_{i}")
false_positives = sum(f"new_{i}" in bf for i in range(10000))
print(f"\nBloom Filter: {false_positives}/10000 false positives "
f"(expected ~1%)")
# --- Multi-Armed Bandit (UCB1) ---
class UCB1:
def __init__(self, k):
self.k = k
self.counts = np.zeros(k)
self.values = np.zeros(k)
self.total = 0
def select_arm(self):
# Pull each arm once first
for i in range(self.k):
if self.counts[i] == 0:
return i
ucb_values = self.values + np.sqrt(2 * np.log(self.total) / self.counts)
return np.argmax(ucb_values)
def update(self, arm, reward):
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] += (reward - self.values[arm]) / n
self.total += 1
# 10-armed Gaussian bandit test
np.random.seed(42)
true_means = np.random.normal(0, 1, 10)
optimal_arm = np.argmax(true_means)
agent = UCB1(10)
cumulative_regret = []
for t in range(1, 2001):
arm = agent.select_arm()
reward = np.random.normal(true_means[arm], 1.0)
agent.update(arm, reward)
regret = true_means[optimal_arm] - true_means[arm]
cumulative_regret.append(regret)
print(f"\nUCB1 Bandit: final avg reward = {agent.values[optimal_arm]:.3f} "
f"(true optimal = {true_means[optimal_arm]:.3f})")
print(f"Cumulative regret after 2000 rounds: {sum(cumulative_regret):.1f}")
Key Takeaways
Summary: Streaming Statistics and Online Learning
- Welford's algorithm computes exact mean and variance in a single pass with memory, avoiding catastrophic cancellation.
- Count-Min sketch estimates item frequencies with controllable error and probability using only counters.
- HyperLogLog estimates set cardinality with ~1% error using only ~2 KB memory, independent of the true cardinality.
- Bloom filters test set membership with zero false negatives and tunable false positive rates in query time.
- Online gradient descent achieves regret in convex settings, enabling learning against adversaries.
- Multi-armed bandits (UCB1, Thompson sampling) balance exploration and exploitation with provable regret bounds.
- All streaming algorithms trade a small, controlled probability of error for dramatic reductions in space and time complexity.