πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Streaming Statistics and Online Learning

Advanced Statistical MethodsModern Methods🟒 Free Lesson

Advertisement

Streaming Statistics and Online Learning

Advanced Statistical Methods

Analyzing Data One Observation at a Time

Streaming statistics and online learning algorithms process data sequentially without storing it all, using fixed memory. Welford's algorithm, Count-Min sketch, HyperLogLog, and multi-armed bandits enable real-time analysis.

  • Network monitoring β€” Detect DDoS attacks and anomalies in real-time traffic streams
  • A/B testing β€” Use multi-armed bandits to dynamically allocate traffic to better-performing variants
  • Financial trading β€” Update risk metrics continuously as new market data arrives

Streaming statistics let you keep pace with data that never stops flowing.


DfStreaming Model

In the streaming model, data arrives sequentially one element at a time, and the algorithm has access to only a single pass over the data. The goal is to compute statistical quantities using O(polylog(n))O(\text{polylog}(n)) space, where nn is the total number of elements observed.

"In a world drowning in data, the ability to compute statistics in a single pass with bounded memory is not a luxury β€” it is a necessity." β€” Cormode & Muthukrishnan, 2005


Why Streaming Matters

Traditional batch algorithms require the entire dataset to be stored in memory. In modern applications:

  • Network traffic analysis: Billions of packets per day; storing all flow records is infeasible
  • Sensor networks: Continuous streams from thousands of IoT devices with limited battery and memory
  • Financial tick data: Real-time computation of portfolio statistics as trades execute
  • Social media analytics: Rolling computation of engagement metrics across millions of posts
  • Database query processing: Approximate aggregate queries over massive tables

The streaming model forces us to rethink fundamental statistical operations: how do we compute a mean, a variance, or a quantile without storing the data?


Welford's Online Algorithm

DfStreaming Mean

Given a stream x1,x2,…,xnx_1, x_2, \ldots, x_n, the streaming mean is updated iteratively:

xΛ‰n=xΛ‰nβˆ’1+xnβˆ’xΛ‰nβˆ’1n\bar{x}_n = \bar{x}_{n-1} + \frac{x_n - \bar{x}_{n-1}}{n}

This is algebraically equivalent to the batch formula but avoids accumulating nn raw values.

Streaming Variance (Welford's Method)

DfWelford's Algorithm

Welford's algorithm (1962) computes the mean and variance in a single pass using a recurrence relation. Define the second moment M2,nM_{2,n}:

M2,n=M2,nβˆ’1+(xnβˆ’xΛ‰nβˆ’1)(xnβˆ’xΛ‰n)M_{2,n} = M_{2,n-1} + (x_n - \bar{x}_{n-1})(x_n - \bar{x}_n)

The sample variance is then:

sn2=M2,nnβˆ’1s_n^2 = \frac{M_{2,n}}{n - 1}

ThNumerical Stability of Welford's Algorithm

Welford's algorithm computes the exact same result as the batch formula βˆ‘(xiβˆ’xΛ‰)2\sum(x_i - \bar{x})^2 but with dramatically better numerical stability. The key insight is that each term (xnβˆ’xΛ‰nβˆ’1)(xnβˆ’xΛ‰n)(x_n - \bar{x}_{n-1})(x_n - \bar{x}_n) is always non-negative, eliminating catastrophic cancellation that plagues the naive two-pass algorithm.

Why Naive Algorithms Fail

The naive variance formula s2=1nβˆ’1(βˆ‘xi2βˆ’(βˆ‘xi)2n)s^2 = \frac{1}{n-1}\left(\sum x_i^2 - \frac{(\sum x_i)^2}{n}\right) suffers from catastrophic cancellation when all xix_i are nearly identical. If xiβ‰ˆ108x_i \approx 10^8, the difference βˆ‘xi2βˆ’(βˆ‘xi)2/n\sum x_i^2 - (\sum x_i)^2/n can lose all significant digits. Welford's method avoids this entirely.

Parallel Streaming: Chan's Algorithm

DfChan's Parallel Algorithm

To merge statistics from two disjoint streams of sizes nAn_A and nBn_B with means xˉA,xˉB\bar{x}_A, \bar{x}_B and second moments M2A,M2BM_{2A}, M_{2B}:

n=nA+nBn = n_A + n_B
xΛ‰=xΛ‰A+nBn(xΛ‰Bβˆ’xΛ‰A)\bar{x} = \bar{x}_A + \frac{n_B}{n}(\bar{x}_B - \bar{x}_A)
M2=M2A+M2B+nAnBn(xΛ‰Bβˆ’xΛ‰A)2M_2 = M_{2A} + M_{2B} + \frac{n_A n_B}{n}(\bar{x}_B - \bar{x}_A)^2

Count-Min Sketch

DfCount-Min Sketch

A Count-Min sketch (Cormode & Muthukrishnan, 2003) is a probabilistic data structure that estimates the frequency of items in a stream. It uses a 2D array of width ww and depth dd:

sketch[i][j]=0βˆ€β€‰i∈[1,d], j∈[1,w]\text{sketch}[i][j] = 0 \quad \forall \, i \in [1,d], \, j \in [1,w]

For each element xx in the stream, we update:

sketch[i][hi(x)]+=1βˆ€β€‰i∈[1,d]\text{sketch}[i][h_i(x)] \mathrel{+}= 1 \quad \forall \, i \in [1,d]

where h1,…,hdh_1, \ldots, h_d are independent hash functions mapping to {1,…,w}\{1, \ldots, w\}.

Query Estimation

Count-Min Query

The estimated frequency of item xx is:

f^(x)=min⁑i=1dsketch[i][hi(x)]\hat{f}(x) = \min_{i=1}^{d} \text{sketch}[i][h_i(x)]

ThCount-Min Error Bound

With probability at least 1βˆ’Ξ΄1 - \delta, the estimated frequency satisfies:

f(x)≀f^(x)≀f(x)+Ξ΅wβ‹…Nf(x) \leq \hat{f}(x) \leq f(x) + \frac{\varepsilon}{w} \cdot N

where N=βˆ‘ifiN = \sum_i f_i is the total stream length, if we set w=⌈e/Ξ΅βŒ‰w = \lceil e/\varepsilon \rceil and d=⌈ln⁑(1/Ξ΄)βŒ‰d = \lceil \ln(1/\delta) \rceil.

Space Complexity

ParameterFormulaTypical Value
Width ww⌈e/Ξ΅βŒ‰\lceil e/\varepsilon \rceil272 for Ξ΅=0.01\varepsilon = 0.01
Depth dd⌈ln⁑(1/Ξ΄)βŒ‰\lceil \ln(1/\delta) \rceil5 for Ξ΄=0.01\delta = 0.01
Total spaceO(wβ‹…d)=O(ln⁑(1/Ξ΄)Ξ΅)O(w \cdot d) = O\left(\frac{\ln(1/\delta)}{\varepsilon}\right)~1,360 counters
Update timeO(d)O(d)Very fast

HyperLogLog

DfHyperLogLog

HyperLogLog (Flajolet et al., 2007) estimates the cardinality (number of distinct elements) of a multiset using the probabilistic observation that a random hash function h:Sβ†’{0,1}bh: S \to \{0,1\}^b will produce a hash with P(leadingΒ zerosβ‰₯k)=2βˆ’kP(\text{leading zeros} \geq k) = 2^{-k}. The maximum number of leading zeros among nn elements follows Geometric(1/2)\text{Geometric}(1/2).

Algorithm

HyperLogLog Estimation

Divide the stream into m=2pm = 2^p buckets using the first pp bits of each hash. For bucket jj, track:

Mj=max⁑i:bucket(h(xi))=jLeadingZeros(h(xi))+1M_j = \max_{i: \text{bucket}(h(x_i)) = j} \text{LeadingZeros}(h(x_i)) + 1

The cardinality estimate is:

n^=Ξ±mβ‹…m2β‹…(βˆ‘j=1m2βˆ’Mj)βˆ’1\hat{n} = \alpha_m \cdot m^2 \cdot \left(\sum_{j=1}^{m} 2^{-M_j}\right)^{-1}

where Ξ±m\alpha_m is a bias-correction constant.

HyperLogLog Accuracy

HyperLogLog achieves a standard error of approximately 1.04m\frac{1.04}{\sqrt{m}}. With m=16384m = 16384 (2 KB memory), the standard error is about 0.81%, sufficient for most large-scale applications.


Bloom Filters

DfBloom Filter

A Bloom filter (Bloom, 1970) is a space-efficient probabilistic data structure for set membership testing. It supports only two operations: add(x) and might_contain(x). The filter is a bit array of mm bits initialized to zero, with kk independent hash functions h1,…,hkh_1, \ldots, h_k.

Add: Set bits at positions h1(x),h2(x),…,hk(x)h_1(x), h_2(x), \ldots, h_k(x) to 1.

Query: Return true if all bits at positions h1(x),…,hk(x)h_1(x), \ldots, h_k(x) are 1.

ThBloom Filter False Positive Rate

For a Bloom filter of size mm bits with kk hash functions containing nn elements, the false positive probability is approximately:

pβ‰ˆ(1βˆ’eβˆ’kn/m)kp \approx \left(1 - e^{-kn/m}\right)^k

The optimal number of hash functions is k=mnln⁑2k = \frac{m}{n} \ln 2.

Parametersn=106n = 10^6 elements, p=1%p = 1\%n=109n = 10^9 elements, p=0.1%p = 0.1\%
Memory mm~1.2 MB~175 MB
Hash functions kk712
Bits per element~9.6~175

Online Gradient Descent

DfOnline Gradient Descent (OGD)

In the online convex optimization framework, at each round tt:

  1. The learner picks a point wtw_t from a convex set KK.
  2. The adversary reveals a convex loss function ftf_t.
  3. The learner incurs loss ft(wt)f_t(w_t).

OGD update rule:

wt+1=Ξ K(wtβˆ’Ξ·tβˆ‡ft(wt))w_{t+1} = \Pi_K\left(w_t - \eta_t \nabla f_t(w_t)\right)

where Ξ K\Pi_K is the projection onto KK and Ξ·t\eta_t is the learning rate.

ThRegret Bound for OGD

The regret of OGD is bounded by:

RT=βˆ‘t=1Tft(wt)βˆ’min⁑w∈Kβˆ‘t=1Tft(w)≀βˆ₯w1βˆ’wβˆ—βˆ₯22Ξ·T+Ξ·T2βˆ‘t=1Tβˆ₯βˆ‡ft(wt)βˆ₯2R_T = \sum_{t=1}^{T} f_t(w_t) - \min_{w \in K} \sum_{t=1}^{T} f_t(w) \leq \frac{\|w_1 - w^*\|^2}{2\eta_T} + \frac{\eta_T}{2} \sum_{t=1}^{T} \|\nabla f_t(w_t)\|^2

With step size Ξ·t=DGt\eta_t = \frac{D}{G\sqrt{t}} (where D=diam(K)D = \text{diam}(K), G=max⁑tβˆ₯βˆ‡ftβˆ₯G = \max_t \|\nabla f_t\|), the regret is O(GDT)O(GD\sqrt{T}).

Diminishing Step Sizes

Unlike batch gradient descent, OGD requires diminishing step sizes to achieve sublinear regret. Constant step sizes yield linear regret in adversarial settings, which is suboptimal.


Multi-Armed Bandits

DfMulti-Armed Bandit (MAB)

A multi-armed bandit consists of KK arms, each yielding i.i.d. rewards from an unknown distribution νk\nu_k. At each round tt, the learner selects an arm AtA_t and observes reward Xt∼νAtX_t \sim \nu_{A_t}. The goal is to maximize cumulative reward while balancing exploration and exploitation.

Upper Confidence Bound (UCB1)

UCB1 Index

The UCB1 algorithm selects arm aa maximizing:

Xa+2ln⁑tNa(t)X_a + \sqrt{\frac{2 \ln t}{N_a(t)}}

where XaX_a is the empirical mean, Na(t)N_a(t) is the number of times arm aa has been pulled, and tt is the current round.

ThUCB1 Regret Bound

The UCB1 algorithm achieves expected regret:

E[RT]β‰€βˆ‘k:Ξ”k>0(8ln⁑TΞ”k+Ο€23Ξ”k)\mathbb{E}[R_T] \leq \sum_{k: \Delta_k > 0} \left(\frac{8 \ln T}{\Delta_k} + \frac{\pi^2}{3\Delta_k}\right)

where Ξ”k=ΞΌβˆ—βˆ’ΞΌk\Delta_k = \mu^* - \mu_k is the suboptimality gap. This is asymptotically optimal up to constant factors.

Thompson Sampling

DfThompson Sampling

Thompson sampling maintains a posterior distribution over each arm's reward distribution. At each round:

  1. Sample θ~k∼P(θk∣data)\tilde{\theta}_k \sim P(\theta_k | \text{data}) for each arm kk.
  2. Select At=arg⁑max⁑kθ~kA_t = \arg\max_k \tilde{\theta}_k.
  3. Observe reward and update posterior.

For Bernoulli arms with Beta prior Beta(Ξ±,Ξ²)\text{Beta}(\alpha, \beta), the posterior after ss successes and ff failures is Beta(Ξ±+s,Ξ²+f)\text{Beta}(\alpha + s, \beta + f).

Thompson Sampling vs UCB

Thompson sampling often outperforms UCB1 in practice due to its natural exploration mechanism. It implicitly balances exploitation (high posterior means) with uncertainty (high posterior variance) through the sampling step.


Python Implementation

import numpy as np
from collections import defaultdict
import hashlib

# --- Welford's Online Algorithm ---
class WelfordOnlineStatistics:
    """Track mean and variance using Welford's algorithm in one pass."""
    def __init__(self):
        self.n = 0
        self.mean = 0.0
        self.M2 = 0.0

    def update(self, x):
        self.n += 1
        delta = x - self.mean
        self.mean += delta / self.n
        delta2 = x - self.mean
        self.M2 += delta * delta2

    @property
    def variance(self):
        return self.M2 / (self.n - 1) if self.n > 1 else 0.0

    @property
    def std(self):
        return np.sqrt(self.variance)

    def __repr__(self):
        return f"Welford(n={self.n}, mean={self.mean:.4f}, var={self.variance:.4f})"

# Demonstration
np.random.seed(42)
data = np.random.normal(loc=5.0, scale=2.0, size=10000)

w = WelfordOnlineStatistics()
for x in data:
    w.update(x)

print(f"Streaming: {w}")
print(f"Batch mean: {data.mean():.4f}, Batch var: {data.var():.4f}")
# Output: Streaming mean/var matches batch exactly

# --- Count-Min Sketch ---
class CountMinSketch:
    def __init__(self, width, depth):
        self.width = width
        self.depth = depth
        self.table = np.zeros((depth, width), dtype=int)
        self.hashes = [self._make_hash(i) for i in range(depth)]

    def _make_hash(self, seed):
        def h(x):
            return int(hashlib.md5(f"{seed}:{x}".encode()).hexdigest(), 16) % self.width
        return h

    def add(self, item, count=1):
        for i in range(self.depth):
            self.table[i][self.hashes[i](item)] += count

    def estimate(self, item):
        return min(self.table[i][self.hashes[i](item)] for i in range(self.depth))

# Demonstration
cms = CountMinSketch(width=1000, depth=5)
items = np.random.choice(1000, size=100000, replace=True)
for item in items:
    cms.add(str(item))

print(f"\nCount-Min Sketch estimates for first 5 items:")
for i in range(5):
    print(f"  Item '{i}': true={np.sum(items==i)}, est={cms.estimate(str(i))}")

# --- HyperLogLog (simplified) ---
class HyperLogLog:
    def __init__(self, p=14):
        self.p = p
        self.m = 1 << p
        self.buckets = [0] * self.m

    def _hash(self, item):
        h = int(hashlib.sha256(str(item).encode()).hexdigest(), 16)
        return h

    def _leading_zeros(self, h, max_bits=64):
        if h == 0:
            return max_bits
        count = 0
        while (h & 1) == 0:
            count += 1
            h >>= 1
        return count

    def add(self, item):
        h = self._hash(item)
        bucket = h & (self.m - 1)
        w = h >> self.p
        lz = self._leading_zeros(w)
        self.buckets[bucket] = max(self.buckets[bucket], lz)

    def cardinality(self):
        alpha = 0.7213 / (1 + 1.079 / self.m)
        raw = alpha * self.m ** 2 / sum(2.0 ** (-b) for b in self.buckets)
        return raw

# Demonstration
hll = HyperLogLog(p=14)
true_n = 500000
for i in range(true_n):
    hll.add(i)

print(f"\nHyperLogLog:")
print(f"  True cardinality: {true_n}")
print(f"  Estimated: {hll.cardinality():.0f}")
print(f"  Error: {abs(hll.cardinality() - true_n) / true_n * 100:.2f}%")

# --- Bloom Filter ---
class BloomFilter:
    def __init__(self, size, num_hashes):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = [False] * size
        self.hashes = [self._make_hash(i) for i in range(num_hashes)]

    def _make_hash(self, seed):
        def h(item):
            return int(hashlib.md5(f"{seed}:{item}".encode()).hexdigest(), 16) % self.size
        return h

    def add(self, item):
        for h in self.hashes:
            self.bit_array[h(item)] = True

    def __contains__(self, item):
        return all(self.bit_array[h(item)] for h in self.hashes)

# Demonstration
bf = BloomFilter(size=10000, num_hashes=7)
for i in range(1000):
    bf.add(f"item_{i}")

false_positives = sum(f"new_{i}" in bf for i in range(10000))
print(f"\nBloom Filter: {false_positives}/10000 false positives "
      f"(expected ~1%)")

# --- Multi-Armed Bandit (UCB1) ---
class UCB1:
    def __init__(self, k):
        self.k = k
        self.counts = np.zeros(k)
        self.values = np.zeros(k)
        self.total = 0

    def select_arm(self):
        # Pull each arm once first
        for i in range(self.k):
            if self.counts[i] == 0:
                return i
        ucb_values = self.values + np.sqrt(2 * np.log(self.total) / self.counts)
        return np.argmax(ucb_values)

    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n
        self.total += 1

# 10-armed Gaussian bandit test
np.random.seed(42)
true_means = np.random.normal(0, 1, 10)
optimal_arm = np.argmax(true_means)

agent = UCB1(10)
cumulative_regret = []

for t in range(1, 2001):
    arm = agent.select_arm()
    reward = np.random.normal(true_means[arm], 1.0)
    agent.update(arm, reward)
    regret = true_means[optimal_arm] - true_means[arm]
    cumulative_regret.append(regret)

print(f"\nUCB1 Bandit: final avg reward = {agent.values[optimal_arm]:.3f} "
      f"(true optimal = {true_means[optimal_arm]:.3f})")
print(f"Cumulative regret after 2000 rounds: {sum(cumulative_regret):.1f}")

Key Takeaways

Summary: Streaming Statistics and Online Learning

  1. Welford's algorithm computes exact mean and variance in a single pass with O(1)O(1) memory, avoiding catastrophic cancellation.
  2. Count-Min sketch estimates item frequencies with controllable error Ρ\varepsilon and probability δ\delta using only O(1Ρlog⁑1δ)O(\frac{1}{\varepsilon} \log \frac{1}{\delta}) counters.
  3. HyperLogLog estimates set cardinality with ~1% error using only ~2 KB memory, independent of the true cardinality.
  4. Bloom filters test set membership with zero false negatives and tunable false positive rates in O(1)O(1) query time.
  5. Online gradient descent achieves O(T)O(\sqrt{T}) regret in convex settings, enabling learning against adversaries.
  6. Multi-armed bandits (UCB1, Thompson sampling) balance exploration and exploitation with provable regret bounds.
  7. All streaming algorithms trade a small, controlled probability of error for dramatic reductions in space and time complexity.

Next Steps

⭐

Premium Content

Streaming Statistics and Online Learning

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement