πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Design a Real-time Analytics System

System Design ProblemsStream Processing🟒 Free Lesson

Advertisement

System Design Problems

Design a Real-time Analytics System

A real-time analytics system ingests high-volume event streams and provides sub-second query responses for dashboards, A/B testing, and business intelligence. Systems like Apache Kafka, Flink, and ClickPower enable real-time data analysis at massive scale.

  • Stream Processing β€” Process millions of events per second in real-time
  • Sub-second Queries β€” Dashboard responses in < 1 second
  • Time-series Optimization β€” Data partitioned by time for efficient range queries

The challenge is balancing write throughput (ingest millions of events/sec) with read latency (query results in < 1 second) across time-partitioned data.

Requirements

Functional Requirements

  • Ingest events from multiple sources (clickstream, IoT, logs)
  • Store events with timestamps for time-range queries
  • Support real-time aggregations (counts, sums, averages, percentiles)
  • Dashboard queries with < 1 second latency
  • Support for ad-hoc queries on recent data (last 24 hours)
  • Historical data available for batch queries (last 90 days)

Non-Functional Requirements

  • Write Throughput: 1M events/second
  • Query Latency: < 1 second for dashboard queries
  • Data Retention: 90 days hot, 1 year cold
  • Accuracy: Exact for counts, approximate for percentiles (HyperLogLog)
  • Availability: 99.99%

Real-time analytics systems are write-heavy with relatively few reads. The key insight is to pre-aggregate data at write time (materialized views) rather than computing aggregations at query time.

Back-of-the-Envelope Estimation

Analytics System Capacity

  • 1M events/sec Γ— 86400 sec = 86.4B events/day
  • Average event size: 500 bytes
  • Daily storage: 86.4B Γ— 500 bytes = 43.2 TB/day
  • 90-day hot storage: ~3.9 PB
  • Query QPS: 10K (dashboard + ad-hoc)
  • Dashboard refresh rate: 10 sec intervals

High-Level Architecture

EventsClickstreamIoT SensorsApplication LogsMobile SDKKafkaEvent Stream(Partitioned)Stream ProcessorFlink/SparkAggregationBatch ProcessorSpark/HiveETL PipelineClickHouseReal-time OLAPColumn-orientedData Lake(S3/HDFS)Real-time Analytics Architecture (Lambda)

Detailed Design

Lambda Architecture

The Lambda architecture combines batch and stream processing:

DfLambda Architecture

The Lambda architecture processes data through two paths: a batch layer that processes all historical data for completeness, and a speed layer that processes recent data for real-time results. A serving layer merges both outputs.

LayerPurposeTechnology
Batch LayerComplete, accurate results over all dataSpark, Hive
Speed LayerLow-latency results over recent dataFlink, Kafka Streams
Serving LayerMerged view for queriesClickHouse, Druid

The Lambda architecture is being replaced by the Kappa architecture in many modern systems. Kappa uses only stream processing (no batch), simplifying the architecture.

Pre-aggregation Strategy

Pre-aggregate data at write time to enable fast queries:

DfPre-aggregation

Instead of storing raw events and computing aggregations at query time, compute aggregations in real-time as events arrive. Store pre-computed aggregates (counts, sums, averages) in materialized views.

Pre-aggregation Storage

storagepre=NdimensionsΓ—Ntime_bucketsΓ—NmetricsΓ—8 bytesstorage_{pre} = N_{dimensions} \times N_{time\_buckets} \times N_{metrics} \times 8\,bytes

Here,

  • NdimensionsN_{dimensions}=Number of unique dimension combinations
  • NtimebucketsN_{time_buckets}=Number of time buckets (e.g., 1440 minutes/day)
  • NmetricsN_{metrics}=Number of metrics per aggregation

Pre-aggregation Savings

Raw events: 1M events/sec Γ— 500 bytes Γ— 86400 = 43.2 TB/day

Pre-aggregated (minute-level, 100 dimension combinations): 100 Γ— 1440 minutes Γ— 5 metrics Γ— 8 bytes Γ— 86400 = ~5 GB/day

Pre-aggregation reduces storage by 99.99% for aggregate queries.

OLAP Database (ClickHouse)

DfColumn-oriented Storage

Column-oriented databases store data by column rather than by row. This enables efficient aggregation queries (SUM, COUNT, AVG) because only the relevant columns are read from disk.

Row 1: [id, ts, user, action, ...]Row 2: [id, ts, user, action, ...]Row-oriented (OLTP)TimestampcolUser IDcolActioncolColumn-oriented (OLAP)Column storage is 10-100x faster for aggregation queries

ClickHouse uses LZ4 compression on column data, achieving 10:1 compression ratios. A 43 TB/day raw event stream becomes ~4 TB compressed in ClickHouse.

Time-series Partitioning

Partition data by time for efficient range queries:

Time Partitioning

partition=date_trunc(timestamp,day)partition = \text{date\_trunc}(timestamp, \text{day})

Here,

  • timestamptimestamp=Event timestamp
  • dayday=Partition granularity

Partition by day for recent data (hot) and by month for historical data (cold). Drop old partitions for data retention management.

Approximate Algorithms

For high-cardinality analytics, use approximate algorithms:

AlgorithmPurposeErrorMemory
HyperLogLogDistinct count~0.8%12 KB
Count-Min SketchFrequency estimationVariesFixed
T-DigestPercentile estimation< 1%100 KB

HyperLogLog Distinct Count

E=Ξ±mΓ—m2Γ—(βˆ‘j=1m2βˆ’Mj)βˆ’1E = \alpha_m \times m^2 \times (\sum_{j=1}^{m} 2^{-M_j})^{-1}

Here,

  • mm=Number of registers (buckets)
  • MjM_j=Maximum leading zeros in register j
  • alphamalpha_m=Bias correction constant

HyperLogLog Efficiency

To count 1 billion distinct user IDs:

  • Exact counting: 1 billion Γ— 8 bytes = 8 GB
  • HyperLogLog: 12 KB with ~0.8% error

HyperLogLog reduces memory by 99.998% with negligible accuracy loss.

Practice Exercises

  1. Design: How would you implement a real-time dashboard that shows the number of active users in the last 5 minutes? What data structure would you use?

  2. Scale: If the system ingests 1M events/second, estimate the Kafka partition count needed and the ClickHouse cluster size for 90-day retention.

  3. Consistency: In a Lambda architecture, how do you handle the transition from batch-computed results to stream-computed results? Design a merge strategy.

  4. Optimization: How would you optimize ClickHouse queries for a dashboard that queries 10 different time ranges and 5 different dimensions simultaneously?

Key Takeaways:

  • Lambda architecture combines batch (completeness) and stream (speed) processing
  • Pre-aggregation at write time reduces query storage by 99%+
  • Column-oriented databases (ClickHouse) are 10-100x faster for aggregation queries
  • HyperLogLog provides distinct counts with 0.8% error at 12 KB memory
  • Time-series partitioning enables efficient range queries and data lifecycle management

What to Learn Next

-> Design Metrics Monitoring Collecting and querying system metrics at scale.

-> Message Queues Kafka for event streaming and partitioning strategies.

-> Event-Driven Architecture Event sourcing and CQRS patterns.

-> Databases Column-oriented vs row-oriented storage trade-offs.

-> Caching Strategies Caching pre-aggregated results for dashboard queries.

-> Design Realtime Analytics This article (stream processing deep dive).

⭐

Premium Content

Design a Real-time Analytics System

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement