System Design Problems
Design a Metrics Monitoring System
A metrics monitoring system collects, stores, and visualizes time-series data for infrastructure and application monitoring. Systems like Prometheus, Datadog, and Grafana handle billions of data points daily, enabling operators to detect anomalies and alert on issues in real-time.
- Time-series Storage β Efficiently store metrics indexed by time and labels
- Real-time Ingestion β Process millions of data points per second
- Alerting β Detect anomalies and notify on-call engineers
Time-series data is append-heavy and read-heavy for recent ranges. The storage engine must optimize for sequential writes and range queries over time windows.
Requirements
Functional Requirements
- Ingest metrics from multiple sources (servers, containers, applications)
- Store time-series data (metric name, labels, timestamp, value)
- Query metrics over time ranges (last 5 min, 1 hour, 7 days)
- Aggregate metrics (rate, avg, sum, count, percentiles)
- Dashboard visualization with auto-refresh
- Alert rules that trigger notifications
- Metric discovery and labeling
Non-Functional Requirements
- Write Throughput: 10M data points/second
- Query Latency: < 1 second for dashboard queries
- Retention: 15 days raw, 1 year downsampled
- Availability: 99.99%
- Accuracy: Exact for counters, approximate for high-cardinality
Monitoring systems are write-heavy (10-100x more writes than reads). The storage engine must optimize for append-only sequential writes and efficient time-range queries.
Back-of-the-Envelope Estimation
Metrics System Capacity
- 10M data points/sec Γ 86400 sec = 864B points/day
- Each point: 50 bytes (metric name hash + labels hash + timestamp + value)
- Daily storage: 864B Γ 50 bytes = 43.2 TB/day
- 15-day retention: 648 TB
- With compression (10:1): 64.8 TB
- Query QPS: 100K (dashboards + alerts)
Data Model
DfTime-Series Data Point
A time-series data point consists of: metric name, label set (key-value pairs), timestamp, and numeric value. The combination of metric name + labels uniquely identifies a time series.
metric: http_requests_total
labels: {method: "GET", status: "200", endpoint: "/api/v1/users"}
timestamp: 1687267200000 (ms)
value: 15234
# This is one data point in one time series
# A system with 10K endpoints Γ 3 methods Γ 5 status codes = 150K time series
Cardinality
Here,
- =Number of unique values for each label
Cardinality Explosion
Labels: method (3 values) Γ status (5 values) Γ endpoint (10K values) Γ host (1000 values)
Total time series: 3 Γ 5 Γ 10000 Γ 1000 = 150M time series
At 1 point/sec per series: 150M points/sec
High-Level Architecture
Detailed Design
Time-Series Database (TSDB)
DfTime-Series Database
A TSDB is optimized for append-heavy time-series workloads. Data is organized by metric name and labels into time series, each with a sequence of timestamp-value pairs.
Storage Engine
DfLSM-Tree for Time Series
A Log-Structured Merge-Tree is ideal for time-series data. Writes append to an in-memory memtable, then flush to disk as sorted SSTables. This provides excellent write throughput for append-heavy workloads.
Write Amplification
Here,
- =Total bytes written including compaction
- =Original user write bytes
Use Gorilla-style encoding for time-series compression: XOR delta encoding for timestamps and values. This achieves 12x compression for typical monitoring data.
Downsampling
Reduce storage by aggregating old data at lower resolution:
| Retention | Resolution | Data Points/Day |
|---|---|---|
| 0-15 days | 1 second | 86,400 |
| 15-90 days | 1 minute | 1,440 |
| 90-365 days | 1 hour | 24 |
| 1+ year | 1 day | 1 |
Downsampling Savings
Here,
- =Raw data point interval (e.g., 1 second)
- =Downsampled interval (e.g., 1 minute)
Downsampling Impact
Raw: 86,400 points/day/series Downsampled to 1 minute: 1,440 points/day/series
Storage reduction: 60x (from 43 TB/day to 0.7 TB/day)
Alerting System
DfAlert Rule
An alert rule defines a condition over metrics that triggers a notification. Rules are evaluated periodically (e.g., every minute) and have states: pending β firing β resolved.
alert_rule: {
name: "High Error Rate",
query: "rate(http_requests_total{status="5xx"}[5m]) / rate(http_requests_total[5m])",
condition: "> 0.05", // > 5% error rate
duration: "5m", // Must be true for 5 minutes
severity: "critical",
notify: ["pagerduty", "slack"]
}
Use the "for" duration to avoid alerting on transient spikes. The condition must be true for the full duration before the alert fires. This reduces false positives.
PromQL-style Query Language
Rate Calculation
Here,
- =Counter metric name
- =Time window for rate calculation
PromQL Queries
Requests per second over last 5 minutes
rate(http_requests_total[5m])
99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Error rate
rate(http_requests_total{status="5xx"}[5m]) / rate(http_requests_total[5m])
Top 10 endpoints by request rate
topk(10, rate(http_requests_total[5m]))
Practice Exercises
-
Design: How would you implement a label index that supports efficient multi-label queries (e.g.,
method="GET" AND status="200" AND host=~"web-.*")? -
Scale: If the system ingests 10M data points/second with 15-day retention, estimate the storage needed with 10:1 compression and downsampling to 1-minute resolution after 15 days.
-
Alerting: Design an alert deduplication and grouping system that sends one notification for related alerts (e.g., 50 servers all reporting high CPU).
-
Optimization: How would you optimize dashboard queries that render 50 panels, each querying 10 time series over the last 24 hours?
Key Takeaways:
- Time-series databases use LSM-trees for append-heavy write optimization
- Gorilla-style XOR delta encoding achieves 12x compression for monitoring data
- Downsampling reduces storage by 60x (1s β 1min) for historical data
- Alert rules with "for" duration prevent false positives from transient spikes
- Label indexing enables efficient multi-dimensional queries across metrics
What to Learn Next
-> Design Realtime Analytics Stream processing for real-time event analytics.
-> Observability Logs, metrics, and traces for system observability.
-> Databases Time-series databases and LSM-tree storage engines.
-> Message Queues Kafka for metric ingestion and streaming.
-> Design Notification System Alert notification delivery via multiple channels.
-> Caching Strategies Caching dashboard queries and metric aggregates.