πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Design a Metrics Monitoring System

System Design ProblemsObservability Infrastructure🟒 Free Lesson

Advertisement

System Design Problems

Design a Metrics Monitoring System

A metrics monitoring system collects, stores, and visualizes time-series data for infrastructure and application monitoring. Systems like Prometheus, Datadog, and Grafana handle billions of data points daily, enabling operators to detect anomalies and alert on issues in real-time.

  • Time-series Storage β€” Efficiently store metrics indexed by time and labels
  • Real-time Ingestion β€” Process millions of data points per second
  • Alerting β€” Detect anomalies and notify on-call engineers

Time-series data is append-heavy and read-heavy for recent ranges. The storage engine must optimize for sequential writes and range queries over time windows.

Requirements

Functional Requirements

  • Ingest metrics from multiple sources (servers, containers, applications)
  • Store time-series data (metric name, labels, timestamp, value)
  • Query metrics over time ranges (last 5 min, 1 hour, 7 days)
  • Aggregate metrics (rate, avg, sum, count, percentiles)
  • Dashboard visualization with auto-refresh
  • Alert rules that trigger notifications
  • Metric discovery and labeling

Non-Functional Requirements

  • Write Throughput: 10M data points/second
  • Query Latency: < 1 second for dashboard queries
  • Retention: 15 days raw, 1 year downsampled
  • Availability: 99.99%
  • Accuracy: Exact for counters, approximate for high-cardinality

Monitoring systems are write-heavy (10-100x more writes than reads). The storage engine must optimize for append-only sequential writes and efficient time-range queries.

Back-of-the-Envelope Estimation

Metrics System Capacity

  • 10M data points/sec Γ— 86400 sec = 864B points/day
  • Each point: 50 bytes (metric name hash + labels hash + timestamp + value)
  • Daily storage: 864B Γ— 50 bytes = 43.2 TB/day
  • 15-day retention: 648 TB
  • With compression (10:1): 64.8 TB
  • Query QPS: 100K (dashboards + alerts)

Data Model

DfTime-Series Data Point

A time-series data point consists of: metric name, label set (key-value pairs), timestamp, and numeric value. The combination of metric name + labels uniquely identifies a time series.

Architecture Diagram
metric: http_requests_total
labels: {method: "GET", status: "200", endpoint: "/api/v1/users"}
timestamp: 1687267200000 (ms)
value: 15234

# This is one data point in one time series
# A system with 10K endpoints Γ— 3 methods Γ— 5 status codes = 150K time series

Cardinality

time_series=∏label∣values(label)∣time\_series = \prod_{label} |values(label)|

Here,

  • ∣values(label)∣|values(label)|=Number of unique values for each label

Cardinality Explosion

Labels: method (3 values) Γ— status (5 values) Γ— endpoint (10K values) Γ— host (1000 values)

Total time series: 3 Γ— 5 Γ— 10000 Γ— 1000 = 150M time series

At 1 point/sec per series: 150M points/sec

High-Level Architecture

SourcesServersContainersAppsDBsCDNsAgentsCollectors(Push/Pull)Scrape + RemoteWriteProcessingAggregationDownsamplingLabel IndexAlert EvaluatorQuery EngineTSDB(InfluxDB/TDengine)Object Storage(Long-term)Alert Manager(PagerDuty/Slack)Metrics Monitoring Architecture

Detailed Design

Time-Series Database (TSDB)

DfTime-Series Database

A TSDB is optimized for append-heavy time-series workloads. Data is organized by metric name and labels into time series, each with a sequence of timestamp-value pairs.

Time Series 1: http_requests_total{method="GET"}100150200180220β†’ 1 min resolutionTime Series 2: http_requests_total{method="POST"}507510090110Each time series = metric name + label set + sequence of values

Storage Engine

DfLSM-Tree for Time Series

A Log-Structured Merge-Tree is ideal for time-series data. Writes append to an in-memory memtable, then flush to disk as sorted SSTables. This provides excellent write throughput for append-heavy workloads.

Write Amplification

write_amplification=bytes_written_to_diskbytes_written_by_userwrite\_amplification = \frac{bytes\_written\_to\_disk}{bytes\_written\_by\_user}

Here,

  • byteswrittentodiskbytes_written_to_disk=Total bytes written including compaction
  • byteswrittenbyuserbytes_written_by_user=Original user write bytes

Use Gorilla-style encoding for time-series compression: XOR delta encoding for timestamps and values. This achieves 12x compression for typical monitoring data.

Downsampling

Reduce storage by aggregating old data at lower resolution:

RetentionResolutionData Points/Day
0-15 days1 second86,400
15-90 days1 minute1,440
90-365 days1 hour24
1+ year1 day1

Downsampling Savings

reduction=resolutionrawresolutiondownsampledreduction = \frac{resolution_{raw}}{resolution_{downsampled}}

Here,

  • resolutionrawresolution_{raw}=Raw data point interval (e.g., 1 second)
  • resolutiondownsampledresolution_{downsampled}=Downsampled interval (e.g., 1 minute)

Downsampling Impact

Raw: 86,400 points/day/series Downsampled to 1 minute: 1,440 points/day/series

Storage reduction: 60x (from 43 TB/day to 0.7 TB/day)

Alerting System

DfAlert Rule

An alert rule defines a condition over metrics that triggers a notification. Rules are evaluated periodically (e.g., every minute) and have states: pending β†’ firing β†’ resolved.

Architecture Diagram
alert_rule: {
  name: "High Error Rate",
  query: "rate(http_requests_total{status="5xx"}[5m]) / rate(http_requests_total[5m])",
  condition: "> 0.05",     // > 5% error rate
  duration: "5m",          // Must be true for 5 minutes
  severity: "critical",
  notify: ["pagerduty", "slack"]
}

Use the "for" duration to avoid alerting on transient spikes. The condition must be true for the full duration before the alert fires. This reduces false positives.

PromQL-style Query Language

Rate Calculation

rate(metric[5m])=Ξ”counterΞ”timerate(metric[5m]) = \frac{\Delta counter}{\Delta time}

Here,

  • metricmetric=Counter metric name
  • 5m5m=Time window for rate calculation

PromQL Queries

Requests per second over last 5 minutes

rate(http_requests_total[5m])

99th percentile latency

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Error rate

rate(http_requests_total{status="5xx"}[5m]) / rate(http_requests_total[5m])

Top 10 endpoints by request rate

topk(10, rate(http_requests_total[5m]))

Practice Exercises

  1. Design: How would you implement a label index that supports efficient multi-label queries (e.g., method="GET" AND status="200" AND host=~"web-.*" )?

  2. Scale: If the system ingests 10M data points/second with 15-day retention, estimate the storage needed with 10:1 compression and downsampling to 1-minute resolution after 15 days.

  3. Alerting: Design an alert deduplication and grouping system that sends one notification for related alerts (e.g., 50 servers all reporting high CPU).

  4. Optimization: How would you optimize dashboard queries that render 50 panels, each querying 10 time series over the last 24 hours?

Key Takeaways:

  • Time-series databases use LSM-trees for append-heavy write optimization
  • Gorilla-style XOR delta encoding achieves 12x compression for monitoring data
  • Downsampling reduces storage by 60x (1s β†’ 1min) for historical data
  • Alert rules with "for" duration prevent false positives from transient spikes
  • Label indexing enables efficient multi-dimensional queries across metrics

What to Learn Next

-> Design Realtime Analytics Stream processing for real-time event analytics.

-> Observability Logs, metrics, and traces for system observability.

-> Databases Time-series databases and LSM-tree storage engines.

-> Message Queues Kafka for metric ingestion and streaming.

-> Design Notification System Alert notification delivery via multiple channels.

-> Caching Strategies Caching dashboard queries and metric aggregates.

⭐

Premium Content

Design a Metrics Monitoring System

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement