Operations

Observability

Observability is the ability to understand a system's internal state from its external outputs. In distributed systems, observability is essential for debugging, performance tuning, and maintaining reliability.

Logs — Discrete events with context
Metrics — Numerical measurements over time
Traces — Request path across service boundaries

You can't fix what you can't see.

What Is Observability?

The three pillars of observability provide complementary views into system behavior.

DfObservability

Observability is a measure of how well you can understand the internal state of a system by examining its external outputs. In software, observability is built from three signal types: logs (what happened), metrics (how much/how fast), and traces (where time was spent). Together, they enable debugging and understanding of complex distributed systems.

Monitoring tells you when something is wrong. Observability tells you why. Monitoring uses predefined dashboards and alerts; observability enables ad-hoc exploration of system behavior.

Three Pillars

Pillar	What It Tells You	Example
Logs	What happened at a point in time	"User 123 failed to login at 10:00:01"
Metrics	How the system is performing over time	"Request rate: 1000 QPS, error rate: 0.1%"
Traces	Where time was spent in a request	"API gateway: 5ms, Auth: 10ms, DB: 50ms"

Logging

Structured logging enables efficient querying and analysis.

DfStructured Logging

Structured logging records log events in a machine-parseable format (JSON) rather than unstructured text. Each log entry contains consistent fields (timestamp, level, service, trace_id, message) that enable efficient filtering, searching, and correlation across services.

Structured log example:

{
  "timestamp": "2024-01-15T10:00:01.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "user_id": "789",
  "message": "Payment processing failed",
  "error": "Insufficient funds",
  "amount": 29.99,
  "currency": "USD"
}

Log Levels

Level	When to Use
DEBUG	Detailed diagnostic information
INFO	Normal operation milestones
WARN	Unexpected but recoverable conditions
ERROR	Failures requiring attention
FATAL	System cannot continue running

Use structured logging in production. Unstructured logs are difficult to query at scale. Tools like Fluentd, Filebeat, and Vector collect and forward logs to centralized systems (Elasticsearch, Loki, CloudWatch).

Metrics

Numerical measurements aggregated over time.

DfMetrics

Metrics are numerical measurements collected at regular intervals. They provide a quantitative view of system behavior: request rates, error rates, latencies, resource utilization, and business KPIs. Metrics are stored in time-series databases optimized for aggregation and querying.

Four Golden Signals

Google's SRE team identifies four key signals for monitoring:

Signal	Description	Example
Latency	Time to serve a request	p99 = 200ms
Traffic	Demand on the system	10,000 QPS
Errors	Rate of failed requests	0.1% error rate
Saturation	How full the system is	70% CPU utilization

Error Rate

error\_rate = \frac{requests_{5xx}}{requests_{total}} \times 100\%

Here,

$requests_{5xx}$ =Requests returning 5xx status codes
$requests_{total}$ =Total requests in the time window

RED Method

For request-driven services:

DfRED Method

RED (Rate, Errors, Duration) monitors each service:

Rate: Requests per second
Errors: Errors per second
Duration: Latency distribution (percentiles)

Use percentiles (p50, p95, p99) instead of averages for latency. The average masks outliers: a service with 10ms average could have p99 = 500ms, meaning 1% of users experience 50x worse latency.

Prometheus

The standard for metrics collection and alerting.

DfPrometheus

Prometheus is an open-source monitoring system that collects metrics via pull-based scraping. It stores data in a time-series database, supports PromQL for querying, and integrates with Alertmanager for alerting. Prometheus is the CNCF standard for cloud-native monitoring.

Prometheus Architecture

Distributed Tracing

Tracking requests across service boundaries.

DfDistributed Tracing

Distributed tracing tracks a request as it flows through multiple services. Each service creates a span (unit of work) with timing information. Spans are linked by a trace ID, forming a tree that shows where time was spent. This enables identifying bottlenecks in microservice architectures.

Tracing Concepts

Concept	Description
Trace	Complete journey of a request through the system
Span	A single unit of work within a trace
Trace ID	Unique identifier linking all spans in a trace
Parent Span ID	Reference to the span that initiated this span
Context Propagation	Passing trace context between services

Use Jaeger, Zipkin, or AWS X-Ray for distributed tracing. Instrument services with OpenTelemetry, which provides vendor-neutral APIs for traces, metrics, and logs.

OpenTelemetry

DfOpenTelemetry

OpenTelemetry (OTel) is an open-source framework for collecting traces, metrics, and logs. It provides SDKs for multiple languages, a collector for processing telemetry, and exports to multiple backends (Prometheus, Jaeger, Datadog). OTel is the CNCF standard for observability instrumentation.

Alerting

DfAlert Fatigue

Alert fatigue occurs when teams receive too many alerts, leading to alert麻木 and missed critical issues. Effective alerting focuses on symptoms (error rate, latency) rather than causes (CPU, memory). Each alert should be actionable and have a clear runbook.

Alert on symptoms, not causes. "Error rate > 1%" is actionable. "CPU > 80%" is often a symptom, not a root cause. Use multi-window alerting to reduce false positives: require a condition to persist across two time windows before firing.

Practice Exercises

Design: Design an observability stack for a microservices application with 20 services. Include logging, metrics, tracing, and alerting strategies.
Metrics: Write PromQL queries for: request rate by service, p99 latency, error rate, and saturation (CPU utilization).
Tracing: Draw a trace diagram for an e-commerce checkout: API Gateway → Order Service → Payment Service → Inventory Service. Where would you expect bottlenecks?
Alerting: Design alerting rules for an API gateway. Include rules for error rate, latency, and saturation with appropriate thresholds and durations.

Key Takeaways:

Observability is built from three pillars: logs, metrics, and traces
Structured logging enables efficient querying and correlation
Monitor the four golden signals: latency, traffic, errors, saturation
Prometheus is the standard for metrics; Grafana for visualization
Distributed tracing tracks requests across service boundaries
OpenTelemetry provides vendor-neutral instrumentation for all three pillars
Alert on symptoms, not causes, to reduce alert fatigue

What to Learn Next

-> Service Mesh Envoy, Istio, and sidecar proxy patterns.

-> CI/CD Pipelines Continuous integration and deployment strategies.

-> Containerization Docker, Kubernetes, and pod scheduling.

-> Security Patterns Authentication, authorization, encryption, and mTLS.

-> Cost Optimization Cloud cost management and right-sizing.

-> Load Balancing Distribution algorithms and health checks.

Observability

Observability

What Is Observability?

DfObservability

Three Pillars

Logging

DfStructured Logging

Log Levels

Metrics

DfMetrics

Four Golden Signals

Error Rate

RED Method

DfRED Method

Prometheus

DfPrometheus

Prometheus Architecture

Distributed Tracing

DfDistributed Tracing

Tracing Concepts

OpenTelemetry

DfOpenTelemetry

Alerting

DfAlert Fatigue

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?