Operations
Observability
Observability is the ability to understand a system's internal state from its external outputs. In distributed systems, observability is essential for debugging, performance tuning, and maintaining reliability.
- Logs β Discrete events with context
- Metrics β Numerical measurements over time
- Traces β Request path across service boundaries
You can't fix what you can't see.
What Is Observability?
The three pillars of observability provide complementary views into system behavior.
DfObservability
Observability is a measure of how well you can understand the internal state of a system by examining its external outputs. In software, observability is built from three signal types: logs (what happened), metrics (how much/how fast), and traces (where time was spent). Together, they enable debugging and understanding of complex distributed systems.
Monitoring tells you when something is wrong. Observability tells you why. Monitoring uses predefined dashboards and alerts; observability enables ad-hoc exploration of system behavior.
Three Pillars
| Pillar | What It Tells You | Example |
|---|---|---|
| Logs | What happened at a point in time | "User 123 failed to login at 10:00:01" |
| Metrics | How the system is performing over time | "Request rate: 1000 QPS, error rate: 0.1%" |
| Traces | Where time was spent in a request | "API gateway: 5ms, Auth: 10ms, DB: 50ms" |
Logging
Structured logging enables efficient querying and analysis.
DfStructured Logging
Structured logging records log events in a machine-parseable format (JSON) rather than unstructured text. Each log entry contains consistent fields (timestamp, level, service, trace_id, message) that enable efficient filtering, searching, and correlation across services.
Structured log example:
{
"timestamp": "2024-01-15T10:00:01.123Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123def456",
"user_id": "789",
"message": "Payment processing failed",
"error": "Insufficient funds",
"amount": 29.99,
"currency": "USD"
}
Log Levels
| Level | When to Use |
|---|---|
| DEBUG | Detailed diagnostic information |
| INFO | Normal operation milestones |
| WARN | Unexpected but recoverable conditions |
| ERROR | Failures requiring attention |
| FATAL | System cannot continue running |
Use structured logging in production. Unstructured logs are difficult to query at scale. Tools like Fluentd, Filebeat, and Vector collect and forward logs to centralized systems (Elasticsearch, Loki, CloudWatch).
Metrics
Numerical measurements aggregated over time.
DfMetrics
Metrics are numerical measurements collected at regular intervals. They provide a quantitative view of system behavior: request rates, error rates, latencies, resource utilization, and business KPIs. Metrics are stored in time-series databases optimized for aggregation and querying.
Four Golden Signals
Google's SRE team identifies four key signals for monitoring:
| Signal | Description | Example |
|---|---|---|
| Latency | Time to serve a request | p99 = 200ms |
| Traffic | Demand on the system | 10,000 QPS |
| Errors | Rate of failed requests | 0.1% error rate |
| Saturation | How full the system is | 70% CPU utilization |
Error Rate
Here,
- =Requests returning 5xx status codes
- =Total requests in the time window
RED Method
For request-driven services:
DfRED Method
RED (Rate, Errors, Duration) monitors each service:
- Rate: Requests per second
- Errors: Errors per second
- Duration: Latency distribution (percentiles)
Use percentiles (p50, p95, p99) instead of averages for latency. The average masks outliers: a service with 10ms average could have p99 = 500ms, meaning 1% of users experience 50x worse latency.
Prometheus
The standard for metrics collection and alerting.
DfPrometheus
Prometheus is an open-source monitoring system that collects metrics via pull-based scraping. It stores data in a time-series database, supports PromQL for querying, and integrates with Alertmanager for alerting. Prometheus is the CNCF standard for cloud-native monitoring.
Prometheus Architecture
Distributed Tracing
Tracking requests across service boundaries.
DfDistributed Tracing
Distributed tracing tracks a request as it flows through multiple services. Each service creates a span (unit of work) with timing information. Spans are linked by a trace ID, forming a tree that shows where time was spent. This enables identifying bottlenecks in microservice architectures.
Tracing Concepts
| Concept | Description |
|---|---|
| Trace | Complete journey of a request through the system |
| Span | A single unit of work within a trace |
| Trace ID | Unique identifier linking all spans in a trace |
| Parent Span ID | Reference to the span that initiated this span |
| Context Propagation | Passing trace context between services |
Use Jaeger, Zipkin, or AWS X-Ray for distributed tracing. Instrument services with OpenTelemetry, which provides vendor-neutral APIs for traces, metrics, and logs.
OpenTelemetry
DfOpenTelemetry
OpenTelemetry (OTel) is an open-source framework for collecting traces, metrics, and logs. It provides SDKs for multiple languages, a collector for processing telemetry, and exports to multiple backends (Prometheus, Jaeger, Datadog). OTel is the CNCF standard for observability instrumentation.
Alerting
DfAlert Fatigue
Alert fatigue occurs when teams receive too many alerts, leading to alertιΊ»ζ¨ and missed critical issues. Effective alerting focuses on symptoms (error rate, latency) rather than causes (CPU, memory). Each alert should be actionable and have a clear runbook.
Alert on symptoms, not causes. "Error rate > 1%" is actionable. "CPU > 80%" is often a symptom, not a root cause. Use multi-window alerting to reduce false positives: require a condition to persist across two time windows before firing.
Practice Exercises
-
Design: Design an observability stack for a microservices application with 20 services. Include logging, metrics, tracing, and alerting strategies.
-
Metrics: Write PromQL queries for: request rate by service, p99 latency, error rate, and saturation (CPU utilization).
-
Tracing: Draw a trace diagram for an e-commerce checkout: API Gateway β Order Service β Payment Service β Inventory Service. Where would you expect bottlenecks?
-
Alerting: Design alerting rules for an API gateway. Include rules for error rate, latency, and saturation with appropriate thresholds and durations.
Key Takeaways:
- Observability is built from three pillars: logs, metrics, and traces
- Structured logging enables efficient querying and correlation
- Monitor the four golden signals: latency, traffic, errors, saturation
- Prometheus is the standard for metrics; Grafana for visualization
- Distributed tracing tracks requests across service boundaries
- OpenTelemetry provides vendor-neutral instrumentation for all three pillars
- Alert on symptoms, not causes, to reduce alert fatigue
What to Learn Next
-> Service Mesh Envoy, Istio, and sidecar proxy patterns.
-> CI/CD Pipelines Continuous integration and deployment strategies.
-> Containerization Docker, Kubernetes, and pod scheduling.
-> Security Patterns Authentication, authorization, encryption, and mTLS.
-> Cost Optimization Cloud cost management and right-sizing.
-> Load Balancing Distribution algorithms and health checks.