πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Observability

OperationsMonitoring & Debugging🟒 Free Lesson

Advertisement

Operations

Observability

Observability is the ability to understand a system's internal state from its external outputs. In distributed systems, observability is essential for debugging, performance tuning, and maintaining reliability.

  • Logs β€” Discrete events with context
  • Metrics β€” Numerical measurements over time
  • Traces β€” Request path across service boundaries

You can't fix what you can't see.

What Is Observability?

The three pillars of observability provide complementary views into system behavior.

DfObservability

Observability is a measure of how well you can understand the internal state of a system by examining its external outputs. In software, observability is built from three signal types: logs (what happened), metrics (how much/how fast), and traces (where time was spent). Together, they enable debugging and understanding of complex distributed systems.

Monitoring tells you when something is wrong. Observability tells you why. Monitoring uses predefined dashboards and alerts; observability enables ad-hoc exploration of system behavior.

Three Pillars

PillarWhat It Tells YouExample
LogsWhat happened at a point in time"User 123 failed to login at 10:00:01"
MetricsHow the system is performing over time"Request rate: 1000 QPS, error rate: 0.1%"
TracesWhere time was spent in a request"API gateway: 5ms, Auth: 10ms, DB: 50ms"

Logging

Structured logging enables efficient querying and analysis.

DfStructured Logging

Structured logging records log events in a machine-parseable format (JSON) rather than unstructured text. Each log entry contains consistent fields (timestamp, level, service, trace_id, message) that enable efficient filtering, searching, and correlation across services.

Structured log example:

{
  "timestamp": "2024-01-15T10:00:01.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "trace_id": "abc123def456",
  "user_id": "789",
  "message": "Payment processing failed",
  "error": "Insufficient funds",
  "amount": 29.99,
  "currency": "USD"
}

Log Levels

LevelWhen to Use
DEBUGDetailed diagnostic information
INFONormal operation milestones
WARNUnexpected but recoverable conditions
ERRORFailures requiring attention
FATALSystem cannot continue running

Use structured logging in production. Unstructured logs are difficult to query at scale. Tools like Fluentd, Filebeat, and Vector collect and forward logs to centralized systems (Elasticsearch, Loki, CloudWatch).

Metrics

Numerical measurements aggregated over time.

DfMetrics

Metrics are numerical measurements collected at regular intervals. They provide a quantitative view of system behavior: request rates, error rates, latencies, resource utilization, and business KPIs. Metrics are stored in time-series databases optimized for aggregation and querying.

Four Golden Signals

Google's SRE team identifies four key signals for monitoring:

SignalDescriptionExample
LatencyTime to serve a requestp99 = 200ms
TrafficDemand on the system10,000 QPS
ErrorsRate of failed requests0.1% error rate
SaturationHow full the system is70% CPU utilization

Error Rate

error_rate=requests5xxrequeststotalΓ—100%error\_rate = \frac{requests_{5xx}}{requests_{total}} \times 100\%

Here,

  • requests5xxrequests_{5xx}=Requests returning 5xx status codes
  • requeststotalrequests_{total}=Total requests in the time window

RED Method

For request-driven services:

DfRED Method

RED (Rate, Errors, Duration) monitors each service:

  • Rate: Requests per second
  • Errors: Errors per second
  • Duration: Latency distribution (percentiles)

Use percentiles (p50, p95, p99) instead of averages for latency. The average masks outliers: a service with 10ms average could have p99 = 500ms, meaning 1% of users experience 50x worse latency.

Prometheus

The standard for metrics collection and alerting.

DfPrometheus

Prometheus is an open-source monitoring system that collects metrics via pull-based scraping. It stores data in a time-series database, supports PromQL for querying, and integrates with Alertmanager for alerting. Prometheus is the CNCF standard for cloud-native monitoring.

Prometheus Architecture

Prometheus Monitoring StackTargetsApp /metricsNode ExporterKube StateCustom ExporterPrometheusTSDB (Time Series DB)PromQL EngineAlertmanagerService DiscoveryGrafanaAlertsscrape

Distributed Tracing

Tracking requests across service boundaries.

DfDistributed Tracing

Distributed tracing tracks a request as it flows through multiple services. Each service creates a span (unit of work) with timing information. Spans are linked by a trace ID, forming a tree that shows where time was spent. This enables identifying bottlenecks in microservice architectures.

Tracing Concepts

ConceptDescription
TraceComplete journey of a request through the system
SpanA single unit of work within a trace
Trace IDUnique identifier linking all spans in a trace
Parent Span IDReference to the span that initiated this span
Context PropagationPassing trace context between services

Use Jaeger, Zipkin, or AWS X-Ray for distributed tracing. Instrument services with OpenTelemetry, which provides vendor-neutral APIs for traces, metrics, and logs.

OpenTelemetry

DfOpenTelemetry

OpenTelemetry (OTel) is an open-source framework for collecting traces, metrics, and logs. It provides SDKs for multiple languages, a collector for processing telemetry, and exports to multiple backends (Prometheus, Jaeger, Datadog). OTel is the CNCF standard for observability instrumentation.

Alerting

DfAlert Fatigue

Alert fatigue occurs when teams receive too many alerts, leading to alert麻木 and missed critical issues. Effective alerting focuses on symptoms (error rate, latency) rather than causes (CPU, memory). Each alert should be actionable and have a clear runbook.

Alert on symptoms, not causes. "Error rate > 1%" is actionable. "CPU > 80%" is often a symptom, not a root cause. Use multi-window alerting to reduce false positives: require a condition to persist across two time windows before firing.

Practice Exercises

  1. Design: Design an observability stack for a microservices application with 20 services. Include logging, metrics, tracing, and alerting strategies.

  2. Metrics: Write PromQL queries for: request rate by service, p99 latency, error rate, and saturation (CPU utilization).

  3. Tracing: Draw a trace diagram for an e-commerce checkout: API Gateway β†’ Order Service β†’ Payment Service β†’ Inventory Service. Where would you expect bottlenecks?

  4. Alerting: Design alerting rules for an API gateway. Include rules for error rate, latency, and saturation with appropriate thresholds and durations.

Key Takeaways:

  • Observability is built from three pillars: logs, metrics, and traces
  • Structured logging enables efficient querying and correlation
  • Monitor the four golden signals: latency, traffic, errors, saturation
  • Prometheus is the standard for metrics; Grafana for visualization
  • Distributed tracing tracks requests across service boundaries
  • OpenTelemetry provides vendor-neutral instrumentation for all three pillars
  • Alert on symptoms, not causes, to reduce alert fatigue

What to Learn Next

-> Service Mesh Envoy, Istio, and sidecar proxy patterns.

-> CI/CD Pipelines Continuous integration and deployment strategies.

-> Containerization Docker, Kubernetes, and pod scheduling.

-> Security Patterns Authentication, authorization, encryption, and mTLS.

-> Cost Optimization Cloud cost management and right-sizing.

-> Load Balancing Distribution algorithms and health checks.

⭐

Premium Content

Observability

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert System Design Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement