Observability: Prometheus, Grafana, Jaeger, OpenTelemetry
Difficulty: Senior Level | Companies: Google, Netflix, Uber, Datadog, New Relic
Interview Question
"Design an observability stack for a microservices platform with 100+ services. How do you handle metrics, logs, traces, and alerting at scale?"
โน๏ธKey Concepts
This question tests your understanding of the three pillars of observability: metrics, logs, and traces, and how to implement them at scale.
Complete Observability Architecture
Architecture Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ OBSERVABILITY ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโ DATA SOURCES โโโโโโโโโโโโโโโโโโโ โ
โ โ Applications โ Infrastructure โ Databases โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโ COLLECTION LAYER โโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ OpenTelemetry Collector โ โ โ
โ โ โ โ โ โ
โ โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ โ
โ โ โ โ Metrics โ โ Logs โ โ Traces โ โ โ โ
โ โ โ โ Receiver โ โ Receiver โ โ Receiver โ โ โ โ
โ โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ โ
โ โ โ โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโ STORAGE LAYER โโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ
โ โ โPrometheusโ โ Loki โ โ Jaeger โ โ โ
โ โ โ(Metrics) โ โ (Logs) โ โ (Traces) โ โ โ
โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโ VISUALIZATION LAYER โโโโโโโโโโโโ โ
โ โ โ โ
โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ
โ โ โ Grafana โ โ Kibana โ โ UI โ โ โ
โ โ โ (Dashbord)โ โ (Logs) โ โ (Traces) โ โ โ
โ โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโ ALERTING LAYER โโโโโโโโโโโโโโโโโ โ
โ โ Alertmanager โ PagerDuty โ Slack โ Email โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Mathematical Foundation: SLI/SLO/SLA
Service Level Indicator (SLI):
- Availability SLI: A = (total_time - downtime) / total_time
- Latency SLI: L = requests_within_sla / total_requests
- Error rate SLI: E = successful_requests / total_requests
Service Level Objective (SLO):
- Availability SLO: 99.9% (8.76 hours downtime/year)
- Latency SLO: 99% of requests < 200ms
- Error SLO: < 0.1% error rate
Error Budget:
- Error budget: B = 1 - SLO
- For 99.9% SLO: B = 0.001 = 0.1%
- Monthly error budget: B_monthly = B ร days_in_month ร 24 ร 60
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
metrics_path: /metrics
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# Alert rules
groups:
- name: application_alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 1% for 5 minutes"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "P99 latency is above 1 second"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.pod }} is restarting frequently"
- alert: HighMemoryUsage
expr: |
(container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Container {{ $labels.container }} is using > 90% memory"
- name: slo_alerts
rules:
- alert: SLOBreach
expr: |
(
sum(rate(http_requests_total{status!~"5.."}[30d])) /
sum(rate(http_requests_total[30d]))
) < 0.999
for: 5m
labels:
severity: critical
annotations:
summary: "SLO breach detected"
description: "Availability SLO is below 99.9%"
Grafana Dashboard Configuration
{
"dashboard": {
"title": "Service Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{service}}"
}
],
"yaxes": [
{
"label": "Requests/sec",
"min": 0
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100",
"legendFormat": "{{service}}"
}
],
"yaxes": [
{
"label": "Error %",
"min": 0,
"max": 100
}
]
},
{
"title": "Latency Distribution",
"type": "heatmap",
"targets": [
{
"expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)",
"legendFormat": "{{service}} - {{le}}s",
"format": "heatmap"
}
]
},
{
"title": "Active Connections",
"type": "singlestat",
"targets": [
{
"expr": "sum(active_connections) by (service)",
"legendFormat": "{{service}}"
}
]
}
],
"templating": {
"list": [
{
"name": "service",
"type": "query",
"query": "label_values(http_requests_total, service)"
}
]
}
}
}
Jaeger Distributed Tracing
# OpenTelemetry instrumentation
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
import time
from typing import Dict, Any, Optional
from functools import wraps
# Initialize tracer
resource = Resource(attributes={
SERVICE_NAME: "order-service"
})
provider = TracerProvider(resource=resource)
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger-agent",
agent_port=6831,
)
processor = BatchSpanProcessor(jaeger_exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
class TracingMiddleware:
"""Custom tracing middleware"""
def __init__(self, app):
self.app = app
async def __call__(self, scope, receive, send):
if scope["type"] == "http":
with tracer.start_as_current_span(
f"{scope['method']} {scope['path']}"
) as span:
# Add attributes
span.set_attribute("http.method", scope["method"])
span.set_attribute("http.url", scope["path"])
span.set_attribute("http.scheme", scope.get("scheme", "http"))
# Process request
response_started = False
body = b""
async def send_wrapper(message):
nonlocal response_started, body
if message["type"] == "http.response.start":
response_started = True
span.set_attribute("http.status_code", message["status"])
elif message["type"] == "http.response.body":
body += message.get("body", b"")
await self.app(scope, receive, send_wrapper)
# Set response body size
span.set_attribute("http.response_content_length", len(body))
else:
await self.app(scope, receive, send)
def trace_function(name: str = None):
"""Decorator for tracing functions"""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
span_name = name or func.__name__
with tracer.start_as_current_span(span_name) as span:
try:
result = await func(*args, **kwargs)
span.set_status(trace.StatusCode.OK)
return result
except Exception as e:
span.set_status(trace.StatusCode.ERROR, str(e))
span.record_exception(e)
raise
return wrapper
return decorator
class DistributedTracer:
"""Distributed tracing across services"""
def __init__(self):
self.tracer = trace.get_tracer(__name__)
def trace_http_request(self, method: str, url: str,
headers: Dict[str, str] = None) -> Dict[str, str]:
"""Add trace context to HTTP headers"""
from opentelemetry import propagate
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
carrier = headers or {}
propagate.inject(carrier)
return carrier
def extract_trace_context(self, headers: Dict[str, str]):
"""Extract trace context from HTTP headers"""
from opentelemetry import propagate
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
return propagate.extract(carrier=headers)
# Example: Tracing across services
async def call_service_a():
with tracer.start_as_current_span("call-service-a") as span:
# Add custom attributes
span.set_attribute("service.name", "service-a")
span.set_attribute("operation.type", "http")
# Make HTTP call with trace context
headers = distributed_tracer.trace_http_request(
"GET",
"http://service-b/api/data"
)
# Record events
span.add_event("Making HTTP request", {
"http.url": "http://service-b/api/data"
})
# Simulate work
time.sleep(0.1)
span.add_event("HTTP request completed")
return {"status": "success"}
โน๏ธTracing Best Practices
Use distributed tracing to track requests across service boundaries. Always propagate trace context in HTTP headers. Add custom attributes for debugging.
Log Aggregation with Loki
# Structured logging with Loki integration
import logging
import json
from typing import Dict, Any, Optional
from datetime import datetime
import uuid
import requests
class LokiHandler(logging.Handler):
"""Custom logging handler for Loki"""
def __init__(self, loki_url: str, labels: Dict[str, str] = None):
super().__init__()
self.loki_url = loki_url
self.labels = labels or {}
def emit(self, record):
log_entry = {
"streams": [
{
"labels": {
**self.labels,
"level": record.levelname.lower(),
"logger": record.name
},
"values": [
[
str(int(datetime.utcnow().timestamp() * 1e9)),
self.format(record)
]
]
}
]
}
try:
requests.post(
f"{self.loki_url}/loki/api/v1/push",
json=log_entry,
timeout=5
)
except Exception as e:
print(f"Failed to send log to Loki: {e}")
class StructuredLogger:
"""Structured logger with context"""
def __init__(self, service_name: str, loki_url: str = None):
self.service_name = service_name
self.logger = logging.getLogger(service_name)
if loki_url:
handler = LokiHandler(
loki_url,
labels={"service": service_name}
)
self.logger.addHandler(handler)
def log(self, level: str, message: str, context: Dict[str, Any] = None):
"""Log with context"""
log_data = {
"timestamp": datetime.utcnow().isoformat(),
"service": self.service_name,
"level": level,
"message": message,
"context": context or {},
"trace_id": str(uuid.uuid4())
}
getattr(self.logger, level.lower())(json.dumps(log_data))
def info(self, message: str, context: Dict[str, Any] = None):
self.log("INFO", message, context)
def error(self, message: str, context: Dict[str, Any] = None):
self.log("ERROR", message, context)
def warning(self, message: str, context: Dict[str, Any] = None):
self.log("WARNING", message, context)
# Example usage
logger = StructuredLogger("order-service", "http://loki:3100")
logger.info("Order created", {
"order_id": "123",
"user_id": "user-456",
"amount": 99.99
})
Custom Metrics
# Custom Prometheus metrics
from prometheus_client import Counter, Histogram, Gauge, Summary
from typing import Dict, Any
import time
# Define metrics
http_requests_total = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
http_request_duration_seconds = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1.0, 2.5, 5.0, 10.0]
)
active_connections = Gauge(
'active_connections',
'Number of active connections',
['service']
)
order_total = Counter(
'order_total',
'Total number of orders',
['status', 'payment_method']
)
order_value = Summary(
'order_value',
'Order value in dollars',
['currency']
)
class MetricsCollector:
"""Custom metrics collector"""
def __init__(self, service_name: str):
self.service_name = service_name
def record_request(self, method: str, endpoint: str,
status: int, duration: float):
"""Record HTTP request metrics"""
http_requests_total.labels(
method=method,
endpoint=endpoint,
status=status
).inc()
http_request_duration_seconds.labels(
method=method,
endpoint=endpoint
).observe(duration)
def record_order(self, status: str, payment_method: str,
value: float, currency: str = "USD"):
"""Record order metrics"""
order_total.labels(
status=status,
payment_method=payment_method
).inc()
order_value.labels(currency=currency).observe(value)
def set_active_connections(self, count: int):
"""Set active connections count"""
active_connections.labels(service=self.service_name).set(count)
# Middleware for automatic metrics
class MetricsMiddleware:
"""Middleware for automatic HTTP metrics"""
def __init__(self, app, collector: MetricsCollector):
self.app = app
self.collector = collector
async def __call__(self, scope, receive, send):
if scope["type"] == "http":
start_time = time.time()
async def send_wrapper(message):
if message["type"] == "http.response.start":
duration = time.time() - start_time
self.collector.record_request(
method=scope["method"],
endpoint=scope["path"],
status=message["status"],
duration=duration
)
await send(message)
await self.app(scope, receive, send_wrapper)
else:
await self.app(scope, receive, send)
โ Observability Benefits
Comprehensive observability enables faster debugging, better capacity planning, and proactive issue detection. Use the three pillars together for complete visibility.
Summary
| Component | Purpose | Data Type |
|---|---|---|
| Prometheus | Metrics collection | Time series data |
| Grafana | Visualization | Dashboards |
| Jaeger | Distributed tracing | Trace spans |
| Loki | Log aggregation | Structured logs |
| OpenTelemetry | Instrumentation | All types |
| Alertmanager | Alert routing | Alert rules |