Observability: Prometheus, Grafana, Jaeger, OpenTelemetry

Difficulty: Senior Level | Companies: Google, Netflix, Uber, Datadog, New Relic

Interview Question

"Design an observability stack for a microservices platform with 100+ services. How do you handle metrics, logs, traces, and alerting at scale?"

ℹ️Key Concepts

This question tests your understanding of the three pillars of observability: metrics, logs, and traces, and how to implement them at scale.

Complete Observability Architecture

Architecture Overview

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY ARCHITECTURE                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌───────────────── DATA SOURCES ──────────────────┐                   │
│  │  Applications │ Infrastructure │ Databases       │                   │
│  └──────────────────────┬──────────────────────────┘                   │
│                         │                                               │
│  ┌───────────────── COLLECTION LAYER ──────────────┐                   │
│  │                                                       │              │
│  │  ┌─────────────────────────────────────────────┐    │              │
│  │  │           OpenTelemetry Collector            │    │              │
│  │  │                                               │    │              │
│  │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  │    │              │
│  │  │  │ Metrics  │  │  Logs    │  │ Traces   │  │    │              │
│  │  │  │ Receiver │  │ Receiver │  │ Receiver │  │    │              │
│  │  │  └──────────┘  └──────────┘  └──────────┘  │    │              │
│  │  │                                               │    │              │
│  │  └─────────────────────────────────────────────┘    │              │
│  │                                                       │              │
│  └──────────────────────┬──────────────────────────────┘              │
│                         │                                               │
│  ┌───────────────── STORAGE LAYER ─────────────────┐                 │
│  │                                                       │              │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐          │              │
│  │  │Prometheus│  │  Loki    │  │  Jaeger  │          │              │
│  │  │(Metrics) │  │  (Logs)  │  │ (Traces) │          │              │
│  │  └──────────┘  └──────────┘  └──────────┘          │              │
│  │                                                       │              │
│  └──────────────────────┬──────────────────────────────┘              │
│                         │                                               │
│  ┌───────────────── VISUALIZATION LAYER ───────────┐                  │
│  │                                                       │              │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐          │              │
│  │  │ Grafana  │  │  Kibana  │  │   UI     │          │              │
│  │  │ (Dashbord)│  │ (Logs)  │  │ (Traces) │          │              │
│  │  └──────────┘  └──────────┘  └──────────┘          │              │
│  │                                                       │              │
│  └──────────────────────┬──────────────────────────────┘              │
│                         │                                               │
│  ┌───────────────── ALERTING LAYER ────────────────┐                  │
│  │  Alertmanager │ PagerDuty │ Slack │ Email        │                  │
│  └─────────────────────────────────────────────────────┘              │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Mathematical Foundation: SLI/SLO/SLA

Service Level Indicator (SLI):

Availability SLI: A = (total_time - downtime) / total_time
Latency SLI: L = requests_within_sla / total_requests
Error rate SLI: E = successful_requests / total_requests

Service Level Objective (SLO):

Availability SLO: 99.9% (8.76 hours downtime/year)
Latency SLO: 99% of requests < 200ms
Error SLO: < 0.1% error rate

Error Budget:

Error budget: B = 1 - SLO
For 99.9% SLO: B = 0.001 = 0.1%
Monthly error budget: B_monthly = B × days_in_month × 24 × 60

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2

  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: service
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

# Alert rules
groups:
  - name: application_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is above 1% for 5 minutes"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "P99 latency is above 1 second"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod is crash looping"
          description: "Pod {{ $labels.pod }} is restarting frequently"

      - alert: HighMemoryUsage
        expr: |
          (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Container {{ $labels.container }} is using > 90% memory"

  - name: slo_alerts
    rules:
      - alert: SLOBreach
        expr: |
          (
            sum(rate(http_requests_total{status!~"5.."}[30d])) /
            sum(rate(http_requests_total[30d]))
          ) < 0.999
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO breach detected"
          description: "Availability SLO is below 99.9%"

Grafana Dashboard Configuration

{
  "dashboard": {
    "title": "Service Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (service)",
            "legendFormat": "{{service}}"
          }
        ],
        "yaxes": [
          {
            "label": "Requests/sec",
            "min": 0
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) * 100",
            "legendFormat": "{{service}}"
          }
        ],
        "yaxes": [
          {
            "label": "Error %",
            "min": 0,
            "max": 100
          }
        ]
      },
      {
        "title": "Latency Distribution",
        "type": "heatmap",
        "targets": [
          {
            "expr": "sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)",
            "legendFormat": "{{service}} - {{le}}s",
            "format": "heatmap"
          }
        ]
      },
      {
        "title": "Active Connections",
        "type": "singlestat",
        "targets": [
          {
            "expr": "sum(active_connections) by (service)",
            "legendFormat": "{{service}}"
          }
        ]
      }
    ],
    "templating": {
      "list": [
        {
          "name": "service",
          "type": "query",
          "query": "label_values(http_requests_total, service)"
        }
      ]
    }
  }
}

Jaeger Distributed Tracing

# OpenTelemetry instrumentation
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
import time
from typing import Dict, Any, Optional
from functools import wraps

# Initialize tracer
resource = Resource(attributes={
    SERVICE_NAME: "order-service"
})

provider = TracerProvider(resource=resource)
jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger-agent",
    agent_port=6831,
)
processor = BatchSpanProcessor(jaeger_exporter)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

class TracingMiddleware:
    """Custom tracing middleware"""

    def __init__(self, app):
        self.app = app

    async def __call__(self, scope, receive, send):
        if scope["type"] == "http":
            with tracer.start_as_current_span(
                f"{scope['method']} {scope['path']}"
            ) as span:
                # Add attributes
                span.set_attribute("http.method", scope["method"])
                span.set_attribute("http.url", scope["path"])
                span.set_attribute("http.scheme", scope.get("scheme", "http"))

                # Process request
                response_started = False
                body = b""

                async def send_wrapper(message):
                    nonlocal response_started, body

                    if message["type"] == "http.response.start":
                        response_started = True
                        span.set_attribute("http.status_code", message["status"])

                    elif message["type"] == "http.response.body":
                        body += message.get("body", b"")

                await self.app(scope, receive, send_wrapper)

                # Set response body size
                span.set_attribute("http.response_content_length", len(body))

        else:
            await self.app(scope, receive, send)

def trace_function(name: str = None):
    """Decorator for tracing functions"""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            span_name = name or func.__name__

            with tracer.start_as_current_span(span_name) as span:
                try:
                    result = await func(*args, **kwargs)
                    span.set_status(trace.StatusCode.OK)
                    return result
                except Exception as e:
                    span.set_status(trace.StatusCode.ERROR, str(e))
                    span.record_exception(e)
                    raise

        return wrapper
    return decorator

class DistributedTracer:
    """Distributed tracing across services"""

    def __init__(self):
        self.tracer = trace.get_tracer(__name__)

    def trace_http_request(self, method: str, url: str, 
                          headers: Dict[str, str] = None) -> Dict[str, str]:
        """Add trace context to HTTP headers"""
        from opentelemetry import propagate
        from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

        carrier = headers or {}
        propagate.inject(carrier)
        return carrier

    def extract_trace_context(self, headers: Dict[str, str]):
        """Extract trace context from HTTP headers"""
        from opentelemetry import propagate
        from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

        return propagate.extract(carrier=headers)

# Example: Tracing across services
async def call_service_a():
    with tracer.start_as_current_span("call-service-a") as span:
        # Add custom attributes
        span.set_attribute("service.name", "service-a")
        span.set_attribute("operation.type", "http")

        # Make HTTP call with trace context
        headers = distributed_tracer.trace_http_request(
            "GET",
            "http://service-b/api/data"
        )

        # Record events
        span.add_event("Making HTTP request", {
            "http.url": "http://service-b/api/data"
        })

        # Simulate work
        time.sleep(0.1)

        span.add_event("HTTP request completed")

        return {"status": "success"}

ℹ️Tracing Best Practices

Use distributed tracing to track requests across service boundaries. Always propagate trace context in HTTP headers. Add custom attributes for debugging.

Log Aggregation with Loki

# Structured logging with Loki integration
import logging
import json
from typing import Dict, Any, Optional
from datetime import datetime
import uuid
import requests

class LokiHandler(logging.Handler):
    """Custom logging handler for Loki"""

    def __init__(self, loki_url: str, labels: Dict[str, str] = None):
        super().__init__()
        self.loki_url = loki_url
        self.labels = labels or {}

    def emit(self, record):
        log_entry = {
            "streams": [
                {
                    "labels": {
                        **self.labels,
                        "level": record.levelname.lower(),
                        "logger": record.name
                    },
                    "values": [
                        [
                            str(int(datetime.utcnow().timestamp() * 1e9)),
                            self.format(record)
                        ]
                    ]
                }
            ]
        }

        try:
            requests.post(
                f"{self.loki_url}/loki/api/v1/push",
                json=log_entry,
                timeout=5
            )
        except Exception as e:
            print(f"Failed to send log to Loki: {e}")

class StructuredLogger:
    """Structured logger with context"""

    def __init__(self, service_name: str, loki_url: str = None):
        self.service_name = service_name
        self.logger = logging.getLogger(service_name)

        if loki_url:
            handler = LokiHandler(
                loki_url,
                labels={"service": service_name}
            )
            self.logger.addHandler(handler)

    def log(self, level: str, message: str, context: Dict[str, Any] = None):
        """Log with context"""
        log_data = {
            "timestamp": datetime.utcnow().isoformat(),
            "service": self.service_name,
            "level": level,
            "message": message,
            "context": context or {},
            "trace_id": str(uuid.uuid4())
        }

        getattr(self.logger, level.lower())(json.dumps(log_data))

    def info(self, message: str, context: Dict[str, Any] = None):
        self.log("INFO", message, context)

    def error(self, message: str, context: Dict[str, Any] = None):
        self.log("ERROR", message, context)

    def warning(self, message: str, context: Dict[str, Any] = None):
        self.log("WARNING", message, context)

# Example usage
logger = StructuredLogger("order-service", "http://loki:3100")

logger.info("Order created", {
    "order_id": "123",
    "user_id": "user-456",
    "amount": 99.99
})

Custom Metrics

# Custom Prometheus metrics
from prometheus_client import Counter, Histogram, Gauge, Summary
from typing import Dict, Any
import time

# Define metrics
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1.0, 2.5, 5.0, 10.0]
)

active_connections = Gauge(
    'active_connections',
    'Number of active connections',
    ['service']
)

order_total = Counter(
    'order_total',
    'Total number of orders',
    ['status', 'payment_method']
)

order_value = Summary(
    'order_value',
    'Order value in dollars',
    ['currency']
)

class MetricsCollector:
    """Custom metrics collector"""

    def __init__(self, service_name: str):
        self.service_name = service_name

    def record_request(self, method: str, endpoint: str, 
                      status: int, duration: float):
        """Record HTTP request metrics"""
        http_requests_total.labels(
            method=method,
            endpoint=endpoint,
            status=status
        ).inc()

        http_request_duration_seconds.labels(
            method=method,
            endpoint=endpoint
        ).observe(duration)

    def record_order(self, status: str, payment_method: str, 
                    value: float, currency: str = "USD"):
        """Record order metrics"""
        order_total.labels(
            status=status,
            payment_method=payment_method
        ).inc()

        order_value.labels(currency=currency).observe(value)

    def set_active_connections(self, count: int):
        """Set active connections count"""
        active_connections.labels(service=self.service_name).set(count)

# Middleware for automatic metrics
class MetricsMiddleware:
    """Middleware for automatic HTTP metrics"""

    def __init__(self, app, collector: MetricsCollector):
        self.app = app
        self.collector = collector

    async def __call__(self, scope, receive, send):
        if scope["type"] == "http":
            start_time = time.time()

            async def send_wrapper(message):
                if message["type"] == "http.response.start":
                    duration = time.time() - start_time
                    self.collector.record_request(
                        method=scope["method"],
                        endpoint=scope["path"],
                        status=message["status"],
                        duration=duration
                    )

                await send(message)

            await self.app(scope, receive, send_wrapper)
        else:
            await self.app(scope, receive, send)

✅Observability Benefits

Comprehensive observability enables faster debugging, better capacity planning, and proactive issue detection. Use the three pillars together for complete visibility.

Summary

Component	Purpose	Data Type
Prometheus	Metrics collection	Time series data
Grafana	Visualization	Dashboards
Jaeger	Distributed tracing	Trace spans
Loki	Log aggregation	Structured logs
OpenTelemetry	Instrumentation	All types
Alertmanager	Alert routing	Alert rules