Kafka Monitoring

Overview

Monitoring Apache Kafka is essential for maintaining reliable, scalable, and performant event streaming platforms. A comprehensive monitoring strategy covers broker health, consumer lag, throughput metrics, and resource utilization across the entire Kafka ecosystem.

Why Monitoring Matters

Early Detection: Identify issues before they impact production workloads
Capacity Planning: Understand growth patterns for infrastructure scaling
Performance Optimization: Tune configurations based on real metrics
SLA Compliance: Ensure message delivery guarantees are met

Key Kafka Metrics

Broker Metrics

Metric	Description	Warning Threshold
`kafka.server.BrokerTopicMetrics.MessagesInPerSec`	Messages received per second	Varies by workload
`kafka.server.BrokerTopicMetrics.BytesInPerSec`	Bytes received per second	> 80% network capacity
`kafka.server.BrokerTopicMetrics.BytesOutPerSec`	Bytes sent per second	> 80% network capacity
`kafka.network.RequestMetrics.TotalRequestsPerSec`	Total requests per second	> 10,000 req/s
`kafka.server.ReplicaManager.UnderReplicatedPartitions`	Partitions with missing replicas	> 0

Consumer Group Metrics

Architecture Diagram

kafka_consumer_group_lag{topic="orders", partition="0", group="order-processor"} 15234
kafka_consumer_group_lag{topic="orders", partition="1", group="order-processor"} 8921
kafka_consumer_group_lag{topic="orders", partition="2", group="order-processor"} 23456

Producer Metrics

record-error-rate: Failed record sends
record-send-rate: Successful record sends
batch-size-avg: Average batch size in bytes
compression-rate-avg: Compression ratio

JMX Metrics Deep Dive

Enabling JMX on Kafka Brokers

# Set JMX options in server.properties or environment
export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.port=9999 \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -Dcom.sun.management.jmxremote.ssl=false"

Key JMX MBeans

# Python example using jmxquery
from jmxquery import JMXQuery

jmx_url = "service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi"

queries = [
    JMXQuery(
        metric_name="kafka_server_brokertopicmetrics_messagesinpersec_count",
        metric_type="gauge",
        metric_tags={"topic": "Orders"}
    ),
    JMXQuery(
        metric_name="kafka_server_replicamanager_underreplicatedpartitions",
        metric_type="gauge"
    )
]

results = JMXQuery(jmx_url).query(queries)
print(results)

Critical JMX MBeans

# jmx_exporter_config.yml
rules:
  - pattern: kafka.server<type=BrokerTopicMetrics, name=MessagesInPerSec><>Count
    name: kafka_server_brokertopicmetrics_messagesinpersec_count
    type: counter
    
  - pattern: kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value
    name: kafka_server_replicamanager_underreplicatedpartitions
    type: gauge
    
  - pattern: kafka.network<type=RequestMetrics, name=TotalRequestsPerSec, request=(.+), error=(.+)><>Count
    name: kafka_network_requestmetrics_totalrequestspersec_count
    labels:
      request: $1
      error: $2

Prometheus Setup

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kafka-brokers'
    static_configs:
      - targets:
          - 'kafka-broker-1:9404'
          - 'kafka-broker-2:9404'
          - 'kafka-broker-3:9404'
    metrics_path: /metrics
    
  - job_name: 'kafka-connect'
    static_configs:
      - targets:
          - 'kafka-connect-1:9404'
          - 'kafka-connect-2:9404'
          
  - job_name: 'kafka-consumer-groups'
    static_configs:
      - targets:
          - 'kafka-consumer-metrics:9404'

Prometheus Alert Rules

# kafka_alerts.yml
groups:
  - name: kafka_alerts
    rules:
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Under-replicated partitions detected"
          description: "{{ $value }} partitions are under-replicated"
          
      - alert: KafkaConsumerLagHigh
        expr: kafka_consumer_group_lag > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Consumer lag is high"
          description: "Consumer group {{ $labels.group }} has lag of {{ $value }}"
          
      - alert: KafkaBrokerDown
        expr: up{job="kafka-brokers"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka broker is down"
          description: "Broker {{ $labels.instance }} is unreachable"

Grafana Dashboard Setup

Essential Dashboard Panels

{
  "panels": [
    {
      "title": "Messages In Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(kafka_server_brokertopicmetrics_messagesinpersec_count[5m]))",
          "legendFormat": "{{instance}}"
        }
      ]
    },
    {
      "title": "Consumer Group Lag",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum by (group) (kafka_consumer_group_lag)",
          "legendFormat": "{{group}}"
        }
      ]
    },
    {
      "title": "Under Replicated Partitions",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)"
        }
      ],
      "thresholds": {
        "steps": [
          {"value": 0, "color": "green"},
          {"value": 1, "color": "red"}
        ]
      }
    }
  ]
}

Dashboard Best Practices

Broker Overview: CPU, memory, disk usage, network I/O
Topic Metrics: Messages/sec, bytes/sec, partition distribution
Consumer Metrics: Lag, throughput, rebalance frequency
Cluster Health: ISR size, controller status, log segments

Alerting Strategies

Alert Severity Levels

Level	Response Time	Examples
Critical	Immediate	Broker down, under-replicated partitions
Warning	1 hour	High consumer lag, disk space low
Info	Next business day	Config changes, rebalances

Common Alert Patterns

# Consumer lag alert with dynamic threshold
- alert: KafkaConsumerLagExceedsThreshold
  expr: |
    (
      kafka_consumer_group_lag 
      / ignoring(group) group_left() 
      kafka_server_brokertopicmetrics_messagesinpersec_count
    ) > 300
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Consumer {{ $labels.group }} is falling behind"

Monitoring Best Practices

1. Establish Baselines

Monitor metrics during normal operations for 1-2 weeks
Set thresholds based on standard deviations from baseline
Account for daily/weekly patterns in traffic

2. Implement Health Checks

def kafka_health_check(brokers):
    """Check Kafka cluster health"""
    checks = {
        'brokers_reachable': check_broker_connectivity(brokers),
        'all_partitions_isr': check_isr_status(),
        'consumer_lags_healthy': check_consumer_lags(),
        'disk_space_adequate': check_disk_usage(),
        'jmx_accessible': check_jmx_endpoint()
    }
    return all(checks.values())

3. Log Aggregation

Collect Kafka logs with Filebeat or Fluentd
Index logs in Elasticsearch or Loki
Create log-based alerts for ERROR messages

4. Distributed Tracing

Use OpenTelemetry for end-to-end tracing
Track message flow from producer to consumer
Monitor end-to-end latency across services

Summary

Effective Kafka monitoring requires a multi-layered approach combining JMX metrics, Prometheus collection, Grafana visualization, and intelligent alerting. Focus on consumer lag, broker health, and throughput metrics to ensure your Kafka deployment remains reliable and performant.

Kafka Monitoring

Kafka Monitoring

Overview

Why Monitoring Matters

Key Kafka Metrics

Broker Metrics

Consumer Group Metrics

Producer Metrics

JMX Metrics Deep Dive

Enabling JMX on Kafka Brokers

Key JMX MBeans

Critical JMX MBeans

Prometheus Setup

Prometheus Configuration

Prometheus Alert Rules

Grafana Dashboard Setup

Essential Dashboard Panels

Dashboard Best Practices

Alerting Strategies

Alert Severity Levels

Common Alert Patterns

Monitoring Best Practices

1. Establish Baselines

2. Implement Health Checks

3. Log Aggregation

4. Distributed Tracing

Summary

Premium Content

Need Expert Kafka Help?