Kafka Monitoring
Overview
Monitoring Apache Kafka is essential for maintaining reliable, scalable, and performant event streaming platforms. A comprehensive monitoring strategy covers broker health, consumer lag, throughput metrics, and resource utilization across the entire Kafka ecosystem.
Why Monitoring Matters
- Early Detection: Identify issues before they impact production workloads
- Capacity Planning: Understand growth patterns for infrastructure scaling
- Performance Optimization: Tune configurations based on real metrics
- SLA Compliance: Ensure message delivery guarantees are met
Key Kafka Metrics
Broker Metrics
| Metric | Description | Warning Threshold |
|---|---|---|
kafka.server.BrokerTopicMetrics.MessagesInPerSec | Messages received per second | Varies by workload |
kafka.server.BrokerTopicMetrics.BytesInPerSec | Bytes received per second | > 80% network capacity |
kafka.server.BrokerTopicMetrics.BytesOutPerSec | Bytes sent per second | > 80% network capacity |
kafka.network.RequestMetrics.TotalRequestsPerSec | Total requests per second | > 10,000 req/s |
kafka.server.ReplicaManager.UnderReplicatedPartitions | Partitions with missing replicas | > 0 |
Consumer Group Metrics
kafka_consumer_group_lag{topic="orders", partition="0", group="order-processor"} 15234
kafka_consumer_group_lag{topic="orders", partition="1", group="order-processor"} 8921
kafka_consumer_group_lag{topic="orders", partition="2", group="order-processor"} 23456
Producer Metrics
record-error-rate: Failed record sendsrecord-send-rate: Successful record sendsbatch-size-avg: Average batch size in bytescompression-rate-avg: Compression ratio
JMX Metrics Deep Dive
Enabling JMX on Kafka Brokers
# Set JMX options in server.properties or environment
export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \
-Dcom.sun.management.jmxremote.port=9999 \
-Dcom.sun.management.jmxremote.authenticate=false \
-Dcom.sun.management.jmxremote.ssl=false"
Key JMX MBeans
# Python example using jmxquery
from jmxquery import JMXQuery
jmx_url = "service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi"
queries = [
JMXQuery(
metric_name="kafka_server_brokertopicmetrics_messagesinpersec_count",
metric_type="gauge",
metric_tags={"topic": "Orders"}
),
JMXQuery(
metric_name="kafka_server_replicamanager_underreplicatedpartitions",
metric_type="gauge"
)
]
results = JMXQuery(jmx_url).query(queries)
print(results)
Critical JMX MBeans
# jmx_exporter_config.yml
rules:
- pattern: kafka.server<type=BrokerTopicMetrics, name=MessagesInPerSec><>Count
name: kafka_server_brokertopicmetrics_messagesinpersec_count
type: counter
- pattern: kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value
name: kafka_server_replicamanager_underreplicatedpartitions
type: gauge
- pattern: kafka.network<type=RequestMetrics, name=TotalRequestsPerSec, request=(.+), error=(.+)><>Count
name: kafka_network_requestmetrics_totalrequestspersec_count
labels:
request: $1
error: $2
Prometheus Setup
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kafka-brokers'
static_configs:
- targets:
- 'kafka-broker-1:9404'
- 'kafka-broker-2:9404'
- 'kafka-broker-3:9404'
metrics_path: /metrics
- job_name: 'kafka-connect'
static_configs:
- targets:
- 'kafka-connect-1:9404'
- 'kafka-connect-2:9404'
- job_name: 'kafka-consumer-groups'
static_configs:
- targets:
- 'kafka-consumer-metrics:9404'
Prometheus Alert Rules
# kafka_alerts.yml
groups:
- name: kafka_alerts
rules:
- alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Under-replicated partitions detected"
description: "{{ $value }} partitions are under-replicated"
- alert: KafkaConsumerLagHigh
expr: kafka_consumer_group_lag > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Consumer lag is high"
description: "Consumer group {{ $labels.group }} has lag of {{ $value }}"
- alert: KafkaBrokerDown
expr: up{job="kafka-brokers"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka broker is down"
description: "Broker {{ $labels.instance }} is unreachable"
Grafana Dashboard Setup
Essential Dashboard Panels
{
"panels": [
{
"title": "Messages In Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate(kafka_server_brokertopicmetrics_messagesinpersec_count[5m]))",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Consumer Group Lag",
"type": "timeseries",
"targets": [
{
"expr": "sum by (group) (kafka_consumer_group_lag)",
"legendFormat": "{{group}}"
}
]
},
{
"title": "Under Replicated Partitions",
"type": "stat",
"targets": [
{
"expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)"
}
],
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "red"}
]
}
}
]
}
Dashboard Best Practices
- Broker Overview: CPU, memory, disk usage, network I/O
- Topic Metrics: Messages/sec, bytes/sec, partition distribution
- Consumer Metrics: Lag, throughput, rebalance frequency
- Cluster Health: ISR size, controller status, log segments
Alerting Strategies
Alert Severity Levels
| Level | Response Time | Examples |
|---|---|---|
| Critical | Immediate | Broker down, under-replicated partitions |
| Warning | 1 hour | High consumer lag, disk space low |
| Info | Next business day | Config changes, rebalances |
Common Alert Patterns
# Consumer lag alert with dynamic threshold
- alert: KafkaConsumerLagExceedsThreshold
expr: |
(
kafka_consumer_group_lag
/ ignoring(group) group_left()
kafka_server_brokertopicmetrics_messagesinpersec_count
) > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Consumer {{ $labels.group }} is falling behind"
Monitoring Best Practices
1. Establish Baselines
- Monitor metrics during normal operations for 1-2 weeks
- Set thresholds based on standard deviations from baseline
- Account for daily/weekly patterns in traffic
2. Implement Health Checks
def kafka_health_check(brokers):
"""Check Kafka cluster health"""
checks = {
'brokers_reachable': check_broker_connectivity(brokers),
'all_partitions_isr': check_isr_status(),
'consumer_lags_healthy': check_consumer_lags(),
'disk_space_adequate': check_disk_usage(),
'jmx_accessible': check_jmx_endpoint()
}
return all(checks.values())
3. Log Aggregation
- Collect Kafka logs with Filebeat or Fluentd
- Index logs in Elasticsearch or Loki
- Create log-based alerts for ERROR messages
4. Distributed Tracing
- Use OpenTelemetry for end-to-end tracing
- Track message flow from producer to consumer
- Monitor end-to-end latency across services
Summary
Effective Kafka monitoring requires a multi-layered approach combining JMX metrics, Prometheus collection, Grafana visualization, and intelligent alerting. Focus on consumer lag, broker health, and throughput metrics to ensure your Kafka deployment remains reliable and performant.