Kafka on Kubernetes

Overview

Running Kafka on Kubernetes provides automated operations, scalability, and resource efficiency. This guide covers deploying Kafka with Strimzi operator, managing persistent storage, and implementing autoscaling.

Benefits of Kubernetes

Automated Deployment: Declarative configuration
Self-Healing: Automatic pod restarts
Scaling: Horizontal and vertical scaling
Resource Efficiency: Better cluster utilization
Rolling Updates: Zero-downtime upgrades

Strimzi Operator Setup

Install Strimzi

# Install Strimzi using Helm
helm repo add strimzi https://strimzi.io/charts/
helm repo update

helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
  --namespace kafka \
  --create-namespace \
  --set watchNamespaces=all

Kafka Cluster CRD

# kafka-cluster.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-kafka-cluster
  namespace: kafka
spec:
  kafka:
    version: 3.5.1
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.5"
    storage:
      type: persistent-claim
      size: 100Gi
      class: fast-ssd
      deleteClaim: false
    resources:
      requests:
        memory: 4Gi
        cpu: 2
      limits:
        memory: 8Gi
        cpu: 4
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 20Gi
      class: fast-ssd
      deleteClaim: false
  entityOperator:
    topicOperator: {}
    userOperator: {}

Apply Configuration

# Apply Kafka cluster
kubectl apply -f kafka-cluster.yaml

# Check cluster status
kubectl get kafka -n kafka

# Check pods
kubectl get pods -n kafka -l app.kubernetes.io/name=kafka

Persistent Volumes

StorageClass Configuration

# storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iopsPerGB: "10"
  encrypted: "true"
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

PersistentVolumeClaim

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: kafka-data-my-kafka-cluster-kafka-0
  namespace: kafka
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 100Gi

StatefulSet Configuration

# StatefulSet is managed by Strimzi, but here's the concept
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: my-kafka-cluster-kafka
  namespace: kafka
spec:
  serviceName: my-kafka-cluster-kafka
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: kafka
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kafka
    spec:
      containers:
        - name: kafka
          image: quay.io/strimzi/kafka:3.5.1
          ports:
            - containerPort: 9092
              name: plain
            - containerPort: 9093
              name: tls
          env:
            - name: KAFKA_CFG_NODE_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          volumeMounts:
            - name: data
              mountPath: /var/lib/kafka/data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 100Gi

Topic Management

Topic CRD

# orders-topic.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: orders
  namespace: kafka
  labels:
    strimzi.io/cluster: my-kafka-cluster
spec:
  partitions: 6
  replicas: 3
  config:
    retention.ms: 604800000  # 7 days
    cleanup.policy: delete
    compression.type: lz4
    min.insync.replicas: 2

Topic Operations

# Create topic
kubectl apply -f orders-topic.yaml

# List topics
kubectl get kafkatopics -n kafka

# Update topic
kubectl patch kafkatopic orders -n kafka --type merge -p '{"spec":{"partitions":12}}'

# Delete topic
kubectl delete kafkatopic orders -n kafka

User Management

User CRD

# app-user.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: app-user
  namespace: kafka
  labels:
    strimzi.io/cluster: my-kafka-cluster
spec:
  authentication:
    type: tls
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: orders
        operations:
          - Read
          - Describe
        host: "*"
      - resource:
          type: topic
          name: orders
        operations:
          - Write
        host: "*"
      - resource:
          type: group
          name: order-processor
        operations:
          - Read
        host: "*"

Horizontal Pod Autoscaling

HPA Configuration

# kafka-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: kafka-broker-hpa
  namespace: kafka
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: my-kafka-cluster-kafka
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: kafka_server_BrokerTopicMetrics_MessagesInPerSec
        target:
          type: AverageValue
          averageValue: "100000"

Custom Metrics

# custom-metrics.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: kafka-consumer-hpa
  namespace: kafka
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-consumer
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: External
      external:
        metric:
          name: kafka_consumer_group_lag
          selector:
            matchLabels:
              group: order-processor
        target:
          type: AverageValue
          averageValue: "1000"

Rolling Updates

Zero-Downtime Updates

# Kafka cluster with rolling update strategy
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-kafka-cluster
spec:
  kafka:
    replicas: 3
    rack:
      topologyKey: topology.kubernetes.io/zone
    config:
      default.replication.factor: 3
      min.insync.replicas: 2
    storage:
      type: persistent-claim
      size: 100Gi
  cruiseControl: {}

Update Strategies

# Trigger rolling update
kubectl annotate kafka my-kafka-cluster -n kafka \
  strimzi.io/manual-roll-update=true

# Monitor rolling update
kubectl get pods -n kafka -l app.kubernetes.io/name=kafka -w

# Check rollout status
kubectl rollout status statefulset/my-kafka-cluster-kafka -n kafka

Monitoring with Prometheus

ServiceMonitor Configuration

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: kafka-metrics
  namespace: kafka
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      strimzi.io/cluster: my-kafka-cluster
  namespaceSelector:
    matchNames:
      - kafka
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

PrometheusRule

# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kafka-alerts
  namespace: kafka
spec:
  groups:
    - name: kafka
      rules:
        - alert: KafkaUnderReplicatedPartitions
          expr: kafka_server_replicamanager_underreplicatedpartitions > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Under-replicated partitions detected"
            description: "{{ $value }} partitions are under-replicated"
        
        - alert: KafkaConsumerLagHigh
          expr: kafka_consumer_group_lag > 10000
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Consumer lag is high"
            description: "Consumer group {{ $labels.group }} has lag of {{ $value }}"

Grafana Dashboard

{
  "dashboard": {
    "title": "Kafka on Kubernetes",
    "panels": [
      {
        "title": "Kafka Brokers",
        "type": "stat",
        "targets": [
          {
            "expr": "count(kafka_server_BrokerTopicMetrics_MessagesInPerSec)",
            "legendFormat": "Brokers"
          }
        ]
      },
      {
        "title": "Messages In Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(kafka_server_BrokerTopicMetrics_MessagesInPerSec[5m]))",
            "legendFormat": "Messages/sec"
          }
        ]
      }
    ]
  }
}

Best Practices

Resource Management

# Resource recommendations
resources:
  kafka:
    requests:
      memory: 4Gi
      cpu: 2
    limits:
      memory: 8Gi
      cpu: 4
  zookeeper:
    requests:
      memory: 2Gi
      cpu: 1
    limits:
      memory: 4Gi
      cpu: 2

Backup Strategy

#!/bin/bash
# backup_kafka.sh

# Backup Kafka data
kubectl exec -n kafka my-kafka-cluster-kafka-0 -- \
  kafka-metadata.sh snapshot /var/lib/kafka/data

# Backup PersistentVolume
kubectl get pv -n kafka -l app.kubernetes.io/name=kafka

# Export topic configurations
kubectl get kafkatopics -n kafka -o yaml > kafka-topics-backup.yaml

Disaster Recovery

# MirrorMaker 2 for cross-cluster replication
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaMirrorMaker2
metadata:
  name: my-mirror-maker-2
  namespace: kafka
spec:
  version: 3.5.1
  replicas: 2
  connectCluster: "target-cluster"
  clusters:
    - alias: "source-cluster"
      bootstrapServers: kafka-source:9092
    - alias: "target-cluster"
      bootstrapServers: kafka-target:9092
  mirrors:
    - sourceCluster: "source-cluster"
      targetCluster: "target-cluster"
      topicsPattern: "orders|payments|users"
      topicsPatternExclude: ".*internal"

Summary

Running Kafka on Kubernetes with Strimzi provides automated operations, scalable deployment, and reliable storage. Implement HPA, monitoring, and backup strategies for production-ready deployments.

Kafka on Kubernetes

Kafka on Kubernetes

Overview

Benefits of Kubernetes

Strimzi Operator Setup

Install Strimzi

Kafka Cluster CRD

Apply Configuration

Persistent Volumes

StorageClass Configuration

PersistentVolumeClaim

StatefulSet Configuration

Topic Management

Topic CRD

Topic Operations

User Management

User CRD

Horizontal Pod Autoscaling

HPA Configuration

Custom Metrics

Rolling Updates

Zero-Downtime Updates

Update Strategies

Monitoring with Prometheus

ServiceMonitor Configuration

PrometheusRule

Grafana Dashboard

Best Practices

Resource Management

Backup Strategy

Disaster Recovery

Summary

Premium Content

Need Expert Kafka Help?