Data Systems

Kafka Deep Dive

Apache Kafka is the de facto standard for event streaming. Master its architecture, partitioning model, consumer groups, and exactly-once semantics for building event-driven systems.

Distributed — Horizontally scalable across many brokers
Durable — Persistent log with configurable retention
Real-Time — Sub-millisecond latency for event delivery

Kafka is not just a message queue—it's a distributed event log.

Kafka Architecture

DfApache Kafka

Apache Kafka is a distributed event streaming platform that stores events in an immutable, append-only log. Producers write events to topics, and consumers read from topics at their own pace. Kafka provides durability through replication, ordering within partitions, and horizontal scalability through partitioning.

Key Concepts

Concept	Description
Topic	Named stream of events (like a table)
Partition	Unit of parallelism within a topic
Offset	Unique sequential ID for each event in a partition
Broker	Kafka server that stores and serves data
Replica	Copy of a partition for fault tolerance
Consumer Group	Group of consumers that divide partition consumption

Partitioning Model

DfKafka Partitioning

Partitioning is the fundamental unit of parallelism in Kafka. Each topic has one or more partitions, and each partition is an ordered, immutable sequence of events. Events with the same key are guaranteed to be in the same partition, maintaining order per key.

Partition Assignment

P(event) = hash(event.key) \mod N_{partitions}

Here,

$P(event)$ =Partition assigned to the event
$event.key$ =The event's partition key
$N_{partitions}$ =Total number of partitions

The number of partitions is set at topic creation and cannot be decreased. More partitions mean more parallelism but also more file handles, memory usage, and end-to-end latency. Start with a reasonable number (e.g., 6-12) and scale as needed.

Consumer Groups

DfConsumer Group

A consumer group is a set of consumers that collaboratively consume events from a topic. Each partition is assigned to exactly one consumer in the group, ensuring events are processed in order within a partition. Consumers within a group don't share partitions—each partition is consumed by one consumer.

Configuration	Effect
N consumers = N partitions	Each consumer gets one partition
N consumers < N partitions	Some consumers get multiple partitions
N consumers > N partitions	Some consumers are idle
Rebalancing	Automatic redistribution when consumers join/leave

Consumer Group Scaling

Topic "orders" has 6 partitions with 3 consumers in a group:

Initial state: Each consumer handles 2 partitions Consumer 4 joins: Rebalance → 2 consumers get 2, 2 consumers get 1 Consumer 1 fails: Rebalance → remaining 2 consumers handle 3 each

This automatic scaling enables horizontal scaling of event processing.

Exactly-Once Semantics

DfExactly-Once in Kafka

Exactly-once semantics in Kafka ensures each event is processed exactly once end-to-end—from producer to consumer to downstream system. This is achieved through idempotent producers, transactional APIs, and consumer offset management within transactions.

Mechanism	Purpose
Idempotent producer	Prevents duplicate writes during retries
Transactions	Atomic writes across multiple partitions
Consumer offset in transaction	Commit offset with processed data atomically
Transactional outbox	Atomic database write + event publish

Exactly-once semantics requires cooperation between the producer, Kafka, and the consumer. The producer must be idempotent, Kafka must support transactions, and the consumer must commit offsets within a transaction.

Kafka Retention

Retention Type	Description
Time-based	Delete events after N days (default: 7 days)
Size-based	Keep only the last N GB per partition
Log compaction	Keep only the latest value per key
Infinite	Never delete (requires sufficient disk)

Practice Exercises

Topic Design: Design the Kafka topics for an e-commerce order system. What topics would you create, how many partitions, and what retention policy?
Consumer Design: Design a consumer group for processing payment events. How do you ensure exactly-once processing when writing to a PostgreSQL database?
Partitioning Strategy: For a topic with user events, choose a partitioning strategy that ensures events for the same user are ordered but load is balanced. What happens when a user has significantly more events than others?
Architecture Decision: Compare Kafka with RabbitMQ for a task queue system. What are the trade-offs in terms of ordering, throughput, and replay capability?

Key Takeaways:

Kafka stores events in an immutable, append-only log
Partitions provide parallelism; consumer groups divide partition consumption
Exactly-once semantics requires idempotent producers, transactions, and consumer coordination
Retention can be time-based, size-based, or log-compacted
The number of partitions determines maximum parallelism
Kafka is ideal for event sourcing, data pipelines, and real-time analytics

What to Learn Next

-> Stream Processing Real-time data processing with Flink, Spark Streaming, and Kafka Streams.

-> Redis Deep Dive Redis data structures, persistence, clustering, and use cases.

-> Event-Driven Architecture Event sourcing, CQRS, and message-driven systems.

-> Message Queues Async processing, event-driven architecture, and pub/sub patterns.

-> Data Lake Architecture Storage, processing, and governance for large-scale data.

-> Batch Processing MapReduce, Spark, and distributed batch processing.

Kafka Deep Dive

Kafka Deep Dive

Kafka Architecture

DfApache Kafka

Key Concepts

Partitioning Model

DfKafka Partitioning

Partition Assignment

Consumer Groups

DfConsumer Group

Consumer Group Scaling

Exactly-Once Semantics

DfExactly-Once in Kafka

Kafka Retention

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?