Data Systems
Kafka Deep Dive
Apache Kafka is the de facto standard for event streaming. Master its architecture, partitioning model, consumer groups, and exactly-once semantics for building event-driven systems.
- Distributed — Horizontally scalable across many brokers
- Durable — Persistent log with configurable retention
- Real-Time — Sub-millisecond latency for event delivery
Kafka is not just a message queue—it's a distributed event log.
Kafka Architecture
DfApache Kafka
Apache Kafka is a distributed event streaming platform that stores events in an immutable, append-only log. Producers write events to topics, and consumers read from topics at their own pace. Kafka provides durability through replication, ordering within partitions, and horizontal scalability through partitioning.
Key Concepts
| Concept | Description |
|---|---|
| Topic | Named stream of events (like a table) |
| Partition | Unit of parallelism within a topic |
| Offset | Unique sequential ID for each event in a partition |
| Broker | Kafka server that stores and serves data |
| Replica | Copy of a partition for fault tolerance |
| Consumer Group | Group of consumers that divide partition consumption |
Partitioning Model
DfKafka Partitioning
Partitioning is the fundamental unit of parallelism in Kafka. Each topic has one or more partitions, and each partition is an ordered, immutable sequence of events. Events with the same key are guaranteed to be in the same partition, maintaining order per key.
Partition Assignment
Here,
- =Partition assigned to the event
- =The event's partition key
- =Total number of partitions
The number of partitions is set at topic creation and cannot be decreased. More partitions mean more parallelism but also more file handles, memory usage, and end-to-end latency. Start with a reasonable number (e.g., 6-12) and scale as needed.
Consumer Groups
DfConsumer Group
A consumer group is a set of consumers that collaboratively consume events from a topic. Each partition is assigned to exactly one consumer in the group, ensuring events are processed in order within a partition. Consumers within a group don't share partitions—each partition is consumed by one consumer.
| Configuration | Effect |
|---|---|
| N consumers = N partitions | Each consumer gets one partition |
| N consumers < N partitions | Some consumers get multiple partitions |
| N consumers > N partitions | Some consumers are idle |
| Rebalancing | Automatic redistribution when consumers join/leave |
Consumer Group Scaling
Topic "orders" has 6 partitions with 3 consumers in a group:
Initial state: Each consumer handles 2 partitions Consumer 4 joins: Rebalance → 2 consumers get 2, 2 consumers get 1 Consumer 1 fails: Rebalance → remaining 2 consumers handle 3 each
This automatic scaling enables horizontal scaling of event processing.
Exactly-Once Semantics
DfExactly-Once in Kafka
Exactly-once semantics in Kafka ensures each event is processed exactly once end-to-end—from producer to consumer to downstream system. This is achieved through idempotent producers, transactional APIs, and consumer offset management within transactions.
| Mechanism | Purpose |
|---|---|
| Idempotent producer | Prevents duplicate writes during retries |
| Transactions | Atomic writes across multiple partitions |
| Consumer offset in transaction | Commit offset with processed data atomically |
| Transactional outbox | Atomic database write + event publish |
Exactly-once semantics requires cooperation between the producer, Kafka, and the consumer. The producer must be idempotent, Kafka must support transactions, and the consumer must commit offsets within a transaction.
Kafka Retention
| Retention Type | Description |
|---|---|
| Time-based | Delete events after N days (default: 7 days) |
| Size-based | Keep only the last N GB per partition |
| Log compaction | Keep only the latest value per key |
| Infinite | Never delete (requires sufficient disk) |
Practice Exercises
-
Topic Design: Design the Kafka topics for an e-commerce order system. What topics would you create, how many partitions, and what retention policy?
-
Consumer Design: Design a consumer group for processing payment events. How do you ensure exactly-once processing when writing to a PostgreSQL database?
-
Partitioning Strategy: For a topic with user events, choose a partitioning strategy that ensures events for the same user are ordered but load is balanced. What happens when a user has significantly more events than others?
-
Architecture Decision: Compare Kafka with RabbitMQ for a task queue system. What are the trade-offs in terms of ordering, throughput, and replay capability?
Key Takeaways:
- Kafka stores events in an immutable, append-only log
- Partitions provide parallelism; consumer groups divide partition consumption
- Exactly-once semantics requires idempotent producers, transactions, and consumer coordination
- Retention can be time-based, size-based, or log-compacted
- The number of partitions determines maximum parallelism
- Kafka is ideal for event sourcing, data pipelines, and real-time analytics
What to Learn Next
-> Stream Processing Real-time data processing with Flink, Spark Streaming, and Kafka Streams.
-> Redis Deep Dive Redis data structures, persistence, clustering, and use cases.
-> Event-Driven Architecture Event sourcing, CQRS, and message-driven systems.
-> Message Queues Async processing, event-driven architecture, and pub/sub patterns.
-> Data Lake Architecture Storage, processing, and governance for large-scale data.
-> Batch Processing MapReduce, Spark, and distributed batch processing.