Kafka System Design Interview
Difficulty: Senior Level | Companies: LinkedIn, Uber, Netflix, Spotify, Confluent
Content
System design interviews test your ability to architect distributed systems using Kafka. Focus on requirements, trade-offs, and practical implementation.
Design Framework
Architecture Diagram
System Design Process:
1. Requirements Clarification
βββ Functional requirements
βββ Non-functional requirements
βββ Constraints
2. High-Level Design
βββ Component diagram
βββ Data flow
βββ API design
3. Deep Dive
βββ Kafka architecture
βββ Schema design
βββ Partitioning strategy
4. Trade-offs
βββ Consistency vs availability
βββ Latency vs throughput
βββ Cost vs performance
Design Problem: E-Commerce Order System
Requirements
Architecture Diagram
Functional Requirements:
βββ Place orders
βββ Process payments
βββ Update inventory
βββ Send notifications
βββ Track order status
Non-Functional Requirements:
βββ High availability (99.99%)
βββ Low latency (< 100ms)
βββ Exactly-once processing
βββ Orderly processing per user
βββ Support 100K orders/day
Constraints:
βββ Existing PostgreSQL database
βββ Must integrate with payment gateway
βββ Budget for infrastructure
High-Level Design
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Order Service β
β βββ Kafka Producer β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kafka Cluster β
β βββ orders topic (64 partitions) β
β βββ payments topic (32 partitions) β
β βββ inventory topic (32 partitions) β
β βββ notifications topic (16 partitions) β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Consumer Services β
β βββ Payment Service β
β βββ Inventory Service β
β βββ Notification Service β
β βββ Analytics Service β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Topic Design
{
"topics": {
"orders": {
"partitions": 64,
"replication_factor": 3,
"retention_ms": 604800000,
"cleanup_policy": "delete",
"key_strategy": "user_id"
},
"payments": {
"partitions": 32,
"replication_factor": 3,
"retention_ms": 2592000000,
"cleanup_policy": "compact,delete",
"key_strategy": "order_id"
},
"inventory": {
"partitions": 32,
"replication_factor": 3,
"retention_ms": 86400000,
"cleanup_policy": "delete",
"key_strategy": "product_id"
},
"notifications": {
"partitions": 16,
"replication_factor": 3,
"retention_ms": 86400000,
"cleanup_policy": "delete",
"key_strategy": "user_id"
}
}
}
Java Implementation
Order Producer
@Service
public class OrderService {
private final KafkaTemplate<String, Order> kafkaTemplate;
private final SchemaRegistryClient schemaRegistry;
@Autowired
public OrderService(KafkaTemplate<String, Order> kafkaTemplate) {
this.kafkaTemplate = kafkaTemplate;
}
@Transactional
public Order placeOrder(OrderRequest request) {
// Validate order
validateOrder(request);
// Create order
Order order = createOrder(request);
// Send to Kafka (atomic with DB transaction)
kafkaTemplate.send("orders", order.getUserId(), order);
return order;
}
private void validateOrder(OrderRequest request) {
if (request.getItems().isEmpty()) {
throw new InvalidOrderException("Order must have items");
}
if (request.getTotalAmount().compareTo(BigDecimal.ZERO) <= 0) {
throw new InvalidOrderException("Invalid amount");
}
}
private Order createOrder(OrderRequest request) {
Order order = new Order();
order.setOrderId(UUID.randomUUID().toString());
order.setUserId(request.getUserId());
order.setItems(request.getItems());
order.setTotalAmount(request.getTotalAmount());
order.setStatus(OrderStatus.PENDING);
order.setCreatedAt(Instant.now());
return order;
}
}
Payment Consumer
@Service
public class PaymentConsumer {
private final PaymentService paymentService;
private final KafkaTemplate<String, Payment> kafkaTemplate;
@KafkaListener(
topics = "orders",
groupId = "payment-service",
containerFactory = "kafkaListenerContainerFactory"
)
@Transactional
public void processOrder(Order order) {
try {
// Process payment
Payment payment = paymentService.charge(order);
// Send payment event
kafkaTemplate.send("payments", order.getOrderId(), payment);
// Update order status
order.setStatus(OrderStatus.PAYMENT_PROCESSED);
} catch (PaymentException e) {
// Handle payment failure
order.setStatus(OrderStatus.PAYMENT_FAILED);
kafkaTemplate.send("orders", order.getUserId(), order);
}
}
}
Python Implementation
Order Producer
from fastapi import FastAPI, HTTPException
from kafka import KafkaProducer
from pydantic import BaseModel
from typing import List
import json
import uuid
app = FastAPI()
producer = KafkaProducer(
bootstrap_servers=['kafka1:9092', 'kafka2:9092'],
key_serializer=lambda k: k.encode('utf-8'),
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
acks='all',
enable_idempotence=True,
transactional_id='order-service-tx'
)
producer.init_transactions()
class OrderItem(BaseModel):
product_id: str
quantity: int
price: float
class OrderRequest(BaseModel):
user_id: str
items: List[OrderItem]
@app.post("/orders")
async def create_order(request: OrderRequest):
try:
# Validate
if not request.items:
raise HTTPException(status_code=400, detail="Order must have items")
# Create order
order = {
"order_id": str(uuid.uuid4()),
"user_id": request.user_id,
"items": [item.dict() for item in request.items],
"total_amount": sum(item.price * item.quantity for item in request.items),
"status": "PENDING"
}
# Send to Kafka
producer.begin_transaction()
producer.send("orders", key=order["user_id"], value=order)
producer.commit_transaction()
return {"order_id": order["order_id"], "status": "created"}
except Exception as e:
producer.abort_transaction()
raise HTTPException(status_code=500, detail=str(e))
Payment Consumer
from kafka import KafkaConsumer, KafkaProducer
import json
consumer = KafkaConsumer(
'orders',
bootstrap_servers=['kafka1:9092'],
group_id='payment-service',
enable_auto_commit=False,
auto_offset_reset='earliest'
)
producer = KafkaProducer(
bootstrap_servers=['kafka1:9092'],
key_serializer=lambda k: k.encode('utf-8'),
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
acks='all',
enable_idempotence=True,
transactional_id='payment-service-tx'
)
producer.init_transactions()
for message in consumer:
order = json.loads(message.value)
try:
# Process payment
payment = process_payment(order)
# Send payment event
producer.begin_transaction()
producer.send("payments", key=order["order_id"], value=payment)
# Update order status
order["status"] = "PAYMENT_PROCESSED"
producer.send("orders", key=order["user_id"], value=order)
producer.commit_transaction()
consumer.commit()
except Exception as e:
producer.abort_transaction()
print(f"Payment failed: {e}")
Schema Design
Avro Schemas
// Order schema
{
"type": "record",
"name": "Order",
"namespace": "com.example.events",
"fields": [
{"name": "order_id", "type": "string"},
{"name": "user_id", "type": "string"},
{"name": "items", "type": {"type": "array", "items": "OrderItem"}},
{"name": "total_amount", "type": "double"},
{"name": "status", "type": "OrderStatus"},
{"name": "created_at", "type": "long", "logicalType": "timestamp-millis"}
]
}
// Payment schema
{
"type": "record",
"name": "Payment",
"namespace": "com.example.events",
"fields": [
{"name": "payment_id", "type": "string"},
{"name": "order_id", "type": "string"},
{"name": "amount", "type": "double"},
{"name": "status", "type": "PaymentStatus"},
{"name": "processed_at", "type": "long", "logicalType": "timestamp-millis"}
]
}
Monitoring & Alerting
# Prometheus alerts
groups:
- name: kafka-alerts
rules:
- alert: HighConsumerLag
expr: kafka_consumergroup_current_offset - kafka_consumergroup_end_offset > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High consumer lag detected"
- alert: TransactionFailed
expr: rate(kafka_producer_transaction_failed_total[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Transaction failures detected"
- alert: OrderProcessingDelayed
expr: histogram_quantile(0.99, rate(order_processing_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Order processing latency high"
Performance Considerations
Architecture Diagram
Partitioning Strategy:
βββ Use user_id as key for ordering guarantees
βββ 64 partitions for orders (100K/day = ~70 orders/min)
βββ 32 partitions for payments (lower volume)
βββ 16 partitions for notifications (lowest volume)
Replication:
βββ Replication factor: 3
βββ min.insync.replicas: 2
βββ acks: all
Retention:
βββ Orders: 7 days (for debugging)
βββ Payments: 30 days (for audit)
βββ Notifications: 1 day (ephemeral)
βββ Use compacted topics for state
βΉοΈ
Key Insight: In system design interviews, focus on the trade-offs you make and justify your decisions. There's no single "correct" answer - demonstrate your understanding of Kafka's capabilities and limitations.
Follow-Up Questions
- How would you handle a sudden spike in order volume?
- What happens if the payment service is down?
- How would you implement order cancellation?
- Explain how you would handle duplicate orders.
- How would you scale this system to support 1M orders/day?