Data Systems
Stream Processing
Stream processing enables real-time data analysis and event-driven architectures. Master the fundamentals of event time, windowing, state management, and exactly-once processing semantics.
- Real-Time β Process events as they arrive, not in batches
- Stateful β Maintain context across events for complex analytics
- Exactly-Once β Guarantee each event is processed exactly once
Stream processing turns data in motion into data at rest.
Stream Processing Fundamentals
DfStream Processing
Stream processing is a programming paradigm where data is processed continuously as it arrives, rather than in batches. It enables real-time analytics, event-driven workflows, and immediate responses to changing data. Stream processing systems handle high-throughput, low-latency data flows with state management and fault tolerance.
Batch vs Stream Processing
| Aspect | Batch Processing | Stream Processing |
|---|---|---|
| Data | Bounded (finite) | Unbounded (infinite) |
| Latency | Minutes to hours | Milliseconds to seconds |
| Throughput | High (optimized for volume) | Moderate (optimized for latency) |
| Use Case | Analytics reports, ETL | Real-time dashboards, alerts |
| Complexity | Simpler | More complex (state, ordering) |
Modern architectures often combine both: stream processing for real-time insights and batch processing for historical analysis. The Lambda Architecture uses both paths; the Kappa Architecture uses only stream processing.
Event Time vs Processing Time
DfEvent Time vs Processing Time
Event time is when the event actually occurred (embedded in the data). Processing time is when the event is processed by the system. The difference between them is called lateness. Stream processing systems must handle out-of-order events and late-arriving data.
Windowing Strategies
DfWindowing
Windowing divides an unbounded stream into finite segments (windows) for processing. Windows group events by time or count, enabling aggregation and analysis over bounded datasets. The choice of windowing strategy affects the completeness and accuracy of results.
| Window Type | Description | Use Case |
|---|---|---|
| Tumbling | Fixed-size, non-overlapping | Hourly aggregates |
| Sliding | Fixed-size, overlapping | Rolling averages |
| Session | Activity-based with gap | User session analysis |
| Global | All events in one window | Running totals |
Window Aggregation
Here,
- =Event i in the stream
- =Set of events in the window
- =Aggregation function (count, sum, avg, etc.)
Tumbling Window for Website Metrics
Process website clicks in 1-minute tumbling windows:
Window [10:00:00 - 10:01:00): 45 clicks β count = 45 Window [10:01:00 - 10:02:00): 52 clicks β count = 52 Window [10:02:00 - 10:03:00): 38 clicks β count = 38
Late event (arrived at 10:02:30 but occurred at 10:01:45) goes to the [10:01:00 - 10:02:00) window if allowed.
Exactly-Once Semantics
DfExactly-Once Processing
Exactly-once processing guarantees that each event is processed exactly once, with no duplicates and no misses. This is achieved through idempotent operations, transactional writes, and coordination between the source and sink. It is one of the hardest problems in stream processing.
| Guarantee | Description | Complexity |
|---|---|---|
| At-most-once | May lose events, no duplicates | Low |
| At-least-once | May have duplicates, no losses | Moderate |
| Exactly-once | No losses, no duplicates | High |
Exactly-once semantics is often misunderstood. It means exactly-once effects, not exactly-once processing. The system may process an event multiple times but ensure the downstream effects (writes, side effects) happen only once.
State Management
DfStateful Stream Processing
Stateful stream processing maintains information across events. Examples include counting events per user, tracking the latest value per device, or detecting patterns across events. State management is critical for correctness and must handle failures, scaling, and reprocessing.
| State Backend | Use Case |
|---|---|
| RocksDB | Large state, local storage |
| Heap | Small state, fast access |
| External | Shared state, distributed |
Stream Processing Frameworks
| Framework | Model | State Management | Exactly-Once |
|---|---|---|---|
| Apache Flink | Event-time, true streaming | RocksDB, heap | Yes |
| Spark Streaming | Micro-batch | Checkpointing | Yes |
| Kafka Streams | Event-time, embedded | RocksDB | Yes |
Flink is the gold standard for stream processing. It provides true event-time processing, exactly-once semantics, and flexible state management. Use Kafka Streams for lightweight embedded processing and Spark Streaming for micro-batch analytics.
Practice Exercises
-
Windowing Design: Design a stream processing pipeline that computes the average temperature from IoT sensors every 5 minutes, with a 1-minute late tolerance. What windowing strategy would you use?
-
Exactly-Once Design: Explain how you would achieve exactly-once semantics for a stream processing pipeline that reads from Kafka, transforms events, and writes to PostgreSQL.
-
State Management: Design a fraud detection system that flags transactions more than 3 standard deviations above a user's average. What state needs to be maintained, and how do you handle state recovery?
-
Architecture Comparison: Compare Flink and Spark Streaming for a real-time analytics dashboard. What are the trade-offs in terms of latency, complexity, and fault tolerance?
Key Takeaways:
- Stream processing handles unbounded data with low latency
- Event time vs processing time affects result correctness
- Windowing strategies (tumbling, sliding, session) determine aggregation scope
- Exactly-once semantics requires coordination between source and sink
- State management is critical for correctness and fault tolerance
- Flink is the gold standard; Kafka Streams for embedded processing
What to Learn Next
-> Kafka Deep Dive Event streaming, partitioning, and exactly-once semantics.
-> Batch Processing MapReduce, Spark, and distributed batch processing.
-> Event-Driven Architecture Event sourcing, CQRS, and message-driven systems.
-> Message Queues Async processing, event-driven architecture, and pub/sub patterns.
-> Data Lake Architecture Storage, processing, and governance for large-scale data.
-> Realtime Analytics Design Designing real-time analytics dashboards.