Batch vs Streaming: Choosing the Right Processing Paradigm
The fundamental question in data engineering is whether to process data in batches or as a continuous stream.
The Batch-Streaming Spectrum
Processing Paradigms:
| Paradigm | Latency | Throughput | Complexity | Cost |
|---|---|---|---|---|
| Pure Batch (Daily ETL) | Hours | High | Simple ops | Lower |
| Micro-Batch (Spark Streaming) | Seconds | Medium | Medium ops | Medium |
| Pure Streaming (Flink, Kafka Streams) | Milliseconds | Lower | Complex ops | Higher |
Key Considerations:
- Latency requirements β how fast do you need results?
- Data volume β how much data are you processing?
- Consistency needs β exact-once or at-least-once?
- Operational maturity β can your team manage streaming infrastructure?
Key Insight: The choice is not binary but a design decision based on latency requirements, data volume, consistency needs, and operational maturity. Micro-batch systems (Spark Structured Streaming, Flink with mini-batch) occupy the middle ground.
Lambda vs Kappa Architecture
Processing Spectrum
Architecture Diagram
Batch processing is the execution of a set of jobs (tasks) on a collection of data items simultaneously, typically on a scheduled basis. Data is accumulated over a period (hour, day, week) and processed as a single unit. Batch systems optimize for throughput (records/second) over latency (time to first result). Examples: MapReduce, Spark batch, Hive, traditional ETL tools.
Stream processing is the processing of data in motion β events are processed individually or in small micro-batches as they arrive, without waiting for a complete dataset. Stream systems optimize for latency (time from event to result) over throughput. Examples: Apache Flink, Kafka Streams, Spark Structured Streaming, Apache Storm.
Lambda architecture is a data processing pattern that combines batch and stream processing to provide both comprehensive and real-time views of data. The batch layer provides accurate, comprehensive results on historical data; the speed layer provides low-latency, approximate results on recent data; the serving layer merges both views for queries. The core formula: Query_result = Batch_view(batch_data) βͺ Speed_view(recent_data).
Kappa architecture is a data processing pattern that uses only stream processing for both real-time and historical data processing. Instead of separate batch and speed layers, all data is processed through a single stream processing pipeline. Historical reprocessing is achieved by replaying the event log (Kafka) through an updated stream processor. The core principle: there is no batch layer β only the stream layer.
Latency-Throughput Trade-off
For a processing system with data volume V, processing rate R, and latency L: L = V / R (for batch). For streaming: L β processing_time_per_event. The batch-throughput relationship: Throughput_batch = V / (T_schedule + T_process + T_load). For streaming: Throughput_stream = R_events (sustained rate). The crossover point where streaming becomes cost-effective: V < R * T_budget, where T_budget is the maximum acceptable latency.
Cost Comparison
Total_cost_batch = C_storage + C_compute_batch + C_operational_batch. Total_cost_streaming = C_storage + C_compute_streaming + C_operational_streaming + C_state_management. Batch is cheaper for: high-volume, low-frequency workloads. Streaming is cheaper for: low-volume, high-frequency workloads. Breakeven: V_batch * C_compute_batch = V_streaming * C_compute_streaming * frequency.
The CAP theorem states that a distributed data store can provide at most two of: Consistency, Availability, Partition tolerance. In batch processing, CP systems are common (HDFS with replication factor 3). In streaming, AP systems dominate (Kafka with eventual consistency across partitions). The choice depends on whether your workload prioritizes consistency (financial calculations) or availability (real-time dashboards).
Batch systems provide exactly-once semantics by default (re-run the entire batch). Streaming systems face a fundamental tension: exactly-once requires coordinated transactions (Kafka Transactions, Flink checkpointing) which add latency and complexity. At-least-once with idempotent processing is often more practical for high-throughput streaming. The trade-off: exactly_once = f(complexity, latency, cost); at_least_once = f(simplicity, lower_latency, lower_cost).
Batch vs Streaming Comparison
| Dimension | Batch Processing | Stream Processing |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Throughput | Very high (optimized for volume) | Moderate (optimized for latency) |
| Complexity | Lower | Higher (state, watermarks, exactly-once) |
| Cost | Lower (scheduled compute) | Higher (always-on compute) |
| Data Completeness | Complete (all data available) | Approximate (late data handling) |
| Reprocessing | Re-run entire batch | Replay from offset |
| Fault Tolerance | Simple (re-run) | Complex (checkpointing, WAL) |
| State Management | Stateless | Stateful (aggregations, joins) |
| Operational Overhead | Lower (scheduled) | Higher (24/7 monitoring) |
| Tooling Maturity | Mature (Spark, Hive, Airflow) | Evolving (Flink, Kafka Streams) |
| Best For | Analytics, reporting, ML training | Alerts, dashboards, fraud detection |
| Data Volume | Unlimited (limited by cluster) | Bounded by consumer lag |
| Schema Evolution | Easy (reprocess entire dataset) | Complex (running migrations) |
- Define latency requirements: If results must be available within seconds -> streaming. If hours/days is acceptable -> batch.
- Assess data arrival pattern: If events arrive continuously -> streaming. If data is generated in discrete chunks -> batch.
- Evaluate completeness needs: If every record must be counted exactly -> batch (or streaming with watermark + completeness guarantee). If approximate is acceptable -> streaming.
- Measure data volume: High-volume (> 1TB/day) favors batch for cost efficiency. Low-volume (< 100GB/day) favors streaming for simplicity.
- Assess operational maturity: Streaming requires 24/7 monitoring, state management expertise, and watermarks. Batch requires scheduling, monitoring, and backfill capabilities.
- Consider reprocessing: If historical reprocessing is frequent, streaming with replayable logs (Kappa) is more maintainable than Lambda architecture.
- Evaluate downstream consumers: If consumers are dashboards/APIs -> streaming. If consumers are ML models/reports -> batch.
- Prototype both approaches: Build minimal POCs for batch and streaming, measure latency, cost, and developer velocity.
- Consider hybrid: Use streaming for real-time alerts + batch for comprehensive analytics (Lambda) or use micro-batch streaming for both.
- Document decision rationale: Record the factors that influenced the choice for future reference.
Production Code
Lambda Architecture: Batch + Streaming Merge
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from datetime import datetime, timedelta
import logging
logger = logging.getLogger(__name__)
class LambdaArchitecture:
"""
Lambda Architecture implementation: batch layer + speed layer + serving layer.
"""
def __init__(self, spark: SparkSession):
self.spark = spark
# ------------------------------------------------------
# BATCH LAYER: Comprehensive, accurate results
# ------------------------------------------------------
def batch_layer(self, input_path: str, batch_output_path: str, execution_date: str):
"""Run batch processing for a complete dataset."""
logger.info(f"Running batch layer for {execution_date}")
# Read complete dataset for the day
raw_df = (
self.spark.read
.parquet(f"{input_path}/date={execution_date}")
)
# Comprehensive aggregation
batch_result = (
raw_df
.groupBy("product_id", "region")
.agg(
F.count("*").alias("total_orders"),
F.sum("amount").alias("total_revenue"),
F.countDistinct("customer_id").alias("unique_customers"),
F.avg("amount").alias("avg_order_value"),
F.percentile_approx("amount", 0.5).alias("median_order_value"),
)
)
# Write batch view
(
batch_result
.write
.mode("overwrite")
.parquet(f"{batch_output_path}/date={execution_date}")
)
logger.info(f"Batch layer completed for {execution_date}")
return batch_result
# ------------------------------------------------------
# SPEED LAYER: Real-time approximations
# ------------------------------------------------------
def speed_layer(self, kafka_bootstrap: str, topic: str, speed_output_path: str):
"""Run streaming processing for real-time approximations."""
logger.info("Starting speed layer streaming query")
raw_stream = (
self.spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_bootstrap)
.option("subscribe", topic)
.option("startingOffsets", "latest")
.load()
)
# Parse and aggregate
parsed = (
raw_stream
.select(F.from_json(F.col("value").cast("string"), self._get_schema()).alias("data"))
.select("data.*")
.withWatermark("event_time", "5 minutes")
.groupBy(
F.window("event_time", "5 minutes"),
"product_id",
"region",
)
.agg(
F.count("*").alias("recent_orders"),
F.sum("amount").alias("recent_revenue"),
)
)
query = (
parsed.writeStream
.format("parquet")
.option("path", speed_output_path)
.option("checkpointLocation", f"{speed_output_path}/_checkpoint")
.trigger(processingTime="1 minute")
.outputMode("append")
.start()
)
return query
# ------------------------------------------------------
# SERVING LAYER: Merge batch + speed views
# ------------------------------------------------------
def serving_layer(
self,
batch_output_path: str,
speed_output_path: str,
merged_output_path: str,
execution_date: str,
):
"""Merge batch and speed views into a unified serving view."""
logger.info("Merging batch and speed views")
# Read batch view (complete, accurate)
batch_view = (
self.spark.read
.parquet(f"{batch_output_path}/date={execution_date}")
)
# Read speed view (recent, approximate)
speed_view = (
self.spark.read
.parquet(speed_output_path)
.filter(F.col("window_start") >= F.lit(execution_date))
)
# Merge: prefer batch for historical, speed for recent
merged = (
batch_view.alias("b")
.join(
speed_view.alias("s"),
(F.col("b.product_id") == F.col("s.product_id")) &
(F.col("b.region") == F.col("s.region")),
"full_outer",
)
.select(
F.coalesce(F.col("s.product_id"), F.col("b.product_id")).alias("product_id"),
F.coalesce(F.col("s.region"), F.col("b.region")).alias("region"),
F.coalesce(F.col("s.recent_orders"), F.col("b.total_orders")).alias("total_orders"),
F.coalesce(F.col("s.recent_revenue"), F.col("b.total_revenue")).alias("total_revenue"),
)
)
(
merged.write
.mode("overwrite")
.parquet(f"{merged_output_path}/date={execution_date}")
)
logger.info(f"Serving layer merged for {execution_date}")
return merged
def _get_schema(self):
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType
return StructType([
StructField("product_id", StringType()),
StructField("region", StringType()),
StructField("amount", DoubleType()),
StructField("event_time", TimestampType()),
])
Kappa Architecture: Replay-Based Reprocessing
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import logging
logger = logging.getLogger(__name__)
class KappaArchitecture:
"""
Kappa Architecture: process all data through streaming, reprocess by replaying.
"""
def __init__(self, spark: SparkSession):
self.spark = spark
def stream_processor(self, kafka_bootstrap: str, topic: str, output_path: str, version: str):
"""
Deploy a streaming processor. To reprocess, deploy a new version
and replay from the beginning of the Kafka topic.
"""
logger.info(f"Deploying stream processor version: {version}")
# Read from Kafka (replayable log)
raw_stream = (
self.spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_bootstrap)
.option("subscribe", topic)
.option("startingOffsets", "earliest") # For reprocessing
.option("group.id", f"processor-{version}")
.load()
)
# Version-specific transformation logic
processed = self._apply_transformations(raw_stream, version)
# Write to versioned output
query = (
processed.writeStream
.format("parquet")
.option("path", f"{output_path}/version={version}")
.option("checkpointLocation", f"{output_path}/_checkpoint/{version}")
.trigger(processingTime="5 minutes")
.outputMode("append")
.start()
)
return query
def _apply_transformations(self, stream, version: str):
"""Apply version-specific transformations."""
parsed = (
stream
.select(F.from_json(F.col("value").cast("string"), self._get_schema()).alias("data"))
.select("data.*")
)
if version == "v1":
# Original logic
return parsed.select(
"event_id", "user_id", "amount",
F.col("timestamp").alias("event_time"),
)
elif version == "v2":
# Updated logic: added fraud detection
return parsed.select(
"event_id", "user_id", "amount",
F.col("timestamp").alias("event_time"),
).withColumn(
"is_fraud",
F.when(F.col("amount") > 10000, True).otherwise(False),
)
return parsed
def reprocess(self, kafka_bootstrap: str, topic: str, output_path: str, new_version: str):
"""
Reprocess historical data by deploying a new version and replaying.
Old version continues serving until new version catches up.
"""
logger.info(f"Starting reprocessing with version {new_version}")
# Deploy new version (reads from beginning of topic)
new_query = self.stream_processor(
kafka_bootstrap, topic, output_path, new_version
)
# Wait for new version to catch up to current time
# (monitored via consumer lag or watermark)
logger.info("New version deployed. Monitor consumer lag for catch-up completion.")
return new_query
def _get_schema(self):
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType
return StructType([
StructField("event_id", StringType()),
StructField("user_id", StringType()),
StructField("amount", DoubleType()),
StructField("timestamp", TimestampType()),
])
Lambda Architecture Drawbacks: Lambda requires maintaining two codebases (batch and speed) that produce the same results, leading to code duplication and divergence. The merge layer adds complexity. Kappa eliminates this by using a single streaming pipeline with replay for reprocessing. However, Kappa requires a durable, replayable event log (Kafka) and efficient stream processors.
Micro-Batch as a Middle Ground: Spark Structured Streaming and Flink (in mini-batch mode) provide a pragmatic compromise: near-real-time latency (seconds) with batch-like processing semantics. This eliminates the need for Lambda architecture in many use cases where sub-second latency is not required.
- Batch processing optimizes for throughput and completeness; streaming optimizes for latency and freshness.
- Lambda architecture combines batch + speed layers but introduces dual codebases and merge complexity.
- Kappa architecture uses only stream processing with replay for reprocessing, simplifying the architecture.
- CAP theorem applies: batch favors consistency (CP), streaming favors availability (AP).
- The choice depends on latency requirements, data volume, operational maturity, and completeness needs.
- Micro-batch streaming (Spark, Flink) provides a pragmatic middle ground for many use cases.
- Always prototype both approaches and measure latency, cost, and developer velocity before committing.
Best Practices
- Start with batch if your team is new to data engineering. Add streaming only when latency requirements demand it.
- Use micro-batch streaming (Spark Structured Streaming) as a middle ground β near-real-time with batch-like simplicity.
- Choose Kappa over Lambda when possible. Replayable event logs (Kafka) eliminate the need for dual codebases.
- If using Lambda, share transformation logic between batch and speed layers via a common library to prevent code divergence.
- Set appropriate watermarks in streaming systems to bound state growth while maintaining completeness.
- Monitor consumer lag as a key SLA metric. Lag exceeding retention period = data loss.
- Design for reprocessing from day one. Store raw data in replayable formats (Kafka, append-only Parquet).
- Use idempotent processing so that reprocessing (batch re-run or stream replay) produces identical results.
- Document the latency-completeness trade-off for each pipeline and get stakeholder agreement on acceptable approximations.
- Evaluate operational cost β streaming requires 24/7 monitoring, state management, and expertise that batch does not.
Architecture Pattern Comparison
| Dimension | Lambda | Kappa | Pure Batch | Micro-Batch |
|---|---|---|---|---|
| Code Duplication | High (2 codebases) | Low (1 codebase) | None | None |
| Reprocessing | Replay speed layer | Replay event log | Re-run batch | Re-run micro-batch |
| Latency | Sub-second | Seconds-minutes | Hours | Seconds |
| Complexity | High | Medium | Low | Low-Medium |
| Operational Cost | High | Medium | Low | Low-Medium |
| Data Completeness | High | High | Complete | Near-complete |
| Best For | Real-time + accuracy | Real-time + simplicity | Analytics/ML | Near-real-time analytics |
See Also
- 022 - Spark Structured Streaming - Micro-batch streaming implementation
- 019 - Apache Kafka: Topics, Producers, and Consumers - Event log for Kappa architecture
- 021 - Apache Spark: RDDs, DataFrames, and the Catalyst Optimizer - Batch processing engine
- 024 - Data Ingestion Patterns - Ingestion for both paradigms
- 016 - ETL vs ELT - Transformation paradigm selection