Batch vs Streaming: Lambda, Kappa, and Architecture Trade-offs

Batch vs Streaming: Choosing the Right Processing Paradigm

The fundamental question in data engineering is whether to process data in batches or as a continuous stream.

The Batch-Streaming Spectrum

Processing Paradigms:

Paradigm	Latency	Throughput	Complexity	Cost
Pure Batch (Daily ETL)	Hours	High	Simple ops	Lower
Micro-Batch (Spark Streaming)	Seconds	Medium	Medium ops	Medium
Pure Streaming (Flink, Kafka Streams)	Milliseconds	Lower	Complex ops	Higher

Key Considerations:

Latency requirements — how fast do you need results?
Data volume — how much data are you processing?
Consistency needs — exact-once or at-least-once?
Operational maturity — can your team manage streaming infrastructure?

Key Insight: The choice is not binary but a design decision based on latency requirements, data volume, consistency needs, and operational maturity. Micro-batch systems (Spark Structured Streaming, Flink with mini-batch) occupy the middle ground.

Lambda vs Kappa Architecture

Processing Spectrum

Architecture Diagram

Batch processing is the execution of a set of jobs (tasks) on a collection of data items simultaneously, typically on a scheduled basis. Data is accumulated over a period (hour, day, week) and processed as a single unit. Batch systems optimize for throughput (records/second) over latency (time to first result). Examples: MapReduce, Spark batch, Hive, traditional ETL tools.

Stream processing is the processing of data in motion — events are processed individually or in small micro-batches as they arrive, without waiting for a complete dataset. Stream systems optimize for latency (time from event to result) over throughput. Examples: Apache Flink, Kafka Streams, Spark Structured Streaming, Apache Storm.

Lambda architecture is a data processing pattern that combines batch and stream processing to provide both comprehensive and real-time views of data. The batch layer provides accurate, comprehensive results on historical data; the speed layer provides low-latency, approximate results on recent data; the serving layer merges both views for queries. The core formula: Query_result = Batch_view(batch_data) ∪ Speed_view(recent_data).

Kappa architecture is a data processing pattern that uses only stream processing for both real-time and historical data processing. Instead of separate batch and speed layers, all data is processed through a single stream processing pipeline. Historical reprocessing is achieved by replaying the event log (Kafka) through an updated stream processor. The core principle: there is no batch layer — only the stream layer.

Latency-Throughput Trade-off

For a processing system with data volume V, processing rate R, and latency L: L = V / R (for batch). For streaming: L ≈ processing_time_per_event. The batch-throughput relationship: Throughput_batch = V / (T_schedule + T_process + T_load). For streaming: Throughput_stream = R_events (sustained rate). The crossover point where streaming becomes cost-effective: V < R * T_budget, where T_budget is the maximum acceptable latency.

Cost Comparison

Total_cost_batch = C_storage + C_compute_batch + C_operational_batch. Total_cost_streaming = C_storage + C_compute_streaming + C_operational_streaming + C_state_management. Batch is cheaper for: high-volume, low-frequency workloads. Streaming is cheaper for: low-volume, high-frequency workloads. Breakeven: V_batch * C_compute_batch = V_streaming * C_compute_streaming * frequency.

The CAP theorem states that a distributed data store can provide at most two of: Consistency, Availability, Partition tolerance. In batch processing, CP systems are common (HDFS with replication factor 3). In streaming, AP systems dominate (Kafka with eventual consistency across partitions). The choice depends on whether your workload prioritizes consistency (financial calculations) or availability (real-time dashboards).

Batch systems provide exactly-once semantics by default (re-run the entire batch). Streaming systems face a fundamental tension: exactly-once requires coordinated transactions (Kafka Transactions, Flink checkpointing) which add latency and complexity. At-least-once with idempotent processing is often more practical for high-throughput streaming. The trade-off: exactly_once = f(complexity, latency, cost); at_least_once = f(simplicity, lower_latency, lower_cost).

Batch vs Streaming Comparison

Dimension	Batch Processing	Stream Processing
Latency	Minutes to hours	Milliseconds to seconds
Throughput	Very high (optimized for volume)	Moderate (optimized for latency)
Complexity	Lower	Higher (state, watermarks, exactly-once)
Cost	Lower (scheduled compute)	Higher (always-on compute)
Data Completeness	Complete (all data available)	Approximate (late data handling)
Reprocessing	Re-run entire batch	Replay from offset
Fault Tolerance	Simple (re-run)	Complex (checkpointing, WAL)
State Management	Stateless	Stateful (aggregations, joins)
Operational Overhead	Lower (scheduled)	Higher (24/7 monitoring)
Tooling Maturity	Mature (Spark, Hive, Airflow)	Evolving (Flink, Kafka Streams)
Best For	Analytics, reporting, ML training	Alerts, dashboards, fraud detection
Data Volume	Unlimited (limited by cluster)	Bounded by consumer lag
Schema Evolution	Easy (reprocess entire dataset)	Complex (running migrations)

Define latency requirements: If results must be available within seconds -> streaming. If hours/days is acceptable -> batch.
Assess data arrival pattern: If events arrive continuously -> streaming. If data is generated in discrete chunks -> batch.
Evaluate completeness needs: If every record must be counted exactly -> batch (or streaming with watermark + completeness guarantee). If approximate is acceptable -> streaming.
Measure data volume: High-volume (> 1TB/day) favors batch for cost efficiency. Low-volume (< 100GB/day) favors streaming for simplicity.
Assess operational maturity: Streaming requires 24/7 monitoring, state management expertise, and watermarks. Batch requires scheduling, monitoring, and backfill capabilities.
Consider reprocessing: If historical reprocessing is frequent, streaming with replayable logs (Kappa) is more maintainable than Lambda architecture.
Evaluate downstream consumers: If consumers are dashboards/APIs -> streaming. If consumers are ML models/reports -> batch.
Prototype both approaches: Build minimal POCs for batch and streaming, measure latency, cost, and developer velocity.
Consider hybrid: Use streaming for real-time alerts + batch for comprehensive analytics (Lambda) or use micro-batch streaming for both.
Document decision rationale: Record the factors that influenced the choice for future reference.

Production Code

Lambda Architecture: Batch + Streaming Merge

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from datetime import datetime, timedelta
import logging

logger = logging.getLogger(__name__)


class LambdaArchitecture:
    """
    Lambda Architecture implementation: batch layer + speed layer + serving layer.
    """

    def __init__(self, spark: SparkSession):
        self.spark = spark

    # ------------------------------------------------------
    # BATCH LAYER: Comprehensive, accurate results
    # ------------------------------------------------------
    def batch_layer(self, input_path: str, batch_output_path: str, execution_date: str):
        """Run batch processing for a complete dataset."""
        logger.info(f"Running batch layer for {execution_date}")

        # Read complete dataset for the day
        raw_df = (
            self.spark.read
            .parquet(f"{input_path}/date={execution_date}")
        )

        # Comprehensive aggregation
        batch_result = (
            raw_df
            .groupBy("product_id", "region")
            .agg(
                F.count("*").alias("total_orders"),
                F.sum("amount").alias("total_revenue"),
                F.countDistinct("customer_id").alias("unique_customers"),
                F.avg("amount").alias("avg_order_value"),
                F.percentile_approx("amount", 0.5).alias("median_order_value"),
            )
        )

        # Write batch view
        (
            batch_result
            .write
            .mode("overwrite")
            .parquet(f"{batch_output_path}/date={execution_date}")
        )

        logger.info(f"Batch layer completed for {execution_date}")
        return batch_result

    # ------------------------------------------------------
    # SPEED LAYER: Real-time approximations
    # ------------------------------------------------------
    def speed_layer(self, kafka_bootstrap: str, topic: str, speed_output_path: str):
        """Run streaming processing for real-time approximations."""
        logger.info("Starting speed layer streaming query")

        raw_stream = (
            self.spark.readStream
            .format("kafka")
            .option("kafka.bootstrap.servers", kafka_bootstrap)
            .option("subscribe", topic)
            .option("startingOffsets", "latest")
            .load()
        )

        # Parse and aggregate
        parsed = (
            raw_stream
            .select(F.from_json(F.col("value").cast("string"), self._get_schema()).alias("data"))
            .select("data.*")
            .withWatermark("event_time", "5 minutes")
            .groupBy(
                F.window("event_time", "5 minutes"),
                "product_id",
                "region",
            )
            .agg(
                F.count("*").alias("recent_orders"),
                F.sum("amount").alias("recent_revenue"),
            )
        )

        query = (
            parsed.writeStream
            .format("parquet")
            .option("path", speed_output_path)
            .option("checkpointLocation", f"{speed_output_path}/_checkpoint")
            .trigger(processingTime="1 minute")
            .outputMode("append")
            .start()
        )

        return query

    # ------------------------------------------------------
    # SERVING LAYER: Merge batch + speed views
    # ------------------------------------------------------
    def serving_layer(
        self,
        batch_output_path: str,
        speed_output_path: str,
        merged_output_path: str,
        execution_date: str,
    ):
        """Merge batch and speed views into a unified serving view."""
        logger.info("Merging batch and speed views")

        # Read batch view (complete, accurate)
        batch_view = (
            self.spark.read
            .parquet(f"{batch_output_path}/date={execution_date}")
        )

        # Read speed view (recent, approximate)
        speed_view = (
            self.spark.read
            .parquet(speed_output_path)
            .filter(F.col("window_start") >= F.lit(execution_date))
        )

        # Merge: prefer batch for historical, speed for recent
        merged = (
            batch_view.alias("b")
            .join(
                speed_view.alias("s"),
                (F.col("b.product_id") == F.col("s.product_id")) &
                (F.col("b.region") == F.col("s.region")),
                "full_outer",
            )
            .select(
                F.coalesce(F.col("s.product_id"), F.col("b.product_id")).alias("product_id"),
                F.coalesce(F.col("s.region"), F.col("b.region")).alias("region"),
                F.coalesce(F.col("s.recent_orders"), F.col("b.total_orders")).alias("total_orders"),
                F.coalesce(F.col("s.recent_revenue"), F.col("b.total_revenue")).alias("total_revenue"),
            )
        )

        (
            merged.write
            .mode("overwrite")
            .parquet(f"{merged_output_path}/date={execution_date}")
        )

        logger.info(f"Serving layer merged for {execution_date}")
        return merged

    def _get_schema(self):
        from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType
        return StructType([
            StructField("product_id", StringType()),
            StructField("region", StringType()),
            StructField("amount", DoubleType()),
            StructField("event_time", TimestampType()),
        ])

Kappa Architecture: Replay-Based Reprocessing

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import logging

logger = logging.getLogger(__name__)


class KappaArchitecture:
    """
    Kappa Architecture: process all data through streaming, reprocess by replaying.
    """

    def __init__(self, spark: SparkSession):
        self.spark = spark

    def stream_processor(self, kafka_bootstrap: str, topic: str, output_path: str, version: str):
        """
        Deploy a streaming processor. To reprocess, deploy a new version
        and replay from the beginning of the Kafka topic.
        """
        logger.info(f"Deploying stream processor version: {version}")

        # Read from Kafka (replayable log)
        raw_stream = (
            self.spark.readStream
            .format("kafka")
            .option("kafka.bootstrap.servers", kafka_bootstrap)
            .option("subscribe", topic)
            .option("startingOffsets", "earliest")  # For reprocessing
            .option("group.id", f"processor-{version}")
            .load()
        )

        # Version-specific transformation logic
        processed = self._apply_transformations(raw_stream, version)

        # Write to versioned output
        query = (
            processed.writeStream
            .format("parquet")
            .option("path", f"{output_path}/version={version}")
            .option("checkpointLocation", f"{output_path}/_checkpoint/{version}")
            .trigger(processingTime="5 minutes")
            .outputMode("append")
            .start()
        )

        return query

    def _apply_transformations(self, stream, version: str):
        """Apply version-specific transformations."""
        parsed = (
            stream
            .select(F.from_json(F.col("value").cast("string"), self._get_schema()).alias("data"))
            .select("data.*")
        )

        if version == "v1":
            # Original logic
            return parsed.select(
                "event_id", "user_id", "amount",
                F.col("timestamp").alias("event_time"),
            )

        elif version == "v2":
            # Updated logic: added fraud detection
            return parsed.select(
                "event_id", "user_id", "amount",
                F.col("timestamp").alias("event_time"),
            ).withColumn(
                "is_fraud",
                F.when(F.col("amount") > 10000, True).otherwise(False),
            )

        return parsed

    def reprocess(self, kafka_bootstrap: str, topic: str, output_path: str, new_version: str):
        """
        Reprocess historical data by deploying a new version and replaying.
        Old version continues serving until new version catches up.
        """
        logger.info(f"Starting reprocessing with version {new_version}")

        # Deploy new version (reads from beginning of topic)
        new_query = self.stream_processor(
            kafka_bootstrap, topic, output_path, new_version
        )

        # Wait for new version to catch up to current time
        # (monitored via consumer lag or watermark)
        logger.info("New version deployed. Monitor consumer lag for catch-up completion.")

        return new_query

    def _get_schema(self):
        from pyspark.sql.types import StructType, StructField, StringType, DoubleType, TimestampType
        return StructType([
            StructField("event_id", StringType()),
            StructField("user_id", StringType()),
            StructField("amount", DoubleType()),
            StructField("timestamp", TimestampType()),
        ])

Lambda Architecture Drawbacks: Lambda requires maintaining two codebases (batch and speed) that produce the same results, leading to code duplication and divergence. The merge layer adds complexity. Kappa eliminates this by using a single streaming pipeline with replay for reprocessing. However, Kappa requires a durable, replayable event log (Kafka) and efficient stream processors.

Micro-Batch as a Middle Ground: Spark Structured Streaming and Flink (in mini-batch mode) provide a pragmatic compromise: near-real-time latency (seconds) with batch-like processing semantics. This eliminates the need for Lambda architecture in many use cases where sub-second latency is not required.

Batch processing optimizes for throughput and completeness; streaming optimizes for latency and freshness.
Lambda architecture combines batch + speed layers but introduces dual codebases and merge complexity.
Kappa architecture uses only stream processing with replay for reprocessing, simplifying the architecture.
CAP theorem applies: batch favors consistency (CP), streaming favors availability (AP).
The choice depends on latency requirements, data volume, operational maturity, and completeness needs.
Micro-batch streaming (Spark, Flink) provides a pragmatic middle ground for many use cases.
Always prototype both approaches and measure latency, cost, and developer velocity before committing.

Best Practices

Start with batch if your team is new to data engineering. Add streaming only when latency requirements demand it.
Use micro-batch streaming (Spark Structured Streaming) as a middle ground — near-real-time with batch-like simplicity.
Choose Kappa over Lambda when possible. Replayable event logs (Kafka) eliminate the need for dual codebases.
If using Lambda, share transformation logic between batch and speed layers via a common library to prevent code divergence.
Set appropriate watermarks in streaming systems to bound state growth while maintaining completeness.
Monitor consumer lag as a key SLA metric. Lag exceeding retention period = data loss.
Design for reprocessing from day one. Store raw data in replayable formats (Kafka, append-only Parquet).
Use idempotent processing so that reprocessing (batch re-run or stream replay) produces identical results.
Document the latency-completeness trade-off for each pipeline and get stakeholder agreement on acceptable approximations.
Evaluate operational cost — streaming requires 24/7 monitoring, state management, and expertise that batch does not.

Architecture Pattern Comparison

Dimension	Lambda	Kappa	Pure Batch	Micro-Batch
Code Duplication	High (2 codebases)	Low (1 codebase)	None	None
Reprocessing	Replay speed layer	Replay event log	Re-run batch	Re-run micro-batch
Latency	Sub-second	Seconds-minutes	Hours	Seconds
Complexity	High	Medium	Low	Low-Medium
Operational Cost	High	Medium	Low	Low-Medium
Data Completeness	High	High	Complete	Near-complete
Best For	Real-time + accuracy	Real-time + simplicity	Analytics/ML	Near-real-time analytics