Error Handling, Retries, and Dead Letter Queues

Error Handling and Resilience Patterns for Data Pipelines

Production data pipelines fail. The question is not if your pipeline will fail, but how it handles failure.

The Error Landscape in Data Pipelines

Error Categories:

Category	Example	Response	Retry?	DLQ?
Transient	Network timeout	Exponential backoff	Yes	No
Rate Limit	HTTP 429	Wait and retry	Yes	No
Permanent	Schema violation	Route to DLQ	No	Yes
Poison Message	Always fails parse	Route to DLQ	No	Yes
Systemic	Disk full	Alert ops, halt	No	No
Intermittent	DB lock	Retry with backoff	Yes	After max retries

Handling Strategies:

Retries with exponential backoff — for transient errors
Dead letter queues — for poison messages
Circuit breakers — for cascading failure prevention
Fallback mechanisms — for graceful degradation

Key Insight: Each error category requires a different handling strategy.

Circuit Breaker Pattern

Exponential Backoff Timeline

Architecture Diagram

Exponential backoff is a retry strategy where the wait time between retry attempts increases exponentially. For the i-th retry attempt, the wait time is: wait_i = base_delay * 2^i, where base_delay is the initial wait time (e.g., 1 second). With jitter: wait_i = base_delay * 2^i * random(0, 1). Jitter prevents thundering herd problems where all clients retry simultaneously.

A dead letter queue is a queue to which messages are sent when they cannot be processed successfully after a configured number of retries. DLQs decouple poison messages from the main processing queue, preventing a single bad record from blocking the entire pipeline. Messages in the DLQ are inspected, debugged, and either reprocessed or permanently discarded.

A circuit breaker is a fault tolerance pattern that prevents a system from repeatedly attempting an operation that is likely to fail. The circuit breaker has three states: (1) Closed — requests pass through normally, (2) Open — requests fail immediately without attempting the operation, (3) Half-Open — a limited number of requests pass through to test if the operation has recovered. The circuit opens when the error rate exceeds a threshold and closes after a timeout.

Idempotent processing ensures that applying the same operation multiple times produces the same result as applying it once. Formally: f(f(x)) = f(x). Idempotency is essential for retry safety — if a retry executes the operation twice, the data should be correct. Achieved via: (1) deterministic deduplication keys, (2) UPSERT/MERGE operations, (3) transactional writes, (4) idempotent consumers.

Exponential Backoff Formula

For retry attempt i with base_delay B and maximum delay M: wait_i = min(B * 2^i, M). Total wait time for N retries: Total_wait = Σ(min(B * 2^i, M)) for i = 0 to N-1. With jitter: wait_i = min(B * 2^i * random(0, 1), M). For B=1s, M=60s, N=5: Total_wait ≈ 1 + 2 + 4 + 8 + 16 = 31s (without jitter).

Circuit Breaker State Transition

Error_rate = Failed_requests / Total_requests (over window W). Circuit opens when: Error_rate > threshold (e.g., 50%) AND min_requests > minimum_threshold (e.g., 10 requests). Circuit transitions to half-open after: timeout_seconds (e.g., 30s). Half-open allows: max_half_open_requests (e.g., 5) test requests. If test requests succeed: circuit closes. If any fail: circuit opens again.

DLQ Processing Cost

DLQ processing cost = DLQ_depth * (inspection_cost + remediation_cost). For a pipeline with 0.1% poison rate and 1M records/day: DLQ_depth = 1000 records/day. If inspection_cost = $0.001 and remediation_cost =$ 0.01: Daily_DLQ_cost = 1000 * 0.011 = $11.00. Reducing poison rate to 0.01% reduces DLQ cost to$ 1.10/day.

A retry is safe if and only if the retried operation is idempotent. Formally: if op(x) is idempotent, then op(op(x)) = op(x), and retries produce correct results. If op(x) is not idempotent, retries cause duplicate side effects (double writes, duplicate charges). For non-idempotent operations, implement deduplication at the consumer level using a unique request ID.

A circuit breaker prevents cascading failures if: (1) the error rate window W is large enough to detect real failures (> 30 seconds), (2) the minimum request threshold prevents opening on single failures, (3) the recovery timeout allows enough time for the downstream service to recover (typically 30-60 seconds), (4) half-open testing uses a small number of probe requests (5-10% of normal traffic).

Key Concepts

Concept	Description	Implementation
Retry	Automatically retry failed operations	`tenacity` decorator, custom retry logic
Exponential Backoff	Increasing wait between retries	`wait=wait_exponential(multiplier=1, max=60)`
Jitter	Random delay to prevent thundering herd	`wait=wait_exponential + wait_random(0, 1)`
Max Retries	Maximum retry attempts before failure	`stop=stop_after_attempt(5)`
Dead Letter Queue	Store unprocessable messages	Kafka DLQ topic, SQS DLQ
Circuit Breaker	Stop calling failing services	`pybreaker`, custom implementation
Idempotency	Safe to retry without side effects	Dedup keys, UPSERT, transactions
Fallback	Alternative when primary fails	Cached data, default values
Bulkhead	Isolate failure to prevent cascade	Thread pools, connection pools
Timeout	Limit wait time for operations	`timeout=30` on requests
Graceful Degradation	Continue with reduced functionality	Partial results, skip failed records
Error Budget	Track failure tolerance	SLO-based error budgets
Poison Message	Record that always fails processing	Route to DLQ after N retries
Retry Queue	Queue for messages needing retry	Separate retry topic with delay
Alert on DLQ	Notify when DLQ depth exceeds threshold	Prometheus alert on queue depth
DLQ Replay	Reprocess messages from DLQ	Manual or automated replay tool
Failure Injection	Test resilience with deliberate failures	Chaos engineering, fault injection
Backpressure	Slow down producers when consumers lag	Kafka consumer lag monitoring

Classify errors: Distinguish transient (retry), permanent (DLQ), and systemic (circuit breaker) failures.
Implement retries with backoff: Use exponential backoff with jitter for transient errors. Set max retries based on error tolerance.
Configure DLQs: Set up dead letter queues for messages that exceed retry limits. Include original message, error details, and timestamps.
Deploy circuit breakers: Place circuit breakers before external service calls. Configure thresholds based on error rate and minimum request counts.
Ensure idempotency: Design all retried operations to be idempotent. Use deduplication keys and UPSERT operations.
Implement fallbacks: Provide degraded responses when primary operations fail. Use cached data, default values, or partial results.
Set timeouts: Configure timeouts on all external calls to prevent hung requests from blocking resources.
Monitor error rates: Track error rates per pipeline stage, error type, and error category. Alert on anomalies.
Build DLQ tooling: Create tools to inspect, debug, and reprocess DLQ messages. Automate common remediations.
Test resilience: Use chaos engineering to inject failures and verify error handling works correctly.

Production Code

Retry Handler with Exponential Backoff

import time
import random
import logging
from typing import Callable, Any, Optional
from functools import wraps
from dataclasses import dataclass
from datetime import datetime
import traceback

logger = logging.getLogger(__name__)


@dataclass
class RetryConfig:
    """
    Configuration for retry behavior.
    
    Parameters:
        max_retries (int): Maximum number of retry attempts before raising exception
        base_delay (float): Initial delay in seconds before first retry
        max_delay (float): Maximum delay cap in seconds (prevents unbounded growth)
        exponential_base (float): Base for exponential growth (2.0 = doubling each retry)
        jitter (bool): Add random jitter to prevent thundering herd problem
        retryable_exceptions (tuple): Exception types that should trigger retry
        
    Example with backoff timing:
        max_retries=5, base_delay=1.0, exponential_base=2.0
        Attempt 1: wait 1.0s
        Attempt 2: wait 2.0s
        Attempt 3: wait 4.0s
        Attempt 4: wait 8.0s
        Attempt 5: wait 16.0s
        Total wait: 31.0s
    """
    max_retries: int = 5
    base_delay: float = 1.0
    max_delay: float = 60.0
    exponential_base: float = 2.0
    jitter: bool = True
    retryable_exceptions: tuple = (Exception,)


class RetryHandler:
    """Production retry handler with exponential backoff and jitter."""

    def __init__(self, config: RetryConfig = None):
        self.config = config or RetryConfig()

    def __call__(self, func: Callable) -> Callable:
        @wraps(func)
        def wrapper(*args, **kwargs) -> Any:
            last_exception = None
            for attempt in range(self.config.max_retries + 1):
                try:
                    return func(*args, **kwargs)
                except self.config.retryable_exceptions as e:
                    last_exception = e
                    if attempt == self.config.max_retries:
                        logger.error(
                            f"Max retries ({self.config.max_retries}) exceeded "
                            f"for {func.__name__}: {e}"
                        )
                        raise

                    # Calculate delay with exponential backoff
                    delay = min(
                        self.config.base_delay * (self.config.exponential_base ** attempt),
                        self.config.max_delay,
                    )

                    # Add jitter
                    if self.config.jitter:
                        delay = delay * random.uniform(0, 1)

                    logger.warning(
                        f"Attempt {attempt + 1}/{self.config.max_retries} failed "
                        f"for {func.__name__}: {e}. Retrying in {delay:.2f}s..."
                    )
                    time.sleep(delay)

            raise last_exception

        return wrapper


# Usage example
@RetryHandler(config=RetryConfig(
    max_retries=5,
    base_delay=1.0,
    max_delay=30.0,
    jitter=True,
    retryable_exceptions=(ConnectionError, TimeoutError),
))
def fetch_from_api(url: str, timeout: int = 30) -> dict:
    """Fetch data from external API with automatic retries."""
    import requests
    response = requests.get(url, timeout=timeout)
    response.raise_for_status()
    return response.json()


# Custom retry handler for specific error patterns
class SmartRetryHandler:
    """Retry handler that adapts behavior based on error type."""

    def __init__(self):
        self.error_counts = {}
        self.circuit_open = False
        self.circuit_open_until = None

    def execute_with_retry(
        self,
        func: Callable,
        *args,
        max_retries: int = 3,
        **kwargs,
    ) -> Any:
        """Execute function with smart retry logic."""
        # Check circuit breaker
        if self.circuit_open:
            if datetime.utcnow() < self.circuit_open_until:
                raise CircuitBreakerOpenError(
                    f"Circuit breaker open until {self.circuit_open_until}"
                )
            else:
                self.circuit_open = False
                logger.info("Circuit breaker entering half-open state")

        for attempt in range(max_retries + 1):
            try:
                result = func(*args, **kwargs)
                # Success: reset error count
                self.error_counts.pop(func.__name__, None)
                return result

            except RateLimitError as e:
                # Rate limit: wait longer before retry
                delay = min(60 * (2 ** attempt), 300)
                logger.warning(f"Rate limited. Waiting {delay}s before retry.")
                time.sleep(delay)

            except TemporaryError as e:
                # Transient error: standard exponential backoff
                delay = min(2 ** attempt, 30)
                logger.warning(f"Transient error. Retrying in {delay}s: {e}")
                time.sleep(delay)

            except PermanentError as e:
                # Permanent error: don't retry, send to DLQ
                logger.error(f"Permanent error, not retrying: {e}")
                raise

            except Exception as e:
                # Unknown error: track and potentially open circuit
                self.error_counts[func.__name__] = (
                    self.error_counts.get(func.__name__, 0) + 1
                )
                if self.error_counts[func.__name__] >= 10:
                    self.circuit_open = True
                    self.circuit_open_until = datetime.utcnow() + timedelta(seconds=60)
                    logger.error(
                        f"Circuit breaker OPEN for {func.__name__} "
                        f"after {self.error_counts[func.__name__]} consecutive errors"
                    )
                raise

        raise Exception(f"Max retries ({max_retries}) exceeded for {func.__name__}")


class CircuitBreakerOpenError(Exception):
    pass

Dead Letter Queue Implementation

import json
import logging
from datetime import datetime
from typing import Dict, List, Optional
from dataclasses import dataclass, asdict
from kafka import KafkaProducer, KafkaConsumer

logger = logging.getLogger(__name__)


@dataclass
class DLQMessage:
    """Message stored in dead letter queue with full context."""
    original_topic: str
    original_partition: int
    original_offset: int
    original_key: Optional[str]
    original_value: str
    error_type: str
    error_message: str
    error_traceback: str
    retry_count: int
    first_failure_time: str
    last_failure_time: str
    pipeline_name: str
    stage_name: str


class DeadLetterQueueManager:
    """Manage dead letter queues for failed pipeline records."""

    def __init__(
        self,
        kafka_bootstrap: str,
        dlq_topic: str = "pipeline.dlq",
        max_retries: int = 3,
    ):
        self.dlq_topic = dlq_topic
        self.max_retries = max_retries
        self.producer = KafkaProducer(
            bootstrap_servers=kafka_bootstrap,
            acks="all",
            enable_idempotence=True,
            value_serializer=lambda v: json.dumps(v, default=str).encode("utf-8"),
        )
        self.retry_counts: Dict[str, int] = {}

    def send_to_dlq(
        self,
        original_message,
        error: Exception,
        pipeline_name: str,
        stage_name: str,
    ):
        """Send a failed message to the dead letter queue."""
        dlq_msg = DLQMessage(
            original_topic=original_message.topic,
            original_partition=original_message.partition,
            original_offset=original_message.offset,
            original_key=original_message.key.decode("utf-8") if original_message.key else None,
            original_value=original_message.value.decode("utf-8"),
            error_type=type(error).__name__,
            error_message=str(error),
            error_traceback=traceback.format_exc(),
            retry_count=self.retry_counts.get(
                f"{original_message.topic}:{original_message.offset}", 0
            ),
            first_failure_time=datetime.utcnow().isoformat(),
            last_failure_time=datetime.utcnow().isoformat(),
            pipeline_name=pipeline_name,
            stage_name=stage_name,
        )

        self.producer.send(
            topic=self.dlq_topic,
            key=original_message.key,
            value=asdict(dlq_msg),
        )

        logger.warning(
            f"Message sent to DLQ: topic={original_message.topic}, "
            f"offset={original_message.offset}, error={type(error).__name__}"
        )

    def should_retry(self, message_key: str) -> bool:
        """Check if a message should be retried or sent to DLQ."""
        count = self.retry_counts.get(message_key, 0)
        if count >= self.max_retries:
            return False
        self.retry_counts[message_key] = count + 1
        return True

    def replay_dlq(
        self,
        target_processor: callable,
        filter_fn: callable = None,
        max_messages: int = None,
    ) -> Dict:
        """Replay messages from DLQ through a processor."""
        consumer = KafkaConsumer(
            self.dlq_topic,
            bootstrap_servers=self.producer.config["bootstrap_servers"],
            auto_offset_reset="earliest",
            value_deserializer=lambda m: json.loads(m.decode("utf-8")),
        )

        replayed = 0
        failed = 0
        skipped = 0

        for message in consumer:
            if max_messages and replayed >= max_messages:
                break

            dlq_msg = message.value

            # Apply filter if provided
            if filter_fn and not filter_fn(dlq_msg):
                skipped += 1
                continue

            try:
                # Reconstruct original message for processing
                original = {
                    "topic": dlq_msg["original_topic"],
                    "key": dlq_msg["original_key"],
                    "value": dlq_msg["original_value"],
                }
                target_processor(original)
                replayed += 1
                logger.info(f"Successfully replayed message: {dlq_msg['original_offset']}")
            except Exception as e:
                failed += 1
                logger.error(f"Replay failed for offset {dlq_msg['original_offset']}: {e}")

        consumer.close()
        return {"replayed": replayed, "failed": failed, "skipped": skipped}

    def get_dlq_stats(self) -> Dict:
        """Get DLQ statistics for monitoring."""
        consumer = KafkaConsumer(
            self.dlq_topic,
            bootstrap_servers=self.producer.config["bootstrap_servers"],
            auto_offset_reset="earliest",
        )

        total_messages = 0
        error_types = {}
        pipelines = {}

        for partition in consumer.assignment():
            total_messages += consumer.end_offsets([partition])[partition]

        consumer.close()

        return {
            "total_messages": total_messages,
            "error_types": error_types,
            "pipelines": pipelines,
        }


# Example: Pipeline with DLQ integration
def process_with_dlq(kafka_bootstrap: str, input_topic: str):
    """Process messages with dead letter queue for failures."""
    dlq = DeadLetterQueueManager(kafka_bootstrap, "pipeline.dlq")
    consumer = KafkaConsumer(
        input_topic,
        bootstrap_servers=kafka_bootstrap,
        group_id="pipeline-processor",
        auto_offset_reset="earliest",
        enable_auto_commit=False,
    )

    for message in consumer:
        message_key = f"{message.topic}:{message.offset}"

        if not dlq.should_retry(message_key):
            dlq.send_to_dlq(
                message,
                Exception("Max retries exceeded"),
                "daily_orders_etl",
                "transform",
            )
            consumer.commit()
            continue

        try:
            # Process message
            data = json.loads(message.value.decode("utf-8"))
            # ... transformation logic ...
            consumer.commit()

        except Exception as e:
            dlq.send_to_dlq(message, e, "daily_orders_etl", "transform")

    consumer.close()

Idempotency in Practice: Every retried operation must be idempotent. For database writes, use INSERT ... ON CONFLICT DO UPDATE (PostgreSQL) or INSERT ... ON DUPLICATE KEY UPDATE (MySQL). For Kafka consumers, commit offsets only after successful processing. For API calls, include a unique request ID in the payload and use idempotency keys.

Chaos Engineering for Pipelines: Test error handling by deliberately injecting failures: kill processes, corrupt data, introduce latency, and exhaust resources. Use tools like Chaos Monkey, Litmus, or custom fault injection to verify that retries, DLQs, and circuit breakers work correctly under real failure conditions.

Classify errors: transient (retry), permanent (DLQ), systemic (circuit breaker).
Exponential backoff with jitter prevents thundering herd: wait = base * 2^attempt * random(0,1).
Dead letter queues decouple poison messages from main processing, enabling debugging and reprocessing.
Circuit breakers prevent cascading failures by stopping calls to failing services.
All retried operations must be idempotent: f(f(x)) = f(x). Use dedup keys and UPSERT operations.
Monitor DLQ depth and error rates. Alert when thresholds are exceeded.
Test resilience with chaos engineering: inject failures and verify error handling works correctly.

Best Practices

Classify errors before retrying. Only retry transient errors (timeouts, rate limits, locks). Route permanent errors to DLQ.
Use exponential backoff with jitter to prevent thundering herd. Base delay: 1s, max delay: 60s, jitter: random(0,1).
Set max retries based on error tolerance. For critical pipelines: 5 retries. For non-critical: 3 retries.
Implement DLQs for every pipeline. Store original message, error details, retry count, and timestamps.
Ensure idempotency for all retried operations. Use deduplication keys, UPSERT operations, and transactional writes.
Deploy circuit breakers before external service calls. Configure: error rate threshold (50%), minimum requests (10), timeout (30s).
Implement fallbacks for graceful degradation. Use cached data, default values, or partial results when primary operations fail.
Set timeouts on all external calls. Prevent hung requests from blocking resources. Typical timeout: 30s.
Monitor error rates per pipeline stage, error type, and error category. Alert on anomalies.
Build DLQ tooling: create tools to inspect, debug, and reprocess DLQ messages. Automate common remediations.

Error Handling Pattern Selection

Pattern	Use Case	Complexity	Latency Impact	Resource Impact
Simple Retry	Transient failures	Low	Low	Low
Exponential Backoff	Rate limits, timeouts	Low	Medium	Low
Circuit Breaker	Cascading failures	Medium	High (when open)	Low
Dead Letter Queue	Poison messages	Medium	Low	Medium
Fallback	Graceful degradation	Medium	Low	Low
Bulkhead	Resource isolation	High	Low	High
Chaos Engineering	Resilience testing	High	N/A	High

Retry Configuration Guide

Error Type	Max Retries	Base Delay	Max Delay	Jitter	Example
Network Timeout	5	1s	30s	Yes	API calls
Rate Limit (429)	10	60s	300s	Yes	External APIs
Database Lock	3	5s	60s	Yes	DB writes
Schema Error	0	N/A	N/A	No	Permanent failure
Disk Full	0	N/A	N/A	No	Systemic failure
OOM	2	30s	300s	Yes	Resource exhaustion