🚀 EMR Spark Deep Dive

Master EMR Spark dynamic allocation, shuffle optimization, and performance tuning.

Module: AWS Data Engineering • Topic 41 of 65 • Premium Content

Spark on EMR Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SPARK ON EMR ARCHITECTURE                                  │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  DRIVER → EXECUTORS (distributed across core/task nodes)            │    │
│  │                                                                     │    │
│  │  Driver: SparkContext, task scheduling, result aggregation         │    │
│  │  Executors: Task execution, data caching, shuffle                  │    │
│  │                                                                     │    │
│  │  DYNAMIC ALLOCATION:                                                │    │
│  │  • Min/Max executors based on workload                              │    │
│  │  • Scale-out when tasks are pending                                 │    │
│  │  • Scale-in when executors are idle                                 │    │
│  │  • External shuffle service for state preservation                 │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  SHUFFLE OPTIMIZATION:                                                      │
│  • Partition count: Default 200, tune based on data size                   │
│  • Adaptive Query Execution (AQE): Auto-optimize at runtime                │
│  • Skew join handling: Detect and split skewed partitions                   │
│  • Coalesce partitions: Merge small partitions automatically               │
└─────────────────────────────────────────────────────────────────────────────┘

Spark Configuration

spark_config = {
    # Dynamic Allocation
    'spark.dynamicAllocation.enabled': 'true',
    'spark.shuffle.service.enabled': 'true',
    'spark.dynamicAllocation.minExecutors': '2',
    'spark.dynamicAllocation.maxExecutors': '100',
    'spark.dynamicAllocation.executorIdleTimeout': '60s',
    
    # Adaptive Query Execution
    'spark.sql.adaptive.enabled': 'true',
    'spark.sql.adaptive.coalescePartitions.enabled': 'true',
    'spark.sql.adaptive.skewJoin.enabled': 'true',
    'spark.sql.adaptive.skewJoin.skewedPartitionFactor': '5',
    
    # Shuffle Optimization
    'spark.sql.shuffle.partitions': '200',
    'spark.shuffle.compress': 'true',
    'spark.shuffle.spill.compress': 'true',
    'spark.io.compression.codec': 'snappy',
    
    # Memory Management
    'spark.executor.memory': '16g',
    'spark.executor.memoryFraction': '0.8',
    'spark.executor.memoryOverhead': '4g',
    'spark.driver.memory': '8g',
    'spark.memory.fraction': '0.8'
}

Interview Q&A

Q1: How does dynamic allocation work?

Answer: Executors are added when tasks are pending and removed when idle. The external shuffle service preserves shuffle state when executors are removed.

Q2: What is the optimal shuffle partition count?

Answer: Start with 200. Tune based on data size: ~128MB per partition. Use AQE for automatic optimization.

Q3: What is Adaptive Query Execution?

Answer: AQE optimizes queries at runtime based on actual data statistics: auto-coalesce partitions, handle skew joins, optimize join strategies.

Summary

Dynamic Allocation: Auto-scale executors based on workload
AQE: Runtime query optimization based on data statistics
Shuffle: Tune partition count, enable compression
Memory: Proper memory allocation prevents OOM errors
Cost: Use spot instances for task nodes