πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

18. Garbage Collection Tuning in PySpark

🟒 Free Lesson

Advertisement

18. Garbage Collection Tuning in PySpark

DfGarbage Collection (GC)

Garbage collection is the automatic memory management process in the JVM that reclaims memory occupied by objects that are no longer referenced by the application. In PySpark, GC directly impacts application performance and stability.

DfStop-the-World (STW) Pause

A stop-the-world pause occurs when the JVM suspends all application threads to perform garbage collection. Long STW pauses can cause application timeouts, data skew, and performance degradation.

GC Overhead Formula
OGC=TGCTtotalΓ—100%O_{GC} = \frac{T_{GC}}{T_{total}} \times 100\%

Here,

  • OGCO_{GC}=GC overhead percentage (target < 5%)
  • TGCT_{GC}=Time spent in garbage collection
  • TtotalT_{total}=Total application execution time

Heap Memory Distribution

Htotal=Hyoung+Hold+Hoff_heapH_{total} = H_{young} + H_{old} + H_{off\_heap}

Here,

  • HtotalH_{total}=Total heap memory allocated
  • HyoungH_{young}=Young generation space (eden + survivor)
  • HoldH_{old}=Old generation space (tenured)
  • Hoff_heapH_{off\_heap}=Off-heap memory (for Tachyon, Netty)

Spark uses both on-heap and off-heap memory. On-heap memory is managed by the JVM garbage collector, while off-heap memory is managed manually. Understanding this distinction is crucial for effective GC tuning.

Use the G1GC garbage collector for most Spark workloads. It provides a good balance between throughput and pause times. For low-latency requirements, consider ZGC or Shenandoah.

ThGC Impact on Performance

Theorem: GC overhead directly impacts Spark application performance. When GC overhead exceeds 5% of total execution time, applications experience significant performance degradation. Optimal GC configuration reduces overhead to below 2% while maintaining sufficient memory for application operations.

  • Garbage collection reclaims unused JVM memory automatically
  • Stop-the-world pauses cause application thread suspension
  • GC overhead should be kept below 5% for optimal performance
  • G1GC is recommended for most Spark workloads
  • Off-heap memory reduces GC pressure but requires manual management

JVM Memory Model & GC Comparison

JVM Heap Memory LayoutTotal HeapYoung GenEden + SurvivorShort-lived objectsOld GenTenured objectsLong-lived dataOff-Heap (Tungsten)No GC overhead | Binary format | Managed by Sparkspark.memory.offHeap.enabled=trueGC Algorithm ComparisonG1GC (Default)Balanced throughput/latencyZGCLow latency {"<"}10msShenandoahConcurrent collectionCMS (Legacy)Deprecated in Java 9+Target GC overhead {"<"} 5% | Use G1GC for most workloads | Off-heap eliminates GC for Tungsten dataLong STW pauses cause timeouts and stragglers

πŸ—οΈ Garbage Collection Architecture

πŸ“š Detailed Explanation

Why GC Tuning Matters

  • JVM's automatic memory management can become a bottleneck
  • Untuned GC β†’ long pause times, out-of-memory errors, degraded performance
  • Proper tuning keeps GC overhead below 5% of total execution time

JVM Memory Structure

GenerationPurposeContents
Young GenerationNew object allocationEden + Survivor spaces
Old GenerationLong-lived objectsPromoted from young generation
Off-Heap MemoryManual managementOutside JVM heap, reduces GC pressure

Spark Executor Memory Layout

Memory TypePurpose
Execution MemoryShuffles, joins, sorts
Storage MemoryCached data, broadcast variables
User MemoryUser data structures
Reserved MemorySpark internals

Understanding these divisions helps configure appropriate heap sizes.


Garbage Collector Comparison

CollectorBest ForTrade-off
Serial GCSingle-core machinesSimple, low overhead
Parallel GCMulti-core, throughput-focusedBetter throughput
G1GCMost Spark workloadsBalance throughput/latency
ZGC/ShenandoahLatency-sensitive appsUltra-low latency

Key Takeaway: Use G1GC for most Spark workloads. Consider ZGC for low-latency requirements.


Key GC Tuning Parameters

ParameterDescription
-Xmx / -XmsHeap size (max/start)
-XX:NewRatioYoung to old generation ratio
-XX:+UseG1GCSelect G1 garbage collector
-XX:MaxGCPauseMillisTarget max pause time
-XX:G1HeapRegionSizeG1 region size

GC Logging & Monitoring

  • Enable via JVM flags: -Xlog:gc* or -XX:+PrintGCDetails
  • Metrics available through Spark UI
  • Reveals:
    • GC frequency
    • Pause times
    • Memory pressure

Common GC Issues & Solutions

IssueSymptomSolution
Frequent minor GCHigh GC overheadIncrease young generation size
Long major GC pausesApplication stallsTune old generation, use G1GC
OutOfMemoryErrorCrashIncrease heap, fix memory leaks

Code-Level Optimizations

  1. Avoid unnecessary object creation
  2. Use primitive types instead of wrapper classes
  3. Reuse objects where possible
  4. Minimize broadcast variables for large datasets

Key Takeaway: Code-level changes can significantly reduce GC pressure.


Production Monitoring

Set alerts for:

  • GC overhead > 5% of total execution time
  • Pause times > 1 second
  • Memory utilization trends

Advanced Techniques

  • Off-heap memory for large datasets
  • Memory-mapped files for external data
  • Custom memory allocators for specific use cases
  • Provide more control but require careful implementation

πŸ“Š Key Concepts Table

ConceptDescriptionUse Case
Garbage CollectionAutomatic memory reclamationJVM memory management
Stop-the-World PauseThread suspension during GCPerformance impact
Young GenerationNew object allocation spaceShort-lived objects
Old GenerationLong-lived object storagePersistent data
Off-Heap MemoryNon-JVM managed memoryLarge datasets
GC AlgorithmCollection strategy selectionPerformance optimization
Heap SizeTotal memory allocationResource configuration
GC OverheadTime spent in GCPerformance metric

πŸ’» Code Examples

Basic GC Configuration

from pyspark.sql import SparkSession

# Configure Spark with GC settings
# Parameter: spark.executor.extraJavaOptions β€” JVM options for executors
spark = SparkSession.builder \
    .appName("GCConfiguration") \
    .config("spark.executor.memory", "8g") \
    .config("spark.executor.extraJavaOptions", 
        "-XX:+UseG1GC "
        "-XX:MaxGCPauseMillis=200 "
        "-XX:InitiatingHeapOccupancyPercent=35 "
        "-XX:G1HeapRegionSize=16m "
        "-XX:G1ReservePercent=15 "
        "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps") \
    .config("spark.driver.extraJavaOptions",
        "-XX:+UseG1GC "
        "-XX:MaxGCPauseMillis=200 "
        "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps") \
    .getOrCreate()

# Read data
df = spark.read \
    .format("parquet") \
    .load("hdfs:///data/input")

# Process data
result = df.groupBy("category").count()

# Write results
result.write \
    .mode("overwrite") \
    .parquet("hdfs:///data/output")

spark.stop()

Memory Configuration for Large Datasets

from pyspark.sql import SparkSession

# Configure memory for large dataset processing
# Parameter: spark.memory.fraction β€” fraction of heap for execution + storage
spark = SparkSession.builder \
    .appName("LargeDatasetGC") \
    .config("spark.executor.memory", "16g") \
    .config("spark.executor.memoryOverhead", "4g") \
    .config("spark.memory.fraction", "0.8") \
    .config("spark.memory.storageFraction", "0.3") \
    .config("spark.executor.extraJavaOptions",
        "-XX:+UseG1GC "
        "-XX:MaxGCPauseMillis=300 "
        "-XX:InitiatingHeapOccupancyPercent=40 "
        "-XX:G1HeapRegionSize=32m "
        "-XX:G1ReservePercent=20 "
        "-XX:ConcGCThreads=4 "
        "-XX:ParallelGCThreads=8 "
        "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps") \
    .getOrCreate()

# Read large dataset
df = spark.read \
    .format("parquet") \
    .load("hdfs:///data/large_dataset")

# Process with caching
df.cache()
df.count()  # Materialize cache

# Perform complex transformations
result = df \
    .repartition(500) \
    .groupBy("category", "subcategory") \
    .agg({"value": "sum", "count": "count"}) \
    .orderBy("category")

# Write results
result.write \
    .mode("overwrite") \
    .parquet("hdfs:///data/output")

spark.stop()

GC Monitoring and Analysis

from pyspark.sql import SparkSession
import time
import psutil

spark = SparkSession.builder \
    .appName("GCMonitoring") \
    .config("spark.executor.extraJavaOptions",
        "-XX:+UseG1GC "
        "-XX:+PrintGCDetails "
        "-XX:+PrintGCDateStamps "
        "-Xloggc:/tmp/spark-gc.log") \
    .getOrCreate()

# Function to monitor memory usage
def monitor_memory():
    """Monitor JVM memory usage"""
    # Get Spark context
    sc = spark.sparkContext
    
    # Get executor memory info
    # Note: This is a simplified monitoring approach
    # In production, use JMX or Spark metrics system
    
    print("Memory monitoring started")
    
    # Monitor during processing
    df = spark.read \
        .format("parquet") \
        .load("hdfs:///data/input")
    
    # Cache and count to trigger GC
    df.cache()
    count = df.count()
    
    print(f"Row count: {count}")
    
    # Perform operations that may trigger GC
    result = df \
        .repartition(100) \
        .groupBy("category") \
        .agg({"value": "sum"})
    
    result.write \
        .mode("overwrite") \
        .parquet("hdfs:///data/output")

# Function to analyze GC logs
def analyze_gc_logs():
    """Analyze GC log files"""
    # This would typically parse the GC log file
    # For demonstration, we'll show the expected log format
    
    print("GC Log Analysis:")
    print("Look for patterns like:")
    print("- GC pause times > 500ms")
    print("- Full GC frequency > 1 per minute")
    print("- Memory allocation failures")
    print("- Promotion failures")

# Run monitoring
monitor_memory()
analyze_gc_logs()

spark.stop()

GC Optimization for Specific Workloads

from pyspark.sql import SparkSession

# Configuration for batch processing (throughput-optimized)
spark_batch = SparkSession.builder \
    .appName("BatchProcessingGC") \
    .config("spark.executor.memory", "32g") \
    .config("spark.executor.extraJavaOptions",
        "-XX:+UseParallelGC "
        "-XX:ParallelGCThreads=8 "
        "-XX:+PrintGCDetails "
        "-Xloggc:/tmp/spark-batch-gc.log") \
    .getOrCreate()

# Configuration for streaming (latency-optimized)
spark_streaming = SparkSession.builder \
    .appName("StreamingGC") \
    .config("spark.executor.memory", "8g") \
    .config("spark.executor.extraJavaOptions",
        "-XX:+UseG1GC "
        "-XX:MaxGCPauseMillis=100 "
        "-XX:InitiatingHeapOccupancyPercent=30 "
        "-XX:+PrintGCDetails "
        "-Xloggc:/tmp/spark-streaming-gc.log") \
    .getOrCreate()

# Configuration for iterative algorithms (memory-optimized)
spark_iterative = SparkSession.builder \
    .appName("IterativeAlgorithmGC") \
    .config("spark.executor.memory", "16g") \
    .config("spark.memory.fraction", "0.85") \
    .config("spark.executor.extraJavaOptions",
        "-XX:+UseG1GC "
        "-XX:G1HeapRegionSize=16m "
        "-XX:InitiatingHeapOccupancyPercent=35 "
        "-XX:+PrintGCDetails "
        "-Xloggc:/tmp/spark-iterative-gc.log") \
    .getOrCreate()

# Example: Batch processing
df_batch = spark_batch.read \
    .format("parquet") \
    .load("hdfs:///data/batch_input")

result_batch = df_batch \
    .repartition(200) \
    .groupBy("category") \
    .agg({"value": "sum"})

result_batch.write \
    .mode("overwrite") \
    .parquet("hdfs:///data/batch_output")

spark_batch.stop()

# Example: Streaming
df_stream = spark_streaming.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "topic1") \
    .load()

result_stream = df_stream \
    .selectExpr("CAST(value AS STRING)") \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

result_stream.awaitTermination()
spark_streaming.stop()

# Example: Iterative algorithm (e.g., PageRank)
df_iterative = spark_iterative.read \
    .format("parquet") \
    .load("hdfs:///data/graph_data")

# Simple iterative algorithm
for i in range(10):
    df_iterative = df_iterative \
        .repartition(100) \
        .groupBy("node_id") \
        .agg({"rank": "sum"}) \
        .withColumn("rank", col("sum(rank)") * 0.85 + 0.15)

df_iterative.write \
    .mode("overwrite") \
    .parquet("hdfs:///data/pagerank_output")

spark_iterative.stop()

πŸ“ˆ Performance Metrics

MetricOptimalTypicalPoorOptimization
GC Overhead< 2%2-5%> 5%Heap tuning, code optimization
Minor GC Pause< 50ms50-200ms> 200msYoung gen sizing
Major GC Pause< 200ms200ms-1s> 1sAlgorithm selection
GC Frequency< 1/min1-10/min> 10/minMemory allocation patterns
Memory Utilization60-80%40-60%< 40%Caching, memory config

πŸ† Best Practices

  1. Monitor GC metrics β€” Track overhead, pause times, and frequency
  2. Use appropriate collector β€” G1GC for most workloads, ZGC for low latency
  3. Right-size heap β€” Avoid over-provisioning or under-provisioning
  4. Tune generation ratios β€” Adjust young/old generation balance
  5. Enable GC logging β€” Use logs for analysis and tuning
  6. Optimize code β€” Reduce object creation and unnecessary allocations
  7. Use off-heap memory β€” For large datasets to reduce GC pressure
  8. Test with production data β€” Validate GC settings with realistic workloads
  9. Set up alerts β€” Monitor for high GC overhead and long pauses
  10. Document configurations β€” Maintain records of GC tuning changes

πŸ”— Related Topics

  • 08-caching-persistence.mdx: Memory management and caching
  • 10-serialization-kryo.mdx: Serialization impact on memory
  • 17-cluster-management.mdx: Cluster-level resource management

See Also

  • 09-udf-optimization.mdx: UDF memory usage patterns
  • 11-structured-streaming.mdx: Streaming memory requirements
  • Kafka Streams (kafka/03): JVM tuning in streaming systems
  • Data Engineering Streaming (data-engineering/022): Production memory management
⭐

Premium Content

18. Garbage Collection Tuning in PySpark

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert PySpark Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement