PySpark RDD Fundamentals

RDDs are immutable, partitioned, distributed collections with lineage-based fault tolerance
Narrow transformations can be pipelined; wide transformations require shuffle barriers
Optimal partition size is 128MB–200MB; partition count = max(dataSize/partitionSize, totalCores)
Recovery cost is proportional to lineage depth × data size per partition
Use persist() for reuse; use coalesce() to reduce partitions without full shuffle
Data skew causes stragglers: stage time = max(partition times)

DfResilient Distributed Dataset (RDD)

An RDD is an immutable, partitioned collection of elements that can be operated on in parallel. Each RDD is defined by five properties: a list of partitions, a function to compute each split, a list of dependencies on parent RDDs, an optional partitioner (for key-value RDDs), and an optional list of preferred locations for each split.

DfPartition

A partition is a logical chunk of data stored on a single node. The number of partitions determines parallelism — each partition is processed by one task on one executor core.

DfLineage

Lineage is the complete record of transformations used to build an RDD. It is stored as a DAG and enables fault tolerance by allowing Spark to recompute only the lost partitions without data replication.

DfShuffle

A shuffle is the process of redistributing data across partitions, typically across the network. It occurs during wide transformations and is the most expensive operation in Spark, involving disk I/O, network I/O, and serialization.

Narrow vs Wide Transformation Partition Mapping

RDD Lineage Fault Tolerance

RDD Architecture Overview

RDD Transformation DAG

Narrow vs Wide Transformations

Partition Count Formula

P = \max\left(\frac{S_{data}}{S_{partition}}, C_{cores}\right)

Here,

$P$ =Number of partitions
$S_{data}$ =Total data size in bytes
$S_{partition}$ =Target partition size (default 128MB for HDFS)
$C_{cores}$ =Total available executor cores

Recovery Cost (Lineage Recomputation)

Cost_{recovery} = \sum_{i=1}^{k} Cost(T_i) \times D_i

Here,

$Cost_{recovery}$ =Total cost to recompute lost partitions
$Cost(T_i)$ =Cost of transformation T_i in the lineage
$D_i$ =Data size processed at step i
$k$ =Number of transformations in the lineage path

Narrow transformations (map, filter, flatMap) have 1:1 parent-child partition mapping and can be pipelined without shuffle. Wide transformations (groupByKey, reduceByKey, join) have M:N mapping and require a shuffle barrier — they cannot be pipelined.

The optimal partition size is 128MB–200MB. Too few partitions cause underutilization of cores; too many cause excessive task scheduling overhead. Use repartition() to increase partitions or coalesce() to decrease without full shuffle.

Avoid collect() on large datasets — it brings all data to the driver node which can cause OutOfMemoryError. Use take(n), show(n), or foreach() instead for large datasets.

ThFault Tolerance via Lineage

Theorem: Any lost partition of an RDD can be recomputed from its lineage in at most O(L × D) time, where L is the lineage depth (number of transformations) and D is the data size at that partition. This guarantees correctness without data replication, unlike systems like HDFS which use 3× replication.

Code Examples

Basic RDD Operations

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("RDD_Basics").setMaster("local[*]")
sc = SparkContext(conf=conf)

# Create RDD from collection
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
rdd = sc.parallelize(data, 4)  # 4 partitions
print(f"Partitions: {rdd.getNumPartitions()}")

# Narrow transformations (no shuffle)
mapped = rdd.map(lambda x: x * 2)
filtered = rdd.filter(lambda x: x > 5)
flattened = rdd.flatMap(lambda x: [x, x * 10])

# Wide transformation (shuffle)
pairs = rdd.map(lambda x: (x % 3, x))
grouped = pairs.groupByKey()
reduced = pairs.reduceByKey(lambda a, b: a + b)

# Actions (trigger execution)
print(f"Count: {rdd.count()}")
print(f"First: {rdd.first()}")
print(f"Take: {rdd.take(3)}")
print(f"Sum: {rdd.reduce(lambda a, b: a + b)}")

# Check lineage
print(rdd.toDebugString().decode())

sc.stop()

RDD Persistence Levels

from pyspark import StorageLevel

# Cache in memory (deserialized)
rdd.cache()  # Equivalent to persist(StorageLevel.MEMORY_ONLY)

# Persist with specific storage level
rdd.persist(StorageLevel.MEMORY_AND_DISK)
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
rdd.persist(StorageLevel.DISK_ONLY)
rdd.persist(StorageLevel.OFF_HEAP)

# Unpersist when done
rdd.unpersist()

Key Concepts Table

Concept	Description	Example
Partition	Logical chunk of data for parallel processing	`rdd.getNumPartitions()`
Lineage	DAG of transformations for fault tolerance	`rdd.toDebugString()`
Narrow Transform	1:1 partition mapping, no shuffle	`map()`, `filter()`, `flatMap()`
Wide Transform	M:N partition mapping, requires shuffle	`groupByKey()`, `reduceByKey()`, `join()`
Lazy Evaluation	Transforms built but not executed until action	Build DAG → Action triggers execution
Action	Triggers computation, returns result	`collect()`, `count()`, `first()`
Cache/Persist	Store RDD in memory/disk for reuse	`rdd.cache()` or `rdd.persist()`
Checkpoint	Write RDD to reliable storage, truncate lineage	`rdd.checkpoint()`
Broadcast	Read-only variable cached on each executor	`sc.broadcast(variable)`
Accumulator	Write-only variable for aggregations	`sc.accumulator(0)`

Best Practices

Prefer DataFrames over RDDs — DataFrames use Catalyst optimizer and Tungsten engine for automatic optimization
Use reduceByKey over groupByKey — reduceByKey combines locally before shuffle, reducing network I/O
Cache wisely — Only cache RDDs that are reused across multiple actions
Partition appropriately — Aim for 128MB–200MB per partition
Avoid collect() on large datasets — use take(n) or foreach() instead
Use coalesce() to reduce partitions — avoids full shuffle unlike repartition()
Enable Kryo serialization — 10x faster than Java serialization
Monitor shuffle spill — indicates memory pressure

Key Takeaways

RDDs are the foundation of Spark's distributed computing model
Narrow transformations pipeline without shuffle; wide transformations require shuffle
Lineage enables fault tolerance without data replication
Optimal partition size: 128MB–200MB
Use persist() for reuse; use coalesce() to reduce partitions
Data skew causes stragglers: stage time = max(partition times)

PySpark RDD Fundamentals: Architecture, Transformations, and Actions