Spark Fundamentals for Big Data

Apache Spark is the unified analytics engine for large-scale data processing. This module covers the core abstractions: RDDs, DataFrames, Spark SQL, and the Catalyst optimizer.

Spark Architecture

1. RDDs (Resilient Distributed Datasets)

RDDs are Spark's foundational abstraction — immutable, partitioned collections of elements that can be operated on in parallel.

Partition count determines parallelism:

\text{Parallelism} = \text{num\_executors} \times \text{cores\_per\_executor}

Key Properties

Resilient: Fault-tolerant via lineage graph
Distributed: Data split across partitions on multiple nodes
Dataset: Collection of partitioned data with primitives

Transformations vs Actions

from pyspark.sql import SparkSession
from pyspark import SparkContext

sc = SparkContext("local", "RDDExample")

# Transformation (lazy)
rdd = sc.parallelize([1, 2, 3, 4, 5])
squared = rdd.map(lambda x: x ** 2)
filtered = squared.filter(lambda x: x > 10)

# Action (triggers computation)
result = filtered.collect()  # [16, 25]
total = filtered.reduce(lambda a, b: a + b)  # 41

Narrow vs Wide Dependencies

2. DataFrames and Spark SQL

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, window, lag

spark = SparkSession.builder \
    .appName("DataEngineering") \
    .config("spark.sql.shuffle.partitions", 200) \
    .getOrCreate()

df = spark.read.parquet("s3://bucket/events/")

# DataFrame API
result = df \
    .filter(col("event_type") == "purchase") \
    .groupBy(window("timestamp", "1 hour")) \
    .agg(
        avg("amount").alias("avg_amount"),
        count("*").alias("total_purchases")
    )

# Spark SQL
df.createOrReplaceTempView("events")
spark.sql("""
    SELECT
        date_trunc('hour', timestamp) AS hour,
        AVG(amount) AS avg_amount,
        COUNT(*) AS total_purchases
    FROM events
    WHERE event_type = 'purchase'
    GROUP BY 1
    ORDER BY 1
""")

3. Lazy Evaluation and Catalyst Optimizer

4. Partitioning Strategy

# Repartition by key
df.repartition(100, "user_id")

# Coalesce to reduce partitions
df.coalesce(10)

# Bucketing for frequent joins
df.write \
    .bucketBy(256, "user_id") \
    .sortBy("user_id") \
    .saveAsTable("bucketed_events")

Partition Tuning Rules

Scenario	Recommendation
Small files problem	Coalesce before write
Large shuffle operations	Increase `spark.sql.shuffle.partitions`
Join on key	Partition by join key
Time-series range	Partition by date

5. Spark MLlib

from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="label_idx")
rf = RandomForestClassifier(featuresCol="features", labelCol="label_idx")

pipeline = Pipeline(stages=[indexer, assembler, rf])
model = pipeline.fit(train_df)
predictions = model.transform(test_df)

6. Performance Tuning Checklist

Cache wisely: df.cache() for repeated use; df.persist() with storage level
Avoid shuffles: Use broadcast joins for small tables (broadcast(small_df))
Tune memory: spark.executor.memory, spark.driver.memory
AQE: Adaptive Query Execution (spark.sql.adaptive.enabled=true)
Broadcast threshold: spark.sql.autoBroadcastJoinThreshold

Key Takeaways

RDDs provide low-level control; DataFrames leverage Catalyst optimization
Lazy evaluation enables whole-plan optimization before execution
Partitioning is the primary lever for performance tuning
Shuffle is expensive — design pipelines to minimize data movement

Spark Fundamentals for Big Data

Spark Fundamentals for Big Data

Spark Architecture

1. RDDs (Resilient Distributed Datasets)

Key Properties

Transformations vs Actions

Narrow vs Wide Dependencies

2. DataFrames and Spark SQL

3. Lazy Evaluation and Catalyst Optimizer

4. Partitioning Strategy

Partition Tuning Rules

5. Spark MLlib

6. Performance Tuning Checklist

Key Takeaways

Premium Content

Need Expert Data Science Help?