πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Spark Fundamentals for Big Data

Module 15: Data Engineering and MLOps🟒 Free Lesson

Advertisement

Spark Fundamentals for Big Data

Apache Spark is the unified analytics engine for large-scale data processing. This module covers the core abstractions: RDDs, DataFrames, Spark SQL, and the Catalyst optimizer.

Spark ArchitectureDriver ProgramCluster Manager (YARN / K8s)Worker 1ExecutorWorker 2ExecutorWorker 3ExecutorDriver β†’ Cluster Manager β†’ Executors (parallel tasks on partitions)

Spark Architecture

Spark ApplicationDriver Program (JVM)SparkContext / SparkSessionCluster Manager (YARN / Mesos / K8s / Standalone)Worker Node 1Executor (JVM)TaskTaskCache / BroadcastWorker Node 2Executor (JVM)TaskTaskCache / BroadcastWorker Node 3Executor (JVM)TaskTaskCache / BroadcastEach executor runs tasks in parallel on data partitions

1. RDDs (Resilient Distributed Datasets)

RDDs are Spark's foundational abstraction β€” immutable, partitioned collections of elements that can be operated on in parallel.

Partition count determines parallelism:

Parallelism=num_executorsΓ—cores_per_executor\text{Parallelism} = \text{num\_executors} \times \text{cores\_per\_executor}

Key Properties

  • Resilient: Fault-tolerant via lineage graph
  • Distributed: Data split across partitions on multiple nodes
  • Dataset: Collection of partitioned data with primitives

Transformations vs Actions

from pyspark.sql import SparkSession
from pyspark import SparkContext

sc = SparkContext("local", "RDDExample")

# Transformation (lazy)
rdd = sc.parallelize([1, 2, 3, 4, 5])
squared = rdd.map(lambda x: x ** 2)
filtered = squared.filter(lambda x: x > 10)

# Action (triggers computation)
result = filtered.collect()  # [16, 25]
total = filtered.reduce(lambda a, b: a + b)  # 41

Narrow vs Wide Dependencies

Narrow DependencyEach partition β†’ limited partitionsP0P1P2P0'P1'P2'map, filter, unionWide DependencyEach partition β†’ many partitions (shuffle)P0P1P2P0'P1'P2'reduceByKey, groupByKey, join

2. DataFrames and Spark SQL

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, window, lag

spark = SparkSession.builder \
    .appName("DataEngineering") \
    .config("spark.sql.shuffle.partitions", 200) \
    .getOrCreate()

df = spark.read.parquet("s3://bucket/events/")

# DataFrame API
result = df \
    .filter(col("event_type") == "purchase") \
    .groupBy(window("timestamp", "1 hour")) \
    .agg(
        avg("amount").alias("avg_amount"),
        count("*").alias("total_purchases")
    )

# Spark SQL
df.createOrReplaceTempView("events")
spark.sql("""
    SELECT
        date_trunc('hour', timestamp) AS hour,
        AVG(amount) AS avg_amount,
        COUNT(*) AS total_purchases
    FROM events
    WHERE event_type = 'purchase'
    GROUP BY 1
    ORDER BY 1
""")

3. Lazy Evaluation and Catalyst Optimizer

Catalyst Optimizer PipelineUnresolved PlanAnalysis(Resolve refs)Logical Plan(Optimize)Physical Plan(Select algos)Code GenerationKey OptimizationsPredicate PushdownColumn PruningConstant FoldingJoin ReorderingWhole-Stage Code GenTungsten ExecutionTransformations are optimized before execution

4. Partitioning Strategy

# Repartition by key
df.repartition(100, "user_id")

# Coalesce to reduce partitions
df.coalesce(10)

# Bucketing for frequent joins
df.write \
    .bucketBy(256, "user_id") \
    .sortBy("user_id") \
    .saveAsTable("bucketed_events")

Partition Tuning Rules

ScenarioRecommendation
Small files problemCoalesce before write
Large shuffle operationsIncrease spark.sql.shuffle.partitions
Join on keyPartition by join key
Time-series rangePartition by date

5. Spark MLlib

from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="label_idx")
rf = RandomForestClassifier(featuresCol="features", labelCol="label_idx")

pipeline = Pipeline(stages=[indexer, assembler, rf])
model = pipeline.fit(train_df)
predictions = model.transform(test_df)

6. Performance Tuning Checklist

  • Cache wisely: df.cache() for repeated use; df.persist() with storage level
  • Avoid shuffles: Use broadcast joins for small tables (broadcast(small_df))
  • Tune memory: spark.executor.memory, spark.driver.memory
  • AQE: Adaptive Query Execution (spark.sql.adaptive.enabled=true)
  • Broadcast threshold: spark.sql.autoBroadcastJoinThreshold

Key Takeaways

  • RDDs provide low-level control; DataFrames leverage Catalyst optimization
  • Lazy evaluation enables whole-plan optimization before execution
  • Partitioning is the primary lever for performance tuning
  • Shuffle is expensive β€” design pipelines to minimize data movement
⭐

Premium Content

Spark Fundamentals for Big Data

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement