πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Spark on Kubernetes

Apache SparkInfrastructure⭐ Premium

Advertisement

Spark on Kubernetes

Difficulty: Senior Level | Companies: Databricks, Netflix, Uber, Apple, Airbnb

Spark on Kubernetes Architecture

Kubernetes mode runs Spark driver and executors as pods. This provides better resource isolation and cloud-native deployment.

Basic Kubernetes Configuration

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SparkOnK8s") \
    .master("k8s://https://kubernetes.default.svc:6443") \
    .config("spark.kubernetes.container.image", "spark:3.5.0") \
    .config("spark.kubernetes.container.image.pullPolicy", "IfNotPresent") \
    .config("spark.kubernetes.namespace", "spark-jobs") \
    .config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark") \
    .config("spark.kubernetes.driver.podTemplateFile", "driver-pod.yaml") \
    .config("spark.kubernetes.executor.podTemplateFile", "executor-pod.yaml") \
    .getOrCreate()

ℹ️

Interview Insight: Kubernetes mode requires container images with Spark installed. Use official Spark Docker images or build custom ones with your dependencies.

Pod Resource Configuration

# Configure executor resources via Spark properties
spark = SparkSession.builder \
    .appName("K8sResources") \
    .config("spark.kubernetes.executor.request.cores", "2") \
    .config("spark.kubernetes.executor.limit.cores", "4") \
    .config("spark.executor.memory", "8g") \
    .config("spark.executor.memoryOverhead", "2g") \
    .config("spark.kubernetes.driver.request.cores", "1") \
    .config("spark.kubernetes.driver.limit.cores", "2") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

# Resource requests guarantee minimum resources
# Resource limits cap maximum usage
# Always set both for production workloads

Dynamic Allocation on Kubernetes

spark = SparkSession.builder \
    .appName("K8sDynamicAlloc") \
    .config("spark.dynamicAllocation.enabled", "true") \
    .config("spark.dynamicAllocation.minExecutors", "2") \
    .config("spark.dynamicAllocation.maxExecutors", "50") \
    .config("spark.dynamicAllocation.executorIdleTimeout", "60s") \
    .config("spark.dynamicAllocation.schedulerBacklogTimeout", "1s") \
    .config("spark.dynamicAllocation.sustainedSchedulerBacklogTimeout", "1s") \
    .config("spark.shuffle.service.enabled", "false") \
    .config("spark.kubernetes.executor.deleteOnTermination", "true") \
    .getOrCreate()

# On Kubernetes, Spark manages executor pods directly
# No external shuffle service needed (Spark 3.x)
# Executors are created and destroyed as pods

⚠️

Warning: Dynamic allocation on Kubernetes creates/deletes pods frequently. Ensure your cluster autoscaler is configured to handle pod churn without excessive delays.

Container Image Management

# Build custom Spark image with dependencies
# Dockerfile example:
# FROM apache/spark:3.5.0
# COPY requirements.txt /opt/spark/requirements.txt
# RUN pip install -r /opt/spark/requirements.txt
# COPY jars/ /opt/spark/jars/

# Reference custom image
spark = SparkSession.builder \
    .appName("CustomImage") \
    .config("spark.kubernetes.container.image", "myregistry/spark:3.5.0-custom") \
    .config("spark.kubernetes.container.image.pullSecrets", "registry-secret") \
    .config("spark.kubernetes.container.image.pullPolicy", "Always") \
    .getOrCreate()

# Cache images on nodes for faster startup
# Use DaemonSet or pre-pull images in node setup

Volume Mounts and Storage

# Mount PVCs for persistent storage
spark = SparkSession.builder \
    .appName("K8sStorage") \
    .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-storage.mount.path", "/data") \
    .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.spark-storage.mount.readOnly", "false") \
    .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-storage.mount.path", "/data") \
    .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-storage.options.claimName", "spark-pvc") \
    .getOrCreate()

# Use emptyDir for temporary shuffle storage
spark = SparkSession.builder \
    .appName("K8sTempStorage") \
    .config("spark.kubernetes.executor.volumes.emptyDir.spark-local-dir.mount.path", "/tmp/spark-local") \
    .config("spark.kubernetes.executor.volumes.emptyDir.spark-local-dir.options.sizeLimit", "50Gi") \
    .config("spark.local.dir", "/tmp/spark-local") \
    .getOrCreate()

# Mount ConfigMaps for configuration
spark = SparkSession.builder \
    .appName("K8sConfig") \
    .config("spark.kubernetes.driver.volumes.configMap.spark-conf.mount.path", "/opt/spark/conf") \
    .config("spark.kubernetes.driver.volumes.configMap.spark-conf.options.name", "spark-config") \
    .getOrCreate()

Secrets Management

# Access Kubernetes secrets
spark = SparkSession.builder \
    .appName("K8sSecrets") \
    .config("spark.kubernetes.driver.secretKeyRef.DB_PASSWORD", "db-secret:password") \
    .config("spark.kubernetes.executor.secretKeyRef.DB_PASSWORD", "db-secret:password") \
    .config("spark.kubernetes.driver.env.DB_HOST", "db-service") \
    .getOrCreate()

# Secrets are mounted as environment variables
# Use Kubernetes secrets for passwords, API keys, certificates

ℹ️

Pro Tip: Never hardcode secrets in Spark configs or container images. Always use Kubernetes secrets or a secrets management service like Vault.

Monitoring on Kubernetes

# Enable metrics for Kubernetes monitoring
spark = SparkSession.builder \
    .appName("K8sMonitoring") \
    .config("spark.eventLog.enabled", "true") \
    .config("spark.eventLog.dir", "s3a://logs/spark-events") \
    .config("spark.metrics.conf", "metrics.properties") \
    .config("spark.metrics.conf.*.sink.class", "org.apache.spark.metrics.sink.PrometheusSink") \
    .getOrCreate()

# Prometheus scraping configuration
# Pod annotations for Prometheus:
# prometheus.io/scrape: "true"
# prometheus.io/port: "4040"
# prometheus.io/path: "/metrics/prometheus"

# Monitor with kubectl
# kubectl get pods -n spark-jobs
# kubectl logs -f spark-driver-pod -n spark-jobs
# kubectl describe pod spark-executor-pod -n spark-jobs

Network Configuration

# Configure network for driver-executor communication
spark = SparkSession.builder \
    .appName("K8sNetwork") \
    .config("spark.kubernetes.driver.podIP", "driver-pod-ip") \
    .config("spark.kubernetes.executor.podIP", "executor-pod-ip") \
    .config("spark.driver.host", "driver-hostname") \
    .config("spark.driver.port", "7077") \
    .config("spark.blockManager.port", "7078") \
    .getOrCreate()

# Use headless services for stable DNS
# apiVersion: v1
# kind: Service
# metadata:
#   name: spark-driver
# spec:
#   clusterIP: None
#   selector:
#     app: spark-driver
#   ports:
#   - port: 7077

Production Deployment Patterns

# Pattern 1: Scheduled batch jobs with CronJob
# apiVersion: batch/v1
# kind: CronJob
# metadata:
#   name: spark-daily-etl
# spec:
#   schedule: "0 2 * * *"
#   jobTemplate:
#     spec:
#       template:
#         spec:
#           containers:
#           - name: spark-submit
#             image: myregistry/spark:3.5.0
#             command: ["spark-submit", "--master", "k8s://...", "local:///app/etl.py"]

# Pattern 2: Streaming with Deployment
# Use StatefulSet for long-running streaming applications
# Provides stable network identity and persistent storage

# Pattern 3: ML training with Job
# Use Kubernetes Jobs for finite ML training tasks
# Jobs automatically clean up after completion

ℹ️

Key Takeaway: Spark on Kubernetes provides cloud-native deployment with better resource isolation. Configure pod templates carefully, use dynamic allocation, and leverage Kubernetes-native monitoring and secrets management.

Follow-Up Questions

  • How does Spark on Kubernetes differ from Spark on YARN architecturally?
  • Explain the trade-offs between client and cluster deploy modes on K8s.
  • How would you handle pod failures and retries in production?
  • Describe strategies for optimizing Spark startup time on Kubernetes.
  • How does Spark integrate with Kubernetes autoscaling?

Advertisement