Interview Question
βΉοΈInterview Context
Company: Netflix / Uber Role: Senior Data Engineer / Platform Engineer Difficulty: Advanced Time: 60-90 minutes
Question: "Describe the best practices for running Airflow in production at scale. How do you handle scaling, migration, disaster recovery, and operational excellence?"
Detailed Theory
Production Fundamentals
# production_fundamentals.py
"""
Production Best Practices:
1. Scaling:
- Horizontal scaling
- Vertical scaling
- Auto-scaling
2. High Availability:
- Redundancy
- Failover
- Disaster recovery
3. Monitoring:
- Health checks
- Alerting
- Performance monitoring
4. Operations:
- Runbooks
- Incident response
- Change management
5. Cost Optimization:
- Resource right-sizing
- Spot instances
- Scheduling optimization
"""
1. Scaling Strategies
# scaling_strategies.py
"""
Scaling Airflow:
1. Horizontal Scaling:
- Add more workers
- Add more scheduler instances
- Use Kubernetes for dynamic scaling
2. Vertical Scaling:
- Increase worker resources
- Increase database resources
- Increase webserver resources
3. Auto-scaling:
- Kubernetes HPA
- Celery worker auto-scaling
- Database read replicas
"""
# Kubernetes HPA configuration
HPA_CONFIG = """
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: airflow-worker-hpa
namespace: airflow
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: airflow-worker
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 120
"""
# Database read replicas
DATABASE_SCALING = """
# PostgreSQL read replicas
1. Primary database for writes
2. Read replicas for read-heavy operations
3. Connection pooling with PgBouncer
Configuration:
- Write operations -> Primary
- Read operations -> Read replicas
- DAG parsing -> Read replicas
"""
# Celery worker auto-scaling
CELERY_SCALING = """
# Celery worker auto-scaling
1. Monitor queue depth
2. Scale workers based on queue size
3. Use cloud auto-scaling groups
Metrics to monitor:
- Queue depth
- Worker utilization
- Task execution time
"""
βΉοΈPro Tip
Use Kubernetes for dynamic scaling. It allows you to scale workers based on demand and release resources when not needed.
2. High Availability
# high_availability.py
"""
High Availability:
Ensure Airflow remains operational during failures.
"""
# Multi-AZ deployment
MULTI_AZ_CONFIG = """
# Deploy across multiple Availability Zones
1. Webserver: Multi-AZ with ALB
2. Scheduler: Active-Passive with database locking
3. Workers: Multi-AZ with auto-scaling
4. Database: Multi-AZ with automated failover
"""
# Scheduler HA
SCHEDULER_HA = """
# Scheduler High Availability
1. Multiple scheduler instances
2. Database-based locking
3. Automatic failover
Configuration:
[scheduler]
scheduler_heartbeat_sec = 5
max_tis_per_query = 512
"""
# Database HA
DATABASE_HA = """
# Database High Availability
1. Primary-Replica setup
2. Automated failover
3. Connection pooling
Configuration:
[database]
sql_alchemy_pool_size = 20
sql_alchemy_max_overflow = 30
sql_alchemy_pool_pre_ping = True
"""
# Disaster recovery
DISASTER_RECOVERY = """
# Disaster Recovery Plan
1. Regular backups
2. Cross-region replication
3. Recovery time objectives (RTO)
4. Recovery point objectives (RPO)
Backup strategy:
- Daily full backups
- Hourly incremental backups
- WAL archiving for point-in-time recovery
"""
3. Migration Strategies
# migration_strategies.py
"""
Migration Strategies:
Migrate Airflow deployments with minimal downtime.
"""
# Airflow 1.x to 2.x migration
MIGRATION_1_TO_2 = """
# Airflow 1.x to 2.x Migration
1. Database migration
- Backup existing database
- Run airflow db migrate
- Verify migration
2. DAG migration
- Update imports
- Replace SubDAGs with TaskGroups
- Update operator imports
3. Configuration migration
- Update airflow.cfg
- Migrate to environment variables
- Test configuration
4. Testing
- Test all DAGs
- Verify connections
- Test scheduling
"""
# Cloud migration
CLOUD_MIGRATION = """
# On-premise to Cloud Migration
1. Assessment
- Inventory current setup
- Identify dependencies
- Plan migration phases
2. Preparation
- Set up cloud infrastructure
- Configure networking
- Set up secrets management
3. Migration
- Migrate DAGs
- Migrate connections
- Migrate variables
4. Validation
- Test all pipelines
- Verify monitoring
- Performance testing
"""
# Multi-region migration
MULTI_REGION_MIGRATION = """
# Multi-region Migration
1. Set up target region
2. Replicate DAGs and configurations
3. Set up database replication
4. Test cross-region failover
5. Gradual traffic migration
"""
4. Operational Excellence
# operational_excellence.py
"""
Operational Excellence:
Runbooks, incident response, and best practices.
"""
# Runbook template
RUNBOOK_TEMPLATE = """
# Runbook: [Incident Type]
## Overview
- Description of the incident
- Impact assessment
- Severity level
## Detection
- How to detect the incident
- Monitoring alerts
- Symptoms
## Investigation
- Initial investigation steps
- Logs to check
- Metrics to review
## Resolution
- Step-by-step resolution
- Commands to run
- Configuration changes
## Prevention
- How to prevent recurrence
- Improvements to implement
- Monitoring enhancements
"""
# Incident response
INCIDENT_RESPONSE = """
# Incident Response Process
1. Detection
- Monitoring alerts
- User reports
- Automated checks
2. Triage
- Assess severity
- Identify impact
- Assign owner
3. Investigation
- Gather information
- Identify root cause
- Determine scope
4. Resolution
- Implement fix
- Verify resolution
- Communicate status
5. Post-mortem
- Document incident
- Identify improvements
- Update runbooks
"""
# Change management
CHANGE_MANAGEMENT = """
# Change Management Process
1. Change request
- Document change
- Risk assessment
- Approval process
2. Testing
- Test in staging
- Validate changes
- Rollback plan
3. Deployment
- Deploy to production
- Monitor impact
- Verify success
4. Documentation
- Update documentation
- Communicate changes
- Train users
"""
β οΈImportant
Always have a rollback plan for changes. Test changes in staging before production deployment.
5. Cost Optimization
# cost_optimization.py
"""
Cost Optimization:
Reduce costs while maintaining performance.
"""
# Resource right-sizing
RIGHT_SIZING = """
# Resource Right-sizing
1. Monitor resource usage
2. Identify over-provisioned resources
3. Right-size based on actual usage
4. Review regularly
Tools:
- Cloud provider cost explorer
- Kubernetes resource metrics
- Application performance monitoring
"""
# Spot instances
SPOT_INSTANCES = """
# Use Spot Instances for Workers
1. Spot instances for non-critical workloads
2. On-demand for critical workloads
3. Reserved instances for base capacity
Benefits:
- Up to 90% cost reduction
- Good for fault-tolerant workloads
- Auto-scaling with spot fleet
"""
# Scheduling optimization
SCHEDULING_OPTIMIZATION = """
# Optimize Scheduling
1. Avoid peak hours
2. Use idle resources
3. Batch processing
4. Resource-aware scheduling
Benefits:
- Lower costs
- Better resource utilization
- Reduced contention
"""
Real-World Scenarios
Scenario 1: Netflix's Production Setup
# netflix_production.py
"""
Netflix-style production setup:
- Global deployment
- Multi-region
- Cost optimization
"""
from airflow.decorators import dag, task
from datetime import datetime
from typing import Dict, Any
@dag(
dag_id='netflix_global_pipeline',
schedule_interval='@hourly',
start_date=datetime(2024, 1, 1),
catchup=False,
max_active_runs=10,
max_active_tasks=100,
tags=['netflix', 'global', 'production'],
)
def netflix_global():
@task
def extract_global() -> Dict[str, Any]:
"""Extract from global sources"""
return {'sources': ['us', 'eu', 'apac']}
@task
def process_region(region: str) -> Dict[str, Any]:
"""Process by region"""
return {'region': region, 'processed': True}
@task
def aggregate_global(results: list) -> Dict[str, Any]:
"""Aggregate global results"""
return {'total_regions': len(results)}
# Global pipeline
sources = extract_global()
regional = process_region.expand(sources['sources'])
aggregate = aggregate_global(regional)
netflix_global()
QuizBox
Best Practices
# best_practices.py
"""
Production Best Practices:
1. Scaling:
- Implement auto-scaling
- Monitor resource usage
- Right-size resources
2. High Availability:
- Deploy across multiple AZs
- Implement failover
- Regular backups
3. Operations:
- Create runbooks
- Implement incident response
- Regular training
4. Cost Optimization:
- Monitor costs
- Use spot instances
- Optimize scheduling
5. Continuous Improvement:
- Regular reviews
- Performance tuning
- Process improvement
"""
βΉοΈNetflix Interview Tip
At Netflix, they emphasize operational excellence. When discussing production best practices, highlight the importance of monitoring, runbooks, and continuous improvement. Also mention how they handle incidents and optimize costs at scale.
Summary
Production best practices are critical for reliable Airflow operation. Key takeaways:
- Scaling for performance
- High availability for reliability
- Migration for upgrades
- Operations for maintainability
- Cost optimization for efficiency
For Netflix and Uber interviews, focus on:
- Scaling strategies
- High availability patterns
- Operational excellence
- Cost optimization
- Continuous improvement
This question is part of the Apache Airflow Advanced interview preparation series. Practice explaining these concepts before your interview.