π Amazon MWAA
Master Amazon Managed Workflows for Apache Airflow (MWAA) on AWS.
Module: AWS Data Engineering β’ Topic 32 of 65 β’ Premium Content
MWAA Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AMAZON MWAA ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MWAA ENVIRONMENT (Managed) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Scheduler (2+) β Web Server β Workers (2+) β β β
β β β Parse DAGs β Airflow UI β Execute Tasks β β β
β β β Schedule runs β REST API β Run Operators β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Metadata Database (RDS PostgreSQL - Managed) β β β
β β β β’ DAG runs, task instances, connections, variables β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β S3 (DAGs/ β β Glue/EMR/Lambda β β CloudWatch β β
β β Plugins/Logs) β β (Task Targets) β β (Monitoring) β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Example DAG
from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.operators.python import PythonOperator
default_args = {
'owner': 'data-engineering',
'depends_on_past': False,
'email_on_failure': True,
'email': ['team@company.com'],
'retries': 3,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'daily_etl_pipeline',
default_args=default_args,
description='Daily ETL pipeline',
schedule_interval='0 2 * * *',
start_date=datetime(2024, 1, 1),
catchup=False,
tags=['etl', 'production'],
) as dag:
wait_for_data = S3KeySensor(
task_id='wait_for_raw_data',
bucket_name='data-lake-raw',
bucket_key='landing/{{ ds }}/',
wildcard_match=True,
timeout=3600,
poke_interval=60,
)
run_glue_job = GlueJobOperator(
task_id='run_etl_job',
job_name='process-daily-data',
script_location='s3://glue-scripts/process_data.py',
s3_bucket='glue-scripts',
iam_role_name='GlueRole',
create_job_kwargs={
'GlueVersion': '4.0',
'NumberOfWorkers': 10,
'WorkerType': 'G.1X'
},
job_poll_interval=30,
aws_conn_id='aws_default',
dag=dag,
)
wait_for_data >> run_glue_job
Interview Q&A
Q1: MWAA vs. Step Functions?
Answer: MWAA for complex multi-step workflows with Python expertise. Step Functions for serverless, visual AWS-native workflows.
Q2: How does MWAA scale?
Answer: MWAA scales workers based on queued tasks. Configure min/max workers and scheduler count based on workload.
Q3: What are Airflow Connections?
Answer: Connections store credentials and configuration for external services (AWS, databases, APIs). Managed securely in the metadata database.
Summary
- MWAA: Managed Apache Airflow on AWS
- Components: Scheduler, Web Server, Workers, Metadata DB
- Storage: S3 for DAGs, plugins, and logs
- Integration: Native AWS operators for Glue, Lambda, EMR, S3
- Scaling: Auto-scaling workers based on task queue
- Use Case: Complex multi-step workflow orchestration