π CDC Patterns on AWS
Master Change Data Capture with DMS, Debezium, and DynamoDB Streams.
Module: AWS Data Engineering β’ Topic 23 of 65 β’ Premium Content
CDC Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CDC PATTERNS ON AWS β
β β
β OPTION 1: AWS DMS (Managed) β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Source βββββΊβ DMS βββββΊβ S3 βββββΊβ Redshift β β
β β (RDS) β β Replicatonβ β (Lake) β β (WH) β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β OPTION 2: Debezium on MSK (Self-managed) β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Source βββββΊβ Debezium βββββΊβ MSK βββββΊβ Lambda/ β β
β β (MySQL) β β Connectorβ β (Kafka) β β Glue β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β OPTION 3: DynamoDB Streams β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β DynamoDB βββββΊβ Streams βββββΊβ Lambda βββββΊβ S3/other β β
β β Table β β β β Process β β β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AWS DMS Configuration
import boto3
dms = boto3.client('dms')
# Create replication instance
response = dms.create_replication_instance(
ReplicationInstanceIdentifier='cdc-replication-instance',
ReplicationInstanceClass='dms.r5.large',
AllocatedStorage=100,
MultiAZ=True,
VpcSecurityGroupIds=['sg-12345678'],
ReplicationSubnetGroupIdentifier='my-subnet-group',
PubliclyAccessible=False,
EngineVersion='3.5.1',
AutoMinorVersionUpgrade=True
)
# Create source endpoint (RDS MySQL)
source_endpoint = dms.create_endpoint(
EndpointIdentifier='source-mysql',
EndpointType='source',
EngineName='mysql',
ServerName='source-db.cluster-123456.us-east-1.rds.amazonaws.com',
Port=3306,
DatabaseName='production',
Username='dms_user',
Password='SecurePassword123!',
SslMode='require'
)
# Create target endpoint (S3)
target_endpoint = dms.create_endpoint(
EndpointIdentifier='target-s3',
EndpointType='target',
EngineName='s3',
S3Settings={
'BucketName': 'cdc-data-lake',
'BucketFolder': 'raw/',
'ServiceAccessRoleArn': 'arn:aws:iam::123456789012:role/DMSS3Role',
'CompressionType': 'GZIP',
'ExternalTableDefinition': '{"Version":1,"DataFormat":"parquet"}'
}
)
# Create replication task with CDC
replication_task = dms.create_replication_task(
ReplicationTaskIdentifier='cdc-task',
SourceEndpointArn=source_endpoint['Endpoint']['EndpointArn'],
TargetEndpointArn=target_endpoint['Endpoint']['EndpointArn'],
ReplicationInstanceArn=response['ReplicationInstance']['ReplicationInstanceArn'],
MigrationType='cdc',
TableMappings='''{
"rules": [{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "include-all-tables",
"object-locator": {
"schema-name": "production",
"table-name": "%"
},
"rule-action": "include"
}]
}''',
ReplicationTaskSettings='''{
"TargetMetadata": {
"TargetSchema": "",
"SupportLobs": true,
"FullLobMode": false,
"LobChunkSize": 64,
"LimitedSizeLobMode": true,
"LobMaxSize": 32
},
"FullLoadSettings": {
"TargetTablePrepMode": "DROP_AND_CREATE",
"CreatePkAfterFullLoad": true
},
"Logging": {
"EnableLogging": true,
"LogComponents": [
{"Id": "SOURCE_UNLOAD", "Severity": "LOGGER_SEVERITY_DEFAULT"},
{"Id": "TARGET_LOAD", "Severity": "LOGGER_SEVERITY_DEFAULT"},
{"Id": "SOURCE_CAPTURE", "Severity": "LOGGER_SEVERITY_DEFAULT"},
{"Id": "TARGET_APPLY", "Severity": "LOGGER_SEVERITY_DEFAULT"}
]
}
}'''
)
βΉοΈ
Pro Tip: DMS CDC uses the source database's transaction log (MySQL binlog, PostgreSQL WAL) to capture changes with minimal impact on source performance.
Debezium on MSK
{
"name": "mysql-cdc-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"tasks.max": "1",
"database.hostname": "source-db.cluster-123456.us-east-1.rds.amazonaws.com",
"database.port": "3306",
"database.user": "debezium",
"database.password": "secure-password",
"database.server.id": "184054",
"database.server.name": "production",
"database.include.list": "production_db",
"database.history.kafka.bootstrap.servers": "b-1.msk-cluster:9092",
"database.history.kafka.topic": "schema-changes.production",
"decimal.handling.mode": "double",
"time.precision.mode": "connect",
"transforms": "route,unwrap",
"transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
"transforms.route.replacement": "$3",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
}
}
Interview Q&A
Q1: DMS vs Debezium?
Answer: DMS is AWS-managed, simpler setup, limited transformations. Debezium is open-source, more flexible, runs on MSK. DMS for quick migrations; Debezium for complex streaming architectures.
Q2: What are the three CDC modes in DMS?
Answer: 1) Full load only, 2) Full load + CDC, 3) CDC only. Use full load + CDC for initial migration with ongoing sync.
Q3: How do DynamoDB Streams work?
Answer: Streams capture item-level modifications (INSERT, MODIFY, REMOVE) with 24-hour retention. Lambda or Kinesis can consume stream records.
Summary
- DMS: AWS-managed CDC, easy setup, supports most databases
- Debezium: Open-source, runs on MSK, more flexible
- DynamoDB Streams: Built-in CDC for DynamoDB tables
- Best Practice: Use transaction log-based CDC for minimal source impact