Troubleshooting Framework
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS Troubleshooting Framework β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Identify Diagnose Resolve β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Symptoms ββββββΆβ Root Cause βββββΆβ Fix β β
β β Logs β β Analysis β β Prevent β β
β β Metrics β β Testing β β Monitor β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β Common Areas β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Performance (slow queries, high latency) β β
β β β’ Connectivity (network, IAM, permissions) β β
β β β’ Data (corruption, duplication, missing) β β
β β β’ Cost (unexpected charges, optimization) β β
β β β’ Security (access denied, compliance) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q1: How do you troubleshoot slow Athena queries?
Answer:
Slow Query Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Slow Athena Query Diagnosis β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Check Query Statistics β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SELECT * FROM system.query_history β β
β β WHERE query_id = 'your-query-id' β β
β β β β
β β Look at: β β
β β β’ Data scanned vs data returned β β
β β β’ Execution time breakdown β β
β β β’ Bytes scanned β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Analyze Query Plan β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β EXPLAIN (FORMAT=GRAPHVIZ) SELECT ... β β
β β β β
β β Look for: β β
β β β’ Full table scans (no partition pruning) β β
β β β’ Large intermediate results β β
β β β’ Skewed joins β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Check Data Layout β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ File sizes (small files = overhead) β β
β β β’ Partition pruning (WHERE clause on partition columns)β β
β β β’ Column pruning (SELECT specific columns) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Common Fixes:
-- Problem: Full table scan
-- Before: 100GB scanned, 5 minutes
SELECT * FROM events WHERE event_date = '2024-01-15';
-- After: 2.7GB scanned, 10 seconds (with partitioning)
SELECT * FROM events
WHERE year = 2024 AND month = 1 AND day = 15;
-- Problem: SELECT * scans all columns
-- Before: 100GB scanned
SELECT * FROM events;
-- After: 5GB scanned (column pruning)
SELECT event_id, event_time, event_type FROM events;
Q2: How do you troubleshoot EMR cluster failures?
Answer:
EMR Troubleshooting Checklist:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β EMR Cluster Failure Diagnosis β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Check Cluster Status β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β aws emr describe-cluster --cluster-id j-XXXXXXXXX β β
β β β β
β β Look at: β β
β β β’ ClusterStatus.State β β
β β β’ NormalizedInstanceHours β β
β β β’ Ec2InstanceAttributes β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Check Step Status β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β aws emr list-steps --cluster-id j-XXXXXXXXX β β
β β β β
β β Look at: β β
β β β’ StepState β β
β β β’ FailureDetails β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Check Logs β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β s3://my-bucket/j-XXXXXXXXX/logs/ β β
β β β β
β β Look at: β β
β β β’ instance-controller.log β β
β β β’ spark-history/ β β
β β β’ hive-server2/ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Common Issues & Solutions:
# Issue 1: Spot instance interruption
# Solution: Use On-Demand or handle interruption
def handle_spot_interruption():
# Check for interruption notice
response = requests.get(
'http://169.254.169.254/latest/meta-data/spot/instance-action',
headers={'X-aws-ec2-metadata-token': get_token()}
)
if response.status_code == 200:
# Save state and prepare for termination
save_checkpoint()
upload_to_s3()
# Issue 2: OutOfMemoryError
# Solution: Increase executor memory
spark.conf.set('spark.executor.memory', '8g')
spark.conf.set('spark.driver.memory', '4g')
# Issue 3: Slow shuffle
# Solution: Optimize shuffle partitions
spark.conf.set('spark.sql.shuffle.partitions', '200')
Q3: How do you troubleshoot Redshift query performance?
Answer:
Redshift Performance Diagnosis:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Redshift Query Performance Diagnosis β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Check Query Execution β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SELECT query, records, elapsed, bytes β β
β β FROM stl_query β β
β β WHERE query = [your-query-id] β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Analyze Query Plan β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β EXPLAIN SELECT ... β β
β β β β
β β Look for: β β
β β β’ Nested loop vs hash join β β
β β β’ Data distribution errors β β
β β β’ Sort key usage β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Check Table Statistics β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SELECT * FROM svv_table_info β β
β β WHERE table = 'your_table' β β
β β β β
β β Look at: β β
β β β’ skew_sortkey1 β β
β β β’ skew_rows β β
β β β’ unsorted β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Performance Tuning:
-- Check for data skew
SELECT
t.table_name,
t.skew_sortkey1,
t.skew_rows
FROM svv_table_info t
WHERE t.database = 'analytics'
ORDER BY t.skew_rows DESC;
-- Optimize distribution
ALTER TABLE fact_sales DISTSTYLE KEY DISTKEY(customer_id);
-- Add sort key
ALTER TABLE fact_sales SORTKEY(sale_date, customer_id);
-- Vacuum and analyze
VACUUM SORT ONLY fact_sales;
ANALYZE fact_sales;
Q4: How do you troubleshoot Glue job failures?
Answer:
Glue Job Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Glue Job Failure Diagnosis β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Check Job Status β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β aws glue get-job-run --job-name my-job --run-id r-XXX β β
β β β β
β β Look at: β β
β β β’ JobRunState β β
β β β’ ExecutionTime β β
β β β’ ErrorMessage β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Check CloudWatch Logs β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β /aws-glue/jobs/errors β β
β β β β
β β Look for: β β
β β β’ Python exceptions β β
β β β’ JVM errors β β
β β β’ Out of memory β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Common Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Schema mismatch β β
β β β’ Permission denied β β
β β β’ Resource limits exceeded β β
β β β’ Timeout errors β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Common Glue Failures:
# Issue 1: Schema mismatch
# Solution: Use resolveChoice
from awsglue.transforms import ResolveChoice
resolvechoice = ResolveChoice.apply(
frame=dynamic_frame,
choice="match_catalog",
database="mydb",
table_name="mytable"
)
# Issue 2: Permission denied
# Solution: Check IAM role permissions
# Required policies:
# - AWSGlueServiceRole
# - AmazonS3FullAccess (or specific bucket access)
# - AWSLambdaRole
# Issue 3: Out of memory
# Solution: Increase worker type
# Use G.2X or G.4X workers for large datasets
Q5: How do you troubleshoot Lambda function failures?
Answer:
Lambda Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Lambda Function Failure Diagnosis β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Check CloudWatch Logs β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β /aws/lambda/my-function β β
β β β β
β β Look for: β β
β β β’ START/END markers β β
β β β’ Error stack traces β β
β β β’ Timeout warnings β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Check Metrics β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β AWS/Lambda namespace β β
β β β β
β β β’ Invocations β β
β β β’ Errors β β
β β β’ Duration β β
β β β’ Throttles β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Common Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Timeout (>3 seconds for API Gateway) β β
β β β’ Memory exceeded β β
β β β’ Cold start latency β β
β β β’ Permission errors β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Lambda Optimization:
# Issue 1: Timeout
# Solution: Increase timeout or optimize code
def lambda_handler(event, context):
# Set context callback
context.callback_waits_for_final_event = False
# Process asynchronously if > 5 minutes
# Use Step Functions for long-running tasks
# Issue 2: Cold start
# Solution: Use provisioned concurrency
import boto3
lambda_client = boto3.client('lambda')
lambda_client.put_provisioned_concurrency_config(
FunctionName='my-function',
Qualifier='prod',
ProvisionedConcurrentExecutions=10
)
# Issue 3: Memory exceeded
# Solution: Increase memory (also increases CPU)
# Monitor via CloudWatch:
# Maximum memory used / Allocated memory
Q6: How do you troubleshoot S3 performance issues?
Answer:
S3 Performance Diagnosis:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β S3 Performance Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Check Request Metrics β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β s3://my-bucket/?metrics β β
β β β β
β β Look at: β β
β β β’ Latency (first byte, last byte) β β
β β β’ Errors (4xx, 5xx) β β
β β β’ Throughput (MB/s) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Check CloudWatch Metrics β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β AWS/S3 namespace β β
β β β β
β β β’ 4xxErrors β β
β β β’ 5xxErrors β β
β β β’ TotalRequests β β
β β β’ BytesUploaded/Downloaded β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Common Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ 403 Forbidden: IAM/ACL issues β β
β β β’ 404 Not Found: Wrong key/bucket β β
β β β’ 503 Slow Down: Throttling β β
β β β’ High latency: Too many small files β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
S3 Optimization:
# Issue 1: Throttling (503 errors)
# Solution: Use exponential backoff
import time
import random
def s3_operation_with_retry(func, max_retries=5):
for attempt in range(max_retries):
try:
return func()
except ClientError as e:
if e.response['Error']['Code'] == 'SlowDown':
time.sleep(2 ** attempt + random.uniform(0, 1))
else:
raise
# Issue 2: High latency with small files
# Solution: Use multipart upload or aggregate files
# For Athena, aim for 128MB-1GB files
# Issue 3: Slow listing
# Solution: Use S3 Inventory instead of ListObjectsV2
Q7: How do you troubleshoot network connectivity issues?
Answer:
Network Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Network Connectivity Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Check VPC Configuration β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Security Groups (outbound/inbound rules) β β
β β β’ NACLs (subnet-level rules) β β
β β β’ Route Tables (internet/NAT gateway routes) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Check VPC Endpoints β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Gateway endpoints (S3, DynamoDB) β β
β β β’ Interface endpoints (other services) β β
β β β’ Endpoint policies β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Use VPC Flow Logs β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ ACCEPT/REJECT status β β
β β β’ Source/destination IPs β β
β β β’ Ports and protocols β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Common Network Issues:
# Issue 1: Cannot connect to S3 from VPC
# Solution: Add VPC endpoint
ec2 = boto3.client('ec2')
ec2.create_vpc_endpoint(
VpcId='vpc-12345678',
ServiceName='com.amazonaws.us-east-1.s3',
RouteTableIds=['rtb-12345678']
)
# Issue 2: Security group blocking traffic
# Solution: Add outbound rule
ec2.authorize_security_group_egress(
GroupId='sg-12345678',
IpPermissions=[
{
'IpProtocol': 'tcp',
'FromPort': 443,
'ToPort': 443,
'IpRanges': [{'CidrIp': '0.0.0.0/0'}]
}
]
)
# Issue 3: NACL blocking ephemeral ports
# Solution: Allow inbound traffic on ports 1024-65535
Q8: How do you troubleshoot IAM permission issues?
Answer:
IAM Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β IAM Permission Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Use IAM Policy Simulator β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Select user/role β β
β β β’ Select action (e.g., s3:GetObject) β β
β β β’ Select resource (e.g., arn:aws:s3:::bucket/*) β β
β β β’ Check results β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Check Policy Evaluation β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Explicit Deny β Access denied β β
β β 2. No matching Allow β Access denied β β
β β 3. Matching Allow β Access allowed β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Common Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Resource policy vs identity policy β β
β β β’ Condition keys (aws:SourceVpc, aws:PrincipalOrgID) β β
β β β’ Service control policies (SCPs) β β
β β β’ Permission boundaries β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
IAM Debugging Steps:
# Step 1: Get effective permissions
iam = boto3.client('iam')
# List attached policies
policies = iam.list_attached_user_policies(UserName='user')
for policy in policies['AttachedPolicies']:
print(policy['PolicyArn'])
# Step 2: Check for explicit denies
# Look for Deny statements in all policies
# Step 3: Use CloudTrail for access denied errors
cloudtrail = boto3.client('cloudtrail')
events = cloudtrail.lookup_events(
LookupAttributes=[
{
'AttributeKey': 'EventName',
'AttributeValue': 'GetObject'
}
],
MaxResults=10
)
for event in events['Events']:
if 'AccessDenied' in event.get('CloudTrailEvent', ''):
print(event['CloudTrailEvent'])
Q9: How do you troubleshoot data quality issues?
Answer:
Data Quality Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Quality Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Identify the Issue β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Missing data β β
β β β’ Duplicate data β β
β β β’ Invalid formats β β
β β β’ Schema mismatches β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Trace Data Flow β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Source β Ingestion β Processing β Storage β Query β β
β β β β
β β Check each stage for: β β
β β β’ Record counts β β
β β β’ Data transformation β β
β β β’ Error handling β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Implement Quality Checks β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Schema validation β β
β β β’ Null checks β β
β β β’ Range checks β β
β β β’ Referential integrity β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Quality Check Implementation:
def check_data_quality(df):
issues = []
# Check for nulls
null_counts = df.select([
count(when(col(c).isNull(), c)).alias(c)
for c in df.columns
]).collect()[0]
for column, count in null_counts.asDict().items():
if count > 0:
issues.append(f"NULL values in {column}: {count}")
# Check for duplicates
total_count = df.count()
distinct_count = df.distinct().count()
if total_count != distinct_count:
issues.append(f"Duplicate records: {total_count - distinct_count}")
# Check for invalid values
if 'amount' in df.columns:
invalid_amounts = df.filter(col('amount') < 0).count()
if invalid_amounts > 0:
issues.append(f"Invalid amounts (negative): {invalid_amounts}")
return issues
Q10: How do you troubleshoot Spark job failures?
Answer:
Spark Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Spark Job Failure Diagnosis β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Check Spark UI β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Jobs tab: Failed/stuck jobs β β
β β β’ Stages tab: Task failures, shuffle read/write β β
β β β’ Storage tab: Cached DataFrames β β
β β β’ Executors tab: Memory usage, GC time β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Common Spark Errors β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ OutOfMemoryError: Increase executor memory β β
β β β’ Data skew: Salt keys, repartition β β
β β β’ Shuffle overflow: Increase shuffle partitions β β
β β β’ Task not serializable: Fix lambda closures β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Check Cluster Resources β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ CPU utilization β β
β β β’ Memory usage β β
β β β’ Disk space β β
β β β’ Network throughput β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Spark Optimization:
# Issue 1: OutOfMemoryError
# Solution: Increase memory or optimize
spark.conf.set('spark.executor.memory', '8g')
spark.conf.set('spark.executor.memoryFraction', '0.8')
# Issue 2: Data skew
# Solution: Salt the skewed key
from pyspark.sql.functions import rand, lit
df_salted = df.withColumn(
'salt',
(rand() * 10).cast('int')
).withColumn(
'salted_key',
concat(col('key'), lit('_'), col('salt'))
)
# Issue 3: Shuffle overflow
# Solution: Increase shuffle partitions
spark.conf.set('spark.sql.shuffle.partitions', '200')
# Issue 4: Task not serializable
# Solution: Use broadcast variables or map-side joins
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), 'key')
Q11: How do you troubleshoot Kinesis stream issues?
Answer:
Kinesis Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kinesis Stream Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Check Stream Status β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β aws kinesis describe-stream --stream-name my-stream β β
β β β β
β β Look at: β β
β β β’ StreamStatus (ACTIVE, CREATING, etc.) β β
β β β’ ShardCount β β
β β β’ HasMoreShards β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Check CloudWatch Metrics β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β AWS/Kinesis namespace β β
β β β β
β β β’ GetRecords.IteratorAgeMilliseconds β β
β β β’ PutRecords.Success β β
β β β’ ReadProvisionedThroughputExceeded β β
β β β’ WriteProvisionedThroughputExceeded β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Common Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Iterator age growing: Consumer too slow β β
β β β’ Throttling: Increase shards or use fan-out β β
β β β’ Data loss: Check checkpointing β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q12: How do you troubleshoot Glue Data Catalog issues?
Answer:
Data Catalog Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Glue Data Catalog Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Check Crawler Status β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β aws glue get-crawler --name my-crawler β β
β β β β
β β Look at: β β
β β β’ CrawlerStatus β β
β β β’ LastCrawl (Status, ErrorMessage) β β
β β β’ TableCount, ConfigurationVersion β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Check Table Schema β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β aws glue get-table --database-name db --table-name tbl β β
β β β β
β β Look at: β β
β β β’ StorageDescriptor (columns, formats) β β
β β β’ PartitionKeys β β
β β β’ Parameters (classification, etc.) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Common Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Schema inference errors β β
β β β’ Missing partitions β β
β β β’ Incorrect data types β β
β β β’ Permission issues β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q13: How do you troubleshoot cost anomalies?
Answer:
Cost Anomaly Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cost Anomaly Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Identify Anomalous Service β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β aws ce get-cost-and-usage β β
β β --time-period Start=2024-01-01,End=2024-01-31 β β
β β --granularity MONTHLY β β
β β --metrics BlendedCost β β
β β --group-by Type=DIMENSION,Key=SERVICE β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Drill Down β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ By region β β
β β β’ By tag (CostCenter, Environment) β β
β β β’ By instance type β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Common Causes β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Left-over resources (EC2, RDS) β β
β β β’ Data transfer costs β β
β β β’ Unexpected API calls β β
β β β’ Reserved Instance expiry β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cost Investigation Script:
def investigate_cost_spike():
ce = boto3.client('ce')
# Get current month costs
current = ce.get_cost_and_usage(
TimePeriod={
'Start': (datetime.now().replace(day=1)).strftime('%Y-%m-%d'),
'End': datetime.now().strftime('%Y-%m-%d')
},
Granularity='MONTHLY',
Metrics=['BlendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
# Get last month costs for comparison
last_month_start = (datetime.now().replace(day=1) - timedelta(days=1)).replace(day=1)
last_month_end = datetime.now().replace(day=1) - timedelta(days=1)
previous = ce.get_cost_and_usage(
TimePeriod={
'Start': last_month_start.strftime('%Y-%m-%d'),
'End': last_month_end.strftime('%Y-%m-%d')
},
Granularity='MONTHLY',
Metrics=['BlendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
# Compare
for service in current['ResultsByTime'][0]['Groups']:
service_name = service['Keys'][0]
current_cost = float(service['Metrics']['BlendedCost']['Amount'])
# Find same service in previous month
for prev_service in previous['ResultsByTime'][0]['Groups']:
if prev_service['Keys'][0] == service_name:
previous_cost = float(prev_service['Metrics']['BlendedCost']['Amount'])
change = ((current_cost - previous_cost) / previous_cost) * 100
if change > 50: # More than 50% increase
print(f"Anomaly: {service_name} increased {change:.1f}%")
Q14: How do you troubleshoot data replication issues?
Answer:
Replication Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Replication Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Check Replication Status β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β S3 CRR: β β
β β aws s3api get-bucket-replication --bucket my-bucket β β
β β β β
β β RDS Read Replicas: β β
β β aws rds describe-db-instances --db-instance-id replica β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Check CloudWatch Metrics β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ ReplicationLag β β
β β β’ ReplicatedBytes β β
β β β’ ReplicationErrors β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Common Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Versioning not enabled β β
β β β’ IAM role permissions β β
β β β’ Replication rules misconfigured β β
β β β’ Network connectivity β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q15: How do you troubleshoot serverless application issues?
Answer:
Serverless Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Serverless Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Lambda + API Gateway β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Check Lambda logs in CloudWatch β β
β β β’ Check API Gateway execution logs β β
β β β’ Verify IAM permissions β β
β β β’ Check Lambda configuration (timeout, memory) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step Functions β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Check execution history β β
β β β’ Review state input/output β β
β β β’ Identify failed state β β
β β β’ Check IAM permissions for each state β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β EventBridge β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Check event rules β β
β β β’ Verify target permissions β β
β β β’ Review dead-letter queue β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q16: How do you troubleshoot security vulnerabilities?
Answer:
Security Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Security Vulnerability Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Use Security Hub β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Review critical/high findings β β
β β β’ Check compliance status β β
β β β’ Prioritize by severity β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Use GuardDuty Findings β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ UnauthorizedAccess β β
β β β’ CryptoCurrency β β
β β β’ Malware β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Remediation β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Patch vulnerabilities β β
β β β’ Update security groups β β
β β β’ Enable encryption β β
β β β’ Rotate credentials β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q17: How do you troubleshoot slow ETL pipelines?
Answer:
Slow ETL Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Slow ETL Pipeline Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Identify Bottleneck β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Ingestion β Processing β Loading β Validation β β
β β β β
β β Measure time at each stage β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Common Causes β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Small files (too many I/O operations) β β
β β β’ Data skew (uneven processing) β β
β β β’ Insufficient resources β β
β β β’ Network latency β β
β β β’ Serialization overhead β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Solutions β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ File compaction β β
β β β’ Partition optimization β β
β β β’ Resource scaling β β
β β β’ Parallel processing β β
β β β’ Caching frequent reads β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q18: How do you troubleshoot data consistency issues?
Answer:
Data Consistency Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Consistency Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Common Consistency Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Duplicate records β β
β β β’ Missing records β β
β β β’ Out-of-order delivery β β
β β β’ Stale data β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Troubleshooting Steps β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Check source vs destination counts β β
β β 2. Verify processing logic (idempotency) β β
β β 3. Check for late-arriving data β β
β β 4. Verify replication lag β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Prevention Strategies β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Idempotent processing β β
β β β’ Transaction logs β β
β β β’ Watermarking for late data β β
β β β’ Exactly-once semantics β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q19: How do you troubleshoot performance degradation?
Answer:
Performance Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Performance Degradation Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Establish Baseline β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Historical metrics β β
β β β’ Expected performance β β
β β β’ SLA requirements β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Identify Root Cause β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Resource exhaustion (CPU, memory, disk) β β
β β β’ Network issues β β
β β β’ Query plan changes β β
β β β’ Data volume growth β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Apply Fixes β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Scale resources β β
β β β’ Optimize queries β β
β β β’ Add caching β β
β β β’ Partition data β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q20: How do you troubleshoot infrastructure as code issues?
Answer:
IaC Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Infrastructure as Code Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β CloudFormation β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Check stack events β β
β β β’ Review error messages β β
β β β’ Check IAM permissions β β
β β β’ Verify resource limits β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β CDK β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Synthesize and review templates β β
β β β’ Check for circular dependencies β β
β β β’ Verify context values β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Terraform β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Run terraform plan first β β
β β β’ Check state file β β
β β β’ Verify provider configuration β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q21: How do you troubleshoot monitoring gaps?
Answer:
Monitoring Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Monitoring Gap Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Identify Gaps β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Missing metrics β β
β β β’ No alerts for critical failures β β
β β β’ Incomplete dashboards β β
β β β’ No logging for key operations β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Implement Missing Monitoring β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Custom CloudWatch metrics β β
β β β’ CloudTrail for API logging β β
β β β’ X-Ray for tracing β β
β β β’ CloudWatch Logs Insights β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Validate Coverage β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ SLI/SOLO coverage β β
β β β’ Alert response time β β
β β β’ Dashboard completeness β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q22: How do you troubleshoot compliance issues?
Answer:
Compliance Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Compliance Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Identify Non-Compliance β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ AWS Config rules β β
β β β’ Security Hub findings β β
β β β’ Audit Manager assessments β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 2: Assess Impact β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Which resources are affected? β β
β β β’ What data is at risk? β β
β β β’ Regulatory implications β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Step 3: Remediate β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Enable encryption β β
β β β’ Update access controls β β
β β β’ Enable logging β β
β β β’ Update policies β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q23: How do you troubleshoot multi-region issues?
Answer:
Multi-Region Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Multi-Region Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Common Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Replication lag β β
β β β’ Region-specific service issues β β
β β β’ Cross-region latency β β
β β β’ Data consistency across regions β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Troubleshooting Steps β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Check replication status per region β β
β β 2. Verify cross-region connectivity β β
β β 3. Check region-specific quotas β β
β β 4. Review region-specific costs β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q24: How do you troubleshoot batch job scheduling issues?
Answer:
Scheduling Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Batch Job Scheduling Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Common Issues β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Job not running β β
β β β’ Job running late β β
β β β’ Job failing intermittently β β
β β β’ Dependency issues β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Troubleshooting Steps β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. Check scheduler (Airflow, Step Functions) β β
β β 2. Verify IAM permissions β β
β β 3. Check resource availability β β
β β 4. Review dependency chain β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Solutions β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Add retry logic β β
β β β’ Implement dead letter queues β β
β β β’ Add monitoring and alerting β β
β β β’ Implement circuit breakers β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q25: How do you troubleshoot disaster recovery scenarios?
Answer:
DR Troubleshooting:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Disaster Recovery Troubleshooting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Pre-DR Checklist β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Backup status and freshness β β
β β β’ Replication lag β β
β β β’ DR site readiness β β
β β β’ DNS failover configuration β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β During DR β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Activate DR site β β
β β β’ Update DNS β β
β β β’ Restore from backups β β
β β β’ Verify data integrity β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Post-DR β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Validate all services are running β β
β β β’ Check data consistency β β
β β β’ Monitor for issues β β
β β β’ Plan failback β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Summary
Mastering AWS troubleshooting requires understanding:
- Methodology: Systematic approach to identify, diagnose, resolve
- Service-Specific Issues: Athena, EMR, Redshift, Glue, Lambda, S3
- Common Patterns: Performance, connectivity, security, cost
- Prevention: Monitoring, alerting, automation, documentation
- Tools: CloudWatch, CloudTrail, X-Ray, Config
These concepts form the foundation for effectively debugging and resolving issues in AWS data systems.