β‘ S3 Performance Optimization
Master multipart upload, transfer acceleration, and S3 request rate optimization.
Module: AWS Data Engineering β’ Topic 36 of 65 β’ Premium Content
S3 Performance Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β S3 PERFORMANCE OPTIMIZATION β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 1. MULTIPART UPLOAD β β
β β β’ Split files >100MB into chunks β β
β β β’ Upload in parallel (up to 10 concurrent) β β
β β β’ Retry individual parts on failure β β
β β β’ Required for files >5GB β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 2. PREFIX PARALLELISM β β
β β β’ 5,500 GET/HEAD requests per second per prefix β β
β β β’ 3,500 PUT/COPY/POST requests per second per prefix β β
β β β’ Use multiple prefixes for high throughput β β
β β β’ Example: s3://bucket/{date}/{hour}/{partition}/ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 3. TRANSFER ACCELERATION β β
β β β’ Use CloudFront edge locations β β
β β β’ Faster cross-region transfers β β
β β β’ $0.04/GB + $0.04/1000 requests β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β 4. S3 SELECT β β
β β β’ Filter at storage layer β β
β β β’ Reduce data transferred β β
β β β’ Supports Parquet, JSON, CSV β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Multipart Upload Example
import boto3
from boto3.s3.transfer import TransferConfig
import os
s3 = boto3.client('s3')
# Configure multipart upload
config = TransferConfig(
multipart_threshold=1024 * 1024 * 100, # 100 MB
max_concurrency=10,
multipart_chunksize=1024 * 1024 * 100, # 100 MB
use_threads=True
)
# Upload with multipart
s3.upload_file(
'large_file.parquet',
'data-lake-bucket',
'raw/data/file.parquet',
Config=config
)
# Manual multipart for very large files
def multipart_upload(bucket, key, file_path, part_size=100*1024*1024):
response = s3.create_multipart_upload(Bucket=bucket, Key=key)
upload_id = response['UploadId']
parts = []
file_size = os.path.getsize(file_path)
with open(file_path, 'rb') as f:
part_number = 1
while True:
data = f.read(part_size)
if not data:
break
response = s3.upload_part(
Bucket=bucket, Key=key,
PartNumber=part_number, UploadId=upload_id, Body=data
)
parts.append({'PartNumber': part_number, 'ETag': response['ETag']})
part_number += 1
s3.complete_multipart_upload(
Bucket=bucket, Key=key, UploadId=upload_id,
MultipartUpload={'Parts': parts}
)
Interview Q&A
Q1: When should you use multipart upload?
Answer: For files >100MB. It improves throughput, allows parallel uploads, and provides resiliency by retrying individual parts.
Q2: How does prefix parallelism work?
Answer: S3 scales requests at the prefix level. Using multiple prefixes (e.g., date partitions) allows parallel request processing up to the per-prefix limits.
Q3: What is the benefit of S3 Select?
Answer: S3 Select filters data at the storage layer, reducing the amount of data transferred. Can reduce costs by up to 80% for queries on large objects.
Summary
- Multipart Upload: Use for files >100MB, parallel part uploads
- Prefix Parallelism: 5,500 req/s per prefix for reads
- Transfer Acceleration: CloudFront-based for cross-region
- S3 Select: Filter at storage layer, reduce data transfer
- Connection Pooling: Reuse connections for high throughput