⚡ S3 Performance Optimization

Master multipart upload, transfer acceleration, and S3 request rate optimization.

Module: AWS Data Engineering • Topic 36 of 65 • Premium Content

S3 Performance Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    S3 PERFORMANCE OPTIMIZATION                                │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  1. MULTIPART UPLOAD                                                 │    │
│  │     • Split files >100MB into chunks                                 │    │
│  │     • Upload in parallel (up to 10 concurrent)                      │    │
│  │     • Retry individual parts on failure                              │    │
│  │     • Required for files >5GB                                       │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  2. PREFIX PARALLELISM                                               │    │
│  │     • 5,500 GET/HEAD requests per second per prefix                 │    │
│  │     • 3,500 PUT/COPY/POST requests per second per prefix           │    │
│  │     • Use multiple prefixes for high throughput                      │    │
│  │     • Example: s3://bucket/{date}/{hour}/{partition}/               │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  3. TRANSFER ACCELERATION                                            │    │
│  │     • Use CloudFront edge locations                                  │    │
│  │     • Faster cross-region transfers                                  │    │
│  │     • $0.04/GB + $0.04/1000 requests                                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  4. S3 SELECT                                                        │    │
│  │     • Filter at storage layer                                        │    │
│  │     • Reduce data transferred                                        │    │
│  │     • Supports Parquet, JSON, CSV                                    │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Multipart Upload Example

import boto3
from boto3.s3.transfer import TransferConfig
import os

s3 = boto3.client('s3')

# Configure multipart upload
config = TransferConfig(
    multipart_threshold=1024 * 1024 * 100,  # 100 MB
    max_concurrency=10,
    multipart_chunksize=1024 * 1024 * 100,  # 100 MB
    use_threads=True
)

# Upload with multipart
s3.upload_file(
    'large_file.parquet',
    'data-lake-bucket',
    'raw/data/file.parquet',
    Config=config
)

# Manual multipart for very large files
def multipart_upload(bucket, key, file_path, part_size=100*1024*1024):
    response = s3.create_multipart_upload(Bucket=bucket, Key=key)
    upload_id = response['UploadId']
    parts = []
    file_size = os.path.getsize(file_path)
    
    with open(file_path, 'rb') as f:
        part_number = 1
        while True:
            data = f.read(part_size)
            if not data:
                break
            response = s3.upload_part(
                Bucket=bucket, Key=key,
                PartNumber=part_number, UploadId=upload_id, Body=data
            )
            parts.append({'PartNumber': part_number, 'ETag': response['ETag']})
            part_number += 1
    
    s3.complete_multipart_upload(
        Bucket=bucket, Key=key, UploadId=upload_id,
        MultipartUpload={'Parts': parts}
    )

Interview Q&A

Q1: When should you use multipart upload?

Answer: For files >100MB. It improves throughput, allows parallel uploads, and provides resiliency by retrying individual parts.

Q2: How does prefix parallelism work?

Answer: S3 scales requests at the prefix level. Using multiple prefixes (e.g., date partitions) allows parallel request processing up to the per-prefix limits.

Q3: What is the benefit of S3 Select?

Answer: S3 Select filters data at the storage layer, reducing the amount of data transferred. Can reduce costs by up to 80% for queries on large objects.

Summary

Multipart Upload: Use for files >100MB, parallel part uploads
Prefix Parallelism: 5,500 req/s per prefix for reads
Transfer Acceleration: CloudFront-based for cross-region
S3 Select: Filter at storage layer, reduce data transfer
Connection Pooling: Reuse connections for high throughput