System Design Problems
Design an Object Storage System
Object storage systems like Amazon S3 store trillions of objects with extreme durability (11 nines) and availability. Unlike file systems, object storage provides a flat namespace with a simple REST API for storing and retrieving arbitrary data.
- Flat Namespace — Objects identified by unique keys in a bucket
- Durability — 99.999999999% (11 nines) durability through replication
- Massive Scale — Trillions of objects, exabytes of data
Object storage trades the hierarchical structure of file systems for massive scalability and durability. Every object is immutable—you update by writing a new version.
Requirements
Functional Requirements
- Store and retrieve objects (up to 5 TB per object)
- Support bucket-based organization
- RESTful API: PUT, GET, DELETE objects
- Versioning and lifecycle management
- Access control lists (ACLs) and bucket policies
- Multipart upload for large objects
- Event notifications on object changes
Non-Functional Requirements
- Durability: 99.999999999% (11 nines)
- Availability: 99.99% for standard tier
- Scale: 10 trillion objects, 100 PB per region
- Latency: First byte < 100ms for GET
- Throughput: 5,500 GET requests/second per prefix
Durability of 11 nines means that if you store 10 million objects for 10 million years, you expect to lose approximately 1 object. This requires aggressive replication and integrity checking.
Back-of-the-Envelope Estimation
Object Storage Capacity
- 10 trillion objects × 1 KB average = 10 PB minimum
- With 3x replication: 30 PB raw storage
- 1000 storage nodes × 100 TB each = 100 PB capacity
- 5,500 GET/sec/prefix × 1000 prefixes = 5.5M GET/sec
- Metadata: 10 trillion × 500 bytes = 5 PB (distributed)
API Design
PUT /{bucket}/{key}
Content-Type: image/jpeg
Body: <binary data>
Response: { "etag": "d41d8cd98f00b204e9800998ecf8427e" }
GET /{bucket}/{key}
Response: <binary data>
Headers: Content-Length, ETag, Last-Modified
DELETE /{bucket}/{key}
Response: 204 No Content
POST /{bucket}/{key}?uploads
Response: { "upload_id": "upload_123" }
PUT /{bucket}/{key}?uploadId=upload_123&partNumber=1
Response: { "etag": "part_etag" }
POST /{bucket}/{key}?uploadId=upload_123
Body: { "parts": [...] }
High-Level Architecture
Detailed Design
Data Model
DfObject Storage Data Model
An object consists of data (the blob) and metadata (key, size, checksum, custom headers). Objects are organized into buckets with a flat key namespace within each bucket.
// Object Metadata
{
bucket: "images",
key: "photos/2026/06/photo1.jpg",
size: 2048576, // 2 MB
content_type: "image/jpeg",
etag: "d41d8cd98f00b204",
created_at: "2026-06-20T10:00:00Z",
storage_class: "STANDARD",
version_id: "v1",
checksum: "sha256:abc123...",
parts: [ // For multipart uploads
{ part_num: 1, offset: 0, size: 5242880, etag: "..." },
{ part_num: 2, offset: 5242880, size: 5242880, etag: "..." }
]
}
Data Chunking and Placement
Large objects are split into chunks for efficient storage and replication:
DfObject Chunking
Large objects are split into fixed-size chunks (e.g., 64 MB). Each chunk is independently replicated across storage nodes. This enables parallel uploads, efficient replication, and partial reads.
Chunk Count
Here,
- =Total object size in bytes
- =Chunk size (typically 64 MB)
Chunk Calculation
For a 1 GB video file with 64 MB chunks: chunks = ⌈1024 MB / 64 MB⌉ = 16 chunks
Each chunk is 64 MB, replicated 3× = 192 MB total storage.
Replication Strategy
DfErasure Coding
Erasure coding splits data into k chunks and generates m parity chunks, allowing recovery from any m chunk failures. More storage-efficient than full replication.
| Strategy | Storage Overhead | Durability | Read Performance |
|---|---|---|---|
| 3x Replication | 300% | High | Fast (any replica) |
| Reed-Solomon (10+4) | 140% | Very High | Moderate (decode) |
| Reed-Solomon (10+2) | 120% | High | Moderate |
Erasure Coding Overhead
Here,
- =Data chunks
- =Parity chunks
Erasure Coding vs Replication
For 100 MB object:
3x Replication: 300 MB storage, can tolerate 2 node failures RS(10,4): 140 MB storage, can tolerate 4 node failures
RS is more storage-efficient with higher fault tolerance.
Metadata Architecture
The metadata service is the control plane:
Metadata operations (PUT/GET object metadata) are separate from data operations (PUT/GET object data). This separation allows metadata to scale independently and be cached aggressively.
| Metadata Tier | Technology | Use Case |
|---|---|---|
| Hot | Redis | Frequently accessed metadata |
| Warm | Cassandra | Recent objects, time-series access |
| Cold | S3/Object Store | Archived metadata, audit logs |
Multipart Upload
For large objects, multipart upload enables:
- Parallel uploads of chunks
- Resumable uploads on failure
- Upload of objects larger than 5 GB
Practice Exercises
-
Design: How would you implement object versioning? What are the storage implications of keeping all versions vs. lifecycle policies?
-
Durability: If you use 3x replication across 3 data centers, calculate the probability of data loss given a 0.1% annual disk failure rate per node.
-
Scale: Design a system to handle 5,500 GET requests per second per prefix. How would you distribute load across storage nodes?
-
Optimization: How would you implement a CDN cache invalidation system for objects stored in S3? Design for both instant and eventual consistency.
Key Takeaways:
- Object storage uses a flat namespace with bucket organization for massive scalability
- Erasure coding (RS 10+4) is more storage-efficient than 3x replication with higher durability
- Separating metadata from data services allows independent scaling
- Multipart upload enables parallel chunk uploads for large objects
- Object immutability simplifies consistency but requires versioning for updates
What to Learn Next
-> Design Pastebin Storing large text objects with S3-style storage.
-> Databases Distributed metadata storage with Cassandra and PostgreSQL.
-> CDNs Caching objects at the edge for low-latency access.
-> Data Replication Replication strategies for durability and availability.
-> Design Google Drive File sync and storage with conflict resolution.
-> Consistent Hashing Distributing objects across storage nodes.