System Design Problems

Design Pastebin

Pastebin allows users to paste text or code and share it via a unique URL. Services like GitHub Gist, hastebin, and dpaste handle millions of pastes daily with simple creation and reading workflows.

Paste Creation — Accept text input, generate unique URL, store with metadata
Paste Reading — Retrieve paste content via short URL with minimal latency
Expiration — Pastes auto-delete after configurable TTL periods

Pastebin is simpler than most system design problems—it's essentially a write-once-read-many (WORM) system with time-based expiration.

Requirements

Functional Requirements

Users can create a paste with text content (up to 10MB)
Each paste gets a unique, shareable URL
Pastes can be public or private (unlisted)
Pastes expire after a configurable duration (10 min, 1 hour, 1 day, 1 week, never)
Support syntax highlighting for code pastes
Users can set a custom name for the paste (optional)

Non-Functional Requirements

Latency: Read paste in < 100ms
Availability: 99.9% uptime
Durability: Pastes must not be lost before expiration
Scalability: 10M new pastes/day, 100M reads/day

Pastebin has an extremely skewed read-to-write ratio (10:1 or higher). The write path can tolerate slightly higher latency since paste creation is not time-critical, but reads must be fast.

Back-of-the-Envelope Estimation

Storage and Traffic Estimation

Write path:

10M pastes/day = ~115 QPS
Average paste size: 10 KB
Daily write storage: 10M × 10 KB = 100 GB/day
Annual storage: ~36.5 TB

Read path:

100M reads/day = ~1,150 QPS
Average read size: 10 KB
Peak read QPS (3x): ~3,500 QPS

Cache strategy:

Hot pastes (top 20%): ~20M pastes × 10 KB = 200 GB cache

API Design

Architecture Diagram

POST /api/v1/pastes
Request:  { "content": "...", "syntax": "python", "visibility": "public", "expires_in": "1h" }
Response: { "paste_id": "abc123", "url": "https://paste.example.com/abc123" }

GET /{paste_id}
Response: { "content": "...", "syntax": "python", "created_at": "...", "expires_at": "..." }

GET /api/v1/pastes/{paste_id}/raw
Response: Plain text content (no JSON wrapping)

DELETE /api/v1/pastes/{paste_id}
Response: { "status": "deleted" }

High-Level Architecture

Detailed Design

Storage Layer

Pastebin has two distinct storage needs:

DfHot vs Cold Storage

Hot storage holds frequently accessed data in memory (Redis). Cold storage holds all data durably on disk or object storage. Pastebin uses a write-through pattern: writes go to both hot and cold storage, but reads try hot storage first.

Data Type	Storage	Reason
Paste content (large)	Object Storage (S3)	Cost-effective, durable, high throughput
Paste metadata	Relational DB (PostgreSQL)	Structured queries, relationships
Hot pastes	Redis	Sub-millisecond reads, TTL support
Expired paste tracking	Redis sorted set	Efficient expiration scanning

Expiration Strategy

Pastebin requires automatic deletion of expired pastes:

Option A: Lazy Expiration (Recommended)

Check expires_at on every read
Delete if expired; return 404
Simple, no background processing
May show stale data in cache

Option B: Active Expiration

Background worker scans for expired pastes
Deletes from database and cache periodically
More consistent, but adds complexity
Use sorted set in Redis: EXPIRE_AT score=<timestamp>

Expiration Scan Rate

scan\_rate = \frac{N_{expired}}{window\_size}

Here,

$N_{expired}$ =Number of pastes expiring in the window
$window_size$ =Scan interval in seconds

Expiration Processing Load

If 10M pastes/day expire on average:

Scan rate = 10M / 86400 ≈ 116 expired pastes/second

This is manageable with a single background worker using a Redis sorted set.

Content Storage Pattern

Store large paste content in object storage, metadata in the database:

Architecture Diagram

// Metadata record
{
  paste_id: "abc123",
  user_id: "user_456",
  content_path: "s3://pastes/ab/c1/abc123.txt",
  syntax: "python",
  visibility: "public",
  created_at: "2026-06-20T10:00:00Z",
  expires_at: "2026-06-20T11:00:00Z",
  size_bytes: 10240
}

Separating metadata from content allows you to cache and query metadata efficiently while keeping large content objects in cost-effective object storage.

Syntax Highlighting

Support syntax highlighting for code pastes:

Client submits paste with syntax parameter (or auto-detect)
Server stores raw content in object storage
On read, apply syntax highlighting at the CDN edge or application layer
Cache highlighted HTML alongside raw content

Use Prism.js or highlight.js for client-side rendering to avoid server-side highlighting overhead. Store the syntax hint with metadata so the client can load the appropriate language pack.

Scaling Considerations

Database Partitioning

Partition paste metadata by paste ID hash:

Architecture Diagram

shard = hash(paste_id) % NUM_SHARDS

This distributes pastes evenly across shards. For time-based queries (e.g., "recent pastes"), maintain a separate time-indexed table or use a time-series database.

Read Path Optimization

Check Redis cache for paste metadata
Cache hit: return content from object storage (cached at CDN)
Cache miss: query database, populate cache, return content
CDN caches public paste content at edge locations

Write Path Optimization

Generate unique paste ID (Snowflake or random)
Write metadata to database (async if possible)
Write content to object storage (S3 multipart upload for large pastes)
Populate Redis cache proactively
Return paste URL to user

For pastes larger than 1MB, use multipart upload to object storage and return the URL immediately without waiting for upload completion. Notify the user when the paste is ready.

What to Learn Next

-> Design URL Shortener Similar WORM pattern with ID generation and caching strategies.

-> Caching Strategies Cache-aside, write-through, and TTL-based expiration patterns.

-> CDNs Caching static content at the edge for global low-latency access.

-> Databases Choosing between SQL and NoSQL for metadata storage.

-> Design Object Storage Building scalable blob storage for large content objects.

-> Design Unique ID Generator Snowflake IDs, UUIDs, and distributed ID generation.

Design Pastebin

Design Pastebin

Requirements

Functional Requirements

Non-Functional Requirements

Back-of-the-Envelope Estimation

Storage and Traffic Estimation

API Design

High-Level Architecture

Detailed Design

Storage Layer

DfHot vs Cold Storage

Expiration Strategy

Expiration Scan Rate

Expiration Processing Load

Content Storage Pattern

Syntax Highlighting

Scaling Considerations

Database Partitioning

Read Path Optimization

Write Path Optimization

What to Learn Next

Premium Content

Need Expert System Design Help?