System Design Foundations
Introduction to System Design
System design is the discipline of defining the architecture, components, modules, interfaces, and data flow of a system to satisfy specified requirements. This guide provides a rigorous foundation for reasoning about distributed systems at scale.
- Architecture — The high-level structure of a system and its components
- Trade-offs — Every design decision involves competing constraints
- Scalability — Systems must handle growth in users, data, and traffic
The goal of system design is not perfection—it is making informed decisions under uncertainty.
What Is System Design?
System design is the process of defining the architecture, interfaces, data models, and operational characteristics of a software system to meet functional and non-functional requirements.
DfSystem Design
System design is the disciplined practice of specifying the structure, behavior, and more detailed views of a system. It encompasses architectural decisions (component decomposition, communication patterns), data modeling (schemas, storage strategies), and operational concerns (scalability, reliability, observability). The goal is to produce a blueprint that satisfies requirements while balancing competing constraints such as cost, complexity, and time-to-market.
Functional vs Non-Functional Requirements
Every system design begins with understanding requirements. These split into two categories:
| Category | Examples |
|---|---|
| Functional | User authentication, search, payments, notifications |
| Non-Functional | Latency < 100ms, 99.99% uptime, support 1M concurrent users, GDPR compliance |
Non-functional requirements (NFRs) are often called quality attributes or -ilities. They are the primary drivers of architectural decisions. A system with different NFRs for the same functional requirements will have a completely different architecture.
The Core Principles
Several principles guide effective system design:
Principle 1: Know Your Constraints
Before designing anything, understand:
- Scale: How many users? How much data? What growth rate?
- Latency: What are the response time requirements?
- Consistency: How strong are the consistency guarantees?
- Budget: What are the cost constraints?
Principle 2: Design for Failure
Distributed systems fail. Networks partition, disks crash, processes crash. Design every component to be resilient:
- Redundancy at every layer
- Graceful degradation under failure
- Circuit breakers and bulkheads
- Health checks and automatic recovery
Principle 3: Keep It Simple
Complexity is the enemy of reliability. Every additional component, every additional network hop, every additional layer of abstraction introduces failure modes and increases cognitive load.
YAGNI (You Aren't Gonna Need It) applies powerfully to system design. Design for current requirements plus one step ahead. Over-engineering is more dangerous than under-engineering because it wastes resources and obscures the system's true complexity.
Principle 4: Make Trade-offs Explicit
Every design decision involves trade-offs. The best engineers can articulate why they chose one approach over another and what they sacrificed.
The Design Process
A systematic approach to system design follows these phases:
Phase 1: Requirements Gathering
Clarify functional and non-functional requirements. Ask questions:
- What are the core use cases?
- What is the expected scale (users, data, QPS)?
- What are the latency requirements?
- What consistency guarantees are needed?
- What are the availability targets?
Phase 2: Back-of-the-Envelope Estimation
Quantify the problem before designing the solution:
Traffic Estimation
Here,
- =Queries per second
- =Daily active users
- =Seconds in a day
Estimating QPS for a Social Media Feed
Suppose we have 100M daily active users, each viewing 10 feeds per day and posting 2 updates per day.
Feed reads: QPS_read = (100M × 10) / 86400 ≈ 11,600 QPS
Feed writes: QPS_write = (100M × 2) / 86400 ≈ 2,300 QPS
Peak QPS (3x average): Peak read ≈ 35,000 QPS Peak write ≈ 7,000 QPS
Phase 3: High-Level Design
Identify the major components and their interactions:
- Client Layer: Web, mobile, API consumers
- Application Layer: Business logic, orchestration
- Data Layer: Databases, caches, search indices
- Infrastructure Layer: Load balancers, message queues, CDN
Phase 4: Detailed Design
Deep-dive into each component:
- Data models and schemas
- API contracts
- Database selection and partitioning
- Caching strategies
- Communication patterns (sync vs async)
Phase 5: Trade-off Analysis
Document the decisions made and alternatives considered. This is where senior engineers distinguish themselves—not by knowing the "right" answer, but by understanding why a choice was made.
System Design Taxonomy
Systems can be categorized along several dimensions:
Monolithic vs Distributed
Synchronous vs Asynchronous
- Synchronous: Request-response, tight coupling, simpler to reason about
- Asynchronous: Event-driven, decoupled, better for scalability and resilience
Stateless vs Stateful
DfStateless vs Stateful
A stateless system stores no client-specific state between requests. Each request contains all information needed to process it. A stateful system maintains session state, requiring sticky sessions or external state stores.
Key Metrics
System design requires understanding and optimizing for specific metrics:
Little's Law (Concurrency)
Here,
- =Average number of concurrent requests in the system
- =Average arrival rate (requests per second)
- =Average time a request spends in the system (seconds)
Applying Little's Law
If your service receives 1000 QPS and each request takes 200ms to process:
L = 1000 × 0.2 = 200 concurrent requests
This tells you the minimum number of workers/threads needed to handle the load without queuing.
Amdahl's Law
When optimizing system performance, Amdahl's Law tells us the maximum improvement possible:
Amdahl's Law
Here,
- =Maximum speedup
- =Fraction of the system that can be parallelized
- =Speedup of the parallelizable portion
Amdahl's Law in Practice
If 75% of your system can be parallelized and you have 10x faster parallel execution:
S = 1 / ((1 - 0.75) + 0.75/10) = 1 / (0.25 + 0.075) = 3.08x
Even with infinite parallelism, the sequential 25% limits speedup to 4x. This is why understanding bottlenecks is critical.
The CAP Theorem
One of the most fundamental results in distributed systems:
DfCAP Theorem
The CAP Theorem (Brewer, 2000; Gilbert & Lynch, 2002) states that a distributed data store can provide at most two of the following three guarantees:
- Consistency (C): Every read receives the most recent write
- Availability (A): Every request receives a response (without error)
- Partition Tolerance (P): The system continues to operate despite network partitions
Since network partitions are inevitable in distributed systems, the real choice is between CP and AP systems.
The CAP theorem is often misunderstood. It does not say you must choose between C and A—rather, during a network partition, you must choose between C and A. Most of the time, when the network is healthy, you can have both.
Practice Exercises
-
Conceptual: Explain the difference between scalability and performance. Can a system be performant but not scalable? Give an example.
-
Estimation: A URL shortener handles 100M new URLs per month and 10:1 read-to-write ratio. Estimate the QPS for reads and writes. How much storage is needed for 5 years at 500 bytes per URL record?
-
Design: Sketch a high-level architecture for a real-time notification system that must deliver messages to 50M users within 500ms. Identify the key components and their responsibilities.
-
Trade-offs: Compare synchronous and asynchronous architectures for processing payment transactions. What are the trade-offs in terms of consistency, latency, and complexity?
Key Takeaways:
- System design is the discipline of defining architecture, components, and data flow to satisfy requirements
- Non-functional requirements (NFRs) are the primary drivers of architectural decisions
- Design for failure, keep it simple, and make trade-offs explicit
- Use back-of-the-envelope estimation to quantify the problem before designing the solution
- Understand fundamental results like Little's Law, Amdahl's Law, and the CAP theorem
What to Learn Next
-> Scalability Fundamentals Vertical vs horizontal scaling, load balancing, and capacity planning.
-> Networking Fundamentals TCP/IP, HTTP, DNS, CDNs, and network latency.
-> API Design REST, GraphQL, gRPC, versioning, and rate limiting.
-> Databases SQL vs NoSQL, indexing, replication, and sharding.
-> CAP Theorem Consistency models, availability, and partition tolerance.
-> Load Balancing Distribution algorithms and L4 vs L7 load balancing.