System Design Foundations

Introduction to System Design

System design is the discipline of defining the architecture, components, modules, interfaces, and data flow of a system to satisfy specified requirements. This guide provides a rigorous foundation for reasoning about distributed systems at scale.

Architecture — The high-level structure of a system and its components
Trade-offs — Every design decision involves competing constraints
Scalability — Systems must handle growth in users, data, and traffic

The goal of system design is not perfection—it is making informed decisions under uncertainty.

What Is System Design?

System design is the process of defining the architecture, interfaces, data models, and operational characteristics of a software system to meet functional and non-functional requirements.

DfSystem Design

System design is the disciplined practice of specifying the structure, behavior, and more detailed views of a system. It encompasses architectural decisions (component decomposition, communication patterns), data modeling (schemas, storage strategies), and operational concerns (scalability, reliability, observability). The goal is to produce a blueprint that satisfies requirements while balancing competing constraints such as cost, complexity, and time-to-market.

Functional vs Non-Functional Requirements

Every system design begins with understanding requirements. These split into two categories:

Category	Examples
Functional	User authentication, search, payments, notifications
Non-Functional	Latency < 100ms, 99.99% uptime, support 1M concurrent users, GDPR compliance

Non-functional requirements (NFRs) are often called quality attributes or -ilities. They are the primary drivers of architectural decisions. A system with different NFRs for the same functional requirements will have a completely different architecture.

The Core Principles

Several principles guide effective system design:

Principle 1: Know Your Constraints

Before designing anything, understand:

Scale: How many users? How much data? What growth rate?
Latency: What are the response time requirements?
Consistency: How strong are the consistency guarantees?
Budget: What are the cost constraints?

Principle 2: Design for Failure

Distributed systems fail. Networks partition, disks crash, processes crash. Design every component to be resilient:

Redundancy at every layer
Graceful degradation under failure
Circuit breakers and bulkheads
Health checks and automatic recovery

Principle 3: Keep It Simple

Complexity is the enemy of reliability. Every additional component, every additional network hop, every additional layer of abstraction introduces failure modes and increases cognitive load.

YAGNI (You Aren't Gonna Need It) applies powerfully to system design. Design for current requirements plus one step ahead. Over-engineering is more dangerous than under-engineering because it wastes resources and obscures the system's true complexity.

Principle 4: Make Trade-offs Explicit

Every design decision involves trade-offs. The best engineers can articulate why they chose one approach over another and what they sacrificed.

The Design Process

A systematic approach to system design follows these phases:

Phase 1: Requirements Gathering

Clarify functional and non-functional requirements. Ask questions:

What are the core use cases?
What is the expected scale (users, data, QPS)?
What are the latency requirements?
What consistency guarantees are needed?
What are the availability targets?

Phase 2: Back-of-the-Envelope Estimation

Quantify the problem before designing the solution:

Traffic Estimation

\text{QPS} = \frac{\text{Daily Active Users} \times \text{Actions per Day}}{\text{Seconds in a Day}}

Here,

$QPS$ =Queries per second
$DAU$ =Daily active users
$86400$ =Seconds in a day

Estimating QPS for a Social Media Feed

Suppose we have 100M daily active users, each viewing 10 feeds per day and posting 2 updates per day.

Feed reads: QPS_read = (100M × 10) / 86400 ≈ 11,600 QPS

Feed writes: QPS_write = (100M × 2) / 86400 ≈ 2,300 QPS

Peak QPS (3x average): Peak read ≈ 35,000 QPS Peak write ≈ 7,000 QPS

Phase 3: High-Level Design

Identify the major components and their interactions:

Client Layer: Web, mobile, API consumers
Application Layer: Business logic, orchestration
Data Layer: Databases, caches, search indices
Infrastructure Layer: Load balancers, message queues, CDN

Phase 4: Detailed Design

Deep-dive into each component:

Data models and schemas
API contracts
Database selection and partitioning
Caching strategies
Communication patterns (sync vs async)

Phase 5: Trade-off Analysis

Document the decisions made and alternatives considered. This is where senior engineers distinguish themselves—not by knowing the "right" answer, but by understanding why a choice was made.

System Design Taxonomy

Systems can be categorized along several dimensions:

Monolithic vs Distributed

Synchronous vs Asynchronous

Synchronous: Request-response, tight coupling, simpler to reason about
Asynchronous: Event-driven, decoupled, better for scalability and resilience

Stateless vs Stateful

DfStateless vs Stateful

A stateless system stores no client-specific state between requests. Each request contains all information needed to process it. A stateful system maintains session state, requiring sticky sessions or external state stores.

Key Metrics

System design requires understanding and optimizing for specific metrics:

Little's Law (Concurrency)

L = \\lambda \times W

Here,

$L$ =Average number of concurrent requests in the system
$\lambda$ =Average arrival rate (requests per second)
$W$ =Average time a request spends in the system (seconds)

Applying Little's Law

If your service receives 1000 QPS and each request takes 200ms to process:

L = 1000 × 0.2 = 200 concurrent requests

This tells you the minimum number of workers/threads needed to handle the load without queuing.

Amdahl's Law

When optimizing system performance, Amdahl's Law tells us the maximum improvement possible:

Amdahl's Law

S = \frac{1}{(1 - p) + \frac{p}{s}}

Here,

$S$ =Maximum speedup
$p$ =Fraction of the system that can be parallelized
$s$ =Speedup of the parallelizable portion

Amdahl's Law in Practice

If 75% of your system can be parallelized and you have 10x faster parallel execution:

S = 1 / ((1 - 0.75) + 0.75/10) = 1 / (0.25 + 0.075) = 3.08x

Even with infinite parallelism, the sequential 25% limits speedup to 4x. This is why understanding bottlenecks is critical.

The CAP Theorem

One of the most fundamental results in distributed systems:

DfCAP Theorem

The CAP Theorem (Brewer, 2000; Gilbert & Lynch, 2002) states that a distributed data store can provide at most two of the following three guarantees:

Consistency (C): Every read receives the most recent write
Availability (A): Every request receives a response (without error)
Partition Tolerance (P): The system continues to operate despite network partitions

Since network partitions are inevitable in distributed systems, the real choice is between CP and AP systems.

The CAP theorem is often misunderstood. It does not say you must choose between C and A—rather, during a network partition, you must choose between C and A. Most of the time, when the network is healthy, you can have both.

Practice Exercises

Conceptual: Explain the difference between scalability and performance. Can a system be performant but not scalable? Give an example.
Estimation: A URL shortener handles 100M new URLs per month and 10:1 read-to-write ratio. Estimate the QPS for reads and writes. How much storage is needed for 5 years at 500 bytes per URL record?
Design: Sketch a high-level architecture for a real-time notification system that must deliver messages to 50M users within 500ms. Identify the key components and their responsibilities.
Trade-offs: Compare synchronous and asynchronous architectures for processing payment transactions. What are the trade-offs in terms of consistency, latency, and complexity?

Key Takeaways:

System design is the discipline of defining architecture, components, and data flow to satisfy requirements
Non-functional requirements (NFRs) are the primary drivers of architectural decisions
Design for failure, keep it simple, and make trade-offs explicit
Use back-of-the-envelope estimation to quantify the problem before designing the solution
Understand fundamental results like Little's Law, Amdahl's Law, and the CAP theorem

What to Learn Next

-> Scalability Fundamentals Vertical vs horizontal scaling, load balancing, and capacity planning.

-> Networking Fundamentals TCP/IP, HTTP, DNS, CDNs, and network latency.

-> API Design REST, GraphQL, gRPC, versioning, and rate limiting.

-> Databases SQL vs NoSQL, indexing, replication, and sharding.

-> CAP Theorem Consistency models, availability, and partition tolerance.

-> Load Balancing Distribution algorithms and L4 vs L7 load balancing.

Introduction to System Design

Introduction to System Design

What Is System Design?

DfSystem Design

Functional vs Non-Functional Requirements

The Core Principles

Principle 1: Know Your Constraints

Principle 2: Design for Failure

Principle 3: Keep It Simple

Principle 4: Make Trade-offs Explicit

The Design Process

Phase 1: Requirements Gathering

Phase 2: Back-of-the-Envelope Estimation

Traffic Estimation

Estimating QPS for a Social Media Feed

Phase 3: High-Level Design

Phase 4: Detailed Design

Phase 5: Trade-off Analysis

System Design Taxonomy

Monolithic vs Distributed

Synchronous vs Asynchronous

Stateless vs Stateful

DfStateless vs Stateful

Key Metrics

Little's Law (Concurrency)

Applying Little's Law

Amdahl's Law

Amdahl's Law

Amdahl's Law in Practice

The CAP Theorem

DfCAP Theorem

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?