System Design Foundations

Scalability Fundamentals

Scalability is the ability of a system to handle increased load by adding resources. This guide covers the mathematical foundations, strategies, and trade-offs for building systems that grow gracefully with demand.

Vertical Scaling — Upgrade the machine (bigger CPU, more RAM)
Horizontal Scaling — Add more machines to distribute load
Capacity Planning — Predict and prepare for future growth

Scale is not about handling today's traffic—it's about handling tomorrow's.

What Is Scalability?

Scalability is a system's ability to maintain or improve performance as resources (compute, memory, storage, network) are added.

DfScalability

Scalability is the capability of a system to handle a growing amount of work by adding resources. A system is scalable if doubling the resources results in at least double the throughput (ideally more). Scalability can be achieved vertically (scaling up) or horizontally (scaling out).

Vertical vs Horizontal Scaling

DfVertical Scaling (Scale Up)

Vertical scaling increases the capacity of a single machine by adding more CPU, RAM, or storage. It is simpler to implement (no code changes needed) but has a hard ceiling: the largest machine available.

DfHorizontal Scaling (Scale Out)

Horizontal scaling distributes load across multiple machines. It requires the application to be stateless or to externalize state. The theoretical limit is much higher than vertical scaling, but introduces distributed systems complexity.

Dimension	Vertical	Horizontal
Complexity	Low (no code changes)	High (distributed systems)
Ceiling	Limited by largest machine	Virtually unlimited
Cost	Exponential growth	Linear growth
Fault Tolerance	Single point of failure	Redundant
Latency	No inter-node communication	Network overhead

The Math of Scaling

Amdahl's Law and Scaling

When you add N machines, the theoretical maximum speedup is:

Amdahl's Law for N Machines

S(N) = \frac{1}{(1 - p) + \frac{p}{N}}

Here,

$S(N)$ =Speedup with N machines
$p$ =Fraction of workload that is parallelizable
$1 - p$ =Sequential fraction (fixed overhead)

Diminishing Returns of Horizontal Scaling

If 90% of your workload is parallelizable and you scale from 1 to 100 machines:

S(100) = 1 / (0.1 + 0.9/100) = 1 / 0.109 ≈ 9.17x

Going from 1 to 100 machines only yields ~9x speedup. The 10% sequential overhead dominates.

Gustafson's Law

Gustafson's Law offers a more optimistic view—when you add resources, you can solve larger problems:

Gustafson's Law

S(N) = N - (1 - p)(N - 1)

Here,

$S(N)$ =Scaled speedup with N machines
$p$ =Parallel fraction of the workload
$N$ =Number of processors

Load Balancing

Load balancing distributes incoming requests across multiple servers to ensure no single server becomes a bottleneck.

DfLoad Balancing

Load balancing is the process of distributing network traffic across multiple servers to ensure high availability and reliability. A load balancer sits between clients and servers, routing each request to the optimal server based on a distribution algorithm.

Common Load Balancing Algorithms

Algorithm	How It Works	Best For
Round Robin	Distributes requests sequentially	Servers with equal capacity
Weighted Round Robin	Distributes proportionally to weight	Servers with different capacities
Least Connections	Routes to server with fewest active connections	Long-lived connections
IP Hash	Maps client IP to a specific server	Session persistence needs
Least Response Time	Routes to server with lowest latency	Latency-sensitive workloads
Consistent Hashing	Maps requests to servers using hash ring	Minimizes redistribution on scaling

For stateless services, round-robin is often sufficient. For stateful services or when connection durations vary significantly, least-connections or consistent hashing are preferred.

Capacity Planning

Capacity planning ensures your system can handle future load without over-provisioning.

Capacity Planning Formula

\text{Required Capacity} = \text{Peak QPS} \times (1 + \text{Safety Margin})

Here,

$Peak QPS$ =Maximum queries per second at peak
$Safety Margin$ =Buffer for unexpected spikes (typically 20-50%)

Step-by-Step Capacity Planning

Forecast demand: Estimate future traffic based on growth rates
Profile current system: Measure resource utilization at current load
Identify bottlenecks: Which resource (CPU, memory, I/O, network) is the limiting factor?
Calculate headroom: How much capacity is needed for the forecast?
Plan scaling triggers: At what utilization threshold should you scale?

Capacity Planning for Growth

Current state:

10M daily active users
5 QPS average, 15 QPS peak
80% CPU utilization at peak

Projected (next year, 3x growth):

30M daily active users
15 QPS average, 45 QPS peak
Required: 15 QPS average × 1.3 safety = 19.5 QPS sustained capacity

Current single server at 80% can handle ~6.25 QPS peak. Need at minimum 45/6.25 = 8 servers for peak.

Scaling Strategies

The Database Bottleneck

The most common scaling bottleneck is the database. Strategies include:

Read replicas: Offload read traffic to replica databases
Sharding: Partition data across multiple database instances
Caching: Store frequently accessed data in memory
Denormalization: Trade write complexity for read performance

The Three-Layer Scaling Model

Practice Exercises

Estimation: A system currently handles 500 QPS with a single server at 70% CPU. If traffic grows to 2000 QPS, how many servers are needed? What if Amdahl's law applies with 5% sequential overhead?
Design: Design a horizontally scalable web application architecture. Consider: how do you handle session state? How do you handle file uploads? How do you deploy new versions without downtime?
Trade-offs: Compare vertical and horizontal scaling for a relational database. When is each approach appropriate? What are the cost implications at 10x, 100x, and 1000x scale?
Analysis: Draw a decision tree for choosing a load balancing algorithm based on: server heterogeneity, connection duration variability, and session persistence requirements.

Key Takeaways:

Vertical scaling is simple but has a ceiling; horizontal scaling is complex but virtually unlimited
Amdahl's Law shows diminishing returns from adding more machines due to sequential overhead
Load balancing distributes traffic; the algorithm depends on server heterogeneity and workload characteristics
Capacity planning requires forecasting, profiling, and planning scaling triggers
The database is typically the first scaling bottleneck—use read replicas, sharding, and caching

What to Learn Next

-> Networking Fundamentals TCP/IP, HTTP, DNS, CDNs, and network latency.

-> API Design REST, GraphQL, gRPC, versioning, and rate limiting.

-> Databases SQL vs NoSQL, indexing, replication, and sharding.

-> Caching Strategies Redis, Memcached, cache invalidation, and write strategies.

-> Load Balancing Algorithms, health checks, and L4 vs L7.

-> CAP Theorem Consistency models, availability, and partition tolerance.

Scalability Fundamentals

Scalability Fundamentals

What Is Scalability?

DfScalability

Vertical vs Horizontal Scaling

DfVertical Scaling (Scale Up)

DfHorizontal Scaling (Scale Out)

The Math of Scaling

Amdahl's Law and Scaling

Amdahl's Law for N Machines

Diminishing Returns of Horizontal Scaling

Gustafson's Law

Gustafson's Law

Load Balancing

DfLoad Balancing

Common Load Balancing Algorithms

Capacity Planning

Capacity Planning Formula

Step-by-Step Capacity Planning

Capacity Planning for Growth

Scaling Strategies

The Database Bottleneck

The Three-Layer Scaling Model

Practice Exercises

What to Learn Next

Premium Content

Need Expert System Design Help?