System Design Foundations
Scalability Fundamentals
Scalability is the ability of a system to handle increased load by adding resources. This guide covers the mathematical foundations, strategies, and trade-offs for building systems that grow gracefully with demand.
- Vertical Scaling — Upgrade the machine (bigger CPU, more RAM)
- Horizontal Scaling — Add more machines to distribute load
- Capacity Planning — Predict and prepare for future growth
Scale is not about handling today's traffic—it's about handling tomorrow's.
What Is Scalability?
Scalability is a system's ability to maintain or improve performance as resources (compute, memory, storage, network) are added.
DfScalability
Scalability is the capability of a system to handle a growing amount of work by adding resources. A system is scalable if doubling the resources results in at least double the throughput (ideally more). Scalability can be achieved vertically (scaling up) or horizontally (scaling out).
Vertical vs Horizontal Scaling
DfVertical Scaling (Scale Up)
Vertical scaling increases the capacity of a single machine by adding more CPU, RAM, or storage. It is simpler to implement (no code changes needed) but has a hard ceiling: the largest machine available.
DfHorizontal Scaling (Scale Out)
Horizontal scaling distributes load across multiple machines. It requires the application to be stateless or to externalize state. The theoretical limit is much higher than vertical scaling, but introduces distributed systems complexity.
| Dimension | Vertical | Horizontal |
|---|---|---|
| Complexity | Low (no code changes) | High (distributed systems) |
| Ceiling | Limited by largest machine | Virtually unlimited |
| Cost | Exponential growth | Linear growth |
| Fault Tolerance | Single point of failure | Redundant |
| Latency | No inter-node communication | Network overhead |
The Math of Scaling
Amdahl's Law and Scaling
When you add N machines, the theoretical maximum speedup is:
Amdahl's Law for N Machines
Here,
- =Speedup with N machines
- =Fraction of workload that is parallelizable
- =Sequential fraction (fixed overhead)
Diminishing Returns of Horizontal Scaling
If 90% of your workload is parallelizable and you scale from 1 to 100 machines:
S(100) = 1 / (0.1 + 0.9/100) = 1 / 0.109 ≈ 9.17x
Going from 1 to 100 machines only yields ~9x speedup. The 10% sequential overhead dominates.
Gustafson's Law
Gustafson's Law offers a more optimistic view—when you add resources, you can solve larger problems:
Gustafson's Law
Here,
- =Scaled speedup with N machines
- =Parallel fraction of the workload
- =Number of processors
Load Balancing
Load balancing distributes incoming requests across multiple servers to ensure no single server becomes a bottleneck.
DfLoad Balancing
Load balancing is the process of distributing network traffic across multiple servers to ensure high availability and reliability. A load balancer sits between clients and servers, routing each request to the optimal server based on a distribution algorithm.
Common Load Balancing Algorithms
| Algorithm | How It Works | Best For |
|---|---|---|
| Round Robin | Distributes requests sequentially | Servers with equal capacity |
| Weighted Round Robin | Distributes proportionally to weight | Servers with different capacities |
| Least Connections | Routes to server with fewest active connections | Long-lived connections |
| IP Hash | Maps client IP to a specific server | Session persistence needs |
| Least Response Time | Routes to server with lowest latency | Latency-sensitive workloads |
| Consistent Hashing | Maps requests to servers using hash ring | Minimizes redistribution on scaling |
For stateless services, round-robin is often sufficient. For stateful services or when connection durations vary significantly, least-connections or consistent hashing are preferred.
Capacity Planning
Capacity planning ensures your system can handle future load without over-provisioning.
Capacity Planning Formula
Here,
- =Maximum queries per second at peak
- =Buffer for unexpected spikes (typically 20-50%)
Step-by-Step Capacity Planning
- Forecast demand: Estimate future traffic based on growth rates
- Profile current system: Measure resource utilization at current load
- Identify bottlenecks: Which resource (CPU, memory, I/O, network) is the limiting factor?
- Calculate headroom: How much capacity is needed for the forecast?
- Plan scaling triggers: At what utilization threshold should you scale?
Capacity Planning for Growth
Current state:
- 10M daily active users
- 5 QPS average, 15 QPS peak
- 80% CPU utilization at peak
Projected (next year, 3x growth):
- 30M daily active users
- 15 QPS average, 45 QPS peak
- Required: 15 QPS average × 1.3 safety = 19.5 QPS sustained capacity
Current single server at 80% can handle ~6.25 QPS peak. Need at minimum 45/6.25 = 8 servers for peak.
Scaling Strategies
The Database Bottleneck
The most common scaling bottleneck is the database. Strategies include:
- Read replicas: Offload read traffic to replica databases
- Sharding: Partition data across multiple database instances
- Caching: Store frequently accessed data in memory
- Denormalization: Trade write complexity for read performance
The Three-Layer Scaling Model
Practice Exercises
-
Estimation: A system currently handles 500 QPS with a single server at 70% CPU. If traffic grows to 2000 QPS, how many servers are needed? What if Amdahl's law applies with 5% sequential overhead?
-
Design: Design a horizontally scalable web application architecture. Consider: how do you handle session state? How do you handle file uploads? How do you deploy new versions without downtime?
-
Trade-offs: Compare vertical and horizontal scaling for a relational database. When is each approach appropriate? What are the cost implications at 10x, 100x, and 1000x scale?
-
Analysis: Draw a decision tree for choosing a load balancing algorithm based on: server heterogeneity, connection duration variability, and session persistence requirements.
Key Takeaways:
- Vertical scaling is simple but has a ceiling; horizontal scaling is complex but virtually unlimited
- Amdahl's Law shows diminishing returns from adding more machines due to sequential overhead
- Load balancing distributes traffic; the algorithm depends on server heterogeneity and workload characteristics
- Capacity planning requires forecasting, profiling, and planning scaling triggers
- The database is typically the first scaling bottleneck—use read replicas, sharding, and caching
What to Learn Next
-> Networking Fundamentals TCP/IP, HTTP, DNS, CDNs, and network latency.
-> API Design REST, GraphQL, gRPC, versioning, and rate limiting.
-> Databases SQL vs NoSQL, indexing, replication, and sharding.
-> Caching Strategies Redis, Memcached, cache invalidation, and write strategies.
-> Load Balancing Algorithms, health checks, and L4 vs L7.
-> CAP Theorem Consistency models, availability, and partition tolerance.