Capacity Planning - Site Reliability Engineering

The Growth Trap

Every successful startup eventually hits a wall. If your user base doubles every month, but your database takes 6 weeks to upgrade, you are on a collision course with a total system outage.

Capacity Planning is the SRE practice of ensuring that the system always has enough "Headroom" to handle both predictable growth and unpredictable spikes, while keeping costs under control.

Key Metrics for Capacity

SREs don't just look at "CPU Usage." We look at Saturation.

Saturation is the measure of how full a specific resource is. Once a resource hits 100% saturation, everything behind it starts to queue up, and latency explodes.

The Four Main Constraints

Compute (CPU): Can the processors handle the number of requests?
Memory (RAM): Is there enough space to store temporary data without "swapping" to the slow disk?
Storage (Disk/IOPS): Can the database write data fast enough?
Network: Is the "pipe" big enough to move the data between microservices?

Measuring Headroom

Headroom is the difference between your total capacity and your peak usage.

$$Headroom = Total Capacity - Peak Demand$$

If your peak demand is 70% of your total capacity, you have 30% headroom.

SRE Rule of Thumb: You should generally strive to maintain at least 30% to 50% headroom.
Why? Because if one server dies, the remaining servers must instantly absorb its traffic. If you are running at 90% capacity and one server fails, the entire cluster will collapse in a cascading failure.

Predicting the Future: Linear vs. Exponential

How many servers will we need in 6 months?

1. The Linear Model

If you add 1,000 new users every month, you can easily project your needs in a straight line.

2. The Step-Function Model

Sometimes growth is flat, then spikes (e.g., a Black Friday sale or a Marketing campaign). SREs work closely with the Marketing team to "Proactively Scale" before the spike hits.

Automation: Vertical vs. Horizontal Scaling

In the cloud era, capacity planning has shifted from "buying hardware" to "configuring software."

Vertical Scaling (Scaling Up)

Making the existing server bigger (adding more CPU/RAM).

Pros: Extremely simple. No code changes required.
Cons: There is a physical limit. You eventually run into the biggest server AWS offers.

Horizontal Scaling (Scaling Out)

Adding more servers of the same size.

Pros: Theoretically infinite growth.
Cons: Highly complex. Your application must be "Stateless" to handle traffic arriving at 50 different servers.

Auto-Scaling

Modern SREs use HPA (Horizontal Pod Autoscaler) in Kubernetes. Instead of manually adding servers, we write a rule: "If CPU across the cluster exceeds 70%, automatically spin up 5 new Pods."

Cost Optimization: The "Cloud Tax"

A common mistake is Over-provisioning (running 100 servers when you only need 10). SREs are responsible for "Efficiency." We use Right-sizing to ensure we aren't wasting thousands of dollars on idle compute power.

We often use Spot Instances (cheap, temporary servers) for non-critical background tasks to save up to 90% on cloud costs.

In the next section, we will look at the silent killer of SRE productivity: Toil.