What is an Error Budget?
An Error Budget is the quantitative measure of the acceptable unreliability your service can tolerate over a specific time window.
It is derived directly from your SLO (Service Level Objective).
The Formula
If your Availability SLO is 99.9% (three nines), your Error Budget is the remaining 0.1%.
Error Budget = 100% - SLO
Over a 30-day window (43,200 minutes), a 99.9% SLO allows for exactly 43.2 minutes of downtime. That is your "budget" to spend on risky activities.
Spending the Budget
In SRE, failure is not just an accident; it is a resource. You "spend" your error budget on things that might cause instability but provide value:
- New Feature Launches: Pushing code that has been tested but might have unknown production bugs.
- Infrastructure Changes: Upgrading your Kubernetes cluster or migrating a database.
- Chaos Engineering: Deliberately injecting failure to test resilience.
- System Failure: Unplanned outages or performance degradations.
The Decision Maker: Frozen vs. Fast
The Error Budget act as the ultimate objective referee in the conflict between Developers (speed) and SREs (stability).
Scenario A: Large Budget Remaining
If it's the 20th of the month and you have used only 5 minutes of your 43-minute budget, you are in high-confidence territory. Developers are encouraged to deploy as fast as they want. SREs might even run chaos experiments.
Scenario B: Budget Exhausted
If a catastrophic outage on the 5th of the month burns 45 minutes of the budget, you are in the red. The Policy: For the remainder of the month, all feature launches are frozen.
- Developers are no longer allowed to push new code to production.
- Instead, they are reassigned to help SREs write tests, build better monitoring, or optimize the bottleneck that caused the original outage.
This aligns incentives: if developers write buggy code that burns the budget, they lose the ability to ship features until they help make the system more reliable.
Advanced Concept: Burn Rate
Simply knowing the budget is remaining isn't enough for a professional SRE. You need to know how fast you are spending it. This is called the Burn Rate.
- Burn Rate 1.0: You are consuming the budget at exactly the rate that will leave you with zero at the end of 30 days. (This is normal).
- Burn Rate 10.0: You are burning the budget 10x faster than you should. If this continues, you will run out of budget in 3 days.
Burn Rate Alerting
Instead of alerting when the site is currently down (which is too late), SREs set alerts on the Burn Rate. If a tool like Prometheus detects a sustained burn rate of 14.4 over the last hour, it knows that 2% of the monthly budget has been burned in just 60 minutes.
This represents a major incident, and the pagers should fire immediately, even if the service is still technically "up" but slightly degraded.
Summary: The Benefits of Error Budgets
- Removes Emotion: "I think it's too risky to deploy" becomes "The data shows we only have 2 minutes of budget left."
- Encourages Innovation: If you have 40 minutes of budget at the end of the month, you are encouraged to take big risks!
- Automates Priority: It automatically shifts the team's focus from "Features" to "Stability" based on mathematical data, not management pressure.