G
GuideDevOps
Lesson 6 of 15

Error Budgets

Part of the Site Reliability Engineering tutorial series.

What is an Error Budget?

An Error Budget is the quantitative measure of the acceptable unreliability your service can tolerate over a specific time window.

It is derived directly from your SLO (Service Level Objective).

The Formula

If your Availability SLO is 99.9% (three nines), your Error Budget is the remaining 0.1%.

Error Budget = 100% - SLO

Over a 30-day window (43,200 minutes), a 99.9% SLO allows for exactly 43.2 minutes of downtime. That is your "budget" to spend on risky activities.


Spending the Budget

In SRE, failure is not just an accident; it is a resource. You "spend" your error budget on things that might cause instability but provide value:

  • New Feature Launches: Pushing code that has been tested but might have unknown production bugs.
  • Infrastructure Changes: Upgrading your Kubernetes cluster or migrating a database.
  • Chaos Engineering: Deliberately injecting failure to test resilience.
  • System Failure: Unplanned outages or performance degradations.

The Decision Maker: Frozen vs. Fast

The Error Budget act as the ultimate objective referee in the conflict between Developers (speed) and SREs (stability).

Scenario A: Large Budget Remaining

If it's the 20th of the month and you have used only 5 minutes of your 43-minute budget, you are in high-confidence territory. Developers are encouraged to deploy as fast as they want. SREs might even run chaos experiments.

Scenario B: Budget Exhausted

If a catastrophic outage on the 5th of the month burns 45 minutes of the budget, you are in the red. The Policy: For the remainder of the month, all feature launches are frozen.

  • Developers are no longer allowed to push new code to production.
  • Instead, they are reassigned to help SREs write tests, build better monitoring, or optimize the bottleneck that caused the original outage.

This aligns incentives: if developers write buggy code that burns the budget, they lose the ability to ship features until they help make the system more reliable.


Advanced Concept: Burn Rate

Simply knowing the budget is remaining isn't enough for a professional SRE. You need to know how fast you are spending it. This is called the Burn Rate.

  • Burn Rate 1.0: You are consuming the budget at exactly the rate that will leave you with zero at the end of 30 days. (This is normal).
  • Burn Rate 10.0: You are burning the budget 10x faster than you should. If this continues, you will run out of budget in 3 days.

Burn Rate Alerting

Instead of alerting when the site is currently down (which is too late), SREs set alerts on the Burn Rate. If a tool like Prometheus detects a sustained burn rate of 14.4 over the last hour, it knows that 2% of the monthly budget has been burned in just 60 minutes.

This represents a major incident, and the pagers should fire immediately, even if the service is still technically "up" but slightly degraded.


Summary: The Benefits of Error Budgets

  1. Removes Emotion: "I think it's too risky to deploy" becomes "The data shows we only have 2 minutes of budget left."
  2. Encourages Innovation: If you have 40 minutes of budget at the end of the month, you are encouraged to take big risks!
  3. Automates Priority: It automatically shifts the team's focus from "Features" to "Stability" based on mathematical data, not management pressure.