SLA, SLO, SLI - Monitoring & Observability

The Reliability Dilemma

In the world of DevOps and Site Reliability Engineering (SRE), there is constant tension between two groups:

Developers (Product): Want to push new features, code, and massive updates to production as fast as possible.
Operations (SRE): Want to lock down production, prevent all deployments, and maintain 100% stability.

If you push too fast, the site crashes. If you lock it down, the company loses to competitors. How do you find the objective, mathematical middle ground?

The solution is the triad of SLAs, SLOs, and SLIs, pioneered by Google.

1. SLI (Service Level Indicator)

An SLI is the actual reality of how your system is performing right now. It is a strictly mathematical metric derived from monitoring tools.

To calculate an SLI, you measure the ratio of "good" events divided by the "total" events, expressed as a percentage.

Common SLIs:

Availability: (Number of successful HTTP 200 responses) / (Total HTTP responses).
- Result: 99.95% of requests succeeded today.
Latency: (Number of requests that returned in under 400ms) / (Total requests).
- Result: 98% of requests were "fast" today.

You must choose SLIs that directly translate to user happiness. (Hint: Users do not care about a "CPU Usage SLI").

2. SLO (Service Level Objective)

An SLO is your internal target for the SLI. It is the goal the engineering team collectively agrees the system must hit to keep users happy.

It combines an SLI with a target percentage over a specific time window.

Examples:

Availability SLO: 99.9% of HTTP requests must succeed over a rolling 30-day window.
Latency SLO: 95% of API requests must complete in under 500ms over a rolling 30-day window.

The Myth of 100% Reliability

The most important rule of SRE: Your SLO must NEVER be 100%.

If a system targets 100%, you can never update it, you can never take maintenance windows, and you will spend infinite money trying to achieve the impossible. Even Amazon and Netflix go down. An objective of 99.9% means you are explicitly allowed to be broken for 43 minutes every month. This leeway allows you to safely deploy new features!

3. SLA (Service Level Agreement)

An SLA is an external, legal contract between your company and your paying customers. It details what happens if you fail to meet your targets.

SLAs are written by lawyers and business executives, not engineers.

Example SLA: "We guarantee 99.5% uptime per month. If we drop below this, we will refund you 10% of your monthly subscription fee for every hour of downtime."

The Golden Rule of SLAs and SLOs

Your internal engineering target (SLO) must always be stricter than your legal contract (SLA).

If your legal SLA is 99.5%, your engineering SLO should be tied to 99.9%. This way, if the engineering team breaches their internal SLO, the pagers go off and they scramble to fix the system long before the company breaches the legal SLA and loses millions of dollars in customer refunds.

Error Budgets (The Negotiator)

How do you weaponize SLOs to solve the developer vs. operations conflict mentioned at the start?

You use Error Budgets.

Let's assume your database has a 30-day target SLO of 99.9% availability. Mathematically, this means you are explicitly allowed to have 43 minutes of downtime in a 30-day window. This 43 minutes is your "Error Budget." It is a currency.

The Workflow:

The Developers rollout aggressive features early in the month.
A bad deployment crashes the database and burns 30 minutes of the downtime budget.
Later that week, a networking glitch drops the database for 15 minutes.
The Operations team looks at the math: 30m + 15m = 45m.
The Error Budget is exhausted (breached).

The Enforcement: When the Error Budget runs out, consequence policies automatically kick in. The rule might state: "If the Error Budget is zero, all feature deployments are frozen. For the rest of the month, Developers are only permitted to write tests, fix bugs, and optimize performance until the 30-day rolling window regenerates their budget allowance."

The Error Budget is the ultimate, objective referee. It removes emotion. If the budget is full, deploy as fast as you want! If the budget is empty, system stability is the only priority.