Incident Management - Site Reliability Engineering

The Chaos of a Major Incident

When a critical system fails—the database goes readonly, or the payment gateway begins returning 100% errors—the engineering team naturally enters a state of panic. Developers start frantically checking their local machines, managers start asking "ETA?", and customer support is buried in angry emails.

Incident Management is the formal SRE process for converting this chaos into a structured, calm, and effective restoration of service.

The Incident Response Roles

During a major outage (Sev-1 or Sev-2), you must immediately appoint people to three strictly separated roles.

1. Incident Commander (IC)

The person in charge. They are the Dictator of the incident.

They do NOT touch code. (This is the most common mistake).
They coordinate the other responders, make the final decisions, and ensure the process is being followed.
If two engineers disagree on a fix, the IC breaks the tie.

2. Operations Lead (Ops)

The subject matter expert. They are the Fixer.

They are actually looking at the dashboards, logs, and terminal.
They propose mitigations ("Should we roll back?" or "Should we flush the cache?").
They execute the commands approved by the IC.

3. Communications Lead (Comms)

The Public Relations officer.

They are responsible for keeping the rest of the company (and the users) updated.
They write the StatusPage updates.
They keep managers and stakeholders out of the "War Room" so the IC and Ops can focus on the fix.

The OODA Loop: A Framework for Response

SREs borrow the OODA Loop from military fighter pilots to manage fast-moving crises.

Observe: Gather facts. What are the metrics saying? Is the error rate 5% or 100%? Are the logs showing "Connection Refused"?
Orient: Contextualize the facts. Was there a deployment 10 minutes ago? Did AWS announce a regional outage?
Decide: Pick a mitigation. Note: The goal of an incident is NOT to "fix it perfectly." The goal is to Restore Service as fast as possible.
Act: Execute the decision. (e.g., Roll back the deployment).

Once you Act, you immediately return to Observe to see if it worked.

Mitigation vs. Root Cause

There is a massive difference between Mitigating an incident and Fixing it.

Imagine a building is on fire because of a faulty microwave.

Mitigation: Spraying water on the fire and evacuated the residents. (Service is restored).
Root Cause: Investigating why the microwave's safety fuse didn't trip and redesigning the microwave manufacturing process. (Prevents future fires).

In an incident, you focus 100% on Mitigation. You do not care why the code is buggy while the site is down. You roll it back first, restore service, and investigate the "Why" tomorrow during the postmortem.

Best Practices for the "War Room"

Declare Early: It is better to declare a "Major Incident" and realize it was a false alarm 5 minutes later, than to wait 2 hours while the site is down because everyone was afraid to "bother" people.
Talk out Loud: In the Slack channel or Zoom call, the Ops Lead should state exactly what they are doing: "I am now going to restart the Nginx service on Cluster A." This ensures the Scribe can document the timeline.
Pauze for Context: Avoid "Analysis Paralysis." If the team has been debating for 10 minutes, the IC should step in and force a decision.

In the next tutorial, we will learn how to handle the most important part of the incident: The Blameless Postmortem.