Postmortems & Blameless Culture - Site Reliability Engineering

The SRE Postmortem Philosophy

In many traditional companies, when a major outage occurs, the first question leadership asks is: "Who caused this?"

In SRE culture, this is a forbidden question. Instead, we ask: "What was wrong with our systems that allowed this to happen?"

Why Blamelessness?

If an engineer knows they will be fired or punished for making a mistake, they will hide their mistake. They will lie about the timeline, delete logs, or stay silent. This destroys observability.

If the culture is blameless, engineers are encouraged to be brutally honest: "I accidentally typed the wrong command because the terminal prompt doesn't clearly show which environment I'm in."

Now, you have a System Problem to fix (improving terminal prompts) rather than a Human Problem (firing the engineer).

The "Five Whys" Technique

To find the root cause, SREs use the Five Whys. You start with the symptom and keep asking "Why?" until you reach a systemic flaw.

Example: The Database was Deleted.

Why was the DB deleted? Because an intern ran a drop table command on production.
Why could an intern run that command? Because they had administrative access to the production cluster.
Why did they have administrative access? Because our onboarding script automatically grants "Admin" to all new hires.
Why does it grant Admin to everyone? Because we haven't implemented granular RBAC roles yet.
Why haven't we implemented RBAC? Because it wasn't prioritized in the last three quarters.

The Root Cause: We lack a prioritized Role Based Access Control system. The Action Item: Implement RBAC. (Firing the intern would not have solved the lack of RBAC).

Anatomy of a Great Postmortem

A postmortem document is a permanent record of the incident. It should contain:

1. The Executive Summary

A 3-sentence summary of what happened, how long it lasted, and the business impact (e.g., "15% of users couldn't log in").

2. The Timeline

A minute-by-minute account of the incident.

12:01: Monitoring alert fires.
12:05: On-call engineer acknowledges.
12:15: Outage confirmed.
12:22: Rollback executed.

3. The Root Cause

A clear explanation of the systemic failure using the Five Whys.

4. Action Items (The "Fix-it" List)

Every postmortem must result in at least one concrete action item. Action items must be:

Prioritized: Must be fixed before the next feature deployment.
Assigned: To a specific person.
Tracked: In a ticket system like Jira or GitHub Issues.

Postmortem Best Practices

Schedule it Quickly: Hold the postmortem meeting within 48 to 72 hours of the incident while the details are still fresh.
Write it for Everyone: The document should be readable by both engineers and non-technical managers.
Celebrate the "Near Misses": If an automated system successfully caught a failure and prevented an outage, write a "Positive Postmortem" to celebrate the system's resilience!

By treating every failure as a free lesson, the SRE team ensures that the system becomes progressively more reliable over time.