Alerting & Incident Response - Monitoring & Observability

The Human Side of Observability

We have explored Metrics, Logs, and Traces. All of these generate beautiful graphs. However, an observability stack is ultimately useless if nobody is looking at it when a system fails.

Alerting is the bridge between the technical systems (Prometheus, Grafana) and the humans (the engineering team).

When a critical threshold is breached (e.g., "The payment API is returning 100% errors"), the system must find a human, wake them up if necessary, and assign them the responsibility of fixing it.

The individual assigned to receive these alerts for a given shift is "On-Call."

Alert Fatigue (The Silent Killer)

If you only remember one thing about Alerting, remember this: Over-alerting is much worse than under-alerting.

If an engineer's phone buzzes 50 times a day with minor warnings ("CPU hit 80%", "A single user had a login error", "Disk is at 60%"), the engineer will rapidly develop Alert Fatigue. Their brain will subconsciously classify the pager sound as "noise." They will stop jumping to investigate. When the true critical alert finally triggers ("The entire database is destroyed"), they will ignore it, assumes it's just another false alarm.

The Rules of Good Alerting

To prevent Alert Fatigue, you must adhere rigidly to these rules:

Alert on Symptoms, not Causes. Do not alert when the database CPU hits 90%. Alert when the user's API request takes longer than 3 seconds. The user doesn't care about CPU; the user cares about latency.
Every Alert must be Actionable. If an alert fires, and the engineer logs on, looks at the dashboard, and says "Ah, it's fine, it'll fix itself," that alert should be explicitly deleted from the codebase. It is noise.
Use Runbooks. Every single alert configuration should include a link to a wiki/document (a Runbook) that explains precisely why the alert fired and the top 3 troubleshooting steps to mitigate it.

Tooling: PagerDuty and OpsGenie

Prometheus AlertManager and Grafana can technically send emails or directly ping Slack channels. However, enterprise teams require immense complexity around who gets notified.

Who is on shift today? Is it John or Sarah?
If John doesn't acknowledge the alert within 5 minutes, who is John's manager? (Escalation Policies).
How do we bypass the "Do Not Disturb" mode on Sarah's iPhone to physically wake her up?

Tools like PagerDuty, OpsGenie, and VictorOps sit between monitoring tools and humans to solve this scheduling nightmare.

The Workflow:

Grafana detects a >5% error rate and sends a JSON payload to PagerDuty.
PagerDuty checks the developer schedule calendar. It realizes John is on-call.
PagerDuty sends a push notification to John's phone.
John ignores it (he is asleep).
After 3 minutes, PagerDuty physically calls John's cell phone with an automated robotic voice. John acknowledges the alert on the phone keypad.
PagerDuty syncs a message to the company Slack: "Incident #100 - High Error Rate. John has acknowledged."

Incident Response (The Blameless Culture)

When a massive outage occurs and the site goes down, it is classified as a "Major Incident".

Handling an incident requires strict protocol, often adapted from emergency services and firefighting models.

Roles During a Major Incident

Incident Commander (IC): The person in charge of managing the crisis. They do NOT touch code or debug. Their job is communication, coordination, and deciding when to escalate.
Operations/Resolvers: The Subject Matter Experts (SMEs). This could be a database engineer or the lead developer. They actively look at the code and the servers to fix the issue.
Communications Lead: Someone assigned to write status updates for the public Twitter page or the internal company executives so they stop bothering the Operations team.

The Post-Mortem (Root Cause Analysis)

Once the fire is put out and the system is stable, the most important step in the DevOps lifecycle occurs: The Post-Mortem.

Within 48 hours, the engineering team must write a document detailing exactly what happened, a timeline of events, and Action Items to prevent it from ever happening again in human history.

CRITICAL RULE: The Post-Mortem MUST BE BLAMELESS.

If an intern accidentally deleted the production database, the Post Mortem should not state "Tim is an idiot and deleted the DB." If people are punished for mistakes, they will hide future mistakes to protect their jobs, destroying observability.

A blameless post-mortem asks: "Why did the system allow an intern to delete the database without three layers of authorization or a secondary backup check?"

You fix the system, not the human.