Grafana - Monitoring & Observability

What is Grafana?

If Prometheus is the engine that collects and stores metric data, Grafana is the dashboard.

Grafana is an open-source analytics and interactive visualization web application. It allows you to query, visualize, alert on, and understand your metrics no matter where they are stored.

While Grafana is most famously paired with Prometheus, it is completely backend-agnostic. You can connect it to Elasticsearch, InfluxDB, AWS CloudWatch, Google Cloud Monitoring, and even traditional SQL databases like PostgreSQL or MySQL.

Creating Dashboards

A Grafana Dashboard is a collection of "Panels." Each panel visualizes a specific query, usually as a time-series line chart, a gauge, a stat box, or a heatmap.

The Power of Templating (Variables)

Instead of hardcoding a dashboard to monitor server-db-1, Grafana allows you to create Variables. You create a drop-down menu at the top of the dashboard containing all your servers.

When you select server-db-2 from the dropdown, every PromQL query running in every panel dynamically updates to filter by {instance="server-db-2"}.

This allows you to write one unified "Linux Node" dashboard that can monitor thousands of individual servers effortlessly.

Grafana Dashboard Community

You rarely need to build complex dashboards from scratch! Because the monitoring stack is standardized across the industry, the community shares their work.

If you deploy the Node Exporter to monitor a Linux server, you can visit grafana.com/dashboards and search for "Node Exporter". You will find the "Node Exporter Full" dashboard (Dashboard ID: 1860).

By simply pasting that ID into your Grafana instance, you instantly receive a massive, beautifully crafted dashboard featuring CPU utilization bars, Memory pie charts, and Disk IOPS graphs—built by experts.

Alerting: When Dashboards Aren't Enough

Dashboards are great for active investigations, but nobody gets paid to stare at a TV screen for 8 hours a day waiting for a line chart to turn red.

A professional observability stack relies heavily on Alerting.

With Alerting, you write a query that evaluates rule statements continuously. When an evaluation breaches a threshold, an alert fires.

The Alerting Philosophy

There is a massive difference between a symptom and a cause.

❌ Bad Alerting (Alerting on causes):

Alerting when CPU hits 95%.
Alerting when a single worker node crashes.

Why is this bad? If CPU hits 95%, but the website is still returning HTTP 200s and responding in 200ms, the end-users do not care! The system is handling the load. Waking up an engineer at 3 AM because a server worked hard is a great way to cause Alert Fatigue (when engineers start ignoring alarms because they cry wolf too often).

✅ Good Alerting (Alerting on Symptoms):

Alerting when the API response time exceeds 2 seconds for more than 5 minutes.
Alerting when the HTTP 5xx error rate exceeds 2% of total traffic.

Why is this good? We alert on things that actually impact the end-user (latency, errors, availability). If the API is slow, the engineer wakes up, looks at the dashboard, and then discovers the root cause was high CPU. Alert on the pain, debug the cause.

How Alerting Works Technically

There are two primary ways the open-source stack handles alerting: Grafana Alerting and Prometheus AlertManager.

Grafana Alerting Framework

Grafana has a built-in alerting engine. You build a query visually in the UI, set a threshold (e.g., Query A is above 80), configure the evaluation interval (e.g., Evaluate every 1 minute for 5 minutes), and define the contact points (Slack, PagerDuty, Email).

Prometheus AlertManager

For cloud-native heavily automated teams using "Configuration as Code", the preference is often to use Prometheus's native AlertManager.

You write alert rules in YAML files securely version-controlled in Git alongside your underlying infrastructure:

groups:
- name: API_Alerts
  rules:
  - alert: HighErrorRate
    # The PromQL Metric query
    expr: job:request_error_rate{job="my-app"} > 0.05
    # How long the state must be bad before firing
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High Request Error Rate on {{ $labels.instance }}"
      description: "App is returning >5% HTTP 500s for 5 minutes."

If the expr returns true for 5 consecutive minutes (the for duration), Prometheus sends a JSON payload to the AlertManager component.

AlertManager is responsible for:

Deduplication: If 50 web servers crash simultaneously, AlertManager groups them into one single notification so your pager only rings once, not 50 times.
Routing: Sending warnings to a #monitoring Slack channel, but escalating "Critical" severities directly to PagerDuty to wake up the on-call engineer.
Silencing: Allowing an engineer to "mute" alerts temporarily while they fix the underlying database issue.