G
GuideDevOps
Lesson 10 of 15

Toil Reduction

Part of the Site Reliability Engineering tutorial series.

What is Toil?

In the SRE world, Toil is a specific type of work. It’s not just "work I don't like." It is work that has the following five characteristics:

  1. Manual: Someone has to physically type a command or click a button.
  2. Repetitive: You've done this exact task 10 times before.
  3. Automatable: A machine could do it just as well (or better).
  4. Tactical: It solves a problem right now but doesn't provide long-term value.
  5. Scales Linearly: If your service grows 10x, the work takes 10x longer.

Examples of Toil vs. Engineering

TaskIs it Toil?Why?
Resetting a crashed server✅ YesRepetitive manual work.
Writing a script to auto-reset servers❌ NoEngineering. It provides long-term value.
Updating a firewall rule manually✅ YesTactical and manual.
Writing Terraform code for firewalls❌ NoEngineering. It creates a versioned record.
Performing a manual code deployment✅ YesIf you follow a "checklist," it's toil.
Building a CI/CD pipeline❌ NoIt eliminates the checklist forever.

The 50% Rule

Google's SRE team has a strict rule: An SRE should spend no more than 50% of their time on toil.

What happens to the other 50%?

The other half of the time must be spent on engineering projects. This means writing code to automate the toil you just did.

Why does this rule exist?

If a service grows, the amount of toil naturally grows with it. If you spend 100% of your time on toil, you will eventually be buried. You'll stop having time to fix the root causes of outages, and the system will slowly degrade into a nightmare.

By capping toil at 50%, you guarantee that the team always has the time to Engineer's their way out of the work.


The Automation ROI (Return on Investment)

SREs are careful about what they automate. You shouldn't spend 40 hours automating a task that only takes 5 minutes once a year.

The Golden Calculation:

  • How long does the task take manually? (e.g., 10 minutes)
  • How often does it happen? (e.g., 5 times a week)
  • Total Toil = 50 minutes / week.
  • Time to Automate = 5 hours.
  • Result: You will "break even" in just 6 weeks. After that, you are profiting 50 minutes of free time every single week for the rest of your career.

How to Reduce Toil

  1. Track It: You can't fix what you don't measure. Keep a simple log of how many hours you spend on "tickets" vs. "projects."
  2. Identify Patterns: If three different people asked you to "refresh the staging cache" this week, that is a prime candidate for automation.
  3. Self-Service: Instead of you running the script for the developers, give the script to the developers (or build a Slack bot) so they can run it themselves.
  4. Eliminate, Don't Just Automate: Sometimes, the best way to reduce toil is to delete a feature that nobody uses but takes 2 hours of maintenance every week.

In the next section, we will look at how SREs handle the most stressful form of toil: The On-Call Shift.