People and procedures are just as crucial as software in an IT setting. Site reliability engineering is an innovation of Google that brings software engineering practises to the business side of things with the intention of making more robust and scalable applications. This article delves into the duties of a Site Reliability Engineer and the best practises for that position, such as incident response documentation, post-mortem analysis, and root-cause analysis.
What does a Site Reliability Engineer do?
Site reliability engineering is “what happens when a software engineer is tasked with what used to be considered operations,” as defined by Ben Treynor, founder of Google’s Site Reliability Team.
For our company’s platforms and services, a Site Reliability Engineer is accountable for ensuring their uptime, performance, monitoring, and issue response.
Everything that enters production must first pass a series of checks to ensure that it meets basic standards such as schematics, service dependencies, monitoring and logging strategies, backup procedures, and potential high availability configurations.
Hardware degradation, networking issues, excessive use of resources, or sluggish responses from our services can occur at any time, even if the software meets all of the requirements. It’s crucial that we be ready at all times to take swift and decisive action.
Mean time to recovery (MTTR) and mean time to failure (MTTF) will be used to evaluate our performance (MTTF). What this means is that we need to restore service as soon as possible, and prevent a recurrence.
The Importance of documenting incidents
To better understand the work of a Site Reliability Engineer, let’s examine a sample situation that might arise in the course of their work.
At three in the morning, you as the on-call engineer get an SMS warning that one of your platforms is down. Every second the platform is down, your company is losing money, so you get up, get into the company VPN, and begin investigating the problem. This is simple to do if you are familiar with the downed service or platform. If you have been with the organisation for a while and have the necessary expertise, fixing it will be a breeze. The first step(s) and the overall plan(s) should be clear to you, right?
If you are unfamiliar with the platform, or if it is not yours, this could be a problem.
You need to decide what action to take.
What is a runbook?
A runbook can be an invaluable tool in this case. A runbook is a set of procedures to follow or checks to do when an issue arises with a service, application, or platform. Our incident-resolution playbook, or “runbook,” is written in the same format as other operating manuals. It might mean the difference between minutes and hours when a site reliability engineer is on call. Our service can be rapidly restored with the help of a runbook outlining the steps to take (or within a reasonable amount of time).
In the case of programming, a runbook is ideally developed by developers and followed by anyone on call. Meaning, a runbook, if any, needs to be in place before releasing the new software. Naturally, as I indicated, this is the best-case scenario.
Infrastructures, hosts, and all deployed services and platforms can be troubleshot with the use of custom-written runbooks. This means that problems and services that need human intervention can each have their own set of procedures. For instance, let’s turn our runbook into a script if it’s feasible to do so. If not, we should write down everything that must be done when an alert is received. In other words, let’s set things up so it can fix itself automatically. However, it is also important to remember that we need to conduct the necessary research and make the ensuing adjustments to our service if we ever want to stop this from happening again.
Even if your company doesn’t use runbooks, every step taken while trying to solve an issue should be documented.
It’s helpful to keep track of what was done to fix a problem, even if that effort ultimately failed. You’ll be able to avoid repeating the same errors in the future if you do this.
Even if your company has several runbooks in place, you will inevitably face a scenario that wasn’t anticipated during planning and development, necessitating the creation of yet another document.
What is an incident response report?
In order to prevent a recurrence of an issue once it has been fixed, it is important to keep detailed records of what happened. The purpose of an incident response report is to document the entire process, from the initial actions taken to the final commands run (both successful and unsuccessful ones). What was your process like? Did you use a playbook or did you have to figure it out on the fly? When did this all begin? Did we get a warning or figure out there was a problem some other way? How helpful would the alert be in resolving the problem, assuming we did receive it?
The incident response report should include not only a description of the incident but also a record of all communications related to its resolution.
Whom did we alert about the outage? What person or persons assisted us? How many people did this incident impact negatively? What was the extent of the problem, and how long had it been going on?
All of the steps we took to restore service must be carefully recorded. We’ll be able to get to the bottom of things with the help of this data. We will be able to make or ask for the necessary changes to make our platform more reliable once we have collected all of this data and identified the underlying cause. The MTTF and MTTR (mean time to failure and MTTR) can both be decreased with this.
What is a postmortem report?
In the field of site reliability engineering, finding a solution is only half the battle. Our top priority is to prevent this from happening again. Thus, a root cause analysis is required. If we want to do this analysis right, we need all of the data that should be in a report about how the incident was handled.
Our postmortem report will include a timeline of all the steps we took to resolve the issue, a root cause analysis, corrective and preventative measures, and an explanation of how we got our service back up and running.
What worked and what didn’t in resolving the incident can be discussed or brainstormed as part of the root cause analysis.
In order to prevent or lessen the severity of future incidents, we will have to complete tickets detailing the necessary corrective and preventative actions.
Finally, we’ll fill out the resolution and recovery section with technical details and perhaps some code snippets that we used to fix the issue.
Monitoring and alerting
As we saw in our example, an alarm sparked the action. Of course, a reliable monitoring system is essential for the existence of the alert.
Site reliability engineers rely heavily on monitoring and alert systems.
So that we always have a complete picture of our system’s status, it is essential that we track every metric possible within our platform. A monitoring strategy needs to be developed concurrently with the system architecture, or in tandem with each service that will be supported.
Metrics are commonly monitored, thresholds are established, and alerts are triggered based on the met or exceeded thresholds.
The takeaway here is to develop alert-interpreting software that can repair our system automatically, notifying us only when human intervention is required. All relevant documentation and runbooks, as well as a detailed description of the service experiencing the problem, should be linked to from the alerts themselves.
Signify that a human needs to take action immediately in response to something that is either happening or about to happen, in order to improve the situation.
Signify that a human needs to take action, but not immediately. The system cannot automatically handle the situation, but if a human takes action in a few days, no damage will result.
This data is recorded for later use in diagnosis or investigation and is not publicly accessible. Logs aren’t expected to be read unless there’s a specific reason to.
We don’t need to be notified every time a threshold is crossed, but only when we need to take action to rectify the situation. In my opinion, email is not the best choice for an alarm system. Simply said, it doesn’t function. There is a phenomenon called “alert fatigue,” when people ignore email alerts because they receive so many of them and the vast majority are not emergencies. Because of this, we risk not only missing unimportant warnings, but also failing to act when necessary. It may be too late by the time we decide to take action.
Without our knowledge, or worse, without adhering to the guidelines established for deployment, changes have been sent to production. As a result, it’s crucial to have a well-established process for handling changes, and all developers should adhere to it. It is the responsibility of a site reliability engineer to establish such guidelines, develop the tools required to automate all the processes, and make it possible to roll back or deploy new services with minimal disruption.
Checking that upcoming changes and new services meet certain criteria is an integral part of change management. What this must contain is:
- Monitoring plan
- Alerts runbooks
- Owners list
- High availability strategies
- Deployment and rollback processes
- Data retention and backups
The documentation we have will allow us to respond appropriately to any potential failures.
The wise person will “automate what they can,” record what they can’t, and know the difference.
Keep meticulous records. This is a lesson I had to learn the hard way. My experience as a system administrator over the past five years has taught me the importance of thorough documentation; without it, I would have had a much harder time resolving any issues that arose. By adhering to the aforementioned guidelines, any site reliability engineer can make great strides toward a more stable platform, effectively eliminating the need for a 3 a.m. wake-up call.