Going on-call
On-call engineers are the first line of defense for any unplanned work, be it production issues or ad hoc support requests. Separating deep work from operational work lets the majority of the team focus on development while on-call engineers focus only on unpredictable operational issues and support tasks. Effective on-call engineers are prized by their teammates and managers, and they grow quickly from the relationship-building and learning opportunities that on-call rotations provide.
How On-Call Works
Note
Main areas: Incidents or Support requests
On-call developers rotate based on a schedule. The length of a rotation can be as short as a day, though more often it’s a week or two. Every qualified developer takes part in the rotation. Developers who are new to the team or lack necessary skills are often asked to “shadow” a few primary on-call rotations to learn the ropes.
Some schedules have a primary and a secondary on-call developer; the secondary acts as a backup when the primary is unavailable.
Most of an on-call’s time is spent fielding ad hoc support requests such as bug reports, questions about how their team’s software behaves, and usage questions.
However, every on-call will eventually be hit with an operational incident (critical problem with production software). An incident is reported to on-call by an alert from an automated monitoring system or by a support engineer who manually observes a problem. On-call developers must triage, mitigate, and resolve incidents.
Paging is an anachronism from before cell phones—these days, an alert is routed through channels such as chat, email, phone calls, or text messages.
All on-call rotations should begin and end with a handoff.
Important On-Call Skills
Make Yourself Available:
An on-call’s job is to respond to requests and alerts. Don’t ignore requests or try to hide. Expect to be interrupted, and accept that you can’t do as much deep work while on-call.
Figure out what on-call expectations are, and don’t get caught in a situation where you can’t respond.
A fast response is generally expected from the on-call engineer, but not necessarily a fast resolution.
example:
I am currently assisting someone else; can I get back to you in 15 minutes?
Pay Attention:
Information relevant to on-call work comes in through many channels: chat, email, phone calls, text messages, tickets, logs, metrics, monitoring tools, and even meetings.
Proactively read release notes and chat or email channels that list operational information like software deployments or configuration changes.
Keep operational dashboards up in the background or on a nearby TV so you can establish a baseline for normal behavior. When incidents do occur, you’ll be able to tell which graphs look odd.
Create a list of resources that you can rely on in an emergency: direct links to critical dashboards and runbooks for your services, instructions for accessing logs, important chat rooms, and troubleshooting guides.
Prioritize Work:
As tasks are finished or become blocked, work your way down the list from highest to lowest priority.
As you work, alerts will fire, and new questions will come in. Quickly triage the interruption: either set it aside or begin working on it if it’s an emergency.
If the new request is higher priority than your current task but isn’t critical, try to finish your current task, or at least get it to a good stopping point before context switching.
If you can’t tell how urgent a request is, ask what the impact of the request is. The impact will determine the priority. If you disagree with the requestor about an issue’s prioritization, discuss it with your manager.
Google Cloud’s support priority ladder offers one example of how priority levels may be defined (https://cloud.google.com/support/docs/best- practice #setting_the_priority_and_escalating/):
P1: Critical Impact—Service Unusable in Production
P2: High Impact—Service Use Severely Impaired
P3: Medium Impact—Service Use Partially Impaired
P4: Low Impact—Service Fully Usable
Service level indicators, objectives, and agreements also help prioritize operational work.
Note
Service level indicators (SLIs) such as error rate, request latency, and requests per second are the easiest way to see if an application is healthy.
Service level objectives (SLOs) define SLI targets for healthy application behavior. If error rate is an SLI for an application, an SLO might be request error rate less than 0.001 percent.
Service level agreements (SLAs) are agreements about what happens when an SLO is missed. (Companies that violate SLAs with their customers usually need to return money and may even face contract termination.)
Communicate Clearly:
To communicate clearly, be polite, direct, responsive, and thorough.
Under a barrage of operational tasks and interruptions, developers get stressed and grumpy—it’s human nature. Be patient and polite when responding to support tasks. While it might be your 10th interruption of the day, it’s the requestor’s first interaction with you
It can feel uncomfortable to be direct, but being direct doesn’t mean being rude. Brevity ensures that your communication is read and understood.
If you don’t know an answer, say so. If you do know the answer, speak up.
Respond to requests quickly. Responses don’t have to be solutions.
Post status updates periodically. Updates should include what you’ve found since your last update and what you’re planning on doing next. Every time you make an update, provide a new time estimate
Track Your Work:
Write down what you’re doing as you work.
Each item that you work on while on-call should be in an issue tracker or the team’s on-call log.
Track progress as you work by writing updates in each ticket.
Include the final steps that mitigated or resolved the issue in the ticket so you’ll have the solution documented if the issue appears again.
Tracking progress reminds you where you left off when you come back to a ticket after an interruption.
The next on-call will be able to see the state of ongoing work by reading your issues, and anyone you ask for help can read the log to catch up.
Logged questions and incidents also create a searchable knowledge base that future on-calls can refer to. Don’t use chats for tracking
Close finished issues so dangling tickets don’t clutter on-call boards and skew on-call support metrics. Ask the requestor to confirm that their issue has been addressed before closing their ticket. If a requestor isn’t responding, say that you’re going to close the ticket in 24 hours due to lack of response; then do so.
Always include timestamps in your notes. Timestamps help operators correlate events across the system when debugging issues. Knowing that a service was restarted at 1 PM is useful when customers begin reporting latency at 1:05 PM.
Handling Incidents
Incident handling is an on-call’s most important responsibility.
Most developers think handling an incident is about fixing a production problem.
Resolving the problem is important, but in a critical incident,
the top objective is to mitigate the impact of the problem and restore service (stop the bleeding).
The second objective is to capture information to later analyze how and why the problem happened.
Determining the cause of the incident, proving it to be the culprit, and fixing the underlying problem is only your third priority.
Apply these steps to handle incidents successfully.
Triage:
Determine a problem’s priority by looking at its impact.
If you’re having trouble determining issue severity, ask for help. Triage is not the time to prove you can figure things out on your own; time is of the essence.
Likewise, triage is not the time to troubleshoot problems. Your users will continue to suffer while you troubleshoot. Save troubleshooting for the mitigation and resolution phases.
Coordination:
Coordination starts by figuring out who’s in charge.
For lower-priority incidents, the on-call is in charge and will coordinate.
For larger incidents, an incident commander will take charge.
Commanders keep track of who is doing what and what the current state of the investigation is.
Once someone takes charge, all relevant parties must be notified of the incident.
Large incidents have war rooms to help with communication. War rooms are virtual or physical spaces used to coordinate incident response.
Track communication in written form in a central location: a ticketing system or chat.
Mitigation:
Your goal in the mitigation phase is to reduce the problem’s impact.
Mitigation isn’t about fixing the problem; it’s about reducing its severity. Incidents are commonly mitigated by rolling back a software release to a “last known good” version or by shifting traffic away from the problem. Depending on the situation, mitigation might involve turning off a feature flag, removing a machine from a pool, or rolling back a justdeployed service
Ideally, the software you’re working with will hav a runbook for the problem. Runbooks are predefined step-by-step instructions to mitigate common problems and perform actions such as restarts and rollbacks.
Once mitigated, the problem might be hard to reproduce.
Quickly saving telemetry data, stack traces, heap dumps, logs, and screenshots of dashboards will help with debugging and root-cause analysis later.
You’ll often find gaps in metrics, tooling, and configuration while trying to mitigate the problem. Important metrics might be missing, incorrect permissions might be granted, or systems might be misconfigured. Quickly write down any gaps that you find—anything that would have made your life better while troubleshooting. Open tickets during the follow-up phase to address these gaps.
Resolution:
Once mitigation is complete, the incident is no longer an emergency. You can take time to troubleshoot and resolve the underlying issues.
During the resolution phase, focus on the immediate technical problems.
Use the scientific method to troubleshoot technical problems.
Chapter 12 of Google’s Site Reliability Engineering book offers a hypothetico-deductive model of the scientific method.
Examine the problem, make a diagnosis, and then test and treat.
Once you have a clear view of the symptoms, diagnose the problem by looking for the causes. Diagnosis is a search, and like any search, you can use search algorithms to troubleshoot. For small problems, a linear search—examining components front to back—is fine. Use divide and conquer or a binary search (also called half-splitting) on bigger systems.
Follow-Up:
Incidents are a big deal, so they need follow-up.
The goal is to learn from the incident and to prevent it from happening again.
A postmortem document is written and reviewed, and tasks are opened to prevent recurrence.
The on-call engineer who dealt with the incident is responsible for drafting a postmortem document, which should capture what happened, what was learned, and what needs to be done to prevent the incident from happening again.
One good example is Atlassian’s postmortem template (https://www.atlassian.com/incident-management/postmortem/templates/).
A critical section of any postmortem document is the root-cause analysis (RCA) section. Root-cause analysis is performed using the five whys.
After a postmortem meeting, follow-up tasks must be completed. If tasks are assigned to you, work with your manager and the postmortem team to prioritize them properly. An incident can’t be closed until all remaining follow-up tasks have been finished.
Some teams even use old postmortem documents to simulate production issues to train new engineers. https://github.com/danluu/post-mortems
Providing Support
Support requests follow a pretty standard flow.
When a request comes in, you should acknowledge that you’ve seen it and ask questions to make sure you understand the problem.
Once you’ve got a grasp on the problem, give a time estimate on the next update: “I’ll get back to you by 5 PM with an update.”
Next, start investigating, and update the requestor as you go.
Follow the same mitigation and resolution strategies that we outlined earlier.
When you think the issue is resolved, ask the requestor to confirm. Finally, close out the request.
Support can feel like a distraction, since your “real” job is programming. Think of support as an opportunity to learn. You’ll get to see how your team’s software is used in the real world and the ways in which it fails or confuses users.
Don’t Be a Hero
On-call activities can feel gratifying. Colleagues routinely thank you for helping them with issues, and managers praise efficient incident resolution.
However, doing too much can lead to burnout.
For some engineers, jumping into “firefighting” mode becomes a reflex as they become more experienced.
Talented firefighting engineers can be a godsend to a team: everyone knows that when things get tough, all they need to do is ask the firefighter, and they’ll fix it.
Firefighters who are pulled into every issue effectively become permanently on-call.
Firefighter engineers also struggle with their programming or design work because they are constantly being interrupted. And teams that rely on a firefighter won’t develop their own expertise and troubleshooting abilities.
If you feel that you are the only one who can fix a problem or that you are routinely involved in firefighting when not on-call, you might be becoming a “hero.” Talk to your manager or tech lead about ways to find better balance and get more people trained and available to step in.
Note
Sources:
The Missing README: A Guide for the New Software Engineer © 2021 by Chris Riccomini and Dmitriy Ryaboy, Chapter 9.