Written: March 27, 2023
Edited: March 29, 2023
Many of the ideas here are taken from my experience (and mistakes) leading the oncall program for Twitter's web app at scale. Please feel free to disagree with this guide in certain places (which is good, as you should always question processes and try to find ways to improve them).
I will tweak the ideas described here throughout my career and use it as the baseline for oncall programs at my future companies.
If you're taking the time to read this, it probably means that you've been scorched a time or two by the fire of leading or participating in really tough incidents at work.
Perhaps your team has a good oncall program and you're looking to improve it. Perhaps your team doesn't have a dedicated oncall program yet.
Whatever the case, you're here for a reason. I'll make it worth your while.
In 2021, I was put in charge of leading the oncall program for all of Twitter.com, the Twitter web app (a ~30 engineer rotation at the time pulled from a team of ~120). Our engineers were oncall roughly 4 times per year (2 times as primary and 2 times as secondary, something we will discuss shortly).
Our engineers faced regressions from new feature or infra launches, problems caused by internal microservices failures that challenged the partition tolerance of Twitter, natural disasters affecting our data centers, and even wild curveballs like a DDoS attack out of mainland China (which is a story for another time).
Above all, our oncall engineers had to "keep the ship pointed forward in the storm" no matter what so that the web app was available for any of the hundreds of millions of active users every day who wanted to use it.
To say this was stressful for oncall engineers is an understatement.
In the following sections, I'll try to distill insights on how to prevent burnout for oncall engineers, provide good support, educate with proper onboarding, and build a culture that makes sure your team is never caught flat footed during an incident.
Let's get to it. 🔥
You might be surprised to see that this first section is more of an abstract discussion about what oncall even is. But some organizations seem to muddle what oncall should be responsible for and what their tasks should be, so this is a good place to start.
As a general rule, your oncall rotation should either be domain- or team-specific. There needs to be clear boundaries between what your oncall engineers are (and are not) responsible for.
Once those rules are established, it's important to make clear that "the buck stops with oncall" when it comes to emergencies. The oncall gets to make the call about if a rollback (deploying a prior version of the code), revert (reverting a problematic pull request), forward fix (deploying with a revert/code fix instead of a rollback), or other action is necessary. The only time this should be overruled is if management or executives want to make a decision on what happens in an incident.
In other words, your primary oncall is the captain in the room in any incident (I'll explain more on the concept of primary and secondary oncalls in a bit). One of the worst feelings as oncall is being responsible for everything (such as getting pages in the middle of the night) but not really being empowered to make final decisions on any of it. It creates a feeling of helplessness and resentment that will weigh on your oncall program like a stone sinking through water.
In addition to leading incidents and making decisions on them, oncall should be responsible for the following tasks:
- Leading all deploys on the team for the week
- Handling pages for the week
- Analyzing health trends on metrics/logging dashboards for the service(s) they're responsible for
- Triaging new errors that are reported by the support team or error monitoring services
- Creating (and working on) tickets for new errors or oncall-related tasks
- Training new oncall engineers who are shadowing the rotation
- Writing all postmortem documents for any incidents they lead
This seems like a lot, and that's because it is. Some of these tasks can be divvied up between primary and secondary oncall roles. What are those, you might ask?
Good question. Let me explain right now.
I've previously said that "the buck stops with oncall." However, what I really meant is that the buck stops with the primary oncall.
If you have a small engineering team (5-6 people), disregard everything that this section has to say (although I'll have a lot more to talk about with scope versus number of rotations soon). If this is you, your rotation is already so small that having a staggered primary/secondary oncall program would increase engineer burnout.
If that's you, go ahead and skip to the next section.
Okay. If you're still here, that means you should build two rotations on your team. Each rotation will have the same engineers in it, but their shifts should be staggered (so that an engineer is primary oncall and then secondary something like half a month later).
The primary oncall should get pages first, lead incidents, be point on training new oncall, and just generally be the one in charge of oncall for the entire week.
Secondary oncall should be back up (both with pages if primary misses them as well as with covering oncall if the primary has to be out). You can also choose to divide tasks like writing postmortem documents, traiging some errors, and other lower priority tasks to the secondary oncall.
Primaries should be expected to be within 15 minutes of their computer at all times in case they receive a page. Missing a page should be a big deal (and multiple instances of it should be brought up by the manager). As I've said, the buck stops with oncall. This is true of authority as well as responsibility.
Secondaries should be within 15-30 minutes of their laptop (unless they let their primary know they'll be out) in case something happens and a page gets escalated to them.
You should find some service that can page the oncall engineers. PagerDuty is popular at time of writing, but it can be expensive - there are other services out there like it, though. These apps should be given override ability on the engineers' phones so oncall engineers know when a problem is going down and they're being paged.
No matter what you do, these pages (which can come at any time of the day or night) will wear on your oncall engineers. That's why you have to be very rigorous about the way you put together a schedule for your oncall program.
As I mentioned above, you should make every attempt to place your engineers into a staggered primary/secondary schedule.
Let's dig into this scheduling process a bit more.
Oncall shifts should run 24/7 for a week straight. The specifics (like start and stop days/times) can be between you and your team. If you have enough engineers distributed across the globe, you could also experiment with having some engineers cover while others are asleep.
In an ideal world, your oncall rotations should be a minimum of 8 people and a maximum of around 25. The reason for this is simple - 8 engineers equals people having a primary oncall shift of once every 2 months and 25 engineers is a primary oncall shift around twice a year.
Any less than 8 engineers and you'll run a big risk of burning your oncall engineers out from being oncall all the time (and believe me, feeling like the world is on your shoulders multiple times a month is not fun); any more than 25 and you run the risk of having your engineers forget how to perform their oncall duties (although this can be mitigated with a strong secondary rotation where the secondary oncall engineer actually performs some duties - more on this soon).
Engineers who perform their primary oncall rotation duties should have a week or two off (at minimum) before they do their secondary oncall shift. Having someone oncall for two weeks in a row (primary and then secondary) where they're tied down to their laptop is fairly cruel. This can (rightfully) lead to anger, frustration, and even turnover if it continues despite complaints.
Add new engineers to the rotation at least several months out and give the entire team a heads up publicly when you do it. This helps mitigate problems with the schedule shifting. You don't want an enginer to find coverage for their shift only to have it shift around multiple times. When you update the schedule, tell people to review their future shifts immediately and to let you know if their newly scheduled times (almost a quarter out) don't work. In this instance, the right thing to do is for you to be the one to find them coverage.
Speaking of finding coverage on your schedule, oncall engineers should be responsible at all other times when they need swaps for weeks they can't work their oncall shifts. It should be a big deal if someone goes out on vacation and misses getting coverage for their shift.
The final piece of this scheduling puzzle is to do with trainings; you should make a third (and rarely-used) rotation for shadowing engineers. This component needs its own section (further below), so I'll end this one by saying that no team is complete unless people consider themselves simultaneously teachers and learners.
Before we get into a discussion of training your team, I'd like to momentarily discuss the tension with oncall schedules between being oncall for more scope and being oncall more frequently.
If you choose more scope, you'll be able to have a large roster of engineers who will be oncall less frequently. This would be on the high end of the roster size (>= 25 people) that I mentioned earlier and would equate to less oncall burnout and engineers not being tied down to their laptops 24/7 a lot of weeks every year (which can build frustration). However, the cons to this approach are that engineers may complain they're losing knowledge on how to properly perform their oncall duties given the large gap in between rotations (which is something that happened to my oncall team at Twitter).
In contrast, having a smaller rotation with much less scope will mean that your engineers get intimately familiar with the subject matter area they're responsible for and run no risk of forgetting it. They will also experience less stress from being responsible for covering a smaller domain while oncall. The main con for this approach is that your engineers will also be oncall more often; this could lead to burnout and prevent them from taking vacations or decompressing when needed from their normal (non-oncall) work.
You should avoid having both large scope and frequent oncall as this is the worst of both worlds (although you may be forced into it - small startups and big corporations will have to make very different trade-offs here).
It's important to note that you will never reach perfection in this area. You will have people on your team who want more oncall shifts with reduced scope and people who would rather have less shifts and more scope. On top of all of this, you may build up your rotation only to have some unexpected turnover that momentarily decreases the size of your rotation no matter what you did.
Perfection is not possible; instead, focus on the moving target I just described and do your best to design your rotation (more scope, more rotations, etc.) for what the current needs of your company dictate.
The important part is to pick a direction with your team and stick to it (at least until it becomes obvious a change is needed).
I've previously mentioned that you should create a "shadow" rotation. Let's expound on that (and other training processes) right now.
I would recommend not adding an engineer to your oncall rotation unless they have met the following qualifications (roughly):
- They are a senior (or higher) engineer and have been on the team for >= 3 months
- They are a junior/mid-level engineer and have been on the team for >= 6 months
This allows these new team members to get used to the codebase, the way your architecture is set up, and the organization itself. However (and as I've mentioned all along), this will need to be shorter if your team is quite small.
Once you've deemed an engineer ready to go into the rotation, you should either meet with them in person/over video chat or have some training materials ready for them that do the following:
- Give a breakdown of the philosophy of oncall at your company
- Describe what primary and secondary oncall are at your company
- Describe what the tasks are for engineers while they're oncall (and the breakdowns for primary/secondary)
The engineer should then immediately go into the shadow rotation. You should link them up with the primary engineers for the weeks they will be shadowing and make introductions (if they don't already know each other).
I would recommend you put your new oncall engineer through 2 weeks of the shadow rotation. In the first week, they should merely shadow and be a fly on the wall in all incidents, investigations, deploys, rollbacks, and anything else the primary engineer does.
In the second week of shadowing, you should flip the schedule so the shadowing engineer is "primary" from 9 AM to 5 PM their local time. They should receive all pages and otherwise function as a primary engineer, but they should have the full support of the training engineer (the "real" primary engineer) in case they have questions. You will need to temporarily modify the paging schedule for this week so the shadowing engineer receives pages during 9 AM to 5 PM their local time.
Once an engineer has completed this training, you should view them as ready to go into the full rotation. Place them in the primary and secondary schedules while respecting the issues with adjusting a schedule that I mentioned previously.
As you bring on new engineers, it will become more-and-more important to have a culture where oncall engineers are properly supported by their team.
In the TV show Avatar: The Last Airbender, there is a great character named Uncle Iroh. One of the surprising skills he reveals during the show is that he can redirect lightning that is fired at him.
Your team should view the oncall engineers (and particularly the primary oncall for each week) as mainly responsible for redirecting lightning in times of crisis. This means that they need to find the right fix, the right people to work on it, and the right short-term mitigation techniques for an issue (which may be a rollback, quick patch, revert of code, etc.). The buck does stop with oncall, but the team needs to be ready to chip in when required.
In other words, work hard to develop a team mentality of "it's good to help oncall."
If oncall is stumped (since no one can possibly know everything about this field) on a high priority issue they got a page about, it may warrant manually paging in other key team members for fixing the problem (such as with a page in the middle of the night when no one else is online yet). You should advise your oncall engineers in their runbooks and oncall training that this should only be used under extreme circumstances (more on runbooks in just a second).
However, oncall should always be able to pull in whatever resources they need to get a job done (depending on the severity of the problem). If the ship is headed in the wrong way and it's high priority, oncall should be able to ask for (and should get) the help they need in order to right the course.
Everybody forgets things, so it's crucial to develop great runbooks that are easy to flip through in the heat of the moment. These will be useful for both new engineers and also oncall veterans who may not have had a shift in a while.
It's also important to note here that these runbooks should never be self-hosted. You should use Google docs, Notion, or some other provider for them. The reason here is simple - the last thing you want in the middle of a very high severity incident (such as the entire company architecture being down) is for the runbooks to be down as well because you decided to run this documentation off the same servers.
In addition, these runbooks should be designed for speed. They should be written by recognizing they might be flipped through under times of extreme stress. The last thing an oncall needs during a crucial incident is to feel like they're combing the secret dusty vaults of a lost civilization while trying to decipher what previous engineers wrote down.
I would advise that your runbooks should include (at a minimum):
- How to deploy builds
- How to rollback builds
- If the team uses a GUI for the above, a guide on manual processes via CLI for them
- General oncall processes (such as what we've been discussing in this article)
- A breakdown of company structure (which other teams/oncalls are responsible for what parts of the company's architecture)
- How to use your paging system (and how to perform manual pages)
- How to schedule a postmortem (assuming your team does them for SOC II compliance)
- How to create a postmortem document for incidents and what to do with it
Try to keep a flattened layout structure for this documentation, and try to favor a bigger initial main table of contents over lots of pages with nested sub-categories (which is harder to search in the heat of the moment).
As time goes on, you and your oncall team will need to reevaluate these runbooks in order to add new ones, update old ones, and keep an FAQ of edge cases that occur for your oncall program.
In any case, please make sure to empower your oncall engineers to make active modifications to these runbooks when they deem it necessary. Your team should view these as living documents; if you take it entirely on yourself (or a specific static group) to be the gatekeepers of periodically updating this documentation, you will almost certainly find them frequently falling out of use by oncall because the information in them is stale.
There is a lot more I could say here about runbooks, but most of it would be wasted. Anything worth talking about from this point forward will revolve around your company's specific needs, culture, and codebase. Just follow the north star of making documentation quick, informative, and easy to parse through for your oncall engineers, and your runbooks will likely be quite useful.
If there is one section in this article that will be contentious, it's what should qualify as a page (or an incident) for your oncall engineers.
In addition to clearly delineating boundaries for what your oncall engineers are responsible for, you should also attempt to define levels of severity for pages. There is nothing more crushing for the soul than multiple pages in the middle of the night for what is the equivalent of a light warning, especially if the oncall engineer already had a busy schedule of big incidents the day before.
A rough placement of alerts into low, medium, and high severity is generally appropriate. You can experiment with having all alerts page during normal business hours and only high ones page in the middle of the night. You can also try having low priority alerts log to an alert channel somewhere (although this is only worth doing if people actually read it). Whatever you decide on, you must have an agreement with your oncall rotations that pages will not be ignored.
Ignoring pages points to bad alert setups, ambivalent oncall engineers, or both. If pages for a certain error are repeatedly ignored, I would suggest addressing the situation with oncall engineers but also examining if the page should be kept around.
I would also recommend that you avoid wiring up alerts that depend on a certain level of traffic to validate data. If not, your traffic levels could drop due to normal workday/weekend fluctuations and an alert will be fired off to trigger a page.
For non-alert-reported incidents (in other words, issues that are reported by employees or your users), this will fall to primary oncall to decide if an incident is worth calling. I would emphasize that this should be their call. Your oncall engineers will hone this sense over time which will also make them much better engineers in general. A little trial by fire never hurt anyone and is generally how people learn very quickly (whether they want to admit it or not).
The bar for what qualifies as an incident will be different for every organization. I would urge you to have your oncall engineers always err on the side of caution and to prioritize immediate fixes for a problem. However, you will only be able to have your oncall engineers enforce a quality bar that management supports. Should oncall call an incident for UI regressions? Does it need to be a deeper infrastructure issue to warrant a full blown incident? These sorts of questions should be decided ahead of time and examples should in your training materials and runbooks.
Once a forward fix/revert/rollback has happened, your team (including your oncall engineers) can focus on the long-term solution.
At the end of every week, the cycle will reset and your new primary and secondary oncall engineers will take over. I would highly recommend that you do an in-person oncall handoff meeting. You may have people push back on this, but I truly believe this is crucial.
All the video recordings and notes in the world can't replace being in a room (Zoom or real life) with someone and being able to pepper them with questions about what happened last week, why it happened, and what problems are still ongoing. This meeting should be a mind meld of information from the departing oncalls to the incoming ones.
A good process here will mean that ongoing incidents have a smooth handoff to an incoming oncall who will then take it over and run with it. This also ensures that you don't end up with oncalls who are working on oncall tasks for weeks after their shift ended because your company didn't have a good handoff process.
I would recommend that you spend no more than 30 minutes in a meeting like this and end immediately once the communication is done and the incoming primary and secondary oncalls have no more questions.
You should take everything we just discussed in this article as a foundation to start your own oncall program.
Over time, you'll come up with new best practices of your own. When that happens, I hope you'll get back in touch with me so I can learn too.
As always, thanks for reading.