How to Build an Effective and Sustainable On-Call Schedule For Your Team

Written by nawazdhandala | Published 2020/09/25
Tech Story Tags: devops | devops-tools | sre | incident-management | incident-responsiveness | incident-response-plan | incident | on-call

TLDR A lot of tech companies struggle with creating an effective and efficient on-call schedule internally for their product and service, which results in longer downtimes when something goes wrong. With the evolution of DevOps, Software Developers now find themselves part of an on-Call rotation. Having an on call schedule for your team is an emergency last line of defense against downtimes. An effective schedule can help reduce friction and help keep your engineers happy. An organization should have a “No Downtime” engineering and ops process in place.via the TL;DR App

A lot of tech companies struggle with creating an effective and efficient on-call schedule internally for their product and service, which results in longer downtimes when something goes wrong. They often over-burden their team members with repeated on-call duty, resulting in team fatigue. Here’s how to create an on-call schedule that your team might just love.
On-call doesn’t have to suck the life out of your employees. There’s another side to it. A better one.
An on-call schedule ensures that someone competent is available to bring services up and running if they go down so that the customers don’t have trouble using your product or service. Though on-call isn’t a new concept in the world of DevOps and IT Ops, the execution and roles have greatly evolved over the years.

How Has On-Call Evolved Over the Years?

In the past, being on-call and resolving issues as they occur used to be the sole responsibility of Sysadmins and Operation Engineers. With the evolution of DevOps, Software Developers now find themselves part of an on-call rotation and this has worked well for most companies.
On-call schedules used to be created on spreadsheets (some still use this method) and intimated to the team without looking into their specific availability. The person on-call had to be available at that time or day. It lacked flexibility, it was a nightmare to find a replacement if the person on-call had an emergency and it was a hassle to find someone who could help resolve an issue if the person on-call wasn't able to resolve it on their own.
Thanks to ops platforms like Fyipe which has an inbuilt, on-call scheduling feature, we don’t have to worry about creating schedules in spreadsheets anymore or informing the person on-call.
What still remains an issue, however, is the negative attitude towards being on-call. No-one wanted to be on-call then and no-one wants it now but it’s an absolute necessity.
Being on-call doesn’t have to suck! An effective on-call schedule can help reduce friction and help keep your engineers happy. Happy on-call team means happy customers!
The only way this is possible without draining your team is to ensure the schedule takes care of their work-life balance and doesn’t deplete any single engineer completely.

Why Do You Need to Have Someone On-Call?

Being on-call is the first step an organization takes towards improving its availability and reliability for its customers or users. On-call engineers are the last line of defense to defend against customer-impacting outages and ensure that the issues are resolved as quickly as possible. You need to be there when your customers need you. On-call ensures this.
“If the idea of being 'on-call' sucks to your team, it means they are responding negatively to a symptom.
The cause is less systemic and more a reflection of the team/organization's basic engineering prowess.
An organization should have a “No Downtime” engineering and ops process in place. Having an on-call schedule for your team is an emergency last line of defense against downtimes.

Who Should Be Part of Your On-Call Team?

Here’s an interesting story. About 11 years ago, Google came up with a new strategy for production management. It realized that as R&D was pushing more and more features to production, Operation Engineers were having a tough time keeping production as stable as possible. The two teams were working in opposite directions, which lead to increased tension due to their different skill sets, backgrounds, incentives, and metrics and ultimately resulted in a clash between them.
In order to bridge the gap between the two teams, Ben Treynor, one of Google’s ops leaders, thought of an innovative solution that led to the creation of a new team at Google called Site Reliability Engineering (SRE). The team comprised of 50% Sysadmins and 50% Software Engineers. This improved the operations efficiency multi-fold.
Many companies have followed along similar lines and we have seen them succeed over the years with this strategy. It makes sense to include engineers, who have worked on the code, on the on-call team because of the following reasons:
They have a deeper understanding of the code or feature they have worked on and hence, if the issue is created due to code errors, they are able to fix it faster. This is extremely efficient. Engineers get exposure to ops processes as well. This presents them with a holistic overview by helping them understand the implications of a certain coding practice in the production environment and thus helps them produce better quality code.

How Should You Create an On-Call Schedule?

Creating a schedule for on-call rotation primarily depends on several factors:
  1. Team Size
  2. Geographical distribution of the team
  3. Feature or service wise distribution of teams
  4. Creating rotation plans based on team size

When You are a Solo Flyer (Team Size = 1)

When you are a single person team, creating an on-call schedule is a no brainer. It is highly likely that you are just starting your journey as a startup and you are the only person responsible for everything in your company. Hence, you need to be available when you are alerted and be on-call 24 x 7 x 365.
Starting a company is tough and you might be drained when you leave for the day. This might even result in you missing alerts or calls.
Our advice is to have an ex-colleague / workmate as backup and add them as a secondary on-call person so that if you don’t wake up and acknowledge the alert, he or she will jump on a call, fix it or notify you / call you immediately. It’s also highly recommended you use on-call software because it calls you in a loop until the issue is acknowledged or resolved.

When Your Team Size = 2

Approach 1: Change the schedule for primary on-call every other day i.e. alternate. Let your peer choose between MWF (Monday, Wednesday and Friday) or TTS (Tuesday, Thursday and Saturday) and you can have your pick on the even or odd of the four Sundays you want to be on-call. Person A will be primary and will be alerted first and Person B will be called when Person A misses notifications or fails to acknowledge an incident.
Primary on-call members will be the ones who will be alerted first. If they do not pick up the call or respond to alerts then secondary on-call members are alerted.
Approach 2: You can also rotate weekly as well. You can be the primary on-call person and your partner would act as a secondary on-call for the week. The following week, this will change and your partner will be primary and you would act as the secondary on-call person.
Let’s say that A is the primary on-call person, B is the secondary on-call person and C is the backup for week 1. So when an alert is triggered, person A receives the first alert. This alert can be in the form of a call, SMS, email or even on Slack depending on the set preference.
We found the best approach to on-call rotation is to do it weekly.

When Your Team Size = 3

Let’s say there are 3 three people in the on-call team — A, B and C. The purpose of having a secondary on-call is to ensure that your partner is alerted in case you miss it and one of you can start working on the issue.
Ideally, A should be available to acknowledge the alert and start working on it. But, if for some reason A misses the alert, the alert is received by B who can then start working on the issue. If neither A or B receives the alert or are unavailable, the alert finally goes to C, the backup.
In week 2, B who was the secondary on-call person in week 1 replaces A as the primary on-call person and C replaces B to become the secondary on-call person. A now becomes the backup on-call person for week 2.
In week 3, C replaces B, who was the primary on-call person for week 2, to become the primary on-call for week 3. A jumps up to become the secondary on-call and B becomes the backup for week 3.
In week 4, A becomes the primary on-call person, B becomes secondary and C becomes back up again and so the cycle repeats.

When Your Team Size = 4 or More

If your team size is 4 members or more, we have seen that the best strategy is to have weekly rotations.
Let’s say there are 6 people on the team, A, B, C, D, E, and F. The rotation must be decided such that it remains fair to everyone while reducing the stress of being available all the time. This can be done by ensuring that everyone works as primary, secondary and backup on-call for the same amount of time.
Week 1: Let’s say A works as the primary on-call, B works as secondary on-call and C works as the backup.
  1. Primary on-call: A
  2. Secondary on-call: B
  3. Backup: C
Free from all on-call responsibility: D, E and F
Week 2: A will be relieved of being on-call since he acted as primary for a week with maximum responsibility. It's time for B to replace A and become the primary for this week and C replaces B as secondary on-call.
Since the backup position is empty and D hasn’t had any responsibilities yet, D will act as the backup for this week.
  1. Primary on-call: B
  2. Secondary on-call: C
  3. Backup: D
Free from all on-call responsibility: A, E and F
Week 3: B is relieved from all on-call responsibilities since B worked as primary on-call for the week. D becomes secondary on-call and E becomes backup.
  1. Primary on-call: C
  2. Secondary on-call: D
  3. Backup: E
Free from all on-call responsibility: A, B and F
Week 4: Now D replaces C and becomes the primary on-call. E becomes secondary and F becomes the backup.
  1. Primary on-call: D
  2. Secondary on-call: E
  3. Backup: F
Free from all on-call responsibility: A, B and C
As we can see, each week a person moves up a notch in the responsibility cycle from backup to secondary and from secondary to primary on-call. As this happens, the person with maximum responsibility moves out of the cycle and stays out until everyone has fulfilled the responsibility of being a backup on-call. Once this is done, people who exited the cycle first enter the cycle first and this continues.
Tip: Some teams follow the same rotation but do it every day instead of every week. This works perfectly when the team size is large, say 30 or more and everyone is aware of the schedule. In smaller teams, however, this creates a lot of tension among engineers who hate constant distractions. This also tends to affect their work-life balance.
With a weekly rotation, engineers are mentally and physically prepared that they need to be available for a week and can work on their tasks in the other 3 weeks of the month. This has proven to be a better plan for a most of our clients who have teams this size.

Feature-wise distribution of teams

It makes sense to have a member from the team responsible for rolling out a feature to also be responsible for maintaining it. An on-call person who is already aware of how the feature is designed and of the code style is much faster in resolving the issue and avoiding it in the future as well.
Having someone on-call as a backup from the team, who developed the feature, helps in faster resolution of issues.
Hence, an on-call team must always include at least one person as a backup / secondary on-call from the team who was responsible for rolling out the feature.
Important: Sometimes a team works on multiple features and so an on-call schedule based on features might clash. In this a case it’s important that a single person doesn't act as the primary on-call for two features. It must also be ensured, if possible, that there is at least one person in both on-call schedules who isn’t common to both the schedules.
This makes it fail-safe such that even in the worst-case scenario, there will be two different people on-call for two different features.

Accounting for Geographical Distribution of Teams

Enterprise companies have a huge team that may be geographically distributed. Companies of this size are based on following the sun model. This ensures that the on-call schedule doesn’t exceed the office hours and helps in ensuring a work-life balance.
Let’s say a team is distributed into two subgroups working in two geographical regions, the US and India. Proper on-call scheduling and rotation ensure that the Indian team receives the alert when the team in the US is off their work schedule and similarly the team in the US would receive the alert when the team in India is offline.
Alerts customized like this prevent burnout for members in either of the regions but here’s a problem that arises when only members in the teams of a particular region are scheduled to receive alerts and work on issues based on the time of the day.
Alerts customized according to the time zone prevent burnout for members in either of the regions.
Warning: Sometimes the teams may require information such as logs, which would only be available with the team in the other time zone, in order to work on the issue. In such a scenario, it becomes extremely difficult to have someone from the team in the other time zone go back to the office in the middle of the night and help with the information required so that the issue can be resolved in time.
Approach: We have found that the best way to ensure that such situations are avoided is to have someone on the team in the other time zone be on-call but this can be further optimized by deciding the schedule based on the priority of the incident. In case of low or moderate priority events, the alert to the team in the other time zone could be avoided. In such a scenario, a member of the same time zone would act as a backup.
In case of high priority incidents, alerts can be sent to team members in the other time zone as well.

High priority — critical incidents

Scenario 1: When its day in the US and night in India and an issue occurs.
  1. Primary on-call: Member 1 of the team in the US
  2. Secondary on-call: Member 2 of the team in the US
  3. Backup: Member 3 of the team in India
  4. Secondary Backup: Member 2 of the team in India (in case the backup misses the alert)
Scenario 2: When its day in India and night in the US.
  1. Primary on-call: Member 1 of the team in India
  2. Secondary on-call: Member 2 of the team in India
  3. Backup: Member 3 of the team in the US
  4. Secondary Backup: Member 2 of the team in the US (in case the backup misses the alert)

Low priority — low impact incidents

Scenario 1: When its day in the US and night in India and an issue occurs.
  1. Primary on-call: Member 1 of the team in the US
  2. Secondary on-call: Member 2 of the team in the US
  3. Backup: Member 3 of the team in the US
Scenario 2: When its day in India and night in the US.
  1. Primary on-call: Member 1 of the team in India
  2. Secondary on-call: Member 2 of the team in India
  3. Backup: Member 3 of the team in India
Using the above methodology, you would now be able to design an on-call rotation schedule that works best for your team. But there are some more things you should consider while creating an on-call team.

Tips to Build an Awesome On-Call Culture

  1. Have only those people on the team who are independently capable of working and resolving issues related to code, servers, and other network issues. Having members in the role of SRE, like Google, is probably a much better idea than having just a Sysadmin or DevOps person on call.
  2. Make sure you take a poll from your team members before finalizing a schedule. It's always better to find a middle ground that serves both the firm and the engineers well. Even after implementation, regular feedback ensures the team doesn’t have trouble following it.
  3. Ensure the schedule takes care of the work-life balance of your employees. Ensure that they get enough sleep and have a healthy work environment.
  4. Make sure the person on-call isn’t burdened with anything else while he or she is on call. This reduces efficiency and is counterproductive.
  5. Help your team develop a culture of empathy. Your team should care for each other and should learn it from you. Sometimes a person who was supposed to be on call might have some emergency due to which they might not be available to be on-call. In such a scenario, someone from the team should eagerly come up and volunteer to shoulder the responsibility and cover that person. They shouldn’t be forced into it.
  6. Your schedule should have the flexibility for people who fall ill, or for someone who might have a child or is about to have one. The schedule designer must take note of these situations and adapt accordingly until the time they are fit to start again. It’s always a good idea to keep a list of people who can replace the on-call person in case of emergencies.
We at Fyipe help hundreds of businesses across the globe run efficiently and reduce downtime to improve the customer experience every day.

Written by nawazdhandala | Founder, HackerBay.io
Published by HackerNoon on 2020/09/25