updated on-call docs

This commit is contained in:
Marius Pana 2017-01-21 11:21:51 +02:00
parent b7b89e396a
commit d5d28e52e7

View File

@ -3,7 +3,12 @@ A summary of expectations and helpful information for being on-call.
![Alert Fatigue](../assets/img/misc/alert_fatigue.png)
## What is On-Call?
Being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise. For example, if you are on-call, should any alarms be triggered by our monitoring solution, you will receive a "page" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken. You will be expected to take whatever actions are necessary in order to resolve the issue and return your service to a normal state.
At Spearhead being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise. There are two on-call scenarios that you will deal with:
* during your normal work shift
* being on-call for outside working hours
For example, if you are on-call outside normal working hours, should any alarms be triggered by our monitoring solution, you will receive a "page" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken. You will be expected to take whatever actions are necessary in order to resolve the issue and return your service to a normal state.
At Spearhead Systems we consider you are on-call during normal working hours in which case you are proactively working with [DoIT](http://doit.sphs.ro/) and looking over your assigned cards/boards as well as when you are formally "on-call" and issues are being redirected to you.
@ -16,8 +21,8 @@ On-call responsibilities extend beyond normal office hours, and if you are on-ca
* Have a way to charge your MiFi.
* Team alert escalation happens within 5 minutes, set/stagger your notification timeouts (push, SMS, phone...) accordingly.
* Make sure Spearhead Systems (and colleagues directly) texts and calls can bypass your "Do Not Disturb" settings.
* Be prepared (environment is set up, a current working copy of the necessary repos is local and functioning, you have configured and tested environments on workstations, your credentials for third-party services are current, you have Java installed, ssh-keys and so on...)
* Read our Incident Response documentation (that's this!) to understand how we handle incidents and service requests, what the different roles and methods of communication are, etc.
* Be prepared (environment is set up, you have remote access tools ready and functional, your credentials are current, you have Java installed, ssh-keys and so on...)
* Read our Issue Response documentation (that's this!) to understand how we handle incidents and service requests, what the different roles and methods of communication are, etc.
* Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments etc.
1. **Triage**
@ -31,7 +36,7 @@ On-call responsibilities extend beyond normal office hours, and if you are on-ca
1. **Fix**
* You are empowered to dive into any problem and act to fix it.
* Involve other team members as necessary: do not hesitate to escalate if you cannot figure out the cause within a reasonable timeframe or if the service / alert is something you have not tackled before.
* If the issue is not very time sensitive and you have other priority work, make a note of this in DoIT to keep a track of it (with an appropriate severity).
* If the issue is not very time sensitive and you have other priority work, make a note of this in DoIT to keep a track of it (with an appropriate severity and due date).
1. **Improve**
* If a particular issue keeps happening; if an issue alerts often but turns out to be a preventable non-issue perhaps improving this should be a longer-term task.
@ -39,7 +44,9 @@ On-call responsibilities extend beyond normal office hours, and if you are on-ca
* If information is difficult / impossible to find, write it down. Constantly refactor and improve our knowledge base and documentation. Add redundant links and pointers if your mental model of the wiki / codebase does not match the way it is currently organized.
1. **Support**
* When your on-call "shift" ends, let the next on-call know about issues that have not been resolved yet and other experiences of note.
* When your on-call "shift" ends, let the next on-call and team know about issues that have not been resolved yet and other experiences of note.
* Make an effort to cleanly handover necessary information. We use Slack, email and DoIT to communicate.
* This is a best-practice that should be applied whenever there are details that by sharong would benefit the efficiency of the team.
* If you are making a change that impacts the schedule (adding / removing yourself, for example), let others know since many of us make arrangements around the on-call schedule well in advance.
* Support each other: when doing activities that might generate plenty of pages, it is courteous to "take the page" away from the on-call by notifying them and scheduling an override for the duration.
@ -53,20 +60,18 @@ On-call responsibilities extend beyond normal office hours, and if you are on-ca
* Service owners will always know more about how their stuff works. Especially if our and their documentation is lacking, double-checking with the relevant team avoids mistakes. Measure twice, cut once and it's often best to let the subject matter expert do the cutting.
## Recommendations
If your team is starting its own on-call rotation, here are some scheduling recommendations from the Operations team.
* Always have a backup schedule. Yes, this means two people being on-call at the same time, however it takes a lot of the stress off of the primary if they know they have a specific backup they can contact, rather than trying to chose a random member of the team.
* Always have a backup schedule. Yes, this means two people being on-call at the same time, however it takes a lot of the stress off of the primary if they know they have a specific backup they can contact, rather than trying to chose a random member of the team.
* A backup shift should generally come directly after a primary shift. It gives chance for the previous primary to pass on additional context which may have come up during their shift. It also helps to prevent people from sitting on issues with the intent of letting the next shift fix it.
* The third-level of your escalation (after backup schedule) should probably be your entire team. This should hopefully never happen (it's happened once in the history of the Support team), but when it does, it's useful to be able to just get the next available person.
* The third-level of your escalation (after backup schedule) should probably be your entire team. This should hopefully never happen, but when it does, it's useful to be able to just get the next available person.
![Escalation](../assets/img/misc/escalation.png)
* Team managers can (and should) be part of your normal rotation. It gives a better insight into what has been going on.
* Team leaders (TL) can (and should) be part of your normal rotation. It gives a better insight into what has been going on.
* New members of the team should shadow your on-call rotation during the first few weeks. They should get all alerts, and should follow along with what you are doing. (All new employees shadow the Support team for one week of on-call, but it's useful to have new team members shadow your team rotations also. Just not at the same time).
* We recommend you set your escalation timeout to 5 minutes. This should be plenty of time for someone to acknowledge the incident if they're able to. If they're not able to within 5 minutes, then they're probably not in a good position to respond to the incident anyway.
* Our escalation timeout is set to 5 minutes. This is usually plenty of time for someone to acknowledge the incident if they're able to. If they're not able to within 5 minutes, then they're probably not in a good position to respond to the incident anyway.
* Triggering an escalation is done automatically in most situations based on the type, priority and severity of the issue.
* When going off-call, you should provide a quick summary to the next on-call about any issues that may come up during their shift. A service has been flapping, an issue is likely to re-occur, etc. If you want to be formal, this can be a written report via email, but generally a verbal summary is sufficient.