spearhead-issue-response/oncall/being_oncall/index.html

699 lines
27 KiB
HTML
Raw Permalink Normal View History

<!DOCTYPE html>
<!--[if lt IE 7 ]><html class="no-js ie6"><![endif]-->
<!--[if IE 7 ]><html class="no-js ie7"><![endif]-->
<!--[if IE 8 ]><html class="no-js ie8"><![endif]-->
<!--[if IE 9 ]><html class="no-js ie9"><![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
<head>
<meta charset="utf-8">
<title>Being On-Call - Spearhead Systems Incident Response Documentation</title>
<!-- Author and License -->
<meta name="author" content="Spearhead Systems, Inc." />
<meta name="dcterms.license" content="http://www.apache.org/licenses/LICENSE-2.0" />
<!-- Page Description -->
<meta name="keywords" content="spearhead, incident, response" />
<meta name="robots" content="index, follow, noarchive" />
<!-- Mobile -->
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<meta name="theme-color" content="#1f293a" />
<!-- Canonical Link -->
<link rel="canonical" href="https://response.spearhead.systems/oncall/being_oncall/">
<!-- Favicon -->
<link rel="shortcut icon" type="image/x-icon" href="../../assets/img/icon.png" />
<link rel="icon" type="image/x-icon" href="../../assets/img/icon.png" />
<!-- Apple -->
<meta name="apple-mobile-web-app-title" content="Spearhead Systems Incident Response Documentation" />
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<link rel="apple-touch-icon" href="../../assets/img/icon.png">
<!-- Open Graph -->
<meta property="og:url" content="https://response.spearhead.systems/oncall/being_oncall/" />
<meta property="og:title" content="Being On-Call - Spearhead Systems Incident Response Documentation" />
<meta property="og:site_name" content="Spearhead Systems Incident Response Documentation" />
<meta property="og:description" content="A collection of information about the Spearhead Systems incident response process. Not only how to prepare new employees for on-call responsibilities, but also how to handle major incidents, both in preparation and after-work." />
<meta property="og:image" content="https://response.spearhead.systems/assets/img/cover.png" />
<meta property="og:locale" content="en_US" />
<meta property="og:type" content="website" />
<!-- Twitter -->
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:title" content="Being On-Call - Spearhead Systems Incident Response Documentation" />
<meta name="twitter:description" content="A collection of information about the Spearhead Systems incident response process. Not only how to prepare new employees for on-call responsibilities, but also how to handle major incidents, both in preparation and after-work." />
<meta name="twitter:image" content="https://response.spearhead.systems/assets/img/cover.png" />
<!-- Style -->
<style>
@font-face {
font-family: 'Icon';
src: url('../../assets/fonts/icon.eot?52m981');
src: url('../../assets/fonts/icon.eot?#iefix52m981')
format('embedded-opentype'),
url('../../assets/fonts/icon.woff?52m981')
format('woff'),
url('../../assets/fonts/icon.ttf?52m981')
format('truetype'),
url('../../assets/fonts/icon.svg?52m981#icon')
format('svg');
font-weight: normal;
font-style: normal;
}
</style>
<link rel="stylesheet" href="../../assets/stylesheets/application-a422ff04cc.css">
<link rel="stylesheet" href="../../assets/stylesheets/palettes-05ab2406df.css">
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto:400,700|Roboto+Mono">
<style>
body, input {
font-family: 'Roboto', Helvetica, Arial, sans-serif;
}
pre, code {
font-family: 'Roboto Mono', 'Courier New', 'Courier', monospace;
}
</style>
<link rel="stylesheet" href="../../assets/css/extra.css">
<!-- Scripts -->
<script src="../../assets/javascripts/modernizr-4ab42b99fd.js"></script>
</head>
<body class="palette-primary-green palette-accent-blue-grey">
<div class="backdrop">
<div class="backdrop-paper"></div>
</div>
<input class="toggle" type="checkbox" id="toggle-drawer">
<input class="toggle" type="checkbox" id="toggle-search">
<label class="toggle-button overlay" for="toggle-drawer"></label>
<header class="header">
<nav aria-label="Header">
<div class="bar default">
<div class="button button-menu" role="button" aria-label="Menu">
<label class="toggle-button icon icon-menu" for="toggle-drawer">
<span></span>
</label>
</div>
<div class="stretch">
<div class="mainlogo">
<a href="/" title="Go to homepage.">
<img src="../../assets/img/logo.png" title="Spearhead Systems" />
</a>
</div>
<div class="title">
<span class="path">
Incident Response
<i class="icon icon-link"></i>
</span>
<span class="path">
On-Call <i class="icon icon-link"></i>
</span>
Being On-Call
</div>
</div>
<div class="button button-twitter" role="button" aria-label="Twitter">
<a href="https://twitter.com/spearhead_sys" title="@spearhead_sys on Twitter" target="_blank" class="toggle-button icon icon-twitter"></a>
</div>
<div class="button button-github" role="button" aria-label="GitHub">
<a href="https://github.com/spearheadsys" title="@spearheadsys on GitHub" target="_blank" class="toggle-button icon icon-github"></a>
</div>
<div class="button button-search" role="button" aria-label="Search">
<label class="toggle-button icon icon-search" title="Search" for="toggle-search"></label>
</div>
</div>
<div class="bar search">
<div class="button button-close" role="button" aria-label="Close">
<label class="toggle-button icon icon-back" for="toggle-search"></label>
</div>
<div class="stretch">
<div class="field">
<input class="query" type="text" placeholder="Search" autocapitalize="off" autocorrect="off" autocomplete="off" spellcheck="false">
</div>
</div>
<div class="button button-reset" role="button" aria-label="Search">
<button class="toggle-button icon icon-close" id="reset-search"></button>
</div>
</div>
</nav>
</header>
<main class="main">
<div class="drawer">
<nav aria-label="Navigation">
<a href="https://github.com/spearheadsys/issue-response-docs" class="project">
<div class="banner">
<div class="logo">
<img src="../../assets/img/icon.png">
</div>
<div class="name">
<strong>
Spearhead Systems Incident Response Documentation
<span class="version">
</span>
</strong>
<br>
spearheadsys/issue-response-docs
</div>
</div>
</a>
<div class="scrollable">
<div class="wrapper">
<ul class="repo">
<li class="repo-download">
<a href="https://github.com/spearheadsys/issue-response-docs/archive/master.zip" target="_blank" title="Download" data-action="download">
<i class="icon icon-download"></i> Download
</a>
</li>
<li class="repo-stars">
<a href="https://github.com/spearheadsys/issue-response-docs/stargazers" target="_blank" title="Stargazers" data-action="star">
<i class="icon icon-star"></i> Stars
<span class="count">&ndash;</span>
</a>
</li>
</ul>
<hr/>
<div class="toc">
<ul>
<li>
<a class="" title="Home" href="../..">
Home
</a>
</li>
<li>
<span class="section">On-Call</span>
<ul>
<li>
<a class="current" title="Being On-Call" href="./">
Being On-Call
</a>
<ul>
<li class="anchor">
<a title="What is On-Call?" href="#what-is-on-call">
What is On-Call?
</a>
</li>
<li class="anchor">
<a title="Responsibilities" href="#responsibilities">
Responsibilities
</a>
</li>
<li class="anchor">
<a title="Not Responsibilities" href="#not-responsibilities">
Not Responsibilities
</a>
</li>
<li class="anchor">
<a title="Recommendations" href="#recommendations">
Recommendations
</a>
</li>
<li class="anchor">
<a title="Etiquette" href="#etiquette">
Etiquette
</a>
</li>
</ul>
</li>
<li>
<a class="" title="Alerting Principles" href="../alerting_principles/">
Alerting Principles
</a>
</li>
</ul>
</li>
<li>
<span class="section">Before an Incident</span>
<ul>
<li>
<a class="" title="Severity Levels" href="../../before/severity_levels/">
Severity Levels
</a>
</li>
<li>
<a class="" title="Different Roles" href="../../before/different_roles/">
Different Roles
</a>
</li>
<li>
<a class="" title="Call Etiquette" href="../../before/call_etiquette/">
Call Etiquette
</a>
</li>
</ul>
</li>
<li>
<span class="section">During an Incident</span>
<ul>
<li>
<a class="" title="During An Incident" href="../../during/during_an_incident/">
During An Incident
</a>
</li>
<li>
<a class="" title="Security Incident" href="../../during/security_incident_response/">
Security Incident
</a>
</li>
</ul>
</li>
<li>
<span class="section">After an Incident</span>
<ul>
<li>
<a class="" title="Post-Mortem Process" href="../../after/post_mortem_process/">
Post-Mortem Process
</a>
</li>
<li>
<a class="" title="Post-Mortem Template" href="../../after/post_mortem_template/">
Post-Mortem Template
</a>
</li>
</ul>
</li>
<li>
<span class="section">Training</span>
<ul>
<li>
<a class="" title="Overview" href="../../training/overview/">
Overview
</a>
</li>
<li>
<a class="" title="Team Leader" href="../../training/team_leader/">
Team Leader
</a>
</li>
<li>
<a class="" title="Sysadmin" href="../../training/sysadmin/">
Sysadmin
</a>
</li>
<li>
<a class="" title="Scribe" href="../../training/scribe/">
Scribe
</a>
</li>
<li>
<a class="" title="Subject Matter Expert" href="../../training/subject_matter_expert/">
Subject Matter Expert
</a>
</li>
<li>
<a class="" title="Glossary" href="../../training/glossary/">
Glossary
</a>
</li>
</ul>
</li>
<li>
<a class="" title="About" href="../../about/">
About
</a>
</li>
</ul>
</div>
</div>
</div>
</nav>
</div>
<article class="article">
<div class="wrapper">
<h1>Being On-Call</h1>
<p>A summary of expectations and helpful information for being on-call.</p>
<p><img alt="Alert Fatigue" src="../../assets/img/misc/alert_fatigue.png" /></p>
<h2 id="what-is-on-call">What is On-Call?<a class="headerlink" href="#what-is-on-call" title="Permanent link">#</a></h2>
<p>At Spearhead, being on-call means that you are responsible for monitoring our communications channels and responding to requests at any time. There are two on-call scenarios that you will deal with:</p>
<ul>
<li>during your normal work shift</li>
<li>outside working hours</li>
</ul>
<p>For example, if you are on-call outside normal working hours, should any alarms be triggered by our monitoring solution or a customer emails our support channel, you will receive a "notification" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken.
You will be expected to gather as much information as possible, create the required cards in our ticketing systems, delegate or assign the card to the right person/watchers and otherwise take whatever actions are necessary in order to resolve the issue. </p>
<!-- At Spearhead Systems we consider you are on-call during normal working hours in which case you are proactively working with [DoIT](http://doit.sphs.ro/) and looking over your assigned cards/boards as well as when you are formally "on-call" and issues are being redirected to you. -->
<p>On-call responsibilities extend beyond normal office hours, and if you are on-call you are expected to be able to respond to issues, even at 2am. This sounds horrible (and it can be), but this is what our customers go through, and is the problem that the Spearhead Systems technical support services is trying to fix!</p>
<p>When you are on-call during normal working hours you are the central contact for our entire support team. We expect you will delegate and assign the card to your colleagues and only attempt to resolve issues if your current workload permits.
When you are on-call outside working hours you are expected to handle as much of the process as possible and delegate only if it is outside your area of expertise or you encounter problems that require another colleagues input.</p>
<div class="admonition note">
<p class="admonition-title">When in the office</p>
<p>You are generally speaking on-call during your normal working hours even if you are not <em>the</em> on-call engineer. This means you are keeping an eye on the cards assigned to you directly or that you are a watcher for. If you are ever in a position that you have no assigned cards and it is not clear what to work on ask a TL or senior Sysadmin to help point you in the right direction.</p>
</div>
<h2 id="responsibilities">Responsibilities<a class="headerlink" href="#responsibilities" title="Permanent link">#</a></h2>
<ol>
<li>
<p><strong>Prepare</strong></p>
<ul>
<li>Have your laptop and Internet with you (office, home, a phone with a tethering plan, etc).<ul>
<li>Have a way to charge your phone.</li>
</ul>
</li>
<li>Team alert escalation happens within 30 minutes, set/stagger your notification timeouts (push, SMS, phone...) accordingly.<ul>
<li>Make sure Spearhead Systems (and colleagues directly) texts and calls can bypass your "Do Not Disturb" settings.</li>
</ul>
</li>
<li>Be prepared (environment is set up, you have remote access tools ready and functional, your credentials are current, you have Java installed, ssh-keys and so on...)</li>
<li>Read our Issue Response documentation (that's this!) to understand how we handle incidents and service requests, what the different roles and methods of communication are, etc.</li>
<li>Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments etc.</li>
</ul>
</li>
<li>
<p><strong>Triage</strong></p>
<ul>
<li>Acknowledge and act on alerts whenever you can (see the first "Not responsibilities" point below)</li>
<li>Determine the urgency of the problem:<ul>
<li>Is it something that should be worked on right now or escalated into a major incident? ("production server on fire" situations. Security alerts) - do so.</li>
<li>Is it some tactical work that doesn't have to happen during the night? (for example, disk utilization high watermark, but there's plenty of space left and the trend is not indicating impending doom) - snooze the issue until a more suitable time (working hours, the next morning...) and get back to fixing it then.</li>
</ul>
</li>
<li>Check our <em>internal Chat</em> for current activity. Often (but not always) actions that could potentially cause alerts will be announced there.</li>
<li>Does the alert and your initial investigation indicate a general problem or an issue with a specific service that the relevant team should look into? If it does not look like a problem you are the expert for, then escalate to another team member or group.</li>
</ul>
</li>
<li>
<p><strong>Fix</strong></p>
<ul>
<li>You are empowered to dive into any problem and act to fix it.</li>
<li>Involve other team members as necessary: do not hesitate to escalate if you cannot figure out the cause within a reasonable timeframe or if the service / alert is something you have not tackled before.</li>
<li>If the issue is not very time sensitive and you have other priority work, make a note of this in DoIT to keep a track of it (with an appropriate severity, comment and due date).</li>
</ul>
</li>
<li>
<p><strong>Improve</strong></p>
<ul>
<li>If a particular issue keeps happening; if an issue alerts often but turns out to be a preventable non-issue perhaps improving this should be a longer-term task.<ul>
<li>Disks that fill up, logs that should be rotated, noisy alerts...(we use ansible and rundeck, go ahead and start automating!)</li>
<li>When we perform a DoD (definition of done) this is good time to bring up recurring issues for discussion.</li>
</ul>
</li>
<li>If information is difficult / impossible to find, write it down. Constantly refactor and improve our knowledge base and documentation. Add redundant links and pointers if your mental model of the wiki / codebase does not match the way it is currently organized.</li>
</ul>
</li>
<li>
<p><strong>Support</strong></p>
<ul>
<li>When your on-call "shift" ends, let the next on-call and team know about issues that have not been resolved yet and other experiences of note.<ul>
<li>Make an effort to cleanly handover necessary information. We use <em>internal Chat</em>, email and DoIT to communicate. </li>
<li>This is a best-practice that should be applied whenever there are details that by sharing would benefit the efficiency of the team.</li>
</ul>
</li>
<li>If you are making a change that impacts the schedule (adding / removing yourself, for example), let others know since many of us make arrangements around the on-call schedule well in advance.</li>
<li>Support each other: when doing activities that might generate plenty of alerts, it is courteous to "place the service/host in maintenance" and take it away from the on-call by notifying them and scheduling an override for the duration.</li>
</ul>
</li>
</ol>
<h2 id="not-responsibilities">Not Responsibilities<a class="headerlink" href="#not-responsibilities" title="Permanent link">#</a></h2>
<ol>
<li>
<p>No expectation to be the first to acknowledge <em>all</em> of the alerts during the on-call period.</p>
<ul>
<li>Commute (and other necessary distractions) are facts of life, and sometimes it is not possible to receive or act on an alert before it escalates. That's why we have the backup on-call and schedule for.</li>
</ul>
</li>
<li>
<p>No expectation to fix all issues by yourself.</p>
<ul>
<li>No one knows everything. Your whole team is here to help. There is no shame, and much to be learned, by escalating issues you are not certain about. "Never hesitate to escalate".</li>
<li>Service owners will always know more about how their stuff works. Especially if our and their documentation is lacking, double-checking with the relevant team avoids mistakes. Measure twice, cut once and it's often best to let the subject matter expert do the cutting.</li>
</ul>
</li>
</ol>
<h2 id="recommendations">Recommendations<a class="headerlink" href="#recommendations" title="Permanent link">#</a></h2>
<ul>
<li>
<p>Always have a backup schedule. Yes, this means two people being on-call at the same time, however it takes a lot of the stress off of the primary if they know they have a specific backup they can contact, rather than trying to chose a random member of the team. </p>
</li>
<li>
<p>The third-level of your escalation (after backup schedule) should probably be your entire team. This should hopefully never happen, but when it does, it's useful to be able to just get the next available person.</p>
</li>
</ul>
<p><img alt="Escalation" src="../../assets/img/misc/escalation.png" /></p>
<ul>
<li>
<p>Team leaders (TL) are a part of our normal rotation. It gives a better insight into what has been going on.</p>
</li>
<li>
<p>New members of the team should shadow your on-call rotation during the first few weeks. They should get all alerts, and should follow along with what you are doing. (All new employees shadow the Support team for one week of on-call, but it's useful to have new team members shadow your team rotations also.).</p>
</li>
</ul>
<!-- // we do not uet implement escalation for incidents, not automatically // * Our escalation timeout is set to 5 minutes. This is usually plenty of time for someone to acknowledge the incident if they're able to. If they're not able to within 5 minutes, then they're probably not in a good position to respond to the incident anyway.
* Triggering an escalation is done automatically in most situations based on the type, priority and severity of the issue.
* Escalations only happen to incidents! Service Requests must be manually escalated based on customer input -->
<ul>
<li>When going off-call, you should provide a quick summary to the next on-call about any issues that may come up during their shift. A service has been flapping, an issue is likely to re-occur, etc. If you want to be formal, this can be a written report via email, but generally a verbal summary during our morning stand-up is sufficient.</li>
</ul>
<h3 id="notification-method-recommendations">Notification Method Recommendations<a class="headerlink" href="#notification-method-recommendations" title="Permanent link">#</a></h3>
<p>You are free to set up your notification rules as you see fit, to match how you would like to best respond to incidents. If you're not sure how to configure them, the Support team has some recommendations,</p>
<p><img alt="Mobile Alerts" src="../../assets/img/misc/mobile_alerts.png" /></p>
<!-- // still working on integration for SMS // * Use Push Notification and Email as your first method of notification. Most of us have phones with us at all times, so this is a prudent first method and is usually sufficient. (DoIT is in the process of integration with SNS for push notifications)
* Use Phone and/or SMS notification each minute after, until the escalation time. If Push didn't work, then it's likely you need something stronger, like a phone call. Keep calling every minute until it's too late. If you don't pick up by the 3rd time, then it's unlikely you are able to respond, and the incident will get escalated away from you. -->
<h2 id="etiquette">Etiquette<a class="headerlink" href="#etiquette" title="Permanent link">#</a></h2>
<ul>
<li>
<p>If the current on-call comes into the office at 12pm looking tired, it's not because they're lazy. They probably got paged in the night. Cut them some slack and be nice.</p>
</li>
<li>
<p>Don't close or otherwise modify a card out from under someone else. If you didn't get that specific card assigned to you as owner or a watcher, then you shouldn't be modifying it. Add a comment with your notes instead in the monitoring system and in DoIT.</p>
</li>
</ul>
<p><img alt="Acknowledging" src="../../assets/img/misc/ack.png" /></p>
<ul>
<li>
<p>If you are testing something, or performing an action that you know will cause an alert from our monitoring or possibly may be identified as an issue by our customers, it's customary to "place the host/service in downtime" and announce all the involved parties, for the time during which you will be testing. Notify the person on-call so they are aware of your testing.</p>
</li>
<li>
<p>"Never hesitate to escalate" - Never feel ashamed to rope in someone else if you're not sure how to resolve an issue. Likewise, never look down on someone else if they ask you for help.</p>
</li>
<li>
<p>Always consider covering an hour or so of someone else's on-call time if they request it and you are able. We all have lives which might get in the way of on-call time, and one day it might be you who needs to swap their on-call time in order to have a night out with your friend from out of town.</p>
</li>
<li>
<p>If an issue comes up during your on-call shift for which you got called, you are responsible for resolving it. Even if it takes 3 hours and there's only 1 hour left of your shift. You can hand over to the next on-call if they agree, but you should never assume that's possible.</p>
</li>
</ul>
<aside class="copyright" role="note">
Copyright &copy; Spearhead Systems, Inc. &ndash;
Documentation built with
<a href="http://www.mkdocs.org" target="_blank">MkDocs</a>
using the
<a href="http://squidfunk.github.io/mkdocs-material/" target="_blank">
Material
</a>
theme.
</aside>
<footer class="footer">
<nav class="pagination" aria-label="Footer">
<div class="previous">
<a href="../.." title="Home">
<span class="direction">
Previous
</span>
<div class="page">
<div class="button button-previous" role="button" aria-label="Previous">
<i class="icon icon-back"></i>
</div>
<div class="stretch">
<div class="title">
Home
</div>
</div>
</div>
</a>
</div>
<div class="next">
<a href="../alerting_principles/" title="Alerting Principles">
<span class="direction">
Next
</span>
<div class="page">
<div class="stretch">
<div class="title">
Alerting Principles
</div>
</div>
<div class="button button-next" role="button" aria-label="Next">
<i class="icon icon-forward"></i>
</div>
</div>
</a>
</div>
</nav>
</footer>
</div>
</article>
<div class="results" role="status" aria-live="polite">
<div class="scrollable">
<div class="wrapper">
<div class="meta"></div>
<div class="list"></div>
</div>
</div>
</div>
</main>
<script>
var base_url = '../..';
var repo_id = 'spearheadsys/issue-response-docs';
</script>
<script src="../../assets/javascripts/application-997097ee0c.js"></script>
</body>
</html>