spearhead-issue-response/before/different_roles/index.html

666 lines
25 KiB
HTML

<!DOCTYPE html>
<!--[if lt IE 7 ]><html class="no-js ie6"><![endif]-->
<!--[if IE 7 ]><html class="no-js ie7"><![endif]-->
<!--[if IE 8 ]><html class="no-js ie8"><![endif]-->
<!--[if IE 9 ]><html class="no-js ie9"><![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
<head>
<meta charset="utf-8">
<title>Different Roles - Spearhead Systems Incident Response Documentation</title>
<!-- Author and License -->
<meta name="author" content="Spearhead Systems, Inc." />
<meta name="dcterms.license" content="http://www.apache.org/licenses/LICENSE-2.0" />
<!-- Page Description -->
<meta name="keywords" content="spearhead, incident, response" />
<meta name="robots" content="index, follow, noarchive" />
<!-- Mobile -->
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<meta name="theme-color" content="#1f293a" />
<!-- Canonical Link -->
<link rel="canonical" href="https://response.spearhead.systems/before/different_roles/">
<!-- Favicon -->
<link rel="shortcut icon" type="image/x-icon" href="../../assets/img/icon.png" />
<link rel="icon" type="image/x-icon" href="../../assets/img/icon.png" />
<!-- Apple -->
<meta name="apple-mobile-web-app-title" content="Spearhead Systems Incident Response Documentation" />
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<link rel="apple-touch-icon" href="../../assets/img/icon.png">
<!-- Open Graph -->
<meta property="og:url" content="https://response.spearhead.systems/before/different_roles/" />
<meta property="og:title" content="Different Roles - Spearhead Systems Incident Response Documentation" />
<meta property="og:site_name" content="Spearhead Systems Incident Response Documentation" />
<meta property="og:description" content="A collection of information about the Spearhead Systems incident response process. Not only how to prepare new employees for on-call responsibilities, but also how to handle major incidents, both in preparation and after-work." />
<meta property="og:image" content="https://response.spearhead.systems/assets/img/cover.png" />
<meta property="og:locale" content="en_US" />
<meta property="og:type" content="website" />
<!-- Twitter -->
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:title" content="Different Roles - Spearhead Systems Incident Response Documentation" />
<meta name="twitter:description" content="A collection of information about the Spearhead Systems incident response process. Not only how to prepare new employees for on-call responsibilities, but also how to handle major incidents, both in preparation and after-work." />
<meta name="twitter:image" content="https://response.spearhead.systems/assets/img/cover.png" />
<!-- Style -->
<style>
@font-face {
font-family: 'Icon';
src: url('../../assets/fonts/icon.eot?52m981');
src: url('../../assets/fonts/icon.eot?#iefix52m981')
format('embedded-opentype'),
url('../../assets/fonts/icon.woff?52m981')
format('woff'),
url('../../assets/fonts/icon.ttf?52m981')
format('truetype'),
url('../../assets/fonts/icon.svg?52m981#icon')
format('svg');
font-weight: normal;
font-style: normal;
}
</style>
<link rel="stylesheet" href="../../assets/stylesheets/application-a422ff04cc.css">
<link rel="stylesheet" href="../../assets/stylesheets/palettes-05ab2406df.css">
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Colfax Regular:400,700|Roboto+Mono">
<style>
body, input {
font-family: 'Colfax Regular', Helvetica, Arial, sans-serif;
}
pre, code {
font-family: 'Roboto Mono', 'Courier New', 'Courier', monospace;
}
</style>
<link rel="stylesheet" href="../../assets/css/extra.css">
<!-- Scripts -->
<script src="../../assets/javascripts/modernizr-4ab42b99fd.js"></script>
</head>
<body class="palette-primary-green palette-accent-blue-grey">
<div class="backdrop">
<div class="backdrop-paper"></div>
</div>
<input class="toggle" type="checkbox" id="toggle-drawer">
<input class="toggle" type="checkbox" id="toggle-search">
<label class="toggle-button overlay" for="toggle-drawer"></label>
<header class="header">
<nav aria-label="Header">
<div class="bar default">
<div class="button button-menu" role="button" aria-label="Menu">
<label class="toggle-button icon icon-menu" for="toggle-drawer">
<span></span>
</label>
</div>
<div class="stretch">
<div class="mainlogo">
<a href="/" title="Go to homepage.">
<img src="../../assets/img/logo.png" title="PagerDuty" />
</a>
</div>
<div class="title">
<span class="path">
Incident Response
<i class="icon icon-link"></i>
</span>
<span class="path">
Before an Incident <i class="icon icon-link"></i>
</span>
Different Roles
</div>
</div>
<div class="button button-twitter" role="button" aria-label="Twitter">
<a href="https://twitter.com/spearhead_sys" title="@spearhead_sys on Twitter" target="_blank" class="toggle-button icon icon-twitter"></a>
</div>
<div class="button button-github" role="button" aria-label="GitHub">
<a href="https://github.com/spearheadsys" title="@spearheadsys on GitHub" target="_blank" class="toggle-button icon icon-github"></a>
</div>
<div class="button button-search" role="button" aria-label="Search">
<label class="toggle-button icon icon-search" title="Search" for="toggle-search"></label>
</div>
</div>
<div class="bar search">
<div class="button button-close" role="button" aria-label="Close">
<label class="toggle-button icon icon-back" for="toggle-search"></label>
</div>
<div class="stretch">
<div class="field">
<input class="query" type="text" placeholder="Search" autocapitalize="off" autocorrect="off" autocomplete="off" spellcheck="false">
</div>
</div>
<div class="button button-reset" role="button" aria-label="Search">
<button class="toggle-button icon icon-close" id="reset-search"></button>
</div>
</div>
</nav>
</header>
<main class="main">
<div class="drawer">
<nav aria-label="Navigation">
<a href="https://github.com/spearheadsys/issue-response-docs" class="project">
<!-- <div class="banner">
<div class="logo">
<img src="../../assets/img/icon.png">
</div>
<div class="name">
<strong>
Spearhead Systems Incident Response Documentation
<span class="version">
</span>
</strong>
<br>
spearheadsys/issue-response-docs
</div>
</div> -->
</a>
<div class="scrollable">
<div class="wrapper">
<!--
<ul class="repo">
<li class="repo-download">
<a href="https://github.com/spearheadsys/issue-response-docs/archive/master.zip" target="_blank" title="Download" data-action="download">
<i class="icon icon-download"></i> Download
</a>
</li>
<li class="repo-stars">
<a href="https://github.com/spearheadsys/issue-response-docs/stargazers" target="_blank" title="Stargazers" data-action="star">
<i class="icon icon-star"></i> Stars
<span class="count">&ndash;</span>
</a>
</li>
</ul>
<hr/>
-->
<div class="toc">
<ul>
<li>
<a class="" title="Home" href="../..">
Home
</a>
</li>
<li>
<span class="section">On-Call</span>
<ul>
<li>
<a class="" title="Being On-Call" href="../../oncall/being_oncall/">
Being On-Call
</a>
</li>
<li>
<a class="" title="Alerting Principles" href="../../oncall/alerting_principles/">
Alerting Principles
</a>
</li>
</ul>
</li>
<li>
<span class="section">Before an Incident</span>
<ul>
<li>
<a class="" title="Severity Levels" href="../severity_levels/">
Severity Levels
</a>
</li>
<li>
<a class="current" title="Different Roles" href="./">
Different Roles
</a>
<ul>
<li class="anchor">
<a title="Team Leader (TL)" href="#team-leader-tl">
Team Leader (TL)
</a>
</li>
<li class="anchor">
<a title="Sysadmin" href="#sysadmin">
Sysadmin
</a>
</li>
<li class="anchor">
<a title="Scribe" href="#scribe">
Scribe
</a>
</li>
<li class="anchor">
<a title="Subject Matter Expert" href="#subject-matter-expert">
Subject Matter Expert
</a>
</li>
<li class="anchor">
<a title="Customer Liaison" href="#customer-liaison">
Customer Liaison
</a>
</li>
</ul>
</li>
<li>
<a class="" title="Call Etiquette" href="../call_etiquette/">
Call Etiquette
</a>
</li>
</ul>
</li>
<li>
<span class="section">During an Incident</span>
<ul>
<li>
<a class="" title="During An Incident" href="../../during/during_an_incident/">
During An Incident
</a>
</li>
<li>
<a class="" title="Security Incident" href="../../during/security_incident_response/">
Security Incident
</a>
</li>
</ul>
</li>
<li>
<span class="section">After an Incident</span>
<ul>
<li>
<a class="" title="Post-Mortem Process" href="../../after/post_mortem_process/">
Post-Mortem Process
</a>
</li>
<li>
<a class="" title="Post-Mortem Template" href="../../after/post_mortem_template/">
Post-Mortem Template
</a>
</li>
</ul>
</li>
<li>
<span class="section">Training</span>
<ul>
<li>
<a class="" title="Overview" href="../../training/overview/">
Overview
</a>
</li>
<li>
<a class="" title="Incident Commander" href="../../training/incident_commander/">
Incident Commander
</a>
</li>
<li>
<a class="" title="Deputy" href="../../training/deputy/">
Deputy
</a>
</li>
<li>
<a class="" title="Scribe" href="../../training/scribe/">
Scribe
</a>
</li>
<li>
<a class="" title="Subject Matter Expert" href="../../training/subject_matter_expert/">
Subject Matter Expert
</a>
</li>
<li>
<a class="" title="Glossary" href="../../training/glossary/">
Glossary
</a>
</li>
</ul>
</li>
<li>
<a class="" title="About" href="../../about/">
About
</a>
</li>
</ul>
</div>
</div>
</div>
</nav>
</div>
<article class="article">
<div class="wrapper">
<h1>Different Roles</h1>
<p>There are several roles for our incident response teams at Spearhead Systems. Certain roles only have one person per incident (e.g. support engineer), whereas other roles can have multiple people (e.g. Sysadmins, Solution Architects, etc.). It's all about coming together as a team, working the problem, and getting a solution quickly.</p>
<p>Here is a rough outline of our role hierarchy, with each role discussed in more detail on the rest of this page.</p>
<p><img alt="Incident Response Structure" src="../../assets/img/misc/incident_response_roles.png" /></p>
<hr />
<h2 id="team-leader-tl">Team Leader (TL)<a class="headerlink" href="#team-leader-tl" title="Permanent link">#</a></h2>
<h3 id="what-is-it">What is it?<a class="headerlink" href="#what-is-it" title="Permanent link">#</a></h3>
<p>A Team Leader acts as the single source of truth of what is currently happening and what is going to happen during an major incident. They come in all shapes, sizes, and colors. TL's are also the key elements in a project (boards in DoIT).</p>
<h3 id="why-have-one">Why have one?<a class="headerlink" href="#why-have-one" title="Permanent link">#</a></h3>
<p>As any system grows in size and complexity, things break and cause incidents. The TL is needed to help drive major incidents to resolution by organizing his team towards a common goal.</p>
<h3 id="what-are-the-responsibilities">What are the responsibilities?<a class="headerlink" href="#what-are-the-responsibilities" title="Permanent link">#</a></h3>
<ol>
<li>Help prepare for projects and incidents,<ul>
<li>Setup communications channels.</li>
<li>Create the DoIT board(s) and other project planning related materials.</li>
<li>Funnel people to these communications channels.</li>
<li>Train team members on how to communicate and train other TL's.</li>
</ul>
</li>
<li>Drive incidents and projects to resolution,<ul>
<li>Get everyone on the same communication channel.</li>
<li>Collect information from team members for their services/area of ownership status.</li>
<li>Collect proposed repair actions, then recommend repair actions to be taken.</li>
<li>Delegate all repair actions, the TL is NOT a resolver.</li>
<li>Be the single authority on system status</li>
<li>Communicate directly with the customers and end-users<ul>
<li>not the engineers themselves!</li>
</ul>
</li>
</ul>
</li>
<li>Post Mortem,<ul>
<li>Creating the initial template right after the incident so people can put in their thoughts while fresh.</li>
<li>Assigning the post-mortem after the event is over, this can be done after the call.</li>
<li>Work with Managers/Support on scheduling preventive actions.</li>
</ul>
</li>
</ol>
<h3 id="who-are-they">Who are they?<a class="headerlink" href="#who-are-they" title="Permanent link">#</a></h3>
<p>Anyone on the TL on-call schedule. Trainees are typically on the TL Shadow schedule.</p>
<h3 id="how-can-i-become-one">How can I become one?<a class="headerlink" href="#how-can-i-become-one" title="Permanent link">#</a></h3>
<p>Take a look at our <a href="../../training/incident_commander/">Team Leader training guide</a>.</p>
<hr />
<h2 id="sysadmin">Sysadmin<a class="headerlink" href="#sysadmin" title="Permanent link">#</a></h2>
<h3 id="what-is-it_1">What is it?<a class="headerlink" href="#what-is-it_1" title="Permanent link">#</a></h3>
<p>A Sysadmin is a direct support role for the Team Leader. This is not a shadow where the person just observes, the Sysadmin is expected to perform important tasks during an incident.</p>
<h3 id="why-have-one_1">Why have one?<a class="headerlink" href="#why-have-one_1" title="Permanent link">#</a></h3>
<p>It's important for the TL to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The Sysadmin helps to support the TL and keep them stay focussed on the incident.</p>
<h3 id="what-are-the-responsibilities_1">What are the responsibilities?<a class="headerlink" href="#what-are-the-responsibilities_1" title="Permanent link">#</a></h3>
<p>The Sysadmin is expected to:</p>
<ol>
<li>Bring up issues to the TL that may otherwise not be addressed (keeping an eye on timers that have been started, circling back around to missed items from a roll call, etc).</li>
<li>Be a "hot standby" TL, should the primary need to either transition to a SME, or otherwise have to step away from the TL role.</li>
<li>Page SME's or other on-call engineers as instructed by the Team Leader.</li>
<li>Manage the incident call, and be prepared to remove people from the call if instructed by the Team Leader.</li>
<li>Liaise with stakeholders and provide status updates on DoIT (using worklogs and comments), Slack and email/telefone as necessary.</li>
</ol>
<h3 id="who-are-they_1">Who are they?<a class="headerlink" href="#who-are-they_1" title="Permanent link">#</a></h3>
<p>Any Team Leader can act as a Sysadmin. Sysadmins need to be trained as an Team Leader as they may be required to take over command.</p>
<h3 id="how-can-i-become-one_1">How can I become one?<a class="headerlink" href="#how-can-i-become-one_1" title="Permanent link">#</a></h3>
<p>Take a look at our <a href="../../training/deputy/">Sysadmin training guide</a>. Sysadmins also need to be <a href="../../training/incident_commander/">trained as an Team Leaders</a>.</p>
<hr />
<p>TODO:::move scribe responsibilities to TL and Sysadmin
::: or assign this to our juniors?</p>
<h2 id="scribe">Scribe<a class="headerlink" href="#scribe" title="Permanent link">#</a></h2>
<h3 id="what-is-it_2">What is it?<a class="headerlink" href="#what-is-it_2" title="Permanent link">#</a></h3>
<p>A Scribe documents the timeline of an incident as it progresses, and makes sure all important decisions and data are captured for later review.</p>
<h3 id="why-have-one_2">Why have one?<a class="headerlink" href="#why-have-one_2" title="Permanent link">#</a></h3>
<p>The incident commander will need to focus on the problem at hand, and the subject matter experts will need to focus on resolving the incident. It is important to capture a timeline of events as they happen so that they can be reviewed during the post-mortem to determine how well we performed, and so we can accurate determine any additional impact that we might not have noticed at the time.</p>
<h3 id="what-are-the-responsibilities_2">What are the responsibilities?<a class="headerlink" href="#what-are-the-responsibilities_2" title="Permanent link">#</a></h3>
<p>The Scribe is expected to:</p>
<ol>
<li>Ensure the incident call is being recorded.</li>
<li>Note in Slack important data, events, and actions, as they happen. Specifically:<ul>
<li>Key actions as they are taken (Example: "prod-server-387723 is being restarted to attempt to remove the stuck lock")</li>
<li>Status reports when one is provided by the IC (Example: "We are in SEV-1, service A is currently not processing events due to a stuck lock, X is restarting the app stack, next checkin in 3 minutes")</li>
<li>Any key callouts either during the call or at the ending review (Example: "Note: (Bob B) We should have a better way to determine stuck locks.")</li>
</ul>
</li>
</ol>
<h3 id="who-are-they_2">Who are they?<a class="headerlink" href="#who-are-they_2" title="Permanent link">#</a></h3>
<p>Anyone can act as a scribe during an incident, and are chosen by the Incident Commander at the start of the call. Typically the Deputy will act as the Scribe, but that doesn't necessarily need to happen, and for larger incidents may not be possible.</p>
<h3 id="how-can-i-become-one_2">How can I become one?<a class="headerlink" href="#how-can-i-become-one_2" title="Permanent link">#</a></h3>
<p>Follow our <a href="../../training/scribe/">Scribe training guide</a>, and then notify the Incident Commanders that you would like to be considered for scribing for the next incident.</p>
<p>TODO::: END move scribe responsibilities to TL and Sysadmin</p>
<hr />
<h2 id="subject-matter-expert">Subject Matter Expert<a class="headerlink" href="#subject-matter-expert" title="Permanent link">#</a></h2>
<h3 id="what-is-it_3">What is it?<a class="headerlink" href="#what-is-it_3" title="Permanent link">#</a></h3>
<p>A Subject Matter Expert (SME), sometimes called a "Resolver" or "Architect", is a domain expert or designated owner of a component or service that is part of the Spearhead Systems service delivery concept.</p>
<h3 id="why-have-one_3">Why have one?<a class="headerlink" href="#why-have-one_3" title="Permanent link">#</a></h3>
<p>The TL and Sysadmins are not all-knowing super beings. When there is a problem with a service or a particular system, an expert in that service is needed to be able to quickly help identify and fix issues.</p>
<h3 id="what-are-the-responsibilities_3">What are the responsibilities?<a class="headerlink" href="#what-are-the-responsibilities_3" title="Permanent link">#</a></h3>
<ol>
<li>Being able to diagnose common problems with the service.</li>
<li>Being able to rapidly fix issues found during an incident.</li>
<li>Concise communication skills, specifically for CAN reports:<ul>
<li>Condition: What is the current state of the service? Is it healthy or not?</li>
<li>Actions: What actions need to be taken if the service is not in a healthy state?</li>
<li>Needs: What support does the resolver need to perform an action?</li>
</ul>
</li>
</ol>
<h3 id="who-are-they_3">Who are they?<a class="headerlink" href="#who-are-they_3" title="Permanent link">#</a></h3>
<p>Anyone who is considered a "domain expert" can act as a resolver for an incident. Typically the service's primary on-call will act as the SME for that service.</p>
<h3 id="how-can-i-become-one_3">How can I become one?<a class="headerlink" href="#how-can-i-become-one_3" title="Permanent link">#</a></h3>
<p>Take a look at our <a href="../../training/subject_matter_expert/">Subject Matter Expert training guide</a>. You should also discuss with your team and service owner to determine what the requirements are for your particular service.</p>
<hr />
<h2 id="customer-liaison">Customer Liaison<a class="headerlink" href="#customer-liaison" title="Permanent link">#</a></h2>
<h3 id="what-is-it_4">What is it?<a class="headerlink" href="#what-is-it_4" title="Permanent link">#</a></h3>
<p>A person responsible for interacting with customers, either directly, or via our public communication channels. Typically a member of the Customer Support team.</p>
<h3 id="why-have-one_4">Why have one?<a class="headerlink" href="#why-have-one_4" title="Permanent link">#</a></h3>
<p>All of the other roles will be actively working on identifying the cause and resolving the issue, we need a role which is focused purely on the customer interaction side of things so that it can be done properly, with the due care and attention it needs.</p>
<h3 id="what-are-the-responsibilities_4">What are the responsibilities?<a class="headerlink" href="#what-are-the-responsibilities_4" title="Permanent link">#</a></h3>
<ol>
<li>Post any publicly facing messages regarding the incident (DoIT, Twitter, StatusPage, etc).</li>
<li>Notify the TL of any customers reporting that they are affected by the incident.</li>
</ol>
<h3 id="who-are-they_4">Who are they?<a class="headerlink" href="#who-are-they_4" title="Permanent link">#</a></h3>
<p>Any member of the Support Team can act as a customer liaison.</p>
<h3 id="how-can-i-become-one_4">How can I become one?<a class="headerlink" href="#how-can-i-become-one_4" title="Permanent link">#</a></h3>
<p>Discuss with the Support Team about becoming our next customer liaison.</p>
<aside class="copyright" role="note">
Copyright &copy; Spearhead Systems, Inc. &ndash;
Documentation built with
<a href="http://www.mkdocs.org" target="_blank">MkDocs</a>
using the
<a href="http://squidfunk.github.io/mkdocs-material/" target="_blank">
Material
</a>
theme.
</aside>
<footer class="footer">
<nav class="pagination" aria-label="Footer">
<div class="previous">
<a href="../severity_levels/" title="Severity Levels">
<span class="direction">
Previous
</span>
<div class="page">
<div class="button button-previous" role="button" aria-label="Previous">
<i class="icon icon-back"></i>
</div>
<div class="stretch">
<div class="title">
Severity Levels
</div>
</div>
</div>
</a>
</div>
<div class="next">
<a href="../call_etiquette/" title="Call Etiquette">
<span class="direction">
Next
</span>
<div class="page">
<div class="stretch">
<div class="title">
Call Etiquette
</div>
</div>
<div class="button button-next" role="button" aria-label="Next">
<i class="icon icon-forward"></i>
</div>
</div>
</a>
</div>
</nav>
</footer>
</div>
</article>
<div class="results" role="status" aria-live="polite">
<div class="scrollable">
<div class="wrapper">
<div class="meta"></div>
<div class="list"></div>
</div>
</div>
</div>
</main>
<script>
var base_url = '../..';
var repo_id = 'spearheadsys/issue-response-docs';
</script>
<script src="../../assets/javascripts/application-997097ee0c.js"></script>
</body>
</html>