spearhead-issue-response/after/post_mortem_process/index.html

672 lines
22 KiB
HTML

<!DOCTYPE html>
<!--[if lt IE 7 ]><html class="no-js ie6"><![endif]-->
<!--[if IE 7 ]><html class="no-js ie7"><![endif]-->
<!--[if IE 8 ]><html class="no-js ie8"><![endif]-->
<!--[if IE 9 ]><html class="no-js ie9"><![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html class="no-js" lang="en"> <!--<![endif]-->
<head>
<meta charset="utf-8">
<title>Post-Mortem Process - Spearhead Systems Incident Response Documentation</title>
<!-- Author and License -->
<meta name="author" content="Spearhead Systems, Inc." />
<meta name="dcterms.license" content="http://www.apache.org/licenses/LICENSE-2.0" />
<!-- Page Description -->
<meta name="keywords" content="spearhead, incident, response" />
<meta name="robots" content="index, follow, noarchive" />
<!-- Mobile -->
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0" />
<meta name="theme-color" content="#1f293a" />
<!-- Canonical Link -->
<link rel="canonical" href="https://response.spearhead.systems/after/post_mortem_process/">
<!-- Favicon -->
<link rel="shortcut icon" type="image/x-icon" href="../../assets/img/icon.png" />
<link rel="icon" type="image/x-icon" href="../../assets/img/icon.png" />
<!-- Apple -->
<meta name="apple-mobile-web-app-title" content="Spearhead Systems Incident Response Documentation" />
<meta name="apple-mobile-web-app-capable" content="yes" />
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />
<link rel="apple-touch-icon" href="../../assets/img/icon.png">
<!-- Open Graph -->
<meta property="og:url" content="https://response.spearhead.systems/after/post_mortem_process/" />
<meta property="og:title" content="Post-Mortem Process - Spearhead Systems Incident Response Documentation" />
<meta property="og:site_name" content="Spearhead Systems Incident Response Documentation" />
<meta property="og:description" content="A collection of information about the Spearhead Systems incident response process. Not only how to prepare new employees for on-call responsibilities, but also how to handle major incidents, both in preparation and after-work." />
<meta property="og:image" content="https://response.spearhead.systems/assets/img/cover.png" />
<meta property="og:locale" content="en_US" />
<meta property="og:type" content="website" />
<!-- Twitter -->
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:title" content="Post-Mortem Process - Spearhead Systems Incident Response Documentation" />
<meta name="twitter:description" content="A collection of information about the Spearhead Systems incident response process. Not only how to prepare new employees for on-call responsibilities, but also how to handle major incidents, both in preparation and after-work." />
<meta name="twitter:image" content="https://response.spearhead.systems/assets/img/cover.png" />
<!-- Style -->
<style>
@font-face {
font-family: 'Icon';
src: url('../../assets/fonts/icon.eot?52m981');
src: url('../../assets/fonts/icon.eot?#iefix52m981')
format('embedded-opentype'),
url('../../assets/fonts/icon.woff?52m981')
format('woff'),
url('../../assets/fonts/icon.ttf?52m981')
format('truetype'),
url('../../assets/fonts/icon.svg?52m981#icon')
format('svg');
font-weight: normal;
font-style: normal;
}
</style>
<link rel="stylesheet" href="../../assets/stylesheets/application-a422ff04cc.css">
<link rel="stylesheet" href="../../assets/stylesheets/palettes-05ab2406df.css">
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Colfax Regular:400,700|Roboto+Mono">
<style>
body, input {
font-family: 'Colfax Regular', Helvetica, Arial, sans-serif;
}
pre, code {
font-family: 'Roboto Mono', 'Courier New', 'Courier', monospace;
}
</style>
<link rel="stylesheet" href="../../assets/css/extra.css">
<!-- Scripts -->
<script src="../../assets/javascripts/modernizr-4ab42b99fd.js"></script>
</head>
<body class="palette-primary-green palette-accent-blue-grey">
<div class="backdrop">
<div class="backdrop-paper"></div>
</div>
<input class="toggle" type="checkbox" id="toggle-drawer">
<input class="toggle" type="checkbox" id="toggle-search">
<label class="toggle-button overlay" for="toggle-drawer"></label>
<header class="header">
<nav aria-label="Header">
<div class="bar default">
<div class="button button-menu" role="button" aria-label="Menu">
<label class="toggle-button icon icon-menu" for="toggle-drawer">
<span></span>
</label>
</div>
<div class="stretch">
<div class="mainlogo">
<a href="/" title="Go to homepage.">
<img src="../../assets/img/logo.png" title="PagerDuty" />
</a>
</div>
<div class="title">
<span class="path">
Incident Response
<i class="icon icon-link"></i>
</span>
<span class="path">
After an Incident <i class="icon icon-link"></i>
</span>
Post-Mortem Process
</div>
</div>
<div class="button button-twitter" role="button" aria-label="Twitter">
<a href="https://twitter.com/spearhead_sys" title="@spearhead_sys on Twitter" target="_blank" class="toggle-button icon icon-twitter"></a>
</div>
<div class="button button-github" role="button" aria-label="GitHub">
<a href="https://github.com/spearheadsys" title="@spearheadsys on GitHub" target="_blank" class="toggle-button icon icon-github"></a>
</div>
<div class="button button-search" role="button" aria-label="Search">
<label class="toggle-button icon icon-search" title="Search" for="toggle-search"></label>
</div>
</div>
<div class="bar search">
<div class="button button-close" role="button" aria-label="Close">
<label class="toggle-button icon icon-back" for="toggle-search"></label>
</div>
<div class="stretch">
<div class="field">
<input class="query" type="text" placeholder="Search" autocapitalize="off" autocorrect="off" autocomplete="off" spellcheck="false">
</div>
</div>
<div class="button button-reset" role="button" aria-label="Search">
<button class="toggle-button icon icon-close" id="reset-search"></button>
</div>
</div>
</nav>
</header>
<main class="main">
<div class="drawer">
<nav aria-label="Navigation">
<a href="https://github.com/spearheadsys/issue-response-docs" class="project">
<div class="banner">
<div class="logo">
<img src="../../assets/img/icon.png">
</div>
<div class="name">
<strong>
Spearhead Systems Incident Response Documentation
<span class="version">
</span>
</strong>
<br>
spearheadsys/issue-response-docs
</div>
</div>
</a>
<div class="scrollable">
<div class="wrapper">
<ul class="repo">
<li class="repo-download">
<a href="https://github.com/spearheadsys/issue-response-docs/archive/master.zip" target="_blank" title="Download" data-action="download">
<i class="icon icon-download"></i> Download
</a>
</li>
<li class="repo-stars">
<a href="https://github.com/spearheadsys/issue-response-docs/stargazers" target="_blank" title="Stargazers" data-action="star">
<i class="icon icon-star"></i> Stars
<span class="count">&ndash;</span>
</a>
</li>
</ul>
<hr/>
<div class="toc">
<ul>
<li>
<a class="" title="Home" href="../..">
Home
</a>
</li>
<li>
<span class="section">On-Call</span>
<ul>
<li>
<a class="" title="Being On-Call" href="../../oncall/being_oncall/">
Being On-Call
</a>
</li>
<li>
<a class="" title="Alerting Principles" href="../../oncall/alerting_principles/">
Alerting Principles
</a>
</li>
</ul>
</li>
<li>
<span class="section">Before an Incident</span>
<ul>
<li>
<a class="" title="Severity Levels" href="../../before/severity_levels/">
Severity Levels
</a>
</li>
<li>
<a class="" title="Different Roles" href="../../before/different_roles/">
Different Roles
</a>
</li>
<li>
<a class="" title="Call Etiquette" href="../../before/call_etiquette/">
Call Etiquette
</a>
</li>
</ul>
</li>
<li>
<span class="section">During an Incident</span>
<ul>
<li>
<a class="" title="During An Incident" href="../../during/during_an_incident/">
During An Incident
</a>
</li>
<li>
<a class="" title="Security Incident" href="../../during/security_incident_response/">
Security Incident
</a>
</li>
</ul>
</li>
<li>
<span class="section">After an Incident</span>
<ul>
<li>
<a class="current" title="Post-Mortem Process" href="./">
Post-Mortem Process
</a>
<ul>
<li class="anchor">
<a title="Owner Designation" href="#owner-designation">
Owner Designation
</a>
</li>
<li class="anchor">
<a title="Owner Responsibilities" href="#owner-responsibilities">
Owner Responsibilities
</a>
</li>
<li class="anchor">
<a title="Post-Mortem Wiki Page" href="#post-mortem-wiki-page">
Post-Mortem Wiki Page
</a>
</li>
<li class="anchor">
<a title="Post-Mortem Meeting" href="#post-mortem-meeting">
Post-Mortem Meeting
</a>
</li>
<li class="anchor">
<a title="Examples" href="#examples">
Examples
</a>
</li>
<li class="anchor">
<a title="Useful Resources" href="#useful-resources">
Useful Resources
</a>
</li>
</ul>
</li>
<li>
<a class="" title="Post-Mortem Template" href="../post_mortem_template/">
Post-Mortem Template
</a>
</li>
</ul>
</li>
<li>
<span class="section">Training</span>
<ul>
<li>
<a class="" title="Overview" href="../../training/overview/">
Overview
</a>
</li>
<li>
<a class="" title="Incident Commander" href="../../training/incident_commander/">
Incident Commander
</a>
</li>
<li>
<a class="" title="Deputy" href="../../training/deputy/">
Deputy
</a>
</li>
<li>
<a class="" title="Scribe" href="../../training/scribe/">
Scribe
</a>
</li>
<li>
<a class="" title="Subject Matter Expert" href="../../training/subject_matter_expert/">
Subject Matter Expert
</a>
</li>
<li>
<a class="" title="Glossary" href="../../training/glossary/">
Glossary
</a>
</li>
</ul>
</li>
<li>
<a class="" title="About" href="../../about/">
About
</a>
</li>
</ul>
</div>
</div>
</div>
</nav>
</div>
<article class="article">
<div class="wrapper">
<h1>Post-Mortem Process</h1>
<p>For every major incident (SEV-2/1), we need to follow up with a post-mortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included.</p>
<p><img alt="Post-Mortem" src="../../assets/img/headers/pagerduty_post_mortem.jpg" /></p>
<h2 id="owner-designation">Owner Designation<a class="headerlink" href="#owner-designation" title="Permanent link">#</a></h2>
<p>The first step is that a post-mortem owner will be designated. This is done by the IC either at the end of a major incident call, or very shortly after. You will be notified directly by the IC if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use Slack for coordinating followup. A detailed list of the steps is available below,</p>
<h2 id="owner-responsibilities">Owner Responsibilities<a class="headerlink" href="#owner-responsibilities" title="Permanent link">#</a></h2>
<p>As owner of a post-mortem, you are responsible for the following,</p>
<ul>
<li>Scheduling the post-mortem meeting (on the shared calendar) and inviting the relevant people (this should be scheduled within 5 business days of the incident).</li>
<li>Updating the page with all of the necessary content.</li>
<li>Investigating the incident, pulling in whomever you need from other teams to assist in the investigation.</li>
<li>Creating follow-up JIRA tickets (<em>You are only responsible for creating the tickets, not following them up to resolution</em>).</li>
<li>Running the post-mortem meeting (<em>these generally run themselves, but you should get people back on topic if the conversation starts to wander</em>).</li>
<li>In cases where we need a public blog post, creating &amp; reviewing it with appropriate parties.</li>
</ul>
<h2 id="post-mortem-wiki-page">Post-Mortem Wiki Page<a class="headerlink" href="#post-mortem-wiki-page" title="Permanent link">#</a></h2>
<p>Once you've been designated as the owner of a post-mortem, you should start updating the page with all the relevant information.</p>
<ol>
<li>
<p>(If not already done by the IC) Create a new post-mortem page for the incident.</p>
</li>
<li>
<p>Schedule a post-mortem meeting for within 5 business days of the incident. You should schedule this before filling in the page, just so it's on the calendar.</p>
<ul>
<li>Create the meeting on the "Incident Post-Mortem Meetings" shared calendar.</li>
</ul>
</li>
<li>
<p>Begin populating the page with all of the information you have.</p>
<ul>
<li>The timeline should be the main focus to begin with.<ul>
<li>The timeline should include important changes in status/impact, and also key actions taken by responders.</li>
<li>You should mark the start of the incident in red, and the resolution in green (for when we went into/out of SEV).</li>
</ul>
</li>
<li>Go through the history in Slack to identify the responders, and add them to the page.<ul>
<li>Identify the Incident Commander and Scribe in this list.</li>
</ul>
</li>
</ul>
</li>
<li>
<p>Populate the page with more detailed information.</p>
<ul>
<li>For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a Tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline.</li>
</ul>
</li>
<li>
<p>Perform an analysis of the incident.</p>
<ul>
<li>Capture all available data regarding the incident. What caused it, how many customers were affected, etc.</li>
<li>Any commands or queries you use to look up data should be posted in the page so others can see how the data was gathered.</li>
<li>Capture the impact to customers (generally in terms of event submission, delayed processing, and slow notification delivery)</li>
<li>Identify the underlying cause of the incident (What happened, and why did it happen).</li>
</ul>
</li>
<li>
<p>Create any followup action JIRA tickets (or note down topics for discussion if we need to decide on a direction to go before creating tickets),</p>
<ul>
<li>Go through the history in Slack to identify any TODO items.</li>
<li>Label all tickets with their severity level and date tags.</li>
<li>Any actions which can reduce re-occurrence of the incident.<ul>
<li>(There may be some trade-off here, and that's fine. Sometimes the ROI isn't worth the effort that would go into it).</li>
</ul>
</li>
<li>Identify any actions which can make our incident response process better.</li>
<li>Be careful with creating too many tickets. Generally we only want to create things that are P0/P1's. Things that absolutely should be dealt with.</li>
</ul>
</li>
<li>
<p>Write the external message that will be sent to customers. This will be reviewed during the post-mortem meeting before it is sent out.</p>
<ul>
<li>Avoid using the word "outage" unless it really was a full outage, use the word "incident" instead. Customers generally see "outage" and assume everything was down, when in reality it was likely just some alerts delivered outside of SLA.</li>
<li>Look at other examples of previous post-mortems to see the kind of thing you should send.</li>
</ul>
</li>
</ol>
<h2 id="post-mortem-meeting">Post-Mortem Meeting<a class="headerlink" href="#post-mortem-meeting" title="Permanent link">#</a></h2>
<p>These meetings should generally last 15-30 minutes, and are intended to be a wrap up of the post-mortem process. We should discuss what happened, what we could've done better, and any followup actions we need to take. The goal is to suss out any disagreement on the facts, analysis, or recommended actions, and to get some wider awareness of the problems that are causing reliability issues for us.</p>
<p>You should invite the following people to the post-mortem meeting,</p>
<ul>
<li>Always<ul>
<li>The incident commander.</li>
<li>Service owners involved in the incident.</li>
<li>Key engineer(s)/responders involved in the incident.</li>
</ul>
</li>
<li>Optional<ul>
<li>Customer liaison. (Only SEV-1 incidents)</li>
</ul>
</li>
</ul>
<p>A general agenda for the meeting would be something like,</p>
<ol>
<li>Recap the timeline, to make sure everyone agrees and is on the same page.</li>
<li>Recap important points, and any unusual items.</li>
<li>Discuss how the problem could've been caught.<ul>
<li>Did it show up in canary?</li>
<li>Could it have been caught in tests, or loadtest environment?</li>
</ul>
</li>
<li>Discuss customer impact. Any comments from customers, etc.</li>
<li>Review action items that have been created, discuss if appropriate, or if more are needed, etc.</li>
</ol>
<h2 id="examples">Examples<a class="headerlink" href="#examples" title="Permanent link">#</a></h2>
<p>Here are some examples of post-mortems from other companies as a reference,</p>
<ul>
<li><a href="https://support.stripe.com/questions/outage-postmortem-2015-10-08-utc">Stripe</a></li>
<li><a href="https://blog.lastpass.com/2015/06/lastpass-security-notice.html/comment-page-2/">LastPass</a></li>
<li><a href="https://aws.amazon.com/message/5467D2/">AWS</a></li>
<li><a href="https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html">Twilio</a></li>
<li><a href="https://status.heroku.com/incidents/151">Heroku</a></li>
<li><a href="http://techblog.netflix.com/2012/10/post-mortem-of-october-222012-aws.html">Netflix</a></li>
<li><a href="https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016">GOV.UK Rail Accident Investigation</a></li>
<li><a href="https://github.com/danluu/post-mortems">A List of Post-mortems!</a></li>
</ul>
<h2 id="useful-resources">Useful Resources<a class="headerlink" href="#useful-resources" title="Permanent link">#</a></h2>
<ul>
<li><a href="http://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011">Advanced PostMortem Fu and Human Error 101 (Velocity 2011)</a></li>
<li><a href="http://fractio.nl/2015/10/30/blame-language-sharing/">Blame. Language. Sharing.</a></li>
</ul>
<aside class="copyright" role="note">
Copyright &copy; Spearhead Systems, Inc. &ndash;
Documentation built with
<a href="http://www.mkdocs.org" target="_blank">MkDocs</a>
using the
<a href="http://squidfunk.github.io/mkdocs-material/" target="_blank">
Material
</a>
theme.
</aside>
<footer class="footer">
<nav class="pagination" aria-label="Footer">
<div class="previous">
<a href="../../during/security_incident_response/" title="Security Incident">
<span class="direction">
Previous
</span>
<div class="page">
<div class="button button-previous" role="button" aria-label="Previous">
<i class="icon icon-back"></i>
</div>
<div class="stretch">
<div class="title">
Security Incident
</div>
</div>
</div>
</a>
</div>
<div class="next">
<a href="../post_mortem_template/" title="Post-Mortem Template">
<span class="direction">
Next
</span>
<div class="page">
<div class="stretch">
<div class="title">
Post-Mortem Template
</div>
</div>
<div class="button button-next" role="button" aria-label="Next">
<i class="icon icon-forward"></i>
</div>
</div>
</a>
</div>
</nav>
</footer>
</div>
</article>
<div class="results" role="status" aria-live="polite">
<div class="scrollable">
<div class="wrapper">
<div class="meta"></div>
<div class="list"></div>
</div>
</div>
</div>
</main>
<script>
var base_url = '../..';
var repo_id = 'spearheadsys/issue-response-docs';
</script>
<script src="../../assets/javascripts/application-997097ee0c.js"></script>
</body>
</html>