postmortem specifics to sph

This commit is contained in:
Marius Pana 2017-01-21 14:39:08 +02:00
parent e85a239ff8
commit 0003db92e2

View File

@ -1,4 +1,4 @@
This is a standard template we use for post-mortems at PagerDuty. Each section describes the type of information you will want to put in that section.
This is a standard template for post-mortems. Each section describes the type of information you will want to put in that section.
---
@ -11,10 +11,10 @@ This is a standard template we use for post-mortems at PagerDuty. Each section d
** Meeting Scheduled For:** _Schedule the meeting on the "Incident Post-Mortem Meetings" shared calendar, for within 5 business days after the incident. Put the date/time here._
** Call Recording:** _Link to the incident call recording._
** Call Recording:** _Link to the incident call recording / slack transcript or DoIT card._
## Overview
_Include a **short** sentence or two summarizing the root cause, timeline summary, and the impact. E.g. "On the morning of August 99th, we suffered a 1 minute SEV-1 due to a runaway process on our primary database machine. This slowness caused roughly 0.024% of alerts that had begun during this time to be delivered out of SLA."_
_Include a **short** sentence or two summarizing the root cause, timeline summary, and the impact. E.g. "On the morning of August 99th, we suffered a 1 minute IN-3 due to a runaway process on our primary database machine. This slowness caused roughly 0.024% of alerts that had begun during this time to be delivered out of SLA."_
## What Happened
_Include a short description of what happened._
@ -28,8 +28,8 @@ _Include a description what solved the problem. If there was a temporary fix in
## Impact
_Be very specific here, include exact numbers._
| Time in SEV-1 | ?mins |
| Time in SEV-2 | ?mins |
| Time in SR-3 | ?mins |
| Time in IN-3 | ?mins |
| Notifications Delivered out of SLA | ??% (?? of ??) |
| Events Dropped / Not Accepted | ??% (?? of ??) _Should usually be 0, but always check_ |
| Accounts Affected | ?? |
@ -38,13 +38,13 @@ _Be very specific here, include exact numbers._
## Responders
* _Who was the IC?_
* _Who was the TL?_
* _Who was the scribe?_
* _Who else was involved?_
* _Who else was involved?_
## Timeline
_Some important times to include: (1) time the root cause began, (2) time of the page, (3) time that the status page was updated (i.e. when the incident became public), (4) time of any significant actions, (5) time the SEV-2/1 ended, (6) links to tools/logs that show how the timestamp was arrived at._
_Some important times to include: (1) time the root cause began, (2) time of the page, (3) time that the status page was updated (i.e. when the incident became public), (4) time of any significant actions, (5) time the IN-3 ended, (6) links to tools/logs that show how the timestamp was arrived at._
| Time (UTC) | Event | Data Link |
| ---------- | ----- | --------- |
@ -60,7 +60,7 @@ _Some important times to include: (1) time the root cause began, (2) time of the
* _List anything you think we didn't do very well. The intent is that we should follow up on all points here to improve our processes._
## Action Items
_Each action item should be in the form of a JIRA ticket, and each ticket should have the same set of two tags: “sev1_YYYYMMDD” (such as sev1_20150911) and simply “sev1”. Include action items such as: (1) any fixes required to prevent the root cause in the future, (2) any preparedness tasks that could help mitigate the problem if it came up again, (3) remaining post-mortem steps, such as the internal email, as well as the status-page public post, (4) any improvements to our incident response process._
_Each action item should be in the form of a DoIT card respectiv GTD next actions principle: "a clear and concise single action to move things forward”. Include action items such as: (1) any fixes required to prevent the root cause in the future, (2) any preparedness tasks that could help mitigate the problem if it came up again, (3) remaining post-mortem steps, such as the internal email, as well as the status-page public post, (4) any improvements to our incident response process._
## Messaging
@ -70,7 +70,7 @@ _This is a follow-up for employees. It should be sent out right after the post-m
> Briefly summarize what happened and where the post-mortem page (this page) can be found.
### External Message
_This is what will be included on the status.pagerduty.com website regarding this incident. What are we telling customers, including an apology? (The apology should be genuine, not rote.)_
_This is what will be included on the public facing status website (status.spearhead.systems) regarding this incident. What are we telling customers, including an apology? (The apology should be genuine, not rote.)_
> Summary