For every major issue (SR/IN +major), we need to follow up with a post-mortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included.
The first step is that a post-mortem owner will be designated. This is done by the TL either at the end of a major incident call, or very shortly after. You will be notified directly by the TL if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use DoIT and Slack for coordinating followup. A detailed list of the steps is available below,
* Scheduling the post-mortem meeting (on a shared calendar) and inviting the relevant people (this should be scheduled within 5 business days of the incident).
1. Schedule a post-mortem meeting for within 5 business days of the incident. You should schedule this before filling in the page, just so it's on the calendar.
* Create the meeting on the "Incident Post-Mortem Meetings" shared calendar.
1. Begin populating the page with all of the information you have.
* The timeline should be the main focus to begin with.
* The timeline should include important changes in status/impact, and also key actions taken by responders.
* For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Check_MK graph, a logwatch search, a Tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline.
* Be careful with creating too many cards. Generally we only want to create things that are of top priority. Things that absolutely should be dealt with.
1. Write the external message that will be sent to customers. This will be reviewed during the post-mortem meeting before it is sent out.
* Avoid using the word "outage" unless it really was a full outage, use the word "incident" instead. Customers generally see "outage" and assume everything was down, when in reality it was likely just some alerts delivered outside of SLA.
* Look at other examples of previous post-mortems to see the kind of thing you should send.
## Post-Mortem Meeting
These meetings should generally last 15-30 minutes, and are intended to be a wrap up of the post-mortem process. We should discuss what happened, what we could've done better, and any followup actions we need to take. The goal is to suss out any disagreement on the facts, analysis, or recommended actions, and to get some wider awareness of the problems that are causing reliability issues for us.
You should invite the following people to the post-mortem meeting,
* [A List of Post-mortems!](https://github.com/danluu/post-mortems)
## Useful Resources
* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](http://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011)