refactor to sphs specifics
This commit is contained in:
parent
5b4f9a148f
commit
e85a239ff8
@ -1,24 +1,24 @@
|
|||||||
For every major incident (SEV-2/1), we need to follow up with a post-mortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included.
|
For every major issue (SR/IN +major), we need to follow up with a post-mortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included.
|
||||||
|
|
||||||
![Post-Mortem](../assets/img/headers/pagerduty_post_mortem.jpg)
|
![Post-Mortem](../assets/img/headers/pagerduty_post_mortem.jpg)
|
||||||
|
|
||||||
## Owner Designation
|
## Owner Designation
|
||||||
The first step is that a post-mortem owner will be designated. This is done by the IC either at the end of a major incident call, or very shortly after. You will be notified directly by the IC if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use Slack for coordinating followup. A detailed list of the steps is available below,
|
The first step is that a post-mortem owner will be designated. This is done by the TL either at the end of a major incident call, or very shortly after. You will be notified directly by the TL if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use DoIT and Slack for coordinating followup. A detailed list of the steps is available below,
|
||||||
|
|
||||||
## Owner Responsibilities
|
## Owner Responsibilities
|
||||||
As owner of a post-mortem, you are responsible for the following,
|
As owner of a post-mortem, you are responsible for the following,
|
||||||
|
|
||||||
* Scheduling the post-mortem meeting (on the shared calendar) and inviting the relevant people (this should be scheduled within 5 business days of the incident).
|
* Scheduling the post-mortem meeting (on a shared calendar) and inviting the relevant people (this should be scheduled within 5 business days of the incident).
|
||||||
* Updating the page with all of the necessary content.
|
* Updating the page with all of the necessary content.
|
||||||
* Investigating the incident, pulling in whomever you need from other teams to assist in the investigation.
|
* Investigating the incident, pulling in whomever you need from other teams to assist in the investigation.
|
||||||
* Creating follow-up JIRA tickets (_You are only responsible for creating the tickets, not following them up to resolution_).
|
* Creating follow-up DoIT cards (_You are only responsible for creating the cards, not following them up to resolution_).
|
||||||
* Running the post-mortem meeting (_these generally run themselves, but you should get people back on topic if the conversation starts to wander_).
|
* Running the post-mortem meeting (_these generally run themselves, but you should get people back on topic if the conversation starts to wander_).
|
||||||
* In cases where we need a public blog post, creating & reviewing it with appropriate parties.
|
* In cases where we need a public blog post, creating & reviewing it with appropriate parties.
|
||||||
|
|
||||||
## Post-Mortem Wiki Page
|
## Post-Mortem Wiki Page
|
||||||
Once you've been designated as the owner of a post-mortem, you should start updating the page with all the relevant information.
|
Once you've been designated as the owner of a post-mortem, you should start updating the page with all the relevant information.
|
||||||
|
|
||||||
1. (If not already done by the IC) Create a new post-mortem page for the incident.
|
1. (If not already done by the TL) Create a new post-mortem page for the incident.
|
||||||
|
|
||||||
1. Schedule a post-mortem meeting for within 5 business days of the incident. You should schedule this before filling in the page, just so it's on the calendar.
|
1. Schedule a post-mortem meeting for within 5 business days of the incident. You should schedule this before filling in the page, just so it's on the calendar.
|
||||||
* Create the meeting on the "Incident Post-Mortem Meetings" shared calendar.
|
* Create the meeting on the "Incident Post-Mortem Meetings" shared calendar.
|
||||||
@ -26,12 +26,12 @@ Once you've been designated as the owner of a post-mortem, you should start upda
|
|||||||
1. Begin populating the page with all of the information you have.
|
1. Begin populating the page with all of the information you have.
|
||||||
* The timeline should be the main focus to begin with.
|
* The timeline should be the main focus to begin with.
|
||||||
* The timeline should include important changes in status/impact, and also key actions taken by responders.
|
* The timeline should include important changes in status/impact, and also key actions taken by responders.
|
||||||
* You should mark the start of the incident in red, and the resolution in green (for when we went into/out of SEV).
|
* You should mark the start of the incident in red, and the resolution in green (for when we went into/out of SR/IN +major).
|
||||||
* Go through the history in Slack to identify the responders, and add them to the page.
|
* Go through the history in DoIT and Slack to identify the responders, and add them to the page.
|
||||||
* Identify the Incident Commander and Scribe in this list.
|
* Identify the Team Leader and Scribe in this list.
|
||||||
|
|
||||||
1. Populate the page with more detailed information.
|
1. Populate the page with more detailed information.
|
||||||
* For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a Tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline.
|
* For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Check_MK graph, a logwatch search, a Tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline.
|
||||||
|
|
||||||
1. Perform an analysis of the incident.
|
1. Perform an analysis of the incident.
|
||||||
* Capture all available data regarding the incident. What caused it, how many customers were affected, etc.
|
* Capture all available data regarding the incident. What caused it, how many customers were affected, etc.
|
||||||
@ -39,13 +39,13 @@ Once you've been designated as the owner of a post-mortem, you should start upda
|
|||||||
* Capture the impact to customers (generally in terms of event submission, delayed processing, and slow notification delivery)
|
* Capture the impact to customers (generally in terms of event submission, delayed processing, and slow notification delivery)
|
||||||
* Identify the underlying cause of the incident (What happened, and why did it happen).
|
* Identify the underlying cause of the incident (What happened, and why did it happen).
|
||||||
|
|
||||||
1. Create any followup action JIRA tickets (or note down topics for discussion if we need to decide on a direction to go before creating tickets),
|
1. Create any followup action DoIT cards (or note down topics for discussion if we need to decide on a direction to go before creating tickets),
|
||||||
* Go through the history in Slack to identify any TODO items.
|
* Go through the history in DoIT, Slack to identify any TODO items.
|
||||||
* Label all tickets with their severity level and date tags.
|
* Label all tickets with their severity level and date tags.
|
||||||
* Any actions which can reduce re-occurrence of the incident.
|
* Any actions which can reduce re-occurrence of the incident.
|
||||||
* (There may be some trade-off here, and that's fine. Sometimes the ROI isn't worth the effort that would go into it).
|
* (There may be some trade-off here, and that's fine. Sometimes the ROI isn't worth the effort that would go into it).
|
||||||
* Identify any actions which can make our incident response process better.
|
* Identify any actions which can make our incident response process better.
|
||||||
* Be careful with creating too many tickets. Generally we only want to create things that are P0/P1's. Things that absolutely should be dealt with.
|
* Be careful with creating too many cards. Generally we only want to create things that are of top priority. Things that absolutely should be dealt with.
|
||||||
|
|
||||||
1. Write the external message that will be sent to customers. This will be reviewed during the post-mortem meeting before it is sent out.
|
1. Write the external message that will be sent to customers. This will be reviewed during the post-mortem meeting before it is sent out.
|
||||||
* Avoid using the word "outage" unless it really was a full outage, use the word "incident" instead. Customers generally see "outage" and assume everything was down, when in reality it was likely just some alerts delivered outside of SLA.
|
* Avoid using the word "outage" unless it really was a full outage, use the word "incident" instead. Customers generally see "outage" and assume everything was down, when in reality it was likely just some alerts delivered outside of SLA.
|
||||||
@ -57,18 +57,18 @@ These meetings should generally last 15-30 minutes, and are intended to be a wra
|
|||||||
You should invite the following people to the post-mortem meeting,
|
You should invite the following people to the post-mortem meeting,
|
||||||
|
|
||||||
* Always
|
* Always
|
||||||
* The incident commander.
|
* The team leader.
|
||||||
* Service owners involved in the incident.
|
* Service owners involved in the incident.
|
||||||
* Key engineer(s)/responders involved in the incident.
|
* Key engineer(s)/responders involved in the incident.
|
||||||
* Optional
|
* Optional
|
||||||
* Customer liaison. (Only SEV-1 incidents)
|
* Customer liaison. (Only SR/IN +major incidents)
|
||||||
|
|
||||||
A general agenda for the meeting would be something like,
|
A general agenda for the meeting would be something like,
|
||||||
|
|
||||||
1. Recap the timeline, to make sure everyone agrees and is on the same page.
|
1. Recap the timeline, to make sure everyone agrees and is on the same page.
|
||||||
1. Recap important points, and any unusual items.
|
1. Recap important points, and any unusual items.
|
||||||
1. Discuss how the problem could've been caught.
|
1. Discuss how the problem could've been caught.
|
||||||
* Did it show up in canary?
|
* Did it send any weak signals?
|
||||||
* Could it have been caught in tests, or loadtest environment?
|
* Could it have been caught in tests, or loadtest environment?
|
||||||
1. Discuss customer impact. Any comments from customers, etc.
|
1. Discuss customer impact. Any comments from customers, etc.
|
||||||
1. Review action items that have been created, discuss if appropriate, or if more are needed, etc.
|
1. Review action items that have been created, discuss if appropriate, or if more are needed, etc.
|
||||||
|
Loading…
Reference in New Issue
Block a user