updating to spearhead specifics

This commit is contained in:
Marius Pana 2017-01-21 15:26:04 +02:00
parent fe5a948988
commit 5df0610b7f
12 changed files with 26 additions and 346 deletions

View File

@ -1,7 +1,5 @@
This site documents parts of the Spearhead Systems Issue Response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after. This site documents parts of the Spearhead Systems Issue Response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after.
Few companies seem to talk about their internal processes for dealing with major incidents. We would like to change that by opening up our documentation to the community, in the hopes that it proves useful to others who may want to formalize their own processes. Additionally, it provides an opportunity for others to suggest improvements, which ends up helping everyone.
This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com). This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com).
## What is this? ## What is this?

Binary file not shown.

Before

Width:  |  Height:  |  Size: 11 KiB

After

Width:  |  Height:  |  Size: 158 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 17 KiB

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

View File

@ -1,5 +1,5 @@
!!! note "Incident Commander Required" !!! note "Team Leader Required"
As with all major incidents at PagerDuty, security ones will also involve an Incident Commander, who will delegate the tasks to relevant resolvers. Tasks may be performed in parallel as assigned by the IC. Page one at the earliest possible opportunity. As with all major incidents, security ones will also involve a Team Leader, who will delegate the tasks to relevant resolvers. Tasks may be performed in parallel as assigned by the TL. Contact one at the earliest possible opportunity.
## Checklist ## Checklist
Details for each of these items are available in the next section. Details for each of these items are available in the next section.
@ -42,7 +42,7 @@ Identify the key responders for the security incident, and keep them all in the
* The security and site-reliability teams should usually be involved. * The security and site-reliability teams should usually be involved.
* A representative for any affected services should be involved. * A representative for any affected services should be involved.
* An Incident Commander (IC) should be appointed, who will also appoint the usual incident command roles. The incident command team will be responsible for keeping documentation of actions taken, and for notifying internal stakeholders as appropriate. * A Team Leader (TL) should be appointed, who will also appoint the usual incident command roles. The incident command team will be responsible for keeping documentation of actions taken, and for notifying internal stakeholders as appropriate.
* Do not communicate with anyone not on the response team about the incident until forensics has been performed. The attack could be happening internally. * Do not communicate with anyone not on the response team about the incident until forensics has been performed. The attack could be happening internally.
* Give the project an innocuous codename that can be used for chats/documents so if anyone overhears they don't realize it's a security incident. (e.g. sapphire-unicorn). * Give the project an innocuous codename that can be used for chats/documents so if anyone overhears they don't realize it's a security incident. (e.g. sapphire-unicorn).
* Prefix all emails, and chat topics with "Attorney Work Project". * Prefix all emails, and chat topics with "Attorney Work Project".
@ -117,7 +117,7 @@ Work with law enforcement to identify the source of the attack, letting any syst
* Contact security companies to help in assessing risk and any PR next steps. * Contact security companies to help in assessing risk and any PR next steps.
## External Communication ## External Communication
**Delegate to:** Marketing Team **Delegate to:** TL, Marketing Team
Once you have validated all of the information you have is accurate, have a timeline of events, and know exactly what information was compromised, how it was compromised, and sure that it won't happen again. Only then should you prepare and release a public statement to customers informing them of the compromised information and any steps they need to take. Once you have validated all of the information you have is accurate, have a timeline of events, and know exactly what information was compromised, how it was compromised, and sure that it won't happen again. Only then should you prepare and release a public statement to customers informing them of the compromised information and any steps they need to take.
@ -139,3 +139,4 @@ Once you have validated all of the information you have is accurate, have a time
* [Responding to IT Security Incidents](https://technet.microsoft.com/en-us/library/cc700825.aspx) (Microsoft) * [Responding to IT Security Incidents](https://technet.microsoft.com/en-us/library/cc700825.aspx) (Microsoft)
* [Defining Incident Management Processes for CSIRTs: A Work in Progress](http://resources.sei.cmu.edu/library/asset-view.cfm?assetid=7153) (CMU) * [Defining Incident Management Processes for CSIRTs: A Work in Progress](http://resources.sei.cmu.edu/library/asset-view.cfm?assetid=7153) (CMU)
* [Creating and Managing Computer Security Incident Handling Teams (CSIRTS)](https://www.first.org/conference/2008/papers/killcrece-georgia-slides.pdf) (CERT) * [Creating and Managing Computer Security Incident Handling Teams (CSIRTS)](https://www.first.org/conference/2008/papers/killcrece-georgia-slides.pdf) (CERT)
* [Google Infrastructure Security Design Overview](https://cloud.google.com/security/security-design/) (Google)

View File

@ -1,4 +1,5 @@
This documentation covers parts of the Spearhead Systems Incident Response process. It is a copy of [PagerDuty's](https://github.com/PagerDuty/incident-response-docs/) documentation and furthermore a cut-down version of our own internal documentation, used at Spearhead Systems for any issue (incident or service request), and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident or service request, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in our operational technical support response process (or those wishing to become part of our support team). See the [about page](about.md) for more information on what this documentation is and why it exists. This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com) that may not yet be open sourced. This documentation covers parts of the Spearhead Systems Incident Response process. It is a copy of [PagerDuty's](https://github.com/PagerDuty/incident-response-docs/) documentation and furthermore a cut-down version of our own internal documentation, used at Spearhead Systems for any issue (incident or service request), and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident or service request, but also what to do during and after. It is intended to be used by those involved in our operational technical support response process (or those wishing to become part of our support team). See the [about page](about.md) for more information on what this documentation is and why it exists.
This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com) and other systems that have not been open sourced.
!!! note "Issue, Incident and Service Request" !!! note "Issue, Incident and Service Request"
At Spearhead we use the term *issue* to define any request from our customers. Issues fall into two categories: "Service Requests (SR)" and "Incidents (IN)". An IN will generally be an issue that has impact on the normal functioning of the business while a SR generally does not. At Spearhead we use the term *issue* to define any request from our customers. Issues fall into two categories: "Service Requests (SR)" and "Incidents (IN)". An IN will generally be an issue that has impact on the normal functioning of the business while a SR generally does not.
@ -17,7 +18,7 @@ If you've never been on-call before or part of a support delivery team, you migh
Reading material for things you probably want to know before an incident occurs. You likely don't want to be reading these during an actual incident. Reading material for things you probably want to know before an incident occurs. You likely don't want to be reading these during an actual incident.
* [Severity Levels](before/severity_levels.md) - _Information on our severity level classification. What constitutes a Low issue? What's a "Major Incident"?, etc._ * [Severity Levels](before/severity_levels.md) - _Information on our severity level classification. What constitutes a Low issue? What's a "Major Incident"?, etc._
* [Different Roles for Incidents](before/different_roles.md) - _Information on the roles during an incident; Incident Commander, Scribe, etc._ * [Different Roles for Incidents](before/different_roles.md) - _Information on the roles during an incident; Team Leader, Sysadmin, etc._
* [Incident Call Etiquette](before/call_etiquette.md) - _Our etiquette guidelines for incident calls, before you find yourself in one._ * [Incident Call Etiquette](before/call_etiquette.md) - _Our etiquette guidelines for incident calls, before you find yourself in one._
## During an Incident ## During an Incident

View File

@ -6,9 +6,9 @@ We manage how we get alerted based on many factors such as the customers contrac
| Priority | Alerts | Response | | Priority | Alerts | Response |
| -------- | ------ | -------- | | -------- | ------ | -------- |
| Major | Major-Priority Spearhead Alert 24/7/365. | Requires **immediate human action**. | | Major | Major-Priority Spearhead Alert 24/7/365. | Requires **immediate human action**. |
| Normal | Normal-Priority Spearhead Alert during **business hours only**. | Requires human action that same working day. | | Normal | Normal-Priority Alert during **business hours only**. | Requires human action that same working day. |
| Minor | Minor-Priority Spearhead Alert 24/7/365. | Requires human action at some point. | | Minor | Minor-Priority Alert 24/7/365. | Requires human action at some point. |
| Notification | Suppressed Events. No response required. | Informational only. We do not need these to clutter out ticketing or inboxes. If they are enabled they should be sent only to required/specific people, not groups. | | Notification | Suppressed Events. No response required. | Informational only. We do not need these to clutter our ticketing or inboxes. If they are enabled they should be sent only to required/specific people, not groups. |
Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page of the respective customer. Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page of the respective customer.

View File

@ -1,57 +0,0 @@
So you want to be a deputy? You've come to the right place!
![Deputy](../assets/img/headers/incident_command_support.jpg)
*Credit: [oregondot @ Flickr](https://www.flickr.com/photos/oregondot/8743801731/in/album-72157633494644719/)*
## Purpose
The purpose of the Deputy is to support the IC by keeping track of timers, notifying the IC of important information, and paging other people as directed by the IC.
It's important for the IC to focus on the problem at hand, rather than worrying about monitoring timers. The deputy is there to help support the IC and keep them focussed on the incident.
As a Deputy, you will be expected to take over command from the IC if they request it.
**You should not be performing any remediations, checking graphs, or investigating logs**. Those tasks will be delegated to the resolvers by the IC.
## Prerequisites
Before you can be a Deputy, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training!
* Be trained as an [Incident Commander](/training/incident_commander.md).
## Responsibilities
Read up on our [Different Roles for Incidents](/before/different_roles.md) to see what is expected from a Deputy, as well as what we expect from the other roles you'll be interacting with.
## Training Process
The training process for a Deputy is quite simple.
* Follow our [Incident Commander Training](/training/incident_commander.md).
* Read this page.
## Incident Call Procedures and Lingo
The [Steps for Deputy](/during/during_an_incident.md) provide a detailed description of what you should be doing during an incident.
Here are some examples of phrases and patterns you should use during incident calls.
### Keep Track of Responders
As you listen to the call, you should keep track of the responders to the call as you hear them speak. Make a note on a piece of paper, or use the `!ic responders` to see who they are. The IC may ask you who is on-call for a particular system, and you should know the answer, and be able to page them.
> Do we have a representative from [X] on the call?
> (pause)
> Deputy, can you go ahead and page the [X] on-call please.
You can page them however you see fit, phone call, etc.
### Provide Executive Status Updates
Provide regular status updates on Slack (roughly every 30mins), giving an executive summary of the current status during SEV-1 incidents. Keep it short and to the point, and use @here. Mention the current state, the actions in progress, customer impact, and expected time remaining. It's OK to miss out some of those if the information isn't known.
> @here: We are in SEV-1 due to X. Current actions in progress are to do Y. Expecting 3 mins to complete that action. Once action is complete, system should recover on its own within 5 minutes.
### Alert IC to Timers
You are expected to keep track of how long the incident has been running for, and provide callouts to the IC every 10 minutes so they can take actions such as increasing the severity, or asking Support to Tweet out. This is as simple as telling the IC on the call,
> IC, be advised the incident is now at the 10 minute mark.
Similarly, when the IC asks for someone to get back to them in X minutes, you are expected to keep track of that. You should remind the IC when that time has been reached.
> IC, be advised the timer for [TEAM]'s investigation is up.

View File

@ -1,263 +0,0 @@
So you want to be an incident commander? You've come to the right place! You don't need to be a senior team member to become an IC, anyone can do it providing you have the requisite knowledge (yes, even an intern)!
![Gene Kranz](../assets/img/headers/gene_kranz.jpg)
*Credit: [NASA](https://en.wikipedia.org/wiki/File:Eugene_F._Kranz_at_his_console_at_the_NASA_Mission_Control_Center.jpg)*
## Purpose
If you could boil down the definition of an Incident Commander to one sentence, it would be,
> Take whatever actions are necessary to protect PagerDuty systems and customers.
The purpose of the Incident Commander is to be the decision maker during an major incident; Delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution.
The Incident Commander becomes the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final.
Your job as an IC is to listen to the call and to watch the incident Slack room in order to provide clear coordination, recruiting others to gather context/details. **You should not be performing any actions or remediations, checking graphs, or investigating logs.** Those tasks should be delegated.
## Prerequisites
Before you can be an Incident Commander, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training!
* Has **excellent knowledge of PagerDuty systems** and is able to quickly evaluate good vs bad options, and quickly identify what's actually going on.
* Been at PagerDuty for at least 6 months and has a **solid understanding of the incident notification pipeline and web stack**.
* Excellent verbal and written **communication skills**.
* Has **knowledge of obscure PagerDuty terms**.
* Has gravitas and is **willing to kick people off a call** to remove distractions, even if it's the CEO.
## Responsibilities
Read up on our [Different Roles for Incidents](/before/different_roles.md) to see what is expected from an Incident Commander, as well as what we expect from the other roles you'll be interacting with.
## Qualities
Some qualities we expect from an effective leader include being able to:
* Take command.
* Motivate responders.
* Communicate clear directions.
* Size up the situation and make rapid decisions.
* Assess the effectiveness of tactics/strategies.
* Be flexible and modify your plans as necessary.
As a leader, you should try to:
* Be proficient in your job.
* Make sound and timely decisions.
* Ensure tasks are understood.
* Be prepared to step out of a tactical role to assume a leadership role.
## Training Process
The process is fairly loose for now. Here's a list of things you can do to train though,
* Read the rest of this page, particularly the sections below.
* Participate in [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/) (FF).
* Shadow a FF to see how it's run.
* Be the scribe for multiple FF's.
* Be the incident commander for multiple FF's.
* Play a game of "[Keep Talking and Nobody Explodes](http://www.keeptalkinggame.com/)" with other people in the office.
* For a more realistic experience, play it with someone in a different office over Hangouts.
* Shadow a current incident commander for at least a full week shift.
* Get alerted when they do, join in on the same calls.
* Sit in on an active incident call, follow along with the chat, and follow along with what the Incident Commander is doing.
* **Do not actively participate in the call, keep your questions until the end.**
* Reverse shadow a current incident commander for at least a full week shift.
* You should be the one to respond to incidents, and you will take point on calls, however the current IC will be there to take over should you not know how to proceed.
## Graduation
What's the difference between an IC in training, and an IC? (This isn't the set up to a joke). Simple, an IC puts themselves on the schedule.
## Handling Incidents
Every incident is different (we're hopefully not repeating the same issue multiple times!), but there's a common process you can apply to each one.
1. **Identify the symptoms.**
* Identify what the symptoms are, how big the issue is, and whether it's escalating/flapping/static.
1. **Size-up the situation.**
* Gather as much information as you can, as quickly as you can (remember the incident is still happening while you're doing this).
* Get the facts, the possibilities of what can happen, and the probability of those things happening.
1. **Stabilize the incident.**
* Identify actions you can use to proceed.
* Gather support for the plan (See "Polling During a Decision" below).
* Delegate remediation actions to your SME's.
1. **Provide regular updates.**
* Maintain a cadence, and provide regular updates to everyone on the call.
* What's happening, what are we doing about it, etc.
## Deputy
The deputy for an incident is generally the backup Incident Commander. However, as an Incident Commander, you may appoint one or more Deputies. Note that Deputy Incident Commanders must be as qualified as the Incident Commander, and that if a Deputy is assigned, he or she must be fully qualified to assume the Incident Commanders position if required.
## Communication Responsibilities
Sharing information during an incident is a critical process. As an Incident Commander (or Deputy), you should be prepared to brief others as necessary. You will also be required to communicate your intentions and decisions clearly so that there is no ambiguity in your commands.
When given information from a responder, you should clearly acknowledge that you have received and understood their message, so that the responder can be confident in moving on to other tasks.
After an incident, you should communicate with other training Incident Commanders on any debrief actions you feel are necessary.
## Incident Call Procedures and Lingo
The [Steps for Incident Commander](/during/during_an_incident.md) provide a detailed description of what you should be doing during an incident.
Additionally, aside from following the [usual incident call etiquette](/before/call_etiquette.md), there a few extra etiquette guidelines you should follow as IC:
* Always announce when you join the call if you are the on-call IC.
* Don't let discussions get out of hand. Keep conversations short.
* Note objections from others, but your call is final.
* If anyone is being actively disruptive to your call, kick them off.
* Announce the end of the call.
Here are some examples of phrases and patterns you should use during incident calls.
### Start of Call Announcement
At the start of any major incident call, the incident commander should announce the following,
> This is [NAME], I am the Incident Commander for this call.
This establishes to everyone on the call what your name is, and that you are now the commander. You should state "Incident Commander" and not "IC", as newcomers may not be familiar with the terminology yet. The word "commander" makes it very clear that you're in charge.
### Start of Incident, IC Not Present
If you are trained to be an IC and have joined a call, even if you aren't the IC on-call, you should do the following,
> Is there an IC on the call?
> (pause)
> Hearing no response, this is [NAME], and I am now the Incident Commander for this call.
If the on-call IC joins later, you may hand over to them at your discretion (see below for the hand-off procedure)
### Checking if SME's are Present
During a call, you will want to know who is available from the various teams in order to resolve the incident. Etiquette dictates that people should announce themselves, but sometimes you may be joining late to the call. If you need a representative from a team, just ask on the call. Your deputy can page one if no one answers.
> Do we have a representative from [X] on the call?
> (pause)
> Deputy, can you go ahead and page the [X] on-call please.
### Assigning Tasks
When you need to give out an assignment or task, give it to a person directly, never say "can someone do..." as this leads to the [bystander effect](https://en.wikipedia.org/wiki/Bystander_effect). Instead, all actions should be assigned to a specific person, and time-boxed with a specific number of minutes.
> IC: Bob, please investigate the high latency on web app boxes. I'll come back to you for an answer in 3 minutes.
> Bob: Understood
Keep track of how many minutes you assigned, and check in with that person after that time. You can get help from your deputy to help track the timings.
### Polling During a Decision
If a decision needs to be made, it comes down to the IC. Once the IC makes a decision, it is final. But it's important that no one can come later and object to the plan, saying things like "I knew that would happen". An IC will use very specific language to be sure that doesn't happen.
> The proposal is to [EXPLAIN PROPOSAL]
> Are there any strong objections to this plan?
> (pause)
> Hearing no objects, we are proceeding with this proposal.
If you were to ask "Does everyone agree?", you'd get people speaking over each other, you'd have quiet people not speaking up, etc. Asking for any STRONG objections gives people the chance to object, but only if they feel strongly on the matter.
### Status Updates
It's important to maintain a cadence during a major incident call. Whenever there is a lull in the proceedings, usually because you're waiting for someone to get back to you, you can fill the gap by explaining the current situation and the actions that are outstanding. This makes sure everyone is on the same page.
> While we wait for [X], here's an update of our current situation.
> We are currently in a SEV-1 situation, we believe to be caused by [X]. There's an open question to [Y] who will be getting back to us in 2 minutes. In the meantime, we have Tweeted out that we are experiencing issues. Our next Tweet will be in 10 minutes if the incident is still ongoing at that time.
> Are there any additional actions or proposals from anyone else at this time?
### Transfer of Command
Transfer of command, involves (as the name suggests) transferring command to another Incident Commander. There are multiple reasons why a transfer of command might take place,
* Commander has become fatigued and is unable to continue.
* Incident complexity changes.
* Change of command is necessary for effectiveness or efficiency.
* Personal emergencies arise (e.g., Incident Commander has a family emergency).
Never feel like you are not doing your job properly by handing over. Handovers are encouraged. In order to handover, out of band from the main call (via Slack for example), notify the other IC that you wish to transfer command. Update them with anything you feel appropriate. Then announce on the call,
> Everyone on the call, be advised, at this time I am handing over command to [X].
The new IC should then announce on the call as if they were joining a new call (see above), so that everyone is aware of the new commander.
Note that the arrival of a more qualified person does NOT necessarily mean a change in incident command.
### Maintaining Order
Often times on a call people will be talking over one another, or an argument on the correct way to proceed may break out. As Incident Commander it's important that order is maintained on a call. The Incident Commander has the power to remove someone from the call if necessary (even if it's the CEO). But often times you just need to remind people to speak one at a time. Sometimes the discussion can be healthy even if it starts as an argument, but you shouldn't let it go on for too long.
> (noise)
> Ok everyone, can we all speak one at a time please. So far I'm hearing two options to proceed: 1) [X], 2) [Y].
> Are there any other proposals someone would like to make at this time?
> ...etc
### Getting Straight Answers
You may ask a question as IC and receive an answer that doesn't actually answer your question. This is generally when you ask for a yes/no answer but get a more detailed explanation. This can often times be because the person doesn't understand the call etiquette. But if it continues, you need to take action in order to proceed.
> IC: Is this going to disable the service for everyone?
> SME: Well... for some people it....
> IC: Stop. I need a yes/no answer. Is this going to disable the service for everyone?
> SME: Well... it might not do...
> IC: Stop. I'm going to ask again, and the only two words I want to hear from you are "yes" or "no. If this going to disable the service for everyone?
> SME: Well.. like I was saying..
> IC: Stop. Leave the call. Backup IC can you please page the backup on-call for [service] so that we can get an answer.
### Executive Swoop
You may get someone who would be senior to you during peacetime come on the call and start overriding your decisions as IC. This is unacceptable behaviour during wartime, as the IC is in command. While this is rare, you can get things back on track with the following,
> Executive: No, I don't want us doing that. Everyone stop. We need to rollback instead.
> IC: Hold please. [EXECUTIVE], do you wish to take over command?
> Executive: Yes/No
> (If yes) IC: Understood. Everyone on the call, be advised, at this time I am handling over command to [EXECUTIVE]. They are now the incident commander for this call.
> (If no) IC: In that case, please cause no further interruptions or I will remove you from the call.
This makes it clear to the executive that they have the option of being in charge and making decisions, but in order to do so they must continue as an Incident Commander. If they refuse, then remind them that you are in charge and disruptive interruptions will not be tolerated. If they continue, remove them from the call.
### End of Call Sign-Off
At the end of an incident, you should announce to everyone on the call that you are ending the call at this time, and provide information on where followup discussion can take place. It's also customary to thank everyone.
> Ok everyone, we're ending the call at this time. Please continue any followup discussion on Slack. Thanks everyone.
## Examples From Pop Culture
PagerDuty employees have access to all previous incident calls, and can listen to them at their discretion. We can't release these calls, so for everyone else, here are some short examples from popular culture to show the techniques at work.
---
<iframe width="560" height="315" src="https://www.youtube.com/embed/gmLgi5mdTVo" frameborder="0" allowfullscreen></iframe>
Here's a clip from the movie Apollo 13, where Gene Kranz (Flight Director / Incident Commander) shows some great examples of Incident Command. Here are some things to note:
* Walks into the room, and immediately obvious that he's the IC. Calms the noise, and makes sure everyone is paying attention.
* Provides a status update so people are aware of the situation.
* Projector breaks, doesn't get sidetracked on fixing it, just moves on to something else.
* Provides a proposal for how to proceed and elicits feedback.
* Listens to the feedback calmly.
* When counter-proposal is raised, states that he agrees and why.
* Allows a discussion to happen, listens to all points. When discussion gets out of hand, re-asserts command of the situation.
* Explains his decision, and why.
* Explains his full plan and decision, so everyone is on the same page.
---
<iframe width="560" height="315" src="https://www.youtube.com/embed/KhoXFVQsIxw" frameborder="0" allowfullscreen></iframe>
Another clip from Apollo 13. Things to note:
* Summarizes the situation, and states the facts.
* Listens to the feedback from various people.
* When a trusted SME provides information counter to what everyone else is saying, asks for additional clarification ("What do you mean, everything?")
* Wise cracking remarks are not acknowledged by the IC ("You can't run a vacuum cleaner on 12 amps!")
* "That's the deal?".. "That's the deal".
* Once decision is made, moves on to the next discussion.
* Delegates tasks.

View File

@ -1,10 +1,10 @@
If you are on-call for any team at PagerDuty, you may be paged for a major incident and will be expected to respond as a subject matter expert (SME) for your service. This page details everything you need to know in order to be prepared for that responsibility. If you are interested in becoming an Incident Commander, take a look at the [Incident Commander Training page](/training/incident_commander.md). If you are on-call for any team at Spearhead Systems, you may be paged for a major incident and will be expected to respond as a subject matter expert (SME) for your service. This page details everything you need to know in order to be prepared for that responsibility. If you are interested in becoming an Team Leader, take a look at the [Team Leader Training page](/training/team_leader.md).
![Incident Response](../assets/img/headers/incident_response.jpg) ![Incident Response](../assets/img/headers/incident_response.jpg)
*Credit: [oregondot @ Flickr](https://www.flickr.com/photos/oregondot/8743809853/in/album-72157633494644719/)* *Credit: [oregondot @ Flickr](https://www.flickr.com/photos/oregondot/8743809853/in/album-72157633494644719/)*
## On-Call Expectations ## On-Call Expectations
If you are on-call for your team, there are certain expectations of you as that on-call. This applies to both the primary and secondary on-calls. Getting paged about a SEV-3 or SEV-4 in your system comes with different expectations than getting paged with a major SEV-2. If you are on-call for your team, there are certain expectations of you as that on-call. This applies to both the primary and secondary on-calls. Getting paged about a IN-3 or SR-3 in your system comes with different expectations than getting paged with a IN-1.
### Before Going On-Call ### Before Going On-Call
@ -13,17 +13,17 @@ If you are on-call for your team, there are certain expectations of you as that
1. [Incident Call Etiquette](/before/call_etiquette.md) - How to behave during an incident call. 1. [Incident Call Etiquette](/before/call_etiquette.md) - How to behave during an incident call.
1. [During an Incident](/during/during_an_incident.md) - What to do during an incident. You are specifically interested in the "Resolver" steps, but you should familiarize yourself with the entire document. 1. [During an Incident](/during/during_an_incident.md) - What to do during an incident. You are specifically interested in the "Resolver" steps, but you should familiarize yourself with the entire document.
1. [Glossary](/training/glossary.md) - Familiarize yourself with the terminology that may be used during the call. 1. [Glossary](/training/glossary.md) - Familiarize yourself with the terminology that may be used during the call.
1. Make sure you have set up your alerting methods, and that PagerDuty can bypass your "Do Not Disturb" settings. 1. Make sure you have set up your alerting methods, and that these can bypass your "Do Not Disturb" settings.
1. Check you can join the incident call. You may need to install a browser plugin. You don't want to be doing that the first time you get paged. 1. Check you can join the incident call. You may need to install a browser plugin. You don't want to be doing that the first time you get paged.
1. Be aware of your upcoming on-call time and arrange swaps around travel, vacations, appointments, etc. 1. Be aware of your upcoming on-call time and arrange swaps around travel, vacations, appointments, etc.
1. If you are an Incident Commander, make sure you are not on-call for your team at the same time as being on-call as Incident Commander. 1. If you are an Team Leader, make sure you are not on-call for your team at the same time as being on-call as Team Leader.
### During On-Call Period ### During On-Call Period
1. Have your laptop and Internet with you at all times during your on-call period (office, home, a MiFi, a phone with a tethering plan, etc). 1. Have your laptop and Internet with you at all times during your on-call period (office, home, a MiFi, a phone with a tethering plan, etc).
1. If you have important appointments, you need to get someone else on your team to cover that time slot in advance. 1. If you have important appointments, you need to get someone else on your team to cover that time slot in advance.
1. When you receive an alert for a major incident, you are expected to join the incident call and Slack as quickly as possible (within minutes). 1. When you receive an alert for a major incident, you are expected to join the incident call and Slack as quickly as possible (within minutes).
1. You will be asked questions or given actions by the Incident Commander. Answer questions concisely, and follow all actions given (even if you disagree with them). 1. You will be asked questions or given actions by the Team Leader. Answer questions concisely, and follow all actions given (even if you disagree with them).
## Response Mobilization ## Response Mobilization
When an incident occurs, you must be mobilized or assigned to become part of the incident response. In other words, until you are mobilized to the incident via a page or being directly asked by someone else on the incident, you remain in your everyday role. After being mobilized, your first task is to check in and receive an assignment. While it's tempting to see an incident happening and want to jump in and help, when resources show up that have not been requested, the management of the incident can be compromised. When an incident occurs, you must be mobilized or assigned to become part of the incident response. In other words, until you are mobilized to the incident via a page or being directly asked by someone else on the incident, you remain in your everyday role. After being mobilized, your first task is to check in and receive an assignment. While it's tempting to see an incident happening and want to jump in and help, when resources show up that have not been requested, the management of the incident can be compromised.
@ -43,12 +43,12 @@ The organizational structure is generally based on seniority. The more senior me
### Wartime ### Wartime
Wartime is different, and you will notice on our major incident calls that there's a different organizational structure. Wartime is different, and you will notice on our major incident calls that there's a different organizational structure.
* The Incident Commander is in charge. No matter their rank during peacetime, they are now the highest ranked individual on the call, higher than the CEO. * The Team Leader is in charge. No matter their rank during peacetime, they are now the highest ranked individual on the call, higher than the CEO.
* Primary responders (folks acting as primary on-call for a team/service) are the highest ranked individuals for that service. * Primary responders (folks acting as primary on-call for a team/service) are the highest ranked individuals for that service.
* Decisions will be made by the IC after consideration of the information presented. Once that decision is made, it is final. * Decisions will be made by the TL after consideration of the information presented. Once that decision is made, it is final.
* Riskier decisions can be made by the IC than would normally be considered during peacetime. * Riskier decisions can be made by the TL than would normally be considered during peacetime.
* For example, the IC may decide to drop events for a particular customer in order to maintain the integrity of the system for everyone else. * For example, the TL may decide to drop events for a particular customer in order to maintain the integrity of the system for everyone else.
* The IC may go against a consensus decision. If a poll is done, and 9/10 people agree but 1 disagrees. The IC may choose the disagreement option despite a majority vote. * The TL may go against a consensus decision. If a poll is done, and 9/10 people agree but 1 disagrees. The TL may choose the disagreement option despite a majority vote.
* Even if you disagree, the IC's decision is final. During the call is not the time to argue with them. * Even if you disagree, the TL's decision is final. During the call is not the time to argue with them.
* The IC may use language or behave in a way you find rude. This is wartime, and they need to do whatever it takes to resolve the situation, so sometimes rudeness occurs. This is never anything personal, and something you should be prepared to experience if you've never been in a wartime situation before. * The TL may use language or behave in a way you find rude. This is wartime, and they need to do whatever it takes to resolve the situation, so sometimes rudeness occurs. This is never anything personal, and something you should be prepared to experience if you've never been in a wartime situation before.
* You may be asked to leave the call by the IC, or you may even be forceable kicked off a call. It is at the IC's discretion to do this if they feel you are not providing useful input. Again, this is nothing personal and you should remember that wartime is different than peacetime. * You may be asked to leave the call by the TL, or you may even be forceable kicked off a call. It is at the TL's discretion to do this if they feel you are not providing useful input. Again, this is nothing personal and you should remember that wartime is different than peacetime.

View File

@ -47,8 +47,8 @@ pages:
- Post-Mortem Template: 'after/post_mortem_template.md' - Post-Mortem Template: 'after/post_mortem_template.md'
- Training: - Training:
- Overview: 'training/overview.md' - Overview: 'training/overview.md'
- Incident Commander: 'training/team_leader.md' - Team Leader: 'training/team_leader.md'
- Deputy: 'training/sysadmin.md' - Sysadmin: 'training/sysadmin.md'
- Scribe: 'training/scribe.md' - Scribe: 'training/scribe.md'
- Subject Matter Expert: 'training/subject_matter_expert.md' - Subject Matter Expert: 'training/subject_matter_expert.md'
- Glossary: 'training/glossary.md' - Glossary: 'training/glossary.md'