diff --git a/.DS_Store b/.DS_Store index 01b019f..ef21c03 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..326c651 --- /dev/null +++ b/.gitignore @@ -0,0 +1,3 @@ +site/ +*sublime-* +.DS_Store diff --git a/.nojekyll b/.nojekyll deleted file mode 100644 index e69de29..0000000 diff --git a/404.html b/404.html deleted file mode 100644 index 8fd2af8..0000000 --- a/404.html +++ /dev/null @@ -1,477 +0,0 @@ - - - - - - - - - - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Spearhead Systems Incident Response Documentation

- - - - - -
-

Sorry! We couldn't find that page.

-

Looks like our well-trained server monkeys dropped the ball. Rest assured they will be dealt with. In the meantime, you probably want to head home. -

- -
- - - -
- - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..95efe6e --- /dev/null +++ b/LICENSE @@ -0,0 +1,13 @@ +Copyright 2016 PagerDuty, Inc. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. diff --git a/_site/LICENSE b/_site/LICENSE new file mode 100644 index 0000000..95efe6e --- /dev/null +++ b/_site/LICENSE @@ -0,0 +1,13 @@ +Copyright 2016 PagerDuty, Inc. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. diff --git a/_site/README.md b/_site/README.md new file mode 100644 index 0000000..0bde4a6 --- /dev/null +++ b/_site/README.md @@ -0,0 +1,36 @@ +# Spearhead Systems Issue Response Documentation [![Build Status](https://travis-ci.com/PagerDuty/incident-response-docs.svg?token=zdc1SxQUyY3TG9TLD3Xz&branch=master)](https://travis-ci.com/PagerDuty/incident-response-docs) +This is a public version of the Issue Response process used at Spearhead Ststems. It is based on the PagerDuty Incident Response process, modified to fit our specific requirements. It is used to prepare new employees for on-call responsibilities, and provides information not only on preparing for an issue (incident or service request), but also what to do during and after. See the [about page](docs/about.md) for more information on what this documentation is and why it exists. + +You can view the documentation [directly](/docs/index.md) in this repository, or rendered as a website at https://response.spearhead.systems. + +[![Spearhead Systems Issue Response Documentation](screenshot.png)](https://response.spearhead.systems) + +## Development +We use [MkDocs](http://www.mkdocs.org/) to create a static site from this repository. For local development, + +1. [Install MkDocs](http://www.mkdocs.org/#installation). `pip install mkdocs` +1. Install the [MkDocs Material theme](https://github.com/squidfunk/mkdocs-material). `pip install mkdocs-material` +1. To test locally, run `mkdocs serve` from the project directory. + +## Deploying +1. Run `mkdocs build --clean` to produce the static site for upload. +1. Upload the `site` directory to S3 (or wherever you would like it to be hosted). + + aws s3 sync ./site/ s3://[BUCKET_NAME] \ + --acl public-read \ + --exclude "*.py*" \ + --delete + +## License +[Apache 2](http://www.apache.org/licenses/LICENSE-2.0) (See [LICENSE](LICENSE) file) + +## Contributing +Thank you for considering contributing! If you have any questions, just ask - or submit your issue or pull request anyway. The worst that can happen is we'll politely ask you to change something. We appreciate all friendly contributions. + +Here is our preferred process for submitting a pull request, + +1. Fork it ( https://github.com/PagerDuty/incident-response-docs/fork ) +1. Create your feature branch (`git checkout -b my-new-feature`) +1. Commit your changes (`git commit -am 'Add some feature'`) +1. Push to the branch (`git push origin my-new-feature`) +1. Create a new Pull Request. diff --git a/_site/docs/about.md b/_site/docs/about.md new file mode 100644 index 0000000..ab5ff64 --- /dev/null +++ b/_site/docs/about.md @@ -0,0 +1,31 @@ +This site documents parts of the Spearhead Systems Issue Response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after. + +Few companies seem to talk about their internal processes for dealing with major incidents. We would like to change that by opening up our documentation to the community, in the hopes that it proves useful to others who may want to formalize their own processes. Additionally, it provides an opportunity for others to suggest improvements, which ends up helping everyone. + +This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com). + +## What is this? + +A collection of pages detailing how to efficiently deal with any incident or service request that might arise, along with information on how to go on-call effectively. It provides lessons learned the hard way, along with training material for getting you up to speed quickly. + +## Who is this for? + +It is intended for on-call practitioners and those involved in an operational incident or service request response process, or those wishing to enact a formal incident response process. Specifically this is for all of our Technical Support staff. + +## Why do I need it? + +As a service provider Spearhead Systems deals with service requests on a daily basis. The reason we exist is to deliver a service which in most cases boils down to incidents and service requests. We want to deliver a smooth and seamless experience for resolving our customers issues therefore this documentation is a guideline for how we handle these requests. This documentation will allow you give you a head start on how to deal with issues in a way which leads to the fastest possible recovery time. + +## What is covered? + +Anything from preparing to [go on-call](/oncall/being_oncall.md), definitions of [severities](/before/severity_levels.md), incident [call etiquette](/before/call_etiquette.md), all the way to how to run a [post-mortem](/after/post_mortem_process.md), providing a [post-mortem template](/after/post_mortem_template.md) and even a [security incident response process](/during/security_incident_response.md). + +## What is missing? + +Lots, dig in an help us complete the picture. We can migrate most processes from Sharepoint here. + +## License + +This documentation is provided under the Apache License 2.0. In plain English that means you can use and modify this documentation and use it both commercially and for private use. However, you must include any original copyright notices, and the original LICENSE file. + +Whether you are a Spearhead Systems customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account, feel free to fork the repository and use it as a base for your own internal documentation. diff --git a/_site/docs/after/post_mortem_process.md b/_site/docs/after/post_mortem_process.md new file mode 100644 index 0000000..76a9775 --- /dev/null +++ b/_site/docs/after/post_mortem_process.md @@ -0,0 +1,91 @@ +For every major incident (SEV-2/1), we need to follow up with a post-mortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included. + +![Post-Mortem](../assets/img/headers/pagerduty_post_mortem.jpg) + +## Owner Designation +The first step is that a post-mortem owner will be designated. This is done by the IC either at the end of a major incident call, or very shortly after. You will be notified directly by the IC if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use Slack for coordinating followup. A detailed list of the steps is available below, + +## Owner Responsibilities +As owner of a post-mortem, you are responsible for the following, + +* Scheduling the post-mortem meeting (on the shared calendar) and inviting the relevant people (this should be scheduled within 5 business days of the incident). +* Updating the page with all of the necessary content. +* Investigating the incident, pulling in whomever you need from other teams to assist in the investigation. +* Creating follow-up JIRA tickets (_You are only responsible for creating the tickets, not following them up to resolution_). +* Running the post-mortem meeting (_these generally run themselves, but you should get people back on topic if the conversation starts to wander_). +* In cases where we need a public blog post, creating & reviewing it with appropriate parties. + +## Post-Mortem Wiki Page +Once you've been designated as the owner of a post-mortem, you should start updating the page with all the relevant information. + +1. (If not already done by the IC) Create a new post-mortem page for the incident. + +1. Schedule a post-mortem meeting for within 5 business days of the incident. You should schedule this before filling in the page, just so it's on the calendar. + * Create the meeting on the "Incident Post-Mortem Meetings" shared calendar. + +1. Begin populating the page with all of the information you have. + * The timeline should be the main focus to begin with. + * The timeline should include important changes in status/impact, and also key actions taken by responders. + * You should mark the start of the incident in red, and the resolution in green (for when we went into/out of SEV). + * Go through the history in Slack to identify the responders, and add them to the page. + * Identify the Incident Commander and Scribe in this list. + +1. Populate the page with more detailed information. + * For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a Tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline. + +1. Perform an analysis of the incident. + * Capture all available data regarding the incident. What caused it, how many customers were affected, etc. + * Any commands or queries you use to look up data should be posted in the page so others can see how the data was gathered. + * Capture the impact to customers (generally in terms of event submission, delayed processing, and slow notification delivery) + * Identify the underlying cause of the incident (What happened, and why did it happen). + +1. Create any followup action JIRA tickets (or note down topics for discussion if we need to decide on a direction to go before creating tickets), + * Go through the history in Slack to identify any TODO items. + * Label all tickets with their severity level and date tags. + * Any actions which can reduce re-occurrence of the incident. + * (There may be some trade-off here, and that's fine. Sometimes the ROI isn't worth the effort that would go into it). + * Identify any actions which can make our incident response process better. + * Be careful with creating too many tickets. Generally we only want to create things that are P0/P1's. Things that absolutely should be dealt with. + +1. Write the external message that will be sent to customers. This will be reviewed during the post-mortem meeting before it is sent out. + * Avoid using the word "outage" unless it really was a full outage, use the word "incident" instead. Customers generally see "outage" and assume everything was down, when in reality it was likely just some alerts delivered outside of SLA. + * Look at other examples of previous post-mortems to see the kind of thing you should send. + +## Post-Mortem Meeting +These meetings should generally last 15-30 minutes, and are intended to be a wrap up of the post-mortem process. We should discuss what happened, what we could've done better, and any followup actions we need to take. The goal is to suss out any disagreement on the facts, analysis, or recommended actions, and to get some wider awareness of the problems that are causing reliability issues for us. + +You should invite the following people to the post-mortem meeting, + +* Always + * The incident commander. + * Service owners involved in the incident. + * Key engineer(s)/responders involved in the incident. +* Optional + * Customer liaison. (Only SEV-1 incidents) + +A general agenda for the meeting would be something like, + +1. Recap the timeline, to make sure everyone agrees and is on the same page. +1. Recap important points, and any unusual items. +1. Discuss how the problem could've been caught. + * Did it show up in canary? + * Could it have been caught in tests, or loadtest environment? +1. Discuss customer impact. Any comments from customers, etc. +1. Review action items that have been created, discuss if appropriate, or if more are needed, etc. + +## Examples +Here are some examples of post-mortems from other companies as a reference, + +* [Stripe](https://support.stripe.com/questions/outage-postmortem-2015-10-08-utc) +* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice.html/comment-page-2/) +* [AWS](https://aws.amazon.com/message/5467D2/) +* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html) +* [Heroku](https://status.heroku.com/incidents/151) +* [Netflix](http://techblog.netflix.com/2012/10/post-mortem-of-october-222012-aws.html) +* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016) +* [A List of Post-mortems!](https://github.com/danluu/post-mortems) + +## Useful Resources + +* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](http://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011) +* [Blame. Language. Sharing.](http://fractio.nl/2015/10/30/blame-language-sharing/) diff --git a/_site/docs/after/post_mortem_template.md b/_site/docs/after/post_mortem_template.md new file mode 100644 index 0000000..781e410 --- /dev/null +++ b/_site/docs/after/post_mortem_template.md @@ -0,0 +1,79 @@ +This is a standard template we use for post-mortems at PagerDuty. Each section describes the type of information you will want to put in that section. + +--- + +!!! note "Guidelines" + This page is intended to be reviewed during a post-mortem meeting that should be scheduled within 5 business days of any event. + Your first step should be to schedule the post-mortem meeting in the shared calendar for within 5 business days after the incident. + Don't wait until you've filled in the info to schedule the meeting, however make sure the page is completed by the meeting. + +** Post-Mortem Owner:** _Your name goes here._ + +** Meeting Scheduled For:** _Schedule the meeting on the "Incident Post-Mortem Meetings" shared calendar, for within 5 business days after the incident. Put the date/time here._ + +** Call Recording:** _Link to the incident call recording._ + +## Overview +_Include a **short** sentence or two summarizing the root cause, timeline summary, and the impact. E.g. "On the morning of August 99th, we suffered a 1 minute SEV-1 due to a runaway process on our primary database machine. This slowness caused roughly 0.024% of alerts that had begun during this time to be delivered out of SLA."_ + +## What Happened +_Include a short description of what happened._ + +## Root Cause +_Include a description of the root cause. If there were any actions taken that exacerbated the issue, also include them here with the intention of learning from any mistakes made during the resolution process._ + +## Resolution +_Include a description what solved the problem. If there was a temporary fix in place, describe that along with the long-term solution._ + +## Impact +_Be very specific here, include exact numbers._ + +| Time in SEV-1 | ?mins | +| Time in SEV-2 | ?mins | +| Notifications Delivered out of SLA | ??% (?? of ??) | +| Events Dropped / Not Accepted | ??% (?? of ??) _Should usually be 0, but always check_ | +| Accounts Affected | ?? | +| Users Affected | ?? | +| Support Requests Raised | ?? _Include any relevant links to tickets_ | + +## Responders + +* _Who was the IC?_ +* _Who was the scribe?_ +* _Who else was involved?_ +* _Who else was involved?_ + +## Timeline +_Some important times to include: (1) time the root cause began, (2) time of the page, (3) time that the status page was updated (i.e. when the incident became public), (4) time of any significant actions, (5) time the SEV-2/1 ended, (6) links to tools/logs that show how the timestamp was arrived at._ + +| Time (UTC) | Event | Data Link | +| ---------- | ----- | --------- | + +## How'd We Do? + +### What Went Well? + +* _List anything you did well and want to call out. It's OK to not list anything._ + +### What Didn't Go So Well? + +* _List anything you think we didn't do very well. The intent is that we should follow up on all points here to improve our processes._ + +## Action Items +_Each action item should be in the form of a JIRA ticket, and each ticket should have the same set of two tags: “sev1_YYYYMMDD” (such as sev1_20150911) and simply “sev1”. Include action items such as: (1) any fixes required to prevent the root cause in the future, (2) any preparedness tasks that could help mitigate the problem if it came up again, (3) remaining post-mortem steps, such as the internal email, as well as the status-page public post, (4) any improvements to our incident response process._ + +## Messaging + +### Internal Email +_This is a follow-up for employees. It should be sent out right after the post-mortem meeting is over. It only needs a short paragraph summarizing the incident and a link to this wiki page._ + +> Briefly summarize what happened and where the post-mortem page (this page) can be found. + +### External Message +_This is what will be included on the status.pagerduty.com website regarding this incident. What are we telling customers, including an apology? (The apology should be genuine, not rote.)_ + +> Summary + +> What Happened? + +> What Are We Doing About This? diff --git a/assets/css/extra.css b/_site/docs/assets/css/extra.css similarity index 100% rename from assets/css/extra.css rename to _site/docs/assets/css/extra.css diff --git a/assets/img/cover.png b/_site/docs/assets/img/cover.png similarity index 100% rename from assets/img/cover.png rename to _site/docs/assets/img/cover.png diff --git a/assets/img/headers/gene_kranz.jpg b/_site/docs/assets/img/headers/gene_kranz.jpg similarity index 100% rename from assets/img/headers/gene_kranz.jpg rename to _site/docs/assets/img/headers/gene_kranz.jpg diff --git a/assets/img/headers/incident_command_support.jpg b/_site/docs/assets/img/headers/incident_command_support.jpg similarity index 100% rename from assets/img/headers/incident_command_support.jpg rename to _site/docs/assets/img/headers/incident_command_support.jpg diff --git a/assets/img/headers/incident_response.jpg b/_site/docs/assets/img/headers/incident_response.jpg similarity index 100% rename from assets/img/headers/incident_response.jpg rename to _site/docs/assets/img/headers/incident_response.jpg diff --git a/assets/img/headers/obama_phone.jpg b/_site/docs/assets/img/headers/obama_phone.jpg similarity index 100% rename from assets/img/headers/obama_phone.jpg rename to _site/docs/assets/img/headers/obama_phone.jpg diff --git a/assets/img/headers/pagerduty_ir.jpg b/_site/docs/assets/img/headers/pagerduty_ir.jpg similarity index 100% rename from assets/img/headers/pagerduty_ir.jpg rename to _site/docs/assets/img/headers/pagerduty_ir.jpg diff --git a/assets/img/headers/pagerduty_post_mortem.jpg b/_site/docs/assets/img/headers/pagerduty_post_mortem.jpg similarity index 100% rename from assets/img/headers/pagerduty_post_mortem.jpg rename to _site/docs/assets/img/headers/pagerduty_post_mortem.jpg diff --git a/assets/img/headers/sph_ir.jpg b/_site/docs/assets/img/headers/sph_ir.jpg similarity index 100% rename from assets/img/headers/sph_ir.jpg rename to _site/docs/assets/img/headers/sph_ir.jpg diff --git a/assets/img/headers/typewriter.jpg b/_site/docs/assets/img/headers/typewriter.jpg similarity index 100% rename from assets/img/headers/typewriter.jpg rename to _site/docs/assets/img/headers/typewriter.jpg diff --git a/assets/img/icon.png b/_site/docs/assets/img/icon.png similarity index 100% rename from assets/img/icon.png rename to _site/docs/assets/img/icon.png diff --git a/assets/img/logo.png b/_site/docs/assets/img/logo.png similarity index 100% rename from assets/img/logo.png rename to _site/docs/assets/img/logo.png diff --git a/assets/img/misc/ack.png b/_site/docs/assets/img/misc/ack.png similarity index 100% rename from assets/img/misc/ack.png rename to _site/docs/assets/img/misc/ack.png diff --git a/assets/img/misc/alert_fatigue.png b/_site/docs/assets/img/misc/alert_fatigue.png similarity index 100% rename from assets/img/misc/alert_fatigue.png rename to _site/docs/assets/img/misc/alert_fatigue.png diff --git a/assets/img/misc/communicate.png b/_site/docs/assets/img/misc/communicate.png similarity index 100% rename from assets/img/misc/communicate.png rename to _site/docs/assets/img/misc/communicate.png diff --git a/assets/img/misc/escalation.png b/_site/docs/assets/img/misc/escalation.png similarity index 100% rename from assets/img/misc/escalation.png rename to _site/docs/assets/img/misc/escalation.png diff --git a/assets/img/misc/incident_response_roles.png b/_site/docs/assets/img/misc/incident_response_roles.png similarity index 100% rename from assets/img/misc/incident_response_roles.png rename to _site/docs/assets/img/misc/incident_response_roles.png diff --git a/assets/img/misc/mobile_alerts.png b/_site/docs/assets/img/misc/mobile_alerts.png similarity index 100% rename from assets/img/misc/mobile_alerts.png rename to _site/docs/assets/img/misc/mobile_alerts.png diff --git a/assets/img/misc/oncall_burnout.png b/_site/docs/assets/img/misc/oncall_burnout.png similarity index 100% rename from assets/img/misc/oncall_burnout.png rename to _site/docs/assets/img/misc/oncall_burnout.png diff --git a/assets/img/misc/schedule.png b/_site/docs/assets/img/misc/schedule.png similarity index 100% rename from assets/img/misc/schedule.png rename to _site/docs/assets/img/misc/schedule.png diff --git a/assets/img/misc/triage.png b/_site/docs/assets/img/misc/triage.png similarity index 100% rename from assets/img/misc/triage.png rename to _site/docs/assets/img/misc/triage.png diff --git a/assets/img/screenshots/high_business_hours.png b/_site/docs/assets/img/screenshots/high_business_hours.png similarity index 100% rename from assets/img/screenshots/high_business_hours.png rename to _site/docs/assets/img/screenshots/high_business_hours.png diff --git a/assets/img/screenshots/high_urgency.png b/_site/docs/assets/img/screenshots/high_urgency.png similarity index 100% rename from assets/img/screenshots/high_urgency.png rename to _site/docs/assets/img/screenshots/high_urgency.png diff --git a/assets/img/screenshots/low_urgency.png b/_site/docs/assets/img/screenshots/low_urgency.png similarity index 100% rename from assets/img/screenshots/low_urgency.png rename to _site/docs/assets/img/screenshots/low_urgency.png diff --git a/assets/img/screenshots/prio-high.png b/_site/docs/assets/img/screenshots/prio-high.png similarity index 100% rename from assets/img/screenshots/prio-high.png rename to _site/docs/assets/img/screenshots/prio-high.png diff --git a/assets/img/screenshots/prio-low.png b/_site/docs/assets/img/screenshots/prio-low.png similarity index 100% rename from assets/img/screenshots/prio-low.png rename to _site/docs/assets/img/screenshots/prio-low.png diff --git a/assets/img/screenshots/prio-norm.png b/_site/docs/assets/img/screenshots/prio-norm.png similarity index 100% rename from assets/img/screenshots/prio-norm.png rename to _site/docs/assets/img/screenshots/prio-norm.png diff --git a/assets/img/screenshots/suppressed.png b/_site/docs/assets/img/screenshots/suppressed.png similarity index 100% rename from assets/img/screenshots/suppressed.png rename to _site/docs/assets/img/screenshots/suppressed.png diff --git a/assets/img/thumbnails/nims_core.png b/_site/docs/assets/img/thumbnails/nims_core.png similarity index 100% rename from assets/img/thumbnails/nims_core.png rename to _site/docs/assets/img/thumbnails/nims_core.png diff --git a/assets/img/thumbnails/nims_training.png b/_site/docs/assets/img/thumbnails/nims_training.png similarity index 100% rename from assets/img/thumbnails/nims_training.png rename to _site/docs/assets/img/thumbnails/nims_training.png diff --git a/_site/docs/before/call_etiquette.md b/_site/docs/before/call_etiquette.md new file mode 100644 index 0000000..9eb7bb4 --- /dev/null +++ b/_site/docs/before/call_etiquette.md @@ -0,0 +1,50 @@ +You've just joined an incident call, and you've never been on one before. You have no idea what's going on, or what you're supposed to be doing. This page will help you through your first time on an incident call, and will provide a reference for future calls you may be a part of. + +![Obama phone](../assets/img/headers/obama_phone.jpg) +*Credit: [Official White House Photo](https://commons.wikimedia.org/wiki/File:Barack_Obama_on_phone_with_Benjamin_Netanyahu_2009-06-08.jpg) by Pete Souza* + +## First Steps + +* If you intend on participating on the incident call you should join both the call, and Slack. +* Make sure you are in a quiet environment in order to participate on the call. Background noise should be kept to a minimum. +* Keep your microphone muted until you have something to say. +* Identify yourself when you join the call; State your name and the system you are the expert for. +* Speak up and speak clearly. +* Be direct and factual. +* Keep conversations/discussions short and to the point. +* Bring any concerns to the Incident Commander (IC) on the call. +* Respect time constraints given by the Incident Commander. + +## Lingo +**Use clear terminology, and avoid using acronyms or abbreviations during a call. Clear and accurate communication is more important than quick communication.** + +![Communication](../assets/img/misc/communicate.png) + +Standard radio [voice procedure](https://en.wikipedia.org/wiki/Voice_procedure#Words_in_voice_procedure) does not need to be followed on calls. However, you should familiarize yourself with the terms, as you may hear them on a call (or need to use them yourself). The ones in more active use on major incident calls are, + +* **Ack/Rog** - "I have received and understood" +* **Say Again** - "Repeat your last message" +* **Standby** - "Please wait a moment for the next response" +* **Wilco** - "Will comply" + +Do not invent new abbreviations, and always favor being explicit of implicit. It is better to make things clearer than to try and save time by abbreviating, only to have a misunderstanding because others didn't know the abbreviation. + +## The Commander +The Incident Commander (IC) is the leader of the incident response process, and is responsible for bringing the incident to resolution. They will announce themselves at the start of the call, and will generally be doing most of the talking. + +* Follow all instructions from the incident commander, without exception. +* Do not perform any actions unless the incident commander has told you to do so. +* The commander will typically poll for any strong objections before performing a large action. This is your time to raise any objections if you have them. +* Once the commander has made a decision, that decision is final and should be followed, even if you disagreed during the poll. +* Answer any questions the commander asks you in a clear and concise way. + * Answering that you "don't know" something is perfectly acceptable. Do not try to guess. +* The commander may ask you to investigate something and get back to them in X minutes. Make sure you are ready with an answer within that time. + * Answering that you need more time is perfectly acceptable, but you need to give the commander an estimate of how much time. + +## Problems? + +#### There's no incident commander on the call! I don't know what to do! +Ask on the call if an IC is present. If you have no response, type `!ic page` in Slack. This will page the primary and backup IC to the call. + +#### I can join the call or Slack, but not both, what should I do? +You're welcome to join only one of the channels, however you should not actively participate in the incident response if so, as it causes disjoined communication. Liaise with someone who is both in Slack and on the call to provide any input you may have so that they can raise it. diff --git a/_site/docs/before/different_roles.md b/_site/docs/before/different_roles.md new file mode 100644 index 0000000..f0d3e30 --- /dev/null +++ b/_site/docs/before/different_roles.md @@ -0,0 +1,134 @@ +There are several roles for our incident response teams at Spearhead Systems. Certain roles only have one person per incident (e.g. support engineer), whereas other roles can have multiple people (e.g.System/Solution Architects, juniors, etc.). It's all about coming together as a team, working the problem, and getting a solution quickly. + +Here is a rough outline of our role hierarchy, with each role discussed in more detail on the rest of this page. + +![Incident Response Structure](../assets/img/misc/incident_response_roles.png) + +--- + +## Team Leader (IC) + +### What is it? +A Team Leader acts as the single source of truth of what is currently happening and what is going to happen during an major incident. They come in all shapes, sizes, and colors. TL's are also the key elements in a project (boards in DoIT). + +### Why have one? +As any system grows in size and complexity, things break and cause incidents. The TL is needed to help drive major incidents to resolution by organizing his team towards a common goal. + +### What are the responsibilities? +1. Help prepare for projects and incidents, + * Setup communications channels. + * Create the DoIT board(s) and other project planning related materials. + * Funnel people to these communications channels. + * Train team members on how to communicate and train other TL's. +1. Drive incidents and projects to resolution, + * Get everyone on the same communication channel. + * Collect information from team members for their services/area of ownership status. + * Collect proposed repair actions, then recommend repair actions to be taken. + * Delegate all repair actions, the TL is NOT a resolver. + * Be the single authority on system status + * Communicate directly with the customers and end-users + - not the engineers themselves! +1. Post Mortem, + * Creating the initial template right after the incident so people can put in their thoughts while fresh. + * Assigning the post-mortem after the event is over, this can be done after the call. + * Work with Managers/Support on scheduling preventive actions. + +### Who are they? +Anyone on the TL on-call schedule. Trainees are typically on the TL Shadow schedule. + +### How can I become one? +Take a look at our [Team Leader training guide](/training/incident_commander.md). + +--- + +## Deputy + +### What is it? +A Deputy is a direct support role for the Incident Commander. This is not a shadow where the person just observes, the Deputy is expected to perform important tasks during an incident. + +### Why have one? +It's important for the IC to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The deputy helps to support the IC and keep them focussed on the incident. + +### What are the responsibilities? +The Deputy is expected to: + +1. Bring up issues to the Incident Commander that may otherwise not be addressed (keeping an eye on timers that have been started, circling back around to missed items from a roll call, etc). +1. Be a "hot standby" Incident Commander, should the primary need to either transition to a SME, or otherwise have to step away from the IC role. +1. Page SME's or other on-call engineers as instructed by the Incident Commander. +1. Manage the incident call, and be prepared to remove people from the call if instructed by the Incident Commander. +1. Liaise with stakeholders and provide status updates on Slack as necessary. + +### Who are they? +Any Incident Commander can act as a deputy. Deputies need to be trained as an Incident Commander as they may be required to take over command. + +### How can I become one? +Take a look at our [Deputy training guide](/training/deputy.md). Deputies also need to be [trained as an Incident Commander](/training/incident_commander.md). + +--- + +## Scribe + +### What is it? +A Scribe documents the timeline of an incident as it progresses, and makes sure all important decisions and data are captured for later review. + +### Why have one? +The incident commander will need to focus on the problem at hand, and the subject matter experts will need to focus on resolving the incident. It is important to capture a timeline of events as they happen so that they can be reviewed during the post-mortem to determine how well we performed, and so we can accurate determine any additional impact that we might not have noticed at the time. + +### What are the responsibilities? +The Scribe is expected to: + +1. Ensure the incident call is being recorded. +1. Note in Slack important data, events, and actions, as they happen. Specifically: + * Key actions as they are taken (Example: "prod-server-387723 is being restarted to attempt to remove the stuck lock") + * Status reports when one is provided by the IC (Example: "We are in SEV-1, service A is currently not processing events due to a stuck lock, X is restarting the app stack, next checkin in 3 minutes") + * Any key callouts either during the call or at the ending review (Example: "Note: (Bob B) We should have a better way to determine stuck locks.") + +### Who are they? +Anyone can act as a scribe during an incident, and are chosen by the Incident Commander at the start of the call. Typically the Deputy will act as the Scribe, but that doesn't necessarily need to happen, and for larger incidents may not be possible. + +### How can I become one? +Follow our [Scribe training guide](/training/scribe.md), and then notify the Incident Commanders that you would like to be considered for scribing for the next incident. + +--- + +## Subject Matter Expert + +### What is it? +A Subject Matter Expert (SME), sometimes called a "Resolver", is a domain expert or designated owner of a component or service that is part of the PagerDuty software stack. + +### Why have one? +The IC and deputy are not all-knowing super beings. When there is a problem with a service, an expert in that service is needed to be able to quickly help identify and fix issues. + +### What are the responsibilities? +1. Being able to diagnose common problems with the service. +1. Being able to rapidly fix issues found during an incident. +1. Concise communication skills, specifically for CAN reports: + * Condition: What is the current state of the service? Is it healthy or not? + * Actions: What actions need to be taken if the service is not in a healthy state? + * Needs: What support does the resolver need to perform an action? + +### Who are they? +Anyone who is considered a "domain expert" can act as a resolver for an incident. Typically the service's primary on-call will act as the SME for that service. + +### How can I become one? +Take a look at our [Subject Matter Expert training guide](/training/subject_matter_expert.md). You should also discuss with your team and service owner to determine what the requirements are for your particular service. + +--- + +## Customer Liaison + +### What is it? +A person responsible for interacting with customers, either directly, or via our public communication channels. Typically a member of the Customer Support team. + +### Why have one? +All of the other roles will be actively working on identifying the cause and resolving the issue, we need a role which is focused purely on the customer interaction side of things so that it can be done properly, with the due care and attention it needs. + +### What are the responsibilities? +1. Post any publicly facing messages regarding the incident (Twitter, StatusPage, etc). +1. Notify the IC of any customers reporting that they are affected by the incident. + +### Who are they? +Any member of the Support Team can act as a customer liaison. + +### How can I become one? +Discuss with the Support Team about becoming our next customer liaison. diff --git a/_site/docs/before/severity_levels.md b/_site/docs/before/severity_levels.md new file mode 100644 index 0000000..d7d95c1 --- /dev/null +++ b/_site/docs/before/severity_levels.md @@ -0,0 +1,92 @@ +The first step in any incident response process is to determine what actually constitutes an incident. We have two high level categories for classifying incidents: this is done using "SR" or "IN" defintions with an attached priority of "Minor", "Normal" or "Major". "SR" are "Service requests" initiated by a customer and usually do not constitute a critical issue (there are exceptions) and "IN" are "incidents" which are generally "urgent". + +All of our operational issues are to be classified as either a Service Request or an Incident. Incidents have priority over Service Requests provided that there are no Service Requests with a higher priority. In general you will want to resolve a higher severity SR or IN than a lower one (a "Major" priority gets a more intensive response than a "Normal" incident for example). + +!!! note "Always Assume The Worst" + If you are unsure which level an incident is (e.g. not sure if IN is Major or Normal), **treat it as the higher one**. During an incident is not the time to discuss or litigate severities, just assume the highest and review during a post-mortem. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SeverityDescriptionWhat To Do
Major +
    +
  • The system is in a critical state and is actively impacting a large number of customers.
  • +
  • Functionality has been severely impaired for a long time, breaking SLA.
  • +
  • Customer-data-exposing security vulnerability has come to our attention.
  • +
+
See During an Incident.
Normal +
    +
  • Functionality of virtualization platform is severely impaired.
  • +
  • E-mail system is offline.
  • +
+
See During an Incident.
Anything above this line is considered a "Major Incident". These are generally Incidents (IN). Below are service requests (SR) which are usually initiated by a human who can help with prioritizing. A call is triggered for all major incidents (indifferently of SR or IN).
Normal +
    +
  • Partial loss of functionality, only affecting minority of customers.
  • +
  • Something that has the likelihood of becoming Major if nothing is done.
  • +
  • No redundancy in a service (failure of 1 more node will cause outage).
  • +
+
+
    +
  • Work on issue as your top priority.
  • +
  • Liaise with engineers of affected systems to identify cause.
  • +
  • If related to recent deployment, rollback.
  • +
  • Monitor status and notice if/when it escalates.
  • +
  • Mention on Slack if you think it has the potential to escalate.
  • +
+
Normal +
    +
  • Performance issues (delays, etc). Tasks that require non-immediate attention.
  • +
  • Job failure (not impacting alerting).
  • +
+
+
    +
  • Work on the issue as your first priority (above "Low" tasks).
  • +
  • Monitor status and notice if/when it escalates.
  • +
+
Low +
    +
  • Normal bugs which aren't impacting system use, cosmetic issues, etc.
  • +
+
+
    +
  • Create a DoIT ticket and assign to owner of affected system.
  • +
+
+ +!!! note "Be Specific" + When creating Cards in Doit, be as specific as possible and include all necessary details. Include relevant details regarding when the issue started, what may have triggered it, etc.. Document your efforts through worklogs and be specific there as well. diff --git a/_site/docs/during/during_an_incident.md b/_site/docs/during/during_an_incident.md new file mode 100644 index 0000000..49a711e --- /dev/null +++ b/_site/docs/during/during_an_incident.md @@ -0,0 +1,111 @@ +Information on what to do during a major incident. See our [severity level descriptions](/before/severity_levels.md) for what constitutes a major incident. + +!!! note "Documentation" + For your own internal documentation, you should make sure that this page has all of the necessary information prominently displayed. Such as: phone bridge numbers, Slack rooms, important chat commands, etc. Here is an example, + + + + + + + + + + + + + + + + + +
#incident-chathttps://a-voip-provider.com/incident-call+1 555 BIG FIRE (+1 555 244 3473) / PIN: 123456
Need an IC? Do !ic page in Slack
For executive summary updates only, join #executive-summary-updates.
+ +!!! info "Security Incident?" + If this is a security incident, you should follow the [Security Incident Response](/during/security_incident_response.md) process. + +## Don't Panic! + +1. Join the incident call and chat (see links above). + * Anyone is free to join the call or chat to observe and follow along with the incident. + * If you wish to participate however, you should join both. If you can't join the call for some reason, you should have a dedicated proxy for the call. Disjointed discussions in the chat room are ultimately distracting. + +1. Follow along with the call/chat, add any comments you feel are appropriate, but keep the discussion relevant to the problem at hand. + * If you are not an SME, try to filter any discussion through the primary SME for your service. Too many people discussing at once get become overwhelming, so we should try to maintain a hierarchical structure to the call if possible. + +1. Follow instructions from the Incident Commander. + * **Is there no IC on the call?** + * Manually page them via Slack, with `!ic page` in Slack. This will page the primary and backup IC's at the same time. + * Never hesitate to page the IC. It's much better to have them and not need them than the other way around. + +## Steps for Incident Commander +Resolve the incident as quickly and as safely as possible, use the Deputy to assist you. Delegate any tasks to relevant experts at your discretion. + +1. Announce on the call and in Slack that you are the incident commander, who you have designated as deputy (usually the backup IC), and scribe. + +1. Identify if there is an obvious cause to the incident (recent deployment, spike in traffic, etc.), delegate investigation to relevant experts, + * Use the service experts on the call to assist in the analysis. They should be able to quickly provide confirmation of the cause, but not always. It's the call of the IC on how to proceed in cases where the cause is not positively known. Confer with service owners and use their knowledge to help you. + +1. Identify investigation & repair actions (roll back, rate-limit services, etc) and delegate actions to relevant service experts. Typically something like this (obviously not an exhaustive list), + * **Bad Deployment:** Roll it back. + * **Web Application Stuck/Crashed:** Do a rolling restart. + * **Event Flood:** Validate automatic throttling is sufficient, adjust manually if not. + * **Data Center Outage:** Validate automation has removed bad data center. Force it to do so if not. + * **Degraded Service Behavior without load:** Gather forensic data (heap dumps, etc), and consider doing a rolling restart. + +1. Listen for prompts from your Deputy regarding severity escalations, decide whether we need to announce publicly, and instruct customer liaison accordingly. + * Announcing publicly is at your discretion as IC. If you are unsure, then announce publicly ("If in doubt, tweet it out"). + +1. Once incident has recovered or is actively recovering, you can announce that the incident is over and that the call is ending. This usually indicates there's no more productive work to be done for the incident right now. + * Move the remaining, non-time-critical discussion to Slack. + * Follow up to ensure the customer liaison wraps up the incident publicly. + * Identify any post-incident clean-up work. + * You may need to perform debriefing/analysis of the underlying root cause. + +1. (After call ends) Create the post-mortem page from the template, and assign an owner to the post-mortem for the incident. + +1. (After call ends) Send out an internal email explaining that we had a major incident, provide a link to the post-mortem. + +## Steps for Deputy +You are there to support the IC in whatever they need. + +1. Monitor the status, and notify the IC if/when the incident escalates in severity level, + * OfficerURL can help you to monitor the status on Slack, + * `!status` - Will tell you the current status. + * `!status stalk` - Will continually monitor the status and report it to the room every 30s. + +1. Be prepared to page other people as directed by the Incident Commander. + +1. Provide regular status updates in Slack (roughly every 30mins) to the executive team, giving an executive summary of the current status. Keep it short and to the point, and use @here. + +1. Follow instructions from the Incident Commander. + +## Steps for Scribe +You are there to document the key information from the incident in Slack. + +1. Update the Slack room with who the IC is, who the Deputy is, and that you're the scribe (if not already done). + * e.g. "IC: Bob Boberson, Deputy: Deputy Deputyson, Scribe: Writer McWriterson" + +1. You should add notes to Slack when significant actions are taken, or findings are determined. You don't need to wait for the IC to direct this - use your own judgment. + * You should also add `TODO` notes to the Slack room that indicate follow-ups slated for later. + +1. Follow instructions from the Incident Commander. + +## Steps for Subject Matter Experts +You are there to support the incident commander in identifying the cause of the incident, suggesting and evaluation repair actions, and following through on the repair actions. + +1. Investigate the incident by analyzing any graphs or logs at your disposal. Announce all findings to the incident commander. + * If you are unsure of the cause, that's fine, state that you are investigating and provide regular updates to the IC. + +1. Announce all suggestions for resolution to the incident commander, it is their decision on how to proceed, do not follow any actions unless told to do so! + +1. Follow instructions from the incident commander. + +1. (Optional) Once the call is over and post-mortem is created, add any notes you think are relevant to the post-mortem page. + +## Steps for Customer Liaison +Be on stand-by to post public facing messages regarding the incident. + +1. You will typically be required to update the status page and to send Tweets from our various accounts at certain times during the call. + +1. Follow instructions from the Incident Commander. diff --git a/_site/docs/during/security_incident_response.md b/_site/docs/during/security_incident_response.md new file mode 100644 index 0000000..cd8d0a3 --- /dev/null +++ b/_site/docs/during/security_incident_response.md @@ -0,0 +1,141 @@ +!!! note "Incident Commander Required" + As with all major incidents at PagerDuty, security ones will also involve an Incident Commander, who will delegate the tasks to relevant resolvers. Tasks may be performed in parallel as assigned by the IC. Page one at the earliest possible opportunity. + +## Checklist +Details for each of these items are available in the next section. + +1. Stop the attack in progress. +1. Cut off the attack vector. +1. Assemble the response team. +1. Isolate affected instances. +1. Identify timeline of attack. +1. Identify compromised data. +1. Assess risk to other systems. +1. Assess risk of re-attack. +1. Apply additional mitigations, additions to monitoring, etc. +1. Forensic analysis of compromised systems. +1. Internal communication. +1. Involve law enforcement. +1. Reach out to external parties that may have been used as vector for attack. +1. External communication. + +--- + +## Attack Mitigation +Stop the attack as quickly as you can, via any means necessary. Shut down servers, network isolate them, turn off a data center if you have to. Some common things to try, + +* Shutdown the instance from the provider console (do not delete or terminate if you can help it, as we'll need to do forensics). +* If you happen to be logged into the box you can try to, + * Re-instate our default iptables rules to restrict traffic. + * `kill -9` any active session you think is an attacker. + * Change root password, and update /etc/shadow to lock out all other users. + * `sudo shutdown now` + +## Cut Off Attack Vector +Identify the likely attack vectors and path/fix them so they cannot be re-exploited immediately after stopping the attack. + +* If you suspect a third-party provider is compromised, delete all accounts except your own (and those of others who are physically present) and immediately rotate your password and MFA tokens. +* If you suspect a service application was an attack vector, disable any relevant code paths, or shut down the service entirely. + +## Assemble Response Team +Identify the key responders for the security incident, and keep them all in the loop. Set up a secure method of communicating all information associated with the incident. Details on the incident (or even the fact that an incident has occurred) should be kept private to the responders until you are confident the attack is not being triggered internally. + +* The security and site-reliability teams should usually be involved. +* A representative for any affected services should be involved. +* An Incident Commander (IC) should be appointed, who will also appoint the usual incident command roles. The incident command team will be responsible for keeping documentation of actions taken, and for notifying internal stakeholders as appropriate. +* Do not communicate with anyone not on the response team about the incident until forensics has been performed. The attack could be happening internally. +* Give the project an innocuous codename that can be used for chats/documents so if anyone overhears they don't realize it's a security incident. (e.g. sapphire-unicorn). +* Prefix all emails, and chat topics with "Attorney Work Project". + +## Isolate Affected Instances +Any instances which were affected by the attack should be immediately isolated from any other instances. As soon as possible, an image of the system should be taken and put into a read-only cold storage for later forensic analysis. + +* Blacklist the IP addresses for any affected instances from all other hosts. +* Turn off and shutdown the instances immediately if you didn't do that to stop the attack. +* Take a disk image for any disks attached to the instances, and ship them to an off-site cold storage location. You should make sure these images are read-only and cannot be tampered with. + +## Identify Timeline of Attack +Work with all tools at your disposal to identify the timeline of the attack, along with exactly what the attacker did. + +* Any reconnaissance the attacker performed on the system before the attack started. +* When the attacker gained access to the system. +* What actions the attacker performed on the system, and when. +* Identify how long the attacker had access to the system before they were detected, and before they were kicked out. +* Identify any queries the attacker ran on databases. +* Try to identify if the attacker still has access to the system via another back door. Monitor logs for unusual activity, etc. + +## Compromised Data +Using forensic analysis of log files, time-series graphs, and any other information/tools at your disposal, attempt to identify what information was compromised (if any), + +* Identify any data that was compromised during the attack. + * Was any data exfiltrated from a database? + * What keys were on the system that are now considering compromised? + * Was the attacker able to identify other components of the system (map out the network, etc). +* Find exactly what customer data has been compromised, if any. + +## Assess Risk +Based on the data that was compromised, assess the risk to other systems. + +* Does the attacker have enough information to find another way in? +* Were any passwords or keys stored on the host? If so, they should be considered compromised, regardless of how they were stored. +* Any user accounts that were used in the initial attack should rotate all of their keys and passwords on every other system they have an account. + +## Apply Additional Mitigations +Start applying mitigations to other parts of your system. + +* Rotate any compromised data. +* Identify any new alerting which is needed to notify of a similar breach. +* Block any IP addresses associated with the attack. +* Identify any keys/credentials that are compromised and revoke their access immediately. + +## Forensic Analysis +Once you are confident the systems are secured, and enough monitoring is in place to detect another attack, you can move onto the forensic analysis stage. + +* Take any read-only images you created, any access logs you have, and comb through them for more information about the attack. +* Identify exactly what happened, how it happened, and how to prevent it in future. +* Keep track of all IP addresses involved in the attack. +* Monitor logs for any attempt to regain access to the system by the attacker. + +## Internal Communication +**Delegate to:** VP or Director of Engineering + +Communicate internally only once you are confident (via forensic analysis) that the attack was not sourced internally. + +* Don't go into too much detail. +* Overview the timeline. +* Discuss mitigation steps taken. +* Follow up with more information once it is known. + +## Liaise With Law Enforcement / External Actors +**Delegate to:** VP or Director of Engineering + +Work with law enforcement to identify the source of the attack, letting any system owners know that systems under their control may be compromised, etc. + +* Contact local law enforcement. +* Contact FBI. +* Contact operators for any systems used in the attack, their systems may also have been compromised. +* Contact security companies to help in assessing risk and any PR next steps. + +## External Communication +**Delegate to:** Marketing Team + +Once you have validated all of the information you have is accurate, have a timeline of events, and know exactly what information was compromised, how it was compromised, and sure that it won't happen again. Only then should you prepare and release a public statement to customers informing them of the compromised information and any steps they need to take. + +* Include the date in the title of any announcement, so that it's never confused for a potential new breach. +* Don't say "We take security very seriously". It makes everyone cringe when they read it. +* Be honest, accept responsibility, and present the facts, along with exactly how we plan to prevent such things in future. +* Be as detailed as possible with the timeline. +* Be as detailed as possible in what information was compromised, and how it affects customers. If we were storing something we shouldn't have been, be honest about it. It'll come out later and it'll be much worse. +* Don't name and shame any external parties that might have caused the compromise. It's bad form. (Unless they've already publicly disclosed, in which case we can link to their disclosure). +* Release the external communication as soon as possible, preferably within a few days of the compromise. The longer we wait, the worse it will be. +* Figure out if there is a way to get in touch with customers' internal security teams before the general public notice is sent. + +--- + +## Additional Reading + +* [Computer Security Incident Handling Guide](http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-61r2.pdf) (NIST) +* [Incident Handler's Handbook](https://www.sans.org/reading-room/whitepapers/incident/incident-handlers-handbook-33901) (SANS) +* [Responding to IT Security Incidents](https://technet.microsoft.com/en-us/library/cc700825.aspx) (Microsoft) +* [Defining Incident Management Processes for CSIRTs: A Work in Progress](http://resources.sei.cmu.edu/library/asset-view.cfm?assetid=7153) (CMU) +* [Creating and Managing Computer Security Incident Handling Teams (CSIRTS)](https://www.first.org/conference/2008/papers/killcrece-georgia-slides.pdf) (CERT) diff --git a/_site/docs/index.md b/_site/docs/index.md new file mode 100644 index 0000000..60ca4b3 --- /dev/null +++ b/_site/docs/index.md @@ -0,0 +1,57 @@ +This documentation covers parts of the Spearhead Systems Issue Response process. It is a copy of [PagerDuty's](https://github.com/PagerDuty/incident-response-docs/) documentation and furthermore a cut-down version of our own internal documentation, used at Spearhead Systems for any issue (incident or service request), and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process). See the [about page](about.md) for more information on what this documentation is and why it exists. This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com) and may not yet be open sourced. + +!!! note "Issue, Incident and Service Request" + At Spearhead we use the term *issue* to define any request from our customers. Issues fall into two categories: "Service Requests (SR)" and "Incidents (IN)". Note that we use the term Incident to describe both a service request as well as incidents. For brevity we will use SR and IN throughout this documentation. + +A "service request" is usually initiated by a human and is generally not critical for the normal functioning of the business while an "incident" is an issue that is or can cause interruption to normal business functions. + +![Issue Response at Spearhead Systems](./assets/img/headers/sph_ir.jpg) + +## Being On-Call + +If you've never been on-call before, you might be wondering what it's all about. These pages describe what the expectations of being on-call are, along with some resources to help you. + +* [Being On-Call](oncall/being_oncall.md) - _A guide to being on-call, both what your responsibilities are, and what they are not._ +* [Alerting Principles](oncall/alerting_principles.md) - _The principles we use to determine what things page an engineer, and what time of day they page._ + +## Before an Incident + +Reading material for things you probably want to know before an incident occurs. You likely don't want to be reading these during an actual incident. + +* [Severity Levels](before/severity_levels.md) - _Information on our severity level classification. What constitutes a Low issue? What's a "Major Incident"?, etc._ +* [Different Roles for Incidents](before/different_roles.md) - _Information on the roles during an incident; Incident Commander, Scribe, etc._ +* [Incident Call Etiquette](before/call_etiquette.md) - _Our etiquette guidelines for incident calls, before you find yourself in one._ + +## During an Incident + +Information and processes during an incident. + +* [During an Incident](during/during_an_incident.md) - _Information on what to do during an incident, and how to constructively contribute._ +* [Security Incident Response](during/security_incident_response.md) - _Security incidents are handled differently to normal operational incidents._ + +## After an Incident + +Our followup processes, how we make sure we don't repeat mistakes and are always improving. + +* [Post-Mortem Process](after/post_mortem_process.md) - _Information on our post-mortem process; what's involved and how to write or run a post-mortem._ +* [Post-Mortem Template](after/post_mortem_template.md) - _The template we use for writing our post-mortems for major incidents._ + +## Training + +So, you want to learn about incident response? You've come to the right place. + +* [Training Overview](training/overview.md) - _An overview of our training guides and additional training material from third-parties._ +* [Incident Commander Training](training/incident_commander.md) - _A guide to becoming our next Incident Commander._ +* [Deputy Training](training/deputy.md) - _How to be a deputy and back up the Incident Commander._ +* [Scribe Training](training/scribe.md) - _A guide to scribing._ +* [Subject Matter Expert Training](training/subject_matter_expert.md) - _A guide on responsibilities and behavior for all participants in a major incident._ +* [Glossary of Incident Response Terms](training/glossary.md) - _A collection of terms that you may hear being used, along with their definition._ + +## Additional Reading + +Useful material and resources from external parties that are relevant to incident response. + +* [Incident Management for Operations](http://shop.oreilly.com/product/0636920036159.do) (O'Reilly) +* [Incident Response](http://shop.oreilly.com/product/9780596001308.do) (O'Reilly) +* [Debriefing Facilitation Guide](http://extfiles.etsy.com/DebriefingFacilitationGuide.pdf) (Etsy) +* [US National Incident Management System (NIMS)](https://www.fema.gov/national-incident-management-system) (FEMA) diff --git a/_site/docs/oncall/alerting_principles.md b/_site/docs/oncall/alerting_principles.md new file mode 100644 index 0000000..89c2803 --- /dev/null +++ b/_site/docs/oncall/alerting_principles.md @@ -0,0 +1,35 @@ +We manage how we get alerted based on many factors such as the customers contractual SLA, the urgency of their request or incident, etc.. **an alert or notification is something which requires a human to perform an action**. Based on the severity of the issue (service request or incident) we prioritize accordingly in [DoIT](http://doit.sphs.ro). + +!!! warning "Major Priority Alerts" + Anything that wakes up a human in the middle of the night should be **immediately human actionable**. If it is none of those things, then we need to adjust the alert to not page at those times. + +| Priority | Alerts | Response | +| -------- | ------ | -------- | +| Major | Major-Priority Spearhead Alert 24/7/365. | Requires **immediate human action**. | +| Normal | Normal-Priority Spearhead Alert during **business hours only**. | Requires human action that same working day. | +| Minor | Minor-Priority Spearhead Alert 24/7/365. | Requires human action at some point. | + +Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page of the respective customer. + +If you're setting up a new alert/notification, consider the chart above for how you want to alert people. Be mindful of not creating new high-priority alerts if they don't require an immediate response, for example. + +!!! info "Alert Channels" + Presently we use email as the only notification method. This means keeping an eye on your email is essential! + SMS and Push notifications are in the pipeline for DoIT. + +## Examples + +#### "Production service is failing for 75% of requests, automation is unable to resolve."_ +This would be a **Major** priority IN, requiring immediate human action to resolve. + +![Major Urgency](../assets/img/screenshots/prio-high.png) + +#### "A customer sends an email stating that "Production server disk space is filling, expected to be full in 48 hours. Log rotation is insufficient to resolve." +This would be a **Normal** priority SR, requiring human action soon, but not immediately. + +![Normal Urgency](../assets/img/screenshots/prio-norm.png) + +#### "An SSL certificate is due to expire in one week." +This would be a **Minor** priority SR, requiring human action some time soon. + +![Minor Urgency](../assets/img/screenshots/prio-low.png) diff --git a/_site/docs/oncall/being_oncall.md b/_site/docs/oncall/being_oncall.md new file mode 100644 index 0000000..6b69246 --- /dev/null +++ b/_site/docs/oncall/being_oncall.md @@ -0,0 +1,95 @@ +A summary of expectations and helpful information for being on-call. + +![Alert Fatigue](../assets/img/misc/alert_fatigue.png) + +## What is On-Call? +Being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise. For example, if you are on-call, should any alarms be triggered by our monitoring solution, you will receive a "page" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken. You will be expected to take whatever actions are necessary in order to resolve the issue and return your service to a normal state. + +At Spearhead Systems we consider you are on-call during normal working hours in which case you are proactively working with [DoIT](http://doit.sphs.ro/) and looking over your assigned cards/boards as well as when you are formally "on-call" and issues are being redirected to you. + +On-call responsibilities extend beyond normal office hours, and if you are on-call you are expected to be able to respond to issues, even at 2am. This sounds horrible (and it can be), but this is what our customers go through, and is the problem that the Spearhead Systems professional services is trying to fix! + +## Responsibilities + +1. **Prepare** + * Have your laptop and Internet with you (office, home, a MiFi dongle, a phone with a tethering plan, etc). + * Have a way to charge your MiFi. + * Team alert escalation happens within 5 minutes, set/stagger your notification timeouts (push, SMS, phone...) accordingly. + * Make sure Spearhead Systems (and colleagues directly) texts and calls can bypass your "Do Not Disturb" settings. + * Be prepared (environment is set up, a current working copy of the necessary repos is local and functioning, you have configured and tested environments on workstations, your credentials for third-party services are current, you have Java installed, ssh-keys and so on...) + * Read our Incident Response documentation (that's this!) to understand how we handle incidents and service requests, what the different roles and methods of communication are, etc. + * Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments etc. + +1. **Triage** + * Acknowledge and act on alerts whenever you can (see the first "Not responsibilities" point below) + * Determine the urgency of the problem: + * Is it something that should be worked on right now or escalated into a major incident? ("production server on fire" situations. Security alerts) - do so. + * Is it some tactical work that doesn't have to happen during the night? (for example, disk utilization high watermark, but there's plenty of space left and the trend is not indicating impending doom) - snooze the alert until a more suitable time (working hours, the next morning...) and get back to fixing it then. + * Check Slack for current activity. Often (but not always) actions that could potentially cause alerts will be announced there. + * Does the alert and your initial investigation indicate a general problem or an issue with a specific service that the relevant team should look into? If it does not look like a problem you are the expert for, then escalate to another team member or group. + +1. **Fix** + * You are empowered to dive into any problem and act to fix it. + * Involve other team members as necessary: do not hesitate to escalate if you cannot figure out the cause within a reasonable timeframe or if the service / alert is something you have not tackled before. + * If the issue is not very time sensitive and you have other priority work, make a note of this in DoIT to keep a track of it (with an appropriate severity). + +1. **Improve** + * If a particular issue keeps happening; if an issue alerts often but turns out to be a preventable non-issue – perhaps improving this should be a longer-term task. + * Disks that fill up, logs that should be rotated, noisy alerts...(we use ansible, go ahead and start automating!) + * If information is difficult / impossible to find, write it down. Constantly refactor and improve our knowledge base and documentation. Add redundant links and pointers if your mental model of the wiki / codebase does not match the way it is currently organized. + +1. **Support** + * When your on-call "shift" ends, let the next on-call know about issues that have not been resolved yet and other experiences of note. + * If you are making a change that impacts the schedule (adding / removing yourself, for example), let others know since many of us make arrangements around the on-call schedule well in advance. + * Support each other: when doing activities that might generate plenty of pages, it is courteous to "take the page" away from the on-call by notifying them and scheduling an override for the duration. + +## Not Responsibilities + +1. No expectation to be the first to acknowledge _all_ of the alerts during the on-call period. + * Commute (and other necessary distractions) are facts of life, and sometimes it is not possible to receive or act on an alert before it escalates. That's why we have the backup on-call and schedule for. + +1. No expectation to fix all issues by yourself. + * No one knows everything. Your whole team is here to help. There is no shame, and much to be learned, by escalating issues you are not certain about. "Never hesitate to escalate". + * Service owners will always know more about how their stuff works. Especially if our and their documentation is lacking, double-checking with the relevant team avoids mistakes. Measure twice, cut once – and it's often best to let the subject matter expert do the cutting. + +## Recommendations +If your team is starting its own on-call rotation, here are some scheduling recommendations from the Operations team. + +* Always have a backup schedule. Yes, this means two people being on-call at the same time, however it takes a lot of the stress off of the primary if they know they have a specific backup they can contact, rather than trying to chose a random member of the team. + * A backup shift should generally come directly after a primary shift. It gives chance for the previous primary to pass on additional context which may have come up during their shift. It also helps to prevent people from sitting on issues with the intent of letting the next shift fix it. + +* The third-level of your escalation (after backup schedule) should probably be your entire team. This should hopefully never happen (it's happened once in the history of the Support team), but when it does, it's useful to be able to just get the next available person. + +![Escalation](../assets/img/misc/escalation.png) + +* Team managers can (and should) be part of your normal rotation. It gives a better insight into what has been going on. + +* New members of the team should shadow your on-call rotation during the first few weeks. They should get all alerts, and should follow along with what you are doing. (All new employees shadow the Support team for one week of on-call, but it's useful to have new team members shadow your team rotations also. Just not at the same time). + +* We recommend you set your escalation timeout to 5 minutes. This should be plenty of time for someone to acknowledge the incident if they're able to. If they're not able to within 5 minutes, then they're probably not in a good position to respond to the incident anyway. + +* When going off-call, you should provide a quick summary to the next on-call about any issues that may come up during their shift. A service has been flapping, an issue is likely to re-occur, etc. If you want to be formal, this can be a written report via email, but generally a verbal summary is sufficient. + +### Notification Method Recommendations +You are free to set up your notification rules as you see fit, to match how you would like to best respond to incidents. If you're not sure how to configure them, the Support team has some recommendations, + +![Mobile Alerts](../assets/img/misc/mobile_alerts.png) + +* Use Push Notification and Email as your first method of notification. Most of us have phones with us at all times, so this is a prudent first method and is usually sufficient. (DoIT is in the process of integratoin with SNS for push notifications) +* Use Phone and/or SMS notification each minute after, until the escalation time. If Push didn't work, then it's likely you need something stronger, like a phone call. Keep calling every minute until it's too late. If you don't pick up by the 3rd time, then it's unlikely you are able to respond, and the incident will get escalated away from you. + +## Etiquette + +* If the current on-call comes into the office at 12pm looking tired, it's not because they're lazy. They probably got paged in the night. Cut them some slack and be nice. + +* Don't acknowledge an incident out from under someone else. If you didn't get paged for the incident, then you shouldn't be acknowledging it. Add a comment with your notes instead. + +![Acknowledging](../assets/img/misc/ack.png) + +* If you are testing something, or performing an action that you know will cause a page (notification, alert), it's customary to "take the pager" for the time during which you will be testing. Notify the person on-call that you are taking the pager for the next hour while you test. + +* "Never hesitate to escalate" - Never feel ashamed to rope in someone else if you're not sure how to resolve an issue. Likewise, never look down on someone else if they ask you for help. + +* Always consider covering an hour or so of someone else's on-call time if they request it and you are able. We all have lives which might get in the way of on-call time, and one day it might be you who needs to swap their on-call time in order to have a night out with your friend from out of town. + +* If an issue comes up during your on-call shift for which you got paged, you are responsible for resolving it. Even if it takes 3 hours and there's only 1 hour left of your shift. You can hand over to the next on-call if they agree, but you should never assume that's possible. diff --git a/_site/docs/training/deputy.md b/_site/docs/training/deputy.md new file mode 100644 index 0000000..07f15cd --- /dev/null +++ b/_site/docs/training/deputy.md @@ -0,0 +1,57 @@ +So you want to be a deputy? You've come to the right place! + +![Deputy](../assets/img/headers/incident_command_support.jpg) +*Credit: [oregondot @ Flickr](https://www.flickr.com/photos/oregondot/8743801731/in/album-72157633494644719/)* + +## Purpose +The purpose of the Deputy is to support the IC by keeping track of timers, notifying the IC of important information, and paging other people as directed by the IC. + +It's important for the IC to focus on the problem at hand, rather than worrying about monitoring timers. The deputy is there to help support the IC and keep them focussed on the incident. + +As a Deputy, you will be expected to take over command from the IC if they request it. + +**You should not be performing any remediations, checking graphs, or investigating logs**. Those tasks will be delegated to the resolvers by the IC. + +## Prerequisites +Before you can be a Deputy, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training! + +* Be trained as an [Incident Commander](/training/incident_commander.md). + +## Responsibilities +Read up on our [Different Roles for Incidents](/before/different_roles.md) to see what is expected from a Deputy, as well as what we expect from the other roles you'll be interacting with. + +## Training Process +The training process for a Deputy is quite simple. + +* Follow our [Incident Commander Training](/training/incident_commander.md). +* Read this page. + +## Incident Call Procedures and Lingo +The [Steps for Deputy](/during/during_an_incident.md) provide a detailed description of what you should be doing during an incident. + +Here are some examples of phrases and patterns you should use during incident calls. + +### Keep Track of Responders +As you listen to the call, you should keep track of the responders to the call as you hear them speak. Make a note on a piece of paper, or use the `!ic responders` to see who they are. The IC may ask you who is on-call for a particular system, and you should know the answer, and be able to page them. + +> Do we have a representative from [X] on the call? + +> (pause) + +> Deputy, can you go ahead and page the [X] on-call please. + +You can page them however you see fit, phone call, etc. + +### Provide Executive Status Updates +Provide regular status updates on Slack (roughly every 30mins), giving an executive summary of the current status during SEV-1 incidents. Keep it short and to the point, and use @here. Mention the current state, the actions in progress, customer impact, and expected time remaining. It's OK to miss out some of those if the information isn't known. + +> @here: We are in SEV-1 due to X. Current actions in progress are to do Y. Expecting 3 mins to complete that action. Once action is complete, system should recover on its own within 5 minutes. + +### Alert IC to Timers +You are expected to keep track of how long the incident has been running for, and provide callouts to the IC every 10 minutes so they can take actions such as increasing the severity, or asking Support to Tweet out. This is as simple as telling the IC on the call, + +> IC, be advised the incident is now at the 10 minute mark. + +Similarly, when the IC asks for someone to get back to them in X minutes, you are expected to keep track of that. You should remind the IC when that time has been reached. + +> IC, be advised the timer for [TEAM]'s investigation is up. diff --git a/_site/docs/training/glossary.md b/_site/docs/training/glossary.md new file mode 100644 index 0000000..d197a4c --- /dev/null +++ b/_site/docs/training/glossary.md @@ -0,0 +1,14 @@ +Ever wonder what all of those strange words you sometimes see in our documentation mean? This page is here to help. + +| Term | Description | +| ---- | ----------- | +| **IC / Incident Commander** | The incident commander is the person responsible for bringing any major incident to resolution. They are the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final. [More info](../before/different_roles.md). | +| **Deputy** | Typically the backup IC. The deputy's job is to support the IC during the call, providing them with any help they need. [More info](../before/different_roles.md). | +| **Scribe** | The scribe's job is to keep a log of all activities performed during the call in a written chat log on Slack. [More info](../before/different_roles.md). | +| **Resolver** | A person on the incident call who is able to help resolve issues within a particular system. Also referred to as an SME (see below). [More info](../before/different_roles.md). | +| **SME** | "Subject Matter Expert", someone who is an expert in a particular service or subject who can provide information to the IC, and perform resolution actions for a particular system. [More info](../before/different_roles.md). | +| **CAN Report** | CAN stands for "Conditions" "Actions" "Needs", if an IC asks you for a CAN report, you should provide the current state of your service (condition), what actions need to be taken to return it to a healthy state (actions), and what support you need in order to perform the actions (needs). | +| **Sev / Severity** | How severe the incident is. The "sev" of an incident determines the type of response we give. The higher the severity, the higher the likelihood of making risky actions to resolve the situation. [More info](../before/severity_levels.md). | +| **Span of Control** | Refers to the number of direct reports you have. For example, if the IC has 10 people as direct reports on a call, they have a large span of control. We aim to make the span of control as minimal as we can while still being productive. | +| **Grenade Thrower** | Someone who joins the call at a late time in the game, and provides information that completely derails the current thinking. They then leave almost immediately. | +| **Executive Swoop** | When an executive comes on the call and drops some sort of bombshell. A version of grenade throwing. | diff --git a/_site/docs/training/incident_commander.md b/_site/docs/training/incident_commander.md new file mode 100644 index 0000000..4721981 --- /dev/null +++ b/_site/docs/training/incident_commander.md @@ -0,0 +1,263 @@ +So you want to be an incident commander? You've come to the right place! You don't need to be a senior team member to become an IC, anyone can do it providing you have the requisite knowledge (yes, even an intern)! + +![Gene Kranz](../assets/img/headers/gene_kranz.jpg) +*Credit: [NASA](https://en.wikipedia.org/wiki/File:Eugene_F._Kranz_at_his_console_at_the_NASA_Mission_Control_Center.jpg)* + +## Purpose +If you could boil down the definition of an Incident Commander to one sentence, it would be, + +> Take whatever actions are necessary to protect PagerDuty systems and customers. + +The purpose of the Incident Commander is to be the decision maker during an major incident; Delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution. + +The Incident Commander becomes the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final. + +Your job as an IC is to listen to the call and to watch the incident Slack room in order to provide clear coordination, recruiting others to gather context/details. **You should not be performing any actions or remediations, checking graphs, or investigating logs.** Those tasks should be delegated. + +## Prerequisites +Before you can be an Incident Commander, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training! + +* Has **excellent knowledge of PagerDuty systems** and is able to quickly evaluate good vs bad options, and quickly identify what's actually going on. +* Been at PagerDuty for at least 6 months and has a **solid understanding of the incident notification pipeline and web stack**. +* Excellent verbal and written **communication skills**. +* Has **knowledge of obscure PagerDuty terms**. +* Has gravitas and is **willing to kick people off a call** to remove distractions, even if it's the CEO. + +## Responsibilities +Read up on our [Different Roles for Incidents](/before/different_roles.md) to see what is expected from an Incident Commander, as well as what we expect from the other roles you'll be interacting with. + +## Qualities +Some qualities we expect from an effective leader include being able to: + +* Take command. +* Motivate responders. +* Communicate clear directions. +* Size up the situation and make rapid decisions. +* Assess the effectiveness of tactics/strategies. +* Be flexible and modify your plans as necessary. + +As a leader, you should try to: + +* Be proficient in your job. +* Make sound and timely decisions. +* Ensure tasks are understood. +* Be prepared to step out of a tactical role to assume a leadership role. + +## Training Process +The process is fairly loose for now. Here's a list of things you can do to train though, + +* Read the rest of this page, particularly the sections below. + +* Participate in [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/) (FF). + * Shadow a FF to see how it's run. + * Be the scribe for multiple FF's. + * Be the incident commander for multiple FF's. + +* Play a game of "[Keep Talking and Nobody Explodes](http://www.keeptalkinggame.com/)" with other people in the office. + * For a more realistic experience, play it with someone in a different office over Hangouts. + +* Shadow a current incident commander for at least a full week shift. + * Get alerted when they do, join in on the same calls. + * Sit in on an active incident call, follow along with the chat, and follow along with what the Incident Commander is doing. + * **Do not actively participate in the call, keep your questions until the end.** + +* Reverse shadow a current incident commander for at least a full week shift. + * You should be the one to respond to incidents, and you will take point on calls, however the current IC will be there to take over should you not know how to proceed. + +## Graduation +What's the difference between an IC in training, and an IC? (This isn't the set up to a joke). Simple, an IC puts themselves on the schedule. + +## Handling Incidents +Every incident is different (we're hopefully not repeating the same issue multiple times!), but there's a common process you can apply to each one. + +1. **Identify the symptoms.** + * Identify what the symptoms are, how big the issue is, and whether it's escalating/flapping/static. + +1. **Size-up the situation.** + * Gather as much information as you can, as quickly as you can (remember the incident is still happening while you're doing this). + * Get the facts, the possibilities of what can happen, and the probability of those things happening. + +1. **Stabilize the incident.** + * Identify actions you can use to proceed. + * Gather support for the plan (See "Polling During a Decision" below). + * Delegate remediation actions to your SME's. + +1. **Provide regular updates.** + * Maintain a cadence, and provide regular updates to everyone on the call. + * What's happening, what are we doing about it, etc. + +## Deputy +The deputy for an incident is generally the backup Incident Commander. However, as an Incident Commander, you may appoint one or more Deputies. Note that Deputy Incident Commanders must be as qualified as the Incident Commander, and that if a Deputy is assigned, he or she must be fully qualified to assume the Incident Commander’s position if required. + +## Communication Responsibilities +Sharing information during an incident is a critical process. As an Incident Commander (or Deputy), you should be prepared to brief others as necessary. You will also be required to communicate your intentions and decisions clearly so that there is no ambiguity in your commands. + +When given information from a responder, you should clearly acknowledge that you have received and understood their message, so that the responder can be confident in moving on to other tasks. + +After an incident, you should communicate with other training Incident Commanders on any debrief actions you feel are necessary. + +## Incident Call Procedures and Lingo +The [Steps for Incident Commander](/during/during_an_incident.md) provide a detailed description of what you should be doing during an incident. + +Additionally, aside from following the [usual incident call etiquette](/before/call_etiquette.md), there a few extra etiquette guidelines you should follow as IC: + +* Always announce when you join the call if you are the on-call IC. +* Don't let discussions get out of hand. Keep conversations short. +* Note objections from others, but your call is final. +* If anyone is being actively disruptive to your call, kick them off. +* Announce the end of the call. + +Here are some examples of phrases and patterns you should use during incident calls. + +### Start of Call Announcement +At the start of any major incident call, the incident commander should announce the following, + +> This is [NAME], I am the Incident Commander for this call. + +This establishes to everyone on the call what your name is, and that you are now the commander. You should state "Incident Commander" and not "IC", as newcomers may not be familiar with the terminology yet. The word "commander" makes it very clear that you're in charge. + +### Start of Incident, IC Not Present +If you are trained to be an IC and have joined a call, even if you aren't the IC on-call, you should do the following, + +> Is there an IC on the call? + +> (pause) + +> Hearing no response, this is [NAME], and I am now the Incident Commander for this call. + +If the on-call IC joins later, you may hand over to them at your discretion (see below for the hand-off procedure) + +### Checking if SME's are Present +During a call, you will want to know who is available from the various teams in order to resolve the incident. Etiquette dictates that people should announce themselves, but sometimes you may be joining late to the call. If you need a representative from a team, just ask on the call. Your deputy can page one if no one answers. + +> Do we have a representative from [X] on the call? + +> (pause) + +> Deputy, can you go ahead and page the [X] on-call please. + +### Assigning Tasks +When you need to give out an assignment or task, give it to a person directly, never say "can someone do..." as this leads to the [bystander effect](https://en.wikipedia.org/wiki/Bystander_effect). Instead, all actions should be assigned to a specific person, and time-boxed with a specific number of minutes. + +> IC: Bob, please investigate the high latency on web app boxes. I'll come back to you for an answer in 3 minutes. + +> Bob: Understood + +Keep track of how many minutes you assigned, and check in with that person after that time. You can get help from your deputy to help track the timings. + +### Polling During a Decision +If a decision needs to be made, it comes down to the IC. Once the IC makes a decision, it is final. But it's important that no one can come later and object to the plan, saying things like "I knew that would happen". An IC will use very specific language to be sure that doesn't happen. + +> The proposal is to [EXPLAIN PROPOSAL] + +> Are there any strong objections to this plan? + +> (pause) + +> Hearing no objects, we are proceeding with this proposal. + +If you were to ask "Does everyone agree?", you'd get people speaking over each other, you'd have quiet people not speaking up, etc. Asking for any STRONG objections gives people the chance to object, but only if they feel strongly on the matter. + +### Status Updates +It's important to maintain a cadence during a major incident call. Whenever there is a lull in the proceedings, usually because you're waiting for someone to get back to you, you can fill the gap by explaining the current situation and the actions that are outstanding. This makes sure everyone is on the same page. + +> While we wait for [X], here's an update of our current situation. + +> We are currently in a SEV-1 situation, we believe to be caused by [X]. There's an open question to [Y] who will be getting back to us in 2 minutes. In the meantime, we have Tweeted out that we are experiencing issues. Our next Tweet will be in 10 minutes if the incident is still ongoing at that time. + +> Are there any additional actions or proposals from anyone else at this time? + +### Transfer of Command +Transfer of command, involves (as the name suggests) transferring command to another Incident Commander. There are multiple reasons why a transfer of command might take place, + +* Commander has become fatigued and is unable to continue. +* Incident complexity changes. +* Change of command is necessary for effectiveness or efficiency. +* Personal emergencies arise (e.g., Incident Commander has a family emergency). + +Never feel like you are not doing your job properly by handing over. Handovers are encouraged. In order to handover, out of band from the main call (via Slack for example), notify the other IC that you wish to transfer command. Update them with anything you feel appropriate. Then announce on the call, + +> Everyone on the call, be advised, at this time I am handing over command to [X]. + +The new IC should then announce on the call as if they were joining a new call (see above), so that everyone is aware of the new commander. + +Note that the arrival of a more qualified person does NOT necessarily mean a change in incident command. + +### Maintaining Order +Often times on a call people will be talking over one another, or an argument on the correct way to proceed may break out. As Incident Commander it's important that order is maintained on a call. The Incident Commander has the power to remove someone from the call if necessary (even if it's the CEO). But often times you just need to remind people to speak one at a time. Sometimes the discussion can be healthy even if it starts as an argument, but you shouldn't let it go on for too long. + +> (noise) + +> Ok everyone, can we all speak one at a time please. So far I'm hearing two options to proceed: 1) [X], 2) [Y]. + +> Are there any other proposals someone would like to make at this time? + +> ...etc + +### Getting Straight Answers +You may ask a question as IC and receive an answer that doesn't actually answer your question. This is generally when you ask for a yes/no answer but get a more detailed explanation. This can often times be because the person doesn't understand the call etiquette. But if it continues, you need to take action in order to proceed. + +> IC: Is this going to disable the service for everyone? + +> SME: Well... for some people it.... + +> IC: Stop. I need a yes/no answer. Is this going to disable the service for everyone? + +> SME: Well... it might not do... + +> IC: Stop. I'm going to ask again, and the only two words I want to hear from you are "yes" or "no. If this going to disable the service for everyone? + +> SME: Well.. like I was saying.. + +> IC: Stop. Leave the call. Backup IC can you please page the backup on-call for [service] so that we can get an answer. + +### Executive Swoop +You may get someone who would be senior to you during peacetime come on the call and start overriding your decisions as IC. This is unacceptable behaviour during wartime, as the IC is in command. While this is rare, you can get things back on track with the following, + +> Executive: No, I don't want us doing that. Everyone stop. We need to rollback instead. + +> IC: Hold please. [EXECUTIVE], do you wish to take over command? + +> Executive: Yes/No + +> (If yes) IC: Understood. Everyone on the call, be advised, at this time I am handling over command to [EXECUTIVE]. They are now the incident commander for this call. + +> (If no) IC: In that case, please cause no further interruptions or I will remove you from the call. + +This makes it clear to the executive that they have the option of being in charge and making decisions, but in order to do so they must continue as an Incident Commander. If they refuse, then remind them that you are in charge and disruptive interruptions will not be tolerated. If they continue, remove them from the call. + +### End of Call Sign-Off +At the end of an incident, you should announce to everyone on the call that you are ending the call at this time, and provide information on where followup discussion can take place. It's also customary to thank everyone. + +> Ok everyone, we're ending the call at this time. Please continue any followup discussion on Slack. Thanks everyone. + +## Examples From Pop Culture +PagerDuty employees have access to all previous incident calls, and can listen to them at their discretion. We can't release these calls, so for everyone else, here are some short examples from popular culture to show the techniques at work. + +--- + + +Here's a clip from the movie Apollo 13, where Gene Kranz (Flight Director / Incident Commander) shows some great examples of Incident Command. Here are some things to note: + +* Walks into the room, and immediately obvious that he's the IC. Calms the noise, and makes sure everyone is paying attention. +* Provides a status update so people are aware of the situation. +* Projector breaks, doesn't get sidetracked on fixing it, just moves on to something else. +* Provides a proposal for how to proceed and elicits feedback. + * Listens to the feedback calmly. + * When counter-proposal is raised, states that he agrees and why. +* Allows a discussion to happen, listens to all points. When discussion gets out of hand, re-asserts command of the situation. + * Explains his decision, and why. +* Explains his full plan and decision, so everyone is on the same page. + +--- + + +Another clip from Apollo 13. Things to note: + +* Summarizes the situation, and states the facts. +* Listens to the feedback from various people. +* When a trusted SME provides information counter to what everyone else is saying, asks for additional clarification ("What do you mean, everything?") +* Wise cracking remarks are not acknowledged by the IC ("You can't run a vacuum cleaner on 12 amps!") +* "That's the deal?".. "That's the deal". +* Once decision is made, moves on to the next discussion. +* Delegates tasks. diff --git a/_site/docs/training/overview.md b/_site/docs/training/overview.md new file mode 100644 index 0000000..4165d8c --- /dev/null +++ b/_site/docs/training/overview.md @@ -0,0 +1,22 @@ +Learning about the Spearhead Systems incident response process is an important part of being an effective on-call engineer at Spearhead Systens. This section goes over our training material for the various roles that are involved in our incident response, along with some additional information and training material from government agencies. + +## Training Guides +Our training guides are split up by role, however you are encouraged to read through the training guides even for roles you don't belong to, as it can give you some good insight into how those people will be behaving during major incidents. + +* [Incident Commander Training](/training/incident_commander.md) - The "IC" is the person who drives a major incident to resolution. They're the person who will be directing everyone else. +* [Deputy Training](/training/deputy.md) - The Deputy is someone who supports the Incident Commander and can take over for them if necessary. +* [Scribe Training](/training/scribe.md) - This is intended for individuals who will be acting as a scribe during an incident. +* [SME / Resolver Training](/training/subject_matter_expert.md) - This is relevant to everyone at Spearhead Systems who are on-call for any team. + +## National Incident Management System (NIMS) +Our incident response process is loosely based on the [US National Incident Management System (NIMS)](https://www.fema.gov/national-incident-management-system), which is described as, + + _A systematic, proactive approach to guide departments and agencies at all levels of government, nongovernmental organizations, and the private sector to work together seamlessly and manage incidents involving all threats and hazards—regardless of cause, size, location, or complexity—in order to reduce loss of life, property and harm to the environment._ + +While it might not initially seem that this would be applicable to an IT operations environment, we've found that many of the lessons learned from major incidents in these situations can be directly applied to our industry too. The principles are the same and span many different environments. + +[![NIMS](../assets/img/thumbnails/nims_core.png)](https://www.fema.gov/pdf/emergency/nims/NIMS_core.pdf) [![NIMS Training](../assets/img/thumbnails/nims_training.png)](https://www.fema.gov/pdf/emergency/nims/nims_training_program.pdf) + +If you want to learn more about NIMS, we recommend the [ICS-100](https://training.fema.gov/is/courseoverview.aspx?code=IS-100.b) and [ICS-700](https://training.fema.gov/is/courseoverview.aspx?code=IS-700.a) online training courses, which go over NIMS and the Incident Command System (You can also take an online examination after training in order to get a certificate from FEMA). There is also a wealth of [additional training material and courses from FEMA](https://training.fema.gov/nims/) on NIMS, which I would encourage you to look at. + +Also take a look at the [Additional Reading](/#additional-reading) section on the home page. diff --git a/_site/docs/training/scribe.md b/_site/docs/training/scribe.md new file mode 100644 index 0000000..b45c124 --- /dev/null +++ b/_site/docs/training/scribe.md @@ -0,0 +1,75 @@ +So you want to be a scribe? You've come to the right place! You don't need to be a senior team member to become a deputy or scribe, anyone can do it providing you have the requisite knowledge! + +![Typewriter](../assets/img/headers/typewriter.jpg) +*Credit: [Holly Chaffin](http://www.publicdomainpictures.net/view-image.php?image=49706&picture=antique-typewriter-keys)* + +## Purpose +The purpose of the Scribe is to maintain a timeline of key events during an incident. Documenting actions, and keeping track of any followup items that will need to be addressed. + +It's important for the rest of the command staff to be able to focus on the problem at hand, rather than worrying about documenting the steps. + +Your job as Scribe is to listen to the call and to watch the incident Slack room, keeping track of context and actions that need to be performed, documenting these in Slack as you go. **You should not be performing any remediations, checking graphs, or investigating logs.** Those tasks will be delegated to the subject matter experts (SME's) by the Incident Commander. + + +## Prerequisites +Before you can be a Scribe, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training! + +* Excellent verbal and written **communication skills**. +* Has **knowledge of obscure PagerDuty terms**. + +## Responsibilities +Read up on our [Different Roles for Incidents](/before/different_roles.md) to see what is expected from a Scribe, as well as what we expect from the other roles you'll be interacting with. + +## Training Process +There is no formal training process for this role, reading this page should be sufficient for most tasks. Here's a list of things you can do to train though, + +* Read the rest of this page, particularly the sections below. + +* Participate in [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/) (FF). + * Shadow a FF to see how it's run. + * Be the scribe for multiple FF's. + +## Scribing +Scribing is more art than science. The objective is to keep an accurate record of important events that occurred on the call, so that we can look back at the timeline to see what happened. But what exactly is important? There's no overwhelming answer, and it really comes down the judgement and experience. But here are some general things you most definitely want to capture as scribe. + +* The result of any polling decisions. + * This is not "9 people voted yay, 3 voted nay". + * It is "Polled for if we should do rolling restart. is proceeding with restart." +* Any followup items that are called out as "We should do this..", "Why didn't this?..", etc. + * This is not "Why isn't the Support representative on the call?" + * This is "TODO: Why didn't we get paged for this earlier?" + +## Incident Call Procedures and Lingo +The [Steps for Scribe](/during/during_an_incident.md) provide a detailed description of what you should be doing during an incident. + +Here are some examples of phrases and patterns you should use during incident calls. + +### Status Stalking +At the start of any major incident call, you should start our status stalking bot, so that it will post to the room an update automatically. + +> !status stalk + +This will provide the update and allow the IC to see the status without having to keep asking. + +### Note Important Actions +During a call, you will hear lots of discussion happening, you should not be documenting all of this in the chat room. You only want to document things which will be important for the final timeline. It's not always obvious what this might be, and it's usually a matter of judgement. You generally want to note any actions the IC has asked someone to perform, along with the result of any polling decisions. + +> Polled for decision on whether to perform rolling restart. We are proceeding with restart. [USER_A] to execute. + +Some actions might seem important at the time, but end up not being. That's OK. It's better to have more info than not enough, but don't go overboard. + +### Note Followup Actions +Sometimes during the call, someone will either mention something we "should fix", or the IC will specifically ask you to note a followup item. You can do this in Slack by simply prefixing with "TODO", this will make it easier to search for later. + +> TODO: Why did we not get paged for the fall in traffic on [X] cluster? + +The post-mortem owner will find these after and raise tasks for them. + +### End of Call Notification +When the IC ends the call, you should post a message into Slack to let everyone know the call is over, and that they should continue discussion elsewhere. + +> Call is over, thanks everyone. Follow up in Slack. + +Don't forget to also stop the status stalking. + +> !status unstalk diff --git a/_site/docs/training/subject_matter_expert.md b/_site/docs/training/subject_matter_expert.md new file mode 100644 index 0000000..5b11dac --- /dev/null +++ b/_site/docs/training/subject_matter_expert.md @@ -0,0 +1,54 @@ +If you are on-call for any team at PagerDuty, you may be paged for a major incident and will be expected to respond as a subject matter expert (SME) for your service. This page details everything you need to know in order to be prepared for that responsibility. If you are interested in becoming an Incident Commander, take a look at the [Incident Commander Training page](/training/incident_commander.md). + +![Incident Response](../assets/img/headers/incident_response.jpg) +*Credit: [oregondot @ Flickr](https://www.flickr.com/photos/oregondot/8743809853/in/album-72157633494644719/)* + +## On-Call Expectations +If you are on-call for your team, there are certain expectations of you as that on-call. This applies to both the primary and secondary on-calls. Getting paged about a SEV-3 or SEV-4 in your system comes with different expectations than getting paged with a major SEV-2. + +### Before Going On-Call + +1. Be prepared, by having already familiarized yourself with our incident response policies and procedures. In particular, + 1. [Different Roles for Incidents](/before/different_roles.md) - You will be acting as a "Resolver" or "SME". But you should familiarize yourself with the other roles and what they will be doing. + 1. [Incident Call Etiquette](/before/call_etiquette.md) - How to behave during an incident call. + 1. [During an Incident](/during/during_an_incident.md) - What to do during an incident. You are specifically interested in the "Resolver" steps, but you should familiarize yourself with the entire document. + 1. [Glossary](/training/glossary.md) - Familiarize yourself with the terminology that may be used during the call. +1. Make sure you have set up your alerting methods, and that PagerDuty can bypass your "Do Not Disturb" settings. +1. Check you can join the incident call. You may need to install a browser plugin. You don't want to be doing that the first time you get paged. +1. Be aware of your upcoming on-call time and arrange swaps around travel, vacations, appointments, etc. +1. If you are an Incident Commander, make sure you are not on-call for your team at the same time as being on-call as Incident Commander. + +### During On-Call Period + +1. Have your laptop and Internet with you at all times during your on-call period (office, home, a MiFi, a phone with a tethering plan, etc). +1. If you have important appointments, you need to get someone else on your team to cover that time slot in advance. +1. When you receive an alert for a major incident, you are expected to join the incident call and Slack as quickly as possible (within minutes). + 1. You will be asked questions or given actions by the Incident Commander. Answer questions concisely, and follow all actions given (even if you disagree with them). + +## Response Mobilization +When an incident occurs, you must be mobilized or assigned to become part of the incident response. In other words, until you are mobilized to the incident via a page or being directly asked by someone else on the incident, you remain in your everyday role. After being mobilized, your first task is to check in and receive an assignment. While it's tempting to see an incident happening and want to jump in and help, when resources show up that have not been requested, the management of the incident can be compromised. + +## "Never Hesitate to Escalate" +If you're not sure about something, it is perfectly acceptable to bring in other SMEs from your team that you believe know a given system better than you. Don't let your ego keep you from bringing in additional help. Our motto is "Never hesitate to escalate", you will never be looked down upon for escalating something because you didn't know how to handle it. + +## Blameless +There will be incidents. Some will be caused by you, some will be caused by others... some will just happen. Our entire incident response process is completely blameless. Blaming people is counter productive and just distracts from the problem at hand. No matter how an incident started, they all need to get solved as quickly as possible. + +## Wartime vs Peacetime +Behavior during a major incident is very different to any other alert you may have received in the past. We call a major incident "wartime", and make a distinction between that and normal everyday operations ("peacetime"). + +### Peacetime +The organizational structure is generally based on seniority. The more senior members of a team will lead discussions, and managers or team leads will have the final say. Decisions are made after careful consideration of all options, and to minimize potential risk to customers. + +### Wartime +Wartime is different, and you will notice on our major incident calls that there's a different organizational structure. + +* The Incident Commander is in charge. No matter their rank during peacetime, they are now the highest ranked individual on the call, higher than the CEO. +* Primary responders (folks acting as primary on-call for a team/service) are the highest ranked individuals for that service. +* Decisions will be made by the IC after consideration of the information presented. Once that decision is made, it is final. +* Riskier decisions can be made by the IC than would normally be considered during peacetime. + * For example, the IC may decide to drop events for a particular customer in order to maintain the integrity of the system for everyone else. +* The IC may go against a consensus decision. If a poll is done, and 9/10 people agree but 1 disagrees. The IC may choose the disagreement option despite a majority vote. + * Even if you disagree, the IC's decision is final. During the call is not the time to argue with them. +* The IC may use language or behave in a way you find rude. This is wartime, and they need to do whatever it takes to resolve the situation, so sometimes rudeness occurs. This is never anything personal, and something you should be prepared to experience if you've never been in a wartime situation before. +* You may be asked to leave the call by the IC, or you may even be forceable kicked off a call. It is at the IC's discretion to do this if they feel you are not providing useful input. Again, this is nothing personal and you should remember that wartime is different than peacetime. diff --git a/_site/mkdocs.yml b/_site/mkdocs.yml new file mode 100644 index 0000000..576145d --- /dev/null +++ b/_site/mkdocs.yml @@ -0,0 +1,65 @@ +# Project Information +site_name: Spearhead Systems Incident Response Documentation +site_description: A collection of information about the Spearhead Systems incident response process. Not only how to prepare new employees for on-call responsibilities, but also how to handle major incidents, both in preparation and after-work. +site_author: Spearhead Systems, Inc. +site_favicon: 'assets/img/icon.png' +site_url: https://response.spearhead.systems +base_url: https://response.spearhead.systems + +# Repository +repo_name: 'GitHub' +repo_url: https://github.com/spearheadsys/issue-response-docs + +# Copyright +copyright: 'Copyright © Spearhead Systems, Inc.' + +# Theme +theme: 'material' +theme_dir: 'theme' +extra_css: ['assets/css/extra.css'] +extra: + logo: 'issue-response-docs/assets/img/icon.png' + cover: 'assets/img/cover.png' + palette: + primary: 'green' + accent: 'blue grey' + font: + text: 'Colfax Regular' + code: 'Roboto Mono' + author: + github: 'spearheadsys' + twitter: 'spearhead_sys' + +# Contents +pages: + - Home: 'index.md' + - On-Call: + - Being On-Call: 'oncall/being_oncall.md' + - Alerting Principles: 'oncall/alerting_principles.md' + - Before an Incident: + - Severity Levels: 'before/severity_levels.md' + - Different Roles: 'before/different_roles.md' + - Call Etiquette: 'before/call_etiquette.md' + - During an Incident: + - During An Incident: 'during/during_an_incident.md' + - Security Incident: 'during/security_incident_response.md' + - After an Incident: + - Post-Mortem Process: 'after/post_mortem_process.md' + - Post-Mortem Template: 'after/post_mortem_template.md' + - Training: + - Overview: 'training/overview.md' + - Incident Commander: 'training/incident_commander.md' + - Deputy: 'training/deputy.md' + - Scribe: 'training/scribe.md' + - Subject Matter Expert: 'training/subject_matter_expert.md' + - Glossary: 'training/glossary.md' + - About: 'about.md' + +# Analytics +# google_analytics: ['UA-8759953-1', 'auto'] + +# Extensions +markdown_extensions: + - toc(permalink=#) + - sane_lists: + - admonition: diff --git a/_site/screenshot.png b/_site/screenshot.png new file mode 100644 index 0000000..a7f6d36 Binary files /dev/null and b/_site/screenshot.png differ diff --git a/_site/theme/404.html b/_site/theme/404.html new file mode 100644 index 0000000..6feaf62 --- /dev/null +++ b/_site/theme/404.html @@ -0,0 +1,15 @@ +{% extends "base.html" %} + +{# mkdocs-material doesn't use content as a block, so cheating and using footer here, as that does use a block #} +{% block footer %} + +
+

Sorry! We couldn't find that page.

+

Looks like our well-trained server monkeys dropped the ball. Rest assured they will be dealt with. In the meantime, you probably want to head home. +

+ + + +{% endblock %} diff --git a/_site/theme/base.html b/_site/theme/base.html new file mode 100644 index 0000000..2d94e99 --- /dev/null +++ b/_site/theme/base.html @@ -0,0 +1,196 @@ + + + + + + + + + {% set title = page_title ~ ' - ' ~ site_name if page_title else site_name %} + {{ title }} + + + {% if site_author %}{% endif %} + + + + {% if page_description %}{% endif %} + + + + + + + + + {% if canonical_url %}{% endif %} + + + {% set favicon = favicon | default("assets/images/favicon-e565ddfa3b.ico", true) %} + + + + + + + + {% if config.extra.logo %}{% endif %} + + + + + + + + + + + + + + + + + + + + {% if config.extra.palette %} + + {% endif %} + {% if config.extra.font != "none" %} + {% set text = config.extra.get("font", {}).text | default("Ubuntu") %} + {% set code = config.extra.get("font", {}).code | default("Ubuntu Mono") %} + {% set font = text + ':400,700|' + code | replace(' ', '+') %} + + + {% endif %} + {% for path in extra_css %} + + {% endfor %} + + + + {% block extrahead %}{% endblock %} + + {% set palette = config.extra.get("palette", {}) %} + {% set primary = palette.primary | replace(' ', '-') | lower %} + {% set accent = palette.accent | replace(' ', '-') | lower %} + + {% if repo_name == "GitHub" and repo_url %} + {% set repo_id = repo_url | replace("https://github.com/", "") %} + {% if repo_id[-1:] == "/" %} + {% set repo_id = repo_id[:-1] %} + {% endif %} + {% endif %} +
+
+
+ + + +
+ {% include "header.html" %} +
+
+ {% set h1 = "\x3ch1 id=" in content %} +
+ {% include "drawer.html" %} +
+
+
+ {% if not h1 %} +

{{ page_title | default(site_name, true)}}

+ {% endif %} + {{ content }} + + {% block footer %} +
+ {% include "footer.html" %} +
+ {% endblock %} +
+
+
+
+
+
+
+
+
+
+
+ + + {% for path in extra_javascript %} + + {% endfor %} + {% if google_analytics %} + + {% endif %} + + diff --git a/_site/theme/drawer.html b/_site/theme/drawer.html new file mode 100644 index 0000000..45b00c4 --- /dev/null +++ b/_site/theme/drawer.html @@ -0,0 +1,59 @@ + diff --git a/_site/theme/header.html b/_site/theme/header.html new file mode 100644 index 0000000..f64db37 --- /dev/null +++ b/_site/theme/header.html @@ -0,0 +1,63 @@ + diff --git a/about/index.html b/about/index.html deleted file mode 100644 index 21ef9dc..0000000 --- a/about/index.html +++ /dev/null @@ -1,548 +0,0 @@ - - - - - - - - - - About - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

About

- -

This site documents parts of the Spearhead Systems Issue Response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after.

-

Few companies seem to talk about their internal processes for dealing with major incidents. We would like to change that by opening up our documentation to the community, in the hopes that it proves useful to others who may want to formalize their own processes. Additionally, it provides an opportunity for others to suggest improvements, which ends up helping everyone.

-

This documentation is complementary to what is available in our existing wiki.

-

What is this?#

-

A collection of pages detailing how to efficiently deal with any incident or service request that might arise, along with information on how to go on-call effectively. It provides lessons learned the hard way, along with training material for getting you up to speed quickly.

-

Who is this for?#

-

It is intended for on-call practitioners and those involved in an operational incident or service request response process, or those wishing to enact a formal incident response process. Specifically this is for all of our Technical Support staff.

-

Why do I need it?#

-

As a service provider Spearhead Systems deals with service requests on a daily basis. The reason we exist is to deliver a service which in most cases boils down to incidents and service requests. We want to deliver a smooth and seamless experience for resolving our customers issues therefore this documentation is a guideline for how we handle these requests. This documentation will allow you give you a head start on how to deal with issues in a way which leads to the fastest possible recovery time.

-

What is covered?#

-

Anything from preparing to go on-call, definitions of severities, incident call etiquette, all the way to how to run a post-mortem, providing a post-mortem template and even a security incident response process.

-

What is missing?#

-

Lots, dig in an help us complete the picture. We can migrate most processes from Sharepoint here.

-

License#

-

This documentation is provided under the Apache License 2.0. In plain English that means you can use and modify this documentation and use it both commercially and for private use. However, you must include any original copyright notices, and the original LICENSE file.

-

Whether you are a Spearhead Systems customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account, feel free to fork the repository and use it as a base for your own internal documentation.

- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/after/post_mortem_process/index.html b/after/post_mortem_process/index.html deleted file mode 100644 index 34c97dd..0000000 --- a/after/post_mortem_process/index.html +++ /dev/null @@ -1,672 +0,0 @@ - - - - - - - - - - Post-Mortem Process - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Post-Mortem Process

- -

For every major incident (SEV-2/1), we need to follow up with a post-mortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included.

-

Post-Mortem

-

Owner Designation#

-

The first step is that a post-mortem owner will be designated. This is done by the IC either at the end of a major incident call, or very shortly after. You will be notified directly by the IC if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use Slack for coordinating followup. A detailed list of the steps is available below,

-

Owner Responsibilities#

-

As owner of a post-mortem, you are responsible for the following,

-
    -
  • Scheduling the post-mortem meeting (on the shared calendar) and inviting the relevant people (this should be scheduled within 5 business days of the incident).
  • -
  • Updating the page with all of the necessary content.
  • -
  • Investigating the incident, pulling in whomever you need from other teams to assist in the investigation.
  • -
  • Creating follow-up JIRA tickets (You are only responsible for creating the tickets, not following them up to resolution).
  • -
  • Running the post-mortem meeting (these generally run themselves, but you should get people back on topic if the conversation starts to wander).
  • -
  • In cases where we need a public blog post, creating & reviewing it with appropriate parties.
  • -
-

Post-Mortem Wiki Page#

-

Once you've been designated as the owner of a post-mortem, you should start updating the page with all the relevant information.

-
    -
  1. -

    (If not already done by the IC) Create a new post-mortem page for the incident.

    -
  2. -
  3. -

    Schedule a post-mortem meeting for within 5 business days of the incident. You should schedule this before filling in the page, just so it's on the calendar.

    -
      -
    • Create the meeting on the "Incident Post-Mortem Meetings" shared calendar.
    • -
    -
  4. -
  5. -

    Begin populating the page with all of the information you have.

    -
      -
    • The timeline should be the main focus to begin with.
        -
      • The timeline should include important changes in status/impact, and also key actions taken by responders.
      • -
      • You should mark the start of the incident in red, and the resolution in green (for when we went into/out of SEV).
      • -
      -
    • -
    • Go through the history in Slack to identify the responders, and add them to the page.
        -
      • Identify the Incident Commander and Scribe in this list.
      • -
      -
    • -
    -
  6. -
  7. -

    Populate the page with more detailed information.

    -
      -
    • For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a Tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline.
    • -
    -
  8. -
  9. -

    Perform an analysis of the incident.

    -
      -
    • Capture all available data regarding the incident. What caused it, how many customers were affected, etc.
    • -
    • Any commands or queries you use to look up data should be posted in the page so others can see how the data was gathered.
    • -
    • Capture the impact to customers (generally in terms of event submission, delayed processing, and slow notification delivery)
    • -
    • Identify the underlying cause of the incident (What happened, and why did it happen).
    • -
    -
  10. -
  11. -

    Create any followup action JIRA tickets (or note down topics for discussion if we need to decide on a direction to go before creating tickets),

    -
      -
    • Go through the history in Slack to identify any TODO items.
    • -
    • Label all tickets with their severity level and date tags.
    • -
    • Any actions which can reduce re-occurrence of the incident.
        -
      • (There may be some trade-off here, and that's fine. Sometimes the ROI isn't worth the effort that would go into it).
      • -
      -
    • -
    • Identify any actions which can make our incident response process better.
    • -
    • Be careful with creating too many tickets. Generally we only want to create things that are P0/P1's. Things that absolutely should be dealt with.
    • -
    -
  12. -
  13. -

    Write the external message that will be sent to customers. This will be reviewed during the post-mortem meeting before it is sent out.

    -
      -
    • Avoid using the word "outage" unless it really was a full outage, use the word "incident" instead. Customers generally see "outage" and assume everything was down, when in reality it was likely just some alerts delivered outside of SLA.
    • -
    • Look at other examples of previous post-mortems to see the kind of thing you should send.
    • -
    -
  14. -
-

Post-Mortem Meeting#

-

These meetings should generally last 15-30 minutes, and are intended to be a wrap up of the post-mortem process. We should discuss what happened, what we could've done better, and any followup actions we need to take. The goal is to suss out any disagreement on the facts, analysis, or recommended actions, and to get some wider awareness of the problems that are causing reliability issues for us.

-

You should invite the following people to the post-mortem meeting,

-
    -
  • Always
      -
    • The incident commander.
    • -
    • Service owners involved in the incident.
    • -
    • Key engineer(s)/responders involved in the incident.
    • -
    -
  • -
  • Optional
      -
    • Customer liaison. (Only SEV-1 incidents)
    • -
    -
  • -
-

A general agenda for the meeting would be something like,

-
    -
  1. Recap the timeline, to make sure everyone agrees and is on the same page.
  2. -
  3. Recap important points, and any unusual items.
  4. -
  5. Discuss how the problem could've been caught.
      -
    • Did it show up in canary?
    • -
    • Could it have been caught in tests, or loadtest environment?
    • -
    -
  6. -
  7. Discuss customer impact. Any comments from customers, etc.
  8. -
  9. Review action items that have been created, discuss if appropriate, or if more are needed, etc.
  10. -
-

Examples#

-

Here are some examples of post-mortems from other companies as a reference,

- -

Useful Resources#

- - - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/after/post_mortem_template/index.html b/after/post_mortem_template/index.html deleted file mode 100644 index b63ac1b..0000000 --- a/after/post_mortem_template/index.html +++ /dev/null @@ -1,670 +0,0 @@ - - - - - - - - - - Post-Mortem Template - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Post-Mortem Template

- -

This is a standard template we use for post-mortems at PagerDuty. Each section describes the type of information you will want to put in that section.

-
-
-

Guidelines

-

This page is intended to be reviewed during a post-mortem meeting that should be scheduled within 5 business days of any event. -Your first step should be to schedule the post-mortem meeting in the shared calendar for within 5 business days after the incident. -Don't wait until you've filled in the info to schedule the meeting, however make sure the page is completed by the meeting.

-
-

Post-Mortem Owner: Your name goes here.

-

Meeting Scheduled For: Schedule the meeting on the "Incident Post-Mortem Meetings" shared calendar, for within 5 business days after the incident. Put the date/time here.

-

Call Recording: Link to the incident call recording.

-

Overview#

-

Include a short sentence or two summarizing the root cause, timeline summary, and the impact. E.g. "On the morning of August 99th, we suffered a 1 minute SEV-1 due to a runaway process on our primary database machine. This slowness caused roughly 0.024% of alerts that had begun during this time to be delivered out of SLA."

-

What Happened#

-

Include a short description of what happened.

-

Root Cause#

-

Include a description of the root cause. If there were any actions taken that exacerbated the issue, also include them here with the intention of learning from any mistakes made during the resolution process.

-

Resolution#

-

Include a description what solved the problem. If there was a temporary fix in place, describe that along with the long-term solution.

-

Impact#

-

Be very specific here, include exact numbers.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Time in SEV-1?mins
Notifications Delivered out of SLA??% (?? of ??)
Events Dropped / Not Accepted??% (?? of ??) Should usually be 0, but always check
Accounts Affected??
Users Affected??
Support Requests Raised?? Include any relevant links to tickets
-

Responders#

-
    -
  • Who was the IC?
  • -
  • Who was the scribe?
  • -
  • Who else was involved?
  • -
  • Who else was involved?
  • -
-

Timeline#

-

Some important times to include: (1) time the root cause began, (2) time of the page, (3) time that the status page was updated (i.e. when the incident became public), (4) time of any significant actions, (5) time the SEV-2/1 ended, (6) links to tools/logs that show how the timestamp was arrived at.

- - - - - - - - - -
Time (UTC)EventData Link
-

How'd We Do?#

-

What Went Well?#

-
    -
  • List anything you did well and want to call out. It's OK to not list anything.
  • -
-

What Didn't Go So Well?#

-
    -
  • List anything you think we didn't do very well. The intent is that we should follow up on all points here to improve our processes.
  • -
-

Action Items#

-

Each action item should be in the form of a JIRA ticket, and each ticket should have the same set of two tags: “sev1_YYYYMMDD” (such as sev1_20150911) and simply “sev1”. Include action items such as: (1) any fixes required to prevent the root cause in the future, (2) any preparedness tasks that could help mitigate the problem if it came up again, (3) remaining post-mortem steps, such as the internal email, as well as the status-page public post, (4) any improvements to our incident response process.

-

Messaging#

-

Internal Email#

-

This is a follow-up for employees. It should be sent out right after the post-mortem meeting is over. It only needs a short paragraph summarizing the incident and a link to this wiki page.

-
-

Briefly summarize what happened and where the post-mortem page (this page) can be found.

-
-

External Message#

-

This is what will be included on the status.pagerduty.com website regarding this incident. What are we telling customers, including an apology? (The apology should be genuine, not rote.)

-
-

Summary

-

What Happened?

-

What Are We Doing About This?

-
- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/assets/fonts/icon.eot b/assets/fonts/icon.eot deleted file mode 100755 index 8f81638..0000000 Binary files a/assets/fonts/icon.eot and /dev/null differ diff --git a/assets/fonts/icon.svg b/assets/fonts/icon.svg deleted file mode 100755 index 86250e7..0000000 --- a/assets/fonts/icon.svg +++ /dev/null @@ -1,22 +0,0 @@ - - - -Generated by IcoMoon - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/assets/fonts/icon.ttf b/assets/fonts/icon.ttf deleted file mode 100755 index b5ab560..0000000 Binary files a/assets/fonts/icon.ttf and /dev/null differ diff --git a/assets/fonts/icon.woff b/assets/fonts/icon.woff deleted file mode 100755 index ed0f20d..0000000 Binary files a/assets/fonts/icon.woff and /dev/null differ diff --git a/assets/images/favicon-e565ddfa3b.ico b/assets/images/favicon-e565ddfa3b.ico deleted file mode 100644 index e85006a..0000000 Binary files a/assets/images/favicon-e565ddfa3b.ico and /dev/null differ diff --git a/assets/images/favicon.ico b/assets/images/favicon.ico deleted file mode 100644 index e85006a..0000000 Binary files a/assets/images/favicon.ico and /dev/null differ diff --git a/assets/javascripts/application-997097ee0c.js b/assets/javascripts/application-997097ee0c.js deleted file mode 100644 index 1199f2e..0000000 --- a/assets/javascripts/application-997097ee0c.js +++ /dev/null @@ -1 +0,0 @@ -function pegasus(t,e){return e=new XMLHttpRequest,e.open("GET",t),t=[],e.onreadystatechange=e.then=function(n,o,i,r){if(n&&n.call&&(t=[,n,o]),4==e.readyState&&(i=t[0|e.status/200])){try{r=JSON.parse(e.responseText)}catch(s){r=null}i(r,e)}},e.send(),e}if("document"in self&&("classList"in document.createElement("_")?!function(){"use strict";var t=document.createElement("_");if(t.classList.add("c1","c2"),!t.classList.contains("c2")){var e=function(t){var e=DOMTokenList.prototype[t];DOMTokenList.prototype[t]=function(t){var n,o=arguments.length;for(n=0;o>n;n++)t=arguments[n],e.call(this,t)}};e("add"),e("remove")}if(t.classList.toggle("c3",!1),t.classList.contains("c3")){var n=DOMTokenList.prototype.toggle;DOMTokenList.prototype.toggle=function(t,e){return 1 in arguments&&!this.contains(t)==!e?e:n.call(this,t)}}t=null}():!function(t){"use strict";if("Element"in t){var e="classList",n="prototype",o=t.Element[n],i=Object,r=String[n].trim||function(){return this.replace(/^\s+|\s+$/g,"")},s=Array[n].indexOf||function(t){for(var e=0,n=this.length;n>e;e++)if(e in this&&this[e]===t)return e;return-1},a=function(t,e){this.name=t,this.code=DOMException[t],this.message=e},c=function(t,e){if(""===e)throw new a("SYNTAX_ERR","An invalid or illegal string was specified");if(/\s/.test(e))throw new a("INVALID_CHARACTER_ERR","String contains an invalid character");return s.call(t,e)},l=function(t){for(var e=r.call(t.getAttribute("class")||""),n=e?e.split(/\s+/):[],o=0,i=n.length;i>o;o++)this.push(n[o]);this._updateClassName=function(){t.setAttribute("class",this.toString())}},u=l[n]=[],d=function(){return new l(this)};if(a[n]=Error[n],u.item=function(t){return this[t]||null},u.contains=function(t){return t+="",-1!==c(this,t)},u.add=function(){var t,e=arguments,n=0,o=e.length,i=!1;do t=e[n]+"",-1===c(this,t)&&(this.push(t),i=!0);while(++nc;c++)a[s[c]]=i(a[s[c]],a);n&&(e.addEventListener("mouseover",this.onMouse,!0),e.addEventListener("mousedown",this.onMouse,!0),e.addEventListener("mouseup",this.onMouse,!0)),e.addEventListener("click",this.onClick,!0),e.addEventListener("touchstart",this.onTouchStart,!1),e.addEventListener("touchmove",this.onTouchMove,!1),e.addEventListener("touchend",this.onTouchEnd,!1),e.addEventListener("touchcancel",this.onTouchCancel,!1),Event.prototype.stopImmediatePropagation||(e.removeEventListener=function(t,n,o){var i=Node.prototype.removeEventListener;"click"===t?i.call(e,t,n.hijacked||n,o):i.call(e,t,n,o)},e.addEventListener=function(t,n,o){var i=Node.prototype.addEventListener;"click"===t?i.call(e,t,n.hijacked||(n.hijacked=function(t){t.propagationStopped||n(t)}),o):i.call(e,t,n,o)}),"function"==typeof e.onclick&&(r=e.onclick,e.addEventListener("click",function(t){r(t)},!1),e.onclick=null)}}var e=navigator.userAgent.indexOf("Windows Phone")>=0,n=navigator.userAgent.indexOf("Android")>0&&!e,o=/iP(ad|hone|od)/.test(navigator.userAgent)&&!e,i=o&&/OS 4_\d(_\d)?/.test(navigator.userAgent),r=o&&/OS [6-7]_\d/.test(navigator.userAgent),s=navigator.userAgent.indexOf("BB10")>0;t.prototype.needsClick=function(t){switch(t.nodeName.toLowerCase()){case"button":case"select":case"textarea":if(t.disabled)return!0;break;case"input":if(o&&"file"===t.type||t.disabled)return!0;break;case"label":case"iframe":case"video":return!0}return/\bneedsclick\b/.test(t.className)},t.prototype.needsFocus=function(t){switch(t.nodeName.toLowerCase()){case"textarea":return!0;case"select":return!n;case"input":switch(t.type){case"button":case"checkbox":case"file":case"image":case"radio":case"submit":return!1}return!t.disabled&&!t.readOnly;default:return/\bneedsfocus\b/.test(t.className)}},t.prototype.sendClick=function(t,e){var n,o;document.activeElement&&document.activeElement!==t&&document.activeElement.blur(),o=e.changedTouches[0],n=document.createEvent("MouseEvents"),n.initMouseEvent(this.determineEventType(t),!0,!0,window,1,o.screenX,o.screenY,o.clientX,o.clientY,!1,!1,!1,!1,0,null),n.forwardedTouchEvent=!0,t.dispatchEvent(n)},t.prototype.determineEventType=function(t){return n&&"select"===t.tagName.toLowerCase()?"mousedown":"click"},t.prototype.focus=function(t){var e;o&&t.setSelectionRange&&0!==t.type.indexOf("date")&&"time"!==t.type&&"month"!==t.type?(e=t.value.length,t.setSelectionRange(e,e)):t.focus()},t.prototype.updateScrollParent=function(t){var e,n;if(e=t.fastClickScrollParent,!e||!e.contains(t)){n=t;do{if(n.scrollHeight>n.offsetHeight){e=n,t.fastClickScrollParent=n;break}n=n.parentElement}while(n)}e&&(e.fastClickLastScrollTop=e.scrollTop)},t.prototype.getTargetElementFromEventTarget=function(t){return t.nodeType===Node.TEXT_NODE?t.parentNode:t},t.prototype.onTouchStart=function(t){var e,n,r;if(t.targetTouches.length>1)return!0;if(e=this.getTargetElementFromEventTarget(t.target),n=t.targetTouches[0],o){if(r=window.getSelection(),r.rangeCount&&!r.isCollapsed)return!0;if(!i){if(n.identifier&&n.identifier===this.lastTouchIdentifier)return t.preventDefault(),!1;this.lastTouchIdentifier=n.identifier,this.updateScrollParent(e)}}return this.trackingClick=!0,this.trackingClickStart=t.timeStamp,this.targetElement=e,this.touchStartX=n.pageX,this.touchStartY=n.pageY,t.timeStamp-this.lastClickTimen||Math.abs(e.pageY-this.touchStartY)>n?!0:!1},t.prototype.onTouchMove=function(t){return this.trackingClick?((this.targetElement!==this.getTargetElementFromEventTarget(t.target)||this.touchHasMoved(t))&&(this.trackingClick=!1,this.targetElement=null),!0):!0},t.prototype.findControl=function(t){return void 0!==t.control?t.control:t.htmlFor?document.getElementById(t.htmlFor):t.querySelector("button, input:not([type=hidden]), keygen, meter, output, progress, select, textarea")},t.prototype.onTouchEnd=function(t){var e,s,a,c,l,u=this.targetElement;if(!this.trackingClick)return!0;if(t.timeStamp-this.lastClickTimethis.tapTimeout)return!0;if(this.cancelNextClick=!1,this.lastClickTime=t.timeStamp,s=this.trackingClickStart,this.trackingClick=!1,this.trackingClickStart=0,r&&(l=t.changedTouches[0],u=document.elementFromPoint(l.pageX-window.pageXOffset,l.pageY-window.pageYOffset)||u,u.fastClickScrollParent=this.targetElement.fastClickScrollParent),a=u.tagName.toLowerCase(),"label"===a){if(e=this.findControl(u)){if(this.focus(u),n)return!1;u=e}}else if(this.needsFocus(u))return t.timeStamp-s>100||o&&window.top!==window&&"input"===a?(this.targetElement=null,!1):(this.focus(u),this.sendClick(u,t),o&&"select"===a||(this.targetElement=null,t.preventDefault()),!1);return o&&!i&&(c=u.fastClickScrollParent,c&&c.fastClickLastScrollTop!==c.scrollTop)?!0:(this.needsClick(u)||(t.preventDefault(),this.sendClick(u,t)),!1)},t.prototype.onTouchCancel=function(){this.trackingClick=!1,this.targetElement=null},t.prototype.onMouse=function(t){return this.targetElement?t.forwardedTouchEvent?!0:t.cancelable&&(!this.needsClick(this.targetElement)||this.cancelNextClick)?(t.stopImmediatePropagation?t.stopImmediatePropagation():t.propagationStopped=!0,t.stopPropagation(),t.preventDefault(),!1):!0:!0},t.prototype.onClick=function(t){var e;return this.trackingClick?(this.targetElement=null,this.trackingClick=!1,!0):"submit"===t.target.type&&0===t.detail?!0:(e=this.onMouse(t),e||(this.targetElement=null),e)},t.prototype.destroy=function(){var t=this.layer;n&&(t.removeEventListener("mouseover",this.onMouse,!0),t.removeEventListener("mousedown",this.onMouse,!0),t.removeEventListener("mouseup",this.onMouse,!0)),t.removeEventListener("click",this.onClick,!0),t.removeEventListener("touchstart",this.onTouchStart,!1),t.removeEventListener("touchmove",this.onTouchMove,!1),t.removeEventListener("touchend",this.onTouchEnd,!1),t.removeEventListener("touchcancel",this.onTouchCancel,!1)},t.notNeeded=function(t){var e,o,i,r;if("undefined"==typeof window.ontouchstart)return!0;if(o=+(/Chrome\/([0-9]+)/.exec(navigator.userAgent)||[,0])[1]){if(!n)return!0;if(e=document.querySelector("meta[name=viewport]")){if(-1!==e.content.indexOf("user-scalable=no"))return!0;if(o>31&&document.documentElement.scrollWidth<=window.outerWidth)return!0}}if(s&&(i=navigator.userAgent.match(/Version\/([0-9]*)\.([0-9]*)/),i[1]>=10&&i[2]>=3&&(e=document.querySelector("meta[name=viewport]")))){if(-1!==e.content.indexOf("user-scalable=no"))return!0;if(document.documentElement.scrollWidth<=window.outerWidth)return!0}return"none"===t.style.msTouchAction||"manipulation"===t.style.touchAction?!0:(r=+(/Firefox\/([0-9]+)/.exec(navigator.userAgent)||[,0])[1],r>=27&&(e=document.querySelector("meta[name=viewport]"),e&&(-1!==e.content.indexOf("user-scalable=no")||document.documentElement.scrollWidth<=window.outerWidth))?!0:"none"===t.style.touchAction||"manipulation"===t.style.touchAction?!0:!1)},t.attach=function(e,n){return new t(e,n)},"function"==typeof define&&"object"==typeof define.amd&&define.amd?define(function(){return t}):"undefined"!=typeof module&&module.exports?(module.exports=t.attach,module.exports.FastClick=t):window.FastClick=t}(),function(){var t=function(e){var n=new t.Index;return n.pipeline.add(t.trimmer,t.stopWordFilter,t.stemmer),e&&e.call(n,n),n};t.version="0.6.0",t.utils={},t.utils.warn=function(t){return function(e){t.console&&console.warn&&console.warn(e)}}(this),t.utils.asString=function(t){return void 0===t||null===t?"":t.toString()},t.EventEmitter=function(){this.events={}},t.EventEmitter.prototype.addListener=function(){var t=Array.prototype.slice.call(arguments),e=t.pop(),n=t;if("function"!=typeof e)throw new TypeError("last argument must be a function");n.forEach(function(t){this.hasHandler(t)||(this.events[t]=[]),this.events[t].push(e)},this)},t.EventEmitter.prototype.removeListener=function(t,e){if(this.hasHandler(t)){var n=this.events[t].indexOf(e);this.events[t].splice(n,1),this.events[t].length||delete this.events[t]}},t.EventEmitter.prototype.emit=function(t){if(this.hasHandler(t)){var e=Array.prototype.slice.call(arguments,1);this.events[t].forEach(function(t){t.apply(void 0,e)})}},t.EventEmitter.prototype.hasHandler=function(t){return t in this.events},t.tokenizer=function(e){return arguments.length&&null!=e&&void 0!=e?Array.isArray(e)?e.map(function(e){return t.utils.asString(e).toLowerCase()}):e.toString().trim().toLowerCase().split(t.tokenizer.seperator):[]},t.tokenizer.seperator=/[\s\-]+/,t.Pipeline=function(){this._stack=[]},t.Pipeline.registeredFunctions={},t.Pipeline.registerFunction=function(e,n){n in this.registeredFunctions&&t.utils.warn("Overwriting existing registered function: "+n),e.label=n,t.Pipeline.registeredFunctions[e.label]=e},t.Pipeline.warnIfFunctionNotRegistered=function(e){var n=e.label&&e.label in this.registeredFunctions;n||t.utils.warn("Function is not registered with pipeline. This may cause problems when serialising the index.\n",e)},t.Pipeline.load=function(e){var n=new t.Pipeline;return e.forEach(function(e){var o=t.Pipeline.registeredFunctions[e];if(!o)throw new Error("Cannot load un-registered function: "+e);n.add(o)}),n},t.Pipeline.prototype.add=function(){var e=Array.prototype.slice.call(arguments);e.forEach(function(e){t.Pipeline.warnIfFunctionNotRegistered(e),this._stack.push(e)},this)},t.Pipeline.prototype.after=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var o=this._stack.indexOf(e);if(-1==o)throw new Error("Cannot find existingFn");o+=1,this._stack.splice(o,0,n)},t.Pipeline.prototype.before=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var o=this._stack.indexOf(e);if(-1==o)throw new Error("Cannot find existingFn");this._stack.splice(o,0,n)},t.Pipeline.prototype.remove=function(t){var e=this._stack.indexOf(t);-1!=e&&this._stack.splice(e,1)},t.Pipeline.prototype.run=function(t){for(var e=[],n=t.length,o=this._stack.length,i=0;n>i;i++){for(var r=t[i],s=0;o>s&&(r=this._stack[s](r,i,t),void 0!==r&&""!==r);s++);void 0!==r&&""!==r&&e.push(r)}return e},t.Pipeline.prototype.reset=function(){this._stack=[]},t.Pipeline.prototype.toJSON=function(){return this._stack.map(function(e){return t.Pipeline.warnIfFunctionNotRegistered(e),e.label})},t.Vector=function(){this._magnitude=null,this.list=void 0,this.length=0},t.Vector.Node=function(t,e,n){this.idx=t,this.val=e,this.next=n},t.Vector.prototype.insert=function(e,n){this._magnitude=void 0;var o=this.list;if(!o)return this.list=new t.Vector.Node(e,n,o),this.length++;if(en.idx?n=n.next:(o+=e.val*n.val,e=e.next,n=n.next);return o},t.Vector.prototype.similarity=function(t){return this.dot(t)/(this.magnitude()*t.magnitude())},t.SortedSet=function(){this.length=0,this.elements=[]},t.SortedSet.load=function(t){var e=new this;return e.elements=t,e.length=t.length,e},t.SortedSet.prototype.add=function(){var t,e;for(t=0;t1;){if(r===t)return i;t>r&&(e=i),r>t&&(n=i),o=n-e,i=e+Math.floor(o/2),r=this.elements[i]}return r===t?i:-1},t.SortedSet.prototype.locationFor=function(t){for(var e=0,n=this.elements.length,o=n-e,i=e+Math.floor(o/2),r=this.elements[i];o>1;)t>r&&(e=i),r>t&&(n=i),o=n-e,i=e+Math.floor(o/2),r=this.elements[i];return r>t?i:t>r?i+1:void 0},t.SortedSet.prototype.intersect=function(e){for(var n=new t.SortedSet,o=0,i=0,r=this.length,s=e.length,a=this.elements,c=e.elements;;){if(o>r-1||i>s-1)break;a[o]!==c[i]?a[o]c[i]&&i++:(n.add(a[o]),o++,i++)}return n},t.SortedSet.prototype.clone=function(){var e=new t.SortedSet;return e.elements=this.toArray(),e.length=e.elements.length,e},t.SortedSet.prototype.union=function(t){var e,n,o;return this.length>=t.length?(e=this,n=t):(e=t,n=this),o=e.clone(),o.add.apply(o,n.toArray()),o},t.SortedSet.prototype.toJSON=function(){return this.toArray()},t.Index=function(){this._fields=[],this._ref="id",this.pipeline=new t.Pipeline,this.documentStore=new t.Store,this.tokenStore=new t.TokenStore,this.corpusTokens=new t.SortedSet,this.eventEmitter=new t.EventEmitter,this._idfCache={},this.on("add","remove","update",function(){this._idfCache={}}.bind(this))},t.Index.prototype.on=function(){var t=Array.prototype.slice.call(arguments);return this.eventEmitter.addListener.apply(this.eventEmitter,t)},t.Index.prototype.off=function(t,e){return this.eventEmitter.removeListener(t,e)},t.Index.load=function(e){e.version!==t.version&&t.utils.warn("version mismatch: current "+t.version+" importing "+e.version);var n=new this;return n._fields=e.fields,n._ref=e.ref,n.documentStore=t.Store.load(e.documentStore),n.tokenStore=t.TokenStore.load(e.tokenStore),n.corpusTokens=t.SortedSet.load(e.corpusTokens),n.pipeline=t.Pipeline.load(e.pipeline),n},t.Index.prototype.field=function(t,e){var e=e||{},n={name:t,boost:e.boost||1};return this._fields.push(n),this},t.Index.prototype.ref=function(t){return this._ref=t,this},t.Index.prototype.add=function(e,n){var o={},i=new t.SortedSet,r=e[this._ref],n=void 0===n?!0:n;this._fields.forEach(function(n){var r=this.pipeline.run(t.tokenizer(e[n.name]));o[n.name]=r,t.SortedSet.prototype.add.apply(i,r)},this),this.documentStore.set(r,i),t.SortedSet.prototype.add.apply(this.corpusTokens,i.toArray());for(var s=0;s0&&(o=1+Math.log(this.documentStore.length/n)),this._idfCache[e]=o},t.Index.prototype.search=function(e){var n=this.pipeline.run(t.tokenizer(e)),o=new t.Vector,i=[],r=this._fields.reduce(function(t,e){return t+e.boost},0),s=n.some(function(t){return this.tokenStore.has(t)},this);if(!s)return[];n.forEach(function(e,n,s){var a=1/s.length*this._fields.length*r,c=this,l=this.tokenStore.expand(e).reduce(function(n,i){var r=c.corpusTokens.indexOf(i),s=c.idf(i),l=1,u=new t.SortedSet;if(i!==e){var d=Math.max(3,i.length-e.length);l=1/Math.log(d)}r>-1&&o.insert(r,a*s*l);for(var h=c.tokenStore.get(i),f=Object.keys(h),p=f.length,m=0;p>m;m++)u.add(h[f[m]].ref);return n.union(u)},new t.SortedSet);i.push(l)},this);var a=i.reduce(function(t,e){return t.intersect(e)});return a.map(function(t){return{ref:t,score:o.similarity(this.documentVector(t))}},this).sort(function(t,e){return e.score-t.score})},t.Index.prototype.documentVector=function(e){for(var n=this.documentStore.get(e),o=n.length,i=new t.Vector,r=0;o>r;r++){var s=n.elements[r],a=this.tokenStore.get(s)[e].tf,c=this.idf(s);i.insert(this.corpusTokens.indexOf(s),a*c)}return i},t.Index.prototype.toJSON=function(){return{version:t.version,fields:this._fields,ref:this._ref,documentStore:this.documentStore.toJSON(),tokenStore:this.tokenStore.toJSON(),corpusTokens:this.corpusTokens.toJSON(),pipeline:this.pipeline.toJSON()}},t.Index.prototype.use=function(t){var e=Array.prototype.slice.call(arguments,1);e.unshift(this),t.apply(this,e)},t.Store=function(){this.store={},this.length=0},t.Store.load=function(e){var n=new this;return n.length=e.length,n.store=Object.keys(e.store).reduce(function(n,o){return n[o]=t.SortedSet.load(e.store[o]),n},{}),n},t.Store.prototype.set=function(t,e){this.has(t)||this.length++,this.store[t]=e},t.Store.prototype.get=function(t){return this.store[t]},t.Store.prototype.has=function(t){return t in this.store},t.Store.prototype.remove=function(t){this.has(t)&&(delete this.store[t],this.length--)},t.Store.prototype.toJSON=function(){return{store:this.store,length:this.length}},t.stemmer=function(){var t={ational:"ate",tional:"tion",enci:"ence",anci:"ance",izer:"ize",bli:"ble",alli:"al",entli:"ent",eli:"e",ousli:"ous",ization:"ize",ation:"ate",ator:"ate",alism:"al",iveness:"ive",fulness:"ful",ousness:"ous",aliti:"al",iviti:"ive",biliti:"ble",logi:"log"},e={icate:"ic",ative:"",alize:"al",iciti:"ic",ical:"ic",ful:"",ness:""},n="[^aeiou]",o="[aeiouy]",i=n+"[^aeiouy]*",r=o+"[aeiou]*",s="^("+i+")?"+r+i,a="^("+i+")?"+r+i+"("+r+")?$",c="^("+i+")?"+r+i+r+i,l="^("+i+")?"+o,u=new RegExp(s),d=new RegExp(c),h=new RegExp(a),f=new RegExp(l),p=/^(.+?)(ss|i)es$/,m=/^(.+?)([^s])s$/,v=/^(.+?)eed$/,g=/^(.+?)(ed|ing)$/,y=/.$/,w=/(at|bl|iz)$/,S=new RegExp("([^aeiouylsz])\\1$"),k=new RegExp("^"+i+o+"[^aeiouwxy]$"),E=/^(.+?[^aeiou])y$/,x=/^(.+?)(ational|tional|enci|anci|izer|bli|alli|entli|eli|ousli|ization|ation|ator|alism|iveness|fulness|ousness|aliti|iviti|biliti|logi)$/,b=/^(.+?)(icate|ative|alize|iciti|ical|ful|ness)$/,T=/^(.+?)(al|ance|ence|er|ic|able|ible|ant|ement|ment|ent|ou|ism|ate|iti|ous|ive|ize)$/,C=/^(.+?)(s|t)(ion)$/,L=/^(.+?)e$/,_=/ll$/,A=new RegExp("^"+i+o+"[^aeiouwxy]$"),O=function(n){var o,i,r,s,a,c,l;if(n.length<3)return n;if(r=n.substr(0,1),"y"==r&&(n=r.toUpperCase()+n.substr(1)),s=p,a=m,s.test(n)?n=n.replace(s,"$1$2"):a.test(n)&&(n=n.replace(a,"$1$2")),s=v,a=g,s.test(n)){var O=s.exec(n);s=u,s.test(O[1])&&(s=y,n=n.replace(s,""))}else if(a.test(n)){var O=a.exec(n);o=O[1],a=f,a.test(o)&&(n=o,a=w,c=S,l=k,a.test(n)?n+="e":c.test(n)?(s=y,n=n.replace(s,"")):l.test(n)&&(n+="e"))}if(s=E,s.test(n)){var O=s.exec(n);o=O[1],n=o+"i"}if(s=x,s.test(n)){var O=s.exec(n);o=O[1],i=O[2],s=u,s.test(o)&&(n=o+t[i])}if(s=b,s.test(n)){var O=s.exec(n);o=O[1],i=O[2],s=u,s.test(o)&&(n=o+e[i])}if(s=T,a=C,s.test(n)){var O=s.exec(n);o=O[1],s=d,s.test(o)&&(n=o)}else if(a.test(n)){var O=a.exec(n);o=O[1]+O[2],a=d,a.test(o)&&(n=o)}if(s=L,s.test(n)){var O=s.exec(n);o=O[1],s=d,a=h,c=A,(s.test(o)||a.test(o)&&!c.test(o))&&(n=o)}return s=_,a=d,s.test(n)&&a.test(n)&&(s=y,n=n.replace(s,"")),"y"==r&&(n=r.toLowerCase()+n.substr(1)),n};return O}(),t.Pipeline.registerFunction(t.stemmer,"stemmer"),t.generateStopWordFilter=function(t){var e=t.reduce(function(t,e){return t[e]=e,t},{});return function(t){return t&&e[t]!==t?t:void 0}},t.stopWordFilter=t.generateStopWordFilter(["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]),t.Pipeline.registerFunction(t.stopWordFilter,"stopWordFilter"),t.trimmer=function(t){return t.replace(/^\W+/,"").replace(/\W+$/,"")},t.Pipeline.registerFunction(t.trimmer,"trimmer"),t.TokenStore=function(){this.root={docs:{}},this.length=0},t.TokenStore.load=function(t){var e=new this;return e.root=t.root,e.length=t.length,e},t.TokenStore.prototype.add=function(t,e,n){var n=n||this.root,o=t.charAt(0),i=t.slice(1);return o in n||(n[o]={docs:{}}),0===i.length?(n[o].docs[e.ref]=e,void(this.length+=1)):this.add(i,e,n[o])},t.TokenStore.prototype.has=function(t){if(!t)return!1;for(var e=this.root,n=0;nt){for(;" "!=this[t]&&--t>0;);return this.substring(0,t)+"…"}return this},HTMLElement.prototype.wrap=function(t){t.length||(t=[t]);for(var e=t.length-1;e>=0;e--){var n=e>0?this.cloneNode(!0):this,o=t[e],i=o.parentNode,r=o.nextSibling;n.appendChild(o),r?i.insertBefore(n,r):i.appendChild(n)}},document.addEventListener("DOMContentLoaded",function(){"use strict";Modernizr.addTest("ios",function(){return!!navigator.userAgent.match(/(iPad|iPhone|iPod)/g)}),Modernizr.addTest("standalone",function(){return!!navigator.standalone}),FastClick.attach(document.body);var t=document.getElementById("toggle-search"),e=(document.getElementById("reset-search"),document.querySelector(".drawer")),n=document.querySelectorAll(".anchor"),o=document.querySelector(".search .field"),i=document.querySelector(".query"),r=document.querySelector(".results .meta");Array.prototype.forEach.call(n,function(t){t.querySelector("a").addEventListener("click",function(){document.getElementById("toggle-drawer").checked=!1,document.body.classList.remove("toggle-drawer")})});var s=window.pageYOffset,a=function(){var t=window.pageYOffset+window.innerHeight,n=Math.max(0,window.innerHeight-e.offsetHeight);t>document.body.clientHeight-(96-n)?"absolute"!=e.style.position&&(e.style.position="absolute",e.style.top=null,e.style.bottom=0):e.offsetHeighte.offsetTop+e.offsetHeight?(e.style.position="fixed",e.style.top=null,e.style.bottom="-96px"):window.pageYOffsets?e.style.top&&(e.style.position="absolute",e.style.top=Math.max(0,s)+"px",e.style.bottom=null):e.style.bottom&&(e.style.position="absolute",e.style.top=t-e.offsetHeight+"px",e.style.bottom=null),s=Math.max(0,window.pageYOffset)},c=function(){var t=document.querySelector(".main");window.removeEventListener("scroll",a),matchMedia("only screen and (max-width: 959px)").matches?(e.style.position=null,e.style.top=null,e.style.bottom=null):e.offsetHeight+96o;o++)t1e4?n=(n/1e3).toFixed(0)+"k":n>1e3&&(n=(n/1e3).toFixed(1)+"k");var o=document.querySelector(".repo-stars .count");o.innerHTML=n},function(t,e){console.error(t,e.status)})}),"standalone"in window.navigator&&window.navigator.standalone){var node,remotes=!1;document.addEventListener("click",function(t){for(node=t.target;"A"!==node.nodeName&&"HTML"!==node.nodeName;)node=node.parentNode;"href"in node&&-1!==node.href.indexOf("http")&&(-1!==node.href.indexOf(document.location.host)||remotes)&&(t.preventDefault(),document.location.href=node.href)},!1)} \ No newline at end of file diff --git a/assets/javascripts/application.js b/assets/javascripts/application.js deleted file mode 100644 index 1199f2e..0000000 --- a/assets/javascripts/application.js +++ /dev/null @@ -1 +0,0 @@ -function pegasus(t,e){return e=new XMLHttpRequest,e.open("GET",t),t=[],e.onreadystatechange=e.then=function(n,o,i,r){if(n&&n.call&&(t=[,n,o]),4==e.readyState&&(i=t[0|e.status/200])){try{r=JSON.parse(e.responseText)}catch(s){r=null}i(r,e)}},e.send(),e}if("document"in self&&("classList"in document.createElement("_")?!function(){"use strict";var t=document.createElement("_");if(t.classList.add("c1","c2"),!t.classList.contains("c2")){var e=function(t){var e=DOMTokenList.prototype[t];DOMTokenList.prototype[t]=function(t){var n,o=arguments.length;for(n=0;o>n;n++)t=arguments[n],e.call(this,t)}};e("add"),e("remove")}if(t.classList.toggle("c3",!1),t.classList.contains("c3")){var n=DOMTokenList.prototype.toggle;DOMTokenList.prototype.toggle=function(t,e){return 1 in arguments&&!this.contains(t)==!e?e:n.call(this,t)}}t=null}():!function(t){"use strict";if("Element"in t){var e="classList",n="prototype",o=t.Element[n],i=Object,r=String[n].trim||function(){return this.replace(/^\s+|\s+$/g,"")},s=Array[n].indexOf||function(t){for(var e=0,n=this.length;n>e;e++)if(e in this&&this[e]===t)return e;return-1},a=function(t,e){this.name=t,this.code=DOMException[t],this.message=e},c=function(t,e){if(""===e)throw new a("SYNTAX_ERR","An invalid or illegal string was specified");if(/\s/.test(e))throw new a("INVALID_CHARACTER_ERR","String contains an invalid character");return s.call(t,e)},l=function(t){for(var e=r.call(t.getAttribute("class")||""),n=e?e.split(/\s+/):[],o=0,i=n.length;i>o;o++)this.push(n[o]);this._updateClassName=function(){t.setAttribute("class",this.toString())}},u=l[n]=[],d=function(){return new l(this)};if(a[n]=Error[n],u.item=function(t){return this[t]||null},u.contains=function(t){return t+="",-1!==c(this,t)},u.add=function(){var t,e=arguments,n=0,o=e.length,i=!1;do t=e[n]+"",-1===c(this,t)&&(this.push(t),i=!0);while(++nc;c++)a[s[c]]=i(a[s[c]],a);n&&(e.addEventListener("mouseover",this.onMouse,!0),e.addEventListener("mousedown",this.onMouse,!0),e.addEventListener("mouseup",this.onMouse,!0)),e.addEventListener("click",this.onClick,!0),e.addEventListener("touchstart",this.onTouchStart,!1),e.addEventListener("touchmove",this.onTouchMove,!1),e.addEventListener("touchend",this.onTouchEnd,!1),e.addEventListener("touchcancel",this.onTouchCancel,!1),Event.prototype.stopImmediatePropagation||(e.removeEventListener=function(t,n,o){var i=Node.prototype.removeEventListener;"click"===t?i.call(e,t,n.hijacked||n,o):i.call(e,t,n,o)},e.addEventListener=function(t,n,o){var i=Node.prototype.addEventListener;"click"===t?i.call(e,t,n.hijacked||(n.hijacked=function(t){t.propagationStopped||n(t)}),o):i.call(e,t,n,o)}),"function"==typeof e.onclick&&(r=e.onclick,e.addEventListener("click",function(t){r(t)},!1),e.onclick=null)}}var e=navigator.userAgent.indexOf("Windows Phone")>=0,n=navigator.userAgent.indexOf("Android")>0&&!e,o=/iP(ad|hone|od)/.test(navigator.userAgent)&&!e,i=o&&/OS 4_\d(_\d)?/.test(navigator.userAgent),r=o&&/OS [6-7]_\d/.test(navigator.userAgent),s=navigator.userAgent.indexOf("BB10")>0;t.prototype.needsClick=function(t){switch(t.nodeName.toLowerCase()){case"button":case"select":case"textarea":if(t.disabled)return!0;break;case"input":if(o&&"file"===t.type||t.disabled)return!0;break;case"label":case"iframe":case"video":return!0}return/\bneedsclick\b/.test(t.className)},t.prototype.needsFocus=function(t){switch(t.nodeName.toLowerCase()){case"textarea":return!0;case"select":return!n;case"input":switch(t.type){case"button":case"checkbox":case"file":case"image":case"radio":case"submit":return!1}return!t.disabled&&!t.readOnly;default:return/\bneedsfocus\b/.test(t.className)}},t.prototype.sendClick=function(t,e){var n,o;document.activeElement&&document.activeElement!==t&&document.activeElement.blur(),o=e.changedTouches[0],n=document.createEvent("MouseEvents"),n.initMouseEvent(this.determineEventType(t),!0,!0,window,1,o.screenX,o.screenY,o.clientX,o.clientY,!1,!1,!1,!1,0,null),n.forwardedTouchEvent=!0,t.dispatchEvent(n)},t.prototype.determineEventType=function(t){return n&&"select"===t.tagName.toLowerCase()?"mousedown":"click"},t.prototype.focus=function(t){var e;o&&t.setSelectionRange&&0!==t.type.indexOf("date")&&"time"!==t.type&&"month"!==t.type?(e=t.value.length,t.setSelectionRange(e,e)):t.focus()},t.prototype.updateScrollParent=function(t){var e,n;if(e=t.fastClickScrollParent,!e||!e.contains(t)){n=t;do{if(n.scrollHeight>n.offsetHeight){e=n,t.fastClickScrollParent=n;break}n=n.parentElement}while(n)}e&&(e.fastClickLastScrollTop=e.scrollTop)},t.prototype.getTargetElementFromEventTarget=function(t){return t.nodeType===Node.TEXT_NODE?t.parentNode:t},t.prototype.onTouchStart=function(t){var e,n,r;if(t.targetTouches.length>1)return!0;if(e=this.getTargetElementFromEventTarget(t.target),n=t.targetTouches[0],o){if(r=window.getSelection(),r.rangeCount&&!r.isCollapsed)return!0;if(!i){if(n.identifier&&n.identifier===this.lastTouchIdentifier)return t.preventDefault(),!1;this.lastTouchIdentifier=n.identifier,this.updateScrollParent(e)}}return this.trackingClick=!0,this.trackingClickStart=t.timeStamp,this.targetElement=e,this.touchStartX=n.pageX,this.touchStartY=n.pageY,t.timeStamp-this.lastClickTimen||Math.abs(e.pageY-this.touchStartY)>n?!0:!1},t.prototype.onTouchMove=function(t){return this.trackingClick?((this.targetElement!==this.getTargetElementFromEventTarget(t.target)||this.touchHasMoved(t))&&(this.trackingClick=!1,this.targetElement=null),!0):!0},t.prototype.findControl=function(t){return void 0!==t.control?t.control:t.htmlFor?document.getElementById(t.htmlFor):t.querySelector("button, input:not([type=hidden]), keygen, meter, output, progress, select, textarea")},t.prototype.onTouchEnd=function(t){var e,s,a,c,l,u=this.targetElement;if(!this.trackingClick)return!0;if(t.timeStamp-this.lastClickTimethis.tapTimeout)return!0;if(this.cancelNextClick=!1,this.lastClickTime=t.timeStamp,s=this.trackingClickStart,this.trackingClick=!1,this.trackingClickStart=0,r&&(l=t.changedTouches[0],u=document.elementFromPoint(l.pageX-window.pageXOffset,l.pageY-window.pageYOffset)||u,u.fastClickScrollParent=this.targetElement.fastClickScrollParent),a=u.tagName.toLowerCase(),"label"===a){if(e=this.findControl(u)){if(this.focus(u),n)return!1;u=e}}else if(this.needsFocus(u))return t.timeStamp-s>100||o&&window.top!==window&&"input"===a?(this.targetElement=null,!1):(this.focus(u),this.sendClick(u,t),o&&"select"===a||(this.targetElement=null,t.preventDefault()),!1);return o&&!i&&(c=u.fastClickScrollParent,c&&c.fastClickLastScrollTop!==c.scrollTop)?!0:(this.needsClick(u)||(t.preventDefault(),this.sendClick(u,t)),!1)},t.prototype.onTouchCancel=function(){this.trackingClick=!1,this.targetElement=null},t.prototype.onMouse=function(t){return this.targetElement?t.forwardedTouchEvent?!0:t.cancelable&&(!this.needsClick(this.targetElement)||this.cancelNextClick)?(t.stopImmediatePropagation?t.stopImmediatePropagation():t.propagationStopped=!0,t.stopPropagation(),t.preventDefault(),!1):!0:!0},t.prototype.onClick=function(t){var e;return this.trackingClick?(this.targetElement=null,this.trackingClick=!1,!0):"submit"===t.target.type&&0===t.detail?!0:(e=this.onMouse(t),e||(this.targetElement=null),e)},t.prototype.destroy=function(){var t=this.layer;n&&(t.removeEventListener("mouseover",this.onMouse,!0),t.removeEventListener("mousedown",this.onMouse,!0),t.removeEventListener("mouseup",this.onMouse,!0)),t.removeEventListener("click",this.onClick,!0),t.removeEventListener("touchstart",this.onTouchStart,!1),t.removeEventListener("touchmove",this.onTouchMove,!1),t.removeEventListener("touchend",this.onTouchEnd,!1),t.removeEventListener("touchcancel",this.onTouchCancel,!1)},t.notNeeded=function(t){var e,o,i,r;if("undefined"==typeof window.ontouchstart)return!0;if(o=+(/Chrome\/([0-9]+)/.exec(navigator.userAgent)||[,0])[1]){if(!n)return!0;if(e=document.querySelector("meta[name=viewport]")){if(-1!==e.content.indexOf("user-scalable=no"))return!0;if(o>31&&document.documentElement.scrollWidth<=window.outerWidth)return!0}}if(s&&(i=navigator.userAgent.match(/Version\/([0-9]*)\.([0-9]*)/),i[1]>=10&&i[2]>=3&&(e=document.querySelector("meta[name=viewport]")))){if(-1!==e.content.indexOf("user-scalable=no"))return!0;if(document.documentElement.scrollWidth<=window.outerWidth)return!0}return"none"===t.style.msTouchAction||"manipulation"===t.style.touchAction?!0:(r=+(/Firefox\/([0-9]+)/.exec(navigator.userAgent)||[,0])[1],r>=27&&(e=document.querySelector("meta[name=viewport]"),e&&(-1!==e.content.indexOf("user-scalable=no")||document.documentElement.scrollWidth<=window.outerWidth))?!0:"none"===t.style.touchAction||"manipulation"===t.style.touchAction?!0:!1)},t.attach=function(e,n){return new t(e,n)},"function"==typeof define&&"object"==typeof define.amd&&define.amd?define(function(){return t}):"undefined"!=typeof module&&module.exports?(module.exports=t.attach,module.exports.FastClick=t):window.FastClick=t}(),function(){var t=function(e){var n=new t.Index;return n.pipeline.add(t.trimmer,t.stopWordFilter,t.stemmer),e&&e.call(n,n),n};t.version="0.6.0",t.utils={},t.utils.warn=function(t){return function(e){t.console&&console.warn&&console.warn(e)}}(this),t.utils.asString=function(t){return void 0===t||null===t?"":t.toString()},t.EventEmitter=function(){this.events={}},t.EventEmitter.prototype.addListener=function(){var t=Array.prototype.slice.call(arguments),e=t.pop(),n=t;if("function"!=typeof e)throw new TypeError("last argument must be a function");n.forEach(function(t){this.hasHandler(t)||(this.events[t]=[]),this.events[t].push(e)},this)},t.EventEmitter.prototype.removeListener=function(t,e){if(this.hasHandler(t)){var n=this.events[t].indexOf(e);this.events[t].splice(n,1),this.events[t].length||delete this.events[t]}},t.EventEmitter.prototype.emit=function(t){if(this.hasHandler(t)){var e=Array.prototype.slice.call(arguments,1);this.events[t].forEach(function(t){t.apply(void 0,e)})}},t.EventEmitter.prototype.hasHandler=function(t){return t in this.events},t.tokenizer=function(e){return arguments.length&&null!=e&&void 0!=e?Array.isArray(e)?e.map(function(e){return t.utils.asString(e).toLowerCase()}):e.toString().trim().toLowerCase().split(t.tokenizer.seperator):[]},t.tokenizer.seperator=/[\s\-]+/,t.Pipeline=function(){this._stack=[]},t.Pipeline.registeredFunctions={},t.Pipeline.registerFunction=function(e,n){n in this.registeredFunctions&&t.utils.warn("Overwriting existing registered function: "+n),e.label=n,t.Pipeline.registeredFunctions[e.label]=e},t.Pipeline.warnIfFunctionNotRegistered=function(e){var n=e.label&&e.label in this.registeredFunctions;n||t.utils.warn("Function is not registered with pipeline. This may cause problems when serialising the index.\n",e)},t.Pipeline.load=function(e){var n=new t.Pipeline;return e.forEach(function(e){var o=t.Pipeline.registeredFunctions[e];if(!o)throw new Error("Cannot load un-registered function: "+e);n.add(o)}),n},t.Pipeline.prototype.add=function(){var e=Array.prototype.slice.call(arguments);e.forEach(function(e){t.Pipeline.warnIfFunctionNotRegistered(e),this._stack.push(e)},this)},t.Pipeline.prototype.after=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var o=this._stack.indexOf(e);if(-1==o)throw new Error("Cannot find existingFn");o+=1,this._stack.splice(o,0,n)},t.Pipeline.prototype.before=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var o=this._stack.indexOf(e);if(-1==o)throw new Error("Cannot find existingFn");this._stack.splice(o,0,n)},t.Pipeline.prototype.remove=function(t){var e=this._stack.indexOf(t);-1!=e&&this._stack.splice(e,1)},t.Pipeline.prototype.run=function(t){for(var e=[],n=t.length,o=this._stack.length,i=0;n>i;i++){for(var r=t[i],s=0;o>s&&(r=this._stack[s](r,i,t),void 0!==r&&""!==r);s++);void 0!==r&&""!==r&&e.push(r)}return e},t.Pipeline.prototype.reset=function(){this._stack=[]},t.Pipeline.prototype.toJSON=function(){return this._stack.map(function(e){return t.Pipeline.warnIfFunctionNotRegistered(e),e.label})},t.Vector=function(){this._magnitude=null,this.list=void 0,this.length=0},t.Vector.Node=function(t,e,n){this.idx=t,this.val=e,this.next=n},t.Vector.prototype.insert=function(e,n){this._magnitude=void 0;var o=this.list;if(!o)return this.list=new t.Vector.Node(e,n,o),this.length++;if(en.idx?n=n.next:(o+=e.val*n.val,e=e.next,n=n.next);return o},t.Vector.prototype.similarity=function(t){return this.dot(t)/(this.magnitude()*t.magnitude())},t.SortedSet=function(){this.length=0,this.elements=[]},t.SortedSet.load=function(t){var e=new this;return e.elements=t,e.length=t.length,e},t.SortedSet.prototype.add=function(){var t,e;for(t=0;t1;){if(r===t)return i;t>r&&(e=i),r>t&&(n=i),o=n-e,i=e+Math.floor(o/2),r=this.elements[i]}return r===t?i:-1},t.SortedSet.prototype.locationFor=function(t){for(var e=0,n=this.elements.length,o=n-e,i=e+Math.floor(o/2),r=this.elements[i];o>1;)t>r&&(e=i),r>t&&(n=i),o=n-e,i=e+Math.floor(o/2),r=this.elements[i];return r>t?i:t>r?i+1:void 0},t.SortedSet.prototype.intersect=function(e){for(var n=new t.SortedSet,o=0,i=0,r=this.length,s=e.length,a=this.elements,c=e.elements;;){if(o>r-1||i>s-1)break;a[o]!==c[i]?a[o]c[i]&&i++:(n.add(a[o]),o++,i++)}return n},t.SortedSet.prototype.clone=function(){var e=new t.SortedSet;return e.elements=this.toArray(),e.length=e.elements.length,e},t.SortedSet.prototype.union=function(t){var e,n,o;return this.length>=t.length?(e=this,n=t):(e=t,n=this),o=e.clone(),o.add.apply(o,n.toArray()),o},t.SortedSet.prototype.toJSON=function(){return this.toArray()},t.Index=function(){this._fields=[],this._ref="id",this.pipeline=new t.Pipeline,this.documentStore=new t.Store,this.tokenStore=new t.TokenStore,this.corpusTokens=new t.SortedSet,this.eventEmitter=new t.EventEmitter,this._idfCache={},this.on("add","remove","update",function(){this._idfCache={}}.bind(this))},t.Index.prototype.on=function(){var t=Array.prototype.slice.call(arguments);return this.eventEmitter.addListener.apply(this.eventEmitter,t)},t.Index.prototype.off=function(t,e){return this.eventEmitter.removeListener(t,e)},t.Index.load=function(e){e.version!==t.version&&t.utils.warn("version mismatch: current "+t.version+" importing "+e.version);var n=new this;return n._fields=e.fields,n._ref=e.ref,n.documentStore=t.Store.load(e.documentStore),n.tokenStore=t.TokenStore.load(e.tokenStore),n.corpusTokens=t.SortedSet.load(e.corpusTokens),n.pipeline=t.Pipeline.load(e.pipeline),n},t.Index.prototype.field=function(t,e){var e=e||{},n={name:t,boost:e.boost||1};return this._fields.push(n),this},t.Index.prototype.ref=function(t){return this._ref=t,this},t.Index.prototype.add=function(e,n){var o={},i=new t.SortedSet,r=e[this._ref],n=void 0===n?!0:n;this._fields.forEach(function(n){var r=this.pipeline.run(t.tokenizer(e[n.name]));o[n.name]=r,t.SortedSet.prototype.add.apply(i,r)},this),this.documentStore.set(r,i),t.SortedSet.prototype.add.apply(this.corpusTokens,i.toArray());for(var s=0;s0&&(o=1+Math.log(this.documentStore.length/n)),this._idfCache[e]=o},t.Index.prototype.search=function(e){var n=this.pipeline.run(t.tokenizer(e)),o=new t.Vector,i=[],r=this._fields.reduce(function(t,e){return t+e.boost},0),s=n.some(function(t){return this.tokenStore.has(t)},this);if(!s)return[];n.forEach(function(e,n,s){var a=1/s.length*this._fields.length*r,c=this,l=this.tokenStore.expand(e).reduce(function(n,i){var r=c.corpusTokens.indexOf(i),s=c.idf(i),l=1,u=new t.SortedSet;if(i!==e){var d=Math.max(3,i.length-e.length);l=1/Math.log(d)}r>-1&&o.insert(r,a*s*l);for(var h=c.tokenStore.get(i),f=Object.keys(h),p=f.length,m=0;p>m;m++)u.add(h[f[m]].ref);return n.union(u)},new t.SortedSet);i.push(l)},this);var a=i.reduce(function(t,e){return t.intersect(e)});return a.map(function(t){return{ref:t,score:o.similarity(this.documentVector(t))}},this).sort(function(t,e){return e.score-t.score})},t.Index.prototype.documentVector=function(e){for(var n=this.documentStore.get(e),o=n.length,i=new t.Vector,r=0;o>r;r++){var s=n.elements[r],a=this.tokenStore.get(s)[e].tf,c=this.idf(s);i.insert(this.corpusTokens.indexOf(s),a*c)}return i},t.Index.prototype.toJSON=function(){return{version:t.version,fields:this._fields,ref:this._ref,documentStore:this.documentStore.toJSON(),tokenStore:this.tokenStore.toJSON(),corpusTokens:this.corpusTokens.toJSON(),pipeline:this.pipeline.toJSON()}},t.Index.prototype.use=function(t){var e=Array.prototype.slice.call(arguments,1);e.unshift(this),t.apply(this,e)},t.Store=function(){this.store={},this.length=0},t.Store.load=function(e){var n=new this;return n.length=e.length,n.store=Object.keys(e.store).reduce(function(n,o){return n[o]=t.SortedSet.load(e.store[o]),n},{}),n},t.Store.prototype.set=function(t,e){this.has(t)||this.length++,this.store[t]=e},t.Store.prototype.get=function(t){return this.store[t]},t.Store.prototype.has=function(t){return t in this.store},t.Store.prototype.remove=function(t){this.has(t)&&(delete this.store[t],this.length--)},t.Store.prototype.toJSON=function(){return{store:this.store,length:this.length}},t.stemmer=function(){var t={ational:"ate",tional:"tion",enci:"ence",anci:"ance",izer:"ize",bli:"ble",alli:"al",entli:"ent",eli:"e",ousli:"ous",ization:"ize",ation:"ate",ator:"ate",alism:"al",iveness:"ive",fulness:"ful",ousness:"ous",aliti:"al",iviti:"ive",biliti:"ble",logi:"log"},e={icate:"ic",ative:"",alize:"al",iciti:"ic",ical:"ic",ful:"",ness:""},n="[^aeiou]",o="[aeiouy]",i=n+"[^aeiouy]*",r=o+"[aeiou]*",s="^("+i+")?"+r+i,a="^("+i+")?"+r+i+"("+r+")?$",c="^("+i+")?"+r+i+r+i,l="^("+i+")?"+o,u=new RegExp(s),d=new RegExp(c),h=new RegExp(a),f=new RegExp(l),p=/^(.+?)(ss|i)es$/,m=/^(.+?)([^s])s$/,v=/^(.+?)eed$/,g=/^(.+?)(ed|ing)$/,y=/.$/,w=/(at|bl|iz)$/,S=new RegExp("([^aeiouylsz])\\1$"),k=new RegExp("^"+i+o+"[^aeiouwxy]$"),E=/^(.+?[^aeiou])y$/,x=/^(.+?)(ational|tional|enci|anci|izer|bli|alli|entli|eli|ousli|ization|ation|ator|alism|iveness|fulness|ousness|aliti|iviti|biliti|logi)$/,b=/^(.+?)(icate|ative|alize|iciti|ical|ful|ness)$/,T=/^(.+?)(al|ance|ence|er|ic|able|ible|ant|ement|ment|ent|ou|ism|ate|iti|ous|ive|ize)$/,C=/^(.+?)(s|t)(ion)$/,L=/^(.+?)e$/,_=/ll$/,A=new RegExp("^"+i+o+"[^aeiouwxy]$"),O=function(n){var o,i,r,s,a,c,l;if(n.length<3)return n;if(r=n.substr(0,1),"y"==r&&(n=r.toUpperCase()+n.substr(1)),s=p,a=m,s.test(n)?n=n.replace(s,"$1$2"):a.test(n)&&(n=n.replace(a,"$1$2")),s=v,a=g,s.test(n)){var O=s.exec(n);s=u,s.test(O[1])&&(s=y,n=n.replace(s,""))}else if(a.test(n)){var O=a.exec(n);o=O[1],a=f,a.test(o)&&(n=o,a=w,c=S,l=k,a.test(n)?n+="e":c.test(n)?(s=y,n=n.replace(s,"")):l.test(n)&&(n+="e"))}if(s=E,s.test(n)){var O=s.exec(n);o=O[1],n=o+"i"}if(s=x,s.test(n)){var O=s.exec(n);o=O[1],i=O[2],s=u,s.test(o)&&(n=o+t[i])}if(s=b,s.test(n)){var O=s.exec(n);o=O[1],i=O[2],s=u,s.test(o)&&(n=o+e[i])}if(s=T,a=C,s.test(n)){var O=s.exec(n);o=O[1],s=d,s.test(o)&&(n=o)}else if(a.test(n)){var O=a.exec(n);o=O[1]+O[2],a=d,a.test(o)&&(n=o)}if(s=L,s.test(n)){var O=s.exec(n);o=O[1],s=d,a=h,c=A,(s.test(o)||a.test(o)&&!c.test(o))&&(n=o)}return s=_,a=d,s.test(n)&&a.test(n)&&(s=y,n=n.replace(s,"")),"y"==r&&(n=r.toLowerCase()+n.substr(1)),n};return O}(),t.Pipeline.registerFunction(t.stemmer,"stemmer"),t.generateStopWordFilter=function(t){var e=t.reduce(function(t,e){return t[e]=e,t},{});return function(t){return t&&e[t]!==t?t:void 0}},t.stopWordFilter=t.generateStopWordFilter(["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]),t.Pipeline.registerFunction(t.stopWordFilter,"stopWordFilter"),t.trimmer=function(t){return t.replace(/^\W+/,"").replace(/\W+$/,"")},t.Pipeline.registerFunction(t.trimmer,"trimmer"),t.TokenStore=function(){this.root={docs:{}},this.length=0},t.TokenStore.load=function(t){var e=new this;return e.root=t.root,e.length=t.length,e},t.TokenStore.prototype.add=function(t,e,n){var n=n||this.root,o=t.charAt(0),i=t.slice(1);return o in n||(n[o]={docs:{}}),0===i.length?(n[o].docs[e.ref]=e,void(this.length+=1)):this.add(i,e,n[o])},t.TokenStore.prototype.has=function(t){if(!t)return!1;for(var e=this.root,n=0;nt){for(;" "!=this[t]&&--t>0;);return this.substring(0,t)+"…"}return this},HTMLElement.prototype.wrap=function(t){t.length||(t=[t]);for(var e=t.length-1;e>=0;e--){var n=e>0?this.cloneNode(!0):this,o=t[e],i=o.parentNode,r=o.nextSibling;n.appendChild(o),r?i.insertBefore(n,r):i.appendChild(n)}},document.addEventListener("DOMContentLoaded",function(){"use strict";Modernizr.addTest("ios",function(){return!!navigator.userAgent.match(/(iPad|iPhone|iPod)/g)}),Modernizr.addTest("standalone",function(){return!!navigator.standalone}),FastClick.attach(document.body);var t=document.getElementById("toggle-search"),e=(document.getElementById("reset-search"),document.querySelector(".drawer")),n=document.querySelectorAll(".anchor"),o=document.querySelector(".search .field"),i=document.querySelector(".query"),r=document.querySelector(".results .meta");Array.prototype.forEach.call(n,function(t){t.querySelector("a").addEventListener("click",function(){document.getElementById("toggle-drawer").checked=!1,document.body.classList.remove("toggle-drawer")})});var s=window.pageYOffset,a=function(){var t=window.pageYOffset+window.innerHeight,n=Math.max(0,window.innerHeight-e.offsetHeight);t>document.body.clientHeight-(96-n)?"absolute"!=e.style.position&&(e.style.position="absolute",e.style.top=null,e.style.bottom=0):e.offsetHeighte.offsetTop+e.offsetHeight?(e.style.position="fixed",e.style.top=null,e.style.bottom="-96px"):window.pageYOffsets?e.style.top&&(e.style.position="absolute",e.style.top=Math.max(0,s)+"px",e.style.bottom=null):e.style.bottom&&(e.style.position="absolute",e.style.top=t-e.offsetHeight+"px",e.style.bottom=null),s=Math.max(0,window.pageYOffset)},c=function(){var t=document.querySelector(".main");window.removeEventListener("scroll",a),matchMedia("only screen and (max-width: 959px)").matches?(e.style.position=null,e.style.top=null,e.style.bottom=null):e.offsetHeight+96o;o++)t1e4?n=(n/1e3).toFixed(0)+"k":n>1e3&&(n=(n/1e3).toFixed(1)+"k");var o=document.querySelector(".repo-stars .count");o.innerHTML=n},function(t,e){console.error(t,e.status)})}),"standalone"in window.navigator&&window.navigator.standalone){var node,remotes=!1;document.addEventListener("click",function(t){for(node=t.target;"A"!==node.nodeName&&"HTML"!==node.nodeName;)node=node.parentNode;"href"in node&&-1!==node.href.indexOf("http")&&(-1!==node.href.indexOf(document.location.host)||remotes)&&(t.preventDefault(),document.location.href=node.href)},!1)} \ No newline at end of file diff --git a/assets/javascripts/modernizr-4ab42b99fd.js b/assets/javascripts/modernizr-4ab42b99fd.js deleted file mode 100644 index e82c909..0000000 --- a/assets/javascripts/modernizr-4ab42b99fd.js +++ /dev/null @@ -1 +0,0 @@ -!function(e,t,n){function r(e,t){return typeof e===t}function i(){var e,t,n,i,o,a,s;for(var l in x)if(x.hasOwnProperty(l)){if(e=[],t=x[l],t.name&&(e.push(t.name.toLowerCase()),t.options&&t.options.aliases&&t.options.aliases.length))for(n=0;nf;f++)if(h=e[f],g=_.style[h],l(h,"-")&&(h=m(h)),_.style[h]!==n){if(o||r(i,"undefined"))return a(),"pfx"==t?h:!0;try{_.style[h]=i}catch(y){}if(_.style[h]!=g)return a(),"pfx"==t?h:!0}return a(),!1}function g(e,t,n){var i;for(var o in e)if(e[o]in t)return n===!1?e[o]:(i=t[e[o]],r(i,"function")?s(i,n||t):i);return!1}function v(e,t,n,i,o){var a=e.charAt(0).toUpperCase()+e.slice(1),s=(e+" "+P.join(a+" ")+a).split(" ");return r(t,"string")||r(t,"undefined")?h(s,t,i,o):(s=(e+" "+A.join(a+" ")+a).split(" "),g(s,t,n))}function y(e,t,r){return v(e,n,n,t,r)}var x=[],E={_version:"3.3.1",_config:{classPrefix:"",enableClasses:!0,enableJSClass:!0,usePrefixes:!0},_q:[],on:function(e,t){var n=this;setTimeout(function(){t(n[e])},0)},addTest:function(e,t,n){x.push({name:e,fn:t,options:n})},addAsyncTest:function(e){x.push({name:null,fn:e})}},S=function(){};S.prototype=E,S=new S;var b,w=[],C=t.documentElement,T="svg"===C.nodeName.toLowerCase();!function(){var e={}.hasOwnProperty;b=r(e,"undefined")||r(e.call,"undefined")?function(e,t){return t in e&&r(e.constructor.prototype[t],"undefined")}:function(t,n){return e.call(t,n)}}(),E._l={},E.on=function(e,t){this._l[e]||(this._l[e]=[]),this._l[e].push(t),S.hasOwnProperty(e)&&setTimeout(function(){S._trigger(e,S[e])},0)},E._trigger=function(e,t){if(this._l[e]){var n=this._l[e];setTimeout(function(){var e,r;for(e=0;e",r.insertBefore(n.lastChild,r.firstChild)}function r(){var e=C.elements;return"string"==typeof e?e.split(" "):e}function i(e,t){var n=C.elements;"string"!=typeof n&&(n=n.join(" ")),"string"!=typeof e&&(e=e.join(" ")),C.elements=n+" "+e,u(t)}function o(e){var t=w[e[S]];return t||(t={},b++,e[S]=b,w[b]=t),t}function a(e,n,r){if(n||(n=t),g)return n.createElement(e);r||(r=o(n));var i;return i=r.cache[e]?r.cache[e].cloneNode():E.test(e)?(r.cache[e]=r.createElem(e)).cloneNode():r.createElem(e),!i.canHaveChildren||x.test(e)||i.tagUrn?i:r.frag.appendChild(i)}function s(e,n){if(e||(e=t),g)return e.createDocumentFragment();n=n||o(e);for(var i=n.frag.cloneNode(),a=0,s=r(),l=s.length;l>a;a++)i.createElement(s[a]);return i}function l(e,t){t.cache||(t.cache={},t.createElem=e.createElement,t.createFrag=e.createDocumentFragment,t.frag=t.createFrag()),e.createElement=function(n){return C.shivMethods?a(n,e,t):t.createElem(n)},e.createDocumentFragment=Function("h,f","return function(){var n=f.cloneNode(),c=n.createElement;h.shivMethods&&("+r().join().replace(/[\w\-:]+/g,function(e){return t.createElem(e),t.frag.createElement(e),'c("'+e+'")'})+");return n}")(C,t.frag)}function u(e){e||(e=t);var r=o(e);return!C.shivCSS||h||r.hasCSS||(r.hasCSS=!!n(e,"article,aside,dialog,figcaption,figure,footer,header,hgroup,main,nav,section{display:block}mark{background:#FF0;color:#000}template{display:none}")),g||l(e,r),e}function c(e){for(var t,n=e.getElementsByTagName("*"),i=n.length,o=RegExp("^(?:"+r().join("|")+")$","i"),a=[];i--;)t=n[i],o.test(t.nodeName)&&a.push(t.applyElement(f(t)));return a}function f(e){for(var t,n=e.attributes,r=n.length,i=e.ownerDocument.createElement(N+":"+e.nodeName);r--;)t=n[r],t.specified&&i.setAttribute(t.nodeName,t.nodeValue);return i.style.cssText=e.style.cssText,i}function d(e){for(var t,n=e.split("{"),i=n.length,o=RegExp("(^|[\\s,>+~])("+r().join("|")+")(?=[[\\s,>+~#.:]|$)","gi"),a="$1"+N+"\\:$2";i--;)t=n[i]=n[i].split("}"),t[t.length-1]=t[t.length-1].replace(o,a),n[i]=t.join("}");return n.join("{")}function p(e){for(var t=e.length;t--;)e[t].removeNode()}function m(e){function t(){clearTimeout(a._removeSheetTimer),r&&r.removeNode(!0),r=null}var r,i,a=o(e),s=e.namespaces,l=e.parentWindow;return!_||e.printShived?e:("undefined"==typeof s[N]&&s.add(N),l.attachEvent("onbeforeprint",function(){t();for(var o,a,s,l=e.styleSheets,u=[],f=l.length,p=Array(f);f--;)p[f]=l[f];for(;s=p.pop();)if(!s.disabled&&T.test(s.media)){try{o=s.imports,a=o.length}catch(m){a=0}for(f=0;a>f;f++)p.push(o[f]);try{u.push(s.cssText)}catch(m){}}u=d(u.reverse().join("")),i=c(e),r=n(e,u)}),l.attachEvent("onafterprint",function(){p(i),clearTimeout(a._removeSheetTimer),a._removeSheetTimer=setTimeout(t,500)}),e.printShived=!0,e)}var h,g,v="3.7.3",y=e.html5||{},x=/^<|^(?:button|map|select|textarea|object|iframe|option|optgroup)$/i,E=/^(?:a|b|code|div|fieldset|h1|h2|h3|h4|h5|h6|i|label|li|ol|p|q|span|strong|style|table|tbody|td|th|tr|ul)$/i,S="_html5shiv",b=0,w={};!function(){try{var e=t.createElement("a");e.innerHTML="",h="hidden"in e,g=1==e.childNodes.length||function(){t.createElement("a");var e=t.createDocumentFragment();return"undefined"==typeof e.cloneNode||"undefined"==typeof e.createDocumentFragment||"undefined"==typeof e.createElement}()}catch(n){h=!0,g=!0}}();var C={elements:y.elements||"abbr article aside audio bdi canvas data datalist details dialog figcaption figure footer header hgroup main mark meter nav output picture progress section summary template time video",version:v,shivCSS:y.shivCSS!==!1,supportsUnknownElements:g,shivMethods:y.shivMethods!==!1,type:"default",shivDocument:u,createElement:a,createDocumentFragment:s,addElements:i};e.html5=C,u(t);var T=/^$|\b(?:all|print)\b/,N="html5shiv",_=!g&&function(){var n=t.documentElement;return!("undefined"==typeof t.namespaces||"undefined"==typeof t.parentWindow||"undefined"==typeof n.applyElement||"undefined"==typeof n.removeNode||"undefined"==typeof e.attachEvent)}();C.type+=" print",C.shivPrint=m,m(t),"object"==typeof module&&module.exports&&(module.exports=C)}("undefined"!=typeof e?e:this,t);var N={elem:u("modernizr")};S._q.push(function(){delete N.elem});var _={style:N.elem.style};S._q.unshift(function(){delete _.style});var z=(E.testProp=function(e,t,r){return h([e],n,t,r)},function(){function e(e,t){var i;return e?(t&&"string"!=typeof t||(t=u(t||"div")),e="on"+e,i=e in t,!i&&r&&(t.setAttribute||(t=u("div")),t.setAttribute(e,""),i="function"==typeof t[e],t[e]!==n&&(t[e]=n),t.removeAttribute(e)),i):!1}var r=!("onblur"in t.documentElement);return e}());E.hasEvent=z,S.addTest("inputsearchevent",z("search"));var k=E.testStyles=f,$=function(){var e=navigator.userAgent,t=e.match(/applewebkit\/([0-9]+)/gi)&&parseFloat(RegExp.$1),n=e.match(/w(eb)?osbrowser/gi),r=e.match(/windows phone/gi)&&e.match(/iemobile\/([0-9])+/gi)&&parseFloat(RegExp.$1)>=9,i=533>t&&e.match(/android/gi);return n||i||r}();$?S.addTest("fontface",!1):k('@font-face {font-family:"font";src:url("https://")}',function(e,n){var r=t.getElementById("smodernizr"),i=r.sheet||r.styleSheet,o=i?i.cssRules&&i.cssRules[0]?i.cssRules[0].cssText:i.cssText||"":"",a=/src/i.test(o)&&0===o.indexOf(n.split(" ")[0]);S.addTest("fontface",a)});var j="Moz O ms Webkit",P=E._config.usePrefixes?j.split(" "):[];E._cssomPrefixes=P;var A=E._config.usePrefixes?j.toLowerCase().split(" "):[];E._domPrefixes=A,E.testAllProps=v,E.testAllProps=y;var R="CSS"in e&&"supports"in e.CSS,F="supportsCSS"in e;S.addTest("supports",R||F),S.addTest("csstransforms3d",function(){var e=!!y("perspective","1px",!0),t=S._config.usePrefixes;if(e&&(!t||"webkitPerspective"in C.style)){var n,r="#modernizr{width:0;height:0}";S.supports?n="@supports (perspective: 1px)":(n="@media (transform-3d)",t&&(n+=",(-webkit-transform-3d)")),n+="{#modernizr{width:7px;height:18px;margin:0;padding:0;border:0}}",k(r+n,function(t){e=7===t.offsetWidth&&18===t.offsetHeight})}return e}),S.addTest("json","JSON"in e&&"parse"in JSON&&"stringify"in JSON),S.addTest("checked",function(){return k("#modernizr {position:absolute} #modernizr input {margin-left:10px} #modernizr :checked {margin-left:20px;display:block}",function(e){var t=u("input");return t.setAttribute("type","checkbox"),t.setAttribute("checked","checked"),e.appendChild(t),20===t.offsetLeft})}),S.addTest("target",function(){var t=e.document;if(!("querySelectorAll"in t))return!1;try{return t.querySelectorAll(":target"),!0}catch(n){return!1}}),S.addTest("contains",r(String.prototype.contains,"function")),i(),o(w),delete E.addTest,delete E.addAsyncTest;for(var M=0;M #mq-test-1 { width: 42px; }',r.insertBefore(o,i),n=42===a.offsetWidth,r.removeChild(o),{matches:n,media:e}}}(e.document)}(this),function(e){"use strict";function t(){E(!0)}var n={};e.respond=n,n.update=function(){};var r=[],i=function(){var t=!1;try{t=new e.XMLHttpRequest}catch(n){t=new e.ActiveXObject("Microsoft.XMLHTTP")}return function(){return t}}(),o=function(e,t){var n=i();n&&(n.open("GET",e,!0),n.onreadystatechange=function(){4!==n.readyState||200!==n.status&&304!==n.status||t(n.responseText)},4!==n.readyState&&n.send(null))};if(n.ajax=o,n.queue=r,n.regex={media:/@media[^\{]+\{([^\{\}]*\{[^\}\{]*\})+/gi,keyframes:/@(?:\-(?:o|moz|webkit)\-)?keyframes[^\{]+\{(?:[^\{\}]*\{[^\}\{]*\})+[^\}]*\}/gi,urls:/(url\()['"]?([^\/\)'"][^:\)'"]+)['"]?(\))/g,findStyles:/@media *([^\{]+)\{([\S\s]+?)$/,only:/(only\s+)?([a-zA-Z]+)\s?/,minw:/\([\s]*min\-width\s*:[\s]*([\s]*[0-9\.]+)(px|em)[\s]*\)/,maxw:/\([\s]*max\-width\s*:[\s]*([\s]*[0-9\.]+)(px|em)[\s]*\)/},n.mediaQueriesSupported=e.matchMedia&&null!==e.matchMedia("only all")&&e.matchMedia("only all").matches,!n.mediaQueriesSupported){var a,s,l,u=e.document,c=u.documentElement,f=[],d=[],p=[],m={},h=30,g=u.getElementsByTagName("head")[0]||c,v=u.getElementsByTagName("base")[0],y=g.getElementsByTagName("link"),x=function(){var e,t=u.createElement("div"),n=u.body,r=c.style.fontSize,i=n&&n.style.fontSize,o=!1;return t.style.cssText="position:absolute;font-size:1em;width:1em",n||(n=o=u.createElement("body"),n.style.background="none"),c.style.fontSize="100%",n.style.fontSize="100%",n.appendChild(t),o&&c.insertBefore(n,c.firstChild),e=t.offsetWidth,o?c.removeChild(n):n.removeChild(t),c.style.fontSize=r,i&&(n.style.fontSize=i),e=l=parseFloat(e)},E=function(t){var n="clientWidth",r=c[n],i="CSS1Compat"===u.compatMode&&r||u.body[n]||r,o={},m=y[y.length-1],v=(new Date).getTime();if(t&&a&&h>v-a)return e.clearTimeout(s),void(s=e.setTimeout(E,h));a=v;for(var S in f)if(f.hasOwnProperty(S)){var b=f[S],w=b.minw,C=b.maxw,T=null===w,N=null===C,_="em";w&&(w=parseFloat(w)*(w.indexOf(_)>-1?l||x():1)),C&&(C=parseFloat(C)*(C.indexOf(_)>-1?l||x():1)),b.hasquery&&(T&&N||!(T||i>=w)||!(N||C>=i))||(o[b.media]||(o[b.media]=[]),o[b.media].push(d[b.rules]))}for(var z in p)p.hasOwnProperty(z)&&p[z]&&p[z].parentNode===g&&g.removeChild(p[z]);p.length=0;for(var k in o)if(o.hasOwnProperty(k)){var $=u.createElement("style"),j=o[k].join("\n");$.type="text/css",$.media=k,g.insertBefore($,m.nextSibling),$.styleSheet?$.styleSheet.cssText=j:$.appendChild(u.createTextNode(j)),p.push($)}},S=function(e,t,r){var i=e.replace(n.regex.keyframes,"").match(n.regex.media),o=i&&i.length||0;t=t.substring(0,t.lastIndexOf("/"));var a=function(e){return e.replace(n.regex.urls,"$1"+t+"$2$3")},s=!o&&r;t.length&&(t+="/"),s&&(o=1);for(var l=0;o>l;l++){var u,c,p,m;s?(u=r,d.push(a(e))):(u=i[l].match(n.regex.findStyles)&&RegExp.$1,d.push(RegExp.$2&&a(RegExp.$2))),p=u.split(","),m=p.length;for(var h=0;m>h;h++)c=p[h],f.push({media:c.split("(")[0].match(n.regex.only)&&RegExp.$2||"all",rules:d.length-1,hasquery:c.indexOf("(")>-1,minw:c.match(n.regex.minw)&&parseFloat(RegExp.$1)+(RegExp.$2||""),maxw:c.match(n.regex.maxw)&&parseFloat(RegExp.$1)+(RegExp.$2||"")})}E()},b=function(){if(r.length){var t=r.shift();o(t.href,function(n){S(n,t.href,t.media),m[t.href]=!0,e.setTimeout(function(){b()},0)})}},w=function(){for(var t=0;tf;f++)if(h=e[f],g=_.style[h],l(h,"-")&&(h=m(h)),_.style[h]!==n){if(o||r(i,"undefined"))return a(),"pfx"==t?h:!0;try{_.style[h]=i}catch(y){}if(_.style[h]!=g)return a(),"pfx"==t?h:!0}return a(),!1}function g(e,t,n){var i;for(var o in e)if(e[o]in t)return n===!1?e[o]:(i=t[e[o]],r(i,"function")?s(i,n||t):i);return!1}function v(e,t,n,i,o){var a=e.charAt(0).toUpperCase()+e.slice(1),s=(e+" "+P.join(a+" ")+a).split(" ");return r(t,"string")||r(t,"undefined")?h(s,t,i,o):(s=(e+" "+A.join(a+" ")+a).split(" "),g(s,t,n))}function y(e,t,r){return v(e,n,n,t,r)}var x=[],E={_version:"3.3.1",_config:{classPrefix:"",enableClasses:!0,enableJSClass:!0,usePrefixes:!0},_q:[],on:function(e,t){var n=this;setTimeout(function(){t(n[e])},0)},addTest:function(e,t,n){x.push({name:e,fn:t,options:n})},addAsyncTest:function(e){x.push({name:null,fn:e})}},S=function(){};S.prototype=E,S=new S;var b,w=[],C=t.documentElement,T="svg"===C.nodeName.toLowerCase();!function(){var e={}.hasOwnProperty;b=r(e,"undefined")||r(e.call,"undefined")?function(e,t){return t in e&&r(e.constructor.prototype[t],"undefined")}:function(t,n){return e.call(t,n)}}(),E._l={},E.on=function(e,t){this._l[e]||(this._l[e]=[]),this._l[e].push(t),S.hasOwnProperty(e)&&setTimeout(function(){S._trigger(e,S[e])},0)},E._trigger=function(e,t){if(this._l[e]){var n=this._l[e];setTimeout(function(){var e,r;for(e=0;e",r.insertBefore(n.lastChild,r.firstChild)}function r(){var e=C.elements;return"string"==typeof e?e.split(" "):e}function i(e,t){var n=C.elements;"string"!=typeof n&&(n=n.join(" ")),"string"!=typeof e&&(e=e.join(" ")),C.elements=n+" "+e,u(t)}function o(e){var t=w[e[S]];return t||(t={},b++,e[S]=b,w[b]=t),t}function a(e,n,r){if(n||(n=t),g)return n.createElement(e);r||(r=o(n));var i;return i=r.cache[e]?r.cache[e].cloneNode():E.test(e)?(r.cache[e]=r.createElem(e)).cloneNode():r.createElem(e),!i.canHaveChildren||x.test(e)||i.tagUrn?i:r.frag.appendChild(i)}function s(e,n){if(e||(e=t),g)return e.createDocumentFragment();n=n||o(e);for(var i=n.frag.cloneNode(),a=0,s=r(),l=s.length;l>a;a++)i.createElement(s[a]);return i}function l(e,t){t.cache||(t.cache={},t.createElem=e.createElement,t.createFrag=e.createDocumentFragment,t.frag=t.createFrag()),e.createElement=function(n){return C.shivMethods?a(n,e,t):t.createElem(n)},e.createDocumentFragment=Function("h,f","return function(){var n=f.cloneNode(),c=n.createElement;h.shivMethods&&("+r().join().replace(/[\w\-:]+/g,function(e){return t.createElem(e),t.frag.createElement(e),'c("'+e+'")'})+");return n}")(C,t.frag)}function u(e){e||(e=t);var r=o(e);return!C.shivCSS||h||r.hasCSS||(r.hasCSS=!!n(e,"article,aside,dialog,figcaption,figure,footer,header,hgroup,main,nav,section{display:block}mark{background:#FF0;color:#000}template{display:none}")),g||l(e,r),e}function c(e){for(var t,n=e.getElementsByTagName("*"),i=n.length,o=RegExp("^(?:"+r().join("|")+")$","i"),a=[];i--;)t=n[i],o.test(t.nodeName)&&a.push(t.applyElement(f(t)));return a}function f(e){for(var t,n=e.attributes,r=n.length,i=e.ownerDocument.createElement(N+":"+e.nodeName);r--;)t=n[r],t.specified&&i.setAttribute(t.nodeName,t.nodeValue);return i.style.cssText=e.style.cssText,i}function d(e){for(var t,n=e.split("{"),i=n.length,o=RegExp("(^|[\\s,>+~])("+r().join("|")+")(?=[[\\s,>+~#.:]|$)","gi"),a="$1"+N+"\\:$2";i--;)t=n[i]=n[i].split("}"),t[t.length-1]=t[t.length-1].replace(o,a),n[i]=t.join("}");return n.join("{")}function p(e){for(var t=e.length;t--;)e[t].removeNode()}function m(e){function t(){clearTimeout(a._removeSheetTimer),r&&r.removeNode(!0),r=null}var r,i,a=o(e),s=e.namespaces,l=e.parentWindow;return!_||e.printShived?e:("undefined"==typeof s[N]&&s.add(N),l.attachEvent("onbeforeprint",function(){t();for(var o,a,s,l=e.styleSheets,u=[],f=l.length,p=Array(f);f--;)p[f]=l[f];for(;s=p.pop();)if(!s.disabled&&T.test(s.media)){try{o=s.imports,a=o.length}catch(m){a=0}for(f=0;a>f;f++)p.push(o[f]);try{u.push(s.cssText)}catch(m){}}u=d(u.reverse().join("")),i=c(e),r=n(e,u)}),l.attachEvent("onafterprint",function(){p(i),clearTimeout(a._removeSheetTimer),a._removeSheetTimer=setTimeout(t,500)}),e.printShived=!0,e)}var h,g,v="3.7.3",y=e.html5||{},x=/^<|^(?:button|map|select|textarea|object|iframe|option|optgroup)$/i,E=/^(?:a|b|code|div|fieldset|h1|h2|h3|h4|h5|h6|i|label|li|ol|p|q|span|strong|style|table|tbody|td|th|tr|ul)$/i,S="_html5shiv",b=0,w={};!function(){try{var e=t.createElement("a");e.innerHTML="",h="hidden"in e,g=1==e.childNodes.length||function(){t.createElement("a");var e=t.createDocumentFragment();return"undefined"==typeof e.cloneNode||"undefined"==typeof e.createDocumentFragment||"undefined"==typeof e.createElement}()}catch(n){h=!0,g=!0}}();var C={elements:y.elements||"abbr article aside audio bdi canvas data datalist details dialog figcaption figure footer header hgroup main mark meter nav output picture progress section summary template time video",version:v,shivCSS:y.shivCSS!==!1,supportsUnknownElements:g,shivMethods:y.shivMethods!==!1,type:"default",shivDocument:u,createElement:a,createDocumentFragment:s,addElements:i};e.html5=C,u(t);var T=/^$|\b(?:all|print)\b/,N="html5shiv",_=!g&&function(){var n=t.documentElement;return!("undefined"==typeof t.namespaces||"undefined"==typeof t.parentWindow||"undefined"==typeof n.applyElement||"undefined"==typeof n.removeNode||"undefined"==typeof e.attachEvent)}();C.type+=" print",C.shivPrint=m,m(t),"object"==typeof module&&module.exports&&(module.exports=C)}("undefined"!=typeof e?e:this,t);var N={elem:u("modernizr")};S._q.push(function(){delete N.elem});var _={style:N.elem.style};S._q.unshift(function(){delete _.style});var z=(E.testProp=function(e,t,r){return h([e],n,t,r)},function(){function e(e,t){var i;return e?(t&&"string"!=typeof t||(t=u(t||"div")),e="on"+e,i=e in t,!i&&r&&(t.setAttribute||(t=u("div")),t.setAttribute(e,""),i="function"==typeof t[e],t[e]!==n&&(t[e]=n),t.removeAttribute(e)),i):!1}var r=!("onblur"in t.documentElement);return e}());E.hasEvent=z,S.addTest("inputsearchevent",z("search"));var k=E.testStyles=f,$=function(){var e=navigator.userAgent,t=e.match(/applewebkit\/([0-9]+)/gi)&&parseFloat(RegExp.$1),n=e.match(/w(eb)?osbrowser/gi),r=e.match(/windows phone/gi)&&e.match(/iemobile\/([0-9])+/gi)&&parseFloat(RegExp.$1)>=9,i=533>t&&e.match(/android/gi);return n||i||r}();$?S.addTest("fontface",!1):k('@font-face {font-family:"font";src:url("https://")}',function(e,n){var r=t.getElementById("smodernizr"),i=r.sheet||r.styleSheet,o=i?i.cssRules&&i.cssRules[0]?i.cssRules[0].cssText:i.cssText||"":"",a=/src/i.test(o)&&0===o.indexOf(n.split(" ")[0]);S.addTest("fontface",a)});var j="Moz O ms Webkit",P=E._config.usePrefixes?j.split(" "):[];E._cssomPrefixes=P;var A=E._config.usePrefixes?j.toLowerCase().split(" "):[];E._domPrefixes=A,E.testAllProps=v,E.testAllProps=y;var R="CSS"in e&&"supports"in e.CSS,F="supportsCSS"in e;S.addTest("supports",R||F),S.addTest("csstransforms3d",function(){var e=!!y("perspective","1px",!0),t=S._config.usePrefixes;if(e&&(!t||"webkitPerspective"in C.style)){var n,r="#modernizr{width:0;height:0}";S.supports?n="@supports (perspective: 1px)":(n="@media (transform-3d)",t&&(n+=",(-webkit-transform-3d)")),n+="{#modernizr{width:7px;height:18px;margin:0;padding:0;border:0}}",k(r+n,function(t){e=7===t.offsetWidth&&18===t.offsetHeight})}return e}),S.addTest("json","JSON"in e&&"parse"in JSON&&"stringify"in JSON),S.addTest("checked",function(){return k("#modernizr {position:absolute} #modernizr input {margin-left:10px} #modernizr :checked {margin-left:20px;display:block}",function(e){var t=u("input");return t.setAttribute("type","checkbox"),t.setAttribute("checked","checked"),e.appendChild(t),20===t.offsetLeft})}),S.addTest("target",function(){var t=e.document;if(!("querySelectorAll"in t))return!1;try{return t.querySelectorAll(":target"),!0}catch(n){return!1}}),S.addTest("contains",r(String.prototype.contains,"function")),i(),o(w),delete E.addTest,delete E.addAsyncTest;for(var M=0;M #mq-test-1 { width: 42px; }',r.insertBefore(o,i),n=42===a.offsetWidth,r.removeChild(o),{matches:n,media:e}}}(e.document)}(this),function(e){"use strict";function t(){E(!0)}var n={};e.respond=n,n.update=function(){};var r=[],i=function(){var t=!1;try{t=new e.XMLHttpRequest}catch(n){t=new e.ActiveXObject("Microsoft.XMLHTTP")}return function(){return t}}(),o=function(e,t){var n=i();n&&(n.open("GET",e,!0),n.onreadystatechange=function(){4!==n.readyState||200!==n.status&&304!==n.status||t(n.responseText)},4!==n.readyState&&n.send(null))};if(n.ajax=o,n.queue=r,n.regex={media:/@media[^\{]+\{([^\{\}]*\{[^\}\{]*\})+/gi,keyframes:/@(?:\-(?:o|moz|webkit)\-)?keyframes[^\{]+\{(?:[^\{\}]*\{[^\}\{]*\})+[^\}]*\}/gi,urls:/(url\()['"]?([^\/\)'"][^:\)'"]+)['"]?(\))/g,findStyles:/@media *([^\{]+)\{([\S\s]+?)$/,only:/(only\s+)?([a-zA-Z]+)\s?/,minw:/\([\s]*min\-width\s*:[\s]*([\s]*[0-9\.]+)(px|em)[\s]*\)/,maxw:/\([\s]*max\-width\s*:[\s]*([\s]*[0-9\.]+)(px|em)[\s]*\)/},n.mediaQueriesSupported=e.matchMedia&&null!==e.matchMedia("only all")&&e.matchMedia("only all").matches,!n.mediaQueriesSupported){var a,s,l,u=e.document,c=u.documentElement,f=[],d=[],p=[],m={},h=30,g=u.getElementsByTagName("head")[0]||c,v=u.getElementsByTagName("base")[0],y=g.getElementsByTagName("link"),x=function(){var e,t=u.createElement("div"),n=u.body,r=c.style.fontSize,i=n&&n.style.fontSize,o=!1;return t.style.cssText="position:absolute;font-size:1em;width:1em",n||(n=o=u.createElement("body"),n.style.background="none"),c.style.fontSize="100%",n.style.fontSize="100%",n.appendChild(t),o&&c.insertBefore(n,c.firstChild),e=t.offsetWidth,o?c.removeChild(n):n.removeChild(t),c.style.fontSize=r,i&&(n.style.fontSize=i),e=l=parseFloat(e)},E=function(t){var n="clientWidth",r=c[n],i="CSS1Compat"===u.compatMode&&r||u.body[n]||r,o={},m=y[y.length-1],v=(new Date).getTime();if(t&&a&&h>v-a)return e.clearTimeout(s),void(s=e.setTimeout(E,h));a=v;for(var S in f)if(f.hasOwnProperty(S)){var b=f[S],w=b.minw,C=b.maxw,T=null===w,N=null===C,_="em";w&&(w=parseFloat(w)*(w.indexOf(_)>-1?l||x():1)),C&&(C=parseFloat(C)*(C.indexOf(_)>-1?l||x():1)),b.hasquery&&(T&&N||!(T||i>=w)||!(N||C>=i))||(o[b.media]||(o[b.media]=[]),o[b.media].push(d[b.rules]))}for(var z in p)p.hasOwnProperty(z)&&p[z]&&p[z].parentNode===g&&g.removeChild(p[z]);p.length=0;for(var k in o)if(o.hasOwnProperty(k)){var $=u.createElement("style"),j=o[k].join("\n");$.type="text/css",$.media=k,g.insertBefore($,m.nextSibling),$.styleSheet?$.styleSheet.cssText=j:$.appendChild(u.createTextNode(j)),p.push($)}},S=function(e,t,r){var i=e.replace(n.regex.keyframes,"").match(n.regex.media),o=i&&i.length||0;t=t.substring(0,t.lastIndexOf("/"));var a=function(e){return e.replace(n.regex.urls,"$1"+t+"$2$3")},s=!o&&r;t.length&&(t+="/"),s&&(o=1);for(var l=0;o>l;l++){var u,c,p,m;s?(u=r,d.push(a(e))):(u=i[l].match(n.regex.findStyles)&&RegExp.$1,d.push(RegExp.$2&&a(RegExp.$2))),p=u.split(","),m=p.length;for(var h=0;m>h;h++)c=p[h],f.push({media:c.split("(")[0].match(n.regex.only)&&RegExp.$2||"all",rules:d.length-1,hasquery:c.indexOf("(")>-1,minw:c.match(n.regex.minw)&&parseFloat(RegExp.$1)+(RegExp.$2||""),maxw:c.match(n.regex.maxw)&&parseFloat(RegExp.$1)+(RegExp.$2||"")})}E()},b=function(){if(r.length){var t=r.shift();o(t.href,function(n){S(n,t.href,t.media),m[t.href]=!0,e.setTimeout(function(){b()},0)})}},w=function(){for(var t=0;tli:before{content:"\e602";display:block;float:left;font-family:Icon;font-size:16px;width:1.2em;margin-left:-1.2em;vertical-align:-.1em}.article p>code{white-space:nowrap;padding:2px 4px}.article kbd{display:inline-block;padding:3px 5px;line-height:10px}.article hr{margin-top:1.5em}.article img{max-width:100%}.article pre{padding:16px;margin:1.5em -16px 0;line-height:1.5em;overflow:auto;-webkit-overflow-scrolling:touch}.article table{margin:3em 0 1.5em;font-size:13px;overflow:hidden}.no-js .article table{display:inline-block;max-width:100%;overflow:auto;-webkit-overflow-scrolling:touch}.article table th{min-width:100px;font-size:12px;text-align:left}.article table td,.article table th{padding:12px 16px;vertical-align:top}.article blockquote{padding-left:16px}.article .data{margin:1.5em -16px;padding:1.5em 0;overflow:auto;-webkit-overflow-scrolling:touch;text-align:center}.article .data table{display:inline-block;margin:0 16px;text-align:left}.footer{position:absolute;bottom:0;left:0;right:0;padding:0 4px}.copyright{margin:1.5em 0}.pagination{max-width:1184px;height:92px;padding:4px 0;margin-left:auto;margin-right:auto;overflow:hidden}.pagination a{display:block;height:100%}.pagination .next,.pagination .previous{position:relative;float:left;height:100%}.pagination .previous{width:25%}.pagination .previous .direction,.pagination .previous .stretch{display:none}.pagination .next{width:75%;text-align:right}.pagination .page{display:table;position:absolute;bottom:4px}.pagination .direction{display:block;position:absolute;bottom:40px;width:100%;font-size:15px;line-height:20px;padding:0 52px}.pagination .stretch{padding:0 4px}.pagination .stretch .title{font-size:18px;padding:11px 0 13px}.admonition{margin:20px -16px 0;padding:20px 16px}.admonition>:first-child{margin-top:0}.admonition .admonition-title{font-size:20px}.admonition .admonition-title:before{content:"\e611";display:block;float:left;font-family:Icon;font-size:24px;vertical-align:-.1em;margin-right:5px}.admonition.warning .admonition-title:before{content:"\e610"}.article h3{font-weight:700}.article h4{font-weight:400;font-style:italic}.article h2 a,.article h3 a,.article h4 a,.article h5 a,.article h6 a{font-weight:400;font-style:normal}.bar{-webkit-transform:translateZ(0);transform:translateZ(0);-webkit-transition:opacity .2s cubic-bezier(.75,0,.25,1),-webkit-transform .4s cubic-bezier(.75,0,.25,1);transition:opacity .2s cubic-bezier(.75,0,.25,1),-webkit-transform .4s cubic-bezier(.75,0,.25,1);transition:opacity .2s cubic-bezier(.75,0,.25,1),transform .4s cubic-bezier(.75,0,.25,1);transition:opacity .2s cubic-bezier(.75,0,.25,1),transform .4s cubic-bezier(.75,0,.25,1),-webkit-transform .4s cubic-bezier(.75,0,.25,1)}#toggle-search:checked~.header .bar,.toggle-search .bar{-webkit-transform:translate3d(0,-56px,0);transform:translate3d(0,-56px,0)}.bar.search .button-reset{-webkit-transform:scale(.5);transform:scale(.5);-webkit-transition:opacity .4s cubic-bezier(.1,.7,.1,1),-webkit-transform .4s cubic-bezier(.1,.7,.1,1);transition:opacity .4s cubic-bezier(.1,.7,.1,1),-webkit-transform .4s cubic-bezier(.1,.7,.1,1);transition:opacity .4s cubic-bezier(.1,.7,.1,1),transform .4s cubic-bezier(.1,.7,.1,1);transition:opacity .4s cubic-bezier(.1,.7,.1,1),transform .4s cubic-bezier(.1,.7,.1,1),-webkit-transform .4s cubic-bezier(.1,.7,.1,1);opacity:0}.bar.search.non-empty .button-reset{-webkit-transform:scale(1);transform:scale(1);opacity:1}.results{-webkit-transition:opacity .3s .1s,width 0s .4s,height 0s .4s;transition:opacity .3s .1s,width 0s .4s,height 0s .4s}#toggle-search:checked~.main .results,.toggle-search .results{-webkit-transition:opacity .4s,width 0s,height 0s;transition:opacity .4s,width 0s,height 0s}.results .list a{-webkit-transition:background .25s;transition:background .25s}.no-csstransforms3d .bar.default{display:table}.no-csstransforms3d .bar.search{display:none;margin-top:0}.no-csstransforms3d #toggle-search:checked~.header .bar.default,.no-csstransforms3d .toggle-search .bar.default{display:none}.no-csstransforms3d #toggle-search:checked~.header .bar.search,.no-csstransforms3d .toggle-search .bar.search{display:table}.bar.search{opacity:0}.bar.search .query{background:transparent;color:rgba(0,0,0,.87)}.bar.search .query::-webkit-input-placeholder{color:rgba(0,0,0,.26)}.bar.search .query:-moz-placeholder,.bar.search .query::-moz-placeholder{color:rgba(0,0,0,.26)}.bar.search .query:-ms-input-placeholder{color:rgba(0,0,0,.26)}.bar.search .button .icon:active{background:rgba(0,0,0,.12)}.results{box-shadow:0 4px 7px rgba(0,0,0,.23),0 8px 25px rgba(0,0,0,.05);background:#fff;color:rgba(0,0,0,.87);opacity:0}#toggle-search:checked~.main .results,.toggle-search .results{opacity:1}.results .meta{background:#e84e40;color:#fff}.results .list a{border-bottom:1px solid rgba(0,0,0,.12)}.results .list a:last-child{border-bottom:none}.results .list a:active{background:rgba(0,0,0,.12)}.result span{color:rgba(0,0,0,.54)}#toggle-search:checked~.header,.toggle-search .header{background:#fff;color:rgba(0,0,0,.54)}#toggle-search:checked~.header:before,.toggle-search .header:before{background:rgba(0,0,0,.54)}#toggle-search:checked~.header .bar.default,.toggle-search .header .bar.default{opacity:0}#toggle-search:checked~.header .bar.search,.toggle-search .header .bar.search{opacity:1}.bar.search{margin-top:8px}.bar.search .query{font-size:18px;padding:13px 0;margin:0;width:100%;height:48px}.bar.search .query::-ms-clear{display:none}.results{position:fixed;top:0;left:0;width:0;height:100%;z-index:1;overflow-y:scroll;-webkit-overflow-scrolling:touch}.results .scrollable{top:56px}#toggle-search:checked~.main .results,.toggle-search .results{width:100%;overflow-y:visible}.results .meta{font-weight:700}.results .meta strong{display:block;font-size:11px;max-width:1200px;margin-left:auto;margin-right:auto;padding:16px}.results .list a{display:block}.result{max-width:1200px;margin-left:auto;margin-right:auto;padding:12px 16px 16px}.result h1{line-height:24px}.result h1,.result span{text-overflow:ellipsis;white-space:nowrap;overflow:hidden}.result span{font-size:12px}.no-csstransforms3d .results{display:none}.no-csstransforms3d #toggle-search:checked~.main .results,.no-csstransforms3d .toggle-search .results{display:block;overflow:auto}.meta{text-transform:uppercase;font-weight:700}@media only screen and (min-width:960px){.backdrop{background:#f2f2f2}.backdrop-paper:after{box-shadow:0 1.5px 3px rgba(0,0,0,.24),0 3px 8px rgba(0,0,0,.05)}.button-menu{display:none}.drawer{float:left;height:auto;margin-bottom:96px;padding-top:80px}.drawer,.drawer .scrollable{position:static}.article{margin-left:262px}.footer{z-index:4}.copyright{margin-bottom:64px}.results{height:auto;top:64px}.results .scrollable{position:static;max-height:413px}}@media only screen and (max-width:959px){#toggle-drawer:checked~.overlay,.toggle-drawer .overlay{width:100%;height:100%}.drawer{-webkit-transform:translate3d(-262px,0,0);transform:translate3d(-262px,0,0);-webkit-transition:-webkit-transform .25s cubic-bezier(.4,0,.2,1);transition:-webkit-transform .25s cubic-bezier(.4,0,.2,1);transition:transform .25s cubic-bezier(.4,0,.2,1);transition:transform .25s cubic-bezier(.4,0,.2,1),-webkit-transform .25s cubic-bezier(.4,0,.2,1)}.no-csstransforms3d .drawer{display:none}.drawer{background:#fff}.project{box-shadow:0 1.5px 3px rgba(0,0,0,.24),0 3px 8px rgba(0,0,0,.05);background:#e84e40;color:#fff}.drawer{position:fixed;z-index:4}#toggle-search:checked~.main .results,.drawer,.toggle-search .results{height:100%}}@media only screen and (min-width:720px){.header{height:64px;padding:8px}.header .stretch{padding:0 16px}.header .stretch .title{font-size:20px;padding:12px 0}.project .name{margin:26px 0 0 5px}.article .wrapper{padding:128px 24px 96px}.article .data{margin:1.5em -24px}.article .data table{margin:0 24px}.article h2{padding-top:100px;margin-top:-64px}.ios.standalone .article h2{padding-top:28px;margin-top:8px}.article h3,.article h4{padding-top:84px;margin-top:-64px}.ios.standalone .article h3,.ios.standalone .article h4{padding-top:20px;margin-top:0}.article pre{padding:1.5em 24px;margin:1.5em -24px 0}.footer{padding:0 8px}.pagination{height:96px;padding:8px 0}.pagination .direction{padding:0 56px;bottom:40px}.pagination .stretch{padding:0 8px}.admonition{margin:20px -24px 0;padding:20px 24px}.bar.search .query{font-size:20px;padding:12px 0}.results .scrollable{top:64px}.results .meta strong{padding:16px 24px}.result{padding:16px 24px 20px}}@media only screen and (min-width:1200px){.header{width:100%}.drawer .scrollable .wrapper hr{width:48px}}@media only screen and (orientation:portrait){.ios.standalone .header{height:76px;padding-top:24px}.ios.standalone .header:before{content:" ";position:absolute;top:0;left:0;z-index:3;width:100%;height:20px}.ios.standalone .drawer .scrollable{top:124px}.ios.standalone .project{padding-top:20px}.ios.standalone .project:before{content:" ";position:absolute;top:0;left:0;z-index:3;width:100%;height:20px}.ios.standalone .article{position:absolute;top:76px;right:0;bottom:0;left:0}.ios.standalone .results .scrollable{top:76px}}@media only screen and (orientation:portrait) and (min-width:720px){.ios.standalone .header{height:84px;padding-top:28px}.ios.standalone .results .scrollable{top:84px}}@media only screen and (max-width:719px){.bar .path{display:none}}@media only screen and (max-width:479px){.button-github,.button-twitter{display:none}}@media only screen and (min-width:720px) and (max-width:959px){.header .stretch{padding:0 24px}}@media only screen and (min-width:480px){.pagination .next,.pagination .previous{width:50%}.pagination .previous .direction{display:block}.pagination .previous .stretch{display:table}}@media print{.drawer,.footer,.header,.headerlink{display:none}.article .wrapper{padding-top:0}.article pre,.article pre *{color:rgba(0,0,0,.87)!important}.article pre{border:1px solid rgba(0,0,0,.12)}.article table{border-radius:none;box-shadow:none}.article table th{color:#e84e40}} \ No newline at end of file diff --git a/assets/stylesheets/application.css b/assets/stylesheets/application.css deleted file mode 100644 index 965bbb3..0000000 --- a/assets/stylesheets/application.css +++ /dev/null @@ -1 +0,0 @@ -html{box-sizing:border-box;-moz-box-sizing:border-box;-webkit-box-sizing:border-box}*,:after,:before{box-sizing:inherit;-moz-box-sizing:inherit;-webkit-box-sizing:inherit}html{font-size:62.5%;-webkit-text-size-adjust:none;-ms-text-size-adjust:none;text-size-adjust:none}a,abbr,acronym,address,applet,article,aside,audio,b,big,blockquote,body,canvas,caption,center,cite,code,dd,del,details,dfn,div,dl,dt,em,embed,fieldset,figcaption,figure,footer,form,h1,h2,h3,h4,h5,h6,header,hgroup,html,i,iframe,img,ins,kbd,label,legend,li,main,mark,menu,nav,object,ol,output,p,pre,q,ruby,s,samp,section,small,span,strike,strong,sub,summary,sup,table,tbody,td,tfoot,th,thead,time,tr,tt,u,ul,var,video{margin:0;padding:0;border:0}main{display:block}ul{list-style:none}table{border-collapse:collapse;border-spacing:0}td{text-align:left;font-weight:400;vertical-align:middle}button{outline:0;padding:0;background:transparent;border:none;font-size:inherit}input{-webkit-appearance:none;-moz-appearance:none;-ms-appearance:none;-o-appearance:none;appearance:none;outline:none;border:none}a{text-decoration:none;color:inherit}a,button,input,label{-webkit-tap-highlight-color:rgba(255,255,255,0);-webkit-tap-highlight-color:transparent}h1,h2,h3,h4,h5,h6{font-weight:inherit}pre{background:rgba(0,0,0,.05)}pre,pre code{color:rgba(0,0,0,.87)}.c,.c1,.cm,.o{color:rgba(0,0,0,.54)}.k,.kn{color:#a71d5d}.kd,.kt{color:#0086b3}.n.f,.nf{color:#795da3}.nx{color:#0086b3}.s,.s1{color:#183691}.bp,.mi{color:#9575cd}.icon{font-family:Icon;speak:none;font-style:normal;font-weight:400;font-variant:normal;text-transform:none;line-height:1;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale}.icon-search:before{content:"\e600"}.icon-back:before{content:"\e601"}.icon-link:before{content:"\e602"}.icon-close:before{content:"\e603"}.icon-menu:before{content:"\e604"}.icon-forward:before{content:"\e605"}.icon-twitter:before{content:"\e606"}.icon-github:before{content:"\e607"}.icon-download:before{content:"\e608"}.icon-star:before{content:"\e609"}.icon-warning:before{content:"\e610"}.icon-note:before{content:"\e611"}a{-webkit-transition:color .25s;transition:color .25s}.overlay{-webkit-transition:opacity .25s,width 0s .25s,height 0s .25s;transition:opacity .25s,width 0s .25s,height 0s .25s}#toggle-drawer:checked~.overlay,.toggle-drawer .overlay{-webkit-transition:opacity .25s,width 0s,height 0s;transition:opacity .25s,width 0s,height 0s}.js .header{-webkit-transition:background .6s,color .6s;transition:background .6s,color .6s}.js .header:before{-webkit-transition:background .6s;transition:background .6s}.button .icon{-webkit-transition:background .25s;transition:background .25s}body{color:rgba(0,0,0,.87)}@supports (-webkit-appearance:none){body{background:#e84e40}}.ios body{background:#fff}hr{border:0;border-top:1px solid rgba(0,0,0,.12)}.toggle-button{cursor:pointer;color:inherit}.backdrop,.backdrop-paper:after{background:#fff}.overlay{background:rgba(0,0,0,.54);opacity:0}#toggle-drawer:checked~.overlay,.toggle-drawer .overlay{opacity:1}.header{box-shadow:0 1.5px 3px rgba(0,0,0,.24),0 3px 8px rgba(0,0,0,.05);background:#e84e40;color:#fff}.ios.standalone .header:before{background:rgba(0,0,0,.12)}.bar .path{color:hsla(0,0%,100%,.7)}.button .icon{border-radius:100%}.button .icon:active{background:hsla(0,0%,100%,.12)}html{height:100%}body{position:relative;min-height:100%}hr{display:block;height:1px;padding:0;margin:0}.locked{height:100%;overflow:hidden}.scrollable{position:absolute;top:0;right:0;bottom:0;left:0;overflow:auto;-webkit-overflow-scrolling:touch}.scrollable .wrapper{height:100%}.ios .scrollable .wrapper{margin-bottom:2px}.toggle{display:none}.toggle-button{display:block}.backdrop{position:absolute;top:0;right:0;bottom:0;left:0;z-index:-1}.backdrop-paper{max-width:1200px;height:100%;margin-left:auto;margin-right:auto}.backdrop-paper:after{content:" ";display:block;height:100%;margin-left:262px}.overlay{width:0;height:0;z-index:3}.header,.overlay{position:fixed;top:0}.header{-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;user-select:none;left:0;z-index:2;height:56px;padding:4px;overflow:hidden}.ios.standalone .header{position:absolute}.bar{display:table;max-width:1184px;margin-left:auto;margin-right:auto}.bar a{display:block}.no-js .bar .button-search{display:none}.bar .path .icon:before{vertical-align:-1.5px}.button{display:table-cell;vertical-align:top;width:1%}.button button{margin:0;padding:0}.button button:active:before{position:relative;top:0;left:0}.button .icon{display:inline-block;font-size:24px;padding:8px;margin:4px}.stretch{display:table;table-layout:fixed;width:100%}.header .stretch{padding:0 20px}.stretch .title{display:table-cell;overflow:hidden;white-space:nowrap;text-overflow:ellipsis}.header .stretch .title{font-size:18px;padding:13px 0}.main{max-width:1200px;margin-left:auto;margin-right:auto}body,input{font-weight:400;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale}.no-fontface body,.no-fontface input,body,input{font-family:Helvetica Neue,Helvetica,Arial,sans-serif}.no-fontface code,.no-fontface kbd,.no-fontface pre,code,kbd,pre{font-family:Courier New,Courier,monospace}#toggle-drawer:checked~.main .drawer,.toggle-drawer .drawer{-webkit-transform:translateZ(0);transform:translateZ(0)}.no-csstransforms3d #toggle-drawer:checked~.main .drawer,.no-csstransforms3d .toggle-drawer .drawer{display:block}.project{-webkit-transition:none;transition:none}.project .logo img{-webkit-transition:box-shadow .4s;transition:box-shadow .4s}.repo a{-webkit-transition:box-shadow .4s,opacity .4s;transition:box-shadow .4s,opacity .4s}.drawer .toc a.current,.drawer .toc a:focus,.drawer .toc a:hover{color:#e84e40}.drawer .anchor a{border-left:2px solid #e84e40}.drawer .section{color:rgba(0,0,0,.54)}.ios.standalone .project:before{background:rgba(0,0,0,.12)}.project .logo img{background:#fff;border-radius:100%}.project:focus .logo img,.project:hover .logo img{box-shadow:0 4px 7px rgba(0,0,0,.23),0 8px 25px rgba(0,0,0,.05)}.repo a{background:#00bfa5;color:#fff;border-radius:3px}.repo a:focus,.repo a:hover{box-shadow:0 4px 7px rgba(0,0,0,.23),0 8px 25px rgba(0,0,0,.05);opacity:.8}.repo a .count{background:rgba(0,0,0,.26);color:#fff;border-radius:0 3px 3px 0}.repo a .count:before{border-width:15px 5px 15px 0;border-color:transparent rgba(0,0,0,.26);border-style:solid}.drawer{width:262px;font-size:13px;line-height:1em}.ios .drawer{overflow:scroll;-webkit-overflow-scrolling:touch}.drawer .toc li a{display:block;padding:14.5px 24px;white-space:nowrap;overflow:hidden;text-overflow:ellipsis}.drawer .toc li.anchor a{margin-left:12px;padding:10px 24px 10px 12px}.drawer .toc li ul{margin-left:12px}.drawer .current+ul{margin-bottom:9px}.drawer .section{display:block;padding:14.5px 24px}.drawer .scrollable{top:104px;z-index:-1}.drawer .scrollable .wrapper{height:auto;min-height:100%}.drawer .scrollable .wrapper hr{margin:12px 0;margin-right:auto}.drawer .scrollable .wrapper .toc{margin:12px 0}.project{display:block}.project .banner{display:table;width:100%;height:104px;padding:20px}.project .logo{display:table-cell;width:64px;padding-right:12px}.project .logo img{display:block;width:64px;height:64px}.project .name{display:table-cell;padding-left:4px;font-size:14px;line-height:1.25em;vertical-align:middle}.project .logo+.name{font-size:12px}.repo{margin:24px 0;text-align:center}.repo li{display:inline-block;padding-right:12px;white-space:nowrap}.repo li:last-child{padding-right:0}.repo a{display:inline-block;padding:0 10px 0 6px;font-size:12px;line-height:30px;height:30px}.repo a .icon{font-size:18px;vertical-align:-3px}.repo a .count{display:inline-block;position:relative;padding:0 8px 0 4px;margin:0 -10px 0 8px;font-size:12px}.repo a .count:before{content:" ";display:block;position:absolute;top:0;left:-5px}.no-js .repo a .count{display:none}.drawer .toc li a{font-weight:700}.drawer .toc li.anchor a{font-weight:400}.drawer .section,.repo a{font-weight:700}.repo a{text-transform:uppercase}.repo a .count{text-transform:none;font-weight:700}pre span{-webkit-transition:color .25s;transition:color .25s}.copyright a{-webkit-transition:color .25s;transition:color .25s}.ios.standalone .article{background:-webkit-linear-gradient(top,#fff 50%,#e84e40 0);background:linear-gradient(180deg,#fff 50%,#e84e40 0)}.ios.standalone .article .wrapper{background:-webkit-linear-gradient(top,#fff 50%,#fff 0);background:linear-gradient(180deg,#fff 50%,#fff 0)}.article a,.article h1,.article h2{color:#e84e40}.article code{background:#eee}.article kbd{color:#555;background-color:#fcfcfc;border:1px solid #ccc;border-bottom-color:#bbb;border-radius:3px;box-shadow:inset 0 -1px 0 #bbb}.article h1{border-bottom:1px solid rgba(0,0,0,.12)}.article a{border-bottom:1px dotted}.article a:focus,.article a:hover{color:#00bfa5}.article .headerlink{color:rgba(0,0,0,.26);border:none}.article table{box-shadow:0 1.5px 3px rgba(0,0,0,.24),0 3px 8px rgba(0,0,0,.05);border-radius:3px}.article table th{background:#ee7a70;color:#fff}.article table td{border-top:1px solid rgba(0,0,0,.05)}.article blockquote{border-left:2px solid rgba(0,0,0,.54);color:rgba(0,0,0,.54)}.footer{background:#e84e40;color:#fff}.footer a{border:none}.copyright{color:rgba(0,0,0,.54)}.pagination a .button,.pagination a .title{color:#fff}.pagination .direction{color:hsla(0,0%,100%,.7)}.admonition{background:#29b6f6;color:#fff}.admonition pre{background:hsla(0,0%,100%,.3)}.admonition.warning{background:#e84e40}.admonition a,.admonition a:hover{color:#fff}.article{font-size:14px;line-height:1.7em}.article:after{content:" ";display:block;clear:both}.article .wrapper{padding:116px 16px 92px}.ios.standalone .article{position:absolute;top:56px;right:0;bottom:0;left:0;overflow:auto;-webkit-overflow-scrolling:touch}.ios.standalone .article .wrapper{position:relative;min-height:100%;padding-top:60px;margin-bottom:2px}.article h1{font-size:24px;line-height:1.333334em;padding:20px 0 42px}.article h2{font-size:20px;line-height:1.4em;padding-top:92px;margin-top:-56px}.ios.standalone .article h2{padding-top:36px;margin:0}.article h3,.article h4{font-size:14px;padding-top:76px;margin-top:-56px}.ios.standalone .article h3,.ios.standalone .article h4{padding-top:20px;margin-top:0}.article .headerlink{float:right;margin-left:20px;font-size:14px}h1 .article .headerlink{display:none}.article ol,.article p,.article ul{margin-top:1.5em}.article li,.article li ol,.article li ul{margin-top:.75em}.article li{margin-left:18px}.article li p{display:inline}.article ul>li:before{content:"\e602";display:block;float:left;font-family:Icon;font-size:16px;width:1.2em;margin-left:-1.2em;vertical-align:-.1em}.article p>code{white-space:nowrap;padding:2px 4px}.article kbd{display:inline-block;padding:3px 5px;line-height:10px}.article hr{margin-top:1.5em}.article img{max-width:100%}.article pre{padding:16px;margin:1.5em -16px 0;line-height:1.5em;overflow:auto;-webkit-overflow-scrolling:touch}.article table{margin:3em 0 1.5em;font-size:13px;overflow:hidden}.no-js .article table{display:inline-block;max-width:100%;overflow:auto;-webkit-overflow-scrolling:touch}.article table th{min-width:100px;font-size:12px;text-align:left}.article table td,.article table th{padding:12px 16px;vertical-align:top}.article blockquote{padding-left:16px}.article .data{margin:1.5em -16px;padding:1.5em 0;overflow:auto;-webkit-overflow-scrolling:touch;text-align:center}.article .data table{display:inline-block;margin:0 16px;text-align:left}.footer{position:absolute;bottom:0;left:0;right:0;padding:0 4px}.copyright{margin:1.5em 0}.pagination{max-width:1184px;height:92px;padding:4px 0;margin-left:auto;margin-right:auto;overflow:hidden}.pagination a{display:block;height:100%}.pagination .next,.pagination .previous{position:relative;float:left;height:100%}.pagination .previous{width:25%}.pagination .previous .direction,.pagination .previous .stretch{display:none}.pagination .next{width:75%;text-align:right}.pagination .page{display:table;position:absolute;bottom:4px}.pagination .direction{display:block;position:absolute;bottom:40px;width:100%;font-size:15px;line-height:20px;padding:0 52px}.pagination .stretch{padding:0 4px}.pagination .stretch .title{font-size:18px;padding:11px 0 13px}.admonition{margin:20px -16px 0;padding:20px 16px}.admonition>:first-child{margin-top:0}.admonition .admonition-title{font-size:20px}.admonition .admonition-title:before{content:"\e611";display:block;float:left;font-family:Icon;font-size:24px;vertical-align:-.1em;margin-right:5px}.admonition.warning .admonition-title:before{content:"\e610"}.article h3{font-weight:700}.article h4{font-weight:400;font-style:italic}.article h2 a,.article h3 a,.article h4 a,.article h5 a,.article h6 a{font-weight:400;font-style:normal}.bar{-webkit-transform:translateZ(0);transform:translateZ(0);-webkit-transition:opacity .2s cubic-bezier(.75,0,.25,1),-webkit-transform .4s cubic-bezier(.75,0,.25,1);transition:opacity .2s cubic-bezier(.75,0,.25,1),-webkit-transform .4s cubic-bezier(.75,0,.25,1);transition:opacity .2s cubic-bezier(.75,0,.25,1),transform .4s cubic-bezier(.75,0,.25,1);transition:opacity .2s cubic-bezier(.75,0,.25,1),transform .4s cubic-bezier(.75,0,.25,1),-webkit-transform .4s cubic-bezier(.75,0,.25,1)}#toggle-search:checked~.header .bar,.toggle-search .bar{-webkit-transform:translate3d(0,-56px,0);transform:translate3d(0,-56px,0)}.bar.search .button-reset{-webkit-transform:scale(.5);transform:scale(.5);-webkit-transition:opacity .4s cubic-bezier(.1,.7,.1,1),-webkit-transform .4s cubic-bezier(.1,.7,.1,1);transition:opacity .4s cubic-bezier(.1,.7,.1,1),-webkit-transform .4s cubic-bezier(.1,.7,.1,1);transition:opacity .4s cubic-bezier(.1,.7,.1,1),transform .4s cubic-bezier(.1,.7,.1,1);transition:opacity .4s cubic-bezier(.1,.7,.1,1),transform .4s cubic-bezier(.1,.7,.1,1),-webkit-transform .4s cubic-bezier(.1,.7,.1,1);opacity:0}.bar.search.non-empty .button-reset{-webkit-transform:scale(1);transform:scale(1);opacity:1}.results{-webkit-transition:opacity .3s .1s,width 0s .4s,height 0s .4s;transition:opacity .3s .1s,width 0s .4s,height 0s .4s}#toggle-search:checked~.main .results,.toggle-search .results{-webkit-transition:opacity .4s,width 0s,height 0s;transition:opacity .4s,width 0s,height 0s}.results .list a{-webkit-transition:background .25s;transition:background .25s}.no-csstransforms3d .bar.default{display:table}.no-csstransforms3d .bar.search{display:none;margin-top:0}.no-csstransforms3d #toggle-search:checked~.header .bar.default,.no-csstransforms3d .toggle-search .bar.default{display:none}.no-csstransforms3d #toggle-search:checked~.header .bar.search,.no-csstransforms3d .toggle-search .bar.search{display:table}.bar.search{opacity:0}.bar.search .query{background:transparent;color:rgba(0,0,0,.87)}.bar.search .query::-webkit-input-placeholder{color:rgba(0,0,0,.26)}.bar.search .query:-moz-placeholder,.bar.search .query::-moz-placeholder{color:rgba(0,0,0,.26)}.bar.search .query:-ms-input-placeholder{color:rgba(0,0,0,.26)}.bar.search .button .icon:active{background:rgba(0,0,0,.12)}.results{box-shadow:0 4px 7px rgba(0,0,0,.23),0 8px 25px rgba(0,0,0,.05);background:#fff;color:rgba(0,0,0,.87);opacity:0}#toggle-search:checked~.main .results,.toggle-search .results{opacity:1}.results .meta{background:#e84e40;color:#fff}.results .list a{border-bottom:1px solid rgba(0,0,0,.12)}.results .list a:last-child{border-bottom:none}.results .list a:active{background:rgba(0,0,0,.12)}.result span{color:rgba(0,0,0,.54)}#toggle-search:checked~.header,.toggle-search .header{background:#fff;color:rgba(0,0,0,.54)}#toggle-search:checked~.header:before,.toggle-search .header:before{background:rgba(0,0,0,.54)}#toggle-search:checked~.header .bar.default,.toggle-search .header .bar.default{opacity:0}#toggle-search:checked~.header .bar.search,.toggle-search .header .bar.search{opacity:1}.bar.search{margin-top:8px}.bar.search .query{font-size:18px;padding:13px 0;margin:0;width:100%;height:48px}.bar.search .query::-ms-clear{display:none}.results{position:fixed;top:0;left:0;width:0;height:100%;z-index:1;overflow-y:scroll;-webkit-overflow-scrolling:touch}.results .scrollable{top:56px}#toggle-search:checked~.main .results,.toggle-search .results{width:100%;overflow-y:visible}.results .meta{font-weight:700}.results .meta strong{display:block;font-size:11px;max-width:1200px;margin-left:auto;margin-right:auto;padding:16px}.results .list a{display:block}.result{max-width:1200px;margin-left:auto;margin-right:auto;padding:12px 16px 16px}.result h1{line-height:24px}.result h1,.result span{text-overflow:ellipsis;white-space:nowrap;overflow:hidden}.result span{font-size:12px}.no-csstransforms3d .results{display:none}.no-csstransforms3d #toggle-search:checked~.main .results,.no-csstransforms3d .toggle-search .results{display:block;overflow:auto}.meta{text-transform:uppercase;font-weight:700}@media only screen and (min-width:960px){.backdrop{background:#f2f2f2}.backdrop-paper:after{box-shadow:0 1.5px 3px rgba(0,0,0,.24),0 3px 8px rgba(0,0,0,.05)}.button-menu{display:none}.drawer{float:left;height:auto;margin-bottom:96px;padding-top:80px}.drawer,.drawer .scrollable{position:static}.article{margin-left:262px}.footer{z-index:4}.copyright{margin-bottom:64px}.results{height:auto;top:64px}.results .scrollable{position:static;max-height:413px}}@media only screen and (max-width:959px){#toggle-drawer:checked~.overlay,.toggle-drawer .overlay{width:100%;height:100%}.drawer{-webkit-transform:translate3d(-262px,0,0);transform:translate3d(-262px,0,0);-webkit-transition:-webkit-transform .25s cubic-bezier(.4,0,.2,1);transition:-webkit-transform .25s cubic-bezier(.4,0,.2,1);transition:transform .25s cubic-bezier(.4,0,.2,1);transition:transform .25s cubic-bezier(.4,0,.2,1),-webkit-transform .25s cubic-bezier(.4,0,.2,1)}.no-csstransforms3d .drawer{display:none}.drawer{background:#fff}.project{box-shadow:0 1.5px 3px rgba(0,0,0,.24),0 3px 8px rgba(0,0,0,.05);background:#e84e40;color:#fff}.drawer{position:fixed;z-index:4}#toggle-search:checked~.main .results,.drawer,.toggle-search .results{height:100%}}@media only screen and (min-width:720px){.header{height:64px;padding:8px}.header .stretch{padding:0 16px}.header .stretch .title{font-size:20px;padding:12px 0}.project .name{margin:26px 0 0 5px}.article .wrapper{padding:128px 24px 96px}.article .data{margin:1.5em -24px}.article .data table{margin:0 24px}.article h2{padding-top:100px;margin-top:-64px}.ios.standalone .article h2{padding-top:28px;margin-top:8px}.article h3,.article h4{padding-top:84px;margin-top:-64px}.ios.standalone .article h3,.ios.standalone .article h4{padding-top:20px;margin-top:0}.article pre{padding:1.5em 24px;margin:1.5em -24px 0}.footer{padding:0 8px}.pagination{height:96px;padding:8px 0}.pagination .direction{padding:0 56px;bottom:40px}.pagination .stretch{padding:0 8px}.admonition{margin:20px -24px 0;padding:20px 24px}.bar.search .query{font-size:20px;padding:12px 0}.results .scrollable{top:64px}.results .meta strong{padding:16px 24px}.result{padding:16px 24px 20px}}@media only screen and (min-width:1200px){.header{width:100%}.drawer .scrollable .wrapper hr{width:48px}}@media only screen and (orientation:portrait){.ios.standalone .header{height:76px;padding-top:24px}.ios.standalone .header:before{content:" ";position:absolute;top:0;left:0;z-index:3;width:100%;height:20px}.ios.standalone .drawer .scrollable{top:124px}.ios.standalone .project{padding-top:20px}.ios.standalone .project:before{content:" ";position:absolute;top:0;left:0;z-index:3;width:100%;height:20px}.ios.standalone .article{position:absolute;top:76px;right:0;bottom:0;left:0}.ios.standalone .results .scrollable{top:76px}}@media only screen and (orientation:portrait) and (min-width:720px){.ios.standalone .header{height:84px;padding-top:28px}.ios.standalone .results .scrollable{top:84px}}@media only screen and (max-width:719px){.bar .path{display:none}}@media only screen and (max-width:479px){.button-github,.button-twitter{display:none}}@media only screen and (min-width:720px) and (max-width:959px){.header .stretch{padding:0 24px}}@media only screen and (min-width:480px){.pagination .next,.pagination .previous{width:50%}.pagination .previous .direction{display:block}.pagination .previous .stretch{display:table}}@media print{.drawer,.footer,.header,.headerlink{display:none}.article .wrapper{padding-top:0}.article pre,.article pre *{color:rgba(0,0,0,.87)!important}.article pre{border:1px solid rgba(0,0,0,.12)}.article table{border-radius:none;box-shadow:none}.article table th{color:#e84e40}} \ No newline at end of file diff --git a/assets/stylesheets/palettes-05ab2406df.css b/assets/stylesheets/palettes-05ab2406df.css deleted file mode 100644 index ead0d84..0000000 --- a/assets/stylesheets/palettes-05ab2406df.css +++ /dev/null @@ -1 +0,0 @@ -@supports (-webkit-appearance:none){.palette-primary-red{background:#e84e40}}.palette-primary-red .footer,.palette-primary-red .header{background:#e84e40}.palette-primary-red .drawer .toc a.current,.palette-primary-red .drawer .toc a:focus,.palette-primary-red .drawer .toc a:hover{color:#e84e40}.palette-primary-red .drawer .anchor a{border-left:2px solid #e84e40}.ios.standalone .palette-primary-red .article{background:-webkit-linear-gradient(top,#fff 50%,#e84e40 0);background:linear-gradient(180deg,#fff 50%,#e84e40 0)}.palette-primary-red .article a,.palette-primary-red .article code,.palette-primary-red .article h1,.palette-primary-red .article h2{color:#e84e40}.palette-primary-red .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-red .article table th{background:#ee7a70}.palette-primary-red .results .meta{background:#e84e40}@supports (-webkit-appearance:none){.palette-primary-pink{background:#e91e63}}.palette-primary-pink .footer,.palette-primary-pink .header{background:#e91e63}.palette-primary-pink .drawer .toc a.current,.palette-primary-pink .drawer .toc a:focus,.palette-primary-pink .drawer .toc a:hover{color:#e91e63}.palette-primary-pink .drawer .anchor a{border-left:2px solid #e91e63}.ios.standalone .palette-primary-pink .article{background:-webkit-linear-gradient(top,#fff 50%,#e91e63 0);background:linear-gradient(180deg,#fff 50%,#e91e63 0)}.palette-primary-pink .article a,.palette-primary-pink .article code,.palette-primary-pink .article h1,.palette-primary-pink .article h2{color:#e91e63}.palette-primary-pink .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-pink .article table th{background:#ef568a}.palette-primary-pink .results .meta{background:#e91e63}@supports (-webkit-appearance:none){.palette-primary-purple{background:#ab47bc}}.palette-primary-purple .footer,.palette-primary-purple .header{background:#ab47bc}.palette-primary-purple .drawer .toc a.current,.palette-primary-purple .drawer .toc a:focus,.palette-primary-purple .drawer .toc a:hover{color:#ab47bc}.palette-primary-purple .drawer .anchor a{border-left:2px solid #ab47bc}.ios.standalone .palette-primary-purple .article{background:-webkit-linear-gradient(top,#fff 50%,#ab47bc 0);background:linear-gradient(180deg,#fff 50%,#ab47bc 0)}.palette-primary-purple .article a,.palette-primary-purple .article code,.palette-primary-purple .article h1,.palette-primary-purple .article h2{color:#ab47bc}.palette-primary-purple .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-purple .article table th{background:#c075cd}.palette-primary-purple .results .meta{background:#ab47bc}@supports (-webkit-appearance:none){.palette-primary-deep-purple{background:#7e57c2}}.palette-primary-deep-purple .footer,.palette-primary-deep-purple .header{background:#7e57c2}.palette-primary-deep-purple .drawer .toc a.current,.palette-primary-deep-purple .drawer .toc a:focus,.palette-primary-deep-purple .drawer .toc a:hover{color:#7e57c2}.palette-primary-deep-purple .drawer .anchor a{border-left:2px solid #7e57c2}.ios.standalone .palette-primary-deep-purple .article{background:-webkit-linear-gradient(top,#fff 50%,#7e57c2 0);background:linear-gradient(180deg,#fff 50%,#7e57c2 0)}.palette-primary-deep-purple .article a,.palette-primary-deep-purple .article code,.palette-primary-deep-purple .article h1,.palette-primary-deep-purple .article h2{color:#7e57c2}.palette-primary-deep-purple .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-deep-purple .article table th{background:#9e81d1}.palette-primary-deep-purple .results .meta{background:#7e57c2}@supports (-webkit-appearance:none){.palette-primary-indigo{background:#3f51b5}}.palette-primary-indigo .footer,.palette-primary-indigo .header{background:#3f51b5}.palette-primary-indigo .drawer .toc a.current,.palette-primary-indigo .drawer .toc a:focus,.palette-primary-indigo .drawer .toc a:hover{color:#3f51b5}.palette-primary-indigo .drawer .anchor a{border-left:2px solid #3f51b5}.ios.standalone .palette-primary-indigo .article{background:-webkit-linear-gradient(top,#fff 50%,#3f51b5 0);background:linear-gradient(180deg,#fff 50%,#3f51b5 0)}.palette-primary-indigo .article a,.palette-primary-indigo .article code,.palette-primary-indigo .article h1,.palette-primary-indigo .article h2{color:#3f51b5}.palette-primary-indigo .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-indigo .article table th{background:#6f7dc8}.palette-primary-indigo .results .meta{background:#3f51b5}@supports (-webkit-appearance:none){.palette-primary-blue{background:#5677fc}}.palette-primary-blue .footer,.palette-primary-blue .header{background:#5677fc}.palette-primary-blue .drawer .toc a.current,.palette-primary-blue .drawer .toc a:focus,.palette-primary-blue .drawer .toc a:hover{color:#5677fc}.palette-primary-blue .drawer .anchor a{border-left:2px solid #5677fc}.ios.standalone .palette-primary-blue .article{background:-webkit-linear-gradient(top,#fff 50%,#5677fc 0);background:linear-gradient(180deg,#fff 50%,#5677fc 0)}.palette-primary-blue .article a,.palette-primary-blue .article code,.palette-primary-blue .article h1,.palette-primary-blue .article h2{color:#5677fc}.palette-primary-blue .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-blue .article table th{background:#8099fd}.palette-primary-blue .results .meta{background:#5677fc}@supports (-webkit-appearance:none){.palette-primary-light-blue{background:#03a9f4}}.palette-primary-light-blue .footer,.palette-primary-light-blue .header{background:#03a9f4}.palette-primary-light-blue .drawer .toc a.current,.palette-primary-light-blue .drawer .toc a:focus,.palette-primary-light-blue .drawer .toc a:hover{color:#03a9f4}.palette-primary-light-blue .drawer .anchor a{border-left:2px solid #03a9f4}.ios.standalone .palette-primary-light-blue .article{background:-webkit-linear-gradient(top,#fff 50%,#03a9f4 0);background:linear-gradient(180deg,#fff 50%,#03a9f4 0)}.palette-primary-light-blue .article a,.palette-primary-light-blue .article code,.palette-primary-light-blue .article h1,.palette-primary-light-blue .article h2{color:#03a9f4}.palette-primary-light-blue .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-light-blue .article table th{background:#42bff7}.palette-primary-light-blue .results .meta{background:#03a9f4}@supports (-webkit-appearance:none){.palette-primary-cyan{background:#00bcd4}}.palette-primary-cyan .footer,.palette-primary-cyan .header{background:#00bcd4}.palette-primary-cyan .drawer .toc a.current,.palette-primary-cyan .drawer .toc a:focus,.palette-primary-cyan .drawer .toc a:hover{color:#00bcd4}.palette-primary-cyan .drawer .anchor a{border-left:2px solid #00bcd4}.ios.standalone .palette-primary-cyan .article{background:-webkit-linear-gradient(top,#fff 50%,#00bcd4 0);background:linear-gradient(180deg,#fff 50%,#00bcd4 0)}.palette-primary-cyan .article a,.palette-primary-cyan .article code,.palette-primary-cyan .article h1,.palette-primary-cyan .article h2{color:#00bcd4}.palette-primary-cyan .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-cyan .article table th{background:#40cddf}.palette-primary-cyan .results .meta{background:#00bcd4}@supports (-webkit-appearance:none){.palette-primary-teal{background:#009688}}.palette-primary-teal .footer,.palette-primary-teal .header{background:#009688}.palette-primary-teal .drawer .toc a.current,.palette-primary-teal .drawer .toc a:focus,.palette-primary-teal .drawer .toc a:hover{color:#009688}.palette-primary-teal .drawer .anchor a{border-left:2px solid #009688}.ios.standalone .palette-primary-teal .article{background:-webkit-linear-gradient(top,#fff 50%,#009688 0);background:linear-gradient(180deg,#fff 50%,#009688 0)}.palette-primary-teal .article a,.palette-primary-teal .article code,.palette-primary-teal .article h1,.palette-primary-teal .article h2{color:#009688}.palette-primary-teal .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-teal .article table th{background:#40b0a6}.palette-primary-teal .results .meta{background:#009688}@supports (-webkit-appearance:none){.palette-primary-green{background:#259b24}}.palette-primary-green .footer,.palette-primary-green .header{background:#259b24}.palette-primary-green .drawer .toc a.current,.palette-primary-green .drawer .toc a:focus,.palette-primary-green .drawer .toc a:hover{color:#259b24}.palette-primary-green .drawer .anchor a{border-left:2px solid #259b24}.ios.standalone .palette-primary-green .article{background:-webkit-linear-gradient(top,#fff 50%,#259b24 0);background:linear-gradient(180deg,#fff 50%,#259b24 0)}.palette-primary-green .article a,.palette-primary-green .article code,.palette-primary-green .article h1,.palette-primary-green .article h2{color:#259b24}.palette-primary-green .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-green .article table th{background:#5cb45b}.palette-primary-green .results .meta{background:#259b24}@supports (-webkit-appearance:none){.palette-primary-light-green{background:#7cb342}}.palette-primary-light-green .footer,.palette-primary-light-green .header{background:#7cb342}.palette-primary-light-green .drawer .toc a.current,.palette-primary-light-green .drawer .toc a:focus,.palette-primary-light-green .drawer .toc a:hover{color:#7cb342}.palette-primary-light-green .drawer .anchor a{border-left:2px solid #7cb342}.ios.standalone .palette-primary-light-green .article{background:-webkit-linear-gradient(top,#fff 50%,#7cb342 0);background:linear-gradient(180deg,#fff 50%,#7cb342 0)}.palette-primary-light-green .article a,.palette-primary-light-green .article code,.palette-primary-light-green .article h1,.palette-primary-light-green .article h2{color:#7cb342}.palette-primary-light-green .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-light-green .article table th{background:#9dc671}.palette-primary-light-green .results .meta{background:#7cb342}@supports (-webkit-appearance:none){.palette-primary-lime{background:#c0ca33}}.palette-primary-lime .footer,.palette-primary-lime .header{background:#c0ca33}.palette-primary-lime .drawer .toc a.current,.palette-primary-lime .drawer .toc a:focus,.palette-primary-lime .drawer .toc a:hover{color:#c0ca33}.palette-primary-lime .drawer .anchor a{border-left:2px solid #c0ca33}.ios.standalone .palette-primary-lime .article{background:-webkit-linear-gradient(top,#fff 50%,#c0ca33 0);background:linear-gradient(180deg,#fff 50%,#c0ca33 0)}.palette-primary-lime .article a,.palette-primary-lime .article code,.palette-primary-lime .article h1,.palette-primary-lime .article h2{color:#c0ca33}.palette-primary-lime .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-lime .article table th{background:#d0d766}.palette-primary-lime .results .meta{background:#c0ca33}@supports (-webkit-appearance:none){.palette-primary-yellow{background:#f9a825}}.palette-primary-yellow .footer,.palette-primary-yellow .header{background:#f9a825}.palette-primary-yellow .drawer .toc a.current,.palette-primary-yellow .drawer .toc a:focus,.palette-primary-yellow .drawer .toc a:hover{color:#f9a825}.palette-primary-yellow .drawer .anchor a{border-left:2px solid #f9a825}.ios.standalone .palette-primary-yellow .article{background:-webkit-linear-gradient(top,#fff 50%,#f9a825 0);background:linear-gradient(180deg,#fff 50%,#f9a825 0)}.palette-primary-yellow .article a,.palette-primary-yellow .article code,.palette-primary-yellow .article h1,.palette-primary-yellow .article h2{color:#f9a825}.palette-primary-yellow .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-yellow .article table th{background:#fbbe5c}.palette-primary-yellow .results .meta{background:#f9a825}@supports (-webkit-appearance:none){.palette-primary-amber{background:#ffb300}}.palette-primary-amber .footer,.palette-primary-amber .header{background:#ffb300}.palette-primary-amber .drawer .toc a.current,.palette-primary-amber .drawer .toc a:focus,.palette-primary-amber .drawer .toc a:hover{color:#ffb300}.palette-primary-amber .drawer .anchor a{border-left:2px solid #ffb300}.ios.standalone .palette-primary-amber .article{background:-webkit-linear-gradient(top,#fff 50%,#ffb300 0);background:linear-gradient(180deg,#fff 50%,#ffb300 0)}.palette-primary-amber .article a,.palette-primary-amber .article code,.palette-primary-amber .article h1,.palette-primary-amber .article h2{color:#ffb300}.palette-primary-amber .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-amber .article table th{background:#ffc640}.palette-primary-amber .results .meta{background:#ffb300}@supports (-webkit-appearance:none){.palette-primary-orange{background:#fb8c00}}.palette-primary-orange .footer,.palette-primary-orange .header{background:#fb8c00}.palette-primary-orange .drawer .toc a.current,.palette-primary-orange .drawer .toc a:focus,.palette-primary-orange .drawer .toc a:hover{color:#fb8c00}.palette-primary-orange .drawer .anchor a{border-left:2px solid #fb8c00}.ios.standalone .palette-primary-orange .article{background:-webkit-linear-gradient(top,#fff 50%,#fb8c00 0);background:linear-gradient(180deg,#fff 50%,#fb8c00 0)}.palette-primary-orange .article a,.palette-primary-orange .article code,.palette-primary-orange .article h1,.palette-primary-orange .article h2{color:#fb8c00}.palette-primary-orange .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-orange .article table th{background:#fca940}.palette-primary-orange .results .meta{background:#fb8c00}@supports (-webkit-appearance:none){.palette-primary-deep-orange{background:#ff7043}}.palette-primary-deep-orange .footer,.palette-primary-deep-orange .header{background:#ff7043}.palette-primary-deep-orange .drawer .toc a.current,.palette-primary-deep-orange .drawer .toc a:focus,.palette-primary-deep-orange .drawer .toc a:hover{color:#ff7043}.palette-primary-deep-orange .drawer .anchor a{border-left:2px solid #ff7043}.ios.standalone .palette-primary-deep-orange .article{background:-webkit-linear-gradient(top,#fff 50%,#ff7043 0);background:linear-gradient(180deg,#fff 50%,#ff7043 0)}.palette-primary-deep-orange .article a,.palette-primary-deep-orange .article code,.palette-primary-deep-orange .article h1,.palette-primary-deep-orange .article h2{color:#ff7043}.palette-primary-deep-orange .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-deep-orange .article table th{background:#ff9472}.palette-primary-deep-orange .results .meta{background:#ff7043}@supports (-webkit-appearance:none){.palette-primary-brown{background:#795548}}.palette-primary-brown .footer,.palette-primary-brown .header{background:#795548}.palette-primary-brown .drawer .toc a.current,.palette-primary-brown .drawer .toc a:focus,.palette-primary-brown .drawer .toc a:hover{color:#795548}.palette-primary-brown .drawer .anchor a{border-left:2px solid #795548}.ios.standalone .palette-primary-brown .article{background:-webkit-linear-gradient(top,#fff 50%,#795548 0);background:linear-gradient(180deg,#fff 50%,#795548 0)}.palette-primary-brown .article a,.palette-primary-brown .article code,.palette-primary-brown .article h1,.palette-primary-brown .article h2{color:#795548}.palette-primary-brown .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-brown .article table th{background:#9b8076}.palette-primary-brown .results .meta{background:#795548}@supports (-webkit-appearance:none){.palette-primary-grey{background:#757575}}.palette-primary-grey .footer,.palette-primary-grey .header{background:#757575}.palette-primary-grey .drawer .toc a.current,.palette-primary-grey .drawer .toc a:focus,.palette-primary-grey .drawer .toc a:hover{color:#757575}.palette-primary-grey .drawer .anchor a{border-left:2px solid #757575}.ios.standalone .palette-primary-grey .article{background:-webkit-linear-gradient(top,#fff 50%,#757575 0);background:linear-gradient(180deg,#fff 50%,#757575 0)}.palette-primary-grey .article a,.palette-primary-grey .article code,.palette-primary-grey .article h1,.palette-primary-grey .article h2{color:#757575}.palette-primary-grey .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-grey .article table th{background:#989898}.palette-primary-grey .results .meta{background:#757575}@supports (-webkit-appearance:none){.palette-primary-blue-grey{background:#546e7a}}.palette-primary-blue-grey .footer,.palette-primary-blue-grey .header{background:#546e7a}.palette-primary-blue-grey .drawer .toc a.current,.palette-primary-blue-grey .drawer .toc a:focus,.palette-primary-blue-grey .drawer .toc a:hover{color:#546e7a}.palette-primary-blue-grey .drawer .anchor a{border-left:2px solid #546e7a}.ios.standalone .palette-primary-blue-grey .article{background:-webkit-linear-gradient(top,#fff 50%,#546e7a 0);background:linear-gradient(180deg,#fff 50%,#546e7a 0)}.palette-primary-blue-grey .article a,.palette-primary-blue-grey .article code,.palette-primary-blue-grey .article h1,.palette-primary-blue-grey .article h2{color:#546e7a}.palette-primary-blue-grey .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-blue-grey .article table th{background:#7f929b}.palette-primary-blue-grey .results .meta{background:#546e7a}.palette-accent-red .article a:focus,.palette-accent-red .article a:hover{color:#ff2d6f}.palette-accent-red .repo a{background:#ff2d6f}.palette-accent-pink .article a:focus,.palette-accent-pink .article a:hover{color:#f50057}.palette-accent-pink .repo a{background:#f50057}.palette-accent-purple .article a:focus,.palette-accent-purple .article a:hover{color:#e040fb}.palette-accent-purple .repo a{background:#e040fb}.palette-accent-deep-purple .article a:focus,.palette-accent-deep-purple .article a:hover{color:#7c4dff}.palette-accent-deep-purple .repo a{background:#7c4dff}.palette-accent-indigo .article a:focus,.palette-accent-indigo .article a:hover{color:#536dfe}.palette-accent-indigo .repo a{background:#536dfe}.palette-accent-blue .article a:focus,.palette-accent-blue .article a:hover{color:#6889ff}.palette-accent-blue .repo a{background:#6889ff}.palette-accent-light-blue .article a:focus,.palette-accent-light-blue .article a:hover{color:#0091ea}.palette-accent-light-blue .repo a{background:#0091ea}.palette-accent-cyan .article a:focus,.palette-accent-cyan .article a:hover{color:#00b8d4}.palette-accent-cyan .repo a{background:#00b8d4}.palette-accent-teal .article a:focus,.palette-accent-teal .article a:hover{color:#00bfa5}.palette-accent-teal .repo a{background:#00bfa5}.palette-accent-green .article a:focus,.palette-accent-green .article a:hover{color:#12c700}.palette-accent-green .repo a{background:#12c700}.palette-accent-light-green .article a:focus,.palette-accent-light-green .article a:hover{color:#64dd17}.palette-accent-light-green .repo a{background:#64dd17}.palette-accent-lime .article a:focus,.palette-accent-lime .article a:hover{color:#aeea00}.palette-accent-lime .repo a{background:#aeea00}.palette-accent-yellow .article a:focus,.palette-accent-yellow .article a:hover{color:#ffd600}.palette-accent-yellow .repo a{background:#ffd600}.palette-accent-amber .article a:focus,.palette-accent-amber .article a:hover{color:#ffab00}.palette-accent-amber .repo a{background:#ffab00}.palette-accent-orange .article a:focus,.palette-accent-orange .article a:hover{color:#ff9100}.palette-accent-orange .repo a{background:#ff9100}.palette-accent-deep-orange .article a:focus,.palette-accent-deep-orange .article a:hover{color:#ff6e40}.palette-accent-deep-orange .repo a{background:#ff6e40}@media only screen and (max-width:959px){.palette-primary-red .project{background:#e84e40}.palette-primary-pink .project{background:#e91e63}.palette-primary-purple .project{background:#ab47bc}.palette-primary-deep-purple .project{background:#7e57c2}.palette-primary-indigo .project{background:#3f51b5}.palette-primary-blue .project{background:#5677fc}.palette-primary-light-blue .project{background:#03a9f4}.palette-primary-cyan .project{background:#00bcd4}.palette-primary-teal .project{background:#009688}.palette-primary-green .project{background:#259b24}.palette-primary-light-green .project{background:#7cb342}.palette-primary-lime .project{background:#c0ca33}.palette-primary-yellow .project{background:#f9a825}.palette-primary-amber .project{background:#ffb300}.palette-primary-orange .project{background:#fb8c00}.palette-primary-deep-orange .project{background:#ff7043}.palette-primary-brown .project{background:#795548}.palette-primary-grey .project{background:#757575}.palette-primary-blue-grey .project{background:#546e7a}} \ No newline at end of file diff --git a/assets/stylesheets/palettes.css b/assets/stylesheets/palettes.css deleted file mode 100644 index ead0d84..0000000 --- a/assets/stylesheets/palettes.css +++ /dev/null @@ -1 +0,0 @@ -@supports (-webkit-appearance:none){.palette-primary-red{background:#e84e40}}.palette-primary-red .footer,.palette-primary-red .header{background:#e84e40}.palette-primary-red .drawer .toc a.current,.palette-primary-red .drawer .toc a:focus,.palette-primary-red .drawer .toc a:hover{color:#e84e40}.palette-primary-red .drawer .anchor a{border-left:2px solid #e84e40}.ios.standalone .palette-primary-red .article{background:-webkit-linear-gradient(top,#fff 50%,#e84e40 0);background:linear-gradient(180deg,#fff 50%,#e84e40 0)}.palette-primary-red .article a,.palette-primary-red .article code,.palette-primary-red .article h1,.palette-primary-red .article h2{color:#e84e40}.palette-primary-red .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-red .article table th{background:#ee7a70}.palette-primary-red .results .meta{background:#e84e40}@supports (-webkit-appearance:none){.palette-primary-pink{background:#e91e63}}.palette-primary-pink .footer,.palette-primary-pink .header{background:#e91e63}.palette-primary-pink .drawer .toc a.current,.palette-primary-pink .drawer .toc a:focus,.palette-primary-pink .drawer .toc a:hover{color:#e91e63}.palette-primary-pink .drawer .anchor a{border-left:2px solid #e91e63}.ios.standalone .palette-primary-pink .article{background:-webkit-linear-gradient(top,#fff 50%,#e91e63 0);background:linear-gradient(180deg,#fff 50%,#e91e63 0)}.palette-primary-pink .article a,.palette-primary-pink .article code,.palette-primary-pink .article h1,.palette-primary-pink .article h2{color:#e91e63}.palette-primary-pink .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-pink .article table th{background:#ef568a}.palette-primary-pink .results .meta{background:#e91e63}@supports (-webkit-appearance:none){.palette-primary-purple{background:#ab47bc}}.palette-primary-purple .footer,.palette-primary-purple .header{background:#ab47bc}.palette-primary-purple .drawer .toc a.current,.palette-primary-purple .drawer .toc a:focus,.palette-primary-purple .drawer .toc a:hover{color:#ab47bc}.palette-primary-purple .drawer .anchor a{border-left:2px solid #ab47bc}.ios.standalone .palette-primary-purple .article{background:-webkit-linear-gradient(top,#fff 50%,#ab47bc 0);background:linear-gradient(180deg,#fff 50%,#ab47bc 0)}.palette-primary-purple .article a,.palette-primary-purple .article code,.palette-primary-purple .article h1,.palette-primary-purple .article h2{color:#ab47bc}.palette-primary-purple .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-purple .article table th{background:#c075cd}.palette-primary-purple .results .meta{background:#ab47bc}@supports (-webkit-appearance:none){.palette-primary-deep-purple{background:#7e57c2}}.palette-primary-deep-purple .footer,.palette-primary-deep-purple .header{background:#7e57c2}.palette-primary-deep-purple .drawer .toc a.current,.palette-primary-deep-purple .drawer .toc a:focus,.palette-primary-deep-purple .drawer .toc a:hover{color:#7e57c2}.palette-primary-deep-purple .drawer .anchor a{border-left:2px solid #7e57c2}.ios.standalone .palette-primary-deep-purple .article{background:-webkit-linear-gradient(top,#fff 50%,#7e57c2 0);background:linear-gradient(180deg,#fff 50%,#7e57c2 0)}.palette-primary-deep-purple .article a,.palette-primary-deep-purple .article code,.palette-primary-deep-purple .article h1,.palette-primary-deep-purple .article h2{color:#7e57c2}.palette-primary-deep-purple .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-deep-purple .article table th{background:#9e81d1}.palette-primary-deep-purple .results .meta{background:#7e57c2}@supports (-webkit-appearance:none){.palette-primary-indigo{background:#3f51b5}}.palette-primary-indigo .footer,.palette-primary-indigo .header{background:#3f51b5}.palette-primary-indigo .drawer .toc a.current,.palette-primary-indigo .drawer .toc a:focus,.palette-primary-indigo .drawer .toc a:hover{color:#3f51b5}.palette-primary-indigo .drawer .anchor a{border-left:2px solid #3f51b5}.ios.standalone .palette-primary-indigo .article{background:-webkit-linear-gradient(top,#fff 50%,#3f51b5 0);background:linear-gradient(180deg,#fff 50%,#3f51b5 0)}.palette-primary-indigo .article a,.palette-primary-indigo .article code,.palette-primary-indigo .article h1,.palette-primary-indigo .article h2{color:#3f51b5}.palette-primary-indigo .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-indigo .article table th{background:#6f7dc8}.palette-primary-indigo .results .meta{background:#3f51b5}@supports (-webkit-appearance:none){.palette-primary-blue{background:#5677fc}}.palette-primary-blue .footer,.palette-primary-blue .header{background:#5677fc}.palette-primary-blue .drawer .toc a.current,.palette-primary-blue .drawer .toc a:focus,.palette-primary-blue .drawer .toc a:hover{color:#5677fc}.palette-primary-blue .drawer .anchor a{border-left:2px solid #5677fc}.ios.standalone .palette-primary-blue .article{background:-webkit-linear-gradient(top,#fff 50%,#5677fc 0);background:linear-gradient(180deg,#fff 50%,#5677fc 0)}.palette-primary-blue .article a,.palette-primary-blue .article code,.palette-primary-blue .article h1,.palette-primary-blue .article h2{color:#5677fc}.palette-primary-blue .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-blue .article table th{background:#8099fd}.palette-primary-blue .results .meta{background:#5677fc}@supports (-webkit-appearance:none){.palette-primary-light-blue{background:#03a9f4}}.palette-primary-light-blue .footer,.palette-primary-light-blue .header{background:#03a9f4}.palette-primary-light-blue .drawer .toc a.current,.palette-primary-light-blue .drawer .toc a:focus,.palette-primary-light-blue .drawer .toc a:hover{color:#03a9f4}.palette-primary-light-blue .drawer .anchor a{border-left:2px solid #03a9f4}.ios.standalone .palette-primary-light-blue .article{background:-webkit-linear-gradient(top,#fff 50%,#03a9f4 0);background:linear-gradient(180deg,#fff 50%,#03a9f4 0)}.palette-primary-light-blue .article a,.palette-primary-light-blue .article code,.palette-primary-light-blue .article h1,.palette-primary-light-blue .article h2{color:#03a9f4}.palette-primary-light-blue .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-light-blue .article table th{background:#42bff7}.palette-primary-light-blue .results .meta{background:#03a9f4}@supports (-webkit-appearance:none){.palette-primary-cyan{background:#00bcd4}}.palette-primary-cyan .footer,.palette-primary-cyan .header{background:#00bcd4}.palette-primary-cyan .drawer .toc a.current,.palette-primary-cyan .drawer .toc a:focus,.palette-primary-cyan .drawer .toc a:hover{color:#00bcd4}.palette-primary-cyan .drawer .anchor a{border-left:2px solid #00bcd4}.ios.standalone .palette-primary-cyan .article{background:-webkit-linear-gradient(top,#fff 50%,#00bcd4 0);background:linear-gradient(180deg,#fff 50%,#00bcd4 0)}.palette-primary-cyan .article a,.palette-primary-cyan .article code,.palette-primary-cyan .article h1,.palette-primary-cyan .article h2{color:#00bcd4}.palette-primary-cyan .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-cyan .article table th{background:#40cddf}.palette-primary-cyan .results .meta{background:#00bcd4}@supports (-webkit-appearance:none){.palette-primary-teal{background:#009688}}.palette-primary-teal .footer,.palette-primary-teal .header{background:#009688}.palette-primary-teal .drawer .toc a.current,.palette-primary-teal .drawer .toc a:focus,.palette-primary-teal .drawer .toc a:hover{color:#009688}.palette-primary-teal .drawer .anchor a{border-left:2px solid #009688}.ios.standalone .palette-primary-teal .article{background:-webkit-linear-gradient(top,#fff 50%,#009688 0);background:linear-gradient(180deg,#fff 50%,#009688 0)}.palette-primary-teal .article a,.palette-primary-teal .article code,.palette-primary-teal .article h1,.palette-primary-teal .article h2{color:#009688}.palette-primary-teal .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-teal .article table th{background:#40b0a6}.palette-primary-teal .results .meta{background:#009688}@supports (-webkit-appearance:none){.palette-primary-green{background:#259b24}}.palette-primary-green .footer,.palette-primary-green .header{background:#259b24}.palette-primary-green .drawer .toc a.current,.palette-primary-green .drawer .toc a:focus,.palette-primary-green .drawer .toc a:hover{color:#259b24}.palette-primary-green .drawer .anchor a{border-left:2px solid #259b24}.ios.standalone .palette-primary-green .article{background:-webkit-linear-gradient(top,#fff 50%,#259b24 0);background:linear-gradient(180deg,#fff 50%,#259b24 0)}.palette-primary-green .article a,.palette-primary-green .article code,.palette-primary-green .article h1,.palette-primary-green .article h2{color:#259b24}.palette-primary-green .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-green .article table th{background:#5cb45b}.palette-primary-green .results .meta{background:#259b24}@supports (-webkit-appearance:none){.palette-primary-light-green{background:#7cb342}}.palette-primary-light-green .footer,.palette-primary-light-green .header{background:#7cb342}.palette-primary-light-green .drawer .toc a.current,.palette-primary-light-green .drawer .toc a:focus,.palette-primary-light-green .drawer .toc a:hover{color:#7cb342}.palette-primary-light-green .drawer .anchor a{border-left:2px solid #7cb342}.ios.standalone .palette-primary-light-green .article{background:-webkit-linear-gradient(top,#fff 50%,#7cb342 0);background:linear-gradient(180deg,#fff 50%,#7cb342 0)}.palette-primary-light-green .article a,.palette-primary-light-green .article code,.palette-primary-light-green .article h1,.palette-primary-light-green .article h2{color:#7cb342}.palette-primary-light-green .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-light-green .article table th{background:#9dc671}.palette-primary-light-green .results .meta{background:#7cb342}@supports (-webkit-appearance:none){.palette-primary-lime{background:#c0ca33}}.palette-primary-lime .footer,.palette-primary-lime .header{background:#c0ca33}.palette-primary-lime .drawer .toc a.current,.palette-primary-lime .drawer .toc a:focus,.palette-primary-lime .drawer .toc a:hover{color:#c0ca33}.palette-primary-lime .drawer .anchor a{border-left:2px solid #c0ca33}.ios.standalone .palette-primary-lime .article{background:-webkit-linear-gradient(top,#fff 50%,#c0ca33 0);background:linear-gradient(180deg,#fff 50%,#c0ca33 0)}.palette-primary-lime .article a,.palette-primary-lime .article code,.palette-primary-lime .article h1,.palette-primary-lime .article h2{color:#c0ca33}.palette-primary-lime .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-lime .article table th{background:#d0d766}.palette-primary-lime .results .meta{background:#c0ca33}@supports (-webkit-appearance:none){.palette-primary-yellow{background:#f9a825}}.palette-primary-yellow .footer,.palette-primary-yellow .header{background:#f9a825}.palette-primary-yellow .drawer .toc a.current,.palette-primary-yellow .drawer .toc a:focus,.palette-primary-yellow .drawer .toc a:hover{color:#f9a825}.palette-primary-yellow .drawer .anchor a{border-left:2px solid #f9a825}.ios.standalone .palette-primary-yellow .article{background:-webkit-linear-gradient(top,#fff 50%,#f9a825 0);background:linear-gradient(180deg,#fff 50%,#f9a825 0)}.palette-primary-yellow .article a,.palette-primary-yellow .article code,.palette-primary-yellow .article h1,.palette-primary-yellow .article h2{color:#f9a825}.palette-primary-yellow .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-yellow .article table th{background:#fbbe5c}.palette-primary-yellow .results .meta{background:#f9a825}@supports (-webkit-appearance:none){.palette-primary-amber{background:#ffb300}}.palette-primary-amber .footer,.palette-primary-amber .header{background:#ffb300}.palette-primary-amber .drawer .toc a.current,.palette-primary-amber .drawer .toc a:focus,.palette-primary-amber .drawer .toc a:hover{color:#ffb300}.palette-primary-amber .drawer .anchor a{border-left:2px solid #ffb300}.ios.standalone .palette-primary-amber .article{background:-webkit-linear-gradient(top,#fff 50%,#ffb300 0);background:linear-gradient(180deg,#fff 50%,#ffb300 0)}.palette-primary-amber .article a,.palette-primary-amber .article code,.palette-primary-amber .article h1,.palette-primary-amber .article h2{color:#ffb300}.palette-primary-amber .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-amber .article table th{background:#ffc640}.palette-primary-amber .results .meta{background:#ffb300}@supports (-webkit-appearance:none){.palette-primary-orange{background:#fb8c00}}.palette-primary-orange .footer,.palette-primary-orange .header{background:#fb8c00}.palette-primary-orange .drawer .toc a.current,.palette-primary-orange .drawer .toc a:focus,.palette-primary-orange .drawer .toc a:hover{color:#fb8c00}.palette-primary-orange .drawer .anchor a{border-left:2px solid #fb8c00}.ios.standalone .palette-primary-orange .article{background:-webkit-linear-gradient(top,#fff 50%,#fb8c00 0);background:linear-gradient(180deg,#fff 50%,#fb8c00 0)}.palette-primary-orange .article a,.palette-primary-orange .article code,.palette-primary-orange .article h1,.palette-primary-orange .article h2{color:#fb8c00}.palette-primary-orange .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-orange .article table th{background:#fca940}.palette-primary-orange .results .meta{background:#fb8c00}@supports (-webkit-appearance:none){.palette-primary-deep-orange{background:#ff7043}}.palette-primary-deep-orange .footer,.palette-primary-deep-orange .header{background:#ff7043}.palette-primary-deep-orange .drawer .toc a.current,.palette-primary-deep-orange .drawer .toc a:focus,.palette-primary-deep-orange .drawer .toc a:hover{color:#ff7043}.palette-primary-deep-orange .drawer .anchor a{border-left:2px solid #ff7043}.ios.standalone .palette-primary-deep-orange .article{background:-webkit-linear-gradient(top,#fff 50%,#ff7043 0);background:linear-gradient(180deg,#fff 50%,#ff7043 0)}.palette-primary-deep-orange .article a,.palette-primary-deep-orange .article code,.palette-primary-deep-orange .article h1,.palette-primary-deep-orange .article h2{color:#ff7043}.palette-primary-deep-orange .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-deep-orange .article table th{background:#ff9472}.palette-primary-deep-orange .results .meta{background:#ff7043}@supports (-webkit-appearance:none){.palette-primary-brown{background:#795548}}.palette-primary-brown .footer,.palette-primary-brown .header{background:#795548}.palette-primary-brown .drawer .toc a.current,.palette-primary-brown .drawer .toc a:focus,.palette-primary-brown .drawer .toc a:hover{color:#795548}.palette-primary-brown .drawer .anchor a{border-left:2px solid #795548}.ios.standalone .palette-primary-brown .article{background:-webkit-linear-gradient(top,#fff 50%,#795548 0);background:linear-gradient(180deg,#fff 50%,#795548 0)}.palette-primary-brown .article a,.palette-primary-brown .article code,.palette-primary-brown .article h1,.palette-primary-brown .article h2{color:#795548}.palette-primary-brown .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-brown .article table th{background:#9b8076}.palette-primary-brown .results .meta{background:#795548}@supports (-webkit-appearance:none){.palette-primary-grey{background:#757575}}.palette-primary-grey .footer,.palette-primary-grey .header{background:#757575}.palette-primary-grey .drawer .toc a.current,.palette-primary-grey .drawer .toc a:focus,.palette-primary-grey .drawer .toc a:hover{color:#757575}.palette-primary-grey .drawer .anchor a{border-left:2px solid #757575}.ios.standalone .palette-primary-grey .article{background:-webkit-linear-gradient(top,#fff 50%,#757575 0);background:linear-gradient(180deg,#fff 50%,#757575 0)}.palette-primary-grey .article a,.palette-primary-grey .article code,.palette-primary-grey .article h1,.palette-primary-grey .article h2{color:#757575}.palette-primary-grey .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-grey .article table th{background:#989898}.palette-primary-grey .results .meta{background:#757575}@supports (-webkit-appearance:none){.palette-primary-blue-grey{background:#546e7a}}.palette-primary-blue-grey .footer,.palette-primary-blue-grey .header{background:#546e7a}.palette-primary-blue-grey .drawer .toc a.current,.palette-primary-blue-grey .drawer .toc a:focus,.palette-primary-blue-grey .drawer .toc a:hover{color:#546e7a}.palette-primary-blue-grey .drawer .anchor a{border-left:2px solid #546e7a}.ios.standalone .palette-primary-blue-grey .article{background:-webkit-linear-gradient(top,#fff 50%,#546e7a 0);background:linear-gradient(180deg,#fff 50%,#546e7a 0)}.palette-primary-blue-grey .article a,.palette-primary-blue-grey .article code,.palette-primary-blue-grey .article h1,.palette-primary-blue-grey .article h2{color:#546e7a}.palette-primary-blue-grey .article .headerlink{color:rgba(0,0,0,.26)}.palette-primary-blue-grey .article table th{background:#7f929b}.palette-primary-blue-grey .results .meta{background:#546e7a}.palette-accent-red .article a:focus,.palette-accent-red .article a:hover{color:#ff2d6f}.palette-accent-red .repo a{background:#ff2d6f}.palette-accent-pink .article a:focus,.palette-accent-pink .article a:hover{color:#f50057}.palette-accent-pink .repo a{background:#f50057}.palette-accent-purple .article a:focus,.palette-accent-purple .article a:hover{color:#e040fb}.palette-accent-purple .repo a{background:#e040fb}.palette-accent-deep-purple .article a:focus,.palette-accent-deep-purple .article a:hover{color:#7c4dff}.palette-accent-deep-purple .repo a{background:#7c4dff}.palette-accent-indigo .article a:focus,.palette-accent-indigo .article a:hover{color:#536dfe}.palette-accent-indigo .repo a{background:#536dfe}.palette-accent-blue .article a:focus,.palette-accent-blue .article a:hover{color:#6889ff}.palette-accent-blue .repo a{background:#6889ff}.palette-accent-light-blue .article a:focus,.palette-accent-light-blue .article a:hover{color:#0091ea}.palette-accent-light-blue .repo a{background:#0091ea}.palette-accent-cyan .article a:focus,.palette-accent-cyan .article a:hover{color:#00b8d4}.palette-accent-cyan .repo a{background:#00b8d4}.palette-accent-teal .article a:focus,.palette-accent-teal .article a:hover{color:#00bfa5}.palette-accent-teal .repo a{background:#00bfa5}.palette-accent-green .article a:focus,.palette-accent-green .article a:hover{color:#12c700}.palette-accent-green .repo a{background:#12c700}.palette-accent-light-green .article a:focus,.palette-accent-light-green .article a:hover{color:#64dd17}.palette-accent-light-green .repo a{background:#64dd17}.palette-accent-lime .article a:focus,.palette-accent-lime .article a:hover{color:#aeea00}.palette-accent-lime .repo a{background:#aeea00}.palette-accent-yellow .article a:focus,.palette-accent-yellow .article a:hover{color:#ffd600}.palette-accent-yellow .repo a{background:#ffd600}.palette-accent-amber .article a:focus,.palette-accent-amber .article a:hover{color:#ffab00}.palette-accent-amber .repo a{background:#ffab00}.palette-accent-orange .article a:focus,.palette-accent-orange .article a:hover{color:#ff9100}.palette-accent-orange .repo a{background:#ff9100}.palette-accent-deep-orange .article a:focus,.palette-accent-deep-orange .article a:hover{color:#ff6e40}.palette-accent-deep-orange .repo a{background:#ff6e40}@media only screen and (max-width:959px){.palette-primary-red .project{background:#e84e40}.palette-primary-pink .project{background:#e91e63}.palette-primary-purple .project{background:#ab47bc}.palette-primary-deep-purple .project{background:#7e57c2}.palette-primary-indigo .project{background:#3f51b5}.palette-primary-blue .project{background:#5677fc}.palette-primary-light-blue .project{background:#03a9f4}.palette-primary-cyan .project{background:#00bcd4}.palette-primary-teal .project{background:#009688}.palette-primary-green .project{background:#259b24}.palette-primary-light-green .project{background:#7cb342}.palette-primary-lime .project{background:#c0ca33}.palette-primary-yellow .project{background:#f9a825}.palette-primary-amber .project{background:#ffb300}.palette-primary-orange .project{background:#fb8c00}.palette-primary-deep-orange .project{background:#ff7043}.palette-primary-brown .project{background:#795548}.palette-primary-grey .project{background:#757575}.palette-primary-blue-grey .project{background:#546e7a}} \ No newline at end of file diff --git a/before/call_etiquette/index.html b/before/call_etiquette/index.html deleted file mode 100644 index ee6398b..0000000 --- a/before/call_etiquette/index.html +++ /dev/null @@ -1,587 +0,0 @@ - - - - - - - - - - Call Etiquette - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Call Etiquette

- -

You've just joined an incident call, and you've never been on one before. You have no idea what's going on, or what you're supposed to be doing. This page will help you through your first time on an incident call, and will provide a reference for future calls you may be a part of.

-

Obama phone -Credit: Official White House Photo by Pete Souza

-

First Steps#

-
    -
  • If you intend on participating on the incident call you should join both the call, and Slack.
  • -
  • Make sure you are in a quiet environment in order to participate on the call. Background noise should be kept to a minimum.
  • -
  • Keep your microphone muted until you have something to say.
  • -
  • Identify yourself when you join the call; State your name and the system you are the expert for.
  • -
  • Speak up and speak clearly.
  • -
  • Be direct and factual.
  • -
  • Keep conversations/discussions short and to the point.
  • -
  • Bring any concerns to the Incident Commander (IC) on the call.
  • -
  • Respect time constraints given by the Incident Commander.
  • -
-

Lingo#

-

Use clear terminology, and avoid using acronyms or abbreviations during a call. Clear and accurate communication is more important than quick communication.

-

Communication

-

Standard radio voice procedure does not need to be followed on calls. However, you should familiarize yourself with the terms, as you may hear them on a call (or need to use them yourself). The ones in more active use on major incident calls are,

-
    -
  • Ack/Rog - "I have received and understood"
  • -
  • Say Again - "Repeat your last message"
  • -
  • Standby - "Please wait a moment for the next response"
  • -
  • Wilco - "Will comply"
  • -
-

Do not invent new abbreviations, and always favor being explicit of implicit. It is better to make things clearer than to try and save time by abbreviating, only to have a misunderstanding because others didn't know the abbreviation.

-

The Commander#

-

The Incident Commander (IC) is the leader of the incident response process, and is responsible for bringing the incident to resolution. They will announce themselves at the start of the call, and will generally be doing most of the talking.

-
    -
  • Follow all instructions from the incident commander, without exception.
  • -
  • Do not perform any actions unless the incident commander has told you to do so.
  • -
  • The commander will typically poll for any strong objections before performing a large action. This is your time to raise any objections if you have them.
  • -
  • Once the commander has made a decision, that decision is final and should be followed, even if you disagreed during the poll.
  • -
  • Answer any questions the commander asks you in a clear and concise way.
      -
    • Answering that you "don't know" something is perfectly acceptable. Do not try to guess.
    • -
    -
  • -
  • The commander may ask you to investigate something and get back to them in X minutes. Make sure you are ready with an answer within that time.
      -
    • Answering that you need more time is perfectly acceptable, but you need to give the commander an estimate of how much time.
    • -
    -
  • -
-

Problems?#

-

There's no incident commander on the call! I don't know what to do!#

-

Ask on the call if an IC is present. If you have no response, type !ic page in Slack. This will page the primary and backup IC to the call.

-

I can join the call or Slack, but not both, what should I do?#

-

You're welcome to join only one of the channels, however you should not actively participate in the incident response if so, as it causes disjoined communication. Liaise with someone who is both in Slack and on the call to provide any input you may have so that they can raise it.

- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/before/different_roles/index.html b/before/different_roles/index.html deleted file mode 100644 index 18cf75b..0000000 --- a/before/different_roles/index.html +++ /dev/null @@ -1,666 +0,0 @@ - - - - - - - - - - Different Roles - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Different Roles

- -

There are several roles for our incident response teams at Spearhead Systems. Certain roles only have one person per incident (e.g. support engineer), whereas other roles can have multiple people (e.g. Sysadmins, Solution Architects, etc.). It's all about coming together as a team, working the problem, and getting a solution quickly.

-

Here is a rough outline of our role hierarchy, with each role discussed in more detail on the rest of this page.

-

Incident Response Structure

-
-

Team Leader (TL)#

-

What is it?#

-

A Team Leader acts as the single source of truth of what is currently happening and what is going to happen during an major incident. They come in all shapes, sizes, and colors. TL's are also the key elements in a project (boards in DoIT).

-

Why have one?#

-

As any system grows in size and complexity, things break and cause incidents. The TL is needed to help drive major incidents to resolution by organizing his team towards a common goal.

-

What are the responsibilities?#

-
    -
  1. Help prepare for projects and incidents,
      -
    • Setup communications channels.
    • -
    • Create the DoIT board(s) and other project planning related materials.
    • -
    • Funnel people to these communications channels.
    • -
    • Train team members on how to communicate and train other TL's.
    • -
    -
  2. -
  3. Drive incidents and projects to resolution,
      -
    • Get everyone on the same communication channel.
    • -
    • Collect information from team members for their services/area of ownership status.
    • -
    • Collect proposed repair actions, then recommend repair actions to be taken.
    • -
    • Delegate all repair actions, the TL is NOT a resolver.
    • -
    • Be the single authority on system status
    • -
    • Communicate directly with the customers and end-users
        -
      • not the engineers themselves!
      • -
      -
    • -
    -
  4. -
  5. Post Mortem,
      -
    • Creating the initial template right after the incident so people can put in their thoughts while fresh.
    • -
    • Assigning the post-mortem after the event is over, this can be done after the call.
    • -
    • Work with Managers/Support on scheduling preventive actions.
    • -
    -
  6. -
-

Who are they?#

-

Anyone on the TL on-call schedule. Trainees are typically on the TL Shadow schedule.

-

How can I become one?#

-

Take a look at our Team Leader training guide.

-
-

Sysadmin#

-

What is it?#

-

A Sysadmin is a direct support role for the Team Leader. This is not a shadow where the person just observes, the Sysadmin is expected to perform important tasks during an incident.

-

Why have one?#

-

It's important for the TL to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The Sysadmin helps to support the TL and keep them stay focussed on the incident.

-

What are the responsibilities?#

-

The Sysadmin is expected to:

-
    -
  1. Bring up issues to the TL that may otherwise not be addressed (keeping an eye on timers that have been started, circling back around to missed items from a roll call, etc).
  2. -
  3. Be a "hot standby" TL, should the primary need to either transition to a SME, or otherwise have to step away from the TL role.
  4. -
  5. Page SME's or other on-call engineers as instructed by the Team Leader.
  6. -
  7. Manage the incident call, and be prepared to remove people from the call if instructed by the Team Leader.
  8. -
  9. Liaise with stakeholders and provide status updates on DoIT (using worklogs and comments), Slack and email/telefone as necessary.
  10. -
-

Who are they?#

-

Any Team Leader can act as a Sysadmin. Sysadmins need to be trained as an Team Leader as they may be required to take over command.

-

How can I become one?#

-

Take a look at our Sysadmin training guide. Sysadmins also need to be trained as an Team Leaders.

-
-

TODO:::move scribe responsibilities to TL and Sysadmin -::: or assign this to our juniors?

-

Scribe#

-

What is it?#

-

A Scribe documents the timeline of an incident as it progresses, and makes sure all important decisions and data are captured for later review.

-

Why have one?#

-

The incident commander will need to focus on the problem at hand, and the subject matter experts will need to focus on resolving the incident. It is important to capture a timeline of events as they happen so that they can be reviewed during the post-mortem to determine how well we performed, and so we can accurate determine any additional impact that we might not have noticed at the time.

-

What are the responsibilities?#

-

The Scribe is expected to:

-
    -
  1. Ensure the incident call is being recorded.
  2. -
  3. Note in Slack important data, events, and actions, as they happen. Specifically:
      -
    • Key actions as they are taken (Example: "prod-server-387723 is being restarted to attempt to remove the stuck lock")
    • -
    • Status reports when one is provided by the IC (Example: "We are in SEV-1, service A is currently not processing events due to a stuck lock, X is restarting the app stack, next checkin in 3 minutes")
    • -
    • Any key callouts either during the call or at the ending review (Example: "Note: (Bob B) We should have a better way to determine stuck locks.")
    • -
    -
  4. -
-

Who are they?#

-

Anyone can act as a scribe during an incident, and are chosen by the Incident Commander at the start of the call. Typically the Deputy will act as the Scribe, but that doesn't necessarily need to happen, and for larger incidents may not be possible.

-

How can I become one?#

-

Follow our Scribe training guide, and then notify the Incident Commanders that you would like to be considered for scribing for the next incident.

-

TODO::: END move scribe responsibilities to TL and Sysadmin

-
-

Subject Matter Expert#

-

What is it?#

-

A Subject Matter Expert (SME), sometimes called a "Resolver" or "Architect", is a domain expert or designated owner of a component or service that is part of the Spearhead Systems service delivery concept.

-

Why have one?#

-

The TL and Sysadmins are not all-knowing super beings. When there is a problem with a service or a particular system, an expert in that service is needed to be able to quickly help identify and fix issues.

-

What are the responsibilities?#

-
    -
  1. Being able to diagnose common problems with the service.
  2. -
  3. Being able to rapidly fix issues found during an incident.
  4. -
  5. Concise communication skills, specifically for CAN reports:
      -
    • Condition: What is the current state of the service? Is it healthy or not?
    • -
    • Actions: What actions need to be taken if the service is not in a healthy state?
    • -
    • Needs: What support does the resolver need to perform an action?
    • -
    -
  6. -
-

Who are they?#

-

Anyone who is considered a "domain expert" can act as a resolver for an incident. Typically the service's primary on-call will act as the SME for that service.

-

How can I become one?#

-

Take a look at our Subject Matter Expert training guide. You should also discuss with your team and service owner to determine what the requirements are for your particular service.

-
-

Customer Liaison#

-

What is it?#

-

A person responsible for interacting with customers, either directly, or via our public communication channels. Typically a member of the Customer Support team.

-

Why have one?#

-

All of the other roles will be actively working on identifying the cause and resolving the issue, we need a role which is focused purely on the customer interaction side of things so that it can be done properly, with the due care and attention it needs.

-

What are the responsibilities?#

-
    -
  1. Post any publicly facing messages regarding the incident (DoIT, Twitter, StatusPage, etc).
  2. -
  3. Notify the TL of any customers reporting that they are affected by the incident.
  4. -
-

Who are they?#

-

Any member of the Support Team can act as a customer liaison.

-

How can I become one?#

-

Discuss with the Support Team about becoming our next customer liaison.

- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/before/severity_levels/index.html b/before/severity_levels/index.html deleted file mode 100644 index ed84e2e..0000000 --- a/before/severity_levels/index.html +++ /dev/null @@ -1,605 +0,0 @@ - - - - - - - - - - Severity Levels - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Severity Levels

- -

The first step in any incident response process is to determine what actually constitutes an incident. We have two high level categories for classifying incidents: this is done using "SR" or "IN" defintions with an attached priority of "Minor", "Normal" or "Major". "SR" are "Service requests" initiated by a customer and usually do not constitute a critical issue (there are exceptions) and "IN" are "incidents" which are generally "urgent".

-

All of our operational issues are to be classified as either a Service Request or an Incident. Incidents have priority over Service Requests provided that there are no Service Requests with a higher priority. In general you will want to resolve a higher severity SR or IN than a lower one (a "Major" priority gets a more intensive response than a "Normal" incident for example).

-
-

Always Assume The Worst

-

If you are unsure which level an incident is (e.g. not sure if IN is Major or Normal), treat it as the higher one. During an incident is not the time to discuss or litigate severities, just assume the highest and review during a post-mortem.

-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
SeverityDescriptionWhat To Do
Major -
    -
  • The system is in a critical state and is actively impacting a large number of customers.
  • -
  • Functionality has been severely impaired for a long time, breaking SLA.
  • -
  • Customer-data-exposing security vulnerability has come to our attention.
  • -
-
See During an Incident.
Normal -
    -
  • Functionality of virtualization platform is severely impaired.
  • -
  • E-mail system is offline.
  • -
-
See During an Incident.
Anything above this line is considered a "Major Incident". These are generally Incidents (IN). Below are service requests (SR) which are usually initiated by a human who can help with prioritizing. A call is triggered for all major incidents (indifferently of SR or IN).
Normal -
    -
  • Partial loss of functionality, only affecting minority of customers.
  • -
  • Something that has the likelihood of becoming Major if nothing is done.
  • -
  • No redundancy in a service (failure of 1 more node will cause outage).
  • -
-
-
    -
  • Work on issue as your top priority.
  • -
  • Liaise with engineers of affected systems to identify cause.
  • -
  • If related to recent deployment, rollback.
  • -
  • Monitor status and notice if/when it escalates.
  • -
  • Mention on Slack if you think it has the potential to escalate.
  • -
-
Normal -
    -
  • Performance issues (delays, etc). Tasks that require non-immediate attention.
  • -
  • Job failure (not impacting alerting).
  • -
-
-
    -
  • Work on the issue as your first priority (above "Low" tasks).
  • -
  • Monitor status and notice if/when it escalates.
  • -
-
Low -
    -
  • Normal bugs which aren't impacting system use, cosmetic issues, etc.
  • -
-
-
    -
  • Create a DoIT ticket and assign to owner of affected system.
  • -
-
- -
-

Be Specific

-

When creating Cards in Doit, be as specific as possible and include all necessary details. Include relevant details regarding when the issue started, what may have triggered it, etc.. Document your efforts through worklogs and be specific there as well.

-
- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/docs/about.md b/docs/about.md new file mode 100644 index 0000000..ab5ff64 --- /dev/null +++ b/docs/about.md @@ -0,0 +1,31 @@ +This site documents parts of the Spearhead Systems Issue Response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after. + +Few companies seem to talk about their internal processes for dealing with major incidents. We would like to change that by opening up our documentation to the community, in the hopes that it proves useful to others who may want to formalize their own processes. Additionally, it provides an opportunity for others to suggest improvements, which ends up helping everyone. + +This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com). + +## What is this? + +A collection of pages detailing how to efficiently deal with any incident or service request that might arise, along with information on how to go on-call effectively. It provides lessons learned the hard way, along with training material for getting you up to speed quickly. + +## Who is this for? + +It is intended for on-call practitioners and those involved in an operational incident or service request response process, or those wishing to enact a formal incident response process. Specifically this is for all of our Technical Support staff. + +## Why do I need it? + +As a service provider Spearhead Systems deals with service requests on a daily basis. The reason we exist is to deliver a service which in most cases boils down to incidents and service requests. We want to deliver a smooth and seamless experience for resolving our customers issues therefore this documentation is a guideline for how we handle these requests. This documentation will allow you give you a head start on how to deal with issues in a way which leads to the fastest possible recovery time. + +## What is covered? + +Anything from preparing to [go on-call](/oncall/being_oncall.md), definitions of [severities](/before/severity_levels.md), incident [call etiquette](/before/call_etiquette.md), all the way to how to run a [post-mortem](/after/post_mortem_process.md), providing a [post-mortem template](/after/post_mortem_template.md) and even a [security incident response process](/during/security_incident_response.md). + +## What is missing? + +Lots, dig in an help us complete the picture. We can migrate most processes from Sharepoint here. + +## License + +This documentation is provided under the Apache License 2.0. In plain English that means you can use and modify this documentation and use it both commercially and for private use. However, you must include any original copyright notices, and the original LICENSE file. + +Whether you are a Spearhead Systems customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account, feel free to fork the repository and use it as a base for your own internal documentation. diff --git a/docs/after/post_mortem_process.md b/docs/after/post_mortem_process.md new file mode 100644 index 0000000..76a9775 --- /dev/null +++ b/docs/after/post_mortem_process.md @@ -0,0 +1,91 @@ +For every major incident (SEV-2/1), we need to follow up with a post-mortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included. + +![Post-Mortem](../assets/img/headers/pagerduty_post_mortem.jpg) + +## Owner Designation +The first step is that a post-mortem owner will be designated. This is done by the IC either at the end of a major incident call, or very shortly after. You will be notified directly by the IC if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use Slack for coordinating followup. A detailed list of the steps is available below, + +## Owner Responsibilities +As owner of a post-mortem, you are responsible for the following, + +* Scheduling the post-mortem meeting (on the shared calendar) and inviting the relevant people (this should be scheduled within 5 business days of the incident). +* Updating the page with all of the necessary content. +* Investigating the incident, pulling in whomever you need from other teams to assist in the investigation. +* Creating follow-up JIRA tickets (_You are only responsible for creating the tickets, not following them up to resolution_). +* Running the post-mortem meeting (_these generally run themselves, but you should get people back on topic if the conversation starts to wander_). +* In cases where we need a public blog post, creating & reviewing it with appropriate parties. + +## Post-Mortem Wiki Page +Once you've been designated as the owner of a post-mortem, you should start updating the page with all the relevant information. + +1. (If not already done by the IC) Create a new post-mortem page for the incident. + +1. Schedule a post-mortem meeting for within 5 business days of the incident. You should schedule this before filling in the page, just so it's on the calendar. + * Create the meeting on the "Incident Post-Mortem Meetings" shared calendar. + +1. Begin populating the page with all of the information you have. + * The timeline should be the main focus to begin with. + * The timeline should include important changes in status/impact, and also key actions taken by responders. + * You should mark the start of the incident in red, and the resolution in green (for when we went into/out of SEV). + * Go through the history in Slack to identify the responders, and add them to the page. + * Identify the Incident Commander and Scribe in this list. + +1. Populate the page with more detailed information. + * For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a Tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline. + +1. Perform an analysis of the incident. + * Capture all available data regarding the incident. What caused it, how many customers were affected, etc. + * Any commands or queries you use to look up data should be posted in the page so others can see how the data was gathered. + * Capture the impact to customers (generally in terms of event submission, delayed processing, and slow notification delivery) + * Identify the underlying cause of the incident (What happened, and why did it happen). + +1. Create any followup action JIRA tickets (or note down topics for discussion if we need to decide on a direction to go before creating tickets), + * Go through the history in Slack to identify any TODO items. + * Label all tickets with their severity level and date tags. + * Any actions which can reduce re-occurrence of the incident. + * (There may be some trade-off here, and that's fine. Sometimes the ROI isn't worth the effort that would go into it). + * Identify any actions which can make our incident response process better. + * Be careful with creating too many tickets. Generally we only want to create things that are P0/P1's. Things that absolutely should be dealt with. + +1. Write the external message that will be sent to customers. This will be reviewed during the post-mortem meeting before it is sent out. + * Avoid using the word "outage" unless it really was a full outage, use the word "incident" instead. Customers generally see "outage" and assume everything was down, when in reality it was likely just some alerts delivered outside of SLA. + * Look at other examples of previous post-mortems to see the kind of thing you should send. + +## Post-Mortem Meeting +These meetings should generally last 15-30 minutes, and are intended to be a wrap up of the post-mortem process. We should discuss what happened, what we could've done better, and any followup actions we need to take. The goal is to suss out any disagreement on the facts, analysis, or recommended actions, and to get some wider awareness of the problems that are causing reliability issues for us. + +You should invite the following people to the post-mortem meeting, + +* Always + * The incident commander. + * Service owners involved in the incident. + * Key engineer(s)/responders involved in the incident. +* Optional + * Customer liaison. (Only SEV-1 incidents) + +A general agenda for the meeting would be something like, + +1. Recap the timeline, to make sure everyone agrees and is on the same page. +1. Recap important points, and any unusual items. +1. Discuss how the problem could've been caught. + * Did it show up in canary? + * Could it have been caught in tests, or loadtest environment? +1. Discuss customer impact. Any comments from customers, etc. +1. Review action items that have been created, discuss if appropriate, or if more are needed, etc. + +## Examples +Here are some examples of post-mortems from other companies as a reference, + +* [Stripe](https://support.stripe.com/questions/outage-postmortem-2015-10-08-utc) +* [LastPass](https://blog.lastpass.com/2015/06/lastpass-security-notice.html/comment-page-2/) +* [AWS](https://aws.amazon.com/message/5467D2/) +* [Twilio](https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html) +* [Heroku](https://status.heroku.com/incidents/151) +* [Netflix](http://techblog.netflix.com/2012/10/post-mortem-of-october-222012-aws.html) +* [GOV.UK Rail Accident Investigation](https://www.gov.uk/government/publications/kyle-beck-safety-digest/near-miss-at-kyle-beck-3-august-2016) +* [A List of Post-mortems!](https://github.com/danluu/post-mortems) + +## Useful Resources + +* [Advanced PostMortem Fu and Human Error 101 (Velocity 2011)](http://www.slideshare.net/jallspaw/advanced-postmortem-fu-and-human-error-101-velocity-2011) +* [Blame. Language. Sharing.](http://fractio.nl/2015/10/30/blame-language-sharing/) diff --git a/docs/after/post_mortem_template.md b/docs/after/post_mortem_template.md new file mode 100644 index 0000000..781e410 --- /dev/null +++ b/docs/after/post_mortem_template.md @@ -0,0 +1,79 @@ +This is a standard template we use for post-mortems at PagerDuty. Each section describes the type of information you will want to put in that section. + +--- + +!!! note "Guidelines" + This page is intended to be reviewed during a post-mortem meeting that should be scheduled within 5 business days of any event. + Your first step should be to schedule the post-mortem meeting in the shared calendar for within 5 business days after the incident. + Don't wait until you've filled in the info to schedule the meeting, however make sure the page is completed by the meeting. + +** Post-Mortem Owner:** _Your name goes here._ + +** Meeting Scheduled For:** _Schedule the meeting on the "Incident Post-Mortem Meetings" shared calendar, for within 5 business days after the incident. Put the date/time here._ + +** Call Recording:** _Link to the incident call recording._ + +## Overview +_Include a **short** sentence or two summarizing the root cause, timeline summary, and the impact. E.g. "On the morning of August 99th, we suffered a 1 minute SEV-1 due to a runaway process on our primary database machine. This slowness caused roughly 0.024% of alerts that had begun during this time to be delivered out of SLA."_ + +## What Happened +_Include a short description of what happened._ + +## Root Cause +_Include a description of the root cause. If there were any actions taken that exacerbated the issue, also include them here with the intention of learning from any mistakes made during the resolution process._ + +## Resolution +_Include a description what solved the problem. If there was a temporary fix in place, describe that along with the long-term solution._ + +## Impact +_Be very specific here, include exact numbers._ + +| Time in SEV-1 | ?mins | +| Time in SEV-2 | ?mins | +| Notifications Delivered out of SLA | ??% (?? of ??) | +| Events Dropped / Not Accepted | ??% (?? of ??) _Should usually be 0, but always check_ | +| Accounts Affected | ?? | +| Users Affected | ?? | +| Support Requests Raised | ?? _Include any relevant links to tickets_ | + +## Responders + +* _Who was the IC?_ +* _Who was the scribe?_ +* _Who else was involved?_ +* _Who else was involved?_ + +## Timeline +_Some important times to include: (1) time the root cause began, (2) time of the page, (3) time that the status page was updated (i.e. when the incident became public), (4) time of any significant actions, (5) time the SEV-2/1 ended, (6) links to tools/logs that show how the timestamp was arrived at._ + +| Time (UTC) | Event | Data Link | +| ---------- | ----- | --------- | + +## How'd We Do? + +### What Went Well? + +* _List anything you did well and want to call out. It's OK to not list anything._ + +### What Didn't Go So Well? + +* _List anything you think we didn't do very well. The intent is that we should follow up on all points here to improve our processes._ + +## Action Items +_Each action item should be in the form of a JIRA ticket, and each ticket should have the same set of two tags: “sev1_YYYYMMDD” (such as sev1_20150911) and simply “sev1”. Include action items such as: (1) any fixes required to prevent the root cause in the future, (2) any preparedness tasks that could help mitigate the problem if it came up again, (3) remaining post-mortem steps, such as the internal email, as well as the status-page public post, (4) any improvements to our incident response process._ + +## Messaging + +### Internal Email +_This is a follow-up for employees. It should be sent out right after the post-mortem meeting is over. It only needs a short paragraph summarizing the incident and a link to this wiki page._ + +> Briefly summarize what happened and where the post-mortem page (this page) can be found. + +### External Message +_This is what will be included on the status.pagerduty.com website regarding this incident. What are we telling customers, including an apology? (The apology should be genuine, not rote.)_ + +> Summary + +> What Happened? + +> What Are We Doing About This? diff --git a/docs/assets/css/extra.css b/docs/assets/css/extra.css new file mode 100644 index 0000000..01e30dc --- /dev/null +++ b/docs/assets/css/extra.css @@ -0,0 +1,399 @@ +/* Colfax Font */ +@font-face { + font-family: 'Colfax Regular'; + font-style: normal; + font-weight: 400; + src: local('ColfaxRegular'), url(https://www.pagerduty.com/wp-content/themes/startit-child/fonts/ColfaxWebRegular.woff) format('woff2'); +} + +@font-face { + font-family: 'Colfax Light'; + font-style: normal; + font-weight: 100; + src: local('ColfaxRegular'), url(https://www.pagerduty.com/wp-content/themes/startit-child/fonts/ColfaxWebLight.woff) format('woff2'); +} + +/* Defaults */ +body { + font-weight: 500; + -webkit-font-smoothing: antialiased; +} + +/* Change the colour theme to better match PagerDuty */ + +/* background: pd-green */ +.repo a { + background: #25c151; +} + +@media only screen and (max-width: 959px) { + .palette-primary-green .project { + background: #25c151; + } +} + +/* background: pd-navy */ +.palette-primary-green, +.palette-primary-green .footer, +.palette-primary-green .header, +.palette-primary-green .results .meta, +.palette-primary-green .article table th { + background: #1f293a; +} + +.palette-primary-green .article table th { + background: #555; +} + +/* font: pd-green */ +.palette-primary-green .article h1, +.palette-primary-green .article h2, +.palette-primary-green .drawer .toc a.current, +.palette-primary-green .drawer .toc a:focus, +.palette-primary-green .drawer .toc a:hover, +.palette-primary-green .article a:hover { + color: #25c151; +} + +/* font: pd-navy */ +.palette-primary-green .article a, +.palette-primary-green .article code, +.palette-primary-green .article h1, +.palette-primary-green .article h2 { + color: #1f293a; +} + +/* Selected nav section */ +.palette-primary-green .drawer .anchor a { + border-left: 3px solid #25c151; +} + +/* Hide the page title, already in the navbar */ +.article h1 { + display: none; +} + +/* But show it when printing */ +@media print { + .article h1 { + display: block; + padding-top: 0em; + padding-bottom: 0.1em; + margin-top: 0em; + margin-bottom: 0em; + border-bottom: none; + } + + /* Also add a heading when printing */ + .article h1:before { + background: url(/assets/img/logo.png) 0em -0.07em no-repeat; + background-size: 7em; + display: block; + height: 2em; + width: 100%; + padding-left: 7.2em; + content: 'Incident Response'; + border-bottom: 1px solid #ddd; + margin-bottom: 0.6em; + } +} + + +/* Want the font to be bigger for articles, easier reading. */ +.article { + font-size: 1.45em; +} + +/* Too much whitespace at the top, not enough at bottom */ +.article .wrapper { + padding: 56px 16px 132px !important; +} + +@media only screen and (min-width: 720px) { + .article .wrapper { + padding: 70px 24px 126px !important; + } +} + +/* Get rid of the whitespace when printing, let people set own margins. */ +@media print { + .article .wrapper { + padding: 0em !important; + } +} + +ul, ol { + padding-left: 1em; +} + +/* Expanding border menu */ +.drawer .toc li a { + overflow: hidden; + position: relative; +} + +.drawer .toc li a:before { + display: block; + content: ''; + position: absolute; + height: 2em; + left: 0px; + top: 0.5em; + border-left: 5px solid #25c151; + transform: scaleY(0); + transition: transform 250ms ease-in-out; +} + +.drawer .toc li a:hover:before { + transform: scaleY(1); +} + +/* Don't do it on active menu items */ +.drawer .toc a.current:hover:before, +.drawer .toc li.anchor a:hover:before { + transform: scaleY(0); + display: none; +} + +/* Don't overflow horizontally on nav */ +.drawer .toc ul li a { + white-space: nowrap; + text-overflow: ellipsis; +} + +/* Change the title bar to include the PD logo */ +nav div.mainlogo { + width: 15em; + display: table-cell; +} + +nav div.mainlogo a { + min-height: 3.5em; + margin-bottom: -1.25em; + width: 14.5em; + + background: url(/assets/img/logo.png) 0em 0.1em no-repeat; + background-size: contain; +} + +nav div.mainlogo img { + display: none; +} + +/* Admonition */ +.admonition { + background: #25c151; +} +.admonition.info { + background: #f5a623; +} + +@media print { + .admonition { + padding: 1em 2em !important; + } +} + +/* Typography */ +h4 { + font-weight: bold; + text-decoration: underline; +} + +.project .logo+.name { + font-size: 13px; +} + +span.bad { + color: #f00; +} + +span.good { + color: #008800; +} + +span.code, +code { + font-family: monospace; + color: #00f !important; + border-radius: 2px; + padding: 0.1em; + border: 1px solid #eee; + background: #f4f4f4; +} + +/* Icons */ +.button .icon:hover { + transition: color 250ms ease-in-out; + color: #25c151; +} + +/* Images */ +.article .wrapper { + overflow: hidden; +} + +/* Center all images */ +.article img { + display: block; + margin: 0 auto; +} + +/* Header images */ +.article h1 + p + p img { + max-width: 110%; + margin-left: -2em; +} + +/* Image Captions */ +img + em { + position: relative; + font-size: 0.8em; + margin-right: -2.3em; + padding: 0em 1em; + float: right; + margin-top: -2.1em; + color: #000; + border-top-left-radius: 3px; + background: rgba(255, 255, 255, 0.7); +} + +/* Fixes for smaller screen sizes */ +@media only screen and (max-width: 720px) { + .article h1 + p + p img { + max-width: 120%; + } + + .article h1 + p + p img + em { + margin-right: -1.4em; + margin-top: -2em; + } +} + +/* Hack to hide the header images when printing. */ +@media print { + .article h1 + p + p img { + display: none; + } + + .article h1 + p + p img + em { + display: none; + } +} + +/* Quotes */ +.article blockquote { + border-left: 3px solid #555; + background: #f9f9f9; + padding: 1em; + padding-left: 16px; + margin-top: 1em; + color: #333; + font-style: italic; +} + +.article blockquote p { + margin: 0em; + padding: 0.5em 0em; +} + +/* Horizontal Rules */ +.article hr { + margin-top: 2em; + border-top: 2px solid #f4f4f4; +} + +/* Don't care about copyright notice for this project, Apache License. */ +aside.copyright { + display: none; +} + +/* Custom tables */ +table.custom-table td ul { + margin-top: -0.8em; + padding-top: 0px; + padding-left: 0px; +} + +table.custom-table td.warning { + font-weight: bold; + text-align: center; + color: #f00; + background: #f4f4f4; +} + +table.custom-table td.sev-1 { + background: #ffe7e7; + color: #f00; + font-weight: bold; +} + +table.custom-table td.sev-2 { + background: #ffd; + color: rgb(255,153,0); + font-weight: bold; +} + +table.custom-table td.sev-3 { + background: #e0f0ff; + color: rgb(51,102,255); + font-weight: bold; +} + +table.custom-table td.sev-4 { + background: #f0f0f0; + color: rgb(128,128,128); + font-weight: bold; +} + +table.custom-table td.sev-5 { + background: #ddfade; + color: rgb(0,128,0); + font-weight: bold; +} + +table.custom-table td.centered { + text-align: center; +} + +/* Embeds */ +iframe { + display: block; + margin: 0 auto; + margin-top: 1em; +} + +/* Contact summary table */ +#contact-summary { + margin-bottom: -2em; + background: #fff; + color: #000; +} + +/* Super horrible hack to get the training PDF images correct */ +#national-incident-management-system-nims ~ p img { + display: inline; +} +#national-incident-management-system-nims ~ p:nth-of-type(6) { + text-align: center; +} + +/* 404 Page */ +#error { + text-align: center; + padding: 0em 5em; +} + +#error h1 { + display: block; + font-size: 2.5em; + padding-bottom: 1em; + margin-bottom: 1em; + margin-top: 1em; + border-bottom: 1px solid #eee; +} + +#error p { + font-style: italic; + color: #555; +} diff --git a/docs/assets/img/cover.png b/docs/assets/img/cover.png new file mode 100644 index 0000000..6325795 Binary files /dev/null and b/docs/assets/img/cover.png differ diff --git a/docs/assets/img/headers/gene_kranz.jpg b/docs/assets/img/headers/gene_kranz.jpg new file mode 100644 index 0000000..773fb29 Binary files /dev/null and b/docs/assets/img/headers/gene_kranz.jpg differ diff --git a/docs/assets/img/headers/incident_command_support.jpg b/docs/assets/img/headers/incident_command_support.jpg new file mode 100644 index 0000000..eed6180 Binary files /dev/null and b/docs/assets/img/headers/incident_command_support.jpg differ diff --git a/docs/assets/img/headers/incident_response.jpg b/docs/assets/img/headers/incident_response.jpg new file mode 100644 index 0000000..e45fc8a Binary files /dev/null and b/docs/assets/img/headers/incident_response.jpg differ diff --git a/docs/assets/img/headers/obama_phone.jpg b/docs/assets/img/headers/obama_phone.jpg new file mode 100644 index 0000000..79b4a77 Binary files /dev/null and b/docs/assets/img/headers/obama_phone.jpg differ diff --git a/docs/assets/img/headers/pagerduty_ir.jpg b/docs/assets/img/headers/pagerduty_ir.jpg new file mode 100644 index 0000000..00b6114 Binary files /dev/null and b/docs/assets/img/headers/pagerduty_ir.jpg differ diff --git a/docs/assets/img/headers/pagerduty_post_mortem.jpg b/docs/assets/img/headers/pagerduty_post_mortem.jpg new file mode 100644 index 0000000..7561025 Binary files /dev/null and b/docs/assets/img/headers/pagerduty_post_mortem.jpg differ diff --git a/docs/assets/img/headers/sph_ir.jpg b/docs/assets/img/headers/sph_ir.jpg new file mode 100644 index 0000000..9cbd836 Binary files /dev/null and b/docs/assets/img/headers/sph_ir.jpg differ diff --git a/docs/assets/img/headers/typewriter.jpg b/docs/assets/img/headers/typewriter.jpg new file mode 100644 index 0000000..37dfbbc Binary files /dev/null and b/docs/assets/img/headers/typewriter.jpg differ diff --git a/docs/assets/img/icon.png b/docs/assets/img/icon.png new file mode 100644 index 0000000..39b6248 Binary files /dev/null and b/docs/assets/img/icon.png differ diff --git a/docs/assets/img/logo.png b/docs/assets/img/logo.png new file mode 100644 index 0000000..8ebf36f Binary files /dev/null and b/docs/assets/img/logo.png differ diff --git a/docs/assets/img/misc/ack.png b/docs/assets/img/misc/ack.png new file mode 100644 index 0000000..e83bd50 Binary files /dev/null and b/docs/assets/img/misc/ack.png differ diff --git a/docs/assets/img/misc/alert_fatigue.png b/docs/assets/img/misc/alert_fatigue.png new file mode 100644 index 0000000..6a3c823 Binary files /dev/null and b/docs/assets/img/misc/alert_fatigue.png differ diff --git a/docs/assets/img/misc/communicate.png b/docs/assets/img/misc/communicate.png new file mode 100644 index 0000000..c708c51 Binary files /dev/null and b/docs/assets/img/misc/communicate.png differ diff --git a/docs/assets/img/misc/escalation.png b/docs/assets/img/misc/escalation.png new file mode 100644 index 0000000..c8595be Binary files /dev/null and b/docs/assets/img/misc/escalation.png differ diff --git a/docs/assets/img/misc/incident_response_roles.png b/docs/assets/img/misc/incident_response_roles.png new file mode 100644 index 0000000..a250b42 Binary files /dev/null and b/docs/assets/img/misc/incident_response_roles.png differ diff --git a/docs/assets/img/misc/mobile_alerts.png b/docs/assets/img/misc/mobile_alerts.png new file mode 100644 index 0000000..225b77c Binary files /dev/null and b/docs/assets/img/misc/mobile_alerts.png differ diff --git a/docs/assets/img/misc/oncall_burnout.png b/docs/assets/img/misc/oncall_burnout.png new file mode 100644 index 0000000..fe39e67 Binary files /dev/null and b/docs/assets/img/misc/oncall_burnout.png differ diff --git a/docs/assets/img/misc/schedule.png b/docs/assets/img/misc/schedule.png new file mode 100644 index 0000000..b7d1f7a Binary files /dev/null and b/docs/assets/img/misc/schedule.png differ diff --git a/docs/assets/img/misc/triage.png b/docs/assets/img/misc/triage.png new file mode 100644 index 0000000..223fe68 Binary files /dev/null and b/docs/assets/img/misc/triage.png differ diff --git a/docs/assets/img/screenshots/high_business_hours.png b/docs/assets/img/screenshots/high_business_hours.png new file mode 100644 index 0000000..fe06d21 Binary files /dev/null and b/docs/assets/img/screenshots/high_business_hours.png differ diff --git a/docs/assets/img/screenshots/high_urgency.png b/docs/assets/img/screenshots/high_urgency.png new file mode 100644 index 0000000..5efa9f9 Binary files /dev/null and b/docs/assets/img/screenshots/high_urgency.png differ diff --git a/docs/assets/img/screenshots/low_urgency.png b/docs/assets/img/screenshots/low_urgency.png new file mode 100644 index 0000000..15c54f3 Binary files /dev/null and b/docs/assets/img/screenshots/low_urgency.png differ diff --git a/docs/assets/img/screenshots/prio-high.png b/docs/assets/img/screenshots/prio-high.png new file mode 100644 index 0000000..563d215 Binary files /dev/null and b/docs/assets/img/screenshots/prio-high.png differ diff --git a/docs/assets/img/screenshots/prio-low.png b/docs/assets/img/screenshots/prio-low.png new file mode 100644 index 0000000..841de1c Binary files /dev/null and b/docs/assets/img/screenshots/prio-low.png differ diff --git a/docs/assets/img/screenshots/prio-norm.png b/docs/assets/img/screenshots/prio-norm.png new file mode 100644 index 0000000..43c98a0 Binary files /dev/null and b/docs/assets/img/screenshots/prio-norm.png differ diff --git a/docs/assets/img/screenshots/suppressed.png b/docs/assets/img/screenshots/suppressed.png new file mode 100644 index 0000000..dc9910b Binary files /dev/null and b/docs/assets/img/screenshots/suppressed.png differ diff --git a/docs/assets/img/thumbnails/nims_core.png b/docs/assets/img/thumbnails/nims_core.png new file mode 100644 index 0000000..48bce4c Binary files /dev/null and b/docs/assets/img/thumbnails/nims_core.png differ diff --git a/docs/assets/img/thumbnails/nims_training.png b/docs/assets/img/thumbnails/nims_training.png new file mode 100644 index 0000000..0025545 Binary files /dev/null and b/docs/assets/img/thumbnails/nims_training.png differ diff --git a/docs/before/call_etiquette.md b/docs/before/call_etiquette.md new file mode 100644 index 0000000..9eb7bb4 --- /dev/null +++ b/docs/before/call_etiquette.md @@ -0,0 +1,50 @@ +You've just joined an incident call, and you've never been on one before. You have no idea what's going on, or what you're supposed to be doing. This page will help you through your first time on an incident call, and will provide a reference for future calls you may be a part of. + +![Obama phone](../assets/img/headers/obama_phone.jpg) +*Credit: [Official White House Photo](https://commons.wikimedia.org/wiki/File:Barack_Obama_on_phone_with_Benjamin_Netanyahu_2009-06-08.jpg) by Pete Souza* + +## First Steps + +* If you intend on participating on the incident call you should join both the call, and Slack. +* Make sure you are in a quiet environment in order to participate on the call. Background noise should be kept to a minimum. +* Keep your microphone muted until you have something to say. +* Identify yourself when you join the call; State your name and the system you are the expert for. +* Speak up and speak clearly. +* Be direct and factual. +* Keep conversations/discussions short and to the point. +* Bring any concerns to the Incident Commander (IC) on the call. +* Respect time constraints given by the Incident Commander. + +## Lingo +**Use clear terminology, and avoid using acronyms or abbreviations during a call. Clear and accurate communication is more important than quick communication.** + +![Communication](../assets/img/misc/communicate.png) + +Standard radio [voice procedure](https://en.wikipedia.org/wiki/Voice_procedure#Words_in_voice_procedure) does not need to be followed on calls. However, you should familiarize yourself with the terms, as you may hear them on a call (or need to use them yourself). The ones in more active use on major incident calls are, + +* **Ack/Rog** - "I have received and understood" +* **Say Again** - "Repeat your last message" +* **Standby** - "Please wait a moment for the next response" +* **Wilco** - "Will comply" + +Do not invent new abbreviations, and always favor being explicit of implicit. It is better to make things clearer than to try and save time by abbreviating, only to have a misunderstanding because others didn't know the abbreviation. + +## The Commander +The Incident Commander (IC) is the leader of the incident response process, and is responsible for bringing the incident to resolution. They will announce themselves at the start of the call, and will generally be doing most of the talking. + +* Follow all instructions from the incident commander, without exception. +* Do not perform any actions unless the incident commander has told you to do so. +* The commander will typically poll for any strong objections before performing a large action. This is your time to raise any objections if you have them. +* Once the commander has made a decision, that decision is final and should be followed, even if you disagreed during the poll. +* Answer any questions the commander asks you in a clear and concise way. + * Answering that you "don't know" something is perfectly acceptable. Do not try to guess. +* The commander may ask you to investigate something and get back to them in X minutes. Make sure you are ready with an answer within that time. + * Answering that you need more time is perfectly acceptable, but you need to give the commander an estimate of how much time. + +## Problems? + +#### There's no incident commander on the call! I don't know what to do! +Ask on the call if an IC is present. If you have no response, type `!ic page` in Slack. This will page the primary and backup IC to the call. + +#### I can join the call or Slack, but not both, what should I do? +You're welcome to join only one of the channels, however you should not actively participate in the incident response if so, as it causes disjoined communication. Liaise with someone who is both in Slack and on the call to provide any input you may have so that they can raise it. diff --git a/docs/before/different_roles.md b/docs/before/different_roles.md new file mode 100644 index 0000000..a00ec8e --- /dev/null +++ b/docs/before/different_roles.md @@ -0,0 +1,138 @@ +There are several roles for our incident response teams at Spearhead Systems. Certain roles only have one person per incident (e.g. support engineer), whereas other roles can have multiple people (e.g. Sysadmins, Solution Architects, etc.). It's all about coming together as a team, working the problem, and getting a solution quickly. + +Here is a rough outline of our role hierarchy, with each role discussed in more detail on the rest of this page. + +![Incident Response Structure](../assets/img/misc/incident_response_roles.png) + +--- + +## Team Leader (TL) + +### What is it? +A Team Leader acts as the single source of truth of what is currently happening and what is going to happen during an major incident. They come in all shapes, sizes, and colors. TL's are also the key elements in a project (boards in DoIT). + +### Why have one? +As any system grows in size and complexity, things break and cause incidents. The TL is needed to help drive major incidents to resolution by organizing his team towards a common goal. + +### What are the responsibilities? +1. Help prepare for projects and incidents, + * Setup communications channels. + * Create the DoIT board(s) and other project planning related materials. + * Funnel people to these communications channels. + * Train team members on how to communicate and train other TL's. +1. Drive incidents and projects to resolution, + * Get everyone on the same communication channel. + * Collect information from team members for their services/area of ownership status. + * Collect proposed repair actions, then recommend repair actions to be taken. + * Delegate all repair actions, the TL is NOT a resolver. + * Be the single authority on system status + * Communicate directly with the customers and end-users + - not the engineers themselves! +1. Post Mortem, + * Creating the initial template right after the incident so people can put in their thoughts while fresh. + * Assigning the post-mortem after the event is over, this can be done after the call. + * Work with Managers/Support on scheduling preventive actions. + +### Who are they? +Anyone on the TL on-call schedule. Trainees are typically on the TL Shadow schedule. + +### How can I become one? +Take a look at our [Team Leader training guide](/training/incident_commander.md). + +--- + +## Sysadmin + +### What is it? +A Sysadmin is a direct support role for the Team Leader. This is not a shadow where the person just observes, the Sysadmin is expected to perform important tasks during an incident. + +### Why have one? +It's important for the TL to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The Sysadmin helps to support the TL and keep them stay focussed on the incident. + +### What are the responsibilities? +The Sysadmin is expected to: + +1. Bring up issues to the TL that may otherwise not be addressed (keeping an eye on timers that have been started, circling back around to missed items from a roll call, etc). +1. Be a "hot standby" TL, should the primary need to either transition to a SME, or otherwise have to step away from the TL role. +1. Page SME's or other on-call engineers as instructed by the Team Leader. +1. Manage the incident call, and be prepared to remove people from the call if instructed by the Team Leader. +1. Liaise with stakeholders and provide status updates on DoIT (using worklogs and comments), Slack and email/telefone as necessary. + +### Who are they? +Any Team Leader can act as a Sysadmin. Sysadmins need to be trained as an Team Leader as they may be required to take over command. + +### How can I become one? +Take a look at our [Sysadmin training guide](/training/deputy.md). Sysadmins also need to be [trained as an Team Leaders](/training/incident_commander.md). + +--- + +TODO:::move scribe responsibilities to TL and Sysadmin +::: or assign this to our juniors? +## Scribe + +### What is it? +A Scribe documents the timeline of an incident as it progresses, and makes sure all important decisions and data are captured for later review. + +### Why have one? +The incident commander will need to focus on the problem at hand, and the subject matter experts will need to focus on resolving the incident. It is important to capture a timeline of events as they happen so that they can be reviewed during the post-mortem to determine how well we performed, and so we can accurate determine any additional impact that we might not have noticed at the time. + +### What are the responsibilities? +The Scribe is expected to: + +1. Ensure the incident call is being recorded. +1. Note in Slack important data, events, and actions, as they happen. Specifically: + * Key actions as they are taken (Example: "prod-server-387723 is being restarted to attempt to remove the stuck lock") + * Status reports when one is provided by the IC (Example: "We are in SEV-1, service A is currently not processing events due to a stuck lock, X is restarting the app stack, next checkin in 3 minutes") + * Any key callouts either during the call or at the ending review (Example: "Note: (Bob B) We should have a better way to determine stuck locks.") + +### Who are they? +Anyone can act as a scribe during an incident, and are chosen by the Incident Commander at the start of the call. Typically the Deputy will act as the Scribe, but that doesn't necessarily need to happen, and for larger incidents may not be possible. + +### How can I become one? +Follow our [Scribe training guide](/training/scribe.md), and then notify the Incident Commanders that you would like to be considered for scribing for the next incident. + +TODO::: END move scribe responsibilities to TL and Sysadmin + +--- + +## Subject Matter Expert + +### What is it? +A Subject Matter Expert (SME), sometimes called a "Resolver" or "Architect", is a domain expert or designated owner of a component or service that is part of the Spearhead Systems service delivery concept. + +### Why have one? +The TL and Sysadmins are not all-knowing super beings. When there is a problem with a service or a particular system, an expert in that service is needed to be able to quickly help identify and fix issues. + +### What are the responsibilities? +1. Being able to diagnose common problems with the service. +1. Being able to rapidly fix issues found during an incident. +1. Concise communication skills, specifically for CAN reports: + * Condition: What is the current state of the service? Is it healthy or not? + * Actions: What actions need to be taken if the service is not in a healthy state? + * Needs: What support does the resolver need to perform an action? + +### Who are they? +Anyone who is considered a "domain expert" can act as a resolver for an incident. Typically the service's primary on-call will act as the SME for that service. + +### How can I become one? +Take a look at our [Subject Matter Expert training guide](/training/subject_matter_expert.md). You should also discuss with your team and service owner to determine what the requirements are for your particular service. + +--- + +## Customer Liaison + +### What is it? +A person responsible for interacting with customers, either directly, or via our public communication channels. Typically a member of the Customer Support team. + +### Why have one? +All of the other roles will be actively working on identifying the cause and resolving the issue, we need a role which is focused purely on the customer interaction side of things so that it can be done properly, with the due care and attention it needs. + +### What are the responsibilities? +1. Post any publicly facing messages regarding the incident (DoIT, Twitter, StatusPage, etc). +1. Notify the TL of any customers reporting that they are affected by the incident. + +### Who are they? +Any member of the Support Team can act as a customer liaison. + +### How can I become one? +Discuss with the Support Team about becoming our next customer liaison. diff --git a/docs/before/severity_levels.md b/docs/before/severity_levels.md new file mode 100644 index 0000000..d7d95c1 --- /dev/null +++ b/docs/before/severity_levels.md @@ -0,0 +1,92 @@ +The first step in any incident response process is to determine what actually constitutes an incident. We have two high level categories for classifying incidents: this is done using "SR" or "IN" defintions with an attached priority of "Minor", "Normal" or "Major". "SR" are "Service requests" initiated by a customer and usually do not constitute a critical issue (there are exceptions) and "IN" are "incidents" which are generally "urgent". + +All of our operational issues are to be classified as either a Service Request or an Incident. Incidents have priority over Service Requests provided that there are no Service Requests with a higher priority. In general you will want to resolve a higher severity SR or IN than a lower one (a "Major" priority gets a more intensive response than a "Normal" incident for example). + +!!! note "Always Assume The Worst" + If you are unsure which level an incident is (e.g. not sure if IN is Major or Normal), **treat it as the higher one**. During an incident is not the time to discuss or litigate severities, just assume the highest and review during a post-mortem. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SeverityDescriptionWhat To Do
Major +
    +
  • The system is in a critical state and is actively impacting a large number of customers.
  • +
  • Functionality has been severely impaired for a long time, breaking SLA.
  • +
  • Customer-data-exposing security vulnerability has come to our attention.
  • +
+
See During an Incident.
Normal +
    +
  • Functionality of virtualization platform is severely impaired.
  • +
  • E-mail system is offline.
  • +
+
See During an Incident.
Anything above this line is considered a "Major Incident". These are generally Incidents (IN). Below are service requests (SR) which are usually initiated by a human who can help with prioritizing. A call is triggered for all major incidents (indifferently of SR or IN).
Normal +
    +
  • Partial loss of functionality, only affecting minority of customers.
  • +
  • Something that has the likelihood of becoming Major if nothing is done.
  • +
  • No redundancy in a service (failure of 1 more node will cause outage).
  • +
+
+
    +
  • Work on issue as your top priority.
  • +
  • Liaise with engineers of affected systems to identify cause.
  • +
  • If related to recent deployment, rollback.
  • +
  • Monitor status and notice if/when it escalates.
  • +
  • Mention on Slack if you think it has the potential to escalate.
  • +
+
Normal +
    +
  • Performance issues (delays, etc). Tasks that require non-immediate attention.
  • +
  • Job failure (not impacting alerting).
  • +
+
+
    +
  • Work on the issue as your first priority (above "Low" tasks).
  • +
  • Monitor status and notice if/when it escalates.
  • +
+
Low +
    +
  • Normal bugs which aren't impacting system use, cosmetic issues, etc.
  • +
+
+
    +
  • Create a DoIT ticket and assign to owner of affected system.
  • +
+
+ +!!! note "Be Specific" + When creating Cards in Doit, be as specific as possible and include all necessary details. Include relevant details regarding when the issue started, what may have triggered it, etc.. Document your efforts through worklogs and be specific there as well. diff --git a/docs/during/during_an_incident.md b/docs/during/during_an_incident.md new file mode 100644 index 0000000..49a711e --- /dev/null +++ b/docs/during/during_an_incident.md @@ -0,0 +1,111 @@ +Information on what to do during a major incident. See our [severity level descriptions](/before/severity_levels.md) for what constitutes a major incident. + +!!! note "Documentation" + For your own internal documentation, you should make sure that this page has all of the necessary information prominently displayed. Such as: phone bridge numbers, Slack rooms, important chat commands, etc. Here is an example, + + + + + + + + + + + + + + + + + +
#incident-chathttps://a-voip-provider.com/incident-call+1 555 BIG FIRE (+1 555 244 3473) / PIN: 123456
Need an IC? Do !ic page in Slack
For executive summary updates only, join #executive-summary-updates.
+ +!!! info "Security Incident?" + If this is a security incident, you should follow the [Security Incident Response](/during/security_incident_response.md) process. + +## Don't Panic! + +1. Join the incident call and chat (see links above). + * Anyone is free to join the call or chat to observe and follow along with the incident. + * If you wish to participate however, you should join both. If you can't join the call for some reason, you should have a dedicated proxy for the call. Disjointed discussions in the chat room are ultimately distracting. + +1. Follow along with the call/chat, add any comments you feel are appropriate, but keep the discussion relevant to the problem at hand. + * If you are not an SME, try to filter any discussion through the primary SME for your service. Too many people discussing at once get become overwhelming, so we should try to maintain a hierarchical structure to the call if possible. + +1. Follow instructions from the Incident Commander. + * **Is there no IC on the call?** + * Manually page them via Slack, with `!ic page` in Slack. This will page the primary and backup IC's at the same time. + * Never hesitate to page the IC. It's much better to have them and not need them than the other way around. + +## Steps for Incident Commander +Resolve the incident as quickly and as safely as possible, use the Deputy to assist you. Delegate any tasks to relevant experts at your discretion. + +1. Announce on the call and in Slack that you are the incident commander, who you have designated as deputy (usually the backup IC), and scribe. + +1. Identify if there is an obvious cause to the incident (recent deployment, spike in traffic, etc.), delegate investigation to relevant experts, + * Use the service experts on the call to assist in the analysis. They should be able to quickly provide confirmation of the cause, but not always. It's the call of the IC on how to proceed in cases where the cause is not positively known. Confer with service owners and use their knowledge to help you. + +1. Identify investigation & repair actions (roll back, rate-limit services, etc) and delegate actions to relevant service experts. Typically something like this (obviously not an exhaustive list), + * **Bad Deployment:** Roll it back. + * **Web Application Stuck/Crashed:** Do a rolling restart. + * **Event Flood:** Validate automatic throttling is sufficient, adjust manually if not. + * **Data Center Outage:** Validate automation has removed bad data center. Force it to do so if not. + * **Degraded Service Behavior without load:** Gather forensic data (heap dumps, etc), and consider doing a rolling restart. + +1. Listen for prompts from your Deputy regarding severity escalations, decide whether we need to announce publicly, and instruct customer liaison accordingly. + * Announcing publicly is at your discretion as IC. If you are unsure, then announce publicly ("If in doubt, tweet it out"). + +1. Once incident has recovered or is actively recovering, you can announce that the incident is over and that the call is ending. This usually indicates there's no more productive work to be done for the incident right now. + * Move the remaining, non-time-critical discussion to Slack. + * Follow up to ensure the customer liaison wraps up the incident publicly. + * Identify any post-incident clean-up work. + * You may need to perform debriefing/analysis of the underlying root cause. + +1. (After call ends) Create the post-mortem page from the template, and assign an owner to the post-mortem for the incident. + +1. (After call ends) Send out an internal email explaining that we had a major incident, provide a link to the post-mortem. + +## Steps for Deputy +You are there to support the IC in whatever they need. + +1. Monitor the status, and notify the IC if/when the incident escalates in severity level, + * OfficerURL can help you to monitor the status on Slack, + * `!status` - Will tell you the current status. + * `!status stalk` - Will continually monitor the status and report it to the room every 30s. + +1. Be prepared to page other people as directed by the Incident Commander. + +1. Provide regular status updates in Slack (roughly every 30mins) to the executive team, giving an executive summary of the current status. Keep it short and to the point, and use @here. + +1. Follow instructions from the Incident Commander. + +## Steps for Scribe +You are there to document the key information from the incident in Slack. + +1. Update the Slack room with who the IC is, who the Deputy is, and that you're the scribe (if not already done). + * e.g. "IC: Bob Boberson, Deputy: Deputy Deputyson, Scribe: Writer McWriterson" + +1. You should add notes to Slack when significant actions are taken, or findings are determined. You don't need to wait for the IC to direct this - use your own judgment. + * You should also add `TODO` notes to the Slack room that indicate follow-ups slated for later. + +1. Follow instructions from the Incident Commander. + +## Steps for Subject Matter Experts +You are there to support the incident commander in identifying the cause of the incident, suggesting and evaluation repair actions, and following through on the repair actions. + +1. Investigate the incident by analyzing any graphs or logs at your disposal. Announce all findings to the incident commander. + * If you are unsure of the cause, that's fine, state that you are investigating and provide regular updates to the IC. + +1. Announce all suggestions for resolution to the incident commander, it is their decision on how to proceed, do not follow any actions unless told to do so! + +1. Follow instructions from the incident commander. + +1. (Optional) Once the call is over and post-mortem is created, add any notes you think are relevant to the post-mortem page. + +## Steps for Customer Liaison +Be on stand-by to post public facing messages regarding the incident. + +1. You will typically be required to update the status page and to send Tweets from our various accounts at certain times during the call. + +1. Follow instructions from the Incident Commander. diff --git a/docs/during/security_incident_response.md b/docs/during/security_incident_response.md new file mode 100644 index 0000000..cd8d0a3 --- /dev/null +++ b/docs/during/security_incident_response.md @@ -0,0 +1,141 @@ +!!! note "Incident Commander Required" + As with all major incidents at PagerDuty, security ones will also involve an Incident Commander, who will delegate the tasks to relevant resolvers. Tasks may be performed in parallel as assigned by the IC. Page one at the earliest possible opportunity. + +## Checklist +Details for each of these items are available in the next section. + +1. Stop the attack in progress. +1. Cut off the attack vector. +1. Assemble the response team. +1. Isolate affected instances. +1. Identify timeline of attack. +1. Identify compromised data. +1. Assess risk to other systems. +1. Assess risk of re-attack. +1. Apply additional mitigations, additions to monitoring, etc. +1. Forensic analysis of compromised systems. +1. Internal communication. +1. Involve law enforcement. +1. Reach out to external parties that may have been used as vector for attack. +1. External communication. + +--- + +## Attack Mitigation +Stop the attack as quickly as you can, via any means necessary. Shut down servers, network isolate them, turn off a data center if you have to. Some common things to try, + +* Shutdown the instance from the provider console (do not delete or terminate if you can help it, as we'll need to do forensics). +* If you happen to be logged into the box you can try to, + * Re-instate our default iptables rules to restrict traffic. + * `kill -9` any active session you think is an attacker. + * Change root password, and update /etc/shadow to lock out all other users. + * `sudo shutdown now` + +## Cut Off Attack Vector +Identify the likely attack vectors and path/fix them so they cannot be re-exploited immediately after stopping the attack. + +* If you suspect a third-party provider is compromised, delete all accounts except your own (and those of others who are physically present) and immediately rotate your password and MFA tokens. +* If you suspect a service application was an attack vector, disable any relevant code paths, or shut down the service entirely. + +## Assemble Response Team +Identify the key responders for the security incident, and keep them all in the loop. Set up a secure method of communicating all information associated with the incident. Details on the incident (or even the fact that an incident has occurred) should be kept private to the responders until you are confident the attack is not being triggered internally. + +* The security and site-reliability teams should usually be involved. +* A representative for any affected services should be involved. +* An Incident Commander (IC) should be appointed, who will also appoint the usual incident command roles. The incident command team will be responsible for keeping documentation of actions taken, and for notifying internal stakeholders as appropriate. +* Do not communicate with anyone not on the response team about the incident until forensics has been performed. The attack could be happening internally. +* Give the project an innocuous codename that can be used for chats/documents so if anyone overhears they don't realize it's a security incident. (e.g. sapphire-unicorn). +* Prefix all emails, and chat topics with "Attorney Work Project". + +## Isolate Affected Instances +Any instances which were affected by the attack should be immediately isolated from any other instances. As soon as possible, an image of the system should be taken and put into a read-only cold storage for later forensic analysis. + +* Blacklist the IP addresses for any affected instances from all other hosts. +* Turn off and shutdown the instances immediately if you didn't do that to stop the attack. +* Take a disk image for any disks attached to the instances, and ship them to an off-site cold storage location. You should make sure these images are read-only and cannot be tampered with. + +## Identify Timeline of Attack +Work with all tools at your disposal to identify the timeline of the attack, along with exactly what the attacker did. + +* Any reconnaissance the attacker performed on the system before the attack started. +* When the attacker gained access to the system. +* What actions the attacker performed on the system, and when. +* Identify how long the attacker had access to the system before they were detected, and before they were kicked out. +* Identify any queries the attacker ran on databases. +* Try to identify if the attacker still has access to the system via another back door. Monitor logs for unusual activity, etc. + +## Compromised Data +Using forensic analysis of log files, time-series graphs, and any other information/tools at your disposal, attempt to identify what information was compromised (if any), + +* Identify any data that was compromised during the attack. + * Was any data exfiltrated from a database? + * What keys were on the system that are now considering compromised? + * Was the attacker able to identify other components of the system (map out the network, etc). +* Find exactly what customer data has been compromised, if any. + +## Assess Risk +Based on the data that was compromised, assess the risk to other systems. + +* Does the attacker have enough information to find another way in? +* Were any passwords or keys stored on the host? If so, they should be considered compromised, regardless of how they were stored. +* Any user accounts that were used in the initial attack should rotate all of their keys and passwords on every other system they have an account. + +## Apply Additional Mitigations +Start applying mitigations to other parts of your system. + +* Rotate any compromised data. +* Identify any new alerting which is needed to notify of a similar breach. +* Block any IP addresses associated with the attack. +* Identify any keys/credentials that are compromised and revoke their access immediately. + +## Forensic Analysis +Once you are confident the systems are secured, and enough monitoring is in place to detect another attack, you can move onto the forensic analysis stage. + +* Take any read-only images you created, any access logs you have, and comb through them for more information about the attack. +* Identify exactly what happened, how it happened, and how to prevent it in future. +* Keep track of all IP addresses involved in the attack. +* Monitor logs for any attempt to regain access to the system by the attacker. + +## Internal Communication +**Delegate to:** VP or Director of Engineering + +Communicate internally only once you are confident (via forensic analysis) that the attack was not sourced internally. + +* Don't go into too much detail. +* Overview the timeline. +* Discuss mitigation steps taken. +* Follow up with more information once it is known. + +## Liaise With Law Enforcement / External Actors +**Delegate to:** VP or Director of Engineering + +Work with law enforcement to identify the source of the attack, letting any system owners know that systems under their control may be compromised, etc. + +* Contact local law enforcement. +* Contact FBI. +* Contact operators for any systems used in the attack, their systems may also have been compromised. +* Contact security companies to help in assessing risk and any PR next steps. + +## External Communication +**Delegate to:** Marketing Team + +Once you have validated all of the information you have is accurate, have a timeline of events, and know exactly what information was compromised, how it was compromised, and sure that it won't happen again. Only then should you prepare and release a public statement to customers informing them of the compromised information and any steps they need to take. + +* Include the date in the title of any announcement, so that it's never confused for a potential new breach. +* Don't say "We take security very seriously". It makes everyone cringe when they read it. +* Be honest, accept responsibility, and present the facts, along with exactly how we plan to prevent such things in future. +* Be as detailed as possible with the timeline. +* Be as detailed as possible in what information was compromised, and how it affects customers. If we were storing something we shouldn't have been, be honest about it. It'll come out later and it'll be much worse. +* Don't name and shame any external parties that might have caused the compromise. It's bad form. (Unless they've already publicly disclosed, in which case we can link to their disclosure). +* Release the external communication as soon as possible, preferably within a few days of the compromise. The longer we wait, the worse it will be. +* Figure out if there is a way to get in touch with customers' internal security teams before the general public notice is sent. + +--- + +## Additional Reading + +* [Computer Security Incident Handling Guide](http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-61r2.pdf) (NIST) +* [Incident Handler's Handbook](https://www.sans.org/reading-room/whitepapers/incident/incident-handlers-handbook-33901) (SANS) +* [Responding to IT Security Incidents](https://technet.microsoft.com/en-us/library/cc700825.aspx) (Microsoft) +* [Defining Incident Management Processes for CSIRTs: A Work in Progress](http://resources.sei.cmu.edu/library/asset-view.cfm?assetid=7153) (CMU) +* [Creating and Managing Computer Security Incident Handling Teams (CSIRTS)](https://www.first.org/conference/2008/papers/killcrece-georgia-slides.pdf) (CERT) diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..4753736 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,58 @@ +This documentation covers parts of the Spearhead Systems Issue Response process. It is a copy of [PagerDuty's](https://github.com/PagerDuty/incident-response-docs/) documentation and furthermore a cut-down version of our own internal documentation, used at Spearhead Systems for any issue (incident or service request), and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process). See the [about page](about.md) for more information on what this documentation is and why it exists. This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com) and may not yet be open sourced. + +!!! note "Issue, Incident and Service Request" + At Spearhead we use the term *issue* to define any request from our customers. Issues fall into two categories: "Service Requests (SR)" and "Incidents (IN)". Note that we use the term Incident to describe both a service request as well as incidents. For brevity we will use SR and IN throughout this documentation. + +A "service request" is usually initiated by a human and is generally not critical for the normal functioning of the business while an "incident" is an issue that is or can cause interruption to normal business functions. + +![Issue Response at Spearhead Systems](./assets/img/headers/sph_ir.jpg) + +## Being On-Call + +If you've never been on-call before, you might be wondering what it's all about. These pages describe what the expectations of being on-call are, along with some resources to help you. + +* [Being On-Call](oncall/being_oncall.md) - _A guide to being on-call, both what your responsibilities are, and what they are not._ +* [Alerting Principles](oncall/alerting_principles.md) - _The principles we use to determine what things page an engineer, and what time of day they page._ + +## Before an Incident + +Reading material for things you probably want to know before an incident occurs. You likely don't want to be reading these during an actual incident. + +* [Severity Levels](before/severity_levels.md) - _Information on our severity level classification. What constitutes a Low issue? What's a "Major Incident"?, etc._ +* [Different Roles for Incidents](before/different_roles.md) - _Information on the roles during an incident; Incident Commander, Scribe, etc._ +* [Incident Call Etiquette](before/call_etiquette.md) - _Our etiquette guidelines for incident calls, before you find yourself in one._ + +## During an Incident + +Information and processes during an incident. + +* [During an Incident](during/during_an_incident.md) - _Information on what to do during an incident, and how to constructively contribute._ +* [Security Incident Response](during/security_incident_response.md) - _Security incidents are handled differently to normal operational incidents._ + +## After an Incident + +Our followup processes, how we make sure we don't repeat mistakes and are always improving. + +* [Post-Mortem Process](after/post_mortem_process.md) - _Information on our post-mortem process; what's involved and how to write or run a post-mortem._ +* [Post-Mortem Template](after/post_mortem_template.md) - _The template we use for writing our post-mortems for major incidents._ + +## Training + +So, you want to learn about incident response? You've come to the right place. + +* [Training Overview](training/overview.md) - _An overview of our training guides and additional training material from third-parties._ +* [Incident Commander Training](training/incident_commander.md) - _A guide to becoming our next Incident Commander._ +* [Deputy Training](training/deputy.md) - _How to be a deputy and back up the Incident Commander._ +* [Scribe Training](training/scribe.md) - _A guide to scribing._ +* [Subject Matter Expert Training](training/subject_matter_expert.md) - _A guide on responsibilities and behavior for all participants in a major incident._ +* [Glossary of Incident Response Terms](training/glossary.md) - _A collection of terms that you may hear being used, along with their definition._ + +## Additional Reading + +Useful material and resources from external parties that are relevant to incident response. + +* [Incident Management for Operations](http://shop.oreilly.com/product/0636920036159.do) (O'Reilly) +* [Incident Response](http://shop.oreilly.com/product/9780596001308.do) (O'Reilly) +* [Debriefing Facilitation Guide](http://extfiles.etsy.com/DebriefingFacilitationGuide.pdf) (Etsy) +* [US National Incident Management System (NIMS)](https://www.fema.gov/national-incident-management-system) (FEMA) +* [Every Minute Counts: Leading Heroku's Incident Response](https://www.heavybit.com/library/video/every-minute-counts-coordinating-herokus-incident-response/) (Blake Gentry) diff --git a/docs/oncall/alerting_principles.md b/docs/oncall/alerting_principles.md new file mode 100644 index 0000000..fb267d1 --- /dev/null +++ b/docs/oncall/alerting_principles.md @@ -0,0 +1,36 @@ +We manage how we get alerted based on many factors such as the customers contractual SLA, the urgency of their request or incident, etc.. **an alert or notification is something which requires a human to perform an action**. Based on the severity of the issue (service request or incident) we prioritize accordingly in [DoIT](http://doit.sphs.ro). + +!!! warning "Major Priority Alerts" + Anything that wakes up a human in the middle of the night should be **immediately human actionable**. If it is none of those things, then we need to adjust the alert to not page at those times. + +| Priority | Alerts | Response | +| -------- | ------ | -------- | +| Major | Major-Priority Spearhead Alert 24/7/365. | Requires **immediate human action**. | +| Normal | Normal-Priority Spearhead Alert during **business hours only**. | Requires human action that same working day. | +| Minor | Minor-Priority Spearhead Alert 24/7/365. | Requires human action at some point. | +| Notification | Suppressed Events. No response required. | Informational only. We do not need these to clutter out ticketing or inboxes. If they are enabled they should be sent only to required/specific people, not groups. | + +Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page of the respective customer. + +If you're setting up a new alert/notification, consider the chart above for how you want to alert people. Be mindful of not creating new high-priority alerts if they don't require an immediate response, for example. + +!!! info "Alert Channels" + Presently we use email as the only notification method. This means keeping an eye on your email is essential! + SMS and Push notifications are in the pipeline for DoIT. + +## Examples + +#### "Production service is failing for 75% of requests, automation is unable to resolve."_ +This would be a **Major** priority IN, requiring immediate human action to resolve. + +![Major Urgency](../assets/img/screenshots/prio-high.png) + +#### "A customer sends an email stating that "Production server disk space is filling, expected to be full in 48 hours. Log rotation is insufficient to resolve." +This would be a **Normal** priority SR, requiring human action soon, but not immediately. + +![Normal Urgency](../assets/img/screenshots/prio-norm.png) + +#### "An SSL certificate is due to expire in one week." +This would be a **Minor** priority SR, requiring human action some time soon. + +![Minor Urgency](../assets/img/screenshots/prio-low.png) diff --git a/docs/oncall/being_oncall.md b/docs/oncall/being_oncall.md new file mode 100644 index 0000000..6b69246 --- /dev/null +++ b/docs/oncall/being_oncall.md @@ -0,0 +1,95 @@ +A summary of expectations and helpful information for being on-call. + +![Alert Fatigue](../assets/img/misc/alert_fatigue.png) + +## What is On-Call? +Being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise. For example, if you are on-call, should any alarms be triggered by our monitoring solution, you will receive a "page" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken. You will be expected to take whatever actions are necessary in order to resolve the issue and return your service to a normal state. + +At Spearhead Systems we consider you are on-call during normal working hours in which case you are proactively working with [DoIT](http://doit.sphs.ro/) and looking over your assigned cards/boards as well as when you are formally "on-call" and issues are being redirected to you. + +On-call responsibilities extend beyond normal office hours, and if you are on-call you are expected to be able to respond to issues, even at 2am. This sounds horrible (and it can be), but this is what our customers go through, and is the problem that the Spearhead Systems professional services is trying to fix! + +## Responsibilities + +1. **Prepare** + * Have your laptop and Internet with you (office, home, a MiFi dongle, a phone with a tethering plan, etc). + * Have a way to charge your MiFi. + * Team alert escalation happens within 5 minutes, set/stagger your notification timeouts (push, SMS, phone...) accordingly. + * Make sure Spearhead Systems (and colleagues directly) texts and calls can bypass your "Do Not Disturb" settings. + * Be prepared (environment is set up, a current working copy of the necessary repos is local and functioning, you have configured and tested environments on workstations, your credentials for third-party services are current, you have Java installed, ssh-keys and so on...) + * Read our Incident Response documentation (that's this!) to understand how we handle incidents and service requests, what the different roles and methods of communication are, etc. + * Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments etc. + +1. **Triage** + * Acknowledge and act on alerts whenever you can (see the first "Not responsibilities" point below) + * Determine the urgency of the problem: + * Is it something that should be worked on right now or escalated into a major incident? ("production server on fire" situations. Security alerts) - do so. + * Is it some tactical work that doesn't have to happen during the night? (for example, disk utilization high watermark, but there's plenty of space left and the trend is not indicating impending doom) - snooze the alert until a more suitable time (working hours, the next morning...) and get back to fixing it then. + * Check Slack for current activity. Often (but not always) actions that could potentially cause alerts will be announced there. + * Does the alert and your initial investigation indicate a general problem or an issue with a specific service that the relevant team should look into? If it does not look like a problem you are the expert for, then escalate to another team member or group. + +1. **Fix** + * You are empowered to dive into any problem and act to fix it. + * Involve other team members as necessary: do not hesitate to escalate if you cannot figure out the cause within a reasonable timeframe or if the service / alert is something you have not tackled before. + * If the issue is not very time sensitive and you have other priority work, make a note of this in DoIT to keep a track of it (with an appropriate severity). + +1. **Improve** + * If a particular issue keeps happening; if an issue alerts often but turns out to be a preventable non-issue – perhaps improving this should be a longer-term task. + * Disks that fill up, logs that should be rotated, noisy alerts...(we use ansible, go ahead and start automating!) + * If information is difficult / impossible to find, write it down. Constantly refactor and improve our knowledge base and documentation. Add redundant links and pointers if your mental model of the wiki / codebase does not match the way it is currently organized. + +1. **Support** + * When your on-call "shift" ends, let the next on-call know about issues that have not been resolved yet and other experiences of note. + * If you are making a change that impacts the schedule (adding / removing yourself, for example), let others know since many of us make arrangements around the on-call schedule well in advance. + * Support each other: when doing activities that might generate plenty of pages, it is courteous to "take the page" away from the on-call by notifying them and scheduling an override for the duration. + +## Not Responsibilities + +1. No expectation to be the first to acknowledge _all_ of the alerts during the on-call period. + * Commute (and other necessary distractions) are facts of life, and sometimes it is not possible to receive or act on an alert before it escalates. That's why we have the backup on-call and schedule for. + +1. No expectation to fix all issues by yourself. + * No one knows everything. Your whole team is here to help. There is no shame, and much to be learned, by escalating issues you are not certain about. "Never hesitate to escalate". + * Service owners will always know more about how their stuff works. Especially if our and their documentation is lacking, double-checking with the relevant team avoids mistakes. Measure twice, cut once – and it's often best to let the subject matter expert do the cutting. + +## Recommendations +If your team is starting its own on-call rotation, here are some scheduling recommendations from the Operations team. + +* Always have a backup schedule. Yes, this means two people being on-call at the same time, however it takes a lot of the stress off of the primary if they know they have a specific backup they can contact, rather than trying to chose a random member of the team. + * A backup shift should generally come directly after a primary shift. It gives chance for the previous primary to pass on additional context which may have come up during their shift. It also helps to prevent people from sitting on issues with the intent of letting the next shift fix it. + +* The third-level of your escalation (after backup schedule) should probably be your entire team. This should hopefully never happen (it's happened once in the history of the Support team), but when it does, it's useful to be able to just get the next available person. + +![Escalation](../assets/img/misc/escalation.png) + +* Team managers can (and should) be part of your normal rotation. It gives a better insight into what has been going on. + +* New members of the team should shadow your on-call rotation during the first few weeks. They should get all alerts, and should follow along with what you are doing. (All new employees shadow the Support team for one week of on-call, but it's useful to have new team members shadow your team rotations also. Just not at the same time). + +* We recommend you set your escalation timeout to 5 minutes. This should be plenty of time for someone to acknowledge the incident if they're able to. If they're not able to within 5 minutes, then they're probably not in a good position to respond to the incident anyway. + +* When going off-call, you should provide a quick summary to the next on-call about any issues that may come up during their shift. A service has been flapping, an issue is likely to re-occur, etc. If you want to be formal, this can be a written report via email, but generally a verbal summary is sufficient. + +### Notification Method Recommendations +You are free to set up your notification rules as you see fit, to match how you would like to best respond to incidents. If you're not sure how to configure them, the Support team has some recommendations, + +![Mobile Alerts](../assets/img/misc/mobile_alerts.png) + +* Use Push Notification and Email as your first method of notification. Most of us have phones with us at all times, so this is a prudent first method and is usually sufficient. (DoIT is in the process of integratoin with SNS for push notifications) +* Use Phone and/or SMS notification each minute after, until the escalation time. If Push didn't work, then it's likely you need something stronger, like a phone call. Keep calling every minute until it's too late. If you don't pick up by the 3rd time, then it's unlikely you are able to respond, and the incident will get escalated away from you. + +## Etiquette + +* If the current on-call comes into the office at 12pm looking tired, it's not because they're lazy. They probably got paged in the night. Cut them some slack and be nice. + +* Don't acknowledge an incident out from under someone else. If you didn't get paged for the incident, then you shouldn't be acknowledging it. Add a comment with your notes instead. + +![Acknowledging](../assets/img/misc/ack.png) + +* If you are testing something, or performing an action that you know will cause a page (notification, alert), it's customary to "take the pager" for the time during which you will be testing. Notify the person on-call that you are taking the pager for the next hour while you test. + +* "Never hesitate to escalate" - Never feel ashamed to rope in someone else if you're not sure how to resolve an issue. Likewise, never look down on someone else if they ask you for help. + +* Always consider covering an hour or so of someone else's on-call time if they request it and you are able. We all have lives which might get in the way of on-call time, and one day it might be you who needs to swap their on-call time in order to have a night out with your friend from out of town. + +* If an issue comes up during your on-call shift for which you got paged, you are responsible for resolving it. Even if it takes 3 hours and there's only 1 hour left of your shift. You can hand over to the next on-call if they agree, but you should never assume that's possible. diff --git a/docs/training/deputy.md b/docs/training/deputy.md new file mode 100644 index 0000000..07f15cd --- /dev/null +++ b/docs/training/deputy.md @@ -0,0 +1,57 @@ +So you want to be a deputy? You've come to the right place! + +![Deputy](../assets/img/headers/incident_command_support.jpg) +*Credit: [oregondot @ Flickr](https://www.flickr.com/photos/oregondot/8743801731/in/album-72157633494644719/)* + +## Purpose +The purpose of the Deputy is to support the IC by keeping track of timers, notifying the IC of important information, and paging other people as directed by the IC. + +It's important for the IC to focus on the problem at hand, rather than worrying about monitoring timers. The deputy is there to help support the IC and keep them focussed on the incident. + +As a Deputy, you will be expected to take over command from the IC if they request it. + +**You should not be performing any remediations, checking graphs, or investigating logs**. Those tasks will be delegated to the resolvers by the IC. + +## Prerequisites +Before you can be a Deputy, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training! + +* Be trained as an [Incident Commander](/training/incident_commander.md). + +## Responsibilities +Read up on our [Different Roles for Incidents](/before/different_roles.md) to see what is expected from a Deputy, as well as what we expect from the other roles you'll be interacting with. + +## Training Process +The training process for a Deputy is quite simple. + +* Follow our [Incident Commander Training](/training/incident_commander.md). +* Read this page. + +## Incident Call Procedures and Lingo +The [Steps for Deputy](/during/during_an_incident.md) provide a detailed description of what you should be doing during an incident. + +Here are some examples of phrases and patterns you should use during incident calls. + +### Keep Track of Responders +As you listen to the call, you should keep track of the responders to the call as you hear them speak. Make a note on a piece of paper, or use the `!ic responders` to see who they are. The IC may ask you who is on-call for a particular system, and you should know the answer, and be able to page them. + +> Do we have a representative from [X] on the call? + +> (pause) + +> Deputy, can you go ahead and page the [X] on-call please. + +You can page them however you see fit, phone call, etc. + +### Provide Executive Status Updates +Provide regular status updates on Slack (roughly every 30mins), giving an executive summary of the current status during SEV-1 incidents. Keep it short and to the point, and use @here. Mention the current state, the actions in progress, customer impact, and expected time remaining. It's OK to miss out some of those if the information isn't known. + +> @here: We are in SEV-1 due to X. Current actions in progress are to do Y. Expecting 3 mins to complete that action. Once action is complete, system should recover on its own within 5 minutes. + +### Alert IC to Timers +You are expected to keep track of how long the incident has been running for, and provide callouts to the IC every 10 minutes so they can take actions such as increasing the severity, or asking Support to Tweet out. This is as simple as telling the IC on the call, + +> IC, be advised the incident is now at the 10 minute mark. + +Similarly, when the IC asks for someone to get back to them in X minutes, you are expected to keep track of that. You should remind the IC when that time has been reached. + +> IC, be advised the timer for [TEAM]'s investigation is up. diff --git a/docs/training/glossary.md b/docs/training/glossary.md new file mode 100644 index 0000000..d197a4c --- /dev/null +++ b/docs/training/glossary.md @@ -0,0 +1,14 @@ +Ever wonder what all of those strange words you sometimes see in our documentation mean? This page is here to help. + +| Term | Description | +| ---- | ----------- | +| **IC / Incident Commander** | The incident commander is the person responsible for bringing any major incident to resolution. They are the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final. [More info](../before/different_roles.md). | +| **Deputy** | Typically the backup IC. The deputy's job is to support the IC during the call, providing them with any help they need. [More info](../before/different_roles.md). | +| **Scribe** | The scribe's job is to keep a log of all activities performed during the call in a written chat log on Slack. [More info](../before/different_roles.md). | +| **Resolver** | A person on the incident call who is able to help resolve issues within a particular system. Also referred to as an SME (see below). [More info](../before/different_roles.md). | +| **SME** | "Subject Matter Expert", someone who is an expert in a particular service or subject who can provide information to the IC, and perform resolution actions for a particular system. [More info](../before/different_roles.md). | +| **CAN Report** | CAN stands for "Conditions" "Actions" "Needs", if an IC asks you for a CAN report, you should provide the current state of your service (condition), what actions need to be taken to return it to a healthy state (actions), and what support you need in order to perform the actions (needs). | +| **Sev / Severity** | How severe the incident is. The "sev" of an incident determines the type of response we give. The higher the severity, the higher the likelihood of making risky actions to resolve the situation. [More info](../before/severity_levels.md). | +| **Span of Control** | Refers to the number of direct reports you have. For example, if the IC has 10 people as direct reports on a call, they have a large span of control. We aim to make the span of control as minimal as we can while still being productive. | +| **Grenade Thrower** | Someone who joins the call at a late time in the game, and provides information that completely derails the current thinking. They then leave almost immediately. | +| **Executive Swoop** | When an executive comes on the call and drops some sort of bombshell. A version of grenade throwing. | diff --git a/docs/training/incident_commander.md b/docs/training/incident_commander.md new file mode 100644 index 0000000..4721981 --- /dev/null +++ b/docs/training/incident_commander.md @@ -0,0 +1,263 @@ +So you want to be an incident commander? You've come to the right place! You don't need to be a senior team member to become an IC, anyone can do it providing you have the requisite knowledge (yes, even an intern)! + +![Gene Kranz](../assets/img/headers/gene_kranz.jpg) +*Credit: [NASA](https://en.wikipedia.org/wiki/File:Eugene_F._Kranz_at_his_console_at_the_NASA_Mission_Control_Center.jpg)* + +## Purpose +If you could boil down the definition of an Incident Commander to one sentence, it would be, + +> Take whatever actions are necessary to protect PagerDuty systems and customers. + +The purpose of the Incident Commander is to be the decision maker during an major incident; Delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution. + +The Incident Commander becomes the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final. + +Your job as an IC is to listen to the call and to watch the incident Slack room in order to provide clear coordination, recruiting others to gather context/details. **You should not be performing any actions or remediations, checking graphs, or investigating logs.** Those tasks should be delegated. + +## Prerequisites +Before you can be an Incident Commander, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training! + +* Has **excellent knowledge of PagerDuty systems** and is able to quickly evaluate good vs bad options, and quickly identify what's actually going on. +* Been at PagerDuty for at least 6 months and has a **solid understanding of the incident notification pipeline and web stack**. +* Excellent verbal and written **communication skills**. +* Has **knowledge of obscure PagerDuty terms**. +* Has gravitas and is **willing to kick people off a call** to remove distractions, even if it's the CEO. + +## Responsibilities +Read up on our [Different Roles for Incidents](/before/different_roles.md) to see what is expected from an Incident Commander, as well as what we expect from the other roles you'll be interacting with. + +## Qualities +Some qualities we expect from an effective leader include being able to: + +* Take command. +* Motivate responders. +* Communicate clear directions. +* Size up the situation and make rapid decisions. +* Assess the effectiveness of tactics/strategies. +* Be flexible and modify your plans as necessary. + +As a leader, you should try to: + +* Be proficient in your job. +* Make sound and timely decisions. +* Ensure tasks are understood. +* Be prepared to step out of a tactical role to assume a leadership role. + +## Training Process +The process is fairly loose for now. Here's a list of things you can do to train though, + +* Read the rest of this page, particularly the sections below. + +* Participate in [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/) (FF). + * Shadow a FF to see how it's run. + * Be the scribe for multiple FF's. + * Be the incident commander for multiple FF's. + +* Play a game of "[Keep Talking and Nobody Explodes](http://www.keeptalkinggame.com/)" with other people in the office. + * For a more realistic experience, play it with someone in a different office over Hangouts. + +* Shadow a current incident commander for at least a full week shift. + * Get alerted when they do, join in on the same calls. + * Sit in on an active incident call, follow along with the chat, and follow along with what the Incident Commander is doing. + * **Do not actively participate in the call, keep your questions until the end.** + +* Reverse shadow a current incident commander for at least a full week shift. + * You should be the one to respond to incidents, and you will take point on calls, however the current IC will be there to take over should you not know how to proceed. + +## Graduation +What's the difference between an IC in training, and an IC? (This isn't the set up to a joke). Simple, an IC puts themselves on the schedule. + +## Handling Incidents +Every incident is different (we're hopefully not repeating the same issue multiple times!), but there's a common process you can apply to each one. + +1. **Identify the symptoms.** + * Identify what the symptoms are, how big the issue is, and whether it's escalating/flapping/static. + +1. **Size-up the situation.** + * Gather as much information as you can, as quickly as you can (remember the incident is still happening while you're doing this). + * Get the facts, the possibilities of what can happen, and the probability of those things happening. + +1. **Stabilize the incident.** + * Identify actions you can use to proceed. + * Gather support for the plan (See "Polling During a Decision" below). + * Delegate remediation actions to your SME's. + +1. **Provide regular updates.** + * Maintain a cadence, and provide regular updates to everyone on the call. + * What's happening, what are we doing about it, etc. + +## Deputy +The deputy for an incident is generally the backup Incident Commander. However, as an Incident Commander, you may appoint one or more Deputies. Note that Deputy Incident Commanders must be as qualified as the Incident Commander, and that if a Deputy is assigned, he or she must be fully qualified to assume the Incident Commander’s position if required. + +## Communication Responsibilities +Sharing information during an incident is a critical process. As an Incident Commander (or Deputy), you should be prepared to brief others as necessary. You will also be required to communicate your intentions and decisions clearly so that there is no ambiguity in your commands. + +When given information from a responder, you should clearly acknowledge that you have received and understood their message, so that the responder can be confident in moving on to other tasks. + +After an incident, you should communicate with other training Incident Commanders on any debrief actions you feel are necessary. + +## Incident Call Procedures and Lingo +The [Steps for Incident Commander](/during/during_an_incident.md) provide a detailed description of what you should be doing during an incident. + +Additionally, aside from following the [usual incident call etiquette](/before/call_etiquette.md), there a few extra etiquette guidelines you should follow as IC: + +* Always announce when you join the call if you are the on-call IC. +* Don't let discussions get out of hand. Keep conversations short. +* Note objections from others, but your call is final. +* If anyone is being actively disruptive to your call, kick them off. +* Announce the end of the call. + +Here are some examples of phrases and patterns you should use during incident calls. + +### Start of Call Announcement +At the start of any major incident call, the incident commander should announce the following, + +> This is [NAME], I am the Incident Commander for this call. + +This establishes to everyone on the call what your name is, and that you are now the commander. You should state "Incident Commander" and not "IC", as newcomers may not be familiar with the terminology yet. The word "commander" makes it very clear that you're in charge. + +### Start of Incident, IC Not Present +If you are trained to be an IC and have joined a call, even if you aren't the IC on-call, you should do the following, + +> Is there an IC on the call? + +> (pause) + +> Hearing no response, this is [NAME], and I am now the Incident Commander for this call. + +If the on-call IC joins later, you may hand over to them at your discretion (see below for the hand-off procedure) + +### Checking if SME's are Present +During a call, you will want to know who is available from the various teams in order to resolve the incident. Etiquette dictates that people should announce themselves, but sometimes you may be joining late to the call. If you need a representative from a team, just ask on the call. Your deputy can page one if no one answers. + +> Do we have a representative from [X] on the call? + +> (pause) + +> Deputy, can you go ahead and page the [X] on-call please. + +### Assigning Tasks +When you need to give out an assignment or task, give it to a person directly, never say "can someone do..." as this leads to the [bystander effect](https://en.wikipedia.org/wiki/Bystander_effect). Instead, all actions should be assigned to a specific person, and time-boxed with a specific number of minutes. + +> IC: Bob, please investigate the high latency on web app boxes. I'll come back to you for an answer in 3 minutes. + +> Bob: Understood + +Keep track of how many minutes you assigned, and check in with that person after that time. You can get help from your deputy to help track the timings. + +### Polling During a Decision +If a decision needs to be made, it comes down to the IC. Once the IC makes a decision, it is final. But it's important that no one can come later and object to the plan, saying things like "I knew that would happen". An IC will use very specific language to be sure that doesn't happen. + +> The proposal is to [EXPLAIN PROPOSAL] + +> Are there any strong objections to this plan? + +> (pause) + +> Hearing no objects, we are proceeding with this proposal. + +If you were to ask "Does everyone agree?", you'd get people speaking over each other, you'd have quiet people not speaking up, etc. Asking for any STRONG objections gives people the chance to object, but only if they feel strongly on the matter. + +### Status Updates +It's important to maintain a cadence during a major incident call. Whenever there is a lull in the proceedings, usually because you're waiting for someone to get back to you, you can fill the gap by explaining the current situation and the actions that are outstanding. This makes sure everyone is on the same page. + +> While we wait for [X], here's an update of our current situation. + +> We are currently in a SEV-1 situation, we believe to be caused by [X]. There's an open question to [Y] who will be getting back to us in 2 minutes. In the meantime, we have Tweeted out that we are experiencing issues. Our next Tweet will be in 10 minutes if the incident is still ongoing at that time. + +> Are there any additional actions or proposals from anyone else at this time? + +### Transfer of Command +Transfer of command, involves (as the name suggests) transferring command to another Incident Commander. There are multiple reasons why a transfer of command might take place, + +* Commander has become fatigued and is unable to continue. +* Incident complexity changes. +* Change of command is necessary for effectiveness or efficiency. +* Personal emergencies arise (e.g., Incident Commander has a family emergency). + +Never feel like you are not doing your job properly by handing over. Handovers are encouraged. In order to handover, out of band from the main call (via Slack for example), notify the other IC that you wish to transfer command. Update them with anything you feel appropriate. Then announce on the call, + +> Everyone on the call, be advised, at this time I am handing over command to [X]. + +The new IC should then announce on the call as if they were joining a new call (see above), so that everyone is aware of the new commander. + +Note that the arrival of a more qualified person does NOT necessarily mean a change in incident command. + +### Maintaining Order +Often times on a call people will be talking over one another, or an argument on the correct way to proceed may break out. As Incident Commander it's important that order is maintained on a call. The Incident Commander has the power to remove someone from the call if necessary (even if it's the CEO). But often times you just need to remind people to speak one at a time. Sometimes the discussion can be healthy even if it starts as an argument, but you shouldn't let it go on for too long. + +> (noise) + +> Ok everyone, can we all speak one at a time please. So far I'm hearing two options to proceed: 1) [X], 2) [Y]. + +> Are there any other proposals someone would like to make at this time? + +> ...etc + +### Getting Straight Answers +You may ask a question as IC and receive an answer that doesn't actually answer your question. This is generally when you ask for a yes/no answer but get a more detailed explanation. This can often times be because the person doesn't understand the call etiquette. But if it continues, you need to take action in order to proceed. + +> IC: Is this going to disable the service for everyone? + +> SME: Well... for some people it.... + +> IC: Stop. I need a yes/no answer. Is this going to disable the service for everyone? + +> SME: Well... it might not do... + +> IC: Stop. I'm going to ask again, and the only two words I want to hear from you are "yes" or "no. If this going to disable the service for everyone? + +> SME: Well.. like I was saying.. + +> IC: Stop. Leave the call. Backup IC can you please page the backup on-call for [service] so that we can get an answer. + +### Executive Swoop +You may get someone who would be senior to you during peacetime come on the call and start overriding your decisions as IC. This is unacceptable behaviour during wartime, as the IC is in command. While this is rare, you can get things back on track with the following, + +> Executive: No, I don't want us doing that. Everyone stop. We need to rollback instead. + +> IC: Hold please. [EXECUTIVE], do you wish to take over command? + +> Executive: Yes/No + +> (If yes) IC: Understood. Everyone on the call, be advised, at this time I am handling over command to [EXECUTIVE]. They are now the incident commander for this call. + +> (If no) IC: In that case, please cause no further interruptions or I will remove you from the call. + +This makes it clear to the executive that they have the option of being in charge and making decisions, but in order to do so they must continue as an Incident Commander. If they refuse, then remind them that you are in charge and disruptive interruptions will not be tolerated. If they continue, remove them from the call. + +### End of Call Sign-Off +At the end of an incident, you should announce to everyone on the call that you are ending the call at this time, and provide information on where followup discussion can take place. It's also customary to thank everyone. + +> Ok everyone, we're ending the call at this time. Please continue any followup discussion on Slack. Thanks everyone. + +## Examples From Pop Culture +PagerDuty employees have access to all previous incident calls, and can listen to them at their discretion. We can't release these calls, so for everyone else, here are some short examples from popular culture to show the techniques at work. + +--- + + +Here's a clip from the movie Apollo 13, where Gene Kranz (Flight Director / Incident Commander) shows some great examples of Incident Command. Here are some things to note: + +* Walks into the room, and immediately obvious that he's the IC. Calms the noise, and makes sure everyone is paying attention. +* Provides a status update so people are aware of the situation. +* Projector breaks, doesn't get sidetracked on fixing it, just moves on to something else. +* Provides a proposal for how to proceed and elicits feedback. + * Listens to the feedback calmly. + * When counter-proposal is raised, states that he agrees and why. +* Allows a discussion to happen, listens to all points. When discussion gets out of hand, re-asserts command of the situation. + * Explains his decision, and why. +* Explains his full plan and decision, so everyone is on the same page. + +--- + + +Another clip from Apollo 13. Things to note: + +* Summarizes the situation, and states the facts. +* Listens to the feedback from various people. +* When a trusted SME provides information counter to what everyone else is saying, asks for additional clarification ("What do you mean, everything?") +* Wise cracking remarks are not acknowledged by the IC ("You can't run a vacuum cleaner on 12 amps!") +* "That's the deal?".. "That's the deal". +* Once decision is made, moves on to the next discussion. +* Delegates tasks. diff --git a/docs/training/overview.md b/docs/training/overview.md new file mode 100644 index 0000000..4693424 --- /dev/null +++ b/docs/training/overview.md @@ -0,0 +1,24 @@ +Learning about the Spearhead Systems incident response process is an important part of being an effective on-call engineer at Spearhead Systens. This section goes over our training material for the various roles that are involved in our incident response, along with some additional information and training material from government agencies. + +## Training Guides +Our training guides are split up by role, however you are encouraged to read through the training guides even for roles you don't belong to, as it can give you some good insight into how those people will be behaving during major incidents. + +* [Incident Commander Training](/training/incident_commander.md) - The "IC" is the person who drives a major incident to resolution. They're the person who will be directing everyone else. +* [Deputy Training](/training/deputy.md) - The Deputy is someone who supports the Incident Commander and can take over for them if necessary. +* [Scribe Training](/training/scribe.md) - This is intended for individuals who will be acting as a scribe during an incident. +* [SME / Resolver Training](/training/subject_matter_expert.md) - This is relevant to everyone at Spearhead Systems who are on-call for any team. + +## National Incident Management System (NIMS) +Our incident response process is loosely based on the [US National Incident Management System (NIMS)](https://www.fema.gov/national-incident-management-system), which is described as, + + _A systematic, proactive approach to guide departments and agencies at all levels of government, nongovernmental organizations, and the private sector to work together seamlessly and manage incidents involving all threats and hazards—regardless of cause, size, location, or complexity—in order to reduce loss of life, property and harm to the environment._ + +While it might not initially seem that this would be applicable to an IT operations environment, we've found that many of the lessons learned from major incidents in these situations can be directly applied to our industry too. The principles are the same and span many different environments. + +[![NIMS](../assets/img/thumbnails/nims_core.png)](https://www.fema.gov/pdf/emergency/nims/NIMS_core.pdf) [![NIMS Training](../assets/img/thumbnails/nims_training.png)](https://www.fema.gov/pdf/emergency/nims/nims_training_program.pdf) + +If you want to learn more about NIMS, we recommend the [ICS-100](https://training.fema.gov/is/courseoverview.aspx?code=IS-100.b) and [ICS-700](https://training.fema.gov/is/courseoverview.aspx?code=IS-700.a) online training courses, which go over NIMS and the Incident Command System (You can also take an online examination after training in order to get a certificate from FEMA). There is also a wealth of [additional training material and courses from FEMA](https://training.fema.gov/nims/) on NIMS, which I would encourage you to look at. + +If you're based in the US and interested in taking a more active incident response role in your community, we recommend investigating your local [CERT programs](https://www.fema.gov/community-emergency-response-teams) (Community Emergency Response Teams). Many cities offer CERT training, after which you can volunteer as a CERT contributor within your community. Not only is it an opportunity to get real world experience with disaster response, but the skills you learn can be applied to everyday life too. + +Also take a look at the [Additional Reading](/#additional-reading) section on the home page. diff --git a/docs/training/scribe.md b/docs/training/scribe.md new file mode 100644 index 0000000..b45c124 --- /dev/null +++ b/docs/training/scribe.md @@ -0,0 +1,75 @@ +So you want to be a scribe? You've come to the right place! You don't need to be a senior team member to become a deputy or scribe, anyone can do it providing you have the requisite knowledge! + +![Typewriter](../assets/img/headers/typewriter.jpg) +*Credit: [Holly Chaffin](http://www.publicdomainpictures.net/view-image.php?image=49706&picture=antique-typewriter-keys)* + +## Purpose +The purpose of the Scribe is to maintain a timeline of key events during an incident. Documenting actions, and keeping track of any followup items that will need to be addressed. + +It's important for the rest of the command staff to be able to focus on the problem at hand, rather than worrying about documenting the steps. + +Your job as Scribe is to listen to the call and to watch the incident Slack room, keeping track of context and actions that need to be performed, documenting these in Slack as you go. **You should not be performing any remediations, checking graphs, or investigating logs.** Those tasks will be delegated to the subject matter experts (SME's) by the Incident Commander. + + +## Prerequisites +Before you can be a Scribe, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training! + +* Excellent verbal and written **communication skills**. +* Has **knowledge of obscure PagerDuty terms**. + +## Responsibilities +Read up on our [Different Roles for Incidents](/before/different_roles.md) to see what is expected from a Scribe, as well as what we expect from the other roles you'll be interacting with. + +## Training Process +There is no formal training process for this role, reading this page should be sufficient for most tasks. Here's a list of things you can do to train though, + +* Read the rest of this page, particularly the sections below. + +* Participate in [Failure Friday](https://www.pagerduty.com/blog/failure-friday-at-pagerduty/) (FF). + * Shadow a FF to see how it's run. + * Be the scribe for multiple FF's. + +## Scribing +Scribing is more art than science. The objective is to keep an accurate record of important events that occurred on the call, so that we can look back at the timeline to see what happened. But what exactly is important? There's no overwhelming answer, and it really comes down the judgement and experience. But here are some general things you most definitely want to capture as scribe. + +* The result of any polling decisions. + * This is not "9 people voted yay, 3 voted nay". + * It is "Polled for if we should do rolling restart. is proceeding with restart." +* Any followup items that are called out as "We should do this..", "Why didn't this?..", etc. + * This is not "Why isn't the Support representative on the call?" + * This is "TODO: Why didn't we get paged for this earlier?" + +## Incident Call Procedures and Lingo +The [Steps for Scribe](/during/during_an_incident.md) provide a detailed description of what you should be doing during an incident. + +Here are some examples of phrases and patterns you should use during incident calls. + +### Status Stalking +At the start of any major incident call, you should start our status stalking bot, so that it will post to the room an update automatically. + +> !status stalk + +This will provide the update and allow the IC to see the status without having to keep asking. + +### Note Important Actions +During a call, you will hear lots of discussion happening, you should not be documenting all of this in the chat room. You only want to document things which will be important for the final timeline. It's not always obvious what this might be, and it's usually a matter of judgement. You generally want to note any actions the IC has asked someone to perform, along with the result of any polling decisions. + +> Polled for decision on whether to perform rolling restart. We are proceeding with restart. [USER_A] to execute. + +Some actions might seem important at the time, but end up not being. That's OK. It's better to have more info than not enough, but don't go overboard. + +### Note Followup Actions +Sometimes during the call, someone will either mention something we "should fix", or the IC will specifically ask you to note a followup item. You can do this in Slack by simply prefixing with "TODO", this will make it easier to search for later. + +> TODO: Why did we not get paged for the fall in traffic on [X] cluster? + +The post-mortem owner will find these after and raise tasks for them. + +### End of Call Notification +When the IC ends the call, you should post a message into Slack to let everyone know the call is over, and that they should continue discussion elsewhere. + +> Call is over, thanks everyone. Follow up in Slack. + +Don't forget to also stop the status stalking. + +> !status unstalk diff --git a/docs/training/subject_matter_expert.md b/docs/training/subject_matter_expert.md new file mode 100644 index 0000000..5b11dac --- /dev/null +++ b/docs/training/subject_matter_expert.md @@ -0,0 +1,54 @@ +If you are on-call for any team at PagerDuty, you may be paged for a major incident and will be expected to respond as a subject matter expert (SME) for your service. This page details everything you need to know in order to be prepared for that responsibility. If you are interested in becoming an Incident Commander, take a look at the [Incident Commander Training page](/training/incident_commander.md). + +![Incident Response](../assets/img/headers/incident_response.jpg) +*Credit: [oregondot @ Flickr](https://www.flickr.com/photos/oregondot/8743809853/in/album-72157633494644719/)* + +## On-Call Expectations +If you are on-call for your team, there are certain expectations of you as that on-call. This applies to both the primary and secondary on-calls. Getting paged about a SEV-3 or SEV-4 in your system comes with different expectations than getting paged with a major SEV-2. + +### Before Going On-Call + +1. Be prepared, by having already familiarized yourself with our incident response policies and procedures. In particular, + 1. [Different Roles for Incidents](/before/different_roles.md) - You will be acting as a "Resolver" or "SME". But you should familiarize yourself with the other roles and what they will be doing. + 1. [Incident Call Etiquette](/before/call_etiquette.md) - How to behave during an incident call. + 1. [During an Incident](/during/during_an_incident.md) - What to do during an incident. You are specifically interested in the "Resolver" steps, but you should familiarize yourself with the entire document. + 1. [Glossary](/training/glossary.md) - Familiarize yourself with the terminology that may be used during the call. +1. Make sure you have set up your alerting methods, and that PagerDuty can bypass your "Do Not Disturb" settings. +1. Check you can join the incident call. You may need to install a browser plugin. You don't want to be doing that the first time you get paged. +1. Be aware of your upcoming on-call time and arrange swaps around travel, vacations, appointments, etc. +1. If you are an Incident Commander, make sure you are not on-call for your team at the same time as being on-call as Incident Commander. + +### During On-Call Period + +1. Have your laptop and Internet with you at all times during your on-call period (office, home, a MiFi, a phone with a tethering plan, etc). +1. If you have important appointments, you need to get someone else on your team to cover that time slot in advance. +1. When you receive an alert for a major incident, you are expected to join the incident call and Slack as quickly as possible (within minutes). + 1. You will be asked questions or given actions by the Incident Commander. Answer questions concisely, and follow all actions given (even if you disagree with them). + +## Response Mobilization +When an incident occurs, you must be mobilized or assigned to become part of the incident response. In other words, until you are mobilized to the incident via a page or being directly asked by someone else on the incident, you remain in your everyday role. After being mobilized, your first task is to check in and receive an assignment. While it's tempting to see an incident happening and want to jump in and help, when resources show up that have not been requested, the management of the incident can be compromised. + +## "Never Hesitate to Escalate" +If you're not sure about something, it is perfectly acceptable to bring in other SMEs from your team that you believe know a given system better than you. Don't let your ego keep you from bringing in additional help. Our motto is "Never hesitate to escalate", you will never be looked down upon for escalating something because you didn't know how to handle it. + +## Blameless +There will be incidents. Some will be caused by you, some will be caused by others... some will just happen. Our entire incident response process is completely blameless. Blaming people is counter productive and just distracts from the problem at hand. No matter how an incident started, they all need to get solved as quickly as possible. + +## Wartime vs Peacetime +Behavior during a major incident is very different to any other alert you may have received in the past. We call a major incident "wartime", and make a distinction between that and normal everyday operations ("peacetime"). + +### Peacetime +The organizational structure is generally based on seniority. The more senior members of a team will lead discussions, and managers or team leads will have the final say. Decisions are made after careful consideration of all options, and to minimize potential risk to customers. + +### Wartime +Wartime is different, and you will notice on our major incident calls that there's a different organizational structure. + +* The Incident Commander is in charge. No matter their rank during peacetime, they are now the highest ranked individual on the call, higher than the CEO. +* Primary responders (folks acting as primary on-call for a team/service) are the highest ranked individuals for that service. +* Decisions will be made by the IC after consideration of the information presented. Once that decision is made, it is final. +* Riskier decisions can be made by the IC than would normally be considered during peacetime. + * For example, the IC may decide to drop events for a particular customer in order to maintain the integrity of the system for everyone else. +* The IC may go against a consensus decision. If a poll is done, and 9/10 people agree but 1 disagrees. The IC may choose the disagreement option despite a majority vote. + * Even if you disagree, the IC's decision is final. During the call is not the time to argue with them. +* The IC may use language or behave in a way you find rude. This is wartime, and they need to do whatever it takes to resolve the situation, so sometimes rudeness occurs. This is never anything personal, and something you should be prepared to experience if you've never been in a wartime situation before. +* You may be asked to leave the call by the IC, or you may even be forceable kicked off a call. It is at the IC's discretion to do this if they feel you are not providing useful input. Again, this is nothing personal and you should remember that wartime is different than peacetime. diff --git a/during/during_an_incident/index.html b/during/during_an_incident/index.html deleted file mode 100644 index eb0bcc6..0000000 --- a/during/during_an_incident/index.html +++ /dev/null @@ -1,719 +0,0 @@ - - - - - - - - - - During An Incident - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

During An Incident

- -

Information on what to do during a major incident. See our severity level descriptions for what constitutes a major incident.

-
-

Documentation

-

For your own internal documentation, you should make sure that this page has all of the necessary information prominently displayed. Such as: phone bridge numbers, Slack rooms, important chat commands, etc. Here is an example,

-

- - - - - - - - - - - - - - - -
#incident-chathttps://a-voip-provider.com/incident-call+1 555 BIG FIRE (+1 555 244 3473) / PIN: 123456
Need an IC? Do !ic page in Slack
For executive summary updates only, join #executive-summary-updates.

-
-
-

Security Incident?

-

If this is a security incident, you should follow the Security Incident Response process.

-
-

Don't Panic!#

-
    -
  1. -

    Join the incident call and chat (see links above).

    -
      -
    • Anyone is free to join the call or chat to observe and follow along with the incident.
    • -
    • If you wish to participate however, you should join both. If you can't join the call for some reason, you should have a dedicated proxy for the call. Disjointed discussions in the chat room are ultimately distracting.
    • -
    -
  2. -
  3. -

    Follow along with the call/chat, add any comments you feel are appropriate, but keep the discussion relevant to the problem at hand.

    -
      -
    • If you are not an SME, try to filter any discussion through the primary SME for your service. Too many people discussing at once get become overwhelming, so we should try to maintain a hierarchical structure to the call if possible.
    • -
    -
  4. -
  5. -

    Follow instructions from the Incident Commander.

    -
      -
    • Is there no IC on the call?
        -
      • Manually page them via Slack, with !ic page in Slack. This will page the primary and backup IC's at the same time.
      • -
      • Never hesitate to page the IC. It's much better to have them and not need them than the other way around.
      • -
      -
    • -
    -
  6. -
-

Steps for Incident Commander#

-

Resolve the incident as quickly and as safely as possible, use the Deputy to assist you. Delegate any tasks to relevant experts at your discretion.

-
    -
  1. -

    Announce on the call and in Slack that you are the incident commander, who you have designated as deputy (usually the backup IC), and scribe.

    -
  2. -
  3. -

    Identify if there is an obvious cause to the incident (recent deployment, spike in traffic, etc.), delegate investigation to relevant experts,

    -
      -
    • Use the service experts on the call to assist in the analysis. They should be able to quickly provide confirmation of the cause, but not always. It's the call of the IC on how to proceed in cases where the cause is not positively known. Confer with service owners and use their knowledge to help you.
    • -
    -
  4. -
  5. -

    Identify investigation & repair actions (roll back, rate-limit services, etc) and delegate actions to relevant service experts. Typically something like this (obviously not an exhaustive list),

    -
      -
    • Bad Deployment: Roll it back.
    • -
    • Web Application Stuck/Crashed: Do a rolling restart.
    • -
    • Event Flood: Validate automatic throttling is sufficient, adjust manually if not.
    • -
    • Data Center Outage: Validate automation has removed bad data center. Force it to do so if not.
    • -
    • Degraded Service Behavior without load: Gather forensic data (heap dumps, etc), and consider doing a rolling restart.
    • -
    -
  6. -
  7. -

    Listen for prompts from your Deputy regarding severity escalations, decide whether we need to announce publicly, and instruct customer liaison accordingly.

    -
      -
    • Announcing publicly is at your discretion as IC. If you are unsure, then announce publicly ("If in doubt, tweet it out").
    • -
    -
  8. -
  9. -

    Once incident has recovered or is actively recovering, you can announce that the incident is over and that the call is ending. This usually indicates there's no more productive work to be done for the incident right now.

    -
      -
    • Move the remaining, non-time-critical discussion to Slack.
    • -
    • Follow up to ensure the customer liaison wraps up the incident publicly.
    • -
    • Identify any post-incident clean-up work.
    • -
    • You may need to perform debriefing/analysis of the underlying root cause.
    • -
    -
  10. -
  11. -

    (After call ends) Create the post-mortem page from the template, and assign an owner to the post-mortem for the incident.

    -
  12. -
  13. -

    (After call ends) Send out an internal email explaining that we had a major incident, provide a link to the post-mortem.

    -
  14. -
-

Steps for Deputy#

-

You are there to support the IC in whatever they need.

-
    -
  1. -

    Monitor the status, and notify the IC if/when the incident escalates in severity level,

    -
      -
    • OfficerURL can help you to monitor the status on Slack,
        -
      • !status - Will tell you the current status.
      • -
      • !status stalk - Will continually monitor the status and report it to the room every 30s.
      • -
      -
    • -
    -
  2. -
  3. -

    Be prepared to page other people as directed by the Incident Commander.

    -
  4. -
  5. -

    Provide regular status updates in Slack (roughly every 30mins) to the executive team, giving an executive summary of the current status. Keep it short and to the point, and use @here.

    -
  6. -
  7. -

    Follow instructions from the Incident Commander.

    -
  8. -
-

Steps for Scribe#

-

You are there to document the key information from the incident in Slack.

-
    -
  1. -

    Update the Slack room with who the IC is, who the Deputy is, and that you're the scribe (if not already done).

    -
      -
    • e.g. "IC: Bob Boberson, Deputy: Deputy Deputyson, Scribe: Writer McWriterson"
    • -
    -
  2. -
  3. -

    You should add notes to Slack when significant actions are taken, or findings are determined. You don't need to wait for the IC to direct this - use your own judgment.

    -
      -
    • You should also add TODO notes to the Slack room that indicate follow-ups slated for later.
    • -
    -
  4. -
  5. -

    Follow instructions from the Incident Commander.

    -
  6. -
-

Steps for Subject Matter Experts#

-

You are there to support the incident commander in identifying the cause of the incident, suggesting and evaluation repair actions, and following through on the repair actions.

-
    -
  1. -

    Investigate the incident by analyzing any graphs or logs at your disposal. Announce all findings to the incident commander.

    -
      -
    • If you are unsure of the cause, that's fine, state that you are investigating and provide regular updates to the IC.
    • -
    -
  2. -
  3. -

    Announce all suggestions for resolution to the incident commander, it is their decision on how to proceed, do not follow any actions unless told to do so!

    -
  4. -
  5. -

    Follow instructions from the incident commander.

    -
  6. -
  7. -

    (Optional) Once the call is over and post-mortem is created, add any notes you think are relevant to the post-mortem page.

    -
  8. -
-

Steps for Customer Liaison#

-

Be on stand-by to post public facing messages regarding the incident.

-
    -
  1. -

    You will typically be required to update the status page and to send Tweets from our various accounts at certain times during the call.

    -
  2. -
  3. -

    Follow instructions from the Incident Commander.

    -
  4. -
- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/during/security_incident_response/index.html b/during/security_incident_response/index.html deleted file mode 100644 index 31fa8eb..0000000 --- a/during/security_incident_response/index.html +++ /dev/null @@ -1,742 +0,0 @@ - - - - - - - - - - Security Incident - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- - -
-
- -

Security Incident

- -
-

Incident Commander Required

-

As with all major incidents at PagerDuty, security ones will also involve an Incident Commander, who will delegate the tasks to relevant resolvers. Tasks may be performed in parallel as assigned by the IC. Page one at the earliest possible opportunity.

-
-

Checklist#

-

Details for each of these items are available in the next section.

-
    -
  1. Stop the attack in progress.
  2. -
  3. Cut off the attack vector.
  4. -
  5. Assemble the response team.
  6. -
  7. Isolate affected instances.
  8. -
  9. Identify timeline of attack.
  10. -
  11. Identify compromised data.
  12. -
  13. Assess risk to other systems.
  14. -
  15. Assess risk of re-attack.
  16. -
  17. Apply additional mitigations, additions to monitoring, etc.
  18. -
  19. Forensic analysis of compromised systems.
  20. -
  21. Internal communication.
  22. -
  23. Involve law enforcement.
  24. -
  25. Reach out to external parties that may have been used as vector for attack.
  26. -
  27. External communication.
  28. -
-
-

Attack Mitigation#

-

Stop the attack as quickly as you can, via any means necessary. Shut down servers, network isolate them, turn off a data center if you have to. Some common things to try,

-
    -
  • Shutdown the instance from the provider console (do not delete or terminate if you can help it, as we'll need to do forensics).
  • -
  • If you happen to be logged into the box you can try to,
      -
    • Re-instate our default iptables rules to restrict traffic.
    • -
    • kill -9 any active session you think is an attacker.
    • -
    • Change root password, and update /etc/shadow to lock out all other users.
    • -
    • sudo shutdown now
    • -
    -
  • -
-

Cut Off Attack Vector#

-

Identify the likely attack vectors and path/fix them so they cannot be re-exploited immediately after stopping the attack.

-
    -
  • If you suspect a third-party provider is compromised, delete all accounts except your own (and those of others who are physically present) and immediately rotate your password and MFA tokens.
  • -
  • If you suspect a service application was an attack vector, disable any relevant code paths, or shut down the service entirely.
  • -
-

Assemble Response Team#

-

Identify the key responders for the security incident, and keep them all in the loop. Set up a secure method of communicating all information associated with the incident. Details on the incident (or even the fact that an incident has occurred) should be kept private to the responders until you are confident the attack is not being triggered internally.

-
    -
  • The security and site-reliability teams should usually be involved.
  • -
  • A representative for any affected services should be involved.
  • -
  • An Incident Commander (IC) should be appointed, who will also appoint the usual incident command roles. The incident command team will be responsible for keeping documentation of actions taken, and for notifying internal stakeholders as appropriate.
  • -
  • Do not communicate with anyone not on the response team about the incident until forensics has been performed. The attack could be happening internally.
  • -
  • Give the project an innocuous codename that can be used for chats/documents so if anyone overhears they don't realize it's a security incident. (e.g. sapphire-unicorn).
  • -
  • Prefix all emails, and chat topics with "Attorney Work Project".
  • -
-

Isolate Affected Instances#

-

Any instances which were affected by the attack should be immediately isolated from any other instances. As soon as possible, an image of the system should be taken and put into a read-only cold storage for later forensic analysis.

-
    -
  • Blacklist the IP addresses for any affected instances from all other hosts.
  • -
  • Turn off and shutdown the instances immediately if you didn't do that to stop the attack.
  • -
  • Take a disk image for any disks attached to the instances, and ship them to an off-site cold storage location. You should make sure these images are read-only and cannot be tampered with.
  • -
-

Identify Timeline of Attack#

-

Work with all tools at your disposal to identify the timeline of the attack, along with exactly what the attacker did.

-
    -
  • Any reconnaissance the attacker performed on the system before the attack started.
  • -
  • When the attacker gained access to the system.
  • -
  • What actions the attacker performed on the system, and when.
  • -
  • Identify how long the attacker had access to the system before they were detected, and before they were kicked out.
  • -
  • Identify any queries the attacker ran on databases.
  • -
  • Try to identify if the attacker still has access to the system via another back door. Monitor logs for unusual activity, etc.
  • -
-

Compromised Data#

-

Using forensic analysis of log files, time-series graphs, and any other information/tools at your disposal, attempt to identify what information was compromised (if any),

-
    -
  • Identify any data that was compromised during the attack.
      -
    • Was any data exfiltrated from a database?
    • -
    • What keys were on the system that are now considering compromised?
    • -
    • Was the attacker able to identify other components of the system (map out the network, etc).
    • -
    -
  • -
  • Find exactly what customer data has been compromised, if any.
  • -
-

Assess Risk#

-

Based on the data that was compromised, assess the risk to other systems.

-
    -
  • Does the attacker have enough information to find another way in?
  • -
  • Were any passwords or keys stored on the host? If so, they should be considered compromised, regardless of how they were stored.
  • -
  • Any user accounts that were used in the initial attack should rotate all of their keys and passwords on every other system they have an account.
  • -
-

Apply Additional Mitigations#

-

Start applying mitigations to other parts of your system.

-
    -
  • Rotate any compromised data.
  • -
  • Identify any new alerting which is needed to notify of a similar breach.
  • -
  • Block any IP addresses associated with the attack.
  • -
  • Identify any keys/credentials that are compromised and revoke their access immediately.
  • -
-

Forensic Analysis#

-

Once you are confident the systems are secured, and enough monitoring is in place to detect another attack, you can move onto the forensic analysis stage.

-
    -
  • Take any read-only images you created, any access logs you have, and comb through them for more information about the attack.
  • -
  • Identify exactly what happened, how it happened, and how to prevent it in future.
  • -
  • Keep track of all IP addresses involved in the attack.
  • -
  • Monitor logs for any attempt to regain access to the system by the attacker.
  • -
-

Internal Communication#

-

Delegate to: VP or Director of Engineering

-

Communicate internally only once you are confident (via forensic analysis) that the attack was not sourced internally.

-
    -
  • Don't go into too much detail.
  • -
  • Overview the timeline.
  • -
  • Discuss mitigation steps taken.
  • -
  • Follow up with more information once it is known.
  • -
-

Liaise With Law Enforcement / External Actors#

-

Delegate to: VP or Director of Engineering

-

Work with law enforcement to identify the source of the attack, letting any system owners know that systems under their control may be compromised, etc.

-
    -
  • Contact local law enforcement.
  • -
  • Contact FBI.
  • -
  • Contact operators for any systems used in the attack, their systems may also have been compromised.
  • -
  • Contact security companies to help in assessing risk and any PR next steps.
  • -
-

External Communication#

-

Delegate to: Marketing Team

-

Once you have validated all of the information you have is accurate, have a timeline of events, and know exactly what information was compromised, how it was compromised, and sure that it won't happen again. Only then should you prepare and release a public statement to customers informing them of the compromised information and any steps they need to take.

-
    -
  • Include the date in the title of any announcement, so that it's never confused for a potential new breach.
  • -
  • Don't say "We take security very seriously". It makes everyone cringe when they read it.
  • -
  • Be honest, accept responsibility, and present the facts, along with exactly how we plan to prevent such things in future.
  • -
  • Be as detailed as possible with the timeline.
  • -
  • Be as detailed as possible in what information was compromised, and how it affects customers. If we were storing something we shouldn't have been, be honest about it. It'll come out later and it'll be much worse.
  • -
  • Don't name and shame any external parties that might have caused the compromise. It's bad form. (Unless they've already publicly disclosed, in which case we can link to their disclosure).
  • -
  • Release the external communication as soon as possible, preferably within a few days of the compromise. The longer we wait, the worse it will be.
  • -
  • Figure out if there is a way to get in touch with customers' internal security teams before the general public notice is sent.
  • -
-
-

Additional Reading#

- - - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/index.html b/index.html deleted file mode 100644 index dd68b11..0000000 --- a/index.html +++ /dev/null @@ -1,583 +0,0 @@ - - - - - - - - - - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Spearhead Systems Incident Response Documentation

- -

This documentation covers parts of the Spearhead Systems Issue Response process. It is a copy of PagerDuty's documentation and furthermore a cut-down version of our own internal documentation, used at Spearhead Systems for any issue (incident or service request), and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process). See the about page for more information on what this documentation is and why it exists. This documentation is complementary to what is available in our existing wiki and may not yet be open sourced.

-
-

Issue, Incident and Service Request

-

At Spearhead we use the term issue to define any request from our customers. Issues fall into two categories: "Service Requests (SR)" and "Incidents (IN)". Note that we use the term Incident to describe both a service request as well as incidents. For brevity we will use SR and IN throughout this documentation.

-
-

A "service request" is usually initiated by a human and is generally not critical for the normal functioning of the business while an "incident" is an issue that is or can cause interruption to normal business functions.

-

Issue Response at Spearhead Systems

-

Being On-Call#

-

If you've never been on-call before, you might be wondering what it's all about. These pages describe what the expectations of being on-call are, along with some resources to help you.

-
    -
  • Being On-Call - A guide to being on-call, both what your responsibilities are, and what they are not.
  • -
  • Alerting Principles - The principles we use to determine what things page an engineer, and what time of day they page.
  • -
-

Before an Incident#

-

Reading material for things you probably want to know before an incident occurs. You likely don't want to be reading these during an actual incident.

-
    -
  • Severity Levels - Information on our severity level classification. What constitutes a Low issue? What's a "Major Incident"?, etc.
  • -
  • Different Roles for Incidents - Information on the roles during an incident; Incident Commander, Scribe, etc.
  • -
  • Incident Call Etiquette - Our etiquette guidelines for incident calls, before you find yourself in one.
  • -
-

During an Incident#

-

Information and processes during an incident.

- -

After an Incident#

-

Our followup processes, how we make sure we don't repeat mistakes and are always improving.

-
    -
  • Post-Mortem Process - Information on our post-mortem process; what's involved and how to write or run a post-mortem.
  • -
  • Post-Mortem Template - The template we use for writing our post-mortems for major incidents.
  • -
-

Training#

-

So, you want to learn about incident response? You've come to the right place.

- -

Additional Reading#

-

Useful material and resources from external parties that are relevant to incident response.

- - - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml new file mode 100644 index 0000000..6a10c5c --- /dev/null +++ b/mkdocs.yml @@ -0,0 +1,64 @@ +# Project Information +site_name: Spearhead Systems Incident Response Documentation +site_description: A collection of information about the Spearhead Systems incident response process. Not only how to prepare new employees for on-call responsibilities, but also how to handle major incidents, both in preparation and after-work. +site_author: Spearhead Systems, Inc. +site_favicon: 'assets/img/icon.png' +site_url: https://response.spearhead.systems + +# Repository +repo_name: 'GitHub' +repo_url: https://github.com/spearheadsys/issue-response-docs + +# Copyright +copyright: 'Copyright © Spearhead Systems, Inc.' + +# Theme +theme: 'material' +theme_dir: 'theme' +extra_css: ['assets/css/extra.css'] +extra: + logo: 'assets/img/icon.png' + cover: 'assets/img/cover.png' + palette: + primary: 'green' + accent: 'blue grey' + font: + text: 'Colfax Regular' + code: 'Roboto Mono' + author: + github: 'spearheadsys' + twitter: 'spearhead_sys' + +# Contents +pages: + - Home: 'index.md' + - On-Call: + - Being On-Call: 'oncall/being_oncall.md' + - Alerting Principles: 'oncall/alerting_principles.md' + - Before an Incident: + - Severity Levels: 'before/severity_levels.md' + - Different Roles: 'before/different_roles.md' + - Call Etiquette: 'before/call_etiquette.md' + - During an Incident: + - During An Incident: 'during/during_an_incident.md' + - Security Incident: 'during/security_incident_response.md' + - After an Incident: + - Post-Mortem Process: 'after/post_mortem_process.md' + - Post-Mortem Template: 'after/post_mortem_template.md' + - Training: + - Overview: 'training/overview.md' + - Incident Commander: 'training/incident_commander.md' + - Deputy: 'training/deputy.md' + - Scribe: 'training/scribe.md' + - Subject Matter Expert: 'training/subject_matter_expert.md' + - Glossary: 'training/glossary.md' + - About: 'about.md' + +# Analytics +# google_analytics: ['UA-8759953-1', 'auto'] + +# Extensions +markdown_extensions: + - toc(permalink=#) + - sane_lists: + - admonition: diff --git a/mkdocs/js/lunr.min.js b/mkdocs/js/lunr.min.js deleted file mode 100644 index b0198df..0000000 --- a/mkdocs/js/lunr.min.js +++ /dev/null @@ -1,7 +0,0 @@ -/** - * lunr - http://lunrjs.com - A bit like Solr, but much smaller and not as bright - 0.7.0 - * Copyright (C) 2016 Oliver Nightingale - * MIT Licensed - * @license - */ -!function(){var t=function(e){var n=new t.Index;return n.pipeline.add(t.trimmer,t.stopWordFilter,t.stemmer),e&&e.call(n,n),n};t.version="0.7.0",t.utils={},t.utils.warn=function(t){return function(e){t.console&&console.warn&&console.warn(e)}}(this),t.utils.asString=function(t){return void 0===t||null===t?"":t.toString()},t.EventEmitter=function(){this.events={}},t.EventEmitter.prototype.addListener=function(){var t=Array.prototype.slice.call(arguments),e=t.pop(),n=t;if("function"!=typeof e)throw new TypeError("last argument must be a function");n.forEach(function(t){this.hasHandler(t)||(this.events[t]=[]),this.events[t].push(e)},this)},t.EventEmitter.prototype.removeListener=function(t,e){if(this.hasHandler(t)){var n=this.events[t].indexOf(e);this.events[t].splice(n,1),this.events[t].length||delete this.events[t]}},t.EventEmitter.prototype.emit=function(t){if(this.hasHandler(t)){var e=Array.prototype.slice.call(arguments,1);this.events[t].forEach(function(t){t.apply(void 0,e)})}},t.EventEmitter.prototype.hasHandler=function(t){return t in this.events},t.tokenizer=function(e){return arguments.length&&null!=e&&void 0!=e?Array.isArray(e)?e.map(function(e){return t.utils.asString(e).toLowerCase()}):e.toString().trim().toLowerCase().split(t.tokenizer.seperator):[]},t.tokenizer.seperator=/[\s\-]+/,t.tokenizer.load=function(t){var e=this.registeredFunctions[t];if(!e)throw new Error("Cannot load un-registered function: "+t);return e},t.tokenizer.label="default",t.tokenizer.registeredFunctions={"default":t.tokenizer},t.tokenizer.registerFunction=function(e,n){n in this.registeredFunctions&&t.utils.warn("Overwriting existing tokenizer: "+n),e.label=n,this.registeredFunctions[n]=e},t.Pipeline=function(){this._stack=[]},t.Pipeline.registeredFunctions={},t.Pipeline.registerFunction=function(e,n){n in this.registeredFunctions&&t.utils.warn("Overwriting existing registered function: "+n),e.label=n,t.Pipeline.registeredFunctions[e.label]=e},t.Pipeline.warnIfFunctionNotRegistered=function(e){var n=e.label&&e.label in this.registeredFunctions;n||t.utils.warn("Function is not registered with pipeline. This may cause problems when serialising the index.\n",e)},t.Pipeline.load=function(e){var n=new t.Pipeline;return e.forEach(function(e){var i=t.Pipeline.registeredFunctions[e];if(!i)throw new Error("Cannot load un-registered function: "+e);n.add(i)}),n},t.Pipeline.prototype.add=function(){var e=Array.prototype.slice.call(arguments);e.forEach(function(e){t.Pipeline.warnIfFunctionNotRegistered(e),this._stack.push(e)},this)},t.Pipeline.prototype.after=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var i=this._stack.indexOf(e);if(-1==i)throw new Error("Cannot find existingFn");i+=1,this._stack.splice(i,0,n)},t.Pipeline.prototype.before=function(e,n){t.Pipeline.warnIfFunctionNotRegistered(n);var i=this._stack.indexOf(e);if(-1==i)throw new Error("Cannot find existingFn");this._stack.splice(i,0,n)},t.Pipeline.prototype.remove=function(t){var e=this._stack.indexOf(t);-1!=e&&this._stack.splice(e,1)},t.Pipeline.prototype.run=function(t){for(var e=[],n=t.length,i=this._stack.length,r=0;n>r;r++){for(var o=t[r],s=0;i>s&&(o=this._stack[s](o,r,t),void 0!==o&&""!==o);s++);void 0!==o&&""!==o&&e.push(o)}return e},t.Pipeline.prototype.reset=function(){this._stack=[]},t.Pipeline.prototype.toJSON=function(){return this._stack.map(function(e){return t.Pipeline.warnIfFunctionNotRegistered(e),e.label})},t.Vector=function(){this._magnitude=null,this.list=void 0,this.length=0},t.Vector.Node=function(t,e,n){this.idx=t,this.val=e,this.next=n},t.Vector.prototype.insert=function(e,n){this._magnitude=void 0;var i=this.list;if(!i)return this.list=new t.Vector.Node(e,n,i),this.length++;if(en.idx?n=n.next:(i+=e.val*n.val,e=e.next,n=n.next);return i},t.Vector.prototype.similarity=function(t){return this.dot(t)/(this.magnitude()*t.magnitude())},t.SortedSet=function(){this.length=0,this.elements=[]},t.SortedSet.load=function(t){var e=new this;return e.elements=t,e.length=t.length,e},t.SortedSet.prototype.add=function(){var t,e;for(t=0;t1;){if(o===t)return r;t>o&&(e=r),o>t&&(n=r),i=n-e,r=e+Math.floor(i/2),o=this.elements[r]}return o===t?r:-1},t.SortedSet.prototype.locationFor=function(t){for(var e=0,n=this.elements.length,i=n-e,r=e+Math.floor(i/2),o=this.elements[r];i>1;)t>o&&(e=r),o>t&&(n=r),i=n-e,r=e+Math.floor(i/2),o=this.elements[r];return o>t?r:t>o?r+1:void 0},t.SortedSet.prototype.intersect=function(e){for(var n=new t.SortedSet,i=0,r=0,o=this.length,s=e.length,a=this.elements,h=e.elements;;){if(i>o-1||r>s-1)break;a[i]!==h[r]?a[i]h[r]&&r++:(n.add(a[i]),i++,r++)}return n},t.SortedSet.prototype.clone=function(){var e=new t.SortedSet;return e.elements=this.toArray(),e.length=e.elements.length,e},t.SortedSet.prototype.union=function(t){var e,n,i;this.length>=t.length?(e=this,n=t):(e=t,n=this),i=e.clone();for(var r=0,o=n.toArray();rp;p++)c[p]===a&&d++;h+=d/f*l.boost}}this.tokenStore.add(a,{ref:o,tf:h})}n&&this.eventEmitter.emit("add",e,this)},t.Index.prototype.remove=function(t,e){var n=t[this._ref],e=void 0===e?!0:e;if(this.documentStore.has(n)){var i=this.documentStore.get(n);this.documentStore.remove(n),i.forEach(function(t){this.tokenStore.remove(t,n)},this),e&&this.eventEmitter.emit("remove",t,this)}},t.Index.prototype.update=function(t,e){var e=void 0===e?!0:e;this.remove(t,!1),this.add(t,!1),e&&this.eventEmitter.emit("update",t,this)},t.Index.prototype.idf=function(t){var e="@"+t;if(Object.prototype.hasOwnProperty.call(this._idfCache,e))return this._idfCache[e];var n=this.tokenStore.count(t),i=1;return n>0&&(i=1+Math.log(this.documentStore.length/n)),this._idfCache[e]=i},t.Index.prototype.search=function(e){var n=this.pipeline.run(this.tokenizerFn(e)),i=new t.Vector,r=[],o=this._fields.reduce(function(t,e){return t+e.boost},0),s=n.some(function(t){return this.tokenStore.has(t)},this);if(!s)return[];n.forEach(function(e,n,s){var a=1/s.length*this._fields.length*o,h=this,u=this.tokenStore.expand(e).reduce(function(n,r){var o=h.corpusTokens.indexOf(r),s=h.idf(r),u=1,l=new t.SortedSet;if(r!==e){var c=Math.max(3,r.length-e.length);u=1/Math.log(c)}o>-1&&i.insert(o,a*s*u);for(var f=h.tokenStore.get(r),d=Object.keys(f),p=d.length,v=0;p>v;v++)l.add(f[d[v]].ref);return n.union(l)},new t.SortedSet);r.push(u)},this);var a=r.reduce(function(t,e){return t.intersect(e)});return a.map(function(t){return{ref:t,score:i.similarity(this.documentVector(t))}},this).sort(function(t,e){return e.score-t.score})},t.Index.prototype.documentVector=function(e){for(var n=this.documentStore.get(e),i=n.length,r=new t.Vector,o=0;i>o;o++){var s=n.elements[o],a=this.tokenStore.get(s)[e].tf,h=this.idf(s);r.insert(this.corpusTokens.indexOf(s),a*h)}return r},t.Index.prototype.toJSON=function(){return{version:t.version,fields:this._fields,ref:this._ref,tokenizer:this.tokenizerFn.label,documentStore:this.documentStore.toJSON(),tokenStore:this.tokenStore.toJSON(),corpusTokens:this.corpusTokens.toJSON(),pipeline:this.pipeline.toJSON()}},t.Index.prototype.use=function(t){var e=Array.prototype.slice.call(arguments,1);e.unshift(this),t.apply(this,e)},t.Store=function(){this.store={},this.length=0},t.Store.load=function(e){var n=new this;return n.length=e.length,n.store=Object.keys(e.store).reduce(function(n,i){return n[i]=t.SortedSet.load(e.store[i]),n},{}),n},t.Store.prototype.set=function(t,e){this.has(t)||this.length++,this.store[t]=e},t.Store.prototype.get=function(t){return this.store[t]},t.Store.prototype.has=function(t){return t in this.store},t.Store.prototype.remove=function(t){this.has(t)&&(delete this.store[t],this.length--)},t.Store.prototype.toJSON=function(){return{store:this.store,length:this.length}},t.stemmer=function(){var t={ational:"ate",tional:"tion",enci:"ence",anci:"ance",izer:"ize",bli:"ble",alli:"al",entli:"ent",eli:"e",ousli:"ous",ization:"ize",ation:"ate",ator:"ate",alism:"al",iveness:"ive",fulness:"ful",ousness:"ous",aliti:"al",iviti:"ive",biliti:"ble",logi:"log"},e={icate:"ic",ative:"",alize:"al",iciti:"ic",ical:"ic",ful:"",ness:""},n="[^aeiou]",i="[aeiouy]",r=n+"[^aeiouy]*",o=i+"[aeiou]*",s="^("+r+")?"+o+r,a="^("+r+")?"+o+r+"("+o+")?$",h="^("+r+")?"+o+r+o+r,u="^("+r+")?"+i,l=new RegExp(s),c=new RegExp(h),f=new RegExp(a),d=new RegExp(u),p=/^(.+?)(ss|i)es$/,v=/^(.+?)([^s])s$/,g=/^(.+?)eed$/,m=/^(.+?)(ed|ing)$/,y=/.$/,S=/(at|bl|iz)$/,w=new RegExp("([^aeiouylsz])\\1$"),k=new RegExp("^"+r+i+"[^aeiouwxy]$"),x=/^(.+?[^aeiou])y$/,b=/^(.+?)(ational|tional|enci|anci|izer|bli|alli|entli|eli|ousli|ization|ation|ator|alism|iveness|fulness|ousness|aliti|iviti|biliti|logi)$/,E=/^(.+?)(icate|ative|alize|iciti|ical|ful|ness)$/,F=/^(.+?)(al|ance|ence|er|ic|able|ible|ant|ement|ment|ent|ou|ism|ate|iti|ous|ive|ize)$/,_=/^(.+?)(s|t)(ion)$/,z=/^(.+?)e$/,O=/ll$/,P=new RegExp("^"+r+i+"[^aeiouwxy]$"),T=function(n){var i,r,o,s,a,h,u;if(n.length<3)return n;if(o=n.substr(0,1),"y"==o&&(n=o.toUpperCase()+n.substr(1)),s=p,a=v,s.test(n)?n=n.replace(s,"$1$2"):a.test(n)&&(n=n.replace(a,"$1$2")),s=g,a=m,s.test(n)){var T=s.exec(n);s=l,s.test(T[1])&&(s=y,n=n.replace(s,""))}else if(a.test(n)){var T=a.exec(n);i=T[1],a=d,a.test(i)&&(n=i,a=S,h=w,u=k,a.test(n)?n+="e":h.test(n)?(s=y,n=n.replace(s,"")):u.test(n)&&(n+="e"))}if(s=x,s.test(n)){var T=s.exec(n);i=T[1],n=i+"i"}if(s=b,s.test(n)){var T=s.exec(n);i=T[1],r=T[2],s=l,s.test(i)&&(n=i+t[r])}if(s=E,s.test(n)){var T=s.exec(n);i=T[1],r=T[2],s=l,s.test(i)&&(n=i+e[r])}if(s=F,a=_,s.test(n)){var T=s.exec(n);i=T[1],s=c,s.test(i)&&(n=i)}else if(a.test(n)){var T=a.exec(n);i=T[1]+T[2],a=c,a.test(i)&&(n=i)}if(s=z,s.test(n)){var T=s.exec(n);i=T[1],s=c,a=f,h=P,(s.test(i)||a.test(i)&&!h.test(i))&&(n=i)}return s=O,a=c,s.test(n)&&a.test(n)&&(s=y,n=n.replace(s,"")),"y"==o&&(n=o.toLowerCase()+n.substr(1)),n};return T}(),t.Pipeline.registerFunction(t.stemmer,"stemmer"),t.generateStopWordFilter=function(t){var e=t.reduce(function(t,e){return t[e]=e,t},{});return function(t){return t&&e[t]!==t?t:void 0}},t.stopWordFilter=t.generateStopWordFilter(["a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your"]),t.Pipeline.registerFunction(t.stopWordFilter,"stopWordFilter"),t.trimmer=function(t){return t.replace(/^\W+/,"").replace(/\W+$/,"")},t.Pipeline.registerFunction(t.trimmer,"trimmer"),t.TokenStore=function(){this.root={docs:{}},this.length=0},t.TokenStore.load=function(t){var e=new this;return e.root=t.root,e.length=t.length,e},t.TokenStore.prototype.add=function(t,e,n){var n=n||this.root,i=t.charAt(0),r=t.slice(1);return i in n||(n[i]={docs:{}}),0===r.length?(n[i].docs[e.ref]=e,void(this.length+=1)):this.add(r,e,n[i])},t.TokenStore.prototype.has=function(t){if(!t)return!1;for(var e=this.root,n=0;n":">",'"':""","'":"'","/":"/"};function escapeHtml(string){return String(string).replace(/[&<>"'\/]/g,function(s){return entityMap[s]})}var whiteRe=/\s*/;var spaceRe=/\s+/;var equalsRe=/\s*=/;var curlyRe=/\s*\}/;var tagRe=/#|\^|\/|>|\{|&|=|!/;function parseTemplate(template,tags){if(!template)return[];var sections=[];var tokens=[];var spaces=[];var hasTag=false;var nonSpace=false;function stripSpace(){if(hasTag&&!nonSpace){while(spaces.length)delete tokens[spaces.pop()]}else{spaces=[]}hasTag=false;nonSpace=false}var openingTagRe,closingTagRe,closingCurlyRe;function compileTags(tags){if(typeof tags==="string")tags=tags.split(spaceRe,2);if(!isArray(tags)||tags.length!==2)throw new Error("Invalid tags: "+tags);openingTagRe=new RegExp(escapeRegExp(tags[0])+"\\s*");closingTagRe=new RegExp("\\s*"+escapeRegExp(tags[1]));closingCurlyRe=new RegExp("\\s*"+escapeRegExp("}"+tags[1]))}compileTags(tags||mustache.tags);var scanner=new Scanner(template);var start,type,value,chr,token,openSection;while(!scanner.eos()){start=scanner.pos;value=scanner.scanUntil(openingTagRe);if(value){for(var i=0,valueLength=value.length;i0?sections[sections.length-1][4]:nestedTokens;break;default:collector.push(token)}}return nestedTokens}function Scanner(string){this.string=string;this.tail=string;this.pos=0}Scanner.prototype.eos=function(){return this.tail===""};Scanner.prototype.scan=function(re){var match=this.tail.match(re);if(!match||match.index!==0)return"";var string=match[0];this.tail=this.tail.substring(string.length);this.pos+=string.length;return string};Scanner.prototype.scanUntil=function(re){var index=this.tail.search(re),match;switch(index){case-1:match=this.tail;this.tail="";break;case 0:match="";break;default:match=this.tail.substring(0,index);this.tail=this.tail.substring(index)}this.pos+=match.length;return match};function Context(view,parentContext){this.view=view;this.cache={".":this.view};this.parent=parentContext}Context.prototype.push=function(view){return new Context(view,this)};Context.prototype.lookup=function(name){var cache=this.cache;var value;if(name in cache){value=cache[name]}else{var context=this,names,index,lookupHit=false;while(context){if(name.indexOf(".")>0){value=context.view;names=name.split(".");index=0;while(value!=null&&index")value=this._renderPartial(token,context,partials,originalTemplate);else if(symbol==="&")value=this._unescapedValue(token,context);else if(symbol==="name")value=this._escapedValue(token,context);else if(symbol==="text")value=this._rawValue(token);if(value!==undefined)buffer+=value}return buffer};Writer.prototype._renderSection=function(token,context,partials,originalTemplate){var self=this;var buffer="";var value=context.lookup(token[1]);function subRender(template){return self.render(template,context,partials)}if(!value)return;if(isArray(value)){for(var j=0,valueLength=value.length;jthis.depCount&&!this.defined){if(G(l)){if(this.events.error&&this.map.isDefine||g.onError!==ca)try{f=i.execCb(c,l,b,f)}catch(d){a=d}else f=i.execCb(c,l,b,f);this.map.isDefine&&void 0===f&&((b=this.module)?f=b.exports:this.usingExports&& -(f=this.exports));if(a)return a.requireMap=this.map,a.requireModules=this.map.isDefine?[this.map.id]:null,a.requireType=this.map.isDefine?"define":"require",w(this.error=a)}else f=l;this.exports=f;if(this.map.isDefine&&!this.ignore&&(r[c]=f,g.onResourceLoad))g.onResourceLoad(i,this.map,this.depMaps);y(c);this.defined=!0}this.defining=!1;this.defined&&!this.defineEmitted&&(this.defineEmitted=!0,this.emit("defined",this.exports),this.defineEmitComplete=!0)}}else this.fetch()}},callPlugin:function(){var a= -this.map,b=a.id,d=p(a.prefix);this.depMaps.push(d);q(d,"defined",u(this,function(f){var l,d;d=m(aa,this.map.id);var e=this.map.name,P=this.map.parentMap?this.map.parentMap.name:null,n=i.makeRequire(a.parentMap,{enableBuildCallback:!0});if(this.map.unnormalized){if(f.normalize&&(e=f.normalize(e,function(a){return c(a,P,!0)})||""),f=p(a.prefix+"!"+e,this.map.parentMap),q(f,"defined",u(this,function(a){this.init([],function(){return a},null,{enabled:!0,ignore:!0})})),d=m(h,f.id)){this.depMaps.push(f); -if(this.events.error)d.on("error",u(this,function(a){this.emit("error",a)}));d.enable()}}else d?(this.map.url=i.nameToUrl(d),this.load()):(l=u(this,function(a){this.init([],function(){return a},null,{enabled:!0})}),l.error=u(this,function(a){this.inited=!0;this.error=a;a.requireModules=[b];B(h,function(a){0===a.map.id.indexOf(b+"_unnormalized")&&y(a.map.id)});w(a)}),l.fromText=u(this,function(f,c){var d=a.name,e=p(d),P=M;c&&(f=c);P&&(M=!1);s(e);t(j.config,b)&&(j.config[d]=j.config[b]);try{g.exec(f)}catch(h){return w(C("fromtexteval", -"fromText eval for "+b+" failed: "+h,h,[b]))}P&&(M=!0);this.depMaps.push(e);i.completeLoad(d);n([d],l)}),f.load(a.name,n,l,j))}));i.enable(d,this);this.pluginMaps[d.id]=d},enable:function(){V[this.map.id]=this;this.enabling=this.enabled=!0;v(this.depMaps,u(this,function(a,b){var c,f;if("string"===typeof a){a=p(a,this.map.isDefine?this.map:this.map.parentMap,!1,!this.skipMap);this.depMaps[b]=a;if(c=m(L,a.id)){this.depExports[b]=c(this);return}this.depCount+=1;q(a,"defined",u(this,function(a){this.defineDep(b, -a);this.check()}));this.errback?q(a,"error",u(this,this.errback)):this.events.error&&q(a,"error",u(this,function(a){this.emit("error",a)}))}c=a.id;f=h[c];!t(L,c)&&(f&&!f.enabled)&&i.enable(a,this)}));B(this.pluginMaps,u(this,function(a){var b=m(h,a.id);b&&!b.enabled&&i.enable(a,this)}));this.enabling=!1;this.check()},on:function(a,b){var c=this.events[a];c||(c=this.events[a]=[]);c.push(b)},emit:function(a,b){v(this.events[a],function(a){a(b)});"error"===a&&delete this.events[a]}};i={config:j,contextName:b, -registry:h,defined:r,urlFetched:S,defQueue:A,Module:Z,makeModuleMap:p,nextTick:g.nextTick,onError:w,configure:function(a){a.baseUrl&&"/"!==a.baseUrl.charAt(a.baseUrl.length-1)&&(a.baseUrl+="/");var b=j.shim,c={paths:!0,bundles:!0,config:!0,map:!0};B(a,function(a,b){c[b]?(j[b]||(j[b]={}),U(j[b],a,!0,!0)):j[b]=a});a.bundles&&B(a.bundles,function(a,b){v(a,function(a){a!==b&&(aa[a]=b)})});a.shim&&(B(a.shim,function(a,c){H(a)&&(a={deps:a});if((a.exports||a.init)&&!a.exportsFn)a.exportsFn=i.makeShimExports(a); -b[c]=a}),j.shim=b);a.packages&&v(a.packages,function(a){var b,a="string"===typeof a?{name:a}:a;b=a.name;a.location&&(j.paths[b]=a.location);j.pkgs[b]=a.name+"/"+(a.main||"main").replace(ia,"").replace(Q,"")});B(h,function(a,b){!a.inited&&!a.map.unnormalized&&(a.map=p(b))});if(a.deps||a.callback)i.require(a.deps||[],a.callback)},makeShimExports:function(a){return function(){var b;a.init&&(b=a.init.apply(ba,arguments));return b||a.exports&&da(a.exports)}},makeRequire:function(a,e){function j(c,d,m){var n, -q;e.enableBuildCallback&&(d&&G(d))&&(d.__requireJsBuild=!0);if("string"===typeof c){if(G(d))return w(C("requireargs","Invalid require call"),m);if(a&&t(L,c))return L[c](h[a.id]);if(g.get)return g.get(i,c,a,j);n=p(c,a,!1,!0);n=n.id;return!t(r,n)?w(C("notloaded",'Module name "'+n+'" has not been loaded yet for context: '+b+(a?"":". Use require([])"))):r[n]}J();i.nextTick(function(){J();q=s(p(null,a));q.skipMap=e.skipMap;q.init(c,d,m,{enabled:!0});D()});return j}e=e||{};U(j,{isBrowser:z,toUrl:function(b){var d, -e=b.lastIndexOf("."),k=b.split("/")[0];if(-1!==e&&(!("."===k||".."===k)||1e.attachEvent.toString().indexOf("[native code"))&& -!Y?(M=!0,e.attachEvent("onreadystatechange",b.onScriptLoad)):(e.addEventListener("load",b.onScriptLoad,!1),e.addEventListener("error",b.onScriptError,!1)),e.src=d,J=e,D?y.insertBefore(e,D):y.appendChild(e),J=null,e;if(ea)try{importScripts(d),b.completeLoad(c)}catch(m){b.onError(C("importscripts","importScripts failed for "+c+" at "+d,m,[c]))}};z&&!q.skipDataMain&&T(document.getElementsByTagName("script"),function(b){y||(y=b.parentNode);if(I=b.getAttribute("data-main"))return s=I,q.baseUrl||(E=s.split("/"), -s=E.pop(),O=E.length?E.join("/")+"/":"./",q.baseUrl=O),s=s.replace(Q,""),g.jsExtRegExp.test(s)&&(s=I),q.deps=q.deps?q.deps.concat(s):[s],!0});define=function(b,c,d){var e,g;"string"!==typeof b&&(d=c,c=b,b=null);H(c)||(d=c,c=null);!c&&G(d)&&(c=[],d.length&&(d.toString().replace(ka,"").replace(la,function(b,d){c.push(d)}),c=(1===d.length?["require"]:["require","exports","module"]).concat(c)));if(M){if(!(e=J))N&&"interactive"===N.readyState||T(document.getElementsByTagName("script"),function(b){if("interactive"=== -b.readyState)return N=b}),e=N;e&&(b||(b=e.getAttribute("data-requiremodule")),g=F[e.getAttribute("data-requirecontext")])}(g?g.defQueue:R).push([b,c,d])};define.amd={jQuery:!0};g.exec=function(b){return eval(b)};g(q)}})(this); diff --git a/mkdocs/js/search-results-template.mustache b/mkdocs/js/search-results-template.mustache deleted file mode 100644 index a8b3862..0000000 --- a/mkdocs/js/search-results-template.mustache +++ /dev/null @@ -1,4 +0,0 @@ - diff --git a/mkdocs/js/search.js b/mkdocs/js/search.js deleted file mode 100644 index d5c8661..0000000 --- a/mkdocs/js/search.js +++ /dev/null @@ -1,88 +0,0 @@ -require([ - base_url + '/mkdocs/js/mustache.min.js', - base_url + '/mkdocs/js/lunr.min.js', - 'text!search-results-template.mustache', - 'text!../search_index.json', -], function (Mustache, lunr, results_template, data) { - "use strict"; - - function getSearchTerm() - { - var sPageURL = window.location.search.substring(1); - var sURLVariables = sPageURL.split('&'); - for (var i = 0; i < sURLVariables.length; i++) - { - var sParameterName = sURLVariables[i].split('='); - if (sParameterName[0] == 'q') - { - return decodeURIComponent(sParameterName[1].replace(/\+/g, '%20')); - } - } - } - - var index = lunr(function () { - this.field('title', {boost: 10}); - this.field('text'); - this.ref('location'); - }); - - data = JSON.parse(data); - var documents = {}; - - for (var i=0; i < data.docs.length; i++){ - var doc = data.docs[i]; - doc.location = base_url + doc.location; - index.add(doc); - documents[doc.location] = doc; - } - - var search = function(){ - - var query = document.getElementById('mkdocs-search-query').value; - var search_results = document.getElementById("mkdocs-search-results"); - while (search_results.firstChild) { - search_results.removeChild(search_results.firstChild); - } - - if(query === ''){ - return; - } - - var results = index.search(query); - - if (results.length > 0){ - for (var i=0; i < results.length; i++){ - var result = results[i]; - doc = documents[result.ref]; - doc.base_url = base_url; - doc.summary = doc.text.substring(0, 200); - var html = Mustache.to_html(results_template, doc); - search_results.insertAdjacentHTML('beforeend', html); - } - } else { - search_results.insertAdjacentHTML('beforeend', "

No results found

"); - } - - if(jQuery){ - /* - * We currently only automatically hide bootstrap models. This - * requires jQuery to work. - */ - jQuery('#mkdocs_search_modal a').click(function(){ - jQuery('#mkdocs_search_modal').modal('hide'); - }); - } - - }; - - var search_input = document.getElementById('mkdocs-search-query'); - - var term = getSearchTerm(); - if (term){ - search_input.value = term; - search(); - } - - search_input.addEventListener("keyup", search); - -}); diff --git a/mkdocs/js/text.js b/mkdocs/js/text.js deleted file mode 100644 index 17921b6..0000000 --- a/mkdocs/js/text.js +++ /dev/null @@ -1,390 +0,0 @@ -/** - * @license RequireJS text 2.0.12 Copyright (c) 2010-2014, The Dojo Foundation All Rights Reserved. - * Available via the MIT or new BSD license. - * see: http://github.com/requirejs/text for details - */ -/*jslint regexp: true */ -/*global require, XMLHttpRequest, ActiveXObject, - define, window, process, Packages, - java, location, Components, FileUtils */ - -define(['module'], function (module) { - 'use strict'; - - var text, fs, Cc, Ci, xpcIsWindows, - progIds = ['Msxml2.XMLHTTP', 'Microsoft.XMLHTTP', 'Msxml2.XMLHTTP.4.0'], - xmlRegExp = /^\s*<\?xml(\s)+version=[\'\"](\d)*.(\d)*[\'\"](\s)*\?>/im, - bodyRegExp = /]*>\s*([\s\S]+)\s*<\/body>/im, - hasLocation = typeof location !== 'undefined' && location.href, - defaultProtocol = hasLocation && location.protocol && location.protocol.replace(/\:/, ''), - defaultHostName = hasLocation && location.hostname, - defaultPort = hasLocation && (location.port || undefined), - buildMap = {}, - masterConfig = (module.config && module.config()) || {}; - - text = { - version: '2.0.12', - - strip: function (content) { - //Strips declarations so that external SVG and XML - //documents can be added to a document without worry. Also, if the string - //is an HTML document, only the part inside the body tag is returned. - if (content) { - content = content.replace(xmlRegExp, ""); - var matches = content.match(bodyRegExp); - if (matches) { - content = matches[1]; - } - } else { - content = ""; - } - return content; - }, - - jsEscape: function (content) { - return content.replace(/(['\\])/g, '\\$1') - .replace(/[\f]/g, "\\f") - .replace(/[\b]/g, "\\b") - .replace(/[\n]/g, "\\n") - .replace(/[\t]/g, "\\t") - .replace(/[\r]/g, "\\r") - .replace(/[\u2028]/g, "\\u2028") - .replace(/[\u2029]/g, "\\u2029"); - }, - - createXhr: masterConfig.createXhr || function () { - //Would love to dump the ActiveX crap in here. Need IE 6 to die first. - var xhr, i, progId; - if (typeof XMLHttpRequest !== "undefined") { - return new XMLHttpRequest(); - } else if (typeof ActiveXObject !== "undefined") { - for (i = 0; i < 3; i += 1) { - progId = progIds[i]; - try { - xhr = new ActiveXObject(progId); - } catch (e) {} - - if (xhr) { - progIds = [progId]; // so faster next time - break; - } - } - } - - return xhr; - }, - - /** - * Parses a resource name into its component parts. Resource names - * look like: module/name.ext!strip, where the !strip part is - * optional. - * @param {String} name the resource name - * @returns {Object} with properties "moduleName", "ext" and "strip" - * where strip is a boolean. - */ - parseName: function (name) { - var modName, ext, temp, - strip = false, - index = name.indexOf("."), - isRelative = name.indexOf('./') === 0 || - name.indexOf('../') === 0; - - if (index !== -1 && (!isRelative || index > 1)) { - modName = name.substring(0, index); - ext = name.substring(index + 1, name.length); - } else { - modName = name; - } - - temp = ext || modName; - index = temp.indexOf("!"); - if (index !== -1) { - //Pull off the strip arg. - strip = temp.substring(index + 1) === "strip"; - temp = temp.substring(0, index); - if (ext) { - ext = temp; - } else { - modName = temp; - } - } - - return { - moduleName: modName, - ext: ext, - strip: strip - }; - }, - - xdRegExp: /^((\w+)\:)?\/\/([^\/\\]+)/, - - /** - * Is an URL on another domain. Only works for browser use, returns - * false in non-browser environments. Only used to know if an - * optimized .js version of a text resource should be loaded - * instead. - * @param {String} url - * @returns Boolean - */ - useXhr: function (url, protocol, hostname, port) { - var uProtocol, uHostName, uPort, - match = text.xdRegExp.exec(url); - if (!match) { - return true; - } - uProtocol = match[2]; - uHostName = match[3]; - - uHostName = uHostName.split(':'); - uPort = uHostName[1]; - uHostName = uHostName[0]; - - return (!uProtocol || uProtocol === protocol) && - (!uHostName || uHostName.toLowerCase() === hostname.toLowerCase()) && - ((!uPort && !uHostName) || uPort === port); - }, - - finishLoad: function (name, strip, content, onLoad) { - content = strip ? text.strip(content) : content; - if (masterConfig.isBuild) { - buildMap[name] = content; - } - onLoad(content); - }, - - load: function (name, req, onLoad, config) { - //Name has format: some.module.filext!strip - //The strip part is optional. - //if strip is present, then that means only get the string contents - //inside a body tag in an HTML string. For XML/SVG content it means - //removing the declarations so the content can be inserted - //into the current doc without problems. - - // Do not bother with the work if a build and text will - // not be inlined. - if (config && config.isBuild && !config.inlineText) { - onLoad(); - return; - } - - masterConfig.isBuild = config && config.isBuild; - - var parsed = text.parseName(name), - nonStripName = parsed.moduleName + - (parsed.ext ? '.' + parsed.ext : ''), - url = req.toUrl(nonStripName), - useXhr = (masterConfig.useXhr) || - text.useXhr; - - // Do not load if it is an empty: url - if (url.indexOf('empty:') === 0) { - onLoad(); - return; - } - - //Load the text. Use XHR if possible and in a browser. - if (!hasLocation || useXhr(url, defaultProtocol, defaultHostName, defaultPort)) { - text.get(url, function (content) { - text.finishLoad(name, parsed.strip, content, onLoad); - }, function (err) { - if (onLoad.error) { - onLoad.error(err); - } - }); - } else { - //Need to fetch the resource across domains. Assume - //the resource has been optimized into a JS module. Fetch - //by the module name + extension, but do not include the - //!strip part to avoid file system issues. - req([nonStripName], function (content) { - text.finishLoad(parsed.moduleName + '.' + parsed.ext, - parsed.strip, content, onLoad); - }); - } - }, - - write: function (pluginName, moduleName, write, config) { - if (buildMap.hasOwnProperty(moduleName)) { - var content = text.jsEscape(buildMap[moduleName]); - write.asModule(pluginName + "!" + moduleName, - "define(function () { return '" + - content + - "';});\n"); - } - }, - - writeFile: function (pluginName, moduleName, req, write, config) { - var parsed = text.parseName(moduleName), - extPart = parsed.ext ? '.' + parsed.ext : '', - nonStripName = parsed.moduleName + extPart, - //Use a '.js' file name so that it indicates it is a - //script that can be loaded across domains. - fileName = req.toUrl(parsed.moduleName + extPart) + '.js'; - - //Leverage own load() method to load plugin value, but only - //write out values that do not have the strip argument, - //to avoid any potential issues with ! in file names. - text.load(nonStripName, req, function (value) { - //Use own write() method to construct full module value. - //But need to create shell that translates writeFile's - //write() to the right interface. - var textWrite = function (contents) { - return write(fileName, contents); - }; - textWrite.asModule = function (moduleName, contents) { - return write.asModule(moduleName, fileName, contents); - }; - - text.write(pluginName, nonStripName, textWrite, config); - }, config); - } - }; - - if (masterConfig.env === 'node' || (!masterConfig.env && - typeof process !== "undefined" && - process.versions && - !!process.versions.node && - !process.versions['node-webkit'])) { - //Using special require.nodeRequire, something added by r.js. - fs = require.nodeRequire('fs'); - - text.get = function (url, callback, errback) { - try { - var file = fs.readFileSync(url, 'utf8'); - //Remove BOM (Byte Mark Order) from utf8 files if it is there. - if (file.indexOf('\uFEFF') === 0) { - file = file.substring(1); - } - callback(file); - } catch (e) { - if (errback) { - errback(e); - } - } - }; - } else if (masterConfig.env === 'xhr' || (!masterConfig.env && - text.createXhr())) { - text.get = function (url, callback, errback, headers) { - var xhr = text.createXhr(), header; - xhr.open('GET', url, true); - - //Allow plugins direct access to xhr headers - if (headers) { - for (header in headers) { - if (headers.hasOwnProperty(header)) { - xhr.setRequestHeader(header.toLowerCase(), headers[header]); - } - } - } - - //Allow overrides specified in config - if (masterConfig.onXhr) { - masterConfig.onXhr(xhr, url); - } - - xhr.onreadystatechange = function (evt) { - var status, err; - //Do not explicitly handle errors, those should be - //visible via console output in the browser. - if (xhr.readyState === 4) { - status = xhr.status || 0; - if (status > 399 && status < 600) { - //An http 4xx or 5xx error. Signal an error. - err = new Error(url + ' HTTP status: ' + status); - err.xhr = xhr; - if (errback) { - errback(err); - } - } else { - callback(xhr.responseText); - } - - if (masterConfig.onXhrComplete) { - masterConfig.onXhrComplete(xhr, url); - } - } - }; - xhr.send(null); - }; - } else if (masterConfig.env === 'rhino' || (!masterConfig.env && - typeof Packages !== 'undefined' && typeof java !== 'undefined')) { - //Why Java, why is this so awkward? - text.get = function (url, callback) { - var stringBuffer, line, - encoding = "utf-8", - file = new java.io.File(url), - lineSeparator = java.lang.System.getProperty("line.separator"), - input = new java.io.BufferedReader(new java.io.InputStreamReader(new java.io.FileInputStream(file), encoding)), - content = ''; - try { - stringBuffer = new java.lang.StringBuffer(); - line = input.readLine(); - - // Byte Order Mark (BOM) - The Unicode Standard, version 3.0, page 324 - // http://www.unicode.org/faq/utf_bom.html - - // Note that when we use utf-8, the BOM should appear as "EF BB BF", but it doesn't due to this bug in the JDK: - // http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 - if (line && line.length() && line.charAt(0) === 0xfeff) { - // Eat the BOM, since we've already found the encoding on this file, - // and we plan to concatenating this buffer with others; the BOM should - // only appear at the top of a file. - line = line.substring(1); - } - - if (line !== null) { - stringBuffer.append(line); - } - - while ((line = input.readLine()) !== null) { - stringBuffer.append(lineSeparator); - stringBuffer.append(line); - } - //Make sure we return a JavaScript string and not a Java string. - content = String(stringBuffer.toString()); //String - } finally { - input.close(); - } - callback(content); - }; - } else if (masterConfig.env === 'xpconnect' || (!masterConfig.env && - typeof Components !== 'undefined' && Components.classes && - Components.interfaces)) { - //Avert your gaze! - Cc = Components.classes; - Ci = Components.interfaces; - Components.utils['import']('resource://gre/modules/FileUtils.jsm'); - xpcIsWindows = ('@mozilla.org/windows-registry-key;1' in Cc); - - text.get = function (url, callback) { - var inStream, convertStream, fileObj, - readData = {}; - - if (xpcIsWindows) { - url = url.replace(/\//g, '\\'); - } - - fileObj = new FileUtils.File(url); - - //XPCOM, you so crazy - try { - inStream = Cc['@mozilla.org/network/file-input-stream;1'] - .createInstance(Ci.nsIFileInputStream); - inStream.init(fileObj, 1, 0, false); - - convertStream = Cc['@mozilla.org/intl/converter-input-stream;1'] - .createInstance(Ci.nsIConverterInputStream); - convertStream.init(inStream, "utf-8", inStream.available(), - Ci.nsIConverterInputStream.DEFAULT_REPLACEMENT_CHARACTER); - - convertStream.readString(inStream.available(), readData); - convertStream.close(); - inStream.close(); - callback(readData.value); - } catch (e) { - throw new Error((fileObj && fileObj.path || '') + ': ' + e); - } - }; - } - return text; -}); diff --git a/mkdocs/search_index.json b/mkdocs/search_index.json deleted file mode 100644 index a026b08..0000000 --- a/mkdocs/search_index.json +++ /dev/null @@ -1,834 +0,0 @@ -{ - "docs": [ - { - "location": "/", - "text": "This documentation covers parts of the Spearhead Systems Issue Response process. It is a copy of \nPagerDuty's\n documentation and furthermore a cut-down version of our own internal documentation, used at Spearhead Systems for any issue (incident or service request), and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process). See the \nabout page\n for more information on what this documentation is and why it exists. This documentation is complementary to what is available in our \nexisting wiki\n and may not yet be open sourced.\n\n\n\n\nIssue, Incident and Service Request\n\n\nAt Spearhead we use the term \nissue\n to define any request from our customers. Issues fall into two categories: \"Service Requests (SR)\" and \"Incidents (IN)\". Note that we use the term Incident to describe both a service request as well as incidents. For brevity we will use SR and IN throughout this documentation.\n\n\n\n\nA \"service request\" is usually initiated by a human and is generally not critical for the normal functioning of the business while an \"incident\" is an issue that is or can cause interruption to normal business functions. \n\n\n\n\nBeing On-Call\n#\n\n\nIf you've never been on-call before, you might be wondering what it's all about. These pages describe what the expectations of being on-call are, along with some resources to help you.\n\n\n\n\nBeing On-Call\n - \nA guide to being on-call, both what your responsibilities are, and what they are not.\n\n\nAlerting Principles\n - \nThe principles we use to determine what things page an engineer, and what time of day they page.\n\n\n\n\nBefore an Incident\n#\n\n\nReading material for things you probably want to know before an incident occurs. You likely don't want to be reading these during an actual incident.\n\n\n\n\nSeverity Levels\n - \nInformation on our severity level classification. What constitutes a Low issue? What's a \"Major Incident\"?, etc.\n\n\nDifferent Roles for Incidents\n - \nInformation on the roles during an incident; Incident Commander, Scribe, etc.\n\n\nIncident Call Etiquette\n - \nOur etiquette guidelines for incident calls, before you find yourself in one.\n\n\n\n\nDuring an Incident\n#\n\n\nInformation and processes during an incident.\n\n\n\n\nDuring an Incident\n - \nInformation on what to do during an incident, and how to constructively contribute.\n\n\nSecurity Incident Response\n - \nSecurity incidents are handled differently to normal operational incidents.\n\n\n\n\nAfter an Incident\n#\n\n\nOur followup processes, how we make sure we don't repeat mistakes and are always improving.\n\n\n\n\nPost-Mortem Process\n - \nInformation on our post-mortem process; what's involved and how to write or run a post-mortem.\n\n\nPost-Mortem Template\n - \nThe template we use for writing our post-mortems for major incidents.\n\n\n\n\nTraining\n#\n\n\nSo, you want to learn about incident response? You've come to the right place.\n\n\n\n\nTraining Overview\n - \nAn overview of our training guides and additional training material from third-parties.\n\n\nIncident Commander Training\n - \nA guide to becoming our next Incident Commander.\n\n\nDeputy Training\n - \nHow to be a deputy and back up the Incident Commander.\n\n\nScribe Training\n - \nA guide to scribing.\n\n\nSubject Matter Expert Training\n - \nA guide on responsibilities and behavior for all participants in a major incident.\n\n\nGlossary of Incident Response Terms\n - \nA collection of terms that you may hear being used, along with their definition.\n\n\n\n\nAdditional Reading\n#\n\n\nUseful material and resources from external parties that are relevant to incident response.\n\n\n\n\nIncident Management for Operations\n (O'Reilly)\n\n\nIncident Response\n (O'Reilly)\n\n\nDebriefing Facilitation Guide\n (Etsy)\n\n\nUS National Incident Management System (NIMS)\n (FEMA)\n\n\nEvery Minute Counts: Leading Heroku's Incident Response\n (Blake Gentry)", - "title": "Home" - }, - { - "location": "/#being-on-call", - "text": "If you've never been on-call before, you might be wondering what it's all about. These pages describe what the expectations of being on-call are, along with some resources to help you. Being On-Call - A guide to being on-call, both what your responsibilities are, and what they are not. Alerting Principles - The principles we use to determine what things page an engineer, and what time of day they page.", - "title": "Being On-Call" - }, - { - "location": "/#before-an-incident", - "text": "Reading material for things you probably want to know before an incident occurs. You likely don't want to be reading these during an actual incident. Severity Levels - Information on our severity level classification. What constitutes a Low issue? What's a \"Major Incident\"?, etc. Different Roles for Incidents - Information on the roles during an incident; Incident Commander, Scribe, etc. Incident Call Etiquette - Our etiquette guidelines for incident calls, before you find yourself in one.", - "title": "Before an Incident" - }, - { - "location": "/#during-an-incident", - "text": "Information and processes during an incident. During an Incident - Information on what to do during an incident, and how to constructively contribute. Security Incident Response - Security incidents are handled differently to normal operational incidents.", - "title": "During an Incident" - }, - { - "location": "/#after-an-incident", - "text": "Our followup processes, how we make sure we don't repeat mistakes and are always improving. Post-Mortem Process - Information on our post-mortem process; what's involved and how to write or run a post-mortem. Post-Mortem Template - The template we use for writing our post-mortems for major incidents.", - "title": "After an Incident" - }, - { - "location": "/#training", - "text": "So, you want to learn about incident response? You've come to the right place. Training Overview - An overview of our training guides and additional training material from third-parties. Incident Commander Training - A guide to becoming our next Incident Commander. Deputy Training - How to be a deputy and back up the Incident Commander. Scribe Training - A guide to scribing. Subject Matter Expert Training - A guide on responsibilities and behavior for all participants in a major incident. Glossary of Incident Response Terms - A collection of terms that you may hear being used, along with their definition.", - "title": "Training" - }, - { - "location": "/#additional-reading", - "text": "Useful material and resources from external parties that are relevant to incident response. Incident Management for Operations (O'Reilly) Incident Response (O'Reilly) Debriefing Facilitation Guide (Etsy) US National Incident Management System (NIMS) (FEMA) Every Minute Counts: Leading Heroku's Incident Response (Blake Gentry)", - "title": "Additional Reading" - }, - { - "location": "/oncall/being_oncall/", - "text": "A summary of expectations and helpful information for being on-call.\n\n\n\n\nWhat is On-Call?\n#\n\n\nBeing on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise. For example, if you are on-call, should any alarms be triggered by our monitoring solution, you will receive a \"page\" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken. You will be expected to take whatever actions are necessary in order to resolve the issue and return your service to a normal state.\n\n\nAt Spearhead Systems we consider you are on-call during normal working hours in which case you are proactively working with \nDoIT\n and looking over your assigned cards/boards as well as when you are formally \"on-call\" and issues are being redirected to you.\n\n\nOn-call responsibilities extend beyond normal office hours, and if you are on-call you are expected to be able to respond to issues, even at 2am. This sounds horrible (and it can be), but this is what our customers go through, and is the problem that the Spearhead Systems professional services is trying to fix!\n\n\nResponsibilities\n#\n\n\n\n\n\n\nPrepare\n\n\n\n\nHave your laptop and Internet with you (office, home, a MiFi dongle, a phone with a tethering plan, etc).\n\n\nHave a way to charge your MiFi.\n\n\n\n\n\n\nTeam alert escalation happens within 5 minutes, set/stagger your notification timeouts (push, SMS, phone...) accordingly.\n\n\nMake sure Spearhead Systems (and colleagues directly) texts and calls can bypass your \"Do Not Disturb\" settings.\n\n\n\n\n\n\nBe prepared (environment is set up, a current working copy of the necessary repos is local and functioning, you have configured and tested environments on workstations, your credentials for third-party services are current, you have Java installed, ssh-keys and so on...)\n\n\nRead our Incident Response documentation (that's this!) to understand how we handle incidents and service requests, what the different roles and methods of communication are, etc.\n\n\nBe aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments etc.\n\n\n\n\n\n\n\n\nTriage\n\n\n\n\nAcknowledge and act on alerts whenever you can (see the first \"Not responsibilities\" point below)\n\n\nDetermine the urgency of the problem:\n\n\nIs it something that should be worked on right now or escalated into a major incident? (\"production server on fire\" situations. Security alerts) - do so.\n\n\nIs it some tactical work that doesn't have to happen during the night? (for example, disk utilization high watermark, but there's plenty of space left and the trend is not indicating impending doom) - snooze the alert until a more suitable time (working hours, the next morning...) and get back to fixing it then.\n\n\n\n\n\n\nCheck Slack for current activity. Often (but not always) actions that could potentially cause alerts will be announced there.\n\n\nDoes the alert and your initial investigation indicate a general problem or an issue with a specific service that the relevant team should look into? If it does not look like a problem you are the expert for, then escalate to another team member or group.\n\n\n\n\n\n\n\n\nFix\n\n\n\n\nYou are empowered to dive into any problem and act to fix it.\n\n\nInvolve other team members as necessary: do not hesitate to escalate if you cannot figure out the cause within a reasonable timeframe or if the service / alert is something you have not tackled before.\n\n\nIf the issue is not very time sensitive and you have other priority work, make a note of this in DoIT to keep a track of it (with an appropriate severity).\n\n\n\n\n\n\n\n\nImprove\n\n\n\n\nIf a particular issue keeps happening; if an issue alerts often but turns out to be a preventable non-issue \u2013 perhaps improving this should be a longer-term task.\n\n\nDisks that fill up, logs that should be rotated, noisy alerts...(we use ansible, go ahead and start automating!)\n\n\n\n\n\n\nIf information is difficult / impossible to find, write it down. Constantly refactor and improve our knowledge base and documentation. Add redundant links and pointers if your mental model of the wiki / codebase does not match the way it is currently organized.\n\n\n\n\n\n\n\n\nSupport\n\n\n\n\nWhen your on-call \"shift\" ends, let the next on-call know about issues that have not been resolved yet and other experiences of note.\n\n\nIf you are making a change that impacts the schedule (adding / removing yourself, for example), let others know since many of us make arrangements around the on-call schedule well in advance.\n\n\nSupport each other: when doing activities that might generate plenty of pages, it is courteous to \"take the page\" away from the on-call by notifying them and scheduling an override for the duration.\n\n\n\n\n\n\n\n\nNot Responsibilities\n#\n\n\n\n\n\n\nNo expectation to be the first to acknowledge \nall\n of the alerts during the on-call period.\n\n\n\n\nCommute (and other necessary distractions) are facts of life, and sometimes it is not possible to receive or act on an alert before it escalates. That's why we have the backup on-call and schedule for.\n\n\n\n\n\n\n\n\nNo expectation to fix all issues by yourself.\n\n\n\n\nNo one knows everything. Your whole team is here to help. There is no shame, and much to be learned, by escalating issues you are not certain about. \"Never hesitate to escalate\".\n\n\nService owners will always know more about how their stuff works. Especially if our and their documentation is lacking, double-checking with the relevant team avoids mistakes. Measure twice, cut once \u2013 and it's often best to let the subject matter expert do the cutting.\n\n\n\n\n\n\n\n\nRecommendations\n#\n\n\nIf your team is starting its own on-call rotation, here are some scheduling recommendations from the Operations team.\n\n\n\n\n\n\nAlways have a backup schedule. Yes, this means two people being on-call at the same time, however it takes a lot of the stress off of the primary if they know they have a specific backup they can contact, rather than trying to chose a random member of the team.\n\n\n\n\nA backup shift should generally come directly after a primary shift. It gives chance for the previous primary to pass on additional context which may have come up during their shift. It also helps to prevent people from sitting on issues with the intent of letting the next shift fix it.\n\n\n\n\n\n\n\n\nThe third-level of your escalation (after backup schedule) should probably be your entire team. This should hopefully never happen (it's happened once in the history of the Support team), but when it does, it's useful to be able to just get the next available person.\n\n\n\n\n\n\n\n\n\n\n\n\nTeam managers can (and should) be part of your normal rotation. It gives a better insight into what has been going on.\n\n\n\n\n\n\nNew members of the team should shadow your on-call rotation during the first few weeks. They should get all alerts, and should follow along with what you are doing. (All new employees shadow the Support team for one week of on-call, but it's useful to have new team members shadow your team rotations also. Just not at the same time).\n\n\n\n\n\n\nWe recommend you set your escalation timeout to 5 minutes. This should be plenty of time for someone to acknowledge the incident if they're able to. If they're not able to within 5 minutes, then they're probably not in a good position to respond to the incident anyway.\n\n\n\n\n\n\nWhen going off-call, you should provide a quick summary to the next on-call about any issues that may come up during their shift. A service has been flapping, an issue is likely to re-occur, etc. If you want to be formal, this can be a written report via email, but generally a verbal summary is sufficient.\n\n\n\n\n\n\nNotification Method Recommendations\n#\n\n\nYou are free to set up your notification rules as you see fit, to match how you would like to best respond to incidents. If you're not sure how to configure them, the Support team has some recommendations,\n\n\n\n\n\n\nUse Push Notification and Email as your first method of notification. Most of us have phones with us at all times, so this is a prudent first method and is usually sufficient. (DoIT is in the process of integratoin with SNS for push notifications)\n\n\nUse Phone and/or SMS notification each minute after, until the escalation time. If Push didn't work, then it's likely you need something stronger, like a phone call. Keep calling every minute until it's too late. If you don't pick up by the 3rd time, then it's unlikely you are able to respond, and the incident will get escalated away from you.\n\n\n\n\nEtiquette\n#\n\n\n\n\n\n\nIf the current on-call comes into the office at 12pm looking tired, it's not because they're lazy. They probably got paged in the night. Cut them some slack and be nice.\n\n\n\n\n\n\nDon't acknowledge an incident out from under someone else. If you didn't get paged for the incident, then you shouldn't be acknowledging it. Add a comment with your notes instead.\n\n\n\n\n\n\n\n\n\n\n\n\nIf you are testing something, or performing an action that you know will cause a page (notification, alert), it's customary to \"take the pager\" for the time during which you will be testing. Notify the person on-call that you are taking the pager for the next hour while you test.\n\n\n\n\n\n\n\"Never hesitate to escalate\" - Never feel ashamed to rope in someone else if you're not sure how to resolve an issue. Likewise, never look down on someone else if they ask you for help.\n\n\n\n\n\n\nAlways consider covering an hour or so of someone else's on-call time if they request it and you are able. We all have lives which might get in the way of on-call time, and one day it might be you who needs to swap their on-call time in order to have a night out with your friend from out of town.\n\n\n\n\n\n\nIf an issue comes up during your on-call shift for which you got paged, you are responsible for resolving it. Even if it takes 3 hours and there's only 1 hour left of your shift. You can hand over to the next on-call if they agree, but you should never assume that's possible.", - "title": "Being On-Call" - }, - { - "location": "/oncall/being_oncall/#what-is-on-call", - "text": "Being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise. For example, if you are on-call, should any alarms be triggered by our monitoring solution, you will receive a \"page\" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken. You will be expected to take whatever actions are necessary in order to resolve the issue and return your service to a normal state. At Spearhead Systems we consider you are on-call during normal working hours in which case you are proactively working with DoIT and looking over your assigned cards/boards as well as when you are formally \"on-call\" and issues are being redirected to you. On-call responsibilities extend beyond normal office hours, and if you are on-call you are expected to be able to respond to issues, even at 2am. This sounds horrible (and it can be), but this is what our customers go through, and is the problem that the Spearhead Systems professional services is trying to fix!", - "title": "What is On-Call?" - }, - { - "location": "/oncall/being_oncall/#responsibilities", - "text": "Prepare Have your laptop and Internet with you (office, home, a MiFi dongle, a phone with a tethering plan, etc). Have a way to charge your MiFi. Team alert escalation happens within 5 minutes, set/stagger your notification timeouts (push, SMS, phone...) accordingly. Make sure Spearhead Systems (and colleagues directly) texts and calls can bypass your \"Do Not Disturb\" settings. Be prepared (environment is set up, a current working copy of the necessary repos is local and functioning, you have configured and tested environments on workstations, your credentials for third-party services are current, you have Java installed, ssh-keys and so on...) Read our Incident Response documentation (that's this!) to understand how we handle incidents and service requests, what the different roles and methods of communication are, etc. Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments etc. Triage Acknowledge and act on alerts whenever you can (see the first \"Not responsibilities\" point below) Determine the urgency of the problem: Is it something that should be worked on right now or escalated into a major incident? (\"production server on fire\" situations. Security alerts) - do so. Is it some tactical work that doesn't have to happen during the night? (for example, disk utilization high watermark, but there's plenty of space left and the trend is not indicating impending doom) - snooze the alert until a more suitable time (working hours, the next morning...) and get back to fixing it then. Check Slack for current activity. Often (but not always) actions that could potentially cause alerts will be announced there. Does the alert and your initial investigation indicate a general problem or an issue with a specific service that the relevant team should look into? If it does not look like a problem you are the expert for, then escalate to another team member or group. Fix You are empowered to dive into any problem and act to fix it. Involve other team members as necessary: do not hesitate to escalate if you cannot figure out the cause within a reasonable timeframe or if the service / alert is something you have not tackled before. If the issue is not very time sensitive and you have other priority work, make a note of this in DoIT to keep a track of it (with an appropriate severity). Improve If a particular issue keeps happening; if an issue alerts often but turns out to be a preventable non-issue \u2013 perhaps improving this should be a longer-term task. Disks that fill up, logs that should be rotated, noisy alerts...(we use ansible, go ahead and start automating!) If information is difficult / impossible to find, write it down. Constantly refactor and improve our knowledge base and documentation. Add redundant links and pointers if your mental model of the wiki / codebase does not match the way it is currently organized. Support When your on-call \"shift\" ends, let the next on-call know about issues that have not been resolved yet and other experiences of note. If you are making a change that impacts the schedule (adding / removing yourself, for example), let others know since many of us make arrangements around the on-call schedule well in advance. Support each other: when doing activities that might generate plenty of pages, it is courteous to \"take the page\" away from the on-call by notifying them and scheduling an override for the duration.", - "title": "Responsibilities" - }, - { - "location": "/oncall/being_oncall/#not-responsibilities", - "text": "No expectation to be the first to acknowledge all of the alerts during the on-call period. Commute (and other necessary distractions) are facts of life, and sometimes it is not possible to receive or act on an alert before it escalates. That's why we have the backup on-call and schedule for. No expectation to fix all issues by yourself. No one knows everything. Your whole team is here to help. There is no shame, and much to be learned, by escalating issues you are not certain about. \"Never hesitate to escalate\". Service owners will always know more about how their stuff works. Especially if our and their documentation is lacking, double-checking with the relevant team avoids mistakes. Measure twice, cut once \u2013 and it's often best to let the subject matter expert do the cutting.", - "title": "Not Responsibilities" - }, - { - "location": "/oncall/being_oncall/#recommendations", - "text": "If your team is starting its own on-call rotation, here are some scheduling recommendations from the Operations team. Always have a backup schedule. Yes, this means two people being on-call at the same time, however it takes a lot of the stress off of the primary if they know they have a specific backup they can contact, rather than trying to chose a random member of the team. A backup shift should generally come directly after a primary shift. It gives chance for the previous primary to pass on additional context which may have come up during their shift. It also helps to prevent people from sitting on issues with the intent of letting the next shift fix it. The third-level of your escalation (after backup schedule) should probably be your entire team. This should hopefully never happen (it's happened once in the history of the Support team), but when it does, it's useful to be able to just get the next available person. Team managers can (and should) be part of your normal rotation. It gives a better insight into what has been going on. New members of the team should shadow your on-call rotation during the first few weeks. They should get all alerts, and should follow along with what you are doing. (All new employees shadow the Support team for one week of on-call, but it's useful to have new team members shadow your team rotations also. Just not at the same time). We recommend you set your escalation timeout to 5 minutes. This should be plenty of time for someone to acknowledge the incident if they're able to. If they're not able to within 5 minutes, then they're probably not in a good position to respond to the incident anyway. When going off-call, you should provide a quick summary to the next on-call about any issues that may come up during their shift. A service has been flapping, an issue is likely to re-occur, etc. If you want to be formal, this can be a written report via email, but generally a verbal summary is sufficient.", - "title": "Recommendations" - }, - { - "location": "/oncall/being_oncall/#notification-method-recommendations", - "text": "You are free to set up your notification rules as you see fit, to match how you would like to best respond to incidents. If you're not sure how to configure them, the Support team has some recommendations, Use Push Notification and Email as your first method of notification. Most of us have phones with us at all times, so this is a prudent first method and is usually sufficient. (DoIT is in the process of integratoin with SNS for push notifications) Use Phone and/or SMS notification each minute after, until the escalation time. If Push didn't work, then it's likely you need something stronger, like a phone call. Keep calling every minute until it's too late. If you don't pick up by the 3rd time, then it's unlikely you are able to respond, and the incident will get escalated away from you.", - "title": "Notification Method Recommendations" - }, - { - "location": "/oncall/being_oncall/#etiquette", - "text": "If the current on-call comes into the office at 12pm looking tired, it's not because they're lazy. They probably got paged in the night. Cut them some slack and be nice. Don't acknowledge an incident out from under someone else. If you didn't get paged for the incident, then you shouldn't be acknowledging it. Add a comment with your notes instead. If you are testing something, or performing an action that you know will cause a page (notification, alert), it's customary to \"take the pager\" for the time during which you will be testing. Notify the person on-call that you are taking the pager for the next hour while you test. \"Never hesitate to escalate\" - Never feel ashamed to rope in someone else if you're not sure how to resolve an issue. Likewise, never look down on someone else if they ask you for help. Always consider covering an hour or so of someone else's on-call time if they request it and you are able. We all have lives which might get in the way of on-call time, and one day it might be you who needs to swap their on-call time in order to have a night out with your friend from out of town. If an issue comes up during your on-call shift for which you got paged, you are responsible for resolving it. Even if it takes 3 hours and there's only 1 hour left of your shift. You can hand over to the next on-call if they agree, but you should never assume that's possible.", - "title": "Etiquette" - }, - { - "location": "/oncall/alerting_principles/", - "text": "We manage how we get alerted based on many factors such as the customers contractual SLA, the urgency of their request or incident, etc.. \nan alert or notification is something which requires a human to perform an action\n. Based on the severity of the issue (service request or incident) we prioritize accordingly in \nDoIT\n.\n\n\n\n\nMajor Priority Alerts\n\n\nAnything that wakes up a human in the middle of the night should be \nimmediately human actionable\n. If it is none of those things, then we need to adjust the alert to not page at those times.\n\n\n\n\n\n\n\n\n\n\nPriority\n\n\nAlerts\n\n\nResponse\n\n\n\n\n\n\n\n\n\n\nMajor\n\n\nMajor-Priority Spearhead Alert 24/7/365.\n\n\nRequires \nimmediate human action\n.\n\n\n\n\n\n\nNormal\n\n\nNormal-Priority Spearhead Alert during \nbusiness hours only\n.\n\n\nRequires human action that same working day.\n\n\n\n\n\n\nMinor\n\n\nMinor-Priority Spearhead Alert 24/7/365.\n\n\nRequires human action at some point.\n\n\n\n\n\n\nNotification\n\n\nSuppressed Events. No response required.\n\n\nInformational only. We do not need these to clutter out ticketing or inboxes. If they are enabled they should be sent only to required/specific people, not groups.\n\n\n\n\n\n\n\n\nBoth IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page of the respective customer.\n\n\nIf you're setting up a new alert/notification, consider the chart above for how you want to alert people. Be mindful of not creating new high-priority alerts if they don't require an immediate response, for example.\n\n\n\n\nAlert Channels\n\n\nPresently we use email as the only notification method. This means keeping an eye on your email is essential!\nSMS and Push notifications are in the pipeline for DoIT. \n\n\n\n\nExamples\n#\n\n\n\"Production service is failing for 75% of requests, automation is unable to resolve.\"_\n#\n\n\nThis would be a \nMajor\n priority IN, requiring immediate human action to resolve.\n\n\n\n\n\"A customer sends an email stating that \"Production server disk space is filling, expected to be full in 48 hours. Log rotation is insufficient to resolve.\"\n#\n\n\nThis would be a \nNormal\n priority SR, requiring human action soon, but not immediately.\n\n\n\n\n\"An SSL certificate is due to expire in one week.\"\n#\n\n\nThis would be a \nMinor\n priority SR, requiring human action some time soon.", - "title": "Alerting Principles" - }, - { - "location": "/oncall/alerting_principles/#examples", - "text": "", - "title": "Examples" - }, - { - "location": "/oncall/alerting_principles/#production-service-is-failing-for-75-of-requests-automation-is-unable-to-resolve_", - "text": "This would be a Major priority IN, requiring immediate human action to resolve.", - "title": "\"Production service is failing for 75% of requests, automation is unable to resolve.\"_" - }, - { - "location": "/oncall/alerting_principles/#a-customer-sends-an-email-stating-that-production-server-disk-space-is-filling-expected-to-be-full-in-48-hours-log-rotation-is-insufficient-to-resolve", - "text": "This would be a Normal priority SR, requiring human action soon, but not immediately.", - "title": "\"A customer sends an email stating that \"Production server disk space is filling, expected to be full in 48 hours. Log rotation is insufficient to resolve.\"" - }, - { - "location": "/oncall/alerting_principles/#an-ssl-certificate-is-due-to-expire-in-one-week", - "text": "This would be a Minor priority SR, requiring human action some time soon.", - "title": "\"An SSL certificate is due to expire in one week.\"" - }, - { - "location": "/before/severity_levels/", - "text": "The first step in any incident response process is to determine what actually constitutes an incident. We have two high level categories for classifying incidents: this is done using \"SR\" or \"IN\" defintions with an attached priority of \"Minor\", \"Normal\" or \"Major\". \"SR\" are \"Service requests\" initiated by a customer and usually do not constitute a critical issue (there are exceptions) and \"IN\" are \"incidents\" which are generally \"urgent\".\n\n\nAll of our operational issues are to be classified as either a Service Request or an Incident. Incidents have priority over Service Requests provided that there are no Service Requests with a higher priority. In general you will want to resolve a higher severity SR or IN than a lower one (a \"Major\" priority gets a more intensive response than a \"Normal\" incident for example).\n\n\n\n\nAlways Assume The Worst\n\n\nIf you are unsure which level an incident is (e.g. not sure if IN is Major or Normal), \ntreat it as the higher one\n. During an incident is not the time to discuss or litigate severities, just assume the highest and review during a post-mortem.\n\n\n\n\n\n \n\n \n\n \nSeverity\n\n \nDescription\n\n \nWhat To Do\n\n \n\n \n\n \n\n \n\n \nMajor\n\n \n\n \n\n \nThe system is in a critical state and is actively impacting a large number of customers.\n\n \nFunctionality has been severely impaired for a long time, breaking SLA.\n\n \nCustomer-data-exposing security vulnerability has come to our attention.\n\n \n\n \n\n \nSee \nDuring an Incident\n.\n\n \n\n \n\n \nNormal\n\n \n\n \n\n \nFunctionality of virtualization platform is severely impaired.\n\n \nE-mail system is offline.\n\n \n\n \n\n \nSee \nDuring an Incident\n.\n\n \n\n \n\n \nAnything above this line is considered a \"Major Incident\". These are generally Incidents (IN). Below are service requests (SR) which are usually initiated by a human who can help with prioritizing. A call is triggered for all major incidents (indifferently of SR or IN).\n\n \n\n \n\n \nNormal\n\n \n\n \n\n \nPartial loss of functionality, only affecting minority of customers.\n\n \nSomething that has the likelihood of becoming Major if nothing is done.\n\n \nNo redundancy in a service (failure of 1 more node will cause outage).\n\n \n\n \n\n \n\n \n\n \nWork on issue as your top priority.\n\n \nLiaise with engineers of affected systems to identify cause.\n\n \nIf related to recent deployment, rollback.\n\n \nMonitor status and notice if/when it escalates.\n\n \nMention on Slack if you think it has the potential to escalate.\n\n \n\n \n\n \n\n \n\n \nNormal\n\n \n\n \n\n \nPerformance issues (delays, etc). Tasks that require non-immediate attention.\n\n \nJob failure (not impacting alerting).\n\n \n\n \n\n \n\n \n\n \nWork on the issue as your first priority (above \"Low\" tasks).\n\n \nMonitor status and notice if/when it escalates.\n\n \n\n \n\n \n\n \n\n \nLow\n\n \n\n \n\n \nNormal bugs which aren't impacting system use, cosmetic issues, etc.\n\n \n\n \n\n \n\n \n\n \nCreate a DoIT ticket and assign to owner of affected system.\n\n \n\n \n\n \n\n \n\n\n\n\n\n\n\nBe Specific\n\n\nWhen creating Cards in Doit, be as specific as possible and include all necessary details. Include relevant details regarding when the issue started, what may have triggered it, etc.. Document your efforts through worklogs and be specific there as well.", - "title": "Severity Levels" - }, - { - "location": "/before/different_roles/", - "text": "There are several roles for our incident response teams at Spearhead Systems. Certain roles only have one person per incident (e.g. support engineer), whereas other roles can have multiple people (e.g. Sysadmins, Solution Architects, etc.). It's all about coming together as a team, working the problem, and getting a solution quickly.\n\n\nHere is a rough outline of our role hierarchy, with each role discussed in more detail on the rest of this page.\n\n\n\n\n\n\nTeam Leader (TL)\n#\n\n\nWhat is it?\n#\n\n\nA Team Leader acts as the single source of truth of what is currently happening and what is going to happen during an major incident. They come in all shapes, sizes, and colors. TL's are also the key elements in a project (boards in DoIT).\n\n\nWhy have one?\n#\n\n\nAs any system grows in size and complexity, things break and cause incidents. The TL is needed to help drive major incidents to resolution by organizing his team towards a common goal.\n\n\nWhat are the responsibilities?\n#\n\n\n\n\nHelp prepare for projects and incidents,\n\n\nSetup communications channels.\n\n\nCreate the DoIT board(s) and other project planning related materials.\n\n\nFunnel people to these communications channels.\n\n\nTrain team members on how to communicate and train other TL's.\n\n\n\n\n\n\nDrive incidents and projects to resolution,\n\n\nGet everyone on the same communication channel.\n\n\nCollect information from team members for their services/area of ownership status.\n\n\nCollect proposed repair actions, then recommend repair actions to be taken.\n\n\nDelegate all repair actions, the TL is NOT a resolver.\n\n\nBe the single authority on system status\n\n\nCommunicate directly with the customers and end-users\n\n\nnot the engineers themselves!\n\n\n\n\n\n\n\n\n\n\nPost Mortem,\n\n\nCreating the initial template right after the incident so people can put in their thoughts while fresh.\n\n\nAssigning the post-mortem after the event is over, this can be done after the call.\n\n\nWork with Managers/Support on scheduling preventive actions.\n\n\n\n\n\n\n\n\nWho are they?\n#\n\n\nAnyone on the TL on-call schedule. Trainees are typically on the TL Shadow schedule.\n\n\nHow can I become one?\n#\n\n\nTake a look at our \nTeam Leader training guide\n.\n\n\n\n\nSysadmin\n#\n\n\nWhat is it?\n#\n\n\nA Sysadmin is a direct support role for the Team Leader. This is not a shadow where the person just observes, the Sysadmin is expected to perform important tasks during an incident.\n\n\nWhy have one?\n#\n\n\nIt's important for the TL to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The Sysadmin helps to support the TL and keep them stay focussed on the incident.\n\n\nWhat are the responsibilities?\n#\n\n\nThe Sysadmin is expected to:\n\n\n\n\nBring up issues to the TL that may otherwise not be addressed (keeping an eye on timers that have been started, circling back around to missed items from a roll call, etc).\n\n\nBe a \"hot standby\" TL, should the primary need to either transition to a SME, or otherwise have to step away from the TL role.\n\n\nPage SME's or other on-call engineers as instructed by the Team Leader.\n\n\nManage the incident call, and be prepared to remove people from the call if instructed by the Team Leader.\n\n\nLiaise with stakeholders and provide status updates on DoIT (using worklogs and comments), Slack and email/telefone as necessary.\n\n\n\n\nWho are they?\n#\n\n\nAny Team Leader can act as a Sysadmin. Sysadmins need to be trained as an Team Leader as they may be required to take over command.\n\n\nHow can I become one?\n#\n\n\nTake a look at our \nSysadmin training guide\n. Sysadmins also need to be \ntrained as an Team Leaders\n.\n\n\n\n\nTODO:::move scribe responsibilities to TL and Sysadmin\n::: or assign this to our juniors?\n\n\nScribe\n#\n\n\nWhat is it?\n#\n\n\nA Scribe documents the timeline of an incident as it progresses, and makes sure all important decisions and data are captured for later review.\n\n\nWhy have one?\n#\n\n\nThe incident commander will need to focus on the problem at hand, and the subject matter experts will need to focus on resolving the incident. It is important to capture a timeline of events as they happen so that they can be reviewed during the post-mortem to determine how well we performed, and so we can accurate determine any additional impact that we might not have noticed at the time.\n\n\nWhat are the responsibilities?\n#\n\n\nThe Scribe is expected to:\n\n\n\n\nEnsure the incident call is being recorded.\n\n\nNote in Slack important data, events, and actions, as they happen. Specifically:\n\n\nKey actions as they are taken (Example: \"prod-server-387723 is being restarted to attempt to remove the stuck lock\")\n\n\nStatus reports when one is provided by the IC (Example: \"We are in SEV-1, service A is currently not processing events due to a stuck lock, X is restarting the app stack, next checkin in 3 minutes\")\n\n\nAny key callouts either during the call or at the ending review (Example: \"Note: (Bob B) We should have a better way to determine stuck locks.\")\n\n\n\n\n\n\n\n\nWho are they?\n#\n\n\nAnyone can act as a scribe during an incident, and are chosen by the Incident Commander at the start of the call. Typically the Deputy will act as the Scribe, but that doesn't necessarily need to happen, and for larger incidents may not be possible.\n\n\nHow can I become one?\n#\n\n\nFollow our \nScribe training guide\n, and then notify the Incident Commanders that you would like to be considered for scribing for the next incident.\n\n\nTODO::: END move scribe responsibilities to TL and Sysadmin\n\n\n\n\nSubject Matter Expert\n#\n\n\nWhat is it?\n#\n\n\nA Subject Matter Expert (SME), sometimes called a \"Resolver\" or \"Architect\", is a domain expert or designated owner of a component or service that is part of the Spearhead Systems service delivery concept.\n\n\nWhy have one?\n#\n\n\nThe TL and Sysadmins are not all-knowing super beings. When there is a problem with a service or a particular system, an expert in that service is needed to be able to quickly help identify and fix issues.\n\n\nWhat are the responsibilities?\n#\n\n\n\n\nBeing able to diagnose common problems with the service.\n\n\nBeing able to rapidly fix issues found during an incident.\n\n\nConcise communication skills, specifically for CAN reports:\n\n\nCondition: What is the current state of the service? Is it healthy or not?\n\n\nActions: What actions need to be taken if the service is not in a healthy state?\n\n\nNeeds: What support does the resolver need to perform an action?\n\n\n\n\n\n\n\n\nWho are they?\n#\n\n\nAnyone who is considered a \"domain expert\" can act as a resolver for an incident. Typically the service's primary on-call will act as the SME for that service.\n\n\nHow can I become one?\n#\n\n\nTake a look at our \nSubject Matter Expert training guide\n. You should also discuss with your team and service owner to determine what the requirements are for your particular service.\n\n\n\n\nCustomer Liaison\n#\n\n\nWhat is it?\n#\n\n\nA person responsible for interacting with customers, either directly, or via our public communication channels. Typically a member of the Customer Support team.\n\n\nWhy have one?\n#\n\n\nAll of the other roles will be actively working on identifying the cause and resolving the issue, we need a role which is focused purely on the customer interaction side of things so that it can be done properly, with the due care and attention it needs.\n\n\nWhat are the responsibilities?\n#\n\n\n\n\nPost any publicly facing messages regarding the incident (DoIT, Twitter, StatusPage, etc).\n\n\nNotify the TL of any customers reporting that they are affected by the incident.\n\n\n\n\nWho are they?\n#\n\n\nAny member of the Support Team can act as a customer liaison.\n\n\nHow can I become one?\n#\n\n\nDiscuss with the Support Team about becoming our next customer liaison.", - "title": "Different Roles" - }, - { - "location": "/before/different_roles/#team-leader-tl", - "text": "", - "title": "Team Leader (TL)" - }, - { - "location": "/before/different_roles/#what-is-it", - "text": "A Team Leader acts as the single source of truth of what is currently happening and what is going to happen during an major incident. They come in all shapes, sizes, and colors. TL's are also the key elements in a project (boards in DoIT).", - "title": "What is it?" - }, - { - "location": "/before/different_roles/#why-have-one", - "text": "As any system grows in size and complexity, things break and cause incidents. The TL is needed to help drive major incidents to resolution by organizing his team towards a common goal.", - "title": "Why have one?" - }, - { - "location": "/before/different_roles/#what-are-the-responsibilities", - "text": "Help prepare for projects and incidents, Setup communications channels. Create the DoIT board(s) and other project planning related materials. Funnel people to these communications channels. Train team members on how to communicate and train other TL's. Drive incidents and projects to resolution, Get everyone on the same communication channel. Collect information from team members for their services/area of ownership status. Collect proposed repair actions, then recommend repair actions to be taken. Delegate all repair actions, the TL is NOT a resolver. Be the single authority on system status Communicate directly with the customers and end-users not the engineers themselves! Post Mortem, Creating the initial template right after the incident so people can put in their thoughts while fresh. Assigning the post-mortem after the event is over, this can be done after the call. Work with Managers/Support on scheduling preventive actions.", - "title": "What are the responsibilities?" - }, - { - "location": "/before/different_roles/#who-are-they", - "text": "Anyone on the TL on-call schedule. Trainees are typically on the TL Shadow schedule.", - "title": "Who are they?" - }, - { - "location": "/before/different_roles/#how-can-i-become-one", - "text": "Take a look at our Team Leader training guide .", - "title": "How can I become one?" - }, - { - "location": "/before/different_roles/#sysadmin", - "text": "", - "title": "Sysadmin" - }, - { - "location": "/before/different_roles/#what-is-it_1", - "text": "A Sysadmin is a direct support role for the Team Leader. This is not a shadow where the person just observes, the Sysadmin is expected to perform important tasks during an incident.", - "title": "What is it?" - }, - { - "location": "/before/different_roles/#why-have-one_1", - "text": "It's important for the TL to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The Sysadmin helps to support the TL and keep them stay focussed on the incident.", - "title": "Why have one?" - }, - { - "location": "/before/different_roles/#what-are-the-responsibilities_1", - "text": "The Sysadmin is expected to: Bring up issues to the TL that may otherwise not be addressed (keeping an eye on timers that have been started, circling back around to missed items from a roll call, etc). Be a \"hot standby\" TL, should the primary need to either transition to a SME, or otherwise have to step away from the TL role. Page SME's or other on-call engineers as instructed by the Team Leader. Manage the incident call, and be prepared to remove people from the call if instructed by the Team Leader. Liaise with stakeholders and provide status updates on DoIT (using worklogs and comments), Slack and email/telefone as necessary.", - "title": "What are the responsibilities?" - }, - { - "location": "/before/different_roles/#who-are-they_1", - "text": "Any Team Leader can act as a Sysadmin. Sysadmins need to be trained as an Team Leader as they may be required to take over command.", - "title": "Who are they?" - }, - { - "location": "/before/different_roles/#how-can-i-become-one_1", - "text": "Take a look at our Sysadmin training guide . Sysadmins also need to be trained as an Team Leaders . TODO:::move scribe responsibilities to TL and Sysadmin\n::: or assign this to our juniors?", - "title": "How can I become one?" - }, - { - "location": "/before/different_roles/#scribe", - "text": "", - "title": "Scribe" - }, - { - "location": "/before/different_roles/#what-is-it_2", - "text": "A Scribe documents the timeline of an incident as it progresses, and makes sure all important decisions and data are captured for later review.", - "title": "What is it?" - }, - { - "location": "/before/different_roles/#why-have-one_2", - "text": "The incident commander will need to focus on the problem at hand, and the subject matter experts will need to focus on resolving the incident. It is important to capture a timeline of events as they happen so that they can be reviewed during the post-mortem to determine how well we performed, and so we can accurate determine any additional impact that we might not have noticed at the time.", - "title": "Why have one?" - }, - { - "location": "/before/different_roles/#what-are-the-responsibilities_2", - "text": "The Scribe is expected to: Ensure the incident call is being recorded. Note in Slack important data, events, and actions, as they happen. Specifically: Key actions as they are taken (Example: \"prod-server-387723 is being restarted to attempt to remove the stuck lock\") Status reports when one is provided by the IC (Example: \"We are in SEV-1, service A is currently not processing events due to a stuck lock, X is restarting the app stack, next checkin in 3 minutes\") Any key callouts either during the call or at the ending review (Example: \"Note: (Bob B) We should have a better way to determine stuck locks.\")", - "title": "What are the responsibilities?" - }, - { - "location": "/before/different_roles/#who-are-they_2", - "text": "Anyone can act as a scribe during an incident, and are chosen by the Incident Commander at the start of the call. Typically the Deputy will act as the Scribe, but that doesn't necessarily need to happen, and for larger incidents may not be possible.", - "title": "Who are they?" - }, - { - "location": "/before/different_roles/#how-can-i-become-one_2", - "text": "Follow our Scribe training guide , and then notify the Incident Commanders that you would like to be considered for scribing for the next incident. TODO::: END move scribe responsibilities to TL and Sysadmin", - "title": "How can I become one?" - }, - { - "location": "/before/different_roles/#subject-matter-expert", - "text": "", - "title": "Subject Matter Expert" - }, - { - "location": "/before/different_roles/#what-is-it_3", - "text": "A Subject Matter Expert (SME), sometimes called a \"Resolver\" or \"Architect\", is a domain expert or designated owner of a component or service that is part of the Spearhead Systems service delivery concept.", - "title": "What is it?" - }, - { - "location": "/before/different_roles/#why-have-one_3", - "text": "The TL and Sysadmins are not all-knowing super beings. When there is a problem with a service or a particular system, an expert in that service is needed to be able to quickly help identify and fix issues.", - "title": "Why have one?" - }, - { - "location": "/before/different_roles/#what-are-the-responsibilities_3", - "text": "Being able to diagnose common problems with the service. Being able to rapidly fix issues found during an incident. Concise communication skills, specifically for CAN reports: Condition: What is the current state of the service? Is it healthy or not? Actions: What actions need to be taken if the service is not in a healthy state? Needs: What support does the resolver need to perform an action?", - "title": "What are the responsibilities?" - }, - { - "location": "/before/different_roles/#who-are-they_3", - "text": "Anyone who is considered a \"domain expert\" can act as a resolver for an incident. Typically the service's primary on-call will act as the SME for that service.", - "title": "Who are they?" - }, - { - "location": "/before/different_roles/#how-can-i-become-one_3", - "text": "Take a look at our Subject Matter Expert training guide . You should also discuss with your team and service owner to determine what the requirements are for your particular service.", - "title": "How can I become one?" - }, - { - "location": "/before/different_roles/#customer-liaison", - "text": "", - "title": "Customer Liaison" - }, - { - "location": "/before/different_roles/#what-is-it_4", - "text": "A person responsible for interacting with customers, either directly, or via our public communication channels. Typically a member of the Customer Support team.", - "title": "What is it?" - }, - { - "location": "/before/different_roles/#why-have-one_4", - "text": "All of the other roles will be actively working on identifying the cause and resolving the issue, we need a role which is focused purely on the customer interaction side of things so that it can be done properly, with the due care and attention it needs.", - "title": "Why have one?" - }, - { - "location": "/before/different_roles/#what-are-the-responsibilities_4", - "text": "Post any publicly facing messages regarding the incident (DoIT, Twitter, StatusPage, etc). Notify the TL of any customers reporting that they are affected by the incident.", - "title": "What are the responsibilities?" - }, - { - "location": "/before/different_roles/#who-are-they_4", - "text": "Any member of the Support Team can act as a customer liaison.", - "title": "Who are they?" - }, - { - "location": "/before/different_roles/#how-can-i-become-one_4", - "text": "Discuss with the Support Team about becoming our next customer liaison.", - "title": "How can I become one?" - }, - { - "location": "/before/call_etiquette/", - "text": "You've just joined an incident call, and you've never been on one before. You have no idea what's going on, or what you're supposed to be doing. This page will help you through your first time on an incident call, and will provide a reference for future calls you may be a part of.\n\n\n\n\nCredit: \nOfficial White House Photo\n by Pete Souza\n\n\nFirst Steps\n#\n\n\n\n\nIf you intend on participating on the incident call you should join both the call, and Slack.\n\n\nMake sure you are in a quiet environment in order to participate on the call. Background noise should be kept to a minimum.\n\n\nKeep your microphone muted until you have something to say.\n\n\nIdentify yourself when you join the call; State your name and the system you are the expert for.\n\n\nSpeak up and speak clearly.\n\n\nBe direct and factual.\n\n\nKeep conversations/discussions short and to the point.\n\n\nBring any concerns to the Incident Commander (IC) on the call.\n\n\nRespect time constraints given by the Incident Commander.\n\n\n\n\nLingo\n#\n\n\nUse clear terminology, and avoid using acronyms or abbreviations during a call. Clear and accurate communication is more important than quick communication.\n\n\n\n\nStandard radio \nvoice procedure\n does not need to be followed on calls. However, you should familiarize yourself with the terms, as you may hear them on a call (or need to use them yourself). The ones in more active use on major incident calls are,\n\n\n\n\nAck/Rog\n - \"I have received and understood\"\n\n\nSay Again\n - \"Repeat your last message\"\n\n\nStandby\n - \"Please wait a moment for the next response\"\n\n\nWilco\n - \"Will comply\"\n\n\n\n\nDo not invent new abbreviations, and always favor being explicit of implicit. It is better to make things clearer than to try and save time by abbreviating, only to have a misunderstanding because others didn't know the abbreviation.\n\n\nThe Commander\n#\n\n\nThe Incident Commander (IC) is the leader of the incident response process, and is responsible for bringing the incident to resolution. They will announce themselves at the start of the call, and will generally be doing most of the talking.\n\n\n\n\nFollow all instructions from the incident commander, without exception.\n\n\nDo not perform any actions unless the incident commander has told you to do so.\n\n\nThe commander will typically poll for any strong objections before performing a large action. This is your time to raise any objections if you have them.\n\n\nOnce the commander has made a decision, that decision is final and should be followed, even if you disagreed during the poll.\n\n\nAnswer any questions the commander asks you in a clear and concise way.\n\n\nAnswering that you \"don't know\" something is perfectly acceptable. Do not try to guess.\n\n\n\n\n\n\nThe commander may ask you to investigate something and get back to them in X minutes. Make sure you are ready with an answer within that time.\n\n\nAnswering that you need more time is perfectly acceptable, but you need to give the commander an estimate of how much time.\n\n\n\n\n\n\n\n\nProblems?\n#\n\n\nThere's no incident commander on the call! I don't know what to do!\n#\n\n\nAsk on the call if an IC is present. If you have no response, type \n!ic page\n in Slack. This will page the primary and backup IC to the call.\n\n\nI can join the call or Slack, but not both, what should I do?\n#\n\n\nYou're welcome to join only one of the channels, however you should not actively participate in the incident response if so, as it causes disjoined communication. Liaise with someone who is both in Slack and on the call to provide any input you may have so that they can raise it.", - "title": "Call Etiquette" - }, - { - "location": "/before/call_etiquette/#first-steps", - "text": "If you intend on participating on the incident call you should join both the call, and Slack. Make sure you are in a quiet environment in order to participate on the call. Background noise should be kept to a minimum. Keep your microphone muted until you have something to say. Identify yourself when you join the call; State your name and the system you are the expert for. Speak up and speak clearly. Be direct and factual. Keep conversations/discussions short and to the point. Bring any concerns to the Incident Commander (IC) on the call. Respect time constraints given by the Incident Commander.", - "title": "First Steps" - }, - { - "location": "/before/call_etiquette/#lingo", - "text": "Use clear terminology, and avoid using acronyms or abbreviations during a call. Clear and accurate communication is more important than quick communication. Standard radio voice procedure does not need to be followed on calls. However, you should familiarize yourself with the terms, as you may hear them on a call (or need to use them yourself). The ones in more active use on major incident calls are, Ack/Rog - \"I have received and understood\" Say Again - \"Repeat your last message\" Standby - \"Please wait a moment for the next response\" Wilco - \"Will comply\" Do not invent new abbreviations, and always favor being explicit of implicit. It is better to make things clearer than to try and save time by abbreviating, only to have a misunderstanding because others didn't know the abbreviation.", - "title": "Lingo" - }, - { - "location": "/before/call_etiquette/#the-commander", - "text": "The Incident Commander (IC) is the leader of the incident response process, and is responsible for bringing the incident to resolution. They will announce themselves at the start of the call, and will generally be doing most of the talking. Follow all instructions from the incident commander, without exception. Do not perform any actions unless the incident commander has told you to do so. The commander will typically poll for any strong objections before performing a large action. This is your time to raise any objections if you have them. Once the commander has made a decision, that decision is final and should be followed, even if you disagreed during the poll. Answer any questions the commander asks you in a clear and concise way. Answering that you \"don't know\" something is perfectly acceptable. Do not try to guess. The commander may ask you to investigate something and get back to them in X minutes. Make sure you are ready with an answer within that time. Answering that you need more time is perfectly acceptable, but you need to give the commander an estimate of how much time.", - "title": "The Commander" - }, - { - "location": "/before/call_etiquette/#problems", - "text": "", - "title": "Problems?" - }, - { - "location": "/before/call_etiquette/#theres-no-incident-commander-on-the-call-i-dont-know-what-to-do", - "text": "Ask on the call if an IC is present. If you have no response, type !ic page in Slack. This will page the primary and backup IC to the call.", - "title": "There's no incident commander on the call! I don't know what to do!" - }, - { - "location": "/before/call_etiquette/#i-can-join-the-call-or-slack-but-not-both-what-should-i-do", - "text": "You're welcome to join only one of the channels, however you should not actively participate in the incident response if so, as it causes disjoined communication. Liaise with someone who is both in Slack and on the call to provide any input you may have so that they can raise it.", - "title": "I can join the call or Slack, but not both, what should I do?" - }, - { - "location": "/during/during_an_incident/", - "text": "Information on what to do during a major incident. See our \nseverity level descriptions\n for what constitutes a major incident.\n\n\n\n\nDocumentation\n\n\nFor your own internal documentation, you should make sure that this page has all of the necessary information prominently displayed. Such as: phone bridge numbers, Slack rooms, important chat commands, etc. Here is an example,\n\n\n\n \n\n \n\n \n\n \n\n \n#incident-chat\n\n \nhttps://a-voip-provider.com/incident-call\n\n \n+1 555 BIG FIRE\n (+1 555 244 3473) / PIN: 123456\n\n \n\n \n\n \nNeed an IC? Do \n!ic page\n in Slack\n\n \n\n \n\n \nFor executive summary updates only, join \n#executive-summary-updates\n.\n\n \n\n \n\n\n\n\n\n\n\n\nSecurity Incident?\n\n\nIf this is a security incident, you should follow the \nSecurity Incident Response\n process.\n\n\n\n\nDon't Panic!\n#\n\n\n\n\n\n\nJoin the incident call and chat (see links above).\n\n\n\n\nAnyone is free to join the call or chat to observe and follow along with the incident.\n\n\nIf you wish to participate however, you should join both. If you can't join the call for some reason, you should have a dedicated proxy for the call. Disjointed discussions in the chat room are ultimately distracting.\n\n\n\n\n\n\n\n\nFollow along with the call/chat, add any comments you feel are appropriate, but keep the discussion relevant to the problem at hand.\n\n\n\n\nIf you are not an SME, try to filter any discussion through the primary SME for your service. Too many people discussing at once get become overwhelming, so we should try to maintain a hierarchical structure to the call if possible.\n\n\n\n\n\n\n\n\nFollow instructions from the Incident Commander.\n\n\n\n\nIs there no IC on the call?\n\n\nManually page them via Slack, with \n!ic page\n in Slack. This will page the primary and backup IC's at the same time.\n\n\nNever hesitate to page the IC. It's much better to have them and not need them than the other way around.\n\n\n\n\n\n\n\n\n\n\n\n\nSteps for Incident Commander\n#\n\n\nResolve the incident as quickly and as safely as possible, use the Deputy to assist you. Delegate any tasks to relevant experts at your discretion.\n\n\n\n\n\n\nAnnounce on the call and in Slack that you are the incident commander, who you have designated as deputy (usually the backup IC), and scribe.\n\n\n\n\n\n\nIdentify if there is an obvious cause to the incident (recent deployment, spike in traffic, etc.), delegate investigation to relevant experts,\n\n\n\n\nUse the service experts on the call to assist in the analysis. They should be able to quickly provide confirmation of the cause, but not always. It's the call of the IC on how to proceed in cases where the cause is not positively known. Confer with service owners and use their knowledge to help you.\n\n\n\n\n\n\n\n\nIdentify investigation \n repair actions (roll back, rate-limit services, etc) and delegate actions to relevant service experts. Typically something like this (obviously not an exhaustive list),\n\n\n\n\nBad Deployment:\n Roll it back.\n\n\nWeb Application Stuck/Crashed:\n Do a rolling restart.\n\n\nEvent Flood:\n Validate automatic throttling is sufficient, adjust manually if not.\n\n\nData Center Outage:\n Validate automation has removed bad data center. Force it to do so if not.\n\n\nDegraded Service Behavior without load:\n Gather forensic data (heap dumps, etc), and consider doing a rolling restart.\n\n\n\n\n\n\n\n\nListen for prompts from your Deputy regarding severity escalations, decide whether we need to announce publicly, and instruct customer liaison accordingly.\n\n\n\n\nAnnouncing publicly is at your discretion as IC. If you are unsure, then announce publicly (\"If in doubt, tweet it out\").\n\n\n\n\n\n\n\n\nOnce incident has recovered or is actively recovering, you can announce that the incident is over and that the call is ending. This usually indicates there's no more productive work to be done for the incident right now.\n\n\n\n\nMove the remaining, non-time-critical discussion to Slack.\n\n\nFollow up to ensure the customer liaison wraps up the incident publicly.\n\n\nIdentify any post-incident clean-up work.\n\n\nYou may need to perform debriefing/analysis of the underlying root cause.\n\n\n\n\n\n\n\n\n(After call ends) Create the post-mortem page from the template, and assign an owner to the post-mortem for the incident.\n\n\n\n\n\n\n(After call ends) Send out an internal email explaining that we had a major incident, provide a link to the post-mortem.\n\n\n\n\n\n\nSteps for Deputy\n#\n\n\nYou are there to support the IC in whatever they need.\n\n\n\n\n\n\nMonitor the status, and notify the IC if/when the incident escalates in severity level,\n\n\n\n\nOfficerURL can help you to monitor the status on Slack,\n\n\n!status\n - Will tell you the current status.\n\n\n!status stalk\n - Will continually monitor the status and report it to the room every 30s.\n\n\n\n\n\n\n\n\n\n\n\n\nBe prepared to page other people as directed by the Incident Commander.\n\n\n\n\n\n\nProvide regular status updates in Slack (roughly every 30mins) to the executive team, giving an executive summary of the current status. Keep it short and to the point, and use @here.\n\n\n\n\n\n\nFollow instructions from the Incident Commander.\n\n\n\n\n\n\nSteps for Scribe\n#\n\n\nYou are there to document the key information from the incident in Slack.\n\n\n\n\n\n\nUpdate the Slack room with who the IC is, who the Deputy is, and that you're the scribe (if not already done).\n\n\n\n\ne.g. \"IC: Bob Boberson, Deputy: Deputy Deputyson, Scribe: Writer McWriterson\"\n\n\n\n\n\n\n\n\nYou should add notes to Slack when significant actions are taken, or findings are determined. You don't need to wait for the IC to direct this - use your own judgment.\n\n\n\n\nYou should also add \nTODO\n notes to the Slack room that indicate follow-ups slated for later.\n\n\n\n\n\n\n\n\nFollow instructions from the Incident Commander.\n\n\n\n\n\n\nSteps for Subject Matter Experts\n#\n\n\nYou are there to support the incident commander in identifying the cause of the incident, suggesting and evaluation repair actions, and following through on the repair actions.\n\n\n\n\n\n\nInvestigate the incident by analyzing any graphs or logs at your disposal. Announce all findings to the incident commander.\n\n\n\n\nIf you are unsure of the cause, that's fine, state that you are investigating and provide regular updates to the IC.\n\n\n\n\n\n\n\n\nAnnounce all suggestions for resolution to the incident commander, it is their decision on how to proceed, do not follow any actions unless told to do so!\n\n\n\n\n\n\nFollow instructions from the incident commander.\n\n\n\n\n\n\n(Optional) Once the call is over and post-mortem is created, add any notes you think are relevant to the post-mortem page.\n\n\n\n\n\n\nSteps for Customer Liaison\n#\n\n\nBe on stand-by to post public facing messages regarding the incident.\n\n\n\n\n\n\nYou will typically be required to update the status page and to send Tweets from our various accounts at certain times during the call.\n\n\n\n\n\n\nFollow instructions from the Incident Commander.", - "title": "During An Incident" - }, - { - "location": "/during/during_an_incident/#dont-panic", - "text": "Join the incident call and chat (see links above). Anyone is free to join the call or chat to observe and follow along with the incident. If you wish to participate however, you should join both. If you can't join the call for some reason, you should have a dedicated proxy for the call. Disjointed discussions in the chat room are ultimately distracting. Follow along with the call/chat, add any comments you feel are appropriate, but keep the discussion relevant to the problem at hand. If you are not an SME, try to filter any discussion through the primary SME for your service. Too many people discussing at once get become overwhelming, so we should try to maintain a hierarchical structure to the call if possible. Follow instructions from the Incident Commander. Is there no IC on the call? Manually page them via Slack, with !ic page in Slack. This will page the primary and backup IC's at the same time. Never hesitate to page the IC. It's much better to have them and not need them than the other way around.", - "title": "Don't Panic!" - }, - { - "location": "/during/during_an_incident/#steps-for-incident-commander", - "text": "Resolve the incident as quickly and as safely as possible, use the Deputy to assist you. Delegate any tasks to relevant experts at your discretion. Announce on the call and in Slack that you are the incident commander, who you have designated as deputy (usually the backup IC), and scribe. Identify if there is an obvious cause to the incident (recent deployment, spike in traffic, etc.), delegate investigation to relevant experts, Use the service experts on the call to assist in the analysis. They should be able to quickly provide confirmation of the cause, but not always. It's the call of the IC on how to proceed in cases where the cause is not positively known. Confer with service owners and use their knowledge to help you. Identify investigation repair actions (roll back, rate-limit services, etc) and delegate actions to relevant service experts. Typically something like this (obviously not an exhaustive list), Bad Deployment: Roll it back. Web Application Stuck/Crashed: Do a rolling restart. Event Flood: Validate automatic throttling is sufficient, adjust manually if not. Data Center Outage: Validate automation has removed bad data center. Force it to do so if not. Degraded Service Behavior without load: Gather forensic data (heap dumps, etc), and consider doing a rolling restart. Listen for prompts from your Deputy regarding severity escalations, decide whether we need to announce publicly, and instruct customer liaison accordingly. Announcing publicly is at your discretion as IC. If you are unsure, then announce publicly (\"If in doubt, tweet it out\"). Once incident has recovered or is actively recovering, you can announce that the incident is over and that the call is ending. This usually indicates there's no more productive work to be done for the incident right now. Move the remaining, non-time-critical discussion to Slack. Follow up to ensure the customer liaison wraps up the incident publicly. Identify any post-incident clean-up work. You may need to perform debriefing/analysis of the underlying root cause. (After call ends) Create the post-mortem page from the template, and assign an owner to the post-mortem for the incident. (After call ends) Send out an internal email explaining that we had a major incident, provide a link to the post-mortem.", - "title": "Steps for Incident Commander" - }, - { - "location": "/during/during_an_incident/#steps-for-deputy", - "text": "You are there to support the IC in whatever they need. Monitor the status, and notify the IC if/when the incident escalates in severity level, OfficerURL can help you to monitor the status on Slack, !status - Will tell you the current status. !status stalk - Will continually monitor the status and report it to the room every 30s. Be prepared to page other people as directed by the Incident Commander. Provide regular status updates in Slack (roughly every 30mins) to the executive team, giving an executive summary of the current status. Keep it short and to the point, and use @here. Follow instructions from the Incident Commander.", - "title": "Steps for Deputy" - }, - { - "location": "/during/during_an_incident/#steps-for-scribe", - "text": "You are there to document the key information from the incident in Slack. Update the Slack room with who the IC is, who the Deputy is, and that you're the scribe (if not already done). e.g. \"IC: Bob Boberson, Deputy: Deputy Deputyson, Scribe: Writer McWriterson\" You should add notes to Slack when significant actions are taken, or findings are determined. You don't need to wait for the IC to direct this - use your own judgment. You should also add TODO notes to the Slack room that indicate follow-ups slated for later. Follow instructions from the Incident Commander.", - "title": "Steps for Scribe" - }, - { - "location": "/during/during_an_incident/#steps-for-subject-matter-experts", - "text": "You are there to support the incident commander in identifying the cause of the incident, suggesting and evaluation repair actions, and following through on the repair actions. Investigate the incident by analyzing any graphs or logs at your disposal. Announce all findings to the incident commander. If you are unsure of the cause, that's fine, state that you are investigating and provide regular updates to the IC. Announce all suggestions for resolution to the incident commander, it is their decision on how to proceed, do not follow any actions unless told to do so! Follow instructions from the incident commander. (Optional) Once the call is over and post-mortem is created, add any notes you think are relevant to the post-mortem page.", - "title": "Steps for Subject Matter Experts" - }, - { - "location": "/during/during_an_incident/#steps-for-customer-liaison", - "text": "Be on stand-by to post public facing messages regarding the incident. You will typically be required to update the status page and to send Tweets from our various accounts at certain times during the call. Follow instructions from the Incident Commander.", - "title": "Steps for Customer Liaison" - }, - { - "location": "/during/security_incident_response/", - "text": "Incident Commander Required\n\n\nAs with all major incidents at PagerDuty, security ones will also involve an Incident Commander, who will delegate the tasks to relevant resolvers. Tasks may be performed in parallel as assigned by the IC. Page one at the earliest possible opportunity.\n\n\n\n\nChecklist\n#\n\n\nDetails for each of these items are available in the next section.\n\n\n\n\nStop the attack in progress.\n\n\nCut off the attack vector.\n\n\nAssemble the response team.\n\n\nIsolate affected instances.\n\n\nIdentify timeline of attack.\n\n\nIdentify compromised data.\n\n\nAssess risk to other systems.\n\n\nAssess risk of re-attack.\n\n\nApply additional mitigations, additions to monitoring, etc.\n\n\nForensic analysis of compromised systems.\n\n\nInternal communication.\n\n\nInvolve law enforcement.\n\n\nReach out to external parties that may have been used as vector for attack.\n\n\nExternal communication.\n\n\n\n\n\n\nAttack Mitigation\n#\n\n\nStop the attack as quickly as you can, via any means necessary. Shut down servers, network isolate them, turn off a data center if you have to. Some common things to try,\n\n\n\n\nShutdown the instance from the provider console (do not delete or terminate if you can help it, as we'll need to do forensics).\n\n\nIf you happen to be logged into the box you can try to,\n\n\nRe-instate our default iptables rules to restrict traffic.\n\n\nkill -9\n any active session you think is an attacker.\n\n\nChange root password, and update /etc/shadow to lock out all other users.\n\n\nsudo shutdown now\n\n\n\n\n\n\n\n\nCut Off Attack Vector\n#\n\n\nIdentify the likely attack vectors and path/fix them so they cannot be re-exploited immediately after stopping the attack.\n\n\n\n\nIf you suspect a third-party provider is compromised, delete all accounts except your own (and those of others who are physically present) and immediately rotate your password and MFA tokens.\n\n\nIf you suspect a service application was an attack vector, disable any relevant code paths, or shut down the service entirely.\n\n\n\n\nAssemble Response Team\n#\n\n\nIdentify the key responders for the security incident, and keep them all in the loop. Set up a secure method of communicating all information associated with the incident. Details on the incident (or even the fact that an incident has occurred) should be kept private to the responders until you are confident the attack is not being triggered internally.\n\n\n\n\nThe security and site-reliability teams should usually be involved.\n\n\nA representative for any affected services should be involved.\n\n\nAn Incident Commander (IC) should be appointed, who will also appoint the usual incident command roles. The incident command team will be responsible for keeping documentation of actions taken, and for notifying internal stakeholders as appropriate.\n\n\nDo not communicate with anyone not on the response team about the incident until forensics has been performed. The attack could be happening internally.\n\n\nGive the project an innocuous codename that can be used for chats/documents so if anyone overhears they don't realize it's a security incident. (e.g. sapphire-unicorn).\n\n\nPrefix all emails, and chat topics with \"Attorney Work Project\".\n\n\n\n\nIsolate Affected Instances\n#\n\n\nAny instances which were affected by the attack should be immediately isolated from any other instances. As soon as possible, an image of the system should be taken and put into a read-only cold storage for later forensic analysis.\n\n\n\n\nBlacklist the IP addresses for any affected instances from all other hosts.\n\n\nTurn off and shutdown the instances immediately if you didn't do that to stop the attack.\n\n\nTake a disk image for any disks attached to the instances, and ship them to an off-site cold storage location. You should make sure these images are read-only and cannot be tampered with.\n\n\n\n\nIdentify Timeline of Attack\n#\n\n\nWork with all tools at your disposal to identify the timeline of the attack, along with exactly what the attacker did.\n\n\n\n\nAny reconnaissance the attacker performed on the system before the attack started.\n\n\nWhen the attacker gained access to the system.\n\n\nWhat actions the attacker performed on the system, and when.\n\n\nIdentify how long the attacker had access to the system before they were detected, and before they were kicked out.\n\n\nIdentify any queries the attacker ran on databases.\n\n\nTry to identify if the attacker still has access to the system via another back door. Monitor logs for unusual activity, etc.\n\n\n\n\nCompromised Data\n#\n\n\nUsing forensic analysis of log files, time-series graphs, and any other information/tools at your disposal, attempt to identify what information was compromised (if any),\n\n\n\n\nIdentify any data that was compromised during the attack.\n\n\nWas any data exfiltrated from a database?\n\n\nWhat keys were on the system that are now considering compromised?\n\n\nWas the attacker able to identify other components of the system (map out the network, etc).\n\n\n\n\n\n\nFind exactly what customer data has been compromised, if any.\n\n\n\n\nAssess Risk\n#\n\n\nBased on the data that was compromised, assess the risk to other systems.\n\n\n\n\nDoes the attacker have enough information to find another way in?\n\n\nWere any passwords or keys stored on the host? If so, they should be considered compromised, regardless of how they were stored.\n\n\nAny user accounts that were used in the initial attack should rotate all of their keys and passwords on every other system they have an account.\n\n\n\n\nApply Additional Mitigations\n#\n\n\nStart applying mitigations to other parts of your system.\n\n\n\n\nRotate any compromised data.\n\n\nIdentify any new alerting which is needed to notify of a similar breach.\n\n\nBlock any IP addresses associated with the attack.\n\n\nIdentify any keys/credentials that are compromised and revoke their access immediately.\n\n\n\n\nForensic Analysis\n#\n\n\nOnce you are confident the systems are secured, and enough monitoring is in place to detect another attack, you can move onto the forensic analysis stage.\n\n\n\n\nTake any read-only images you created, any access logs you have, and comb through them for more information about the attack.\n\n\nIdentify exactly what happened, how it happened, and how to prevent it in future.\n\n\nKeep track of all IP addresses involved in the attack.\n\n\nMonitor logs for any attempt to regain access to the system by the attacker.\n\n\n\n\nInternal Communication\n#\n\n\nDelegate to:\n VP or Director of Engineering\n\n\nCommunicate internally only once you are confident (via forensic analysis) that the attack was not sourced internally.\n\n\n\n\nDon't go into too much detail.\n\n\nOverview the timeline.\n\n\nDiscuss mitigation steps taken.\n\n\nFollow up with more information once it is known.\n\n\n\n\nLiaise With Law Enforcement / External Actors\n#\n\n\nDelegate to:\n VP or Director of Engineering\n\n\nWork with law enforcement to identify the source of the attack, letting any system owners know that systems under their control may be compromised, etc.\n\n\n\n\nContact local law enforcement.\n\n\nContact FBI.\n\n\nContact operators for any systems used in the attack, their systems may also have been compromised.\n\n\nContact security companies to help in assessing risk and any PR next steps.\n\n\n\n\nExternal Communication\n#\n\n\nDelegate to:\n Marketing Team\n\n\nOnce you have validated all of the information you have is accurate, have a timeline of events, and know exactly what information was compromised, how it was compromised, and sure that it won't happen again. Only then should you prepare and release a public statement to customers informing them of the compromised information and any steps they need to take.\n\n\n\n\nInclude the date in the title of any announcement, so that it's never confused for a potential new breach.\n\n\nDon't say \"We take security very seriously\". It makes everyone cringe when they read it.\n\n\nBe honest, accept responsibility, and present the facts, along with exactly how we plan to prevent such things in future.\n\n\nBe as detailed as possible with the timeline.\n\n\nBe as detailed as possible in what information was compromised, and how it affects customers. If we were storing something we shouldn't have been, be honest about it. It'll come out later and it'll be much worse.\n\n\nDon't name and shame any external parties that might have caused the compromise. It's bad form. (Unless they've already publicly disclosed, in which case we can link to their disclosure).\n\n\nRelease the external communication as soon as possible, preferably within a few days of the compromise. The longer we wait, the worse it will be.\n\n\nFigure out if there is a way to get in touch with customers' internal security teams before the general public notice is sent.\n\n\n\n\n\n\nAdditional Reading\n#\n\n\n\n\nComputer Security Incident Handling Guide\n (NIST)\n\n\nIncident Handler's Handbook\n (SANS)\n\n\nResponding to IT Security Incidents\n (Microsoft)\n\n\nDefining Incident Management Processes for CSIRTs: A Work in Progress\n (CMU)\n\n\nCreating and Managing Computer Security Incident Handling Teams (CSIRTS)\n (CERT)", - "title": "Security Incident" - }, - { - "location": "/during/security_incident_response/#checklist", - "text": "Details for each of these items are available in the next section. Stop the attack in progress. Cut off the attack vector. Assemble the response team. Isolate affected instances. Identify timeline of attack. Identify compromised data. Assess risk to other systems. Assess risk of re-attack. Apply additional mitigations, additions to monitoring, etc. Forensic analysis of compromised systems. Internal communication. Involve law enforcement. Reach out to external parties that may have been used as vector for attack. External communication.", - "title": "Checklist" - }, - { - "location": "/during/security_incident_response/#attack-mitigation", - "text": "Stop the attack as quickly as you can, via any means necessary. Shut down servers, network isolate them, turn off a data center if you have to. Some common things to try, Shutdown the instance from the provider console (do not delete or terminate if you can help it, as we'll need to do forensics). If you happen to be logged into the box you can try to, Re-instate our default iptables rules to restrict traffic. kill -9 any active session you think is an attacker. Change root password, and update /etc/shadow to lock out all other users. sudo shutdown now", - "title": "Attack Mitigation" - }, - { - "location": "/during/security_incident_response/#cut-off-attack-vector", - "text": "Identify the likely attack vectors and path/fix them so they cannot be re-exploited immediately after stopping the attack. If you suspect a third-party provider is compromised, delete all accounts except your own (and those of others who are physically present) and immediately rotate your password and MFA tokens. If you suspect a service application was an attack vector, disable any relevant code paths, or shut down the service entirely.", - "title": "Cut Off Attack Vector" - }, - { - "location": "/during/security_incident_response/#assemble-response-team", - "text": "Identify the key responders for the security incident, and keep them all in the loop. Set up a secure method of communicating all information associated with the incident. Details on the incident (or even the fact that an incident has occurred) should be kept private to the responders until you are confident the attack is not being triggered internally. The security and site-reliability teams should usually be involved. A representative for any affected services should be involved. An Incident Commander (IC) should be appointed, who will also appoint the usual incident command roles. The incident command team will be responsible for keeping documentation of actions taken, and for notifying internal stakeholders as appropriate. Do not communicate with anyone not on the response team about the incident until forensics has been performed. The attack could be happening internally. Give the project an innocuous codename that can be used for chats/documents so if anyone overhears they don't realize it's a security incident. (e.g. sapphire-unicorn). Prefix all emails, and chat topics with \"Attorney Work Project\".", - "title": "Assemble Response Team" - }, - { - "location": "/during/security_incident_response/#isolate-affected-instances", - "text": "Any instances which were affected by the attack should be immediately isolated from any other instances. As soon as possible, an image of the system should be taken and put into a read-only cold storage for later forensic analysis. Blacklist the IP addresses for any affected instances from all other hosts. Turn off and shutdown the instances immediately if you didn't do that to stop the attack. Take a disk image for any disks attached to the instances, and ship them to an off-site cold storage location. You should make sure these images are read-only and cannot be tampered with.", - "title": "Isolate Affected Instances" - }, - { - "location": "/during/security_incident_response/#identify-timeline-of-attack", - "text": "Work with all tools at your disposal to identify the timeline of the attack, along with exactly what the attacker did. Any reconnaissance the attacker performed on the system before the attack started. When the attacker gained access to the system. What actions the attacker performed on the system, and when. Identify how long the attacker had access to the system before they were detected, and before they were kicked out. Identify any queries the attacker ran on databases. Try to identify if the attacker still has access to the system via another back door. Monitor logs for unusual activity, etc.", - "title": "Identify Timeline of Attack" - }, - { - "location": "/during/security_incident_response/#compromised-data", - "text": "Using forensic analysis of log files, time-series graphs, and any other information/tools at your disposal, attempt to identify what information was compromised (if any), Identify any data that was compromised during the attack. Was any data exfiltrated from a database? What keys were on the system that are now considering compromised? Was the attacker able to identify other components of the system (map out the network, etc). Find exactly what customer data has been compromised, if any.", - "title": "Compromised Data" - }, - { - "location": "/during/security_incident_response/#assess-risk", - "text": "Based on the data that was compromised, assess the risk to other systems. Does the attacker have enough information to find another way in? Were any passwords or keys stored on the host? If so, they should be considered compromised, regardless of how they were stored. Any user accounts that were used in the initial attack should rotate all of their keys and passwords on every other system they have an account.", - "title": "Assess Risk" - }, - { - "location": "/during/security_incident_response/#apply-additional-mitigations", - "text": "Start applying mitigations to other parts of your system. Rotate any compromised data. Identify any new alerting which is needed to notify of a similar breach. Block any IP addresses associated with the attack. Identify any keys/credentials that are compromised and revoke their access immediately.", - "title": "Apply Additional Mitigations" - }, - { - "location": "/during/security_incident_response/#forensic-analysis", - "text": "Once you are confident the systems are secured, and enough monitoring is in place to detect another attack, you can move onto the forensic analysis stage. Take any read-only images you created, any access logs you have, and comb through them for more information about the attack. Identify exactly what happened, how it happened, and how to prevent it in future. Keep track of all IP addresses involved in the attack. Monitor logs for any attempt to regain access to the system by the attacker.", - "title": "Forensic Analysis" - }, - { - "location": "/during/security_incident_response/#internal-communication", - "text": "Delegate to: VP or Director of Engineering Communicate internally only once you are confident (via forensic analysis) that the attack was not sourced internally. Don't go into too much detail. Overview the timeline. Discuss mitigation steps taken. Follow up with more information once it is known.", - "title": "Internal Communication" - }, - { - "location": "/during/security_incident_response/#liaise-with-law-enforcement-external-actors", - "text": "Delegate to: VP or Director of Engineering Work with law enforcement to identify the source of the attack, letting any system owners know that systems under their control may be compromised, etc. Contact local law enforcement. Contact FBI. Contact operators for any systems used in the attack, their systems may also have been compromised. Contact security companies to help in assessing risk and any PR next steps.", - "title": "Liaise With Law Enforcement / External Actors" - }, - { - "location": "/during/security_incident_response/#external-communication", - "text": "Delegate to: Marketing Team Once you have validated all of the information you have is accurate, have a timeline of events, and know exactly what information was compromised, how it was compromised, and sure that it won't happen again. Only then should you prepare and release a public statement to customers informing them of the compromised information and any steps they need to take. Include the date in the title of any announcement, so that it's never confused for a potential new breach. Don't say \"We take security very seriously\". It makes everyone cringe when they read it. Be honest, accept responsibility, and present the facts, along with exactly how we plan to prevent such things in future. Be as detailed as possible with the timeline. Be as detailed as possible in what information was compromised, and how it affects customers. If we were storing something we shouldn't have been, be honest about it. It'll come out later and it'll be much worse. Don't name and shame any external parties that might have caused the compromise. It's bad form. (Unless they've already publicly disclosed, in which case we can link to their disclosure). Release the external communication as soon as possible, preferably within a few days of the compromise. The longer we wait, the worse it will be. Figure out if there is a way to get in touch with customers' internal security teams before the general public notice is sent.", - "title": "External Communication" - }, - { - "location": "/during/security_incident_response/#additional-reading", - "text": "Computer Security Incident Handling Guide (NIST) Incident Handler's Handbook (SANS) Responding to IT Security Incidents (Microsoft) Defining Incident Management Processes for CSIRTs: A Work in Progress (CMU) Creating and Managing Computer Security Incident Handling Teams (CSIRTS) (CERT)", - "title": "Additional Reading" - }, - { - "location": "/after/post_mortem_process/", - "text": "For every major incident (SEV-2/1), we need to follow up with a post-mortem. A blame-free, detailed description, of exactly what went wrong in order to cause the incident, along with a list of steps to take in order to prevent a similar incident from occurring again in the future. The incident response process itself should also be included.\n\n\n\n\nOwner Designation\n#\n\n\nThe first step is that a post-mortem owner will be designated. This is done by the IC either at the end of a major incident call, or very shortly after. You will be notified directly by the IC if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use Slack for coordinating followup. A detailed list of the steps is available below,\n\n\nOwner Responsibilities\n#\n\n\nAs owner of a post-mortem, you are responsible for the following,\n\n\n\n\nScheduling the post-mortem meeting (on the shared calendar) and inviting the relevant people (this should be scheduled within 5 business days of the incident).\n\n\nUpdating the page with all of the necessary content.\n\n\nInvestigating the incident, pulling in whomever you need from other teams to assist in the investigation.\n\n\nCreating follow-up JIRA tickets (\nYou are only responsible for creating the tickets, not following them up to resolution\n).\n\n\nRunning the post-mortem meeting (\nthese generally run themselves, but you should get people back on topic if the conversation starts to wander\n).\n\n\nIn cases where we need a public blog post, creating \n reviewing it with appropriate parties.\n\n\n\n\nPost-Mortem Wiki Page\n#\n\n\nOnce you've been designated as the owner of a post-mortem, you should start updating the page with all the relevant information.\n\n\n\n\n\n\n(If not already done by the IC) Create a new post-mortem page for the incident.\n\n\n\n\n\n\nSchedule a post-mortem meeting for within 5 business days of the incident. You should schedule this before filling in the page, just so it's on the calendar.\n\n\n\n\nCreate the meeting on the \"Incident Post-Mortem Meetings\" shared calendar.\n\n\n\n\n\n\n\n\nBegin populating the page with all of the information you have.\n\n\n\n\nThe timeline should be the main focus to begin with.\n\n\nThe timeline should include important changes in status/impact, and also key actions taken by responders.\n\n\nYou should mark the start of the incident in red, and the resolution in green (for when we went into/out of SEV).\n\n\n\n\n\n\nGo through the history in Slack to identify the responders, and add them to the page.\n\n\nIdentify the Incident Commander and Scribe in this list.\n\n\n\n\n\n\n\n\n\n\n\n\nPopulate the page with more detailed information.\n\n\n\n\nFor each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a Tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline.\n\n\n\n\n\n\n\n\nPerform an analysis of the incident.\n\n\n\n\nCapture all available data regarding the incident. What caused it, how many customers were affected, etc.\n\n\nAny commands or queries you use to look up data should be posted in the page so others can see how the data was gathered.\n\n\nCapture the impact to customers (generally in terms of event submission, delayed processing, and slow notification delivery)\n\n\nIdentify the underlying cause of the incident (What happened, and why did it happen).\n\n\n\n\n\n\n\n\nCreate any followup action JIRA tickets (or note down topics for discussion if we need to decide on a direction to go before creating tickets),\n\n\n\n\nGo through the history in Slack to identify any TODO items.\n\n\nLabel all tickets with their severity level and date tags.\n\n\nAny actions which can reduce re-occurrence of the incident.\n\n\n(There may be some trade-off here, and that's fine. Sometimes the ROI isn't worth the effort that would go into it).\n\n\n\n\n\n\nIdentify any actions which can make our incident response process better.\n\n\nBe careful with creating too many tickets. Generally we only want to create things that are P0/P1's. Things that absolutely should be dealt with.\n\n\n\n\n\n\n\n\nWrite the external message that will be sent to customers. This will be reviewed during the post-mortem meeting before it is sent out.\n\n\n\n\nAvoid using the word \"outage\" unless it really was a full outage, use the word \"incident\" instead. Customers generally see \"outage\" and assume everything was down, when in reality it was likely just some alerts delivered outside of SLA.\n\n\nLook at other examples of previous post-mortems to see the kind of thing you should send.\n\n\n\n\n\n\n\n\nPost-Mortem Meeting\n#\n\n\nThese meetings should generally last 15-30 minutes, and are intended to be a wrap up of the post-mortem process. We should discuss what happened, what we could've done better, and any followup actions we need to take. The goal is to suss out any disagreement on the facts, analysis, or recommended actions, and to get some wider awareness of the problems that are causing reliability issues for us.\n\n\nYou should invite the following people to the post-mortem meeting,\n\n\n\n\nAlways\n\n\nThe incident commander.\n\n\nService owners involved in the incident.\n\n\nKey engineer(s)/responders involved in the incident.\n\n\n\n\n\n\nOptional\n\n\nCustomer liaison. (Only SEV-1 incidents)\n\n\n\n\n\n\n\n\nA general agenda for the meeting would be something like,\n\n\n\n\nRecap the timeline, to make sure everyone agrees and is on the same page.\n\n\nRecap important points, and any unusual items.\n\n\nDiscuss how the problem could've been caught.\n\n\nDid it show up in canary?\n\n\nCould it have been caught in tests, or loadtest environment?\n\n\n\n\n\n\nDiscuss customer impact. Any comments from customers, etc.\n\n\nReview action items that have been created, discuss if appropriate, or if more are needed, etc.\n\n\n\n\nExamples\n#\n\n\nHere are some examples of post-mortems from other companies as a reference,\n\n\n\n\nStripe\n\n\nLastPass\n\n\nAWS\n\n\nTwilio\n\n\nHeroku\n\n\nNetflix\n\n\nGOV.UK Rail Accident Investigation\n\n\nA List of Post-mortems!\n\n\n\n\nUseful Resources\n#\n\n\n\n\nAdvanced PostMortem Fu and Human Error 101 (Velocity 2011)\n\n\nBlame. Language. Sharing.", - "title": "Post-Mortem Process" - }, - { - "location": "/after/post_mortem_process/#owner-designation", - "text": "The first step is that a post-mortem owner will be designated. This is done by the IC either at the end of a major incident call, or very shortly after. You will be notified directly by the IC if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use Slack for coordinating followup. A detailed list of the steps is available below,", - "title": "Owner Designation" - }, - { - "location": "/after/post_mortem_process/#owner-responsibilities", - "text": "As owner of a post-mortem, you are responsible for the following, Scheduling the post-mortem meeting (on the shared calendar) and inviting the relevant people (this should be scheduled within 5 business days of the incident). Updating the page with all of the necessary content. Investigating the incident, pulling in whomever you need from other teams to assist in the investigation. Creating follow-up JIRA tickets ( You are only responsible for creating the tickets, not following them up to resolution ). Running the post-mortem meeting ( these generally run themselves, but you should get people back on topic if the conversation starts to wander ). In cases where we need a public blog post, creating reviewing it with appropriate parties.", - "title": "Owner Responsibilities" - }, - { - "location": "/after/post_mortem_process/#post-mortem-wiki-page", - "text": "Once you've been designated as the owner of a post-mortem, you should start updating the page with all the relevant information. (If not already done by the IC) Create a new post-mortem page for the incident. Schedule a post-mortem meeting for within 5 business days of the incident. You should schedule this before filling in the page, just so it's on the calendar. Create the meeting on the \"Incident Post-Mortem Meetings\" shared calendar. Begin populating the page with all of the information you have. The timeline should be the main focus to begin with. The timeline should include important changes in status/impact, and also key actions taken by responders. You should mark the start of the incident in red, and the resolution in green (for when we went into/out of SEV). Go through the history in Slack to identify the responders, and add them to the page. Identify the Incident Commander and Scribe in this list. Populate the page with more detailed information. For each item in the timeline, identify a metric, or some third-party page where the data came from. This could be a link to a Datadog graph, a SumoLogic search, a Tweet, etc. Anything which shows the data point you're trying to illustrate in the timeline. Perform an analysis of the incident. Capture all available data regarding the incident. What caused it, how many customers were affected, etc. Any commands or queries you use to look up data should be posted in the page so others can see how the data was gathered. Capture the impact to customers (generally in terms of event submission, delayed processing, and slow notification delivery) Identify the underlying cause of the incident (What happened, and why did it happen). Create any followup action JIRA tickets (or note down topics for discussion if we need to decide on a direction to go before creating tickets), Go through the history in Slack to identify any TODO items. Label all tickets with their severity level and date tags. Any actions which can reduce re-occurrence of the incident. (There may be some trade-off here, and that's fine. Sometimes the ROI isn't worth the effort that would go into it). Identify any actions which can make our incident response process better. Be careful with creating too many tickets. Generally we only want to create things that are P0/P1's. Things that absolutely should be dealt with. Write the external message that will be sent to customers. This will be reviewed during the post-mortem meeting before it is sent out. Avoid using the word \"outage\" unless it really was a full outage, use the word \"incident\" instead. Customers generally see \"outage\" and assume everything was down, when in reality it was likely just some alerts delivered outside of SLA. Look at other examples of previous post-mortems to see the kind of thing you should send.", - "title": "Post-Mortem Wiki Page" - }, - { - "location": "/after/post_mortem_process/#post-mortem-meeting", - "text": "These meetings should generally last 15-30 minutes, and are intended to be a wrap up of the post-mortem process. We should discuss what happened, what we could've done better, and any followup actions we need to take. The goal is to suss out any disagreement on the facts, analysis, or recommended actions, and to get some wider awareness of the problems that are causing reliability issues for us. You should invite the following people to the post-mortem meeting, Always The incident commander. Service owners involved in the incident. Key engineer(s)/responders involved in the incident. Optional Customer liaison. (Only SEV-1 incidents) A general agenda for the meeting would be something like, Recap the timeline, to make sure everyone agrees and is on the same page. Recap important points, and any unusual items. Discuss how the problem could've been caught. Did it show up in canary? Could it have been caught in tests, or loadtest environment? Discuss customer impact. Any comments from customers, etc. Review action items that have been created, discuss if appropriate, or if more are needed, etc.", - "title": "Post-Mortem Meeting" - }, - { - "location": "/after/post_mortem_process/#examples", - "text": "Here are some examples of post-mortems from other companies as a reference, Stripe LastPass AWS Twilio Heroku Netflix GOV.UK Rail Accident Investigation A List of Post-mortems!", - "title": "Examples" - }, - { - "location": "/after/post_mortem_process/#useful-resources", - "text": "Advanced PostMortem Fu and Human Error 101 (Velocity 2011) Blame. Language. Sharing.", - "title": "Useful Resources" - }, - { - "location": "/after/post_mortem_template/", - "text": "This is a standard template we use for post-mortems at PagerDuty. Each section describes the type of information you will want to put in that section.\n\n\n\n\n\n\nGuidelines\n\n\nThis page is intended to be reviewed during a post-mortem meeting that should be scheduled within 5 business days of any event.\nYour first step should be to schedule the post-mortem meeting in the shared calendar for within 5 business days after the incident.\nDon't wait until you've filled in the info to schedule the meeting, however make sure the page is completed by the meeting.\n\n\n\n\n Post-Mortem Owner:\n \nYour name goes here.\n\n\n Meeting Scheduled For:\n \nSchedule the meeting on the \"Incident Post-Mortem Meetings\" shared calendar, for within 5 business days after the incident. Put the date/time here.\n\n\n Call Recording:\n \nLink to the incident call recording.\n\n\nOverview\n#\n\n\nInclude a \nshort\n sentence or two summarizing the root cause, timeline summary, and the impact. E.g. \"On the morning of August 99th, we suffered a 1 minute SEV-1 due to a runaway process on our primary database machine. This slowness caused roughly 0.024% of alerts that had begun during this time to be delivered out of SLA.\"\n\n\nWhat Happened\n#\n\n\nInclude a short description of what happened.\n\n\nRoot Cause\n#\n\n\nInclude a description of the root cause. If there were any actions taken that exacerbated the issue, also include them here with the intention of learning from any mistakes made during the resolution process.\n\n\nResolution\n#\n\n\nInclude a description what solved the problem. If there was a temporary fix in place, describe that along with the long-term solution.\n\n\nImpact\n#\n\n\nBe very specific here, include exact numbers.\n\n\n\n\n\n\n\n\nTime in SEV-1\n\n\n?mins\n\n\n\n\n\n\n\n\n\n\nNotifications Delivered out of SLA\n\n\n??% (?? of ??)\n\n\n\n\n\n\nEvents Dropped / Not Accepted\n\n\n??% (?? of ??) \nShould usually be 0, but always check\n\n\n\n\n\n\nAccounts Affected\n\n\n??\n\n\n\n\n\n\nUsers Affected\n\n\n??\n\n\n\n\n\n\nSupport Requests Raised\n\n\n?? \nInclude any relevant links to tickets\n\n\n\n\n\n\n\n\nResponders\n#\n\n\n\n\nWho was the IC?\n\n\nWho was the scribe?\n\n\nWho else was involved?\n\n\nWho else was involved?\n\n\n\n\nTimeline\n#\n\n\nSome important times to include: (1) time the root cause began, (2) time of the page, (3) time that the status page was updated (i.e. when the incident became public), (4) time of any significant actions, (5) time the SEV-2/1 ended, (6) links to tools/logs that show how the timestamp was arrived at.\n\n\n\n\n\n\n\n\nTime (UTC)\n\n\nEvent\n\n\nData Link\n\n\n\n\n\n\n\n\n\n\nHow'd We Do?\n#\n\n\nWhat Went Well?\n#\n\n\n\n\nList anything you did well and want to call out. It's OK to not list anything.\n\n\n\n\nWhat Didn't Go So Well?\n#\n\n\n\n\nList anything you think we didn't do very well. The intent is that we should follow up on all points here to improve our processes.\n\n\n\n\nAction Items\n#\n\n\nEach action item should be in the form of a JIRA ticket, and each ticket should have the same set of two tags: \u201csev1_YYYYMMDD\u201d (such as sev1_20150911) and simply \u201csev1\u201d. Include action items such as: (1) any fixes required to prevent the root cause in the future, (2) any preparedness tasks that could help mitigate the problem if it came up again, (3) remaining post-mortem steps, such as the internal email, as well as the status-page public post, (4) any improvements to our incident response process.\n\n\nMessaging\n#\n\n\nInternal Email\n#\n\n\nThis is a follow-up for employees. It should be sent out right after the post-mortem meeting is over. It only needs a short paragraph summarizing the incident and a link to this wiki page.\n\n\n\n\nBriefly summarize what happened and where the post-mortem page (this page) can be found.\n\n\n\n\nExternal Message\n#\n\n\nThis is what will be included on the status.pagerduty.com website regarding this incident. What are we telling customers, including an apology? (The apology should be genuine, not rote.)\n\n\n\n\nSummary\n\n\nWhat Happened?\n\n\nWhat Are We Doing About This?", - "title": "Post-Mortem Template" - }, - { - "location": "/after/post_mortem_template/#overview", - "text": "Include a short sentence or two summarizing the root cause, timeline summary, and the impact. E.g. \"On the morning of August 99th, we suffered a 1 minute SEV-1 due to a runaway process on our primary database machine. This slowness caused roughly 0.024% of alerts that had begun during this time to be delivered out of SLA.\"", - "title": "Overview" - }, - { - "location": "/after/post_mortem_template/#what-happened", - "text": "Include a short description of what happened.", - "title": "What Happened" - }, - { - "location": "/after/post_mortem_template/#root-cause", - "text": "Include a description of the root cause. If there were any actions taken that exacerbated the issue, also include them here with the intention of learning from any mistakes made during the resolution process.", - "title": "Root Cause" - }, - { - "location": "/after/post_mortem_template/#resolution", - "text": "Include a description what solved the problem. If there was a temporary fix in place, describe that along with the long-term solution.", - "title": "Resolution" - }, - { - "location": "/after/post_mortem_template/#impact", - "text": "Be very specific here, include exact numbers. Time in SEV-1 ?mins Notifications Delivered out of SLA ??% (?? of ??) Events Dropped / Not Accepted ??% (?? of ??) Should usually be 0, but always check Accounts Affected ?? Users Affected ?? Support Requests Raised ?? Include any relevant links to tickets", - "title": "Impact" - }, - { - "location": "/after/post_mortem_template/#responders", - "text": "Who was the IC? Who was the scribe? Who else was involved? Who else was involved?", - "title": "Responders" - }, - { - "location": "/after/post_mortem_template/#timeline", - "text": "Some important times to include: (1) time the root cause began, (2) time of the page, (3) time that the status page was updated (i.e. when the incident became public), (4) time of any significant actions, (5) time the SEV-2/1 ended, (6) links to tools/logs that show how the timestamp was arrived at. Time (UTC) Event Data Link", - "title": "Timeline" - }, - { - "location": "/after/post_mortem_template/#howd-we-do", - "text": "", - "title": "How'd We Do?" - }, - { - "location": "/after/post_mortem_template/#what-went-well", - "text": "List anything you did well and want to call out. It's OK to not list anything.", - "title": "What Went Well?" - }, - { - "location": "/after/post_mortem_template/#what-didnt-go-so-well", - "text": "List anything you think we didn't do very well. The intent is that we should follow up on all points here to improve our processes.", - "title": "What Didn't Go So Well?" - }, - { - "location": "/after/post_mortem_template/#action-items", - "text": "Each action item should be in the form of a JIRA ticket, and each ticket should have the same set of two tags: \u201csev1_YYYYMMDD\u201d (such as sev1_20150911) and simply \u201csev1\u201d. Include action items such as: (1) any fixes required to prevent the root cause in the future, (2) any preparedness tasks that could help mitigate the problem if it came up again, (3) remaining post-mortem steps, such as the internal email, as well as the status-page public post, (4) any improvements to our incident response process.", - "title": "Action Items" - }, - { - "location": "/after/post_mortem_template/#messaging", - "text": "", - "title": "Messaging" - }, - { - "location": "/after/post_mortem_template/#internal-email", - "text": "This is a follow-up for employees. It should be sent out right after the post-mortem meeting is over. It only needs a short paragraph summarizing the incident and a link to this wiki page. Briefly summarize what happened and where the post-mortem page (this page) can be found.", - "title": "Internal Email" - }, - { - "location": "/after/post_mortem_template/#external-message", - "text": "This is what will be included on the status.pagerduty.com website regarding this incident. What are we telling customers, including an apology? (The apology should be genuine, not rote.) Summary What Happened? What Are We Doing About This?", - "title": "External Message" - }, - { - "location": "/training/overview/", - "text": "Learning about the Spearhead Systems incident response process is an important part of being an effective on-call engineer at Spearhead Systens. This section goes over our training material for the various roles that are involved in our incident response, along with some additional information and training material from government agencies.\n\n\nTraining Guides\n#\n\n\nOur training guides are split up by role, however you are encouraged to read through the training guides even for roles you don't belong to, as it can give you some good insight into how those people will be behaving during major incidents.\n\n\n\n\nIncident Commander Training\n - The \"IC\" is the person who drives a major incident to resolution. They're the person who will be directing everyone else.\n\n\nDeputy Training\n - The Deputy is someone who supports the Incident Commander and can take over for them if necessary.\n\n\nScribe Training\n - This is intended for individuals who will be acting as a scribe during an incident.\n\n\nSME / Resolver Training\n - This is relevant to everyone at Spearhead Systems who are on-call for any team.\n\n\n\n\nNational Incident Management System (NIMS)\n#\n\n\nOur incident response process is loosely based on the \nUS National Incident Management System (NIMS)\n, which is described as,\n\n\nA systematic, proactive approach to guide departments and agencies at all levels of government, nongovernmental organizations, and the private sector to work together seamlessly and manage incidents involving all threats and hazards\u2014regardless of cause, size, location, or complexity\u2014in order to reduce loss of life, property and harm to the environment.\n\n\nWhile it might not initially seem that this would be applicable to an IT operations environment, we've found that many of the lessons learned from major incidents in these situations can be directly applied to our industry too. The principles are the same and span many different environments.\n\n\n \n\n\nIf you want to learn more about NIMS, we recommend the \nICS-100\n and \nICS-700\n online training courses, which go over NIMS and the Incident Command System (You can also take an online examination after training in order to get a certificate from FEMA). There is also a wealth of \nadditional training material and courses from FEMA\n on NIMS, which I would encourage you to look at.\n\n\nIf you're based in the US and interested in taking a more active incident response role in your community, we recommend investigating your local \nCERT programs\n (Community Emergency Response Teams). Many cities offer CERT training, after which you can volunteer as a CERT contributor within your community. Not only is it an opportunity to get real world experience with disaster response, but the skills you learn can be applied to everyday life too.\n\n\nAlso take a look at the \nAdditional Reading\n section on the home page.", - "title": "Overview" - }, - { - "location": "/training/overview/#training-guides", - "text": "Our training guides are split up by role, however you are encouraged to read through the training guides even for roles you don't belong to, as it can give you some good insight into how those people will be behaving during major incidents. Incident Commander Training - The \"IC\" is the person who drives a major incident to resolution. They're the person who will be directing everyone else. Deputy Training - The Deputy is someone who supports the Incident Commander and can take over for them if necessary. Scribe Training - This is intended for individuals who will be acting as a scribe during an incident. SME / Resolver Training - This is relevant to everyone at Spearhead Systems who are on-call for any team.", - "title": "Training Guides" - }, - { - "location": "/training/overview/#national-incident-management-system-nims", - "text": "Our incident response process is loosely based on the US National Incident Management System (NIMS) , which is described as, A systematic, proactive approach to guide departments and agencies at all levels of government, nongovernmental organizations, and the private sector to work together seamlessly and manage incidents involving all threats and hazards\u2014regardless of cause, size, location, or complexity\u2014in order to reduce loss of life, property and harm to the environment. While it might not initially seem that this would be applicable to an IT operations environment, we've found that many of the lessons learned from major incidents in these situations can be directly applied to our industry too. The principles are the same and span many different environments. If you want to learn more about NIMS, we recommend the ICS-100 and ICS-700 online training courses, which go over NIMS and the Incident Command System (You can also take an online examination after training in order to get a certificate from FEMA). There is also a wealth of additional training material and courses from FEMA on NIMS, which I would encourage you to look at. If you're based in the US and interested in taking a more active incident response role in your community, we recommend investigating your local CERT programs (Community Emergency Response Teams). Many cities offer CERT training, after which you can volunteer as a CERT contributor within your community. Not only is it an opportunity to get real world experience with disaster response, but the skills you learn can be applied to everyday life too. Also take a look at the Additional Reading section on the home page.", - "title": "National Incident Management System (NIMS)" - }, - { - "location": "/training/incident_commander/", - "text": "So you want to be an incident commander? You've come to the right place! You don't need to be a senior team member to become an IC, anyone can do it providing you have the requisite knowledge (yes, even an intern)!\n\n\n\n\nCredit: \nNASA\n\n\nPurpose\n#\n\n\nIf you could boil down the definition of an Incident Commander to one sentence, it would be,\n\n\n\n\nTake whatever actions are necessary to protect PagerDuty systems and customers.\n\n\n\n\nThe purpose of the Incident Commander is to be the decision maker during an major incident; Delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution.\n\n\nThe Incident Commander becomes the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final.\n\n\nYour job as an IC is to listen to the call and to watch the incident Slack room in order to provide clear coordination, recruiting others to gather context/details. \nYou should not be performing any actions or remediations, checking graphs, or investigating logs.\n Those tasks should be delegated.\n\n\nPrerequisites\n#\n\n\nBefore you can be an Incident Commander, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training!\n\n\n\n\nHas \nexcellent knowledge of PagerDuty systems\n and is able to quickly evaluate good vs bad options, and quickly identify what's actually going on.\n\n\nBeen at PagerDuty for at least 6 months and has a \nsolid understanding of the incident notification pipeline and web stack\n.\n\n\nExcellent verbal and written \ncommunication skills\n.\n\n\nHas \nknowledge of obscure PagerDuty terms\n.\n\n\nHas gravitas and is \nwilling to kick people off a call\n to remove distractions, even if it's the CEO.\n\n\n\n\nResponsibilities\n#\n\n\nRead up on our \nDifferent Roles for Incidents\n to see what is expected from an Incident Commander, as well as what we expect from the other roles you'll be interacting with.\n\n\nQualities\n#\n\n\nSome qualities we expect from an effective leader include being able to:\n\n\n\n\nTake command.\n\n\nMotivate responders.\n\n\nCommunicate clear directions.\n\n\nSize up the situation and make rapid decisions.\n\n\nAssess the effectiveness of tactics/strategies.\n\n\nBe flexible and modify your plans as necessary.\n\n\n\n\nAs a leader, you should try to:\n\n\n\n\nBe proficient in your job.\n\n\nMake sound and timely decisions.\n\n\nEnsure tasks are understood.\n\n\nBe prepared to step out of a tactical role to assume a leadership role.\n\n\n\n\nTraining Process\n#\n\n\nThe process is fairly loose for now. Here's a list of things you can do to train though,\n\n\n\n\n\n\nRead the rest of this page, particularly the sections below.\n\n\n\n\n\n\nParticipate in \nFailure Friday\n (FF).\n\n\n\n\nShadow a FF to see how it's run.\n\n\nBe the scribe for multiple FF's.\n\n\nBe the incident commander for multiple FF's.\n\n\n\n\n\n\n\n\nPlay a game of \"\nKeep Talking and Nobody Explodes\n\" with other people in the office.\n\n\n\n\nFor a more realistic experience, play it with someone in a different office over Hangouts.\n\n\n\n\n\n\n\n\nShadow a current incident commander for at least a full week shift.\n\n\n\n\nGet alerted when they do, join in on the same calls.\n\n\nSit in on an active incident call, follow along with the chat, and follow along with what the Incident Commander is doing.\n\n\nDo not actively participate in the call, keep your questions until the end.\n\n\n\n\n\n\n\n\nReverse shadow a current incident commander for at least a full week shift.\n\n\n\n\nYou should be the one to respond to incidents, and you will take point on calls, however the current IC will be there to take over should you not know how to proceed.\n\n\n\n\n\n\n\n\nGraduation\n#\n\n\nWhat's the difference between an IC in training, and an IC? (This isn't the set up to a joke). Simple, an IC puts themselves on the schedule.\n\n\nHandling Incidents\n#\n\n\nEvery incident is different (we're hopefully not repeating the same issue multiple times!), but there's a common process you can apply to each one.\n\n\n\n\n\n\nIdentify the symptoms.\n\n\n\n\nIdentify what the symptoms are, how big the issue is, and whether it's escalating/flapping/static.\n\n\n\n\n\n\n\n\nSize-up the situation.\n\n\n\n\nGather as much information as you can, as quickly as you can (remember the incident is still happening while you're doing this).\n\n\nGet the facts, the possibilities of what can happen, and the probability of those things happening.\n\n\n\n\n\n\n\n\nStabilize the incident.\n\n\n\n\nIdentify actions you can use to proceed.\n\n\nGather support for the plan (See \"Polling During a Decision\" below).\n\n\nDelegate remediation actions to your SME's.\n\n\n\n\n\n\n\n\nProvide regular updates.\n\n\n\n\nMaintain a cadence, and provide regular updates to everyone on the call.\n\n\nWhat's happening, what are we doing about it, etc.\n\n\n\n\n\n\n\n\nDeputy\n#\n\n\nThe deputy for an incident is generally the backup Incident Commander. However, as an Incident Commander, you may appoint one or more Deputies. Note that Deputy Incident Commanders must be as qualified as the Incident Commander, and that if a Deputy is assigned, he or she must be fully qualified to assume the Incident Commander\u2019s position if required.\n\n\nCommunication Responsibilities\n#\n\n\nSharing information during an incident is a critical process. As an Incident Commander (or Deputy), you should be prepared to brief others as necessary. You will also be required to communicate your intentions and decisions clearly so that there is no ambiguity in your commands.\n\n\nWhen given information from a responder, you should clearly acknowledge that you have received and understood their message, so that the responder can be confident in moving on to other tasks.\n\n\nAfter an incident, you should communicate with other training Incident Commanders on any debrief actions you feel are necessary.\n\n\nIncident Call Procedures and Lingo\n#\n\n\nThe \nSteps for Incident Commander\n provide a detailed description of what you should be doing during an incident.\n\n\nAdditionally, aside from following the \nusual incident call etiquette\n, there a few extra etiquette guidelines you should follow as IC:\n\n\n\n\nAlways announce when you join the call if you are the on-call IC.\n\n\nDon't let discussions get out of hand. Keep conversations short.\n\n\nNote objections from others, but your call is final.\n\n\nIf anyone is being actively disruptive to your call, kick them off.\n\n\nAnnounce the end of the call.\n\n\n\n\nHere are some examples of phrases and patterns you should use during incident calls.\n\n\nStart of Call Announcement\n#\n\n\nAt the start of any major incident call, the incident commander should announce the following,\n\n\n\n\nThis is [NAME], I am the Incident Commander for this call.\n\n\n\n\nThis establishes to everyone on the call what your name is, and that you are now the commander. You should state \"Incident Commander\" and not \"IC\", as newcomers may not be familiar with the terminology yet. The word \"commander\" makes it very clear that you're in charge.\n\n\nStart of Incident, IC Not Present\n#\n\n\nIf you are trained to be an IC and have joined a call, even if you aren't the IC on-call, you should do the following,\n\n\n\n\nIs there an IC on the call?\n\n\n(pause)\n\n\nHearing no response, this is [NAME], and I am now the Incident Commander for this call.\n\n\n\n\nIf the on-call IC joins later, you may hand over to them at your discretion (see below for the hand-off procedure)\n\n\nChecking if SME's are Present\n#\n\n\nDuring a call, you will want to know who is available from the various teams in order to resolve the incident. Etiquette dictates that people should announce themselves, but sometimes you may be joining late to the call. If you need a representative from a team, just ask on the call. Your deputy can page one if no one answers.\n\n\n\n\nDo we have a representative from [X] on the call?\n\n\n(pause)\n\n\nDeputy, can you go ahead and page the [X] on-call please.\n\n\n\n\nAssigning Tasks\n#\n\n\nWhen you need to give out an assignment or task, give it to a person directly, never say \"can someone do...\" as this leads to the \nbystander effect\n. Instead, all actions should be assigned to a specific person, and time-boxed with a specific number of minutes.\n\n\n\n\nIC: Bob, please investigate the high latency on web app boxes. I'll come back to you for an answer in 3 minutes.\n\n\nBob: Understood\n\n\n\n\nKeep track of how many minutes you assigned, and check in with that person after that time. You can get help from your deputy to help track the timings.\n\n\nPolling During a Decision\n#\n\n\nIf a decision needs to be made, it comes down to the IC. Once the IC makes a decision, it is final. But it's important that no one can come later and object to the plan, saying things like \"I knew that would happen\". An IC will use very specific language to be sure that doesn't happen.\n\n\n\n\nThe proposal is to [EXPLAIN PROPOSAL]\n\n\nAre there any strong objections to this plan?\n\n\n(pause)\n\n\nHearing no objects, we are proceeding with this proposal.\n\n\n\n\nIf you were to ask \"Does everyone agree?\", you'd get people speaking over each other, you'd have quiet people not speaking up, etc. Asking for any STRONG objections gives people the chance to object, but only if they feel strongly on the matter.\n\n\nStatus Updates\n#\n\n\nIt's important to maintain a cadence during a major incident call. Whenever there is a lull in the proceedings, usually because you're waiting for someone to get back to you, you can fill the gap by explaining the current situation and the actions that are outstanding. This makes sure everyone is on the same page.\n\n\n\n\nWhile we wait for [X], here's an update of our current situation.\n\n\nWe are currently in a SEV-1 situation, we believe to be caused by [X]. There's an open question to [Y] who will be getting back to us in 2 minutes. In the meantime, we have Tweeted out that we are experiencing issues. Our next Tweet will be in 10 minutes if the incident is still ongoing at that time.\n\n\nAre there any additional actions or proposals from anyone else at this time?\n\n\n\n\nTransfer of Command\n#\n\n\nTransfer of command, involves (as the name suggests) transferring command to another Incident Commander. There are multiple reasons why a transfer of command might take place,\n\n\n\n\nCommander has become fatigued and is unable to continue.\n\n\nIncident complexity changes.\n\n\nChange of command is necessary for effectiveness or efficiency.\n\n\nPersonal emergencies arise (e.g., Incident Commander has a family emergency).\n\n\n\n\nNever feel like you are not doing your job properly by handing over. Handovers are encouraged. In order to handover, out of band from the main call (via Slack for example), notify the other IC that you wish to transfer command. Update them with anything you feel appropriate. Then announce on the call,\n\n\n\n\nEveryone on the call, be advised, at this time I am handing over command to [X].\n\n\n\n\nThe new IC should then announce on the call as if they were joining a new call (see above), so that everyone is aware of the new commander.\n\n\nNote that the arrival of a more qualified person does NOT necessarily mean a change in incident command.\n\n\nMaintaining Order\n#\n\n\nOften times on a call people will be talking over one another, or an argument on the correct way to proceed may break out. As Incident Commander it's important that order is maintained on a call. The Incident Commander has the power to remove someone from the call if necessary (even if it's the CEO). But often times you just need to remind people to speak one at a time. Sometimes the discussion can be healthy even if it starts as an argument, but you shouldn't let it go on for too long.\n\n\n\n\n(noise)\n\n\nOk everyone, can we all speak one at a time please. So far I'm hearing two options to proceed: 1) [X], 2) [Y].\n\n\nAre there any other proposals someone would like to make at this time?\n\n\n...etc\n\n\n\n\nGetting Straight Answers\n#\n\n\nYou may ask a question as IC and receive an answer that doesn't actually answer your question. This is generally when you ask for a yes/no answer but get a more detailed explanation. This can often times be because the person doesn't understand the call etiquette. But if it continues, you need to take action in order to proceed.\n\n\n\n\nIC: Is this going to disable the service for everyone?\n\n\nSME: Well... for some people it....\n\n\nIC: Stop. I need a yes/no answer. Is this going to disable the service for everyone?\n\n\nSME: Well... it might not do...\n\n\nIC: Stop. I'm going to ask again, and the only two words I want to hear from you are \"yes\" or \"no. If this going to disable the service for everyone?\n\n\nSME: Well.. like I was saying..\n\n\nIC: Stop. Leave the call. Backup IC can you please page the backup on-call for [service] so that we can get an answer.\n\n\n\n\nExecutive Swoop\n#\n\n\nYou may get someone who would be senior to you during peacetime come on the call and start overriding your decisions as IC. This is unacceptable behaviour during wartime, as the IC is in command. While this is rare, you can get things back on track with the following,\n\n\n\n\nExecutive: No, I don't want us doing that. Everyone stop. We need to rollback instead.\n\n\nIC: Hold please. [EXECUTIVE], do you wish to take over command?\n\n\nExecutive: Yes/No\n\n\n(If yes) IC: Understood. Everyone on the call, be advised, at this time I am handling over command to [EXECUTIVE]. They are now the incident commander for this call.\n\n\n(If no) IC: In that case, please cause no further interruptions or I will remove you from the call.\n\n\n\n\nThis makes it clear to the executive that they have the option of being in charge and making decisions, but in order to do so they must continue as an Incident Commander. If they refuse, then remind them that you are in charge and disruptive interruptions will not be tolerated. If they continue, remove them from the call.\n\n\nEnd of Call Sign-Off\n#\n\n\nAt the end of an incident, you should announce to everyone on the call that you are ending the call at this time, and provide information on where followup discussion can take place. It's also customary to thank everyone.\n\n\n\n\nOk everyone, we're ending the call at this time. Please continue any followup discussion on Slack. Thanks everyone.\n\n\n\n\nExamples From Pop Culture\n#\n\n\nPagerDuty employees have access to all previous incident calls, and can listen to them at their discretion. We can't release these calls, so for everyone else, here are some short examples from popular culture to show the techniques at work.\n\n\n\n\n\n\n\nHere's a clip from the movie Apollo 13, where Gene Kranz (Flight Director / Incident Commander) shows some great examples of Incident Command. Here are some things to note:\n\n\n\n\nWalks into the room, and immediately obvious that he's the IC. Calms the noise, and makes sure everyone is paying attention.\n\n\nProvides a status update so people are aware of the situation.\n\n\nProjector breaks, doesn't get sidetracked on fixing it, just moves on to something else.\n\n\nProvides a proposal for how to proceed and elicits feedback.\n\n\nListens to the feedback calmly.\n\n\nWhen counter-proposal is raised, states that he agrees and why.\n\n\n\n\n\n\nAllows a discussion to happen, listens to all points. When discussion gets out of hand, re-asserts command of the situation.\n\n\nExplains his decision, and why.\n\n\n\n\n\n\nExplains his full plan and decision, so everyone is on the same page.\n\n\n\n\n\n\n\n\n\nAnother clip from Apollo 13. Things to note:\n\n\n\n\nSummarizes the situation, and states the facts.\n\n\nListens to the feedback from various people.\n\n\nWhen a trusted SME provides information counter to what everyone else is saying, asks for additional clarification (\"What do you mean, everything?\")\n\n\nWise cracking remarks are not acknowledged by the IC (\"You can't run a vacuum cleaner on 12 amps!\")\n\n\n\"That's the deal?\".. \"That's the deal\".\n\n\nOnce decision is made, moves on to the next discussion.\n\n\nDelegates tasks.", - "title": "Incident Commander" - }, - { - "location": "/training/incident_commander/#purpose", - "text": "If you could boil down the definition of an Incident Commander to one sentence, it would be, Take whatever actions are necessary to protect PagerDuty systems and customers. The purpose of the Incident Commander is to be the decision maker during an major incident; Delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution. The Incident Commander becomes the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final. Your job as an IC is to listen to the call and to watch the incident Slack room in order to provide clear coordination, recruiting others to gather context/details. You should not be performing any actions or remediations, checking graphs, or investigating logs. Those tasks should be delegated.", - "title": "Purpose" - }, - { - "location": "/training/incident_commander/#prerequisites", - "text": "Before you can be an Incident Commander, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training! Has excellent knowledge of PagerDuty systems and is able to quickly evaluate good vs bad options, and quickly identify what's actually going on. Been at PagerDuty for at least 6 months and has a solid understanding of the incident notification pipeline and web stack . Excellent verbal and written communication skills . Has knowledge of obscure PagerDuty terms . Has gravitas and is willing to kick people off a call to remove distractions, even if it's the CEO.", - "title": "Prerequisites" - }, - { - "location": "/training/incident_commander/#responsibilities", - "text": "Read up on our Different Roles for Incidents to see what is expected from an Incident Commander, as well as what we expect from the other roles you'll be interacting with.", - "title": "Responsibilities" - }, - { - "location": "/training/incident_commander/#qualities", - "text": "Some qualities we expect from an effective leader include being able to: Take command. Motivate responders. Communicate clear directions. Size up the situation and make rapid decisions. Assess the effectiveness of tactics/strategies. Be flexible and modify your plans as necessary. As a leader, you should try to: Be proficient in your job. Make sound and timely decisions. Ensure tasks are understood. Be prepared to step out of a tactical role to assume a leadership role.", - "title": "Qualities" - }, - { - "location": "/training/incident_commander/#training-process", - "text": "The process is fairly loose for now. Here's a list of things you can do to train though, Read the rest of this page, particularly the sections below. Participate in Failure Friday (FF). Shadow a FF to see how it's run. Be the scribe for multiple FF's. Be the incident commander for multiple FF's. Play a game of \" Keep Talking and Nobody Explodes \" with other people in the office. For a more realistic experience, play it with someone in a different office over Hangouts. Shadow a current incident commander for at least a full week shift. Get alerted when they do, join in on the same calls. Sit in on an active incident call, follow along with the chat, and follow along with what the Incident Commander is doing. Do not actively participate in the call, keep your questions until the end. Reverse shadow a current incident commander for at least a full week shift. You should be the one to respond to incidents, and you will take point on calls, however the current IC will be there to take over should you not know how to proceed.", - "title": "Training Process" - }, - { - "location": "/training/incident_commander/#graduation", - "text": "What's the difference between an IC in training, and an IC? (This isn't the set up to a joke). Simple, an IC puts themselves on the schedule.", - "title": "Graduation" - }, - { - "location": "/training/incident_commander/#handling-incidents", - "text": "Every incident is different (we're hopefully not repeating the same issue multiple times!), but there's a common process you can apply to each one. Identify the symptoms. Identify what the symptoms are, how big the issue is, and whether it's escalating/flapping/static. Size-up the situation. Gather as much information as you can, as quickly as you can (remember the incident is still happening while you're doing this). Get the facts, the possibilities of what can happen, and the probability of those things happening. Stabilize the incident. Identify actions you can use to proceed. Gather support for the plan (See \"Polling During a Decision\" below). Delegate remediation actions to your SME's. Provide regular updates. Maintain a cadence, and provide regular updates to everyone on the call. What's happening, what are we doing about it, etc.", - "title": "Handling Incidents" - }, - { - "location": "/training/incident_commander/#deputy", - "text": "The deputy for an incident is generally the backup Incident Commander. However, as an Incident Commander, you may appoint one or more Deputies. Note that Deputy Incident Commanders must be as qualified as the Incident Commander, and that if a Deputy is assigned, he or she must be fully qualified to assume the Incident Commander\u2019s position if required.", - "title": "Deputy" - }, - { - "location": "/training/incident_commander/#communication-responsibilities", - "text": "Sharing information during an incident is a critical process. As an Incident Commander (or Deputy), you should be prepared to brief others as necessary. You will also be required to communicate your intentions and decisions clearly so that there is no ambiguity in your commands. When given information from a responder, you should clearly acknowledge that you have received and understood their message, so that the responder can be confident in moving on to other tasks. After an incident, you should communicate with other training Incident Commanders on any debrief actions you feel are necessary.", - "title": "Communication Responsibilities" - }, - { - "location": "/training/incident_commander/#incident-call-procedures-and-lingo", - "text": "The Steps for Incident Commander provide a detailed description of what you should be doing during an incident. Additionally, aside from following the usual incident call etiquette , there a few extra etiquette guidelines you should follow as IC: Always announce when you join the call if you are the on-call IC. Don't let discussions get out of hand. Keep conversations short. Note objections from others, but your call is final. If anyone is being actively disruptive to your call, kick them off. Announce the end of the call. Here are some examples of phrases and patterns you should use during incident calls.", - "title": "Incident Call Procedures and Lingo" - }, - { - "location": "/training/incident_commander/#start-of-call-announcement", - "text": "At the start of any major incident call, the incident commander should announce the following, This is [NAME], I am the Incident Commander for this call. This establishes to everyone on the call what your name is, and that you are now the commander. You should state \"Incident Commander\" and not \"IC\", as newcomers may not be familiar with the terminology yet. The word \"commander\" makes it very clear that you're in charge.", - "title": "Start of Call Announcement" - }, - { - "location": "/training/incident_commander/#start-of-incident-ic-not-present", - "text": "If you are trained to be an IC and have joined a call, even if you aren't the IC on-call, you should do the following, Is there an IC on the call? (pause) Hearing no response, this is [NAME], and I am now the Incident Commander for this call. If the on-call IC joins later, you may hand over to them at your discretion (see below for the hand-off procedure)", - "title": "Start of Incident, IC Not Present" - }, - { - "location": "/training/incident_commander/#checking-if-smes-are-present", - "text": "During a call, you will want to know who is available from the various teams in order to resolve the incident. Etiquette dictates that people should announce themselves, but sometimes you may be joining late to the call. If you need a representative from a team, just ask on the call. Your deputy can page one if no one answers. Do we have a representative from [X] on the call? (pause) Deputy, can you go ahead and page the [X] on-call please.", - "title": "Checking if SME's are Present" - }, - { - "location": "/training/incident_commander/#assigning-tasks", - "text": "When you need to give out an assignment or task, give it to a person directly, never say \"can someone do...\" as this leads to the bystander effect . Instead, all actions should be assigned to a specific person, and time-boxed with a specific number of minutes. IC: Bob, please investigate the high latency on web app boxes. I'll come back to you for an answer in 3 minutes. Bob: Understood Keep track of how many minutes you assigned, and check in with that person after that time. You can get help from your deputy to help track the timings.", - "title": "Assigning Tasks" - }, - { - "location": "/training/incident_commander/#polling-during-a-decision", - "text": "If a decision needs to be made, it comes down to the IC. Once the IC makes a decision, it is final. But it's important that no one can come later and object to the plan, saying things like \"I knew that would happen\". An IC will use very specific language to be sure that doesn't happen. The proposal is to [EXPLAIN PROPOSAL] Are there any strong objections to this plan? (pause) Hearing no objects, we are proceeding with this proposal. If you were to ask \"Does everyone agree?\", you'd get people speaking over each other, you'd have quiet people not speaking up, etc. Asking for any STRONG objections gives people the chance to object, but only if they feel strongly on the matter.", - "title": "Polling During a Decision" - }, - { - "location": "/training/incident_commander/#status-updates", - "text": "It's important to maintain a cadence during a major incident call. Whenever there is a lull in the proceedings, usually because you're waiting for someone to get back to you, you can fill the gap by explaining the current situation and the actions that are outstanding. This makes sure everyone is on the same page. While we wait for [X], here's an update of our current situation. We are currently in a SEV-1 situation, we believe to be caused by [X]. There's an open question to [Y] who will be getting back to us in 2 minutes. In the meantime, we have Tweeted out that we are experiencing issues. Our next Tweet will be in 10 minutes if the incident is still ongoing at that time. Are there any additional actions or proposals from anyone else at this time?", - "title": "Status Updates" - }, - { - "location": "/training/incident_commander/#transfer-of-command", - "text": "Transfer of command, involves (as the name suggests) transferring command to another Incident Commander. There are multiple reasons why a transfer of command might take place, Commander has become fatigued and is unable to continue. Incident complexity changes. Change of command is necessary for effectiveness or efficiency. Personal emergencies arise (e.g., Incident Commander has a family emergency). Never feel like you are not doing your job properly by handing over. Handovers are encouraged. In order to handover, out of band from the main call (via Slack for example), notify the other IC that you wish to transfer command. Update them with anything you feel appropriate. Then announce on the call, Everyone on the call, be advised, at this time I am handing over command to [X]. The new IC should then announce on the call as if they were joining a new call (see above), so that everyone is aware of the new commander. Note that the arrival of a more qualified person does NOT necessarily mean a change in incident command.", - "title": "Transfer of Command" - }, - { - "location": "/training/incident_commander/#maintaining-order", - "text": "Often times on a call people will be talking over one another, or an argument on the correct way to proceed may break out. As Incident Commander it's important that order is maintained on a call. The Incident Commander has the power to remove someone from the call if necessary (even if it's the CEO). But often times you just need to remind people to speak one at a time. Sometimes the discussion can be healthy even if it starts as an argument, but you shouldn't let it go on for too long. (noise) Ok everyone, can we all speak one at a time please. So far I'm hearing two options to proceed: 1) [X], 2) [Y]. Are there any other proposals someone would like to make at this time? ...etc", - "title": "Maintaining Order" - }, - { - "location": "/training/incident_commander/#getting-straight-answers", - "text": "You may ask a question as IC and receive an answer that doesn't actually answer your question. This is generally when you ask for a yes/no answer but get a more detailed explanation. This can often times be because the person doesn't understand the call etiquette. But if it continues, you need to take action in order to proceed. IC: Is this going to disable the service for everyone? SME: Well... for some people it.... IC: Stop. I need a yes/no answer. Is this going to disable the service for everyone? SME: Well... it might not do... IC: Stop. I'm going to ask again, and the only two words I want to hear from you are \"yes\" or \"no. If this going to disable the service for everyone? SME: Well.. like I was saying.. IC: Stop. Leave the call. Backup IC can you please page the backup on-call for [service] so that we can get an answer.", - "title": "Getting Straight Answers" - }, - { - "location": "/training/incident_commander/#executive-swoop", - "text": "You may get someone who would be senior to you during peacetime come on the call and start overriding your decisions as IC. This is unacceptable behaviour during wartime, as the IC is in command. While this is rare, you can get things back on track with the following, Executive: No, I don't want us doing that. Everyone stop. We need to rollback instead. IC: Hold please. [EXECUTIVE], do you wish to take over command? Executive: Yes/No (If yes) IC: Understood. Everyone on the call, be advised, at this time I am handling over command to [EXECUTIVE]. They are now the incident commander for this call. (If no) IC: In that case, please cause no further interruptions or I will remove you from the call. This makes it clear to the executive that they have the option of being in charge and making decisions, but in order to do so they must continue as an Incident Commander. If they refuse, then remind them that you are in charge and disruptive interruptions will not be tolerated. If they continue, remove them from the call.", - "title": "Executive Swoop" - }, - { - "location": "/training/incident_commander/#end-of-call-sign-off", - "text": "At the end of an incident, you should announce to everyone on the call that you are ending the call at this time, and provide information on where followup discussion can take place. It's also customary to thank everyone. Ok everyone, we're ending the call at this time. Please continue any followup discussion on Slack. Thanks everyone.", - "title": "End of Call Sign-Off" - }, - { - "location": "/training/incident_commander/#examples-from-pop-culture", - "text": "PagerDuty employees have access to all previous incident calls, and can listen to them at their discretion. We can't release these calls, so for everyone else, here are some short examples from popular culture to show the techniques at work. Here's a clip from the movie Apollo 13, where Gene Kranz (Flight Director / Incident Commander) shows some great examples of Incident Command. Here are some things to note: Walks into the room, and immediately obvious that he's the IC. Calms the noise, and makes sure everyone is paying attention. Provides a status update so people are aware of the situation. Projector breaks, doesn't get sidetracked on fixing it, just moves on to something else. Provides a proposal for how to proceed and elicits feedback. Listens to the feedback calmly. When counter-proposal is raised, states that he agrees and why. Allows a discussion to happen, listens to all points. When discussion gets out of hand, re-asserts command of the situation. Explains his decision, and why. Explains his full plan and decision, so everyone is on the same page. Another clip from Apollo 13. Things to note: Summarizes the situation, and states the facts. Listens to the feedback from various people. When a trusted SME provides information counter to what everyone else is saying, asks for additional clarification (\"What do you mean, everything?\") Wise cracking remarks are not acknowledged by the IC (\"You can't run a vacuum cleaner on 12 amps!\") \"That's the deal?\".. \"That's the deal\". Once decision is made, moves on to the next discussion. Delegates tasks.", - "title": "Examples From Pop Culture" - }, - { - "location": "/training/deputy/", - "text": "So you want to be a deputy? You've come to the right place!\n\n\n\n\nCredit: \noregondot @ Flickr\n\n\nPurpose\n#\n\n\nThe purpose of the Deputy is to support the IC by keeping track of timers, notifying the IC of important information, and paging other people as directed by the IC.\n\n\nIt's important for the IC to focus on the problem at hand, rather than worrying about monitoring timers. The deputy is there to help support the IC and keep them focussed on the incident.\n\n\nAs a Deputy, you will be expected to take over command from the IC if they request it.\n\n\nYou should not be performing any remediations, checking graphs, or investigating logs\n. Those tasks will be delegated to the resolvers by the IC.\n\n\nPrerequisites\n#\n\n\nBefore you can be a Deputy, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training!\n\n\n\n\nBe trained as an \nIncident Commander\n.\n\n\n\n\nResponsibilities\n#\n\n\nRead up on our \nDifferent Roles for Incidents\n to see what is expected from a Deputy, as well as what we expect from the other roles you'll be interacting with.\n\n\nTraining Process\n#\n\n\nThe training process for a Deputy is quite simple.\n\n\n\n\nFollow our \nIncident Commander Training\n.\n\n\nRead this page.\n\n\n\n\nIncident Call Procedures and Lingo\n#\n\n\nThe \nSteps for Deputy\n provide a detailed description of what you should be doing during an incident.\n\n\nHere are some examples of phrases and patterns you should use during incident calls.\n\n\nKeep Track of Responders\n#\n\n\nAs you listen to the call, you should keep track of the responders to the call as you hear them speak. Make a note on a piece of paper, or use the \n!ic responders\n to see who they are. The IC may ask you who is on-call for a particular system, and you should know the answer, and be able to page them.\n\n\n\n\nDo we have a representative from [X] on the call?\n\n\n(pause)\n\n\nDeputy, can you go ahead and page the [X] on-call please.\n\n\n\n\nYou can page them however you see fit, phone call, etc.\n\n\nProvide Executive Status Updates\n#\n\n\nProvide regular status updates on Slack (roughly every 30mins), giving an executive summary of the current status during SEV-1 incidents. Keep it short and to the point, and use @here. Mention the current state, the actions in progress, customer impact, and expected time remaining. It's OK to miss out some of those if the information isn't known.\n\n\n\n\n@here: We are in SEV-1 due to X. Current actions in progress are to do Y. Expecting 3 mins to complete that action. Once action is complete, system should recover on its own within 5 minutes.\n\n\n\n\nAlert IC to Timers\n#\n\n\nYou are expected to keep track of how long the incident has been running for, and provide callouts to the IC every 10 minutes so they can take actions such as increasing the severity, or asking Support to Tweet out. This is as simple as telling the IC on the call,\n\n\n\n\nIC, be advised the incident is now at the 10 minute mark.\n\n\n\n\nSimilarly, when the IC asks for someone to get back to them in X minutes, you are expected to keep track of that. You should remind the IC when that time has been reached.\n\n\n\n\nIC, be advised the timer for [TEAM]'s investigation is up.", - "title": "Deputy" - }, - { - "location": "/training/deputy/#purpose", - "text": "The purpose of the Deputy is to support the IC by keeping track of timers, notifying the IC of important information, and paging other people as directed by the IC. It's important for the IC to focus on the problem at hand, rather than worrying about monitoring timers. The deputy is there to help support the IC and keep them focussed on the incident. As a Deputy, you will be expected to take over command from the IC if they request it. You should not be performing any remediations, checking graphs, or investigating logs . Those tasks will be delegated to the resolvers by the IC.", - "title": "Purpose" - }, - { - "location": "/training/deputy/#prerequisites", - "text": "Before you can be a Deputy, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training! Be trained as an Incident Commander .", - "title": "Prerequisites" - }, - { - "location": "/training/deputy/#responsibilities", - "text": "Read up on our Different Roles for Incidents to see what is expected from a Deputy, as well as what we expect from the other roles you'll be interacting with.", - "title": "Responsibilities" - }, - { - "location": "/training/deputy/#training-process", - "text": "The training process for a Deputy is quite simple. Follow our Incident Commander Training . Read this page.", - "title": "Training Process" - }, - { - "location": "/training/deputy/#incident-call-procedures-and-lingo", - "text": "The Steps for Deputy provide a detailed description of what you should be doing during an incident. Here are some examples of phrases and patterns you should use during incident calls.", - "title": "Incident Call Procedures and Lingo" - }, - { - "location": "/training/deputy/#keep-track-of-responders", - "text": "As you listen to the call, you should keep track of the responders to the call as you hear them speak. Make a note on a piece of paper, or use the !ic responders to see who they are. The IC may ask you who is on-call for a particular system, and you should know the answer, and be able to page them. Do we have a representative from [X] on the call? (pause) Deputy, can you go ahead and page the [X] on-call please. You can page them however you see fit, phone call, etc.", - "title": "Keep Track of Responders" - }, - { - "location": "/training/deputy/#provide-executive-status-updates", - "text": "Provide regular status updates on Slack (roughly every 30mins), giving an executive summary of the current status during SEV-1 incidents. Keep it short and to the point, and use @here. Mention the current state, the actions in progress, customer impact, and expected time remaining. It's OK to miss out some of those if the information isn't known. @here: We are in SEV-1 due to X. Current actions in progress are to do Y. Expecting 3 mins to complete that action. Once action is complete, system should recover on its own within 5 minutes.", - "title": "Provide Executive Status Updates" - }, - { - "location": "/training/deputy/#alert-ic-to-timers", - "text": "You are expected to keep track of how long the incident has been running for, and provide callouts to the IC every 10 minutes so they can take actions such as increasing the severity, or asking Support to Tweet out. This is as simple as telling the IC on the call, IC, be advised the incident is now at the 10 minute mark. Similarly, when the IC asks for someone to get back to them in X minutes, you are expected to keep track of that. You should remind the IC when that time has been reached. IC, be advised the timer for [TEAM]'s investigation is up.", - "title": "Alert IC to Timers" - }, - { - "location": "/training/scribe/", - "text": "So you want to be a scribe? You've come to the right place! You don't need to be a senior team member to become a deputy or scribe, anyone can do it providing you have the requisite knowledge!\n\n\n\n\nCredit: \nHolly Chaffin\n\n\nPurpose\n#\n\n\nThe purpose of the Scribe is to maintain a timeline of key events during an incident. Documenting actions, and keeping track of any followup items that will need to be addressed.\n\n\nIt's important for the rest of the command staff to be able to focus on the problem at hand, rather than worrying about documenting the steps.\n\n\nYour job as Scribe is to listen to the call and to watch the incident Slack room, keeping track of context and actions that need to be performed, documenting these in Slack as you go. \nYou should not be performing any remediations, checking graphs, or investigating logs.\n Those tasks will be delegated to the subject matter experts (SME's) by the Incident Commander.\n\n\nPrerequisites\n#\n\n\nBefore you can be a Scribe, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training!\n\n\n\n\nExcellent verbal and written \ncommunication skills\n.\n\n\nHas \nknowledge of obscure PagerDuty terms\n.\n\n\n\n\nResponsibilities\n#\n\n\nRead up on our \nDifferent Roles for Incidents\n to see what is expected from a Scribe, as well as what we expect from the other roles you'll be interacting with.\n\n\nTraining Process\n#\n\n\nThere is no formal training process for this role, reading this page should be sufficient for most tasks. Here's a list of things you can do to train though,\n\n\n\n\n\n\nRead the rest of this page, particularly the sections below.\n\n\n\n\n\n\nParticipate in \nFailure Friday\n (FF).\n\n\n\n\nShadow a FF to see how it's run.\n\n\nBe the scribe for multiple FF's.\n\n\n\n\n\n\n\n\nScribing\n#\n\n\nScribing is more art than science. The objective is to keep an accurate record of important events that occurred on the call, so that we can look back at the timeline to see what happened. But what exactly is important? There's no overwhelming answer, and it really comes down the judgement and experience. But here are some general things you most definitely want to capture as scribe.\n\n\n\n\nThe result of any polling decisions.\n\n\n This is not \"9 people voted yay, 3 voted nay\".\n\n\n It is \"Polled for if we should do rolling restart. \n is proceeding with restart.\"\n\n\n\n\n\n\nAny followup items that are called out as \"We should do this..\", \"Why didn't this?..\", etc.\n\n\n This is not \"Why isn't the Support representative on the call?\"\n\n\n This is \"TODO: Why didn't we get paged for this earlier?\"\n\n\n\n\n\n\n\n\nIncident Call Procedures and Lingo\n#\n\n\nThe \nSteps for Scribe\n provide a detailed description of what you should be doing during an incident.\n\n\nHere are some examples of phrases and patterns you should use during incident calls.\n\n\nStatus Stalking\n#\n\n\nAt the start of any major incident call, you should start our status stalking bot, so that it will post to the room an update automatically.\n\n\n\n\n!status stalk\n\n\n\n\nThis will provide the update and allow the IC to see the status without having to keep asking.\n\n\nNote Important Actions\n#\n\n\nDuring a call, you will hear lots of discussion happening, you should not be documenting all of this in the chat room. You only want to document things which will be important for the final timeline. It's not always obvious what this might be, and it's usually a matter of judgement. You generally want to note any actions the IC has asked someone to perform, along with the result of any polling decisions.\n\n\n\n\nPolled for decision on whether to perform rolling restart. We are proceeding with restart. [USER_A] to execute.\n\n\n\n\nSome actions might seem important at the time, but end up not being. That's OK. It's better to have more info than not enough, but don't go overboard.\n\n\nNote Followup Actions\n#\n\n\nSometimes during the call, someone will either mention something we \"should fix\", or the IC will specifically ask you to note a followup item. You can do this in Slack by simply prefixing with \"TODO\", this will make it easier to search for later.\n\n\n\n\nTODO: Why did we not get paged for the fall in traffic on [X] cluster?\n\n\n\n\nThe post-mortem owner will find these after and raise tasks for them.\n\n\nEnd of Call Notification\n#\n\n\nWhen the IC ends the call, you should post a message into Slack to let everyone know the call is over, and that they should continue discussion elsewhere.\n\n\n\n\nCall is over, thanks everyone. Follow up in Slack.\n\n\n\n\nDon't forget to also stop the status stalking.\n\n\n\n\n!status unstalk", - "title": "Scribe" - }, - { - "location": "/training/scribe/#purpose", - "text": "The purpose of the Scribe is to maintain a timeline of key events during an incident. Documenting actions, and keeping track of any followup items that will need to be addressed. It's important for the rest of the command staff to be able to focus on the problem at hand, rather than worrying about documenting the steps. Your job as Scribe is to listen to the call and to watch the incident Slack room, keeping track of context and actions that need to be performed, documenting these in Slack as you go. You should not be performing any remediations, checking graphs, or investigating logs. Those tasks will be delegated to the subject matter experts (SME's) by the Incident Commander.", - "title": "Purpose" - }, - { - "location": "/training/scribe/#prerequisites", - "text": "Before you can be a Scribe, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training! Excellent verbal and written communication skills . Has knowledge of obscure PagerDuty terms .", - "title": "Prerequisites" - }, - { - "location": "/training/scribe/#responsibilities", - "text": "Read up on our Different Roles for Incidents to see what is expected from a Scribe, as well as what we expect from the other roles you'll be interacting with.", - "title": "Responsibilities" - }, - { - "location": "/training/scribe/#training-process", - "text": "There is no formal training process for this role, reading this page should be sufficient for most tasks. Here's a list of things you can do to train though, Read the rest of this page, particularly the sections below. Participate in Failure Friday (FF). Shadow a FF to see how it's run. Be the scribe for multiple FF's.", - "title": "Training Process" - }, - { - "location": "/training/scribe/#scribing", - "text": "Scribing is more art than science. The objective is to keep an accurate record of important events that occurred on the call, so that we can look back at the timeline to see what happened. But what exactly is important? There's no overwhelming answer, and it really comes down the judgement and experience. But here are some general things you most definitely want to capture as scribe. The result of any polling decisions. This is not \"9 people voted yay, 3 voted nay\". It is \"Polled for if we should do rolling restart. is proceeding with restart.\" Any followup items that are called out as \"We should do this..\", \"Why didn't this?..\", etc. This is not \"Why isn't the Support representative on the call?\" This is \"TODO: Why didn't we get paged for this earlier?\"", - "title": "Scribing" - }, - { - "location": "/training/scribe/#incident-call-procedures-and-lingo", - "text": "The Steps for Scribe provide a detailed description of what you should be doing during an incident. Here are some examples of phrases and patterns you should use during incident calls.", - "title": "Incident Call Procedures and Lingo" - }, - { - "location": "/training/scribe/#status-stalking", - "text": "At the start of any major incident call, you should start our status stalking bot, so that it will post to the room an update automatically. !status stalk This will provide the update and allow the IC to see the status without having to keep asking.", - "title": "Status Stalking" - }, - { - "location": "/training/scribe/#note-important-actions", - "text": "During a call, you will hear lots of discussion happening, you should not be documenting all of this in the chat room. You only want to document things which will be important for the final timeline. It's not always obvious what this might be, and it's usually a matter of judgement. You generally want to note any actions the IC has asked someone to perform, along with the result of any polling decisions. Polled for decision on whether to perform rolling restart. We are proceeding with restart. [USER_A] to execute. Some actions might seem important at the time, but end up not being. That's OK. It's better to have more info than not enough, but don't go overboard.", - "title": "Note Important Actions" - }, - { - "location": "/training/scribe/#note-followup-actions", - "text": "Sometimes during the call, someone will either mention something we \"should fix\", or the IC will specifically ask you to note a followup item. You can do this in Slack by simply prefixing with \"TODO\", this will make it easier to search for later. TODO: Why did we not get paged for the fall in traffic on [X] cluster? The post-mortem owner will find these after and raise tasks for them.", - "title": "Note Followup Actions" - }, - { - "location": "/training/scribe/#end-of-call-notification", - "text": "When the IC ends the call, you should post a message into Slack to let everyone know the call is over, and that they should continue discussion elsewhere. Call is over, thanks everyone. Follow up in Slack. Don't forget to also stop the status stalking. !status unstalk", - "title": "End of Call Notification" - }, - { - "location": "/training/subject_matter_expert/", - "text": "If you are on-call for any team at PagerDuty, you may be paged for a major incident and will be expected to respond as a subject matter expert (SME) for your service. This page details everything you need to know in order to be prepared for that responsibility. If you are interested in becoming an Incident Commander, take a look at the \nIncident Commander Training page\n.\n\n\n\n\nCredit: \noregondot @ Flickr\n\n\nOn-Call Expectations\n#\n\n\nIf you are on-call for your team, there are certain expectations of you as that on-call. This applies to both the primary and secondary on-calls. Getting paged about a SEV-3 or SEV-4 in your system comes with different expectations than getting paged with a major SEV-2.\n\n\nBefore Going On-Call\n#\n\n\n\n\nBe prepared, by having already familiarized yourself with our incident response policies and procedures. In particular,\n\n\nDifferent Roles for Incidents\n - You will be acting as a \"Resolver\" or \"SME\". But you should familiarize yourself with the other roles and what they will be doing.\n\n\nIncident Call Etiquette\n - How to behave during an incident call.\n\n\nDuring an Incident\n - What to do during an incident. You are specifically interested in the \"Resolver\" steps, but you should familiarize yourself with the entire document.\n\n\nGlossary\n - Familiarize yourself with the terminology that may be used during the call.\n\n\n\n\n\n\nMake sure you have set up your alerting methods, and that PagerDuty can bypass your \"Do Not Disturb\" settings.\n\n\nCheck you can join the incident call. You may need to install a browser plugin. You don't want to be doing that the first time you get paged.\n\n\nBe aware of your upcoming on-call time and arrange swaps around travel, vacations, appointments, etc.\n\n\nIf you are an Incident Commander, make sure you are not on-call for your team at the same time as being on-call as Incident Commander.\n\n\n\n\nDuring On-Call Period\n#\n\n\n\n\nHave your laptop and Internet with you at all times during your on-call period (office, home, a MiFi, a phone with a tethering plan, etc).\n\n\nIf you have important appointments, you need to get someone else on your team to cover that time slot in advance.\n\n\nWhen you receive an alert for a major incident, you are expected to join the incident call and Slack as quickly as possible (within minutes).\n\n\nYou will be asked questions or given actions by the Incident Commander. Answer questions concisely, and follow all actions given (even if you disagree with them).\n\n\n\n\n\n\n\n\nResponse Mobilization\n#\n\n\nWhen an incident occurs, you must be mobilized or assigned to become part of the incident response. In other words, until you are mobilized to the incident via a page or being directly asked by someone else on the incident, you remain in your everyday role. After being mobilized, your first task is to check in and receive an assignment. While it's tempting to see an incident happening and want to jump in and help, when resources show up that have not been requested, the management of the incident can be compromised.\n\n\n\"Never Hesitate to Escalate\"\n#\n\n\nIf you're not sure about something, it is perfectly acceptable to bring in other SMEs from your team that you believe know a given system better than you. Don't let your ego keep you from bringing in additional help. Our motto is \"Never hesitate to escalate\", you will never be looked down upon for escalating something because you didn't know how to handle it.\n\n\nBlameless\n#\n\n\nThere will be incidents. Some will be caused by you, some will be caused by others... some will just happen. Our entire incident response process is completely blameless. Blaming people is counter productive and just distracts from the problem at hand. No matter how an incident started, they all need to get solved as quickly as possible.\n\n\nWartime vs Peacetime\n#\n\n\nBehavior during a major incident is very different to any other alert you may have received in the past. We call a major incident \"wartime\", and make a distinction between that and normal everyday operations (\"peacetime\").\n\n\nPeacetime\n#\n\n\nThe organizational structure is generally based on seniority. The more senior members of a team will lead discussions, and managers or team leads will have the final say. Decisions are made after careful consideration of all options, and to minimize potential risk to customers.\n\n\nWartime\n#\n\n\nWartime is different, and you will notice on our major incident calls that there's a different organizational structure.\n\n\n\n\nThe Incident Commander is in charge. No matter their rank during peacetime, they are now the highest ranked individual on the call, higher than the CEO.\n\n\nPrimary responders (folks acting as primary on-call for a team/service) are the highest ranked individuals for that service.\n\n\nDecisions will be made by the IC after consideration of the information presented. Once that decision is made, it is final.\n\n\nRiskier decisions can be made by the IC than would normally be considered during peacetime.\n\n\nFor example, the IC may decide to drop events for a particular customer in order to maintain the integrity of the system for everyone else.\n\n\n\n\n\n\nThe IC may go against a consensus decision. If a poll is done, and 9/10 people agree but 1 disagrees. The IC may choose the disagreement option despite a majority vote.\n\n\nEven if you disagree, the IC's decision is final. During the call is not the time to argue with them.\n\n\n\n\n\n\nThe IC may use language or behave in a way you find rude. This is wartime, and they need to do whatever it takes to resolve the situation, so sometimes rudeness occurs. This is never anything personal, and something you should be prepared to experience if you've never been in a wartime situation before.\n\n\nYou may be asked to leave the call by the IC, or you may even be forceable kicked off a call. It is at the IC's discretion to do this if they feel you are not providing useful input. Again, this is nothing personal and you should remember that wartime is different than peacetime.", - "title": "Subject Matter Expert" - }, - { - "location": "/training/subject_matter_expert/#on-call-expectations", - "text": "If you are on-call for your team, there are certain expectations of you as that on-call. This applies to both the primary and secondary on-calls. Getting paged about a SEV-3 or SEV-4 in your system comes with different expectations than getting paged with a major SEV-2.", - "title": "On-Call Expectations" - }, - { - "location": "/training/subject_matter_expert/#before-going-on-call", - "text": "Be prepared, by having already familiarized yourself with our incident response policies and procedures. In particular, Different Roles for Incidents - You will be acting as a \"Resolver\" or \"SME\". But you should familiarize yourself with the other roles and what they will be doing. Incident Call Etiquette - How to behave during an incident call. During an Incident - What to do during an incident. You are specifically interested in the \"Resolver\" steps, but you should familiarize yourself with the entire document. Glossary - Familiarize yourself with the terminology that may be used during the call. Make sure you have set up your alerting methods, and that PagerDuty can bypass your \"Do Not Disturb\" settings. Check you can join the incident call. You may need to install a browser plugin. You don't want to be doing that the first time you get paged. Be aware of your upcoming on-call time and arrange swaps around travel, vacations, appointments, etc. If you are an Incident Commander, make sure you are not on-call for your team at the same time as being on-call as Incident Commander.", - "title": "Before Going On-Call" - }, - { - "location": "/training/subject_matter_expert/#during-on-call-period", - "text": "Have your laptop and Internet with you at all times during your on-call period (office, home, a MiFi, a phone with a tethering plan, etc). If you have important appointments, you need to get someone else on your team to cover that time slot in advance. When you receive an alert for a major incident, you are expected to join the incident call and Slack as quickly as possible (within minutes). You will be asked questions or given actions by the Incident Commander. Answer questions concisely, and follow all actions given (even if you disagree with them).", - "title": "During On-Call Period" - }, - { - "location": "/training/subject_matter_expert/#response-mobilization", - "text": "When an incident occurs, you must be mobilized or assigned to become part of the incident response. In other words, until you are mobilized to the incident via a page or being directly asked by someone else on the incident, you remain in your everyday role. After being mobilized, your first task is to check in and receive an assignment. While it's tempting to see an incident happening and want to jump in and help, when resources show up that have not been requested, the management of the incident can be compromised.", - "title": "Response Mobilization" - }, - { - "location": "/training/subject_matter_expert/#never-hesitate-to-escalate", - "text": "If you're not sure about something, it is perfectly acceptable to bring in other SMEs from your team that you believe know a given system better than you. Don't let your ego keep you from bringing in additional help. Our motto is \"Never hesitate to escalate\", you will never be looked down upon for escalating something because you didn't know how to handle it.", - "title": "\"Never Hesitate to Escalate\"" - }, - { - "location": "/training/subject_matter_expert/#blameless", - "text": "There will be incidents. Some will be caused by you, some will be caused by others... some will just happen. Our entire incident response process is completely blameless. Blaming people is counter productive and just distracts from the problem at hand. No matter how an incident started, they all need to get solved as quickly as possible.", - "title": "Blameless" - }, - { - "location": "/training/subject_matter_expert/#wartime-vs-peacetime", - "text": "Behavior during a major incident is very different to any other alert you may have received in the past. We call a major incident \"wartime\", and make a distinction between that and normal everyday operations (\"peacetime\").", - "title": "Wartime vs Peacetime" - }, - { - "location": "/training/subject_matter_expert/#peacetime", - "text": "The organizational structure is generally based on seniority. The more senior members of a team will lead discussions, and managers or team leads will have the final say. Decisions are made after careful consideration of all options, and to minimize potential risk to customers.", - "title": "Peacetime" - }, - { - "location": "/training/subject_matter_expert/#wartime", - "text": "Wartime is different, and you will notice on our major incident calls that there's a different organizational structure. The Incident Commander is in charge. No matter their rank during peacetime, they are now the highest ranked individual on the call, higher than the CEO. Primary responders (folks acting as primary on-call for a team/service) are the highest ranked individuals for that service. Decisions will be made by the IC after consideration of the information presented. Once that decision is made, it is final. Riskier decisions can be made by the IC than would normally be considered during peacetime. For example, the IC may decide to drop events for a particular customer in order to maintain the integrity of the system for everyone else. The IC may go against a consensus decision. If a poll is done, and 9/10 people agree but 1 disagrees. The IC may choose the disagreement option despite a majority vote. Even if you disagree, the IC's decision is final. During the call is not the time to argue with them. The IC may use language or behave in a way you find rude. This is wartime, and they need to do whatever it takes to resolve the situation, so sometimes rudeness occurs. This is never anything personal, and something you should be prepared to experience if you've never been in a wartime situation before. You may be asked to leave the call by the IC, or you may even be forceable kicked off a call. It is at the IC's discretion to do this if they feel you are not providing useful input. Again, this is nothing personal and you should remember that wartime is different than peacetime.", - "title": "Wartime" - }, - { - "location": "/training/glossary/", - "text": "Ever wonder what all of those strange words you sometimes see in our documentation mean? This page is here to help.\n\n\n\n\n\n\n\n\nTerm\n\n\nDescription\n\n\n\n\n\n\n\n\n\n\nIC / Incident Commander\n\n\nThe incident commander is the person responsible for bringing any major incident to resolution. They are the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final. \nMore info\n.\n\n\n\n\n\n\nDeputy\n\n\nTypically the backup IC. The deputy's job is to support the IC during the call, providing them with any help they need. \nMore info\n.\n\n\n\n\n\n\nScribe\n\n\nThe scribe's job is to keep a log of all activities performed during the call in a written chat log on Slack. \nMore info\n.\n\n\n\n\n\n\nResolver\n\n\nA person on the incident call who is able to help resolve issues within a particular system. Also referred to as an SME (see below). \nMore info\n.\n\n\n\n\n\n\nSME\n\n\n\"Subject Matter Expert\", someone who is an expert in a particular service or subject who can provide information to the IC, and perform resolution actions for a particular system. \nMore info\n.\n\n\n\n\n\n\nCAN Report\n\n\nCAN stands for \"Conditions\" \"Actions\" \"Needs\", if an IC asks you for a CAN report, you should provide the current state of your service (condition), what actions need to be taken to return it to a healthy state (actions), and what support you need in order to perform the actions (needs).\n\n\n\n\n\n\nSev / Severity\n\n\nHow severe the incident is. The \"sev\" of an incident determines the type of response we give. The higher the severity, the higher the likelihood of making risky actions to resolve the situation. \nMore info\n.\n\n\n\n\n\n\nSpan of Control\n\n\nRefers to the number of direct reports you have. For example, if the IC has 10 people as direct reports on a call, they have a large span of control. We aim to make the span of control as minimal as we can while still being productive.\n\n\n\n\n\n\nGrenade Thrower\n\n\nSomeone who joins the call at a late time in the game, and provides information that completely derails the current thinking. They then leave almost immediately.\n\n\n\n\n\n\nExecutive Swoop\n\n\nWhen an executive comes on the call and drops some sort of bombshell. A version of grenade throwing.", - "title": "Glossary" - }, - { - "location": "/about/", - "text": "This site documents parts of the Spearhead Systems Issue Response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after.\n\n\nFew companies seem to talk about their internal processes for dealing with major incidents. We would like to change that by opening up our documentation to the community, in the hopes that it proves useful to others who may want to formalize their own processes. Additionally, it provides an opportunity for others to suggest improvements, which ends up helping everyone.\n\n\nThis documentation is complementary to what is available in our \nexisting wiki\n.\n\n\nWhat is this?\n#\n\n\nA collection of pages detailing how to efficiently deal with any incident or service request that might arise, along with information on how to go on-call effectively. It provides lessons learned the hard way, along with training material for getting you up to speed quickly.\n\n\nWho is this for?\n#\n\n\nIt is intended for on-call practitioners and those involved in an operational incident or service request response process, or those wishing to enact a formal incident response process. Specifically this is for all of our Technical Support staff.\n\n\nWhy do I need it?\n#\n\n\nAs a service provider Spearhead Systems deals with service requests on a daily basis. The reason we exist is to deliver a service which in most cases boils down to incidents and service requests. We want to deliver a smooth and seamless experience for resolving our customers issues therefore this documentation is a guideline for how we handle these requests. This documentation will allow you give you a head start on how to deal with issues in a way which leads to the fastest possible recovery time.\n\n\nWhat is covered?\n#\n\n\nAnything from preparing to \ngo on-call\n, definitions of \nseverities\n, incident \ncall etiquette\n, all the way to how to run a \npost-mortem\n, providing a \npost-mortem template\n and even a \nsecurity incident response process\n.\n\n\nWhat is missing?\n#\n\n\nLots, dig in an help us complete the picture. We can migrate most processes from Sharepoint here.\n\n\nLicense\n#\n\n\nThis documentation is provided under the Apache License 2.0. In plain English that means you can use and modify this documentation and use it both commercially and for private use. However, you must include any original copyright notices, and the original LICENSE file.\n\n\nWhether you are a Spearhead Systems customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account, feel free to fork the repository and use it as a base for your own internal documentation.", - "title": "About" - }, - { - "location": "/about/#what-is-this", - "text": "A collection of pages detailing how to efficiently deal with any incident or service request that might arise, along with information on how to go on-call effectively. It provides lessons learned the hard way, along with training material for getting you up to speed quickly.", - "title": "What is this?" - }, - { - "location": "/about/#who-is-this-for", - "text": "It is intended for on-call practitioners and those involved in an operational incident or service request response process, or those wishing to enact a formal incident response process. Specifically this is for all of our Technical Support staff.", - "title": "Who is this for?" - }, - { - "location": "/about/#why-do-i-need-it", - "text": "As a service provider Spearhead Systems deals with service requests on a daily basis. The reason we exist is to deliver a service which in most cases boils down to incidents and service requests. We want to deliver a smooth and seamless experience for resolving our customers issues therefore this documentation is a guideline for how we handle these requests. This documentation will allow you give you a head start on how to deal with issues in a way which leads to the fastest possible recovery time.", - "title": "Why do I need it?" - }, - { - "location": "/about/#what-is-covered", - "text": "Anything from preparing to go on-call , definitions of severities , incident call etiquette , all the way to how to run a post-mortem , providing a post-mortem template and even a security incident response process .", - "title": "What is covered?" - }, - { - "location": "/about/#what-is-missing", - "text": "Lots, dig in an help us complete the picture. We can migrate most processes from Sharepoint here.", - "title": "What is missing?" - }, - { - "location": "/about/#license", - "text": "This documentation is provided under the Apache License 2.0. In plain English that means you can use and modify this documentation and use it both commercially and for private use. However, you must include any original copyright notices, and the original LICENSE file. Whether you are a Spearhead Systems customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account, feel free to fork the repository and use it as a base for your own internal documentation.", - "title": "License" - } - ] -} \ No newline at end of file diff --git a/oncall/alerting_principles/index.html b/oncall/alerting_principles/index.html deleted file mode 100644 index 840b846..0000000 --- a/oncall/alerting_principles/index.html +++ /dev/null @@ -1,575 +0,0 @@ - - - - - - - - - - Alerting Principles - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Alerting Principles

- -

We manage how we get alerted based on many factors such as the customers contractual SLA, the urgency of their request or incident, etc.. an alert or notification is something which requires a human to perform an action. Based on the severity of the issue (service request or incident) we prioritize accordingly in DoIT.

-
-

Major Priority Alerts

-

Anything that wakes up a human in the middle of the night should be immediately human actionable. If it is none of those things, then we need to adjust the alert to not page at those times.

-
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
PriorityAlertsResponse
MajorMajor-Priority Spearhead Alert 24/7/365.Requires immediate human action.
NormalNormal-Priority Spearhead Alert during business hours only.Requires human action that same working day.
MinorMinor-Priority Spearhead Alert 24/7/365.Requires human action at some point.
NotificationSuppressed Events. No response required.Informational only. We do not need these to clutter out ticketing or inboxes. If they are enabled they should be sent only to required/specific people, not groups.
-

Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page of the respective customer.

-

If you're setting up a new alert/notification, consider the chart above for how you want to alert people. Be mindful of not creating new high-priority alerts if they don't require an immediate response, for example.

-
-

Alert Channels

-

Presently we use email as the only notification method. This means keeping an eye on your email is essential! -SMS and Push notifications are in the pipeline for DoIT.

-
-

Examples#

-

"Production service is failing for 75% of requests, automation is unable to resolve."_#

-

This would be a Major priority IN, requiring immediate human action to resolve.

-

Major Urgency

-

"A customer sends an email stating that "Production server disk space is filling, expected to be full in 48 hours. Log rotation is insufficient to resolve."#

-

This would be a Normal priority SR, requiring human action soon, but not immediately.

-

Normal Urgency

-

"An SSL certificate is due to expire in one week."#

-

This would be a Minor priority SR, requiring human action some time soon.

-

Minor Urgency

- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/oncall/being_oncall/index.html b/oncall/being_oncall/index.html deleted file mode 100644 index b648525..0000000 --- a/oncall/being_oncall/index.html +++ /dev/null @@ -1,685 +0,0 @@ - - - - - - - - - - Being On-Call - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Being On-Call

- -

A summary of expectations and helpful information for being on-call.

-

Alert Fatigue

-

What is On-Call?#

-

Being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise. For example, if you are on-call, should any alarms be triggered by our monitoring solution, you will receive a "page" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken. You will be expected to take whatever actions are necessary in order to resolve the issue and return your service to a normal state.

-

At Spearhead Systems we consider you are on-call during normal working hours in which case you are proactively working with DoIT and looking over your assigned cards/boards as well as when you are formally "on-call" and issues are being redirected to you.

-

On-call responsibilities extend beyond normal office hours, and if you are on-call you are expected to be able to respond to issues, even at 2am. This sounds horrible (and it can be), but this is what our customers go through, and is the problem that the Spearhead Systems professional services is trying to fix!

-

Responsibilities#

-
    -
  1. -

    Prepare

    -
      -
    • Have your laptop and Internet with you (office, home, a MiFi dongle, a phone with a tethering plan, etc).
        -
      • Have a way to charge your MiFi.
      • -
      -
    • -
    • Team alert escalation happens within 5 minutes, set/stagger your notification timeouts (push, SMS, phone...) accordingly.
        -
      • Make sure Spearhead Systems (and colleagues directly) texts and calls can bypass your "Do Not Disturb" settings.
      • -
      -
    • -
    • Be prepared (environment is set up, a current working copy of the necessary repos is local and functioning, you have configured and tested environments on workstations, your credentials for third-party services are current, you have Java installed, ssh-keys and so on...)
    • -
    • Read our Incident Response documentation (that's this!) to understand how we handle incidents and service requests, what the different roles and methods of communication are, etc.
    • -
    • Be aware of your upcoming on-call time (primary, backup) and arrange swaps around travel, vacations, appointments etc.
    • -
    -
  2. -
  3. -

    Triage

    -
      -
    • Acknowledge and act on alerts whenever you can (see the first "Not responsibilities" point below)
    • -
    • Determine the urgency of the problem:
        -
      • Is it something that should be worked on right now or escalated into a major incident? ("production server on fire" situations. Security alerts) - do so.
      • -
      • Is it some tactical work that doesn't have to happen during the night? (for example, disk utilization high watermark, but there's plenty of space left and the trend is not indicating impending doom) - snooze the alert until a more suitable time (working hours, the next morning...) and get back to fixing it then.
      • -
      -
    • -
    • Check Slack for current activity. Often (but not always) actions that could potentially cause alerts will be announced there.
    • -
    • Does the alert and your initial investigation indicate a general problem or an issue with a specific service that the relevant team should look into? If it does not look like a problem you are the expert for, then escalate to another team member or group.
    • -
    -
  4. -
  5. -

    Fix

    -
      -
    • You are empowered to dive into any problem and act to fix it.
    • -
    • Involve other team members as necessary: do not hesitate to escalate if you cannot figure out the cause within a reasonable timeframe or if the service / alert is something you have not tackled before.
    • -
    • If the issue is not very time sensitive and you have other priority work, make a note of this in DoIT to keep a track of it (with an appropriate severity).
    • -
    -
  6. -
  7. -

    Improve

    -
      -
    • If a particular issue keeps happening; if an issue alerts often but turns out to be a preventable non-issue – perhaps improving this should be a longer-term task.
        -
      • Disks that fill up, logs that should be rotated, noisy alerts...(we use ansible, go ahead and start automating!)
      • -
      -
    • -
    • If information is difficult / impossible to find, write it down. Constantly refactor and improve our knowledge base and documentation. Add redundant links and pointers if your mental model of the wiki / codebase does not match the way it is currently organized.
    • -
    -
  8. -
  9. -

    Support

    -
      -
    • When your on-call "shift" ends, let the next on-call know about issues that have not been resolved yet and other experiences of note.
    • -
    • If you are making a change that impacts the schedule (adding / removing yourself, for example), let others know since many of us make arrangements around the on-call schedule well in advance.
    • -
    • Support each other: when doing activities that might generate plenty of pages, it is courteous to "take the page" away from the on-call by notifying them and scheduling an override for the duration.
    • -
    -
  10. -
-

Not Responsibilities#

-
    -
  1. -

    No expectation to be the first to acknowledge all of the alerts during the on-call period.

    -
      -
    • Commute (and other necessary distractions) are facts of life, and sometimes it is not possible to receive or act on an alert before it escalates. That's why we have the backup on-call and schedule for.
    • -
    -
  2. -
  3. -

    No expectation to fix all issues by yourself.

    -
      -
    • No one knows everything. Your whole team is here to help. There is no shame, and much to be learned, by escalating issues you are not certain about. "Never hesitate to escalate".
    • -
    • Service owners will always know more about how their stuff works. Especially if our and their documentation is lacking, double-checking with the relevant team avoids mistakes. Measure twice, cut once – and it's often best to let the subject matter expert do the cutting.
    • -
    -
  4. -
-

Recommendations#

-

If your team is starting its own on-call rotation, here are some scheduling recommendations from the Operations team.

-
    -
  • -

    Always have a backup schedule. Yes, this means two people being on-call at the same time, however it takes a lot of the stress off of the primary if they know they have a specific backup they can contact, rather than trying to chose a random member of the team.

    -
      -
    • A backup shift should generally come directly after a primary shift. It gives chance for the previous primary to pass on additional context which may have come up during their shift. It also helps to prevent people from sitting on issues with the intent of letting the next shift fix it.
    • -
    -
  • -
  • -

    The third-level of your escalation (after backup schedule) should probably be your entire team. This should hopefully never happen (it's happened once in the history of the Support team), but when it does, it's useful to be able to just get the next available person.

    -
  • -
-

Escalation

-
    -
  • -

    Team managers can (and should) be part of your normal rotation. It gives a better insight into what has been going on.

    -
  • -
  • -

    New members of the team should shadow your on-call rotation during the first few weeks. They should get all alerts, and should follow along with what you are doing. (All new employees shadow the Support team for one week of on-call, but it's useful to have new team members shadow your team rotations also. Just not at the same time).

    -
  • -
  • -

    We recommend you set your escalation timeout to 5 minutes. This should be plenty of time for someone to acknowledge the incident if they're able to. If they're not able to within 5 minutes, then they're probably not in a good position to respond to the incident anyway.

    -
  • -
  • -

    When going off-call, you should provide a quick summary to the next on-call about any issues that may come up during their shift. A service has been flapping, an issue is likely to re-occur, etc. If you want to be formal, this can be a written report via email, but generally a verbal summary is sufficient.

    -
  • -
-

Notification Method Recommendations#

-

You are free to set up your notification rules as you see fit, to match how you would like to best respond to incidents. If you're not sure how to configure them, the Support team has some recommendations,

-

Mobile Alerts

-
    -
  • Use Push Notification and Email as your first method of notification. Most of us have phones with us at all times, so this is a prudent first method and is usually sufficient. (DoIT is in the process of integratoin with SNS for push notifications)
  • -
  • Use Phone and/or SMS notification each minute after, until the escalation time. If Push didn't work, then it's likely you need something stronger, like a phone call. Keep calling every minute until it's too late. If you don't pick up by the 3rd time, then it's unlikely you are able to respond, and the incident will get escalated away from you.
  • -
-

Etiquette#

-
    -
  • -

    If the current on-call comes into the office at 12pm looking tired, it's not because they're lazy. They probably got paged in the night. Cut them some slack and be nice.

    -
  • -
  • -

    Don't acknowledge an incident out from under someone else. If you didn't get paged for the incident, then you shouldn't be acknowledging it. Add a comment with your notes instead.

    -
  • -
-

Acknowledging

-
    -
  • -

    If you are testing something, or performing an action that you know will cause a page (notification, alert), it's customary to "take the pager" for the time during which you will be testing. Notify the person on-call that you are taking the pager for the next hour while you test.

    -
  • -
  • -

    "Never hesitate to escalate" - Never feel ashamed to rope in someone else if you're not sure how to resolve an issue. Likewise, never look down on someone else if they ask you for help.

    -
  • -
  • -

    Always consider covering an hour or so of someone else's on-call time if they request it and you are able. We all have lives which might get in the way of on-call time, and one day it might be you who needs to swap their on-call time in order to have a night out with your friend from out of town.

    -
  • -
  • -

    If an issue comes up during your on-call shift for which you got paged, you are responsible for resolving it. Even if it takes 3 hours and there's only 1 hour left of your shift. You can hand over to the next on-call if they agree, but you should never assume that's possible.

    -
  • -
- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/screenshot.png b/screenshot.png new file mode 100644 index 0000000..a7f6d36 Binary files /dev/null and b/screenshot.png differ diff --git a/sitemap.xml b/sitemap.xml deleted file mode 100644 index 7479afc..0000000 --- a/sitemap.xml +++ /dev/null @@ -1,130 +0,0 @@ - - - - - - https://response.spearhead.systems/ - 2017-01-13 - daily - - - - - - - https://response.spearhead.systems/oncall/being_oncall/ - 2017-01-13 - daily - - - - https://response.spearhead.systems/oncall/alerting_principles/ - 2017-01-13 - daily - - - - - - - - https://response.spearhead.systems/before/severity_levels/ - 2017-01-13 - daily - - - - https://response.spearhead.systems/before/different_roles/ - 2017-01-13 - daily - - - - https://response.spearhead.systems/before/call_etiquette/ - 2017-01-13 - daily - - - - - - - - https://response.spearhead.systems/during/during_an_incident/ - 2017-01-13 - daily - - - - https://response.spearhead.systems/during/security_incident_response/ - 2017-01-13 - daily - - - - - - - - https://response.spearhead.systems/after/post_mortem_process/ - 2017-01-13 - daily - - - - https://response.spearhead.systems/after/post_mortem_template/ - 2017-01-13 - daily - - - - - - - - https://response.spearhead.systems/training/overview/ - 2017-01-13 - daily - - - - https://response.spearhead.systems/training/incident_commander/ - 2017-01-13 - daily - - - - https://response.spearhead.systems/training/deputy/ - 2017-01-13 - daily - - - - https://response.spearhead.systems/training/scribe/ - 2017-01-13 - daily - - - - https://response.spearhead.systems/training/subject_matter_expert/ - 2017-01-13 - daily - - - - https://response.spearhead.systems/training/glossary/ - 2017-01-13 - daily - - - - - - - https://response.spearhead.systems/about/ - 2017-01-13 - daily - - - - \ No newline at end of file diff --git a/theme/404.html b/theme/404.html new file mode 100644 index 0000000..6feaf62 --- /dev/null +++ b/theme/404.html @@ -0,0 +1,15 @@ +{% extends "base.html" %} + +{# mkdocs-material doesn't use content as a block, so cheating and using footer here, as that does use a block #} +{% block footer %} + +
+

Sorry! We couldn't find that page.

+

Looks like our well-trained server monkeys dropped the ball. Rest assured they will be dealt with. In the meantime, you probably want to head home. +

+ +
+ {% include "footer.html" %} +
+ +{% endblock %} diff --git a/theme/base.html b/theme/base.html new file mode 100644 index 0000000..2d94e99 --- /dev/null +++ b/theme/base.html @@ -0,0 +1,196 @@ + + + + + + + + + {% set title = page_title ~ ' - ' ~ site_name if page_title else site_name %} + {{ title }} + + + {% if site_author %}{% endif %} + + + + {% if page_description %}{% endif %} + + + + + + + + + {% if canonical_url %}{% endif %} + + + {% set favicon = favicon | default("assets/images/favicon-e565ddfa3b.ico", true) %} + + + + + + + + {% if config.extra.logo %}{% endif %} + + + + + + + + + + + + + + + + + + + + {% if config.extra.palette %} + + {% endif %} + {% if config.extra.font != "none" %} + {% set text = config.extra.get("font", {}).text | default("Ubuntu") %} + {% set code = config.extra.get("font", {}).code | default("Ubuntu Mono") %} + {% set font = text + ':400,700|' + code | replace(' ', '+') %} + + + {% endif %} + {% for path in extra_css %} + + {% endfor %} + + + + {% block extrahead %}{% endblock %} + + {% set palette = config.extra.get("palette", {}) %} + {% set primary = palette.primary | replace(' ', '-') | lower %} + {% set accent = palette.accent | replace(' ', '-') | lower %} + + {% if repo_name == "GitHub" and repo_url %} + {% set repo_id = repo_url | replace("https://github.com/", "") %} + {% if repo_id[-1:] == "/" %} + {% set repo_id = repo_id[:-1] %} + {% endif %} + {% endif %} +
+
+
+ + + +
+ {% include "header.html" %} +
+
+ {% set h1 = "\x3ch1 id=" in content %} +
+ {% include "drawer.html" %} +
+
+
+ {% if not h1 %} +

{{ page_title | default(site_name, true)}}

+ {% endif %} + {{ content }} + + {% block footer %} +
+ {% include "footer.html" %} +
+ {% endblock %} +
+
+
+
+
+
+
+
+
+
+
+ + + {% for path in extra_javascript %} + + {% endfor %} + {% if google_analytics %} + + {% endif %} + + diff --git a/theme/drawer.html b/theme/drawer.html new file mode 100644 index 0000000..f22a5c9 --- /dev/null +++ b/theme/drawer.html @@ -0,0 +1,59 @@ + diff --git a/theme/header.html b/theme/header.html new file mode 100644 index 0000000..f64db37 --- /dev/null +++ b/theme/header.html @@ -0,0 +1,63 @@ + diff --git a/training/deputy/index.html b/training/deputy/index.html deleted file mode 100644 index b84f5ae..0000000 --- a/training/deputy/index.html +++ /dev/null @@ -1,592 +0,0 @@ - - - - - - - - - - Deputy - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Deputy

- -

So you want to be a deputy? You've come to the right place!

-

Deputy -Credit: oregondot @ Flickr

-

Purpose#

-

The purpose of the Deputy is to support the IC by keeping track of timers, notifying the IC of important information, and paging other people as directed by the IC.

-

It's important for the IC to focus on the problem at hand, rather than worrying about monitoring timers. The deputy is there to help support the IC and keep them focussed on the incident.

-

As a Deputy, you will be expected to take over command from the IC if they request it.

-

You should not be performing any remediations, checking graphs, or investigating logs. Those tasks will be delegated to the resolvers by the IC.

-

Prerequisites#

-

Before you can be a Deputy, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training!

- -

Responsibilities#

-

Read up on our Different Roles for Incidents to see what is expected from a Deputy, as well as what we expect from the other roles you'll be interacting with.

-

Training Process#

-

The training process for a Deputy is quite simple.

- -

Incident Call Procedures and Lingo#

-

The Steps for Deputy provide a detailed description of what you should be doing during an incident.

-

Here are some examples of phrases and patterns you should use during incident calls.

-

Keep Track of Responders#

-

As you listen to the call, you should keep track of the responders to the call as you hear them speak. Make a note on a piece of paper, or use the !ic responders to see who they are. The IC may ask you who is on-call for a particular system, and you should know the answer, and be able to page them.

-
-

Do we have a representative from [X] on the call?

-

(pause)

-

Deputy, can you go ahead and page the [X] on-call please.

-
-

You can page them however you see fit, phone call, etc.

-

Provide Executive Status Updates#

-

Provide regular status updates on Slack (roughly every 30mins), giving an executive summary of the current status during SEV-1 incidents. Keep it short and to the point, and use @here. Mention the current state, the actions in progress, customer impact, and expected time remaining. It's OK to miss out some of those if the information isn't known.

-
-

@here: We are in SEV-1 due to X. Current actions in progress are to do Y. Expecting 3 mins to complete that action. Once action is complete, system should recover on its own within 5 minutes.

-
-

Alert IC to Timers#

-

You are expected to keep track of how long the incident has been running for, and provide callouts to the IC every 10 minutes so they can take actions such as increasing the severity, or asking Support to Tweet out. This is as simple as telling the IC on the call,

-
-

IC, be advised the incident is now at the 10 minute mark.

-
-

Similarly, when the IC asks for someone to get back to them in X minutes, you are expected to keep track of that. You should remind the IC when that time has been reached.

-
-

IC, be advised the timer for [TEAM]'s investigation is up.

-
- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/training/glossary/index.html b/training/glossary/index.html deleted file mode 100644 index 8bb9156..0000000 --- a/training/glossary/index.html +++ /dev/null @@ -1,563 +0,0 @@ - - - - - - - - - - Glossary - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Glossary

- -

Ever wonder what all of those strange words you sometimes see in our documentation mean? This page is here to help.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
TermDescription
IC / Incident CommanderThe incident commander is the person responsible for bringing any major incident to resolution. They are the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final. More info.
DeputyTypically the backup IC. The deputy's job is to support the IC during the call, providing them with any help they need. More info.
ScribeThe scribe's job is to keep a log of all activities performed during the call in a written chat log on Slack. More info.
ResolverA person on the incident call who is able to help resolve issues within a particular system. Also referred to as an SME (see below). More info.
SME"Subject Matter Expert", someone who is an expert in a particular service or subject who can provide information to the IC, and perform resolution actions for a particular system. More info.
CAN ReportCAN stands for "Conditions" "Actions" "Needs", if an IC asks you for a CAN report, you should provide the current state of your service (condition), what actions need to be taken to return it to a healthy state (actions), and what support you need in order to perform the actions (needs).
Sev / SeverityHow severe the incident is. The "sev" of an incident determines the type of response we give. The higher the severity, the higher the likelihood of making risky actions to resolve the situation. More info.
Span of ControlRefers to the number of direct reports you have. For example, if the IC has 10 people as direct reports on a call, they have a large span of control. We aim to make the span of control as minimal as we can while still being productive.
Grenade ThrowerSomeone who joins the call at a late time in the game, and provides information that completely derails the current thinking. They then leave almost immediately.
Executive SwoopWhen an executive comes on the call and drops some sort of bombshell. A version of grenade throwing.
- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/training/incident_commander/index.html b/training/incident_commander/index.html deleted file mode 100644 index 8e89a96..0000000 --- a/training/incident_commander/index.html +++ /dev/null @@ -1,833 +0,0 @@ - - - - - - - - - - Incident Commander - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Incident Commander

- -

So you want to be an incident commander? You've come to the right place! You don't need to be a senior team member to become an IC, anyone can do it providing you have the requisite knowledge (yes, even an intern)!

-

Gene Kranz -Credit: NASA

-

Purpose#

-

If you could boil down the definition of an Incident Commander to one sentence, it would be,

-
-

Take whatever actions are necessary to protect PagerDuty systems and customers.

-
-

The purpose of the Incident Commander is to be the decision maker during an major incident; Delegating tasks and listening to input from subject matter experts in order to bring the incident to resolution.

-

The Incident Commander becomes the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final.

-

Your job as an IC is to listen to the call and to watch the incident Slack room in order to provide clear coordination, recruiting others to gather context/details. You should not be performing any actions or remediations, checking graphs, or investigating logs. Those tasks should be delegated.

-

Prerequisites#

-

Before you can be an Incident Commander, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training!

-
    -
  • Has excellent knowledge of PagerDuty systems and is able to quickly evaluate good vs bad options, and quickly identify what's actually going on.
  • -
  • Been at PagerDuty for at least 6 months and has a solid understanding of the incident notification pipeline and web stack.
  • -
  • Excellent verbal and written communication skills.
  • -
  • Has knowledge of obscure PagerDuty terms.
  • -
  • Has gravitas and is willing to kick people off a call to remove distractions, even if it's the CEO.
  • -
-

Responsibilities#

-

Read up on our Different Roles for Incidents to see what is expected from an Incident Commander, as well as what we expect from the other roles you'll be interacting with.

-

Qualities#

-

Some qualities we expect from an effective leader include being able to:

-
    -
  • Take command.
  • -
  • Motivate responders.
  • -
  • Communicate clear directions.
  • -
  • Size up the situation and make rapid decisions.
  • -
  • Assess the effectiveness of tactics/strategies.
  • -
  • Be flexible and modify your plans as necessary.
  • -
-

As a leader, you should try to:

-
    -
  • Be proficient in your job.
  • -
  • Make sound and timely decisions.
  • -
  • Ensure tasks are understood.
  • -
  • Be prepared to step out of a tactical role to assume a leadership role.
  • -
-

Training Process#

-

The process is fairly loose for now. Here's a list of things you can do to train though,

-
    -
  • -

    Read the rest of this page, particularly the sections below.

    -
  • -
  • -

    Participate in Failure Friday (FF).

    -
      -
    • Shadow a FF to see how it's run.
    • -
    • Be the scribe for multiple FF's.
    • -
    • Be the incident commander for multiple FF's.
    • -
    -
  • -
  • -

    Play a game of "Keep Talking and Nobody Explodes" with other people in the office.

    -
      -
    • For a more realistic experience, play it with someone in a different office over Hangouts.
    • -
    -
  • -
  • -

    Shadow a current incident commander for at least a full week shift.

    -
      -
    • Get alerted when they do, join in on the same calls.
    • -
    • Sit in on an active incident call, follow along with the chat, and follow along with what the Incident Commander is doing.
    • -
    • Do not actively participate in the call, keep your questions until the end.
    • -
    -
  • -
  • -

    Reverse shadow a current incident commander for at least a full week shift.

    -
      -
    • You should be the one to respond to incidents, and you will take point on calls, however the current IC will be there to take over should you not know how to proceed.
    • -
    -
  • -
-

Graduation#

-

What's the difference between an IC in training, and an IC? (This isn't the set up to a joke). Simple, an IC puts themselves on the schedule.

-

Handling Incidents#

-

Every incident is different (we're hopefully not repeating the same issue multiple times!), but there's a common process you can apply to each one.

-
    -
  1. -

    Identify the symptoms.

    -
      -
    • Identify what the symptoms are, how big the issue is, and whether it's escalating/flapping/static.
    • -
    -
  2. -
  3. -

    Size-up the situation.

    -
      -
    • Gather as much information as you can, as quickly as you can (remember the incident is still happening while you're doing this).
    • -
    • Get the facts, the possibilities of what can happen, and the probability of those things happening.
    • -
    -
  4. -
  5. -

    Stabilize the incident.

    -
      -
    • Identify actions you can use to proceed.
    • -
    • Gather support for the plan (See "Polling During a Decision" below).
    • -
    • Delegate remediation actions to your SME's.
    • -
    -
  6. -
  7. -

    Provide regular updates.

    -
      -
    • Maintain a cadence, and provide regular updates to everyone on the call.
    • -
    • What's happening, what are we doing about it, etc.
    • -
    -
  8. -
-

Deputy#

-

The deputy for an incident is generally the backup Incident Commander. However, as an Incident Commander, you may appoint one or more Deputies. Note that Deputy Incident Commanders must be as qualified as the Incident Commander, and that if a Deputy is assigned, he or she must be fully qualified to assume the Incident Commander’s position if required.

-

Communication Responsibilities#

-

Sharing information during an incident is a critical process. As an Incident Commander (or Deputy), you should be prepared to brief others as necessary. You will also be required to communicate your intentions and decisions clearly so that there is no ambiguity in your commands.

-

When given information from a responder, you should clearly acknowledge that you have received and understood their message, so that the responder can be confident in moving on to other tasks.

-

After an incident, you should communicate with other training Incident Commanders on any debrief actions you feel are necessary.

-

Incident Call Procedures and Lingo#

-

The Steps for Incident Commander provide a detailed description of what you should be doing during an incident.

-

Additionally, aside from following the usual incident call etiquette, there a few extra etiquette guidelines you should follow as IC:

-
    -
  • Always announce when you join the call if you are the on-call IC.
  • -
  • Don't let discussions get out of hand. Keep conversations short.
  • -
  • Note objections from others, but your call is final.
  • -
  • If anyone is being actively disruptive to your call, kick them off.
  • -
  • Announce the end of the call.
  • -
-

Here are some examples of phrases and patterns you should use during incident calls.

-

Start of Call Announcement#

-

At the start of any major incident call, the incident commander should announce the following,

-
-

This is [NAME], I am the Incident Commander for this call.

-
-

This establishes to everyone on the call what your name is, and that you are now the commander. You should state "Incident Commander" and not "IC", as newcomers may not be familiar with the terminology yet. The word "commander" makes it very clear that you're in charge.

-

Start of Incident, IC Not Present#

-

If you are trained to be an IC and have joined a call, even if you aren't the IC on-call, you should do the following,

-
-

Is there an IC on the call?

-

(pause)

-

Hearing no response, this is [NAME], and I am now the Incident Commander for this call.

-
-

If the on-call IC joins later, you may hand over to them at your discretion (see below for the hand-off procedure)

-

Checking if SME's are Present#

-

During a call, you will want to know who is available from the various teams in order to resolve the incident. Etiquette dictates that people should announce themselves, but sometimes you may be joining late to the call. If you need a representative from a team, just ask on the call. Your deputy can page one if no one answers.

-
-

Do we have a representative from [X] on the call?

-

(pause)

-

Deputy, can you go ahead and page the [X] on-call please.

-
-

Assigning Tasks#

-

When you need to give out an assignment or task, give it to a person directly, never say "can someone do..." as this leads to the bystander effect. Instead, all actions should be assigned to a specific person, and time-boxed with a specific number of minutes.

-
-

IC: Bob, please investigate the high latency on web app boxes. I'll come back to you for an answer in 3 minutes.

-

Bob: Understood

-
-

Keep track of how many minutes you assigned, and check in with that person after that time. You can get help from your deputy to help track the timings.

-

Polling During a Decision#

-

If a decision needs to be made, it comes down to the IC. Once the IC makes a decision, it is final. But it's important that no one can come later and object to the plan, saying things like "I knew that would happen". An IC will use very specific language to be sure that doesn't happen.

-
-

The proposal is to [EXPLAIN PROPOSAL]

-

Are there any strong objections to this plan?

-

(pause)

-

Hearing no objects, we are proceeding with this proposal.

-
-

If you were to ask "Does everyone agree?", you'd get people speaking over each other, you'd have quiet people not speaking up, etc. Asking for any STRONG objections gives people the chance to object, but only if they feel strongly on the matter.

-

Status Updates#

-

It's important to maintain a cadence during a major incident call. Whenever there is a lull in the proceedings, usually because you're waiting for someone to get back to you, you can fill the gap by explaining the current situation and the actions that are outstanding. This makes sure everyone is on the same page.

-
-

While we wait for [X], here's an update of our current situation.

-

We are currently in a SEV-1 situation, we believe to be caused by [X]. There's an open question to [Y] who will be getting back to us in 2 minutes. In the meantime, we have Tweeted out that we are experiencing issues. Our next Tweet will be in 10 minutes if the incident is still ongoing at that time.

-

Are there any additional actions or proposals from anyone else at this time?

-
-

Transfer of Command#

-

Transfer of command, involves (as the name suggests) transferring command to another Incident Commander. There are multiple reasons why a transfer of command might take place,

-
    -
  • Commander has become fatigued and is unable to continue.
  • -
  • Incident complexity changes.
  • -
  • Change of command is necessary for effectiveness or efficiency.
  • -
  • Personal emergencies arise (e.g., Incident Commander has a family emergency).
  • -
-

Never feel like you are not doing your job properly by handing over. Handovers are encouraged. In order to handover, out of band from the main call (via Slack for example), notify the other IC that you wish to transfer command. Update them with anything you feel appropriate. Then announce on the call,

-
-

Everyone on the call, be advised, at this time I am handing over command to [X].

-
-

The new IC should then announce on the call as if they were joining a new call (see above), so that everyone is aware of the new commander.

-

Note that the arrival of a more qualified person does NOT necessarily mean a change in incident command.

-

Maintaining Order#

-

Often times on a call people will be talking over one another, or an argument on the correct way to proceed may break out. As Incident Commander it's important that order is maintained on a call. The Incident Commander has the power to remove someone from the call if necessary (even if it's the CEO). But often times you just need to remind people to speak one at a time. Sometimes the discussion can be healthy even if it starts as an argument, but you shouldn't let it go on for too long.

-
-

(noise)

-

Ok everyone, can we all speak one at a time please. So far I'm hearing two options to proceed: 1) [X], 2) [Y].

-

Are there any other proposals someone would like to make at this time?

-

...etc

-
-

Getting Straight Answers#

-

You may ask a question as IC and receive an answer that doesn't actually answer your question. This is generally when you ask for a yes/no answer but get a more detailed explanation. This can often times be because the person doesn't understand the call etiquette. But if it continues, you need to take action in order to proceed.

-
-

IC: Is this going to disable the service for everyone?

-

SME: Well... for some people it....

-

IC: Stop. I need a yes/no answer. Is this going to disable the service for everyone?

-

SME: Well... it might not do...

-

IC: Stop. I'm going to ask again, and the only two words I want to hear from you are "yes" or "no. If this going to disable the service for everyone?

-

SME: Well.. like I was saying..

-

IC: Stop. Leave the call. Backup IC can you please page the backup on-call for [service] so that we can get an answer.

-
-

Executive Swoop#

-

You may get someone who would be senior to you during peacetime come on the call and start overriding your decisions as IC. This is unacceptable behaviour during wartime, as the IC is in command. While this is rare, you can get things back on track with the following,

-
-

Executive: No, I don't want us doing that. Everyone stop. We need to rollback instead.

-

IC: Hold please. [EXECUTIVE], do you wish to take over command?

-

Executive: Yes/No

-

(If yes) IC: Understood. Everyone on the call, be advised, at this time I am handling over command to [EXECUTIVE]. They are now the incident commander for this call.

-

(If no) IC: In that case, please cause no further interruptions or I will remove you from the call.

-
-

This makes it clear to the executive that they have the option of being in charge and making decisions, but in order to do so they must continue as an Incident Commander. If they refuse, then remind them that you are in charge and disruptive interruptions will not be tolerated. If they continue, remove them from the call.

-

End of Call Sign-Off#

-

At the end of an incident, you should announce to everyone on the call that you are ending the call at this time, and provide information on where followup discussion can take place. It's also customary to thank everyone.

-
-

Ok everyone, we're ending the call at this time. Please continue any followup discussion on Slack. Thanks everyone.

-
-

Examples From Pop Culture#

-

PagerDuty employees have access to all previous incident calls, and can listen to them at their discretion. We can't release these calls, so for everyone else, here are some short examples from popular culture to show the techniques at work.

-
- - -

Here's a clip from the movie Apollo 13, where Gene Kranz (Flight Director / Incident Commander) shows some great examples of Incident Command. Here are some things to note:

-
    -
  • Walks into the room, and immediately obvious that he's the IC. Calms the noise, and makes sure everyone is paying attention.
  • -
  • Provides a status update so people are aware of the situation.
  • -
  • Projector breaks, doesn't get sidetracked on fixing it, just moves on to something else.
  • -
  • Provides a proposal for how to proceed and elicits feedback.
      -
    • Listens to the feedback calmly.
    • -
    • When counter-proposal is raised, states that he agrees and why.
    • -
    -
  • -
  • Allows a discussion to happen, listens to all points. When discussion gets out of hand, re-asserts command of the situation.
      -
    • Explains his decision, and why.
    • -
    -
  • -
  • Explains his full plan and decision, so everyone is on the same page.
  • -
-
- - -

Another clip from Apollo 13. Things to note:

-
    -
  • Summarizes the situation, and states the facts.
  • -
  • Listens to the feedback from various people.
  • -
  • When a trusted SME provides information counter to what everyone else is saying, asks for additional clarification ("What do you mean, everything?")
  • -
  • Wise cracking remarks are not acknowledged by the IC ("You can't run a vacuum cleaner on 12 amps!")
  • -
  • "That's the deal?".. "That's the deal".
  • -
  • Once decision is made, moves on to the next discussion.
  • -
  • Delegates tasks.
  • -
- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/training/overview/index.html b/training/overview/index.html deleted file mode 100644 index d7e3e02..0000000 --- a/training/overview/index.html +++ /dev/null @@ -1,545 +0,0 @@ - - - - - - - - - - Overview - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Overview

- -

Learning about the Spearhead Systems incident response process is an important part of being an effective on-call engineer at Spearhead Systens. This section goes over our training material for the various roles that are involved in our incident response, along with some additional information and training material from government agencies.

-

Training Guides#

-

Our training guides are split up by role, however you are encouraged to read through the training guides even for roles you don't belong to, as it can give you some good insight into how those people will be behaving during major incidents.

-
    -
  • Incident Commander Training - The "IC" is the person who drives a major incident to resolution. They're the person who will be directing everyone else.
  • -
  • Deputy Training - The Deputy is someone who supports the Incident Commander and can take over for them if necessary.
  • -
  • Scribe Training - This is intended for individuals who will be acting as a scribe during an incident.
  • -
  • SME / Resolver Training - This is relevant to everyone at Spearhead Systems who are on-call for any team.
  • -
-

National Incident Management System (NIMS)#

-

Our incident response process is loosely based on the US National Incident Management System (NIMS), which is described as,

-

A systematic, proactive approach to guide departments and agencies at all levels of government, nongovernmental organizations, and the private sector to work together seamlessly and manage incidents involving all threats and hazards—regardless of cause, size, location, or complexity—in order to reduce loss of life, property and harm to the environment.

-

While it might not initially seem that this would be applicable to an IT operations environment, we've found that many of the lessons learned from major incidents in these situations can be directly applied to our industry too. The principles are the same and span many different environments.

-

NIMS NIMS Training

-

If you want to learn more about NIMS, we recommend the ICS-100 and ICS-700 online training courses, which go over NIMS and the Incident Command System (You can also take an online examination after training in order to get a certificate from FEMA). There is also a wealth of additional training material and courses from FEMA on NIMS, which I would encourage you to look at.

-

If you're based in the US and interested in taking a more active incident response role in your community, we recommend investigating your local CERT programs (Community Emergency Response Teams). Many cities offer CERT training, after which you can volunteer as a CERT contributor within your community. Not only is it an opportunity to get real world experience with disaster response, but the skills you learn can be applied to everyday life too.

-

Also take a look at the Additional Reading section on the home page.

- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/training/scribe/index.html b/training/scribe/index.html deleted file mode 100644 index a244c75..0000000 --- a/training/scribe/index.html +++ /dev/null @@ -1,625 +0,0 @@ - - - - - - - - - - Scribe - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Scribe

- -

So you want to be a scribe? You've come to the right place! You don't need to be a senior team member to become a deputy or scribe, anyone can do it providing you have the requisite knowledge!

-

Typewriter -Credit: Holly Chaffin

-

Purpose#

-

The purpose of the Scribe is to maintain a timeline of key events during an incident. Documenting actions, and keeping track of any followup items that will need to be addressed.

-

It's important for the rest of the command staff to be able to focus on the problem at hand, rather than worrying about documenting the steps.

-

Your job as Scribe is to listen to the call and to watch the incident Slack room, keeping track of context and actions that need to be performed, documenting these in Slack as you go. You should not be performing any remediations, checking graphs, or investigating logs. Those tasks will be delegated to the subject matter experts (SME's) by the Incident Commander.

-

Prerequisites#

-

Before you can be a Scribe, it is expected that you meet the following criteria. Don't worry if you don't meet them all yet, you can still continue with training!

-
    -
  • Excellent verbal and written communication skills.
  • -
  • Has knowledge of obscure PagerDuty terms.
  • -
-

Responsibilities#

-

Read up on our Different Roles for Incidents to see what is expected from a Scribe, as well as what we expect from the other roles you'll be interacting with.

-

Training Process#

-

There is no formal training process for this role, reading this page should be sufficient for most tasks. Here's a list of things you can do to train though,

-
    -
  • -

    Read the rest of this page, particularly the sections below.

    -
  • -
  • -

    Participate in Failure Friday (FF).

    -
      -
    • Shadow a FF to see how it's run.
    • -
    • Be the scribe for multiple FF's.
    • -
    -
  • -
-

Scribing#

-

Scribing is more art than science. The objective is to keep an accurate record of important events that occurred on the call, so that we can look back at the timeline to see what happened. But what exactly is important? There's no overwhelming answer, and it really comes down the judgement and experience. But here are some general things you most definitely want to capture as scribe.

-
    -
  • The result of any polling decisions.
      -
    • This is not "9 people voted yay, 3 voted nay".
    • -
    • It is "Polled for if we should do rolling restart. is proceeding with restart."
    • -
    -
  • -
  • Any followup items that are called out as "We should do this..", "Why didn't this?..", etc.
      -
    • This is not "Why isn't the Support representative on the call?"
    • -
    • This is "TODO: Why didn't we get paged for this earlier?"
    • -
    -
  • -
-

Incident Call Procedures and Lingo#

-

The Steps for Scribe provide a detailed description of what you should be doing during an incident.

-

Here are some examples of phrases and patterns you should use during incident calls.

-

Status Stalking#

-

At the start of any major incident call, you should start our status stalking bot, so that it will post to the room an update automatically.

-
-

!status stalk

-
-

This will provide the update and allow the IC to see the status without having to keep asking.

-

Note Important Actions#

-

During a call, you will hear lots of discussion happening, you should not be documenting all of this in the chat room. You only want to document things which will be important for the final timeline. It's not always obvious what this might be, and it's usually a matter of judgement. You generally want to note any actions the IC has asked someone to perform, along with the result of any polling decisions.

-
-

Polled for decision on whether to perform rolling restart. We are proceeding with restart. [USER_A] to execute.

-
-

Some actions might seem important at the time, but end up not being. That's OK. It's better to have more info than not enough, but don't go overboard.

-

Note Followup Actions#

-

Sometimes during the call, someone will either mention something we "should fix", or the IC will specifically ask you to note a followup item. You can do this in Slack by simply prefixing with "TODO", this will make it easier to search for later.

-
-

TODO: Why did we not get paged for the fall in traffic on [X] cluster?

-
-

The post-mortem owner will find these after and raise tasks for them.

-

End of Call Notification#

-

When the IC ends the call, you should post a message into Slack to let everyone know the call is over, and that they should continue discussion elsewhere.

-
-

Call is over, thanks everyone. Follow up in Slack.

-
-

Don't forget to also stop the status stalking.

-
-

!status unstalk

-
- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file diff --git a/training/subject_matter_expert/index.html b/training/subject_matter_expert/index.html deleted file mode 100644 index 47ceedf..0000000 --- a/training/subject_matter_expert/index.html +++ /dev/null @@ -1,601 +0,0 @@ - - - - - - - - - - Subject Matter Expert - Spearhead Systems Incident Response Documentation - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
-
- - - -
- -
-
- -
- -
-
-
- -

Subject Matter Expert

- -

If you are on-call for any team at PagerDuty, you may be paged for a major incident and will be expected to respond as a subject matter expert (SME) for your service. This page details everything you need to know in order to be prepared for that responsibility. If you are interested in becoming an Incident Commander, take a look at the Incident Commander Training page.

-

Incident Response -Credit: oregondot @ Flickr

-

On-Call Expectations#

-

If you are on-call for your team, there are certain expectations of you as that on-call. This applies to both the primary and secondary on-calls. Getting paged about a SEV-3 or SEV-4 in your system comes with different expectations than getting paged with a major SEV-2.

-

Before Going On-Call#

-
    -
  1. Be prepared, by having already familiarized yourself with our incident response policies and procedures. In particular,
      -
    1. Different Roles for Incidents - You will be acting as a "Resolver" or "SME". But you should familiarize yourself with the other roles and what they will be doing.
    2. -
    3. Incident Call Etiquette - How to behave during an incident call.
    4. -
    5. During an Incident - What to do during an incident. You are specifically interested in the "Resolver" steps, but you should familiarize yourself with the entire document.
    6. -
    7. Glossary - Familiarize yourself with the terminology that may be used during the call.
    8. -
    -
  2. -
  3. Make sure you have set up your alerting methods, and that PagerDuty can bypass your "Do Not Disturb" settings.
  4. -
  5. Check you can join the incident call. You may need to install a browser plugin. You don't want to be doing that the first time you get paged.
  6. -
  7. Be aware of your upcoming on-call time and arrange swaps around travel, vacations, appointments, etc.
  8. -
  9. If you are an Incident Commander, make sure you are not on-call for your team at the same time as being on-call as Incident Commander.
  10. -
-

During On-Call Period#

-
    -
  1. Have your laptop and Internet with you at all times during your on-call period (office, home, a MiFi, a phone with a tethering plan, etc).
  2. -
  3. If you have important appointments, you need to get someone else on your team to cover that time slot in advance.
  4. -
  5. When you receive an alert for a major incident, you are expected to join the incident call and Slack as quickly as possible (within minutes).
      -
    1. You will be asked questions or given actions by the Incident Commander. Answer questions concisely, and follow all actions given (even if you disagree with them).
    2. -
    -
  6. -
-

Response Mobilization#

-

When an incident occurs, you must be mobilized or assigned to become part of the incident response. In other words, until you are mobilized to the incident via a page or being directly asked by someone else on the incident, you remain in your everyday role. After being mobilized, your first task is to check in and receive an assignment. While it's tempting to see an incident happening and want to jump in and help, when resources show up that have not been requested, the management of the incident can be compromised.

-

"Never Hesitate to Escalate"#

-

If you're not sure about something, it is perfectly acceptable to bring in other SMEs from your team that you believe know a given system better than you. Don't let your ego keep you from bringing in additional help. Our motto is "Never hesitate to escalate", you will never be looked down upon for escalating something because you didn't know how to handle it.

-

Blameless#

-

There will be incidents. Some will be caused by you, some will be caused by others... some will just happen. Our entire incident response process is completely blameless. Blaming people is counter productive and just distracts from the problem at hand. No matter how an incident started, they all need to get solved as quickly as possible.

-

Wartime vs Peacetime#

-

Behavior during a major incident is very different to any other alert you may have received in the past. We call a major incident "wartime", and make a distinction between that and normal everyday operations ("peacetime").

-

Peacetime#

-

The organizational structure is generally based on seniority. The more senior members of a team will lead discussions, and managers or team leads will have the final say. Decisions are made after careful consideration of all options, and to minimize potential risk to customers.

-

Wartime#

-

Wartime is different, and you will notice on our major incident calls that there's a different organizational structure.

-
    -
  • The Incident Commander is in charge. No matter their rank during peacetime, they are now the highest ranked individual on the call, higher than the CEO.
  • -
  • Primary responders (folks acting as primary on-call for a team/service) are the highest ranked individuals for that service.
  • -
  • Decisions will be made by the IC after consideration of the information presented. Once that decision is made, it is final.
  • -
  • Riskier decisions can be made by the IC than would normally be considered during peacetime.
      -
    • For example, the IC may decide to drop events for a particular customer in order to maintain the integrity of the system for everyone else.
    • -
    -
  • -
  • The IC may go against a consensus decision. If a poll is done, and 9/10 people agree but 1 disagrees. The IC may choose the disagreement option despite a majority vote.
      -
    • Even if you disagree, the IC's decision is final. During the call is not the time to argue with them.
    • -
    -
  • -
  • The IC may use language or behave in a way you find rude. This is wartime, and they need to do whatever it takes to resolve the situation, so sometimes rudeness occurs. This is never anything personal, and something you should be prepared to experience if you've never been in a wartime situation before.
  • -
  • You may be asked to leave the call by the IC, or you may even be forceable kicked off a call. It is at the IC's discretion to do this if they feel you are not providing useful input. Again, this is nothing personal and you should remember that wartime is different than peacetime.
  • -
- - - - -
-
-
-
-
-
-
-
-
-
-
- - - - - - \ No newline at end of file