From 7a9ec8a64360aef177b3efabc02f907f5280489f Mon Sep 17 00:00:00 2001 From: Marius Pana Date: Sun, 13 Aug 2017 20:17:52 +0300 Subject: [PATCH] update cf last stand-up --- _site/docs/index.md | 4 +- docs/about.md | 10 ++-- docs/after/post_mortem_process.md | 2 +- docs/before/call_etiquette.md | 19 ++++---- docs/before/different_roles.md | 17 +++---- docs/before/severity_levels.md | 10 ++-- docs/during/during_an_incident.md | 20 ++++---- docs/during/security_incident_response.md | 13 ++--- docs/index.md | 8 ++-- docs/oncall/alerting_principles.md | 9 ++-- docs/oncall/being_oncall.md | 58 +++++++++++++---------- 11 files changed, 91 insertions(+), 79 deletions(-) diff --git a/_site/docs/index.md b/_site/docs/index.md index 60ca4b3..ab19513 100644 --- a/_site/docs/index.md +++ b/_site/docs/index.md @@ -1,7 +1,7 @@ -This documentation covers parts of the Spearhead Systems Issue Response process. It is a copy of [PagerDuty's](https://github.com/PagerDuty/incident-response-docs/) documentation and furthermore a cut-down version of our own internal documentation, used at Spearhead Systems for any issue (incident or service request), and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process). See the [about page](about.md) for more information on what this documentation is and why it exists. This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com) and may not yet be open sourced. +This documentation covers parts of the Spearhead Systems reponse process for technical support service requests and incidents. It is based on [PagerDuty's](https://github.com/PagerDuty/incident-response-docs/) documentation and furthermore a cut-down version of our own internal documentation, used to prepare new employees for servicing our customer requests and incidents. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process). See the [about page](about.md) for more information on what this documentation is and why it exists. This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com) and may not yet be public. !!! note "Issue, Incident and Service Request" - At Spearhead we use the term *issue* to define any request from our customers. Issues fall into two categories: "Service Requests (SR)" and "Incidents (IN)". Note that we use the term Incident to describe both a service request as well as incidents. For brevity we will use SR and IN throughout this documentation. + At Spearhead we use the term *issue* to define any request from our customers. Issues fall into two categories: "Service Requests (SR)" and "Incidents (IN)". We use the term *issue* to describe both a service request as well as incidents. For brevity we will use SR and IN throughout this documentation. A "service request" is usually initiated by a human and is generally not critical for the normal functioning of the business while an "incident" is an issue that is or can cause interruption to normal business functions. diff --git a/docs/about.md b/docs/about.md index 0e8f7dc..83b5008 100644 --- a/docs/about.md +++ b/docs/about.md @@ -1,4 +1,4 @@ -This site documents parts of the Spearhead Systems Issue Response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after. +This site documents parts of the Spearhead Systems technical support response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after. This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com). @@ -8,11 +8,11 @@ A collection of pages detailing how to efficiently deal with any incident or ser ## Who is this for? -It is intended for on-call practitioners and those involved in an operational incident or service request response process, or those wishing to enact a formal incident response process. Specifically this is for all of our Technical Support staff. +It is intended for our technical support staff and customers/partners looking for more details regarding our support process. ## Why do I need it? -As a service provider Spearhead Systems deals with service requests on a daily basis. The reason we exist is to deliver a service which in most cases boils down to incidents and service requests. We want to deliver a smooth and seamless experience for resolving our customers issues therefore this documentation is a guideline for how we handle these requests. This documentation will allow you give you a head start on how to deal with issues in a way which leads to the fastest possible recovery time. +As a service provider Spearhead Systems deals with technical support requests on a daily basis. The reason we exist is to deliver our technical support services which boils down to responsind to incidents and service requests. We want to deliver a smooth and seamless experience for resolving our customers issues therefore this documentation is a guideline for how we handle these requests. This documentation will give you a head start on how to deal with issues in a way which leads to the fastest possible recovery time. ## What is covered? @@ -20,10 +20,12 @@ Anything from preparing to [go on-call](/oncall/being_oncall.md), definitions of ## What is missing? -Lots, dig in an help us complete the picture. We can migrate most processes from Sharepoint here. +Lots, dig in an help us complete the picture. We can migrate most processes from Sharepoint here. We're also looking for experienced operations/support people who are willing to share their experience with us and help us provide a better support service. ## License This documentation is provided under the Apache License 2.0. In plain English that means you can use and modify this documentation and use it both commercially and for private use. However, you must include any original copyright notices, and the original LICENSE file. Whether you are a Spearhead Systems customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account, feel free to fork the repository and use it as a base for your own internal documentation. + +Please also check-out [PagerDuty's](https://github.com/PagerDuty/incident-response-docs/) response documentation which has made our own efforts in documenting our process much easier. diff --git a/docs/after/post_mortem_process.md b/docs/after/post_mortem_process.md index 42d7693..49d38e3 100644 --- a/docs/after/post_mortem_process.md +++ b/docs/after/post_mortem_process.md @@ -3,7 +3,7 @@ For every major issue (SR/IN +major), we need to follow up with a post-mortem. A ![Post-Mortem](../assets/img/headers/pagerduty_post_mortem.jpg) ## Owner Designation -The first step is that a post-mortem owner will be designated. This is done by the TL either at the end of a major incident call, or very shortly after. You will be notified directly by the TL if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use DoIT and Slack for coordinating followup. A detailed list of the steps is available below, +The first step is that a post-mortem owner will be designated. This is done by the TL either at the end of a major incident call, or very shortly after. You will be notified directly by the TL if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use DoIT and our internal Chat for coordinating followup. A detailed list of the steps is available below, ## Owner Responsibilities As owner of a post-mortem, you are responsible for the following, diff --git a/docs/before/call_etiquette.md b/docs/before/call_etiquette.md index 6f892dc..d7cf63b 100644 --- a/docs/before/call_etiquette.md +++ b/docs/before/call_etiquette.md @@ -3,20 +3,20 @@ You've just joined Spearhead Systems support staff and you've never worked in a ![Obama phone](../assets/img/headers/obama_phone.jpg) *Credit: [Official White House Photo](https://commons.wikimedia.org/wiki/File:Barack_Obama_on_phone_with_Benjamin_Netanyahu_2009-06-08.jpg) by Pete Souza* -## First Steps +## First Steps regarding Incidents -* If you intend on participating on the incident call you should join both the call, review the associated cards in DoIT, and jump on the corresponding Slack channel. +* If you intend on participating on the incident call you should join both the call (if there is a call), review the associated cards in DoIT, and jump on the corresponding internal Chat channel. * Make sure you are in a quiet environment in order to participate on the call. Background noise should be kept to a minimum. * Keep your microphone muted until you have something to say. * Identify yourself when you join the call; State your name and the system you are the expert for. * Speak up and speak clearly. * Be direct and factual. * Keep conversations/discussions short and to the point. -* Bring any concerns to the Team Leader (IC) on the call. +* Bring any concerns to the Team Leader (TL) on the call. * Respect time constraints given by the Team Leader. !!! warning "Incident Call" - Not all issues start with an incident call. Some issues may be completely automated and available only in DoIT while others may be in the incipient stages and the customer may still be on the phone/Slack detailing their issue. + Not all issues start with an incident call. Some issues may be completely automated and available only in DoIT while others may be in the incipient stages and the customer may still be on the phone/internal Chat detailing their issue. ## Lingo **Use clear terminology, and avoid using acronyms or abbreviations during a call. Clear and accurate communication is more important than quick communication.** @@ -35,6 +35,9 @@ Do not invent new abbreviations, and always favor being explicit of implicit. It ## The Team Leader The Team Leader (TL) is the leader of the incident response process, and is responsible for bringing the incident to resolution. They will announce themselves at the start of the call, and will generally be doing most of the talking. +!!! info "TL is not available" + A TL may not be available in which case the incident call will be guided by the senior Sysadmin or SME available. + * Follow all instructions from the team leader, without exception. * Do not perform any actions unless the team leader has told you to do so. * The team leader will typically poll for any strong objections before performing a large action. This is your time to raise any objections if you have them. @@ -47,11 +50,7 @@ The Team Leader (TL) is the leader of the incident response process, and is resp ## Problems? #### There's no team leader on the call! I don't know what to do! -Ask on the call if an TL is present. If you have no response, try asking in Slack. If there is no TL the sysadmin can take over this role temporarily. +Ask on the call if an TL is present. If you have no response, try asking in our internal Chat. If there is no TL the sysadmin can take over this role temporarily. #### There is not enough information! -The definitive source of information for all issues is in DoIT. If at any point there is a discrepancy ask the TL or Sysadmins to provide up to date information and update the card/tasks accordingly. At a minimum information should be available in Slack. - -#### I can join the call or Slack, but not both, what should I do? -You're welcome to join only one of the channels, however you should not actively participate in the incident response if so, as it causes disjoined communication. Liaise with someone who is both in Slack and on the call to provide any input you may have so that they can raise it. - +The definitive source of information for all issues is in DoIT. If it is lacking there then you need to make a note of it and make sure that whoever created the card understands the importance of complete information in a timely manner. If at any point there is a discrepancy ask the TL or Sysadmins to provide up to date information and update the card/tasks accordingly. diff --git a/docs/before/different_roles.md b/docs/before/different_roles.md index e68f1f5..5843ae0 100644 --- a/docs/before/different_roles.md +++ b/docs/before/different_roles.md @@ -1,5 +1,5 @@ -Our support services are deliviered via a flat organizational structure. The same people that deliver projects are also there to deliver ongoing support/maintenance services. -There are several roles in our support team at Spearhead Systems. Certain roles only have one person per incident (e.g. sysadmin), whereas other roles can have multiple people (e.g. Sysadmins, Solution Architects, etc.). It's all about coming together as a team, working the problem, and getting a solution quickly. +Our support services are currently deliviered via a flat organizational structure. +There are however several roles in our support team at Spearhead Systems. Certain roles only have one person per incident (e.g. sysadmin), whereas other roles can have multiple people (e.g. Sysadmins, Solution Architects, etc.). It's all about coming together as a team, working the problem, and getting a solution quickly. Here is a rough outline of our role hierarchy, with each role discussed in more detail on the rest of this page. @@ -10,7 +10,7 @@ Here is a rough outline of our role hierarchy, with each role discussed in more ## Team Leader (TL) ### What is it? -A Team Leader acts as the single source of truth of what is currently happening and what is going to happen during an major incident. They come in all shapes, sizes, and colors. TL's are also the key elements in a project (boards in DoIT). +A Team Leader acts as the single source of truth of what is currently happening and what is going to happen during an major incident and general ongoing support. They come in all shapes, sizes, and colors. TL's are also the key elements in a project (boards in DoIT). ### Why have one? As any system grows in size and complexity, things break and cause incidents. The TL is needed to help drive major incidents to resolution by organizing his team towards a common goal. A TL's skillset includes project and resource management skills which are essential in driving both projects and incidents to a smooth resolution. @@ -21,6 +21,7 @@ As any system grows in size and complexity, things break and cause incidents. Th * Create the DoIT board(s) and other project planning related materials. * Funnel people to these communications channels. * Train team members on how to communicate and train other TL's. + * Train team members and help them prepare with the proper know-how/tools to deliver the project. 1. Drive incidents and projects to resolution, * Get everyone on the same communication channel. * Collect information from team members for their services/area of ownership status. @@ -35,7 +36,7 @@ As any system grows in size and complexity, things break and cause incidents. Th * Work with Managers/Support on scheduling preventive actions. ### Who are they? -Anyone on the TL on-call schedule. Trainees are typically on the TL Shadow schedule. +Anyone on the on-call schedule is a TL durin his shift. Trainees are typically on the TL Shadow schedule. ### How can I become one? Take a look at our [Team Leader training guide](/training/incident_commander.md). @@ -48,16 +49,16 @@ Take a look at our [Team Leader training guide](/training/incident_commander.md) A Sysadmin is a direct support role for the Team Leader. This is not a shadow where the person just observes, the Sysadmin is expected to perform important tasks during an incident. ### Why have one? -It's important for the TL to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The Sysadmin helps to support the TL and keep them stay focussed on the incident. +It's important for the TL to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The Sysadmin helps to support the TL and help them stay focussed on the incident. ### What are the responsibilities? The Sysadmin is expected to: 1. Bring up issues to the TL that may otherwise not be addressed (keeping an eye on timers that have been started, circling back around to missed items from a roll call, etc). 1. Be a "hot standby" TL, should the primary need to either transition to a SME, or otherwise have to step away from the TL role. -1. Page SME's or other on-call engineers as instructed by the Team Leader. +1. Call SME's or other on-call engineers as instructed by the Team Leader. 1. Manage the incident call, and be prepared to remove people from the call if instructed by the Team Leader. -1. Liaise with stakeholders and provide status updates on DoIT (using worklogs and comments), Slack and email/telefone as necessary. +1. Liaise with stakeholders and provide status updates on DoIT (using worklogs and comments), internal Chat and email/telefone as necessary. ### Who are they? Any Team Leader can act as a Sysadmin. Sysadmins need to be trained as an Team Leader as they may be required to take over command. @@ -79,7 +80,7 @@ The Team Leader will need to focus on the problem at hand, and the sysadmins and The Scribe is expected to: 1. Ensure the incident call is being recorded. -1. Note in DoIT, Slack, etc. important data, events, and actions, as they happen. Specifically: +1. Note in DoIT, internal Chat, etc. important data, events, and actions, as they happen. Specifically: * Key actions as they are taken (Example: "prod-server-387723 is being restarted to attempt to remove the stuck lock") * Status reports when one is provided by the TL (Example: "We are in IN-Major, service A is currently not processing events due to a stuck lock, X is restarting the app stack, next checkin in 3 minutes") * Any key callouts either during the call or at the ending review (Example: "Note: (Bob B) We should have a better way to determine stuck locks.") diff --git a/docs/before/severity_levels.md b/docs/before/severity_levels.md index 4cad096..9c81fcb 100644 --- a/docs/before/severity_levels.md +++ b/docs/before/severity_levels.md @@ -1,6 +1,6 @@ -The first step in any incident response process is to determine what actually constitutes an incident. We have two high level categories for classifying incidents: this is done using "SR" or "IN" defintions with an attached priority of "Minor", "Normal" or "Major". "SR" are "Service requests" initiated by a customer and usually do not constitute a critical issue (there are exceptions) while "IN" are "incidents" which are generally "urgent". +The first step in any incident response process is to determine what actually constitutes an incident. We have two high level categories for classifying incidents: these are "SR" or "IN" defintions with an attached priority of "Minor", "Normal" or "Major". "SR" are "Service requests" initiated by a customer and usually do not constitute a critical issue (there are exceptions) while "IN" are "incidents" which are generally "urgent". -All of our operational issues are to be classified as either a Service Request or an Incident. Incidents have priority over Service Requests provided that there are no Service Requests with a higher priority. In general you will want to resolve a higher severity SR or IN than a lower one (a "Major" priority gets a more intensive response than a "Normal" incident for example). +All issues reported to Spearhead are to be classified as either a Service Request or an Incident. Incidents have priority over Service Requests provided that there are no Service Requests with a higher priority. In general you will want to resolve a higher severity SR or IN than a lower one (a "Major" priority gets a more intensive response than a "Normal" incident for example). !!! note "Always Assume The Worst" If you are unsure which level an incident is (e.g. not sure if IN is Major or Normal), **treat it as the higher one**. During an incident is not the time to discuss or litigate severities, just assume the highest and review during a post-mortem. @@ -26,7 +26,7 @@ All of our operational issues are to be classified as either a Service Request o See During an Incident. - Normal + Major