update cf last stand-up

This commit is contained in:
Marius Pana 2017-08-13 20:17:52 +03:00
parent 23a9056b57
commit 7a9ec8a643
11 changed files with 91 additions and 79 deletions

View File

@ -1,7 +1,7 @@
This documentation covers parts of the Spearhead Systems Issue Response process. It is a copy of [PagerDuty's](https://github.com/PagerDuty/incident-response-docs/) documentation and furthermore a cut-down version of our own internal documentation, used at Spearhead Systems for any issue (incident or service request), and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process). See the [about page](about.md) for more information on what this documentation is and why it exists. This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com) and may not yet be open sourced.
This documentation covers parts of the Spearhead Systems reponse process for technical support service requests and incidents. It is based on [PagerDuty's](https://github.com/PagerDuty/incident-response-docs/) documentation and furthermore a cut-down version of our own internal documentation, used to prepare new employees for servicing our customer requests and incidents. It provides information not only on preparing for an incident, but also what to do during and after. It is intended to be used by on-call practitioners and those involved in an operational incident response process (or those wishing to enact a formal incident response process). See the [about page](about.md) for more information on what this documentation is and why it exists. This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com) and may not yet be public.
!!! note "Issue, Incident and Service Request"
At Spearhead we use the term *issue* to define any request from our customers. Issues fall into two categories: "Service Requests (SR)" and "Incidents (IN)". Note that we use the term Incident to describe both a service request as well as incidents. For brevity we will use SR and IN throughout this documentation.
At Spearhead we use the term *issue* to define any request from our customers. Issues fall into two categories: "Service Requests (SR)" and "Incidents (IN)". We use the term *issue* to describe both a service request as well as incidents. For brevity we will use SR and IN throughout this documentation.
A "service request" is usually initiated by a human and is generally not critical for the normal functioning of the business while an "incident" is an issue that is or can cause interruption to normal business functions.

View File

@ -1,4 +1,4 @@
This site documents parts of the Spearhead Systems Issue Response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after.
This site documents parts of the Spearhead Systems technical support response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after.
This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com).
@ -8,11 +8,11 @@ A collection of pages detailing how to efficiently deal with any incident or ser
## Who is this for?
It is intended for on-call practitioners and those involved in an operational incident or service request response process, or those wishing to enact a formal incident response process. Specifically this is for all of our Technical Support staff.
It is intended for our technical support staff and customers/partners looking for more details regarding our support process.
## Why do I need it?
As a service provider Spearhead Systems deals with service requests on a daily basis. The reason we exist is to deliver a service which in most cases boils down to incidents and service requests. We want to deliver a smooth and seamless experience for resolving our customers issues therefore this documentation is a guideline for how we handle these requests. This documentation will allow you give you a head start on how to deal with issues in a way which leads to the fastest possible recovery time.
As a service provider Spearhead Systems deals with technical support requests on a daily basis. The reason we exist is to deliver our technical support services which boils down to responsind to incidents and service requests. We want to deliver a smooth and seamless experience for resolving our customers issues therefore this documentation is a guideline for how we handle these requests. This documentation will give you a head start on how to deal with issues in a way which leads to the fastest possible recovery time.
## What is covered?
@ -20,10 +20,12 @@ Anything from preparing to [go on-call](/oncall/being_oncall.md), definitions of
## What is missing?
Lots, dig in an help us complete the picture. We can migrate most processes from Sharepoint here.
Lots, dig in an help us complete the picture. We can migrate most processes from Sharepoint here. We're also looking for experienced operations/support people who are willing to share their experience with us and help us provide a better support service.
## License
This documentation is provided under the Apache License 2.0. In plain English that means you can use and modify this documentation and use it both commercially and for private use. However, you must include any original copyright notices, and the original LICENSE file.
Whether you are a Spearhead Systems customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account, feel free to fork the repository and use it as a base for your own internal documentation.
Please also check-out [PagerDuty's](https://github.com/PagerDuty/incident-response-docs/) response documentation which has made our own efforts in documenting our process much easier.

View File

@ -3,7 +3,7 @@ For every major issue (SR/IN +major), we need to follow up with a post-mortem. A
![Post-Mortem](../assets/img/headers/pagerduty_post_mortem.jpg)
## Owner Designation
The first step is that a post-mortem owner will be designated. This is done by the TL either at the end of a major incident call, or very shortly after. You will be notified directly by the TL if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use DoIT and Slack for coordinating followup. A detailed list of the steps is available below,
The first step is that a post-mortem owner will be designated. This is done by the TL either at the end of a major incident call, or very shortly after. You will be notified directly by the TL if you are the owner for the post-mortem. The owner is responsible for populating the post-mortem page, looking up logs, managing the followup investigation, and keeping all interested parties in the loop. Please use DoIT and our internal Chat for coordinating followup. A detailed list of the steps is available below,
## Owner Responsibilities
As owner of a post-mortem, you are responsible for the following,

View File

@ -3,20 +3,20 @@ You've just joined Spearhead Systems support staff and you've never worked in a
![Obama phone](../assets/img/headers/obama_phone.jpg)
*Credit: [Official White House Photo](https://commons.wikimedia.org/wiki/File:Barack_Obama_on_phone_with_Benjamin_Netanyahu_2009-06-08.jpg) by Pete Souza*
## First Steps
## First Steps regarding Incidents
* If you intend on participating on the incident call you should join both the call, review the associated cards in DoIT, and jump on the corresponding Slack channel.
* If you intend on participating on the incident call you should join both the call (if there is a call), review the associated cards in DoIT, and jump on the corresponding internal Chat channel.
* Make sure you are in a quiet environment in order to participate on the call. Background noise should be kept to a minimum.
* Keep your microphone muted until you have something to say.
* Identify yourself when you join the call; State your name and the system you are the expert for.
* Speak up and speak clearly.
* Be direct and factual.
* Keep conversations/discussions short and to the point.
* Bring any concerns to the Team Leader (IC) on the call.
* Bring any concerns to the Team Leader (TL) on the call.
* Respect time constraints given by the Team Leader.
!!! warning "Incident Call"
Not all issues start with an incident call. Some issues may be completely automated and available only in DoIT while others may be in the incipient stages and the customer may still be on the phone/Slack detailing their issue.
Not all issues start with an incident call. Some issues may be completely automated and available only in DoIT while others may be in the incipient stages and the customer may still be on the phone/internal Chat detailing their issue.
## Lingo
**Use clear terminology, and avoid using acronyms or abbreviations during a call. Clear and accurate communication is more important than quick communication.**
@ -35,6 +35,9 @@ Do not invent new abbreviations, and always favor being explicit of implicit. It
## The Team Leader
The Team Leader (TL) is the leader of the incident response process, and is responsible for bringing the incident to resolution. They will announce themselves at the start of the call, and will generally be doing most of the talking.
!!! info "TL is not available"
A TL may not be available in which case the incident call will be guided by the senior Sysadmin or SME available.
* Follow all instructions from the team leader, without exception.
* Do not perform any actions unless the team leader has told you to do so.
* The team leader will typically poll for any strong objections before performing a large action. This is your time to raise any objections if you have them.
@ -47,11 +50,7 @@ The Team Leader (TL) is the leader of the incident response process, and is resp
## Problems?
#### There's no team leader on the call! I don't know what to do!
Ask on the call if an TL is present. If you have no response, try asking in Slack. If there is no TL the sysadmin can take over this role temporarily.
Ask on the call if an TL is present. If you have no response, try asking in our internal Chat. If there is no TL the sysadmin can take over this role temporarily.
#### There is not enough information!
The definitive source of information for all issues is in DoIT. If at any point there is a discrepancy ask the TL or Sysadmins to provide up to date information and update the card/tasks accordingly. At a minimum information should be available in Slack.
#### I can join the call or Slack, but not both, what should I do?
You're welcome to join only one of the channels, however you should not actively participate in the incident response if so, as it causes disjoined communication. Liaise with someone who is both in Slack and on the call to provide any input you may have so that they can raise it.
The definitive source of information for all issues is in DoIT. If it is lacking there then you need to make a note of it and make sure that whoever created the card understands the importance of complete information in a timely manner. If at any point there is a discrepancy ask the TL or Sysadmins to provide up to date information and update the card/tasks accordingly.

View File

@ -1,5 +1,5 @@
Our support services are deliviered via a flat organizational structure. The same people that deliver projects are also there to deliver ongoing support/maintenance services.
There are several roles in our support team at Spearhead Systems. Certain roles only have one person per incident (e.g. sysadmin), whereas other roles can have multiple people (e.g. Sysadmins, Solution Architects, etc.). It's all about coming together as a team, working the problem, and getting a solution quickly.
Our support services are currently deliviered via a flat organizational structure.
There are however several roles in our support team at Spearhead Systems. Certain roles only have one person per incident (e.g. sysadmin), whereas other roles can have multiple people (e.g. Sysadmins, Solution Architects, etc.). It's all about coming together as a team, working the problem, and getting a solution quickly.
Here is a rough outline of our role hierarchy, with each role discussed in more detail on the rest of this page.
@ -10,7 +10,7 @@ Here is a rough outline of our role hierarchy, with each role discussed in more
## Team Leader (TL)
### What is it?
A Team Leader acts as the single source of truth of what is currently happening and what is going to happen during an major incident. They come in all shapes, sizes, and colors. TL's are also the key elements in a project (boards in DoIT).
A Team Leader acts as the single source of truth of what is currently happening and what is going to happen during an major incident and general ongoing support. They come in all shapes, sizes, and colors. TL's are also the key elements in a project (boards in DoIT).
### Why have one?
As any system grows in size and complexity, things break and cause incidents. The TL is needed to help drive major incidents to resolution by organizing his team towards a common goal. A TL's skillset includes project and resource management skills which are essential in driving both projects and incidents to a smooth resolution.
@ -21,6 +21,7 @@ As any system grows in size and complexity, things break and cause incidents. Th
* Create the DoIT board(s) and other project planning related materials.
* Funnel people to these communications channels.
* Train team members on how to communicate and train other TL's.
* Train team members and help them prepare with the proper know-how/tools to deliver the project.
1. Drive incidents and projects to resolution,
* Get everyone on the same communication channel.
* Collect information from team members for their services/area of ownership status.
@ -35,7 +36,7 @@ As any system grows in size and complexity, things break and cause incidents. Th
* Work with Managers/Support on scheduling preventive actions.
### Who are they?
Anyone on the TL on-call schedule. Trainees are typically on the TL Shadow schedule.
Anyone on the on-call schedule is a TL durin his shift. Trainees are typically on the TL Shadow schedule.
### How can I become one?
Take a look at our [Team Leader training guide](/training/incident_commander.md).
@ -48,16 +49,16 @@ Take a look at our [Team Leader training guide](/training/incident_commander.md)
A Sysadmin is a direct support role for the Team Leader. This is not a shadow where the person just observes, the Sysadmin is expected to perform important tasks during an incident.
### Why have one?
It's important for the TL to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The Sysadmin helps to support the TL and keep them stay focussed on the incident.
It's important for the TL to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The Sysadmin helps to support the TL and help them stay focussed on the incident.
### What are the responsibilities?
The Sysadmin is expected to:
1. Bring up issues to the TL that may otherwise not be addressed (keeping an eye on timers that have been started, circling back around to missed items from a roll call, etc).
1. Be a "hot standby" TL, should the primary need to either transition to a SME, or otherwise have to step away from the TL role.
1. Page SME's or other on-call engineers as instructed by the Team Leader.
1. Call SME's or other on-call engineers as instructed by the Team Leader.
1. Manage the incident call, and be prepared to remove people from the call if instructed by the Team Leader.
1. Liaise with stakeholders and provide status updates on DoIT (using worklogs and comments), Slack and email/telefone as necessary.
1. Liaise with stakeholders and provide status updates on DoIT (using worklogs and comments), internal Chat and email/telefone as necessary.
### Who are they?
Any Team Leader can act as a Sysadmin. Sysadmins need to be trained as an Team Leader as they may be required to take over command.
@ -79,7 +80,7 @@ The Team Leader will need to focus on the problem at hand, and the sysadmins and
The Scribe is expected to:
1. Ensure the incident call is being recorded.
1. Note in DoIT, Slack, etc. important data, events, and actions, as they happen. Specifically:
1. Note in DoIT, internal Chat, etc. important data, events, and actions, as they happen. Specifically:
* Key actions as they are taken (Example: "prod-server-387723 is being restarted to attempt to remove the stuck lock")
* Status reports when one is provided by the TL (Example: "We are in IN-Major, service A is currently not processing events due to a stuck lock, X is restarting the app stack, next checkin in 3 minutes")
* Any key callouts either during the call or at the ending review (Example: "Note: (Bob B) We should have a better way to determine stuck locks.")

View File

@ -1,6 +1,6 @@
The first step in any incident response process is to determine what actually constitutes an incident. We have two high level categories for classifying incidents: this is done using "SR" or "IN" defintions with an attached priority of "Minor", "Normal" or "Major". "SR" are "Service requests" initiated by a customer and usually do not constitute a critical issue (there are exceptions) while "IN" are "incidents" which are generally "urgent".
The first step in any incident response process is to determine what actually constitutes an incident. We have two high level categories for classifying incidents: these are "SR" or "IN" defintions with an attached priority of "Minor", "Normal" or "Major". "SR" are "Service requests" initiated by a customer and usually do not constitute a critical issue (there are exceptions) while "IN" are "incidents" which are generally "urgent".
All of our operational issues are to be classified as either a Service Request or an Incident. Incidents have priority over Service Requests provided that there are no Service Requests with a higher priority. In general you will want to resolve a higher severity SR or IN than a lower one (a "Major" priority gets a more intensive response than a "Normal" incident for example).
All issues reported to Spearhead are to be classified as either a Service Request or an Incident. Incidents have priority over Service Requests provided that there are no Service Requests with a higher priority. In general you will want to resolve a higher severity SR or IN than a lower one (a "Major" priority gets a more intensive response than a "Normal" incident for example).
!!! note "Always Assume The Worst"
If you are unsure which level an incident is (e.g. not sure if IN is Major or Normal), **treat it as the higher one**. During an incident is not the time to discuss or litigate severities, just assume the highest and review during a post-mortem.
@ -26,7 +26,7 @@ All of our operational issues are to be classified as either a Service Request o
<td>See <a href="/during/during_an_incident">During an Incident</a>.</td>
</tr>
<tr>
<td class="sev-2">Normal</td>
<td class="sev-1">Major</td>
<td>
<ul>
<li>Functionality of virtualization platform is severely impaired.</li>
@ -35,9 +35,7 @@ All of our operational issues are to be classified as either a Service Request o
</td>
<td>See <a href="/during/during_an_incident">During an Incident</a>.</td>
</tr>
<tr>
<td class="warning" colspan="3">Anything above this line is considered a "Major Incident". These are generally Incidents (IN). Below are service requests (SR) which are usually initiated by a human who can help with prioritizing. A call is triggered for all major incidents (indifferently of SR or IN).</td>
</tr>
<tr>
<td class="sev-2">Normal</td>
<td>

View File

@ -1,19 +1,19 @@
Information on what to do during a major incident. See our [severity level descriptions](/before/severity_levels.md) for what constitutes a major incident.
!!! note "Documentation"
Always document your activities. Keep a detailed worklog of your actions in DoIT and communicate verbosely on Slack or other channels (email, etc.).
Always document your activities. Keep a detailed worklog of your actions in DoIT and communicate verbosely in our internal Chat or other channels (email, etc.).
<table class="custom-table" id="contact-summary">
<thead>
</thead>
<tbody>
<tr>
<td><a href="#">#support</a></td>
<td><a href="#">#support</a> (on MS Teams/internal Chat)</td>
<td><a href="#">http://response.spearhead.systems</a></td>
<td><a href="#">+40728 005 263</a> </td>
</tr>
<tr>
<td colspan="3" class="centered">Need an TL? Do <code>!tl page</code> in Slack</td>
<td colspan="3" class="centered">Need an TL? Use a Sysadmin!</td>
</tr>
<tr>
<td colspan="3"><em>For executive summary updates only, join <a href="#">#executive-summary-updates</a>.</em></td>
@ -31,21 +31,21 @@ Information on what to do during a major incident. See our [severity level descr
* If you wish to participate however, you should join both. If you can't join the call for some reason, you should have a dedicated proxy for the call. Disjointed discussions in the chat room are ultimately distracting.
1. Follow along with the call/chat, add any comments you feel are appropriate, but keep the discussion relevant to the problem at hand.
* If you are not an SME, try to filter any discussion through the primary SME for your service. Too many people discussing at once get become overwhelming, so we should try to maintain a hierarchical structure to the call if possible.
* If you are not an SME, try to filter any discussion through the primary SME for your service. Too many people discussing at once becomes overwhelming, so we try to maintain a hierarchical structure to the call if possible.
1. Follow instructions from the Team Leader.
* **Is there no TL on the call?**
* Manually page them via Slack, with `!tl page` in Slack. This will page the primary and backup TL's at the same time.
* Never hesitate to page the TL. It's much better to have them and not need them than the other way around.
* Call them!
* Never hesitate to call the TL. It's much better to have them and not need them than the other way around.
!!! info "Not a call?"
Not all issues begin with a formal call. Some issues are self-explanatory and automatically generated via our monitoring platforms, a customer logging on to our portal, etc. In these scenarios [DoIT](http://doit.sphs.ro) is the definitive source. If that is not sufficient ask your TL.
Not all issues begin with a formal call. Some issues are self-explanatory and automatically generated via our monitoring platforms, a customer logging on to our portal, etc. In these scenarios [DoIT](http://doit.sphs.ro) is the definitive source. If that is not sufficient ask your TL and Sysadmin.
## Steps for the Team Leader
Resolve the incident as quickly and as safely as possible, use the Sysadmin to assist you. Delegate any tasks to relevant experts at your discretion.
1. Announce on the call, in DoIT and in Slack that you are the team leader, who you have designated as sysadmin (usually the backup TL), and scribe/juniors if any.
1. Announce on the call, in DoIT and in our internal Chat that you are the team leader, who you have designated as sysadmin (usually the backup TL), and scribe/juniors if any.
1. Identify if there is an obvious cause to the incident (recent deployment, spike in traffic, etc.), delegate investigation to relevant experts,
* Use the service experts on the call to assist in the analysis. They should be able to quickly provide confirmation of the cause, but not always. It's the call of the TL on how to proceed in cases where the cause is not positively known. Confer with service owners and use their knowledge to help you.
@ -61,7 +61,7 @@ Resolve the incident as quickly and as safely as possible, use the Sysadmin to a
* Announcing publicly is at your discretion as TL. If you are unsure, then announce publicly ("If in doubt, tweet it out").
1. Once incident has recovered or is actively recovering, you can announce that the incident is over and that the call is ending. This usually indicates there's no more productive work to be done for the incident right now.
* Move the remaining, non-time-critical discussion to Slack.
* Move the remaining, non-time-critical discussion to our internal Chat.
* Follow up to ensure the customer liaison wraps up the incident publicly.
* Identify any post-incident clean-up work.
* You may need to perform debriefing/analysis of the underlying root cause.
@ -77,7 +77,7 @@ You are there to support the TL in whatever they need.
1. Be prepared to page other people as directed by the Team Leader.
1. Provide regular status updates in Slack (roughly every 30mins) to the executive team, giving an executive summary of the current status. Keep it short and to the point, and use @here.
1. Provide regular status updates in our internal Chat (roughly every 30mins) to the executive team, giving an executive summary of the current status. Keep it short and to the point, and use @<channel-name>.
1. Perform any remediations, checking graphs, analysis or investigating logs unless otherwse delegated by the TL.

View File

@ -26,7 +26,7 @@ Stop the attack as quickly as you can, via any means necessary. Shut down server
* Shutdown the instance from the provider console (do not delete or terminate if you can help it, as we'll need to do forensics).
* If you happen to be logged into the box you can try to,
* Re-instate our default iptables rules to restrict traffic.
* Apply firewall rules to restrict traffic.
* `kill -9` any active session you think is an attacker.
* Change root password, and update /etc/shadow to lock out all other users.
* `sudo shutdown now`
@ -35,17 +35,18 @@ Stop the attack as quickly as you can, via any means necessary. Shut down server
Identify the likely attack vectors and path/fix them so they cannot be re-exploited immediately after stopping the attack.
* If you suspect a third-party provider is compromised, delete all accounts except your own (and those of others who are physically present) and immediately rotate your password and MFA tokens.
* Disable/remove ssh keys that do not belong to you and those of others who are physically present.
* If you suspect a service application was an attack vector, disable any relevant code paths, or shut down the service entirely.
## Assemble Response Team
Identify the key responders for the security incident, and keep them all in the loop. Set up a secure method of communicating all information associated with the incident. Details on the incident (or even the fact that an incident has occurred) should be kept private to the responders until you are confident the attack is not being triggered internally.
Identify the key responders for the security incident, and keep them all in the loop. Set up a secure method of communicating all information associated with the incident (internal Chat is one option). Details on the incident (or even the fact that an incident has occurred) should be kept private to the responders until you are confident the attack is not being triggered internally.
* The security and site-reliability teams should usually be involved.
* A representative for any affected services should be involved.
* A Team Leader (TL) should be appointed, who will also appoint the usual incident command roles. The incident command team will be responsible for keeping documentation of actions taken, and for notifying internal stakeholders as appropriate.
* Do not communicate with anyone not on the response team about the incident until forensics has been performed. The attack could be happening internally.
* Give the project an innocuous codename that can be used for chats/documents so if anyone overhears they don't realize it's a security incident. (e.g. sapphire-unicorn).
* Prefix all emails, and chat topics with "Attorney Work Project".
* Prefix all emails, and chat topics with "Legal Work Project".
## Isolate Affected Instances
Any instances which were affected by the attack should be immediately isolated from any other instances. As soon as possible, an image of the system should be taken and put into a read-only cold storage for later forensic analysis.
@ -97,7 +98,7 @@ Once you are confident the systems are secured, and enough monitoring is in plac
* Monitor logs for any attempt to regain access to the system by the attacker.
## Internal Communication
**Delegate to:** VP or Director of Engineering
**Delegate to:** CTO, GM
Communicate internally only once you are confident (via forensic analysis) that the attack was not sourced internally.
@ -107,7 +108,7 @@ Communicate internally only once you are confident (via forensic analysis) that
* Follow up with more information once it is known.
## Liaise With Law Enforcement / External Actors
**Delegate to:** VP or Director of Engineering
**Delegate to:** CTO, GM
Work with law enforcement to identify the source of the attack, letting any system owners know that systems under their control may be compromised, etc.
@ -117,7 +118,7 @@ Work with law enforcement to identify the source of the attack, letting any syst
* Contact security companies to help in assessing risk and any PR next steps.
## External Communication
**Delegate to:** TL, Marketing Team
**Delegate to:** TL, PR/Marketing
Once you have validated all of the information you have is accurate, have a timeline of events, and know exactly what information was compromised, how it was compromised, and sure that it won't happen again. Only then should you prepare and release a public statement to customers informing them of the compromised information and any steps they need to take.

View File

@ -1,8 +1,8 @@
This documentation covers parts of the Spearhead Systems Incident Response process. It is a copy of [PagerDuty's](https://github.com/PagerDuty/incident-response-docs/) documentation and furthermore a cut-down version of our own internal documentation, used at Spearhead Systems for any issue (incident or service request), and to prepare new employees for on-call responsibilities. It provides information not only on preparing for an incident or service request, but also what to do during and after. It is intended to be used by those involved in our operational technical support response process (or those wishing to become part of our support team). See the [about page](about.md) for more information on what this documentation is and why it exists.
This documentation covers parts of the Spearhead Systems technical support response process. It is used at Spearhead Systems for any technical issue (incident or service request), and to prepare new employees for technical support responsibilities. It provides information not only on preparing for an incident or service request, but also what to do during and after. It is intended to be used by those involved in our operational technical support response process (or those wishing to become part of our support team). See the [about page](about.md) for more information on what this documentation is and why it exists.
This documentation is complementary to what is available in our [existing wiki](https://sphsys.sharepoint.com) and other systems that have not been open sourced.
!!! note "Issue, Incident and Service Request"
At Spearhead we use the term *issue* to define any request from our customers. Issues fall into two categories: "Service Requests (SR)" and "Incidents (IN)". An IN will generally be an issue that has impact on the normal functioning of the business while a SR generally does not.
!!! note "Issue: Incidents and Service Request"
At Spearhead we use the term *issue* to define any request that we receive. Issues fall into two categories: "Service Requests (SR)" and "Incidents (IN)". An IN will generally be an issue that has impact on the normal functioning of the business while a SR generally does not.
![Incident Response at Spearhead Systems](./assets/img/headers/sph_ir.jpg)
@ -15,7 +15,7 @@ If you've never been on-call before or part of a support delivery team, you migh
## Before an Incident
Reading material for things you probably want to know before an incident occurs. You likely don't want to be reading these during an actual incident.
Reading material for things you want to know before an incident occurs. You don't want to be reading these during an actual incident.
* [Severity Levels](before/severity_levels.md) - _Information on our severity level classification. What constitutes a Low issue? What's a "Major Incident"?, etc._
* [Different Roles for Incidents](before/different_roles.md) - _Information on the roles during an incident; Team Leader, Sysadmin, etc._

View File

@ -1,7 +1,7 @@
We manage how we get alerted based on many factors such as the customers contractual SLA, the urgency of their request or incident, etc.. **an alert or notification is something which requires a human to perform an action**. Based on the severity of the issue (service request or incident) we prioritize accordingly in [DoIT](http://doit.sphs.ro).
!!! warning "Major Priority Alerts"
Anything that wakes up a human in the middle of the night should be **immediately human actionable**. If it is none of those things, then we need to adjust the alert to not page at those times.
Anything that wakes up a human in the middle of the night should be **immediately human actionable**. If it is none of those things, then we need to adjust the alert to not bother us at those times.
| Priority | Alerts | Response |
| -------- | ------ | -------- |
@ -10,12 +10,13 @@ We manage how we get alerted based on many factors such as the customers contrac
| Minor | Minor-Priority Alert 24/7/365. | Requires human action at some point. |
| Notification | Suppressed Events. No response required. | Informational only. We do not need these to clutter our ticketing or inboxes. If they are enabled they should be sent only to required/specific people, not groups. |
Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page of the respective customer.
Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page.
If you're setting up a new alert/notification, consider the chart above for how you want to alert people. Be mindful of not creating new high-priority alerts if they don't require an immediate response, for example.
!!! info "Alert Channels"
Presently we use email as the only notification method. This means keeping an eye on your email is essential!
Primarily we use email as the notification/alert methods and all of our customers are encouraged to use this method. Secondly there is the DoIT customer portal which will send alerts to the on-call person(s) and escalate based on SLA/contractual agreements. Thirdly we use our centralized support telephone number and individual phones. This means keeping an eye on your email is essential!
SMS and Push notifications are in the pipeline for DoIT.
## Examples
@ -33,4 +34,4 @@ This would be a **Normal** priority SR, requiring human action soon, but not imm
#### "An SSL certificate is due to expire in one week."
This would be a **Minor** priority SR, requiring human action some time soon.
![Minor Urgency](../assets/img/screenshots/prio-low.png)
![Minor Urgency](../assets/img/screenshots/prio-low.png)

View File

@ -3,23 +3,31 @@ A summary of expectations and helpful information for being on-call.
![Alert Fatigue](../assets/img/misc/alert_fatigue.png)
## What is On-Call?
At Spearhead being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise. There are two on-call scenarios that you will deal with:
At Spearhead, being on-call means that you are responsible for monitoring our communications channels and responding to requests at any time. There are two on-call scenarios that you will deal with:
* during your normal work shift
* being on-call for outside working hours
* outside working hours
For example, if you are on-call outside normal working hours, should any alarms be triggered by our monitoring solution, you will receive a "page" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken. You will be expected to take whatever actions are necessary in order to resolve the issue and return your service to a normal state.
For example, if you are on-call outside normal working hours, should any alarms be triggered by our monitoring solution or a customer emails our support channel, you will receive a "notification" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken.
You will be expected to gather as much information as possible, create the required cards in our ticketing systems, delegate or assign the card to the right person/watchers and otherwise take whatever actions are necessary in order to resolve the issue.
At Spearhead Systems we consider you are on-call during normal working hours in which case you are proactively working with [DoIT](http://doit.sphs.ro/) and looking over your assigned cards/boards as well as when you are formally "on-call" and issues are being redirected to you.
<!-- At Spearhead Systems we consider you are on-call during normal working hours in which case you are proactively working with [DoIT](http://doit.sphs.ro/) and looking over your assigned cards/boards as well as when you are formally "on-call" and issues are being redirected to you. -->
On-call responsibilities extend beyond normal office hours, and if you are on-call you are expected to be able to respond to issues, even at 2am. This sounds horrible (and it can be), but this is what our customers go through, and is the problem that the Spearhead Systems professional services is trying to fix!
On-call responsibilities extend beyond normal office hours, and if you are on-call you are expected to be able to respond to issues, even at 2am. This sounds horrible (and it can be), but this is what our customers go through, and is the problem that the Spearhead Systems technical support services is trying to fix!
When you are on-call during normal working hours you are the central contact for our entire support team. We expect you will delegate and assign the card to your colleagues and only attempt to resolve issues if your current workload permits.
When you are on-call outside working hours you are expected to handle as much of the process as possible and delegate only if it is outside your area of expertise or you encounter problems that require another colleagues input.
!!! note "When in the office"
You are generally speaking on-call during your normal working hours even if you are not *the* on-call engineer. This means you are keeping an eye on the cards assigned to you directly or that you are a watcher for. If you are ever in a position that you have no assigned cards and it is not clear what to work on ask a TL or senior Sysadmin to help point you in the right direction.
## Responsibilities
1. **Prepare**
* Have your laptop and Internet with you (office, home, a MiFi dongle, a phone with a tethering plan, etc).
* Have a way to charge your MiFi.
* Team alert escalation happens within 5 minutes, set/stagger your notification timeouts (push, SMS, phone...) accordingly.
* Have your laptop and Internet with you (office, home, a phone with a tethering plan, etc).
* Have a way to charge your phone.
* Team alert escalation happens within 30 minutes, set/stagger your notification timeouts (push, SMS, phone...) accordingly.
* Make sure Spearhead Systems (and colleagues directly) texts and calls can bypass your "Do Not Disturb" settings.
* Be prepared (environment is set up, you have remote access tools ready and functional, your credentials are current, you have Java installed, ssh-keys and so on...)
* Read our Issue Response documentation (that's this!) to understand how we handle incidents and service requests, what the different roles and methods of communication are, etc.
@ -29,26 +37,27 @@ On-call responsibilities extend beyond normal office hours, and if you are on-ca
* Acknowledge and act on alerts whenever you can (see the first "Not responsibilities" point below)
* Determine the urgency of the problem:
* Is it something that should be worked on right now or escalated into a major incident? ("production server on fire" situations. Security alerts) - do so.
* Is it some tactical work that doesn't have to happen during the night? (for example, disk utilization high watermark, but there's plenty of space left and the trend is not indicating impending doom) - snooze the alert until a more suitable time (working hours, the next morning...) and get back to fixing it then.
* Check Slack for current activity. Often (but not always) actions that could potentially cause alerts will be announced there.
* Is it some tactical work that doesn't have to happen during the night? (for example, disk utilization high watermark, but there's plenty of space left and the trend is not indicating impending doom) - snooze the issue until a more suitable time (working hours, the next morning...) and get back to fixing it then.
* Check our *internal Chat* for current activity. Often (but not always) actions that could potentially cause alerts will be announced there.
* Does the alert and your initial investigation indicate a general problem or an issue with a specific service that the relevant team should look into? If it does not look like a problem you are the expert for, then escalate to another team member or group.
1. **Fix**
* You are empowered to dive into any problem and act to fix it.
* Involve other team members as necessary: do not hesitate to escalate if you cannot figure out the cause within a reasonable timeframe or if the service / alert is something you have not tackled before.
* If the issue is not very time sensitive and you have other priority work, make a note of this in DoIT to keep a track of it (with an appropriate severity and due date).
* If the issue is not very time sensitive and you have other priority work, make a note of this in DoIT to keep a track of it (with an appropriate severity, comment and due date).
1. **Improve**
* If a particular issue keeps happening; if an issue alerts often but turns out to be a preventable non-issue perhaps improving this should be a longer-term task.
* Disks that fill up, logs that should be rotated, noisy alerts...(we use ansible, go ahead and start automating!)
* Disks that fill up, logs that should be rotated, noisy alerts...(we use ansible and rundeck, go ahead and start automating!)
* When we perform a DoD (definition of done) this is good time to bring up recurring issues for discussion.
* If information is difficult / impossible to find, write it down. Constantly refactor and improve our knowledge base and documentation. Add redundant links and pointers if your mental model of the wiki / codebase does not match the way it is currently organized.
1. **Support**
* When your on-call "shift" ends, let the next on-call and team know about issues that have not been resolved yet and other experiences of note.
* Make an effort to cleanly handover necessary information. We use Slack, email and DoIT to communicate.
* This is a best-practice that should be applied whenever there are details that by sharong would benefit the efficiency of the team.
* Make an effort to cleanly handover necessary information. We use *internal Chat*, email and DoIT to communicate.
* This is a best-practice that should be applied whenever there are details that by sharing would benefit the efficiency of the team.
* If you are making a change that impacts the schedule (adding / removing yourself, for example), let others know since many of us make arrangements around the on-call schedule well in advance.
* Support each other: when doing activities that might generate plenty of pages, it is courteous to "take the page" away from the on-call by notifying them and scheduling an override for the duration.
* Support each other: when doing activities that might generate plenty of alerts, it is courteous to "place the service/host in maintenance" and take it away from the on-call by notifying them and scheduling an override for the duration.
## Not Responsibilities
@ -66,35 +75,36 @@ On-call responsibilities extend beyond normal office hours, and if you are on-ca
![Escalation](../assets/img/misc/escalation.png)
* Team leaders (TL) can (and should) be part of your normal rotation. It gives a better insight into what has been going on.
* Team leaders (TL) are a part of our normal rotation. It gives a better insight into what has been going on.
* New members of the team should shadow your on-call rotation during the first few weeks. They should get all alerts, and should follow along with what you are doing. (All new employees shadow the Support team for one week of on-call, but it's useful to have new team members shadow your team rotations also. Just not at the same time).
* New members of the team should shadow your on-call rotation during the first few weeks. They should get all alerts, and should follow along with what you are doing. (All new employees shadow the Support team for one week of on-call, but it's useful to have new team members shadow your team rotations also.).
* Our escalation timeout is set to 5 minutes. This is usually plenty of time for someone to acknowledge the incident if they're able to. If they're not able to within 5 minutes, then they're probably not in a good position to respond to the incident anyway.
<!-- // we do not uet implement escalation for incidents, not automatically // * Our escalation timeout is set to 5 minutes. This is usually plenty of time for someone to acknowledge the incident if they're able to. If they're not able to within 5 minutes, then they're probably not in a good position to respond to the incident anyway.
* Triggering an escalation is done automatically in most situations based on the type, priority and severity of the issue.
* Escalations only happen to incidents! Service Requests must be manually escalated based on customer input -->
* When going off-call, you should provide a quick summary to the next on-call about any issues that may come up during their shift. A service has been flapping, an issue is likely to re-occur, etc. If you want to be formal, this can be a written report via email, but generally a verbal summary is sufficient.
* When going off-call, you should provide a quick summary to the next on-call about any issues that may come up during their shift. A service has been flapping, an issue is likely to re-occur, etc. If you want to be formal, this can be a written report via email, but generally a verbal summary during our morning stand-up is sufficient.
### Notification Method Recommendations
You are free to set up your notification rules as you see fit, to match how you would like to best respond to incidents. If you're not sure how to configure them, the Support team has some recommendations,
![Mobile Alerts](../assets/img/misc/mobile_alerts.png)
* Use Push Notification and Email as your first method of notification. Most of us have phones with us at all times, so this is a prudent first method and is usually sufficient. (DoIT is in the process of integratoin with SNS for push notifications)
* Use Phone and/or SMS notification each minute after, until the escalation time. If Push didn't work, then it's likely you need something stronger, like a phone call. Keep calling every minute until it's too late. If you don't pick up by the 3rd time, then it's unlikely you are able to respond, and the incident will get escalated away from you.
<!-- // still working on integration for SMS // * Use Push Notification and Email as your first method of notification. Most of us have phones with us at all times, so this is a prudent first method and is usually sufficient. (DoIT is in the process of integration with SNS for push notifications)
* Use Phone and/or SMS notification each minute after, until the escalation time. If Push didn't work, then it's likely you need something stronger, like a phone call. Keep calling every minute until it's too late. If you don't pick up by the 3rd time, then it's unlikely you are able to respond, and the incident will get escalated away from you. -->
## Etiquette
* If the current on-call comes into the office at 12pm looking tired, it's not because they're lazy. They probably got paged in the night. Cut them some slack and be nice.
* Don't acknowledge an incident out from under someone else. If you didn't get paged for the incident, then you shouldn't be acknowledging it. Add a comment with your notes instead.
* Don't close or otherwise modify a card out from under someone else. If you didn't get that specific card assigned to you as owner or a watcher, then you shouldn't be modifying it. Add a comment with your notes instead in the monitoring system and in DoIT.
![Acknowledging](../assets/img/misc/ack.png)
* If you are testing something, or performing an action that you know will cause a page (notification, alert), it's customary to "take the pager" for the time during which you will be testing. Notify the person on-call that you are taking the pager for the next hour while you test.
* If you are testing something, or performing an action that you know will cause an alert from our monitoring or possibly may be identified as an issue by our customers, it's customary to "place the host/service in downtime" and announce all the involved parties, for the time during which you will be testing. Notify the person on-call so they are aware of your testing.
* "Never hesitate to escalate" - Never feel ashamed to rope in someone else if you're not sure how to resolve an issue. Likewise, never look down on someone else if they ask you for help.
* Always consider covering an hour or so of someone else's on-call time if they request it and you are able. We all have lives which might get in the way of on-call time, and one day it might be you who needs to swap their on-call time in order to have a night out with your friend from out of town.
* If an issue comes up during your on-call shift for which you got paged, you are responsible for resolving it. Even if it takes 3 hours and there's only 1 hour left of your shift. You can hand over to the next on-call if they agree, but you should never assume that's possible.
* If an issue comes up during your on-call shift for which you got called, you are responsible for resolving it. Even if it takes 3 hours and there's only 1 hour left of your shift. You can hand over to the next on-call if they agree, but you should never assume that's possible.