diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000..01b019f Binary files /dev/null and b/.DS_Store differ diff --git a/404.html b/404.html index 192f043..be6f116 100644 --- a/404.html +++ b/404.html @@ -214,7 +214,7 @@
This site documents parts of the Spearhead Systems Issue Response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after.
+This site documents parts of the Spearhead Systems technical support response process. It is a cut-down version of our internal documentation, used at Spearhead Systems for any incident or service request, and to prepare new employees for on-call responsibilities. It provides information not only on preparation but also what to do during and after.
This documentation is complementary to what is available in our existing wiki.
A collection of pages detailing how to efficiently deal with any incident or service request that might arise, along with information on how to go on-call effectively. It provides lessons learned the hard way, along with training material for getting you up to speed quickly.
It is intended for on-call practitioners and those involved in an operational incident or service request response process, or those wishing to enact a formal incident response process. Specifically this is for all of our Technical Support staff.
+It is intended for our technical support staff and customers/partners looking for more details regarding our support process.
As a service provider Spearhead Systems deals with service requests on a daily basis. The reason we exist is to deliver a service which in most cases boils down to incidents and service requests. We want to deliver a smooth and seamless experience for resolving our customers issues therefore this documentation is a guideline for how we handle these requests. This documentation will allow you give you a head start on how to deal with issues in a way which leads to the fastest possible recovery time.
+As a service provider Spearhead Systems deals with technical support requests on a daily basis. The reason we exist is to deliver our technical support services which boils down to responsind to incidents and service requests. We want to deliver a smooth and seamless experience for resolving our customers issues therefore this documentation is a guideline for how we handle these requests. This documentation will give you a head start on how to deal with issues in a way which leads to the fastest possible recovery time.
Anything from preparing to go on-call, definitions of severities, incident call etiquette, all the way to how to run a post-mortem, providing a post-mortem template and even a security incident response process.
Lots, dig in an help us complete the picture. We can migrate most processes from Sharepoint here.
+Lots, dig in an help us complete the picture. We can migrate most processes from Sharepoint here. We're also looking for experienced operations/support people who are willing to share their experience with us and help us provide a better support service.
This documentation is provided under the Apache License 2.0. In plain English that means you can use and modify this documentation and use it both commercially and for private use. However, you must include any original copyright notices, and the original LICENSE file.
Whether you are a Spearhead Systems customer or not, we want you to have the ability to use this documentation internally at your own company. You can view the source code for all of this documentation on our GitHub account, feel free to fork the repository and use it as a base for your own internal documentation.
+Please also check-out PagerDuty's response documentation which has made our own efforts in documenting our process much easier.
Anyone on the TL on-call schedule. Trainees are typically on the TL Shadow schedule.
+Anyone on the on-call schedule is a TL durin his shift. Trainees are typically on the TL Shadow schedule.
Take a look at our Team Leader training guide.
A Sysadmin is a direct support role for the Team Leader. This is not a shadow where the person just observes, the Sysadmin is expected to perform important tasks during an incident.
It's important for the TL to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The Sysadmin helps to support the TL and keep them stay focussed on the incident.
+It's important for the TL to focus on the problem at hand, rather than worrying about documenting the steps or monitoring timers. The Sysadmin helps to support the TL and help them stay focussed on the incident.
The Sysadmin is expected to:
Any Team Leader can act as a Sysadmin. Sysadmins need to be trained as an Team Leader as they may be required to take over command.
@@ -537,7 +538,7 @@ There are several roles in our support team at Spearhead Systems. Certain rolesThe Scribe is expected to:
The first step in any incident response process is to determine what actually constitutes an incident. We have two high level categories for classifying incidents: this is done using "SR" or "IN" defintions with an attached priority of "Minor", "Normal" or "Major". "SR" are "Service requests" initiated by a customer and usually do not constitute a critical issue (there are exceptions) while "IN" are "incidents" which are generally "urgent".
-All of our operational issues are to be classified as either a Service Request or an Incident. Incidents have priority over Service Requests provided that there are no Service Requests with a higher priority. In general you will want to resolve a higher severity SR or IN than a lower one (a "Major" priority gets a more intensive response than a "Normal" incident for example).
+The first step in any incident response process is to determine what actually constitutes an incident. We have two high level categories for classifying incidents: these are "SR" or "IN" defintions with an attached priority of "Minor", "Normal" or "Major". "SR" are "Service requests" initiated by a customer and usually do not constitute a critical issue (there are exceptions) while "IN" are "incidents" which are generally "urgent".
+All issues reported to Spearhead are to be classified as either a Service Request or an Incident. Incidents have priority over Service Requests provided that there are no Service Requests with a higher priority. In general you will want to resolve a higher severity SR or IN than a lower one (a "Major" priority gets a more intensive response than a "Normal" incident for example).
Always Assume The Worst
If you are unsure which level an incident is (e.g. not sure if IN is Major or Normal), treat it as the higher one. During an incident is not the time to discuss or litigate severities, just assume the highest and review during a post-mortem.
@@ -459,7 +459,7 @@Information on what to do during a major incident. See our severity level descriptions for what constitutes a major incident.
Documentation
-Always document your activities. Keep a detailed worklog of your actions in DoIT and communicate verbosely on Slack or other channels (email, etc.).
+Always document your activities. Keep a detailed worklog of your actions in DoIT and communicate verbosely in our internal Chat or other channels (email, etc.).
#support | +#support (on MS Teams/internal Chat) | http://response.spearhead.systems | +40728 005 263 | ||
Need an TL? Do !tl page in Slack |
+ Need an TL? Use a Sysadmin! | ||||
For executive summary updates only, join #executive-summary-updates. | @@ -510,15 +510,15 @@
Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page of the respective customer.
+Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page.
If you're setting up a new alert/notification, consider the chart above for how you want to alert people. Be mindful of not creating new high-priority alerts if they don't require an immediate response, for example.
Alert Channels
-Presently we use email as the only notification method. This means keeping an eye on your email is essential! -SMS and Push notifications are in the pipeline for DoIT.
+Primarily we use email as the notification/alert methods and all of our customers are encouraged to use this method. Secondly there is the DoIT customer portal which will send alerts to the on-call person(s) and escalate based on SLA/contractual agreements. Thirdly we use our centralized support telephone number and individual phones. This means keeping an eye on your email is essential!
+SMS and Push notifications are in the pipeline for DoIT.
A summary of expectations and helpful information for being on-call.
At Spearhead being on-call means that you are able to be contacted at any time in order to investigate and fix issues that may arise. There are two on-call scenarios that you will deal with:
+At Spearhead, being on-call means that you are responsible for monitoring our communications channels and responding to requests at any time. There are two on-call scenarios that you will deal with:
For example, if you are on-call outside normal working hours, should any alarms be triggered by our monitoring solution, you will receive a "page" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken. You will be expected to take whatever actions are necessary in order to resolve the issue and return your service to a normal state.
-At Spearhead Systems we consider you are on-call during normal working hours in which case you are proactively working with DoIT and looking over your assigned cards/boards as well as when you are formally "on-call" and issues are being redirected to you.
-On-call responsibilities extend beyond normal office hours, and if you are on-call you are expected to be able to respond to issues, even at 2am. This sounds horrible (and it can be), but this is what our customers go through, and is the problem that the Spearhead Systems professional services is trying to fix!
+For example, if you are on-call outside normal working hours, should any alarms be triggered by our monitoring solution or a customer emails our support channel, you will receive a "notification" (an alert on your mobile device, email, phone call, or SMS, etc.) giving you details on what has broken. +You will be expected to gather as much information as possible, create the required cards in our ticketing systems, delegate or assign the card to the right person/watchers and otherwise take whatever actions are necessary in order to resolve the issue.
+ + +On-call responsibilities extend beyond normal office hours, and if you are on-call you are expected to be able to respond to issues, even at 2am. This sounds horrible (and it can be), but this is what our customers go through, and is the problem that the Spearhead Systems technical support services is trying to fix!
+When you are on-call during normal working hours you are the central contact for our entire support team. We expect you will delegate and assign the card to your colleagues and only attempt to resolve issues if your current workload permits. +When you are on-call outside working hours you are expected to handle as much of the process as possible and delegate only if it is outside your area of expertise or you encounter problems that require another colleagues input.
+When in the office
+You are generally speaking on-call during your normal working hours even if you are not the on-call engineer. This means you are keeping an eye on the cards assigned to you directly or that you are a watcher for. If you are ever in a position that you have no assigned cards and it is not clear what to work on ask a TL or senior Sysadmin to help point you in the right direction.
+Prepare
Improve
Support
Team leaders (TL) can (and should) be part of your normal rotation. It gives a better insight into what has been going on.
+Team leaders (TL) are a part of our normal rotation. It gives a better insight into what has been going on.
New members of the team should shadow your on-call rotation during the first few weeks. They should get all alerts, and should follow along with what you are doing. (All new employees shadow the Support team for one week of on-call, but it's useful to have new team members shadow your team rotations also. Just not at the same time).
+New members of the team should shadow your on-call rotation during the first few weeks. They should get all alerts, and should follow along with what you are doing. (All new employees shadow the Support team for one week of on-call, but it's useful to have new team members shadow your team rotations also.).
Our escalation timeout is set to 5 minutes. This is usually plenty of time for someone to acknowledge the incident if they're able to. If they're not able to within 5 minutes, then they're probably not in a good position to respond to the incident anyway.
-When going off-call, you should provide a quick summary to the next on-call about any issues that may come up during their shift. A service has been flapping, an issue is likely to re-occur, etc. If you want to be formal, this can be a written report via email, but generally a verbal summary is sufficient.
-You are free to set up your notification rules as you see fit, to match how you would like to best respond to incidents. If you're not sure how to configure them, the Support team has some recommendations,
-If the current on-call comes into the office at 12pm looking tired, it's not because they're lazy. They probably got paged in the night. Cut them some slack and be nice.
Don't acknowledge an incident out from under someone else. If you didn't get paged for the incident, then you shouldn't be acknowledging it. Add a comment with your notes instead.
+Don't close or otherwise modify a card out from under someone else. If you didn't get that specific card assigned to you as owner or a watcher, then you shouldn't be modifying it. Add a comment with your notes instead in the monitoring system and in DoIT.
If you are testing something, or performing an action that you know will cause a page (notification, alert), it's customary to "take the pager" for the time during which you will be testing. Notify the person on-call that you are taking the pager for the next hour while you test.
+If you are testing something, or performing an action that you know will cause an alert from our monitoring or possibly may be identified as an issue by our customers, it's customary to "place the host/service in downtime" and announce all the involved parties, for the time during which you will be testing. Notify the person on-call so they are aware of your testing.
"Never hesitate to escalate" - Never feel ashamed to rope in someone else if you're not sure how to resolve an issue. Likewise, never look down on someone else if they ask you for help.
@@ -610,7 +616,7 @@Always consider covering an hour or so of someone else's on-call time if they request it and you are able. We all have lives which might get in the way of on-call time, and one day it might be you who needs to swap their on-call time in order to have a night out with your friend from out of town.
If an issue comes up during your on-call shift for which you got paged, you are responsible for resolving it. Even if it takes 3 hours and there's only 1 hour left of your shift. You can hand over to the next on-call if they agree, but you should never assume that's possible.
+If an issue comes up during your on-call shift for which you got called, you are responsible for resolving it. Even if it takes 3 hours and there's only 1 hour left of your shift. You can hand over to the next on-call if they agree, but you should never assume that's possible.