Severity Levels

The first step in any incident response process is to determine what actually constitutes an incident. We have two high level categories for classifying incidents: this is done using "SR" or "IN" defintions with an attached priority of "Minor", "Normal" or "Major". "SR" are "Service requests" initiated by a customer and usually do not constitute a critical issue (there are exceptions) while "IN" are "incidents" which are generally "urgent".

All of our operational issues are to be classified as either a Service Request or an Incident. Incidents have priority over Service Requests provided that there are no Service Requests with a higher priority. In general you will want to resolve a higher severity SR or IN than a lower one (a "Major" priority gets a more intensive response than a "Normal" incident for example).

Always Assume The Worst

If you are unsure which level an incident is (e.g. not sure if IN is Major or Normal), treat it as the higher one. During an incident is not the time to discuss or litigate severities, just assume the highest and review during a post-mortem.

Severity Description What To Do
Major
  • The system is in a critical state and is actively impacting a large number of customers.
  • Functionality has been severely impaired for a long time, breaking SLA.
  • Customer-data-exposing security vulnerability has come to our attention.
See During an Incident.
Normal
  • Functionality of virtualization platform is severely impaired.
  • E-mail system is offline.
See During an Incident.
Anything above this line is considered a "Major Incident". These are generally Incidents (IN). Below are service requests (SR) which are usually initiated by a human who can help with prioritizing. A call is triggered for all major incidents (indifferently of SR or IN).
Normal
  • Partial loss of functionality, only affecting minority of customers.
  • Something that has the likelihood of becoming Major if nothing is done.
  • No redundancy in a service (failure of 1 more node will cause outage).
  • Work on issue as your top priority.
  • Liaise with engineers of affected systems to identify cause.
  • If related to recent deployment, rollback.
  • Monitor status and notice if/when it escalates.
  • Mention on Slack if you think it has the potential to escalate.
Normal
  • Performance issues (delays, etc). Tasks that require non-immediate attention.
  • Job failure (not impacting alerting).
  • Work on the issue as your first priority (above "Low" tasks).
  • Monitor status and notice if/when it escalates.
Low
  • Normal issues which aren't impacting system use, cosmetic issues, etc.
  • Create a DoIT card and assign to owner of affected system.

Be Specific

When creating Cards in Doit, be as specific as possible and include all necessary details. Include relevant details regarding when the issue started, what may have triggered it, etc.. Document your efforts through worklogs and be specific there as well.