spearhead-issue-response/docs/oncall/alerting_principles.md

2.8 KiB

We manage how we get alerted based on many factors such as the customers contractual SLA, the urgency of their request or incident, etc.. an alert or notification is something which requires a human to perform an action. Based on the severity of the issue (service request or incident) we prioritize accordingly in DoIT.

!!! warning "Major Priority Alerts" Anything that wakes up a human in the middle of the night should be immediately human actionable. If it is none of those things, then we need to adjust the alert to not bother us at those times.

Priority Alerts Response
Major Major-Priority Spearhead Alert 24/7/365. Requires immediate human action.
Normal Normal-Priority Alert during business hours only. Requires human action that same working day.
Minor Minor-Priority Alert 24/7/365. Requires human action at some point.
Notification Suppressed Events. No response required. Informational only. We do not need these to clutter our ticketing or inboxes. If they are enabled they should be sent only to required/specific people, not groups.

Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page.

If you're setting up a new alert/notification, consider the chart above for how you want to alert people. Be mindful of not creating new high-priority alerts if they don't require an immediate response, for example.

!!! info "Alert Channels" Primarily we use email as the notification/alert methods and all of our customers are encouraged to use this method. Secondly there is the DoIT customer portal which will send alerts to the on-call person(s) and escalate based on SLA/contractual agreements. Thirdly we use our centralized support telephone number and individual phones. This means keeping an eye on your email is essential!

SMS and Push notifications are in the pipeline for DoIT.  

Examples

"Production service is failing for 75% of requests, automation is unable to resolve."_

This would be a Major priority IN, requiring immediate human action to resolve.

Major Urgency

"A customer sends an email stating that "Production server disk space is filling, expected to be full in 48 hours. Log rotation is insufficient to resolve."

This would be a Normal priority SR, requiring human action soon, but not immediately.

Normal Urgency

"An SSL certificate is due to expire in one week."

This would be a Minor priority SR, requiring human action some time soon.

Minor Urgency