Alerting Principles
We manage how we get alerted based on many factors such as the customers contractual SLA, the urgency of their request or incident, etc.. an alert or notification is something which requires a human to perform an action. Based on the severity of the issue (service request or incident) we prioritize accordingly in DoIT.
Major Priority Alerts
Anything that wakes up a human in the middle of the night should be immediately human actionable. If it is none of those things, then we need to adjust the alert to not page at those times.
Priority | Alerts | Response |
---|---|---|
Major | Major-Priority Spearhead Alert 24/7/365. | Requires immediate human action. |
Normal | Normal-Priority Spearhead Alert during business hours only. | Requires human action that same working day. |
Minor | Minor-Priority Spearhead Alert 24/7/365. | Requires human action at some point. |
Notification | Suppressed Events. No response required. | Informational only. We do not need these to clutter out ticketing or inboxes. If they are enabled they should be sent only to required/specific people, not groups. |
Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page of the respective customer.
If you're setting up a new alert/notification, consider the chart above for how you want to alert people. Be mindful of not creating new high-priority alerts if they don't require an immediate response, for example.
Alert Channels
Presently we use email as the only notification method. This means keeping an eye on your email is essential! SMS and Push notifications are in the pipeline for DoIT.
Examples#
"Production service is failing for 75% of requests, automation is unable to resolve."_#
This would be a Major priority IN, requiring immediate human action to resolve.
"A customer sends an email stating that "Production server disk space is filling, expected to be full in 48 hours. Log rotation is insufficient to resolve."#
This would be a Normal priority SR, requiring human action soon, but not immediately.
"An SSL certificate is due to expire in one week."#
This would be a Minor priority SR, requiring human action some time soon.