spearhead-issue-response/docs/oncall/alerting_principles.md

37 lines
2.6 KiB
Markdown

We manage how we get alerted based on many factors such as the customers contractual SLA, the urgency of their request or incident, etc.. **an alert or notification is something which requires a human to perform an action**. Based on the severity of the issue (service request or incident) we prioritize accordingly in [DoIT](http://doit.sphs.ro).
!!! warning "Major Priority Alerts"
Anything that wakes up a human in the middle of the night should be **immediately human actionable**. If it is none of those things, then we need to adjust the alert to not page at those times.
| Priority | Alerts | Response |
| -------- | ------ | -------- |
| Major | Major-Priority Spearhead Alert 24/7/365. | Requires **immediate human action**. |
| Normal | Normal-Priority Alert during **business hours only**. | Requires human action that same working day. |
| Minor | Minor-Priority Alert 24/7/365. | Requires human action at some point. |
| Notification | Suppressed Events. No response required. | Informational only. We do not need these to clutter our ticketing or inboxes. If they are enabled they should be sent only to required/specific people, not groups. |
Both IN and SR (incidents, service requests) share the same priorities. The actual response / resolution times vary and are based upon contractual agreements with the customer. These details (SLA) are available in DoIT on the organization page of the respective customer.
If you're setting up a new alert/notification, consider the chart above for how you want to alert people. Be mindful of not creating new high-priority alerts if they don't require an immediate response, for example.
!!! info "Alert Channels"
Presently we use email as the only notification method. This means keeping an eye on your email is essential!
SMS and Push notifications are in the pipeline for DoIT.
## Examples
#### "Production service is failing for 75% of requests, automation is unable to resolve."_
This would be a **Major** priority IN, requiring immediate human action to resolve.
![Major Urgency](../assets/img/screenshots/prio-high.png)
#### "A customer sends an email stating that "Production server disk space is filling, expected to be full in 48 hours. Log rotation is insufficient to resolve."
This would be a **Normal** priority SR, requiring human action soon, but not immediately.
![Normal Urgency](../assets/img/screenshots/prio-norm.png)
#### "An SSL certificate is due to expire in one week."
This would be a **Minor** priority SR, requiring human action some time soon.
![Minor Urgency](../assets/img/screenshots/prio-low.png)