When designing a system which sends alerts to people, it's important to consider what is effective and what is respectful of people's time. Time is one of our most precious resources.
It is the easiest thing in the world to turn on a slew of alerting conditions and direct the potential ensuing onslaught of alerts to an an existing distribution list (i.e. - e-mail or a slack channel).
But is that sensible?
I would say no. It causes Alert fatigue and numbness.
When faced with an onslaught of alerting e-mails for example, there appear to be five typical ways people approach responding:
- Turn the alerts off at the source.
- Ignore them and delete them manually from inbox
- Ignore them, depending on someone else on the notification list to respond.
- Ignore them and set up a rule to delete/archive them automatically.
- Set up e-mail inbox filters/rules to suss out the 'important' alerts.
Most of these approaches waste time or miss deriving any value from the alert. #3 and #5 are the only cases where the alert might result in an action being taken to fix it. Even in the case of #5 though, the logic which enables extracting the value is not easily shared among a team. A carefully tuned set of e-mail inbox filters/rules silos the knowledge about how to interpret importance of the alerts into one persons e-mail configuration, which is a fragile place for a team's knowledge to live.
^Attention Interrupted$
When people's eyeballs are drawn and their attention is interrupted, when they context switch, they slow down and time is consumed that will never be returned. Sorry about that.
Every e-mail in an inbox and every chat message that pops up a notification on a desktop takes a little slice of time and focus away from everyone that receives it . . . even if they subsequently decide to ignore it. It may seem like an inconsequential amount of time to some when dealing with one at a time, but multiply it out by the number of people receiving it, and the number of times it repeats, and it can add up quickly.
For this reason, it's critical to tune alerting to avoid creating waste and to be respectful of people's time and attention.
I'm very intentionally using the terminology 'tuning' here because in my experience it's more of a recurring activity to smooth out rough spots, adjust based on learning, and improve efficiencies, as opposed to a one-time configuration you set and forget. It's ok to turn off alerts if they're not providing value. It's ok that an alert is not perfect at first. To make it effective and efficient though, make it WORTHY!
Worthy Alerts
- Are Actionable and Necessary
- Assign Ownership
- Provide appropriate context
- Are tune-able, on-going, to a level which respects the receiver
- Let you know only when alert state has changed
Actionable and Necessary
An actionable alert is directed to a person that can address it and is important enough that action needs to be taken to respond to it.
If the alert is not going to a person that can take action to investigate or fix it, then it would be better to dashboard, log, or send basic outage/issue communication based on human judgement.
Alerts should not fire for normal system behavior. If CPU spikes every day at 4am-4:10am and there is no significant adverse effect on doing business, then that should not alert. It would be a waste of human attention and create alert fatigue and numbness to other incoming alerts which may actually be important. Similarly, if it's normal that sales drop on a day when a business is closed, that should not trigger a sales alert.
Assign Ownership
Alerts with specific, individual owners are responded to more quickly. When you have a team, with more than one member capable of addressing an issue, always assign a first responder automatically to expedite investigation and resolution. Sending an alert to an e-mail distribution list or a slack channel can increase awareness, but it also fogs up a sense of clarity around who owns fixing that thing until someone claims it. Even if your team is great at rallying and claiming issues aggressively, the interruption has still been created and pulled everyone out of the their focused context of other work. There are many great tools that enable this capability with rotating on-call schedules as well as automatic escalation tiers. Assigning an initial owner/responder automatically is not mutually exclusive from enlisting the help of key team members or swarming/mobbing when appropriate. It's a good first step in general.
Context
An alert needs to have just enough context to know where something may have gone wrong and enable a responder to start digging into the symptom(s) and cause(s). Having a host/device/instance/pod/service/metric name and a short description of the condition that triggered the alert is generally enough since the alert is directed to a person with enough context to know where to dig. If the team size or number of contexts is unwieldy, runbooks can be employed to better share investigative next steps.
Resist the urge to make an alert to a person a complete explanation of what went wrong and why. Except for physical hardware replacement, if you find an alert has so much context that it knows exactly what went wrong and how to fix it, then often it's a great candidate to either (a) have that alert trigger an action which automates the healing or (b) fix the root cause so the system never degrades into that bad state again.
Tune-able
An alert which seems like a good idea at first may turn out to be a horrible idea and one of the quickest ways to encourage alert fatigue and numbness is to continue to spam people with alerts that seemed like good ideas at first, and never come back to clean up. It's like a digital form of littering, but an order of magnitude worse because it makes new waste every time it triggers.
Therefore, make sure whatever mechanism you're using to send alerts allows careful, and effective tuning of the conditions where the alert will fire.
It's also best if those people closest to investigating and responding to an alert can do that tuning themselves, over time.
There are 3 levels of progression towards self-service alert tuning utopia here:
- The person who initially defines the alert can easily tune the conditions under which it fires to whatever level is desired. Tuning to the nth degree** means being able to start simple, and iterate on a more and more effective alerts building up conditions for when to fire the alert that becomes more and more accurate using what you learn over time.
- The person receiving the alert can influence the definition of the conditions when the alert fires, based on what they observe to be normal. There is a self-service mechanism to establish snooze/mute periods for known maintenance windows and issues, so that alerts are never fired during intended downtime.
- The receiver of an alert can self service define conditions and alerting policies for the things they support.
**: Tuning to the nth degree for what is 'normal' typically requires a rich expression language (example), time-series data, and system for defining routing rules for alerts that is appropriate to an organization's size.
State Change Aware or "Don't spam me bro!"
If something breaks, it shouldn't take repeated notifications to get someone's attention. When it does, that's a sign that something else is wrong with an alerting pipeline, or the volume of inbound requests in that persons inbox. Alerting systems can group, batch, and set periods for how often to notify people and when to automatically escalate a firing alert to another person because the intended first responder is unavailable. A good alerting system can also self-resolve a previously fired alert when a system heals on its own or the cause of an issue subsides, enabling a team to know that something happened, but to choose when it makes sense to invest effort investigating the cause.
Are your alerts worthy?
Are they respectful of people's time and attention?
Are the actionable and necessary?
Is there a clear owner and first responder?
Do they provide an appropriate amount of context?
Are they tune-able? by the receiver even?