Managing Software System Alerts
Blaring sirens. Whooping car alarms. Crying newborns. Be it noise, lights or colour, everyday life is full of alerts to which we need to react. People learn heuristics to guide them in identifying the how to’s of everyday life. These rules are normally generated through reaction to the piercing sirens and learning from the resulting successes and failures. However, one just also learning to discern between showstopping signs, indicative warnings and false alarms.
Just life real life, supporting software systems requires reacting to alerts and system status at key points in the business day. Yet, the common developer support divide often leads to poor alerting capabilities. Developer empathy is often lost when considering exceptional situations to flag.
Discussions regarding alerting considerations and their support impact by developers have previously taken place. Nevertheless, it is only when we experience it ourselves that engineers stand up and take notice. My current sabbatical has meant our warning mechanisms have proven difficult to filter out alerts that I won’t react to with information that will help me keep up to date. To this end, I reflect on the state of our system sirens, and attempt to identify some good and bad practices with our approaches.
Cum On Feel the Noize
To understand the solutions presented, we must first understand the problem. There is indeed a selfish motivation at play here. Regardless of the origin, it has definitely shone a spotlight on as existing issue that until now I was happy to endure.
Validating the state of some of our systems, be it production or non-production, is a minefield. A myriad of dashboards, emails and custom alerts bombard engineers and support teams alike. Infrastructure mayday emails in particular flood our inboxes. For me on sabbatical, despite reviewing my mailbox filters ahead of my leave, I find navigating the flood to find the useful content exceptionally challenging. Testing environment volumes are often exceptionally high. Despite focusing on only the primary communication groups, I am continually swiping left to clear the clutter.
I Got the Message
While my situation is rather specific, and arguably self-inflicted, they highlight several challenges facing the development and support teams. The first relates to the level of context switching this demands. Programmers find themselves switching more often between email and IDEs. Support teams shift between monitoring dashboards, emails and other applications as they process not only alerts, but ongoing status snapshots. While it might be acceptable for leadership to live in Outlook, we are fostering a requirement for developers to live there too.
We should also consider the relevance of each of these signals. While it is perfectly adequate for me to not react to these signs, should our engineers be ignoring them. Differing behaviour is expected for alerts originating between production and non-production environments. Nevertheless, some warnings are consistently ignored due to known issues or lower thresholds. Instilling a selective reactive culture of alerts, desensitises developers.
One (Clear Channel Stripped)
Given our problem statement, what’s solutions can help alleviate the current burden. One cause is that communication and alerting channels have been combined together. Hence why I am filtering through these groups to find the juicy content. A trainer once explained that forums such as emails and social media feeds exhibit similar traits to gambling. I am definitely guilty of searching for the jackpot, while clearing the junk!
One could argue that separate conduits allow for teams to ignore warnings altogether. Team culture should be evaluated to determine if this is a true worry. Regardless, in addressing the volumes for non-production environments, this is definitely a useful technique to utilise.
Of course, adoption of on-demand testing environments would be a far more effective solution. There are many additional benefits aside from reducing the number of SOS messages sent. Unstable environments can be wiped clean with a single click. Infrastructure costs can be reduced as environments are only available when required. Testing coordination becomes a thing of the past as multiple different environments can be brought up for differing purposes. This is but a distant dream for us currently, but one cannot ignore its place in our alerting allegory.
Message in a Bottle
Another rather troubling issue is developer and infrastructure analysts decision on which events for which to send alerts and warnings. My experiences show that groups gravitate towards opposing extremes. In the majority of cases, when to raise alerts is often only considered when it is too late and the offending feature is live and misbehaving. However, when the need to send a software smoke signal is realised, the volume is not considered. This leads to the opposing problem of flooding subscribers. When using shared alerting infrastructure that many larger organisations possess, a new issue emerges. No one wants to be the software engineer that inadvertently initiated a distributed denial of service attack on another application!
Do we need nearly all of these messages? Of course not! Many are regularly ignored. Developers will instantaneously jump to remedy one issue, but leave another to fester.
The problem with instilling a psychological mindset of different reactions to different alerts is the need to learn them. You are enforcing a set of heuristics to govern your application. Regardless of whether these unwritten rules are alert type, infrastructure or environment dependent, if the reaction required differs you are increasing the learning overhead for application support. Reoccurring sirens should be evaluated to determine why they keep appearing. The most common cause for us is low message thresholds, which should be reviewed more regularly. Here we should either be increasing the limit, evaluating our infrastructure size, or both.
Much like the different sounding cries of a newborn, the instinct must be to immediately address the issue. The fix can certainly differ. Nevertheless, one should never question the need to react.
Yet another cause boils down to a poor choice of signal sending. Emails are not the best alerting mechanisms. Be it a small startup, or large organisations, there are a myriad of different communication tools available to teams today. Yet, email is seen as old reliable. It is the most overused tool of any organisation. They promote confusion. They cultivate a reactionary culture where people rarely speak. They encourage context switching. There are valid scenarios for email usage. However, I would argue that system monitoring is not one such use case.
Centralised diagnostic tools should allow the ability to capture current state, potentially using a traffic light system. The use of RAG status allows us to beat the psychology of ignorance and be clear on what alerts require action. There are cultural considerations when using colour and iconography. However, the premise is that systems should be monitored in one place.
Much like we need to contemplate our communication channels, the medium through which application alerts are communicated must also be cautiously deliberated. It must be accessible via any dashboard capabilities that are made available to teams, possibly as some kind of news feed. One tool is required to communicate the news of the world for your systems.
Last Train to Clarksville
Yes, the suggestions listed in this article stem from a very selfish place. The need to optimise my email monitoring is not the key takeaway. The goal is to convey that programmers and support analysts alike are often the forgotten stakeholder in many applications. Supportability should not be the last resort.
Next time you write a feature, be mindful of what happens when it goes wrong. Who needs to know? How should this be communicated? How many times should the same type of signal be sent? Having business users notify the teams of a problem should be the absolute last resort. The exception rather than the rule.
With the many interruptions I encountered while writing this piece, I appreciate your reads!