What is the ultimate alert strategy to make sure your alerts are significant and not just noise?
Production monitoring is critical to the success of your app, we know that. But how can you be sure that the right information is reaching the right people? Automation of the monitoring process can only be effective when actionable information reaches the right person. The answer is an automatic alert. However, there are some elements and guidelines that can help us get the most out of our monitoring techniques, no matter what.
To help you develop a better workflow, we’ve identified the key benefits your alerts can offer you. Let us check them out.
Timeliness – Know as soon as something bad happens
Our apps and servers are always up and running, and there is a lot going on at any given moment. It is therefore important to be updated on new errors when they are first introduced into the system.
Even if you’re a big fan of filtering log files, they only give you a retroactive perspective of what happened to the app, servers, or users. Some will say that scheduling is everything and receiving real-time alerts is critical to your business. We want to fix issues before they have a serious impact on our users or our app.
This is where the tools and combinations of a third party are valuable, and let us know as soon as something happens. This idea may not sound nice when an alert comes out at 3:00 in the morning or during your night, but you still can not deny its importance.
When it comes to a production environment every second counts, and you want to know from the moment an error appears.
Context is the key to understanding issues
It is important to know when an error occurred, and the next step is to understand where it occurs. Aleksey Vorona, a senior Java developer at xMatters, told us that for his company, context is the most important component when it comes to alerts; “Once you put an error in the app, you want to get as much information as possible so you can understand it. This context can be the computer the app ran on, the user IDs and the developer who owns the error. The more information you have, the easier it is to understand the issue.”
Context is everything. And when it comes to alerts, these are the different values and components that will help you understand exactly what happened. For example, it will be helpful for you to know if a new layout has introduced new errors, or to receive alerts when the number of errors recorded or not detected exceeds a certain threshold. You will also want to know if a particular error is new or recurring, and what caused it to appear or reappear.
If we go into more detail, there are 5 critical values we want to see in each error:
- What error entered the system?
- Where it happened inside the code.
- How many times has each error occurred and what is its urgency?
- When was the first time this error was seen?
- When was the last time this error occurred?
These were some of the issues we had to deal with here at OverOps, in an attempt to help developers, managers and DevOps teams automate their manual error handling processes. Because each team has its own unique way of dealing with problems, we have created a customizable dashboard where you can quickly see the top 5 values for each error.
OverOps allows you to quickly identify critical errors, understand where they happened within the code and know if they are critical or not.
You need to know when, where, what and how many errors and exceptions happen to understand their importance and urgency.
Root Identification – Why did it happen in the first place?
Now that we’re getting automatic alerts with the right context, it’s time to figure out why they happened in the first place. For most engineering teams, it’s time to get into the log files and start looking for that needle in the haystack. That is, if the error was recorded in the first place. However, we see that teams with the best performance have a different way of doing things.
Typically, applications shoot hundreds of thousands or even millions of errors every day, and it becomes a real challenge to get to their true root in a graded manner without spending days on finding it. For large companies like Intuit, log search has not been helpful; Sumit Nagal, chief quality engineer at Intuit notes that “even if we found the problems in the logs, some of them were not recoverable. Finding, restoring and solving problems in these areas is a real challenge.”
Instead of sifting through logs in an attempt to find critical issues and close cards with a label that says, “Failed to recover,” Intuit chose to use OverOps. With OverOps, the development team can immediately identify the cause of each anomaly, along with the variables that caused it. The company can significantly improve the productivity of the development team by giving the root with just one click.
Getting to the root cause, along with the source code and full variables, will help you understand why errors occurred in the first place.
Communication – Maintaining team synchronization
You can not handle alerts without everyone on the development team being on board. Therefore communication is a key aspect when it comes to alerts. First, it is important to assign the alert to the right person. The team should be on the same page, knowing why each of them is responsible and who is working on which element of the app.
Some teams may think that this process is not as important as it should be, and they assign different team members to handle alerts only after they “go out”. However, this is a bad practice, and is not as effective as some hope.
Imagine the following scenario: It’s a Saturday night and the app crashes. Alerts are sent to various people across the company and some team members try to help. However, they did not handle this part of the application or the code. You now have 7 team members trying to talk to each other, trying to figure out what needs to be done to resolve it.
This was caused due to a lack of communication in earlier parts of the project, leading to Team members were unaware of who was responsible, what was deployed or how to do Handle events when sending alerts.
Communication is important, and you need to work on improving it as part of the error handling process.
Responsibility – Make sure the right person handles the alert
Continuing with our communication topic from the previous paragraph, an important part of this concept is knowing that the alert is reaching the right person, and that he or she is taking care of it. We may know which team member was the last to handle the code before it broke, but is he the one responsible for fixing it now? In our interview with him, Alexei Verona noted that it is important for him to know who is responsible for any alert or issue that arises. The person who wrote the code may be more likely to handle it better than the other team members, or may have another team member equipped to solve it.
The bottom line is that by automating your alerts, you can direct exception handling tasks directly to the team member in charge of them. Otherwise, the right people may miss important information and the responsibility will come out the window. Such issues can lead to dissatisfied users, performance issues or even a complete crash of servers and systems.
Team members should be alerted to issues they are responsible for maintaining, so it is always clear who is responsible for which tasks.
Processing – alert handling cycle
Your team members communicate and work together, which is great. However, you still need to create a game plan that the team will strive to achieve. A good example of a game plan is an informed strategy for handling exceptions instead of addressing each event individually.
Exceptions are one of the core components of a manufacturing environment, and they usually indicate a warning signal that requires attention. When exceptions are abused, they can lead to performance issues, and harm the app and its users without your knowledge.
How do you prevent this from happening? One way is to implement a “game plan” of the company’s Inbox Zero policy. It is a process in which unique anomalies are recognized, treated and eventually eliminated as soon as they are presented.
We researched how companies address their exceptions and found that some have a tendency to push them to a “later” date, just like emails. We’ve found that companies that implement zero-box policies have a better understanding of how their app works, clearer log files, and developers who focus on important, new projects. We will deal with this more in the next chapter.
Find the right game plans for you and implement them as part of a better alert handling process.
Integrations? Yes please!
Handling alerts yourself may work, but it is not expandable in the long run. For companies like Comcast, which serve over 23 million X1 XFINITY devices, it’s almost impossible to know which alerts are critical and should be addressed as soon as possible. This is where third party tools and integrations will be your best friends.
After combining OverOps with their automated deployment model, Comcast was
Are able to train their application servers. The company is deploying a new version of their app on a weekly basis, and OverOps is helping them identify the unknown error terms that Comcast did not anticipate. Watch John McCann, Senior Product Engineering Director at Comcast Cable explains how OverOps helps companies automate their deployments.
Integrations can also be helpful in the current workflow of your alert. For example, Aleksey Vorona of xMatters is working on developing a unified IT alert platform and has developed an integration with OverOps. The integration allows the company to access critical information, such as the changing status that caused each error, and alert the right team member.
Use third-party tools and combinations to recharge your alerts and make them meaningful.
Alerts are important, but there is much more to them than just adding them to yours Implementation. You want to make sure you have information about the reason they occurred in First place, how you should take care of them and how you can get the most out of it Them (as opposed to just knowing that something bad has happened). Automatic alert is An important part of the monitoring system. We need the right people to know when, Where and why things go wrong in production so they can fix it soontolerable.