Incidents: A thousand alerts is like zero alerts

It would be great if a company could predict all possible incident scenarios. There would be no more problems. But even with the best monitoring system in the universe, the best alert system in the world, a perfect code review system, the best developers, and a good process of automatic CI/CD (continuous integration, continuous delivery) checks, the best the engineering leadership would be able to do is to minimize incidents, but not fully eliminate them.

There will always be some incidents, because the interactions between systems are complex and in constant evolution, and it’s impossible to validate everything. People move in and out of a company, employees don’t understand the protocols, one change interferes with another—there are many variables. Both code and people are subject to failure.

When it comes to alerts, though, having too many alerts is bad, as having a thousand alerts is the same as having zero alerts. What you want are relevant alerts that really draw attention. For example, for cloud providers, it doesn’t make sense to alert every time a hard disk fails because, given the number of disks they have in all their data centers around the world and the likelihood of disk failures, alerts would go off nonstop. Rather than alerting, it would be better to invest in an automated system to deal with these failures.

Undoing a change that has caused a problem is not always easy or viable. Developers need to take this into account and include the ability to do automatic rollback, which means undoing the change in production if problems arise.

If The New York Times decides to overhaul its graphic design, for instance, the newspaper’s technology team won’t have the chance to test the new look on every combination of device, browser, and computer system on the planet to guarantee if it will work universally. To avoid problems, developers could make the change allowing the content to be displayed with both the new and the old design. This is not trivial. Another measure is to put the new design into production, but without enabling it. The content continues to be rendered in the previous design, with the new one in production but protected by a switch. This switch can be a flag, such as DESIGN-VERSION, which can be changed from 13 to 14, and thus change how users view the site’s content. If the new version doesn’t work, there is the possibility to revert the value of the flag back to 13 without any major problems.

This type of system must be thought out intentionally, which is sometimes difficult because, depending on the change, it can be very hard or impossible to go back to the way things were before.

Health indicators for systems

Ideally, systems have tests that validate how the system is running in production, providing health indicators that can be used to create a general health dashboard. When someone in charge realizes that the health of the system has been affected by some change, the team can automatically go back to the previous version and start an investigation to understand what happened.

In Microsoft’s Azure there is a huge combination of types of servers, network routers, data center designs, and software versions. Imagine the disparity that exists, with more than 60 data center regions spread all over the world and a very high number of software changes per day. Over time, the heterogeneity that accumulates in a cloud server park is enormous. When developing any new version of the system, the change management process must be quite rigorous.

Going back to the concept in the book The Black Swan: The Impact of the Highly Improbable, by Nassim Nicholas Taleb (Random House, 2008), it is not possible to predict every possible situation in life or in a technology company. Obviously, you can’t foresee every interaction in a system, but if every little part is tested rationally, the chances of a major problem occurring when the parts are combined are reduced. However, if no tests are implemented, the chance of an unforeseen interaction causing a problem is enormous.

The way forward is to limit the scope of a problem and have agile mechanisms to undo any changes that cause incidents. A good practice for an application, for example, is to not roll out a new feature to everyone, but only to an initial user base of 1% of users and then get feedback to check that everything is OK. If it is, you move on to 5% of users, then 25%, and so on, until you reach everyone. It’s worth crawling at first and, once you’re doing well enough, you can walk and run afterward.

You can’t foresee every interaction in a system, but if every little part is tested rationally, the chances of a major problem occurring when the parts are combined are reduced.

About the author

Marcus Fontoura

Marcus Fontoura is a technical fellow and CTO for Azure Core at Microsoft, and author of A Platform Mindset. He works on efforts related to large-scale distributed systems, data centers, and engineering productivity. Fontoura has had several roles as an architect and research scientist in big tech companies, such as Yahoo! and Google, and was most recently the CTO at Stone, a leading Brazilian fintech.

Culture

Leadership

Technology

Defining what is a technology company

Workforce diversity in a technology company

Reusable platforms in a technology company

Culture

Leadership

Technology

Incidents: A thousand alerts is like zero alerts

Marcus Fontoura

Health indicators for systems

Marcus Fontoura

Related content

Technology incidents and post-mortems

Avoidable and unavoidable technology incidents

Tools, tests, and AI to combat incidents