Technology incidents and post-mortems

If the CEO of any company announces publicly that she or he is going to bring in a new CTO to finally put an end to technology incidents, you should know right away that this person will be spending money on hiring for nothing, because that’s such an unrealistic expectation that even the best CTO in Silicon Valley wouldn’t be able to live up to it. Not only are incidents bound to happen, but they are also welcome to a certain extent. CTOs want to have incidents! Teams that have never had to face incidents won’t ever be prepared to deal with them when they inevitably arise. That doesn’t mean, however, that you should sit back and wait for them to happen. You must work actively to prevent them and contain their consequences.

It is necessary to have teams well equipped to deal with incidents, as well as automated tools and processes to minimize them, and clear leadership guidelines so that they are, in the end, learning opportunities in the present and drivers of structuring actions in the short-, medium-, and long-term.

Post-mortem meetings are decisive, to understand what happened during an incident and to look for the root causes of the problem. Also hugely important is to have a blameless corporate culture to create a work environment in which everyone can feel safe to make mistakes (and admit to them) and can be comfortable carrying out their tasks with the responsibility and boldness needed to innovate— something many authors refer to as fearless execution.

Incidents are to be expected, and no one should be reprimanded just for having been involved in one. A good CTO, then, is the one who, applying a platform mindset, establishes a culture that values collaborative work and plans a system of technical checks and balances to deal with incidents, not the one who promises to put an end to them.

Teams that have never had to face incidents won’t ever be prepared to deal with them when they inevitably arise.

Incidents are a part of life (and technology)

In the book The Black Swan: The Impact of the Highly Improbable, by Nassim Nicholas Taleb (Random House, 2008), there is this idea that the unexpected is an integral part of life and we can, therefore, take advantage of it. The book’s title refers to the fact that we can’t assume that black swans don’t exist, as has happened before in human history, just because we’ve never seen one.

This kind of thinking is also fundamental to the analysis of computer systems. We cannot believe that there are no bugs and that a system is simply correct. We must assume that there are bugs and they just haven’t manifested themselves in production yet.

Incidents often occur because some part of the system is tampered with, leading to a different code execution path and triggering the problem—the manifestation of that bug that had been sitting there for a while. A principle of software engineering is that if a piece of code has not been tested, it has bugs—in large-scale systems, it is impossible to test every interaction, which means that dormant bugs exist and will eventually appear.

The culture of testing is extremely important, but it’s impossible to predict every situation. There isn’t a single company in the world where there aren’t incidents, because they are part of the game—and of life, as Nicholas Taleb preaches. You can be sure that it will always take you 10 minutes to get your child from home to school, based on the experience of the hundreds of journeys you’ve made on the same route, but one day it will take you 50 minutes because a tree fell in the middle of the road and stopped all traffic in the area. Events like this happen, and there’s no way of predicting them.

That’s also the idea behind incidents. Problems arise without anyone anticipating them. Engineering processes are designed to minimize incidents, and these processes should be rigorous enough to reduce the scale of incidents. The worst-case scenario is ignoring an incident that happened and allowing it to manifest again. This type of situation is no longer an incident, but a recurring problem that becomes a technical debt.

Engineering leaders don’t want recurring problems, only one-off incidents, which happen from time to time. When an incident is identified, short-, medium-, and long-term structuring actions must be activated so that it doesn’t happen again. We must assimilate the logic of the black swan and prepare ourselves to mitigate incidents quickly when they happen, minimizing the impact on our customers.

Google “horses”

Brazilian computer engineer Luiz André Barroso used to tell the story (if it’s true or legend, who knows) of a large-scale incident where one of Google’s data centers became totally disconnected. When people investigated the cause of the service interruption, they discovered that it was…a horse.

Somewhere near a Google data center in South America, a horse died, and they dug a deep hole to bury it. It turns out that in the process of digging, they ended up hitting the underground network cables connecting the data center to Google’s network, which took it completely offline.

What can we learn from this story? Could this horse incident have been avoided? Probably not, but once it was discovered that this possibility existed, it became necessary to question whether the cables shouldn’t be installed even deeper, or whether they shouldn’t be encased in some kind of cut-proof metal. You can’t assume that Google is going to dig up all its existing data centers around the world to bury cables deeper or cover them with metal on top. How much would these structural actions cost? Millions of dollars, and it would be an endless job. If, on the other hand, the cables in the next data centers were already built to be resilient to horse deaths, the incident would have yielded a more feasible structural plan.

There are immediate actions that can be taken for incidents, for example when the cause of an incident was a lack of sufficient testing. Reinforcing the testing culture and integrating it as much as possible into the tools is a fundamental measure to avoid bugs.

Imagine, in an almost cartoonish example, if such a mistake were to result in a bank deposit being understood by a system as a subtraction from someone’s current account, rather than an addition of money. The person deposits 200 dollars in their bank account and realizes that they are 200 dollars poorer. It would be a terrible situation in every sense, and it’s always worse when the fault in the system is noticed by the customer before it’s detected by the company.

Companies need automatic alerts and monitoring systems to detect incidents before they become public knowledge. No organization wants to have to rely on customers notifying them of failures, as in the absurd hypothetical example above. Once internal alerts notify of an incident, companies need a structured incident management process to contain it.

Mechanisms to deal with incidents

Once an alert of an incident is triggered, someone on call is activated. This person is responsible for analyzing what is happening using the organization’s monitoring mechanisms. These mechanisms pinpoint which systems are having problems and why they are not working, to help solve the situation as quickly as possible. When a system has a problem, the initial goal is to put out the fire as soon as possible, sometimes in a palliative way. A definitive solution to the problem can be a later step.

Cloud computing providers face an even more delicate situation when incidents occur, since so many people around the world depend on them. Because of this, they publish official communications and reports explaining in detail what happened in serious incidents. This is also the case when a mass-use service, like Instagram, goes down: the company usually creates a page to give users and stakeholders an explanation. This type of external communication may not be a regulatory requirement, but it is good practice, especially for companies that can have a substantial impact on their customers, such as cloud providers and financial institutions.

When a critical incident occurs, an online emergency meeting, also known as a “war room,” is opened to try to resolve the problem, and a specific person is assigned command. It is not always possible to know the root cause of an incident immediately after it has been resolved, because the initial goal, and the most urgent, is to mitigate the problem. Once the incident is under control, the post-mortem exercise is to investigate what happened and write a document about that incident, explaining the root cause and defining short-, medium-, and long-term actions to avoid and minimize the occurrence of similar incidents in the future.

A short-term action might be to create an alert for the situation if it recurs.
A hypothetical medium-term action may be to create more instructions in the change management system so that it can’t be bypassed or circumvented—this may be a significant action in the system and take months to implement.
A long-term action could take years, as it could be related, for example, to resolving a major old technical debt, requiring more structured and planned actions.

Ideally, companies should already have a list of technical debts that need to be monitored and solved at any given point. These outstanding issues should be considered during the planning cycle and weighed against other features that need to be developed.

About the author

Marcus Fontoura

Marcus Fontoura is a technical fellow and CTO for Azure Core at Microsoft, and author of A Platform Mindset. He works on efforts related to large-scale distributed systems, data centers, and engineering productivity. Fontoura has had several roles as an architect and research scientist in big tech companies, such as Yahoo! and Google, and was most recently the CTO at Stone, a leading Brazilian fintech.

Culture

Leadership

Technology

Defining what is a technology company

Workforce diversity in a technology company

Reusable platforms in a technology company

Culture

Leadership

Technology

Technology incidents and post-mortems

Marcus Fontoura

Incidents are a part of life (and technology)

Google “horses”

Mechanisms to deal with incidents

Marcus Fontoura

Related content

Avoidable and unavoidable technology incidents

Incidents: A thousand alerts is like zero alerts

Tools, tests, and AI to combat incidents