Technology incidents and the importance of repeatedly asking why

After a technology incident is mitigated, the process is usually to hold a meeting with the engineers involved, who then try to locate the root cause of the problem. The main goal during an incident is to get the system back up and running to control the damage. Then the team needs to move on to cleaning up anything that’s been left behind, investigating and identifying the root cause, and writing the post-mortem, which is a report describing what happened and proposing improvements for the future.

For the post-mortem, some tech leaders use the 5 whys strategy developed at Toyota to solve quality problems with their products. The technique advocates asking “why?” several times:

Why did this incident happen?
Is it the complexity of the system?
If not, then what caused the incident?
Why did it stop working?
How likely is it that this will happen again?

From there, it’s possible to start thinking about some structuring actions, so similar problems don’t happen in the future. The purpose of looking for the root cause and writing the post-mortem is to generate learning and prevent issues from recurring.

The goal should be to make the environment increasingly safe, with improvements derived from these learnings. Another very important point is that all of this must take place in a blameless environment.

A blameless postmortem is an analysis of incidents without pointing fingers. In principle, no one is responsible for the incident; it just happened. And the more checks and balances there are in the engineering life cycle, the more the focus is taken away from looking for culprits. It is true that, in most cases, it is a new code written by somebody that causes an incident, but the validations should have ensured that it wouldn’t be a problem. Human failures are to be expected and this needs to be integrated into the risk management system.

The purpose of the post-mortem process is to improve the state of the platforms, tools, systems, and processes. The aim is to raise the bar so that incidents don’t happen so often. And leadership must constantly reinforce that the company culture is blameless otherwise no one will believe it and trust that it is ok to take responsibility for mistakes.

The purpose of looking for the root cause and writing the post-mortem is to generate learning and prevent issues from recurring.

About the author

Marcus Fontoura

Marcus Fontoura is a technical fellow and CTO for Azure Core at Microsoft, and author of A Platform Mindset. He works on efforts related to large-scale distributed systems, data centers, and engineering productivity. Fontoura has had several roles as an architect and research scientist in big tech companies, such as Yahoo! and Google, and was most recently the CTO at Stone, a leading Brazilian fintech.

Culture

Leadership

Technology

Defining what is a technology company

Workforce diversity in a technology company

Reusable platforms in a technology company

Culture

Leadership

Technology

Technology incidents and the importance of repeatedly asking why

Marcus Fontoura

Marcus Fontoura

Related content

Technology incidents and post-mortems

Avoidable and unavoidable technology incidents

Incidents: A thousand alerts is like zero alerts