A blameless culture for incidents

Top technology companies invest in hiring the best people in the market and creating good processes to retain and develop them. Also important is to nurture a corporate culture in which employees feel comfortable that causing incidents does not have negative repercussions for their careers. If a developer does not do anything and doesn’t deploy any code into production, he or she doesn’t run the risk of causing incidents, but the company also doesn’t produce anything. It’s a worst-case scenario for all parties.

Technology leaders want their teams to work with the mindset of fearless execution. A tech person needs to be able to work in this way because that’s what allows creativity and innovation to flourish. But to be able to act with no fear, controls need to be put in place. Anyone who writes code needs to be judicious, test, but they should also know that everyone in the company understands there are system and human failures, and it’s impossible to protect yourself from all of them.

It is best to assume that incidents will eventually happen and organize processes to mitigate and understand them rather than to hope incidents won’t happen. The blameless culture revolves around this too. There will always be a human component in any incident, and the leadership must make it clear to employees that no one has to be a hero and do everything alone. Spreading this concept is fundamental so that people won’t want to do things in secret, and that they really embrace the philosophy of collaborative work.

Dismissals because of incidents shouldn’t happen in technology companies because making mistakes, taking responsibility, and fixing them needs to be an established culture.

It is best to assume that incidents will eventually happen and organize processes to mitigate and understand them rather than to hope incidents won’t happen. The blameless culture revolves around this too.

Teams that never lose don’t know how to recover

Sometimes repeated incidents create a fear that there is some system vulnerability that will produce the same situation again. The engineering team gets paranoid about these situations, and rightly so. The engineering leader must be present and calmly set the tone to not stir things up even more. Nobody can think strategically when they’re desperate. An incident war room must be objective, investigative, and collaborative. Calmness needs to prevail for the solution to emerge.

It is worth highlighting that, up to a point, incidents are even welcome. Without incidents, an engineering team:

Doesn’t develop the skills to mitigate them
Doesn’t learn to hold post-mortem meetings
Doesn’t have to rethink its own processes

Having incidents is part of our daily lives in tech, and no one can wish for zero incidents in their company. We should, however, minimize the impact on the customer when they occur.

Establishing a parallel with the world of sports, a team without incidents is like the team that wins every game in the tournament but loses the final match. To avoid panic when something goes wrong, it may even be beneficial to lose a few games along the way so that the team can identify its weaknesses and be able to reorganize and react promptly. In the same way, incidents are learning moments, which end up nurturing the culture of platform mindset.

Sometimes it’s even worth holding an incident simulation just to gauge the team’s reaction. It’s important for the team to be trained in how to behave in an emergency. A completely new situation in which the company is losing millions of dollars a minute generates enormous pressure, and the team must know how to act professionally under pressure.

About the author

Marcus Fontoura

Marcus Fontoura is a technical fellow and CTO for Azure Core at Microsoft, and author of A Platform Mindset. He works on efforts related to large-scale distributed systems, data centers, and engineering productivity. Fontoura has had several roles as an architect and research scientist in big tech companies, such as Yahoo! and Google, and was most recently the CTO at Stone, a leading Brazilian fintech.

Culture

Leadership

Technology

Defining what is a technology company

Workforce diversity in a technology company

Reusable platforms in a technology company

Culture

Leadership

Technology

A blameless culture for incidents

Marcus Fontoura

Teams that never lose don’t know how to recover

Marcus Fontoura

Related content

Technology incidents and post-mortems

Avoidable and unavoidable technology incidents

Incidents: A thousand alerts is like zero alerts