Avoidable and unavoidable technology incidents

Usually, an incident occurs when there is a change. If a team doesn’t do anything new, the system doesn’t change and the chances of exposing a problem are very low. But nobody wants a technology organization frozen in time. The pace of change is usually fast, so it’s usual for an incident to be caused by a recent modification that hasn’t been well tested, revealing a vulnerability.

An external event will rarely cause an incident. At the turn of the year 2000, there was widespread concern about computer incidents, because the number of digits used to represent the year was going to change.

Engineers all over the world had to anticipate possible problems in systems that assumed the beginning 19xx for years, only considering the last two digits, and make all the code changes so that the year “00” wouldn’t be understood as 1900 instead of 2000. The Y2K problem, or millennium bug, as this situation came to be known, meant that an incident had to be proactively avoided. If no one had thought of this, several systems could have crashed, possibly causing serious incidents on a global scale.

Apart from these exceptional cases of external causes, foreseeable or unforeseeable, it is more common that incidents are caused by internal changes. Therefore, mitigation often involves locating the change that caused the problem and reversing it as soon as possible, returning the system to its previous state. Once in this previous state, everything should work again. This is why tests, validations, and alerts are so important. With them in place, when you deploy a change into production, automatic tests monitoring the health of the system detect faults so that a rollback can be carried out and the system can return to the way it was before the unwanted change was deployed.

Automated tests to prevent bigger issues

When a code is approved by the code review process, it goes into a pipeline to go into production using the CI/CD (continuous integration, continuous delivery) system. This pipeline runs more tests and more validations. In the production phase, automated tests need to keep monitoring the state of things, as if it were a person who has their temperature taken every five minutes to detect a possible fever. Unlike nurses and mothers, computers don’t get tired of doing these tests. These automated tests are running all the time, and once you realize that a system is in poor health, you can understand what is going on to try to prevent a major incident.

In many cases, when the problem has been triggered by a change, once you reverse it, the problem is gone. More complicated cases may involve more elaborate actions. For example, if a change has caused a credit card transaction system to go down, you can go back to the way things were before, but what about all the transactions that weren’t processed in the meantime?

Possibly palliative actions will have to be taken to allow people to complete their payments. It’s a delicate situation because it could be that someone has tried a few times to pay for a service by credit card and, as the operation wasn’t working, decided to pay using other mechanisms. There could then be duplicate payments. If this happens to several people, the next day the company will have a huge line of people calling the customer service department to complain, which would cause a customer service incident.

Situations like this can cause a chain reaction, such as technical problems causing related customer service problems.

In the production phase, automated tests need to keep monitoring the state of things, as if it were a person who has their temperature taken every five minutes to detect a possible fever.

About the author

Marcus Fontoura

Marcus Fontoura is a technical fellow and CTO for Azure Core at Microsoft, and author of A Platform Mindset. He works on efforts related to large-scale distributed systems, data centers, and engineering productivity. Fontoura has had several roles as an architect and research scientist in big tech companies, such as Yahoo! and Google, and was most recently the CTO at Stone, a leading Brazilian fintech.

Culture

Leadership

Technology

Defining what is a technology company

Workforce diversity in a technology company

Reusable platforms in a technology company

Culture

Leadership

Technology

Avoidable and unavoidable technology incidents

Marcus Fontoura

Automated tests to prevent bigger issues

Marcus Fontoura

Related content

Technology incidents and post-mortems

Incidents: A thousand alerts is like zero alerts

Tools, tests, and AI to combat incidents