Tools, tests, and AI to combat incidents

Assuming that incidents are intrinsic to technology, the best course of action is for companies to create resilient systems. However, there’s no such thing as perfect, which is also why it’s unrealistic to spend the whole time just increasing test coverage for existing systems, without producing anything new.

The development conveyor belt of an organization, composed of tools, processes, and validations, should reach such a level of maturity that it guarantees the engineering and product teams the agility to innovate without worrying too much about the stability of the system, something that some authors call fearless execution. Most platforms should already have been tested, so that the chance of a team “breaking” something big is low. The more technical debt, disparities, and lack of validations, the greater the team’s insecurity.

There are certain areas of a company, such as new product development, that need to move fast and validate their ideas quickly. Well-designed platforms should insulate these areas from major problems so that a problematic change doesn’t generate a tsunami of setbacks.

Artificial intelligence tools help a lot because they can be integrated into a company’s development environment and perform various functions. You can point to a piece of code you’ve written, and it generates unit tests that validate that code. With this support from AI, the degree of trust should increase. AI increases productivity, but humans still must get involved to monitor whether that generated code is good, and if it is, send it for code review.

If the artificial intelligence hallucinates and creates something that doesn’t work, the code review process should come into play. Engineering processes already involve humans naturally (human in the loop, or HITL), which makes the use of co-pilots for code development and testing a very suitable AI application.

The development conveyor belt of an organization, composed of tools, processes, and validations, should reach such a level of maturity that it guarantees the engineering and product teams the agility to innovate without worrying too much about the stability of the system

Automation against breaches

Ideally, the tools themselves should be able to categorically prohibit an unauthorized change in a system from happening. In other words, they should allow the engineer leadership to impose a freeze, if they decide to do so, directly on the tool, without ambiguity or room for non-compliance.

You can try to avoid incidents with freezes, which is a proactive way for the leadership not to expose the system to instability at times that it considers sensitive for the business. However, in addition to warning engineers about the freeze, it is necessary to have blocks in the system itself that reinforce the directive. The counterpoint is that imposing a freeze is not the best solution for the business itself, because it is more productive to have deployments happening continuously.

Blocking deployments, or only executing them within certain timeframes, means that when they finally go into production, there could be a lot of accumulated changes. What if there’s a bug? With fewer accumulated changes, it’s much easier to understand what happened and to mitigate the incident more quickly. That’s why frequent deployments of smaller changes are ideal.

A week without deployments generates an accumulation of seven days of changes. If something goes wrong, it’s extremely complex to look for causes with so many new things deployed at the same time. The reality is that the fewer deployment freezes the team can sustain, the easier it is to determine the cause of problems and minimize incidents.

Incidents during the day, when everyone is in the office or working online, are easier to deal with than at times when most people are asleep. Sometimes, for very critical situations, there is a choice to do deployments at the weekend or in the early hours of the morning, with staff on duty, but in general, deployments should happen during regular working hours. If they cause micro-incidents during these times, there are more staff present to monitor and mitigate the impact. The process of managing incidents is more natural, more fluid, and more agile with deployments taking place all the time.

About the author

Marcus Fontoura

Marcus Fontoura is a technical fellow and CTO for Azure Core at Microsoft, and author of A Platform Mindset. He works on efforts related to large-scale distributed systems, data centers, and engineering productivity. Fontoura has had several roles as an architect and research scientist in big tech companies, such as Yahoo! and Google, and was most recently the CTO at Stone, a leading Brazilian fintech.

Culture

Leadership

Technology

Defining what is a technology company

Workforce diversity in a technology company

Reusable platforms in a technology company

Culture

Leadership

Technology

Tools, tests, and AI to combat incidents

Marcus Fontoura

Automation against breaches

Marcus Fontoura

Related content

Technology incidents and post-mortems

Avoidable and unavoidable technology incidents

Incidents: A thousand alerts is like zero alerts