reading time
Published at 12 / diciembre / 2022

Chaos engineering: what is it and how does it work?

Chaos engineering: what is it and how does it work?

What would you do if you entered a data centre to find a room full of hundreds of tangled cables? How would you untangle and organise them? One possible solution would be to pick a cable at random, unplug it to observe the result, and then label the cable accordingly. In the programming world, this task has become an entire discipline: chaos engineering.

Chaos engineering is used to prevent errors and to better understand the system

What is chaos engineering?

Chaos engineering is the discipline that, through controlled experimentation, improves the adaptability of a distributed system when faced with an adverse situation.

On the other hand, chaos engineering methodology aims to avoid the type of situations that cause large economic losses to companies, for example through downtime causing loss of service. In 2019, Facebook lost 90 million dollars after a server outage that lasted just 14 hours. On another occasion, Delta Airlines suffered a loss of some 150 million dollars after a power outage that caused over 2,000 flights to be cancelled.

Netflix, WhatsApp, Instagram, and HBO are some of the many platforms that stop working without prior notice to the user, leaving the programmers in these companies in a real predicament, forcing them to work against the clock and under extreme pressure. The chaos engineering methodology seeks to put an end to this type of situation by anticipating such events, predicting possible errors, and automating potential solutions. This work is not dissimilar to that of a Site Reliability Engineer. Both seek greater system stability and anticipation of errors, and consequently share a department in many organisations.

How does the chaos engineering methodology work?

The steps of the chaos engineering methodology are not as simple as the example of the room full of tangled cables. This discipline is not just about breaking the system and observing the results at random, it requires a safe and well-documented procedure:

  1. Defining the current state of the system.
  2. Launching a positive hypothesis to test an experiment that greatly disrupts the stability of the system.
  3. Defining an environment that guarantees a minimum impact and limits the number of affected users.
  4. Identifying metrics and preparing real-time monitoring of the system.
  5. Notifying all departments that may be affected by the experiment.
  6. Performing the experiment.
  7. Analysing the results based on metrics.
  8. Performing the experiment again with an increase in the system scope.
  9. Automating the solution for possible similar future scenarios.

These steps not only ensure that the system supports the fault, or is now ready to fix it, but also check and improve real-time monitoring, and train teams to identify and fix real problems both faster and more efficiently.

Chaos engineering is a good career path for programmers specialised in distributed cloud systems

Is chaos engineering a good career opportunity?

The chaos engineering methodology is not a new discipline. In 2010, Netflix published the article 5 Lessons We’ve Learned Using AWS to explain the importance of the cloud and distributed systems, and also the complexity of managing them. At that time, they created a set of solutions called Chaos Monkey in order to cause server failures, and observe how they affected their infrastructure.

And so, chaos engineering was born, a methodology that has continued to be perfected since, and that continues to grow due to the success of cloud storage, social networks, and online entertainment platforms.

In the same way as version control, chaos engineering is presented as a very useful and absolutely necessary tool for the world of programming and system maintenance. If your passion is computing and distributed systems are your natural habitat, don't hesitate - take a new path in your professional career with BETWEEN!


Tags: IT

Related Posts

Bioconstruction and the buildings of the future: a new opportunity for engineering?

Thought you seen all there is to see in the building world? Like traditional factories with OPC UA, edge computing, and cobots, the construction world is taking the first steps ...

( reading time )

Topics: IT

The 5 Essential Version Control Systems in Software Development

A few months ago we talked about security in software development, one of the most important aspects of a project. Today we will talk about another basic pillar of the world of ...

( reading time )

Topics: IT

OPC UA, the industrial communication protocol that will leave its mark on the next decade

The reinvention of traditional factories as smart factories will be crucial to maximise production and compete in the future market. But... how can we overcome the barriers of ...

( reading time )

Topics: IT