What would you do if you entered a data centre to find a room full of hundreds of tangled cables? How would you untangle and organise them? One possible solution would be to pick a cable at random, unplug it to observe the result, and then label the cable accordingly. In the programming world, this task has become an entire discipline: chaos engineering.
What is chaos engineering?
Chaos engineering is the discipline that, through controlled experimentation, improves the adaptability of a distributed system when faced with an adverse situation.
On the other hand, chaos engineering methodology aims to avoid the type of situations that cause large economic losses to companies, for example through downtime causing loss of service. In 2019, Facebook lost 90 million dollars after a server outage that lasted just 14 hours. On another occasion, Delta Airlines suffered a loss of some 150 million dollars after a power outage that caused over 2,000 flights to be cancelled.
Netflix, WhatsApp, Instagram, and HBO are some of the many platforms that stop working without prior notice to the user, leaving the programmers in these companies in a real predicament, forcing them to work against the clock and under extreme pressure. The chaos engineering methodology seeks to put an end to this type of situation by anticipating such events, predicting possible errors, and automating potential solutions. This work is not dissimilar to that of a Site Reliability Engineer. Both seek greater system stability and anticipation of errors, and consequently share a department in many organisations.
How does the chaos engineering methodology work?
The steps of the chaos engineering methodology are not as simple as the example of the room full of tangled cables. This discipline is not just about breaking the system and observing the results at random, it requires a safe and well-documented procedure:
- Defining the current state of the system.
- Launching a positive hypothesis to test an experiment that greatly disrupts the stability of the system.
- Defining an environment that guarantees a minimum impact and limits the number of affected users.
- Identifying metrics and preparing real-time monitoring of the system.
- Notifying all departments that may be affected by the experiment.
- Performing the experiment.
- Analysing the results based on metrics.
- Performing the experiment again with an increase in the system scope.
- Automating the solution for possible similar future scenarios.
These steps not only ensure that the system supports the fault, or is now ready to fix it, but also check and improve real-time monitoring, and train teams to identify and fix real problems both faster and more efficiently.
Is chaos engineering a good career opportunity?
The chaos engineering methodology is not a new discipline. In 2010, Netflix published the article 5 Lessons We’ve Learned Using AWS to explain the importance of the cloud and distributed systems, and also the complexity of managing them. At that time, they created a set of solutions called Chaos Monkey in order to cause server failures, and observe how they affected their infrastructure.
And so, chaos engineering was born, a methodology that has continued to be perfected since, and that continues to grow due to the success of cloud storage, social networks, and online entertainment platforms.
In the same way as version control, chaos engineering is presented as a very useful and absolutely necessary tool for the world of programming and system maintenance. If your passion is computing and distributed systems are your natural habitat, don't hesitate - take a new path in your professional career with BETWEEN!