What do you need to work as a Site Reliability Engineer?

Escrito por Susana Morcuende | 17-jun-2020 9:01:42

The mythical rivalry that exists between the Development and Operations departments in technology companies is not a story of this decade. Not even from this century. That is why, in 2003, Ben Treynor, Google's vice president of Engineering, decided to put programming specialists to carry out tasks specific to the operational area. Thus was born the concept of Site Reliability Engineering and, with it, the position of Site Reliability Engineer (SRE), a figure increasingly appreciated by companies that aspire to innovate regularly in their products, but maintaining at the same time a high level of service reliability.The SRE team comes to put peace between Development professionals, who aspire to launch more functionalities at the highest possible pace; and the members of the Operations segment, whose obsession is the stability of products. Thanks to SREs, each engineering division can fully focus on its objectives:

Development: write code and innovate.
SRE: monitor the operation of the products in order to detect and solve any error early.
Operations: take care of setup, maintenance and periodic testing.

Have you thought about working as an SRE? It is a rising role that brings together the best of two worlds and will allow you to learn something new every day. Find out how far you could go and what training and skills you need to get there.

What is a Site Reliability Engineer and what functions does it have?

The Site Reliability Engineer (SRE) is a position that splits your time between developing software for stability and performance improvement; and monitoring and problem solving, to ensure both service availability and business growth and innovation.

The SRE specialists are dedicated to devising systems that have a high tolerance for failures, using strategies such as gradual degradation (deactivation of some processes so that the system continues to function, even with incidents); or defense in depth, which provides ways for errors to correct themselves automatically.

How are SRE and DevOps different?

The fact that the role of Site Reliability Engineer is conceived as a bridge between Development and Operations leads to confusion with that of an engineer or DevOps engineer. The clearest distinction comes from the hand of Google, which indicates that:

DevOps functions are more generic and stem from a business culture that relies on integrating the two areas of Operations or Development, but without a methodology that defines how to do it. Each organization should study its operating codes to find the most appropriate protocol.
The responsibilities of the SRE, on the other hand, are well defined and must comply with what is stated in the book Site Reliability Engineering - How Google Runs Production Systems, written by the Google SRE team.

This volume includes basic concepts -according to Google- to delimit and coordinate the work of the SRE, such as:

The Service Level Agreement (SLA), that is, the minimum availability percentage that the system must maintain for end users.
The error budget, or allowable percentage of outages that compromise system availability in a given period of time. All the experiments that the Development team wants to carry out must be covered by this error budget.

However, it must be borne in mind that Google's methodology is... that, very Google. And that, outside of there, there will not be two SREs or two equal DevOps, as other companies mix and modulate the tasks of these two profiles according to their needs.

How to become a Site Reliability Engineer?

To work as an SRE, your resume should integrate the following vertices:

Training in Computer Engineering or similar university specializations.
Previous experience in the areas of Systems and Software Development. Perhaps your career is stronger in one than the other, but it is important that you have knowledge of both.
Soft skills such as communication skills (oral and written), ability to work in a team, a decisive mind to face problems and a disposition for continuous learning.

Professionals in the computing sector are currently facing a multitude of challenges, such as data storage in times of big data, digital transformation, the use of open source software or the renewal of legacy structures. At Site Reliability Engineering, the challenge is automating the most repetitive and cumbersome work (called toil, in Google SRE team jargon).

Likewise, we must not lose sight of the fact that the tasks of resolution of incidents usually absorb a good part of the working hours of the SREs. In fact, according to Catchpoint's 2019 SRE Report, which annually conducts a survey to assess the state of the profession, 49% of site reliability engineers say they have had to deal with one of these issues in the past week. And 50% of the sample affirms to have had to solve cuts in the service of more than a day of duration in some moment of her career.

Can you identify with this description and you would not hesitate to use yourself thoroughly to end any treacherous software failure? So working as an SRE is your thing. Come to BETWEEN and climb one more step in your professional career with us!

Ver post completo