Supporting SRE and Chaos Engineering disciplines with Azure Services

According to Google, Site Reliability Engineering (SRE) is a discipline and a role that incorporates aspects of software engineering and applies them to infrastructure and operations problems. SRE is fundamentally doing work that has been done by an operations team but using engineers with software expertise. An SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. A discipline that allows to achieve many of SRE challenges. Specifically, the SRE inherited 7 principles and 18 practices to reach resilience and reliability in our systems. Some of these principles include SLOs, Minimize Toiling and Reduce the Cost of Failure. Some practices include On-Call, Incident Response, Postmortem Culture, and Data Processing Pipelines. Cloud providers promise resilience and reliability in their elevator pitches and Azure is not the exception.

In this talk I am going to explain what is SRE and Chaos Engineering, showing what is the relation of those disciplines. I am going to explore the main challenges and benefits for cloud, operations, security and development engineers (using examples with .NET) and finally i am going to present a tool that we are implementing to integrate chaos engineering in the ci/cd pipelines that .NET developers use.
Intermediate Cloud & Serverless

How did you like it?

We appreciate your privacy. Feedback is anonymous and is used to improve quality of our conferences.

Other speaker sessions

The speaker has no other sessions.