How to get started with SRE?

Gagan Goswami
2 min readJan 21, 2021

--

A step by step guide to start implementing SRE concepts.

SRE is different for every organization and workload.

SRE is responsible for availability, reliability, performance, monitoring, emergency response for infrastructure or applications, and reducing manual work by implementing SRE principles and practices. SRE team directly works with Devs/DevOps to deploy new features and changes, so they can protect SLO by completing deployments within the error budget. For instance, they only approve deployment, if there is a sufficient error budget available. There should be a plan for disaster recovery and roll back in case something went wrong.

Basic SRE principles and practices-

  • Implementation of SLO, SLI, Error budget
  • Blameless postmortems
  • Incident management process
  • Untoil
  • Monitoring
  • Capacity planning
  • Code review

In order to start introducing SRE, we need to set our mindset to keep infrastructure and/or application healthy, rather than just resolving alerts.

How can we start setting this mindset?

SRE is a vast implementation, which can not be done in one attempt. Implementing everything in one attempt can lead to failure and bad sentiment towards SRE.

We can get started by implementing some basic SRE principles and practices in addition to utilize the current skill set, which will lead to a change in mindset towards the type of service we provide. Implementing SLI, SLO, and mastering a specific type of service in beginning can help us in starting SRE path.

Start by selecting some on-going projects, for which we have some insights about their infrastructure and application.

For instance, we can start by selecting a project based on ECS and starts implementing SLIs and SLO for infrastructure availability.

Once the team starts tracking SLO and begins understanding their responsibilities to defend SLO, We can move forward with next steps.

Some useful tips

Deployments should only be done after approval from SRE team.

Do deployments only when you have a sufficient error budget.

Keep Disaster Recovery and rollback plan ready.

Conclusion

This is a beginner level implementation suits best for starting moving towards SRE. In the next blog, We’ll see how we can move forward towards the next step in SRE implementation and building SRE team. After this series, we’ll have a proper SRE team doing automation, utoil, code review, setting up observability, capacity planning, and taking responsibility for Availability, Reliability & Performance of application including its infrastructure.

Thanks for reading!

--

--

Gagan Goswami

DevOps & SRE, focusing on architecting, automating, and optimizing complex deployments on AWS. I’m a 5x AWS Certified & Datadog Technical Specialist.