Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
SRE Principles
- Find Service Level
Service Level Indicator(SLI), Service Level Object(SLO) & Service Level Agreement(SLA) are parameters with which reliability, availability and performance of the service are measured. - Error Budgets
•An error budget is 1 minus the SLO of the service. A 99.9% SLO service has a 0.1% error budget. If our service receives 1,000,000 requests in four weeks, a 99.9% availability SLO gives us a budget of 1,000 errors over that period. - Eliminate Toil
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. SRE job is to eliminate as many as Toils by Automating stuff - Automate Everything
SRE team Automation provides
– Consistency as systems scale
– A platform for extending to other systems
– Faster repairs for common problems
– Faster action than humans
– Time savings by decoupling operator from - Support Releases
Running reliable services requires reliable release processes.
Continuously build and deploy, including
– Automating check gates
– A/B deployments and other methods for checking sanity
SRE don’t afraid to roll-back a problem release.