Overall, great write up. As an engineer mostly involved on the build side it introduced me to a number of good ideas and confirmed others. It’s Google DevOps++.
Here is a quick bite-sized packaging of the main takeaways. Watch out for the key points.
- SRE teams are staffed by a mix of sys admins and software developers
- Aim is to spend no more than 50% of individual time on ‘toil’. SREs will write code towards that aim, of making Google’s systems run themselves – also results in a large acceptance of change
Acknowledge the fundamental odds between ops (stability) and dev (change) – align to focus on delivery speed within acceptable risk boundaries.
The error budget stems from the observation that 100% is the wrong reliability target for basically everything – the right target is a product specific question.
The error budget determines how unreliable the service is allowed to be within a single quarter. As long as there is error budget remaining—new releases can be pushed. This gives SRE and Dev teams focus and structure to find the right balance between innovation and reliability.
- Toil is defined as boring repetitive tasks with no enduring value. 50℅ of your time is supposed to be spent on building stuff to eliminate toil
- Toil leads to boredom > discontent > quitting
Explicitly align the risk taken by a given service with the risk the business is willing to bear – make a service reliable enough, but no more reliable than it needs to be. Work with the product owners directly to establish the threshold
Golden signals to monitor on a service
Machines are instrumented out of the box to a large extent. Discards email as primary notification mechanism. Defined levels of alerts:
- Pages – respond now
- Tickets – respond later
Emphasis on building systems to be automatic not just ‘automated’ – system should require minimal babysitting and take reasonable steps to respond to anomalies – for example, DB notices problems and fails over automatically
Focus on a self-sufficient/self service model for the consuming teams. `Release engineering` is a separate function – which develops tools and best practices.
Builds are hermetic, meaning that they are insensitive to the libraries and other software installed on the build machine. Instead, builds depend on known versions of build tools, such as compilers, and dependencies, such as libraries – in other words , always reproducible
- Add code goes into the main branch
- Branch from main for a release > this is never merged back
- Bug fixes are submitted to the main and then cherry picked into the branch for inclusion in the release
- In addition on the CI – tests are run in the context of what’s being released
- An independent testing environment runs system tests on packaged build artifacts
- Config files are external to the binary
- Dynamically changing config goes into a central storage
Being on call
- Flexible alert delivery systems that can dispatch pages via multiple mechanisms (email, SMS, robot call, app) across multiple devices
- Limiting the number of engineers in the on-call rotation ensures that engineers do not lose touch with the production systems
- Open bug for every issue reported
- While troubleshooting a high volume service it might not be a feasible to log everything. Log one out of every 1000 requests (for example) and use a statistical sampling approach.
- Primary goal is to ensure that the incident is documented, root cause understood and preventive steps are put in place
- Blameless postmortems are a core tenet – focus on contributing causes of the incident without indicting an individual or teams for inappropriate behaviour. IMO this has to do with physiological safety – if we concentrate on blame, it inhibits open sharing.
Testing for reliability
Similar to the take on reliability – the theme is about fitness for purpose. The level of testing is proportional to the criticality of the system in question – as opposed to thoughtless statements like ‘100% coverage’
Reliable product launch at scale
- A dedicated consulting team within SRE tackles the task of launching at scale – staffed with experienced SRE engineers. The aim to a have a process which is:
- Lightweight – engineers sidestep burdensome processes
- Adaptable – caters to small changes to high visibility public announcements
- Launch control works on a checklist basis – points are mostly drawn from experience and serve to provide the appropriate level of rigor/facilitate conversations
- Updates go out in a rolling manner with verification steps interspersed
It has taken Google 10 years to fine tune the process and the book admits that there have been low points where the difficulty of launching a new service had become ‘legendary’
In the next post…