How seriously do you take your SLOs?
The Google SRE book is one of the most important texts I have ever read. If you are working in the software industry and haven’t read it in its entirety, then I urge you to click the link and read it right now. It will fundamentally change the way you understand the ‘ops’ part of your job, and also radically inform your decision making process with respect to product.
In this post we’ll be digging in to a story told in “Chapter 4 - Service Level Objectives,” and reflecting on takeaways you can apply to both your current job and your overall career trajectory.
Introducing Chubby
Chubby is an internal Google service that provides coarse-grained distributed locking functionality, first described in a 2006 research paper. “Coarse-grained” in this context means that Chubby is designed for low-throughput use cases where an application might hold on to a lock for hours or even days, in contrast to high-throughput use cases such as locking a database table for a few seconds during a transaction.
While it might not be intuitively obvious, it turns out that locking semantics are actually a pretty good way of reaching consensus in a distributed system. A classic example is electing a primary node in a database cluster. MongoDB solves the issue by running a complicated election algorithm to determine which node inside a replica set will become the primary, but an alternative design is to simply have every replica attempt to acquire the same advisory lock and have whoever “wins” the locking race become the primary.
That’s exactly the approach used in a lot of Google products, and the result of this is a big reduction in complexity across Google’s codebases. Rather than needing to implement consensus algorithms from scratch inside every new product, it is possible to outsource that undifferentiated heavy lifting to Chubby.
There are other use cases for this kind of distributed locking service, and Chubby also has additional functionality such as the ability to emit events in response to state changes. The paper is worth a read if you want the details, but all you really need to know to understand the rest of this paper is that Chubby’s distributed locking functionality is broadly useful and many Google products depend upon this service both directly and indirectly.
SLAs ‘n SLOs
When you have an internal service like Chubby, it’s necessary to put in place SLAs between the Chubby team and the teams using their service in exactly the same way you would for a customer-facing service. If you have SaaS customers you promise 99% availability to and you depend on an internal service that makes no guarantees about its own availability, then fulfilling that promise becomes very difficult indeed.
At smaller companies you naturally avoid this issue by having a codebase small enough for your single engineering team to take end-to-end ownership over the whole thing, but this utopia never lasts long. There comes a point in time when you are forced to carve out slices of responsibility from the cake in order to continue scaling.
If you cut cleanly all the way down from the frosting to the platter then you avoid interdependencies between teams. That’s the ideal scenario, but internal platform services like Chubby make that ideal impossible. It’s untenable for everyone in the organization to own this service—and all other shared services—and the service is valuable to have because it simplifies the implementation of other services.
Someone’s gotta own it and be responsible for it.
A couple of definitions, for those unfamiliar:
- An SLA (service-level agreement) is an agreement you make with your customers which guarantees a certain level of service. For instance, you might promise that your app will have an uptime of 99.9%. Oftentimes there will be penalties you have to pay if you miss these guarantees.
- An SLO (service-level objective) is like a SLA, but it is internal. They represent the objectives an engineering team must hit in order to fulfill SLAs to customers, and oftentimes the SLO will be stricter than the SLA. In the case of our app’s uptime, we may set an internal SLO of a 99.95% uptime.
If we promised a 99% uptime to our SaaS customers and we rely upon Chubby for critical functionality, then we would want the Chubby team to commit to some kind of availability-based SLO.
And indeed, the team responsible for building and operating Chubby do have these objectives in place.
Your observed SLO becomes your reality
The story goes that one day Chubby engineers noticed that when Chubby experienced a global outage it resulted in downstream service outages which impacted end users. Further investigation of Chubby’s dependent services revealed that the engineers working on these services had haphazardly added dependencies on Chubby assuming it never went down.
At the scale of Google, one-in-a-million events happen every day. Every engineer working at Google understands this either through personal experience or by proxy. Even so, the Chubby engineers were so good at keeping Chubby online that fellow Googlers were happy to naïvely assume that Chubby would always be online.
Not only did Google engineers write this code in the first place, but it also actually got through code review. It’s hard to overstate just how good the Chubby engineers must be at what they do.
The problem is that it doesn’t really matter what you promise to external stakeholders. What those stakeholders observe becomes reality. If you promise a 99% uptime and deliver a 99.99% uptime, then your customers will start to expect the latter and build with that expectation in mind.
So, what’s the fix here?
Solving Chubby-induced outages
A lot of organizations operate under a nebulous and ill-defined culture of “ownership.” Ownership is—on paper—a great idea. It makes a lot of sense to empower software engineering teams to own their work, but doing so necessarily requires that teams are both accountable and have some degree of autonomy.
My career in software engineering has told me that “ownership” tends to be a suitcase term. It oftentimes ends up meaning one of the two following things:
- A form of “collective” ownership wherein no one really owns anything.
- A cudgel used against engineering teams when objectives are missed, or when something needs to change and the person invoking the term doesn’t know what needs to change.
In organizations where this type of “ownership” is employed, the solution would likely be to tell the Chubby team to fix their outages. It’s a horribly misguided solution, but it’s one I’ve seen play out a number of times in the real world.
Google is not one of these organizations. They—quite literally—wrote the book on SRE. The root cause of the problem here isn’t that the Chubby team can’t hit a 100% uptime; it’s the downstream teams adding unnecessary critical dependencies to the Chubby service.
How do you incite change in these downstream teams? A lesser organization would have posted an @channel
message in the #devs
Slack channel asking everyone to please improve the reliability of their code by factoring in possible Chubby outages. Google took a much more extreme—and effective—approach.
The Chubby team decided that once per quarter they would look at their uptime metrics. If they significantly exceeded their SLO, then they would intentionally take the system down to drop their availability to the level of their SLO.
Intentionally triggering outages in this way forces downstream teams to deal with the reality of a Chubby outage sooner rather than later. Unnecessary dependencies on Chubby are naturally flushed out shortly after they are added, as teams are forced to reckon with the reality of distributed systems sooner rather than later.
How seriously do you take your SLOs?
A good test of your organization’s engineering culture is to imagine how this problem would be solved by you. I think the majority of engineering shops out there would find Google’s solution to this problem unpalatable, and yet if you think it through from first principles it is the only good solution.
The reality is that Chubby will go down at some point in time, intentionally or not. Downstream services must be capable of dealing with this inevitability.
How can you properly incentivize those service teams to build in resilience without intentionally taking the system down intermittently? It would be fantastic if we could simply rely on everyone else’s innate sense of engineering excellence, but the reality is that shortcuts get taken every day in the interest of meeting deadlines or stuffing promo packets.
This is one of the motivating factors behind Netflix’s adoption of chaos engineering. On the one hand, distributed systems fail in spectacular and hard-to-predict manners. Chaos engineering is a good fit for diagnosing these unpredictable modes of failure. On the other hand, it’s a reality that not all code will be written perfectly with resilience in mind.
Injecting failures in a controlled fashion is one way of surfacing both of these issues so they can be resolved. By breaking things regularly we are able to both improve the resilience of our codebase and build organizational scar tissue. Engineers who work in places which leverage chaos engineering are much better firefighters than their counterparts because they build so much more operational expertise through repeated exposure to outages.
We often think about SLOs as a minimum target, and perceive exceeding them to always be a good thing. In reality, significantly exceeding our SLOs can cause damage elsewhere in the organization by giving engineers on other teams a false sense of security. To truly take ownership of your SLOs, you need to take them seriously in both directions.
Sometimes it’s a good idea to intentionally kill a worker node, just to see what happens downstream.