Implicit is better than explicit
I have some thoughts on the Google Cloud’s latest global outage! The incident report contains some interesting nuggets, if you look deep enough.
For the most part it’s a pretty typical failure for a complicated distributed system. Google’s service for authorizing inbound API calls was missing handling for an edge case triggered while parsing empty configuration values, and hitting this case causes the service to panic and crash. When Google pushed out a configuration change containing blank values, global replication of that change ended up triggering a global outage:
On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds. This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment.
Complicated systems are complicated, and a lot of incidents are caused by small bugs which look obvious in retrospect.
The tail end of the previous section combined with the next section of the report is surprising, though (emphasis mine):
Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this.
I’m not surprised that a Google engineer forgot to apply exponential backoff here—anyone else could have done the same thing. I am, however, surprised that it was possible for this failure mode to occur in production in the first place.
Two things seem to have gone horribly wrong here:
- The read from the Spanner table didn’t go over a service mesh 1.
- Borg (Google’s internal version of Kubernetes) doesn’t seem to apply backoff when containers crash loop.
As an engineer working on GCP, neither of these are things you should have to consciously think about. In a large established engineering organization these concerns get handled implicitly outside of your application logic. Sometimes implicit is better than explicit, and fault tolerance in a distributed system is a perfect example of this principle.
In a previous post we looked at various implementations of a getUsers
function, where each implementation was progressively safer and robust. One major improvement made was to ensure that getUsers
enforced its own page size limits, which prevents callers from exhausting available memory by requesting overly large amounts of data.
If you’re a strict adherent to the principle that “explicit is better than implicit” then you’ll find that to be an utterly horrifying, and prefer to have the caller be explicit about the page size they desire.
Being explicit has merit, but if you militantly follow that idea in all cases then you’ll inevitably end up where Google found themselves the other week. There’s simply too much going on in a modern system for a human to accurately specify every single teeny tiny detail. Mistakes happen, and you’ll miss some things.
I love that the default (read: implicit) retry policy for a Kubernetes pod applies backoff to pod restarts when one crashes during startup. If you’ve ever seen a pod with CrashLoopBackoff
status in your Kubernetes cluster, then you too have benefited from this implicit behavior. It’s specifically designed to avoid—or at least mitigate—the kind of thundering herd problem Google experienced here.
Conversely I grieved at work back in March when Shopify’s card vaulting endpoint started returning intermittent 403s. At work we declared an incident and had to deploy a code change to our retry policy, because the code for depositing cards into Shopify’s vault had never considered a 403 to be transient. In a larger team with more resources available for infrastructure investment we’d be using a service mesh for that network call instead, and the incident would have been resolved in five minutes instead of forty.
Implicit is just so much better than explicit.
- I’m assuming this is the case from how the report is written, anyway. Talking about implementation of exponential backoff within the context of Service Control sounds like an implementation in code rather than a service mesh configuration.↩