1 July, 20254 minute read

You can’t always retry a 5xx

Networks are unreliable. Even when services are otherwise healthy transient failures like timeouts, dropped connections, and DNS blips can all cause your application to fail for no good reason.

Sometimes that’s OK. If you’re building a simple system that only makes a few network requests then it’s reasonable to let these transient failures bubble up and let the caller decide how they’d like to handle them.

But in more complicated systems with deeper call graphs, this error handling strategy breaks down. It’s hard to justify failing an entire request because one out of a hundred network calls happened to fail. Choosing to fail fast in this scenario means failing often, and this directly impacts your ability to meet SLAs .

Most modern systems tend to involve endpoints which make lots of network requests, so you typically want some sort of retry logic around nearly every cross-network request you end up making.

Designing a basic retry policy

Some errors are not worth retrying. A 400 or 401 HTTP status is extremely unlikely to resolve itself through additional attempts, so it’s better to bubble up the error in that case. When retrying, we’re predominantly concerned with transient errors which we reasonably expect to disappear with a retry.

Network errors are almost always transient in nature, and most engineers tend to think of server-side 5xx errors as being largely transient, too.

Below is an example function which POSTs data to a URL which retries the request when either of these errors are encountered. If we were to also retry 429s then this would be pretty close to a “typical” retry policy:

Click to copy
function isTransientError(error: unknown) {  return (    isAxiosError(error)    // network error    && !error.status    // 5xx http error    || (      error.status >= 500      && error.status <= 599    )  );} async function post(  url: string,  data: Record<string, unknown>,) {  let attempt = 0;  while (attempt < 3) {    try {      return await axios.post(url, data);    } catch (error) {      if (!isTransientError(error)) {        throw error;      }    }    remainingAttempts--;     // back off with a jitter    const jitter = Math.random() * 200;    const delay = Math.min(      1_000 * Math.pow(2, attempt) + jitter,      10_000,    );    await new Promise(resolve => setTimeout(resolve, delay));     attempt++;  }}

Our basic retry policy doesn’t stack up well here. If we were making GET requests it’d be fine—assuming we aren’t retrying excessively and overloading the target server—but POST endpoints need to be treated carefully, because they typically trigger a side effect.

If that side effect isn’t idempotent and carries real-world consequences, then carelessly retrying could do serious damage.

Consider the following server, and suppose it is running behind a load balancer with a one second idle timeout:

Click to copy
import Koa from 'koa';import { koaBody } from 'koa-body';import Router from '@koa/router';import { setTimeout as sleep } from 'node:timers/promises'; const app = new Koa();const router = new Router(); const data = new Map<string, any>(); router.get('/:key', async (ctx) => {  const key = ctx.params.key;  if (data.has(key)) {    ctx.status = 404;  } else {    ctx.status = 200;    ctx.body = data.get(key);  }}); router.post('/:key', async (ctx) => {  const key = ctx.params.key;  const value = ctx.request.body.value;   data.set(key, value);   // sleep for 2s  await sleep(2_000);   ctx.status = 200;}); const server = app  .use(koaBody())  .use(router.routes())  .use(router.allowedMethods())  .listen(3000);

When a POST request is sent towards this server, the following sequence of events occur:

  1. The load balancer receives the POST request, and starts its one second timer.
  2. The load balancer forwards the request on to the server.
  3. The server immediately updates its in-memory datastore.
  4. The server starts sleeping for two seconds.
  5. After one second, the load balancer times out the request and responds to the caller with an HTTP 504 status code.
  6. One second after that, the server tries to respond with an HTTP 200 status code. This is unsuccessful because the load balancer has given up on waiting for a response.

In this scenario the caller’s side effect was successfully executed, despite the caller receiving a 5xx status code. Here a retry wouldn’t be the end of the world, because the operation being performed (setting a key in a map) is idempotent. But this is not always the case!

A real-world example

In June 2024, the feature flagging platform Hypertune experienced a system-wide outage. At the time, their SDK was configured to retry feature flag retrieval over a one minute period and at work we had a call to their initIfNeeded method before reading any flags, like their docs recommend.

The end result was that all calls to API endpoints which consumed feature flags failed with an HTTP 504 response code. Some developers using our API had retry logic which triggered on 5xx status codes, and wound up purchasing duplicate orders as a result. The full incident writeup is here if you are interested.

Idempotency keys might have helped here, but they’re not always a silver bullet. Stripe, for instance, implements shallow idempotency which means HTTP requests made with the same idempotency key will see exactly the same HTTP response as the initial request. If you saw a 504 status code on your initial call to paymentIntents.capture, then retrying with the same idempotency key will not help you ensure the payment intent was actually captured.

Safely retrying 5xx status codes

The following is a good blueprint to follow for safely retrying 5xx status codes:

  1. If the operation is deeply idempotent ⇒ feel free to naïvely retry the operation.
  2. Otherwise ⇒ verify the side effect didn’t happen, and then retry.

The second case can be trickier than it first appears. Ideally you’d simply try to GET the resource you’re trying to create, but this isn’t always as easy as it sounds. The two biggest edge cases you’ll run into are:

  1. Not having an ID to fetch the resource by. In this case your best option is typically to fetch the last N created resources from whatever list endpoint your API vendor offers and searching for the thing you were trying to create. If the API supports attaching metadata to resources, then stashing a unique ID there to use as a lookup key is helpful.
  2. Eventual consistency. Shopify’s Order objects, for example, take a while to show up in their Admin API. You need to poll the order query for a little while before you can safely retry creation.

Don't want to miss out on new posts?

Join 100+ fellow engineers who subscribe for software insights, technical deep-dives, and valuable advice.

Get in touch 👋

If you're working on an innovative web or AI software product, then I'd love to hear about it. If we both see value in working together, we can move forward. And if not—we both had a nice chat and have a new connection.
Send me an email at hello@sophiabits.com