How to write robust webhook handlers
Webhooks are surprisingly hard to process correctly. Almost every software project I’ve worked on has needed to process webhooks from third party vendors, and almost every team I’ve worked with has missed one detail or other in the implementation.
In this post I’ll go over some of the details that you should think about when designing an endpoint to process webhook events.
Table of contents
Webhook best practices
Validate the event
Most webhook vendors will send headers alongside the event payload which you can use to verify that the event truly originated from the vendor, and not from a random person who guessed the URL of your handler.
Some vendors—like Stripe—make this easy, and offer a convenient helper function for you to use:
async function webhookHandler(ctx) { const signature = ctx.headers['stripe-signature']; let event; try { event = stripe.webhooks.constructEvent( ctx.rawBody, signature, process.env.STRIPE_WEBHOOK_SECRET, ); } catch (error) { // report error return { status: 400 }; } // process event}
Other vendors require you to compute the webhook signature yourself, like Zoom:
async function webhookHandler(ctx) { const expectedSignature = ctx.headers['x-zm-signature']; const timestamp = ctx.headers['x-zm-request-timestamp']; const hmac = crypto.createHmac('sha256', process.env.ZOOM_WEBHOOK_SECRET).update(ctx.rawBody).digest('hex'); const computedSignature = `v0=${hmac}`; if (expectedSignature !== computedSignature) { // report error return { status: 400 }; } // process event}
If your vendor doesn’t support some way of verifying webhook authenticity, then you should ask them to add it. It’s an important security feature that helps protect you against man-in-the-middle and replay attacks.
Don’t assume ordering
Webhook vendors typically don’t guarantee that events will arrive at your endpoint in the order they occurred. Making such a guarantee is very difficult, and implies delivery delays when your handler repeatedly fails to process a particular event.
This means that while you’d intuitively expect to never receive a customer.updated
event after a customer.deleted
event for the dame customer, in reality you may very well need to process that sequence in production.
The easiest way of working around this problem is to simply retrieve the resource via the vendor’s API, to ensure your handler is always working with the freshest possible data:
// ❌async function webhookHandler(ctx) { const event = validateEvent(ctx); if (event.type === 'customer.deleted') { await customersRepo.deleteById(event.object.id); } else if (event.type === 'customer.updated') { await customersRepo.upsert(event.object); }} // ✅async function webhookHandler(ctx) { const event = validateEvent(ctx); if (event.type.startsWith('customer.')) { const customer = await stripe.customers.retrieve( event.data.object.id, ); await syncCustomer(customer); }} async function syncCustomer( customer: Stripe.Customer | Stripe.DeletedCustomer,) { const customer = await stripe.customers.retrieve(customerId); if (customer.deleted) { await customersRepo.deleteById(customerId); } else { await customersRepo.upsert(customerId); }}
This approach does increase processing latency and put more pressure on your API rate limits, but this tradeoff is generally worth making unless you are dealing with a very high volume of webhooks.
Make them idempotent
It is not possible to guarantee exactly-once delivery in a distributed system. Your options are “at most once” or “at least once”, and almost all vendors that fire webhooks will opt for the latter.
While you can’t have exactly-once delivery, you can have exactly-once processing within the boundaries of your system. You do this by adding an idempotency layer on top of your webhook handler. A very minimal (and not production-grade) example looks like this:
async function webhookHandler(ctx) { const event = validateEvent(ctx); const result = await redis.set(event.id, 1, { EX: 60 * 60 * 24, // 1 day NX: true, }); if (result !== 'OK') { // bail out } // process event}
In the example I’m using Redis as my idempotency store. I’m using the event’s ID as an idempotency key, and the NX
option means Redis will only apply my SET
command if no value has been previously set for the key. I expire the key after 24 hours, because in this fictitious example my webhook vendor only retries delivery for that long—and some vendors may reuse event IDs in future.
This isn’t production grade because you normally want at least two idempotency keys for each webhook. One key is used to filter out redeliveries from the webhook vendor, and the other key is used to track whether your own system is currently processing the webhook—a good segue into the next component of a robust webhook handler.
Process events asynchronously
The following is an antipattern:
async function webhookHandler(ctx) { const event = validateEvent(ctx); // <idempotency> await doSomeHeavyProcessing(event); return { status: 200 };}
All network requests should have a timeout defined, and the POST request your vendor is making to your backend is no different. Most vendors will document the request timeout they’ve chosen, and if your handler doesn’t respond with a 200 OK status code within that timeout they’ll mark the delivery as failed and retry it later on.
The exact timeout depends on the vendor. Some are more generous than others—below are some examples.
- HubSpot: 1 second(!)
- Slack: 3 seconds
- Zoom: 3 seconds
- Chargebee: multiple timeout components, up to 60s
- Marketo: 30 seconds
- Twilio: configurable
Even if you’ve got an idempotency layer in place, you still want to avoid timing out your vendor’s POST. Vendors like Shopify and Stripe will stop sending webhooks to you if too many deliveries fail within a certain period of time.
You can solve this by moving all of the heavy work into the background. Ideally, you’d put the events in a queue so you can retry processing and dead letter events that refuse to process successfully:
async function webhookHandler(ctx) { const event = validateEvent(ctx); // <idempotency> await sqs.sendMessage({ MessageBody: JSON.stringify(event), QueueUrl: 'https://...', }).promise(); return { status: 200 };}
Note the possibility of a partial failure here. If writing the idempotency key into Redis succeeds but enqueuing the message fails, then you’ll end up dropping the webhook entirely.
There are a few ways of solving this problem. The most obvious is to simply retry the queue operation a few times, although those retries need to fit within the vendor’s timeout. You can also implement a rollback on the idempotency store so that the next redelivery doesn’t bail out:
async function webhookHandler(ctx) { const event = validateEvent(ctx); // <idempotency> try { await retry(() => await sqs.sendMessage({ MessageBody: JSON.stringify(event), QueueUrl: 'https://...', }).promise(), ); } catch (error) { // fire and forget redis.del(event.id).catch((error) => { // ... }); throw error; } return { status: 200 };}
It’s still possible to get very unlucky and have all of your retries fail in addition to the Redis rollback, which leaves you in an awkward state. Ultimately you shouldn’t worry too much about this—even if you design a system that can enqueue the webhook in the face of an arbitrarily long network problem, you will still end up missing some webhooks. This is because the vendor themselves could fail to deliver webhooks to you in the case of network disruptions between them and your server!
Periodically run a full resync
It’s not possible to completely protect against dropped events, so you should have a cronjob which periodically performs a full synchronization between your system and your vendor.
Say you’re using webhook events to maintain a copy of your Stripe customer data. You could run the following job to pull everything from Stripe and save it into your database:
async function syncAllStripeCustomers() { // handle deleted customers for await (const customer of customersRepo.list()) { const stripeCustomer = await stripe.customers.retrieve(customer.id); if (stripeCustomer.deleted) { await syncCustomer(stripeCustomer); } } // sync over any other customers await stripe.customers.list().autoPagingEach( async (customer) => { await syncCustomer(customer); }, );}
In addition to “catching up” on any missed webhooks, this sync job also helps fix any inconsistent state caused by your webhook handlers. It’s my observation that almost all webhook handlers end up relying on event ordering to some degree. Even if your system is perfect today, it’s only a matter of time before someone starts relying on the data inside the event payload instead of looking up the resource from the API in the name of efficiency.
Sometimes you are even forced to do so. If your vendor is firing webhooks to your backend at a rate greater than the API rate limit they offer to you, then it’s impossible to look up the latest data from their API without totally exhausting your rate limit—preventing you from doing anything other than process their webhooks.
Buffer webhooks (at scale, maybe)
If you’re operating at a scale where webhooks are coming in faster than your API rate limit and your vendor won’t give you an increase, then you can’t hit their API for every single webhook that comes in. If you’re okay with introducing some processing delays, then you can solve the problem by buffering your webhook events.
There are two main solutions.
Priority queue
The first idea is to store all of your incoming events in a priority queue, and use each event’s timestamp as their priority value. You want to keep events in the queue for a period of time before processing them, which gives your system time to account for delayed event deliveries.
This also requires some form of grouping. If you store everything in a single queue and debounce processing a message off that queue for every single event that comes in, then you can end up with an infinitely growing queue if your vendor consistently sends events to you faster than your debounce.
One idea is to group events by their resource’s ID. For example, all webhooks relating to a particular Stripe charge can be grouped under that charge’s ID and that group can be debounced independently of any other charge. It’s very unlikely that you’ll end up with a particular resource that changes faster than your debounce time over an extended period of time.
Deduplicate by resource
The priority queue idea tends to be a bit more complicated than it’s worth, mainly because there are no good off-the-shelf priority queue components that support a high cardinality priority value. Most priority queue implementations I’ve seen tend to consist of a fixed number of queues which correspond to a small number of priority levels (e.g. low / medium / high), which isn’t suitable if you want to use a timestamp as the priority value.
It turns out you don’t really need a queue—you just need a key-value store that supports a key expiration time. Your webhook handler stores the event payload under a key (usually the resource ID) with a short expiration time, and you have a consumer that is subscribed to keyspace events.
To achieve this in Redis, you’ll first need to enable keyspace notifications for expired keys:
CONFIG SET notify-keyspace-events Ex
And then you’ll need to set up a subscriber to receive expiration events and kick off event processing:
async function webhookHandler(ctx) { const event = validateEvent(ctx); // looks like: `customer:cus_Ht5bYdLwaqcYM1` const eventKey = `${event.data.object.object}:${event.data.object.id}`; await redis.set(eventKey, JSON.stringify(event), { EX: 5, }); return { status: 200 };}
Under the hood processEvent
can dispatch out to your regular event processors, like the syncCustomer
function introduced earlier in the article.
This does come with some pretty big operational drawbacks that you should keep in mind. If the subscriber misses a notification—because it crashed, for instance—then there is no way to “replay” the expiration event, because by the time your subscriber comes back online the key will be long gone. You probably want to run multiple replicas of the keyspace subscriber for redundancy, and once you start horizontally scaling this service you’ll need to ensure that you have concurrency control around the processEvent
function to avoid processing the same event twice.
You’ll also usually find that your subscriber is I/O bound, and that can be a problem depending on how you’re hosting it. Google’s Cloud Run can be used to host background workers like this, but there’s no way to autoscale a Cloud Run service based on I/O metrics which means you’ll need to move to something else.
Of course, there are other options available. You could use MongoDB Atlas triggers in tandem with a document TTL and stream expired documents to AWS EventBridge. This means you have less infrastructure to manage yourself, although it does also mean you have less ability to tune things.
How important is this stuff?
It depends! The only technique on this list that I would say is a “must do” is verifying the authenticity of incoming events, because not doing so can have major security implications.
Beyond that, the extent to which you engineer your webhook handlers depends on your use case. An initial MVP can get away with cutting more corners than an established application with users that rely on it working correctly, and a webhook handler that deals with money needs to be much more robust than a webhook handler that synchronizes chat messages.
I will say, though, that everything on this list is quite easy to implement. Note how the syncCustomer
method we introduced to decouple our webhook handler from event ordering wound up being reused inside our cronjob. Each item on this list probably only takes 10 minutes of your time, and can save you a lot of headaches down the line.