4 September, 2023•5 minute read

Efficient data retention policies

Last year Retool authored a post describing how they upgraded a 4 TB PostgreSQL database from 9.6 to 13 with “minimal downtime.” It’s a good blog post—major version upgrades of PostgreSQL are hard—but one sentence in particular caught my eye:

“The larger 2 TB table, an append-only table of audit events in the app, was easy to transfer: we waited until after the cutover to migrate the contents, as the Retool product functions just fine even if that table is empty.”

The “append-only table of audit events” is a recurring character in internet businesses, and it often ends up being by far the largest table in your database. In Retool’s case it was about half the DB, and at Crimson our audit table is similarly large. This table is everywhere. If you’ve worked in the industry for more than a few months you’ve ran into it, and it can get out of hand if you don’t get on top of it.

Over at gethook we don’t yet have an audit table, but we do have something analogous. Our event table contains a row for every webhook we receive, and it gets big pretty quickly despite us only being in closed beta.

Provided you have the right indexes and aren’t running pathological queries table size isn’t a concern for day-to-day database performance, but large tables do cause other issues. Databases are uniquely hungry for IOPS which pushes up the price of database disks compared to other storage options, and large tables also make migrations significantly harder and slower—as the Retool engineering team will attest to.

Both Stripe and gethook only retain webhook events for 30 days for these reasons. There’s no real reason to store a bunch of rarely accessed historical data inside your main production database. The costs almost always outweigh the benefits.

The key here is to have a background worker periodically delete old data from your database, but there’s an art to doing this efficiently. Most implementations I’ve seen are wrong, and in this post I’ll explain how to do it correctly.

The components of an efficient data cleanup job are as follows:

Delete in batches. Batching the work is important because it safeguards us against our worker process timing out, and also makes individual delete queries run faster. Trying to delete millions of rows in a single query is likely to take some time, and long-running transactions can introduce feedback into the rest of the system.

💡 A key part of designing systems which last is to limit “worst case load.” Batching the process of deleting old data means our query only ever works on N rows at a time, which caps its worst case execution time.

Don’t load data into your worker process. You want to be deleting data with a single query to minimize network traffic. Anything which does a select-then-delete is violating this best practice.

Don’t send deleted items over the wire. Following the same logic as the previous point, you don’t want to put a returning clause in your query. Less data transfer means quicker data cleanup.

The query

With that out of the way, here’s a query that efficiently deletes old data by following the aforementioned guidelines.

Click to copy

delete from "event"where "id" in (  select "id" from "event"  where "created_at" < @horizon  limit @batch_size)

PostgreSQL (and SQLite¹) require the most complex query, which is an unusual occurrence as Postgres tends to be very expressive. The problem is that Postgres doesn’t support a limit clause inside a delete statement which necessitates the use of a subquery to emulate a limit. A CTE works, too, but I personally find the subquery variant more readable.

You’d think the Postgres query would work as-is in other database engines, but MySQL has an annoying limitation where the delete statement cannot refer to the same table as a select subquery. If you use a sub-subquery to create a temporary table you can work around this limitation, but you’re much better off just using limit directly—it’s clearer and more performant.

How about foreign keys?

Something this doesn’t handle are foreign key constraints. If you’re okay with cascaded deletes then you won’t need to do anything, but if you’d prefer to handle cleanup of relations by hand or are using a database engine without foreign key constraints (e.g. Vitess) you’ll need a different query.

Adjusting the above queries to handle foreign keys is a pain. Postgres doesn’t support delete join so your best option is to (ab)use CTEs—you will need a returning clause inside the CTE so you can properly delete the related rows inside your main query, but you’re not sending the data back to the client so all is well.

Something like the following would work; the key thing to get right is making sure things are deleted in the correct order. Here I delete from event_delivery before deleting from event as event_delivery has a foreign key constraint on its event_id column.

Click to copy

with "events_to_delete" as (  select "id" from "event"  where "created_at" < @horizon  limit @batch_size),"deleted_deliveries" as (  delete from "event_delivery"  where "event_id" in (    select "id" from "events_to_delete"  )  returning "event_id") delete from "event"where "id" in (  select "event_id" from "deleted_deliveries")

I’m not aware of a way to do this perfectly with MySQL. The lack of returning makes the CTE approach impossible, and join can’t be combined with limit inside a delete. Even the “sub-subquery into a temporary table” strategy doesn’t work.

If you’re building on Vitess and don’t have access to foreign key constraints—and by extension, cascaded deletes—then I suspect the best you can do is to break out separate select and delete queries. At the scale where these optimizations really begin to matter, you’ll be sharding your dataset anyway which helps keep things snappy.

Make sure to update your APIs!

Deletion of old data should happen asynchronously via a background worker, which means that your cleanup process is eventually consistent. This means users of your software will be able to see records older than your retention period in some cases.

You can fix this by updating your API endpoints to also apply the retention period as part of its filters. Here’s a simplified example from gethook’s “list events” endpoint:

list-events.sql

Click to copy

select * from `event`where `cursor` > @cursorand `created_at` > @horizonorder by `cursor` asclimit @page_size

Efficient data retention policies

The query

How about foreign keys?

Make sure to update your APIs!

Other solutions

Rebuild the table

Truncate the table

« Newer post

Older post »

Get in touch 👋