Efficient data retention policies
Last year Retool authored a post describing how they upgraded a 4 TB PostgreSQL database from 9.6 to 13 with “minimal downtime.” It’s a good blog post—major version upgrades of PostgreSQL are hard—but one sentence in particular caught my eye:
“The larger 2 TB table, an append-only table of audit events in the app, was easy to transfer: we waited until after the cutover to migrate the contents, as the Retool product functions just fine even if that table is empty.”
The “append-only table of audit events” is a recurring character in internet businesses, and it often ends up being by far the largest table in your database. In Retool’s case it was about half the DB, and at Crimson our audit table is similarly large. This table is everywhere. If you’ve worked in the industry for more than a few months you’ve ran into it, and it can get out of hand if you don’t get on top of it.
Over at gethook we don’t yet have an audit table, but we do have something analogous. Our
event table contains a row for every webhook we receive, and it gets big pretty quickly despite us only being in closed beta.
Provided you have the right indexes and aren’t running pathological queries table size isn’t a concern for day-to-day database performance, but large tables do cause other issues. Databases are uniquely hungry for IOPS which pushes up the price of database disks compared to other storage options, and large tables also make migrations significantly harder and slower—as the Retool engineering team will attest to.
Both Stripe and gethook only retain webhook events for 30 days for these reasons. There’s no real reason to store a bunch of rarely accessed historical data inside your main production database. The costs almost always outweigh the benefits.
The key here is to have a background worker periodically delete old data from your database, but there’s an art to doing this efficiently. Most implementations I’ve seen are wrong, and in this post I’ll explain how to do it correctly.
The components of an efficient data cleanup job are as follows:
Delete in batches. Batching the work is important because it safeguards us against our worker process timing out, and also makes individual delete queries run faster. Trying to delete millions of rows in a single query is likely to take some time, and long-running transactions can introduce feedback into the rest of the system.
💡 A key part of designing systems which last is to limit “worst case load.” Batching the process of deleting old data means our query only ever works on
N rows at a time, which caps its worst case execution time.
Don’t load data into your worker process. You want to be deleting data with a single query to minimize network traffic. Anything which does a
delete is violating this best practice.
Don’t send deleted items over the wire. Following the same logic as the previous point, you don’t want to put a
returning clause in your query. Less data transfer means quicker data cleanup.
With that out of the way, here’s a query that efficiently deletes old data by following the aforementioned guidelines.
delete from "event"where "id" in ( select "id" from "event" where "created_at" < @horizon limit @batch_size)
PostgreSQL (and SQLite1) require the most complex query, which is an unusual occurrence as Postgres tends to be very expressive. The problem is that Postgres doesn’t support a
limit clause inside a
delete statement which necessitates the use of a subquery to emulate a
limit. A CTE works, too, but I personally find the subquery variant more readable.
You’d think the Postgres query would work as-is in other database engines, but MySQL has an annoying limitation where the
delete statement cannot refer to the same table as a
select subquery. If you use a sub-subquery to create a temporary table you can work around this limitation, but you’re much better off just using
limit directly—it’s clearer and more performant.
How about foreign keys?
Something this doesn’t handle are foreign key constraints. If you’re okay with cascaded deletes then you won’t need to do anything, but if you’d prefer to handle cleanup of relations by hand or are using a database engine without foreign key constraints (e.g. Vitess) you’ll need a different query.
Adjusting the above queries to handle foreign keys is a pain. Postgres doesn’t support
delete join so your best option is to (ab)use CTEs—you will need a
returning clause inside the CTE so you can properly delete the related rows inside your main query, but you’re not sending the data back to the client so all is well.
Something like the following would work; the key thing to get right is making sure things are deleted in the correct order. Here I delete from
event_delivery before deleting from
event_delivery has a foreign key constraint on its
with "events_to_delete" as ( select "id" from "event" where "created_at" < @horizon limit @batch_size),"deleted_deliveries" as ( delete from "event_delivery" where "event_id" in ( select "id" from "events_to_delete" ) returning "event_id") delete from "event"where "id" in ( select "event_id" from "deleted_deliveries")
I’m not aware of a way to do this perfectly with MySQL. The lack of
returning makes the CTE approach impossible, and
join can’t be combined with
limit inside a
delete. Even the “sub-subquery into a temporary table” strategy doesn’t work.
If you’re building on Vitess and don’t have access to foreign key constraints—and by extension, cascaded deletes—then I suspect the best you can do is to break out separate
delete queries. At the scale where these optimizations really begin to matter, you’ll be sharding your dataset anyway which helps keep things snappy.
Make sure to update your APIs!
Deletion of old data should happen asynchronously via a background worker, which means that your cleanup process is eventually consistent. This means users of your software will be able to see records older than your retention period in some cases.
You can fix this by updating your API endpoints to also apply the retention period as part of its filters. Here’s a simplified example from gethook’s “list events” endpoint:
select * from `event`where `cursor` > @cursorand `created_at` > @horizonorder by `cursor` asclimit @page_size
The solution involving a bounded
delete query works for any application, but in some edge cases you might want to do something slightly different. In the vast majority of cases, though, I’d recommend using a background worker with one of the queries from this post.
Rebuild the table
Let’s say you have a truly massive amount of data—hundreds of gigabytes inside one table—and you only want to retain a small percentage of it. In this case it’s actually faster to build a new table. You do this like so:
- Copy the data you care about into a new table
- Rename the old table to something else
- Rename the new table to the name of the old table
- Drop the old table
Note that you can avoid getting into this situation by implementing a sensible data retention policy early on in the lifecycle of your application. If you’re diligent from the start about clearing out old data, your table can’t get to the size where this strategy makes sense in the first place.
Truncate the table
Another option which works well for partitioned tables is to use the
truncate statement. Where
delete works by deleting rows one-by-one and writing to the transaction log,
truncate simply drops the entire table and then recreates the structure in place. Foreign key constraints get checked by
truncate prior to dropping the table, but triggers don’t get executed.
You can’t combine
truncate with filters or a limit, meaning that it will delete everything in the target table. In cases where you’re okay with this, it’s a good option in place of
delete. Generally speaking the all-or-nothing nature of
truncate limits its usefulness to situations involving partition tables, as the partitions effectively function as a filter for