17 April, 2023•6 minute read

Object IDs for humans

IDs are an essential part of API design. So much of the code we write as engineers uses IDs in some capacity, yet many engineering teams blindly reach for the tried and tested UUID format without considering the tradeoffs. Make no mistake about it, however: the format you use for IDs can have a surprising impact on performance, debuggability, and developer experience.

In my recent projects I’ve instead been using a combination of an object type prefix concatenated with a KSUID. An example ID might look like team_2OVAmoKVfUfq7BjyrlMTNaMxcJM, and I think this format has a lot going for it.

Why are UUIDs so popular?

A few decades ago—when software companies served many orders of magnitude less traffic than today—it was commonplace to use an auto-incrementing integer for IDs. This strategy has largely fallen out of favor today in favor of the UUID, but it’s worth taking a moment to explore why this shift happened.

The problem with an auto-incrementing primary key is that every insert to the database requires the database to acquire a lock so that it can atomically increment the sequence. When you're under a lot of traffic like a modern SaaS app, this can result in a pretty significant performance penalty.

The other downside–also related to performance–is that the service sending the query to the database has no way of knowing ahead of time what the final ID is. This means that if the service needs to insert a graph of records, it's impossible to bulk insert them all in one query. It's necessary to make a round trip like so:

At scale this is also slow. Ideally we want our service to be able to generate IDs upfront so that we can insert both our user and user_info rows in one query, without needing to worry about a potential ID collision.

UUIDs solve these scalability problems quite well. Because the likelihood of two clients generating identical UUIDs is vanishingly small, it’s possible to significantly cut down on database load by avoiding database locks and round trips.

There are a few other benefits as well. A sequentially increasing ID value is simple to guess for anyone poking your API, and they also leak information like how many users your app has.

UUIDs are also very simple to use. Most database engines have native support for them, and every programming language has libraries for generating them.

So what’s wrong with UUIDs?

If we compare UUIDs to auto-incrementing primary keys they look quite appealing. There's a reason why they're the default option for so many engineering teams: they work well enough most of the time.

But they’re not totally perfect, and they come with tradeoffs. Here are some problems which you’ll run into if you use UUIDs.

UUIDs can’t be ordered by time. The UUID v4¹ format—the one most engineering teams use—does not include the timestamp it was generated on which reduces the chance of an ID collision while also decreasing the usability of the ID.

Storing a created_at column is generally best practice, but there are real-world situations where you might only have an ID to work with. Recently at work we had to debug an issue with our analytics pipeline–events are streamed to a Lambda function via Kinesis, and that Lambda is responsible for converting the events into CSV format and writing them into an S3 bucket where they eventually get synced into a data lake.

Something was broken in this Lambda, and we needed to investigate the contents of the S3 bucket. The S3 endpoint for listing objects guarantees that object keys are returned in UTF-8 binary order which would have been very helpful if we were using an ID format which included a timestamp component, but unfortunately we weren't.

Bulk inserts are slow. Following on from the above, if you need to bulk insert a large number of rows you’ll find that UUIDs can slow things down. Because they’re not sequential your database engine needs to figure out which page to write each and every row into, rather than being able to perform a sequential write for everything.

At small scales—both traffic and row count—this isn’t much of a problem. But this lack of scalability is worth bearing in mind, especially given how easy it is to simply pick a better ID format at the start of a new project.

Copy/paste is annoying. Here’s a UUID: d447601f-3ce1-49a2-85c9-08b7178603eb. Try to double-click select it and you’ll find that you can’t, because your browser thinks the ID consists of five words.

Not being able to quickly copy/paste UUIDs is remarkably annoying. This may seem like a minor gripe, but it’s worth getting small details right. I find myself copying IDs all the time, and the extra bit of time wasted by manually selecting the full UUID adds up quickly.

I actually think this inconvenience is worth swapping to a different format all on its own².

They’re inscrutable. Looking at a UUID tells you nothing about the object it references. If you’re debugging an error then you’ll often need to look at the surrounding log message in order to figure out what kind of resource the ID is referring to, and you lose a lot of glance value when checking over configuration values. Which option below is easier to read?

config.toml

Click to copy

[project-one]default_tenant_id = "4c1be4b0-c5b6-42ff-8d84-22984a07afbe"# ... [project-two]default_tenant_id = "price_2OXmz8X7zWQvG1qNYSTEVTLJ5jx"# ...

The second option is much easier to read, and we can even see that someone's likely made a mistake when filling out the configuration values as project two's default_tenant_id is referring to the ID of some price object!

This downside of UUIDs is really underestimated. Engineers spend a lot of time trawling through log messages and throwing ID values around, and being able to simply glance at an ID and instantly understand what it refers to is a godsend.

There's one final sub-point to make on this, and that's GraphQL. The Global Object Identification specification requires that all objects in your graph implement a Node interface, which contains an id field and that all Nodes can be looked up via a node query. A GraphQL schema implementing this looks as follows:

schema.gql

Click to copy

interface Node {  id: ID!} interface Note implements Node {  id: ID!  content: String!  # other fields} type Query {  node(id: ID!): Node}

The point of this specification is to provide a standard mechanism for refreshing client caches. Clients can simply query node with the ID of the object they want to refresh.

The problem you'll immediately run into when using UUIDs is that it's hard to know which table to query given only a UUID. You can work around this by adding additional infrastructure like an ID service, but at this point things are getting pretty complicated. It's worth noting that this drawback is actually shared between UUIDs and database-generated IDs.

Are these drawbacks dealbreakers for the UUID format? Not necessarily. Plenty of engineering teams use UUIDs and do great work. But I think these drawbacks are significant enough that it’s worth looking for a new default option that solves these problems without compromising on the things that make UUIDs great.

Enter the prefix + KSUID strategy.

What makes a prefix + KSUID so nice for IDs?

Here’s another specimen ID: user_2OVFytmKTeZPMhvaU5g0EbHh7hn.

There’s a lot to like about this ID format. In addition to being easily copyable (try double click that ID!), it is immediately obvious that the ID refers to a user. If I’m working to resolve a production outage I save valuable seconds that would have otherwise been spent trying to figure out what kind of data the ID in my error log refers to.

Digging into the technical details, the KSUID format is k-sortable which means that bulk inserts can be written sequentially, and records can be roughly ordered by creation time using only their ID field. We also get a little bit more collision resistance—KSUIDs have 128 bits of randomness compared to a UUIDv4’s 122 bits, which boosts collision resistance by 64.

Of course, there are some tradeoffs. This ID requires more storage space than a UUID, and databases don’t have built-in support for this ID format. You’ll need to generate IDs by hand in your code.

Neither of these are particularly big concerns, though. The latter could even be seen as a benefit—one of the reasons we moved to UUIDs was that we could generate them in code and assume no collision, so that we could insert relational data in bulk. Relying on the database to generate your IDs takes you back to the bad old days where you needed to make a database round trip before inserting your related row.

For your next project, consider trying out the prefix + KSUID approach. I think you'll like it.

It's worth mentioning that there are other UUID formats such as UUIDv7 which do include a timestamp, but using these formats causes you to lose a massive amount of random bits. UUIDv4 has 122 bits of random data whereas UUIDv7 only has 74.↩
Prior to my current strategy, I had been using NanoIDs simply because you can easily copy them.↩

« Newer post

Speak slowly and buy a decent microphone

Older post »

LLMs are just tools

Why are UUIDs so popular?

So what’s wrong with UUIDs?

What makes a prefix + KSUID so nice for IDs?

« Newer post

Older post »

Get in touch 👋