5 May, 20247 minute read

Always use a dataloader

Can you spot the hidden findMany query lurking inside this GraphQL schema?

Click to copy
type User {  id: ID!  email: String!  # ...} type Query {  user(id: ID!): User}

This is not a trick question; it is possible for a GraphQL operation executed against this schema to cause multiple database queries. If you’ve been able to identify it, then you should consider applying to one of our open roles at Rye. We’re hiring engineers who have—among other things—a deep understanding of GraphQL.

This schema looks inconspicuous, and it’s tempting to implement the user resolver using a straight findOne call. But remember that thanks to aliasing it is possible for a consumer of your schema to invoke a single resolver multiple times from a single query document, like so:

Click to copy
query HitTheDBMultipleTimes {  firstUser: user(id: "foo") {    id    # ...  }  secondUser: user(id: "bar") {    id    # ...  }}

If user is implemented with a findOne call then our GraphQL server will hit the database twice, causing unnecessary load. Queries like this one are usually rarely written by hand in my experience, but they appear all the time when using something like Apollo’s batch HTTP link.

Fortunately, it is trivially easy to avoid this problem by simply reaching for the dataloader library. While most people think about data loaders in the context of the N+1 problem, I find that they are actually a really valuable tool in almost all GraphQL use cases and it’s very rare for me to ever load data without one.

Using a dataloader

The dataloader library can be installed using your Node.js package manager of choice, and using it to wrap a database access is straightforward. Here I am using Drizzle as my query builder, Pothos as my schema builder, and Yoga as my GraphQL server:

Click to copy
// Add DataLoader instance(s) to contextconst yoga = createYoga({  schema: builder.toSchema(),  context: () => ({    usersLoader: new DataLoader(      (userIds: string[]) => {        const rows = await db          .select()          .from(users)          .where(inArray(users.id, userIds));                // First element in this array must correspond to first        // ID in the `userIds` list.        return userIds.map((userId) =>          rows.find((it) => it.id === userId) || null,        );      },    ),  }),}); // In the resolver; use the DataLoaderbuilder.queryType({  fields: (t) => {    user: t.field({      type: User,      args: {        id: t.arg.id({ required: true }),      },      resolve: (_root, args, ctx) =>        ctx.usersLoader.load(args.id),    })  },});

When this application receives a GraphQL operation that requests user multiple times, the data loader will coalesce those requests together and issue only a single database query. This behavior is the primary selling point of the dataloader library that most will already be familiar with, although this particular case of redundant database reads can catch people out.

Even if you are only loading a single record from within a resolver, it is still worth using a dataloader. You get to optimize this access pattern involving aliases, and gain some futureproofing for when you inevitably start making users available elsewhere in your graph.

There are also some other less well known benefits to using data loaders that we will cover in the next section.

Less-known reasons why data loaders are awesome

“Safe by default” behavior

Earlier this week I dove deep and implemented a getUsers function for a variety of different “levels” of reliability. Building software is hard, and architecting systems for reliability requires deep thought and consideration applied all throughout the stack.

We can make the difficulty curve a bit less steep by making better tech stack decisions. A getUsers function written to target a PostgreSQL database has no natural defence against a poorly written “fetch one million records” query, because SQL prioritizes query flexibility above all else. I’ve spent a lot of time over the course of my career debugging queries like that.

If we instead use Firestore as our database, we immediately get limited to fetching only 30 records by ID in one go. It immediately becomes impossible for us to overwhelm our datastore with a poorly written query, and instead the failure occurs closer to the edge of the stack with a smaller blast radius. DynamoDB is also a pretty good option, as query result sets are limited to returning only 1 MB of data per API request.

As software engineers, we tend to be skeptical of constraints. If you’ve worked in the industry for longer than a few months then you’ll have a war story about needing to work around technical constraints imposed on you by a past decision. But constraints aren’t a universally bad thing—they can give you nice guardrails which keep you on the straight and narrow.

Data loaders add a few constraints of their own to your codebase, but those constraints pay off by adding a lot of safety and performance to your system. It’s not just about coalescing multiple findOne calls together; data loaders also do the following:

  1. Batching. The default data loader behavior is to coalesce all of your loads into a single database request, but you can control this behavior and specify a maximum batch size by passing the maxBatchSize option. With this set, your data loader will only ever try to fetch maxBatchSize records in a single attempt. It’s really nice getting this optimization for free—this batching behavior is completely transparent to consumers of the data loader.
  2. Caching. Repeated attempts to load the same record will only ever issue a single database lookup, and this is true even in the case where you have multiple overlapping findMany operations running. The dataloader library is really smart about deduping this work for you, and this is something that’s tricky to do correctly by hand.

These two features are in addition to the core value proposition of solving N+1 query patterns, and they are extremely underrated value adds. These are things you should be doing anyway, but which often fall by the wayside when querying a database directly due to time constraints or lack of thoroughness. Moving that database access to inside a data loader gives you these benefits almost for free, and the end result is a substantially more resilient system.

A consistent view of the world

It’s often impossible to completely avoid issuing redundant database loads from within your application code. Consider the following example, where we want to enforce some ACL rules on an incoming request before going on to generate an email using GPT:

Click to copy
// Authorization helperasync function validateAccess(ctx: Context) {  // This will issue a database query  const user = await ctx.scope.usersService.fromId(    ctx.auth.userId,  );   if (user.role !== 'admin') {    throw new UnauthorizedError();  }} // API request handlerasync function endpoint(ctx: Context) {  await validateAccess(ctx);   // And this will *also* issue a database query  const user = await ctx.scope.usersService.fromId(    ctx.auth.userId,  );  const email = await ctx.scope.aiService.generateEmail({    user,    // ...  });   ctx.status = 200;  ctx.body = { email };}

This example is contrived; it’s hard to provide a realistic example within the confines of a single blog post. It’s easy to refactor this particular authorization function to take a user object directly rather than making its own lookup, but in practice the object graph can be complicated and a refactor like this one can prove difficult1—if you can even identify the existence of the redundant find in the first place.

In addition to this extra user lookup being wasteful, it is also vulnerable to data races. We have no guarantee that the second user object we fetch is consistent with the first one, as it is entirely possible for something else to mutate the record out from under us in between our two lookups. This can cause really subtle and hard-to-debug errors in real world applications.

At Crimson I built a variety of recommendation engines which give our students ideas on how they can improve their college candidacy. Recommendations include things like extracurricular activities, book suggestions, and more.

We also have an all-in-one “strategy recommender” available to our students. This recommender used all of the individual recommendation engines to produce a single multi-year roadmap which guides the student all the way through their high school career. It’s a complicated feature with a lot of data flowing through it; a single run results in about 140 prompts being sent to GPT. This feature actually helped me identify an issue in OpenAI’s infrastructure that caused 0.28% of prompts to hang indefinitely.

It’s very important here that all of the subsystems involved in the recommendation process have a consistent view of reality. If one module fetches different student profile data from another due to a data race, then the cohesion of the strategy recommendation will suffer. At best, our students will come away from the experience disappointed. At worst, we give them conflicting recommendations that—when followed—end up hurting their chance of admission to their dream school.

The caching behavior offered by data loaders completely solves this problem at the architectural level. Data flows can be reasoned about in isolation, and engineers can retain confidence that the overall system will end up working as expected. When we refactor our authorization logic to use a data loader, the result of the first lookup is saved to the loader’s internal cache. When we go on to look up the user a second time, the data is read directly from that cache rather than being requested from the database.

Click to copy
// Authorization helperasync function validateAccess(ctx: Context) {  // This call will issue a database query  const user = await ctx.scope.loaders.usersById.load(    ctx.auth.userId,  );   if (user.role !== 'admin') {    throw new UnauthorizedError();  }} // API request handlerasync function endpoint(ctx: Context) {  await validateAccess(ctx);   // ... and this one will read the `user` from cache;  // the same object as the one retrieved in `validateAccess`  const user = await ctx.scope.loaders.usersById.load(    ctx.auth.userId,  );  const email = await ctx.scope.aiService.generateEmail({    user,    // ...  });   ctx.status = 200;  ctx.body = { email };}

We get a performance win, which is nice. But the real victory is that we were able to remove a nasty source of bugs from our system with almost zero effort on our part. Even better, we can retain confidence in our system as it evolves over time because our use of data loaders has given us a systematic improvement to our codebase. We don’t need to continuously solve one-off data race issues, because the tools we’ve chosen help us fall into the pit of success.

Conclusion

When building GraphQL systems, you want to be using data loaders for almost all of your data accesses. The world overindexes on their ability to solve N+1 query patterns when resolving relational data, but they’re actually a tremendously powerful abstraction that comes with other benefits as well.

While the N+1 query problem is largely unique to GraphQL, the other problems solved by data loaders are far more universal in the world of backend systems. Every backend—whether it’s RESTful, GraphQL, gRPC, or something else—needs to deal with data races, batching of work, and caching. Data loaders are a particularly elegant solution for all of these concerns.

Next time you consider calling findOne or findMany directly on your ORM, think twice.



  1. Not even just from the perspective of refactoring the actual app; there’s probably also a lot of tests that will need updating if you proceed with this refactor!

Don't want to miss out on new posts?

Join 100+ fellow engineers who subscribe for software insights, technical deep-dives, and valuable advice.

Get in touch 👋

If you're working on an innovative web or AI software product, then I'd love to hear about it. If we both see value in working together, we can move forward. And if not—we both had a nice chat and have a new connection.
Send me an email at hello@sophiabits.com