1 July, 20247 minute read

Bulletproof GraphQL polling

At Crimson one of the first projects undertaken by my AI team was to produce high-quality meeting summaries. Document summarization is a great use case for LLMs, but it is also a fairly slow process. Our first implementation took ~90s to summarize a typical Crimson strategy session between a consultant and a student, and this made a synchronous request/response model impossible. Amazon API Gateway—used throughout Crimson to ship APIs—has a hard 29s maximum request timeout.

We didn’t want to use GraphQL subscriptions because introducing WebSockets into the mix would have added a significant amount of complexity to our system. This meant we needed to model this asynchronous operation within the GraphQL schema itself.

We needed to figure out a good design. At the time there weren’t any good guides available online and we had a very tight deadline so we wound up winging it. The design ended up OK, but wasn’t entirely perfect.

There are lots of use cases which need to be implemented as long-running asynchronous operations. While I’ve indexed heavily on document summarization so far, there are many other examples we could consider. A good, reusable pattern for modeling this kind of operation in GraphQL would have been tremendously valuable to us.

This article steps through what is—in my opinion—the best design. Before we get there, though, we’ll take a look at some less than perfect designs and consider their limitations.

A few bad schemas

A very simple design might look like this:

Click to copy
enum SummaryStatus {  DONE  FAILED  PENDING} type Summary {  id: ID!  content: String  status: SummaryStatus!} type Mutation {  summaryCreate(input: SummaryCreateInput!): Summary!} type Query {  summary(id: ID!): Summary}

Whenever you see a field like state, status, or type in your GraphQL schema you should run for the hills. These fields are quite natural in modern programming languages with support for discriminated or tagged union types, but are a common source of confusion when used in GraphQL’s simple and inexpressive type system.

Consider the content field. Our summary has content if it’s in the SUCCESS state, and has no content when it is in any of the other two states. There is no way of encoding this invariant in the GraphQL schema, so we are forced into typing content as a nullable string field. It would be much better if we were able to make content non-null when the summary transitions to the SUCCESS state.

This is possible to do by deleting the status field and leveraging union types. We’ll add a concrete type for each state a summary can be in—implementing a common base interface is optional—and then type the summary and summaryCreate fields so they return a union of these concrete types.

Our FailedSummary and PendingSummary types wouldn’t have a content field at all, while our DoneSummary type would have content: String!. Perfectly type-safe, and this approach scales as we add new fields for each state (example: an error field on FailedSummary, or eta on PendingSummary).

Here’s what the improved schema using unions looks like:

Click to copy
type FailedSummary {  id: ID!  # no content} type PendingSummary {  id: ID!  # no content} type DoneSummary {  id: ID!  # non-null!  content: String!} union Summary = FailedSummary | PendingSummary | DoneSummary type Mutation {  summaryCreate(input: SummaryCreateInput!): Summary!} type Query {  summary(id: ID!): Summary}

But this design still isn’t quite perfect. There are two major flaws that I can see:

  1. This solution is very tightly coupled to the specifics of the particular problem being solved. Summarization is only one kind of asynchronous operation; if we have other such operations (machine translation, perhaps) then we need to duplicate a bunch of schema boilerplate.
  2. There are still cases where type safety is difficult. Although unexpressive languages like Go can work around their lack of discriminated unions—check out go-graphql-client—not all such languages will have great tooling available.
  3. We can’t block another query behind a summarization job. Imagine we have an analytics query field, and that we wanted to retrieve some data from that field after finishing the summarization job. The current schema design forces us to either load analytics eagerly every time we poll for the summary, or to make a waterfalled network request after we finish polling. Both options kind of suck.

This schema design is what I implemented at Crimson, but there is a better design that solves for these remaining concerns. It’s creative, and I think you’ll like it.

The optimal solution

Let’s tackle the first problem. The previous schema we looked at was serviceable, but the way in which we modeled our asynchronous summarization method was very tightly coupled to the summarization feature.

If we wanted to support asynchronous machine translation, for instance, then one could easily imagine that we’d need FailedTranation, PendingTranslation, and DoneTranslation types. This looks awfully similar to the *Summary types, and it would be nice if we could remove some of this boilerplate.

Let’s fix this by introducing a “job” abstraction. All asynchronous operations in our schema will return a job object which tracks the status of the job.

Click to copy
interface Job {  id: ID!} type SummaryJob implements Job {  id: ID!  summary: Summary} type SummaryCreateResult {  job: SummaryJob!} type Mutation {  summaryCreate(    input: SummaryCreateInput!  ): SummaryCreateResult!} type Query {  job(id: ID!): Job  summary(id: ID!): Summary}

Additional fields can be added to Job depending on your use case. The naming here is quite nice; I personally think that Job.eta is more elegant than PendingSummary.eta.

This solves our first concern. We can reuse the Job interface for all of our asynchronous operations, and we still have flexibility to use more granular union types or to add job-specific fields to the implementing types (e.g. SummaryJob) if necessary.

We’ve also managed to remove the unions—at least for now—which makes consuming code simpler in languages with a more primitive type system.

The final limitation we need to work around is figuring out a way of “blocking” queries on the status of our job. In an ideal world, we’d be able to send out a single GraphQL document which polls for our completed summary job and fetches some analytics if the job completed. We don’t want to fetch the analytics if the job is pending because that would be inefficient, and we also want to avoid waterfalling a network request.

The fix here is to directly return the Query type from job. This will let the user provide an arbitrary query nested under job. If the job hasn’t finished yet then we’ll short-circuit and return null instead of resolving that query.

Consuming this schema looks like so:

Click to copy
query FetchSummary($jobId: ID!, $summaryId: ID!) {  job(id: $jobId) {    query {      analytics {        # ...      }      summary(id: $summaryId) {        ...summaryFragment      }    }  }}

You can, of course, keep SummaryJob.summary if you so desire. It’s a convenient shorthand analogous to lifting up nodes to the top of a Relay-style connection type.

Circling back to the idea of adding more fields to Job, this final design really benefits from a done: Boolean! field. This is lets consumers handle the case where they are truly only interested in whether the job completed, and don’t need to query for side effects of the job.

Handling errors

One thing missing from our above schema is error handling. There are many ways of modeling this, and the correct option might differ based on your exact use case.

One pattern I am fond of is to leverage field aliases to support running a different query when the job has failed. Here’s what that looks like:

Click to copy
query PollSummary(id: $ID!) {  job(id: $id) {    done    data: query(when: SUCCESS) {      # ...    }    error: query(when: ERROR) {      # ...    }  }}

You do lose a bit of type strictness here in the case where your error and success states are mutually exclusive, because it’s not possible to model this invariant in the schema. You’ll end up with some redundant null checks inside your application. If that’s a particular concern to you, you can also make query a union type over Query | SummaryError.

I’d be careful with that approach, however, because it’s often the case that you want to be able to support partial failures of your asynchronous operation. In this case, the error and data fields aren’t mutually exclusive. The query(when: JobState) pattern still works well for this use case—the mental shift here is to think of your job as existing in multiple states at once—but you could also add a dedicated errors field to the Job type itself if you prefer.

There isn’t really a universally “best” method for modeling errors here. It’s highly contextual to your specific use case, and what “failure” means to your application.

How to implement a Query field

If the job isn’t done yet then return null. Otherwise you can simply return an empty object from the resolver, and your GraphQL server will take care of the rest. Here’s what it looks like in Pothos:

Click to copy
// Step 1: Save a reference to the query type in a variableconst QueryType = builder.queryType({  fields: (t) => {    job: t.field({      nullable: true,      type: Job,      args: {        id: t.arg.id({ required: true }),      },      resolve: async (_source, { id }) => {        // resolve your job      },    }),  },}); // Step 2: Make sure `Job.query` is specified in your schemabuilder.objectType(Job, {  name: 'Job',  fields: (t) => ({    // ...    query: t.field({      nullable: true,      type: QueryType,      // @ts-expect-error I get a type error here      resolve: (job) => (job.done ? {} : null)<    }),  }),});

If you want to see a more concrete example, I have a sample GraphQL server available here for you to play with.

This looks deceptively simple. The key insight here is that there’s nothing special about the Query type; it’s just a regular object type with resolvers defined for its fields. Returning an empty object works for the same reason you don’t need to explicitly return a count field to resolve this Kanban board type:

Click to copy
export const KanbanBoard = objectType('KanbanBoard', {  definition(t) {    t.nonNull.id('id');    t.nonNull.list.nonNull.field('tasks', {      type: Task,    });    t.nonNull.int('count', {      resolve: (board) => board.tasks.length,    });  },});

Conclusion

Designing GraphQL schemas to support asynchronous operations in an elegant manner requires careful consideration and thoughtful structuring. The most obvious solution of introducing a union type over your objects to represent each state over the lifecycle of the operation works, but comes with some pretty heavy drawbacks. By adopting some systems-level thinking and introducing a Job abstraction, you get a reusable building block that decouples specific types in your schema from the technical details of polling for completion.

Additionally, providing a mechanism to block queries until certain jobs are completed enhances the efficiency and usability of the API. While error handling remains a context-dependent challenge, using field aliases or unions are each robust ways to manage different kinds of failure modes.

This apporach not only addresses the common pitfalls of simpler designs but also offers a scalable and elegant solution for complex, long-running tasks in GraphQL. Give it a try—I think you’ll really like it.

Don't want to miss out on new posts?

Join 100+ fellow engineers who subscribe for software insights, technical deep-dives, and valuable advice.

Get in touch 👋

If you're working on an innovative web or AI software product, then I'd love to hear about it. If we both see value in working together, we can move forward. And if not—we both had a nice chat and have a new connection.
Send me an email at hello@sophiabits.com