Bulletproof GraphQL polling
At Crimson one of the first projects undertaken by my AI team was to produce high-quality meeting summaries. Document summarization is a great use case for LLMs, but it is also a fairly slow process. Our first implementation took ~90s to summarize a typical Crimson strategy session between a consultant and a student, and this made a synchronous request/response model impossible. Amazon API Gateway—used throughout Crimson to ship APIs—has a hard 29s maximum request timeout.
We didn’t want to use GraphQL subscriptions because introducing WebSockets into the mix would have added a significant amount of complexity to our system. This meant we needed to model this asynchronous operation within the GraphQL schema itself.
We needed to figure out a good design. At the time there weren’t any good guides available online and we had a very tight deadline so we wound up winging it. The design ended up OK, but wasn’t entirely perfect.
There are lots of use cases which need to be implemented as long-running asynchronous operations. While I’ve indexed heavily on document summarization so far, there are many other examples we could consider. A good, reusable pattern for modeling this kind of operation in GraphQL would have been tremendously valuable to us.
This article steps through what is—in my opinion—the best design. Before we get there, though, we’ll take a look at some less than perfect designs and consider their limitations.
A few bad schemas
A very simple design might look like this:
enum SummaryStatus { DONE FAILED PENDING} type Summary { id: ID! content: String status: SummaryStatus!} type Mutation { summaryCreate(input: SummaryCreateInput!): Summary!} type Query { summary(id: ID!): Summary}
Whenever you see a field like state
, status
, or type
in your GraphQL schema you should run for the hills. These fields are quite natural in modern programming languages with support for discriminated or tagged union types, but are a common source of confusion when used in GraphQL’s simple and inexpressive type system.
Consider the content
field. Our summary has content if it’s in the SUCCESS
state, and has no content when it is in any of the other two states. There is no way of encoding this invariant in the GraphQL schema, so we are forced into typing content
as a nullable string field. It would be much better if we were able to make content
non-null when the summary transitions to the SUCCESS
state.
This is possible to do by deleting the status
field and leveraging union types. We’ll add a concrete type for each state a summary can be in—implementing a common base interface is optional—and then type the summary
and summaryCreate
fields so they return a union of these concrete types.
Our FailedSummary
and PendingSummary
types wouldn’t have a content
field at all, while our DoneSummary
type would have content: String!
. Perfectly type-safe, and this approach scales as we add new fields for each state (example: an error
field on FailedSummary
, or eta
on PendingSummary
).
Here’s what the improved schema using unions looks like:
type FailedSummary { id: ID! # no content} type PendingSummary { id: ID! # no content} type DoneSummary { id: ID! # non-null! content: String!} union Summary = FailedSummary | PendingSummary | DoneSummary type Mutation { summaryCreate(input: SummaryCreateInput!): Summary!} type Query { summary(id: ID!): Summary}
But this design still isn’t quite perfect. There are two major flaws that I can see:
- This solution is very tightly coupled to the specifics of the particular problem being solved. Summarization is only one kind of asynchronous operation; if we have other such operations (machine translation, perhaps) then we need to duplicate a bunch of schema boilerplate.
- There are still cases where type safety is difficult. Although unexpressive languages like Go can work around their lack of discriminated unions—check out
go-graphql-client
—not all such languages will have great tooling available. - We can’t block another query behind a summarization job. Imagine we have an
analytics
query field, and that we wanted to retrieve some data from that field after finishing the summarization job. The current schema design forces us to either loadanalytics
eagerly every time we poll for the summary, or to make a waterfalled network request after we finish polling. Both options kind of suck.
This schema design is what I implemented at Crimson, but there is a better design that solves for these remaining concerns. It’s creative, and I think you’ll like it.
The optimal solution
Let’s tackle the first problem. The previous schema we looked at was serviceable, but the way in which we modeled our asynchronous summarization method was very tightly coupled to the summarization feature.
If we wanted to support asynchronous machine translation, for instance, then one could easily imagine that we’d need FailedTranation
, PendingTranslation
, and DoneTranslation
types. This looks awfully similar to the *Summary
types, and it would be nice if we could remove some of this boilerplate.
Let’s fix this by introducing a “job” abstraction. All asynchronous operations in our schema will return a job object which tracks the status of the job.
interface Job { id: ID!} type SummaryJob implements Job { id: ID! summary: Summary} type SummaryCreateResult { job: SummaryJob!} type Mutation { summaryCreate( input: SummaryCreateInput! ): SummaryCreateResult!} type Query { job(id: ID!): Job summary(id: ID!): Summary}
Additional fields can be added to Job
depending on your use case. The naming here is quite nice; I personally think that Job.eta
is more elegant than PendingSummary.eta
.
This solves our first concern. We can reuse the Job
interface for all of our asynchronous operations, and we still have flexibility to use more granular union types or to add job-specific fields to the implementing types (e.g. SummaryJob
) if necessary.
We’ve also managed to remove the unions—at least for now—which makes consuming code simpler in languages with a more primitive type system.
The final limitation we need to work around is figuring out a way of “blocking” queries on the status of our job. In an ideal world, we’d be able to send out a single GraphQL document which polls for our completed summary job and fetches some analytics if the job completed. We don’t want to fetch the analytics if the job is pending because that would be inefficient, and we also want to avoid waterfalling a network request.
The fix here is to directly return the Query
type from job
. This will let the user provide an arbitrary query nested under job
. If the job hasn’t finished yet then we’ll short-circuit and return null
instead of resolving that query.
Consuming this schema looks like so:
query FetchSummary($jobId: ID!, $summaryId: ID!) { job(id: $jobId) { query { analytics { # ... } summary(id: $summaryId) { ...summaryFragment } } }}
You can, of course, keep SummaryJob.summary
if you so desire. It’s a convenient shorthand analogous to lifting up nodes
to the top of a Relay-style connection type.
Circling back to the idea of adding more fields to Job
, this final design really benefits from a done: Boolean!
field. This is lets consumers handle the case where they are truly only interested in whether the job completed, and don’t need to query for side effects of the job.
Handling errors
One thing missing from our above schema is error handling. There are many ways of modeling this, and the correct option might differ based on your exact use case.
One pattern I am fond of is to leverage field aliases to support running a different query when the job has failed. Here’s what that looks like:
query PollSummary(id: $ID!) { job(id: $id) { done data: query(when: SUCCESS) { # ... } error: query(when: ERROR) { # ... } }}
You do lose a bit of type strictness here in the case where your error and success states are mutually exclusive, because it’s not possible to model this invariant in the schema. You’ll end up with some redundant null checks inside your application. If that’s a particular concern to you, you can also make query
a union type over Query | SummaryError
.
I’d be careful with that approach, however, because it’s often the case that you want to be able to support partial failures of your asynchronous operation. In this case, the error
and data
fields aren’t mutually exclusive. The query(when: JobState)
pattern still works well for this use case—the mental shift here is to think of your job as existing in multiple states at once—but you could also add a dedicated errors
field to the Job
type itself if you prefer.
There isn’t really a universally “best” method for modeling errors here. It’s highly contextual to your specific use case, and what “failure” means to your application.
Query
field
How to implement a If the job isn’t done yet then return null
. Otherwise you can simply return an empty object from the resolver, and your GraphQL server will take care of the rest. Here’s what it looks like in Pothos:
// Step 1: Save a reference to the query type in a variableconst QueryType = builder.queryType({ fields: (t) => { job: t.field({ nullable: true, type: Job, args: { id: t.arg.id({ required: true }), }, resolve: async (_source, { id }) => { // resolve your job }, }), },}); // Step 2: Make sure `Job.query` is specified in your schemabuilder.objectType(Job, { name: 'Job', fields: (t) => ({ // ... query: t.field({ nullable: true, type: QueryType, // @ts-expect-error I get a type error here resolve: (job) => (job.done ? {} : null)< }), }),});
If you want to see a more concrete example, I have a sample GraphQL server available here for you to play with.
This looks deceptively simple. The key insight here is that there’s nothing special about the Query
type; it’s just a regular object type with resolvers defined for its fields. Returning an empty object works for the same reason you don’t need to explicitly return a count
field to resolve this Kanban board type:
export const KanbanBoard = objectType('KanbanBoard', { definition(t) { t.nonNull.id('id'); t.nonNull.list.nonNull.field('tasks', { type: Task, }); t.nonNull.int('count', { resolve: (board) => board.tasks.length, }); },});
Conclusion
Designing GraphQL schemas to support asynchronous operations in an elegant manner requires careful consideration and thoughtful structuring. The most obvious solution of introducing a union type over your objects to represent each state over the lifecycle of the operation works, but comes with some pretty heavy drawbacks. By adopting some systems-level thinking and introducing a Job
abstraction, you get a reusable building block that decouples specific types in your schema from the technical details of polling for completion.
Additionally, providing a mechanism to block queries until certain jobs are completed enhances the efficiency and usability of the API. While error handling remains a context-dependent challenge, using field aliases or unions are each robust ways to manage different kinds of failure modes.
This apporach not only addresses the common pitfalls of simpler designs but also offers a scalable and elegant solution for complex, long-running tasks in GraphQL. Give it a try—I think you’ll really like it.