26 December, 20236 minute read

Making breaking GraphQL changes without breaking anything

We had an issue the other month with our internal summarization microservice, which necessitated what would otherwise be a breaking change. This service has always supported attaching arbitrary metadata to summaries in the form of key-value pairs (à la Stripe), but didn’t have a way of passing options to the chosen summarization method.

This is a big limitation, and prevented us from letting end users control things like the tone of voice used, or final length of the produced summary.

The initial solution was to simply reuse our Metadata scalar type:

Click to copy
type Mutation {  summaryCreate(    prompt_id: ID!,    text: String!,    metadata: Metadata,    # new optional options bag    options: Metadata,  ): Summary!  # ...}

This solution worked well up until we realized that in some cases we actually need to be able to pass in highly structured data as options, which of course doesn’t work with our Metadata scalar type’s validation logic. Whoops.

In theory we can swap the type of options from Metadata to an arbitrary JSON1 scalar type, but this is a breaking change. Any existing graph consumers running this mutation with a Metadata-typed options value will immediately fail type checking at runtime, even though all valid Metadata values are also valid JSON values.

This is because GraphQL scalars are nominal types, and there’s no mechanism for indicating hierarchy or relationships between custom scalar types. Scalar types are—as far as your GraphQL server is concerned—black boxes.

Good to know

This whole issue can be avoided in the first place by simply using an input object parameter instead of putting all the arguments directly on the mutation. The wrapping CreateSummaryInput type signature doesn’t change when an individual field’s scalar type changes—so the request still passes type checks.

We could also fix this issue in a backwards-compatible manner by relaxing the constraints on Metadata such that it’s possible to store more than simple key-value pairs, but this also isn’t ideal. We support filtering based on metadata values, and this can be done much more efficiently when metadata complexity is constrained.

Constraints are just a good thing to have in general. Having good guard rails in place combined with a product-oriented approach to graph design yields a highly capable GraphQL service that is delightful to build on.

Fortunately, it’s possible to both have our cake and eat it by rewriting incoming queries before they hit our GraphQL framework.

GraphQL query rewriter

GraphQL query rewriter is a small JavaScript library which applies “rewriters” to GraphQL queries. The library has an accompanying express middleware, which lets you run it before queries actually hit your GraphQL server.

Rewriters allow you to transparently migrate incoming queries to work around what would otherwise be backwards-incompatible schema changes. For instance, updating a query from user(id: String!): User! to user(id: ID!): User! would ordinarily break any consumers using the old query despite the ID scalar technically being a superset of String.

With GraphQL query rewriter, it is possible to make this schema change and simply rewrite the old queries to use the new type. This approach to handling breaking changes is conceptually similar to Stripe’s “version change modules.”

In the case of my summaryCreate mutation I was able to use this exact rewriter to evolve my schema without causing breakage or needing to resort to less elegant solutions like introducing a summaryCreateV2 mutation.

Click to copy
// ...const rewriters: Rewriter[] = [  // ...  new FieldArgTypeRewriter({    fieldName: 'summaryCreate',    argName: 'options',    oldType: 'Metadata',    newType: 'JSON',  }),  // ...];// ...

The library provides a number of rewriters out of the box. One particularly useful one is FieldArgsToInputTypeRewriter which can be used to migrate a mutation which takes a bunch of top-level arguments (like summaryCreate) over to taking a single input parameter.

But because rewriters are just bits of code that modify the AST of incoming queries, it is possible to perform any arbitrary rewrite so long as you are willing to implement it. You’re only really limited by your imagination.

Downsides of rewriting queries

In my view there are two major considerations to keep in mind when leveraging this technique. The first one is that it just doesn’t play nicely with existing GraphQL tooling.

GraphiQL, for instance, leverages schema introspection to support autocompletion and typechecking of queries inside your browser. Because the rewriter rules aren’t formally advertised as part of your schema, the end result is GraphiQL “incorrectly” complaining about queries which actually are legitimate inputs for your server.

While the GraphiQL example is relatively benign, the larger issue is that anything which consumes your GraphQL schema can end up breaking. Most teams that are all-in on GraphQL are using codegen tools to benefit from stronger type-safety guarantees and increased automation, and these tools really don’t play nicely with query rewriting.

In the case of codegen, these backwards-compatible breaking changes can still break your build. The default behavior for a lot of codegen tools is to throw an error and bail out of actually emitting code when one of your GraphQL documents fail to typecheck, forcing you to update your client then and there.

Older deployed versions of your app will continue to work just fine2, but you won’t be able to defer the work of updating your codebase if you want to deploy anything new. That can be a pretty big drag on team velocity.

Stripe avoids this issue in their REST API by distributing versioned schema files, and in theory it’s possible to do this in GraphQL as well.

You could take a snapshot of your schema as it exists before making your backwards-incompatible change and then swap your consumers over to using that snapshot instead of the live schema. This breaks the convention of GraphQL services being unversioned, but it does allow you to ship new versions of your consumers without needing to migrate to the new schema upfront.

Which leads nicely into the second drawback of this technique: it gets less ergonomic the more you lean on it. Disrupting your tooling once in a while for a particularly high impact schema update is probably fine for most teams, but you really don’t want to use query rewriting as a crutch for bad schema design.

Because GraphQL has so much “smart” client tooling that leverages service schema files, this kind of out-of-band request rewriting is never going to be fully backwards-compatible with those tools. If you use schema-aware tools—as you should be!—then query rewriting is only really useful for maintaining support of old consumers.

Languages other than JavaScript

The GraphQL query rewriter library I’m referring to in this post is only available for JavaScript, but the general concept of rewriting GraphQL queries is hardly to unique to a single language. While I’m not aware of any analogous library available off-the-shelf in other languages, the library itself doesn’t really do all that much.

All you need to do is run some code against the incoming query’s AST before passing it over to your server. Most GraphQL servers support middleware / plugins, and oftentimes you’ll have access to the AST at that point which obviates the need for you to parse the query yourself.

In other cases, like graphql-go, there is no support for middleware / plugins and the API is wants you to pass in a raw query string. In these cases you unfortunately need to write some of our own boilerplate, and also need to take a slight performance hit by parsing and reserializing the incoming request:

server.go
Click to copy
func main() {  // set up schema   schema, err := graphql.NewSchema(schemaConfig)  if err != nil {    log.Fatalf("failed to init schema, error: %v", err)  }   query := `    {      hello    }  `  source := source.NewSource(&source.Source{    Body: []byte(query),    Name: "Query",  })  ast, err := parser.Parse(parser.ParseParams{Source: source})  if err != nil {    log.Fatalf("failed to parse query, error: %v", err)  }   // TODO: manipulate `ast` here   newQuery := printer.Print(ast)    params := graphql.Params{    Schema: schema,    RequestString: newQuery,  }  r := graphql.Do(params)  // ...}

Quite a bit more work than pulling in something from off the shelf, but also not particularly onerous.

Alternatives

The go to solution for making these backwards-incompatible changes has always been to implement a new query or mutation and mark the old one as deprecated. Tools like Apollo GraphOS can be used to track usage of the deprecated operations so you can safely remove them once everyone has updated.

More concretely, our schema could look like the following:

Click to copy
type Mutation {  summaryCreate(    # ...    options: Metadata  ): Summary! @deprecated(reason: "Use `summaryCreateV2` instead")  summaryCreateV2(    # ...    options: JSON  ): Summary!}

It’s a little ugly compared to rewriting queries, but this “add, deprecate, migrate, remove” workflow is the de facto standard for handling schema-breaking changes in a backwards compatible fashion.

For a really small change (like swapping a scalar type) I think this approach is usually a little overkill.

Conclusion

If you’ve backed yourself into a corner with your GraphQL schema and need to make a backwards-incompatible update, then query rewriting can let you elegantly paper over the issue and pretend it never happened.

While the technique has some weaknesses when it comes to GraphQL tooling that consumes service schemas, it’s still generally pretty safe. Stripe uses a similar concept when they migrate incoming requests and outgoing responses to their REST API as part of their versioning strategy, and they haven’t (yet) run into any problems as far as I know.

While you can perform arbitrarily complicated transformations to incoming queries, I would recommend only using the technique for straightforward schema changes like swapping a scalar type or migrating a mutation over to taking a single input argument. The more complicated the transformation, the more surprising the mismatch between runtime service behavior and the advertised schema becomes.



  1. If we wanted to be perfectly precise here, a JSONObject scalar would be more appropriate as we don’t support options being a top-level array and don’t intend to down the line. JSON is more flexible, though, and there’s nothing wrong with enforcing the object constraint out of band.
  2. Assuming you aren’t doing anything crazy, like type-checking everything at runtime against the live service schema!

Don't want to miss out on new posts?

Join 100+ fellow engineers who subscribe for software insights, technical deep-dives, and valuable advice.

Get in touch 👋

If you're working on an innovative web or AI software product, then I'd love to hear about it. If we both see value in working together, we can move forward. And if not—we both had a nice chat and have a new connection.
Send me an email at hello@sophiabits.com