Don’t rely too naively on GPT’s JSON mode
Shortly after OpenAI’s DevDay I wrote up a deep dive into the “JSON mode” which came bundled with the November GPT models. While there are some unfortunate limitations to be aware of when using this feature, it is definitely worth using when building new things on top of GPT.
And then right after I’d published that post, the DevDay releases were overshadowed by the news that Sam Altman had been ousted as the CEO of OpenAI, and the company was in a state of chaos.
OpenAI eventually recovered from this drama, but between those events and their frequent API outages it’s clear that building on top of their platform comes with risk. Even something benign like low API rate limits puts a limit on your business’s ability to scale when building on top of OpenAI.
That risk means that you shouldn’t be blindly relying on JSON mode.
Building with resiliency in mind
JSON mode guarantees that the response generated by GPT will be a valid JSON object. In theory you can immediately
serde_json::from_str the response without needing to do any kind of preprocessing, but you should resist this temptation.
Instead, pretend JSON mode doesn’t exist. Keep using that regex to deal with Markdown code fences in the generated message. If you do end up needing to fall back to using a different LLM, then you’ll be glad you built in this redundancy from the start because at the time of writing JSON mode is only available with the latest GPT and Mistral models.
LLMs served over an API are an interesting product from the point of view that you send them language and then receive language as an output. While there are subtle differences in how to best prompt different models, the overall integration story is extremely fungible. Swapping from one LLM to another is trivially easy to do on paper.
The moment you start to use features which are only available from a particular language model is the moment you make switching hard. Function calling was the first example of this, and JSON mode is the latest. Relying too much on these features greatly limits your LLM options down the line.
Function calling is hard to avoid if your use case is a good fit for that solution, but JSON mode is really easy to be agnostic about. All you need to do is keep using your old code that dealt with text mode responses.
Long-term resiliency strategies
It’s important that anyone building applications using OpenAI’s models are thinking about long-term resiliency. You need to plan for a few different cases:
- OpenAI goes away and you are forced to switch to a different model.
- A different vendor releases a superior model to GPT and you want to switch to improve your product.
- Your company is wildly successful to the point where OpenAI rate limits become a bottleneck.
All of these problems are solved through identical methods. The two big mitigating strategies are:
- Don’t rely on vendor-specific features like JSON mode. If you do use these features, then build in compensations to handle the case where these features are no longer available (i.e. in the case of JSON mode, pretend you made a text mode API call).
- Develop a solid suite of evals so you are able to systematically measure system performance. This will allow you to actually understand the impact switching models will have on your product.
Both of these points are things that any team should already be doing. Without evals, for instance, you have no way of improving your system in a controlled and measured fashion and any change you make is a shot in the dark. Model testing and evaluation is covered in any “Intro to ML” paper!
The other thing you should consider doing is to track all of your API calls. You can store this log of calls somewhere, and later on its possible to use that data to finetune a smaller model like Llama or Mistral 7B and use that to replace your GPT use. Smaller models can very easily outperform larger ones with surprisingly little task-specific data.
There are even a few services available which make this easy:
- OpenPipe is an all-in-one solution that handles logging your API requests in addition to fine tuning and model serving for you.
- LLM Report is a solid option for tracking API requests, and provides some tremendously useful dashboards for more general monitoring your GPT-based application. Tracked API calls can be exported as a CSV file, and the service comes with a generous free tier (or can be hosted yourself—it’s open source).
- PromptLayer is the OG in this space. The UI is a bit clunkier than LLM Report in my opinion, but it’s usable. Compared to LLM Report there’s less focus on analytics and monitoring, and more focus on prompt versioning and evaluation.
That, at least, is my reading. My take on the recent ByteDance news is that they’re in hot water because they were (allegedly) using GPT output to produce a competing LLM. I think that so long as you aren’t serving your model over an API to public developers you should be fine.
A side benefit of this fine tuning work is that you can also achieve significant cost savings. Llama 2 runs reasonably well even on commodity hardware like the RTX 3090. If you have some upfront capital (or are happy to use a service like OpenPipe) then ~50x cost savings are possible relative to vanilla GPT-3.5.
There’s a lot of hype around LLM technology, and a lot of the hype is well-warranted. There’s a lot of innovation happening at an extreme pace, and features like function calling and JSON mode are very nice to have.
But it’s important to make sure you aren’t coupling yourself too tightly to the intricacies of any one particular vendor. Switching from one LLM to another is the easiest thing to do in the world provided you aren’t using any vendor’s proprietary features, and building with an eye on maintaining that flexibility wins you a lot of resilience.