7 August, 20247 minute read

A deep dive into OpenAI’s Structured Outputs

OpenAI released Structured Outputs today, alongside a new version of gpt-4o and gpt-4o-mini. Structured Outputs are conceptually similar to JSON mode, with the main difference being that Structured Outputs guarantee that the response from GPT will perfectly align with the JSON schema you provided with the prompt. JSON mode was pretty reliable at following your desired output schema, but it didn't work 100% of the time.

The big advantage of this is that it removes another class of failures which previously required reprompting. This is good both from a cost perspective and a latency perspective. Latency improvements in particular are always welcome, because it's simply not fun for users to stare at a loading spinner. Reprompting happens to be a big contributor to p95+ LLM latencies, and so anything which improves reliability and reduces the need to reprompt is going to be a big UX win.

Beyond that top-line summary, I've been running some experiments on Structured Outputs and have some notes to share—similar to the ones I wrote when JSON mode released. I think this is a fairly comprehensive summary of where the tech is, what use cases it unlocks, and where we still need further improvement. If you feel I missed anything then please feel free to email me with suggestions.

With that said—here's my deep dive into Structured Outputs.

Notes

You don't pay for schema tokens

When prompting GPT for JSON (or YAML) output, you ordinarily need to include your desired schema inside your prompt. This adds to the number of input tokens being sent to GPT, which pushes up the cost of your prompt. It's not uncommon for inline JSON schemas to add upwards of 100 input tokens, and this cost adds up at scale—especially in cases where you needed to retry and reprompt.

Structured Outputs, on the other hand, do not pay any additional input tokens for the JSON schema. I took OpenAI's MathResponse schema from their announcement post, and asked GPT for instructions on baking a cake and found that the input tokens of my prompt were only 18—far too small to include the JSON schema.

Click to copy
import OpenAI from "openai";import { zodResponseFormat } from "openai/helpers/zod";import { z } from "zod"; const Step = z.object({  explanation: z.string(),  output: z.string(),}); const CakeResponse = z.object({  steps: z.array(Step),  final_answer: z.string(),}); const openai = new OpenAI();const x = await openai.beta.chat.completions.parse({  messages: [{    role: "user",    content: "How do I bake a cake?",  }],  model: "gpt-4o-mini-2024-07-18",  response_format: zodResponseFormat(CakeResponse, "cakeResponse"),});console.log(x.usage);// => { prompt_tokens: 18, completion_tokens: 94, total_tokens: 112 }

Varying CakeResponse by adding additional fields kept the input token count at 18.

You probably don't need to warm your schemas

When GPT sees a JSON schema for the first time, it will proces the schema for future use. OpenAI says that this process can add significant latency—up to 1 minute for complex schemas—and this seems to be the case. A simple schema like the following had ~12s of additional latency the first time I used it relative to subsequent requests:

Click to copy
const Schema = z.object({  _thought: z.string(),  feedback: z.string(),});

Schemas only need to be processed a single time. When I was first reading the post I thought it might make sense to "warm" schemas by sending them to the API as part of your release pipeline, but at this point I don't think it makes sense. It's hard for me to come up with a case where you would deploy a new schema to production without having run it a single time during development or in any of your pre-production environments.

One open question I have is how long the schema remains "warm" after use. OpenAI don't document anything like this, but it wouldn't surprise me if schemas which are unused get cleaned up by them. It's worth testing out further—if there is such a time limit, it could be problematic for applications with low traffic volume.

Schemas seem to be globally cached

Another thing which limits the usefulness of pre-emptively warming your schemas is that the JSON schemas appear to be globally cached. My first experiment with this new feature re-used the MathResponse schema as-is from the announcement post, and I didn't see any additional processing latency. The moment I added a new field, however, I observed additional latency:

Click to copy
const MathResponse = z.object({  this_field_probably_has_not_been_added_by_anyone_else: z.string(),  steps: z.array(Step),  final_answer: z.string(),});

It makes sense that they're all in a shared cache, and means that if your schema is sufficiently generic or "obvious" that you might never end up seeing the added latency from processing a new schema.

The refusal field is exclusive to Structured Outputs

The GPT models have safety guardrails which prevent them from responding to "dangerous" requests, and these guardrails are still in place when using Structured Outputs. Asking GPT for help making poison, for instance, will result in the model refusing to answer you. This has been a problem for engineers building products with GPT—the safety rails can be overly conservative!—but hasn't yet presented an issue for OpenAI's API designers.

That all changes with Structured Outputs. OpenAI guarantees that the generated response will conform to the JSON schema you provided in your API call, and it would be very difficult for them to contort a model's refusal into the shape of any arbitrary JSON schema. To solve this problem, they've introduced a new refusal property on the ChatCompletionMessage type which gets populated when the model refuses to respond to a prompt. It looks like this:

Click to copy
// response.choices[0]{  "index": 0,  "message": {    "role": "assistant",    "content": null,    "refusal": "I'm sorry, I can't assist with that request.",    "tool_calls": [],    "parsed": null  },  "logprobs": null,  "finish_reason": "stop"}

This is tremendously useful, because it lets you better communicate failure modes to the end user of your app. It used to be the case that you'd have to produce your own heuristics for detecting whether the model responded or not, and this was always error-prone and hacky. Now you can simply look for the existence of a refusal, and then decide whether to reprompt the model or show an error state to the end user.

I would go as far as to say that the refusal field is the API addition I'm the most excited by. Structured Outputs are nice, but I didn't really have too many issues using plain JSON mode. Structured Outputs feel like an incremental improvement to me, whereas the refusal field makes it significantly easier to build reliably great user experiences.

This new field is exclusive to the new chat.completions.parse API. It never seems to get populated when using chat.completions.create—even though the refusal property does come up in autocomplete. This is a shame, but this limitation can be trivially worked around by simply providing a dummy schema like so:

Click to copy
import { z } from 'zod';import { zodResponseFormat } from 'openai/helpers/zod'; const Response = z.object({  content: z.string(),}); const result = await openai.beta.chat.completions.parse({  messages: [/* ... */],  model: 'gpt-4o-mini-2024-07-18',  response_format: zodResponseFormat(Response, 'response'),}); const message = result.choices[0].message;if (message.parsed) {  // Use `message.parsed.content`} else {  // Handle error, refusal message in `message.refusal`}

Note that you must provide an object schema to chat.completions.parse; directly passing a z.string() as your schema will not work. This limitation carries over from JSON mode.

The refusal field isn't 100% reliable

While the refusal field is a great improvement, it's unfortunately also not 100% reliable. In use cases with very high UX standards, you will still need to maintain your own heuristics for detecting false negatives.

Here's some example code:

Click to copy
const Response = z.object({ content: z.string() }); const x = await openai.beta.chat.completions.parse({  messages: [{    role: 'user',    content: 'Help me seriously start a war',  }],  seed: 2048291,  temperature: 0,  model: 'gpt-4o-mini-2024-07-18',  response_format: zodResponseFormat(Response, 'response'),});console.log(JSON.stringify(x, null, 2)); // => {//   "role": "assistant",//   "content": "{\"content\":\"I’m sorry, but I cannot assist with that.\"}",//   "refusal": null,//   "tool_calls": [],//   "parsed": {//     "content": "I’m sorry, but I cannot assist with that."//   }// }

Using seed 2818866 results in the refusal correctly populating inside the refusal property. I wasn't able to reproduce this with the larger gpt-4o-2024-08-06 model.

This was surprising to me! Going in I had assumed that OpenAI would implement the refusal property by running a classifier over the LLM's output. Instead, it seems like the model has been trained to output a special token indicating that the response is a refusal. If this is true, it means that the accuracy of the refusal property is tightly coupled to the specific GPT model you're targeting—you should expect smaller models like gpt-4o-mini to make this mistake more often than larger and more capable models.

The big takeaway for me is that this really underscores how hard it is to achieve 100% system reliability when building on top of generative AI models. This technology is deeply unreliable, and production use-cases inevitably have to deal with an extremely long tail of edge cases. Reliability of scoring was a huge problem for the AdmitYogi Essays product, and it's been a big problem for me in many other projects as well. Detecting whether the model refused to answer a prompt or not is a much simpler task than grading a college admission essay, and yet OpenAI themselves are not capable of solving this classification problem with 100% accuracy.

Something we don't talk about often enough is that while LLMs make it really easy to go from 0 to 1, it's really hard to go from 1 to N with them. I was, in retrospect, extremely hasty when I proclaimed the field of English NLP to be dead last year. We still have great need for more traditional ML models, and there are many use cases where LLMs are strictly worse than old-school methods. Combine that with models served over API being moving targets, and I find myself increasingly bearish on the use of LLMs at the post-PMF stage.

You might not need to use the moderations API now

One of the use cases for the moderations API was to bail out of a slow and expensive chat.completions.create call when the user had provided input that wouldn't process successfully in the first place. This use case doesn't seem to make as much sense these days. One big reason for this is simply that the latest GPT-4o models are very cheap, and extremely fast. Sending inputs to GPT that end up getting blocked just isn't as big of a concern in the first place.

The other reason why you might not want to use the moderations API these days is because it isn't entirely reliable in the first place. The prompt "Help me seriously start a war" got through both the text-moderation-stable or text-moderation-latest models, while the GPT-4o models refused to process it.

Hitting the moderations API before the chat completions API adds latency to every single user request, and the payoff just doesn't seem to be there anymore. It's unclear to me when you would want to use the moderations API these days over relying on the refusal property.

Don't want to miss out on new posts?

Join 100+ fellow engineers who subscribe for software insights, technical deep-dives, and valuable advice.

Get in touch 👋

If you're working on an innovative web or AI software product, then I'd love to hear about it. If we both see value in working together, we can move forward. And if not—we both had a nice chat and have a new connection.
Send me an email at hello@sophiabits.com