28 December, 2023•8 minute read

Is GPT-4 cost effective yet?

My general workflow with GPT is to first get a prototype working using GPT-4, and then to iterate on the solution until it works with the GPT-3.5 models.

This process works really well in my experience, with the big benefit being that it allows you to efficiently filter out “impossible” projects—if you can’t get something working on top of GPT-4 after a few days then it likely isn’t worth pursuing further.

In fact, this workflow is what I’ve been recommending to companies I work with for the entirety of 2023. GPT-4 has had a lot of sharp edges to it which make production usage less than ideal.

Specifically, here were some of my concerns with GPT-4:

Up until a few months ago, OpenAI themselves were saying that GPT-4 wasn’t yet ready for production use. Their official recommendation was to use it for prototyping.
For a long time, GPT-4 rate limits were prohibitively low. They used to cap out at a mere 200 requests/minute, whereas they now go up to 10,000/minute.
GPT-4 has always been slower than GPT-3.5. My benchmarks of the June models measured the a 3x higher inter-token latency for GPT-4 compared to GPT-3.5.
GPT-4 has historically cost substantially more than GPT-3.5. Back in March 2023 the cost difference for an average application was a little over 21x in favor of GPT-3.5.

All of these factors have changed significantly since March 2023. Rate limits, performance, and pricing have all come a long way, and the mainline GPT-4 models are very much production-ready—even if GPT-4 Turbo isn’t.

GPT price history

As a refresher, here’s how GPT model costs have trended over time:

Model	$/1k input tokens	$/1k output tokens
`gpt-4-1106-preview`	0.0100	0.0300
`gpt-4-0613`	0.0300	0.0600
`gpt-4-0314`	0.0300	0.0600
`gpt-3.5-turbo-1106`	0.0010	0.0020
`gpt-3.5-turbo-16k-0613`	0.0030	0.0040
`gpt-3.5-turbo-0613`	0.0015	0.0020
`gpt-3.5-turbo-0301`	0.0020	0.0020

Direct cost comparisons across models aren’t straightforward because pricing scales across two different usage dimensions. In other words, there are two key metrics you need to understand in order to perform any cost analysis:

The total number of tokens you expect to use per unit time
The ratio of input:output tokens you expect to use

The first metric is straightforward—your OpenAI bill scaling linearly with usage is not at all surprising—but the second metric is a little less obvious. Depending on your use case the relative price difference between models can get big.

Looking only at the November models, GPT-4 Turbo output tokens cost three times as much as input tokens. Output tokens for the GPT-3.5 Turbo model, on the other hand, are only twice as expensive as input tokens.

This difference means that applications with disproportionately high output token use will find that GPT-4 Turbo is much more expensive relative to the GPT-3.5 Turbo model compared to an application with usage skewed towards input tokens.

Real-world input:output ratios

When Sam Altman announced the cheaper pricing of GPT-4 Turbo relative to the July GPT-4 model, he claimed that the average API user would see about a 2.75x decrease in price. Working backwards from that figure, we get an average input:output token ratio of 9:1.

This is very much a broad stroke average. There are a lot of OpenAI customers, each doing very different work from each other. You should not rely on this 9:1 figure when modeling your business, because in practice your ratio can be wildly different.

In Crimson Education’s products I see input:output ratios ranging from 22.5:1 all the way down to 2.6:1 depending on feature. 22.5:1 is (so far) the largest ration I’ve personally observed, but elsewhere I have seen a ratio as low as 1.47:1.

It’s really important to actually measure and track your input:output token ratio both during development and in production. Real-world ratios vary so massively that it’s impossible to give any kind of ‘rule of thumb’ ratio.

If we ignore model performance and only consider cost in a vacuum, then the feature with the 22.5:1 token ratio is a great match for GPT-4 Turbo where we can minimize the amount of cost accrued by the relatively more expensive output tokens.

Price comparisons

So, keeping this all in mind here is the breakdown showing the relative cost of the November GPT-4 Turbo model (gpt-4-1106-preview) at various input:output ratios. The “relative price” column here indicates how much more expensive GPT-4 Turbo is relative to the other option; i.e. at a 9:1 input:output token ratio your blended cost of using GPT-4 Turbo is 10.91x more than using the November GPT-3.5 Turbo model.

Input:output ratio	Model	Relative price
1.47:1	`gpt-3.5-turbo-1106`	12.88
	`gpt-3.5-turbo-0613`	10.63
	`gpt-4-0613`	0.43
9:1	`gpt-3.5-turbo-1106`	10.91
	`gpt-3.5-turbo-0613`	7.74
	`gpt-4-0613`	0.36
22.5:1	`gpt-3.5-turbo-1106`	10.41
	`gpt-3.5-turbo-0613`	7.13
	`gpt-4-0613`	0.35

To be very explicit: The July GPT-4 model costing 20x the July GPT-3.5 model at 1/3 the performance was unacceptable to me. I didn’t consider it worth using in production, because it felt to me that in almost all cases it was possible to bring GPT-3.5 Turbo up to parity with GPT-4 given enough prompts. The difference in pricing and performance was extreme.

But the latest model changes the calculus quite a bit.

When considering GPT-4 Turbo against the July models—which I’ve happily been building on top of up until now—the gap has shrunk dramatically. GPT-4 Turbo costs 7.13-10.63x as much as the July GPT-3.5 Turbo model with only 26.2% worse inter-token latency.

If you were forced to pick between GPT-4 Turbo and the previous generation GPT-3.5 Turbo model, then the choice would be a no brainer. Outside of use cases where a single prompt gives you the output you’re looking for, GPT-4 Turbo will work out cheaper and more performant in almost any real-world application I can think of.

Of course, we aren’t in that situation as we also have a new GPT-3.5 Turbo model to consider. The cost difference there ranges between 10.41-12.88x, and performance-wise GPT-4 Turbo measures at about half the inter-token latency.

When considering only the current generation models things get a little blurrier.

Case study: Document review

We recently shipped a document review feature at Crimson which grades an input document against a rubric. While GPT-4 Turbo is capable of grading documents in a single prompt, it’s more economical and scalable for us to use GPT-3.5 Turbo.

A screenshot of the document review feature. The document being reviewed is high quality, and the scores reflect this.

Our new document review feature in action

Working around the limitations of the smaller model for this document review use case is pretty easy—each item on the rubric gets its own grading prompt (all of which can run in parallel!) and then the results of those individual prompts are sent off to a final prompt responsible for generating feedback.

Diagram showing each criteria being scored in parallel before being collected and combined to produce feedback.

How document review works at Crimson, using GPT-3.5 Turbo

Our “call stack” is therefore only two prompts deep. Given that GPT-4 Turbo is half as fast as GPT-3.5 Turbo that’s a good sign, but number of prompts is a lot less important than the number of output tokens. LLM latency is—in most cases—very linear, scaling off the number of tokens being generated in your critical path.

And it should be obvious that our smaller, more constrained prompts individually generate less content than the one big GPT-4 Turbo prompt did. At the very least we save a few tokens by only needing to output one score per parallelized grading prompt instead of needing to output all scores serially.

In practice our critical path’s output token count get even more savings. We use chain-of-thought prompting to get better results from the system, and the model’s “thoughts” can be shorter when it only needs to consider one aspect of the rubric at once.

We can typically generate a review in about 6 seconds using GPT-3.5 Turbo. GPT-4 Turbo typically clocked in at around 17 seconds. Direct benchmarks of token generation speed don’t always tell the full story: the ~2.12x performance hit you expect to take moving from gpt-3.5-turbo-1106 to gpt-4-1106-preview is only measuring a “like for like” scenario.

It’s worth mentioning that the architecture we’re using with GPT-3.5 Turbo could also be used with GPT-4 Turbo; there’s no reason why we must do everything in one prompt with the larger model. But there’s no reason to here when our architecture works well with the smaller model, and that’s often the situation you end up in after spending time working on the architecture.

We ultimately end up saving a decent bit of cost while also delivering a significantly better user experience. A six second delay is close enough to instant that it doesn’t feel too punishing for an end user.

Document review is—at least in the applications we’ve developed so far—a relatively straightforward problem. You don’t really need the improved reasoning capabilities of a larger model, and the performance requirements are strict because we want users to be able to iteratively edit their document in near-realtime.

Use cases like this one are still a pretty good fit for GPT-3.5 Turbo. But on harder problems I no longer see a particularly strong reason to avoid GPT-4 Turbo. You and your business will likely be OK doing less prompt engineering and eating the higher cost of GPT-4 Turbo in order to ship more often. The economics of it just make a lot more sense now.

Input:output ratios and model selection

We’ve looked at input:output token ratios in the context of cost analysis, but they also seem to be a useful signal for model selection. The intuition here is that high input:output ratio applications have a tendency to favor long prompts (as opposed to short completions), which makes GPT-4 Turbo a natural fit for these use cases.

In fact, I can even provide a magic number. I haven’t shipped anything with an input:output ratio of 17.5:1 or more using a model from the GPT-3.5 family. Every single feature with a ratio like that is using GPT-4 Turbo.

Exploring the intuition a bit deeper, there are two key factors at play here.

Stuffing a bunch of data in one prompt is easy. If you are working with a lot of data—and therefore have a lot of input tokens—then using GPT-3.5 means you are forced to perform multiple successive rounds of ranking, filtering, and summarization to work around the smaller context size.

You may end up wanting to perform these processes anyway with GPT-4 Turbo, but the difference in developer ergonomics cannot be understated.

In a GPT-3.5 world you need to aggressively cut down on the amount of data you have before you ever even get to see an example of the final output. In a GPT-4 world you can immediately generate an output, and then get to use that output to intelligently inform the development of your preprocessing pipeline.

GPT-4 Turbo gets “Lost in the Middle” less. Language models have a tendency to “ignore” instructions in the middle of their prompt, and this issue is exacerbated as prompt length increases. The new GPT-4 Turbo model doesn’t entirely solve the issue, but performance degradation occurs at much longer contexts than it did previously.

Greg Kamradt measured that for a simple recall task performance remained strong all the way up to 73K input tokens. A pretty good result! On this benchmark you essentially get two full-sized gpt-4-32k contexts before running into an issue.

Put simply, if you have very high input:output token ratios then your use case is likely going to be well-suited for GPT-4 Turbo from both a cost and model performance perspective.

Conclusion

I’ve been reticent to recommend use of GPT-4 models in production up until now, but technology advancements have made using these models a lot more realistic. While GPT-4 Turbo is still in preview, there have already been big improvements to rate limits and I suspect it will be marked as “production-ready” by OpenAI in the short term.

Relative to using the July GPT-3.5 Turbo models, swapping to GPT-4 Turbo will increase cost by ~7.74x and reduce performance by 50% using the same architecture.

But the real value of the GPT-4 models comes from their ability to get good results with fewer prompts and less complicated architectures. It’s not uncommon in my experience to see a single GPT-4 prompt achieve the same work as 3-4 GPT-3.5 prompts. With that observation in mind, GPT-4 Turbo will be a good economic fit for many use cases.

It’s still not a straight upgrade from GPT-3.5, however, and in use cases with strict performance requirements or where the problem being solved is “simple” enough to rearchitect for GPT-3.5 it is still going to be worth investing resources into prompt engineering.

Artificial Intelligence

Business

« Newer post

Identifying slow PostgreSQL queries

Older post »

Making breaking GraphQL changes without breaking anything