13 November, 20233 minute read

Big GPT performance improvements

I’ve been digging into the announcements from OpenAI’s DevDay (here’s the deep dive on the new JSON mode if you missed it) and while playing with the new models I’ve noticed they are significantly faster than the previous iteration of GPTs. Sam Altman had specifically called out the improved performance of GPT-4 Turbo compared to plain GPT-4, but this performance increase also applies to the new GPT-3.5 Turbo model!

This is big news for anyone building on top of GPT. The major limiting factor in my view—aside from low rate limits—has been how slow they are to run. If you aren’t able to leverage streaming then the user experience suffers immensely. There’s nothing worse than staring at a loading spinner for 60s+.

Methodology

I’ve used Anyscale’s brand new llmperf tool to measure these results. If you’re interested in detailed discussion of methodology you can read the blog post announcing the tool, but if you are only interested in the final results there are two key metrics:

Out of the box llmperf only reports the mean reading of these metrics—I’m hoping to PR a change later this week—but it’s possible to calculate the standard deviation after the fact as it outputs a raw data file in addition to the summary statistics.

It should also be noted that these measurements aren’t definitive. GPT latency varies wildly over time as a function of system load (you can see this visually on PromptLayer’s latency graph) so you would expect to see variation between two runs ran at different times. For the purposes of this blog post, all of my benchmarks were ran within 30 minutes of each other.

The command I’ve used is python3 llmperf.py -r 1000 -f openai -m "{{model_name}}" --min-lines 8 --max-lines 10 --sleep 2. This results in 1,000 requests made to GPT, averaging ~176 prompt tokens and ~104 completion tokens per request. These are pretty small compared to what I’ve seen in production; on paper this shouldn’t matter (LLM latency is overwhelmingly linear based on generated token count) but in practice it’s possible OpenAI has some prioritization logic based on expected usage. It would be worthwhile to investigate this possibility down the line!

Results

With all that said, the results are:

ModelITL (ms)TTFT (ms)
gpt-3.5-turbo-0613
(Old GPT-3.5 Turbo)
21.7 (SD = 0.2)564 (SD = 190.6)
gpt-3.5-turbo-1106
(New GPT-3.5 Turbo)
12.9 (SD = 3.5)598 (SD = 294.3)
gpt-4-0613
(GPT-4)
64.5 (SD = 13.4)762 (SD = 347.6)
gpt-4-1106-preview
(GPT-4 Turbo)
27.4 (SD = 6.5)615 (SD = 329.4)

Discussion

The key takeaways for me are:

Don't want to miss out on new posts?

Join 100+ fellow engineers who subscribe for software insights, technical deep-dives, and valuable advice.

Get in touch 👋

If you're working on an innovative web or AI software product, then I'd love to hear about it. If we both see value in working together, we can move forward. And if not—we both had a nice chat and have a new connection.
Send me an email at hello@sophiabits.com