Big GPT performance improvements
I’ve been digging into the announcements from OpenAI’s DevDay (here’s the deep dive on the new JSON mode if you missed it) and while playing with the new models I’ve noticed they are significantly faster than the previous iteration of GPTs. Sam Altman had specifically called out the improved performance of GPT-4 Turbo compared to plain GPT-4, but this performance increase also applies to the new GPT-3.5 Turbo model!
This is big news for anyone building on top of GPT. The major limiting factor in my view—aside from low rate limits—has been how slow they are to run. If you aren’t able to leverage streaming then the user experience suffers immensely. There’s nothing worse than staring at a loading spinner for 60s+.
Methodology
I’ve used Anyscale’s brand new llmperf
tool to measure these results. If you’re interested in detailed discussion of methodology you can read the blog post announcing the tool, but if you are only interested in the final results there are two key metrics:
- Time to first token (”TTFT”): This measures how long it takes for the first generated token to be produced in milliseconds.
- Inter-token latency (”ITL”): This is effectively “milliseconds per generated token.”
Out of the box llmperf
only reports the mean reading of these metrics—I’m hoping to PR a change later this week—but it’s possible to calculate the standard deviation after the fact as it outputs a raw data file in addition to the summary statistics.
It should also be noted that these measurements aren’t definitive. GPT latency varies wildly over time as a function of system load (you can see this visually on PromptLayer’s latency graph) so you would expect to see variation between two runs ran at different times. For the purposes of this blog post, all of my benchmarks were ran within 30 minutes of each other.
The command I’ve used is python3 llmperf.py -r 1000 -f openai -m "{{model_name}}" --min-lines 8 --max-lines 10 --sleep 2
. This results in 1,000 requests made to GPT, averaging ~176 prompt tokens and ~104 completion tokens per request. These are pretty small compared to what I’ve seen in production; on paper this shouldn’t matter (LLM latency is overwhelmingly linear based on generated token count) but in practice it’s possible OpenAI has some prioritization logic based on expected usage. It would be worthwhile to investigate this possibility down the line!
Results
With all that said, the results are:
Model | ITL (ms) | TTFT (ms) |
---|---|---|
gpt-3.5-turbo-0613 (Old GPT-3.5 Turbo) | 21.7 (SD = 0.2) | 564 (SD = 190.6) |
gpt-3.5-turbo-1106 (New GPT-3.5 Turbo) | 12.9 (SD = 3.5) | 598 (SD = 294.3) |
gpt-4-0613 (GPT-4) | 64.5 (SD = 13.4) | 762 (SD = 347.6) |
gpt-4-1106-preview (GPT-4 Turbo) | 27.4 (SD = 6.5) | 615 (SD = 329.4) |
Discussion
The key takeaways for me are:
- The new GPT-3.5 Turbo model is almost twice as fast as the previous one.
- GPT-4 Turbo is getting pretty close to the performance of the old GPT-3.5 Turbo model which should make it usable for a wider variety of products.
- Latency is quite variable on both the newer models and the older GPT-4 model. Presumably
gpt-3.5-turbo-0613
being the most popular model enables OpenAI to right-size compute capacity more effectively than for the others. - Altman said during the opening keynote that this round of releases prioritized price reductions first, and that further speed improvements are on the horizon. It will be interesting to see how much lower these numbers can be pushed now that speed is the top priority.