Alpine is slowing down your Python builds
At Crimson we have a brand new “chain service” which is responsible for executing LangChain chains. Centralizing your API calls to LLMs like GPT has a host of benefits, but the big one is that it allows you to intelligently prioritize units of work in order to minimize the impact of rate limits.
Crimson has a strong internal preference for TypeScript, but the chain service is written in Python. While LangChain does have a TypeScript port, it does not have feature parity with the Python implementation and likely never will. Language choice isn’t a hill worth dying on: engineers constantly switch between languages over the course of the working day.
Originally we’d based the Dockerfile for the chain service off one of our existing Dockerfiles, and we quickly ran into issues with the time it took for our testing and build pipelines to run for this service. The root cause was using Alpine as the base image.
What’s wrong with Alpine?
For a lot of engineers, Alpine is the go-to distribution for Docker images because it is extremely small. A smaller base image means your final Docker image will also be smaller which saves on bandwidth and storage costs, and can also reduce startup times. For languages like JavaScript or Go, Alpine is an excellent choice.
Butwhen building a Python project a problem you’ll run into very quickly is that Alpine uses musl
for its standard library instead of the much more common glibc
implementation. This is a really important detail whenever you’re working with native dependencies, because a binary built against a glibc
standard library can’t run on a system without glibc
installed.
In ecosystems where Alpine is a common choice—like JavaScript—most native dependencies ship both a glibc
and musl
-linked binary. This means that when you npm install
something like sharp
, you don’t need to build the underlying C code from scratch because someone else has already built it against musl
.
In the Python ecosystem libraries with native dependencies also ship precompiled binaries to minimize install time, but almost none of these libraries ship binaries compiled against musl
. This means that if you’re using Alpine as your base image, you’ll need to compile most if not all of your native dependencies from scratch. Even on small projects with only a few such dependencies, this can take a long time.
In our case, this meant our chain servive could take upwards of 20 minutes to install dependencies in CI. Running pdm sync
locally on the other hand, would complete nearly instantly because our local machines use glibc
.
Swapping off Alpine
In the end, swapping off of Alpine only required us to change two lines of code. We swapped our base image from python:3.11-alpine
to python:3.11-slim
, and removed an apk add
command that was previously installing some tools such as g++
. The Debian-based slim
image not only includes these tools out of the box, but they’re also now unnecessary anyway as we no longer need to build any native code.
The results from this two line change are both impactful and surprising:
Base image | Build time | Image size |
---|---|---|
python:3.11-alpine | 16m8s | 557 MB |
python:3.11-slim | 28s (-97.1%) | 300 MB (-46.2%) |
The reduction in build time was expected, but a 97% improvement is extreme. There are very few changes you can make to a pipeline that will yield that degree of speedup.
What we didn’t expect, however, was that the resulting image would also be smaller than the one based on Alpine. In retrospect it seems obvious that downloading an optimized binary directly would result in less disk use than compiling from source, but a 46.2% reduction is well beyond what I would have thought possible.
The difference between Alpine-based images and Debian-based images will of course differ based on your particular project’s dependency tree. In our case we have a number of large native dependencies such as numpy
and spacy
which increases the relative improvement.
A project with no native dependencies won’t notice the difference between the two base images, but these projects are likely rare.
Don’t use Alpine for Python projects
It’s clear that Alpine is currently a poor base image choice for Python projects. Down the line it’s possible that more Python libraries will start shipping prebuilt musl
-compatible binaries, but for now I recommend sticking to Debian-based Linux distributions. This optimizes both build time and final image size.
The degree of improvement you get from shifting base image will depend on the dependency tree of your project. Our chain service pulls in a number of Python libraries with large native dependencies such as numpy
and spacy
, which makes swapping to Alpine a far bigger improvement. Projects with few or no native dependencies are unlikely to notice the difference, but it’s unusual to be in that position for long. A lot of really useful Python libraries use native code to improve performance, so it’s only a matter of time before you need to start worrying about this problem.
Just go with Debian from the start.