2 June, 20233 minute read

Alpine is slowing down your Python builds

At Crimson we have a brand new “chain service” which is responsible for executing LangChain chains. Centralizing your API calls to LLMs like GPT has a host of benefits, but the big one is that it allows you to intelligently prioritize units of work in order to minimize the impact of rate limits.

Crimson has a strong internal preference for TypeScript, but the chain service is written in Python. While LangChain does have a TypeScript port, it does not have feature parity with the Python implementation and likely never will. Language choice isn’t a hill worth dying on: engineers constantly switch between languages over the course of the working day.

Originally we’d based the Dockerfile for the chain service off one of our existing Dockerfiles, and we quickly ran into issues with the time it took for our testing and build pipelines to run for this service. The root cause was using Alpine as the base image.

What’s wrong with Alpine?

For a lot of engineers, Alpine is the go-to distribution for Docker images because it is extremely small. A smaller base image means your final Docker image will also be smaller which saves on bandwidth and storage costs, and can also reduce startup times. For languages like JavaScript or Go, Alpine is an excellent choice.

Butwhen building a Python project a problem you’ll run into very quickly is that Alpine uses musl for its standard library instead of the much more common glibc implementation. This is a really important detail whenever you’re working with native dependencies, because a binary built against a glibc standard library can’t run on a system without glibc installed.

In ecosystems where Alpine is a common choice—like JavaScript—most native dependencies ship both a glibc and musl-linked binary. This means that when you npm install something like sharp, you don’t need to build the underlying C code from scratch because someone else has already built it against musl.

In the Python ecosystem libraries with native dependencies also ship precompiled binaries to minimize install time, but almost none of these libraries ship binaries compiled against musl. This means that if you’re using Alpine as your base image, you’ll need to compile most if not all of your native dependencies from scratch. Even on small projects with only a few such dependencies, this can take a long time.

In our case, this meant our chain servive could take upwards of 20 minutes to install dependencies in CI. Running pdm sync locally on the other hand, would complete nearly instantly because our local machines use glibc.

Swapping off Alpine

In the end, swapping off of Alpine only required us to change two lines of code. We swapped our base image from python:3.11-alpine to python:3.11-slim, and removed an apk add command that was previously installing some tools such as g++. The Debian-based slim image not only includes these tools out of the box, but they’re also now unnecessary anyway as we no longer need to build any native code.

The results from this two line change are both impactful and surprising:

Base imageBuild timeImage size
python:3.11-alpine16m8s557 MB
python:3.11-slim28s (-97.1%)300 MB (-46.2%)

The reduction in build time was expected, but a 97% improvement is extreme. There are very few changes you can make to a pipeline that will yield that degree of speedup.

What we didn’t expect, however, was that the resulting image would also be smaller than the one based on Alpine. In retrospect it seems obvious that downloading an optimized binary directly would result in less disk use than compiling from source, but a 46.2% reduction is well beyond what I would have thought possible.

The difference between Alpine-based images and Debian-based images will of course differ based on your particular project’s dependency tree. In our case we have a number of large native dependencies such as numpy and spacy which increases the relative improvement.

A project with no native dependencies won’t notice the difference between the two base images, but these projects are likely rare.

Don’t use Alpine for Python projects

It’s clear that Alpine is currently a poor base image choice for Python projects. Down the line it’s possible that more Python libraries will start shipping prebuilt musl-compatible binaries, but for now I recommend sticking to Debian-based Linux distributions. This optimizes both build time and final image size.

The degree of improvement you get from shifting base image will depend on the dependency tree of your project. Our chain service pulls in a number of Python libraries with large native dependencies such as numpy and spacy, which makes swapping to Alpine a far bigger improvement. Projects with few or no native dependencies are unlikely to notice the difference, but it’s unusual to be in that position for long. A lot of really useful Python libraries use native code to improve performance, so it’s only a matter of time before you need to start worrying about this problem.

Just go with Debian from the start.

Don't want to miss out on new posts?

Join 100+ fellow engineers who subscribe for software insights, technical deep-dives, and valuable advice.

Get in touch 👋

If you're working on an innovative web or AI software product, then I'd love to hear about it. If we both see value in working together, we can move forward. And if not—we both had a nice chat and have a new connection.
Send me an email at hello@sophiabits.com