Beware the architecture astronaut
The other day I posted about how to use pg_stat_statements
to find slow PostgreSQL queries. I walked through a real-world example where I found a bottleneck in Crimson Education’s calendar syncing code and was able to speed up one pathological query 1,129x.
In between identifying this query and fixing it I naturally spent some time digging in to the history of the surrounding code. Our calendar sync setup is quite complicated and dates back to the early days of Crimson when we weren’t quite so good at documenting things, so I had to cast a pretty wide net. Between old git commits, pull requests, Google Docs, and Slack messages I was able to piece together a better understanding of the code.
In doing so I learned two rather funny things.
The first is brief: all the heavy lifting of diffing our local state with the remote calendar’s state is contained within an internal company library called caldav-client
. Curiously, however, all of the loggers created by this package are namespaced under the name tsdav
.
Why tsdav
instead of caldav-client
? I’m not sure. But I did find out that the developer who wrote our caldav-client
library went on to open source it under the tsdav
name! It’s not uncommon for companies to contribute internal libraries to the open source community, but it does seem a bit weird that our developer didn’t migrate us over to his open source version.
The other thing I discovered was that our calendar syncing logic has actually been a recurring source of performance bottlenecks. The latest issue I fixed is far from the first time someone has needed to take a magnifying glass to it and squeeze out a performance win.
And in one of those instances, our resident architecture astronaut stepped in and tried to introduce a brand new microservice. Let’s dig in to that.
Architecture astronautics in action
So first things first: what is an “architecture astronaut”? It’s a term coined by Joel Spolsky, and it refers to people who build towers of abstractions so high that they end up running out of oxygen. Put more simply, an “architecture astronaut” is someone who tends to over-architect technical solutions when simpler approaches would suffice. The legendary FizzBuzz Enterprise Edition project is a good example of extreme architecture astronautics.
With that out of the way we can proceed with the story.
On 16 February 2021 the Crimson calendar sync started burning database CPU cycles like crazy. After 5 minutes of the database sitting at 80%+ CPU utilization, Newrelic triggered an alert and sent out a Slack message alerting the engineering team of the problem.
One engineer jumped in quickly, and identified that the root cause of the spike in resource usage was the calendar sync.
The very next reply came from our architecture astronaut. His message bluntly stated that we needed to split off calendar sync into its own service with a separate database.
Did the astronaut do any kind of benchmarking? Had he investigated the code for obvious bottlenecks? Had he even ran an explain
on the query that was causing the CPU usage? Nope.
At this point in time he was an intermediate software engineer, and he had drunk the Kool-Aid big time. In his mind the solution to any scaling solution was always to spin up new microservices.
Eight hours later a more senior member of the team based in China began his work day. He posted in the Slack thread that some key database columns weren’t indexed and that this was causing the slowdown.
He also noted that the performance of our pagedUsers
GraphQL query dropped dramatically when request throughput exceeded ~50 requests/s.
By indexing some columns and making some small tweaks to pagedUsers
the team was able to resolve the performance issues. The code continued ticking away without incident up until December 2023, at which point I came in and found that the calendar sync had been doing hundreds of times more work than it had needed to the whole time. In the pg_stat_statements
post I said that indexes can’t fix bad code, but it turns out they can bad code for a hell of a lot longer than you might otherwise think.
We are failing the next generation
I joined the company in March 2022—long after this incident—and by the time I arrived on scene our architecture astronaut had been promoted to senior software engineer. He wound up leaving a while afterwards, but I really enjoyed working with him for that period of time we were both at Crimson. He was easily one of my favorite colleagues of all time.
At that point he still had architecture astronaut tendencies—and riffing on him for that was part of the fun—but the additional year of experience had mellowed him out a little. I suspect that if the calendar sync had gone haywire in 2022 while he was around, he would have—maybe—done some benchmarking before suggesting the bootstrapping of a new microservice.
But I’m not really writing this post as a reflection on just this particular case of architecture astronautics. I think there is a much broader insight to keep in mind here that affects software engineering across the board.
Here’s my thesis: a lot of today’s junior engineering talent have never been exposed to anything other than shoddy microservice architectures. A sizable proportion of our new engineering talent are incapable of evaluating the tradeoffs of microservices because they’ve simply never known anything else. And if they don’t get the opportunity to see other ways of solving problems, their overall career development will be stunted.
Being able to profile some code and optimize it is a really valuable skill. You’ll never end up developing it, however, if you’re in an environment where your first instinct upon seeing a performance issue is to immediately cut&paste code out into a different process. Not only is this bad for the junior engineer, but it’s also a pretty big business problem.
After all, microservice boundaries should be defined by the bounded contexts of a business’ domain. Microservices are simply not a tool for making your hot path run faster. Can you use microservices for this purpose? Sure. It depends.
But if this is the only way you ever try to solve performance issues you are going to wind up with a sprawling distributed monolith and will find it extremely difficult to get anything done.
At the end of the day it doesn’t matter if you move the calendar sync out into its own service if the underlying database table is missing indexes and the SQL statements are touching hundreds of rows when they should be touching only one.
The saving grace here was that we had a senior engineer step in and avoid disaster. This incident was able to become a learning moment for our architecture astronaut thanks to his efforts.
Takeaways
There are two key highlights from this saga.
The first is the value of being socratic. While I was critical of microservices in the previous section, they are actually a perfectly good tool to use in certain situations. Sometimes you really do need the ability to scale a large system at a granular level, and sometimes you really do need a hard, physical boundary between two modules of code.
The word “sometimes” here means that the onus is on you to evaluate whether or not your situation falls into that category or not. If you hear that some calendar sync logic is burning CPU cycles, then your first course of action should be to ask “why?” Your second step should be to actually figure out the root cause. Only then will you be able to intelligently compare the tradeoffs involved in solving the problem.
If your first instinct is to offer a solution—and that means any solution, not just microservices—then you aren’t being socratic about the situation.
The second is the important role of senior engineers. Part of your job is to help the more junior members of your team grow into highly competent engineers. Sometimes the best way to do this is to let your juniors make a mistake, and sometimes—as was the case here—the best way to help them grow is to proactively help them avoid making a mess of things.
After I helped Crimson acquire the College Planner product, we got to work on integrating it with our marketing websites. At the time of acquisition the user accounts system was using Amazon Cognito user IDs as the primary key.
That was a pretty big issue for us, because Cognito doesn’t support linking multiple social logins to an individual user account, and we had a business requirement to support
One of the intermediate engineers on my team came up with a plan to fix this issue. Their plan was to migrate our user
table over to using email addresses as a primary key instead of the Cognito ID. Then, we could have a user_cognito
bridge table to link users to their Cognito accounts. A nice, simple design that looks perfectly fine so long as you ignore the choice of primary key.
See, email address primary keys are a pretty bad idea. You’re in for a world of hurt the instant that anyone wants to update their profile information, and you also end up plastering PII all over your foreign key columns. Even though I knew we would have issues, I knew that we’d run into these issues pretty quickly. By the time it became an issue, the scale of the problem would be small enough that migrating primary key again would be trivial.
I asked a few probing questions to see if they would pick up on the issue—they didn’t—and then signed off on the design.
A week later the engineer discovered the issue, and migrated over to a UUID primary key. I’m not the world’s biggest fan of UUIDs, but they’re a massively better choice than email addresses. Since this failure and subsequent recovery, I have never again seen this engineer make an obviously poor primary key choice. That’s progress right there.
We learn when we fail, and in this case it was a smooth and controlled failure with an easy migration back to the golden path. Far better to let the engineer in question build some scar tissue by really feeling the impact of a suboptimal design.
But then on the other end of the spectrum it is a really hard job to split code out into a microservice and then roll that decision back at a later date. You absolutely want to avoid making that mistake in the first place, because—although your junior will learn a lot from making such a mess—the cleanup is so involved.
We have a duty to be socratic in our work, and challenge the next generation of engineers to be better.