9 February, 2024•4 minute read

If you’re paying for AWS support, use it

A few years ago while working at Kwotimation I needed to build a search feature. We’d received feedback from customers that they were finding it difficult to find records inside their dashboard, and we figured full-text search would resolve the issue nicely. A quick prototype I’d built using Elasticsearch seemed promising, so we decided to productize it.

As we were a small team with limited operational capacity I really didn’t want to have to manage my own Elasticsearch instance. Like most early stage startups, we had significantly more important work which needed doing compared to keeping infrastructure online so we heavily favored managed services. Fortunately for us, AWS actually offers a managed Elasticsearch service and so I went to work writing the Terraform needed to provision it.

But alas, I made a pretty big mistake with the IAM policy of my shiny new Elasticsearch cluster. Before I explain the mistake I made, let’s first examine this IAM policy for an S3 bucket:

storage/main.tf

Click to copy

resource "aws_s3_bucket" "main" {  count  = length(var.tlds)  bucket = element(var.tlds, count.index)  acl    = "public-read"  policy = <<POLICY{    "Version": "2012-10-17",    "Statement": [        {            "Sid": "PublicReadGetObject",            "Effect": "Allow",            "Principal": "*",            "Action": "s3:GetObject",            "Resource": "arn:aws:s3:::${element(var.tlds, count.index)}/*"        }    ]}POLICY  website {    index_document = "index.html"    error_document = "index.html"  }}

See that ARN under the Resource section? It’s missing both a region and an account ID. This is completely legitimate for an S3 bucket’s ARN, because despite S3 buckets being regional the name of an S3 bucket is required to be globally unique across all accounts. You therefore don’t need the region or account number to disambiguate which bucket the ARN refers to, and so the canonical representation omits these details.

AWS’s managed Elasticsearch offering has no such global naming constraint, and so you do need to include the region and account ID components of the ARN. I unfortunately missed this detail when initially creating the infrastructure, and wound up with the following buggy IaC.

Neither my staging nor production deployments were able to access their respective Elasticearch clusters. All because I had neglected to include the region and account ID in the ARN of the access policy!

search/main.tf

Click to copy

resource "aws_elasticsearch_domain" "main" {  domain_name           = "${var.name}-es-${var.environment}"  elasticsearch_version = "7.10"   access_policies = <<CONFIG{    "Version": "2012-10-17",    "Statement": [        {            "Action": "es:*",            "Principal": {              "AWS": "*"            },            "Effect": "Allow",            "Resource": "arn:aws:es:::domain/${var.name}-es-${var.environment}/*"        }    ]}CONFIG  # ...}

In retrospect this mistake is very obvious, but the problem is that when you’re in the weeds and trying to execute on a lot of different deliverables concurrently these things are extremely hard to see for yourself. The longer you stare at a block of code, the harder it becomes to really “see” it for what it is.

I opened up an AWS support ticket—my first one ever!—and the person on the other end was able to see my rookie mistake and set me on the right path within 24 hours. That’s incredible service given that our business had spent next to nothing on AWS up until that point thanks to all the free startup credits we’d accrued.

There are a lot of valid critiques you can level against Amazon, but they are deadly serious about their “customer obsession” principle. I’ve experienced it first hand as an Amazon.com shopper, an AWS customer, and as an AWS Community Builder.

I’m not the only one who recognizes this. Some months ago it was the night before AWS Cloud Day Auckland and I was at a bar networking with some fellow engineers. I decided to stick around for a while after the rest of our group had dispersed, and found myself chatting with one of Forsyth Barr’s directors. Hanging out a little too late at bars is a pretty good way of meeting interesting people.

I bring up this conversation because he had actually been speaking to some of the folks from AWS earlier that day, and he himself was planning to attend the Cloud Day event. He impressed upon me multiple times over the course of that evening that the AWS employees had a genuine desire to help people in New Zealand build things.

He spoke at length about the investments AWS were planning, the training programs they were offering, and the credits and technical assistance they were handing out like candy.

There are a lot of companies out there for whom their values and principles are little more than marketing fluff. Amazon is absolutely not one of those companies, and they really do care deeply about the success of their customers. Talk to anyone who’s dealt with someone from Amazon, and you’ll almost always hear positive stories. They’re a pretty incredible company to partner up with.

And yet there are so many engineering teams out there that simply don’t make use of this. I’ve seen so many engineers—from juniors to seniors—bang their head against the wall for hours trying to debug their cloud system when all they really needed to do was open up an AWS support ticket.

My rule of thumb: if I can’t solve a problem in AWS within 15-20 minutes, then I open a support ticket. No point letting your ego get in the way of asking for help; becoming a senior engineer means eating a lot of humble pie along the way.

Oftentimes I’ll move on to some other deliverable while waiting for their response. If it’s mission critical I’ll keep banging my head against the problem in the hope that I’ll figure something out. If I do manage to figure it out myself then that’s fine—I can just close the ticket and pretend nothing ever happened. That’s exactly what happened when I ran into problems using Secrets Manager to store short-lived OAuth credentials last year; write rates greater than once per 10 minutes result in running out of “secret versions”.

But in those cases where I can’t find the fix on my own, the AWS support team have usually been able to pull through. The only time I’ve ever left feeling dissatisfied is when AWS itself lacks a capability I require—and that’s no fault of the customer support engineers.

The best part of it all is that when I use AWS support I’m not draining capacity from my team. I’m not saying it’s necessarily a bad thing to shoulder tap a team member for assistance—in fact, I generally think you should be leaning on your team mates’ expertise—but if there’s an alternate option available that doesn’t suck up your team’s resources then why wouldn’t you take advantage of it?

AWS support exists to increase your team’s leverage. If you’re paying for it—and you probably are—then you’d better use it.

AWS

Business

« Newer post

Software reliability happens in depth

Older post »

Nest, nest, nest

« Newer post

Older post »

Get in touch 👋