The missing piece for self-healing Elastic Beanstalk apps
At Crimson we use Elastic Beanstalk to run our Docker images, and we leverage an old internal Terraform module to provision them. I’d never really used Elastic Beanstalk prior to this job, and I’ve decided that I don’t really like it. It’s a fragile and hard-to-use service, and App Runner, ECS, and EKS are all much better AWS services that span a wide range of use cases and levels of expertise. I really do recommend looking into one of those alternate services instead of using Elastic Beanstalk—they’re all great.
One headscratcher we’ve ran into a few times takes the form of all the EC2 instances in an Elastic Beanstalk environment going unhealthy and never getting replaced with new ones. When this happens we’re able to sign in to the AWS Console and see the environment marked as unhealthy by Elastic Beanstalk, but Elastic Beanstalk never does anything to correct the situation. We have synthetics which allow us to manually intervene quickly, but this obviously isn’t a great long-term solution.
The issue is a real headscratcher, because Elastic Beanstalk has a setting named “Application Healthcheck URL” under the
aws:elasticbeanstalk:application namespace. I come from using Kubernetes, and so to me this seems like all you should really need for self-healing—but it’s not. I just recently figured out how to get self-healing working in Elastic Beanstalk and in this post I’ll explain both why this doesn’t work out of the box, and how to fix it.
The key thing to know here is that there are three components that perform health checks within an Elastic Beanstalk environment, and they all do different things. The following list explains at a high level what each of them are responsible for, but of course I’m omitting some details for brevity.
- Elastic Beanstalk will intermittently ping the health check URL you configured (or default to port 80 over TCP). If health checks fail then your environment’s status goes red, and it’s possible for you to set up CloudWatch alarms to alert on this. Beyond that, the Elastic Beanstalk health checks don’t seem to do much.
- The Elastic Load Balancer also pings your health check URL, and if one of your instances becomes unhealthy for an extended period of time the ELB will stop sending requests to it.
- The Auto Scaling group also runs health checks, and if one of your instances goes unhealthy it will terminate the bad instance and replace it with a new one.
The Auto Scaling group sounds like it does what we need it to, but there’s a catch. ASGs support different kinds of health checks; specifically in our case the interesting options are “EC2 checks” and “ELB health checks.”
By default, an Auto Scaling group only performs EC2 checks. EC2 checks essentially just check that the underlying hardware is okay, which is useful but doesn’t help us at all in cases where the machine is fine but our application has crashed. ELB health checks, on the other hand, allows the Auto Scaling group to look at the health check results of the load balancer attached to your Auto Scaling group.
If ELB has marked an instance as unhealthy and stopped sending requests to it, then the Auto Scaling group will terminate the unhealthy instance and start up a new one. This is exactly what we want, but the stumbling block is that this isn’t the default behavior. You need to opt in to ELB health checks, as you can see below:
This explains the lack of self-healing. What’s happening is the following:
- The ELB recognizes an instance has gone unhealthy and stops sending requests to it.
- The Elastic Beanstalk environment recognizes that things aren’t healthy so the environment goes red.
- The Auto Scaling group sees that the underlying EC2 instances are fine, and does nothing.
- Eventually all of your instances are unhealthy and your service goes down.
This is annoying, but there’s a remedy available in the form of turning on those ELB health checks for the Auto Scaling group. When provisioning the infrastructure yourself using Terraform or ClickOps changing this setting is easy, but in Elastic Beanstalk things are less straightforward. Most environment configuration happens by passing in EB-specific configuration options, but there’s no option available to change the Auto Scaling group’s health check setting! So, how do you do it?
The answer is in the
.ebextensions folder. By adding a
Resources key to a configuration file it’s possible to use CloudFormation to customize almost any aspect of the resources spun up by an Elastic Beanstalk environment. The Auto Scaling group is one of the resources that can be customized, and enabling ELB health checks is simply a matter of dropping the following YAML file into
Resources: AWSEBAutoScalingGroup: Type: "AWS::AutoScaling::AutoScalingGroup" Properties: HealthCheckType: ELB HealthCheckGracePeriod: 300
After you’ve added that configuration file and uploaded a new application version you can redeploy your environment and get self-healing behavior.