Launch Error Handling when using AWS CodeDeploy + AutoScaling Group
Over the week I got to work on improving the provisioning process and deployment of a legacy service using:
- terraform + packer
- AWS CodeDeploy
The setup for migration is fairly simple and the most time-consuming part is actually handling the error happened during instance launch. I spent a good amount of time looking into CodeDeploy
and decided to write down what I learned since I failed to find anything useful googling.
Context
The application’s launch template (provisioned by terraform too and used by auto-scaling group) contains a shell script for userdata
:
- We choose to install
codedeploy-agent
inuserdata
instead of baking it into the AMI. In a previous test, a coworker of mine noticed launch time is increased by ~1min whencodedeploy-agent
is baked into the AMI while installing it inuserdata
takes <10s. - We have some service specific provisioning that needs to be done after instance is launched but before service is started. Part of the provisioning logic here in
userdata
may fail with a non-0 exit code.
Setup
Simplified terraform setup:
resource "aws_autoscaling_group" "my_asg" {
name = "my-asg-prefix"
vpc_zone_identifier = ["subnet0", "subnet1", "subnet2"]
min_size = 0
max_size = 16
desired_capacity = 4launch_template {
id = aws_launch_template.my_template.id
version = "$Latest"
}load_balancers = [
aws_elb.my_elb.name,
]
}resource "aws_codedeploy_app" "my_app" {
name = "my-app"
}resource "aws_codedeploy_deployment_group" "group" {
app_name = aws_codedeploy_app.my_app.name
deployment_group_name = "foo"
service_role_arn = "<service_role_arn>"
autoscaling_groups = [
aws_autoscaling_group.my_asg.name,
]
}
userdata.sh
#!/bin/bash
set -euo pipefailexec > >(tee -a /var/log/user-data.out )
exec 2> >(tee -a /var/log/user-data.out >&2)# launch-time provisioning: application-specific
# ...# install codedeploy-agent
# <command_to_install_codedeploy_agent>
Problem
AWS CodeDeploy injects a lifecycle hook to associated auto-scaling group, and that lifecycle hook is supposed to be managed by codedeploy-agent
only. When userdata
script exits with non-zero code, codedeploy-agent
is not installed. In EC2 console instance view, the instance is marked as running
but in auto-scaling group view, the instance is marked as pending:waiting
until it hits the internal lifecycle hook timeout (60000s
) set by CodeDedeploy
.
Switching order so that codedeploy-agent
is installed first didn’t work either. CodeDeploy
started installing code before the necessary provisioning is ready and non-zero exit code in the following provisioning steps are not taken care of at all.
It’s possible to manually completes the codedeploy-managed hook with combining the following commands:
# get instance id from AWS metadata
INSTANCE_ID=`curl -s http://169.254.169.254/latest/meta-data/instance-id`# locate lifecycle hook name with ASG name
aws autoscaling describe-lifecycle-hooks# complete / abandon lifecycle hook with instance id, hook name and ASG name
aws autoscaling complete-lifecycle-action
But obviously that’s not the right approach: userdata.sh
and CodeDeploy
belong to two independent systems and should not be aware of / depend on assumptions of each other.
Solution
The solution I end up with:
- Ensure
codedeploy-agent
is installed regardless ofuserdata
script exit code with the help oftrap
command.
#!/bin/bash
set -euo pipefailexec > >(tee -a /var/log/user-data.out )
exec 2> >(tee -a /var/log/user-data.out >&2)# install codedeploy-agent
# trap "<command_to_install_codedeploy_agent>" EXIT# launch-time provisioning: application-specific
# ...
- Keep the
userdata.sh
exit information in something that’s less ephemeral. e.g. a file. - Utilize the
BeforeInstall
hook inappspec.yaml
to detect the possible failure: if error is detected, return a non-zero exit code in the script used forBeforeInstall
will triggercodedeploy-agent
. Examples below:
hooks:
BeforeInstall:
- location: foo/bar/check_userdata_provisioning.sh
timeout: 60
check_userdata_provision
:
#!/bin/bash -e
set -euo pipefailret=`cat /var/run/userdata.output`
if [ $ret -eq 0 ]; then
echo "Continue with deployment. DEPLOYMENT_ID=${DEPLOYMENT_ID}"
exit 0
else
echo "Abort the deployment. DEPLOYMENT_ID=${DEPLOYMENT_ID}"
exit $ret
fi
The output file is used as a medium for communicating state information between userdata.sh
and codedeploy-agent
. It doesn’t have to be an exit code, or even a file, as long as BeforeInstall
hook script has a way to identify the state set by userdata.sh
agent, you’re good to go. The important part is, the lifecycle hook is managed by CodeDeploy
and you should leave it completely untouched from userdata.sh