Launch Error Handling when using AWS CodeDeploy + AutoScaling Group

Xing Du
3 min readAug 3, 2020

--

Over the week I got to work on improving the provisioning process and deployment of a legacy service using:

The setup for migration is fairly simple and the most time-consuming part is actually handling the error happened during instance launch. I spent a good amount of time looking into CodeDeploy and decided to write down what I learned since I failed to find anything useful googling.

Context

The application’s launch template (provisioned by terraform too and used by auto-scaling group) contains a shell script for userdata:

  • We choose to install codedeploy-agent in userdata instead of baking it into the AMI. In a previous test, a coworker of mine noticed launch time is increased by ~1min when codedeploy-agent is baked into the AMI while installing it in userdata takes <10s.
  • We have some service specific provisioning that needs to be done after instance is launched but before service is started. Part of the provisioning logic here in userdata may fail with a non-0 exit code.

Setup

Simplified terraform setup:

resource "aws_autoscaling_group" "my_asg" {
name = "my-asg-prefix"
vpc_zone_identifier = ["subnet0", "subnet1", "subnet2"]
min_size = 0
max_size = 16
desired_capacity = 4
launch_template {
id = aws_launch_template.my_template.id
version = "$Latest"
}
load_balancers = [
aws_elb.my_elb.name,
]
}
resource "aws_codedeploy_app" "my_app" {
name = "my-app"
}
resource "aws_codedeploy_deployment_group" "group" {
app_name = aws_codedeploy_app.my_app.name
deployment_group_name = "foo"
service_role_arn = "<service_role_arn>"
autoscaling_groups = [
aws_autoscaling_group.my_asg.name,
]
}

userdata.sh

#!/bin/bash
set -euo pipefail
exec > >(tee -a /var/log/user-data.out )
exec 2> >(tee -a /var/log/user-data.out >&2)
# launch-time provisioning: application-specific
# ...
# install codedeploy-agent
# <command_to_install_codedeploy_agent>

Problem

AWS CodeDeploy injects a lifecycle hook to associated auto-scaling group, and that lifecycle hook is supposed to be managed by codedeploy-agent only. When userdata script exits with non-zero code, codedeploy-agent is not installed. In EC2 console instance view, the instance is marked as running but in auto-scaling group view, the instance is marked as pending:waiting until it hits the internal lifecycle hook timeout (60000s) set by CodeDedeploy.

Switching order so that codedeploy-agent is installed first didn’t work either. CodeDeploy started installing code before the necessary provisioning is ready and non-zero exit code in the following provisioning steps are not taken care of at all.

It’s possible to manually completes the codedeploy-managed hook with combining the following commands:

# get instance id from AWS metadata
INSTANCE_ID=`curl -s http://169.254.169.254/latest/meta-data/instance-id`
# locate lifecycle hook name with ASG name
aws autoscaling describe-lifecycle-hooks
# complete / abandon lifecycle hook with instance id, hook name and ASG name
aws autoscaling complete-lifecycle-action

But obviously that’s not the right approach: userdata.sh and CodeDeploy belong to two independent systems and should not be aware of / depend on assumptions of each other.

Solution

The solution I end up with:

  • Ensure codedeploy-agent is installed regardless of userdata script exit code with the help of trap command.
#!/bin/bash
set -euo pipefail
exec > >(tee -a /var/log/user-data.out )
exec 2> >(tee -a /var/log/user-data.out >&2)
# install codedeploy-agent
# trap "<command_to_install_codedeploy_agent>" EXIT
# launch-time provisioning: application-specific
# ...
  • Keep the userdata.sh exit information in something that’s less ephemeral. e.g. a file.
  • Utilize the BeforeInstall hook in appspec.yaml to detect the possible failure: if error is detected, return a non-zero exit code in the script used for BeforeInstall will trigger codedeploy-agent . Examples below:
hooks:
BeforeInstall:
- location: foo/bar/check_userdata_provisioning.sh
timeout: 60

check_userdata_provision :

#!/bin/bash -e
set -euo pipefail
ret=`cat /var/run/userdata.output`
if [ $ret -eq 0 ]; then
echo "Continue with deployment. DEPLOYMENT_ID=${DEPLOYMENT_ID}"
exit 0
else
echo "Abort the deployment. DEPLOYMENT_ID=${DEPLOYMENT_ID}"
exit $ret
fi

The output file is used as a medium for communicating state information between userdata.sh and codedeploy-agent . It doesn’t have to be an exit code, or even a file, as long as BeforeInstall hook script has a way to identify the state set by userdata.sh agent, you’re good to go. The important part is, the lifecycle hook is managed by CodeDeploy and you should leave it completely untouched from userdata.sh

--

--

Xing Du
Xing Du

Written by Xing Du

Minimalist. Game Developer. Software Engineer. DevOps enthusiast. Foodie. Gamer.

No responses yet