More context was detailed in my other post: part I
After moving my EKS provisioning project to Terraform Cloud (part I, part II), I was faced with another “unauthenticated” error.
The difference from last time is that it happens only sometimes.
Troubleshooting
The TFC agent knows about what IAM role to assume, and the desired IAM role has been whitelisted in aws-auth
. With limited insights available from Terraform Cloud, I need to look deeper into my tool bag.
A different tool
EKS control plane logging is available on CloudWatch, and the EKS Terraform module from AWS provisions it for you by default. Specifically, you need these configurations:
# The log group name format is /aws/eks/<cluster-name>/cluster
create_cloudwatch_log_group = true # creates CW log group
cluster_enabled_log_types = [“audit”, “api”, “authenticator”, “controllerManager”, “scheduler”] # defaults to ["audit", "api", "authenticator"]
cloudwatch_log_group_retention_in_days = 180 # in days
# use `cloudwatch_log_group_kms_key_id` if logs need to be encrypted
Discovery
Go to AWS CloudWatch, select “log group”, and filter to view authenticator
log streams. Narrowing down the time window to match the TFC agent apply
step, I found what I was looking for:
2023–11–111T11:11:11.111–08:00 time=”2023–11–11T11:11:11Z” level=warning msg=”access denied” arn=”arn:aws:iam::<aws_account_id>:role/my-tfc-agent-instance-profile-role” client=”127.0.0.1:37232" error=”ARN is not mapped” method=POST path=/authenticate
where a successful run (same workspace, same change) shows:
2023–11–11T11:11:11.111–08:00 time=”2023–11–11T11:11:11Z” level=info msg=”access granted” arn=”arn:aws:iam::<aws_account_id>:role/my-tfc-agent-role” client=”127.0.0.1:55540" groups=”[system:masters]” method=POST path=/authenticate uid=”aws-iam-authenticator:<aws_account_id>:<my_tfc_agent_user_id>” username=my-tfc-agent-username
The expected part is that the instance profile isn’t being whitelisted in aws-auth
, therefore getting aUnauthorized
error.
The unexpected part is: why my terraform agent does not assume the specified role?
Explanation
The following is my hypothesis.
Going through the Terraform run, I noticed that there’s no change in all aws
resources (depends on aws
provider, where assume_role
is specified), only kubernetes
resources changes.
Terraform Cloud tries to streamline runs by decoupling different steps of a run with the agents. Different steps (terraform plan
, terraform apply
) can be picked up by different agents from the same pool. To improve performance, the output diff of terraform plan
is saved on the disk and streamed to the agent (if needed) to be fed into terraform apply
.
My guess (I could be wrong, hopefully someone from Hashicorp can throw some light on this) is:
When a diff does not contain aws
changes, the terraform
runtime may choose to optimize by not initializing the provider. Therefore, it uses what’s in the environment variables (e.g. AWS_ACCESS_KEY_ID
) and what’s on the disk (e.g. if $HOME/.aws/credentials
is mounted to terraoform
container at runtime) authenticate with AWS at runtime, and these may have been affected by a previous run on the same agent.
This is why it falls back to using instance profile sometimes
Solution
I believe the problem is specific to EKS: kubernetes
resources have an implicit dependency on aws
provider but terraform
can not figure it out due to a lack of explicit specifications.
What I wish was possible
Maybe there’s an existing way to achieve what these options provide, I’d love to learn if anyone can share how to do it currently. It would be really nice to have one or more of the following options (Terraform / TFC folks, please give it a read!)
- adding a
force_init
meta argument for allproviders
to always initialize a provider regardless of the content of the diff - adding a flag to
terraform apply
to always initialize a provider regardless of the content of the diff, and make it configurable on TFC. - support
reuse_agent
configuration on TFC to allow specifying stickiness betweenterraform plan
step andterraform apply
step
What I did
Given the unpredictability of this issue, I end up whitelisting the instance profile on my EKS:
# in EKS module
aws_auth_roles = [
{
rolearn = "arn:aws:iam::<aws_account_id>:role/my-tfc-agent-role"
username = “my-tfc-agent”
groups = [“system:masters”]
- }
+ },
+ {
+ rolearn = "arn:aws:iam::<aws_account_id>:role/my-tfc-agent-instance-profile-role"
+ username = “my-tfc-agent-instance-profile”
+groups = [“system:masters”]
+ }
]
Conclusion
- Take advantage of
k8s
audit logs, it’s very powerful. terraform
and TFC has its limitations, but this one is very niche.- be practical and focus on what you have control over.
If you find this to be helpful, give it a clap and it would mean the world to me. Please share this with whoever needs this, and I’d appreciate it if you want to buy me a coffee