Debug EKS Unauthenticated Error-Part II

Xing Du
3 min readNov 16, 2023

More context was detailed in my other post: part I

After moving my EKS provisioning project to Terraform Cloud (part I, part II), I was faced with another “unauthenticated” error.

The difference from last time is that it happens only sometimes.

Troubleshooting

The TFC agent knows about what IAM role to assume, and the desired IAM role has been whitelisted in aws-auth . With limited insights available from Terraform Cloud, I need to look deeper into my tool bag.

A different tool

EKS control plane logging is available on CloudWatch, and the EKS Terraform module from AWS provisions it for you by default. Specifically, you need these configurations:

# The log group name format is /aws/eks/<cluster-name>/cluster
create_cloudwatch_log_group = true # creates CW log group
cluster_enabled_log_types = [“audit”, “api”, “authenticator”, “controllerManager”, “scheduler”] # defaults to ["audit", "api", "authenticator"]
cloudwatch_log_group_retention_in_days = 180 # in days
# use `cloudwatch_log_group_kms_key_id` if logs need to be encrypted

Discovery

Go to AWS CloudWatch, select “log group”, and filter to view authenticator log streams. Narrowing down the time window to match the TFC agent apply step, I found what I was looking for:

2023–11–111T11:11:11.111–08:00 time=”2023–11–11T11:11:11Z” level=warning msg=”access denied” arn=”arn:aws:iam::<aws_account_id>:role/my-tfc-agent-instance-profile-role” client=”127.0.0.1:37232" error=”ARN is not mapped” method=POST path=/authenticate

where a successful run (same workspace, same change) shows:

2023–11–11T11:11:11.111–08:00 time=”2023–11–11T11:11:11Z” level=info msg=”access granted” arn=”arn:aws:iam::<aws_account_id>:role/my-tfc-agent-role” client=”127.0.0.1:55540" groups=”[system:masters]” method=POST path=/authenticate uid=”aws-iam-authenticator:<aws_account_id>:<my_tfc_agent_user_id>” username=my-tfc-agent-username

The expected part is that the instance profile isn’t being whitelisted in aws-auth , therefore getting aUnauthorized error.

The unexpected part is: why my terraform agent does not assume the specified role?

Explanation

The following is my hypothesis.

Going through the Terraform run, I noticed that there’s no change in all aws resources (depends on aws provider, where assume_role is specified), only kubernetes resources changes.

Terraform Cloud tries to streamline runs by decoupling different steps of a run with the agents. Different steps (terraform plan , terraform apply) can be picked up by different agents from the same pool. To improve performance, the output diff of terraform plan is saved on the disk and streamed to the agent (if needed) to be fed into terraform apply .

My guess (I could be wrong, hopefully someone from Hashicorp can throw some light on this) is:

When a diff does not contain aws changes, the terraform runtime may choose to optimize by not initializing the provider. Therefore, it uses what’s in the environment variables (e.g. AWS_ACCESS_KEY_ID) and what’s on the disk (e.g. if $HOME/.aws/credentials is mounted to terraoform container at runtime) authenticate with AWS at runtime, and these may have been affected by a previous run on the same agent.

This is why it falls back to using instance profile sometimes

Solution

I believe the problem is specific to EKS: kubernetes resources have an implicit dependency on aws provider but terraform can not figure it out due to a lack of explicit specifications.

What I wish was possible

Maybe there’s an existing way to achieve what these options provide, I’d love to learn if anyone can share how to do it currently. It would be really nice to have one or more of the following options (Terraform / TFC folks, please give it a read!)

  1. adding a force_init meta argument for all providers to always initialize a provider regardless of the content of the diff
  2. adding a flag to terraform apply to always initialize a provider regardless of the content of the diff, and make it configurable on TFC.
  3. support reuse_agent configuration on TFC to allow specifying stickiness between terraform plan step and terraform apply step

What I did

Given the unpredictability of this issue, I end up whitelisting the instance profile on my EKS:

# in EKS module
aws_auth_roles = [
{
rolearn = "arn:aws:iam::<aws_account_id>:role/my-tfc-agent-role"
username = “my-tfc-agent”
groups = [“system:masters”]
- }
+ },
+ {
+ rolearn = "arn:aws:iam::<aws_account_id>:role/my-tfc-agent-instance-profile-role"
+ username = “my-tfc-agent-instance-profile”
+groups = [“system:masters”]
+ }
]

Conclusion

  1. Take advantage of k8s audit logs, it’s very powerful.
  2. terraform and TFC has its limitations, but this one is very niche.
  3. be practical and focus on what you have control over.

If you find this to be helpful, give it a clap and it would mean the world to me. Please share this with whoever needs this, and I’d appreciate it if you want to buy me a coffee

--

--

Xing Du

Minimalist. Game Developer. Software Engineer. DevOps enthusiast. Foodie. Gamer.