I ran into an “unauthorized” error while migrating my EKS terraform provisioning project to terraform cloud last week.
The debugging process is somewhat interesting and I hope writing it down would help whoever runs into the same problem in the future
Context
Project Setup
A simplified version for my terraform project, which should be a fairly common setup:
provider "aws" {
region = var.region
allowed_account_ids = ["${var.aws_account_id}"]
assume_role {
role_arn = var.role_arn
}
}
provider "tfe" {
hostname = "app.terraform.io"
}
provider "kubernetes" {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
token = data.aws_eks_cluster_auth.this.token
}
data "aws_eks_cluster_auth" "this" {
name = module.eks.cluster_name
}
Migration
- prior to migrating to terraform cloud,
terraform
runs from local with my local AWS credential - the project contains AWS resources and k8s resources: a running EKS cluster has been provisioned by
terraform
already - this migration contains no further change than necessary changes that modify where
terraform
is executed - existing
tfstate
(on as3
backend) is automatically migrated to Terraform cloud withbackend
change andterraform init
- on Terraform cloud, the workspace is configured to use
agent
Execution mode
:terraform
runs from an agent, which is hosted on a dedicated EC2 instance
Changes
More details on how to migrate to Terraform cloud will be covered in a different post. For context, some high-level changes:
- point backend to
cloud
IAM
role changeworkspace
name change- connectivity: VPC peering connection changes
- Security group changes
Problem
Symptom
After making the necessary changes (see above) to stop terraform
from complaining, I arrived at the last error:
Error: Unauthorized
with kubernetes_some_resource.my_resource
on my_tf.tf line xx, in resource “kubernetes_some_resource” “my_resource”:
resource “kubernetes_some_resource” “my_resource” {
Debugging
A couple of thoughts to narrow down the investigation:
- This didn’t happen when using
s3
backend and localterraform
, so it's caused by the environmental difference. - the project contains AWS and k8s resources and this only affects k8s resources
- connectivity issue would have surfaced in a
tcp dial timeout
instead
The root cause is in how terraform
runtime authorizes with EKS, i.e. related to this block:
provider "kubernetes" {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
token = data.aws_eks_cluster_auth.this.token
}
data "aws_eks_cluster_auth" "this" {
name = module.eks.cluster_name
}
Reasons could be (but not limited to):
- auth token is expired when
terraform
gets to k8s resources: a token is valid for 15mins and it’s possible (but unlikely) if the TFC run is using a stale token fromtfstate
instead of issuing a new one - auth token is valid but does not have permission to access EKS
Validate token expiration
To verify/rule out 1st guess, I:
- commented out the
aws_eks_cluster_auth
data source - used a sensitive variable to pass in the token to k8s provider
- issued the token via AWS CLI (
aws eks get-token --cluster-name <cluster_name> --role-arn <tfc_agent_role>
) and pass in using variable - ensure this process is done within 15mins
i.e:
provider "kubernetes" {
host = module.eks.cluster_endpoint
cluster_ca_certificate = base64decode(module.eks.cluster_certificate_authority_data)
token = var.eks_auth_token
}
var "eks_auth_token" {
type = string
sensitive = true
}
This run ends up with the same error message, meaning it’s probably not related to a valid but expired token.
Validate token permission
To verify: we can try hitting cluster k8s API using auth token issued from AWS CLI without going through terraform
:
- fetch
cluster_endpoint
from either terraform output or AWS web console - fetch the
cluster_certificate_authority_data
from either terraform output or AWS web console, decode it and save it locally:echo -n '<cluster_certificate_authority_data>' | base64 --decode > /tmp/mycert
aws eks get-token --cluster-name <cluster_name> --role-arn <tfc_agent_role>
curl -s --cacert <(cat /tmp/mycert) --header "Authorization: Bearer <token>" --request GET 'https://<cluster_endpoint>/openapi/v2'
This setup returned the same unauthorized error message, however, if I replace tfc_agent_role
with the role I used from local runs and repeat the steps above, I get a successful response.
Confirmed that this is caused by the difference in the AWS IAM role being used and this part is not explicitly covered in my terraform
source.
Root cause
After some research, I found this page explaining the issue. I provisioned my EKS cluster with an OIDC provider to drive RBAC with SSO and intentionally didn’t specify aws-auth
related configurations.
The IAM role used for creating the cluster will have system:master
permission to k8s and won't need to be explicitly added to aws-auth
config map, while other IAM roles need to be added to aws-auth
explicitly for authorization.
In this case, the role I used for local runs was used during cluster creation. Therefore, despite not specifying that to aws-auth
I got permission to do other k8s provisioning with this role. When migrating to TFC, the TFC agent role needs to be added to aws-auth
to prevent this error. i.e.:
# in EKS module
aws_auth_roles = [
{
rolearn = local.tfc_role_arn
username = "tfc-agent"
groups = ["system:masters"]
},
{
rolearn = local.local_role_arn
username = "local-user"
groups = ["system:masters"]
}
]
After adding the above block, the unauthorized error message went away and I successfully migrated this project to terraform cloud.
Conclusion
If you find this to be helpful, give it a clap and it would mean the world to me. Please share this with whoever needs this, and I’d appreciate it if you want to buy me a coffee