A few weeks ago, I experienced a scary but interesting issue: the k8s control plane of my cluster experienced a state reset, deleting hundreds of running workloads and taking down every web service.
It is a rarely seen problem, and it’s critical to know the solution ahead of time.
In this article, I’ll walk through what happened and what are the recovery steps, followed by the root cause analysis. Hopefully this will be helpful.
What Happened
My Kubernetes cluster is provisioned with a tool called kops, which exposes the control plane to me: I have full permission to access the control plan, and I’m responsible for managing the control plane.
The `etcd` cluster (used by the control plane for state management) triggered a cluster reset, deleting all the state data. The control plane started deleting all workloads to converge to this “desired state”.
Recovery Steps
Luckily, kops
-provisioned k8s clusters back up etcd
state data into external storage (e.g. AWS S3). Snapshots are taken both hourly and daily.
Using the backup snapshot to recover the state is the approach here.
High level
- identify the correct version of the state file to restore from
- issue restore commands to the
etcd
cluster to restore state data with the correct version of state data - observe k8s converge and monitor
- manually apply the state gap between the point of backup to the point of incident
Details
kops
stores back up at “<configBase>/backups/etcd/<etcdClusterName>
" (reference): where<configBase>
is the S3 directory feed to kops, and<etcdClusterName>
is the name of the etcd cluster. In this case, there are 2etcd
clusters:main
andevents
- locate the right backup directory (named after the timestamp of creation): use the last known stable state of the k8s cluster. Write down the prefix (name of the directory) somewhere. Note that
main
andevents
have different timestamps - ssh to your control nodes one at a time, and check
/var/log/etcd.log
to see which node acts as the leader for theetcd
cluster:I am leader with token
is what you are looking for - on the leader control node, run
docker exec
toetcd-manager
pod (you need to do this for bothmain
andevents
). Identify the container id withdocker ps
and ssh withdocker exec -it <container_id> /bin/bash
- run the restoration command for the matching
etcd
cluster:/etcd-manager-ctl -backup-store=s3://<bucket> restore-backup <s3_prefix_timestamp>
. Note this timestamp (from step 2)should match theetcd
cluster name the currentetcd-manager
is responsible for (you can see from the container name) - back to the control node host and monitor logs to see the progress:
tail -F /var/log/etcd.log
- repeat the same thing for the other
etcd
clusters (if there’s more than 1)
Important note:
If you issued a wrong command, e.g. made a typo in the backup timestamp, the failed command will be stuck in the command queue, preventing future commands from being processed.
In this case: run (on leader node, inside etcd-manager
pod) /etcd-manager-ctl list-commands
to see what commands were in the queue, and run /etcd-manager-ctl delete-command
to remove incorrect commands
Root Cause Analysis
k8s
control plane depends onetcd
cluster for state management (records of all resources). Data insideetcd
cluster is the source of truth for what state k8s should be.
— if data inetcd
is updated, thek8s
cluster will update accordingly.
— if data inetcd
is reset to empty, thek8s
cluster will delete all resources as well.kops
-provisionedk8s
cluster uses aetcd-manager
component to “manage” theetcd
cluster: leader election, backup, restore, etc.etcd-manager
is hard coded to back upetcd
data periodically to a given S3 bucket. daily + hourly backups are kept on abackups
directory on the given S3 bucket.- in addition to backup files,
etcd-manager
also puts 2 files (created
andspec
) used foretcd
cluster state control in the samebackups
directory (also hardcoded, unable to modify). these files are read-from every 2s byetcd-manager
— if both files are gone,etcd-manager
will complain about “need to create a new etcd cluster” but do nothing about the currentetcd
cluster and its data
— if onlyspec
exists andcreated
does not,etcd-manager
will proceed to create a newetcd
cluster, wiping data in the currentetcd
cluster if there’s any. - a lifecycle policy was added to the same S3 bucket to prune old backups. both
created
andspec
are lastly modified months ago, and got deleted by this policy. etcd-manager
started logging “need to create new etcd cluster” but does nothing about existingetcd
cluster since then.- on the day of what happened, a deployment job was kicked off to update
k8s
cluster provisioning template. This process went throughkops
andkops
detected thespec
file missing on S3, and created that file directly on S3. etcd-manager
detected the newly-createdspec
file, and proceed with creating a newetcd
cluster.- all data in the current
etcd
cluster was wiped as a result, and myk8s
cluster started deleting all workloads and evicting all nodes to match the “desired state” (empty). - a few minutes later, all workloads are fully wiped, and all services become unavailable.