K8s control plane: disaster + recovery

4 min readAug 22, 2022

A few weeks ago, I experienced a scary but interesting issue: the k8s control plane of my cluster experienced a state reset, deleting hundreds of running workloads and taking down every web service.

It is a rarely seen problem, and it’s critical to know the solution ahead of time.

In this article, I’ll walk through what happened and what are the recovery steps, followed by the root cause analysis. Hopefully this will be helpful.

What Happened

My Kubernetes cluster is provisioned with a tool called kops, which exposes the control plane to me: I have full permission to access the control plan, and I’m responsible for managing the control plane.

The `etcd` cluster (used by the control plane for state management) triggered a cluster reset, deleting all the state data. The control plane started deleting all workloads to converge to this “desired state”.

Recovery Steps

Luckily, kops-provisioned k8s clusters back up etcdstate data into external storage (e.g. AWS S3). Snapshots are taken both hourly and daily.

Using the backup snapshot to recover the state is the approach here.

High level

identify the correct version of the state file to restore from
issue restore commands to the etcd cluster to restore state data with the correct version of state data
observe k8s converge and monitor
manually apply the state gap between the point of backup to the point of incident

Details

kops stores back up at “<configBase>/backups/etcd/<etcdClusterName>" (reference): where <configBase> is the S3 directory feed to kops, and <etcdClusterName> is the name of the etcd cluster. In this case, there are 2 etcd clusters: main and events
locate the right backup directory (named after the timestamp of creation): use the last known stable state of the k8s cluster. Write down the prefix (name of the directory) somewhere. Note that main and events have different timestamps
ssh to your control nodes one at a time, and check /var/log/etcd.log to see which node acts as the leader for the etcd cluster: I am leader with token is what you are looking for
on the leader control node, run docker exec to etcd-manager pod (you need to do this for both main and events). Identify the container id with docker ps and ssh with docker exec -it <container_id> /bin/bash
run the restoration command for the matching etcd cluster: /etcd-manager-ctl -backup-store=s3://<bucket> restore-backup <s3_prefix_timestamp> . Note this timestamp (from step 2)should match the etcd cluster name the current etcd-manager is responsible for (you can see from the container name)
back to the control node host and monitor logs to see the progress: tail -F /var/log/etcd.log
repeat the same thing for the other etcd clusters (if there’s more than 1)

Important note:

If you issued a wrong command, e.g. made a typo in the backup timestamp, the failed command will be stuck in the command queue, preventing future commands from being processed.

In this case: run (on leader node, inside etcd-manager pod) /etcd-manager-ctl list-commands to see what commands were in the queue, and run /etcd-manager-ctl delete-command to remove incorrect commands

Root Cause Analysis

k8s control plane depends on etcd cluster for state management (records of all resources). Data inside etcd cluster is the source of truth for what state k8s should be.
— if data in etcd is updated, the k8s cluster will update accordingly.
— if data in etcd is reset to empty, thek8s cluster will delete all resources as well.
kops-provisioned k8s cluster uses a etcd-manager component to “manage” the etcd cluster: leader election, backup, restore, etc.
etcd-manager is hard coded to back up etcd data periodically to a given S3 bucket. daily + hourly backups are kept on a backups directory on the given S3 bucket.
in addition to backup files, etcd-manager also puts 2 files (created and spec) used for etcd cluster state control in the same backups directory (also hardcoded, unable to modify). these files are read-from every 2s by etcd-manager
— if both files are gone, etcd-manager will complain about “need to create a new etcd cluster” but do nothing about the current etcd cluster and its data
— if only spec exists and created does not, etcd-manager will proceed to create a new etcd cluster, wiping data in the current etcd cluster if there’s any.
a lifecycle policy was added to the same S3 bucket to prune old backups. both created and spec are lastly modified months ago, and got deleted by this policy.
etcd-manager started logging “need to create new etcd cluster” but does nothing about existing etcd cluster since then.
on the day of what happened, a deployment job was kicked off to update k8s cluster provisioning template. This process went through kops and kops detected the spec file missing on S3, and created that file directly on S3.
etcd-manager detected the newly-created spec file, and proceed with creating a new etcd cluster.
all data in the current etcd cluster was wiped as a result, and my k8s cluster started deleting all workloads and evicting all nodes to match the “desired state” (empty).
a few minutes later, all workloads are fully wiped, and all services become unavailable.