(Jan. 27 2020) – So you’ve got a Kubernetes cluster, now what?
etcd is the persistent key-value store that holds the state of a Kubernetes cluster. Every pod, deployment, secret, kubelet – everything about a Kubernetes cluster is stored in etcd.
That means that etcd is one of the most important things to keep backups for, and is an essential “Day 2” consideration for Kubernetes Administrators.
etcd itself works on a quorum model. A single Kubernetes cluster should host several instances of etcd for high availability. Each of these instances holds a full set of independent information about the cluster in its Write Ahead Log (or WAL).
If an etcd member goes down and cannot be recovered, a new member can be easily spun up to replace it. This makes the etcd cluster running behind a Kubernetes cluster fairly resilient in the face of failures.
One way to see this in action is illustrated in this quick video. We will migrate an etcd member to another host in the cluster. Etcd instances maintain a list of other members, keeping track of them with a value called the member ID. We pick one of these members using a list retrieved with etcdctl member list. Once we pick one, we use CTRL C to stop it, simulating a failure. Once the member, Node1, is idle, we copy its data directory, where the WAL lives, to another host.
Using etcdctl member update, we tell the etcd cluster to update the member list using the IP of the “new” Node1 host. After the update completes, we start etcd on the new host.
Since etcd finds the Node1 directory on the new host, the new member assumes the identity of the Node1 member as if it were never gone. Etcd recognizes that the new node is still Node1, even if the host itself has a completely different identity. The “new” Node1 then updates its WAL with all of the missing transactions since its original member went down and the cluster is once again whole.