Nodes successfully joined, not!

The content of this page hasn't been updated for years and might refer to discontinued products and projects.

Debug 101 🔗︎

Today we’re starting a new series called Debug 101, which deals with those issues that gave us particularly bad headaches and took a large amount of time to debug, understand and fix. We believe strongly in open source software and open issue resolution, and we try to describe our problems and suggest fixes as we go, so you don’t have to shave that yak. We already have, and they yak looks awesome.

Nodes successfully joined, not! 🔗︎

We use different cloud providers (typically Kubernetes-managed) to develop and launch workloads on Kubernetes. One of the E2E automated tests we run utilizes AWS EC2, wherein we install the Kubernetes cluster (using kubadm). Custom installations give us the flexibility of custom setups and profiles, and allow us to test Pipeline with different k8s versions. From time to time, we encountered strange behavior in the cluster - the foremost of which was that, although we started EC2 instances, they did not show up in the Kubernetes cluster. This issue occured sporadically and was undeterministic. After some test runs we identified a (potential) common source. Either the node instances in question were of a stronger instance variety, or the cluster size was particularly small (less than three or four nodes). Since we run our automated tests on spot price instances, and the pool is recommended by an internal tool/project (which we will open source soon), it’s hard for us to know what our instance types will be beforehand (unless explicitely specified), since Hollowtrees determines this according to a variety of factors (rulesets, spot price history, price trend in the AZ, fleet diversity, etc).

To cut to the chase, the problem is that, though master and node start in parallel, after some time the controller removes the node from the cluster. Here’s the sequence of events:

Master starts sucessfully
Node starts and tries to join to the master: (kubeadm join --token ${TOKEN} ${MASTER})
Eventually, Node goes missing from the cluster node list (kubectl get node)
The logs on the Node say that it’s successfully joined the Master

    [bootstrap] The server supports the Certificates API (certificates.k8s.io/v1beta1)
    and the Node join is complete:
    * Certificate signing request is sent to master and response received.
    * Kubelet informed of new secure connection details.

The running kubectl get nodes on the Master register the machine as joined
We have now arrived at the kubectl event, which allows us see what’s happening cronologically

1m 1m 1 ip-10-0-100-184.eu-west-1.compute.internal.14fb501a0ae334a3 Node Normal RegisteredNode controllermanager       Node ip-10-0-100-184.eu-west-1.compute.internal event: Registered Node ip-10-0-100-184.eu-west-1.compute.internal in Controller
1m 1m 1 ip-10-0-100-184.eu-west-1.compute.internal.14fb501a19555324 Node Normal DeletingNode   controllermanager       Node ip-10-0-100-184.eu-west-1.compute.internal event: Deleting Node ip-10-0-100-184.eu-west-1.compute.internal because it's not present according to cloud provider
1m 1m 1 ip-10-0-100-184.eu-west-1.compute.internal.14fb501b435fa741 Node Normal RemovingNode   controllermanager       Node ip-10-0-100-184.eu-west-1.compute.internal event: Removing Node ip-10-0-100-184.eu-west-1.compute.internal from Controller
1m 1m 1 ip-10-0-100-184.eu-west-1.compute.internal.14fb501c241eff24 Node Normal Starting       kubelet, ip-10-0-100-18

This can be paired down to Removing Node ip-10-0-100-184.eu-west-1.compute.internal from Controller.

As you can see, once the controller loop calls the Amazon controller (realizes that the cluster is running on EC2), it removes the node immediately. If we dig a bit deeper into the code and annotate it back after introducing our changes, we have this PR. If the PR is operating correctly, the Amazon controller loop removes instances from the cluster if it thinks that the node is gone or no longer available. This is likely done so that the EC2 instance can die, removed as the result of spot price surgery or due to other communication issues which affect RAFT gossip.

We’ve built Hollowtrees with the exact purpose of avoiding these kinds of problems in the cloud (subscribe to our social channels or check back next week for more on this and other topics).

With a simple loop, this issue can actually be fixed.


until kubectl get node | grep $(hostname -f)
do
  kubeadm reset
  kubeadm join --token ${TOKEN} ${MASTER}
  echo "Waiting for Master to start up..."
  sleep 10
done

How it works 🔗︎

On the first check there is no kubeconfig configuration, so the initial join occurs. In the second iteration the credentials do not work from the first join, and it checks if the node has joined the cluster. If the node has not joined, it retries and repeats this process until the actual join happens. A successful join will look like this.

NAME                                         STATUS    ROLES     AGE       VERSION
ip-10-0-0-51.eu-west-1.compute.internal      Ready     master    4m        v1.8.3
ip-10-0-100-184.eu-west-1.compute.internal   Ready     <none>    3m        v1.8.3

Related resources

Managing spot instance clusters on Kubernetes with Hollowtrees

article

Overspending in the cloud

article