Kubernetes node pool upgrades with Pipeline

The content of this page hasn't been updated for years and might refer to discontinued products and projects.

One of the many best-practices for operating Kubernetes clusters is to frequently perform Kubernetes version upgrades in those clusters. This insures that you’ll be running your workload with a version of Kubernetes that contains the latest security fixes, stability improvements, and features. There are two ways of doing this:

manually with, for example, kubeadm when you’re running on bare-metal
and, if you’re using one of the many cloud-provider Kubernetes solutions, the assisted way

They all have their benefits and downsides. The manual process is very flexible, but is error-prone by nature. The cloud-provider assisted version can be rigid, but its chance of failure is quite low. Based on our customer’s feedback we’ve seen how enterprises like to customize the upgrade process. And, of course, they try to mitigate their risks as much as possible. Since it’s unacceptable to run into an upgrade issue on a cluster with a production workload, they tend to use managed Kubernetes solutions and limited but assisted upgrade processes.

Banzai Cloud aims to make managed Kubernetes node pool upgrades cloud-agnostic and highly customizable. We have carefully analyzed and compared existing managed Kubernetes upgrade solutions in reference to customer needs, so we could come up with a meticulously crafted upgrade process that fits current market requirements.

A quick review of the managed node upgrade solutions that already exist 🔗︎

At Banzai Cloud we’ve been building our own cloud-agnostic container management platform for hybrid clouds. During the time it’s taken to build, we’ve offered Kubernetes support via both the easy way (cloud provider managed K8s) and the hard (provisioning our own CNCF certified K8s distribution, PKE) way. Today, Pipeline provisions Kubernetes on multiple cloud providers (Amazon, Azure, Google) as well as on-prem (VMware and bare metal) and makes available a wide range of Day 1 and Day 2 operations on top of these clusters.

Credits to Google GKE 🔗︎

When we started to build Pipeline and PKE, the only cloud provider managed Kubernetes distribution was Google’s GKE. If you are familiar with both Pipeline and GKE then you may have already noticed that they share some concepts, like node pools.

We’d like to take a moment to give Google credit for setting such a high bar for managed Kubernetes. It’s influenced many of the choices we made while building Pipeline and PKE and we have, overall, strived to offer a GKE-like experience on-prem and across the five major clouds Pipeline supports.

When we got to the point of providing rolling upgrades for Kubernetes, it was no difference. We dug deep into how GKE does rolling upgrades, and set that as the bare minimum standard requirement across all clouds we support. As under the hood Pipeline is using the Cadence workflow engine, we managed to build a highly customizable managed rolling upgrade flow - with the option to add custom flows, cloud provider specific steps (if any), choose between sequential or parallel execution and lots more.

Now let’s compare the ways EKS and GKE handle rolling node pool upgrades, and examine the approach, based on best-of-breed technologies, that we came up with.

EKS node group upgrade 🔗︎

AWS’s EKS uses ‘node groups,’ which are essentially the same as Pipeline’s node pools. Node groups are represented by Auto Scaling groups. EKS’s managed node update process is as follows: it updates the Autoscaling Group’s launch template with the new image and increases the desired size of the group by count(AZs) * 2 (let’s call these extra nodes surge nodes, which is what other providers call them). Then it cordons all the old nodes on the Kubernetes API Server. After this, old nodes are drained and deleted from the Autoscaling Group one by one; during this process the new nodes similarly come up one-by-one. In the end, the size of the group is restored to its original value.

As you can see, the upgrade process is a fairly decent one, but it lacks configurability. As of now, excepting scenarios involving the AMI and a forced drain, nothing can be configured during the upgrade process.

GKE node pool upgrade 🔗︎

Google’s GKE offers a slightly more versatile upgrade process for Kubernetes node pools. By default, it upgrades nodes one at a time, but this number is configurable - unlike in EKS node groups. Like EKS, however, it cordons and starts to drain the nodes that have been selected for upgrade. It offers a number of fine-tuning options like a configurable number of surge nodes; the node pool is increased via the addition of nodes running the new version, before the nodes running the old versions are removed. The other fine-tuned option worth pointing out is maxUnavailable, which determines the number of nodes that can be unavailable simultaneously during an upgrade.

Pipeline node pool upgrade 🔗︎

Pipeline drew inspiration from the cloud providers we just talked about when implementing its upgrade algorithm. The main benefit of using Pipeline’s node pool upgrade procedure is that it combines the benefits of other cloud providers’ solutions, and offers a cloud-agnostic method and API for you to manage upgrades across clouds.

The provider independent node pool upgrade algorithm does the following:

Checks input parameters and calculates a new node version label.
Ensures that the node pool’s desired capacity is stable.
Updates the node pool label and image in the pool template.
Optional: increases node pool size by maxSurge nodes.
Disables the cluster autoscaler if it’s enabled.
Cordons maxBatchSize old nodes, and drains them.
Waits for maxBatchSize new nodes to come up and become ready.
Optional: checks if unavailable nodes haven’t reached the configured maxUnavailable threshold.
Repeats steps 6 through 8, until all nodes are replaced.
Optional: decreases node pool size by maxSurge nodes.

Node.js

In addition, we support lots of configuration parameters for different batch sizes, number of surge and unavailable nodes, and drain options. Check the documentation for more details.

Upgrading an EKS node pool with Pipeline 🔗︎

Let’s test the feature by creating an EKS cluster with Kubernetes version 1.15.11 and updating it to 1.16.8 with the help of the Banzai CLI. Before upgrading any Kubernetes cluster, we advise you to familiarize yourself with the official Kubernetes version and version skew support policy to avoid incompatibilities (unless your clusters are managed by Banzai Cloud as part of a subscription).

Create an EKS cluster with Pipeline with Kubernetes version 1.15.11:

CLUSTER="my-cluster"
REGION="eu-west-1"
VERSION="1.15.11"
NEW_VERSION="1.16"

banzai cluster create <<EOF
{
  "name": "${CLUSTER}",
  "location": "${REGION}",
  "cloud": "amazon",
  "secretName": "aws",
  "properties": {
    "eks": {
      "version": "${VERSION}",
      "nodePools": {
        "pool1": {
          "spotPrice": "0.03",
          "count": 3,
          "minCount": 3,
          "maxCount": 5,
          "autoscaling": true,
          "instanceType": "t2.medium"
        }
      }
    }
  }
}
EOF

Update this cluster to 1.16 with the aws CLI:

aws eks update-cluster-version --region ${REGION} --name ${CLUSTER} --kubernetes-version ${NEW_VERSION}

Grab the latest image ID of the target Kubernetes version from the AWS Parameter Store:

IMAGE=$(aws ssm get-parameter --name /aws/service/eks/optimized-ami/${NEW_VERSION}/amazon-linux-2/recommended/image_id --region ${REGION} --query "Parameter.Value" --output text)

Start the upgrade to this image version:

banzai cluster nodepool update --cluster-name ${CLUSTER} pool1 <<EOF
{
  "image": "${IMAGE}",
  "options": {
    "maxBatchSize": 2,
    "maxUnavailable": 0,
    "drain": {
      "timeout": 300,
      "failOnError": true,
      "podSelector": "app=database"
    }
  }
}
EOF

During the upgrade you can follow what Pipeline is doing step by step. For the CLI, check out this asciicast animation:

Upgrade capacity strategies 🔗︎

You can choose from several capacity strategies to be used during node pool upgrades:

decreased capacity: your cluster gets smaller, with maxBatchSize number of nodes, and then returns to normal size for every upgrade batch (breathing)
maintained capacity: your cluster increases by maxSurge number of nodes so you can basically maintain the original capacity during an upgrade
adaptive upgrade: based on the utilization of the target node pool, the cluster upgrader will choose one of the above strategies

In the following video, we show you how we upgraded a 100-node EKS cluster with cluster monitoring enabled. You can observe how some nodes come and go on the Grafana Dashboard, and how the pods are rescheduled during draining because we use the decreased capacity strategy for the upgrade.

See the documentation for the details of the procedure and the available node pool upgrade options. The open source version for EKS node pool upgrade is available on our hosted trial service. For all the available options and for the sophisticated upgrade process, check the enterprise version of the Pipeline platform.

About Banzai Cloud Pipeline 🔗︎

Banzai Cloud’s Pipeline provides a platform for enterprises to develop, deploy, and scale container-based applications. It leverages best-of-breed cloud components, such as Kubernetes, to create a highly productive, yet flexible environment for developers and operations teams alike. Strong security measures — multiple authentication backends, fine-grained authorization, dynamic secret management, automated secure communications between components using TLS, vulnerability scans, static code analysis, CI/CD, and so on — are default features of the Pipeline platform.

Related resources

Next generation integrated services

article

GPU accelerated AI workloads on Kubernetes

article

Try Pipeline Enterprise for free

article