Introduction to distributed TensorFlow on Kubernetes

The content of this page hasn't been updated for years and might refer to discontinued products and projects.

Last time we discussed how our Pipeline PaaS deploys and provisions an AWS EFS filesystem on Kubernetes and what the performance benefits are for Spark or TensorFlow. This post is gives:

An introduction to TensorFlow on Kubernetes
The benefits of EFS for TensorFlow (image data storage for TensorFlow jobs)

Pipeline uses the kubeflow framework to deploy:

A JupyterHub to create & manage interactive Jupyter notebooks
A TensorFlow Training Controller that can be configured to use CPUs or GPUs
A TensorFlow Serving container

Note that Pipeline also has default Spotguides for Spark and Zeppelin to help support your datascience experience

Introduction to TensorFlow 🔗︎

TensorFlow is a popular open source software library for numerical computation using data flow graphs. A computational graph is a series of TensorFlow operations arranged into a graph of nodes, where the nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays - tensors - communicated between them. This flexible architecture allows for the deployment of computation to one or more CPUs or GPUs on a desktop, server, or mobile device using a single API.

There are plenty of examples and scripts running TensorFlow workloads, most running on single nodes/machines. In order to run your script on a cluster of TensorFlow servers, you need to modify them, create the ClusterSpec, and explicitly define tasks to run on different devices. A typical TensorFlow cluster consists of workers and parameter servers. Workers run copies of their training, while parameter servers (ps) maintain model parameters. You can find more info about these here: Distributed TensorFlow. Kubeflow is a handy way to deploy workers and parameter servers, allowing us to define TfJob, which makes it easy to define TensorFlow specific deployments.

We’ve selected an example walk-through for provisioning the Pipeline PaaS, inception_distributed_training.py, a distributed training job from the well known Inception model, adapted to run on kubeflow. Inception is a deep convolutional neural network architecture for state of the art classification and detection of images.

Kubeflow passes TensorFlow cluster specs (workers and parameter servers) as a JSON in an environment variable called TF_CONFIG. That’s why we needed to modify the inception_distributed_training.py script a bit.

Ok, now let’s take a look at how to get this started. First, you need to check out two projects on GitHub:

git clone https://github.com/google/kubeflow.git
git clone https://github.com:banzaicloud/tensorflow-models.git
cd tensorflow-models
git checkout master-k8s

Spoiler alert 1: Pipeline automates all these and we’ve made their Helm chart deployments available. The purpose of this detailed walk-through is to understand what’s happening behind the scenes when you choose model training or serving with Pipeline

Create a Kubernetes cluster on AWS and attach EFS storage

This is pretty much the same as described in the previous post about EFS on AWS. To recap:

you need to provision a Kubernetes cluster on AWS (with or without Pipeline)
and attach EFS storage (with or without Pipeline)

For this example we’ve provisioned a two node cluster (m4.2xlarge) with EFS available as a PVC with the name efs.

Build Docker images for our Inception training example

You can either use our Docker image or build them yourself:

cd tensorflow-models/research
docker build -t banzaicloud/tensorflow-inception-example:v0.1 -f inception/k8s/docker/Dockerfile .

This image contains the bazel-bin executable versions of the following scripts:

download_and_preprocess_imagenet - downloads set of images (imagenet) and prepares TfRecord files
download_and_preprocess_flowers - downloads set of images (flowers) and prepares TfRecord files
imagenet_train - default training script for imagenet data
imagenet_distributed_train - distributed training script for imagenet data
imagenet_eval - evaluates training of imagenet data

Download and preprocess images for training jobs

cd tensorflow-models/research/inception/k8s
kubectl apply -f prepare.yaml

Now we have to launch the download_and_preprocess_flowers script, which downloads the data sets to our EFS volume, unpacks it and creates TfRecord files suitable for training jobs. We used a flowers data set for the purposes of this demo, which is usually used to skip the training process and start from scratch. It’s much smaller than imagenet. To train on imagenet data, all you have to do is change the command in prepare.yaml to download_and_preprocess_imagenet.

Run TensorFlow training

Deploy KubeFlow:

kubectl apply -f [directory_where_you_cloned_kubeflow]/components/ -R

$ kubectl get po

NAME                                                    READY     STATUS    RESTARTS   AGE
efs-provisioner-b8dcc9bd7-s6dt4                         1/1       Running   0          5h
tf-job-operator-6f7ccdfd4d-mkk9l                        1/1       Running   0          18m

As you can see, Kubeflow deploys a tf-job-operator which handles TensorFlow specific deployments like TfJob. It’s really very self-explanatory: you’ll understand once you take a look at training.yaml:

kind: "TfJob"
metadata:
  name: "inception-train-job"
spec:
  replicaSpecs:
    - replicas: 4
      tfReplicaType: WORKER
      template:
        spec:
          containers:
            - image: banzaicloud/tensorflow-inception-example:v0.1
              name: tensorflow
              command: ["bazel-bin/inception/imagenet_distributed_train"]
              args: ["--batch_size=32", "--num_gpus=0", "--data_dir=/efs/image-data", "--train_dir=/efs/train"]
              volumeMounts:
                - name: efs-pvc
                  mountPath: "/efs"
          volumes:
            - name: efs-pvc
              persistentVolumeClaim:
                claimName: efs
          restartPolicy: OnFailure
    - replicas: 2
      tfReplicaType: PS
  tensorboard:
    logDir: /efs/train
    serviceType: LoadBalancer
    volumes:
      - name: efs-pvc
        persistentVolumeClaim:
          claimName: efs
    volumeMounts:
      - name: efs-pvc
        mountPath: "/efs"
  terminationPolicy:
    chief:
      replicaName: WORKER
      replicaIndex: 0

Ok, let’s deploy the training job:

kubectl apply -f training.yaml

Check running pods on your k8s cluster:

$ kubectl get po

NAME                                                    READY     STATUS    RESTARTS   AGE
efs-provisioner-b8dcc9bd7-s6dt4                         1/1       Running   0          5h
inception-train-job-ps-b91p-0-256s8                     1/1       Running   0          1m
inception-train-job-ps-b91p-1-xwqsr                     1/1       Running   0          1m
inception-train-job-tensorboard-b91p-5dfcc88c95-p5r97   1/1       Running   0          1m
inception-train-job-worker-b91p-0-hqbg9                 1/1       Running   0          1m
inception-train-job-worker-b91p-1-mqp54                 1/1       Running   0          1m
inception-train-job-worker-b91p-2-pmqxn                 1/1       Running   0          1m
inception-train-job-worker-b91p-3-lpfls                 1/1       Running   0          1m
tf-job-operator-6f7ccdfd4d-mkk9l                        1/1       Running   0          18m

You may notice that there’s a pod for Tensorboard which is reachable from outside through a LoadBalancer. You can retrieve its address if you describe the service thusly:

$ kubectl describe svc inception-train-job-tensorboard-b91p

Name:                     inception-train-job-tensorboard-b91p
Namespace:                default
Labels:                   app=tensorboard
                          runtime_id=b91p
                          tensorflow.org=
                          tf_job_name=inception-train-job
Annotations:              <none>
Selector:                 app=tensorboard,runtime_id=b91p,tensorflow.org=,tf_job_name=inception-train-job
Type:                     LoadBalancer
IP:                       10.98.1.170
LoadBalancer Ingress:     ac6299300fb7211e7bc7c06b9c102cd8-1452377798.eu-west-1.elb.amazonaws.com
Port:                     tb-port  80/TCP
TargetPort:               6006/TCP
NodePort:                 tb-port  32515/TCP
Endpoints:                10.38.0.6:6006
Session Affinity:         None
External Traffic Policy:  Cluster

Tensorboard can visualize your graph, making it easier to understand and debug.

If you look at the logs carefully you’ll notice that our example of a two nodes cluster executes learning steps quite slowly. That’s mostly because we’re using generic compute instances on AWS, not GPU compute instances. Using generic compute instances instead of GPU ones doesn’t makes sense except to illustrate the benefits and the ease of using GPU resources in Kubernetes, which we will do in the next post in this series.

Spoiler alert 2: it’s considerably faster and uses the built-in GPU plugin/scheduling feature of k8s.

Overall, it’s relatively easy to start a training job on k8s once you have the proper definition yaml files. But it’s even easier to push your python scripts to a GitHub repository (like this Spark example) then leave the cluster provisioning (both cloud and k8s), build, deployment, monitoring, scaling and everything else to an automated CI/CD Pipeline.

Related resources

GPU accelerated AI workloads on Kubernetes

article

Manage your AWS GovCloud Kubernetes clusters with Pipeline

article

Vault replication across multiple datacenters on Kubernetes

article