Distributed Tensorflow deployed to Azure AKS Kubernetes using GPU instances

The content of this page hasn't been updated for years and might refer to discontinued products and projects.

In our last last entry in the distributed TensorFlow series, we used a research example for distributed training of an Inception model. In this post we’ll showcase how to do the same thing on GPU instances, this time on Azure managed Kubernetes - AKS deployed with Pipeline.

As you may remember from our previous post that the first thing to consider when running distributed Tensorflow models is whether you have shared storage space available. On AWS we previously used EFS. We currently run something similar on Azure Cloud, AzureFiles, which is a fully managed File Share accessible via the industry standard Server Message Block (SMB) protocol (also known as the Common Internet File System or CIFS). Azure File Shares can be mounted concurrently via cloud or on premise deployments. Fortunately, AzureFiles has native support in Kubernetes, so you can dynamically provision AzureFiles via Storage Class or bind an already existing File Share to a Persistent Volume. Since we want to keep training model data in the long term, we’ll create a Storage Account and File Share beforehand, instead of dynamically provisioning.

We’re going to use the same Inception example we did last time, though with a subtly optimized preparation script. That script will first download and extract a prepared set of images, the Flowers dataset, then separate the images into training and validation sets, and finally create TFRecord files. Since these images are mostly small (100KByte max) files, copying them through a storage sharing system isn’t ideal. That’s why we’re running the first part of this preparation on the local disk, then creating TFRecords files in shared storage. These are typically larger files, at least tens of MBytes, and there’s no significant overhead between reading TFRecord files from AzureDisk or AzureFile, meaning you can’t significantly boost overall training speed by placing these files on AzureDisk instead of running workers distributed on separate nodes while reading from AzureFile.

The required steps are virtually identical to those taken on CPU instances, the main difference being the definition of the training job. If you compare training_gpu.yaml to training_cpu.yaml here’s the difference between workers’ job definitions:

tfReplicaType: WORKER
template:
  spec:
    containers:
      - image: banzaicloud/tensorflow-inception-example:v0.18-gpu
        ...
        volumeMounts:

          - name: bin
            mountPath: /usr/local/nvidia/bin
          - name: lib
            mountPath: /usr/local/nvidia/lib64
          - name: libcuda
            mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
        resources:
          requests:
            alpha.kubernetes.io/nvidia-gpu: 1
          limits:
            alpha.kubernetes.io/nvidia-gpu: 1
    volumes:

      - name: bin
        hostPath:
          path: /usr/lib/nvidia-384/bin
      - name: lib
        hostPath:
          path: /usr/lib/nvidia-384
      - name: libcuda
        hostPath:
          path: /usr/lib/x86_64-linux-gnu/libcuda.so.1

As you’ve probably noticed, there’s a different image running on the GPU, which requests GPU resources from Kubernetes and from bound NVIDIA driver folders.

Ok, so now let’s examine the steps necessary to run our example.

Create the Azure cluster 🔗︎

Create a Kubernetes cluster with an agent pool of two Standard_NC6 instance types on Azure with AKS. This instance will have one GPU device available that is powered by the NVIDIA Tesla K80 card. You can accomplish this quickly and efficiently with Pipeline, using this example Create Azure cluster request:

POST {{pipeline_url}}/api/v1/clusters
{
  "name":"azgputestcluster",
    "location": "eastus",
    "cloud": "azure",
    "nodeInstanceType": "Standard_NC6",
    "properties": {
        "azure": {
            "node": {
                "resourceGroup": "your_resource_group_name",
                "agentCount": 2,
                "agentName": "agentpool1",
                "kubernetesVersion": "1.8.2"
            }
        }
    }
}

You should retrieve the Kubernetes config for your cluster to set the kubectl command, since we will use this for deployment in the next few steps. If you created your cluster with Pipeline, you can retrieve the config with a simple GET request:

GET {{pipeline_url}}/api/v1/clusters/{{cluster_id}}/config

Save the config as a file and set it in KUBECONFIG env var.

Create a Storage Account & FilesShare on Azure Portal

Create a general purpose Storage Account then add a File Share by clicking on Files. I’ve named the File Share tensorflowshared. You can use any name you want, but make sure to remember it, as we will need it for the next step. Go to the Access keys tab, and save your access key.

Checkout our example code and Kubeflow

git clone https://github.com/google/kubeflow.git
git clone https://github.com:banzaicloud/tensorflow-models.git
git checkout master-k8s-azure

You can find the K8S deployment files in tensorflow-models/research/inception/k8s folder.

Bound storage as a Persistent Volume

First, encode the name of your Storage Account and the Access key into base64, then set these in pvc_static.yaml

kubectl create -f tensorflow-models/research/inception/k8s/pvc_static.yaml

Make sure the PV is correctly bound:

kubectl get pv

NAME      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                 STORAGECLASS   REASON    AGE
pv001     5Gi        RWX            Retain           Bound     default/azure-files                            9s

Run prepare script

kubectl create -f tensorflow-models/research/inception/k8s/prepare.yaml

Check that your File Share contains the image-data folder with the train & validation files.

Deploy training job

We’ll use Kubeflow to deploy the training jobs, so you have to deploy the Kubeflow operator first:

kubectl apply -f kubeflow/components -R
kubectl create -f tensorflow-models/research/inception/k8s/training_gpu.yaml

Tensorboard is reachable on an external IP, you can obtain it via listing services, thusly:

$ kubectl get svc
NAME                                   TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                      AGE
...
inception-train-job-tensorboard-kujj   LoadBalancer   10.0.162.221   13.82.187.171   80:30691/TCP                 3h
...

Summary

Running Deep Learning training jobs in a distributed manner might not be a perftect fit for all use cases by default, however, the advantage of using Kubernetes to deploy TensorFlow workloads is that you can freely combine different cluster environments, using the same code and same deployment. For example, you can run the deployment above on a single node cluster with two GPUs (Standard_NC12) without any modification and it will be faster than running the same deployment on two nodes of a single GPU. So what are we gaining, here? Well, running a job on a strong multi-GPU instance usually costs a lot more than running either multiple single instances or two GPU instances. Here’s how Pipeline comes into play. Pipeline integrated with Hollowtrees allows for ideal combinations of price and performance, so that you can decide what best fits your needs.

If you’re interested in how Pipeline automates all the preceeding steps, and runs Tensorflow jobs on different cloud providers (AWS, Azure, Google Cloud), follow us.

Related resources

Next generation integrated services

article

Kafka external access

article

Managing ksqlDB with Supertubes

article