In our last last entry in the distributed TensorFlow series, we used a research example for distributed training of an
Inception model. In this post we’ll showcase how to do the same thing on
GPU instances, this time on Azure managed Kubernetes - AKS deployed with Pipeline.
As you may remember from our previous post that the first thing to consider when running distributed Tensorflow models is whether you have shared storage space available. On AWS we previously used EFS. We currently run something similar on Azure Cloud,
AzureFiles, which is a fully managed File Share accessible via the industry standard Server Message Block (SMB) protocol (also known as the Common Internet File System or CIFS). Azure File Shares can be mounted concurrently via cloud or on premise deployments. Fortunately,
AzureFiles has native support in Kubernetes, so you can dynamically provision AzureFiles via
Storage Class or bind an already existing File Share to a
Since we want to keep training model data in the long term, we’ll create a Storage Account and File Share beforehand, instead of dynamically provisioning.
We’re going to use the same Inception example we did last time, though with a subtly optimized preparation script. That script will first download and extract a prepared set of images, the Flowers dataset,
then separate the images into training and validation sets, and finally create
TFRecord files. Since these images are mostly small (100KByte max) files, copying them through a storage sharing system isn’t ideal. That’s why we’re running the first part of this preparation on the local disk, then creating
TFRecords files in shared storage. These are typically larger files, at least tens of MBytes, and there’s no significant overhead between reading TFRecord files from AzureDisk or AzureFile, meaning you can’t significantly boost overall training speed by placing these files on AzureDisk instead of running workers distributed on separate nodes while reading from AzureFile.
The required steps are virtually identical to those taken on CPU instances, the main difference being the definition of the training job. If you compare training_gpu.yaml to training_cpu.yaml here’s the difference between workers’ job definitions:
- image: banzaicloud/tensorflow-inception-example:v0.18-gpu
- name: bin
- name: lib
- name: libcuda
- name: bin
- name: lib
- name: libcuda
As you’ve probably noticed, there’s a different image running on the GPU, which requests GPU resources from
Kubernetes and from bound
NVIDIA driver folders.
Ok, so now let’s examine the steps necessary to run our example.
Create the Azure cluster 🔗︎
Create a Kubernetes cluster with an agent pool of two
Standard_NC6 instance types on Azure with AKS. This instance will have one GPU device available that is powered by the NVIDIA Tesla K80 card.
You can accomplish this quickly and efficiently with Pipeline, using this example
Create Azure cluster request:
You should retrieve the Kubernetes config for your cluster to set the
kubectl command, since we will use this for deployment in the next few steps. If you created your cluster with
Pipeline, you can retrieve the config with a simple GET request:
Save the config as a file and set it in KUBECONFIG env var.
Create a Storage Account & FilesShare on Azure Portal
Create a general purpose
Storage Account then add a File Share by clicking on
I’ve named the File Share
tensorflowshared. You can use any name you want, but make sure to remember it, as we will need it for the next step.
Go to the
Access keys tab, and save your access key.
Checkout our example code and Kubeflow
git clone https://github.com/google/kubeflow.git
git clone https://github.com:banzaicloud/tensorflow-models.git
git checkout master-k8s-azure
You can find the K8S deployment files in
Bound storage as a Persistent Volume
First, encode the name of your Storage Account and the Access key into base64, then set these in
kubectl create -f tensorflow-models/research/inception/k8s/pvc_static.yaml
Make sure the PV is correctly bound:
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pv001 5Gi RWX Retain Bound default/azure-files 9s
Run prepare script
kubectl create -f tensorflow-models/research/inception/k8s/prepare.yaml
Check that your File Share contains the
image-data folder with the train & validation files.
Deploy training job
We’ll use Kubeflow to deploy the training jobs, so you have to deploy the Kubeflow operator first:
kubectl apply -f kubeflow/components -R
kubectl create -f tensorflow-models/research/inception/k8s/training_gpu.yaml
Tensorboard is reachable on an external IP, you can obtain it via listing services, thusly:
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
inception-train-job-tensorboard-kujj LoadBalancer 10.0.162.221 22.214.171.124 80:30691/TCP 3h
Running Deep Learning training jobs in a distributed manner might not be a perftect fit for all use cases by default, however, the advantage of using
Kubernetes to deploy TensorFlow workloads is that you can freely combine different cluster environments, using the same code and same deployment. For example, you can run the deployment above on a single node cluster with two GPUs (Standard_NC12) without any modification and it will be faster than running the same deployment on two nodes of a single GPU. So what are we gaining, here? Well, running a job on a strong multi-GPU instance usually costs a lot more than running either multiple single instances or two GPU instances. Here’s how Pipeline comes into play. Pipeline integrated with Hollowtrees allows for ideal combinations of price and performance, so that you can decide what best fits your needs.
If you’re interested in how Pipeline automates all the preceeding steps, and runs Tensorflow jobs on different cloud providers (AWS, Azure, Google Cloud), follow us.