In our last last entry in the distributed TensorFlow series, we used a research example for distributed training of an Inception
model. In this post we’ll showcase how to do the same thing on GPU
instances, this time on Azure managed Kubernetes - AKS deployed with Pipeline.
As you may remember from our previous post that the first thing to consider when running distributed Tensorflow models is whether you have shared storage space available. On AWS we previously used EFS. We currently run something similar on Azure Cloud, AzureFiles
, which is a fully managed File Share accessible via the industry standard Server Message Block (SMB) protocol (also known as the Common Internet File System or CIFS). Azure File Shares can be mounted concurrently via cloud or on premise deployments. Fortunately, AzureFiles
has native support in Kubernetes, so you can dynamically provision AzureFiles via Storage Class
or bind an already existing File Share to a Persistent Volume
.
Since we want to keep training model data in the long term, we’ll create a Storage Account and File Share beforehand, instead of dynamically provisioning.
We’re going to use the same Inception example we did last time, though with a subtly optimized preparation script. That script will first download and extract a prepared set of images, the Flowers dataset,
then separate the images into training and validation sets, and finally create TFRecord
files. Since these images are mostly small (100KByte max) files, copying them through a storage sharing system isn’t ideal. That’s why we’re running the first part of this preparation on the local disk, then creating TFRecords
files in shared storage. These are typically larger files, at least tens of MBytes, and there’s no significant overhead between reading TFRecord files from AzureDisk or AzureFile, meaning you can’t significantly boost overall training speed by placing these files on AzureDisk instead of running workers distributed on separate nodes while reading from AzureFile.
The required steps are virtually identical to those taken on CPU instances, the main difference being the definition of the training job. If you compare training_gpu.yaml to training_cpu.yaml here’s the difference between workers’ job definitions:
tfReplicaType: WORKER
template:
spec:
containers:
- image: banzaicloud/tensorflow-inception-example:v0.18-gpu
...
volumeMounts:
- name: bin
mountPath: /usr/local/nvidia/bin
- name: lib
mountPath: /usr/local/nvidia/lib64
- name: libcuda
mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
resources:
requests:
alpha.kubernetes.io/nvidia-gpu: 1
limits:
alpha.kubernetes.io/nvidia-gpu: 1
volumes:
- name: bin
hostPath:
path: /usr/lib/nvidia-384/bin
- name: lib
hostPath:
path: /usr/lib/nvidia-384
- name: libcuda
hostPath:
path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
As you’ve probably noticed, there’s a different image running on the GPU, which requests GPU resources from Kubernetes
and from bound NVIDIA
driver folders.
Ok, so now let’s examine the steps necessary to run our example.
Create the Azure cluster 🔗︎
Create a Kubernetes cluster with an agent pool of two Standard_NC6
instance types on Azure with AKS. This instance will have one GPU device available that is powered by the NVIDIA Tesla K80 card.
You can accomplish this quickly and efficiently with Pipeline, using this example Create Azure cluster
request:
POST {{pipeline_url}}/api/v1/clusters
{
"name":"azgputestcluster",
"location": "eastus",
"cloud": "azure",
"nodeInstanceType": "Standard_NC6",
"properties": {
"azure": {
"node": {
"resourceGroup": "your_resource_group_name",
"agentCount": 2,
"agentName": "agentpool1",
"kubernetesVersion": "1.8.2"
}
}
}
}
You should retrieve the Kubernetes config for your cluster to set the kubectl
command, since we will use this for deployment in the next few steps. If you created your cluster with Pipeline
, you can retrieve the config with a simple GET request:
GET {{pipeline_url}}/api/v1/clusters/{{cluster_id}}/config
Save the config as a file and set it in KUBECONFIG env var.
Create a Storage Account & FilesShare on Azure Portal
Create a general purpose Storage Account
then add a File Share by clicking on Files
.
I’ve named the File Share tensorflowshared
. You can use any name you want, but make sure to remember it, as we will need it for the next step.
Go to the Access keys
tab, and save your access key.
Checkout our example code and Kubeflow
git clone https://github.com/google/kubeflow.git
git clone https://github.com:banzaicloud/tensorflow-models.git
git checkout master-k8s-azure
You can find the K8S deployment files in tensorflow-models/research/inception/k8s
folder.
Bound storage as a Persistent Volume
First, encode the name of your Storage Account and the Access key into base64, then set these in pvc_static.yaml
kubectl create -f tensorflow-models/research/inception/k8s/pvc_static.yaml
Make sure the PV is correctly bound:
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pv001 5Gi RWX Retain Bound default/azure-files 9s
Run prepare script
kubectl create -f tensorflow-models/research/inception/k8s/prepare.yaml
Check that your File Share contains the image-data
folder with the train & validation files.
Deploy training job
We’ll use Kubeflow to deploy the training jobs, so you have to deploy the Kubeflow operator first:
kubectl apply -f kubeflow/components -R
kubectl create -f tensorflow-models/research/inception/k8s/training_gpu.yaml
Tensorboard
is reachable on an external IP, you can obtain it via listing services, thusly:
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
...
inception-train-job-tensorboard-kujj LoadBalancer 10.0.162.221 13.82.187.171 80:30691/TCP 3h
...
Summary
Running Deep Learning training jobs in a distributed manner might not be a perftect fit for all use cases by default, however, the advantage of using Kubernetes
to deploy TensorFlow workloads is that you can freely combine different cluster environments, using the same code and same deployment. For example, you can run the deployment above on a single node cluster with two GPUs (Standard_NC12) without any modification and it will be faster than running the same deployment on two nodes of a single GPU. So what are we gaining, here? Well, running a job on a strong multi-GPU instance usually costs a lot more than running either multiple single instances or two GPU instances. Here’s how Pipeline comes into play. Pipeline integrated with Hollowtrees allows for ideal combinations of price and performance, so that you can decide what best fits your needs.
If you’re interested in how Pipeline automates all the preceeding steps, and runs Tensorflow jobs on different cloud providers (AWS, Azure, Google Cloud), follow us.