6. Integration with Kubernetes
Preview Release
This is an early release of Kubernetes support for the IPU. As such the software is subject to change without notice.
The Kubernetes IPU Operator for V-IPU is available on request from Graphcore support.
Kubernetes (K8s) is an open-source container orchestration and management system. Kubernetes Operators allow the Kubernetes API to be extended with custom objects, and implement the control logic for such custom objects. You can read more about the Operator pattern in the Kubernetes documentation.
The IPU Operator provides a framework to extend the Kubernetes API and to manage IPUs via custom resource definitions (CRDs) and custom controllers. It allows you to specify the number of IPUs required for your Kubernetes workload using annotations.
This chapter outlines the Operator components, installation steps and usage.
Note
Kubernetes uses the word Pod to refer to the smallest deployable units of computing that you can create and manage in Kubernetes. This is not to be confused with the Graphcore IPU-POD, which is a rack-based system of IPUs.
6.1. Components and design
The Operator contains the following components:
The
gc-proxy
that communicates with the V-IPU controller (vipu-server
)The CRD and controller that let you allocate IPUs directly from the Kubernetes cluster.
The gc-proxy
is responsible for:
Managing the IPU resources by communicating with the V-IPU controller
Running the REST API server to serve requests from the controller for partition creation and deletion
The CRD and custom controller extend the Kubernetes API and manage IPU resources on your behalf. They are responsible for:
Watching for CRD events
Creating worker and launcher Pods based on the CRD configuration, by using the following fields:
Adding a finaliser to the custom resource to release the IPUs on deletion
Setting the
hostNetwork
andsecurityContext/privileged
to trueSetting the Pod
dnsPolicy
toClusterFirstWithHostNet
Providing webhook REST endpoints to validate the input CRD specification
6.2. Package contents
The software is delivered as a single tarball containing the following files and directories:
- The CRD specification:
gc-ipu-operator-v1.0.3/CRDs/graphcore.ai_ipujobs.yaml
- Documentation for the Operator:
gc-ipu-operator-v1.0.3/docs/
- The Helm Chart:
gc-ipu-operator-v1.0.3/gc-ipu-operator-helm-chart-v1.0.0-alpha-5.tgz
- The Operator and
gc-proxy
images: gc-ipu-operator-v1.0.3/gc-operator-images.tar.gz
- Checksum for the Operator:
gc-ipu-operator-v1.0.3/gc-operator.cksm
- Checksum for the Helm Chart:
gc-ipu-operator-v1.0.3/gc-ipu-operator-helm-chart.cksm
6.3. Deploying the software
6.3.1. Prerequisites
Before you can use IPUs from your Kubernetes workloads, you need to meet the following conditions:
Have access to one or more Graphcore IPU-POD systems
Have a compatible version of the V-IPU controller installed on your IPU-POD system
Create a Kubernetes cluster. At least one of the worker nodes in the cluster must be on the head node of the IPU-POD. See Section 6.11, Known limitations for more information.
Have the
kubectl
and Helm (v3.0.0 or later) command-line tools installed on your machine.
6.3.2. Installation
Installing the CRDs
To install the CRDs, run the following command:
$ kubectl apply -f <dir>/CRDs/graphcore.ai_ipujobs.yaml
Installing the Operator
Unzip the Helm package and run the following command:
$ helm install <release-name> <path-to-chart-tar> <custom-parameters>
Where:
<release-name>
is the name you choose for this Helm installation<path-to-chart-tar>
is the path to the downloaded Helm Chart tar file<custom-parameters>
is where you customize the installation.You can either use multiple
--set key=value
arguments, or put your customization in a YAML file and use the--values your-values.yaml
argument.
See Section 6.4, Configurations for more information.
For example, the following command deploys the software to the Kubernetes cluster in the default configuration.
$ cd ipu-proxy
$ helm install [RELEASE_NAME] . --set global.vipuControllers="pod1:8090:ipunode=pod1\,pod2:8090:ipunode=pod2" \
--set controller.image.repository=[controller-image] --set vipuProxy.image.repository=[proxy-image]
This command installs the following in the same namespace where the Helm release installed:
gc-proxy
andgc-controller
as deploymentsgc-proxy
service of type ClusterIPRBAC: ServiceAccount, ClusterRole to manage Pods and ConfigMaps
A Partitions tracker ConfigMap
Configuration objects for the mutation and validation webhooks
You can read more about installing Helm in the Kubernetes documentation.
You can see all the customization options in the README.md
for the Helm Charts.
Multiple V-IPU controller support
The IPU Operator can communicate with multiple V-IPU controllers.
You can specify multiple V-IPU controllers during installation by setting the
vipuControllers
option on the helm install
command line. For example:
--set vipuControllers="pod001:8090:ipunode=node1\,pod002:8091:ipunode=node2"
Alternatively, after installation you can edit the ConfigMap, as shown below, and update the value.
$ kubectl edit configmap gc-ipu-operator-vipu-controllers
Each V-IPU controller is specified with a colon-separated list of three values:
V-IPU controller host address
V-IPU controller port
A label defined by
key=value
.
The same label must be added to the node where the containers corresponding to that V-IPU controller will run. Labeling the node is done with the following command:
$ kubectl label nodes <someworkernode> <key>=<value>
The ConfigMap can be modified at any time and the IPU Operator automatically adds the new V-IPU controller to its internal list. It can take up to 60 seconds for the new V-IPU controller to be added. When a partition is created, the IPU Operator goes through the list serially until it finds space for the requested number of IPUs.
Verify the installation is successful
When the installation is complete, you can verify that it worked correctly by running the following commands and seeing similar output:
$ kubectl get crd
NAME CREATED AT
ipujobs.graphcore.ai 2021-03-02T12:20:04Z
...
$ helm ls -n <the-namespace-where-you-deployed-the-operator>
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
gc default 1 2021-03-18 11:50:31.35861 +0100 CET deployed gc-ipu-operator-helm-chart-v1.0.0 v1.0.0
$ kubectl get pods -n <the-namespace-where-you-deployed-the-operator>
NAME READY STATUS RESTARTS AGE
gc-ipu-operator-controller-manager-54766f7f7b-x5wtr 2/2 Running 0 5d23h
gc-ipu-operator-vipu-proxy-844c7d6b7f-88bqr 1/1 Running 1 5d23h
6.3.3. Uninstall
$ helm uninstall [RELEASE_NAME]
This removes all the Kubernetes components associated with the chart and deletes the release.
See helm uninstall for command documentation.
Note
The partition tracker ConfigMap ipu-partitions-tracker
does not get deleted
when you uninstall the Helm release. This is so that when the ipu-proxy
is deployed
again, it can pick up from where it was uninstalled before (in terms of managing the
created partitions). If you wish to remove that ConfigMap, you can run:
kubectl delete configmap ipu-partitions-tracker -n <namespace>
6.3.4. Upgrading the Helm Chart
$ helm upgrade [RELEASE_NAME] [CHART]
See helm upgrade for command documentation.
6.4. Configurations
Table 6.1 lists the configurable parameters of the Helm Chart and their default values.
Parameter |
Description |
Default |
---|---|---|
nameOverride |
Override the name of the chart in the generated Chart resource names |
“” |
fullNameOverride |
Override the fully qualified app name which is used in naming the generated chart resources.
If this is not set, the fully qualified app name is defaulted to: |
“” |
global.imagePullSecrets |
A map of image pull secrets names (for example, |
[] |
global.kubectlVersion |
Kubectl server version |
[] |
global.launcherImagePullPolicy |
Launcher image pull policy |
“Never” |
global.launcherImage |
The container image used for each IPUJob launcher container |
launcher:latest |
global.vipuControllers |
List of V-IPU controllers |
[] |
admissionWebhooks.failurePolicy |
The admission webhooks failure policy. |
“Fail” |
admissionWebhooks.patch.image.pullPolicy |
The admission webhooks patch image pull policy. |
“IfNotPresent” |
admissionWebhooks.patch.image.repository |
The admission webhooks patch image repository. |
“k8s.gcr.io/ingress-nginx/kube-webhook-certgen” |
admissionWebhooks.patch.image.tag |
The admission webhooks patch image tag. |
“v1.1.1” |
admissionWebhooks.patch.nodeSelector |
The Kubernetes node selector for the admission webhooks patch jobs. |
{} |
admissionWebhooks.patch.podAnnotations |
The pod annotations for the admission webhooks patch jobs. |
{} |
admissionWebhooks.patch.priorityClassName |
The name of a priority class to use with the admission webhook patching job. |
“” |
admissionWebhooks.patch.runAsUser |
The User to use for the admission webhooks patch jobs. |
2000 |
admissionWebhooks.patch.tolerations |
The Kubernetes tolerations for the admission webhooks patch jobs. See Taints and Tolerations on the Kubernetes website. |
[] |
admissionWebhooks.port |
9443 |
The port at which the admission webhook server is exposed in the controller container. |
admissionWebhooks.service.annotations |
The admission webhooks service annotations. |
{} |
admissionWebhooks.service.servicePort |
443 |
The admission webhooks service port. |
admissionWebhooks.service.type |
The admission webhooks service type. |
“ClusterIP” |
admissionWebhooks.timeoutSeconds |
The admission webhooks timeout in seconds. |
30 |
controller.affinity |
Controller Kubernetes affinity. See Pod Affinity on the Kubernetes website. |
{} |
controller.develLogs |
Specifies whether to enable or disable development logging mode. |
true |
controller.dnsPolicy |
Set the dnsPolicy to ClusterFirstWithHostNet if hostNetwork is true – otherwise set the dnsPolicy to ClusterFirst |
“ClusterFirstWithHostNet” |
controller.hostNetwork |
Set the hostnetwork flag when running the controller |
true |
controller.image.pullPolicy |
The Controller image pull policy |
“Always” |
controller.image.repository |
The Controller image repository |
“localhost:5000/controller” |
controller.image.tag |
Overrides the Controller image tag whose default is the chart appVersion. |
“” |
controller.nodeSelector |
Controller Kubernetes node selector. |
{} |
controller.podAnnotations |
Controller pod annotations. |
{} |
controller.podSecurityContext |
Controller pod security policy. |
{“runAsUser”:65532}` |
controller.rbac.create |
Specifies whether to create rbac clusterrole and clusterrolebinding and attach them to the service account. |
true |
controller.resources.limits.cpu |
The max limit for CPU time for the controller, in Kubernetes CPU units. |
“500m” |
controller.resources.limits.memory |
The max limit for memory for the controller. |
“512Mi” |
controller.resources.requests.cpu |
The requested CPU for the controller, in Kubernetes CPU units. |
“100m” |
controller.resources.requests.memory |
The requested memory for the controller. |
“200Mi” |
controller.securityContext |
Controller security context. |
{} |
controller.service.port |
The port for the controller service, used to setup kube-rbac-proxy for protecting metrics endpoint |
8443 |
controller.service.type |
The Kubernetes service type for the controller. |
“ClusterIP” |
controller.serviceAccount.annotations |
Annotations to add to the service account. |
{} |
controller.serviceAccount.create |
Specifies whether a service account should be created. |
true |
controller.serviceAccount.name |
The name of the service account to use. If not set and create is true, a name is generated using the fullname template. |
“” |
controller.tolerations |
Controller Kubernetes tolerations. See Taints and Tolerations on the Kubernetes website. |
[] |
vipuProxy.affinity |
The V-IPU proxy Kubernetes affinity. See Pod Affinity |
{} |
vipuProxy.image.pullPolicy |
The V-IPU proxy image pull policy. |
“Always” |
vipuProxy.image.repository |
The V-IPU proxy image repository. |
“localhost:5000/vipu-proxy” |
vipuProxy.image.tag |
Overrides V-IPU proxy the image tag whose default is the chart appVersion. |
“” |
vipuProxy.logLevel |
V-IPU proxy log level (min 1 -max 6). |
2 |
vipuProxy.nodeSelector |
The V-IPU proxy Kubernetes node selector. |
{} |
vipuProxy.podAnnotations |
The V-IPU proxy pod annotations. |
{} |
vipuProxy.podSecurityContext |
The V-IPU proxy pod security policy. |
{} |
vipuProxy.proxyIdleTimeoutSeconds |
V-IPU proxy idle timeout seconds. |
60 |
vipuProxy.proxyPartitionTrackerConfigMap |
V-IPU proxy partition tracking configmap name. |
“ipu-partitions-tracker” |
vipuProxy.proxyPort |
V-IPU proxy port. |
8080 |
vipuProxy.proxyReadTimeoutSeconds |
V-IPU proxy read timeout in seconds. |
30 |
vipuProxy.proxyWriteTimeoutSeconds |
V-IPU proxy write timeout in seconds. |
300 |
vipuProxy.rbac.create |
Specifies whether to create rbac clusterrole and clusterrolebinding and attach them to the service account for V-IPU proxy. |
true |
vipuProxy.resources |
The Kubernetes resources limits and requirements for the V-IPU proxy. |
{} |
vipuProxy.securityContext |
The V-IPU proxy security context. |
{} |
vipuProxy.service.port |
80 |
The Kubernetes service port for V-IPU proxy. |
vipuProxy.service.type |
“ClusterIP” |
The Kubernetes service type for V-IPU proxy. |
vipuProxy.serviceAccount.annotations |
Annotations to add to the service account for V-IPU proxy. |
{} |
vipuProxy.serviceAccount.create |
Specifies whether a service account should be created for V-IPU proxy. |
true |
vipuProxy.serviceAccount.name |
The name of the service account to use for V-IPU proxy. If not set and create is true, a name is generated using the fullname template. |
“” |
vipuProxy.tolerations |
The V-IPU proxy Kubernetes tolerations. See Taints and Tolerations on the Kubernetes website. |
[] |
6.5. Creating an IPUJob
Once the CRDs and the IPU Operator are installed, you can start submitting IPUJobs (MPI-based AI/ML jobs that use IPUs).
6.6. Training CRD Job
The following YAML file is an example of a declarative definition of an IPUJob for the ResNet-8 TensorFlow application:
apiVersion: graphcore.ai/v1alpha1 # the API that defined this API object type
kind: IPUJob # the kind of this Kubernetes object
metadata:
name: ipujob-sample # the name of the job
spec:
jobInstances: 1 # refers to the number of job instances. More than 1 job instance is usually useful for non-training jobs only.
ipusPerJobInstance: "1" # refers to the number of ipus required per job instance. A separate partition of this size will be created by the operator for each job instances
workerPerJobInstance: "1" # Number of K8s worker pods created with a separate GCD. Refers to the number of poplar instances
modelReplicasPerWorker: "1" # refers to the number of replicas within each WorkerPods
launcher:
command: # the command to trigger the job execution
- mpirun
- --allow-run-as-root
- --bind-to
- none
- -np
- "1"
- python3
- /public_examples/applications/tensorflow/cnns/training/train.py
- --dataset=cifar-10
- --synthetic-data
- --model-size=8
- --batch-size=1
- --batches-per-step=10
- --gradient-accumulation-count=10
- --no-validation
- --no-stochastic-rounding
- --iterations=20
workers:
replicas: 1 # how many workers (poplar instances) should participate in this execution
template: # native Kubernetes Pod template. https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates
metadata:
labels:
app: resnet-launcher
spec:
containers: # the containers running inside each worker
- name: resnet
image: resnet:latest
env: # environment variables set on each worker
- name: "IPUOF_LOG_LEVEL"
value: "INFO"
- name: "POPLAR_LOG_LEVEL"
value: "INFO"
Download single-gcd-sample.yaml
Save the above specification file as single-gcd-sample.yaml
then run:
$ kubectl apply -f single-gcd-sample.yaml
ipujob.graphcore.ai/ipujob-sample created
Now you can inspect what happens in the cluster and you should see something similar to:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
gc-ipu-operator-controller-manager-6ff6b6875d-ncjgp 2/2 Running 0 3d22h
gc-ipu-operator-vipu-proxy-849dbf98df-rg8gh 1/1 Running 0 3d22h
ipujob-sample-launcher 1/1 Running 0 10s
ipujob-sample-worker-0 1/1 Running 0 25s
You can also list the IPUJobs in the cluster and see their status:
$ kubectl get ipujobs.graphcore.ai
NAME STATUS AGE
ipujob-sample Running 40s
And you can inspect more details about a specific IPUJob as follows:
$ kubectl describe ipujobs.graphcore.ai ipujob-sample
Name: ipujob-sample
Namespace: default
Labels: <none>
Annotations: <none>
API Version: graphcore.ai/v1alpha1
Kind: IPUJob
Metadata:
Creation Timestamp: 2021-03-22T10:10:31Z
Finalizers:
ipu.finalizers.graphcore.ai
Generation: 2
Manager: manager
Operation: Update
Time: 2021-03-22T10:10:45Z
Resource Version: 29226482
Self Link: /apis/graphcore.ai/v1alpha1/namespaces/default/ipujobs/ipujob-sample
UID: beb81bbe-2309-494a-9e28-2a75a704be15
Spec:
Clean Pod Policy: None
Ipus Per Model Replica: 1
Launcher:
Command:
mpirun
--allow-run-as-root
--bind-to
none
-np
1
python3
/public_examples/applications/tensorflow/cnns/training/train.py
--dataset=cifar-10
--synthetic-data
--model-size=8
--batch-size=1
--batches-per-step=10
--gradient-accumulation-count=10
--no-validation
--no-stochastic-rounding
--iterations=20
Model Replicas: 4
Restart Policy:
Back Off Limit: 3
Type: Never
Workers:
Replicas: 1
Template:
Metadata:
Spec:
Containers:
Env:
Name: IPUOF_LOG_LEVEL
Value: INFO
Name: POPLAR_LOG_LEVEL
Value: INFO
Image: artifactory-systems.eng.graphcore.ai/vipu-k8s-docker-dev-local/resnet-poplar-2.0:operator
Name: resnet
Resources:
Status:
Conditions:
Last Transition Time: 2021-03-22T10:10:31Z
Last Update Time: 2021-03-22T10:10:31Z
Message: IPUJob default/ipujob-sample is waiting for resources to be ready.
Reason: IPUJobPending
Status: False
Type: Pending
Last Transition Time: 2021-03-22T10:10:45Z
Last Update Time: 2021-03-22T10:10:45Z
Message: IPUJob default/ipujob-sample is running.
Reason: IPUJobRunning
Status: True
Type: Running
I PU Partition Created: true
Launcher Status: Running
Restart Count: 0
Start Time: 2021-03-22T10:10:31Z
Workers Status:
Active: 1
6.6.1. Interactive mode
You can also run the IPUJob in interactive mode, where it does not execute anything by default:
apiVersion: graphcore.ai/v1alpha1
kind: IPUJob
metadata:
name: interactive-sample-job
spec:
jobInstances: 1 # refers to the number of job instances. More than 1 job instance is usually useful for non-training jobs only.
ipusPerJobInstance: "2" # refers to the number of ipus required per job instance. A separate partition of this size will be created by the operator for each job instances
interactive:
ttl: 3600 # how long should the interactive session last
workers:
replicas: 1
template:
metadata:
labels:
app: resnet-launcher
spec:
containers:
- name: resnet
image: resnet:latest
imagePullPolicy: Always
env:
- name: "IPUOF_LOG_LEVEL"
value: "INFO"
- name: "POPLAR_LOG_LEVEL"
value: "INFO"
Download interactive-job.yaml
Save the above specification as interactive-job.yaml
then run:
$ kubectl apply -f interactive-job.yaml
ipujob.graphcore.ai/interactive-sample-job created
Then you can have a terminal access to the job’s launcher Kubernetes Pod:
$ kubectl exec -it interactive-sample-job-launcher -- bash
root@interactive-sample-job-launcher:/public_examples/applications/tensorflow/cnns/training# <run your mpi programs here>
6.7. Inference job spec
IPU Operator supports running inference jobs. Following is an example of IPUJob spec that runs a poplar image for 30,000 seconds.
apiVersion: graphcore.ai/v1alpha1
kind: IPUJob
metadata:
name: job-inference-1
namespace: default
spec:
jobInstances: 1
ipusPerJobInstance: "1"
cleanPodPolicy: "None"
workers:
template:
metadata:
labels:
app: inference-job
spec:
containers:
- name: resnet
image: artifactory-systems.eng.graphcore.ai/vipu-k8s-docker-dev-local/resnet-poplar-2.1:operator
imagePullPolicy: IfNotPresent
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30000; done;" ]
Save the above spec as inference.yaml, then run:
$ kubectl apply -f inference.yaml
ipujob.graphcore.ai/job-inference-sample created
Now, you can inspect what happens in the cluster and you should see something similar to:
$ kubectl get pods,ipujobs
NAME READY STATUS RESTARTS AGE
pod/gc-ipu-operator-controller-manager-7c98ff5746-jkntn 2/2 Running 0 3h29m
pod/gc-ipu-operator-vipu-proxy-867fcddfc4-79zmb 1/1 Running 0 3h29m
pod/job-inference-1-worker-0 1/1 Running 0 12s
NAME STATUS CURRENT DESIRED LASTMESSAGE AGE
ipujob.graphcore.ai/job-inference-1 Running 1 1 Successfully reconciled IPUJob 14s
#### Scale up/down operations
To scale up the number of worker instances to 2, run the following command:
$ kubectl scale ipujob.graphcore.ai/job-inference-1 --replicas 2
ipujob.graphcore.ai/job-inference-1 scaled
$ kubectl get pods,ipujobs
NAME READY STATUS RESTARTS AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d 2/2 Running 0 8m16s
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9 1/1 Running 0 8m16s
pod/job-inference-1-worker-0 1/1 Running 0 43s
pod/job-inference-1-worker-1 0/1 ContainerCreating 0 1s
NAME STATUS CURRENT DESIRED LASTMESSAGE AGE
ipujob.graphcore.ai/job-inference-1 Running 2 2 Successfully reconciled IPUJob 45s
$ kubectl get pods,ipujobs
NAME READY STATUS RESTARTS AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d 2/2 Running 0 8m20s
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9 1/1 Running 0 8m20s
pod/job-inference-1-worker-0 1/1 Running 0 47s
pod/job-inference-1-worker-1 1/1 Running 0 5s
NAME STATUS CURRENT DESIRED LASTMESSAGE AGE
ipujob.graphcore.ai/job-inference-1 Running 2 2 Successfully reconciled IPUJob 49s
To further scale to 4 job instances:
$ kubectl scale ipujob.graphcore.ai/job-inference-1 --replicas 4
ipujob.graphcore.ai/job-inference-1 scaled
$ kubectl get pods,ipujobs
NAME READY STATUS RESTARTS AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d 2/2 Running 0 10m
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9 1/1 Running 0 10m
pod/job-inference-1-worker-0 1/1 Running 0 2m40s
pod/job-inference-1-worker-1 1/1 Running 0 118s
NAME STATUS CURRENT DESIRED LASTMESSAGE AGE
ipujob.graphcore.ai/job-inference-1 Pending 2 4 partition not ready yet!partition not ready yet! 2m42s
$ kubectl get pods,ipujobs
NAME READY STATUS RESTARTS AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d 2/2 Running 0 10m
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9 1/1 Running 0 10m
pod/job-inference-1-worker-0 1/1 Running 0 2m42s
pod/job-inference-1-worker-1 1/1 Running 0 2m
pod/job-inference-1-worker-2 0/1 ContainerCreating 0 1s
NAME STATUS CURRENT DESIRED LASTMESSAGE AGE
ipujob.graphcore.ai/job-inference-1 Pending 3 4 partition not ready yet! 2m44s
$ kubectl get pods,ipujobs
NAME READY STATUS RESTARTS AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d 2/2 Running 0 10m
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9 1/1 Running 0 10m
pod/job-inference-1-worker-0 1/1 Running 0 2m43s
pod/job-inference-1-worker-1 1/1 Running 0 2m1s
pod/job-inference-1-worker-2 1/1 Running 0 2s
pod/job-inference-1-worker-3 0/1 ContainerCreating 0 0s
NAME STATUS CURRENT DESIRED LASTMESSAGE AGE
ipujob.graphcore.ai/job-inference-1 Running 4 4 Successfully reconciled IPUJob 2m45s
$ kubectl get pods,ipujobs
NAME READY STATUS RESTARTS AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d 2/2 Running 0 10m
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9 1/1 Running 0 10m
pod/job-inference-1-worker-0 1/1 Running 0 2m47s
pod/job-inference-1-worker-1 1/1 Running 0 2m5s
pod/job-inference-1-worker-2 1/1 Running 0 6s
pod/job-inference-1-worker-3 1/1 Running 0 4s
NAME STATUS CURRENT DESIRED LASTMESSAGE AGE
ipujob.graphcore.ai/job-inference-1 Running 4 4 Successfully reconciled IPUJob 2m49s
Scale down the number of job instances to 1, by running the following command:
$ kubectl scale ipujob.graphcore.ai/job-inference-1 --replicas 1
ipujob.graphcore.ai/job-inference-1 scaled
$ kubectl get pods,ipujobs
NAME READY STATUS RESTARTS AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d 2/2 Running 0 13m
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9 1/1 Running 0 13m
pod/job-inference-1-worker-0 1/1 Running 0 6m2s
NAME STATUS CURRENT DESIRED LASTMESSAGE AGE
ipujob.graphcore.ai/job-inference-1 Running 1 1 Successfully reconciled IPUJob 6m4s
6.7.1. Mounting data volumes for an IPUJob
Every IPUJob will require an input dataset and will possibly produce output files (for example, checkpoints and trained models). The Kubernetes Pods are ephemeral by nature. This means that all files inside the containers are lost when the containers are removed. For data persistence, the IPU Operator relies on using the native Kubernetes volumes.
By specifying volumes
and volumeMounts
under the IPUJob workers’ Pod
template, all the IPUJob workers and their launcher will have the same volume(s)
mounted at the same path. This means that the workers and the launcher all see
the same file system at certain path(s). One thing to keep in mind, however, is
that you need to use a Persistent Volume type that supports multiple Read/Write
mounts. See the Kubernetes
documentation
for a list of volume types you can use.
Here is an example of the single-gcd-sample.yaml
we used above with volumes added to it:
# native Kubernetes volumes which uses NFS
apiVersion: v1
kind: PersistentVolume
metadata:
name: nfs-shared-storage
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Recycle
storageClassName: slow
mountOptions:
- hard
- nfsvers=4.1
nfs:
server: nfs-server.default.svc.cluster.local # this should be your NFS server endpoint
path: "/"
---
# Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nfs-pvc
spec:
accessModes:
- ReadWriteMany
storageClassName: ""
resources:
requests:
storage: 5Gi
---
apiVersion: graphcore.ai/v1alpha1 # the API that defined this API object type
kind: IPUJob # the kind of this Kubernetes object
metadata:
name: ipujob-sample # the name of the job
spec:
jobInstances: 1 # refers to the number of job instances. More than 1 job instance is usually useful for non-training jobs only.
ipusPerJobInstance: "1" # refers to the number of ipus required per job instance. A separate partition of this size will be created by the operator for each job instances
launcher:
command: # the command to trigger the job execution
- mpirun
- --allow-run-as-root
- --bind-to
- none
- -np
- "1"
- python3
- /public_examples/applications/tensorflow/cnns/training/train.py
- --dataset=cifar-10
- --synthetic-data
- --model-size=8
- --batch-size=1
- --batches-per-step=10
- --gradient-accumulation-count=10
- --no-validation
- --no-stochastic-rounding
- --iterations=20
workers:
replicas: 1 # how many workers (poplar instances) should participate in this execution
template: # native Kubernetes Pod template. https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates
metadata:
labels:
app: resnet-launcher
spec:
volumes: # we define here which volumes we want to use with the workers (the same is applied to the launcher too)
- name: mypvc
persistentVolumeClaim:
claimName: nfs-pvc # that is the persistent volume claim we created in the above object
containers: # the containers running inside each worker
- name: resnet
image: resnet:latest
env: # environment variables set on each worker
- name: "IPUOF_LOG_LEVEL"
value: "INFO"
- name: "POPLAR_LOG_LEVEL"
value: "INFO"
volumeMounts:
- name: mypvc # the name of the volume defined in the volumes section
mountPath: /mnt/sample # this is where we mount the volume into both workers and the launcher
---
Download single-gcd-sample-nfs.yaml
The above specification will create an NFS persistent volume (assuming you have an NFS server available), and a persistent volume claim requesting the same amount of storage as the persistent volume.
The IPUJob then mounts that NFS volume at /mnt/sample
in the job’s workers and launchers.
6.7.2. Automatic restarts
You may want your IPUJob to automatically restart in certain cases. Currently, we support four types of restart policies which can be defined under the IPUJob spec:
Always
: means that the job will be always restarted when finished regardless of success or failureOnFailure
: means that the job will only be restarted if it fails regradless of why it failedNever
: means that the job will never be restarted when finished regardless of success or failureExitCode
: means that user should provide exit codes by themselves,
The operator will check these exit codes to determine the behavior when an error occurs. for example:
1-127: permanent error, do not restart.
128-255: retryable error, will restart the job.
6.7.3. Clean up Kubernetes resources and IPU partitions
Once the job is finished and is no longer going to be restarted, the operator can perform automatic cleanup to
free the Kubernetes resources that are no longer needed. This can be
defined in the cleanPodPolicy
under the IPUJob spec. The following values can be set:
Workers
: delete the workers only when the job is finished.All
: delete all pods (launcher and workers) and release IPU resources when the job is finished. It is also worth mentioning that if thecleanPodPolicy
is set like so, it will take priority over anyrestartPolicy
. It means that therestartPolicy
will act as if it was set toNever
.None
: don’t delete any pods when the job is finished.
6.8. Debugging problems
When something does not work as expected, you need to debug the problem to understand how to fix it.
Before we talk about how to debug a failed IPUJob, it is probably good to understand how the IPU Operator executes an IPUJob.
6.8.1. How does the IPU Operator work?
The IPU Operator consists of a few components:
Controller: this is the reconcile loop that makes sure the desired state (defined in the IPUJob specifications) matches the state of the world.
Admission webhooks: we have two webhooks:
Defaulting (Mutation) webhook: adds some default values to the submitted IPUJob specifications.
Validating webhook: validates the IPUJob specification.
V-IPU proxy: which proxies IPU partition operations to the V-IPU controller and keeps track of the partitions created for jobs running inside the cluster.
When an IPUJob is created in the cluster, the IPU Operator gets notified and creates the following Kubernetes resources:
A ConfigMap to hold a couple of things:
A
kubeexec
script which is used by MPI to trigger remote execution with the worker Pods from the launcher Pod.A
hostfile
which lists the worker Pods that MPI will use for remote execution.
A Kubernetes RBAC role, role-binding and service account for the launcher Pod which allows the launcher to list and watch worker Pods and exec into them.
A set of worker Pods which participate in the job execution. These Pods are placed into a 365 days sleep as the main background process until the launcher triggers the job processes on them.
A launcher Pod which contains two components:
An init-container which runs a small application we provide with the IPU Operator. This program watches the worker Pods until they are all available and in the Ready state.
The program also created the required IPU partition by interacting with the V-IPU proxy. If the V-IPU proxy sees that this partition already exists and is already owned by this IPUJob, it will reset the partition. This makes it possible to restart the job with a clean IPU partition.
The main container which uses the image provided by the user and runs the user-defined command to trigger the job execution.
The IPU Operator also sets environment variables on the worker Pods that allow Poplar to see and use the IPUs when running the AI/ML program.
6.8.2. Debugging
There are a few places to look for debug info:
The status updates for the IPUJob, which you can find by using
kubectl
to describe the jobThe launcher Pod logs which can be found by running:
$ kubectl logs <ipujob-name>-launcher -n <the-namespace-where-the-job-was-deployed>
The controller logs which can be found by running:
$ kubectl logs <controller-manager-pod-name> -n <the-namespace-where-the-operator-was-deployed>
The V-IPU proxy logs which can be found by running:
$ kubectl logs <vipu-proxy-Pod-name> -n <the-namespace-where-the-operator-was-deployed>
6.9. IPU usage statistics
The V-IPU proxy in the IPU Operator keeps track of the used IPU partitions by IPUJobs running inside the cluster. The data is stored in a ConfigMap and it links the IPUJob with the partition it is using. On top of that tracker ConfigMap, the V-IPU proxy exposes a couple of read-only REST endpoints that you can utilize.
By default, these endpoints are only exposed within the cluster, so we can use any container with curl to query them, using the following commands:
/stats
# from inside a Kubernetes Pod that has curl # gc-ipu-operator-vipu-proxy is the Kubernetes service name for the V-IPU proxy. This name can be different in your installation $ curl gc-ipu-operator-vipu-proxy/stats | jq . { "default": { # the default namespace "used": 4, "available": 28 }, "total": { "used": 4, "available": 28 } }
/query
# from inside a Kubernetes Pod that has curl # gc-ipu-operator-vipu-proxy is the Kubernetes service name for the V-IPU proxy. This name can be different in your installation $ curl --request POST -H "Content-Type: application/json" --data '{"size":2}' gc-ipu-operator-vipu-proxy/query | jq . { "available": true, "numOfPartitions": 14, # 14 possible partitions are available "message": "" } $ curl --request POST -H "Content-Type: application/json" --data '{"size":64}' gc-ipu-operator-vipu-proxy/query | jq . { "available": true, "numOfPartitions": 0, # no partition of the requested size are available "message": "" }
6.10. Operator Metrics
The IPU Operator exposes a set of Prometheus metrics that you can use. However, these metrics are exposed behind a protected endpoint. The IPU Operator creates a ClusterRole that grants the permissions to scrape the metrics. To allow your Prometheus server to scrape those metrics, you need to bind that ClusterRole to the service account that the Prometheus server uses.
# find the ClusterRole you must use. Note that the name can be different in your installation.
$ kubectl get clusterrole -l component=metrics-reader
NAME CREATED AT
gc-metrics-reader 2021-03-24T11:13:07Z
# create a ClusterRoleBinding. The service account must be the one that the Prometheus Server uses.
kubectl create clusterrolebinding metrics --clusterrole=gc-metrics-reader --serviceaccount=<namespace>:<service-account-name>
6.11. Known limitations
There are currently a few limitations:
IPUs can be only be accessed from within the IPU-POD network by default.
Therefore, IPUJob Pods must be run on a Kubernetes node that can access the IPUs, which means that at least one of the IPU-POD head nodes has to be a Kubernetes worker node.
In order to access the RDMA network interface on the head node, the IPUJob Pods run on host network and in privileged mode.
For parallel IPUJobs (jobs with more than one worker Pod), you must specify the network interface which will be used for MPI communication using the mpirun
--mca btl_tcp_if_include
option.IPU partitions larger than 64 IPUs are currently not supported.