6. Integration with Kubernetes

Preview Release

This is an early release of Kubernetes support for the IPU. As such the software is subject to change without notice.

The Kubernetes IPU Operator for V-IPU is available on request from Graphcore support.

Kubernetes (K8s) is an open-source container orchestration and management system. Kubernetes Operators allow the Kubernetes API to be extended with custom objects, and implement the control logic for such custom objects. You can read more about the Operator pattern in the Kubernetes documentation.

The IPU Operator provides a framework to extend the Kubernetes API and to manage IPUs via custom resource definitions (CRDs) and custom controllers. It allows you to specify the number of IPUs required for your Kubernetes workload using annotations.

This chapter outlines the Operator components, installation steps and usage.

Note

Kubernetes uses the word Pod to refer to the smallest deployable units of computing that you can create and manage in Kubernetes. This is not to be confused with the Graphcore IPU-POD, which is a rack-based system of IPUs.

6.1. Components and design

The Operator contains the following components:

The gc-proxy that communicates with the V-IPU controller (vipu-server)
The CRD and controller that let you allocate IPUs directly from the Kubernetes cluster.

The gc-proxy is responsible for:

Managing the IPU resources by communicating with the V-IPU controller
Running the REST API server to serve requests from the controller for partition creation and deletion

The CRD and custom controller extend the Kubernetes API and manage IPU resources on your behalf. They are responsible for:

Watching for CRD events
Creating worker and launcher Pods based on the CRD configuration, by using the following fields:
- Adding a finaliser to the custom resource to release the IPUs on deletion
- Setting the hostNetwork and securityContext/privileged to true
- Setting the Pod dnsPolicy to ClusterFirstWithHostNet
- Providing webhook REST endpoints to validate the input CRD specification

6.2. Package contents

The software is delivered as a single tarball containing the following files and directories:

The CRD specification:: gc-ipu-operator-v1.0.3/CRDs/graphcore.ai_ipujobs.yaml
Documentation for the Operator:: gc-ipu-operator-v1.0.3/docs/
The Helm Chart:: gc-ipu-operator-v1.0.3/gc-ipu-operator-helm-chart-v1.0.0-alpha-5.tgz
The Operator and gc-proxy images:: gc-ipu-operator-v1.0.3/gc-operator-images.tar.gz
Checksum for the Operator:: gc-ipu-operator-v1.0.3/gc-operator.cksm
Checksum for the Helm Chart:: gc-ipu-operator-v1.0.3/gc-ipu-operator-helm-chart.cksm

6.3. Deploying the software

6.3.1. Prerequisites

Before you can use IPUs from your Kubernetes workloads, you need to meet the following conditions:

Have access to one or more Graphcore IPU-POD systems
Have a compatible version of the V-IPU controller installed on your IPU-POD system
Create a Kubernetes cluster. At least one of the worker nodes in the cluster must be on the head node of the IPU-POD. See Section 6.11, Known limitations for more information.
Have the kubectl and Helm (v3.0.0 or later) command-line tools installed on your machine.

6.3.2. Installation

Installing the CRDs

To install the CRDs, run the following command:

$ kubectl apply -f <dir>/CRDs/graphcore.ai_ipujobs.yaml

Installing the Operator

Unzip the Helm package and run the following command:

$ helm install <release-name> <path-to-chart-tar> <custom-parameters>

Where:

<release-name> is the name you choose for this Helm installation
<path-to-chart-tar> is the path to the downloaded Helm Chart tar file
<custom-parameters> is where you customize the installation.

You can either use multiple --set key=value arguments, or put your customization in a YAML file and use the --values your-values.yaml argument.

See Section 6.4, Configurations for more information.

For example, the following command deploys the software to the Kubernetes cluster in the default configuration.

$ cd ipu-proxy
$ helm install [RELEASE_NAME] . --set global.vipuControllers="pod1:8090:ipunode=pod1\,pod2:8090:ipunode=pod2" \
  --set controller.image.repository=[controller-image] --set vipuProxy.image.repository=[proxy-image]

This command installs the following in the same namespace where the Helm release installed:

gc-proxy and gc-controller as deployments
gc-proxy service of type ClusterIP
RBAC: ServiceAccount, ClusterRole to manage Pods and ConfigMaps
A Partitions tracker ConfigMap
Configuration objects for the mutation and validation webhooks

You can read more about installing Helm in the Kubernetes documentation.

You can see all the customization options in the README.md for the Helm Charts.

Multiple V-IPU controller support

The IPU Operator can communicate with multiple V-IPU controllers.

You can specify multiple V-IPU controllers during installation by setting the vipuControllers option on the helm install command line. For example:

--set vipuControllers="pod001:8090:ipunode=node1\,pod002:8091:ipunode=node2"

Alternatively, after installation you can edit the ConfigMap, as shown below, and update the value.

$ kubectl edit configmap gc-ipu-operator-vipu-controllers

Each V-IPU controller is specified with a colon-separated list of three values:

V-IPU controller host address
V-IPU controller port
A label defined by key=value.

The same label must be added to the node where the containers corresponding to that V-IPU controller will run. Labeling the node is done with the following command:

$ kubectl label nodes <someworkernode> <key>=<value>

The ConfigMap can be modified at any time and the IPU Operator automatically adds the new V-IPU controller to its internal list. It can take up to 60 seconds for the new V-IPU controller to be added. When a partition is created, the IPU Operator goes through the list serially until it finds space for the requested number of IPUs.

Verify the installation is successful

When the installation is complete, you can verify that it worked correctly by running the following commands and seeing similar output:

$ kubectl get crd
NAME                   CREATED AT
ipujobs.graphcore.ai   2021-03-02T12:20:04Z
...

$ helm ls -n <the-namespace-where-you-deployed-the-operator>
NAME  NAMESPACE  REVISION  UPDATED                              STATUS    CHART                              APP VERSION
gc    default    1         2021-03-18 11:50:31.35861 +0100 CET  deployed  gc-ipu-operator-helm-chart-v1.0.0  v1.0.0


$ kubectl get pods -n <the-namespace-where-you-deployed-the-operator>
NAME                                                  READY   STATUS    RESTARTS   AGE
gc-ipu-operator-controller-manager-54766f7f7b-x5wtr   2/2     Running     0          5d23h
gc-ipu-operator-vipu-proxy-844c7d6b7f-88bqr           1/1     Running     1          5d23h

6.3.3. Uninstall

$ helm uninstall [RELEASE_NAME]

This removes all the Kubernetes components associated with the chart and deletes the release.

See helm uninstall for command documentation.

Note

The partition tracker ConfigMap ipu-partitions-tracker does not get deleted when you uninstall the Helm release. This is so that when the ipu-proxy is deployed again, it can pick up from where it was uninstalled before (in terms of managing the created partitions). If you wish to remove that ConfigMap, you can run: kubectl delete configmap ipu-partitions-tracker -n <namespace>

6.3.4. Upgrading the Helm Chart

$ helm upgrade [RELEASE_NAME] [CHART]

See helm upgrade for command documentation.

6.4. Configurations

Table 6.1 lists the configurable parameters of the Helm Chart and their default values.

Table 6.1 Helm Chart parameters
Parameter	Description	Default
nameOverride	Override the name of the chart in the generated Chart resource names	“”
fullNameOverride	Override the fully qualified app name which is used in naming the generated chart resources. If this is not set, the fully qualified app name is defaulted to: `<helm-release-name>-<either Chart name or nameOverride if it is set>`	“”
global.imagePullSecrets	A map of image pull secrets names (for example, `name: "test-secret"`)	[]
global.kubectlVersion	Kubectl server version	[]
global.launcherImagePullPolicy	Launcher image pull policy	“Never”
global.launcherImage	The container image used for each IPUJob launcher container	launcher:latest
global.vipuControllers	List of V-IPU controllers	[]
admissionWebhooks.failurePolicy	The admission webhooks failure policy.	“Fail”
admissionWebhooks.patch.image.pullPolicy	The admission webhooks patch image pull policy.	“IfNotPresent”
admissionWebhooks.patch.image.repository	The admission webhooks patch image repository.	“k8s.gcr.io/ingress-nginx/kube-webhook-certgen”
admissionWebhooks.patch.image.tag	The admission webhooks patch image tag.	“v1.1.1”
admissionWebhooks.patch.nodeSelector	The Kubernetes node selector for the admission webhooks patch jobs.	{}
admissionWebhooks.patch.podAnnotations	The pod annotations for the admission webhooks patch jobs.	{}
admissionWebhooks.patch.priorityClassName	The name of a priority class to use with the admission webhook patching job.	“”
admissionWebhooks.patch.runAsUser	The User to use for the admission webhooks patch jobs.	2000
admissionWebhooks.patch.tolerations	The Kubernetes tolerations for the admission webhooks patch jobs. See Taints and Tolerations on the Kubernetes website.	[]
admissionWebhooks.port	9443	The port at which the admission webhook server is exposed in the controller container.
admissionWebhooks.service.annotations	The admission webhooks service annotations.	{}
admissionWebhooks.service.servicePort	443	The admission webhooks service port.
admissionWebhooks.service.type	The admission webhooks service type.	“ClusterIP”
admissionWebhooks.timeoutSeconds	The admission webhooks timeout in seconds.	30
controller.affinity	Controller Kubernetes affinity. See Pod Affinity on the Kubernetes website.	{}
controller.develLogs	Specifies whether to enable or disable development logging mode.	true
controller.dnsPolicy	Set the dnsPolicy to ClusterFirstWithHostNet if hostNetwork is true – otherwise set the dnsPolicy to ClusterFirst	“ClusterFirstWithHostNet”
controller.hostNetwork	Set the hostnetwork flag when running the controller	true
controller.image.pullPolicy	The Controller image pull policy	“Always”
controller.image.repository	The Controller image repository	“localhost:5000/controller”
controller.image.tag	Overrides the Controller image tag whose default is the chart appVersion.	“”
controller.nodeSelector	Controller Kubernetes node selector.	{}
controller.podAnnotations	Controller pod annotations.	{}
controller.podSecurityContext	Controller pod security policy.	{“runAsUser”:65532}`
controller.rbac.create	Specifies whether to create rbac clusterrole and clusterrolebinding and attach them to the service account.	true
controller.resources.limits.cpu	The max limit for CPU time for the controller, in Kubernetes CPU units.	“500m”
controller.resources.limits.memory	The max limit for memory for the controller.	“512Mi”
controller.resources.requests.cpu	The requested CPU for the controller, in Kubernetes CPU units.	“100m”
controller.resources.requests.memory	The requested memory for the controller.	“200Mi”
controller.securityContext	Controller security context.	{}
controller.service.port	The port for the controller service, used to setup kube-rbac-proxy for protecting metrics endpoint	8443
controller.service.type	The Kubernetes service type for the controller.	“ClusterIP”
controller.serviceAccount.annotations	Annotations to add to the service account.	{}
controller.serviceAccount.create	Specifies whether a service account should be created.	true
controller.serviceAccount.name	The name of the service account to use. If not set and create is true, a name is generated using the fullname template.	“”
controller.tolerations	Controller Kubernetes tolerations. See Taints and Tolerations on the Kubernetes website.	[]
vipuProxy.affinity	The V-IPU proxy Kubernetes affinity. See Pod Affinity	{}
vipuProxy.image.pullPolicy	The V-IPU proxy image pull policy.	“Always”
vipuProxy.image.repository	The V-IPU proxy image repository.	“localhost:5000/vipu-proxy”
vipuProxy.image.tag	Overrides V-IPU proxy the image tag whose default is the chart appVersion.	“”
vipuProxy.logLevel	V-IPU proxy log level (min 1 -max 6).	2
vipuProxy.nodeSelector	The V-IPU proxy Kubernetes node selector.	{}
vipuProxy.podAnnotations	The V-IPU proxy pod annotations.	{}
vipuProxy.podSecurityContext	The V-IPU proxy pod security policy.	{}
vipuProxy.proxyIdleTimeoutSeconds	V-IPU proxy idle timeout seconds.	60
vipuProxy.proxyPartitionTrackerConfigMap	V-IPU proxy partition tracking configmap name.	“ipu-partitions-tracker”
vipuProxy.proxyPort	V-IPU proxy port.	8080
vipuProxy.proxyReadTimeoutSeconds	V-IPU proxy read timeout in seconds.	30
vipuProxy.proxyWriteTimeoutSeconds	V-IPU proxy write timeout in seconds.	300
vipuProxy.rbac.create	Specifies whether to create rbac clusterrole and clusterrolebinding and attach them to the service account for V-IPU proxy.	true
vipuProxy.resources	The Kubernetes resources limits and requirements for the V-IPU proxy.	{}
vipuProxy.securityContext	The V-IPU proxy security context.	{}
vipuProxy.service.port	80	The Kubernetes service port for V-IPU proxy.
vipuProxy.service.type	“ClusterIP”	The Kubernetes service type for V-IPU proxy.
vipuProxy.serviceAccount.annotations	Annotations to add to the service account for V-IPU proxy.	{}
vipuProxy.serviceAccount.create	Specifies whether a service account should be created for V-IPU proxy.	true
vipuProxy.serviceAccount.name	The name of the service account to use for V-IPU proxy. If not set and create is true, a name is generated using the fullname template.	“”
vipuProxy.tolerations	The V-IPU proxy Kubernetes tolerations. See Taints and Tolerations on the Kubernetes website.	[]

6.5. Creating an IPUJob

Once the CRDs and the IPU Operator are installed, you can start submitting IPUJobs (MPI-based AI/ML jobs that use IPUs).

6.6. Training CRD Job

The following YAML file is an example of a declarative definition of an IPUJob for the ResNet-8 TensorFlow application:

apiVersion: graphcore.ai/v1alpha1 # the API that defined this API object type
kind: IPUJob # the kind of this Kubernetes object
metadata:
  name: ipujob-sample # the name of the job
spec:
  jobInstances: 1 # refers to the number of job instances. More than 1 job instance is usually useful for non-training jobs only.
  ipusPerJobInstance: "1"  # refers to the number of ipus required per job instance. A separate partition of this size will be created by the operator for each job instances
  workerPerJobInstance: "1" # Number of K8s worker pods created with a separate GCD. Refers to the number of poplar instances
  modelReplicasPerWorker: "1" # refers to the number of replicas within each WorkerPods
  launcher:
    command: # the command to trigger the job execution
      - mpirun
      - --allow-run-as-root
      - --bind-to
      - none
      - -np
      - "1"
      - python3
      - /public_examples/applications/tensorflow/cnns/training/train.py
      - --dataset=cifar-10
      - --synthetic-data
      - --model-size=8
      - --batch-size=1
      - --batches-per-step=10
      - --gradient-accumulation-count=10
      - --no-validation
      - --no-stochastic-rounding
      - --iterations=20
  workers:
    replicas: 1 # how many workers (poplar instances) should participate in this execution
    template: # native Kubernetes Pod template. https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates
      metadata:
        labels:
          app: resnet-launcher
      spec:
        containers: # the containers running inside each worker
        - name: resnet
          image: resnet:latest
          env: # environment variables set on each worker
          - name: "IPUOF_LOG_LEVEL"
            value: "INFO"
          - name: "POPLAR_LOG_LEVEL"
            value: "INFO"

Download single-gcd-sample.yaml

Save the above specification file as single-gcd-sample.yaml then run:

$ kubectl apply -f single-gcd-sample.yaml
ipujob.graphcore.ai/ipujob-sample created

Now you can inspect what happens in the cluster and you should see something similar to:

$ kubectl get pods
NAME                                                  READY   STATUS    RESTARTS   AGE
gc-ipu-operator-controller-manager-6ff6b6875d-ncjgp   2/2     Running   0          3d22h
gc-ipu-operator-vipu-proxy-849dbf98df-rg8gh           1/1     Running   0          3d22h
ipujob-sample-launcher                                1/1     Running   0          10s
ipujob-sample-worker-0                                1/1     Running   0          25s

You can also list the IPUJobs in the cluster and see their status:

$ kubectl get  ipujobs.graphcore.ai
NAME            STATUS    AGE
ipujob-sample   Running   40s

And you can inspect more details about a specific IPUJob as follows:

$ kubectl describe ipujobs.graphcore.ai ipujob-sample
Name:         ipujob-sample
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  graphcore.ai/v1alpha1
Kind:         IPUJob
Metadata:
  Creation Timestamp:  2021-03-22T10:10:31Z
  Finalizers:
    ipu.finalizers.graphcore.ai
  Generation:  2
    Manager:         manager
    Operation:       Update
    Time:            2021-03-22T10:10:45Z
  Resource Version:  29226482
  Self Link:         /apis/graphcore.ai/v1alpha1/namespaces/default/ipujobs/ipujob-sample
  UID:               beb81bbe-2309-494a-9e28-2a75a704be15
Spec:
  Clean Pod Policy:        None
  Ipus Per Model Replica:  1
  Launcher:
    Command:
      mpirun
      --allow-run-as-root
      --bind-to
      none
      -np
      1
      python3
      /public_examples/applications/tensorflow/cnns/training/train.py
      --dataset=cifar-10
      --synthetic-data
      --model-size=8
      --batch-size=1
      --batches-per-step=10
      --gradient-accumulation-count=10
      --no-validation
      --no-stochastic-rounding
      --iterations=20
  Model Replicas:  4
  Restart Policy:
    Back Off Limit:  3
    Type:            Never
  Workers:
    Replicas:  1
    Template:
      Metadata:
      Spec:
        Containers:
          Env:
            Name:             IPUOF_LOG_LEVEL
            Value:            INFO
            Name:             POPLAR_LOG_LEVEL
            Value:            INFO
          Image:              artifactory-systems.eng.graphcore.ai/vipu-k8s-docker-dev-local/resnet-poplar-2.0:operator
          Name:               resnet
          Resources:
Status:
  Conditions:
    Last Transition Time:  2021-03-22T10:10:31Z
    Last Update Time:      2021-03-22T10:10:31Z
    Message:               IPUJob default/ipujob-sample is waiting for resources to be ready.
    Reason:                IPUJobPending
    Status:                False
    Type:                  Pending
    Last Transition Time:  2021-03-22T10:10:45Z
    Last Update Time:      2021-03-22T10:10:45Z
    Message:               IPUJob default/ipujob-sample is running.
    Reason:                IPUJobRunning
    Status:                True
    Type:                  Running
  I PU Partition Created:  true
  Launcher Status:         Running
  Restart Count:           0
  Start Time:              2021-03-22T10:10:31Z
  Workers Status:
    Active:  1

6.6.1. Interactive mode

You can also run the IPUJob in interactive mode, where it does not execute anything by default:

apiVersion: graphcore.ai/v1alpha1
kind: IPUJob
metadata:
  name: interactive-sample-job
spec:
  jobInstances: 1 # refers to the number of job instances. More than 1 job instance is usually useful for non-training jobs only.
  ipusPerJobInstance: "2"  # refers to the number of ipus required per job instance. A separate partition of this size will be created by the operator for each job instances
  interactive:
    ttl: 3600 # how long should the interactive session last
  workers:
    replicas: 1
    template:
      metadata:
        labels:
          app: resnet-launcher
      spec:
        containers:
        - name: resnet
          image: resnet:latest
          imagePullPolicy: Always
          env:
          - name: "IPUOF_LOG_LEVEL"
            value: "INFO"
          - name: "POPLAR_LOG_LEVEL"
            value: "INFO"

Download interactive-job.yaml

Save the above specification as interactive-job.yaml then run:

$ kubectl apply -f interactive-job.yaml
ipujob.graphcore.ai/interactive-sample-job created

Then you can have a terminal access to the job’s launcher Kubernetes Pod:

$ kubectl exec -it interactive-sample-job-launcher -- bash
root@interactive-sample-job-launcher:/public_examples/applications/tensorflow/cnns/training# <run your mpi programs here>

6.7. Inference job spec

IPU Operator supports running inference jobs. Following is an example of IPUJob spec that runs a poplar image for 30,000 seconds.

apiVersion: graphcore.ai/v1alpha1
kind: IPUJob
metadata:
  name: job-inference-1
  namespace: default
spec:
  jobInstances: 1
  ipusPerJobInstance: "1"
  cleanPodPolicy: "None"
  workers:
    template:
      metadata:
        labels:
          app: inference-job
      spec:
        containers:
        - name: resnet
          image: artifactory-systems.eng.graphcore.ai/vipu-k8s-docker-dev-local/resnet-poplar-2.1:operator
          imagePullPolicy: IfNotPresent
          command: [ "/bin/bash", "-c", "--" ]
          args: [ "while true; do sleep 30000; done;" ]

Save the above spec as inference.yaml, then run:

$ kubectl apply -f inference.yaml
ipujob.graphcore.ai/job-inference-sample created

Now, you can inspect what happens in the cluster and you should see something similar to:

$ kubectl get pods,ipujobs
NAME                                                      READY   STATUS    RESTARTS   AGE
pod/gc-ipu-operator-controller-manager-7c98ff5746-jkntn   2/2     Running   0          3h29m
pod/gc-ipu-operator-vipu-proxy-867fcddfc4-79zmb           1/1     Running   0          3h29m
pod/job-inference-1-worker-0                              1/1     Running   0          12s

NAME                                  STATUS    CURRENT   DESIRED   LASTMESSAGE                      AGE
ipujob.graphcore.ai/job-inference-1   Running   1         1         Successfully reconciled IPUJob   14s

#### Scale up/down operations

To scale up the number of worker instances to 2, run the following command:

$ kubectl scale ipujob.graphcore.ai/job-inference-1 --replicas 2
ipujob.graphcore.ai/job-inference-1 scaled

$ kubectl get pods,ipujobs
NAME                                                     READY   STATUS              RESTARTS   AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d   2/2     Running             0          8m16s
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9          1/1     Running             0          8m16s
pod/job-inference-1-worker-0                             1/1     Running             0          43s
pod/job-inference-1-worker-1                             0/1     ContainerCreating   0          1s

NAME                                  STATUS    CURRENT   DESIRED   LASTMESSAGE                      AGE
ipujob.graphcore.ai/job-inference-1   Running   2         2         Successfully reconciled IPUJob   45s

$ kubectl get pods,ipujobs
NAME                                                     READY   STATUS    RESTARTS   AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d   2/2     Running   0          8m20s
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9          1/1     Running   0          8m20s
pod/job-inference-1-worker-0                             1/1     Running   0          47s
pod/job-inference-1-worker-1                             1/1     Running   0          5s

NAME                                  STATUS    CURRENT   DESIRED   LASTMESSAGE                      AGE
ipujob.graphcore.ai/job-inference-1   Running   2         2         Successfully reconciled IPUJob   49s

To further scale to 4 job instances:

$ kubectl scale ipujob.graphcore.ai/job-inference-1 --replicas 4
ipujob.graphcore.ai/job-inference-1 scaled

$ kubectl get pods,ipujobs
NAME                                                     READY   STATUS    RESTARTS   AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d   2/2     Running   0          10m
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9          1/1     Running   0          10m
pod/job-inference-1-worker-0                             1/1     Running   0          2m40s
pod/job-inference-1-worker-1                             1/1     Running   0          118s

NAME                                  STATUS    CURRENT   DESIRED   LASTMESSAGE                                        AGE
ipujob.graphcore.ai/job-inference-1   Pending   2         4         partition not ready yet!partition not ready yet!   2m42s

$ kubectl get pods,ipujobs
NAME                                                     READY   STATUS              RESTARTS   AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d   2/2     Running             0          10m
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9          1/1     Running             0          10m
pod/job-inference-1-worker-0                             1/1     Running             0          2m42s
pod/job-inference-1-worker-1                             1/1     Running             0          2m
pod/job-inference-1-worker-2                             0/1     ContainerCreating   0          1s

NAME                                  STATUS    CURRENT   DESIRED   LASTMESSAGE                AGE
ipujob.graphcore.ai/job-inference-1   Pending   3         4         partition not ready yet!   2m44s

$ kubectl get pods,ipujobs
NAME                                                     READY   STATUS              RESTARTS   AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d   2/2     Running             0          10m
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9          1/1     Running             0          10m
pod/job-inference-1-worker-0                             1/1     Running             0          2m43s
pod/job-inference-1-worker-1                             1/1     Running             0          2m1s
pod/job-inference-1-worker-2                             1/1     Running             0          2s
pod/job-inference-1-worker-3                             0/1     ContainerCreating   0          0s

NAME                                  STATUS    CURRENT   DESIRED   LASTMESSAGE                      AGE
ipujob.graphcore.ai/job-inference-1   Running   4         4         Successfully reconciled IPUJob   2m45s

$ kubectl get pods,ipujobs
NAME                                                     READY   STATUS    RESTARTS   AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d   2/2     Running   0          10m
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9          1/1     Running   0          10m
pod/job-inference-1-worker-0                             1/1     Running   0          2m47s
pod/job-inference-1-worker-1                             1/1     Running   0          2m5s
pod/job-inference-1-worker-2                             1/1     Running   0          6s
pod/job-inference-1-worker-3                             1/1     Running   0          4s

NAME                                  STATUS    CURRENT   DESIRED   LASTMESSAGE                      AGE
ipujob.graphcore.ai/job-inference-1   Running   4         4         Successfully reconciled IPUJob   2m49s

Scale down the number of job instances to 1, by running the following command:

$ kubectl scale ipujob.graphcore.ai/job-inference-1 --replicas 1
ipujob.graphcore.ai/job-inference-1 scaled

$ kubectl get pods,ipujobs
NAME                                                     READY   STATUS    RESTARTS   AGE
pod/gc-ipu-operator-controller-manager-75f6b68db-bp74d   2/2     Running   0          13m
pod/gc-ipu-operator-vipu-proxy-6c46fd5df4-sv5k9          1/1     Running   0          13m
pod/job-inference-1-worker-0                             1/1     Running   0          6m2s

NAME                                  STATUS    CURRENT   DESIRED   LASTMESSAGE                      AGE
ipujob.graphcore.ai/job-inference-1   Running   1         1         Successfully reconciled IPUJob   6m4s

6.7.1. Mounting data volumes for an IPUJob

Every IPUJob will require an input dataset and will possibly produce output files (for example, checkpoints and trained models). The Kubernetes Pods are ephemeral by nature. This means that all files inside the containers are lost when the containers are removed. For data persistence, the IPU Operator relies on using the native Kubernetes volumes.

By specifying volumes and volumeMounts under the IPUJob workers’ Pod template, all the IPUJob workers and their launcher will have the same volume(s) mounted at the same path. This means that the workers and the launcher all see the same file system at certain path(s). One thing to keep in mind, however, is that you need to use a Persistent Volume type that supports multiple Read/Write mounts. See the Kubernetes documentation for a list of volume types you can use.

Here is an example of the single-gcd-sample.yaml we used above with volumes added to it:

# native Kubernetes volumes which uses NFS
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-shared-storage
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Recycle
  storageClassName: slow
  mountOptions:
    - hard
    - nfsvers=4.1
  nfs:
    server: nfs-server.default.svc.cluster.local # this should be your NFS server endpoint
    path: "/"
---
# Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 5Gi
---
apiVersion: graphcore.ai/v1alpha1 # the API that defined this API object type
kind: IPUJob # the kind of this Kubernetes object
metadata:
  name: ipujob-sample # the name of the job
spec:
  jobInstances: 1 # refers to the number of job instances. More than 1 job instance is usually useful for non-training jobs only.
  ipusPerJobInstance: "1"  # refers to the number of ipus required per job instance. A separate partition of this size will be created by the operator for each job instances
  launcher:
    command: # the command to trigger the job execution
      - mpirun
      - --allow-run-as-root
      - --bind-to
      - none
      - -np
      - "1"
      - python3
      - /public_examples/applications/tensorflow/cnns/training/train.py
      - --dataset=cifar-10
      - --synthetic-data
      - --model-size=8
      - --batch-size=1
      - --batches-per-step=10
      - --gradient-accumulation-count=10
      - --no-validation
      - --no-stochastic-rounding
      - --iterations=20
  workers:
    replicas: 1 # how many workers (poplar instances) should participate in this execution
    template: # native Kubernetes Pod template. https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates
      metadata:
        labels:
          app: resnet-launcher
      spec:
        volumes: # we define here which volumes we want to use with the workers (the same is applied to the launcher too)
        - name: mypvc
          persistentVolumeClaim:
            claimName: nfs-pvc # that is the persistent volume claim we created in the above object

        containers: # the containers running inside each worker
        - name: resnet
          image: resnet:latest
          env: # environment variables set on each worker
          - name: "IPUOF_LOG_LEVEL"
            value: "INFO"
          - name: "POPLAR_LOG_LEVEL"
            value: "INFO"
          volumeMounts:
          - name: mypvc # the name of the volume defined in the volumes section
            mountPath: /mnt/sample  # this is where we mount the volume into both workers and the launcher
---

Download single-gcd-sample-nfs.yaml

The above specification will create an NFS persistent volume (assuming you have an NFS server available), and a persistent volume claim requesting the same amount of storage as the persistent volume.

The IPUJob then mounts that NFS volume at /mnt/sample in the job’s workers and launchers.

6.7.2. Automatic restarts

You may want your IPUJob to automatically restart in certain cases. Currently, we support four types of restart policies which can be defined under the IPUJob spec:

Always: means that the job will be always restarted when finished regardless of success or failure
OnFailure: means that the job will only be restarted if it fails regradless of why it failed
Never: means that the job will never be restarted when finished regardless of success or failure
ExitCode: means that user should provide exit codes by themselves,

The operator will check these exit codes to determine the behavior when an error occurs. for example:

1-127: permanent error, do not restart.
128-255: retryable error, will restart the job.

6.7.3. Clean up Kubernetes resources and IPU partitions

Once the job is finished and is no longer going to be restarted, the operator can perform automatic cleanup to free the Kubernetes resources that are no longer needed. This can be defined in the cleanPodPolicy under the IPUJob spec. The following values can be set:

Workers: delete the workers only when the job is finished.
All: delete all pods (launcher and workers) and release IPU resources when the job is finished. It is also worth mentioning that if the cleanPodPolicy is set like so, it will take priority over any restartPolicy. It means that the restartPolicy will act as if it was set to Never.
None: don’t delete any pods when the job is finished.

6.8. Debugging problems

When something does not work as expected, you need to debug the problem to understand how to fix it.

Before we talk about how to debug a failed IPUJob, it is probably good to understand how the IPU Operator executes an IPUJob.

6.8.1. How does the IPU Operator work?

The IPU Operator consists of a few components:

Controller: this is the reconcile loop that makes sure the desired state (defined in the IPUJob specifications) matches the state of the world.
Admission webhooks: we have two webhooks:
- Defaulting (Mutation) webhook: adds some default values to the submitted IPUJob specifications.
- Validating webhook: validates the IPUJob specification.
- V-IPU proxy: which proxies IPU partition operations to the V-IPU controller and keeps track of the partitions created for jobs running inside the cluster.

When an IPUJob is created in the cluster, the IPU Operator gets notified and creates the following Kubernetes resources:

A ConfigMap to hold a couple of things:
- A kubeexec script which is used by MPI to trigger remote execution with the worker Pods from the launcher Pod.
- A hostfile which lists the worker Pods that MPI will use for remote execution.
A Kubernetes RBAC role, role-binding and service account for the launcher Pod which allows the launcher to list and watch worker Pods and exec into them.
A set of worker Pods which participate in the job execution. These Pods are placed into a 365 days sleep as the main background process until the launcher triggers the job processes on them.
A launcher Pod which contains two components:
- An init-container which runs a small application we provide with the IPU Operator. This program watches the worker Pods until they are all available and in the Ready state.
  
  The program also created the required IPU partition by interacting with the V-IPU proxy. If the V-IPU proxy sees that this partition already exists and is already owned by this IPUJob, it will reset the partition. This makes it possible to restart the job with a clean IPU partition.
- The main container which uses the image provided by the user and runs the user-defined command to trigger the job execution.

The IPU Operator also sets environment variables on the worker Pods that allow Poplar to see and use the IPUs when running the AI/ML program.

6.8.2. Debugging

There are a few places to look for debug info:

The status updates for the IPUJob, which you can find by using kubectl to describe the job

The launcher Pod logs which can be found by running:

$ kubectl logs <ipujob-name>-launcher -n <the-namespace-where-the-job-was-deployed>

The controller logs which can be found by running:

$ kubectl logs <controller-manager-pod-name> -n <the-namespace-where-the-operator-was-deployed>

The V-IPU proxy logs which can be found by running:

$ kubectl logs <vipu-proxy-Pod-name> -n <the-namespace-where-the-operator-was-deployed>

6.9. IPU usage statistics

The V-IPU proxy in the IPU Operator keeps track of the used IPU partitions by IPUJobs running inside the cluster. The data is stored in a ConfigMap and it links the IPUJob with the partition it is using. On top of that tracker ConfigMap, the V-IPU proxy exposes a couple of read-only REST endpoints that you can utilize.

By default, these endpoints are only exposed within the cluster, so we can use any container with curl to query them, using the following commands:

/stats

# from inside a Kubernetes Pod that has curl
# gc-ipu-operator-vipu-proxy is the Kubernetes service name for the V-IPU proxy. This name can be different in your installation
$ curl gc-ipu-operator-vipu-proxy/stats | jq .
{
  "default": { # the default namespace
    "used": 4,
    "available": 28
  },
  "total": {
    "used": 4,
    "available": 28
  }
}

/query

# from inside a Kubernetes Pod that has curl
# gc-ipu-operator-vipu-proxy is the Kubernetes service name for the V-IPU proxy. This name can be different in your installation
$ curl --request POST -H "Content-Type: application/json" --data '{"size":2}' gc-ipu-operator-vipu-proxy/query | jq .
{
  "available": true,
  "numOfPartitions": 14, # 14 possible partitions are available
  "message": ""
}
$ curl --request POST -H "Content-Type: application/json" --data '{"size":64}' gc-ipu-operator-vipu-proxy/query | jq .
{
  "available": true,
  "numOfPartitions": 0, # no partition of the requested size are available
  "message": ""
}

6.10. Operator Metrics

The IPU Operator exposes a set of Prometheus metrics that you can use. However, these metrics are exposed behind a protected endpoint. The IPU Operator creates a ClusterRole that grants the permissions to scrape the metrics. To allow your Prometheus server to scrape those metrics, you need to bind that ClusterRole to the service account that the Prometheus server uses.

# find the ClusterRole you must use. Note that the name can be different in your installation.
$ kubectl get clusterrole -l component=metrics-reader
NAME                CREATED AT
gc-metrics-reader   2021-03-24T11:13:07Z

# create a ClusterRoleBinding. The service account must be the one that the Prometheus Server uses.
kubectl create clusterrolebinding metrics --clusterrole=gc-metrics-reader --serviceaccount=<namespace>:<service-account-name>

6.11. Known limitations

There are currently a few limitations:

IPUs can be only be accessed from within the IPU-POD network by default.

Therefore, IPUJob Pods must be run on a Kubernetes node that can access the IPUs, which means that at least one of the IPU-POD head nodes has to be a Kubernetes worker node.
In order to access the RDMA network interface on the head node, the IPUJob Pods run on host network and in privileged mode.
For parallel IPUJobs (jobs with more than one worker Pod), you must specify the network interface which will be used for MPI communication using the mpirun --mca btl_tcp_if_include option.
IPU partitions larger than 64 IPUs are currently not supported.