6. Integration with Kubernetes

Preview Release

This is an early release of Kubernetes support for the IPU. As such the software is subject to change without notice.

The Kubernetes PU Operator for V-IPU is available on request from Graphcore support.

Kubernetes (K8s) is an open-source container orchestration and management system. Kubernetes Operators allow the Kubernetes API to be extended with custom objects, and implement the control logic for such custom objects. You can read more about the Operator pattern in the Kubernetes documentation.

This provides a framework to extend the Kubernetes API and to manage IPUs via custom resource definitions (CRDs) and custom controllers. It allows you to specify the number of IPUs required for your Kubernetes workload using annotations.

This chapter outlines the Operator components, installation steps and usage.

Note

Kubernetes uses the word Pod to refer to the smallest deployable units of computing that you can create and manage in Kubernetes. This is not to be confused with the Graphcore IPU-POD, which is a rack-based system of IPUs.

6.1. Components and design

The Operator contains the following components:

The gc-proxy that communicates with the V-IPU controller (vipu-server)
The CRD and controller that let you allocate IPUs directly from the Kubernetes cluster.

The gc-proxy is responsible for:

Managing the IPU resources by communicating with the V-IPU controller
Running the REST API server to serve requests from init-containers for partition creation

The CRD and custom controller extend the Kubernetes API and manage IPU resources on your behalf. They are responsible for:

Watching for CRD events
Creating worker and launcher Pods based on the CRD configuration, by using the following fields:
- Adding a finaliser to the custom resource to release the IPUs on deletion
- Setting the hostNetwork and securityContext/privileged to true
- Setting the Pod dnsPolicy to ClusterFirstWithHostNet
- Providing webhook REST endpoints to validate the input CRD specification

6.2. Package contents

The software is delivered as a single tarball containing the following files and directories:

The CRD specification:: gc-ipu-operator-v1.0.0-alpha-5/CRDs/graphcore.ai_ipujobs.yaml
Documentation for the Operator:: gc-ipu-operator-v1.0.0-alpha-5/docs/
The Helm Chart:: gc-ipu-operator-v1.0.0-alpha-5/gc-ipu-operator-helm-chart-v1.0.0-alpha-5.tgz
The Operator and gc-proxy images:: gc-ipu-operator-v1.0.0-alpha-5/gc-operator-images.tar.gz
Checksum for the Operator:: gc-ipu-operator-v1.0.0-alpha-5/gc-operator.cksm
Checksum for the Helm Chart:: gc-ipu-operator-v1.0.0-alpha-5/gc-ipu-operator-helm-chart.cksm

6.3. Deploying the software

6.3.1. Prerequisites

Before you can use IPUs from your Kubernetes workloads, you need to meet the following conditions:

Have access to one or more Graphcore IPU-PODs
Have a compatible version of the V-IPU controller installed on your IPU-PODs
Create a Kubernetes cluster. At least one of the worker nodes in the cluster must be on the head node of the IPU-POD. See Section 6.9, Known limitations for more information.
Have the kubectl and Helm (v3.0.0 or later) command-line tools installed on your machine.

6.3.2. Installation

Installing the CRDs

To install the CRDs, run the following command:

$ kubectl apply -f <dir>/CRDs/graphcore.ai_ipujobs.yaml

Installing the Operator

Unzip the Helm package and run the following command:

$ helm install <release-name> <path-to-chart-tar> <custom-parameters>

Where:

<release-name> is the name you choose for this Helm installation
<path-to-chart-tar> is the path to the downloaded Helm Chart tar file
<custom-parameters> is where you customize the installation.

You can either use multiple --set key=value arguments, or put your customization in a YAML file and use the --values your-values.yaml argument.

See Section 6.4, Configurations for more information.

For example, the following command deploys the software to the Kubernetes cluster in the default configuration.

$ cd ipu-proxy
$ helm install [RELEASE_NAME] . --set vipuServerAddr=[host] --set vipuServerPort=[port] --set vipuCLusterName=[cluster] \
  --set controller.image.repository=[controller-image] --set vipuProxy.image.repository=[proxy-image]

This command installs the following in the same namespace where the Helm release installed:

gc-proxy and gc-controller as deployments
gc-proxy service of type ClusterIP
RBAC: ServiceAccount, ClusterRole to manage Pods and ConfigMaps
A Partitions tracker ConfigMap
Configuration objects for the mutation and validation webhooks

You can read more about installing Helm in the Kubernetes documentation.

You can see all the customization options in the README.md for the Helm Charts.

Multiple V-IPU controller support

The IPU Operator can communicate with multiple V-IPU controllers.

You can specify multiple V-IPU controllers during installation by setting the vipuControllers option on the helm install command line. For example:

--set vipuControllers="pod001:8090:ipunode=node1,pod002:8091:ipunode=node2"

Alternatively, after installation you can edit the ConfigMap, as shown below, and update the value.

$ kubectl edit configmap gc-ipu-operator-vipu-controllers

Each V-IPU controller is specified with a colon-separated list of three values:

V-IPU controller host address
V-IPU controller port
A label defined by key=value.

The same label must be added to the node where the containers corresponding to that V-IPU controller will run. Labeling the node is done with the following command:

$ kubectl label nodes <someworkernode> <key>=<value>

The ConfigMap can be modified at any time and the IPU Operator automatically adds the new V-IPU controller to its internal list. It can take up to 60 seconds for the new V-IPU controller to be added. When a partition is created, the IPU Operator goes through the list serially until it finds space for the requested number of IPUs.

Verify the installation is successful

When the installation is complete, you can verify that it worked correctly by running the following commands and seeing similar output:

$ kubectl get crd samia@samia-malt0
NAME                   CREATED AT
ipujobs.graphcore.ai   2021-03-02T12:20:04Z
...

$ helm ls -n <the-namespace-where-you-deployed-the-operator>
NAME  NAMESPACE  REVISION  UPDATED                              STATUS    CHART                              APP VERSION
gc    default    1         2021-03-18 11:50:31.35861 +0100 CET  deployed  gc-ipu-operator-helm-chart-v1.0.0  v1.0.0


$ kubectl get pods -n <the-namespace-where-you-deployed-the-operator>
NAME                                                  READY   STATUS    RESTARTS   AGE
gc-ipu-operator-controller-manager-54766f7f7b-x5wtr   2/2     Running     0          5d23h
gc-ipu-operator-vipu-proxy-844c7d6b7f-88bqr           1/1     Running     1          5d23h

6.3.3. Uninstall

$ helm uninstall [RELEASE_NAME]

This removes all the Kubernetes components associated with the chart and deletes the release.

See helm uninstall for command documentation.

Note

The partition tracker ConfigMap ipu-partitions-tracker does not get deleted when you uninstall the Helm release. This is so that when the ipu-proxy is deployed again, it can pick up from where it was uninstalled before (in terms of managing the created partitions). If you wish to remove that ConfigMap, you can run: kubectl delete configmap ipu-partitions-tracker -n <namespace>

6.3.4. Upgrading the Helm Chart

$ helm upgrade [RELEASE_NAME] [CHART]

See helm upgrade for command documentation.

6.4. Configurations

The following table lists the configurable parameters of the Helm Chart and their default values.

Parameter	Description	Default
global.launcherImage	The container image used for each IPUJob launcher init container	launcher:latest
global.imagePullSecrets	A map of image pull secrets names (for example, `name: "test-secret"`)	[]
nameOverride	Override the name of the chart in the generated Chart resource names	“”
fullNameOverride	Override the fully qualified app name which is used in naming the generated chart resources. If this is not set, the fully qualified app name is defaulted to: `<helm-release-name>-<either Chart name or nameOverride if it is set>`	“”
controller.hostNetwork	Set the hostNetwork flag for the controller Pod	true
controller.dnsPolicy	Set the dnsPolicy to ClusterFirstWithHostNet if hostNetwork is true otherwise set the dnsPolicy to ClusterFirst	ClusterFirstWithHostNet
controller.image.repository	Controller image repository	“”
controller.image.pullPolicy	Controller image pull policy	Always
controller.image.tag	Overrides the Controller image tag whose default is the chart appVersion.	“”
controller.serviceAccount.create	Set to true to create a service account for the `ipu-proxy` or not	true
controller.serviceAccount.annotations	Annotations to add to the service account.	{}
controllers.serviceAccount.name	The name of the service account to use. If not set and create is true, a name is generated using the “fullname” template	“”
controller.rbac.create	Set to true to create RBAC ClusterRole and ClusterRoleBinding and attach them to the service account	true
vipuServerAddr	`V-IPU controller` address	example.com
vipuServerPort	`V-IPU controller` port	8191
vipuClusterName	`V-IPU controller` cluster name	test
proxyPort	`ipu-proxy` port	8080
ipuVisibility.advertiseOnMaster	If true, advertise IPU availability on master node(s) or not	false
podAnnotations	`ipu-proxy` Pod annotations	{}
podSecurityPolicyContext	`ipu-proxy` Pod security policy context	{}
securityContext	Security context	{}
service.type	`ipu-proxy` Kubernetes service type	clusterIP
service.port	`ipu-proxy` Kubernetes service port	80
resources	`ipu-proxy` compute resources	{}
nodeSelector	`ipu-proxy` Kubernetes node selector	{}
tolerations	`ipu-proxy` Kubernetes tolerations	{}
affinity	`ipu-proxy` Kubernetes node affinity	{}
extendedScheduler.enabled	If true, Kubernetes default scheduler extension is setup. You must manually restart the Kubernetes default scheduler after the Helm release installation. The Helm release installation prints out the instructions to do so.	false
admissionWebhooks.scope	Comma-separated list of namespaces where the webhook will perform mutations/validations. Leaving this empty/unset means mutation is performed on all namespaces.	“”
admissionWebhooks.timeoutSeconds	Admission webhook timeout	30
admissionWebhooks.image.repository	Admission webhook image repository	`localhost:5000/gc-webhook`
admissionWebhooks.image.tag	Admission webhook image	v0.2.0
admissionWebhooks.image.pullPolicy	Admission webhooks image pullPolicy	IfNotPresent
admissionWebhooks.failurePolicy	Admission webhook failure policy	Fail
admissionWebhooks.port	Admission webhook Pod port	8443
admissionWebhooks.service.annotations	Admission webhook service annotations	{}
admissionWebhooks.service.servicePort	Admission webhook service port	443
admissionWebhooks.service.type	admission webhook service type	clusterIP
admissionWebhooks.patch.enabled	Create and configure admission webhook TLS certificate	true
admissionWebhooks.patch.image.repository	Admission webhook TLS patch image repository	`docker.io/jettech/kube-webhook-certgen`
admissionWebhooks.patch.image.tag	Admission webhook TLS patch image tag	v1.3.0
admissionWebhooks.patch.image.pullPolicy	Admission webhook TLS patch image pull policy	`IfNotPresent`
admissionWebhooks.patch.priorityClassName	Admission webhook TLS patch jobs priority class	“”
admissionWebhooks.patch.podAnnotations	Admission webhook TLS patch jobs Pod annotations	{}
admissionWebhooks.patch.nodeSelector	Admission webhook TLS patch jobs nodeSelector	{}
admissionWebhooks.patch.tolerations	Admission webhook TLS patch jobs tolerations	[]
admissionWebhooks.patch.runAsUser	Admission webhook TLS patch jobs run as user	2000

6.5. Creating an IPUJob

Once the CRDs and the IPU Operator are installed, you can start submitting IPUJobs (MPI-based AI/ML jobs that use IPUs). The following YAML file is an example of a declarative definition of an IPUJob for the ResNet-8 TensorFlow application:

apiVersion: graphcore.ai/v1alpha1 # the API that defined this API object type
kind: IPUJob # the kind of this Kubernetes object
metadata:
  name: ipujob-sample # the name of the job
spec:
  modelReplicas: "4" # how many replicas should the graph model be split into when being processed
  ipusPerModelReplica: "1" # how many IPUs should be assigned to each model replica
  launcher:
    command: # the command to trigger the job execution
      - mpirun
      - --allow-run-as-root
      - --bind-to
      - none
      - -np
      - "1"
      - python3
      - /public_examples/applications/tensorflow/cnns/training/train.py
      - --dataset=cifar-10
      - --synthetic-data
      - --model-size=8
      - --batch-size=1
      - --batches-per-step=10
      - --gradient-accumulation-count=10
      - --no-validation
      - --no-stochastic-rounding
      - --iterations=20
  workers:
    replicas: 1 # how many workers (poplar instances) should participate in this execution
    template: # native Kubernetes Pod template. https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates
      metadata:
        labels:
          app: resnet-launcher
      spec:
        containers: # the containers running inside each worker
        - name: resnet
          image: resnet:latest
          env: # environment variables set on each worker
          - name: "IPUOF_LOG_LEVEL"
            value: "INFO"
          - name: "POPLAR_LOG_LEVEL"
            value: "INFO"

Download single-gcd-sample.yaml

Save the above specification file as single-gcd-sample.yaml then run:

$ kubectl apply -f single-gcd-sample.yaml
ipujob.graphcore.ai/ipujob-sample created

Now you can inspect what happens in the cluster and you should see something similar to:

$ kubectl get pods
NAME                                                  READY   STATUS    RESTARTS   AGE
gc-ipu-operator-controller-manager-6ff6b6875d-ncjgp   2/2     Running   0          3d22h
gc-ipu-operator-vipu-proxy-849dbf98df-rg8gh           1/1     Running   0          3d22h
ipujob-sample-launcher                                1/1     Running   0          10s
ipujob-sample-worker-0                                1/1     Running   0          25s

You can also list the IPUJobs in the cluster and see their status:

$ kubectl get  ipujobs.graphcore.ai
NAME            STATUS    AGE
ipujob-sample   Running   40s

And you can inspect more details about a specific IPUJob as follows:

$ kubectl describe ipujobs.graphcore.ai ipujob-sample
Name:         ipujob-sample
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  graphcore.ai/v1alpha1
Kind:         IPUJob
Metadata:
  Creation Timestamp:  2021-03-22T10:10:31Z
  Finalizers:
    ipu.finalizers.graphcore.ai
  Generation:  2
    Manager:         manager
    Operation:       Update
    Time:            2021-03-22T10:10:45Z
  Resource Version:  29226482
  Self Link:         /apis/graphcore.ai/v1alpha1/namespaces/default/ipujobs/ipujob-sample
  UID:               beb81bbe-2309-494a-9e28-2a75a704be15
Spec:
  Clean Pod Policy:        None
  Ipus Per Model Replica:  1
  Launcher:
    Command:
      mpirun
      --allow-run-as-root
      --bind-to
      none
      -np
      1
      python3
      /public_examples/applications/tensorflow/cnns/training/train.py
      --dataset=cifar-10
      --synthetic-data
      --model-size=8
      --batch-size=1
      --batches-per-step=10
      --gradient-accumulation-count=10
      --no-validation
      --no-stochastic-rounding
      --iterations=20
  Model Replicas:  4
  Restart Policy:
    Back Off Limit:  3
    Type:            Never
  Workers:
    Replicas:  1
    Template:
      Metadata:
      Spec:
        Containers:
          Env:
            Name:             IPUOF_LOG_LEVEL
            Value:            INFO
            Name:             POPLAR_LOG_LEVEL
            Value:            INFO
          Image:              artifactory-systems.eng.graphcore.ai/vipu-k8s-docker-dev-local/resnet-poplar-2.0:operator
          Name:               resnet
          Resources:
Status:
  Conditions:
    Last Transition Time:  2021-03-22T10:10:31Z
    Last Update Time:      2021-03-22T10:10:31Z
    Message:               IPUJob default/ipujob-sample is waiting for resources to be ready.
    Reason:                IPUJobPending
    Status:                False
    Type:                  Pending
    Last Transition Time:  2021-03-22T10:10:45Z
    Last Update Time:      2021-03-22T10:10:45Z
    Message:               IPUJob default/ipujob-sample is running.
    Reason:                IPUJobRunning
    Status:                True
    Type:                  Running
  I PU Partition Created:  true
  Launcher Status:         Running
  Restart Count:           0
  Start Time:              2021-03-22T10:10:31Z
  Workers Status:
    Active:  1

6.5.1. Interactive mode

You can also run the IPUJob in interactive mode, where it does not execute anything by default:

apiVersion: graphcore.ai/v1alpha1
kind: IPUJob
metadata:
  name: interactive-sample-job
spec:
  modelReplicas: "4"
  ipusPerModelReplica: "1"
  interactive:
    ttl: 3600 # how long should the interactive session last
  workers:
    replicas: 1
    template:
      metadata:
        labels:
          app: resnet-launcher
      spec:
        containers:
        - name: resnet
          image: resnet:latest
          imagePullPolicy: Always
          env:
          - name: "IPUOF_LOG_LEVEL"
            value: "INFO"
          - name: "POPLAR_LOG_LEVEL"
            value: "INFO"

Download interactive-job.yaml

Save the above specification as interactive-job.yaml then run:

$ kubectl apply -f interactive-job.yaml
ipujob.graphcore.ai/interactive-sample-job created

Then you can have a terminal access to the job’s launcher Kubernetes Pod:

$ kubectl exec -it interactive-sample-job-launcher -- bash
root@interactive-sample-job-launcher:/public_examples/applications/tensorflow/cnns/training# <run your mpi programs here>

6.5.2. Mounting data volumes for an IPUJob

Every IPUJob will require an input dataset and will possibly produce output files (for example, checkpoints and trained models). The Kubernetes Pods are ephemeral by nature. This means that all files inside the containers are lost when the containers are removed. For data persistence, the IPU Operator relies on using the native Kubernetes volumes.

By specifying volumes and volumeMounts under the IPUJob workers’ Pod template, all the IPUJob workers and their launcher will have the same volume(s) mounted at the same path. This means that the workers and the launcher all see the same file system at certain path(s). One thing to keep in mind, however, is that you need to use a Persistent Volume type that supports multiple Read/Write mounts. See the Kubernetes documentation for a list of volume types you can use.

Here is an example of the single-gcd-sample.yaml we used above with volumes added to it:

# native Kubernetes volumes which uses NFS
apiVersion: v1
kind: PersistentVolume
metadata:
  name: nfs-shared-storage
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Recycle
  storageClassName: slow
  mountOptions:
    - hard
    - nfsvers=4.1
  nfs:
    server: nfs-server.default.svc.cluster.local # this should be your NFS server endpoint
    path: "/"
---
# Persistent Volume Claim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 5Gi
---
apiVersion: graphcore.ai/v1alpha1 # the API that defined this API object type
kind: IPUJob # the kind of this Kubernetes object
metadata:
  name: ipujob-sample # the name of the job
spec:
  modelReplicas: "4" # how many replicas should the graph model be split into when being processed
  ipusPerModelReplica: "1" # how many IPUs should be assigned to each model replica
  launcher:
    command: # the command to trigger the job execution
      - mpirun
      - --allow-run-as-root
      - --bind-to
      - none
      - -np
      - "1"
      - python3
      - /public_examples/applications/tensorflow/cnns/training/train.py
      - --dataset=cifar-10
      - --synthetic-data
      - --model-size=8
      - --batch-size=1
      - --batches-per-step=10
      - --gradient-accumulation-count=10
      - --no-validation
      - --no-stochastic-rounding
      - --iterations=20
  workers:
    replicas: 1 # how many workers (poplar instances) should participate in this execution
    template: # native Kubernetes Pod template. https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates
      metadata:
        labels:
          app: resnet-launcher
      spec:
        volumes: # we define here which volumes we want to use with the workers (the same is applied to the launcher too)
        - name: mypvc
          persistentVolumeClaim:
            claimName: nfs-pvc # that is the persistent volume claim we created in the above object

        containers: # the containers running inside each worker
        - name: resnet
          image: resnet:latest
          env: # environment variables set on each worker
          - name: "IPUOF_LOG_LEVEL"
            value: "INFO"
          - name: "POPLAR_LOG_LEVEL"
            value: "INFO"
          volumeMounts:
          - name: mypvc # the name of the volume defined in the volumes section
            mountPath: /mnt/sample  # this is where we mount the volume into both workers and the launcher
---

Download single-gcd-sample-nfs.yaml

The above specification will create an NFS persistent volume (assuming you have an NFS server available), and a persistent volume claim requesting the same amount of storage as the persistent volume.

The IPUJob then mounts that NFS volume at /mnt/sample in the job’s workers and launchers.

6.5.3. Automatic restarts

You may want your IPUJob to automatically restart in certain cases. Currently, we support four restart policies which can be defined under the IPUJob specification: “Never”, “Always”, “OnFailure” and “ExitCode”. You can find more details about these in the CRD reference documentation in the docs directory of the release package.

6.5.4. Clean up Kubernetes resources and IPU partitions

Once the job is finished and is no longer going to be restarted, automatic cleanup can be performed to free the Kubernetes resources that are no longer needed . This can be defined in the cleanPodPolicy under the IPUJob specification. You can explore the options in the CRD reference documentation.

The IPU partitions are currently only cleaned up when the IPUJob is deleted.

6.6. Debugging problems

When something does not work as expected, you need to debug the problem to understand how to fix it.

Before we talk about how to debug a failed IPUJob, it is probably good to understand how the IPU Operator executes an IPUJob.

6.6.1. How does the IPU Operator work?

The IPU Operator consists of a few components:

Controller: this is the reconcile loop that makes sure the desired state (defined in the IPUJob specifications) matches the state of the world.
Admission webhooks: we have two webhooks:
- Defaulting (Mutation) webhook: adds some default values to the submitted IPUJob specifications.
- Validating webhook: validates the IPUJob specification.
- V-IPU proxy: which proxies IPU partition operations to the V-IPU controller and keeps track of the partitions created for jobs running inside the cluster.

When an IPUJob is created in the cluster, the IPU Operator gets notified and creates the following Kubernetes resources:

A ConfigMap to hold a couple of things:
- A kubeexec script which is used by MPI to trigger remote execution with the worker Pods from the launcher Pod.
- A hostfile which lists the worker Pods that MPI will use for remote execution.
A Kubernetes RBAC role, role-binding and service account for the launcher Pod which allows the launcher to list and watch worker Pods and exec into them.
A set of worker Pods which participate in the job execution. These Pods are placed into a 365 days sleep as the main background process until the launcher triggers the job processes on them.
A launcher Pod which contains two components:
- An init-container which runs a small application we provide with the IPU Operator. This program watches the worker Pods until they are all available and in the Ready state.
  
  The program also created the required IPU partition by interacting with the V-IPU proxy. If the V-IPU proxy sees that this partition already exists and is already owned by this IPUJob, it will reset the partition. This makes it possible to restart the job with a clean IPU partition.
- The main container which uses the image provided by the user and runs the user-defined command to trigger the job execution.

The IPU Operator also sets environment variables on the worker Pods that allow Poplar to see and use the IPUs when running the AI/ML program.

6.6.2. Debugging

There are a few places to look for debug info:

The status updates for the IPUJob, which you can find by using kubectl to describe the job

The launcher Pod logs which can be found by running:

$ kubectl logs <ipujob-name>-launcher -n <the-namespace-where-the-job-was-deployed>

The controller logs which can be found by running:

$ kubectl logs <controller-manager-pod-name> -n <the-namespace-where-the-operator-was-deployed>

The V-IPU proxy logs which can be found by running:

$ kubectl logs <vipu-proxy-Pod-name> -n <the-namespace-where-the-operator-was-deployed>

6.7. IPU usage statistics

The V-IPU proxy in the IPU Operator keeps track of the used IPU partitions by IPUJobs running inside the cluster. The data is stored in a ConfigMap and it links the IPUJob with the partition it is using. On top of that tracker ConfigMap, the V-IPU proxy exposes a couple of read-only REST endpoints that you can utilize.

By default, these endpoints are only exposed within the cluster, so we can use any container with curl to query them, using the following commands:

/stats

# from inside a Kubernetes Pod that has curl
# gc-ipu-operator-vipu-proxy is the Kubernetes service name for the V-IPU proxy. This name can be different in your installation
$ curl gc-ipu-operator-vipu-proxy/stats | jq .
{
  "default": { # the default namespace
    "used": 4,
    "available": 28
  },
  "total": {
    "used": 4,
    "available": 28
  }
}

/query

# from inside a Kubernetes Pod that has curl
# gc-ipu-operator-vipu-proxy is the Kubernetes service name for the V-IPU proxy. This name can be different in your installation
$ curl --request POST -H "Content-Type: application/json" --data '{"size":2}' gc-ipu-operator-vipu-proxy/query | jq .
{
  "available": true,
  "numOfPartitions": 14, # 14 possible partitions are available
  "message": ""
}
$ curl --request POST -H "Content-Type: application/json" --data '{"size":64}' gc-ipu-operator-vipu-proxy/query | jq .
{
  "available": true,
  "numOfPartitions": 0, # no partition of the requested size are available
  "message": ""
}

6.8. Operator Metrics

The IPU Operator exposes a set of Prometheus metrics that you can use. However, these metrics are exposed behind a protected endpoint. The IPU Operator creates a ClusterRole that grants the permissions to scrape the metrics. To allow your Prometheus server to scrape those metrics, you need to bind that ClusterRole to the service account that the Prometheus server uses.

# find the ClusterRole you must use. Note that the name can be different in your installation.
$ kubectl get clusterrole -l component=metrics-reader
NAME                CREATED AT
gc-metrics-reader   2021-03-24T11:13:07Z

# create a ClusterRoleBinding. The service account must be the one that the Prometheus Server uses.
kubectl create clusterrolebinding metrics --clusterrole=gc-metrics-reader --serviceaccount=<namespace>:<service-account-name>

6.9. Known limitations

There are currently a few limitations:

IPUs can be only be accessed from within the IPU-POD network by default.

Therefore, IPUJob Pods must be run on a Kubernetes node that can access the IPUs, which means that at least one of the IPU-POD head nodes has to be a Kubernetes worker node.
In order to access the RDMA network interface on the head node, the IPUJob Pods run on host network and in privileged mode.
For parallel IPUJobs (jobs with more than one worker Pod), you must specify the network interface which will be used for MPI communication using the mpirun --mca btl_tcp_if_include option.
IPU partitions larger than 64 IPUs are currently not supported.