1. The IPU Operator

The IPU Operator extends the Kubernetes API by introducing IPUJob custom resource definition (CRD) and an operator for this CRD to manage IPUs. It allows you to specify the number of IPUs required for a workload. You can read more about the operator pattern in the Kubernetes documentation.

Note

Kubernetes uses the term Pod to refer to the smallest deployable unit of computing that can be created and managed in Kubernetes.

This is not to be confused with the Graphcore IPU-POD and Bow Pod, which are rack-based systems of IPUs. We refer to both of these as IPU-POD in this document.

1.1. Components and design

The IPU Operator contains the following components:

IPUJob CRD

The IPUJob CRD extends the Kubernetes API by introducing a new resource type for managing IPU resources.

IPUJob allows you to define a workload that can use IPUs. It always has to contain a specification for the worker Pod and may contain also a specification for a launcher Pod.

Controller

This is the central point of the IPU Operator. It handles submitted IPUJobs. It allocates an IPU partition (a set of IPUs) for a workload using V-IPU Proxy. Then it starts one or more worker Pods and a launcher Pod. The launcher Pod is only started in the case of distributed training when mpirun or poprun need to be used to start further workloads inside worker Pods.

V-IPU Proxy

A Pod that is used by the controller to communicate with the V-IPU controller in order to create and delete IPU partition, and also to check its state.

Launcher

A Pod that runs an mpirun or poprun command. These commands start workloads inside worker Pods. This is used in case of distributed training.

More details about the V-IPU Controller can be found in V-IPU User Guide and V-IPU Administrator Guide.