1. The IPU Operator
The IPU Operator extends the Kubernetes API by introducing IPUJob
custom resource definition (CRD) and an operator for this CRD to
manage IPUs. It allows you to specify the number of IPUs required
for a workload. You can read more about the operator pattern in the
Kubernetes documentation.
Note
Kubernetes uses the term Pod to refer to the smallest deployable unit of computing that can be created and managed in Kubernetes.
This is not to be confused with the Graphcore IPU-POD and Bow Pod, which are rack-based systems of IPUs. We refer to both of these as IPU-POD in this document.
1.1. Components and design
The IPU Operator contains the following components:
- IPUJob CRD
The
IPUJob
CRD extends the Kubernetes API by introducing a new resource type for managing IPU resources.IPUJob
allows you to define a workload that can use IPUs. It always has to contain a specification for the worker Pod and may contain also a specification for a launcher Pod.- Controller
This is the central point of the IPU Operator. It handles submitted
IPUJobs
. It allocates an IPU partition (a set of IPUs) for a workload usingV-IPU Proxy
. Then it starts one or more worker Pods and a launcher Pod. The launcher Pod is only started in the case of distributed training whenmpirun
orpoprun
need to be used to start further workloads inside worker Pods.- V-IPU Proxy
A Pod that is used by the controller to communicate with the V-IPU controller in order to create and delete IPU partition, and also to check its state.
- Launcher
A Pod that runs an
mpirun
orpoprun
command. These commands start workloads inside worker Pods. This is used in case of distributed training.
More details about the V-IPU Controller can be found in V-IPU User Guide and V-IPU Administrator Guide.