1. The IPU Operator
The IPU Operator extends the Kubernetes API by introducing
custom resource definition (CRD) and an operator for this CRD to
manage IPUs. It allows you to specify the number of IPUs required
for a workload. You can read more about the operator pattern in the
Kubernetes uses the term Pod to refer to the smallest deployable unit of computing that can be created and managed in Kubernetes.
This is not to be confused with the Graphcore IPU-POD and Bow Pod, which are rack-based systems of IPUs. We refer to both of these as IPU-POD in this document.
1.1. Components and design
The IPU Operator contains the following components:
- IPUJob CRD
IPUJobCRD extends the Kubernetes API by introducing a new resource type for managing IPU resources.
IPUJoballows you to define a workload that can use IPUs. It always has to contain a specification for the worker Pod and may contain also a specification for a launcher Pod.
This is the central point of the IPU Operator. It handles submitted
IPUJobs. It allocates an IPU partition (a set of IPUs) for a workload using
V-IPU Proxy. Then it starts one or more worker Pods and a launcher Pod. The launcher Pod is only started in the case of distributed training when
poprunneed to be used to start further workloads inside worker Pods.
- V-IPU Proxy
A Pod that is used by the controller to communicate with the V-IPU controller in order to create and delete IPU partition, and also to check its state.
A Pod that runs an
popruncommand. These commands start workloads inside worker Pods. This is used in case of distributed training.