2. Concepts and architecture

The V-IPU management software consists of the following components:

  • V-IPU agents: Agents reside on each IPU-Machine in a Pod. They implement API end-points for managing the IPUs, exporting application-monitoring metrics and the server-side component of IPUoF that talks to the Poplar runtime.

  • V-IPU controller: The V-IPU controller, called vipu-server, runs on a management node. It is responsible for managing V-IPU agents and serving API end-points.

  • API end-points: There are two API end-points provided by the V-IPU controller: admin and user end-points. The admin end-point is to be used for deployment and administration of a V-IPU cluster, while the user end-point is used by users to request V-IPU resources. Third-party management and scheduling software, such as Kubernetes or Slurm, use the user API end-point.

    Note: the admin end-point is not accessible to unprivileged users in cloud-based systems.

  • V-IPU command-line interface: Command line tools provide access to the administration and user functions of the V-IPU controller.

2.1. Architecture

The architecture of the V-IPU management system is shown in Fig. 2.1.

In a typical V-IPU setup, each IPU-Machine runs a V-IPU agent that is responsible for the hardware introspection, configuration, and management of the IPU-Machine.

V-IPU architecture

Fig. 2.1 V-IPU architecture

The V-IPU agents are controlled and managed by the V-IPU controller. For Pods used in “direct attach” mode, the V-IPU controller runs on one of the IPU-Machines. In switched Pods, the V-IPU controller runs on a host server, which may be one of the servers running Poplar or a separate management server.

The V-IPU controller provides end-points over gRPC, providing a rich set of administration and user APIs. These can be used to define clusters, create user partitions, and monitor and obtain diagnostics of the IPU clusters.

When a user partition is created to run user applications, a set of IPUs is selected, depending on the workload requirements, and all the necessary IPU, link and Sync-Link configurations are done to configure an isolated sub-cluster for the application. Moreover, the Host-Links (IPUoF end-points) are also configured so the IPUs are ready to run end-user Poplar applications.