6. Integration with Slurm

6.1. Introduction

Slurm is a popular open-source cluster management and job scheduling system. For more information, refer to the Slurm Workload Manager Quick Start User Guide.

This section describes how to use V-IPU with Slurm for the different integration methods.

For information on configuring Slurm with V-IPU, refer to the Integration with Slurm section in the V-IPU Administrator Guide.

IPUs are not connected to a single host but instead they are connected to all hosts via the host fabric. In the resulting setup (Fig. 6.1), all hosts can use all IPUs and all IPUs are connected.

_images/pod64-schematic.png

Fig. 6.1 Example topology for an IPU POD64.

This disaggregated representation poses a significant challenge for integration with Slurm and its node-centric view of cluster resources. This section describes three methods for integrating IPUs with Slurm.

6.1.1. Cluster setup

IPU-Machines are connected via the host through the host-fabric to a VM or to bare metal Poplar and Control hosts. The controller part can run on one of the Poplar hosts (for example Poplar host 1). Fig. 6.2 shows a separate control host for clarity; it is not mandatory.

For more information on cluster setup, refer to the V-IPU Administrator Guide.

_images/logical-topology-pod64.png

Fig. 6.2 Example logical topology of an IPU‑POD64.

6.2. Overview of integration options

There are four options for integrating V-IPU with Slurm:

  1. host-IPU mapping (recommended)

  2. using multiple preconfigured static partitions

  3. using a single preconfigured reconfigurable dynamic partition

  4. Graphcore-modified Slurm with IPU resource selection plugin

The pros and cons of each option are summarised in Table 6.1.

Table 6.1 Pros and cons for the different options to integration V-IPU into Slurm.

Integration options

Pros

Cons

Host-IPU mapping

  • Uses any vanilla Slurm.

  • IPU host nodes can be added to existing Slurm clusters.

  • If the prolog script fails, the job can be returned to the queue and will be rescheduled without user intervention.

  • If the epilog script fails, the node is removed from the pool, the administrator needs to fix it and add it back.

  • Easy to setup

  • Slurm is not aware of IPUs

  • IPU-Machines are mapped statically to hosts. This has to be decided when the cluster is created.

  • Reconfiguration requires disabling all nodes mapped to IPUs.

  • IPUs must be used in exact quantities, multiplies of mapping. So, for a POD64, mapping 16 IPUs to a single host means only 4 jobs can run simultaneously.

  • Users can modify partitions of other users.

  • Might cause under-utilisation of IPUs in a cluster.

  • Creating an allocation might require additional options, which might be confusing for the end-user

  • Error prone manual configuration

  • Single host and more than 16 IPUs require 2 or more nodes

Preconfigured multiple static partitions

  • Simple submit using only GRES switch

  • IPU partition reset will not affect other workloads

  • Isolation - more appropriate for multi-tenancy (from partitions with 4+ IPUs)

  • No partition management overhead

  • Slurm counts resources and takes care of scheduling

  • Limited number of configurations supported

  • Adding GRES is required

  • Complex prolog script

  • No support for multiple GCDs/hosts, unless multiple GCD partitions are created

  • Complex setup (partition names are global for hosts)

  • Changing partition setup requires draining node

  • Some IPU utilities (poprun) can mess with partitions and break setup

  • Small IPU partitions (1,2) must share cluster and allocation

  • Error prone submitting jobs because GRES vipu:4 is not equivalent to vipu:4:1 but for 4 partitions on single node.

Single preconfigured reconfigurable dynamic partition

  • Simple / no setup required

  • No extra steps when submitting a job

  • No partition management overhead

  • Limited to 64 IPUs

  • Reconfigurable partition cannot be used with multiple GCDs (job using multiple hosts)

  • IPUs used by multiple users are not separated

  • When there are not enough resources, the job will fail and the user needs to manually re-add it to the queue

Graphcore-modified Slurm with IPU resource selection plugin

  • User configures job with IPUs and replicas and gets everything set up and ready to run a job

  • Manages the lifecycle of partitions out-of-the-box

  • Cluster administrator needs to add IPUOF config file to Slurm config directory only

  • Modifies Slurm core and protocols

  • Changes to internal protocols

  • Unable to integrate into an existing Slurm installation

  • Not supported by SchedMD

  • Based on an old/unsupported version of Slurm (22.05)

  • Resource selection plugin relies on the cons/tres plugin

6.4. Preconfigured partition: multiple static partitions

Using multiple static partitions adds user separation but having non-reconfigurable partitions will waste resources. Also, the user can only request allowed partitions.

See Preconfigured partition: multiple static partitions in the V-IPU Administrator Guide for more information about how to configure this method.

Use predefined GRES values for preconfigured partitions. Otherwise any standard Slurm options can be used.

Listing 6.2 Example of how to run a job using multiple static partitions
srun --gres vipu:1 -p ipu job # it will wait for resource forever
# copy work to dir shared to compute nodes
sbatch --gres vipu:4 -p ipu --requeue job_script

6.5. Preconfigured partition: single reconfigurable dynamic partition

Using a single reconfigurable dynamic partition is the most straightforward solution for integration of IPUs with Slurm, but it sacrifices security for simplicity and has other severe limitations (Section 5, Partitions).

See Preconfigured partition: single reconfigurable dynamic partition in the V-IPU Administrator Guide for more information about how to configure this method.

To schedule a job to run on IPUs, the user needs to choose proper nodes or Slurm partitions.

Listing 6.3 Example of how to run a job using a single reconfigurable partition
srun -p ipu job
# copy work to dir shared to compute nodes
sbatch -p ipu  --requeue job_script

6.6. Graphcore-modified Slurm with IPU resource selection plugin

A Slurm plugin is a dynamically linked code object providing customized implementation of well-defined Slurm APIs. Slurm plugins are loaded at runtime by the Slurm libraries and the customized API callbacks are called at appropriate stages.

Resource selection plugins are a type of Slurm plugin which implement the Slurm resource/node selection APIs. The resource selection APIs provide rich interfaces to allow for customized selection of nodes for jobs, as well as performing any tasks needed for preparing the job runs (such as partition creation in our case), and appropriate clean-up code at job termination (such as partition deletion in our case).

6.6.1. Job submission and parameters

The V-IPU resource selection plugin supports the following options:

  • --ipus: Number of IPUs requested for the job

  • -n / --ntasks: Number of tasks for the job. This will correspond to the number of GCDs requested for the job partition.

  • --num-replicas: Number of model replicas for the job.

These parameters can be configured in both sbatch and srun scripts as well as provided on the command line:

$ sbatch --ipus=2 --ntasks=1 --num-replicas=1 myjob.batch

Optional:

If V-IPU GRES has been configured, you can add the following option in the job definition to select a particular GRES model for the V-IPU.

--gres=vipu:<type name>

You can configure the GRES model parameter in both sbatch and srun scripts as well as on the command line. Assuming the desired GRES model to be used is pod64, the command should look like:

$ sbatch --ipus=2 --ntasks=1 --num-replicas=1 --gres=vipu:pod64 myjob.batch

6.6.2. Job script examples

The following is an example of a single gcd job script:

#!/bin/bash
#SBATCH  --job-name single-gcd-job
#SBATCH --ipus 2
#SBATCH -n 1
#SBATCH --time=00:30:00

srun <ipu_program>

wait

You can configure a multi-GCD job in the same way, except for indicating the number of GCDs requested by setting the number of tasks:

#!/bin/bash
#SBATCH --job-name multi-gcd-job
#SBATCH --ipus 2
#SBATCH -n 2
#SBATCH --time=00:30:00

srun <ipu_program>

wait