6. Integration with Slurm

6.1. Introduction

Slurm is a popular open-source cluster management and job scheduling system. For more information, refer to the Slurm Workload Manager Quick Start User Guide.

This section describes how to use V-IPU with Slurm for the different integration methods.

For information on configuring Slurm with V-IPU, refer to the Integration with Slurm section in the V-IPU Administrator Guide.

IPUs are not connected to a single host but instead they are connected to all hosts via the host fabric. In the resulting setup (Fig. 6.1), all hosts can use all IPUs and all IPUs are connected.

_images/pod64-schematic.png — Fig. 6.1 Example topology for an IPU POD64.

This disaggregated representation poses a significant challenge for integration with Slurm and its node-centric view of cluster resources. This section describes three methods for integrating IPUs with Slurm.

6.1.1. Cluster setup

IPU-Machines are connected via the host through the host-fabric to a VM or to bare metal Poplar and Control hosts. The controller part can run on one of the Poplar hosts (for example Poplar host 1). Fig. 6.2 shows a separate control host for clarity; it is not mandatory.

For more information on cluster setup, refer to the V-IPU Administrator Guide.

_images/logical-topology-pod64.png — Fig. 6.2 Example logical topology of an IPU‑POD₆₄.

6.2. Overview of integration options

There are four options for integrating V-IPU with Slurm:

host-IPU mapping (recommended)
using multiple preconfigured static partitions
using a single preconfigured reconfigurable dynamic partition
Graphcore-modified Slurm with IPU resource selection plugin

The pros and cons of each option are summarised in Table 6.1.

Table 6.1 Pros and cons for the different options to integration V-IPU into Slurm.
Integration options	Pros	Cons
Host-IPU mapping	Uses any vanilla Slurm. IPU host nodes can be added to existing Slurm clusters. If the prolog script fails, the job can be returned to the queue and will be rescheduled without user intervention. If the epilog script fails, the node is removed from the pool, the administrator needs to fix it and add it back. Easy to setup	Slurm is not aware of IPUs IPU-Machines are mapped statically to hosts. This has to be decided when the cluster is created. Reconfiguration requires disabling all nodes mapped to IPUs. IPUs must be used in exact quantities, multiplies of mapping. So, for a POD64, mapping 16 IPUs to a single host means only 4 jobs can run simultaneously. Users can modify partitions of other users. Might cause under-utilisation of IPUs in a cluster. Creating an allocation might require additional options, which might be confusing for the end-user Error prone manual configuration Single host and more than 16 IPUs require 2 or more nodes
Preconfigured multiple static partitions	Simple submit using only GRES switch IPU partition reset will not affect other workloads Isolation - more appropriate for multi-tenancy (from partitions with 4+ IPUs) No partition management overhead Slurm counts resources and takes care of scheduling	Limited number of configurations supported Adding GRES is required Complex prolog script No support for multiple GCDs/hosts, unless multiple GCD partitions are created Complex setup (partition names are global for hosts) Changing partition setup requires draining node Some IPU utilities (`poprun`) can mess with partitions and break setup Small IPU partitions (1,2) must share cluster and allocation Error prone submitting jobs because GRES `vipu:4` is not equivalent to `vipu:4:1` but for 4 partitions on single node.
Single preconfigured reconfigurable dynamic partition	Simple / no setup required No extra steps when submitting a job No partition management overhead	Limited to 64 IPUs Reconfigurable partition cannot be used with multiple GCDs (job using multiple hosts) IPUs used by multiple users are not separated When there are not enough resources, the job will fail and the user needs to manually re-add it to the queue
Graphcore-modified Slurm with IPU resource selection plugin	User configures job with IPUs and replicas and gets everything set up and ready to run a job Manages the lifecycle of partitions out-of-the-box Cluster administrator needs to add `IPUOF` config file to Slurm config directory only	Modifies Slurm core and protocols Changes to internal protocols Unable to integrate into an existing Slurm installation Not supported by SchedMD Based on an old/unsupported version of Slurm (22.05) Resource selection plugin relies on the `cons/tres` plugin

6.3. Host-IPU mapping (recommended)

Jobs can be submitted using standard Slurm tools (salloc, srun, sbatch) and the number of hosts determines the IPU count. Partitions are configured to guarantee that a single job has exclusive node access. The exclusive node access is needed because when another job is allowed to use the same node, it will compete for IPUs and one or both the jobs will fail due to lack of resources.

See Host-IPU mapping (recommended) in the V-IPU Administrator Guide for more information about how to configure this method.

Listing 6.1 Example of submitting a job using a host-IPU mapping

#!/bin/bash
# run job on pod16 pick any node, no need for constraint
# Number of processes is minimal, number of nodes is exact
srun -N1 -n1 -ppod16 "$@"

# run job on pod32
# constraint will pick 2 nodes having one of pod32-. features
# which will guarantee adjacent nodes to be picked
srun -N2 -ppod32 -n2 --constraint='[pod32-3|pod32-1|pod32-2|pod32-0]' "$@"

# run job on pod64 pick all nodes, no need for constraint
srun -N 4 -ppod64 -n4 "$@"

# salloc and sbatch should follow the same schema

6.4. Preconfigured partition: multiple static partitions

Using multiple static partitions adds user separation but having non-reconfigurable partitions will waste resources. Also, the user can only request allowed partitions.

See Preconfigured partition: multiple static partitions in the V-IPU Administrator Guide for more information about how to configure this method.

Use predefined GRES values for preconfigured partitions. Otherwise any standard Slurm options can be used.

Listing 6.2 Example of how to run a job using multiple static partitions

srun --gres vipu:1 -p ipu job # it will wait for resource forever
# copy work to dir shared to compute nodes
sbatch --gres vipu:4 -p ipu --requeue job_script

6.5. Preconfigured partition: single reconfigurable dynamic partition

Using a single reconfigurable dynamic partition is the most straightforward solution for integration of IPUs with Slurm, but it sacrifices security for simplicity and has other severe limitations (Section 5, Partitions).

See Preconfigured partition: single reconfigurable dynamic partition in the V-IPU Administrator Guide for more information about how to configure this method.

To schedule a job to run on IPUs, the user needs to choose proper nodes or Slurm partitions.

Listing 6.3 Example of how to run a job using a single reconfigurable partition

srun -p ipu job
# copy work to dir shared to compute nodes
sbatch -p ipu  --requeue job_script

6.6. Graphcore-modified Slurm with IPU resource selection plugin

A Slurm plugin is a dynamically linked code object providing customized implementation of well-defined Slurm APIs. Slurm plugins are loaded at runtime by the Slurm libraries and the customized API callbacks are called at appropriate stages.

Resource selection plugins are a type of Slurm plugin which implement the Slurm resource/node selection APIs. The resource selection APIs provide rich interfaces to allow for customized selection of nodes for jobs, as well as performing any tasks needed for preparing the job runs (such as partition creation in our case), and appropriate clean-up code at job termination (such as partition deletion in our case).

6.6.1. Job submission and parameters

The V-IPU resource selection plugin supports the following options:

--ipus: Number of IPUs requested for the job
-n / --ntasks: Number of tasks for the job. This will correspond to the number of GCDs requested for the job partition.
--num-replicas: Number of model replicas for the job.

These parameters can be configured in both sbatch and srun scripts as well as provided on the command line:

$ sbatch --ipus=2 --ntasks=1 --num-replicas=1 myjob.batch

Optional:

If V-IPU GRES has been configured, you can add the following option in the job definition to select a particular GRES model for the V-IPU.

--gres=vipu:<type name>

You can configure the GRES model parameter in both sbatch and srun scripts as well as on the command line. Assuming the desired GRES model to be used is pod64, the command should look like:

$ sbatch --ipus=2 --ntasks=1 --num-replicas=1 --gres=vipu:pod64 myjob.batch

6.6.2. Job script examples

The following is an example of a single gcd job script:

#!/bin/bash
#SBATCH  --job-name single-gcd-job
#SBATCH --ipus 2
#SBATCH -n 1
#SBATCH --time=00:30:00

srun <ipu_program>

wait

You can configure a multi-GCD job in the same way, except for indicating the number of GCDs requested by setting the number of tasks:

#!/bin/bash
#SBATCH --job-name multi-gcd-job
#SBATCH --ipus 2
#SBATCH -n 2
#SBATCH --time=00:30:00

srun <ipu_program>

wait

Search help