8. Integration with Slurm

Slurm is a popular open-source cluster management and job scheduling system. For more information, refer to the Slurm Workload Manager Quick Start User Guide.

This section describes the configuration for the different methods of integrating V-IPU with Slurm.

For information on using Slurm with V-IPU, refer to the Integration with Slurm section in the V-IPU User Guide.

8.1. Overview of integration options

There are four options for integrating V-IPU with Slurm:

host-IPU mapping (recommended)
using multiple preconfigured static partitions
using a single preconfigured reconfigurable dynamic partition
Graphcore-modified Slurm with IPU resource selection plugin

The pros and cons of each option are summarised in Table Pros and cons for the different options to integration V-IPU into Slurm. in the V-IPU User Guide.

8.2. Host-IPU mapping (recommended)

The recommended solution is to statically map IPUs to Poplar hosts and configure Slurm (which is not aware of IPUs) to schedule workloads on hosts to maximise IPU performance. The proposed solution uses a simple file which maps IPU-Machines to hosts. The Poplar host performance determines how many IPU-Machines can be mapped — it can be a single IPU-Machine or more. In the example below 4 IPU-Machines (or 16 IPUs) are assigned to a single host. Mapping IPU-Machines to hosts is a manual process, referencing physical connections between IPU-machines and hosts.

See Host-IPU mapping (recommended) in the V-IPU User Guide for more information about how to use this method.

Listing 8.1 Mapping file

# Mapping
# Host: IPU-machine list
host-1:ipum1,ipum2,ipum3,ipum4
host-2:ipum5,ipum6,ipum7,ipum8
host-3:ipum9,ipum10,ipum11,ipum12
host-4:ipum13,ipum14,ipum15,ipum16

The mapping file is later used by Slurm prolog and epilog scripts to create an allocation with IPU-Machines assigned to the hosts selected for the job. Slurm needs to be configured to enable the controller daemon to use prolog and epilog scripts by setting:

Listing 8.2 Entries in slurm.conf configuring the prolog and epilog scripts to use

...
EpilogSlurmctld=/usr/local/slurm/scripts/slurmctld_epilog.sh
...
PrologSlurmctld=/usr/local/slurm/scripts/slurmctld_prolog.sh

More information can found in the Slurm Workload Manager - Prolog and Epilog Guide. The prolog script, run before the start of the job, reads the mapping file and decides how to create the allocation and cluster of IPUs. The epilog script, run after the job finishes, cleans up IPU resources by removing IPU allocation and clusters.

Creating a partition is left to the job owner. The mimimal_prolog.sh prolog script and the minimal_epilog.sh epilog script below are minimal examples for the mapping solution and might not cover all the steps needed to create a valid IPU allocation.

Apart from the Slurmctld prolog and epilog scripts, it is convenient to use TaskProlog to create the prolog script for setting up the IPUOF_VIPU environment variables (see Section 8.4, Preconfigured partition: single reconfigurable dynamic partition for more about the reconfigurable partition approach). The allocation ID can be shared in the same way using some variable for example VIPU_ALLOCATION_ID.

Listing 8.3 Example prolog script minimal_prolog.sh

#!/bin/sh
ALLOCATION_ID="c$SLURM_JOBID"
NODELIST="$(scontrol show hostnames "$SLURM_JOB_NODELIST")"
# get a list of IPUMs used. This is space aware so map must be host:ipum,ipum2
IPUMLIST=$(for i in $NODELIST; do grep "$i": /opt/slurm-23.02.6/etc/ipum.allocs | awk -F:  '{print $2}' ; done | paste -d, -s)
# get a count of IPUMS
IPUMCOUNT=$(echo "$IPUMLIST" | sed "s/\,/\ /g" | wc -w)
echo "print Requesting $IPUMCOUNT IPUs"
# vipu-admin is present in SLURMCTLD host
until vipu-admin get allocation "$ALLOCATION_ID"
do
    vipu-admin create cluster "$ALLOCATION_ID" --agents="$IPUMLIST"
    sleep 5
done

echo export IPUOF_VIPU_API_HOST=poplar-host-1
echo export IPUOF_VIPU_API_PORT=8090
# For benchmarks
echo "export VIPU_ALLOCATION_ID=$ALLOCATION_ID"
echo "export VIPU_NUM_IPUS=$((IPUMCOUNT * 4))"
# IPUOF_VIPU_API_PARTITION_ID needs to be set by job after partition creation

The prolog script can fail in cases of bad formatting of the ipum.allocs file or when creating a cluster fails (for example when an IPU-Machine is already a part of another cluster). Any other failure of the prolog script means there is a configuration problem. Then, the epilog script will try to remove the allocation and cluster until Slurm kills this script and the node state is “drained”. In this case, the node has to be checked and returned to the pool manually.

Listing 8.4 Example epilog script minimal_epilog.sh

#!/bin/bash

ALLOCATION_ID="c$SLURM_JOBID"
# Remove partitions for allocation
PARTITION_LIST=$(vipu-admin list partitions --allocation "$ALLOCATION_ID" --showjson | jq -r '.partitions[] | .id')
for PARTITION in $PARTITION_LIST
do
    while vipu-admin get partition "$PARTITION"
    do
        vipu-admin remove partition -f "$PARTITION"
        sleep 5
    done
done

# Remove the cluster
while vipu-admin get cluster "$ALLOCATION_ID"
do
    vipu-admin remove cluster "$ALLOCATION_ID"
    sleep 5
done

Listing 8.5 Example task prolog script task_prolog.sh

#!/bin/sh

echo export IPUOF_VIPU_API_HOST=vipu-host
echo export IPUOF_VIPU_API_PORT=8090
# export this for user to know which allocation can be use to create partition
echo export VIPU_ALLOCATION_ID="c$SLURM_JOBID"

A single host is mapped to one Pod16 (16 IPUs or 4 IPU-Machines). Then, for a Pod64, which is made up of 4 Pod16s, there are four hosts. In these cases, it is pretty straightforward to schedule work on a Pod16 (single host) or a Pod64 (all hosts). However, running on a Pod32 requires more configuration.

In order to run on a Pod32, the IPU-Machines need to be in adjacent hosts. To guarantee this, an additional Slurm configuration is required. Assuming that IPU-Machines for consecutive hosts are connected to each other (ipum4 is mapped to host-1 which is connected to ipum5 that is mapped to host-2).

To build a Pod32, the Slurm scheduler needs to select adjacent hosts. However, this is not guaranteed and so must be configured. This is done with the features option in slurm.conf to label adjacent hosts. For example, pod32-0 is shared between neighbouring hosts host-1 and host-2. Similarly, pod32-3 is shared between hosts host-3 and host-4. So, when two nodes are requested for a Pod32, the Slurm scheduler will pick two adjacent hosts and the IPU-Machines mapped to them.

Simultaneously, the user is required to set job constraints to pick any of the Pod32 features to help the Slurm scheduler pick the proper nodes. Optionally, to guarantee a valid number of nodes for each Pod, partitions need to be configured for every supported Pod size (Pod16, Pod32, Pod64). In this way a job can be rejected when an invalid number of nodes is requested.

The following configuration can be added to slurm.conf by using the include directive.

Listing 8.6 Example of configuration for a Pod32

# Nodes features are needed to pick only pod32 because pod16 is single node and pod64 is all
NodeName=host-1 features=pod32-0,pod32-3
NodeName=host-2 features=pod32-0,pod32-1
NodeName=host-3 features=pod32-1,pod32-2
NodeName=host-4 features=pod32-2,pod32-3

# Partitions
PartitionName=pod16 Nodes=host-[1-4] OverSubscribe=EXCLUSIVE MaxNodes=1 MinNodes=1
PartitionName=pod32 Nodes=host-[1-4] OverSubscribe=EXCLUSIVE MaxNodes=2 MinNodes=2
PartitionName=pod64 Nodes=host-[1-4] OverSubscribe=EXCLUSIVE MaxNodes=4 MinNodes=4

8.3. Preconfigured partition: multiple static partitions

Using multiple static partitions adds user separation but having non-reconfigurable partitions will waste resources. Also, the user can only request allowed partitions.

See Preconfigured partition: multiple static partitions in the V-IPU User Guide for more information about how to use this method.

When a basic separation of user workloads is needed, multiple preconfigured partitions can be used. This can be done using generic resources (GRES). You need to configure Slurm with the available GRES types using the GresType option. Next, assign GRES to the nodes (in the NodeName definition) and this can be done in gres.conf or slurm.conf – for simplicity this guide uses slurm.conf. The value for gres has the format type:model:count, where only type is mandatory. model can be any string to differentiate resource sub-kinds, and count is a positive number. For IPUs, type is vipu and model is mandatory. For IPUs, model stores the partition sizes present on the node. For example, gres=vipu:2:8 means we have 8 IPU partitions configured with a size of 2 IPUs.

Listing 8.7 Example of configuring GRES in slurm.conf

...
Epilog=/usr/local/slurm/scripts/epilog.sh
...
Prolog=/usr/local/slurm/scripts/prolog.sh
...
TaskProlog=/usr/local/slurm/scripts/task_prolog.sh
...
# Define GRES type
GresTypes=vipu

# Define GRES for Node where model means number of IPUs
# present in predefined partition
NodeName=host-1 gres=vipu:1:16 state=UNKNOWN
NodeName=host-2 gres=vipu:2:8  state=UNKNOWN
NodeName=host-3 gres=vipu:16:1 state=UNKNOWN
NodeName=host-4 gres=vipu:4:4  state=UNKNOWN

# Slurm partition for IPU nodes
PartitionName=ipu Nodes=host-[1-4]

For workloads to use preconfigured IPU partitions, it is necessary to track partition usage (which IPUs’ partitions are free and which are allocated) and pass the information on allocated partitions to the job. A simple way of doing this is by moving files with the same names as the IPU partitions between two directories: one directory containing the partitions that are free (named free) to a directory containing allocated partitions (named allocated). Exclusive access to these directories is guaranteed by a flock Linux system call. The prolog script takes the file from the free directory and moves it to the allocated directory and creates a pointer file with the name of the allocated partition and the job ID. The epilog script reverses this process: first it reads the pointer file related to the job to get the name of the partition file, then the epilog script moves the partition file from allocated back to free.

Listing 8.8 Example prolog script to use preconfigured IPU partitions

#!/bin/bash

# Where all allocated partitions stored
ALLOC_DIR=/opt/slurm-23.02.6/data/allocated
# Where all free partition are stored
FREE_DIR=/opt/slurm-23.02.6/data/free
# lock
ALLOCATION_LOCK=/opt/slurm-23.02.6/data/allocation.lock
PARTITION="/opt/slurm-23.02.6/data/p$SLURM_JOBID"
FLOCK="flock --verbose"

echo print checking configuration
# Require preallocated dirs
[ -d "$ALLOC_DIR" ] || exit 1
[ -d "$FREE_DIR" ]  || exit 1

PARTITION_FILE="$(ls -1 "$FREE_DIR" | head -1)"

# No free files, exit with failure
[ -z "$PARTITION_FILE" ] && exit 1
PARTITION_FREE="$FREE_DIR/$PARTITION_FILE"
PARTITION_ALLOC="$ALLOC_DIR/$PARTITION_FILE"
# move partition from free to lock
$FLOCK "$ALLOCATION_LOCK" mv "$PARTITION_FREE" "$PARTITION_ALLOC"

echo "$PARTITION_FILE" > "$PARTITION"

When the number of configured IPU partitions is specified with gres count, Slurm will not request more than count partitions to prevent scheduling jobs on nodes with exhausted resources.

Listing 8.9 Example epilog script to use preconfigured IPU partitions

#!/bin/bash

# Where all allocated partitions stored
ALLOC_DIR=allocated
# Where all free partition are stored
FREE_DIR=free
# lock
ALLOCATION_LOCK=allocation.lock
PARTITION="p$SLURM_JOB_ID"
FLOCK flock --verbose -s 10

[ -d "$ALLOC_DIR" ] || exit 0
[ -d "$FREE_DIR" ]  || exit 0
# not PARTITION NO JOB
[ -f "$PARTITION" ] || exit 0
PARTITION_FILE="$(cat $PARTITION)"
PARTITION_FREE="$FREE_DIR/$PARTITION_FILE"
PARTITION_ALLOC="$ALLOC_DIR/$PARTITION_FILE"
# move partition from alloc to free
$FLOCK "$ALLOCATION_LOCK" mv "$PARTITION_ALLOC" "$PARTITION_FREE"

rm -f "$PARTITION"

Lastly, the file with the allocated partition is the interface for the task_prolog script, which takes the allocated partition name and sets the IPUOF_ environment variables for the job.

Listing 8.10 Example task prolog script

#!/bin/sh

echo export IPUOF_VIPU_API_HOST=vipu-host
echo export IPUOF_VIPU_API_PORT=8090
PARTITION="/opt/slurm-23.02.6/data/p$SLURM_JOBID"
# export this for user to know which allocation can be use to create partition
echo "export IPUOF_VIPU_API_PARTITION_ID=$(cat "$PARTITION")"

Slurm handles the GRES type as optional and it is processed only in the case when both gres type and count are specified. In any other case, the number following gres is treated as count. GRES is a user-provided argument and there are a few cases where Slurm will do something different than expected:

when model and count are empty (--gres=vipu): Slurm will pick any machine with any vipu GRES present and assume count to be one.
when count is empty (for example, --gres=vipu:2): Slurm will treat model as count and pick machines with vipu GRES with the desired count (2 in this case).

This is undesirable and the simplest way to mitigate it is to use the JobSubmit plugin. We use the Lua JobSubmit plugin. To set it up, add lua to the JobSubmitPlugins in slurm.conf and place job_submit.lua in the same directory as slurm.conf. The script below only handles the two cases mentioned above and assumes Slurm will fail when a malformed GRES string is given (for example, vipu:::).

Listing 8.11 Example of job submit plugin code to mitigate GRES configuration problems

function string:split(sep)
    local sep, fields = sep or ":", {}
    local pattern = string.format("([^%s]+)", sep)
    self:gsub(pattern, function(c) fields[#fields+1] = c end)
    return fields
end

function slurm_job_submit(job_desc, part_list, submit_uid)
    local _gres = job_desc.gres or ''

    if _gres ~= '' then
        for _tres in string.gmatch(_gres, '([^,]+)') do
            if string.find (_tres, '^gres:vipu') then
                _vipu_tres = _tres:split(':')

                local _vipu_partition_size = _vipu_tres[3]
                local _vipus_requested = _vipu_tres[4]
                if _vipu_partition_size == nil then
                    slurm.log_user("ERROR: Undefined IPU partition size, use vipu:<partition_size>:<count>")
                    return slurm.FAILURE
                end

                if _vipus_requested == nil or not ( tonumber(_vipus_requested) == 1 ) then
                    local _new_vipu_tres = string.format("%s:%s:%s:1", _vipu_tres[1], _vipu_tres[2], _vipu_tres[3])
                    slurm.log_user("WARN: Incorrect GRES correcting '%s' to: '%s'", _tres, _new_vipu_tres)
                    job_desc.gres = string.gsub(_gres, _tres, _new_vipu_tres)
                end
            end
        end
    end
    return slurm.SUCCESS
end

function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
    return slurm.SUCCESS
end

return slurm.SUCCESS

8.4. Preconfigured partition: single reconfigurable dynamic partition

Using a single reconfigurable dynamic partition is the most straightforward solution for integration of IPUs with Slurm, but it sacrifices security for simplicity and has other severe limitations (see the section on Partitions in the V-IPU User Guide for more information).

See Preconfigured partition: single reconfigurable dynamic partition in the V-IPU User Guide for more information about how to use this method.

The administrator can create reconfigurable partitions for up to 64 IPUs when a cluster is set up. Partitions will be static and persistent across user jobs. To expose these partitions to users, the environment variables IPUOF_VIPU_API_HOST and IPUOF_VIPU_API_PARTITION_ID must be set to run code on IPUs. Usually this can be achieved with the task_prolog_ipu.sh script.

Listing 8.12 Example of task_prolog_ipu.sh script

# output of the script will be called before
echo export IPUOF_VIPU_API_HOST=vipuhost
echo export IPUOF_VIPU_API_PARTITION_ID=big_reconfigurable_partition
# optional when default 8090 port is not used by vipu-server
echo export IPUOF_VIPU_API_PORT=8090

8.5. Graphcore-modified Slurm with IPU resource selection plugin

See Graphcore-modified Slurm with IPU resource selection plugin in the V-IPU User Guide for more information about how to use this method.

8.5.1. Configuring Slurm to use the V-IPU select plugin

Note

This document assumes that you have you have installed the Graphcore pre-compiled Slurm binaries with the V-IPU plugin support or you have already patched and recompiled your Slurm installation with the V-IPU support. Slurm binaries and source code are available from the Graphcore Downloads page.

To enable V-IPU resource selection in Slurm, you need to configure the SelectType as select/vipu in the Slurm configuration. The V-IPU Slurm plugin is a layered plugin, which means it can enable V-IPU support for existing resource selection plugins. Options pertaining to selected secondary resource selection plugins can be specified under SelectTypeParameters.

You must also set PropagateResourceLimitsExcept to MEMLOCK. This prevents host memory limits being propagated to the job, which could cause failures when initialising the IPU.

The following is an example of the Slurm configuration enabling the V-IPU resource selection plugin layered on top of a consumable resource allocation plugin (select/other_cons_tres) with the CPU as a consumable resource:

SelectType=select/vipu
SelectTypeParameters=other_cons_tres,CR_CPU
PropagateResourceLimitsExcept=MEMLOCK

For SelectTypeParameters supported by each of the existing resource selection plugins, refer to the Slurm documentation.

8.5.2. Configuration parameters

Configuration parameters for the V-IPU resource selection plugin are set in separate configuration files that need to be stored in the same directory as slurm.conf. The default configuration file is named vipu.conf. Moreover, administrators can configure additional GRES models for the V-IPU representing different V-IPU clusters. For the additional GRES models, configuration files are named with the desired model name. For instance, a GRES model,``pod1``, needs a corresponding configuration file named as pod1.conf in the Slurm configuration directory.

The following configuration options are supported:

ApiHost: The host name or IP address for the V-IPU controller.
ApiPort: The port number for the V-IPU controller. Default port is 8090.
IpuofDir: The directory where IPUoF configuration files for user jobs will be stored.
MaxIpusPerJob: Maximum IPUs allowed per job. Should not exceed size of the POD. Default value is 256.
ApiTimeout: Timeout in seconds for the V-IPU client. The default value is 50.
ForceDeletePartition: Set to 1 to specify forced deletion of partition in case of failures. The default value is 0.
UseReconfigPartition: Set to 1 to specify that reconfigurable partitions should be created. The default value is 0.

In addition, slurm.conf should contain the following configuration to allow sharing IPUoF configuration files needed by the Graphcore Poplar SDK.

VipuIpuofDir: Path to shared storage location writable by scheduler, and readable by all nodes and user accounts.

8.5.3. The V-IPU GRES plugin

To enable the V-IPU GRES plugin, add vipu to the list of GRES types defined for the Slurm cluster.

GresTypes=vipu

In addition, for each node that can access a V-IPU resource, the following node GRES configuration must be added:

Gres=vipu:<GRES_MODEL>:no_consume:<max partition size>

8.5.4. An example Slurm Controller configuration

Note

Note that the following settings will override or take precedence over any values configured in your existing slurm.conf configuration file.

In the following, we outline an example of using the V-IPU plugin to configure a Slurm cluster containing a single IPU-POD64, with 4 compute nodes having shared access to a directory /home/ipuof. The GRES model is named as pod6 and a V-IPU Controller is running using default port without mTLS on the first node.

Node names are assumed to be ipu-pod64-001 through ipu-pod64-004.

At the end of the slurm.conf, add the following line:
Include v-ipu-plugin.conf

Create a file called v-ipu-plugin.conf in the same directory as the slurm.conf containing the following parameters:

SelectType=select/vipu
SelectTypeParameters=other_cons_tres,CR_CPU
PropagateResourceLimitsExcept=MEMLOCK
VipuIpuofDir=/home/ipuof
GresTypes=vipu

NodeName=ipu-pod64-001 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000
NodeName=ipu-pod64-002 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000
NodeName=ipu-pod64-003 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000
NodeName=ipu-pod64-004 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000

PartitionName=v-ipu Nodes=ipu-pod64-00[1-4] Default=NO MaxTime=INFINITE State=UP

Create a file called vipu.conf in the same directory as slurm.conf containing the following parameters:
ApiHost=ipu-pod64-001 ApiPort=8090 IpuofDir=/home/ipuof MaxIpusPerJob=64
Create a symbolic link to the vipu.conf file called pod64.conf in the same directory as the slurm.conf.

8.6. Troubleshooting

Table 8.1 lists some possible issues and how to resolve them.

Table 8.1 Troubleshooting tips
Issue	Action	Possible solution
Slurm job hangs or is constantly requeued	Check prolog file permissions	Set permissions and owner
IPUs are not detected	Check if the environment variables `IPUOF_VIPU_PARTITION_ID`, `IPUOF_VIPU_API_HOST`, `IPUOF_VIPU_API_PORT` are set and exported.	Export variables and contact administrator
Error that pod agent is not registered	Check if `ipum.allocs` contains any whitespaces	Remove spaces

Search help