8. Integration with Slurm
Slurm is a popular open-source cluster management and job scheduling system. For more information, refer to the Slurm Workload Manager Quick Start User Guide.
This section describes the configuration for the different methods of integrating V-IPU with Slurm.
For information on using Slurm with V-IPU, refer to the Integration with Slurm section in the V-IPU User Guide.
8.1. Overview of integration options
There are four options for integrating V-IPU with Slurm:
The pros and cons of each option are summarised in Table Pros and cons for the different options to integration V-IPU into Slurm. in the V-IPU User Guide.
8.2. Host-IPU mapping (recommended)
The recommended solution is to statically map IPUs to Poplar hosts and configure Slurm (which is not aware of IPUs) to schedule workloads on hosts to maximise IPU performance. The proposed solution uses a simple file which maps IPU-Machines to hosts. The Poplar host performance determines how many IPU-Machines can be mapped — it can be a single IPU-Machine or more. In the example below 4 IPU-Machines (or 16 IPUs) are assigned to a single host. Mapping IPU-Machines to hosts is a manual process, referencing physical connections between IPU-machines and hosts.
See Host-IPU mapping (recommended) in the V-IPU User Guide for more information about how to use this method.
# Mapping
# Host: IPU-machine list
host-1:ipum1,ipum2,ipum3,ipum4
host-2:ipum5,ipum6,ipum7,ipum8
host-3:ipum9,ipum10,ipum11,ipum12
host-4:ipum13,ipum14,ipum15,ipum16
The mapping file is later used by Slurm prolog and epilog scripts to create an allocation with IPU-Machines assigned to the hosts selected for the job. Slurm needs to be configured to enable the controller daemon to use prolog and epilog scripts by setting:
...
EpilogSlurmctld=/usr/local/slurm/scripts/slurmctld_epilog.sh
...
PrologSlurmctld=/usr/local/slurm/scripts/slurmctld_prolog.sh
More information can found in the Slurm Workload Manager - Prolog and Epilog Guide. The prolog script, run before the start of the job, reads the mapping file and decides how to create the allocation and cluster of IPUs. The epilog script, run after the job finishes, cleans up IPU resources by removing IPU allocation and clusters.
Creating a partition is left to the job owner. The mimimal_prolog.sh
prolog script and the minimal_epilog.sh
epilog script below are minimal examples for the mapping solution and might not cover all the steps needed to create a valid IPU allocation.
Apart from the Slurmctld prolog and epilog scripts, it is convenient to use TaskProlog to create the prolog script for setting up the IPUOF_VIPU
environment variables (see Section 8.4, Preconfigured partition: single reconfigurable dynamic partition for more about the reconfigurable partition approach). The allocation ID can be shared in the same way using some variable for example VIPU_ALLOCATION_ID
.
#!/bin/sh
ALLOCATION_ID="c$SLURM_JOBID"
NODELIST="$(scontrol show hostnames "$SLURM_JOB_NODELIST")"
# get a list of IPUMs used. This is space aware so map must be host:ipum,ipum2
IPUMLIST=$(for i in $NODELIST; do grep "$i": /opt/slurm-23.02.6/etc/ipum.allocs | awk -F: '{print $2}' ; done | paste -d, -s)
# get a count of IPUMS
IPUMCOUNT=$(echo "$IPUMLIST" | sed "s/\,/\ /g" | wc -w)
echo "print Requesting $IPUMCOUNT IPUs"
# vipu-admin is present in SLURMCTLD host
until vipu-admin get allocation "$ALLOCATION_ID"
do
vipu-admin create cluster "$ALLOCATION_ID" --agents="$IPUMLIST"
sleep 5
done
echo export IPUOF_VIPU_API_HOST=poplar-host-1
echo export IPUOF_VIPU_API_PORT=8090
# For benchmarks
echo "export VIPU_ALLOCATION_ID=$ALLOCATION_ID"
echo "export VIPU_NUM_IPUS=$((IPUMCOUNT * 4))"
# IPUOF_VIPU_API_PARTITION_ID needs to be set by job after partition creation
The prolog script can fail in cases of bad formatting of the ipum.allocs
file or when creating a cluster fails (for example when an IPU-Machine is already a part of another cluster). Any other failure of the prolog script means there is a configuration problem. Then, the epilog script will try to remove the allocation and cluster until Slurm kills this script and the node state is “drained”. In this case, the node has to be checked and returned to the pool manually.
#!/bin/bash
ALLOCATION_ID="c$SLURM_JOBID"
# Remove partitions for allocation
PARTITION_LIST=$(vipu-admin list partitions --allocation "$ALLOCATION_ID" --showjson | jq -r '.partitions[] | .id')
for PARTITION in $PARTITION_LIST
do
while vipu-admin get partition "$PARTITION"
do
vipu-admin remove partition -f "$PARTITION"
sleep 5
done
done
# Remove the cluster
while vipu-admin get cluster "$ALLOCATION_ID"
do
vipu-admin remove cluster "$ALLOCATION_ID"
sleep 5
done
#!/bin/sh
echo export IPUOF_VIPU_API_HOST=vipu-host
echo export IPUOF_VIPU_API_PORT=8090
# export this for user to know which allocation can be use to create partition
echo export VIPU_ALLOCATION_ID="c$SLURM_JOBID"
A single host is mapped to one Pod16 (16 IPUs or 4 IPU-Machines). Then, for a Pod64, which is made up of 4 Pod16s, there are four hosts. In these cases, it is pretty straightforward to schedule work on a Pod16 (single host) or a Pod64 (all hosts). However, running on a Pod32 requires more configuration.
In order to run on a Pod32, the IPU-Machines need to be in adjacent hosts. To guarantee this, an additional Slurm configuration is required. Assuming that IPU-Machines for consecutive hosts are connected to each other (ipum4
is mapped to host-1
which is connected to ipum5
that is mapped to host-2
).
To build a Pod32, the Slurm scheduler needs to select adjacent hosts. However, this is not guaranteed and so must be configured. This is done with the features
option in slurm.conf
to label adjacent hosts. For example, pod32-0
is shared between neighbouring hosts host-1
and host-2
. Similarly, pod32-3
is shared between hosts host-3
and host-4
. So, when two nodes are requested for a Pod32, the Slurm scheduler will pick two adjacent hosts and the IPU-Machines mapped to them.
Simultaneously, the user is required to set job constraints to pick any of the Pod32 features to help the Slurm scheduler pick the proper nodes. Optionally, to guarantee a valid number of nodes for each Pod, partitions need to be configured for every supported Pod size (Pod16, Pod32, Pod64). In this way a job can be rejected when an invalid number of nodes is requested.
The following configuration can be added to slurm.conf
by using the include directive.
# Nodes features are needed to pick only pod32 because pod16 is single node and pod64 is all
NodeName=host-1 features=pod32-0,pod32-3
NodeName=host-2 features=pod32-0,pod32-1
NodeName=host-3 features=pod32-1,pod32-2
NodeName=host-4 features=pod32-2,pod32-3
# Partitions
PartitionName=pod16 Nodes=host-[1-4] OverSubscribe=EXCLUSIVE MaxNodes=1 MinNodes=1
PartitionName=pod32 Nodes=host-[1-4] OverSubscribe=EXCLUSIVE MaxNodes=2 MinNodes=2
PartitionName=pod64 Nodes=host-[1-4] OverSubscribe=EXCLUSIVE MaxNodes=4 MinNodes=4
8.3. Preconfigured partition: multiple static partitions
Using multiple static partitions adds user separation but having non-reconfigurable partitions will waste resources. Also, the user can only request allowed partitions.
See Preconfigured partition: multiple static partitions in the V-IPU User Guide for more information about how to use this method.
When a basic separation of user workloads is needed, multiple preconfigured partitions can be used. This can be done using generic resources (GRES). You need to configure Slurm with the available GRES types using the GresType
option. Next, assign GRES to the nodes (in the NodeName
definition) and this can be done in gres.conf
or slurm.conf
– for simplicity this guide uses slurm.conf
. The value for gres
has the format type:model:count
, where only type
is mandatory. model
can be any string to differentiate resource sub-kinds, and count
is a positive number. For IPUs, type
is vipu
and model
is mandatory. For IPUs, model
stores the partition sizes present on the node. For example, gres=vipu:2:8
means we have 8 IPU partitions configured with a size of 2 IPUs.
...
Epilog=/usr/local/slurm/scripts/epilog.sh
...
Prolog=/usr/local/slurm/scripts/prolog.sh
...
TaskProlog=/usr/local/slurm/scripts/task_prolog.sh
...
# Define GRES type
GresTypes=vipu
# Define GRES for Node where model means number of IPUs
# present in predefined partition
NodeName=host-1 gres=vipu:1:16 state=UNKNOWN
NodeName=host-2 gres=vipu:2:8 state=UNKNOWN
NodeName=host-3 gres=vipu:16:1 state=UNKNOWN
NodeName=host-4 gres=vipu:4:4 state=UNKNOWN
# Slurm partition for IPU nodes
PartitionName=ipu Nodes=host-[1-4]
For workloads to use preconfigured IPU partitions, it is necessary to track partition usage (which IPUs’ partitions are free and which are allocated) and pass the information on allocated partitions to the job. A simple way of doing this is by moving files with the same names as the IPU partitions between two directories: one directory containing the partitions that are free (named free
) to a directory containing allocated partitions (named allocated
). Exclusive access to these directories is guaranteed by a flock
Linux system call. The prolog script takes the file from the free
directory and moves it to the allocated
directory and creates a pointer file with the name of the allocated partition and the job ID. The epilog script reverses this process: first it reads the pointer file related to the job to get the name of the partition file, then the epilog script moves the partition file from allocated
back to free
.
#!/bin/bash
# Where all allocated partitions stored
ALLOC_DIR=/opt/slurm-23.02.6/data/allocated
# Where all free partition are stored
FREE_DIR=/opt/slurm-23.02.6/data/free
# lock
ALLOCATION_LOCK=/opt/slurm-23.02.6/data/allocation.lock
PARTITION="/opt/slurm-23.02.6/data/p$SLURM_JOBID"
FLOCK="flock --verbose"
echo print checking configuration
# Require preallocated dirs
[ -d "$ALLOC_DIR" ] || exit 1
[ -d "$FREE_DIR" ] || exit 1
PARTITION_FILE="$(ls -1 "$FREE_DIR" | head -1)"
# No free files, exit with failure
[ -z "$PARTITION_FILE" ] && exit 1
PARTITION_FREE="$FREE_DIR/$PARTITION_FILE"
PARTITION_ALLOC="$ALLOC_DIR/$PARTITION_FILE"
# move partition from free to lock
$FLOCK "$ALLOCATION_LOCK" mv "$PARTITION_FREE" "$PARTITION_ALLOC"
echo "$PARTITION_FILE" > "$PARTITION"
When the number of configured IPU partitions is specified with gres
count, Slurm will not request more than count partitions to prevent scheduling jobs on nodes with exhausted resources.
#!/bin/bash
# Where all allocated partitions stored
ALLOC_DIR=allocated
# Where all free partition are stored
FREE_DIR=free
# lock
ALLOCATION_LOCK=allocation.lock
PARTITION="p$SLURM_JOB_ID"
FLOCK flock --verbose -s 10
[ -d "$ALLOC_DIR" ] || exit 0
[ -d "$FREE_DIR" ] || exit 0
# not PARTITION NO JOB
[ -f "$PARTITION" ] || exit 0
PARTITION_FILE="$(cat $PARTITION)"
PARTITION_FREE="$FREE_DIR/$PARTITION_FILE"
PARTITION_ALLOC="$ALLOC_DIR/$PARTITION_FILE"
# move partition from alloc to free
$FLOCK "$ALLOCATION_LOCK" mv "$PARTITION_ALLOC" "$PARTITION_FREE"
rm -f "$PARTITION"
Lastly, the file with the allocated partition is the interface for the task_prolog script, which takes the allocated partition name and sets the IPUOF_
environment variables for the job.
#!/bin/sh
echo export IPUOF_VIPU_API_HOST=vipu-host
echo export IPUOF_VIPU_API_PORT=8090
PARTITION="/opt/slurm-23.02.6/data/p$SLURM_JOBID"
# export this for user to know which allocation can be use to create partition
echo "export IPUOF_VIPU_API_PARTITION_ID=$(cat "$PARTITION")"
Slurm handles the GRES type
as optional and it is processed only in the case when both gres type
and count
are specified. In any other case, the number following gres
is treated as count
. GRES is a user-provided argument and there are a few cases where Slurm will do something different than expected:
when
model
andcount
are empty (--gres=vipu
): Slurm will pick any machine with anyvipu
GRES present and assume count to be one.when
count
is empty (for example,--gres=vipu:2
): Slurm will treatmodel
ascount
and pick machines withvipu
GRES with the desiredcount
(2 in this case).
This is undesirable and the simplest way to mitigate it is to use the JobSubmit plugin. We use the Lua JobSubmit plugin. To set it up, add lua
to the JobSubmitPlugins in slurm.conf
and place job_submit.lua
in the same directory as slurm.conf
. The script below only handles the two cases mentioned above and assumes Slurm will fail when a malformed GRES string is given (for example, vipu:::
).
function string:split(sep)
local sep, fields = sep or ":", {}
local pattern = string.format("([^%s]+)", sep)
self:gsub(pattern, function(c) fields[#fields+1] = c end)
return fields
end
function slurm_job_submit(job_desc, part_list, submit_uid)
local _gres = job_desc.gres or ''
if _gres ~= '' then
for _tres in string.gmatch(_gres, '([^,]+)') do
if string.find (_tres, '^gres:vipu') then
_vipu_tres = _tres:split(':')
local _vipu_partition_size = _vipu_tres[3]
local _vipus_requested = _vipu_tres[4]
if _vipu_partition_size == nil then
slurm.log_user("ERROR: Undefined IPU partition size, use vipu:<partition_size>:<count>")
return slurm.FAILURE
end
if _vipus_requested == nil or not ( tonumber(_vipus_requested) == 1 ) then
local _new_vipu_tres = string.format("%s:%s:%s:1", _vipu_tres[1], _vipu_tres[2], _vipu_tres[3])
slurm.log_user("WARN: Incorrect GRES correcting '%s' to: '%s'", _tres, _new_vipu_tres)
job_desc.gres = string.gsub(_gres, _tres, _new_vipu_tres)
end
end
end
end
return slurm.SUCCESS
end
function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
return slurm.SUCCESS
end
return slurm.SUCCESS
8.4. Preconfigured partition: single reconfigurable dynamic partition
Using a single reconfigurable dynamic partition is the most straightforward solution for integration of IPUs with Slurm, but it sacrifices security for simplicity and has other severe limitations (see the section on Partitions in the V-IPU User Guide for more information).
See Preconfigured partition: single reconfigurable dynamic partition in the V-IPU User Guide for more information about how to use this method.
The administrator can create reconfigurable partitions for up to 64 IPUs when a cluster is set up. Partitions will be static and persistent across user jobs. To expose these partitions to users, the environment variables IPUOF_VIPU_API_HOST
and IPUOF_VIPU_API_PARTITION_ID
must be set to run code on IPUs. Usually this can be achieved with the task_prolog_ipu.sh
script.
# output of the script will be called before
echo export IPUOF_VIPU_API_HOST=vipuhost
echo export IPUOF_VIPU_API_PARTITION_ID=big_reconfigurable_partition
# optional when default 8090 port is not used by vipu-server
echo export IPUOF_VIPU_API_PORT=8090
8.5. Graphcore-modified Slurm with IPU resource selection plugin
See Graphcore-modified Slurm with IPU resource selection plugin in the V-IPU User Guide for more information about how to use this method.
8.5.1. Configuring Slurm to use the V-IPU select plugin
Note
This document assumes that you have you have installed the Graphcore pre-compiled Slurm binaries with the V-IPU plugin support or you have already patched and recompiled your Slurm installation with the V-IPU support. Slurm binaries and source code are available from the Graphcore Downloads page.
To enable V-IPU resource selection in Slurm, you need to configure the SelectType
as
select/vipu
in the Slurm configuration. The V-IPU Slurm plugin is a layered plugin, which means it
can enable V-IPU support for existing resource selection plugins. Options pertaining to selected
secondary resource selection plugins can be specified under SelectTypeParameters
.
You must also set PropagateResourceLimitsExcept
to MEMLOCK
. This prevents host memory limits
being propagated to the job, which could cause failures when initialising the IPU.
The following is an example of the Slurm configuration enabling the V-IPU resource selection plugin layered on
top of a consumable resource allocation plugin (select/other_cons_tres
) with the CPU as a consumable resource:
SelectType=select/vipu
SelectTypeParameters=other_cons_tres,CR_CPU
PropagateResourceLimitsExcept=MEMLOCK
For SelectTypeParameters
supported by each of the existing resource selection plugins, refer to the
Slurm documentation.
8.5.2. Configuration parameters
Configuration parameters for the V-IPU resource selection plugin are set in separate configuration
files that need to be stored in the same directory as slurm.conf
. The default configuration file is
named vipu.conf
. Moreover, administrators can configure additional GRES models for the V-IPU representing
different V-IPU clusters. For the additional GRES models, configuration files are named with the desired
model name. For instance, a GRES model,``pod1``, needs a corresponding configuration file named as pod1.conf
in the Slurm configuration directory.
The following configuration options are supported:
ApiHost: The host name or IP address for the V-IPU controller.
ApiPort: The port number for the V-IPU controller. Default port is 8090.
IpuofDir: The directory where IPUoF configuration files for user jobs will be stored.
MaxIpusPerJob: Maximum IPUs allowed per job. Should not exceed size of the POD. Default value is 256.
ApiTimeout: Timeout in seconds for the V-IPU client. The default value is 50.
ForceDeletePartition: Set to 1 to specify forced deletion of partition in case of failures. The default value is 0.
UseReconfigPartition: Set to 1 to specify that reconfigurable partitions should be created. The default value is 0.
In addition, slurm.conf
should contain the following configuration to allow sharing IPUoF configuration files
needed by the Graphcore Poplar SDK.
VipuIpuofDir: Path to shared storage location writable by scheduler, and readable by all nodes and user accounts.
8.5.3. The V-IPU GRES plugin
To enable the V-IPU GRES plugin, add vipu
to the list of GRES types defined for the Slurm cluster.
GresTypes=vipu
In addition, for each node that can access a V-IPU resource, the following node GRES configuration must be added:
Gres=vipu:<GRES_MODEL>:no_consume:<max partition size>
8.5.4. An example Slurm Controller configuration
Note
Note that the following settings will override or take precedence over any values configured
in your existing slurm.conf
configuration file.
In the following, we outline an example of using the V-IPU plugin to configure a Slurm cluster containing a
single IPU-POD64
, with 4 compute nodes having shared access to a directory /home/ipuof
. The GRES model
is named as pod6
and a V-IPU Controller is running using default port without mTLS on the first node.
Node names are assumed to be ipu-pod64-001
through ipu-pod64-004
.
At the end of the
slurm.conf
, add the following line:Include v-ipu-plugin.conf
Create a file called
v-ipu-plugin.conf
in the same directory as theslurm.conf
containing the following parameters:SelectType=select/vipu SelectTypeParameters=other_cons_tres,CR_CPU PropagateResourceLimitsExcept=MEMLOCK VipuIpuofDir=/home/ipuof GresTypes=vipu NodeName=ipu-pod64-001 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000 NodeName=ipu-pod64-002 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000 NodeName=ipu-pod64-003 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000 NodeName=ipu-pod64-004 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000 PartitionName=v-ipu Nodes=ipu-pod64-00[1-4] Default=NO MaxTime=INFINITE State=UP
Create a file called
vipu.conf
in the same directory asslurm.conf
containing the following parameters:ApiHost=ipu-pod64-001 ApiPort=8090 IpuofDir=/home/ipuof MaxIpusPerJob=64
Create a symbolic link to the
vipu.conf
file calledpod64.conf
in the same directory as theslurm.conf
.
8.6. Troubleshooting
Table 8.1 lists some possible issues and how to resolve them.
Issue |
Action |
Possible solution |
---|---|---|
Slurm job hangs or is constantly requeued |
Check prolog file permissions |
Set permissions and owner |
IPUs are not detected |
Check if the environment variables |
Export variables and contact administrator |
Error that pod agent is not registered |
Check if |
Remove spaces |