8. Integration with Slurm

Slurm is a popular open-source cluster management and job scheduling system. For more information, refer to the Slurm Workload Manager Quick Start User Guide.

This section describes the configuration for the different methods of integrating V-IPU with Slurm.

For information on using Slurm with V-IPU, refer to the Integration with Slurm section in the V-IPU User Guide.

8.1. Overview of integration options

There are four options for integrating V-IPU with Slurm:

  1. host-IPU mapping (recommended)

  2. using multiple preconfigured static partitions

  3. using a single preconfigured reconfigurable dynamic partition

  4. Graphcore-modified Slurm with IPU resource selection plugin

The pros and cons of each option are summarised in Table Pros and cons for the different options to integration V-IPU into Slurm. in the V-IPU User Guide.

8.3. Preconfigured partition: multiple static partitions

Using multiple static partitions adds user separation but having non-reconfigurable partitions will waste resources. Also, the user can only request allowed partitions.

See Preconfigured partition: multiple static partitions in the V-IPU User Guide for more information about how to use this method.

When a basic separation of user workloads is needed, multiple preconfigured partitions can be used. This can be done using generic resources (GRES). You need to configure Slurm with the available GRES types using the GresType option. Next, assign GRES to the nodes (in the NodeName definition) and this can be done in gres.conf or slurm.conf – for simplicity this guide uses slurm.conf. The value for gres has the format type:model:count, where only type is mandatory. model can be any string to differentiate resource sub-kinds, and count is a positive number. For IPUs, type is vipu and model is mandatory. For IPUs, model stores the partition sizes present on the node. For example, gres=vipu:2:8 means we have 8 IPU partitions configured with a size of 2 IPUs.

Listing 8.7 Example of configuring GRES in slurm.conf
...
Epilog=/usr/local/slurm/scripts/epilog.sh
...
Prolog=/usr/local/slurm/scripts/prolog.sh
...
TaskProlog=/usr/local/slurm/scripts/task_prolog.sh
...
# Define GRES type
GresTypes=vipu

# Define GRES for Node where model means number of IPUs
# present in predefined partition
NodeName=host-1 gres=vipu:1:16 state=UNKNOWN
NodeName=host-2 gres=vipu:2:8  state=UNKNOWN
NodeName=host-3 gres=vipu:16:1 state=UNKNOWN
NodeName=host-4 gres=vipu:4:4  state=UNKNOWN

# Slurm partition for IPU nodes
PartitionName=ipu Nodes=host-[1-4]

For workloads to use preconfigured IPU partitions, it is necessary to track partition usage (which IPUs’ partitions are free and which are allocated) and pass the information on allocated partitions to the job. A simple way of doing this is by moving files with the same names as the IPU partitions between two directories: one directory containing the partitions that are free (named free) to a directory containing allocated partitions (named allocated). Exclusive access to these directories is guaranteed by a flock Linux system call. The prolog script takes the file from the free directory and moves it to the allocated directory and creates a pointer file with the name of the allocated partition and the job ID. The epilog script reverses this process: first it reads the pointer file related to the job to get the name of the partition file, then the epilog script moves the partition file from allocated back to free.

Listing 8.8 Example prolog script to use preconfigured IPU partitions
#!/bin/bash

# Where all allocated partitions stored
ALLOC_DIR=/opt/slurm-23.02.6/data/allocated
# Where all free partition are stored
FREE_DIR=/opt/slurm-23.02.6/data/free
# lock
ALLOCATION_LOCK=/opt/slurm-23.02.6/data/allocation.lock
PARTITION="/opt/slurm-23.02.6/data/p$SLURM_JOBID"
FLOCK="flock --verbose"

echo print checking configuration
# Require preallocated dirs
[ -d "$ALLOC_DIR" ] || exit 1
[ -d "$FREE_DIR" ]  || exit 1

PARTITION_FILE="$(ls -1 "$FREE_DIR" | head -1)"

# No free files, exit with failure
[ -z "$PARTITION_FILE" ] && exit 1
PARTITION_FREE="$FREE_DIR/$PARTITION_FILE"
PARTITION_ALLOC="$ALLOC_DIR/$PARTITION_FILE"
# move partition from free to lock
$FLOCK "$ALLOCATION_LOCK" mv "$PARTITION_FREE" "$PARTITION_ALLOC"

echo "$PARTITION_FILE" > "$PARTITION"

When the number of configured IPU partitions is specified with gres count, Slurm will not request more than count partitions to prevent scheduling jobs on nodes with exhausted resources.

Listing 8.9 Example epilog script to use preconfigured IPU partitions
#!/bin/bash

# Where all allocated partitions stored
ALLOC_DIR=allocated
# Where all free partition are stored
FREE_DIR=free
# lock
ALLOCATION_LOCK=allocation.lock
PARTITION="p$SLURM_JOB_ID"
FLOCK flock --verbose -s 10

[ -d "$ALLOC_DIR" ] || exit 0
[ -d "$FREE_DIR" ]  || exit 0
# not PARTITION NO JOB
[ -f "$PARTITION" ] || exit 0
PARTITION_FILE="$(cat $PARTITION)"
PARTITION_FREE="$FREE_DIR/$PARTITION_FILE"
PARTITION_ALLOC="$ALLOC_DIR/$PARTITION_FILE"
# move partition from alloc to free
$FLOCK "$ALLOCATION_LOCK" mv "$PARTITION_ALLOC" "$PARTITION_FREE"

rm -f "$PARTITION"

Lastly, the file with the allocated partition is the interface for the task_prolog script, which takes the allocated partition name and sets the IPUOF_ environment variables for the job.

Listing 8.10 Example task prolog script
#!/bin/sh

echo export IPUOF_VIPU_API_HOST=vipu-host
echo export IPUOF_VIPU_API_PORT=8090
PARTITION="/opt/slurm-23.02.6/data/p$SLURM_JOBID"
# export this for user to know which allocation can be use to create partition
echo "export IPUOF_VIPU_API_PARTITION_ID=$(cat "$PARTITION")"

Slurm handles the GRES type as optional and it is processed only in the case when both gres type and count are specified. In any other case, the number following gres is treated as count. GRES is a user-provided argument and there are a few cases where Slurm will do something different than expected:

  • when model and count are empty (--gres=vipu): Slurm will pick any machine with any vipu GRES present and assume count to be one.

  • when count is empty (for example, --gres=vipu:2): Slurm will treat model as count and pick machines with vipu GRES with the desired count (2 in this case).

This is undesirable and the simplest way to mitigate it is to use the JobSubmit plugin. We use the Lua JobSubmit plugin. To set it up, add lua to the JobSubmitPlugins in slurm.conf and place job_submit.lua in the same directory as slurm.conf. The script below only handles the two cases mentioned above and assumes Slurm will fail when a malformed GRES string is given (for example, vipu:::).

Listing 8.11 Example of job submit plugin code to mitigate GRES configuration problems
function string:split(sep)
    local sep, fields = sep or ":", {}
    local pattern = string.format("([^%s]+)", sep)
    self:gsub(pattern, function(c) fields[#fields+1] = c end)
    return fields
end

function slurm_job_submit(job_desc, part_list, submit_uid)
    local _gres = job_desc.gres or ''

    if _gres ~= '' then
        for _tres in string.gmatch(_gres, '([^,]+)') do
            if string.find (_tres, '^gres:vipu') then
                _vipu_tres = _tres:split(':')

                local _vipu_partition_size = _vipu_tres[3]
                local _vipus_requested = _vipu_tres[4]
                if _vipu_partition_size == nil then
                    slurm.log_user("ERROR: Undefined IPU partition size, use vipu:<partition_size>:<count>")
                    return slurm.FAILURE
                end

                if _vipus_requested == nil or not ( tonumber(_vipus_requested) == 1 ) then
                    local _new_vipu_tres = string.format("%s:%s:%s:1", _vipu_tres[1], _vipu_tres[2], _vipu_tres[3])
                    slurm.log_user("WARN: Incorrect GRES correcting '%s' to: '%s'", _tres, _new_vipu_tres)
                    job_desc.gres = string.gsub(_gres, _tres, _new_vipu_tres)
                end
            end
        end
    end
    return slurm.SUCCESS
end

function slurm_job_modify(job_desc, job_rec, part_list, modify_uid)
    return slurm.SUCCESS
end

return slurm.SUCCESS

8.4. Preconfigured partition: single reconfigurable dynamic partition

Using a single reconfigurable dynamic partition is the most straightforward solution for integration of IPUs with Slurm, but it sacrifices security for simplicity and has other severe limitations (see the section on Partitions in the V-IPU User Guide for more information).

See Preconfigured partition: single reconfigurable dynamic partition in the V-IPU User Guide for more information about how to use this method.

The administrator can create reconfigurable partitions for up to 64 IPUs when a cluster is set up. Partitions will be static and persistent across user jobs. To expose these partitions to users, the environment variables IPUOF_VIPU_API_HOST and IPUOF_VIPU_API_PARTITION_ID must be set to run code on IPUs. Usually this can be achieved with the task_prolog_ipu.sh script.

Listing 8.12 Example of task_prolog_ipu.sh script
# output of the script will be called before
echo export IPUOF_VIPU_API_HOST=vipuhost
echo export IPUOF_VIPU_API_PARTITION_ID=big_reconfigurable_partition
# optional when default 8090 port is not used by vipu-server
echo export IPUOF_VIPU_API_PORT=8090

8.5. Graphcore-modified Slurm with IPU resource selection plugin

See Graphcore-modified Slurm with IPU resource selection plugin in the V-IPU User Guide for more information about how to use this method.

8.5.1. Configuring Slurm to use the V-IPU select plugin

Note

This document assumes that you have you have installed the Graphcore pre-compiled Slurm binaries with the V-IPU plugin support or you have already patched and recompiled your Slurm installation with the V-IPU support. Slurm binaries and source code are available from the Graphcore Downloads page.

To enable V-IPU resource selection in Slurm, you need to configure the SelectType as select/vipu in the Slurm configuration. The V-IPU Slurm plugin is a layered plugin, which means it can enable V-IPU support for existing resource selection plugins. Options pertaining to selected secondary resource selection plugins can be specified under SelectTypeParameters.

You must also set PropagateResourceLimitsExcept to MEMLOCK. This prevents host memory limits being propagated to the job, which could cause failures when initialising the IPU.

The following is an example of the Slurm configuration enabling the V-IPU resource selection plugin layered on top of a consumable resource allocation plugin (select/other_cons_tres) with the CPU as a consumable resource:

SelectType=select/vipu
SelectTypeParameters=other_cons_tres,CR_CPU
PropagateResourceLimitsExcept=MEMLOCK

For SelectTypeParameters supported by each of the existing resource selection plugins, refer to the Slurm documentation.

8.5.2. Configuration parameters

Configuration parameters for the V-IPU resource selection plugin are set in separate configuration files that need to be stored in the same directory as slurm.conf. The default configuration file is named vipu.conf. Moreover, administrators can configure additional GRES models for the V-IPU representing different V-IPU clusters. For the additional GRES models, configuration files are named with the desired model name. For instance, a GRES model,``pod1``, needs a corresponding configuration file named as pod1.conf in the Slurm configuration directory.

The following configuration options are supported:

  • ApiHost: The host name or IP address for the V-IPU controller.

  • ApiPort: The port number for the V-IPU controller. Default port is 8090.

  • IpuofDir: The directory where IPUoF configuration files for user jobs will be stored.

  • MaxIpusPerJob: Maximum IPUs allowed per job. Should not exceed size of the POD. Default value is 256.

  • ApiTimeout: Timeout in seconds for the V-IPU client. The default value is 50.

  • ForceDeletePartition: Set to 1 to specify forced deletion of partition in case of failures. The default value is 0.

  • UseReconfigPartition: Set to 1 to specify that reconfigurable partitions should be created. The default value is 0.

In addition, slurm.conf should contain the following configuration to allow sharing IPUoF configuration files needed by the Graphcore Poplar SDK.

  • VipuIpuofDir: Path to shared storage location writable by scheduler, and readable by all nodes and user accounts.

8.5.3. The V-IPU GRES plugin

To enable the V-IPU GRES plugin, add vipu to the list of GRES types defined for the Slurm cluster.

GresTypes=vipu

In addition, for each node that can access a V-IPU resource, the following node GRES configuration must be added:

Gres=vipu:<GRES_MODEL>:no_consume:<max partition size>

8.5.4. An example Slurm Controller configuration

Note

Note that the following settings will override or take precedence over any values configured in your existing slurm.conf configuration file.

In the following, we outline an example of using the V-IPU plugin to configure a Slurm cluster containing a single IPU-POD64, with 4 compute nodes having shared access to a directory /home/ipuof. The GRES model is named as pod6 and a V-IPU Controller is running using default port without mTLS on the first node.

Node names are assumed to be ipu-pod64-001 through ipu-pod64-004.

  1. At the end of the slurm.conf, add the following line:

    Include v-ipu-plugin.conf
    
  2. Create a file called v-ipu-plugin.conf in the same directory as the slurm.conf containing the following parameters:

    SelectType=select/vipu
    SelectTypeParameters=other_cons_tres,CR_CPU
    PropagateResourceLimitsExcept=MEMLOCK
    VipuIpuofDir=/home/ipuof
    GresTypes=vipu
    
    NodeName=ipu-pod64-001 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000
    NodeName=ipu-pod64-002 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000
    NodeName=ipu-pod64-003 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000
    NodeName=ipu-pod64-004 State=UNKNOWN Gres=vipu:pod64:no_consume:64 CPUs=96 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=2 RealMemory=760000 TmpDisk=4760000
    
    PartitionName=v-ipu Nodes=ipu-pod64-00[1-4] Default=NO MaxTime=INFINITE State=UP
    
  3. Create a file called vipu.conf in the same directory as slurm.conf containing the following parameters:

    ApiHost=ipu-pod64-001
    ApiPort=8090
    IpuofDir=/home/ipuof
    MaxIpusPerJob=64
    
  4. Create a symbolic link to the vipu.conf file called pod64.conf in the same directory as the slurm.conf.

8.6. Troubleshooting

Table 8.1 lists some possible issues and how to resolve them.

Table 8.1 Troubleshooting tips

Issue

Action

Possible solution

Slurm job hangs or is constantly requeued

Check prolog file permissions

Set permissions and owner

IPUs are not detected

Check if the environment variables IPUOF_VIPU_PARTITION_ID, IPUOF_VIPU_API_HOST, IPUOF_VIPU_API_PORT are set and exported.

Export variables and contact administrator

Error that pod agent is not registered

Check if ipum.allocs contains any whitespaces

Remove spaces