5. Partitions

5.1. Overview

This section gives more detailed information about the types of partition which can be provisioned using Virtual-IPU. This information is not normally necessary for running existing applications, but may be valuable to those developing new applications or looking to better understand the provisioning process.

A Virtual-IPU partition represents some number of IPUs which can communicate with one another. They are isolated so that all communication from physically neighbouring devices that are not in the same partition is prohibited.

The vipu create partition command communicates with the Virtual-IPU controller to allocate some unused IPUs from the resource pool and configure them so that they can be accessed. The size of the partition represents the number of IPUs which are required to run the application. The size of a partition must be a power of 2.

Note that, by default, the provisioned devices are only visible to the user who created the partition. You can override this if, for example, a device should be visible to all users on the system (see Section 5.8.2, Sharing access to devices).

The result of provisioning a partition is a configuration file which is read by the IPUoF client in Poplar. This allows the provisioned IPUs to be accessed remotely over the RDMA network.

You can configure each partition as one or more subsets that will be targeted by a Poplar application. Each such subset is called a graph compilation domain (GCD). See Section 5.11, Multi-GCD partitions for more information.

By default, a partition contains a single GCD.

5.2. Creating a reconfigurable partition

Note

Reconfigurable partitions are only supported with the “c2-compatible” sync type. Any create partition command with --reconfigurable set will default to the “c2-compatible” sync type being used. Trying to specify another sync type with the --sync-type flag will result in failure, except if the --index flag is set. In this case the partition will silently be created with the “c2-compatible” sync type.

You can create a “reconfigurable” partition which makes the IPUs available in a flexible way. This is the preferred method for small scale systems (up to one IPU-POD64 or one Bow Pod64).

A device ID will be assigned to each individual IPU. In addition, if the partition contains more than one IPU, multi-IPU device IDs will be assigned to every subset of IPUs, as shown below.

You can create a reconfigurable partition with a command such as the following:

vipu create partition pt --size 4 --reconfigurable

This allocates four IPUs to a partition called “pt”.

You can see the IPUs that are now available using the gc-info command that was installed with the SDK:

$ gc-info -l
-+- Id: [0], type:    [Fabric], PCI Domain: [3]
-+- Id: [1], type:    [Fabric], PCI Domain: [2]
-+- Id: [2], type:    [Fabric], PCI Domain: [1]
-+- Id: [3], type:    [Fabric], PCI Domain: [0]
-+- Id: [4], type: [Multi IPU]
 |--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
 |--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]
-+- Id: [5], type: [Multi IPU]
 |--- PCIe Id: [2], DNC Id: [2], PCI Domain: [1]
 |--- PCIe Id: [3], DNC Id: [3], PCI Domain: [0]
-+- Id: [6], type: [Multi IPU]
 |--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
 |--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]
 |--- PCIe Id: [2], DNC Id: [2], PCI Domain: [1]
 |--- PCIe Id: [3], DNC Id: [3], PCI Domain: [0]

Here, the four individual IPUs have the IDs 0 to 3.

The multi-IPU devices are listed below that. A multi-IPU device always represents a number of IPUs which is a power of two. Here there are three multi-IPU devices:

  • ID 4 contains IPUs 0 and 1

  • ID 5 contains IPUs 2 and 3

  • ID 6 contains all four IPUs

In contrast to preconfigured partitions (see Section 5.3, Creating a preconfigured partition), all the supported device subsets are available and can be attached to. Note that creating reconfigurable partitions with large IPU clusters can yield a very large number of combinations (127 devices for a 64 IPU cluster).

Different Poplar programs can make use of these devices concurrently. A device ID that is used by a Poplar program is removed from the list of available device IDs while that Poplar program is running.

5.2.1. Limitations

A reconfigurable partition has the following limitations:

  • Partitions containing multiple GCDs cannot be provisioned as reconfigurable. This is due to the requirement that some IPU configuration must be co-ordinated across all GCDs in the partition.

  • Partitions with greater than 64 IPUs cannot be provisioned as reconfigurable.

  • While IPUs are isolated from IPUs in different partitions, they are not isolated from one another within the partition. Multiple applications sharing a partition can potentially interfere with one another, including receiving traffic from neighbouring devices in the same partition.

  • When multiple applications are contending for device groups concurrently, it is possible that applications cannot attach to devices reliably. This is due to how groups of IPUs are acquired individually by the application stack.

  • If a larger partition is created than is necessary for an application, such that not all the IPUs in the partition are required, then the remaining IPUs are not returned to the Virtual-IPU and made available for provisioning to other partitions or tenants. This can lead to poor system utilisation.

5.3. Creating a preconfigured partition

For large-scale systems, with multiple logical racks and multiple GCDs, you will need to create a preconfigured partition:

$ vipu create partition pt --size 1

This will make a set of IPU devices visible to the current user. These will appear as “Fabric” IPU devices in the output of gc-info:

$ gc-info -l
-+- Id: [0], type:    [Fabric], PCI Domain: [3]

When you create a partition that contains more than a single IPU, then a corresponding multi-IPU device is also created. A multi-IPU device always represents a number of IPUs which is a power of two. See the examples below.

$ vipu create partition pt --size 2

$ gc-info -l
-+- Id: [0], type:    [Fabric], PCI Domain: [3]
-+- Id: [1], type:    [Fabric], PCI Domain: [2]
-+- Id: [3], type: [Multi IPU]
 |--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
 |--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]

Here, the two individual IPUs have the IDs 0 and 1. A multi-IPU device, with device ID 3, that contains the two IPUs is shown below that.

$ vipu create partition pt --size 4

$ gc-info -l
-+- Id: [0], type:    [Fabric], PCI Domain: [3]
-+- Id: [1], type:    [Fabric], PCI Domain: [2]
-+- Id: [2], type:    [Fabric], PCI Domain: [1]
-+- Id: [3], type:    [Fabric], PCI Domain: [0]
-+- Id: [4], type: [Multi IPU]
 |--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
 |--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]
 |--- PCIe Id: [2], DNC Id: [2], PCI Domain: [1]
 |--- PCIe Id: [3], DNC Id: [3], PCI Domain: [0]

In this example, the four individual IPUs have the IDs 0 to 3. The multi-IPU device (with ID 4) contains the four IPUs.

By default, IPUs provisioned by V-IPU have IPU-Link routing and Sync-Link preconfigured across the entire partition. Smaller subsets of the partition such as a single IPU or pairs of IPUs are not available. Attempting to run code using a single IPU on the example partition defined above will fail with an error, because the partition is configured for all four IPUs to synchronise. Attaching to individual devices will also not behave as you might expect if you have previously been using directly attached PCIe devices. To configure a device which behaves more like a direct attached PCIe device, or for more flexible usage of subsets of IPUs, see Section 5.2, Creating a reconfigurable partition.

5.4. IPU selection

By default, the Virtual-IPU controller will attempt to allocate your IPU devices based on the requirements given. You can also choose IPUs explicitly to add to a partition. Bear in mind that if you do not have access to the IPUs requested, or if a partition is already using the IPUs, then the request will fail.

You can explicitly request IPUs by manually specifying the cluster in which the IPUs reside, and a unique index to identify a particular group of IPUs. To fetch and display the available choices for a particular partition size, pass the --options option to create partition command. Note that the list of the available partition placements for the given size is cached at the client side so the list must first be fetched before creating a partition using the index. The example below shows the available partition placements with four IPUs within a 16 IPU cluster:

$ vipu create partition p --options --size 4
 Index       | Cluster | ILDs | GW Routing | ILD Routing | Size | IPUs
--------------------------------------------------------------------------------------------------------------
 0           | cl00    | 1    | UNDEFINED  | DNC         | 4    | 0.0=ag00:0 0.1=ag00:1 0.2=ag00:2 0.3=ag00:3
 1           | cl00    | 1    | UNDEFINED  | DNC         | 4    | 0.0=ag01:0 0.1=ag01:1 0.2=ag01:2 0.3=ag01:3
 2           | cl00    | 1    | UNDEFINED  | DNC         | 4    | 0.0=ag02:0 0.1=ag02:1 0.2=ag02:2 0.3=ag02:3
 3           | cl00    | 1    | UNDEFINED  | DNC         | 4    | 0.0=ag03:0 0.1=ag03:1 0.2=ag03:2 0.3=ag03:3
 0           | cl01    | 1    | UNDEFINED  | DNC         | 4    | 0.0=ag00:0 0.1=ag00:1 0.2=ag00:2 0.3=ag00:3
--------------------------------------------------------------------------------------------------------------

Only valid partition placements for the active user are shown. If the IPUs are not assigned to the user, or are already in use, they are not shown. The valid partition placement options for a partition size are calculated using a partition mapping generation algorithm on the Virtual-IPU controller. The algorithm ensures that the selected IPUs satisfy the topology and routing requirements for the required size. In addition, restrictions are in place to make sure that IPU fragmentation is avoided as much as possible while still enabling manual IPU selection.

You can create a partition with a specific selection of IPUs as follows:

$ vipu create partition p --size 4 --cluster cl00 --index 2

This selects the four IPUs with the index 2, from the table above.

5.5. IPU relocation

Partition creation is an asynchronous process consisting of two stages: IPU allocation and configuration. After a set of IPUs have been allocated for a partition, it is possible that the actual configuration process does not succeed, for example due to an issue with the corresponding IPU-Machines. In such cases, the Virtual-IPU Controller will retry creating the partition until the maximum number of retries have been reached before giving up (default value for the number of retries is 10). However, each of the configuration retries will use the same set of IPUs allocated in the first stage.

Partition configuration errors can also occur at a later stage if a partition gets into an ERROR state and is being reset by the V-IPU Controller (see Section 5.6.1, Partition state). For many use-cases, it is useful to change the IPU allocation if the partition is unable to get to the ACTIVE state with the initially allocated set of IPUs. If you want to do this, use the --relocatable option to the create partition command. This indicates that the partition is relocatable to a different set of IPUs in case of configuration errors.

$ vipu create partition p --size 4 --relocatable

5.6. Displaying partition information

You can display detailed information about a partition with the get partition command:

$ vipu get partition pt1
------------------------------------------------------------
Partition ID         | pt1
Cluster              | cl0
GW-Link Routing      | N/A
IPU-Link Routing     | DNC
Number of ILDs       | 1
Number of IPUs       | 4
Number of GCDs       | 1
Reconfigurable       | false
State                | ACTIVE
Provisioning State   | IDLE
Intra-GCD Sync (GS1) | Replicas=1
Inter-GCD Sync (GS2) | UNDEFINED
Last Error           |
IPUs                 | 0=ag06:0 1=ag06:1 2=ag06:2 3=ag06:3
------------------------------------------------------------

5.6.1. Partition state

A partition can have one of several possible states during its lifetime. The Virtual-IPU controller continuously monitors all created partitions and marks them with the correct state, representing the usability of the partition at the time.

Table 5.1 Link partition states

Partition State

Description

ACTIVE

The partition is ready to be used by applications

PENDING

The partition is not yet ready

REMOVED

The partition is marked as removed and will be deleted

ERROR

An error occurred while getting the partition ready

The Virtual-IPU controller will periodically retry configuration of partitions which are in the ERROR state so they can get ready to be used by the applications. An example of a situation when a partition can go to the ERROR state is a voluntary or involuntary reboot of one of the IPU-Machines contributing IPUs to the partition. In that case, once the IPU-Machine gets up back again, the Virtual-IPU controller will execute the necessary configuration to get the IPUs ready to use and marks the partition as ACTIVE again. Other cases when a partition can be marked in the ERROR state by the V-IPU controller include when an error is detected on one of the IPUs in the partition, as well as when a server-side IPUoF component (see Section 2, Concepts and architecture) has failed to start for one or more of the corresponding IPUs in the partition.

5.6.2. Provisioning state

While partition state refers to the current state a partition is in, the provisioning state field indicates what operation is currently running for the partition on the Virtual-IPU controller. When the partition state is PENDING or ERROR, more information about the partition can be obtained by looking at its provisioning state on the Virtual-IPU Controller.

Table 5.2 Link partition provisioning states

Provisioning State

Description

IDLE

No operation is running on the partition

CREATING

The partition is being created

REMOVING

The partition is being removed

RESETTING

The partition is being reset

FAILED

An error occurred during the last operation (see the Last Error field in the partition information)

5.7. IPUoF environment variables

When you create a partition, Poplar needs to know the network endpoints to use to access the IPU devices, the topology of the devices and the configuration state. This information is used by Poplar to connect to the server-side IPUoF component (see Section 2, Concepts and architecture).

This information can most easily be provided by using the following environment variables.

Table 5.3 IPUoF configuration variables

IPUOF_VIPU_API_HOST

The IP address of the server running the V-IPU controller. Required.

IPUOF_VIPU_API_PORT

The port to connect to the V-IPU controller. Optional. The default is 8090.

IPUOF_VIPU_API_PARTITION_ID

The name of the partition to use. Required.

IPUOF_VIPU_API_GCD_ID

The ID of the GCD you want to use. Required for multi-GCD systems.

IPUOF_VIPU_API_TIMEOUT

Set the time-out for client calls in seconds. Optional. Default 200.

5.8. IPUoF configuration files

The information that Poplar needs to access devices can also be passed via an IPUoF configuration file which is, by default, written to a directory in your home directory (~/.ipuof.conf.d/) when a partition is created on the command line:

$ vipu create partition mypt --size 4
create partition (mypt): success

$ ls ~/.ipuof.conf.d/
mypt.conf

Note

IPUoF configuration files are provided as a convenience. However, because of the difficulties of maintaining and sharing configuration files they are not the recommended way to manage access to IPUs. We recommend the use of environment variables, as described in Section 5.7, IPUoF environment variables.

Configuration files may be useful when the Poplar hosts do not have direct network access to the V-IPU controller (for security reasons, for example).

You can change the directory where the IPUoF configuration file is created, with the --ipuof-config-location option (see Section 8.3, Create partition) or the environment variable VIPU_CLI_IPUOF_CONFIG_LOCATION.

Note

IPUoF configuration files are not written for the relocatable partitions because IPU configuration may change during the lifetime of such partitions, as described in Section 5.5, IPU relocation.

Poplar will search for an IPUoF configuration file in the following locations:

  • The file specified by IPUOF_CONFIG_PATH

  • $HOME/.ipuof.conf.d/

  • /etc/ipuof.conf.d/

Warning

There must only be one configuration file in the directory. If more than one file is present then they will all be silently ignored.

By default, the automatically-created file is named partition-name.conf but Poplar will read any file that it finds in that directory.

5.8.1. Creating an IPUoF configuration file

If you do not have an IPUoF configuration file for the partition you want to use, you can create one using the output of vipu get partition --ipuof-configs command (see Section 8.4, Get info about a partition). For multi-GCD partitions the --gcd option must also be provided to fetch the correct configuration.

5.8.2. Sharing access to devices

The default location of the IPUoF configuration file implies that a provisioned device is only visible to the current user. If the device should be made visible to all users on the system, then the IPUoF configuration file should be written to the directory (/etc/ipuof.conf.d/) where it can be found by all users:

$ vipu create partition mypt --ipuof-config-location /etc/ipuof.conf.d/ --size 4

In order to keep the IPUoF configuration file in sync with the system configuration, we recommend that you specify this location via the V-IPU configuration file (see Section 8.2.1, Using a configuration file) or a persistent environment variable, which is always set for the user session. For example:

$ export VIPU_CLI_IPUOF_CONFIG_LOCATION=/etc/ipuof.conf.d/
$ vipu create partition mypt --size 4
create partition (mypt): success

$ ls /etc/ipuof.conf.d/
mypt.conf

Note that the user who creates the system-wide partitions will need privileges to write to this directory. This can be achieved with, for example, group permissions:

$ groupadd ipugroup
$ mkdir -m775 /etc/ipuof.conf.d/
$ chgrp ipugroup /etc/ipuof.conf.d/

Warning

You must not modify these automatically created files.

5.9. Removing a partition

When you remove a partition, the associated configuration file is also removed. Because there must only be one configuration file present at a time, you should remove an existing partition before creating a new one:

$ vipu remove partition mypt
remove partition (mypt): success

$ ls ~/.ipuof.conf.d/

5.10. Routing configuration

The routing configuration defines how messages between IPUs will be routed via the available connections.

Table 5.4 lists the different types of routing configurations that can be used, depending on the size of the partition and the physical link topologies.

A default IPU-Link routing configuration will be chosen when the partition is created based on the size and topology of the cluster:

  • If the partition size is less than or equal to 16 it defaults to DNC

  • If the partition size is 16 or more and the partition occupies the whole cluster and the cluster topology has loopback cables enabled, it defaults to RINGSWNC

  • Otherwise it defaults to SWNC

The routing for IPU-Links can be specified with the --routing option (see Section 8.3, Create partition).

The application may also need to specify the routing. See the Target class in the Poplar and Poplibs API Reference for information on setting the routing used by an application.

Table 5.4 Link routing configurations

Partition size

IPU-Link routing

1

DNC

2

4

8

16

32

SWNC

64

SWNC

RINGSWNC [*]

128+

SWNC

SWNC

RINGSWNC [*]

RINGSWNC [*]

Key

IPU-Link routing options

BTNC

Barley Twist Network Configuration

DNC

Default Network Configuration

SWNC

Sliding Window Network Configuration

RINGSWNC

Ring with Sliding Window Network Configuration

[*] Note that some routing options are not available depending on the system size and link topology:

  • RINGSWNC requires a partition that uses an entire IPU-Link domain with torus topology and size of at least 16

  • BTNC requires a partition with torus topology and 1 IPU per replica

5.10.1. Intra-GCD sync configuration

You can adjust the configuration of the sync group (GS1) which is shared between IPUs within a GCD. This is typically used to synchronise execution steps between IPUs which are computing a single replica of a replicated graph. See the description of sync groups in the Poplar User Guide.

For example:

$ vipu create partition pt --size 8 --total-num-replicas 2

This will effectively divide the IPUs into two groups such that each group can sync independently of each other. The table below shows the IPUs present in each sync group for a variety of partition sizes.

Table 5.5 Sync configurations within a GCD

Size (#IPUs)

Replicas

GS1 independent IPU groups (IPU IDs)

1

1

0

2

1

0 1

4

1

0 1 2 3

8

1

0 1 2 3 4 5 6 7

2

2

0

1

4

2

0 1

2 3

8

2

0 1 2 3

4 5 6 7

4

4

0

1

2

3

8

4

0 1

2 3

4 5

6 7

8

8

0

1

2

3

4

5

6

7

In order to sync all IPUs in the partition, the application should use the GS2 sync group.

Note that for single-GCD partitions, if you do not specify the number of replicas then the configuration defaults to synchronising all IPUs in the partition. This is equivalent to specifying a single replica. This means that the following two commands yield an identical configuration:

$ vipu create partition pt --size 8 --total-num-replicas 1
$ vipu create partition pt --size 8

5.11. Multi-GCD partitions

Partitions, by default, yield a single multi-IPU device which can be used by a single Poplar process to run a machine learning application.

You can also specify that a partition is to be split into a number of GCDs.

The following example creates a partition with four IPUs and two GCDs:

$ vipu create partition pt --size 4 --num-gcds 2

The result of this is that the IPUs in the partition are split equally across the requested number of multi-IPU devices:

$ gc-info -l
-+- Id: [0], type:    [Fabric], PCI Domain: [3]
-+- Id: [1], type:    [Fabric], PCI Domain: [2]
-+- Id: [2], type:    [Fabric], PCI Domain: [1]
-+- Id: [3], type:    [Fabric], PCI Domain: [0]
-+- Id: [4], type: [Multi IPU]
 |--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
 |--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]
-+- Id: [5], type: [Multi IPU]
 |--- PCIe Id: [2], DNC Id: [2], PCI Domain: [1]
 |--- PCIe Id: [3], DNC Id: [3], PCI Domain: [0]

This allows a separate Poplar process to attach to each subset of IPUs in the partition. These subsets of IPUs are referred to as GCDs. Each device is assigned a GCD id, which is simply a number representing the position of the GCD in the partition. It is not normally necessary to inspect this number or any other meta-information about the devices, but it can be displayed if needed using the gc-info tool. You will also be able to see the total number of GCDs which were configured into the partition:

$ gc-info --device-info -d 4 | grep 'GCD'
  GCD Id: 0
  Num GCDs: 2

$ gc-info --device-info -d 5 | grep 'GCD'
  GCD Id: 1
  Num GCDs: 2

Note that for partitions with a single IPU per GCD, then each individual fabric IPU device represents a GCD; there is no multi-IPU device:

$ vipu create partition pt --size 2 --num-gcds 2

$ gc-info -l
-+- Id: [0], type:    [Fabric], PCI Domain: [3]
-+- Id: [1], type:    [Fabric], PCI Domain: [2]

$ gc-info --device-info -d 0 | grep 'GCD'
  GCD Id: 0
  Num GCDs: 2

$ gc-info --device-info -d 1 | grep 'GCD'
  GCD Id: 1
  Num GCDs: 2

5.11.1. Multi-host IPUoF configuration

The primary use case of these GCDs is to allow different Poplar workloads to be distributed over multiple servers, whilst still permitting the IPUs in the same partition to communicate with one another. Contrast this with creating multiple partitions, which are isolated and unable to communicate.

The recommended way of distributing the IPUoF configuration information to multiple hosts is to set the relevant environment variables on every host. You would specify the same values for the host, port and partition ID but give each host the appropriate GCD ID.

See Section 5.7, IPUoF environment variables for more information.

Multi-host IPUoF configuration files

When a multi-GCD partition is created, the IPUoF configuration for each GCD is written to a separate file so that they can be distributed. Typically, a higher-level job scheduler would take care of ensuring that each server is provided with the necessary device configuration. This information is provided primarily for those wishing to integrate distributed Poplar execution into an existing job scheduler, it is not typically to be done manually:

$ vipu create partition mypt --size 4 --num-gcds 2
create partition (mypt): success

$ ls -l ~/.ipuof.conf.d/
mypt_gcd0.conf mypt_gcd1.conf

Notice that in this example, because two GCDs were requested, a configuration file for each GCD was produced. You can then distribute each of these to a different server. Note that both servers must have access to the IPUoF data-plane network.

head$ vipu create partition mypt --size 4 --num-gcds 2
create partition (mypt): success

head$ scp ~/.ipuof.conf.d/mypt_gcd0.conf node0:.ipuof.conf.d
head$ scp ~/.ipuof.conf.d/mypt_gcd1.conf node1:.ipuof.conf.d

Warning

There must only be one configuration file in the directory. If more than one file is present then they will all be silently ignored. We recommend the use of environment variables, instead of configuration files, to specify the partition information to each host. See Section 5.7, IPUoF environment variables for more information.

Each node will now see only a single device, each from a different subset of the IPUs in the partition, based on which GCD configuration file they are using:

node0$ gc-info -l
-+- Id: [0], type:    [Fabric], PCI Domain: [3]
-+- Id: [1], type:    [Fabric], PCI Domain: [2]
-+- Id: [2], type: [Multi IPU]
 |--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
 |--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]

 node0$ gc-info --device-info -d 2 | grep 'GCD'
  GCD Id: 0
  Num GCDs: 2
node1$ gc-info -l
-+- Id: [0], type:    [Fabric], PCI Domain: [1]
-+- Id: [1], type:    [Fabric], PCI Domain: [0]
-+- Id: [2], type: [Multi IPU]
 |--- PCIe Id: [2], DNC Id: [2], PCI Domain: [1]
 |--- PCIe Id: [3], DNC Id: [3], PCI Domain: [0]

 node1$ gc-info --device-info -d 2 | grep 'GCD'
  GCD Id: 1
  Num GCDs: 2

5.11.2. Intra-GCD sync configuration

For multi-GCD partitions, by default, the intra-GCD sync group (GS1) is configured to synchronise all IPUs in the GCD. You can alter this via the same mechanisms as for single-GCD partitions, for example --total-num-replicas.

5.11.3. Inter-GCD sync configuration

For multi-GCD partitions, by default, the inter-GCD sync group (GS2) is configured to synchronise all IPUs in the entire partition.