5. Partitions
5.1. Overview
This section gives more detailed information about the types of partition which can be provisioned using Virtual-IPU. This information is not normally necessary for running existing applications, but may be valuable to those developing new applications or looking to better understand the provisioning process.
A Virtual-IPU partition represents some number of IPUs which can communicate with one another. They are isolated so that all communication from physically neighbouring devices that are not in the same partition is prohibited.
The vipu create partition
command
communicates with the Virtual-IPU controller to allocate some unused IPUs
from the resource pool and configure them so that they can be accessed. The
size of the partition represents the number of IPUs which are required to run
the application. The size of a partition must be a power of 2.
Note that, by default, the provisioned devices are only visible to the user who created the partition. You can override this if, for example, a device should be visible to all users on the system (see Section 5.8.2, Sharing access to devices).
The result of provisioning a partition is a configuration file which is read by the IPUoF client in Poplar. This allows the provisioned IPUs to be accessed remotely over the RDMA network.
You can configure each partition as one or more subsets that will be targeted by a Poplar application. Each such subset is called a graph compilation domain (GCD). See Section 5.11, Multi-GCD partitions for more information.
By default, a partition contains a single GCD.
5.2. Creating a reconfigurable partition
Note
Reconfigurable partitions are only supported with the “c2-compatible” sync type.
Any create partition
command with --reconfigurable
set will default to the “c2-compatible”
sync type being used. Trying to specify another sync type with the --sync-type
flag will
result in failure, except if the --index
flag is set. In this case the partition will
silently be created with the “c2-compatible” sync type.
You can create a “reconfigurable” partition which makes the IPUs available in a flexible way. This is the preferred method for small scale systems (up to one IPU-POD64 or one Bow Pod64).
A device ID will be assigned to each individual IPU. In addition, if the partition contains more than one IPU, multi-IPU device IDs will be assigned to every subset of IPUs, as shown below.
You can create a reconfigurable partition with a command such as the following:
vipu create partition pt --size 4 --reconfigurable
This allocates four IPUs to a partition called “pt”.
You can see the IPUs that are now available using the gc-info
command that was
installed with the SDK:
$ gc-info -l
-+- Id: [0], type: [Fabric], PCI Domain: [3]
-+- Id: [1], type: [Fabric], PCI Domain: [2]
-+- Id: [2], type: [Fabric], PCI Domain: [1]
-+- Id: [3], type: [Fabric], PCI Domain: [0]
-+- Id: [4], type: [Multi IPU]
|--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
|--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]
-+- Id: [5], type: [Multi IPU]
|--- PCIe Id: [2], DNC Id: [2], PCI Domain: [1]
|--- PCIe Id: [3], DNC Id: [3], PCI Domain: [0]
-+- Id: [6], type: [Multi IPU]
|--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
|--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]
|--- PCIe Id: [2], DNC Id: [2], PCI Domain: [1]
|--- PCIe Id: [3], DNC Id: [3], PCI Domain: [0]
Here, the four individual IPUs have the IDs 0 to 3.
The multi-IPU devices are listed below that. A multi-IPU device always represents a number of IPUs which is a power of two. Here there are three multi-IPU devices:
ID 4 contains IPUs 0 and 1
ID 5 contains IPUs 2 and 3
ID 6 contains all four IPUs
In contrast to preconfigured partitions (see Section 5.3, Creating a preconfigured partition), all the supported device subsets are available and can be attached to. Note that creating reconfigurable partitions with large IPU clusters can yield a very large number of combinations (127 devices for a 64 IPU cluster).
Different Poplar programs can make use of these devices concurrently. A device ID that is used by a Poplar program is removed from the list of available device IDs while that Poplar program is running.
5.2.1. Limitations
A reconfigurable partition has the following limitations:
Partitions containing multiple GCDs cannot be provisioned as reconfigurable. This is due to the requirement that some IPU configuration must be co-ordinated across all GCDs in the partition.
Partitions with greater than 64 IPUs cannot be provisioned as reconfigurable.
While IPUs are isolated from IPUs in different partitions, they are not isolated from one another within the partition. Multiple applications sharing a partition can potentially interfere with one another, including receiving traffic from neighbouring devices in the same partition.
When multiple applications are contending for device groups concurrently, it is possible that applications cannot attach to devices reliably. This is due to how groups of IPUs are acquired individually by the application stack.
If a larger partition is created than is necessary for an application, such that not all the IPUs in the partition are required, then the remaining IPUs are not returned to the Virtual-IPU and made available for provisioning to other partitions or tenants. This can lead to poor system utilisation.
5.3. Creating a preconfigured partition
For large-scale systems, with multiple logical racks and multiple GCDs, you will need to create a preconfigured partition:
$ vipu create partition pt --size 1
This will make a set of IPU devices visible to the current user.
These will appear as “Fabric” IPU devices in the output of gc-info
:
$ gc-info -l
-+- Id: [0], type: [Fabric], PCI Domain: [3]
When you create a partition that contains more than a single IPU, then a corresponding multi-IPU device is also created. A multi-IPU device always represents a number of IPUs which is a power of two. See the examples below.
$ vipu create partition pt --size 2
$ gc-info -l
-+- Id: [0], type: [Fabric], PCI Domain: [3]
-+- Id: [1], type: [Fabric], PCI Domain: [2]
-+- Id: [3], type: [Multi IPU]
|--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
|--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]
Here, the two individual IPUs have the IDs 0 and 1. A multi-IPU device, with device ID 3, that contains the two IPUs is shown below that.
$ vipu create partition pt --size 4
$ gc-info -l
-+- Id: [0], type: [Fabric], PCI Domain: [3]
-+- Id: [1], type: [Fabric], PCI Domain: [2]
-+- Id: [2], type: [Fabric], PCI Domain: [1]
-+- Id: [3], type: [Fabric], PCI Domain: [0]
-+- Id: [4], type: [Multi IPU]
|--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
|--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]
|--- PCIe Id: [2], DNC Id: [2], PCI Domain: [1]
|--- PCIe Id: [3], DNC Id: [3], PCI Domain: [0]
In this example, the four individual IPUs have the IDs 0 to 3. The multi-IPU device (with ID 4) contains the four IPUs.
By default, IPUs provisioned by V-IPU have IPU-Link routing and Sync-Link preconfigured, therefore the smaller device subsets are not available. Attaching to individual devices will also not behave as you might expect if you have previously been using directly attached PCIe devices. To configure a device which behaves more like a direct attached PCIe device, see Section 5.2, Creating a reconfigurable partition.
5.4. IPU selection
By default, the Virtual-IPU controller will attempt to allocate your IPU devices based on the requirements given. If this placement needs to be performed manually for whatever reason, then you can explicitly choose the IPUs. Bear in mind that if you do not have access to the IPUs requested, or if a partition is already using the IPUs, then the request will fail.
You can explicitly request IPUs by manually specifying the cluster in
which the IPUs reside, and a unique index to identify a particular group
of IPUs. To view the available choices for a particular partition size, pass the
--options
option create partition
. The example below
shows the available partition placements with four IPUs within an 16 IPU cluster:
$ vipu create partition p --options --size 4
Index | Cluster | ILDs | GW Routing | ILD Routing | Size | IPUs
--------------------------------------------------------------------------------------------------------------
0 | cl00 | 1 | UNDEFINED | DNC | 4 | 0.0=ag00:0 0.1=ag00:1 0.2=ag00:2 0.3=ag00:3
1 | cl00 | 1 | UNDEFINED | DNC | 4 | 0.0=ag01:0 0.1=ag01:1 0.2=ag01:2 0.3=ag01:3
2 | cl00 | 1 | UNDEFINED | DNC | 4 | 0.0=ag02:0 0.1=ag02:1 0.2=ag02:2 0.3=ag02:3
3 | cl00 | 1 | UNDEFINED | DNC | 4 | 0.0=ag03:0 0.1=ag03:1 0.2=ag03:2 0.3=ag03:3
0 | cl01 | 1 | UNDEFINED | DNC | 4 | 0.0=ag00:0 0.1=ag00:1 0.2=ag00:2 0.3=ag00:3
--------------------------------------------------------------------------------------------------------------
Only valid partition placements for the active user are shown. If the IPUs are not assigned to the user, or are already in use, they are not shown.
You can create a partition with a specific selection of IPUs as follows:
$ vipu create partition p --size 4 --cluster cl00 --index 2
This selects the four IPUs with the index 2, from the table above.
5.5. IPU relocation
Partition creation is an asynchronous process consisting of two stages: IPU allocation and configuration. After a set of IPUs have been allocated for a partition, it is possible that the actual configuration process does not succeed, for example due to an issue with the corresponding IPU-Machines. In such cases, the Virtual-IPU Controller will retry creating the partition until the maximum number of retries have been reached before giving up (default value for the number of retries is 10). However, each of the configuration retries will use the same set of IPUs allocated in the first stage.
Partition configuration errors can also occur at a later stage if a partition gets into an
ERROR
state and is being reset by the V-IPU Controller (see Section 5.6.1, Partition state).
For many use-cases, it is useful to change the IPU allocation if the partition is unable to get to
the ACTIVE
state with the initially allocated set of IPUs. If you want to do this, use
the --relocatable
option to the create partition
command. This indicates that the partition
is relocatable to a different set of IPUs in case of configuration errors.
$ vipu create partition p --size 4 --relocatable
5.6. Displaying partition information
You can display detailed information about a partition with the get partition
command:
$ vipu get partition pt1
------------------------------------------------------------
Partition ID | pt1
Cluster | cl0
GW-Link Routing | DEFAULT
IPU-Link Routing | DNC
Number of ILDs | 1
Number of IPUs | 4
Number of GCDs | 1
Reconfigurable | false
State | ACTIVE
Provisioning State | IDLE
Intra-GCD Sync (GS1) | Replicas=1
Inter-GCD Sync (GS2) | UNDEFINED
Last Error |
IPUs | 0=ag06:0 1=ag06:1 2=ag06:2 3=ag06:3
------------------------------------------------------------
5.6.1. Partition state
A partition can have one of several possible states during its lifetime. The Virtual-IPU controller continuously monitors all created partitions and marks them with the correct state, representing the usability of the partition at the time.
Partition State |
Description |
---|---|
ACTIVE |
The partition is ready to be used by applications |
PENDING |
The partition is not yet ready |
REMOVED |
The partition is marked as removed and will be deleted |
ERROR |
An error occurred while getting the partition ready |
The Virtual-IPU controller will periodically retry configuration of partitions which are
in the ERROR
state so they can get ready to be used by the applications.
An example of a situation when a partition can go to the ERROR
state is a voluntary or
involuntary reboot of one of the IPU-Machines contributing IPUs to the partition. In that case,
once the IPU-Machine gets up back again, the Virtual-IPU controller will execute the necessary
configuration to get the IPUs ready to use and marks the partition as ACTIVE
again.
5.6.2. Provisioning state
While partition state refers to the current state a partition is in, the provisioning state
field indicates what operation is currently running for the partition on the Virtual-IPU controller.
When the partition state is PENDING
or ERROR
, more information about the partition can be
obtained by looking at its provisioning state on the Virtual-IPU Controller.
Provisioning State |
Description |
---|---|
IDLE |
No operation is running on the partition |
CREATING |
The partition is being created |
REMOVING |
The partition is being removed |
RESETTING |
The partition is being reset |
FAILED |
An error occurred during the last operation (see the Last Error field in the partition information) |
5.7. IPUoF environment variables
When you create a partition, Poplar needs to know the network endpoints to use to access the IPU devices, the topology of the devices and the configuration state. This information is used by Poplar to connect to the server-side IPUoF component (see Section 2, Concepts and architecture).
This information can most easily be provided by using the following environment variables.
IPUOF_VIPU_API_HOST
: The IP address of the server running the V-IPU controller. Required.IPUOF_VIPU_API_PORT
: The port to connect to the V-IPU controller. Optional. The default is 8090.IPUOF_VIPU_API_PARTITION_ID
: The name of the partition to use. Required.IPUOF_VIPU_API_GCD_ID
: The ID of the GCD you want to use. Required for multi-GCD systems.IPUOF_VIPU_API_TIMEOUT
: Set the time-out for client calls in seconds. Optional. Default 200.
5.8. IPUoF configuration files
The information that Poplar needs to access devices can also be passed via an
IPUoF configuration file which is, by default, written to a directory in your
home directory (~/.ipuof.conf.d/
) when a partition is created on the command line:
$ vipu create partition mypt --size 4
create partition (mypt): success
$ ls ~/.ipuof.conf.d/
mypt.conf
Note
IPUoF configuration files are provided as a convenience. However, because of the difficulties of maintaining and sharing configuration files they are not the recommended way to manage access to IPUs. We recommend the use of environment variables, as described in Section 5.7, IPUoF environment variables.
Configuration files may be useful when the Poplar hosts do not have direct network access to the V-IPU controller (for security reasons, for example).
You can change the directory where the IPUoF configuration file is created, with
the --ipuof-config-location
option (see Section 8.3, Create partition) or
the environment variable VIPU_CLI_IPUOF_CONFIG_LOCATION
.
Note
IPUoF configuration files are not written for the relocatable partitions because IPU configuration may change during the lifetime of such partitions, as described in Section 5.5, IPU relocation.
Poplar will search for an IPUoF configuration file in the following locations:
The file specified by
IPUOF_CONFIG_PATH
$HOME/.ipuof.conf.d/
/etc/ipuof.conf.d/
Warning
There must only be one configuration file in the directory. If more than one file is present then they will all be silently ignored.
By default, the automatically-created file is named
partition-name.conf
but Poplar will read any file that it finds
in that directory.
5.8.1. Creating an IPUoF configuration file
If you do not have an IPUoF configuration file for the partition you want to
use, you can create one using the output of vipu get partition
--ipuof-configs
command (see Section 8.4, Get partition). For multi-GCD
partitions the --gcd
option must also be provided to fetch the correct
configuration.
5.9. Removing a partition
When you remove a partition, the associated configuration file is also removed. Because there must only be one configuration file present at a time, you should remove an existing partition before creating a new one:
$ vipu remove partition mypt
remove partition (mypt): success
$ ls ~/.ipuof.conf.d/
5.10. Routing configuration
The routing configuration defines how messages between IPUs will be routed via the available connections.
Table 5.3 lists the different types of routing configurations that can be used, depending on the size of the partition and the physical link topologies.
A default IPU-Link routing configuration will be chosen when the partition is created based on the size and topology of the cluster:
If the partition size is less than or equal to 16 it defaults to DNC
If the partition size is 16 or more and the partition occupies the whole cluster and the cluster topology has loopback cables enabled, it defaults to RINGSWNC
Otherwise it defaults to SWNC
The routing for IPU-Links can be specified with the --routing
option (see Section 8.3, Create partition).
The application may also need to specify the routing. See the Target
class
in the Poplar and Poplibs API Reference
for information on setting the routing used by an application.
Partition size |
IPU-Link routing |
---|---|
1 |
DNC |
2 |
|
4 |
|
8 |
|
16 |
|
32 |
SWNC |
64 |
SWNC |
RINGSWNC [*] |
|
128+ |
SWNC |
SWNC |
|
RINGSWNC [*] |
|
RINGSWNC [*] |
Key |
||
IPU-Link routing options |
||
BTNC |
Barley Twist Network Configuration |
|
DNC |
Default Network Configuration |
|
SWNC |
Sliding Window Network Configuration |
|
RINGSWNC |
Ring with Sliding Window Network Configuration |
[*] Note that some routing options are not available depending on the system size and link topology:
RINGSWNC requires a partition that uses an entire IPU-Link domain with torus topology and size of at least 16
BTNC requires a partition with torus topology and 1 IPU per replica
5.10.1. Intra-GCD sync configuration
You can adjust the configuration of the sync group (GS1) which is shared between IPUs within a GCD. This is typically used to synchronise execution steps between IPUs which are computing a single replica of a replicated graph. See the description of sync groups in the Poplar User Guide.
For example:
$ vipu create partition pt --size 8 --total-num-replicas 2
This will effectively divide the IPUs into two groups such that each group can sync independently of each other. The table below shows the IPUs present in each sync group for a variety of partition sizes.
Size (#IPUs) |
Replicas |
GS1 independent IPU groups (IPU IDs) |
|||||||
---|---|---|---|---|---|---|---|---|---|
1 |
1 |
0 |
|||||||
2 |
1 |
0 1 |
|||||||
4 |
1 |
0 1 2 3 |
|||||||
8 |
1 |
0 1 2 3 4 5 6 7 |
|||||||
2 |
2 |
0 |
1 |
||||||
4 |
2 |
0 1 |
2 3 |
||||||
8 |
2 |
0 1 2 3 |
4 5 6 7 |
||||||
4 |
4 |
0 |
1 |
2 |
3 |
||||
8 |
4 |
0 1 |
2 3 |
4 5 |
6 7 |
||||
8 |
8 |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
In order to sync all IPUs in the partition, the application should use the GS2 sync group.
Note that for single-GCD partitions, if you do not specify the number of replicas then the configuration defaults to synchronising all IPUs in the partition. This is equivalent to specifying a single replica. This means that the following two commands yield an identical configuration:
$ vipu create partition pt --size 8 --total-num-replicas 1
$ vipu create partition pt --size 8
5.11. Multi-GCD partitions
Partitions, by default, yield a single multi-IPU device which can be used by a single Poplar process to run a machine learning application.
You can also specify that a partition is to be split into a number of GCDs.
The following example creates a partition with four IPUs and two GCDs:
$ vipu create partition pt --size 4 --num-gcds 2
The result of this is that the IPUs in the partition are split equally across the requested number of multi-IPU devices:
$ gc-info -l
-+- Id: [0], type: [Fabric], PCI Domain: [3]
-+- Id: [1], type: [Fabric], PCI Domain: [2]
-+- Id: [2], type: [Fabric], PCI Domain: [1]
-+- Id: [3], type: [Fabric], PCI Domain: [0]
-+- Id: [4], type: [Multi IPU]
|--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
|--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]
-+- Id: [5], type: [Multi IPU]
|--- PCIe Id: [2], DNC Id: [2], PCI Domain: [1]
|--- PCIe Id: [3], DNC Id: [3], PCI Domain: [0]
This allows a separate Poplar process to attach to each subset of IPUs in the
partition. These subsets of IPUs are referred to as GCDs. Each device is
assigned a GCD id, which is simply a number representing the position of the
GCD in the partition. It is not normally necessary to inspect this number or
any other meta-information about the devices, but it can be displayed if needed
using the gc-info
tool. You will also be able to see the total number of
GCDs which were configured into the partition:
$ gc-info --device-info -d 4 | grep 'GCD'
GCD Id: 0
Num GCDs: 2
$ gc-info --device-info -d 5 | grep 'GCD'
GCD Id: 1
Num GCDs: 2
Note that for partitions with a single IPU per GCD, then each individual fabric IPU device represents a GCD; there is no multi-IPU device:
$ vipu create partition pt --size 2 --num-gcds 2
$ gc-info -l
-+- Id: [0], type: [Fabric], PCI Domain: [3]
-+- Id: [1], type: [Fabric], PCI Domain: [2]
$ gc-info --device-info -d 0 | grep 'GCD'
GCD Id: 0
Num GCDs: 2
$ gc-info --device-info -d 1 | grep 'GCD'
GCD Id: 1
Num GCDs: 2
5.11.1. Multi-host IPUoF configuration
The primary use case of these GCDs is to allow different Poplar workloads to be distributed over multiple servers, whilst still permitting the IPUs in the same partition to communicate with one another. Contrast this with creating multiple partitions, which are isolated and unable to communicate.
The recommended way of distributing the IPUoF configuration information to multiple hosts is to set the relevant environment variables on every host. You would specify the same values for the host, port and partition ID but give each host the appropriate GCD ID.
See Section 5.7, IPUoF environment variables for more information.
Multi-host IPUoF configuration files
When a multi-GCD partition is created, the IPUoF configuration for each GCD is written to a separate file so that they can be distributed. Typically, a higher-level job scheduler would take care of ensuring that each server is provided with the necessary device configuration. This information is provided primarily for those wishing to integrate distributed Poplar execution into an existing job scheduler, it is not typically to be done manually:
$ vipu create partition mypt --size 4 --num-gcds 2
create partition (mypt): success
$ ls -l ~/.ipuof.conf.d/
mypt_gcd0.conf mypt_gcd1.conf
Notice that in this example, because two GCDs were requested, a configuration file for each GCD was produced. You can then distribute each of these to a different server. Note that both servers must have access to the IPUoF data-plane network.
head$ vipu create partition mypt --size 4 --num-gcds 2
create partition (mypt): success
head$ scp ~/.ipuof.conf.d/mypt_gcd0.conf node0:.ipuof.conf.d
head$ scp ~/.ipuof.conf.d/mypt_gcd1.conf node1:.ipuof.conf.d
Warning
There must only be one configuration file in the directory. If more than one file is present then they will all be silently ignored. We recommend the use of environment variables, instead of configuration files, to specify the partition information to each host. See Section 5.7, IPUoF environment variables for more information.
Each node will now see only a single device, each from a different subset of the IPUs in the partition, based on which GCD configuration file they are using:
node0$ gc-info -l
-+- Id: [0], type: [Fabric], PCI Domain: [3]
-+- Id: [1], type: [Fabric], PCI Domain: [2]
-+- Id: [2], type: [Multi IPU]
|--- PCIe Id: [0], DNC Id: [0], PCI Domain: [3]
|--- PCIe Id: [1], DNC Id: [1], PCI Domain: [2]
node0$ gc-info --device-info -d 2 | grep 'GCD'
GCD Id: 0
Num GCDs: 2
node1$ gc-info -l
-+- Id: [0], type: [Fabric], PCI Domain: [1]
-+- Id: [1], type: [Fabric], PCI Domain: [0]
-+- Id: [2], type: [Multi IPU]
|--- PCIe Id: [2], DNC Id: [2], PCI Domain: [1]
|--- PCIe Id: [3], DNC Id: [3], PCI Domain: [0]
node1$ gc-info --device-info -d 2 | grep 'GCD'
GCD Id: 1
Num GCDs: 2
5.11.2. Intra-GCD sync configuration
For multi-GCD partitions, by default, the intra-GCD sync group (GS1) is
configured to synchronise all IPUs in the GCD. You can alter this via the
same mechanisms as for single-GCD partitions, for example --total-num-replicas
.
5.11.3. Inter-GCD sync configuration
For multi-GCD partitions, by default, the inter-GCD sync group (GS2) is configured to synchronise all IPUs in the entire partition.
5.11.4. Crossing IPU-Link domains
IPU systems are divided into domains based on the interconnect used between the IPUs. For IPU-POD64 and Bow Pod64 based systems, 64 IPUs are interconnected using IPU-Links, and are referred to as an IPU-Link domain. For larger systems, multiples of this unit are interconnected using gateway (GW) links.
For these large systems, only inter-GCD communication is permitted when crossing GW-Links. Intra-GCD communication is limited to using IPU-Links. The practical side effect of this is that a single GCD cannot be provisioned with more than 64 IPUs, but must be provisioned as multiple GCDs.
For example, in a cluster with 128 IPUs, comprising two IPU-Link domains, then a 128-IPU partition must be provisioned into a minimum of two GCDs:
$ vipu create partition pt --size 128 --num-gcds 2
The partition can be configured further by subdividing the IPU-Link domains, or supplying other configuration options:
$ vipu create partition pt --size 128 --num-gcds 8 --total-num-replicas 64