3. Collectives
3.1. Supported reduction operators
This section lists the available reduction operators when running collective operations.
3.1.1. ADD
The ADD
operator calculates the sums of the respective operands. When an AllReduce collective operation is executed on the following inputs:
input_replica_0: 01,02,03,04
input_replica_1: 05,06,07,08
the result will be:
output_replica_0: 06,08,10,12
output_replica_1: 06,08,10,12
ADD
supports the following Poplar types: FLOAT
, HALF
, INT
, LONG_LONG
, UNSIGNED_LONG_LONG
.
3.1.2. MEAN
The MEAN
operator calculates the average of the respective operands. When an AllReduce collective operation is executed on the following inputs:
input_replica_0: 1.0,2.0,3.0,4.0
input_replica_1: 5.0,6.0,7.0,8.0
the result will be:
output_replica_0: 3.0,4.0,5.0,6.0
output_replica_1: 3.0,4.0,5.0,6.0
MEAN
only supports floating point data types: FLOAT
and HALF
.
3.1.3. MUL
The MUL
operator multiplies the respective operands. When an AllReduce collective operation is executed on the following inputs:
input_replica_0: 01,02,03,04
input_replica_1: 05,06,07,08
the result will be:
output_replica_0: 05,12,21,32
output_replica_1: 05,12,21,32
MUL
supports the following Poplar types: FLOAT
, HALF
, INT
, LONG_LONG
, UNSIGNED_LONG_LONG
.
3.1.4. MIN
The MIN
operator finds the smallest value of the respective operands. When an AllReduce collective operation is executed on the following inputs:
input_replica_0: 01,02,03,04
input_replica_1: 05,06,07,08
the result will be:
output_replica_0: 01,02,03,04
output_replica_1: 01,02,03,04
MIN
supports the following Poplar types: FLOAT
, HALF
, INT
, UNSIGNED_INT
, LONG_LONG
, UNSIGNED_LONG_LONG
.
3.1.5. MAX
The MAX
operator finds the smallest value of the respective operands. When an AllReduce collective operation is executed on the following inputs:
input_replica_0: 01,02,03,04
input_replica_1: 05,06,07,08
the result will be:
output_replica_0: 05,06,07,08
output_replica_1: 05,06,07,08
MAX
supports the following Poplar types: FLOAT
, HALF
, INT
, UNSIGNED_INT
, LONG_LONG
, UNSIGNED_LONG_LONG
.
3.1.6. SQUARE_ADD
The SQUARE_ADD
operator calculates the sum of squares of the respective operands. When an AllReduce collective operation is executed on the following inputs:
input_replica_0: 01,02,03,04
input_replica_1: 05,06,07,08
the result will be:
output_replica_0: 26,40,58,80
output_replica_1: 26,40,58,80
SQUARE_ADD
supports the following Poplar types: FLOAT
, HALF
, INT
, UNSIGNED_INT
, LONG_LONG
, UNSIGNED_LONG_LONG
.
3.1.7. LOGICAL_AND
The LOGICAL_AND
operator calculates the logical AND
of the respective operands. When an AllReduce collective operation is executed on the following inputs:
input_replica_0: true,false,true,false
input_replica_1: false,true,true,false
the result will be:
output_replica_0: false,false,true,false
output_replica_1: false,false,true,false
The only Poplar data type supported by the LOGICAL_AND
operator is BOOL
.
3.1.8. LOGICAL_OR
The LOGICAL_OR
operator calculates the logical OR
of the respective operands. When an AllReduce collective operation is executed on the following inputs:
input_replica_0: true,false,true,false
input_replica_1: false,true,true,false
the result will be:
output_replica_0: true,true,true,false
output_replica_1: true,true,true,false
The only Poplar data type supported by the LOGICAL_OR
operator is BOOL
.
3.2. Collective groups
GCL supports a few kinds of communication groups that describe the IPUs taking part in a particular collective operation.
3.2.1. Orthogonal group
ORTHOGONAL
groups consist of replicas of IPUs that are assigned to it orthogonally to the replica ordering in the topology. For example, for 16 replicas (replica index = 0 to 15) and a group size of 4, there will be four groups and they are assigned as shown in Table 3.1 and in Fig. 3.1.
Group |
Replicas |
---|---|
0 |
0, 4, 8, 12 |
1 |
1, 5, 9, 13 |
2 |
2, 6, 10, 14 |
3 |
3, 7, 11, 15 |
If there are N
replicas denoted {0, ... N-1}
and the group size is k
, then there are m = N/k
groups of size k
:
{0, m, 2m, ...}, {1, m+1, 2m+1, ...} ... {m-1, 2m-1, ... N-1}
ORTHOGONAL
groups can be also expressed in terms of stride. Replicas are assigned to groups with a stride defined by the number of groups where \(number\ of\ groups = \frac{number\ of\ replicas}{group\ size}\).
3.2.2. Consecutive group
A CONSECUTIVE
group consists of replicas of IPUs that are assigned to it consecutively with the replica ordering.
Each group has a size equal to the size CommGroup
is instantiated with. For
example, for 16 replicas (replica index = 0 to 15) and a group size of 4,
the groups are assigned as shown in Table 3.2 and in Fig. 3.1.
Group |
Replicas |
---|---|
0 |
0, 1, 2, 3 |
1 |
4, 5, 6, 7 |
2 |
8, 9, 10, 11 |
3 |
12, 13, 14, 15 |
If there are N replicas denoted {0, ... N-1}
and the group size is k
,
then there are N/k
groups of size k
:
{0, 1, ... k-1}, {k, ... 2k-1} ... {N-k-1, ... N-1}
3.2.3. All group
The ALL
group tells GCL that all the replicas are taking part in the collective operation as single group. An example of such a grouping is shown in Table 3.3 and in Fig. 3.1.
Group |
Replicas |
---|---|
0 |
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 |
3.3. Collective operations
GCL supports the following collective operations:
3.3.1. AllGather
AllGather is an operation where input elements from multiple replicas are distributed to all the participating replicas. For the following inputs:
input_replica_0: [x0,y0]
input_replica_1: [x1,y1]
input_replica_2: [x2,y2]
input_replica_3: [x3,y3]
the result will be:
output_replica_0: [x0,y0,x1,y1,x2,y2,x3,y3]
output_replica_1: [x0,y0,x1,y1,x2,y2,x3,y3]
output_replica_2: [x0,y0,x1,y1,x2,y2,x3,y3]
output_replica_3: [x0,y0,x1,y1,x2,y2,x3,y3]
After the operation, each replica receives the aggregation of data from all replicas in the order of the replicas. The output shape will be calculated as [number_of_input_elements * number_of_replicas]
3.3.2. AllReduce
AllReduce is an operation where input elements are reduced across replicas, with each replica receiving a complete result. For the following inputs:
input_replica_0: x0,y0,z0
input_replica_1: x1,y1,z1
the result will be:
output_replica_0: op(x0,x1),op(y0,y1),op(z0,z1)
output_replica_1: op(x0,x1),op(y0,y1),op(z0,z1)
The output shape will be the same as the input shape and all replicas will have the same data. When ReduceScatter is followed by an AllGather, it becomes equivalent to an AllReduce. In other words, AllReduce(input) == AllGather(ReduceScatter(input))
.
3.3.3. AllToAll
AllToAll is an operation where each replica splits its data and sends this split data to all other replicas. The split happens where the split index in the input data matches the index of the recipient’s replica. For the following inputs:
input_replica_0: a0,a1,a2,a3
input_replica_1: b0,b1,b2,b3
input_replica_2: c0,c1,c2,c3
input_replica_3: d0,d1,d2,d3
the result will be:
output_replica_0: a0,b0,c0,d0
output_replica_1: a1,b1,c1,d1
output_replica_2: a2,b2,c2,d2
output_replica_3: a3,b3,c3,d3
The input shape must be equal to the number of replicas in its first dimension (even if it only has one dimension) and the output shape will be equal to the input shape.
3.3.4. Broadcast
Broadcast is an operation where one replica (root replica) will send data to all other replicass. For the following inputs:
root_replica_0: a0,a1,a2,a3
input_replica_1: b0,b1,b2,b3
input_replica_2: c0,c1,c2,c3
input_replica_3: d0,d1,d2,d3
the result will be:
root_replica_0: a0,a1,a2,a3
output_replica_1: a0,a1,a2,a3
output_replica_2: a0,a1,a2,a3
output_replica_3: a0,a1,a2,a3
3.3.5. ReduceScatter
ReduceScatter is an operation where input elements are reduced across replicas, with each replica receiving a part of the result. For the following inputs:
input_replica_0: x0,y0,z0
input_replica_1: x1,y1,z1
the result will be:
output_replica_0: op(x0,x1),op(y0,y1)
output_replica_1: op(z0,z1),0
The output shape might not match the input shape as it will be calculated as [ceil(number_of_input_elements / number_of_replicas)]
.
3.4. Collective methods
This section describes the available collective methods. Collective methods describe the logical network topologies, that is, they define the datapaths in the network.
GCL allows you to control the method selection through the GCL_OPTIONS
environment variable described in Section 4.1.1, Option values. The default value for method
is auto
, which means that GCL will try to pick the optimal method using several variables such as data size, number of replicas in a communication group, communication group type, bytes of data per IPU or physical network topology.
This section describes the collective methods available in GCL.
3.4.1. Anticlockwise ring
This method sends data fragments anticlockwise around the ring of IPUs. The number of fragments is equal to the number of IPUs in the ring.
3.4.2. Bi-directional ring pair
This method splits the data in two and uses the clockwise ring algorithm on one half and the anticlockwise ring algorithm on the other. This will fully use the links in both directions. The number of fragments is equal to twice the number of IPUs in the ring.
3.4.3. Broadcast
This method broadcasts the tensor to all participating replicas in the communication group and performs the reduction locally. This means that the network latency cost is only paid once. This method is faster for small tensors, but comes with a downside of increased memory use if the tensors are larger or the communication group size is too large. There is a GCL_OPTIONS
variable that controls this cut-off point after which broadcast will not be selected: syncful.maxBroadcastSize
with a default value of 2048 which is group_size * numBytes
.
3.4.4. Clockwise ring
This method sends data fragments clockwise around the ring of IPUs. The number of fragments is equal to the number of IPUs in the ring.
3.4.5. Meet-in-the-middle ring
This method sends half of the fragments halfway around the ring in the clockwise direction and half the fragments halfway around the ring in the anticlockwise direction - they meet in the middle. The number of fragments is equal to the number of IPUs in the ring. The disadvantage compared to the bi-directional ring pair method is that the usage of available bandwidth is not quite optimal. In particular, the final step only uses the links in one direction (assuming an even number of IPUs). The advantage is the that this method requires fewer steps and allows the use of larger fragments.
3.4.6. Quad-directional ring
This method divides the fragments in four parts and sends each quarter around one of two rings using the mirrored and non-mirrored ring patterns.