Collectives
-
interface OptionFlags
Supported Option flags:
useSynclessCollectives
(true, false, hybrid, auto) [=auto]Type of collective implementation to use.
auto: Choose the appropriate implementation for the operation in question. At the moment this is the same as ‘false’.
true: Use the syncless implementation.
false: Use the syncful implementation.
hybrid: Use syncful over IPU-Links and syncless over gw links.
maxBytesPerTile
Integer [=35000]The maximum size of data and padding in the payload buffer to put on each IO tile. The maximum allowed value is 64000.
topology
(rung-ring-2, rung-ring-4, rung-ring-8, ring-on-line, peripheral-ring) [=auto]The topology to use for the syncful implementation. If you do not specify this option the topology is auto detected.
auto: Topology automatically selected based on the curent graph.
rung-ring-2: Relevant for replica size 2. The traffic follows one of two physical rings, one on each side of the IPU-Link mesh, by moving straight up and assuming wrap-around at the top.
rung-ring-4: Relevant for replica size 4. The traffic follows one of two physical rings, one on each side of the IPU-Link mesh, by moving straight up and assuming wrap-around at the top.
rung-ring-8: Relevant for replica size 8. The traffic follows one of two physical rings, one on each side of the IPU-Link mesh, by moving straight up and assuming wrap-around at the top.
ring-on-line: Relevant for replica size 2. The traffic follows one of two virtual rings, one on each side of the IPU-Link mesh.
peripheral-ring: Relevant for replica size 1. The traffic follows a single ring on the periphery of the IPU-Link mesh.
link
(auto-link, ipu-link, gw-link) [=auto-link]The link type to use between IPUs.
auto-link: Use the link type appropriate for the operation.
ipu-link: Use the IPU-Links.
gw-link: Use the GW-Links.
method
(auto, clockwise_ring, anticlockwise_ring, bidirectional_ring_pair, meet_in_middle_ring, quad_directional_ring) [=auto]The method/topology to be used.
auto: Automatically decide on the most optimal method.
clockwise_ring: Send fragments clockwise around the ring. The number of fragments is equal to the number of IPUs in the ring.
anticlockwise_ring: Send fragments anticlockwise around the ring. The number of fragments is equal to the number of IPUs in the ring.
bidirectional_ring_pair: Split the data into two halves and use the clockwise ring algorithm on one half and the anticlockwise ring algorithm on the other in order to fully utilize the links in both directions. The number of fragments is equal to twice the number of IPUs in the ring.
meet_in_middle_ring: Send half the fragments halfway around the ring in the clockwise direction and half the fragments halfway around the ring in the anticlockwise direction, meeting in the middle. The number of fragments is equal to the number of IPUs in the ring. The disadvantage compared to the “bidirectional_ring_pair” method is that the usage of available bandwidth is not quite optimal, in particular the final step only uses the links in one direction (assuming an even number of IPUs). The advantage is the that it requires fewer steps and allows the use of larger fragments.
quad_directional_ring: Divide fragments in four and send each quarter around one of two rings using the mirrored and non-mirrored ring pattern.
#include <gcl/Collectives.hpp>
Defines
-
GCL_DEPRECATED(x)
Function scheduled for removal.
-
namespace gcl
Graphcore Communications Library.
CrossReplica functions
Collective operations working across replicas.
-
enum CommGroupType
Enum to define communication group specification type.
Assumption: replica groups are uniform in size and layout on IPUs.
Values:
-
enumerator ALL
All replicas viewed as one group, replica group size is ignored.
-
enumerator CONSECUTIVE
Groups are consecutive in replica.
If there are N replicas denoted {0, … N-1} and group size is k, then there are N/k groups of size k: {0, 1, … k-1}, {k, … 2k-1} … {N-k-1, … N-1}
-
enumerator ORTHOGONAL
Groups are sliced orthogonal to the replica ordering.
If there are N replicas denoted {0, … N-1} and group size is k, then there are m = N/k groups of size k: {0, m, 2m, …}, {1, m+1, 2m+1, …} … {m-1, 2m-1, … N-1}
-
enumerator ALL
-
poplar::Tensor allReduceCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Perform an all-reduce operation.
The operation is performed on the provided tensor over replicas as specified by the
group
argument. This operation reduces across the tensors that the replicated tensor is a handle for. The result is returned as a replicated tensor.- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
- Returns
A replicated tensor with the reduction of
data
.
-
poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Perform an all-reduce operation.
The operation is performed on the provided tensor over replicas as specified by the
group
argument. This operation reduces across the tensors that the replicated tensor is a handle for. The result is returned as a replicated tensor.- Deprecated:
Use allReduceCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
- Returns
A replicated tensor with the reduction of
data
.
-
std::vector<poplar::Tensor> allReduceCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceCrossReplica, but batch up multiple tensors to be executed as a single collective operations.
This gives a performance improvement over sequentially reducing one tensor per operation. For short tensors the potential latency reduction is 1/(number-of-tensors) over sequentially reducing one tensor per operation.
- Parameters
graph – The replicated graph the input tensors belongs to.
datas – The vector of replicated tensors to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
- Returns
A vector of replicated tensors. Each of these tensors containing the reduction of the corresponding tensor in
datas
accross all replicas.
-
std::vector<poplar::Tensor> allReduce(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceCrossReplica, but batch up multiple tensors to be executed as a single collective operations.
This gives a performance improvement over sequentially reducing one tensor per operation. For short tensors the potential latency reduction is 1/(number-of-tensors) over sequentially reducing one tensor per operation.
- Deprecated:
Use allReduceCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensors belongs to.
datas – The vector of replicated tensors to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
- Returns
A vector of replicated tensors. Each of these tensors containing the reduction of the corresponding tensor in
datas
accross all replicas.
-
poplar::Tensor allReduceCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceCrossReplica() without the
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensors belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
- Returns
A replicated tensor with the reduction of
data
.
-
poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceCrossReplica() without the
group
arg (for all replicas).- Deprecated:
Use allReduceCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensors belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
- Returns
A replicated tensor with the reduction of
data
.
-
std::vector<poplar::Tensor> allReduceCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceCrossReplica() with multiple input tensors and without the
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensors belongs to.
datas – A vector of replicated tensors to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
- Returns
A vector of replicated tensors. Each of these tensors containing the reduction of the corresponding tensor in
datas
accross all replicas.
-
std::vector<poplar::Tensor> allReduce(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceCrossReplica() with multiple input tensors and without the
group
arg (for all replicas).- Deprecated:
Use allReduceCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensors belongs to.
datas – A vector of replicated tensors to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
- Returns
A vector of replicated tensors. Each of these tensors containing the reduction of the corresponding tensor in
datas
accross all replicas.
-
void allReduceToDestinationCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceCrossReplica() but writes the result to the
destination
tensor.- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
destination – Tensor to write the result to.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceCrossReplica() but writes the result to the
destination
tensor.- Deprecated:
Use allReduceToDestinationCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
destination – Tensor to write the result to.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceToDestinationCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, std::vector<poplar::Tensor> &destinations, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceToDestinationCrossReplica() with multiple input and output tensors.
- Parameters
graph – The replicated graph the input tensors belongs to.
datas – Vector of replicated tensors to reduce.
destinations – Vector of replicated tensors to write the result to.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceToDestination(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, std::vector<poplar::Tensor> &destinations, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceToDestination() with multiple input and output tensors.
- Deprecated:
Use allReduceToDestinationCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensors belongs to.
datas – Vector of replicated tensors to reduce.
destinations – Vector of replicated tensors to write the result to.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceToDestinationCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceToDestinationCrossReplica() without
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
destination – Tensor to write the result to.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceToDestination() without
group
arg (for all replicas).- Deprecated:
Use allReduceToDestinationCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
destination – Tensor to write the result to.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceInPlaceCrossReplica(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceCrossReplica() but writes result back to the input
data
tensor.- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceCrossReplica() but writes result back to the input
data
tensor.- Deprecated:
Use allReduceInPlaceCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceInPlaceCrossReplica(poplar::Graph &graph, std::vector<poplar::Tensor> &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceInPlaceCrossReplica(), but batch up multiple tensors to be executed as a single collective operations.
This gives a performance improvement over sequentially reducing one tensor per operation. For short tensors the potential latency reduction is 1/(number-of-tensors) over sequentially reducing one tensor per operation.
- Parameters
graph – The replicated graph the input tensor belongs to.
datas – Vector of replicated tensors to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceInPlace(poplar::Graph &graph, std::vector<poplar::Tensor> &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceInPlaceCrossReplica(), but batch up multiple tensors to be executed as a single collective operations.
This gives a performance improvement over sequentially reducing one tensor per operation. For short tensors the potential latency reduction is 1/(number-of-tensors) over sequentially reducing one tensor per operation.
- Deprecated:
Use allReduceInPlaceCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
datas – Vector of replicated tensors to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceInPlaceCrossReplica(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceInPlaceCrossReplica() without
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceInPlace() without
group
arg (for all replicas).- Deprecated:
Use allReduceInPlaceCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceInPlaceCrossReplica(poplar::Graph &graph, std::vector<poplar::Tensor> &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceInPlaceCrossReplica() with multiple input tensors and without
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
datas – Vector of replicated tensors to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
-
void allReduceInPlace(poplar::Graph &graph, std::vector<poplar::Tensor> &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceInPlaceCrossReplica() with multiple input tensors and without
group
arg (for all replicas).- Deprecated:
Use allReduceInPlaceCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
datas – Vector of replicated tensors to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
-
poplar::Tensor reduceScatterCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Reduce the replicated rank-1 tensor
toReduce
with the result scattered across the replicas.For an input of shape [numElements] mapped to a single IPU per replica, the output will have shape [ceil(numElements / replicationFactor)]. If replicationFactor does not evenly divide numElements, the result is zero-padded. For instance:
Before:
Replica0: toReduce[x0, y0, z0]
Replica1: toReduce[x1, y1, z1]
After:
Replica0: result[op(x0, x1), op(y0, y1)]
Replica1: result[op(z0, z1), 0]
For the syncful implementation an input of shape: [numElementsIPU0 + numElementsIPU1 + …] mapped to multiple IPUs per replica, the output will have shape: [ceil(numElementsIPU0 / replicationFactor) + ceil(numElementsIPU1 / replicationFactor) + …] with the result grouped per IPU. If replicationFactor does not evenly divide the number of elements on an IPU, the result is zero-padded per IPU. For instance:
Before:
Replica0: toReduce[x0, y0, z0, w0]
Replica1: toReduce[x1, y1, z1, w1]
Replica2: toReduce[x2, y2, z2, w2]
Replica3: toReduce[x3, y3, z3, w3]
Mapping: toReduce[IPU0, IPU0, IPU0, IPU1]
After:
Replica0: result[op(x0, x1, x2, x3), op(w0, w1, w2, w3)]
Replica1: result[op(y0, y1, y2, y3), 0]
Replica2: result[op(z0, z1, z2, z3), 0]
Replica3: result[ 0, 0]
Mapping: result[ IPU0, IPU1]
Note
Only flat input tensors are currently supported.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce scatter.
op – The reduction operator (for example,
Operation::ADD
)prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
- Returns
The output tensor, with the content described above.
-
poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Reduce the replicated rank-1 tensor
toReduce
with the result scattered across the replicas.For an input of shape [numElements] mapped to a single IPU per replica, the output will have shape [ceil(numElements / replicationFactor)]. If replicationFactor does not evenly divide numElements, the result is zero-padded. For instance:
Before:
Replica0: toReduce[x0, y0, z0]
Replica1: toReduce[x1, y1, z1]
After:
Replica0: result[op(x0, x1), op(y0, y1)]
Replica1: result[op(z0, z1), 0]
For the syncful implementation an input of shape: [numElementsIPU0 + numElementsIPU1 + …] mapped to multiple IPUs per replica, the output will have shape: [ceil(numElementsIPU0 / replicationFactor) + ceil(numElementsIPU1 / replicationFactor) + …] with the result grouped per IPU. If replicationFactor does not evenly divide the number of elements on an IPU, the result is zero-padded per IPU. For instance:
Before:
Replica0: toReduce[x0, y0, z0, w0]
Replica1: toReduce[x1, y1, z1, w1]
Replica2: toReduce[x2, y2, z2, w2]
Replica3: toReduce[x3, y3, z3, w3]
Mapping: toReduce[IPU0, IPU0, IPU0, IPU1]
After:
Replica0: result[op(x0, x1, x2, x3), op(w0, w1, w2, w3)]
Replica1: result[op(y0, y1, y2, y3), 0]
Replica2: result[op(z0, z1, z2, z3), 0]
Replica3: result[ 0, 0]
Mapping: result[ IPU0, IPU1]
- Deprecated:
Use reduceScatterCrossReplica() instead.
Note
Only flat input tensors are currently supported.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce scatter.
op – The reduction operator (for example,
Operation::ADD
)prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See gcl::allReduceCrossReplica().
- Returns
The output tensor, with the content described above.
-
poplar::Tensor reduceScatterCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As reduceScatterCrossReplica() without
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce scatter.
op – The reduction operator (for example,
Operation::ADD
)prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
- Returns
The output tensor, with the content described above.
-
poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As reduceScatterCrossReplica() without
group
arg (for all replicas).- Deprecated:
Use reduceScatterCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce scatter.
op – The reduction operator (for example,
Operation::ADD
)prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allGatherCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Gather the replicated tensor
toGather
and return the result so each replica will have a copy of all other replicas’toGather
tensors.For instance:
Before:
Replica0: toGather[s,t]
Replica1: toGather[u,v]
Replica2: toGather[w,x]
Replica3: toGather[y,z]
After allGather:
Replica0: result[[s,t], [u,v], [w,x], [y,z]]
Replica1: result[[s,t], [u,v], [w,x], [y,z]]
Replica2: result[[s,t], [u,v], [w,x], [y,z]]
Replica3: result[[s,t], [u,v], [w,x], [y,z]]
For an input of shape [incomingShape] the output will be [replicationFactor][incomingShape].
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to gather.
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allGather(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Gather the replicated tensor
toGather
and return the result so each replica will have a copy of all other replicas’toGather
tensors.For instance:
Before:
Replica0: toGather[s,t]
Replica1: toGather[u,v]
Replica2: toGather[w,x]
Replica3: toGather[y,z]
After allGather:
Replica0: result[[s,t], [u,v], [w,x], [y,z]]
Replica1: result[[s,t], [u,v], [w,x], [y,z]]
Replica2: result[[s,t], [u,v], [w,x], [y,z]]
Replica3: result[[s,t], [u,v], [w,x], [y,z]]
For an input of shape [incomingShape] the output will be [replicationFactor][incomingShape].
- Deprecated:
Use allGatherCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to gather.
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allGatherCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allGatherCrossReplica() without
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to gather.
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allGather(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allGatherCrossReplica() without
group
arg (for all replicas).- Deprecated:
Use allGatherCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to gather.
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allToAllCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Perform an all-to-all exchange of the elements of the input tensor based on replica ID.
The shape of the input must have the number of replicas in the graph as its first or only dimension. That dimension will be used to split up the tensor being sent, with each replica sending all splits except for the split index which matches its replica ID. That is, replica 2 will not send input[2] and so on.
The replica receiving the slice will copy that incoming slice into the output at the index which matches the replica ID of the replica which sent it. For instance:
Input tensor:
Replica0: Tensor T[x0,x1,x2]
Replica1: Tensor T[y0,y1,y2]
Replica2: Tensor T[z0,z1,z2]
Output tensor:
Replica0: Tensor T[x0,y0,z0]
Replica1: Tensor T[x1,y1,z1]
Replica2: Tensor T[x2,y2,z2]
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allToAll(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Perform an all-to-all exchange of the elements of the input tensor based on replica ID.
The shape of the input must have the number of replicas in the graph as its first or only dimension. That dimension will be used to split up the tensor being sent, with each replica sending all splits except for the split index which matches its replica ID. That is, replica 2 will not send input[2] and so on.
The replica receiving the slice will copy that incoming slice into the output at the index which matches the replica ID of the replica which sent it. For instance:
Input tensor:
Replica0: Tensor T[x0,x1,x2]
Replica1: Tensor T[y0,y1,y2]
Replica2: Tensor T[z0,z1,z2]
Output tensor:
Replica0: Tensor T[x0,y0,z0]
Replica1: Tensor T[x1,y1,z1]
Replica2: Tensor T[x2,y2,z2]
- Deprecated:
Use allToAllCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See OptionFlags
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allToAllCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allToAllCrossReplica() without
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See gcl::allReduceCrossReplica().
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allToAll(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allToAllCrossReplica() without
group
arg (for all replicas).- Deprecated:
Use allToAllCrossReplica() instead.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See OptionFlags
- Returns
The output tensor, with the content described above.
WithinReplica functions
Collective operations working within replicas.
-
Chunks reduceScatterWithinReplica(poplar::Graph &graph, const poplar::Tensor &toReduce, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Given a tensor of rank 2, reduce across the outermost dimension using the specified reduction operator.
This function assumes index
i
in the outermost dimension is mapped to IPUi
. The result is distributed over IPUs such that each IPU has a slice of the final result.- Parameters
graph – The graph.
toReduce – The tensor to reduce. Each partial should be mapped identically to the others across the IPUs within the rank.
op – The reduction operator (for example, popops::CollectiveOperator::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug information.
options – See OptionFlags
- Returns
A vector of chunks, where chunk
i
resides on IPUi
. The chunks may have different numbers of elements (for example, when the number of IPUs does not exactly divide the number of elements).
-
poplar::Tensor allGatherWithinReplica(poplar::Graph &graph, const Chunks &toGather, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Broadcast data distributed over all IPUs.
This function assumes chunk
i
is mapped to IPUi
.- Parameters
graph – The graph.
toGather – The chunks to gather.
prog – The program sequence to add operations to.
debugContext – Optional debug information.
options – See OptionFlags
- Returns
A 2D tensor that contains a copy of the data for each IPU. Index
i
in the outermost dimension of the result is mapped to IPUi
.
-
poplar::Tensor allReduceWithinReplica(poplar::Graph &graph, const poplar::Tensor &toReduce, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Perform an all-reduce operation on the specified tensor.
This operation reduces across the outermost dimension of the input and produces a tensor with the same shape where the innermost dimension is the result of the reduction and the outermost dimension is a number of copies of the result. This function assumes index
i
in the outermost dimension of the input is mapped to IPUi
. Indexi
in the outermost dimension of the result is mapped to IPUi
.- Parameters
graph – The graph.
toReduce – The tensor to reduce. Each partial should be mapped identically to the others across the ipus within the rank.
op – The reduction operator (for example, popops::CollectiveOperator::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug information.
options – See OptionFlags
- Returns
A tensor with the same shape as
toReduce
, where the innermost dimension is the result of the reduction and the outermost dimension has a number of copies of the result.
-
struct Chunk
- #include <Collectives.hpp>
Represents a section of a tensor mapped to an IPU.
Public Functions
-
Chunk() = default
-
Chunk() = default
-
struct Chunks
- #include <Collectives.hpp>
A vector of Chunk data.
Public Functions
-
Chunks() = default
-
Chunks() = default
-
struct CommGroup
- #include <Collectives.hpp>
Struct to specify sub-groups of replicas.
Examples of derived sub-groups:
IPU-link domain sub-rack:
where N is power of two and replicaGroupSize > 1.type == CONSECUTIVE && replicaGroupSize == 64/replica-size/N
Complete IPU-link domain / full rack:
type == CONSECUTIVE && replicaGroupSize == 64/replica-size
Using GW-links only:
type == ORTHOGONAL && replicaGroupSize == 64/replica-size
Public Functions
-
CommGroup() = default
-
inline CommGroup(const CommGroupType &groupType, unsigned groupSize)
Construct CommGroup.
- Parameters
groupType – replica group type
groupSize – replica group size
-
virtual ~CommGroup() = default
Protected Attributes
-
CommGroupType replicaGroupType = CommGroupType::ALL
Replica group type.
-
unsigned replicaGroupSize = 0
Replica group size.
0 means the default size for the given group type.
-
enum CommGroupType