6.2. Collectives

#include <gcl/Collectives.hpp>
interface OptionFlags

Supported Option flags

collectiveImplementation

The type of collective implementation to use. DEPRECATED

method

The method/topology to use. Acceptable values are anticlockwise_ring, auto, bidirectional_ring_pair, broadcast, clockwise_ring, meet_in_middle_ring, or quad_directional_ring. The default value is auto.

  • anticlockwise_ring: Send fragments anticlockwise around the ring. The number of fragments is equal to the number of IPUs in the ring.

  • auto: Automatically decide on the most optimal method.

  • bidirectional_ring_pair: Split the data into two halves and use the clockwise ring algorithm on one half and the anticlockwise ring algorithm on the other in order to fully utilize the links in both directions. The number of fragments is equal to twice the number of IPUs in the ring.

  • broadcast: Broadcast the tensor to all replicas and do the reduce locally. This is the fastest option for small tensors.

  • clockwise_ring: Send fragments clockwise around the ring. The number of fragments is equal to the number of IPUs in the ring.

  • meet_in_middle_ring: Send half the fragments halfway around the ring in the clockwise direction and half the fragments halfway around the ring in the anticlockwise direction, meeting in the middle. The number of fragments is equal to the number of IPUs in the ring. The disadvantage compared to the bidirectional_ring_pair method is that the usage of available bandwidth is not quite optimal, in particular the final step only uses the links in one direction (assuming an even number of IPUs). The advantage is the that it requires fewer steps and allows the use of larger fragments.

  • quad_directional_ring: Divide fragments in four and send each quarter around one of two rings using the mirrored and non-mirrored ring pattern.

syncful.allToAll

Selects AllToAll implementation. Default value is auto.

  • auto: The implementation is selected automatically.

  • drop_off: Sends all forward and drops off one slice per step.

  • single_slice: Sends single slices directly to the destination.

syncful.maxBroadcastSize

For small tensors it is beneficial to broadcast the tensor to all replicas and do the reduce locally so the network latency cost is paid only once. However, the memory use increases for larger group sizes and data volumes. This option controls the size (the product of number of bytes and group size) beyond which broadcast AllReduce will not be used. It must have a positive integer value, and the default is 2048.

syncful.useForwardingToSupportStridedGroups

This option controls whether store and forward technique is enabled in GCL. Acceptable values are auto, false, or true. Enabling this is useful if generated traffic patterns try to go beyond the reachability of the sliding window or can potentially deadlock. When store and forward is enabled, data movement between the replicas is broken down into several steps where intermediate replicas act as lighthouses that receive and forward the data on the way towards the destination. This extends the reachability of the sliding window and may decrease the number of overlapping communication rings, which breaks cyclic dependencies in the network. There are situations where this option is not supported; in this cases an exception is throw if it is enabled. The auto alternative enables the store and forward technique in all the cases where it is supported.

syncful.useOptimisedLayout

If the input tensor has been allocated in a GCL friendly way, then reusing the same layout for the source buffer will minimise code when copying fragments to it. This is the default behaviour (and can be explicitly set by setting this option to true). Turning off this behaviour (by setting the option to false) might reduce the cycle count at the cost of higher memory usage.

Defines

GCL_NO_DISCARD

Produce compile time warning for unused return values.

namespace gcl

Graphcore Communication Library

CrossReplica functions

Collective operations working across replicas.

poplar::Tensor allGatherCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform an AllGather operation.

This gathers the replicated tensor data and return the result so each replica will have a copy of all other replicas’ data tensors. For instance:

Before:

Replica0: data[s,t]
Replica1: data[u,v]
Replica2: data[w,x]
Replica3: data[y,z]

After:

Replica0: result[[s,t], [u,v], [w,x], [y,z]]
Replica1: result[[s,t], [u,v], [w,x], [y,z]]
Replica2: result[[s,t], [u,v], [w,x], [y,z]]
Replica3: result[[s,t], [u,v], [w,x], [y,z]]

For an input of shape [incomingShape] the output will be [replicationFactor][incomingShape].

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to AllGather.

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

Returns

The tensor with the content described above.

std::vector<poplar::Tensor> allGatherCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allGatherCrossReplica() but with multiple tensors.

This perform an AllGather operation with vector input argument and vector output as return value.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – The replicated tensors to AllGather.

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

Returns

The tensors with the content described above.

void allGatherToDestinationCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, const poplar::Tensor &destination, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allGatherCrossReplica() but writes the result to the destination tensor.

Note

The destination tensor must be mapped to IPUs in the same way as the data tensor.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to AllGather.

  • destination – Tensor to write the result to.

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

void allGatherToDestinationCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, const std::vector<poplar::Tensor> &destinations, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allGatherToDestinationCrossReplica() but with multiple tensors.

This is akin to the allGatherCrossReplica() taking multiple tensors.

Note

The destination tensors must be mapped to IPUs in the same way as the data tensor.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – The replicated tensors to AllGather.

  • destinations – Tensors to write result to; the provided vector must have the correct size.

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

poplar::Tensor allReduceCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Performs an AllReduce operation.

The operation is performed on the provided tensor over replicas as specified by the group argument. This operation reduces across the tensors that the replicated tensor is a handle for. The result is returned as a replicated tensor with the same shape as the input shape where all replicas output tensors have the same data. For instance:

Before:

Replica0: data[x0,y0]
Replica1: data[x1,y1]
Replica2: data[x2,y2]
Replica3: data[x3,y3]

After:

Replica0: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]
Replica1: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]
Replica2: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]
Replica3: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to AllReduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

Returns

The tensor with the content described above.

std::vector<poplar::Tensor> allReduceCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceCrossReplica() but with multiple tensors.

This perform an AllReduce operation on multiple tensors. Thus it batch up multiple tensors to be executed as a single collective operation. This gives a performance improvement over sequentially reducing one tensor per operation. For short tensors the potential latency reduction is 1/(number-of-tensors) over sequentially reducing one tensor per operation.

Parameters
  • graph – The replicated graph the input tensors belongs to.

  • datas – The replicated tensors to AllReduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

Returns

The tensors with the content described above. Each of these tensors contains the reduction of the corresponding tensor in datas across all replicas.

void allReduceInPlaceCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceCrossReplica() but writes result back to the input data tensor.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to AllReduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

void allReduceInPlaceCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceInPlaceCrossReplica() but with multiple tensors.

This is akin to allReduceCrossReplica() taking multiple tensors.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – The replicated tensors to AllReduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

void allReduceToDestinationCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, const poplar::Tensor &destination, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceCrossReplica() but writes the result to the destination tensor.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to AllReduce.

  • destination – Tensor to write the result to.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

void allReduceToDestinationCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, const std::vector<poplar::Tensor> &destinations, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceToDestinationCrossReplica() but with multiple tensors.

This is akin to allReduceCrossReplica() taking multiple tensors.

Parameters
  • graph – The replicated graph the input tensors belongs to.

  • datas – The replicated tensors to AllReduce.

  • destinations – Tensors to write result to; the provided vector must have the correct size.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

poplar::Tensor allToAllCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform an AllToAll operation.

This does an all-to-all exchange of the elements of the input tensor based on replica ID. The shape of the input must have the number of replicas in the graph as its first or only dimension. That dimension will be used to split up the tensor being sent, with each replica sending all splits except for the split index which matches its replica ID.

The replica receiving the slice will copy that incoming slice into the output at the index which matches the replica ID of the replica which sent it. For instance:

Before:

Replica0: data[a0,a1,a2,a3]
Replica1: data[b0,b1,b2,b3]
Replica2: data[c0,c1,c2,c3]
Replica3: data[d0,d1,d2,d3]

After:

Replica0: result[a0,b0,c0,d3]
Replica1: result[a1,b1,c1,d3]
Replica2: result[a2,b2,c2,d3]
Replic32: result[a3,b3,c3,d3]

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor for AllToAll exchange.

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

Returns

The tensor with the content described above.

poplar::Tensor broadcastCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group = {}, unsigned rootReplica = 0, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform a Broadcast operation.

This does a broadcast from one replica (the rootReplica) to all other replicas. For instance:

Before:

Replica0: data[a0,a1,a2,a3] // <-- rootReplica
Replica1: data[b0,b1,b2,b3]
Replica2: data[c0,c1,c2,c3]
Replica3: data[d0,d1,d2,d3]

After:

Replica0: result[a0,a1,a2,a3]
Replica1: result[a0,a1,a2,a3]
Replica2: result[a0,a1,a2,a3]
Replica3: result[a0,a1,a2,a3]

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to Broadcast.

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • rootReplica – The replica ID to use as source for the broadcast.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

Returns

The tensor with the content described above.

poplar::Tensor reduceScatterCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform a ReduceScatter operation.

This reduce the replicated rank-1 tensor data with the result scattered across the replicas. For an input of shape

[numElements]
mapped to a single * IPU per replica, the output will have shape
[ceil(numElements / replicationFactor)]
If replicationFactor does not evenly divide numElements, the result is zero-padded. For instance:

Before:

Replica0: toReduce[x0, y0, z0]
Replica1: toReduce[x1, y1, z1]

After:

Replica0: result[op(x0, x1), op(y0, y1)]
Replica1: result[op(z0, z1), 0]

Multi-IPU mapped input

If an input of shape

[numElementsIPU0 + numElementsIPU1 + ...]
is mapped to multiple IPUs per replica, the output will have shape
[ceil(numElementsIPU0 / replicationFactor) +
 ceil(numElementsIPU1 / replicationFactor) + ...]
with the result grouped per IPU. If replicationFactor does not evenly divide the number of elements on an IPU, the result is zero-padded per IPU. For instance:

Before:

Replica0: toReduce[  x0,   y0,   z0,   w0]
Replica1: toReduce[  x1,   y1,   z1,   w1]
Replica2: toReduce[  x2,   y2,   z2,   w2]
Replica3: toReduce[  x3,   y3,   z3,   w3]
Mapping:  toReduce[IPU0, IPU0, IPU0, IPU1]

After:

Replica0: result[op(x0, x1, x2, x3), op(w0, w1, w2, w3)]
Replica1: result[op(y0, y1, y2, y3),                  0]
Replica2: result[op(z0, z1, z2, z3),                  0]
Replica3: result[                 0,                  0]
Mapping:  result[              IPU0,               IPU1]

Note

Only flat input tensors are supported.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to ReduceScatter.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

Returns

The tensor with the content described above.

std::vector<poplar::Tensor> reduceScatterCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As reduceScatterCrossReplica() but with multiple tensors.

This perform an ReduceScatter operation with vector input argument and vector output as return value.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – The replicated tensors to ReduceScatter.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

Returns

The tensors with the content described above.

void reduceScatterToDestinationCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, const poplar::Tensor &destination, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As reduceScatterCrossReplica() but write the result to the destination tensor.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to ReduceScatter.

  • destination – Tensor to write result to.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

void reduceScatterToDestinationCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, const std::vector<poplar::Tensor> &destinations, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group = {}, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As reduceScatterToDestinationCrossReplica() but with multiple tensors.

This is akin to reduceScatterCrossReplica() taking multiple tensors.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – The replicated tensors to ReduceScatter.

  • destinations – Tensors to write result to; the provided vector must have the correct size.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

WithinReplica functions

Collective operations working within replicas.

poplar::Tensor allGatherWithinReplica(poplar::Graph &graph, const Chunks &toGather, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform an AllGather operation.

Broadcast data distributed over all IPUs. This function assumes chunk i is mapped to IPU i. For instance:

Before:

Chunks = [
  [ ], // IPU0 (index=2, offset=0)
  [z], // IPU1 (index=1, offset=0)
  [x], // IPU2 (index=3, offset=0)
  [y]  // IPU3 (index=0, offset=0)
]

After:

result = [
  [x,y,z], // IPU0
  [x,y,z], // IPU1
  [x,y,z], // IPU2
  [x,y,z]  // IPU3
]

Note

Multi-IPU ranks (more than one IPU per rank) are not supported.

Parameters
  • graph – The graph.

  • toGather – The chunks to AllGather.

  • prog – The program sequence to add operations to.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

Returns

A 2D tensor that contains a copy of the data for each IPU. Index i in the outermost dimension of the result is mapped to IPU i.

poplar::Tensor allReduceWithinReplica(poplar::Graph &graph, const poplar::Tensor &toReduce, CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform an AllReduce operation.

This operation reduces across the outermost dimension of the input and produces a tensor with the same shape where the innermost dimension is the result of the reduction and the outermost dimension is a number of copies of the result.

The function assumes index i in the outermost dimension of the input is mapped to IPU i. Index i in the outermost dimension of the result is mapped to IPU i. For instance:

Before:

toReduce = [
  [x0,y0], // IPU0
  [x1,y1], // IPU1
  [x2,y2], // IPU2
  [x3,y3], // IPU3
]

After:

result = [
  [op(x0,x1,x2,x3), op(y0,y1,y2,y3)], // IPU0
  [op(x0,x1,x2,x3), op(y0,y1,y2,y3)], // IPU1
  [op(x0,x1,x2,x3), op(y0,y1,y2,y3)], // IPU2
  [op(x0,x1,x2,x3), op(y0,y1,y2,y3)]  // IPU3
]

Parameters
  • graph – The graph.

  • toReduce – The tensor to AllReduce. Each partial should be mapped identically to the others across the IPUs within the rank.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

Returns

A tensor with the same shape as toReduce, where the innermost dimension is the result of the reduction and the outermost dimension has a number of copies of the result.

Chunks reduceScatterWithinReplica(poplar::Graph &graph, const poplar::Tensor &toReduce, CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform a ReduceScatter operation.

Given a tensor of rank 2, reduce across the outermost dimension using the specified reduction operator. This function assumes index i in the outermost dimension is mapped to IPU i. The result is distributed over IPUs such that each IPU has a slice of the final result. For instance:

Before:

data = [
  [x0,y0,z0], // IPU0
  [x1,y1,z1], // IPU1
  [x2,y2,z2], // IPU2
  [x3,y3,z3]  // IPU3
]

After:

Chunks = [
  [],                // IPU0 (index=0, offset=0)
  [op(z0,z1,z2,z3)], // IPU1 (index=3, offset=0)
  [op(x0,x1,x2,x3)], // IPU2 (index=1, offset=0)
  [op(y0,y1,y2,y3)]  // IPU3 (index=2, offset=0)
]

Note

Multi-IPU ranks (more than one IPU per rank) are not supported.

Parameters
  • graph – The graph.

  • toReduce – The tensor to ReduceScatter. Each partial should be mapped identically to the others across the IPUs within the rank.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • debugContext – An optional debug context.

  • options – See OptionFlags.

Returns

A vector of chunks, where chunk i resides on IPU i. The chunks may have different numbers of elements (for example, when the number of IPUs does not exactly divide the number of elements).

Enums

enum class CommGroupType

Enumeration to define communication group specification.

Assumption: replica groups are uniform in size and layout on IPUs.

Values:

enumerator ALL

All replicas are viewed as one group.

enumerator CONSECUTIVE

Each group contains a number of consecutive replicas.

If there are N replicas denoted {0, ... N-1} and the group size is k, then there are N/k groups of size k:

{0, 1, ... k-1}, {k, ... 2k-1} ... {N-k-1, ... N-1}

enumerator ORTHOGONAL

Groups are sliced orthogonally to the replica ordering.

Each group contains replicas separated by a stride equal to the number of groups.

If there are N replicas denoted {0, ... N-1} and the group size is k, then there are m = N/k groups of size k:

{0, m, 2m, ...}, {1, m+1, 2m+1, ...} ... {m-1, 2m-1, ... N-1}

enum class CollectiveOperator

Supported collective operators.

Values:

enumerator ADD

Sum all the elements.

enumerator MEAN

Calculate the mean of all the elements.

enumerator MUL

Multiply all the elements.

enumerator MIN

Return the minimum of all the elements.

enumerator MAX

Return the maximum of all the elements.

enumerator LOGICAL_AND

Logical and of all the elements.

Only supports boolean operands.

enumerator LOGICAL_OR

Logical or of all the elements.

Only supports boolean operands.

enumerator SQUARE_ADD

Square each element before applying the ADD reduction.

enumerator LOCAL

Do nothing and keep the local value.

Functions

std::istream &operator>>(std::istream &inStream, CollectiveOperator &op)

Parse token from input stream is to op.

Valid input values are the stringified enumerations, for example “ADD” or “MUL”.

Parameters
  • inStream – The input stream to read from.

  • op – The collective operator parsed from the input stream.

Returns

The original input stream.

std::ostream &operator<<(std::ostream &outStream, const CollectiveOperator &op)

Write op to output stream os.

The value written is the stringified enumeration, for example “ADD” or “MUL”.

Parameters
  • outStream – The output stream to write to.

  • op – The collective operator printed to the output stream.

Returns

The original output stream.

struct Chunk
#include <Collectives.hpp>

Represents a section of a tensor mapped to an IPU.

Public Functions

Chunk() = default

Construct an empty invalid chunk.

Chunk(poplar::Tensor tensor, unsigned index, unsigned offset)

A section of a tensor mapped to an IPU.

Parameters
  • tensor – The mapped tensor.

  • index – The ring index (data parallel index).

  • offset – The offset within rank (model parallel index).

poplar::Tensor getTensor() const

Get the mapped tensor.

Returns

The mapped tensor.

unsigned getIndex() const

Get the ring index (the data-parallel index).

Returns

The ring index.

unsigned getOffset() const

Get the offset within the rank (the model-parallel index).

Returns

The offset within the rank.

void setTensor(poplar::Tensor tensor)

Set mapped tensor.

Parameters

tensor – The mapped tensor.

void setIndex(unsigned index)

Set ring index.

Parameters

index – The ring index.

void setOffset(unsigned offset)

Set the offset.

Parameters

offset – The offset within the rank.

Private Members

poplar::Tensor mTensor

Mapped tensor.

unsigned mIndex = 0

Ring index (data parallel index)

unsigned mOffset = 0

Offset within rank (model parallel index)

struct Chunks
#include <Collectives.hpp>

A vector of Chunk data.

Public Functions

Chunks() = default

Construct an empty chunks object.

inline explicit Chunks(unsigned size)

A vector of Chunk data.

Parameters

size – Length of the chunk vector.

poplar::Tensor getOriginalInput() const

Used to undo shuffles introduced by scatter.

Returns

The original input.

const std::vector<Chunk> &getChunks() const

Chunks produced by the scatter step.

Returns

The chunks created by the scatter.

void setOriginalInput(poplar::Tensor input)

Set original input.

Parameters

input – The original input.

void setChunk(std::vector<Chunk>::size_type i, Chunk chunk)

Set a chunk.

Parameters
  • i – The chunk index.

  • chunk – The new chunk.

void setChunks(std::vector<Chunk> chunks)

Set chunks produced by scatter step.

Parameters

chunks – The produced chunks.

poplar::Tensor concat() const

Concatenate chunks.

Create and return a tensor that consist of the concatenated and sorted Chunk elements. The Chunk elements are primarily sorted by offset, and secondary sorted by index. This operation is performed on the output of the reduceScatterWithinReplica() and on the input of the allGatherWithinReplica() operations.

Returns

A concatenated vector consisting of sorted Chunk elements.

Private Members

poplar::Tensor mOriginalInput

Used to undo shuffles introduced by scatter.

std::vector<Chunk> mChunks

Chunks produced by the scatter step.

struct CommGroup
#include <Collectives.hpp>

Structure to specify sub-groups of replicas.

Examples of derived sub-groups:

  • IPU-link domain sub-rack:

    type == CONSECUTIVE && replicaGroupSize == (ipuLinkDomainSize/replicaSize)/N
    
    where N (the number of replicas) is a power of two and replicaGroupSize > 1.

  • Complete IPU-link domain / full rack:

    type == CONSECUTIVE && replicaGroupSize == ipuLinkDomainSize/replicaSize
    

  • Using GW-links only:

    type == ORTHOGONAL && replicaGroupSize == numberOfIpuLinkDomains
    

Public Functions

CommGroup() = default

Construct a CommGroup where all replicas are viewed as one group.

CommGroup(const CommGroupType groupType, unsigned groupSize, unsigned replicaStride = 1)

Construct a CommGroup with the given specification.

Parameters
  • groupType – Replica group type.

  • groupSize – Number of replicas in the group.

  • replicaStride – Replica group stride.

virtual ~CommGroup() = default
unsigned size(poplar::Graph &graph) const

Get the size of the group in number of replicas.

This either returns a preset non-zero value or calculates the size if the predefined value is zero.

Parameters

graph – The graph this will be used on.

Returns

The number of replicas in the group.

inline CommGroupType type() const

Get the group type as documented in CommGroupType.

Returns

The group type passed to the constructor.

inline unsigned stride() const

Get the stride of the group.

Stride is 1, by default, which defines standard ring patterns for the collective operations. A stride value greater than 1 is used to divide the groups into multiple independent groups, for example:

For 8 replicas and a flat ring pattern you would get:

0, 1, 2, 3, 4, 5, 6, 7

But with a stride of 2 you will have two interleaved rings:

0, 2, 4, 6
1, 3, 5, 7

Stride is only supported for CommGroupType::CONSECUTIVE.

Returns

The stride, in number of replicas, that was passed to the constructor.

Protected Attributes

CommGroupType mReplicaGroupType = CommGroupType::ALL

Replica group type.

unsigned mReplicaGroupSize = 0

Replica group size.

0 indicate the default size for the group type.

unsigned mReplicaGroupStride = 1

Replica group stride.

0 indicate the default replica stride for the group type.

Friends

friend std::ostream &operator<<(std::ostream &outStream, const CommGroup &group)

Output a string representation of the CommGroup.

Parameters
  • outStream – The output stream to write to.

  • group – The group to output as a string.