Collectives

interface OptionFlags

Supported Option flags:

  • useSynclessCollectives (true, false, hybrid, auto) [=auto]

    Type of collective implementation to use.

    • auto: Choose the appropriate implementation for the operation in question. At the moment this is the same as ‘false’.

    • true: Use the syncless implementation. Deprecated: please use Syncless instead.

    • false: Use the syncful implementation. Deprecated: please use Syncful instead.

    • hybrid: Use syncful over IPU-Links and syncless over gw links. Deprecated: please use Hybrid instead.

    • Syncless: Use the syncless implementation.

    • Syncful: Use the syncful implementation.

    • Hybrid: Use syncful over IPU-Links and syncless over gw links.

  • maxBytesPerTile Integer [=35000]

    The maximum size of data and padding in the payload buffer to put on each IO tile. The maximum allowed value is 64000.

  • topology (rung-ring-2, rung-ring-4, rung-ring-8, ring-on-line, peripheral-ring) [=auto]

    The topology to use for the syncful implementation. If you do not specify this option the topology is auto detected.

    • auto: Topology automatically selected based on the current graph.

    • rung-ring-2: Relevant for replica size 2. The traffic follows one of two physical rings, one on each side of the IPU-Link mesh, by moving straight up and assuming wrap-around at the top.

    • rung-ring-4: Relevant for replica size 4. The traffic follows one of two physical rings, one on each side of the IPU-Link mesh, by moving straight up and assuming wrap-around at the top.

    • rung-ring-8: Relevant for replica size 8. The traffic follows one of two physical rings, one on each side of the IPU-Link mesh, by moving straight up and assuming wrap-around at the top.

    • ring-on-line: Relevant for replica size 2. The traffic follows one of two virtual rings, one on each side of the IPU-Link mesh.

    • peripheral-ring: Relevant for replica size 1. The traffic follows a single ring on the periphery of the IPU-Link mesh.

  • link (auto-link, ipu-link, gw-link) [=auto-link]

    The link type to use between IPUs.

    • auto-link: Use the link type appropriate for the operation.

    • ipu-link: Use the IPU-Links.

    • gw-link: Use the GW-Links.

  • method (auto, clockwise_ring, anticlockwise_ring, bidirectional_ring_pair, meet_in_middle_ring, quad_directional_ring) [=auto]

    The method/topology to be used.

    • auto: Automatically decide on the most optimal method.

    • clockwise_ring: Send fragments clockwise around the ring. The number of fragments is equal to the number of IPUs in the ring.

    • anticlockwise_ring: Send fragments anticlockwise around the ring. The number of fragments is equal to the number of IPUs in the ring.

    • bidirectional_ring_pair: Split the data into two halves and use the clockwise ring algorithm on one half and the anticlockwise ring algorithm on the other in order to fully utilize the links in both directions. The number of fragments is equal to twice the number of IPUs in the ring.

    • meet_in_middle_ring: Send half the fragments halfway around the ring in the clockwise direction and half the fragments halfway around the ring in the anticlockwise direction, meeting in the middle. The number of fragments is equal to the number of IPUs in the ring. The disadvantage compared to the “bidirectional_ring_pair” method is that the usage of available bandwidth is not quite optimal, in particular the final step only uses the links in one direction (assuming an even number of IPUs). The advantage is the that it requires fewer steps and allows the use of larger fragments.

    • quad_directional_ring: Divide fragments in four and send each quarter around one of two rings using the mirrored and non-mirrored ring pattern.

  • `syncful.useOptimisedLayout (true, false) [=true]

    If the input tensor has been allocated in a GCL friendly way, reusing the same layout for the srcBuffer will minimise code when copying fragments to the srcBuffer. Turning this off might reduce the cycle count at the cost of higher memory usage.

#include <gcl/Collectives.hpp>

Defines

GCL_DEPRECATED(x)

Function scheduled for removal.

GCL_NO_DISCARD

Produce compile time warning for unused return values.

popops_CollectiveTypes_hpp

Include guard from a deprecated header.

namespace gcl

Graphcore Communications Library.

CrossReplica functions

Collective operations working across replicas.

GCL_NO_DISCARD poplar::Tensor allReduceCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Perform an all-reduce operation.

The operation is performed on the provided tensor over replicas as specified by the group argument. This operation reduces across the tensors that the replicated tensor is a handle for. The result is returned as a replicated tensor with the same shape as the input shape where all replicas output tensors have the same data. For instance:

Before:

Replica0: data[x0,y0]
Replica1: data[x1,y1]
Replica2: data[x2,y2]
Replica3: data[x3,y3]

After:

Replica0: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]
Replica1: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]
Replica2: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]
Replica3: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

A replicated tensor with the reduction of data.

GCL_NO_DISCARD std::vector< poplar::Tensor > allReduceCrossReplica (poplar::Graph &graph, const std::vector< poplar::Tensor > &datas, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Perform an all-reduce operation on multiple tensors.

As allReduceCrossReplica(), but batch up multiple tensors to be executed as a single collective operation. This gives a performance improvement over sequentially reducing one tensor per operation. For short tensors the potential latency reduction is 1/(number-of-tensors) over sequentially reducing one tensor per operation.

Parameters
  • graph – The replicated graph the input tensors belongs to.

  • datas – The vector of replicated tensors to reduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

A vector of replicated tensors. Each of these tensors containing the reduction of the corresponding tensor in datas accross all replicas.

GCL_NO_DISCARD poplar::Tensor allReduceCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

As allReduceCrossReplica() without the group arg (for all replicas).

Parameters
  • graph – The replicated graph the input tensors belongs to.

  • data – The replicated tensor to reduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

A replicated tensor with the reduction of data.

GCL_NO_DISCARD std::vector< poplar::Tensor > allReduceCrossReplica (poplar::Graph &graph, const std::vector< poplar::Tensor > &datas, CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

As allReduceCrossReplica() with multiple input tensors and without the group arg.

Parameters
  • graph – The replicated graph the input tensors belongs to.

  • datas – A vector of replicated tensors to reduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

A vector of replicated tensors. Each of these tensors containing the reduction of the corresponding tensor in datas across all replicas.

void allReduceToDestinationCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceCrossReplica() but writes the result to the destination tensor.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce.

  • destination – Tensor to write the result to.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

void allReduceToDestinationCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, const std::vector<poplar::Tensor> &destinations, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceToDestinationCrossReplica() with multiple input and output tensors.

Parameters
  • graph – The replicated graph the input tensors belongs to.

  • datas – Vector of replicated tensors to reduce.

  • destinations – Vector of replicated tensors to write the result to.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

void allReduceToDestinationCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceToDestinationCrossReplica() without group arg.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce.

  • destination – Tensor to write the result to.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

void allReduceInPlaceCrossReplica(poplar::Graph &graph, poplar::Tensor &data, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceCrossReplica() but writes result back to the input data tensor.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

void allReduceInPlaceCrossReplica(poplar::Graph &graph, std::vector<poplar::Tensor> &datas, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform an all-reduce operation on multiple tensors writing the results back to the input datas tensors.

As allReduceInPlaceCrossReplica(), but batch up multiple tensors to be executed as a single collective operations. This gives a performance improvement over sequentially reducing one tensor per operation. For short tensors the potential latency reduction is 1/(number-of-tensors) over sequentially reducing one tensor per operation.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – Vector of replicated tensors to reduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

void allReduceInPlaceCrossReplica(poplar::Graph &graph, poplar::Tensor &data, CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceInPlaceCrossReplica() without group arg.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

void allReduceInPlaceCrossReplica(poplar::Graph &graph, std::vector<poplar::Tensor> &datas, CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceInPlaceCrossReplica() with multiple input tensors and without group arg.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – Vector of replicated tensors to reduce.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags

GCL_NO_DISCARD poplar::Tensor reduceScatterCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Reduce the replicated rank-1 tensor data with the result scattered across the replicas.

For an input of shape [numElements] mapped to a single IPU per replica, the output will have shape [ceil(numElements / replicationFactor)]. If replicationFactor does not evenly divide numElements, the result is zero-padded. For instance:

Before:

Replica0: toReduce[x0, y0, z0]
Replica1: toReduce[x1, y1, z1]

After:

Replica0: result[op(x0, x1), op(y0, y1)]
Replica1: result[op(z0, z1), 0]

Multi IPU mapped input

For the syncful implementation an input of shape: [numElementsIPU0 + numElementsIPU1 + ...] mapped to multiple IPUs per replica, the output will have shape: [ceil(numElementsIPU0 / replicationFactor) + ceil(numElementsIPU1 / replicationFactor) + ...] with the result grouped per IPU. If replicationFactor does not evenly divide the number of elements on an IPU, the result is zero-padded per IPU. For instance:

Before:

Replica0: toReduce[  x0,   y0,   z0,   w0]
Replica1: toReduce[  x1,   y1,   z1,   w1]
Replica2: toReduce[  x2,   y2,   z2,   w2]
Replica3: toReduce[  x3,   y3,   z3,   w3]
Mapping:  toReduce[IPU0, IPU0, IPU0, IPU1]

After:

Replica0: result[op(x0, x1, x2, x3), op(w0, w1, w2, w3)]
Replica1: result[op(y0, y1, y2, y3),                  0]
Replica2: result[op(z0, z1, z2, z3),                  0]
Replica3: result[                 0,                  0]
Mapping:  result[              IPU0,               IPU1]

Note

Only flat input tensors are currently supported.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce scatter.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

The output tensor, with the content described above.

GCL_NO_DISCARD std::vector< poplar::Tensor > reduceScatterCrossReplica (poplar::Graph &graph, const std::vector< poplar::Tensor > &datas, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

As reduceScatterCrossReplica() but with vector input argument and vector output as return value.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – The replicated tensors to reduce scatter.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

The output tensors, with the content described above.

void reduceScatterToDestinationCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, const std::vector<poplar::Tensor> &destinations, CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As reduceScatterCrossReplica() but with vector input/output arguments.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – The replicated tensors to reduce scatter.

  • destinations – Output tensors which must have correct type/shape.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

GCL_NO_DISCARD poplar::Tensor reduceScatterCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

As reduceScatterCrossReplica() without group arg.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce scatter.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags

Returns

The output tensor, with the content described above.

GCL_NO_DISCARD poplar::Tensor allGatherCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Gather the replicated tensor data.

Return the result so each replica will have a copy of all other replicas’ data tensors. For instance:

Before:

Replica0: data[s,t]
Replica1: data[u,v]
Replica2: data[w,x]
Replica3: data[y,z]

After:

Replica0: result[[s,t], [u,v], [w,x], [y,z]]
Replica1: result[[s,t], [u,v], [w,x], [y,z]]
Replica2: result[[s,t], [u,v], [w,x], [y,z]]
Replica3: result[[s,t], [u,v], [w,x], [y,z]]

For an input of shape [incomingShape] the output will be [replicationFactor][incomingShape].

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to gather.

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

The output tensor, with the content described above.

GCL_NO_DISCARD std::vector< poplar::Tensor > allGatherCrossReplica (poplar::Graph &graph, const std::vector< poplar::Tensor > &datas, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

As allGatherCrossReplica() but with vector input argument and vector output as return value.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – The replicated tensors to gather.

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

The output tensors, with the content described above.

void allGatherToDestinationCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, const std::vector<poplar::Tensor> &destinations, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allGatherCrossReplica() but with vector input/output arguments.

Note

It’s important that the destination tensors are mapped to ipus in the same way as the data tensors.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – The replicated tensors to gather.

  • destinations – Output tensors which must have correct type/shape.

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

GCL_NO_DISCARD poplar::Tensor allGatherCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

As allGatherCrossReplica() without group arg.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to gather.

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

The output tensor, with the content described above.

GCL_NO_DISCARD poplar::Tensor allToAllCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Perform an all-to-all exchange of the elements of the input tensor based on replica ID.

The shape of the input must have the number of replicas in the graph as its first or only dimension. That dimension will be used to split up the tensor being sent, with each replica sending all splits except for the split index which matches its replica ID. That is, replica 2 will not send input[2] and so on.

The replica receiving the slice will copy that incoming slice into the output at the index which matches the replica ID of the replica which sent it. For instance:

Before:

Replica0: data[x0,x1,x2]
Replica1: data[y0,y1,y2]
Replica2: data[z0,z1,z2]

After:

Replica0: result[x0,y0,z0]
Replica1: result[x1,y1,z1]
Replica2: result[x2,y2,z2]

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to aggregate.

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

The output tensor, with the content described above.

GCL_NO_DISCARD poplar::Tensor allToAllCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

As allToAllCrossReplica() without group arg.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to aggregate.

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See gcl::allReduceCrossReplica().

Returns

The output tensor, with the content described above.

GCL_NO_DISCARD poplar::Tensor broadcastCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group={}, unsigned rootReplica=0, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Perform a broadcast from one replica to all other replicas.

Before:

Replica0: data[x0,x1,x2] // <-- rootReplica
Replica1: data[y0,y1,y2]
Replica2: data[z0,z1,z2]

After:

Replica0: result[x0,x1,x2]
Replica1: result[x0,x1,x2]
Replica2: result[x0,x1,x2]

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to broadcast.

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • rootReplica – The replica id to use as source for the broadcast.

  • debugContext – Optional debug context.

  • options – See gcl::allReduceCrossReplica().

Returns

The output tensor, with the content described above.

WithinReplica functions

Collective operations working within replicas.

poplar::Tensor concatChunks(Chunks chunks)

Concatenates chunks.

Given a vector of Chunk data, its elements are sorted according to the offset or index and a tensor is returned that consists of sorted concatenated Chunk elements. This operation is performed on the output of the reduceScatterWithinReplica and on the input of the allGatherWithinReplica operations.

Parameters

chunks – A structure containing a vector of Chunk data.

Returns

A concatenated vector consisting of sorted Chunk elements.

GCL_NO_DISCARD Chunks reduceScatterWithinReplica (poplar::Graph &graph, const poplar::Tensor &toReduce, CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Reduce a rank 2 tensor.

Given a tensor of rank 2, reduce across the outermost dimension using the specified reduction operator. This function assumes index i in the outermost dimension is mapped to IPU i. The result is distributed over IPUs such that each IPU has a slice of the final result.

Before:

data = [
         [x0,y0,z0], // IPU0
         [x1,y1,z1], // IPU1
         [x2,y2,z2], // IPU2
         [x3,y3,z3]  // IPU3
       ]

After:

Chunks = [
           [],                // IPU0 (index=0, offset=0)
           [op(z0,z1,z2,z3)], // IPU1 (index=3, offset=0)
           [op(x0,x1,x2,x3)], // IPU2 (index=1, offset=0)
           [op(y0,y1,y2,y3)]  // IPU3 (index=2, offset=0)
         ]

Note

Multi ipu ranks (>1 ipu per rank) are not yet supported.

Parameters
  • graph – The graph.

  • toReduce – The tensor to reduce. Each partial should be mapped identically to the others across the IPUs within the rank.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug information.

  • options – See OptionFlags

Returns

A vector of chunks, where chunk i resides on IPU i. The chunks may have different numbers of elements (for example, when the number of IPUs does not exactly divide the number of elements).

GCL_NO_DISCARD poplar::Tensor allGatherWithinReplica (poplar::Graph &graph, const Chunks &toGather, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Broadcast data distributed over all IPUs.

This function assumes chunk i is mapped to IPU i.

Before:

Chunks = [
           [ ], // IPU0 (index=2, offset=0)
           [z], // IPU1 (index=1, offset=0)
           [x], // IPU2 (index=3, offset=0)
           [y]  // IPU3 (index=0, offset=0)
         ]

After:

result = [
           [x,y,z], // IPU0
           [x,y,z], // IPU1
           [x,y,z], // IPU2
           [x,y,z]  // IPU3
         ]

Note

Multi ipu ranks (>1 ipu per rank) are not yet supported.

Parameters
  • graph – The graph.

  • toGather – The chunks to gather.

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug information.

  • options – See OptionFlags.

Returns

A 2D tensor that contains a copy of the data for each IPU. Index i in the outermost dimension of the result is mapped to IPU i.

GCL_NO_DISCARD poplar::Tensor allReduceWithinReplica (poplar::Graph &graph, const poplar::Tensor &toReduce, CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Perform an all-reduce operation on the specified tensor.

This operation reduces across the outermost dimension of the input and produces a tensor with the same shape where the innermost dimension is the result of the reduction and the outermost dimension is a number of copies of the result.

This function assumes index i in the outermost dimension of the input is mapped to IPU i. Index i in the outermost dimension of the result is mapped to IPU i.

Before:

toReduce = [
             [x0,y0], // IPU0
             [x1,y1], // IPU1
             [x2,y2], // IPU2
             [x3,y3], // IPU3
           ]

After:

result = [
           [op(x0,x1,x2,x3), op(y0,y1,y2,y3)], // IPU0
           [op(x0,x1,x2,x3), op(y0,y1,y2,y3)], // IPU1
           [op(x0,x1,x2,x3), op(y0,y1,y2,y3)], // IPU2
           [op(x0,x1,x2,x3), op(y0,y1,y2,y3)]  // IPU3
         ]

Parameters
  • graph – The graph.

  • toReduce – The tensor to reduce. Each partial should be mapped identically to the others across the ipus within the rank.

  • op – The reduction operator (for example, gcl::CollectiveOperator::ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug information.

  • options – See OptionFlags.

Returns

A tensor with the same shape as toReduce, where the innermost dimension is the result of the reduction and the outermost dimension has a number of copies of the result.

Deprecated CrossReplica functions

Collective operations working across replicas.

inline GCL_NO_DISCARD poplar::Tensor allReduceCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Perform an all-reduce operation.

The operation is performed on the provided tensor over replicas as specified by the group argument. This operation reduces across the tensors that the replicated tensor is a handle for. The result is returned as a replicated tensor with the same shape as the input shape where all replicas output tensors have the same data. For instance:

Before:

Replica0: data[x0,y0]
Replica1: data[x1,y1]
Replica2: data[x2,y2]
Replica3: data[x3,y3]

After:

Replica0: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]
Replica1: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]
Replica2: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]
Replica3: result[op(x0,x1,x2,x3), op(y0,y1,y2,y3)]

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

A replicated tensor with the reduction of data.

inline GCL_NO_DISCARD std::vector< poplar::Tensor > allReduceCrossReplica (poplar::Graph &graph, const std::vector< poplar::Tensor > &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Perform an all-reduce operation on multiple tensors.

As allReduceCrossReplica(), but batch up multiple tensors to be executed as a single collective operation. This gives a performance improvement over sequentially reducing one tensor per operation. For short tensors the potential latency reduction is 1/(number-of-tensors) over sequentially reducing one tensor per operation.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensors belongs to.

  • datas – The vector of replicated tensors to reduce.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

A vector of replicated tensors. Each of these tensors containing the reduction of the corresponding tensor in datas accross all replicas.

inline GCL_NO_DISCARD poplar::Tensor allReduceCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

As allReduceCrossReplica() without the group arg (for all replicas).

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensors belongs to.

  • data – The replicated tensor to reduce.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

A replicated tensor with the reduction of data.

inline GCL_NO_DISCARD std::vector< poplar::Tensor > allReduceCrossReplica (poplar::Graph &graph, const std::vector< poplar::Tensor > &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

As allReduceCrossReplica() with multiple input tensors and without the group arg.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensors belongs to.

  • datas – A vector of replicated tensors to reduce.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

A vector of replicated tensors. Each of these tensors containing the reduction of the corresponding tensor in datas across all replicas.

inline void allReduceToDestinationCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceCrossReplica() but writes the result to the destination tensor.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce.

  • destination – Tensor to write the result to.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

inline void allReduceToDestinationCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, const std::vector<poplar::Tensor> &destinations, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceToDestinationCrossReplica() with multiple input and output tensors.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensors belongs to.

  • datas – Vector of replicated tensors to reduce.

  • destinations – Vector of replicated tensors to write the result to.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

inline void allReduceToDestinationCrossReplica(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceToDestinationCrossReplica() without group arg.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce.

  • destination – Tensor to write the result to.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

inline void allReduceInPlaceCrossReplica(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceCrossReplica() but writes result back to the input data tensor.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

inline void allReduceInPlaceCrossReplica(poplar::Graph &graph, std::vector<poplar::Tensor> &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform an all-reduce operation on multiple tensors writing the results back to the input datas tensors.

As allReduceInPlaceCrossReplica(), but batch up multiple tensors to be executed as a single collective operations. This gives a performance improvement over sequentially reducing one tensor per operation. For short tensors the potential latency reduction is 1/(number-of-tensors) over sequentially reducing one tensor per operation.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – Vector of replicated tensors to reduce.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

inline void allReduceInPlaceCrossReplica(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceInPlaceCrossReplica() without group arg.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

inline void allReduceInPlaceCrossReplica(poplar::Graph &graph, std::vector<poplar::Tensor> &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceInPlaceCrossReplica() with multiple input tensors and without group arg.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – Vector of replicated tensors to reduce.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags

inline GCL_NO_DISCARD poplar::Tensor reduceScatterCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Reduce the replicated rank-1 tensor data with the result scattered across the replicas.

For an input of shape [numElements] mapped to a single IPU per replica, the output will have shape [ceil(numElements / replicationFactor)]. If replicationFactor does not evenly divide numElements, the result is zero-padded. For instance:

Before:

Replica0: toReduce[x0, y0, z0]
Replica1: toReduce[x1, y1, z1]

After:

Replica0: result[op(x0, x1), op(y0, y1)]
Replica1: result[op(z0, z1), 0]

Multi IPU mapped input

For the syncful implementation an input of shape: [numElementsIPU0 + numElementsIPU1 + ...] mapped to multiple IPUs per replica, the output will have shape: [ceil(numElementsIPU0 / replicationFactor) + ceil(numElementsIPU1 / replicationFactor) + ...] with the result grouped per IPU. If replicationFactor does not evenly divide the number of elements on an IPU, the result is zero-padded per IPU. For instance:

Before:

Replica0: toReduce[  x0,   y0,   z0,   w0]
Replica1: toReduce[  x1,   y1,   z1,   w1]
Replica2: toReduce[  x2,   y2,   z2,   w2]
Replica3: toReduce[  x3,   y3,   z3,   w3]
Mapping:  toReduce[IPU0, IPU0, IPU0, IPU1]

After:

Replica0: result[op(x0, x1, x2, x3), op(w0, w1, w2, w3)]
Replica1: result[op(y0, y1, y2, y3),                  0]
Replica2: result[op(z0, z1, z2, z3),                  0]
Replica3: result[                 0,                  0]
Mapping:  result[              IPU0,               IPU1]

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Note

Only flat input tensors are currently supported.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce scatter.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

The output tensor, with the content described above.

inline GCL_NO_DISCARD std::vector< poplar::Tensor > reduceScatterCrossReplica (poplar::Graph &graph, const std::vector< poplar::Tensor > &datas, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

As reduceScatterCrossReplica() but with vector input argument and vector output as return value.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – The replicated tensors to reduce scatter.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

Returns

The output tensors, with the content described above.

inline void reduceScatterToDestinationCrossReplica(poplar::Graph &graph, const std::vector<poplar::Tensor> &datas, const std::vector<poplar::Tensor> &destinations, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As reduceScatterCrossReplica() but with vector input/output arguments.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • datas – The replicated tensors to reduce scatter.

  • destinations – Output tensors which must have correct type/shape.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • group – The subset of replicas for the collective operation.

  • debugContext – Optional debug context.

  • options – See OptionFlags.

inline GCL_NO_DISCARD poplar::Tensor reduceScatterCrossReplica (poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

As reduceScatterCrossReplica() without group arg.

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Parameters
  • graph – The replicated graph the input tensor belongs to.

  • data – The replicated tensor to reduce scatter.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug context.

  • options – See OptionFlags

Returns

The output tensor, with the content described above.

Deprecated WithinReplica functions

Collective operations working within replicas.

inline GCL_NO_DISCARD Chunks reduceScatterWithinReplica (poplar::Graph &graph, const poplar::Tensor &toReduce, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Reduce a rank 2 tensor.

Given a tensor of rank 2, reduce across the outermost dimension using the specified reduction operator. This function assumes index i in the outermost dimension is mapped to IPU i. The result is distributed over IPUs such that each IPU has a slice of the final result.

Before:

data = [
         [x0,y0,z0], // IPU0
         [x1,y1,z1], // IPU1
         [x2,y2,z2], // IPU2
         [x3,y3,z3]  // IPU3
       ]

After:

Chunks = [
           [],                // IPU0 (index=0, offset=0)
           [op(z0,z1,z2,z3)], // IPU1 (index=3, offset=0)
           [op(x0,x1,x2,x3)], // IPU2 (index=1, offset=0)
           [op(y0,y1,y2,y3)]  // IPU3 (index=2, offset=0)
         ]

Deprecated:

Use the version with gcl::CollectiveOperator instead of popops::CollectiveOperator.

Note

Multi ipu ranks (>1 ipu per rank) are not yet supported.

Parameters
  • graph – The graph.

  • toReduce – The tensor to reduce. Each partial should be mapped identically to the others across the IPUs within the rank.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug information.

  • options – See OptionFlags

Returns

A vector of chunks, where chunk i resides on IPU i. The chunks may have different numbers of elements (for example, when the number of IPUs does not exactly divide the number of elements).

inline GCL_NO_DISCARD poplar::Tensor allReduceWithinReplica (poplar::Graph &graph, const poplar::Tensor &toReduce, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext={}, const poplar::OptionFlags &options={})

Perform an all-reduce operation on the specified tensor.

This operation reduces across the outermost dimension of the input and produces a tensor with the same shape where the innermost dimension is the result of the reduction and the outermost dimension is a number of copies of the result.

This function assumes index i in the outermost dimension of the input is mapped to IPU i. Index i in the outermost dimension of the result is mapped to IPU i.

Before:

toReduce = [
             [x0,y0], // IPU0
             [x1,y1], // IPU1
             [x2,y2], // IPU2
             [x3,y3], // IPU3
           ]

After:

result = [
           [op(x0,x1,x2,x3), op(y0,y1,y2,y3)], // IPU0
           [op(x0,x1,x2,x3), op(y0,y1,y2,y3)], // IPU1
           [op(x0,x1,x2,x3), op(y0,y1,y2,y3)], // IPU2
           [op(x0,x1,x2,x3), op(y0,y1,y2,y3)]  // IPU3
         ]

Parameters
  • graph – The graph.

  • toReduce – The tensor to reduce. Each partial should be mapped identically to the others across the ipus within the rank.

  • op – The reduction operator (for example, ADD).

  • prog – The program sequence to add operations to.

  • debugContext – Optional debug information.

  • options – See OptionFlags.

Returns

A tensor with the same shape as toReduce, where the innermost dimension is the result of the reduction and the outermost dimension has a number of copies of the result.

Enums

enum CommGroupType

Enum to define communication group specification type.

Assumption: replica groups are uniform in size and layout on IPUs.

Values:

enumerator ALL

All replicas viewed as one group.

enumerator CONSECUTIVE

Groups are consecutive in replica.

If there are N replicas denoted {0, ... N-1} and group size is k, then there are N/k groups of size k: {0, 1, ... k-1}, {k, ... 2k-1} ... {N-k-1, ... N-1}

enumerator ORTHOGONAL

Groups are sliced orthogonal to the replica ordering.

If there are N replicas denoted {0, ... N-1} and group size is k, then there are m = N/k groups of size k: {0, m, 2m, ...}, {1, m+1, 2m+1, ...} ... {m-1, 2m-1, ... N-1}

enum CollectiveOperator

Supported collective operators.

Values:

enumerator ADD
enumerator MEAN
enumerator MUL
enumerator MIN
enumerator MAX
enumerator LOGICAL_AND

Only supports boolean operands.

enumerator LOGICAL_OR

Only supports boolean operands.

enumerator SQUARE_ADD

Squares each element before applying ADD reduction.

enumerator LOCAL

Do nothing and keep the local value.

Functions

std::istream &operator>>(std::istream &is, CollectiveOperator &op)

Parse token from input stream is to op.

Valid input values are the stringified enumerations, for example “ADD” or “MUL”.

Parameters
Returns

The original input stream.

std::ostream &operator<<(std::ostream &os, const CollectiveOperator &op)

Write op to output stream os.

The value written is the stringified enumeration, for example “ADD” or “MUL”.

Parameters
Returns

The original output stream.

gcl::CollectiveOperator castCollectiveOp(popops::CollectiveOperator &op)

Internal function that converts from a deprecated enumeration to a non-deprecated one.

Deprecated:

This method is for internal use only and will be removed soon!

Parameters

op – An collective operator.

Returns

The corresponding operator in GCL clothing.

struct Chunk
#include <Collectives.hpp>

Represents a section of a tensor mapped to an IPU.

Public Functions

Chunk() = default
inline Chunk(poplar::Tensor tensor, unsigned index, unsigned offset)

A section of a tensor mapped to an IPU.

Parameters
  • tensor – Mapped tensor

  • index – Ring index (data parallel index)

  • offset – Offset within rank (model parallel index)

Public Members

poplar::Tensor tensor

Mapped tensor.

unsigned index = 0

Ring index (data parallel index)

unsigned offset = 0

Offset within rank (model parallel index)

struct Chunks
#include <Collectives.hpp>

A vector of Chunk data.

Public Functions

Chunks() = default
inline explicit Chunks(unsigned size)

A vector of Chunk data.

Parameters

size – Length of chunk vector

Public Members

poplar::Tensor originalInput

Used to undo shuffles introduced by scatter.

std::vector<Chunk> chunks

Chunks produced by the scatter step.

struct CommGroup
#include <Collectives.hpp>

Struct to specify sub-groups of replicas.

Examples of derived sub-groups:

  • IPU-link domain sub-rack:

    type == CONSECUTIVE && replicaGroupSize == ipuLinkDomainSize/replica-size/N
    
    where N is power of two and replicaGroupSize > 1.

  • Complete IPU-link domain / full rack:

    type == CONSECUTIVE && replicaGroupSize == ipuLinkDomainSize/replica-size
    

  • Using GW-links only:

    type == ORTHOGONAL && replicaGroupSize == numberOfIpuLinkDomains
    

Public Functions

CommGroup() = default
CommGroup(const CommGroupType groupType, unsigned groupSize, unsigned replicaStride = 1)

Construct CommGroup.

Parameters
  • groupType – replica group type

  • groupSize – replica group size

  • replicaStride – replica group stride

virtual ~CommGroup() = default

Protected Attributes

CommGroupType replicaGroupType = CommGroupType::ALL

Replica group type.

unsigned replicaGroupSize = 0

Replica group size.

0 indicate the default size for the group type.

unsigned replicaGroupStride = 1

Replica group stride.

0 indicate the default replica stride for the group type.

Friends

friend std::ostream &operator<<(std::ostream &os, const CommGroup &group)

String representation of the CommGroup.

Parameters
  • os – ostream output destination.

  • group – group to represent as string.

namespace popops

Common functions, such as elementwise and reductions.

Deprecated popops functions

Collective operations working across replicas.

std::istream &operator>>(std::istream &is, CollectiveOperator &op)

Parse token from input stream is to op.

Valid input values are the stringified enumerations, for example “ADD” or “MUL”.

Deprecated:

This operator overload has been deprecated and will be removed in a future release.

Valid input values are the stringified enumerations, for example “ADD” or “MUL”.

Deprecated:

This operator overload has been deprecated and will be removed in a future release.

Parameters
  • is – Input stream.

  • op – Storage space for operator.

Returns

The original input stream.

Returns

The original input stream.

std::ostream &operator<<(std::ostream &os, const CollectiveOperator &op)

Write op to output stream os.

The value written is the stringified enumeration, for example “ADD” or “MUL”.

Deprecated:

This operator overload has been deprecated and will be removed in a future release.

The value written is the stringified enumeration, for example “ADD” or “MUL”.

Deprecated:

This operator overload has been deprecated and will be removed in a future release.

Parameters
  • os – Output stream.

  • op – The operator to print.

Returns

The original output stream.

Returns

The original output stream.

CollectiveOperator operationToCollectiveOperator(const Operation &col)

Convert from popops::Operation to popops::CollectiveOperator.

Deprecated:

Use gcl::operationToCollectiveOperator instead.

Deprecated:

Use gcl::operationToCollectiveOperator instead.

Parameters

col – An operator.

Returns

The corresponding CollectiveOperator.

Enums

enum CollectiveOperator

Supported collective operators.

Deprecated:

Use gcl::CollectiveOperator instead.

Values:

enumerator ADD
enumerator MEAN
enumerator MUL
enumerator MIN
enumerator MAX
enumerator LOGICAL_AND

Only supports boolean operands.

enumerator LOGICAL_OR

Only supports boolean operands.

enumerator SQUARE_ADD

Squares each element before applying ADD reduction.

enumerator LOCAL

Do nothing and keep the local value.