5. GCL API reference

The Graphcore Communication Library (GCL) provides application-level functions that can be used in Poplar programs for the IPU.

gcl/TileAllocation.hpp

namespace gcl

Graphcore Communications Library.

Functions

unsigned getNumXBsUsed()

Returns: The number of exchange blocks used

unsigned getMinIoTiles(const poplar::Graph &graph)

The lowest number of io tiles currently supported.

Parameters: graph – The graph on which to check
Returns: The lowest number of io tiles currently supported

std::vector<unsigned> perIPUTiles(const poplar::Graph &graph, unsigned offset, unsigned count, bool sorted = true, bool tilePairs = true)

Return a list of tile ids optimal for gcl collective operations.

Parameters

graph – The graph on which to allocate tiles
offset – Skip a number of tiles and allocate from an offset
count – Number of tiles ids to return
sorted – If true will sort the returned list of ids. This should normally be true and is thus also the default.
tilePairs – Override the default behaviour and return tile pairs. This * is normally false and thus not the default, so it has to be explicitly instructed by the caller.

Returns

A vector of tile ids.

gcl/Collectives.hpp

namespace gcl

Graphcore Communications Library.

Enums

enum CommGroupType

Enum to define communication group specification type.

Assumption: replica groups are uniform in size and layout on IPUs.

Values:

enumerator ALL: All replicas viewed as one group, replica group size is ignored.

enumerator CONSECUTIVE

Groups are consecutive in replica.

If there are N replicas denoted {0, … N-1} and group size is k, then there are N/k groups of size k: {0, 1, … k-1}, {k, … 2k-1} … {N-k-1, … N-1}

enumerator ORTHOGONAL

Groups are sliced orthogonal to the replica ordering.

If there are N replicas denoted {0, … N-1} and group size is k, then there are m = N/k groups of size k: {0, m, 2m, …}, {1, m+1, 2m+1, …} … {m-1, 2m-1, … N-1}

Functions

poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform an all-reduce operation.

The operation is performed on the provided tensor over replicas as specified by the group argument. This operation reduces across the tensors that the replicated tensor is a handle for. The result is returned as a replicated tensor.

Supported Option Flags:

useSynclessCollectives (true, false, hybrid, auto) [=auto]
- true: Use the syncless implementation.
- false: Use the syncful implementation.
- hybrid: Use syncful over ipu links and syncless over gw links.
- auto: Choose the appropriate implementation for the operation in question. At the moment this is the same as ‘false’.
maxBytesPerTile Integer [=35000]

The maximum size of data and padding in the payload buffer to put on each IO tile. The maximum allowed value is 64000.
topology (rung-ring-2, rung-ring-4, rung-ring-8, ring-on-line, peripheral-ring) []

The topology to use for the syncful implementation. By not specifying this option the topology is auto detected.
- rung-ring-2: Relevant for replica size 2. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.
- rung-ring-4: Relevant for replica size 4. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.
- rung-ring-8: Relevant for replica size 8. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.
- ring-on-line: Relevant for replica size 2. The traffic follows one of two virtual rings, one on each side of the IPU link mesh.
- peripheral-ring: Relevant for replica size 1. The traffic follows a single ring on the peripheral of the IPU link mesh.
link (auto-link, ipu-link, gw-link) [=auto-link]
- auto-link: Use the link type appropriate for the operation.
- ipu-link: Use the ipu links.
- gw-link: Use the gateway links.

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See above.

Returns

A replicated tensor with the reduction of data.

poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduce() without the group arg (for all replicas).

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See above.

Returns

A replicated tensor with the reduction of data.

void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduce() but writes the result to the destination tensor.

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
destination – Tensor to write the result to.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See above.

void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceToDestination() without group arg (for all replicas).

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
destination – Tensor to write the result to.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See above.

void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduce() but writes result back to the input data tensor.

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See above.

void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceInPlace() without group arg (for all replicas).

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See above.

poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Reduce the replicated rank-1 tensor toReduce with the result scattered across the replicas.

For an input of shape [numElements] mapped to a single IPU per replica, the output will have shape [ceil(numElements / replicationFactor)]. If replicationFactor does not evenly divide numElements, the result is zero-padded. For instance:

Before:
- Replica0: toReduce[x0, y0, z0]
- Replica1: toReduce[x1, y1, z1]
After:
- Replica0: result[op(x0, x1), op(y0, y1)]
- Replica1: result[op(z0, z1), 0]

For an input of shape [numElementsIPU0 + numElementsIPU1 + …] mapped to multiple IPUs per replica, the output will have shape: [ceil(numElementsIPU0 / replicationFactor) + ceil(numElementsIPU1 / replicationFactor) + …] with the result grouped per IPU. If replicationFactor does not evenly divide the number of elements on an IPU, the result is zero-padded per IPU. For instance:

Before:
- Replica0: toReduce[x0, y0, z0, w0]
- Replica1: toReduce[x1, y1, z1, w1]
- Replica2: toReduce[x2, y2, z2, w2]
- Replica3: toReduce[x3, y3, z3, w3]
- Mapping: toReduce[IPU0, IPU0, IPU0, IPU1]
After:
- Replica0: result[op(x0, x1, x2, x3), op(w0, w1, w2, w3)]
- Replica1: result[op(y0, y1, y2, y3), 0]
- Replica2: result[op(z0, z1, z2, z3), 0]
- Replica3: result[0, 0]
- Mapping: result[IPU0, IPU1]

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce scatter.
op – The reduction operator (for example, Operation::ADD)
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See gcl::allReduce().

Returns

The output tensor, with the content described above.

poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As reduceScatter() without group arg (for all replicas).

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce scatter.
op – The reduction operator (for example, Operation::ADD)
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See gcl::allReduce().

Returns

The output tensor, with the content described above.

poplar::Tensor allGather(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Gather the replicated tensor toGather and return the result so each replica will have a copy of all other replicas’ toGather tensors.

For instance:

Before:
- Replica0: toGather[x,y]
- Replica1: toGather[z,w]
- Replica2: toGather[x1, y1]
After allGather:
- Replica0: result[x,y,z,w,x1,y1]
- Replica1: result[x,y,z,w,x1,y1]
- Replica2: result[x,y,z,w,x1,y1]
  
  For an input of shape [incomingShape] the output will be [replicationFactor][incomingShape].

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to gather.
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See gcl::allReduce().

Returns

The output tensor, with the content described above.

poplar::Tensor allGather(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allGather() without group arg (for all replicas).

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to gather.
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See gcl::allReduce().

Returns

The output tensor, with the content described above.

poplar::Tensor allToAll(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform an all-to-all exchange of the elements of the input tensor based on replica ID.

The shape of the input must have the number of replicas in the graph as its first or only dimension. That dimension will be used to split up the tensor being sent, with each replica sending all splits except for the split index which matches its replica ID. That is, replica 2 will not send input[2] and so on.

The replica receiving the slice will copy that incoming slice into the output at the index which matches the replica ID of the replica which sent it. For instance:

Input tensor:
- Replica0: Tensor T[x0,x1,x2]
- Replica1: Tensor T[y0,y1,y2]
- Replica2: Tensor T[z0,z1,z2]
Output tensor:
- Replica0: Tensor T[x0,y0,z0]
- Replica1: Tensor T[x1,y1,z1]
- Replica2: Tensor T[x2,y2,z2]

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See gcl::allReduce().

Returns

The output tensor, with the content described above.

poplar::Tensor allToAll(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allToAll() without group arg (for all replicas).

Parameters

graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See gcl::allReduce().

Returns

The output tensor, with the content described above.

struct CommGroup

#include <Collectives.hpp>

Struct to specify sub-groups of replicas.

Examples of derived sub-groups:

IPU-link domain sub-rack:
```
type == CONSECUTIVE && replicaGroupSize == 64/replica-size/N
```
where N is power of two and replicaGroupSize > 1.

Complete IPU-link domain / full rack:

type == CONSECUTIVE && replicaGroupSize == 64/replica-size

Using GW-links only:

type == ORTHOGONAL && replicaGroupSize == 64/replica-size

Public Functions

CommGroup() = default

inline CommGroup(const CommGroupType &groupType, unsigned groupSize)

Construct CommGroup.

Parameters

groupType – replica group type
groupSize – replica group size

Public Members

CommGroupType type = CommGroupType::ALL : Replica group type.

unsigned replicaGroupSize = 0

Replica group size.

0 means the default size for the given group type.

Friends

friend std::ostream &operator<<(std::ostream &os, const CommGroup &group)

String representation of the CommGroup.

Parameters

os – ostream output destination
group – group to represent as string

gcl/CollectiveBalancedReorder.hpp

namespace gcl

Graphcore Communications Library.

class CollectiveBalancedHostRearrangement

#include <CollectiveBalancedReorder.hpp>

This class contains functions and data necessary to rearrange tensors on the host side at runtime.

The separation is made so that we can serialize the state and restore it without having to create a poplar::Graph.

Public Functions

void rearrangeForCollective(const void *in, void *out, int64_t elemByteSize) const

Balanced reorder the tensor in a collective-friendly manner (host-side).

Parameters

in – Pointer to the input buffer.
out – Pointer to the output buffer.
elemByteSize – The byte size of the elements.

void undoRearrangeForCollective(const void *in, void *out, int64_t elemByteSize) const

Reorder tensor back into the expected IR tensor shape and order (host-side).

Parameters

in – Pointer to the input buffer.
out – Pointer to the output buffer.
elemByteSize – The byte size of the elements.

size_t getNumRearrangedTensorElems() const

Number of elements in the collective balanced (reordered) tensor.

Returns: The number of elements.

void rearrange(const void *in, void *out, int64_t elemByteSize, bool refToGathered) const

Host tensor rearrangement routine.

Parameters

in – Pointer to the input buffer.
out – Pointer to the output buffer.
elemByteSize – The byte size of the elements.
refToGathered – Whether to rearrage from reference to gathered or the other way.

Public Members

unsigned replicationFactor = 0: The graph’s replication factor.

std::size_t totalElementsPerReplica = 0: The total number for one replica’s fragment.

std::vector<poplar::Interval> gatheredToRefSlices: The mapping from the gathered tensor back to the reference tensor.

std::vector<uint32_t> elementMap

Simple indices map for mapping individual elements one by one.

It is used instead gatheredToRefSlices for short intervals.

Private Functions

template<typename ElementType> void rearrangeImpl(const ElementType *in, ElementType *out, bool refToGathered) const

Host tensor rearrangement routine.

Parameters

in – Pointer to the input buffer.
out – Pointer to the output buffer.
refToGathered – Whether to rearrage from reference to gathered or the other way.

class CollectiveBalancedReorder

#include <CollectiveBalancedReorder.hpp>

Helper class to reorder a tensor in a per-tile-balanced fashion such that each replica obtains (for inputs to AllGather or outputs of ReduceScatter) an equally sized 1D tensor with equally sized regions.

This helper class reduces the memory used by the syncful collective. The reordering process:

Flattens the input tensor
Analyses the tile mapping
Determines reordering strategy and required internal padding
Can rearrange and undo the rearrangement on any tensor that has the same tile mapping
Can rearrange and undo the rearrangement on host tensors that are to be copied into CBR-rearranged RemoteBuffers

Public Functions

CollectiveBalancedReorder(poplar::Graph &graph_, poplar::Tensor tensor_, unsigned replicationFactor_, const poplar::DebugNameAndId &dnai_, bool allowElementMap = false)

Constructor.

Parameters

graph_ – The poplar graph.
tensor_ – The reference tensor to rearrange.
replicationFactor_ – The replication factor of the graph.
dnai_ – Debug name and id.
allowElementMap – Allow alternative representation of the host rearrangements. Sometimes it is beneficial to collapse all intervals into simple 1-to-1 element map. This flag should be set true in all new code and deprecated when all frameworks implement serialisation of newly added elementMap field.

poplar::Tensor createReplicaSlice(const poplar::Type &type)

Create a tensor mapped efficiently over the same tiles as the reference tensor.

The returned tensor has the size of the result of the reduce scatter and of the input of the all gather.

Parameters: type – The type to use when creating the tensor.
Returns: The efficient tensor created from the reference.

poplar::Tensor createCollectivesTensor(const poplar::Type &type, const std::string &debugPrefix)

Create a tensor mapped efficiently over the same tiles as the reference tensor.

The returned tensor has the size of the input of the reduce scatter and of the result of the all gather.

Parameters

type – The type to use when creating the tensor.
debugPrefix – The debug prefix.

Returns

The efficient tensor created from the reference.

poplar::Tensor undoRearrangeForCollective(const poplar::Tensor &tensor) const

Reorder tensor back into the expected IR tensor shape and order.

Parameters: tensor – The tensor to rearrange.
Returns: The tensor with the rearrangement undone.

inline std::vector<std::size_t> getReferenceShape() const

Get the shape of the reference tensor.

Returns: The shape of the reference tensor.

inline const CollectiveBalancedHostRearrangement &getHostRearrangement() const

Get a helper class that implements allows to apply the rearrangement on the host.

Returns: The helper class for host rearrangement.

Private Functions

void rearrange(const void *in, void *out, int64_t elemByteSize, bool refToGathered) const: Host tensor rearrangement routine.

Private Members

poplar::Graph &graph: Graph or subgraph on which the tensor and reordered tensor are allocated.

unsigned replicationFactor

std::vector<std::size_t> numReplicaElementsPerTile

std::vector<poplar::Interval> gatheredToSimplifiedRefSlices

poplar::Tensor referenceTensor

poplar::TensorRearranger simplifier

CollectiveBalancedHostRearrangement hostRearrangement

const poplar::DebugNameAndId dnai