5. GCL API reference¶

The Graphcore Communication Library (GCL) provides application-level functions that can be used in Poplar programs for the IPU.

5.1. gcl/TileAllocation.hpp¶

namespace gcl¶

Graphcore Communications Library.

Functions

unsigned getNumXBsUsed()¶

Return: The number of exchange blocks used

unsigned getMinIoTiles(const poplar::Graph &graph)¶

The lowest number of io tiles currently supported.

Return

The lowest number of io tiles currently supported

Parameters

graph: The graph on which to check

std::vector<unsigned> perIPUTiles(const poplar::Graph &graph, unsigned offset, unsigned count, bool sorted = true)¶

Return a list of tile ids optimal for gcl collective operations.

Return

A vector of tile ids.

Parameters

graph: The graph on which to allocate tiles
offset: Skip a number of tiles and allocate from an offset
count: Number of tiles ids to return
sorted: If true will sort the returned list of ids. This should normally be true and is thus also the default.

5.2. gcl/Collectives.hpp¶

namespace gcl

Graphcore Communications Library.

Enums

enum CommGroupType¶

Enum to define communication group specification type.

Assumption: replica groups are uniform in size and layout on IPUs.

Values:

enumerator ALL¶: All replicas viewed as one group, replica group size is ignored.

enumerator CONSECUTIVE¶

Groups are consecutive in replica.

If there are N replicas denoted {0….N-1} and group size is k then the groups are: {0, 1, … k-1}, {k, … 2k-1} … {N-k-1, … N-1}

enumerator ORTHOGONAL¶

Groups are sliced orthogonal to the replica ordering.

If there are N replicas denoted {0….N-1} and group size is k then the groups are: {0, k, 2k, …}, {1, k+1, 2k+1, …} … {k-1, 2k-1, …, N-1}

Functions

poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Perform an all-reduce operation.

The operation is performed on the provided tensor over replicas as specified by the group argument. This operation reduces across the tensors that the replicated tensor is a handle for. The result is returned as a replicated tensor.

Supported Option Flags:

useSynclessCollectives (true, false, auto) [=auto]
- true: Use the syncless implementation.
- false: Use the syncful implementation.
- auto: Choose the appropriate implementation for the operation in question. At the moment syncless is used when going over gateway links and syncful when going over ipu links.
maxBytesPerTile Integer [=35000]

The maximum size of data and padding in the payload buffer to put on each IO tile. The maximum allowed value is 64000.
topology (rung-ring-2, rung-ring-4, ring-on-line, peripheral-ring) []

The topology to use for the syncful implementation. By not specifying this option the topology is auto detected.
- rung-ring-2: Relevant for replica size 2. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.
- rung-ring-4: Relevant for replica size 4. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.
- ring-on-line: Relevant for replica size 2. The traffic follows one of two virtual rings, one on each side of the IPU link mesh.
- peripheral-ring: Relevant for replica size 1. The traffic follows a single ring on the peripheral of the IPU link mesh.
link (auto-link, ipu-link, gw-link) [=auto-link]
- auto-link: Use the link type appropriate for the operation.
- ipu-link: Use the ipu links.
- gw-link: Use the gateway links.
Return
A replicated tensor with the reduction of data.

Parameters
- graph: The replicated graph the input tensor belongs to.
- data: The replicated tensor to reduce.
- op: The reduction operator (for example, poplar::Operation::ADD).
- prog: The program sequence to add operations to.
- group: The subset of replicas for the collective operation.
- debugContext: Optional debug context
- options: See above.

poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Deprecated:

Return

A replicated tensor with the reduction of data. deprecated Use gcl::allReduce with popops::CollectiveOperator instead

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
op: The reduction operator (for example, poplar::Operation::ADD).
prog: The program sequence to add operations to.
group: The subset of replicas for the collective operation.
debugContext: Optional debug context
options: See above.

poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

As allReduce() without the group arg (for all replicas).

Return

A replicated tensor with the reduction of data.

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
op: The reduction operator (for example, poplar::Operation::ADD).
prog: The program sequence to add operations to.
debugContext: Optional debug context
options: See above.

poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Deprecated:

Deprecated:: deprecated Use allReduce with popops::CollectiveOperator instead

Return

A replicated tensor with the reduction of data. deprecated Use gcl::allReduce with popops::CollectiveOperator instead

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
op: The reduction operator (for example, poplar::Operation::ADD).
prog: The program sequence to add operations to.
debugContext: Optional debug context
options: See above.

void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

As allReduce() but writes the result to the destination tensor.

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
destination: Tensor to write the result to.
op: The reduction operator (for example, poplar::Operation::ADD).
prog: The program sequence to add operations to.
group: The subset of replicas for the collective operation.
debugContext: Optional debug context
options: See above.

void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Deprecated:

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
destination: Tensor to write the result to.
op: The reduction operator (for example, poplar::Operation::ADD).
prog: The program sequence to add operations to.
group: The subset of replicas for the collective operation.
debugContext: Optional debug context
options: See above. deprecated Use gcl::allReduceToDestination with popops::CollectiveOperator instead

void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

As allReduceToDestination() without group arg (for all replicas).

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
destination: Tensor to write the result to.
op: The reduction operator (for example, poplar::Operation::ADD).
prog: The program sequence to add operations to.
debugContext: Optional debug context
options: See above.

void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Deprecated:

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
destination: Tensor to write the result to.
op: The reduction operator (for example, poplar::Operation::ADD).
prog: The program sequence to add operations to.
debugContext: Optional debug context
options: See above. deprecated Use gcl::allReduceToDestination with popops::CollectiveOperator instead

void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

As allReduce() but writes result back to the input data tensor.

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
op: The reduction operator (for example, poplar::Operation::ADD).
prog: The program sequence to add operations to.
group: The subset of replicas for the collective operation.
debugContext: Optional debug context
options: See above.

void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Deprecated:

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
op: The reduction operator (for example, poplar::Operation::ADD).
prog: The program sequence to add operations to.
group: The subset of replicas for the collective operation.
debugContext: Optional debug context
options: See above. deprecated Use gcl::allReduceInPlace with popops::CollectiveOperator instead

void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

As allReduceInPlace() without group arg (for all replicas).

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
op: The reduction operator (for example, poplar::Operation::ADD).
prog: The program sequence to add operations to.
debugContext: Optional debug context
options: See above.

void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Deprecated:

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
op: The reduction operator (for example, poplar::Operation::ADD).
prog: The program sequence to add operations to.
debugContext: Optional debug context
options: See above. deprecated Use gcl::allReduceInPlace with popops::CollectiveOperator instead

poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Reduce the replicated rank-1 tensor toReduce with the result scattered across the replicas.

For an input of shape [numElements] mapped to a single IPU per replica, the output will have shape [ceil(numElements / replicationFactor)]. If replicationFactor does not evenly divide numElements, the result is zero-padded. For instance:

Before:
- Replica0: toReduce[x0, y0, z0]
- Replica1: toReduce[x1, y1, z1]
After:
- Replica0: result[op(x0, x1), op(y0, y1)]
- Replica1: result[op(z0, z1), 0]

For an input of shape [numElementsIPU0 + numElementsIPU1 + …] mapped to multiple IPUs per replica, the output will have shape: [ceil(numElementsIPU0 / replicationFactor) + ceil(numElementsIPU1 / replicationFactor) + …] with the result grouped per IPU. If replicationFactor does not evenly divide the number of elements on an IPU, the result is zero-padded per IPU. For instance:

Before:
- Replica0: toReduce[x0, y0, z0, w0]
- Replica1: toReduce[x1, y1, z1, w1]
- Replica2: toReduce[x2, y2, z2, w2]
- Replica3: toReduce[x3, y3, z3, w3]
- Mapping: toReduce[IPU0, IPU0, IPU0, IPU1]
After:
- Replica0: result[op(x0, x1, x2, x3), op(w0, w1, w2, w3)]
- Replica1: result[op(y0, y1, y2, y3), 0]
- Replica2: result[op(z0, z1, z2, z3), 0]
- Replica3: result[0, 0]
- Mapping: result[IPU0, IPU1]

Return

The output tensor, with the content described above.

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce scatter.
op: The reduction operator (for example, Operation::ADD)
prog: The program sequence to add operations to.
group: The subset of replicas for the collective operation.
debugContext: Optional debug context
options: See gcl::allReduce().

poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Deprecated:

Return

The output tensor, with the content described above. deprecated Use gcl::reduceScatter with popops::CollectiveOperator instead

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce scatter.
op: The reduction operator (for example, Operation::ADD)
prog: The program sequence to add operations to.
group: The subset of replicas for the collective operation.
debugContext: Optional debug context
options: See gcl::allReduce().

poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

As reduceScatter() without group arg (for all replicas).

Return

The output tensor, with the content described above.

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce scatter.
op: The reduction operator (for example, Operation::ADD)
prog: The program sequence to add operations to.
debugContext: Optional debug context
options: See gcl::allReduce().

poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Deprecated:

Return

The output tensor, with the content described above. deprecated Use gcl::reduceScatter with popops::CollectiveOperator instead

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce scatter.
op: The reduction operator (for example, Operation::ADD)
prog: The program sequence to add operations to.
debugContext: Optional debug context
options: See gcl::allReduce().

poplar::Tensor allGather(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Gather the replicated tensor toGather and return the result so each replica will have a copy of all other replicas’ toGather tensors.

For instance:

Before:
- Replica0: toGather[x,y]
- Replica1: toGather[z,w]
- Replica2: toGather[x1, y1]
After allGather:
- Replica0: result[x,y,z,w,x1,y1]
- Replica1: result[x,y,z,w,x1,y1]
- Replica2: result[x,y,z,w,x1,y1]
  For an input of shape [incomingShape] the output will be [replicationFactor][incomingShape].

Return

The output tensor, with the content described above.

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
prog: The program sequence to add operations to.
group: The subset of replicas for the collective operation.
debugContext: Optional debug context
options: See gcl::allReduce().

poplar::Tensor allGather(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

As allGather() without group arg (for all replicas).

Return

The output tensor, with the content described above.

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
prog: The program sequence to add operations to.
debugContext: Optional debug context
options: See gcl::allReduce().

poplar::Tensor allToAll(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

Perform an all-to-all exchange of the elements of the input tensor based on replica ID.

The shape of the input must have the number of replicas in the graph as its first or only dimension. That dimension will be used to split up the tensor being sent, with each replica sending all splits except for the split index which matches its replica ID. That is, replica 2 will not send input[2] and so on.

The replica receiving the slice will copy that incoming slice into the output at the index which matches the replica ID of the replica which sent it. For instance:

Input tensor:
- Replica0: Tensor T[x0,x1,x2]
- Replica1: Tensor T[y0,y1,y2]
- Replica2: Tensor T[z0,z1,z2]
Output tensor:
- Replica0: Tensor T[x0,y0,z0]
- Replica1: Tensor T[x1,y1,z1]
- Replica2: Tensor T[x2,y2,z2]

Return

The output tensor, with the content described above.

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
prog: The program sequence to add operations to.
group: The subset of replicas for the collective operation.
debugContext: Optional debug context
options: See gcl::allReduce().

poplar::Tensor allToAll(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶

As allToAll() without group arg (for all replicas).

Return

The output tensor, with the content described above.

Parameters

graph: The replicated graph the input tensor belongs to.
data: The replicated tensor to reduce.
prog: The program sequence to add operations to.
debugContext: Optional debug context
options: See gcl::allReduce().

struct CommGroup¶

#include <Collectives.hpp>

Struct to specify sub-groups of replicas.

Examples of derived sub-groups:

IPU-link domain sub-rack:
```
type == CONSECUTIVE && replicaGroupSize == 64/replica-size/N
```
where N is power of two and replicaGroupSize > 1.

Complete IPU-link domain / full rack:

type == CONSECUTIVE && replicaGroupSize == 64/replica-size

Using GW-links only:

type == ORTHOGONAL && replicaGroupSize == 64/replica-size

Public Functions

CommGroup() = default¶

CommGroup(const CommGroupType &groupType, unsigned groupSize)¶

Construct CommGroup.

Parameters

groupType: replica group type
groupSize: replica group size

Public Members

CommGroupType type = CommGroupType::ALL ¶: Replica group type.

unsigned replicaGroupSize = 0¶: Replica group size.

5.3. gcl/CollectiveBalancedReorder.hpp¶

namespace gcl

Graphcore Communications Library.

class CollectiveBalancedHostRearrangement¶

#include <CollectiveBalancedReorder.hpp>

This class contains functions and data necessary to rearrange tensors on the host side at runtime.

The separation is made so that we can serialize the state and restore it without having to create a poplar::Graph.

Public Functions

void rearrangeForCollective(const char *in, char *out, int64_t elemByteSize) const¶

Balanced reorder the tensor in a collective-friendly manner (host-side).

Parameters

in: Pointer to the input buffer.
out: Pointer to the output buffer.
elemByteSize: The byte size of the elements.

void undoRearrangeForCollective(const char *in, char *out, int64_t elemByteSize) const¶

Reorder tensor back into the expected IR tensor shape and order (host-side).

Parameters

in: Pointer to the input buffer.
out: Pointer to the output buffer.
elemByteSize: The byte size of the elements.

size_t getNumRearrangedTensorElems() const¶

Number of elements in the collective balanced (reordered) tensor.

Return: The number of elements.

void rearrange(const char *in, char *out, int64_t elemByteSize, bool refToGathered) const¶

Host tensor rearrangement routine.

Parameters

in: Pointer to the input buffer.
out: Pointer to the output buffer.
elemByteSize: The byte size of the elements.
refToGathered: Whatever to rearrage from reference to gathered or the other way.

Public Members

unsigned replicationFactor = 0¶: The graph’s replication factor.

std::size_t totalElementsPerReplica = 0¶: The total number for one replica’s fragment.

std::vector<poplar::Interval> gatheredToRefSlices¶: The mapping from the gathered tensor back to the reference tensor.

class CollectiveBalancedReorder¶

#include <CollectiveBalancedReorder.hpp>

Helper class to reorder a tensor in a per-tile-balanced fashion such that each replica obtains (for inputs to AllGather or outputs of ReduceScatter) an equally sized 1D tensor with equally sized regions.

This helper class reduces the memory used by the syncful collective. The reordering process:

Flattens the input tensor
Analyzes the tile mapping
Determines reordering strategy and required internal padding
Can rearrange and undo the rearrangement on any tensor that has the same tile mapping
Can rearrange and undo the rearrangement on host tensors that are to be copied into CBR-rearranged RemoteBuffers

Public Functions

CollectiveBalancedReorder(poplar::Graph &graph_, poplar::Tensor tensor_, unsigned replicationFactor_, const poplar::DebugNameAndId &dnai_)¶

Constructor.

Parameters

graph_: The poplar graph.
tensor_: The reference tensor to rearrange.
replicationFactor_: The replication factor of the graph.
dnai_: Debug name and id.

poplar::Tensor createReplicaSlice(const poplar::Type &type)¶

Create a tensor mapped efficiently over the same tiles as the reference tensor.

The returned tensor has the size of the result of the reduce scatter and of the input of the all gather.

Return

The efficient tensor created from the reference.

Parameters

type: The type to use when creating the tensor.

poplar::Tensor createCollectivesTensor(const poplar::Type &type, const std::string &debugPrefix)¶

Create a tensor mapped efficiently over the same tiles as the reference tensor.

The returned tensor has the size of the input of the reduce scatter and of the result of the all gather.

Return

The efficient tensor created from the reference.

Parameters

type: The type to use when creating the tensor.
debugPrefix: The debug prefix.

poplar::Tensor undoRearrangeForCollective(const poplar::Tensor &tensor) const¶

Reorder tensor back into the expected IR tensor shape and order.

Return

The tensor with the rearrangement undone.

Parameters

tensor: The tensor to rearrange.

std::vector<std::size_t> getReferenceShape() const¶

Get the shape of the reference tensor.

Return: The shape of the reference tensor.

const CollectiveBalancedHostRearrangement &getHostRearrangement() const¶

Get a helper class that implements allows to apply the rearrangement on the host.

Return: The helper class for host rearrangement.

Private Functions

void rearrange(const char *in, char *out, int64_t elemByteSize, bool refToGathered) const¶: Host tensor rearrangement routine.

Private Members

poplar::Graph &graph¶: Graph or subgraph on which the tensor and reordered tensor are allocated.

unsigned replicationFactor¶

std::vector<std::size_t> numReplicaElementsPerTile¶

std::vector<poplar::Interval> gatheredToSimplifiedRefSlices¶

poplar::Tensor referenceTensor¶

poplar::TensorRearranger simplifier¶

CollectiveBalancedHostRearrangement hostRearrangement¶

const poplar::DebugNameAndId dnai¶