5. GCL API reference

The Graphcore Communication Library (GCL) provides application-level functions that can be used in Poplar programs for the IPU.

5.1. gcl/TileAllocation.hpp

namespace gcl

Graphcore Communications Library.

Functions

unsigned getNumXBsUsed()

Return

The number of exchange blocks used

unsigned getMinIoTiles(const poplar::Graph &graph)

The lowest number of io tiles currently supported.

Return

The lowest number of io tiles currently supported

Parameters
  • graph: The graph on which to check

std::vector<unsigned> perIPUTiles(const poplar::Graph &graph, unsigned offset, unsigned count, bool sorted = true)

Return a list of tile ids optimal for gcl collective operations.

Return

A vector of tile ids.

Parameters
  • graph: The graph on which to allocate tiles

  • offset: Skip a number of tiles and allocate from an offset

  • count: Number of tiles ids to return

  • sorted: If true will sort the returned list of ids. This should normally be true and is thus also the default.

5.2. gcl/Collectives.hpp

namespace gcl

Graphcore Communications Library.

Enums

enum CommGroupType

Enum to define communication group specification type.

Assumption: replica groups are uniform in size and layout on IPUs.

Values:

enumerator ALL

All replicas viewed as one group, replica group size is ignored.

enumerator CONSECUTIVE

Groups are consecutive in replica.

If there are N replicas denoted {0….N-1} and group size is k then the groups are: {0, 1, … k-1}, {k, … 2k-1} … {N-k-1, … N-1}

enumerator ORTHOGONAL

Groups are sliced orthogonal to the replica ordering.

If there are N replicas denoted {0….N-1} and group size is k then the groups are: {0, k, 2k, …}, {1, k+1, 2k+1, …} … {k-1, 2k-1, …, N-1}

Functions

poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform an all-reduce operation.

The operation is performed on the provided tensor over replicas as specified by the group argument. This operation reduces across the tensors that the replicated tensor is a handle for. The result is returned as a replicated tensor.

Supported Option Flags:

  • useSynclessCollectives (true, false, auto) [=auto]

    • true: Use the syncless implementation.

    • false: Use the syncful implementation.

    • auto: Choose the appropriate implementation for the operation in question. At the moment syncless is used when going over gateway links and syncful when going over ipu links.

  • maxBytesPerTile Integer [=35000]

    The maximum size of data and padding in the payload buffer to put on each IO tile. The maximum allowed value is 64000.

  • topology (rung-ring-2, rung-ring-4, ring-on-line, peripheral-ring) []

    The topology to use for the syncful implementation. By not specifying this option the topology is auto detected.

    • rung-ring-2: Relevant for replica size 2. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.

    • rung-ring-4: Relevant for replica size 4. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.

    • ring-on-line: Relevant for replica size 2. The traffic follows one of two virtual rings, one on each side of the IPU link mesh.

    • peripheral-ring: Relevant for replica size 1. The traffic follows a single ring on the peripheral of the IPU link mesh.

  • link (auto-link, ipu-link, gw-link) [=auto-link]

    • auto-link: Use the link type appropriate for the operation.

    • ipu-link: Use the ipu links.

    • gw-link: Use the gateway links.

    Return

    A replicated tensor with the reduction of data.

    Parameters
    • graph: The replicated graph the input tensor belongs to.

    • data: The replicated tensor to reduce.

    • op: The reduction operator (for example, poplar::Operation::ADD).

    • prog: The program sequence to add operations to.

    • group: The subset of replicas for the collective operation.

    • debugContext: Optional debug context

    • options: See above.

poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Deprecated:
Return

A replicated tensor with the reduction of data. deprecated Use gcl::allReduce with popops::CollectiveOperator instead

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • op: The reduction operator (for example, poplar::Operation::ADD).

  • prog: The program sequence to add operations to.

  • group: The subset of replicas for the collective operation.

  • debugContext: Optional debug context

  • options: See above.

poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduce() without the group arg (for all replicas).

Return

A replicated tensor with the reduction of data.

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • op: The reduction operator (for example, poplar::Operation::ADD).

  • prog: The program sequence to add operations to.

  • debugContext: Optional debug context

  • options: See above.

poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Deprecated:
Deprecated:

deprecated Use allReduce with popops::CollectiveOperator instead

Return

A replicated tensor with the reduction of data. deprecated Use gcl::allReduce with popops::CollectiveOperator instead

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • op: The reduction operator (for example, poplar::Operation::ADD).

  • prog: The program sequence to add operations to.

  • debugContext: Optional debug context

  • options: See above.

void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduce() but writes the result to the destination tensor.

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • destination: Tensor to write the result to.

  • op: The reduction operator (for example, poplar::Operation::ADD).

  • prog: The program sequence to add operations to.

  • group: The subset of replicas for the collective operation.

  • debugContext: Optional debug context

  • options: See above.

void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Deprecated:
Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • destination: Tensor to write the result to.

  • op: The reduction operator (for example, poplar::Operation::ADD).

  • prog: The program sequence to add operations to.

  • group: The subset of replicas for the collective operation.

  • debugContext: Optional debug context

  • options: See above. deprecated Use gcl::allReduceToDestination with popops::CollectiveOperator instead

void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceToDestination() without group arg (for all replicas).

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • destination: Tensor to write the result to.

  • op: The reduction operator (for example, poplar::Operation::ADD).

  • prog: The program sequence to add operations to.

  • debugContext: Optional debug context

  • options: See above.

void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Deprecated:
Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • destination: Tensor to write the result to.

  • op: The reduction operator (for example, poplar::Operation::ADD).

  • prog: The program sequence to add operations to.

  • debugContext: Optional debug context

  • options: See above. deprecated Use gcl::allReduceToDestination with popops::CollectiveOperator instead

void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduce() but writes result back to the input data tensor.

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • op: The reduction operator (for example, poplar::Operation::ADD).

  • prog: The program sequence to add operations to.

  • group: The subset of replicas for the collective operation.

  • debugContext: Optional debug context

  • options: See above.

void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Deprecated:
Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • op: The reduction operator (for example, poplar::Operation::ADD).

  • prog: The program sequence to add operations to.

  • group: The subset of replicas for the collective operation.

  • debugContext: Optional debug context

  • options: See above. deprecated Use gcl::allReduceInPlace with popops::CollectiveOperator instead

void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allReduceInPlace() without group arg (for all replicas).

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • op: The reduction operator (for example, poplar::Operation::ADD).

  • prog: The program sequence to add operations to.

  • debugContext: Optional debug context

  • options: See above.

void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Deprecated:
Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • op: The reduction operator (for example, poplar::Operation::ADD).

  • prog: The program sequence to add operations to.

  • debugContext: Optional debug context

  • options: See above. deprecated Use gcl::allReduceInPlace with popops::CollectiveOperator instead

poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Reduce the replicated rank-1 tensor toReduce with the result scattered across the replicas.

For an input of shape [numElements] mapped to a single IPU per replica, the output will have shape [ceil(numElements / replicationFactor)]. If replicationFactor does not evenly divide numElements, the result is zero-padded. For instance:

  • Before:

    • Replica0: toReduce[x0, y0, z0]

    • Replica1: toReduce[x1, y1, z1]

  • After:

    • Replica0: result[op(x0, x1), op(y0, y1)]

    • Replica1: result[op(z0, z1), 0]

For an input of shape [numElementsIPU0 + numElementsIPU1 + …] mapped to multiple IPUs per replica, the output will have shape: [ceil(numElementsIPU0 / replicationFactor) + ceil(numElementsIPU1 / replicationFactor) + …] with the result grouped per IPU. If replicationFactor does not evenly divide the number of elements on an IPU, the result is zero-padded per IPU. For instance:

  • Before:

    • Replica0: toReduce[x0, y0, z0, w0]

    • Replica1: toReduce[x1, y1, z1, w1]

    • Replica2: toReduce[x2, y2, z2, w2]

    • Replica3: toReduce[x3, y3, z3, w3]

    • Mapping: toReduce[IPU0, IPU0, IPU0, IPU1]

  • After:

    • Replica0: result[op(x0, x1, x2, x3), op(w0, w1, w2, w3)]

    • Replica1: result[op(y0, y1, y2, y3), 0]

    • Replica2: result[op(z0, z1, z2, z3), 0]

    • Replica3: result[0, 0]

    • Mapping: result[IPU0, IPU1]

Return

The output tensor, with the content described above.

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce scatter.

  • op: The reduction operator (for example, Operation::ADD)

  • prog: The program sequence to add operations to.

  • group: The subset of replicas for the collective operation.

  • debugContext: Optional debug context

  • options: See gcl::allReduce().

poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Deprecated:
Return

The output tensor, with the content described above. deprecated Use gcl::reduceScatter with popops::CollectiveOperator instead

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce scatter.

  • op: The reduction operator (for example, Operation::ADD)

  • prog: The program sequence to add operations to.

  • group: The subset of replicas for the collective operation.

  • debugContext: Optional debug context

  • options: See gcl::allReduce().

poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As reduceScatter() without group arg (for all replicas).

Return

The output tensor, with the content described above.

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce scatter.

  • op: The reduction operator (for example, Operation::ADD)

  • prog: The program sequence to add operations to.

  • debugContext: Optional debug context

  • options: See gcl::allReduce().

poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Deprecated:
Return

The output tensor, with the content described above. deprecated Use gcl::reduceScatter with popops::CollectiveOperator instead

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce scatter.

  • op: The reduction operator (for example, Operation::ADD)

  • prog: The program sequence to add operations to.

  • debugContext: Optional debug context

  • options: See gcl::allReduce().

poplar::Tensor allGather(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Gather the replicated tensor toGather and return the result so each replica will have a copy of all other replicas’ toGather tensors.

For instance:

  • Before:

    • Replica0: toGather[x,y]

    • Replica1: toGather[z,w]

    • Replica2: toGather[x1, y1]

  • After allGather:

    • Replica0: result[x,y,z,w,x1,y1]

    • Replica1: result[x,y,z,w,x1,y1]

    • Replica2: result[x,y,z,w,x1,y1]

      For an input of shape [incomingShape] the output will be [replicationFactor][incomingShape].

Return

The output tensor, with the content described above.

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • prog: The program sequence to add operations to.

  • group: The subset of replicas for the collective operation.

  • debugContext: Optional debug context

  • options: See gcl::allReduce().

poplar::Tensor allGather(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allGather() without group arg (for all replicas).

Return

The output tensor, with the content described above.

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • prog: The program sequence to add operations to.

  • debugContext: Optional debug context

  • options: See gcl::allReduce().

poplar::Tensor allToAll(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

Perform an all-to-all exchange of the elements of the input tensor based on replica ID.

The shape of the input must have the number of replicas in the graph as its first or only dimension. That dimension will be used to split up the tensor being sent, with each replica sending all splits except for the split index which matches its replica ID. That is, replica 2 will not send input[2] and so on.

The replica receiving the slice will copy that incoming slice into the output at the index which matches the replica ID of the replica which sent it. For instance:

  • Input tensor:

    • Replica0: Tensor T[x0,x1,x2]

    • Replica1: Tensor T[y0,y1,y2]

    • Replica2: Tensor T[z0,z1,z2]

  • Output tensor:

    • Replica0: Tensor T[x0,y0,z0]

    • Replica1: Tensor T[x1,y1,z1]

    • Replica2: Tensor T[x2,y2,z2]

Return

The output tensor, with the content described above.

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • prog: The program sequence to add operations to.

  • group: The subset of replicas for the collective operation.

  • debugContext: Optional debug context

  • options: See gcl::allReduce().

poplar::Tensor allToAll(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})

As allToAll() without group arg (for all replicas).

Return

The output tensor, with the content described above.

Parameters
  • graph: The replicated graph the input tensor belongs to.

  • data: The replicated tensor to reduce.

  • prog: The program sequence to add operations to.

  • debugContext: Optional debug context

  • options: See gcl::allReduce().

struct CommGroup
#include <Collectives.hpp>

Struct to specify sub-groups of replicas.

Examples of derived sub-groups:

  • IPU-link domain sub-rack:

    type == CONSECUTIVE && replicaGroupSize == 64/replica-size/N
    
    where N is power of two and replicaGroupSize > 1.

  • Complete IPU-link domain / full rack:

    type == CONSECUTIVE && replicaGroupSize == 64/replica-size
    

  • Using GW-links only:

    type == ORTHOGONAL && replicaGroupSize == 64/replica-size
    

Public Functions

CommGroup() = default
CommGroup(const CommGroupType &groupType, unsigned groupSize)

Construct CommGroup.

Parameters
  • groupType: replica group type

  • groupSize: replica group size

Public Members

CommGroupType type = CommGroupType::ALL

Replica group type.

unsigned replicaGroupSize = 0

Replica group size.

5.3. gcl/CollectiveBalancedReorder.hpp

namespace gcl

Graphcore Communications Library.

class CollectiveBalancedHostRearrangement
#include <CollectiveBalancedReorder.hpp>

This class contains functions and data necessary to rearrange tensors on the host side at runtime.

The separation is made so that we can serialize the state and restore it without having to create a poplar::Graph.

Public Functions

void rearrangeForCollective(const char *in, char *out, int64_t elemByteSize) const

Balanced reorder the tensor in a collective-friendly manner (host-side).

Parameters
  • in: Pointer to the input buffer.

  • out: Pointer to the output buffer.

  • elemByteSize: The byte size of the elements.

void undoRearrangeForCollective(const char *in, char *out, int64_t elemByteSize) const

Reorder tensor back into the expected IR tensor shape and order (host-side).

Parameters
  • in: Pointer to the input buffer.

  • out: Pointer to the output buffer.

  • elemByteSize: The byte size of the elements.

size_t getNumRearrangedTensorElems() const

Number of elements in the collective balanced (reordered) tensor.

Return

The number of elements.

void rearrange(const char *in, char *out, int64_t elemByteSize, bool refToGathered) const

Host tensor rearrangement routine.

Parameters
  • in: Pointer to the input buffer.

  • out: Pointer to the output buffer.

  • elemByteSize: The byte size of the elements.

  • refToGathered: Whatever to rearrage from reference to gathered or the other way.

Public Members

unsigned replicationFactor = 0

The graph’s replication factor.

std::size_t totalElementsPerReplica = 0

The total number for one replica’s fragment.

std::vector<poplar::Interval> gatheredToRefSlices

The mapping from the gathered tensor back to the reference tensor.

class CollectiveBalancedReorder
#include <CollectiveBalancedReorder.hpp>

Helper class to reorder a tensor in a per-tile-balanced fashion such that each replica obtains (for inputs to AllGather or outputs of ReduceScatter) an equally sized 1D tensor with equally sized regions.

This helper class reduces the memory used by the syncful collective. The reordering process:

  • Flattens the input tensor

  • Analyzes the tile mapping

  • Determines reordering strategy and required internal padding

  • Can rearrange and undo the rearrangement on any tensor that has the same tile mapping

  • Can rearrange and undo the rearrangement on host tensors that are to be copied into CBR-rearranged RemoteBuffers

Public Functions

CollectiveBalancedReorder(poplar::Graph &graph_, poplar::Tensor tensor_, unsigned replicationFactor_, const poplar::DebugNameAndId &dnai_)

Constructor.

Parameters
  • graph_: The poplar graph.

  • tensor_: The reference tensor to rearrange.

  • replicationFactor_: The replication factor of the graph.

  • dnai_: Debug name and id.

poplar::Tensor createReplicaSlice(const poplar::Type &type)

Create a tensor mapped efficiently over the same tiles as the reference tensor.

The returned tensor has the size of the result of the reduce scatter and of the input of the all gather.

Return

The efficient tensor created from the reference.

Parameters
  • type: The type to use when creating the tensor.

poplar::Tensor createCollectivesTensor(const poplar::Type &type, const std::string &debugPrefix)

Create a tensor mapped efficiently over the same tiles as the reference tensor.

The returned tensor has the size of the input of the reduce scatter and of the result of the all gather.

Return

The efficient tensor created from the reference.

Parameters
  • type: The type to use when creating the tensor.

  • debugPrefix: The debug prefix.

poplar::Tensor undoRearrangeForCollective(const poplar::Tensor &tensor) const

Reorder tensor back into the expected IR tensor shape and order.

Return

The tensor with the rearrangement undone.

Parameters
  • tensor: The tensor to rearrange.

std::vector<std::size_t> getReferenceShape() const

Get the shape of the reference tensor.

Return

The shape of the reference tensor.

const CollectiveBalancedHostRearrangement &getHostRearrangement() const

Get a helper class that implements allows to apply the rearrangement on the host.

Return

The helper class for host rearrangement.

Private Functions

void rearrange(const char *in, char *out, int64_t elemByteSize, bool refToGathered) const

Host tensor rearrangement routine.

Private Members

poplar::Graph &graph

Graph or subgraph on which the tensor and reordered tensor are allocated.

unsigned replicationFactor
std::vector<std::size_t> numReplicaElementsPerTile
std::vector<poplar::Interval> gatheredToSimplifiedRefSlices
poplar::Tensor referenceTensor
poplar::TensorRearranger simplifier
CollectiveBalancedHostRearrangement hostRearrangement
const poplar::DebugNameAndId dnai