5. GCL API reference¶
The Graphcore Communication Library (GCL) provides application-level functions that can be used in Poplar programs for the IPU.
5.1. gcl/TileAllocation.hpp¶
-
namespace
gcl
¶ Graphcore Communications Library.
Functions
-
unsigned
getNumXBsUsed
()¶ - Return
The number of exchange blocks used
-
unsigned
getMinIoTiles
(const poplar::Graph &graph)¶ The lowest number of io tiles currently supported.
- Return
The lowest number of io tiles currently supported
- Parameters
graph
: The graph on which to check
-
std::vector<unsigned>
perIPUTiles
(const poplar::Graph &graph, unsigned offset, unsigned count, bool sorted = true)¶ Return a list of tile ids optimal for gcl collective operations.
- Return
A vector of tile ids.
- Parameters
graph
: The graph on which to allocate tilesoffset
: Skip a number of tiles and allocate from an offsetcount
: Number of tiles ids to returnsorted
: If true will sort the returned list of ids. This should normally be true and is thus also the default.
-
unsigned
5.2. gcl/Collectives.hpp¶
-
namespace
gcl
Graphcore Communications Library.
Enums
-
enum
CommGroupType
¶ Enum to define communication group specification type.
Assumption: replica groups are uniform in size and layout on IPUs.
Values:
-
enumerator
ALL
¶ All replicas viewed as one group, replica group size is ignored.
-
enumerator
CONSECUTIVE
¶ Groups are consecutive in replica.
If there are N replicas denoted {0….N-1} and group size is k then the groups are: {0, 1, … k-1}, {k, … 2k-1} … {N-k-1, … N-1}
-
enumerator
ORTHOGONAL
¶ Groups are sliced orthogonal to the replica ordering.
If there are N replicas denoted {0….N-1} and group size is k then the groups are: {0, k, 2k, …}, {1, k+1, 2k+1, …} … {k-1, 2k-1, …, N-1}
-
enumerator
Functions
-
poplar::Tensor
allReduce
(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ Perform an all-reduce operation.
The operation is performed on the provided tensor over replicas as specified by the
group
argument. This operation reduces across the tensors that the replicated tensor is a handle for. The result is returned as a replicated tensor.Supported Option Flags:
useSynclessCollectives
(true, false, auto) [=auto]true: Use the syncless implementation.
false: Use the syncful implementation.
auto: Choose the appropriate implementation for the operation in question. At the moment syncless is used when going over gateway links and syncful when going over ipu links.
maxBytesPerTile
Integer [=35000]The maximum size of data and padding in the payload buffer to put on each IO tile. The maximum allowed value is 64000.
topology
(rung-ring-2, rung-ring-4, ring-on-line, peripheral-ring) []The topology to use for the syncful implementation. By not specifying this option the topology is auto detected.
rung-ring-2: Relevant for replica size 2. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.
rung-ring-4: Relevant for replica size 4. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.
ring-on-line: Relevant for replica size 2. The traffic follows one of two virtual rings, one on each side of the IPU link mesh.
peripheral-ring: Relevant for replica size 1. The traffic follows a single ring on the peripheral of the IPU link mesh.
link
(auto-link, ipu-link, gw-link) [=auto-link]auto-link: Use the link type appropriate for the operation.
ipu-link: Use the ipu links.
gw-link: Use the gateway links.
- Return
A replicated tensor with the reduction of
data
.- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.group
: The subset of replicas for the collective operation.debugContext
: Optional debug contextoptions
: See above.
-
poplar::Tensor
allReduce
(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ -
- Return
A replicated tensor with the reduction of
data
. deprecated Use gcl::allReduce with popops::CollectiveOperator instead- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.group
: The subset of replicas for the collective operation.debugContext
: Optional debug contextoptions
: See above.
-
poplar::Tensor
allReduce
(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ As allReduce() without the
group
arg (for all replicas).- Return
A replicated tensor with the reduction of
data
.- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.debugContext
: Optional debug contextoptions
: See above.
-
poplar::Tensor
allReduce
(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ -
- Deprecated:
deprecated Use allReduce with popops::CollectiveOperator instead
- Return
A replicated tensor with the reduction of
data
. deprecated Use gcl::allReduce with popops::CollectiveOperator instead- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.debugContext
: Optional debug contextoptions
: See above.
-
void
allReduceToDestination
(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ As allReduce() but writes the result to the
destination
tensor.- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.destination
: Tensor to write the result to.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.group
: The subset of replicas for the collective operation.debugContext
: Optional debug contextoptions
: See above.
-
void
allReduceToDestination
(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ -
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.destination
: Tensor to write the result to.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.group
: The subset of replicas for the collective operation.debugContext
: Optional debug contextoptions
: See above. deprecated Use gcl::allReduceToDestination with popops::CollectiveOperator instead
-
void
allReduceToDestination
(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ As allReduceToDestination() without
group
arg (for all replicas).- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.destination
: Tensor to write the result to.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.debugContext
: Optional debug contextoptions
: See above.
-
void
allReduceToDestination
(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ -
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.destination
: Tensor to write the result to.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.debugContext
: Optional debug contextoptions
: See above. deprecated Use gcl::allReduceToDestination with popops::CollectiveOperator instead
-
void
allReduceInPlace
(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ As allReduce() but writes result back to the input
data
tensor.- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.group
: The subset of replicas for the collective operation.debugContext
: Optional debug contextoptions
: See above.
-
void
allReduceInPlace
(poplar::Graph &graph, poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ -
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.group
: The subset of replicas for the collective operation.debugContext
: Optional debug contextoptions
: See above. deprecated Use gcl::allReduceInPlace with popops::CollectiveOperator instead
-
void
allReduceInPlace
(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ As allReduceInPlace() without
group
arg (for all replicas).- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.debugContext
: Optional debug contextoptions
: See above.
-
void
allReduceInPlace
(poplar::Graph &graph, poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ -
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.op
: The reduction operator (for example, poplar::Operation::ADD).prog
: The program sequence to add operations to.debugContext
: Optional debug contextoptions
: See above. deprecated Use gcl::allReduceInPlace with popops::CollectiveOperator instead
-
poplar::Tensor
reduceScatter
(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ Reduce the replicated rank-1 tensor
toReduce
with the result scattered across the replicas.For an input of shape [numElements] mapped to a single IPU per replica, the output will have shape [ceil(numElements / replicationFactor)]. If replicationFactor does not evenly divide numElements, the result is zero-padded. For instance:
Before:
Replica0: toReduce[x0, y0, z0]
Replica1: toReduce[x1, y1, z1]
After:
Replica0: result[op(x0, x1), op(y0, y1)]
Replica1: result[op(z0, z1), 0]
For an input of shape [numElementsIPU0 + numElementsIPU1 + …] mapped to multiple IPUs per replica, the output will have shape: [ceil(numElementsIPU0 / replicationFactor) + ceil(numElementsIPU1 / replicationFactor) + …] with the result grouped per IPU. If replicationFactor does not evenly divide the number of elements on an IPU, the result is zero-padded per IPU. For instance:
Before:
Replica0: toReduce[x0, y0, z0, w0]
Replica1: toReduce[x1, y1, z1, w1]
Replica2: toReduce[x2, y2, z2, w2]
Replica3: toReduce[x3, y3, z3, w3]
Mapping: toReduce[IPU0, IPU0, IPU0, IPU1]
After:
Replica0: result[op(x0, x1, x2, x3), op(w0, w1, w2, w3)]
Replica1: result[op(y0, y1, y2, y3), 0]
Replica2: result[op(z0, z1, z2, z3), 0]
Replica3: result[0, 0]
Mapping: result[IPU0, IPU1]
- Return
The output tensor, with the content described above.
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce scatter.op
: The reduction operator (for example,Operation::ADD
)prog
: The program sequence to add operations to.group
: The subset of replicas for the collective operation.debugContext
: Optional debug contextoptions
: See gcl::allReduce().
-
poplar::Tensor
reduceScatter
(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ -
- Return
The output tensor, with the content described above. deprecated Use gcl::reduceScatter with popops::CollectiveOperator instead
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce scatter.op
: The reduction operator (for example,Operation::ADD
)prog
: The program sequence to add operations to.group
: The subset of replicas for the collective operation.debugContext
: Optional debug contextoptions
: See gcl::allReduce().
-
poplar::Tensor
reduceScatter
(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ As reduceScatter() without
group
arg (for all replicas).- Return
The output tensor, with the content described above.
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce scatter.op
: The reduction operator (for example,Operation::ADD
)prog
: The program sequence to add operations to.debugContext
: Optional debug contextoptions
: See gcl::allReduce().
-
poplar::Tensor
reduceScatter
(poplar::Graph &graph, const poplar::Tensor &data, popops::Operation op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ -
- Return
The output tensor, with the content described above. deprecated Use gcl::reduceScatter with popops::CollectiveOperator instead
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce scatter.op
: The reduction operator (for example,Operation::ADD
)prog
: The program sequence to add operations to.debugContext
: Optional debug contextoptions
: See gcl::allReduce().
-
poplar::Tensor
allGather
(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ Gather the replicated tensor
toGather
and return the result so each replica will have a copy of all other replicas’toGather
tensors.For instance:
Before:
Replica0: toGather[x,y]
Replica1: toGather[z,w]
Replica2: toGather[x1, y1]
After allGather:
Replica0: result[x,y,z,w,x1,y1]
Replica1: result[x,y,z,w,x1,y1]
Replica2: result[x,y,z,w,x1,y1]
For an input of shape [incomingShape] the output will be [replicationFactor][incomingShape].
- Return
The output tensor, with the content described above.
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.prog
: The program sequence to add operations to.group
: The subset of replicas for the collective operation.debugContext
: Optional debug contextoptions
: See gcl::allReduce().
-
poplar::Tensor
allGather
(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ As allGather() without
group
arg (for all replicas).- Return
The output tensor, with the content described above.
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.prog
: The program sequence to add operations to.debugContext
: Optional debug contextoptions
: See gcl::allReduce().
-
poplar::Tensor
allToAll
(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ Perform an all-to-all exchange of the elements of the input tensor based on replica ID.
The shape of the input must have the number of replicas in the graph as its first or only dimension. That dimension will be used to split up the tensor being sent, with each replica sending all splits except for the split index which matches its replica ID. That is, replica 2 will not send input[2] and so on.
The replica receiving the slice will copy that incoming slice into the output at the index which matches the replica ID of the replica which sent it. For instance:
Input tensor:
Replica0: Tensor T[x0,x1,x2]
Replica1: Tensor T[y0,y1,y2]
Replica2: Tensor T[z0,z1,z2]
Output tensor:
Replica0: Tensor T[x0,y0,z0]
Replica1: Tensor T[x1,y1,z1]
Replica2: Tensor T[x2,y2,z2]
- Return
The output tensor, with the content described above.
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.prog
: The program sequence to add operations to.group
: The subset of replicas for the collective operation.debugContext
: Optional debug contextoptions
: See gcl::allReduce().
-
poplar::Tensor
allToAll
(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})¶ As allToAll() without
group
arg (for all replicas).- Return
The output tensor, with the content described above.
- Parameters
graph
: The replicated graph the input tensor belongs to.data
: The replicated tensor to reduce.prog
: The program sequence to add operations to.debugContext
: Optional debug contextoptions
: See gcl::allReduce().
-
struct
CommGroup
¶ - #include <Collectives.hpp>
Struct to specify sub-groups of replicas.
Examples of derived sub-groups:
IPU-link domain sub-rack:
where N is power of two and replicaGroupSize > 1.type == CONSECUTIVE && replicaGroupSize == 64/replica-size/N
Complete IPU-link domain / full rack:
type == CONSECUTIVE && replicaGroupSize == 64/replica-size
Using GW-links only:
type == ORTHOGONAL && replicaGroupSize == 64/replica-size
Public Functions
-
CommGroup
() = default¶
-
CommGroup
(const CommGroupType &groupType, unsigned groupSize)¶ Construct CommGroup.
- Parameters
groupType
: replica group typegroupSize
: replica group size
Public Members
-
CommGroupType
type
= CommGroupType::ALL¶ Replica group type.
-
unsigned
replicaGroupSize
= 0¶ Replica group size.
-
enum
5.3. gcl/CollectiveBalancedReorder.hpp¶
-
namespace
gcl
Graphcore Communications Library.
-
class
CollectiveBalancedHostRearrangement
¶ - #include <CollectiveBalancedReorder.hpp>
This class contains functions and data necessary to rearrange tensors on the host side at runtime.
The separation is made so that we can serialize the state and restore it without having to create a
poplar::Graph
.Public Functions
-
void
rearrangeForCollective
(const char *in, char *out, int64_t elemByteSize) const¶ Balanced reorder the tensor in a collective-friendly manner (host-side).
- Parameters
in
: Pointer to the input buffer.out
: Pointer to the output buffer.elemByteSize
: The byte size of the elements.
-
void
undoRearrangeForCollective
(const char *in, char *out, int64_t elemByteSize) const¶ Reorder tensor back into the expected IR tensor shape and order (host-side).
- Parameters
in
: Pointer to the input buffer.out
: Pointer to the output buffer.elemByteSize
: The byte size of the elements.
-
size_t
getNumRearrangedTensorElems
() const¶ Number of elements in the collective balanced (reordered) tensor.
- Return
The number of elements.
-
void
rearrange
(const char *in, char *out, int64_t elemByteSize, bool refToGathered) const¶ Host tensor rearrangement routine.
- Parameters
in
: Pointer to the input buffer.out
: Pointer to the output buffer.elemByteSize
: The byte size of the elements.refToGathered
: Whatever to rearrage from reference to gathered or the other way.
-
void
-
class
CollectiveBalancedReorder
¶ - #include <CollectiveBalancedReorder.hpp>
Helper class to reorder a tensor in a per-tile-balanced fashion such that each replica obtains (for inputs to AllGather or outputs of ReduceScatter) an equally sized 1D tensor with equally sized regions.
This helper class reduces the memory used by the syncful collective. The reordering process:
Flattens the input tensor
Analyzes the tile mapping
Determines reordering strategy and required internal padding
Can rearrange and undo the rearrangement on any tensor that has the same tile mapping
Can rearrange and undo the rearrangement on host tensors that are to be copied into CBR-rearranged RemoteBuffers
Public Functions
-
CollectiveBalancedReorder
(poplar::Graph &graph_, poplar::Tensor tensor_, unsigned replicationFactor_, const poplar::DebugNameAndId &dnai_)¶ Constructor.
- Parameters
graph_
: The poplar graph.tensor_
: The reference tensor to rearrange.replicationFactor_
: The replication factor of the graph.dnai_
: Debug name and id.
-
poplar::Tensor
createReplicaSlice
(const poplar::Type &type)¶ Create a tensor mapped efficiently over the same tiles as the reference tensor.
The returned tensor has the size of the result of the reduce scatter and of the input of the all gather.
- Return
The efficient tensor created from the reference.
- Parameters
type
: The type to use when creating the tensor.
-
poplar::Tensor
createCollectivesTensor
(const poplar::Type &type, const std::string &debugPrefix)¶ Create a tensor mapped efficiently over the same tiles as the reference tensor.
The returned tensor has the size of the input of the reduce scatter and of the result of the all gather.
- Return
The efficient tensor created from the reference.
- Parameters
type
: The type to use when creating the tensor.debugPrefix
: The debug prefix.
-
poplar::Tensor
undoRearrangeForCollective
(const poplar::Tensor &tensor) const¶ Reorder tensor back into the expected IR tensor shape and order.
- Return
The tensor with the rearrangement undone.
- Parameters
tensor
: The tensor to rearrange.
-
std::vector<std::size_t>
getReferenceShape
() const¶ Get the shape of the reference tensor.
- Return
The shape of the reference tensor.
-
const CollectiveBalancedHostRearrangement &
getHostRearrangement
() const¶ Get a helper class that implements allows to apply the rearrangement on the host.
- Return
The helper class for host rearrangement.
Private Functions
-
void
rearrange
(const char *in, char *out, int64_t elemByteSize, bool refToGathered) const¶ Host tensor rearrangement routine.
Private Members
-
unsigned
replicationFactor
¶
-
poplar::TensorRearranger
simplifier
¶
-
CollectiveBalancedHostRearrangement
hostRearrangement
¶
-
const poplar::DebugNameAndId
dnai
¶
-
class