5. GCL API reference
The Graphcore Communication Library (GCL) provides application-level functions that can be used in Poplar programs for the IPU.
gcl/TileAllocation.hpp
-
namespace gcl
Graphcore Communications Library.
Functions
-
unsigned getNumXBsUsed()
- Returns
The number of exchange blocks used
-
unsigned getMinIoTiles(const poplar::Graph &graph)
The lowest number of io tiles currently supported.
- Parameters
graph – The graph on which to check
- Returns
The lowest number of io tiles currently supported
-
std::vector<unsigned> perIPUTiles(const poplar::Graph &graph, unsigned offset, unsigned count, bool sorted = true, bool tilePairs = true)
Return a list of tile ids optimal for gcl collective operations.
- Parameters
graph – The graph on which to allocate tiles
offset – Skip a number of tiles and allocate from an offset
count – Number of tiles ids to return
sorted – If true will sort the returned list of ids. This should normally be true and is thus also the default.
tilePairs – Override the default behaviour and return tile pairs. This * is normally false and thus not the default, so it has to be explicitly instructed by the caller.
- Returns
A vector of tile ids.
-
unsigned getNumXBsUsed()
gcl/Collectives.hpp
-
namespace gcl
Graphcore Communications Library.
Enums
-
enum CommGroupType
Enum to define communication group specification type.
Assumption: replica groups are uniform in size and layout on IPUs.
Values:
-
enumerator ALL
All replicas viewed as one group, replica group size is ignored.
-
enumerator CONSECUTIVE
Groups are consecutive in replica.
If there are N replicas denoted {0, … N-1} and group size is k, then there are N/k groups of size k: {0, 1, … k-1}, {k, … 2k-1} … {N-k-1, … N-1}
-
enumerator ORTHOGONAL
Groups are sliced orthogonal to the replica ordering.
If there are N replicas denoted {0, … N-1} and group size is k, then there are m = N/k groups of size k: {0, m, 2m, …}, {1, m+1, 2m+1, …} … {m-1, 2m-1, … N-1}
-
enumerator ALL
Functions
-
poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Perform an all-reduce operation.
The operation is performed on the provided tensor over replicas as specified by the
group
argument. This operation reduces across the tensors that the replicated tensor is a handle for. The result is returned as a replicated tensor.Supported Option Flags:
useSynclessCollectives
(true, false, hybrid, auto) [=auto]true: Use the syncless implementation.
false: Use the syncful implementation.
hybrid: Use syncful over ipu links and syncless over gw links.
auto: Choose the appropriate implementation for the operation in question. At the moment this is the same as ‘false’.
maxBytesPerTile
Integer [=35000]The maximum size of data and padding in the payload buffer to put on each IO tile. The maximum allowed value is 64000.
topology
(rung-ring-2, rung-ring-4, rung-ring-8, ring-on-line, peripheral-ring) []The topology to use for the syncful implementation. By not specifying this option the topology is auto detected.
rung-ring-2: Relevant for replica size 2. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.
rung-ring-4: Relevant for replica size 4. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.
rung-ring-8: Relevant for replica size 8. The traffic follows one of two physical rings, one on each side of the IPU link mesh, by moving straight up and assuming wrap-around at the top.
ring-on-line: Relevant for replica size 2. The traffic follows one of two virtual rings, one on each side of the IPU link mesh.
peripheral-ring: Relevant for replica size 1. The traffic follows a single ring on the peripheral of the IPU link mesh.
link
(auto-link, ipu-link, gw-link) [=auto-link]auto-link: Use the link type appropriate for the operation.
ipu-link: Use the ipu links.
gw-link: Use the gateway links.
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See above.
- Returns
A replicated tensor with the reduction of
data
.
-
poplar::Tensor allReduce(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduce() without the
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See above.
- Returns
A replicated tensor with the reduction of
data
.
-
void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduce() but writes the result to the
destination
tensor.- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
destination – Tensor to write the result to.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See above.
-
void allReduceToDestination(poplar::Graph &graph, const poplar::Tensor &data, poplar::Tensor &destination, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceToDestination() without
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
destination – Tensor to write the result to.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See above.
-
void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduce() but writes result back to the input
data
tensor.- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See above.
-
void allReduceInPlace(poplar::Graph &graph, poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allReduceInPlace() without
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
op – The reduction operator (for example, poplar::Operation::ADD).
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See above.
-
poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Reduce the replicated rank-1 tensor
toReduce
with the result scattered across the replicas.For an input of shape [numElements] mapped to a single IPU per replica, the output will have shape [ceil(numElements / replicationFactor)]. If replicationFactor does not evenly divide numElements, the result is zero-padded. For instance:
Before:
Replica0: toReduce[x0, y0, z0]
Replica1: toReduce[x1, y1, z1]
After:
Replica0: result[op(x0, x1), op(y0, y1)]
Replica1: result[op(z0, z1), 0]
For an input of shape [numElementsIPU0 + numElementsIPU1 + …] mapped to multiple IPUs per replica, the output will have shape: [ceil(numElementsIPU0 / replicationFactor) + ceil(numElementsIPU1 / replicationFactor) + …] with the result grouped per IPU. If replicationFactor does not evenly divide the number of elements on an IPU, the result is zero-padded per IPU. For instance:
Before:
Replica0: toReduce[x0, y0, z0, w0]
Replica1: toReduce[x1, y1, z1, w1]
Replica2: toReduce[x2, y2, z2, w2]
Replica3: toReduce[x3, y3, z3, w3]
Mapping: toReduce[IPU0, IPU0, IPU0, IPU1]
After:
Replica0: result[op(x0, x1, x2, x3), op(w0, w1, w2, w3)]
Replica1: result[op(y0, y1, y2, y3), 0]
Replica2: result[op(z0, z1, z2, z3), 0]
Replica3: result[0, 0]
Mapping: result[IPU0, IPU1]
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce scatter.
op – The reduction operator (for example,
Operation::ADD
)prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See gcl::allReduce().
- Returns
The output tensor, with the content described above.
-
poplar::Tensor reduceScatter(poplar::Graph &graph, const poplar::Tensor &data, popops::CollectiveOperator op, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As reduceScatter() without
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce scatter.
op – The reduction operator (for example,
Operation::ADD
)prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See gcl::allReduce().
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allGather(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Gather the replicated tensor
toGather
and return the result so each replica will have a copy of all other replicas’toGather
tensors.For instance:
Before:
Replica0: toGather[x,y]
Replica1: toGather[z,w]
Replica2: toGather[x1, y1]
After allGather:
Replica0: result[x,y,z,w,x1,y1]
Replica1: result[x,y,z,w,x1,y1]
Replica2: result[x,y,z,w,x1,y1]
For an input of shape [incomingShape] the output will be [replicationFactor][incomingShape].
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to gather.
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See gcl::allReduce().
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allGather(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allGather() without
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to gather.
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See gcl::allReduce().
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allToAll(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const CommGroup &group, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
Perform an all-to-all exchange of the elements of the input tensor based on replica ID.
The shape of the input must have the number of replicas in the graph as its first or only dimension. That dimension will be used to split up the tensor being sent, with each replica sending all splits except for the split index which matches its replica ID. That is, replica 2 will not send input[2] and so on.
The replica receiving the slice will copy that incoming slice into the output at the index which matches the replica ID of the replica which sent it. For instance:
Input tensor:
Replica0: Tensor T[x0,x1,x2]
Replica1: Tensor T[y0,y1,y2]
Replica2: Tensor T[z0,z1,z2]
Output tensor:
Replica0: Tensor T[x0,y0,z0]
Replica1: Tensor T[x1,y1,z1]
Replica2: Tensor T[x2,y2,z2]
- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
prog – The program sequence to add operations to.
group – The subset of replicas for the collective operation.
debugContext – Optional debug context
options – See gcl::allReduce().
- Returns
The output tensor, with the content described above.
-
poplar::Tensor allToAll(poplar::Graph &graph, const poplar::Tensor &data, poplar::program::Sequence &prog, const poplar::DebugContext &debugContext = {}, const poplar::OptionFlags &options = {})
As allToAll() without
group
arg (for all replicas).- Parameters
graph – The replicated graph the input tensor belongs to.
data – The replicated tensor to reduce.
prog – The program sequence to add operations to.
debugContext – Optional debug context
options – See gcl::allReduce().
- Returns
The output tensor, with the content described above.
-
struct CommGroup
- #include <Collectives.hpp>
Struct to specify sub-groups of replicas.
Examples of derived sub-groups:
IPU-link domain sub-rack:
where N is power of two and replicaGroupSize > 1.type == CONSECUTIVE && replicaGroupSize == 64/replica-size/N
Complete IPU-link domain / full rack:
type == CONSECUTIVE && replicaGroupSize == 64/replica-size
Using GW-links only:
type == ORTHOGONAL && replicaGroupSize == 64/replica-size
Public Functions
-
CommGroup() = default
-
inline CommGroup(const CommGroupType &groupType, unsigned groupSize)
Construct CommGroup.
- Parameters
groupType – replica group type
groupSize – replica group size
Public Members
-
CommGroupType type = CommGroupType::ALL
Replica group type.
-
unsigned replicaGroupSize = 0
Replica group size.
0 means the default size for the given group type.
-
enum CommGroupType
gcl/CollectiveBalancedReorder.hpp
-
namespace gcl
Graphcore Communications Library.
-
class CollectiveBalancedHostRearrangement
- #include <CollectiveBalancedReorder.hpp>
This class contains functions and data necessary to rearrange tensors on the host side at runtime.
The separation is made so that we can serialize the state and restore it without having to create a
poplar::Graph
.Public Functions
-
void rearrangeForCollective(const void *in, void *out, int64_t elemByteSize) const
Balanced reorder the tensor in a collective-friendly manner (host-side).
- Parameters
in – Pointer to the input buffer.
out – Pointer to the output buffer.
elemByteSize – The byte size of the elements.
-
void undoRearrangeForCollective(const void *in, void *out, int64_t elemByteSize) const
Reorder tensor back into the expected IR tensor shape and order (host-side).
- Parameters
in – Pointer to the input buffer.
out – Pointer to the output buffer.
elemByteSize – The byte size of the elements.
-
size_t getNumRearrangedTensorElems() const
Number of elements in the collective balanced (reordered) tensor.
- Returns
The number of elements.
-
void rearrange(const void *in, void *out, int64_t elemByteSize, bool refToGathered) const
Host tensor rearrangement routine.
- Parameters
in – Pointer to the input buffer.
out – Pointer to the output buffer.
elemByteSize – The byte size of the elements.
refToGathered – Whether to rearrage from reference to gathered or the other way.
Public Members
-
unsigned replicationFactor = 0
The graph’s replication factor.
Private Functions
-
template<typename ElementType>
void rearrangeImpl(const ElementType *in, ElementType *out, bool refToGathered) const Host tensor rearrangement routine.
- Parameters
in – Pointer to the input buffer.
out – Pointer to the output buffer.
refToGathered – Whether to rearrage from reference to gathered or the other way.
-
void rearrangeForCollective(const void *in, void *out, int64_t elemByteSize) const
-
class CollectiveBalancedReorder
- #include <CollectiveBalancedReorder.hpp>
Helper class to reorder a tensor in a per-tile-balanced fashion such that each replica obtains (for inputs to AllGather or outputs of ReduceScatter) an equally sized 1D tensor with equally sized regions.
This helper class reduces the memory used by the syncful collective. The reordering process:
Flattens the input tensor
Analyses the tile mapping
Determines reordering strategy and required internal padding
Can rearrange and undo the rearrangement on any tensor that has the same tile mapping
Can rearrange and undo the rearrangement on host tensors that are to be copied into CBR-rearranged RemoteBuffers
Public Functions
-
CollectiveBalancedReorder(poplar::Graph &graph_, poplar::Tensor tensor_, unsigned replicationFactor_, const poplar::DebugNameAndId &dnai_, bool allowElementMap = false)
Constructor.
- Parameters
graph_ – The poplar graph.
tensor_ – The reference tensor to rearrange.
replicationFactor_ – The replication factor of the graph.
dnai_ – Debug name and id.
allowElementMap – Allow alternative representation of the host rearrangements. Sometimes it is beneficial to collapse all intervals into simple 1-to-1 element map. This flag should be set true in all new code and deprecated when all frameworks implement serialisation of newly added
elementMap
field.
-
poplar::Tensor createReplicaSlice(const poplar::Type &type)
Create a tensor mapped efficiently over the same tiles as the reference tensor.
The returned tensor has the size of the result of the reduce scatter and of the input of the all gather.
- Parameters
type – The type to use when creating the tensor.
- Returns
The efficient tensor created from the reference.
-
poplar::Tensor createCollectivesTensor(const poplar::Type &type, const std::string &debugPrefix)
Create a tensor mapped efficiently over the same tiles as the reference tensor.
The returned tensor has the size of the input of the reduce scatter and of the result of the all gather.
- Parameters
type – The type to use when creating the tensor.
debugPrefix – The debug prefix.
- Returns
The efficient tensor created from the reference.
-
poplar::Tensor undoRearrangeForCollective(const poplar::Tensor &tensor) const
Reorder tensor back into the expected IR tensor shape and order.
- Parameters
tensor – The tensor to rearrange.
- Returns
The tensor with the rearrangement undone.
-
inline std::vector<std::size_t> getReferenceShape() const
Get the shape of the reference tensor.
- Returns
The shape of the reference tensor.
-
inline const CollectiveBalancedHostRearrangement &getHostRearrangement() const
Get a helper class that implements allows to apply the rearrangement on the host.
- Returns
The helper class for host rearrangement.
Private Functions
-
void rearrange(const void *in, void *out, int64_t elemByteSize, bool refToGathered) const
Host tensor rearrangement routine.
Private Members
-
unsigned replicationFactor
-
poplar::TensorRearranger simplifier
-
CollectiveBalancedHostRearrangement hostRearrangement
-
const poplar::DebugNameAndId dnai
-
class CollectiveBalancedHostRearrangement