Detailed Description

Supported Option flags:

collectiveImplementation (Syncless, Syncful, Hybrid, Auto) [=Auto]

Type of collective implementation to use.
- auto: Choose the appropriate implementation for the operation in question.
- hybrid: Use syncful over IPU-Links and syncless over GW-Links. Deprecated: please use Hybrid instead.
- Syncless: Use the syncless implementation.
- Syncful: Use the syncful implementation.
- Hybrid: Use syncful over IPU-Links and syncless over GW-Links.

useSynclessCollectives (true, false, hybrid, auto) [=auto]

This option is deprecated, use the collectiveImplementation option instead.
- auto: Choose the appropriate implementation for the operation in question. At the moment this is the same as 'false'.
- true: Use the syncless implementation. Deprecated: please use Syncless instead.
- false: Use the syncful implementation. Deprecated: please use Syncful instead.
- hybrid: Use syncful over IPU-Links and syncless over gw links. Deprecated: please use Hybrid instead.
- Syncless: Use the syncless implementation.
- Syncful: Use the syncful implementation.
- Hybrid: Use syncful over IPU-Links and syncless over gw links.

maxBytesPerTile Integer [=35000]

The maximum size of data and padding in the payload buffer to put on each IO tile. The maximum allowed value is 64000.

topology (rung-ring-2, rung-ring-4, rung-ring-8, ring-on-line, peripheral-ring) [=auto]

The topology to use for the syncful implementation. If you do not specify this option the topology is auto detected.
- auto: Topology automatically selected based on the current graph.
- rung-ring-2: Relevant for replica size 2. The traffic follows one of two physical rings, one on each side of the IPU-Link mesh, by moving straight up and assuming wrap-around at the top.
- rung-ring-4: Relevant for replica size 4. The traffic follows one of two physical rings, one on each side of the IPU-Link mesh, by moving straight up and assuming wrap-around at the top.
- rung-ring-8: Relevant for replica size 8. The traffic follows one of two physical rings, one on each side of the IPU-Link mesh, by moving straight up and assuming wrap-around at the top.
- ring-on-line: Relevant for replica size 2. The traffic follows one of two virtual rings, one on each side of the IPU-Link mesh.
- peripheral-ring: Relevant for replica size 1. The traffic follows a single ring on the periphery of the IPU-Link mesh.

link (auto-link, ipu-link, gw-link) [=auto-link]

The link type to use between IPUs.
- auto-link: Use the link type appropriate for the operation.
- ipu-link: Use the IPU-Links.
- gw-link: Use the GW-Links.

method (auto, broadcast, clockwise_ring, anticlockwise_ring, bidirectional_ring_pair, meet_in_middle_ring, quad_directional_ring) [=auto]

The method/topology to be used.
- auto: Automatically decide on the most optimal method.
- broadcast: Broadcast the tensor to all replicas and do the reduce locally. Faster for small tensors.
- clockwise_ring: Send fragments clockwise around the ring. The number of fragments is equal to the number of IPUs in the ring.
- anticlockwise_ring: Send fragments anticlockwise around the ring. The number of fragments is equal to the number of IPUs in the ring.
- bidirectional_ring_pair: Split the data into two halves and use the clockwise ring algorithm on one half and the anticlockwise ring algorithm on the other in order to fully utilize the links in both directions. The number of fragments is equal to twice the number of IPUs in the ring.
- meet_in_middle_ring: Send half the fragments halfway around the ring in the clockwise direction and half the fragments halfway around the ring in the anticlockwise direction, meeting in the middle. The number of fragments is equal to the number of IPUs in the ring. The disadvantage compared to the "bidirectional_ring_pair" method is that the usage of available bandwidth is not quite optimal, in particular the final step only uses the links in one direction (assuming an even number of IPUs). The advantage is the that it requires fewer steps and allows the use of larger fragments.
- quad_directional_ring: Divide fragments in four and send each quarter around one of two rings using the mirrored and non-mirrored ring pattern.
`syncful.useOptimisedLayout (true, false) [=true]

If the input tensor has been allocated in a GCL friendly way, reusing the same layout for the srcBuffer will minimise code when copying fragments to the srcBuffer. Turning this off might reduce the cycle count at the cost of higher memory usage.
`syncful.maxBroadcastSize Integer [=2048]

For small tensors it is beneficial to broadcast the tensor to all replicas and do the reduce locally so the network latency cost is paid only once. However, the memory use increases for larger group sizes and data volumes. This option controls the group_size * numBytes size beyond which broadcast AllReduce will not be used.

The documentation for this interface was generated from the following file:

include/gcl/Collectives.hpp