10. Distributed training
Distributed training is supported for IPU Pod systems. Here, the IPUs in a rack are interconnected by IPU-Links, and IPUs in different racks are interconnected by GW-Links. Distributed training uses these links to perform collective operations without host involvement. When using multiple instances (host processes), there may however still be a need for communication over the host network, for example for broadcasting the initial values of variables from the first instance to the others.
To perform distributed training on Pod systems, use
which performs data-parallel synchronous training using multiple host processes.
In this sense it is similar to
provided in standard TensorFlow. Initial values are broadcast over the host
network using PopDist.
Collective operations (explicitly through a member function like
implicitly by using an optimizer under the strategy scope) will be performed
directly on the IPU by using compiled communications with the GCL library
over the IPU-Links and GW-Links. The
PopDistStrategy is designed for use with PopDist and PopRun.
Refer to the PopDist and PopRun User Guide for more details.
A distinction should be made between the
IPUStrategy provided in TensorFlow 2. The
a single system with one or more IPUs attached, whereas
targets distributed Pod systems.
10.1. PopDistStrategy examples
There are examples for
PopDistStrategy in the Graphcore feature examples on GitHub.
10.2. Limitations and known issues
Replicated Tensor Sharding (RTS) is not supported for distributed training with