10. Distributed training

Distributed training is supported for IPU Pod systems. Here, the IPUs in a rack are interconnected by IPU-Links, and IPUs in different racks are interconnected by GW-Links. Distributed training uses these links to perform collective operations without host involvement. When using multiple instances (host processes), there may however still be a need for communication over the host network, for example for broadcasting the initial values of variables from the first instance to the others.

To perform distributed training on Pod systems, use PopDistStrategy, which performs data-parallel synchronous training using multiple host processes. In this sense it is similar to MultiWorkerMirroredStrategy provided in standard TensorFlow. Initial values are broadcast over the host network using Horovod.

Collective operations (explicitly through a member function like reduce() or implicitly by using an optimizer under the strategy scope) will be performed directly on the IPU by using compiled communications with the GCL library over the IPU-Links and GW-Links. The PopDistStrategy is designed for use with PopDist and PopRun. Refer to the PopDist and PopRun User Guide for more details.

A distinction should be made between the PopDistStrategy and the IPUStrategy provided in TensorFlow 2. The IPUStrategy targets a single system with one or more IPUs attached, whereas PopDistStrategy targets distributed Pod systems. Note that the use of ipu_compiler.compile() is still required to ensure a single XLA graph is compiled, except when using IPUEstimator or IPUPipelineEstimator which already use it internally.

10.1. PopDistStrategy examples

There are examples for PopDistStrategy in the Graphcore feature examples on GitHub.


From Poplar SDK 3.1, TensorFlow 1 will only be supported in CentOS 7. In addition, Examples and Tutorials for TensorFlow 1 are only available up to version 3.0 of the SDK. There has been limited testing of the 3.0 versions of the TensorFlow 1 tutorials and examples with Poplar SDK 3.1.