10. Distributed training

Distributed training is supported for IPU Pod systems. Here, the IPUs in a rack are interconnected by IPU-Links, and IPUs in different racks are interconnected by GW-Links. Distributed training uses these links to perform collective operations without host involvement. When using multiple instances (host processes), there may however still be a need for communication over the host network, for example for broadcasting the initial values of variables from the first instance to the others.

To perform distributed training on Pod systems, use PopDistStrategy, which performs data-parallel synchronous training using multiple host processes. In this sense it is similar to MultiWorkerMirroredStrategy provided in standard TensorFlow. Initial values are broadcast over the host network using PopDist.

Collective operations (explicitly through a member function like reduce() or implicitly by using an optimizer under the strategy scope) will be performed directly on the IPU by using compiled communications with the GCL library over the IPU-Links and GW-Links. The PopDistStrategy is designed for use with PopDist and PopRun. Refer to the PopDist and PopRun User Guide for more details.

A distinction should be made between the PopDistStrategy and the IPUStrategy provided in TensorFlow 2. The IPUStrategy targets a single system with one or more IPUs attached, whereas PopDistStrategy targets distributed Pod systems.

10.1. PopDistStrategy examples

There are examples for PopDistStrategy in the Graphcore feature examples on GitHub.

10.2. Limitations and known issues

  • Replicated Tensor Sharding (RTS) is not supported for distributed training with PopDistStrategy.