4. Poplar distributed configuration library (PopDist)
This section provides information on how you can use the PopDist library to make the appropriate changes to your application so that it can be launched in a distributed environment using PopRun. In short, PopDist provides an API which you can use to write a distributed application. As the examples (Section 4.4, PopDist examples) will demonstrate, very few modifications are needed in order to prepare an application for distributed execution.
PopRun uses mpirun
for multiple instance creation, while PopDist uses
mpi
for communication between multiple hosts. This is not to be confused
with communication between IPUs, which is realised using the Graphcore
Communication Library (GCL) over either the IPU-Links or the GW-Links.
There are several packages that enable mpi
communication in Python
applications. We recommend Horovod due to its simplicity. More details about
the Horovod package can be found in the Horovod documentation.
For mpi
communication in C++ applications we recommend using PopDist directly.
More information about backend initialisation and the implemented collectives is
available in Section 4.2, PopDist collectives and Section 5, PopDist C++ API reference.
Note
PopTorch, TensorFlow and PopART frameworks use the C++ collectives implemented in PopDist. As a result, the PopDist backend needs to be initialised in Python applications when using any of these frameworks. More information can be found in Section 4.2, PopDist collectives.
4.1. Horovod installation
The implementation of Horovod that you install depends on your machine learning framework:
PyTorch: You must install the official Horovod package if you want to use Horovod.
TensorFlow 1: You do not have to install Horovod separately as it has been bundled in the Graphcore TensorFlow wheel.
TensorFlow 2: You must install the official Horovod package if you want to use Horovod.
PopART: PopART does not support Horovod, so you will need to use PopDist or MPI directly.
Note
You must install and enable the Poplar SDK following the instructions in the getting started guide for your IPU system. Do this before installing Horovod to ensure that it uses the OpenMPI version that comes with the SDK.
Note
It is recommended to create Python virtual environments for each framework before installing any packages.
4.1.1. PyTorch
If you are using PyTorch, then you must install the official Horovod package if
you plan to use mpi
for host-side communication. Make sure you have
activated your PyTorch virtual environment and installed PopTorch. For
instructions on how to install PopTorch, refer to the PyTorch Quick Start.
You can install the official Horovod package with the command:
$ pip install horovod
You can validate that Horovod has been installed correctly by using the code snippet in Listing 4.1.
1# Copyright (c) 2021 Graphcore Ltd. All rights reserved.
2
3import torch
4import poptorch
5import popdist
6import horovod.torch as hvd
7
8hvd.init()
4.1.2. TensorFlow 2
If you are using TensorFlow 2, then you must install the official Horovod package if
you plan to use mpi
for host-side communication. Make sure you have
activated your virtual environment and installed the TensorFlow 2 wheel.
refer to the TensorFlow 2 Quick Start.
Once you have installed the TensorFlow wheel, you can validate that Horovod has been installed correctly by using the code snippet in Listing 4.2.
1# Copyright (c) 2021 Graphcore Ltd. All rights reserved.
2
3import popdist
4import popdist.tensorflow
5from tensorflow.python.ipu import horovod as hvd
6
7hvd.init()
4.1.3. TensorFlow 1
Horovod is bundled in the Graphcore TensorFlow wheel and so you do not have to install it separately. Once you have installed the TensorFlow wheel, you can validate that Horovod has been installed correctly by using the code snippet in Listing 4.2.
4.1.4. PopART
If you are using PopART, then you must install the Graphcore implementation of Horovod.
$ pip install <sdk_path>/horovod.x.y.z.whl
Note
Since PopART has to use the version of Horovod that is bundled with the Poplar SDK, there is no need to validate the installation.
4.2. PopDist collectives
PopDist exposes its own set of C++ functions for mpi
communication. Before
using them you need to register an MPI based backend and initialise it:
#include <popdist/backend.hpp>
popdist::registerDefaultBackend();
popdist::initializeBackend();
popdist::synchronize();
popdist::finalizeBackend();
PopTorch, TensorFlow and PopART frameworks use the C++ collectives implemented in PopDist under the hood. Thus, the PopDist backend in Python applications needs to be initialised as well:
import popdist
popdist.init()
popdist.synchronize()
Alternatively, you can set the POPDIST_AUTO_INITIALIZE=1
environment variable
to instruct PopDist to perform the initialisation automatically. In both cases
the backend will be finalised automatically before the program is finished.
Documentation of the collective functions is available in the C++ API.
4.2.1. Running code on a subset of instances
Since frameworks perform collectives/synchronisation under the hood it is important
to let PopDist know which parts of the program are only meant to be run on a subset of
instances. This can be achieved using the execute_on_instances
function.
import popdist
def validate():
# Synchronisation only happens across 'participants'.
popdist.synchronize()
return validate(model)
# IDs of the instances that will perform the validation.
participants = {0, 1}
result = popdist.execute_on_instances(participants, validate)
4.3. PopDist API
PopDist contains some universal functions which are useful for all frameworks. They can be used for changing the application behaviour dynamically, based on the arguments provided to PopRun. Some examples of this are:
Evaluating the global batch size
Sharding your dataset
Setting the size of buffer to use for shuffling
See Section 5, PopDist C++ API reference and Section 6, PopDist Python API reference for more information.
4.4. PopDist examples
In this section we will detail how you can use PopDist to distribute your application. Examples for PyTorch and TensorFlow are shown.
4.4.1. PyTorch
The code example below outlines the most common steps involved in adding PopDist to an application. For the sake of brevity, the example code shown below assumes that the application is launched using PopRun like this:
$ poprun --vipu-partition=MyPartition --vipu-server-host=127.0.0.1
--num-replicas=4 --num-instances=1 -v python main.py
1# Copyright (c) 2021 Graphcore Ltd. All rights reserved.
2
3import torch
4import argparse
5import poptorch
6import popdist
7import popdist.poptorch
8import horovod.torch as hvd
9
10
11def command_line_arguments():
12 parser = argparse.ArgumentParser(description='PopDist PyTorch example')
13 parser.add_argument(
14 '--replicas', type=int, default=1, help='Number of replicas')
15 return parser.parse_args()
16
17
18def init_popdist(args):
19 popdist.init()
20 hvd.init()
21 args.use_popdist = True
22 if popdist.getNumTotalReplicas() != args.replicas:
23 print(f"The number of replicas is overridden by PopRun. "
24 f"The new value is {popdist.getNumTotalReplicas()}.")
25 args.replicas = int(popdist.getNumLocalReplicas())
26
27 args.popdist_rank = popdist.getInstanceIndex()
28 args.popdist_size = popdist.getNumInstances()
29
30
31def create_model(opts):
32 if opts.use_popdist:
33 model_opts = popdist.poptorch.Options()
34 else:
35 model_opts = poptorch.Options()
36
37 return model_opts
38
39
40if __name__ == '__main__':
41
42 opts = command_line_arguments()
43
44 # Initialise PopDist
45 if popdist.isPopdistEnvSet():
46 init_popdist(opts)
47 else:
48 opts.use_popdist = False
49
50 create_model(opts)
The PopTorch PopDist package is imported on line 5-6, while Horovod is
imported on line 7. Next, the PopRun command line parameters are parsed and
passed to init_popdist
, which is used to initialise both
Horovod and PopDist.
Note
The use of Horovod is optional. It is not a requirement to initialise PopDist.
Some of the applications in our GitHub examples repository, such as our PyTorch CNNs training application make use of PopDist and PopRun.
4.4.2. TensorFlow 1
When using PopDist with TensorFlow 1, there is an additional class and function that you need to use that are not part of the PopDist API:
PopDistStrategy
: This is a context manager for targeting the IPU in a distributed manner.tf.data.Dataset.shard(): This allows you to split your dataset between instances using PopDist API methods. This is done as follows:
dataset = dataset.shard(num_shards=popdist.getNumInstances(), index=popdist.getInstanceIndex())
The data is automatically split between replicas within each instance, so no further manual splitting is required.
See the TensorFlow 1 PopDist feature example.
Our TensorFlow 1 CNNs training application also makes use of PopDist and PopRun.
4.4.3. TensorFlow 2
When using PopDist with TensorFlow 2, there is an additional class and function that you need to use that are not part of the PopDist API:
PopDistStrategy
: This is a new context manager for targeting the IPU in a distributed manner.tf.data.Dataset.shard(): This allows you to split your dataset between instances using PopDist API methods. This is done as follows:
dataset = dataset.shard(num_shards=popdist.getNumInstances(), index=popdist.getInstanceIndex())
The data is automatically split between replicas within each instance, so no further manual splitting is required.
See the TensorFlow 2 PopDist feature example.
Note
When using Keras methods such as tf.keras.Model.fit() in a multi-instance session, the progress bar that summarises the metrics and loss for each epoch is a summary across all replicas. Do not be alarmed that the logs of each instance are claiming to process the whole epoch, each instance only processes its respective dataset shard.
Note
When using tf.keras.Model() to define your model, the batch_size
argument for tf.keras.Input() expects the global batch size. You do not need to explicitly provide this parameter, but if you do, be aware of this behaviour.
4.5. Conclusion
A PopDist application can be organised into the following parts:
1. Parsing phase: where PopRun runtime command line parameters are parsed and handled.
2. Initialisation phase: where much needed variables such as rank and size are stored by making PopDist API calls. If you are using Horovod then it needs to be initialised here.
3. Digestion phase: where PopDist variables are actively used to make the application distributed.
The steps above are just rough outlines and may vary from application to application.