4. Poplar distributed configuration library (PopDist)

This section provides information on how you can use the PopDist library to make the appropriate changes to your application so that it can be launched in a distributed environment using PopRun. In short, PopDist provides a set of APIs which you can use to write a distributed application. As the examples later in this guide will demonstrate, very few modifications are needed in order to prepare an application for distributed execution.

As already mentioned in the introduction section, PopRun uses mpirun for multiple instance creation, while PopDist leverages mpi for communication between multiple hosts. This is not the be confused with communication between IPUs, which is realised using the Graphcore Communication Library (GCL) over either the IPU-Links or the GW-Links.

There are several packages that enable mpi communication in Python applications, but we recommend Horovod due to its simplicity. Broader details on the Horovod package can be found in the documentation here.

4.1. Installation

If your machine learning framework of choice is PopART, Graphcore’s implementation of Horovod is recommended. The Graphcore TensorFlow wheel comes bundled with Horovod and thus no additional installation is needed. The Poplar SDK ships with a wheel for Graphcore Horovod and it can be installed using pip.

Please install and enable the Poplar SDK following the instructions in the Getting Started guide for your IPU system. Do this before installing Horovod to ensure that it uses the OpenMPI version that comes with the SDK.

Then Horovod can be installed with your virtual Python environment of choice activated:

$ pip install <sdk_path>/horovod.x.y.z.whl

PyTorch users on the other hand, must install the official Horovod package if they plan to use mpi for host side communication. This can be done using pip or simply typing:

$ pip install horovod

PyTorch is compatible with the latest version of Horovod. Be sure to have your virtual Python environment of choice activated first.

4.1.1. Validating Horovod

If you use TensorFlow, you can run the following code snippet in order to ensure that Horovod is installed correctly and working.

Listing 4.1 TensorFlow Horovod
1
2
3
4
5
6
7
# Copyright (c) 2021 Graphcore Ltd. All rights reserved.

import popdist
import popdist.tensorflow
from tensorflow.python.ipu import horovod as hvd

hvd.init()

If you are using PyTorch, you can run the following code snippet to verify that Horovod is installed correctly.

Listing 4.2 PyTorch Horovod
1
2
3
4
5
6
7
8
# Copyright (c) 2021 Graphcore Ltd. All rights reserved.

import torch
import poptorch
import popdist
import horovod.torch as hvd

hvd.init()

4.2. The PopDist API

PopDist contains some universal functions which are useful in all frameworks. They can be used for doing things such as the following dynamically, based on the arguments provided to PopRun:

  • Evaluating the global batch size

  • Sharding your dataset

  • Setting the size of buffer to use for shuffling

4.2.1. PopART

popdist.popart.configureSessionOptions(opts)

Configure PopART session options to work with the PopDist context.

Parameters

opts (popart.SessionOptions) –

Return type

None

popdist.popart.getDevice(ipusPerReplica)

Get a PopART device that works with the PopDist context.

Returns

An attached device.

Parameters

ipusPerReplica (int) –

Return type

popart.DeviceInfo

4.2.2. PopTorch

class popdist.poptorch.Options(*args, **kwargs)

An extension to PopTorch’s Options class so that it is easier to pass application-specific options to PopDist.

4.2.3. TensorFlow

popdist.tensorflow.set_ipu_config(config, ipus_per_replica, configure_device=True)

Set the PopDist configuration options for TensorFlow.

Parameters
  • config – An IPUConfig instance created with tensorflow.python.ipu.config.IPUConfig() - or IpuOptions configuration protobuf created with tensorflow.python.ipu.utils.create_ipu_config() (deprecated) - to update.

  • ipus_per_replica (int) – The number of IPUs per replica.

  • configure_device (bool) – Whether to update config to select the IPU device for PopDist execution.

Returns

The passed config.

4.3. PopDist examples

In this section we will detail how you can use PopDist to distribute your application. Examples for PyTorch and TensorFlow are shown.

4.3.1. PyTorch

The code example below outlines the most common steps involved in adding PopDist to an application. For the sake of brevity, the example code shown below assumes that the application is launched using PopRun like this:

$ poprun --vipu-partition=MyPartition --vipu-server-host=127.0.0.1
         --num-replicas=4 --num-instances=1 -v python main.py
Listing 4.3 Simple PopDist Example with PyTorch
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Copyright (c) 2021 Graphcore Ltd. All rights reserved.

import torch
import poptorch
import popdist
import popdist.poptorch
import horovod.torch as hvd


def init_popdist(args):
    hvd.init()
    args.use_popdist = True
    if popdist.getNumTotalReplicas() != args.replicas:
        print(f"The number of replicas is overridden by PopRun. "
              f"The new value is {popdist.getNumTotalReplicas()}.")
    args.replicas = int(popdist.getNumLocalReplicas())

    args.popdist_rank = popdist.getInstanceIndex()
    args.popdist_size = popdist.getNumInstances()


def create_model(opts):
    if opts.use_popdist:
        model_opts = popdist.poptorch.Options()
    else:
        model_opts = poptorch.Options()

    return model_opts


if __name__ == '__main__':

    opts = command_line_arguments()

    # Initialise PopDist
    if popdist.isPopdistEnvSet():
        init_popdist(opts)
    else:
        opts.use_popdist = False

    create_model(opts)

The PopTorch PopDist package is imported on line 5-6, while Horovod is imported on line 7. Next, the PopRun command line parameters are parsed and passed to init_popdist, which is used to initialise both Horovod and PopDist.

Note

The use of Horovod is optional. It is not a requirement to initialise PopDist.

Some of the applications in our GitHub examples repository, such as our PyTorch CNNs training application make use of PopDist and PopRun.

4.3.2. TensorFlow

See the TensorFlow PopDist feature example.

Our TensorFlow CNNs training application also makes use of PopDist and PopRun.

4.4. Conclusion

A PopDist application can be organised into the following parts:

1. Parsing phase: where PopRun runtime command line parameters are parsed and handled.

2. Initialisation phase: where much needed variables such as rank and size are stored by making PopDist API calls. If MPI is to be used between the various instances, Horovod is also initialised.

3. Digestion phase: where PopDist variables are actively used to make the application distributed.

The steps above are just rough outlines and may vary from application to application.