13. Replication

This chapter describes how to use replication in PopXL.

13.1. Graph replication

PopXL has the ability to run multiple copies of your model in parallel. This is called graph replication. Replication is a means of parallelising your inference or training workloads. We call each instance of the graph a replica. The replication factor is the number of replicas in total across all replica groups (see Section 13.2, Replica grouping).

This can be set through :py:attr:~popxl.Ir.replication_factor`.

13.2. Replica grouping

In PopXL you have the ability to define a grouping of replicas when you create variables. This grouping is used when you initialise or read the variable. Typically the variables are initialised and read on a per-group basis. The default behaviour is all replicas belong to one group.

The grouping in question is defined by the ReplicaGrouping object, instantiated with replica_grouping(). ReplicaGrouping is initialized with a group_size and a stride.

The group_size parameter sets the number of replicas to be grouped together, and the stride parameter sets the replica index difference between two members of a group.

Warning

Limitations:

Tables Table 13.1, Table 13.2, and Table 13.3 shows some different way group_size and stride would part up replicas into groups.

Table 13.1 Replication factor 16, group_size = 4, and stride = 1

Group

Replicas

0

0, 1, 2, 3

1

4, 5, 6, 7

2

8, 9, 10, 11

3

12, 13, 14, 15

Table 13.2 Replication factor 16, group_size = 4, and stride = 4

Group

Replicas

0

0, 4, 8, 12

1

1, 5, 9, 13

2

2, 6, 10, 14

3

3, 7, 11, 15

Table 13.3 Replication factor 16, group_size = 1, and stride = 1

Group

Replicas

0

0

1

1

2

2

14

14

15

15

13.3. Code examples

Listing 13.1 shows a simple example of the initialization of a few different groupings.

Listing 13.1 Example of setting up different variables.
 1# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
 2import popxl
 3import numpy as np
 4
 5replication_factor = 8
 6ir = popxl.Ir(replication=replication_factor)
 7
 8with ir.main_graph:
 9
10    base_shape = [3, 3]
11
12    # Create a tensor with default settings, that is: load same value to all replicas.
13    tensor_1 = popxl.variable(np.ndarray(base_shape))
14
15    # Create a tensor with one variable on each of the replicas:
16    tensor_2 = popxl.variable(
17        np.ndarray([replication_factor] + base_shape),
18        replica_grouping=ir.replica_grouping(group_size=1),
19    )
20
21    # Create a tensor where two and two replicas are grouped together
22    group_size = 2
23    tensor_3 = popxl.variable(
24        np.ndarray([replication_factor // group_size] + base_shape),
25        replica_grouping=ir.replica_grouping(group_size=2),
26    )
27
28    # Create a tensor where tensors are grouped with an orthogonal replica.
29    tensor_3 = popxl.variable(

Listing 13.2 shows an example of using a replica grouping on a remote variable. The IR has two replicas, and each is its own group.

Listing 13.2 Example of setting up different remote variables.
13ir = popxl.Ir(replication=2)
14
15num_groups = 2
16v_h = np.arange(0, num_groups * 32).reshape((num_groups, 32))
17
18rg = ir.replica_grouping(group_size=ir.replication_factor // num_groups)
19
20with ir.main_graph, popxl.in_sequence():
21    remote_buffer = popxl.remote_buffer((32,), dtypes.int32)
22    remote_v = popxl.remote_variable(v_h, remote_buffer, replica_grouping=rg)
23
24    v = ops.remote_load(remote_buffer, 0)
25
26    v += 1
27
28    ops.remote_store(remote_buffer, 0, v)

There are a couple of specfics to note here. Firstly, you need the in_sequence context as there is no data-flow dependence between the inplace add op and remote_store op on the same tensor. Secondly, we manually pass the correct per-replica shape to popxl.remote_buffer. This shape does not have the group dimension.

Note

If you consider v_h to be the data for a single variable, this is akin to sharding the variable over two replicas. Actually, unless you need to AllGather your shards and cannot forgo the CBR optimisation, it is advisable to just use replica groupings as shown to achieve sharding. This is because the API is much less brittle with respect to what you can do without errors or undefined behaviour.

13.4. Retrieval modes

By default, only one replica per group is returned. Usually this is sufficient as all replicas within a group should be identical. However, if you wish to return all replicas within a group (for example to test all grouped replicas are the same), set the retrieval_mode parameter to "all_replicas" when constructing your variable:

Listing 13.3 Example of setting up variables with all_replicas retrieval mode.
31        replica_grouping=ir.replica_grouping(stride=4),
32    )
33
34    # Create a tensor which is grouped across sequential replicas (0 and 1, 2 and 3) and
35    # return all the group's variables when requested. The returned array will be of shape
36    # [replication_factor] + base_shape
37    group_size = 2
38    tensor_4 = popxl.variable(
39        np.ndarray([replication_factor // group_size] + base_shape),
40        replica_grouping=ir.replica_grouping(group_size=2),
41        retrieval_mode="all_replicas",
42    )
43
44    # Create a tensor which is grouped across orthogonal replicas (0 and 2, 1 and 3)
45    # and return all the group's variables when requested. The returned array will be of shape
46    # [replication_factor] + base_shape