13. Replication

This chapter describes how to use replication in PopXL.

13.1. Graph replication

PopXL has the ability to run multiple copies of your model in parallel. This is called graph replication. Replication is a means of parallelising your inference or training workloads. We call each instance of the graph a replica. The replication factor is the number of replicas in total across all replica groups (see Section 13.2, Replica grouping).

This can be set through :py:attr:~popxl.Ir.replication_factor`.

13.2. Replica grouping

In PopXL you have the ability to define a grouping of replicas when you create variables. This grouping is used when you initialise or read the variable. Typically the variables are initialised and read on a per-group basis. The default behaviour is all replicas belong to one group.

The grouping in question is defined by the ReplicaGrouping object, instantiated with replica_grouping(). ReplicaGrouping is initialized with a group_size and a stride.

The group_size parameter sets the number of replicas to be grouped together, and the stride parameter sets the replica index difference between two members of a group.

Warning

Limitations:

Tables Table 13.1, Table 13.2, and Table 13.3 shows some different way group_size and stride would part up replicas into groups.

Table 13.1 Replication factor 16, group_size = 4, and stride = 1

Group

Replicas

0

0, 1, 2, 3

1

4, 5, 6, 7

2

8, 9, 10, 11

3

12, 13, 14, 15

Table 13.2 Replication factor 16, group_size = 4, and stride = 4

Group

Replicas

0

0, 4, 8, 12

1

1, 5, 9, 13

2

2, 6, 10, 14

3

3, 7, 11, 15

Table 13.3 Replication factor 16, group_size = 1, and stride = 1

Group

Replicas

0

0

1

1

2

2

14

14

15

15

13.3. Code examples

Listing 13.1 shows a simple example of the initialization of a few different groupings.

Listing 13.1 Example of setting up different variables.
 1# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
 2import popxl
 3import numpy as np
 4
 5replication_factor = 8
 6ir = popxl.Ir(replication=replication_factor)
 7
 8with ir.main_graph:
 9
10    base_shape = [3, 3]
11
12    # Create a tensor with default settings, that is: load same value to all replicas.
13    tensor_1 = popxl.variable(np.ndarray(base_shape))
14
15    # Create a tensor with one variable on each of the replicas:
16    tensor_2 = popxl.variable(
17        np.ndarray([replication_factor] + base_shape),
18        replica_grouping=ir.replica_grouping(group_size=1),
19    )
20
21    # Create a tensor where two and two replicas are grouped together
22    group_size = 2
23    tensor_3 = popxl.variable(
24        np.ndarray([replication_factor // group_size] + base_shape),
25        replica_grouping=ir.replica_grouping(group_size=2),
26    )
27
28    # Create a tensor where tensors are grouped with an orthogonal replica.
29    tensor_3 = popxl.variable(

13.4. Retrieval Modes

By default, only one replica per group is returned. Usually this is sufficient as all replicas within a group should be identical. However, if you wish to return all replicas within a group (for example to test all grouped replicas are the same), set the retrieval_mode parameter to "all_replicas" when constructing your variable:

Listing 13.2 Example of setting up variables with all_replicas retrieval mode.
31        replica_grouping=ir.replica_grouping(stride=4),
32    )
33
34    # Create a tensor which is grouped across sequential replicas (0 and 1, 2 and 3) and
35    # return all the group's variables when requested. The returned array will be of shape
36    # [replication_factor] + base_shape
37    group_size = 2
38    tensor_4 = popxl.variable(
39        np.ndarray([replication_factor // group_size] + base_shape),
40        replica_grouping=ir.replica_grouping(group_size=2),
41        retrieval_mode="all_replicas",
42    )
43
44    # Create a tensor which is grouped across orthogonal replicas (0 and 2, 1 and 3)
45    # and return all the group's variables when requested. The returned array will be of shape
46    # [replication_factor] + base_shape