13. Replication

This chapter describes how to use replication in PopXL.

13.1. Graph replication

PopXL has the ability to run multiple copies of your model in parallel. This is called graph replication. Replication is a means of parallelising your inference or training workloads. We call each instance of the graph a replica. The replication factor is the number of replicas in total across all replica groups (see Section 13.2, Replica grouping).

This can be set through :py:attr:~popxl.Ir.replication_factor`.

13.2. Replica grouping

In PopXL you have the ability to define a grouping of replicas when you create variables. This grouping is used when you initialise or read the variable. Typically the variables are initialised and read on a per-group basis. The default behaviour is all replicas belong to one group.

The grouping in question is defined by the ReplicaGrouping object, instantiated with replica_grouping(). ReplicaGrouping is initialized with a group_size and a stride.

The group_size parameter sets the number of replicas to be grouped together, and the stride parameter sets the replica index difference between two members of a group.

Warning

Limitations:

When stride == 1 a requirement is replication_factor modulo group_size equals 0.

When stride != 1 a requirement is stride times group_size equals replication_factor.

Tables Table 13.1, Table 13.2, and Table 13.3 shows some different way group_size and stride would part up replicas into groups.

Table 13.1 Replication factor 16, group_size = 4, and stride = 1
Group	Replicas
0	0, 1, 2, 3
1	4, 5, 6, 7
2	8, 9, 10, 11
3	12, 13, 14, 15

Table 13.2 Replication factor 16, group_size = 4, and stride = 4
Group	Replicas
0	0, 4, 8, 12
1	1, 5, 9, 13
2	2, 6, 10, 14
3	3, 7, 11, 15

Table 13.3 Replication factor 16, group_size = 1, and stride = 1
Group	Replicas
0	0
1	1
2	2
…	…
14	14
15	15

13.3. Code examples

Listing 13.1 shows a simple example of the initialization of a few different groupings.

Listing 13.1 Example of setting up different variables.

# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
import popxl
import numpy as np

replication_factor = 8
ir = popxl.Ir(replication=replication_factor)

with ir.main_graph:

    base_shape = [3, 3]

    # Create a tensor with default settings, that is: load same value to all replicas.
    tensor_1 = popxl.variable(np.ndarray(base_shape))

    # Create a tensor with one variable on each of the replicas:
    tensor_2 = popxl.variable(
        np.ndarray([replication_factor] + base_shape),
        replica_grouping=ir.replica_grouping(group_size=1),
    )

    # Create a tensor where two and two replicas are grouped together
    group_size = 2
    tensor_3 = popxl.variable(
        np.ndarray([replication_factor // group_size] + base_shape),
        replica_grouping=ir.replica_grouping(group_size=2),
    )

    # Create a tensor where tensors are grouped with an orthogonal replica.
    tensor_3 = popxl.variable(

Listing 13.2 shows an example of using a replica grouping on a remote variable. The IR has two replicas, and each is its own group.

Listing 13.2 Example of setting up different remote variables.

ir = popxl.Ir(replication=2)

num_groups = 2
v_h = np.arange(0, num_groups * 32).reshape((num_groups, 32))

rg = ir.replica_grouping(group_size=ir.replication_factor // num_groups)

with ir.main_graph, popxl.in_sequence():
    remote_buffer = popxl.remote_buffer((32,), dtypes.int32)
    remote_v = popxl.remote_variable(v_h, remote_buffer, replica_grouping=rg)

    v = ops.remote_load(remote_buffer, 0)

    v += 1

    ops.remote_store(remote_buffer, 0, v)

There are a couple of specfics to note here. Firstly, you need the in_sequence context as there is no data-flow dependence between the inplace add op and remote_store op on the same tensor. Secondly, we manually pass the correct per-replica shape to popxl.remote_buffer. This shape does not have the group dimension.

Note

If you consider v_h to be the data for a single variable, this is akin to sharding the variable over two replicas. Actually, unless you need to AllGather your shards and cannot forgo the CBR optimisation, it is advisable to just use replica groupings as shown to achieve sharding. This is because the API is much less brittle with respect to what you can do without errors or undefined behaviour.

13.4. Retrieval modes

By default, only one replica per group is returned. Usually this is sufficient as all replicas within a group should be identical. However, if you wish to return all replicas within a group (for example to test all grouped replicas are the same), set the retrieval_mode parameter to "all_replicas" when constructing your variable:

Listing 13.3 Example of setting up variables with all_replicas retrieval mode.

        replica_grouping=ir.replica_grouping(stride=4),
    )

    # Create a tensor which is grouped across sequential replicas (0 and 1, 2 and 3) and
    # return all the group's variables when requested. The returned array will be of shape
    # [replication_factor] + base_shape
    group_size = 2
    tensor_4 = popxl.variable(
        np.ndarray([replication_factor // group_size] + base_shape),
        replica_grouping=ir.replica_grouping(group_size=2),
        retrieval_mode="all_replicas",
    )

    # Create a tensor which is grouped across orthogonal replicas (0 and 2, 1 and 3)
    # and return all the group's variables when requested. The returned array will be of shape
    # [replication_factor] + base_shape