13. Replication

This chapter describes how to use replication in PopXL.

13.1. Graph replication

PopXL has the ability to run multiple copies of your model in parallel. This is called graph replication. Replication is a means of parallelising your inference or training workloads. We call each instance of the graph a replica. The replication factor is the number of replicas in total across all replica groups (see Section 13.2, Replica grouping).

This can be set through :py:attr:~popxl.Ir.replication_factor`.

13.2. Replica grouping

In PopXL you have the ability to define a grouping of replicas when you create variables. This grouping is used when you initialise or read the variable. Typically the variables are initialised and read on a per-group basis. The default behaviour is all replicas belong to one group.

The grouping in question is defined by the ReplicaGrouping object, instantiated with replica_grouping(). ReplicaGrouping is initialized with a group_size and a stride.

The group_size parameter sets the number of replicas to be grouped together, and the stride parameter sets the replica index difference between two members of a group.

Warning

Limitations:

When stride == 1 a requirement is replication_factor modulo group_size equals 0.

When stride != 1 a requirement is stride times group_size equals replication_factor.

Tables Table 13.1, Table 13.2, and Table 13.3 shows some different way group_size and stride would part up replicas into groups.

Table 13.1 Replication factor 16, group_size = 4, and stride = 1
Group	Replicas
0	0, 1, 2, 3
1	4, 5, 6, 7
2	8, 9, 10, 11
3	12, 13, 14, 15

Table 13.2 Replication factor 16, group_size = 4, and stride = 4
Group	Replicas
0	0, 4, 8, 12
1	1, 5, 9, 13
2	2, 6, 10, 14
3	3, 7, 11, 15

Table 13.3 Replication factor 16, group_size = 1, and stride = 1
Group	Replicas
0	0
1	1
2	2
…	…
14	14
15	15

13.3. Code examples

Listing 13.1 shows a simple example of the initialization of a few different groupings.

Listing 13.1 Example of setting up different variables.

# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
import popxl
import numpy as np

replication_factor = 8
ir = popxl.Ir(replication=replication_factor)

with ir.main_graph:

    base_shape = [3, 3]

    # Create a tensor with default settings, that is: load same value to all replicas.
    tensor_1 = popxl.variable(np.ndarray(base_shape))

    # Create a tensor with one variable on each of the replicas:
    tensor_2 = popxl.variable(
        np.ndarray([replication_factor] + base_shape),
        replica_grouping=ir.replica_grouping(group_size=1),
    )

    # Create a tensor where two and two replicas are grouped together
    group_size = 2
    tensor_3 = popxl.variable(
        np.ndarray([replication_factor // group_size] + base_shape),
        replica_grouping=ir.replica_grouping(group_size=2),
    )

    # Create a tensor where tensors are grouped with an orthogonal replica.
    tensor_3 = popxl.variable(

13.4. Retrieval Modes

By default, only one replica per group is returned. Usually this is sufficient as all replicas within a group should be identical. However, if you wish to return all replicas within a group (for example to test all grouped replicas are the same), set the retrieval_mode parameter to "all_replicas" when constructing your variable:

Listing 13.2 Example of setting up variables with all_replicas retrieval mode.

        replica_grouping=ir.replica_grouping(stride=4),
    )

    # Create a tensor which is grouped across sequential replicas (0 and 1, 2 and 3) and
    # return all the group's variables when requested. The returned array will be of shape
    # [replication_factor] + base_shape
    group_size = 2
    tensor_4 = popxl.variable(
        np.ndarray([replication_factor // group_size] + base_shape),
        replica_grouping=ir.replica_grouping(group_size=2),
        retrieval_mode="all_replicas",
    )

    # Create a tensor which is grouped across orthogonal replicas (0 and 2, 1 and 3)
    # and return all the group's variables when requested. The returned array will be of shape
    # [replication_factor] + base_shape