6. Grouping graph replicas

This section details how to use the popart.VariableSettings for the purpose of grouping tensor weights across replicas. For a detailed description of what a replica is, refer to the ipu-programmers-guide:replication section in the IPU Programmer’s Guide.

6.1. Concept

When using graph replication, variables by default contain the same value on all replicas. With the help of VariableSettings we can assign distinct tensor values to (and retrieve tensor values from) groups of replicas, removing the limitation of assigning the same value to all replicas.

6.2. VariableSettings

The VariableSettings object is initialized with two values: a CommGroup and a VariableRetrievalMode. The CommGroup is used to set the communication groups this tensor is divided into across replicas, and the VariableRetrievalMode lets you specify how to retrieve variables from the replicas.

The CommGroup class in turn is composed of the CommGroupType enum, and the size of each group. Possible values for CommGroupType are:

  • popart.CommGroupType.All:

    This is the default group type, with this grouping all replicas use the same variable values. This CommGroupType ignores group size. An example of such a grouping is in Table 6.1.

    Table 6.1 Replication factor 16, CommGroupType = All

    Group

    Replicas

    0

    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

  • popart.CommGroupType.Consecutive:

    With this CommGroupType replicas will be grouped together with adjacent replicas (based on replica index) with each group having a size equal to the size the CommGroup is instantiated with. An example of such a grouping is in Table 6.2.

    Table 6.2 Replication factor 16, CommGroupType = Consecutive, CommGroup size = 4

    Group

    Replicas

    0

    0, 1, 2, 3

    1

    4, 5, 6, 7

    2

    8, 9, 10, 11

    3

    12, 13, 14, 15

  • popart.CommGroupType.Orthogonal:

    Orthogonal groups, unlike Consecutive, assign replicas such that the first member of a group has replica-index same as the group-index, and following members are assigned with a stride from the previous equal to the number of groups. An example to visualize this is in Table 6.3.

    Table 6.3 Replication factor 16, CommGroupType = Orthogonal, CommGroup size = 4

    Group

    Replicas

    0

    0, 4, 8, 12

    1

    1, 5, 9, 13

    2

    2, 6, 10, 14

    3

    3, 7, 11, 15

  • popart.CommGroupType.Ungrouped:

    Ungrouped replicas imply that each replica is in their own group, see Table 6.4.

    Table 6.4 Replication factor 16, CommGroupType = Ungrouped

    Group

    Replicas

    0

    0

    1

    1

    2

    2

    14

    14

    15

    15

6.3. Instantiating Variables with VariableSettings

Before creating variables with VariableSettings a replication factor must be decided upon, as different replication factors will change the number of communication groups requiring initialization, and thus the size of the instantiating buffer size.

VariableSettings can be added to the addInitializedInput() or addVarInit() call when initiating your variable.

The initializer buffer used for creating these variables have to be sized such that they initialize each group individually. This is done by adding an outer dimension to the initializing buffer equal to the number of groups, the graph-builder will handle the rest. That is to say, a tensor with shape [2, 3, 4] with VariableSettings and a replication_factor (that is the number of replicas) that results in 4 groups must be instantiated with shape [4, 2, 3, 4], where [r, ...] instantiates the variable on replica r.

6.4. Weight input/output

When using PyWeightsIO to read the value of the weights, the buffer size must match the size of the initializing data, and if VariableRetrievalMode is AllReplicas said outer dimension must match the replication factor.

For example: with a tensor of shape [2, 3, 4], using replication factor 4 and a VariableSettings with CommGroup (Consecutive, 2) we need a buffer for the PyWeightsIO with the shape

  • [2, 2, 3, 4] if we use popart.VariableRetrievalMode.OnePerGroup.

  • [4, 2, 3, 4] if we use popart.VariableRetrievalMode.AllReplicas.

The on device buffer is populated when using popart.Session.readWeights().

Listing 6.1

Listing 6.1 Creating buffers for replicas.
# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
import popart
import numpy
from popart import CommGroup, CommGroupType
from popart import VariableRetrievalMode, VariableSettings

builder = popart.Builder()

# replication factor
repl_factor = 4

# Simple base shape of variable on replica
base_shape = [3, 5]

# size of each group
group_size = 2

# The CommGroup we plan to use
communication_group = CommGroup(CommGroupType.Consecutive, group_size)

# VariableSettings to read from groups
settings_grouped    = VariableSettings(\
                            communication_group,\
                            VariableRetrievalMode.OnePerGroup)

# VariableSettings to read from all replicas
settings_individual = VariableSettings(\
                            communication_group,\
                            VariableRetrievalMode.AllReplicas)

# get init buffer:
num_groups = settings_grouped.groupCount(repl_factor)
shape = [int(repl_factor / num_groups)] + base_shape
initializer = numpy.zeros(shape).astype(numpy.float32)  # example

print(initializer.dtype)

# Creating Variables
a = builder.addInitializedInputTensor(initializer, settings_grouped)
b = builder.addInitializedInputTensor(initializer, settings_individual)

# get IO buffer shapes
shape_a = [settings_grouped.numReplicasReturningVariable(repl_factor)] \
            + base_shape
shape_b = [settings_individual.numReplicasReturningVariable(repl_factor)] \
            + base_shape

# get IO buffers
buffer_a = numpy.ndarray(shape_a)
buffer_b = numpy.ndarray(shape_b)

# finalize IO buffers
weightsIo = popart.PyWeightsIO({a: buffer_a, b: buffer_b})

6.5. ONNX checkpoints

ONNX is not by default aware of the replication factor, thus unless told specifically the ONNX model will attempt to interpret the outermost dimension as a part of each replica, usually breaking the model logic in the process.

To accomodate this the builder function: popart.Builder.embedReplicationFactor() writes the replication factor into the Onnx model as an attribute of the graph.

The builder does not need the replication factor embedded when using popart.Session.resetHostWeights() to write a ONNX-file into a new model.