4. Features

4.1. Options

You can change how PopTorch compiles and executes models using poptorch.Options. You can find a full list of options in Section 10.1, Options. Broadly speaking, the options fall into the following categories:

  1. General options (see Options)

  2. Options related to half precision (see opts.Precision.*)

  3. Management of the training process (see opts.Training.*)

  4. Location of tensors (see: opts.TensorLocations.* and TensorLocationSettings)

  5. Options relevant to the Torch JIT compiler (see opts.Jit.*)

  6. Control of distributed execution environments when using tools other than PopRun (see opts.Distributed.*)

See Section 5, Efficient data batching for a full explanation of how device_iterations greater than 1, gradient_accumulation, and replication_factor interact with the output and input sizes.

You can choose to use the IPU Model instead of IPU hardware with the useIpuModel() option.

4.1.1. Setting options via config file

In addition to setting these options programmatically, you can also set them in a config text file by using loadFromFile().

Each line in the file must contain a single command corresponding to setting an option in Options. To set an option within the file, write the command as you would within a Python script but omit the options. prefix. For example:

Listing 4.1 Example contents of a config file used to set options
1
2
3
4
5
deviceIterations(1)
setExecutionStrategy(poptorch.ShardedExecution())
replicationFactor(1)
enableSyntheticData(True)

Then, instantiate Options and call loadFromFile():

Listing 4.2 Setting options using a config file named “poptorch.conf”
1
2
opts = poptorch.Options()
opts.loadFromFile("tmp/poptorch.conf")

4.2. Model wrapping functions

The basis of PopTorch integration comes from the two model wrapping functions described in the following sections.

Note

PopTorch makes a shallow copy of the model. Changes to the parameters in the models returned by these two model wrapping functions affect the original model and vice versa. However, primitive variable types will not be kept in sync. This includes the training bool of pytorch.nn.Module. If your PyTorch model is named model, call model.eval() or model.train(), if required, before calling these wrapping functions.

4.2.1. poptorch.trainingModel

This function wraps a PyTorch model, yielding a PopTorch model that can be run on the IPU in training mode. See trainingModel() for more information.

Listing 4.3 An example of the use of trainingModel
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import torch
import poptorch


class ExampleModelWithLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(10, 10)
        self.loss = torch.nn.MSELoss()

    def forward(self, x, target=None):
        fc = self.fc(x)
        if self.training:
            return fc, self.loss(fc, target)
        return fc


torch.manual_seed(0)
model = ExampleModelWithLoss()

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)
ones = torch.ones(10)

# Train on IPU.
for i in range(0, 800):
    # Each call here executes the forward pass, loss calculation, and backward
    # pass in one step.
    # Model input and loss function input are provided together.
    poptorch_out, loss = poptorch_model(input, target)
    print(f"{i}: {loss}")

# Copy the trained weights from the IPU back into the host model.
poptorch_model.copyWeightsToHost()

# Execute the trained weights on host.
model.eval()
native_out = model(input)

# Models should be very close to native output although some operations are
# numerically different and floating point differences can accumulate.
torch.testing.assert_allclose(native_out, poptorch_out, rtol=1e-04, atol=1e-04)

Note

By default, PopTorch will only return the final batch of outputs. Please see Section 5.6, poptorch.Options.Training.anchorReturnType for details on what PopTorch returns when using trainingModel() and how you can calculate statistics such as training accuracy over all batches.

4.2.2. poptorch.inferenceModel

This function wraps a PyTorch model, yielding a PopTorch model that can be run on the IPU in inference mode. See inferenceModel() for more information.

Listing 4.4 An example of the use of inferenceModel
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import torch
import torchvision
import poptorch

# Some dummy imagenet sized input.
picture_of_a_cat_here = torch.randn([1, 3, 224, 224])

# The model, in this case a MobileNet model with pretrained weights that comes
# canned with Pytorch.
model = torchvision.models.mobilenet_v2(pretrained=True)
model.train(False)

# Wrap in the PopTorch inference wrapper
inference_model = poptorch.inferenceModel(model)

# Execute on IPU.
out_tensor = inference_model(picture_of_a_cat_here)

# Get the top 5 ImageNet classes.
top_five_classes = torch.topk(torch.softmax(out_tensor, 1), 5)
print(top_five_classes)

# Try the same on native PyTorch
native_out = model(picture_of_a_cat_here)

native_top_five_classes = torch.topk(torch.softmax(native_out, 1), 5)

# Models should be very close to native output although some operations are
# numerically different and floating point differences can accumulate.
assert any(top_five_classes[1][0] == native_top_five_classes[1][0])
# inference_half_start
model = torch.nn.Linear(1, 10)

# Convert the parameters (weights) to halfs. Without doing so,
# the Linear parameters will automatically be cast to half, which allows
# training with float32 parameters but half tensors.
model.half()

t1 = torch.tensor([1.]).half()

opts = poptorch.Options()

inference_model = poptorch.inferenceModel(model, opts)
out = inference_model(t1)

assert out.dtype == torch.half
# inference_half_end

4.2.3. poptorch.PoplarExecutor

You should not create this class directly. It is a wrapper around the model that was passed into inferenceModel() or trainingModel(). It has a few methods which you can use to interface with the IPU.

The PoplarExecutor will implicitly keep in sync the parameters of the source PyTorch model and the PopTorch model(s). However, you need to explicitly copy the weights if the model is trained on the CPU and inference is run on the IPU.

See PoplarExecutor for a complete description of the IPU interface functionality.

Listing 4.5 Example contents of when explicit copies are needed
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
model = Model()

model.eval()
poptorch_inf = poptorch.inferenceModel(model)

# Switch for "poptorch.trainingModel": poptorch_inf will remain in "eval" mode
model.train()
poptorch_train = poptorch.trainingModel(model)

# train on IPU
train(poptorch_train)
torch.save(model.state_dict(), "model.save")  # OK

# Aready in "eval" mode
validate(poptorch_inf)  # OK

# switch to "eval" mode for CPU
model.eval()
validate(model)  # OK

# train on CPU
model.train()
train_on_cpu(model)

# Explicit copy needed
poptorch_inf.copyWeightsToDevice()
validate(poptorch_inf)

4.2.4. poptorch.isRunningOnIpu

One useful utility function is isRunningOnIpu(). This returns True when executing on the IPU and False when executing the model outside IPU scope. This allows for different code paths within the model.

A common use case is executing equivalent code to a PopART custom operator when running on the CPU. For example:

class Network(torch.nn.Module):
  def forward(self, x, y):
      if poptorch.isRunningOnIpu():
          # IPU path
          return my_custom_operator(x, y)
      else:
          # CPU path
          return my_torch_implementation(x,y)

4.3. Error handling

4.3.1. Recoverable runtime errors

This category of error is likely to be transient.

Exception type raised by PopTorch: poptorch.RecoverableError (inherits from poptorch.Error)

The exception contains the action required to recover from this error in its recovery_action string attribute.

This attribute can contain:
  • IPU_RESET: Reset the IPU and reload the IPU memory.

  • PARTITION_RESET: Reset the IPU partition. This resets the IPU-links between IPUs.

  • FULL_RESET: Power cycle the system.

4.3.2. Unrecoverable runtime errors

These errors are likely to persist. You should take the system out of operation for analysis and repair.

Exception type raised by PopTorch: poptorch.UnrecoverableError (inherits from poptorch.Error)

4.3.3. Application and other errors

This kind of error is due to an error in the program or a misuse of an API.

Exception type raised by PopTorch: poptorch.Error if the error was detected in the C++ backend, or some generic Python Exception if it happened in the Python layer.

poptorch.Error has the following string attributes:
  • message The error message without any of the context.

  • type The part of the software stack that raised the exception and the category of the error if available.

  • location Where the exception was raised.

Example:

Listing 4.6 How to handle recoverable / unrecoverable errors
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
    try:
        m = PytorchModel(model_param)
        inference_model = poptorch.inferenceModel(m)
        t1 = torch.tensor([1.])
        t2 = torch.tensor([2.])
        assert inference_model(t1, t2) == 3.0
    except poptorch.RecoverableError as e:
        print(e)
        if e.recovery_action == "FULL_RESET":
            reboot_server()
        elif e.recovery_action == "IPU_RESET":
            print("Need to reset the IPU")
        elif e.recovery_action == "PARITION_RESET":
            print("Need to reset the partition")
    except poptorch.UnrecoverableError as e:
        print(f"Unrecoverable error: machine needs to be taken offline: {e}")
        shutdown_system()
    except poptorch.Error as e:
        print(f"Received {e.message} from component {e.type}, "
              f"location: {e.location}")
        # Or you could just print all the information at once:
        print(e)
    except Exception as e:
        print(e)

4.4. Multi-IPU execution strategies

This section describes strategies to run PopTorch code on more than one IPU. Some of these allow you to run code in parallel on multiple IPUs. You will need to use one of these execution strategies for PopTorch code that does not fit on a single IPU.

Note

In general, we advise pipelining over as few IPUs as possible. However, You may need to experiment to find the optimal pipeline length. In some corner cases, a longer pipeline can lead to faster throughput.

There are four kinds of execution strategies that you can use to run a model on a multi-IPU device:

You can select this with the setExecutionStrategy() option.

The default execution strategy is PipelinedExecution.

In the following, we first introduce the general functions that are relevant to all four parallel execution strategies. Finally, we explain the four strategies with examples.

By default, PopTorch will not let you run the model if the number of IPUs is not a power of 2. For this reason, it is preferable to annotate the model so that the number of IPUs used is a power of 2. However, you can also enable autoRoundNumIPUs() to automatically round up the number of IPUs reserved to a power of 2, with the excess being reserved but idle. This option is not enabled by default to prevent unintentional overbooking of IPUs.

4.4.1. Annotations

Model partitioning using blocks

BeginBlock is a wrapper class, Block is a context manager, and BlockFunction() is a function decorator. These partition models into “blocks” that can be executed on different IPUs. You can use them to define model sharding on a multi-IPU device.

You can use BeginBlock to annotate an existing model. Each call, with example arguments (layer_n, ipu_id=m), places all layers before layer_n on IPU m-1 and all layers from layer_n onwards (inclusive) on IPU m.

Listing 4.7 Annotating existing layers
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import transformers
import torch
import poptorch

# A bert model from hugging face. See the packaged BERT example for actual usage.
pretrained_weights = 'mrm8488/bert-medium-finetuned-squadv2'


# For later versions of transformers, we need to wrap the model and set
# return_dict to False
class WrappedModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.wrapped = transformers.BertForQuestionAnswering.from_pretrained(
            pretrained_weights)

    def forward(self, input_ids, attention_mask, token_type_ids):
        return self.wrapped.forward(input_ids,
                                    attention_mask,
                                    token_type_ids,
                                    return_dict=False)

    def __getattr__(self, attr):
        try:
            return torch.nn.Module.__getattr__(self, attr)
        except AttributeError:
            return getattr(self.wrapped, attr)


model = WrappedModel()

# A handy way of seeing the names of all the layers in the network.
print(model)

# All layers before "model.bert.encoder.layer[0]" will be on IPU 0 and all layers from
# "model.bert.encoder.layer[0]" onwards (inclusive) will be on IPU 1.
model.bert.encoder.layer[0] = poptorch.BeginBlock(model.bert.encoder.layer[0],
                                                  ipu_id=1)

# Now all layers before layer are on IPU 1 and this layer onward is on IPU 2
model.bert.encoder.layer[2] = poptorch.BeginBlock(model.bert.encoder.layer[2],
                                                  ipu_id=2)

# Finally all layers from this layer till the end of the network are on IPU 3.
model.bert.encoder.layer[4] = poptorch.BeginBlock(model.bert.encoder.layer[4],
                                                  ipu_id=3)

# We must batch the data by at least the number of IPUs. Each IPU will still execute
# whatever the model batch size is.
data_batch_size = 4

# Create a poptorch.Options instance to override default options
opts = poptorch.Options()
opts.deviceIterations(data_batch_size)

You can use Block to annotate a model from within its definition. This context manager class defines a scope in the context of the model. Everything within that scope is placed on the IPU specified (unless overridden by a :py:class:~poptorch.Stage).

Listing 4.8 Annotating a model directly
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class Network(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(5, 10)
        self.layer2 = torch.nn.Linear(10, 5)
        self.layer3 = torch.nn.Linear(5, 5)
        self.layer4 = torch.nn.Linear(5, 5)

        self.act = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, x):

        # Explicit layers on a certain IPU
        poptorch.Block.useAutoId()
        with poptorch.Block(ipu_id=0):
            x = self.act(self.layer1(x))

        with poptorch.Block(ipu_id=1):
            x = self.act(self.layer2(x))

        with poptorch.Block(ipu_id=2):
            x = self.act(self.layer3(x))
            x = self.act(self.layer4(x))

        with poptorch.Block(ipu_id=3):
            x = self.softmax(x)
        return x


model = Network()
opts = poptorch.Options()
opts.deviceIterations(4)
poptorch_model = poptorch.inferenceModel(model, options=opts)
print(poptorch_model(torch.rand((4, 5))))

In addition, you can use the BlockFunction() function decorator to place functions (containing one or more layers) onto a particular block. Everything within that function is placed on the IPU specified (unless overridden by a :py:class:~poptorch.Stage)

Listing 4.9 Annotating functions
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class Network(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(5, 10)
        self.layer2 = torch.nn.Linear(10, 5)
        self.layer3 = torch.nn.Linear(5, 5)
        self.layer4 = torch.nn.Linear(5, 5)

        self.act = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, x):
        poptorch.Block.useAutoId()
        x = self.block_one(x)
        x = self.block_two(x)
        x = self.final_activation(x)
        return x

    @poptorch.BlockFunction(ipu_id=0)
    def block_one(self, x):
        x = self.act(self.layer1(x))
        x = self.act(self.layer2(x))
        return x

    @poptorch.BlockFunction(ipu_id=1)
    def block_two(self, x):
        x = self.act(self.layer3(x))
        x = self.act(self.layer4(x))
        return x

    @poptorch.BlockFunction(ipu_id=1)
    def final_activation(self, x):
        return self.softmax(x)


model = Network()
opts = poptorch.Options()
opts.deviceIterations(4)
poptorch_model = poptorch.inferenceModel(model, options=opts)
print(poptorch_model(torch.rand((4, 5))))

You can use any, or a combination, of these three annotation options. In the above examples, ipu_id is used to specify blocks. This alone is sufficient to enable parallel execution: by default, :py:class`~poptorch.AutoStage` will set up a pipeline for which the pipeline stage is equal to the ipu_id for each block. However, it would be equally valid to instead use the user_id argument to assign names to each block. Then, using :py:class`~poptorch.Stage` or :py:class`~poptorch.Phase` classes, you can manually assign each block in a pipeline using their names, as outlined in the next sections.

BeginBlock, Block and BlockFunction() need to follow a set of rules:

  • You must declare all the layers inside a Block scope to avoid missing annotations. BeginBlock doesn’t have the same constraint because all the layers called after this will automatically be added to the last BeginBlock.

  • Note that PopTorch needs to reserve IPUs in powers of 2. You are advised to configure your model accordingly to take full advantage of the IPUs available. However, if you need to run with a different number of IPUs, you can use poptorch.Options().autoRoundNumIPUs(True) to allow PopTorch to reserve more IPUs than the model specifies.

  • You should not include unused or dead layers in any BeginBlock or Block.

  • If layer A happens before layer B inside the model and each layer has a BeginBlock associated with it, you need to write BeginBlock for layer A before BeginBlock for layer B.

Failing to obey above rules will result in compilation errors.

poptorch.Stage and poptorch.AutoStage

Conceptually, BeginBlock and Block collect the layers of a model into a Stage. You can combine multiple stages into a Phase. Multiple phases form an execution strategy.

poptorch.Stage

Stage defines the layers of the model to run on one IPU. A stage can consist of one or more blocks created using BeginBlock or Block and identified by their user_id.

You can define consecutive layers in a model in either the same stage or consecutive stages. Whether stages run in parallel or sequentially depends on the specific execution strategy.

Internally, each operation in a model is assigned a stage_id through Stage.

poptorch.AutoStage

You can use AutoStage if you don’t want to specify stages by hand. This will assign one Stage per BeginBlock or Block.

By default, AutoStage.SameAsIpu is true, which means the stage_id of the Stage will be set to the ipu_id specified for the BeginBlock or Block.

Note that stage_id must have ascending values in PipelinedExecution. Let’s use the code example above. If your blocks “0”, “1”, and “2” are assigned to IPU 0, 1, and 0. Then the Block “2” will be assigned stage_id 0. This will cause the compiler to fail to schedule the last two stages “1” and “2” due to a conflict:

  • The model implies “1” should run earlier than “2”

  • Their stage_id values suggest “2” should run earlier than “1”

When AutoStage.AutoIncrement is true, each new BeginBlock or Block will be assigned an automatically incremented stage_id. In the previous example the last stage would be assigned stage_id 2 and the compilation would succeed.

poptorch.Phase

Phase defines a processing unit of phased execution. It can contain one or more Stage stages.

Phase is only used in SerialPhasedExecution and ParallelPhasedExecution. It is not used in ShardedExecution and PipelinedExecution.

Listing 4.10 Example of Stage declaration
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class Model(torch.nn.Module):
    def forward(self, x, y):
        with poptorch.Block("A"):
            c = x + x
        with poptorch.Block("B"):
            d = y + y
        with poptorch.Block("C"):
            e = x * 3

        return c, d, e


first = poptorch.Phase(poptorch.Stage("A").ipu(0))
# Regrouped in a single stage
second = poptorch.Phase(poptorch.Stage("B", "C").ipu(1))
# 2 separate stages
second = poptorch.Phase(poptorch.Stage("B").ipu(1), poptorch.Stage("C").ipu(3))

In the code snippet above, “A” and “B” will run in parallel on IPUs 0 and 1 simultaneously because they are placed in two stages. They will run sequentially on one IPU if they are placed in a single stage.

Advanced annotation with strings

You can use Python strings to represent the user_id and ipu_id for a Block or BeginBlock. Because strings are evaluated at runtime, they allow for a dynamic number of stages and phases.

Here is an example showing how to use formatted strings(f-strings) in ParallelPhasedExecution.

In Listing 4.11, there are several places where f-strings are used:

  • Line 25: f"phase{phase}_ipu{ipu}", where phase has the values 0, 1, 1, 2, 3, 3, 4, 5, and 5, and ipu ranges from 0 to 1. The total number of instances for this f-string is 12, from 6 phases and 2 IPUs.

  • Line 32: f"phase{N*2-1}_ipu1", where phase is 5 and ipu is 1.

  • Lines 46-47 and 50-51: when defining Stage, four f-strings are used where n ranges from 0 to 2

    • f"phase_{2*n}_ipu0"

    • f"phase{2*n}_ipu1"

    • f"phase_{2*n+1}_ipu0"

    • f"phase{2*n+1}_ipu1"

    These refer to phases 0, 2, 4 and 1, 3, 5, with ipu0 and ipu1, respectively. So all these 12 f-strings are defined in BeginBlock, and used in Stage dynamically. These match exactly.

Listing 4.11 An example of parallel phased execution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
poptorch.setLogLevel("DEBUG")  # Force debug logging
N = 3
size = 10


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = []
        for n in range(N * 6):
            weight = torch.nn.Parameter(torch.rand(size, size),
                                        requires_grad=True)
            self.register_parameter(f"w{n}", weight)
            self.weights.append(weight)

    def forward(self, in0, target=None):
        phase = 0
        weight = iter(self.weights)
        with poptorch.Block("phase0_ipu0"):
            ins = torch.split(in0, size)
        for n in range(N * 3):
            out = []
            for ipu in range(2):
                x = ins[ipu]
                with poptorch.Block(f"phase{phase}_ipu{ipu}"):
                    x = torch.matmul(next(weight), x)
                    out.append(F.relu(x))
            ins = out[1], out[0]
            # We want 2 matmuls in the same phase
            if n % 3 != 1:
                phase += 1
        with poptorch.Block(f"phase{N*2-1}_ipu1"):
            res = ins[0] + ins[1]
            if target is None:
                return res
            return res, torch.nn.L1Loss(reduction="mean")(res, target)


input = torch.rand(size * 2, 1)
target = torch.rand(size, 1)
model = Model()
opts = poptorch.Options()
phases = []
# Alternate between 0-2 and 1-3
for n in range(N):
    phases.append([
        poptorch.Stage(f"phase{2*n}_ipu0").ipu(0),
        poptorch.Stage(f"phase{2*n}_ipu1").ipu(2)
    ])
    phases.append([
        poptorch.Stage(f"phase{2*n+1}_ipu0").ipu(1),
        poptorch.Stage(f"phase{2*n+1}_ipu1").ipu(3)
    ])
opts.setExecutionStrategy(poptorch.ParallelPhasedExecution(*phases))
poptorch_model = poptorch.trainingModel(model, opts)
poptorch_model.compile(input, target)

With the above functions as building blocks, you can set execution strategies using the four kinds of execution modes, as shown below.

4.4.2. Available execution strategies

Note that you can use the same annotation for each execution strategy. They only differ in the method of parallelisation and tensor locations.

Pipelined execution

PipelinedExecution is the default execution strategy. It extends Sharded execution with parallel execution on multiple IPUs.

Parallelisation in PipelinedExecution requires deviceIterations() (required for inference only, but speeds up training) and gradientAccumulation() (for training only) as explained in Section 5, Efficient data batching. deviceIterations() must be greater than or equal to the number of IPUs used by the model. gradientAccumulation() must be greater than or equal to the number of pipeline stages (forward and backward). As well as these constraints, you must also consider the batch dimension, which must be a multiple of deviceIterations() * replicationFactor() * gradientAccumulation() during training and deviceIterations() * replicationFactor() during inference.

After one stage has finished processing a batch on one IPU, it immediately starts processing the next batch. This creates a pipeline where multiple batches are processed in parallel.

An IPU can only start its own stage of a batch after its previous stage of the current batch has been processed. Hence, all IPUs will be occupied after a “warm-up” period.

At the end of processing, a “cool-down” period is required to aggregate the results and apply weight updates.

Sharded execution

In this strategy, each IPU will sequentially execute a distinct part of the model. A single unit of processing ShardedExecution is called a shard.

A shard is specified using Stage, or if no Stage is specified, the user_id passed by BeginBlock or Block is used. Each shard is executed sequentially on a single IPU. You can place multiple shards on multiple IPUs. However, only one IPU is used at a time, while the other IPUs are idle. If an IPU is allocated to run consecutive stages, PopART will merge consecutive stages into one on the same IPU. Weights and activations will use the on-chip memory of the IPUs. You need to place layers that share weights on the same IPU.

ShardedExecution can be useful for processing a single sample or for debugging. Overall, it has low efficiency because only one IPU is used at a time.

Phased execution

ParallelPhasedExecution and SerialPhasedExecution have the following features in common:

  • A portion of the weights and activations are transferred to and from streaming memory, before and after each phase.

  • If the desired weights and activations are already stored in an IPU of the same group of IPUs, intra-phase cross-IPU copies can replace the copies to and from streaming memory.

  • This specific portion is needed by the layers of the model wrapped in BeginBlock or Block in current Phase.

  • They both trade off some performance for larger models with higher memory needs.

  • Any number of phases is allowed.

  • The number of stages in each Phase should match the number of IPUs in each group of IPUs.

  • Stages inside each Phase can run in parallel.

Although you only define the Phase for forward passes, the corresponding phases for backward passes are created correspondingly. The order of phased execution for backward passes won’t change but you can decide whether a phase is shared by both forward and backward passes. In other words, you decide whether to avoid a memory transfer of a portion of the weights and activations.

Serial phased execution

In SerialPhasedExecution, phases execute on a single group of IPUs sequentially.

Listing 4.12 How to use SerialPhasedExecution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
strategy = poptorch.SerialPhasedExecution(
    poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")),
    poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")),
    poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2")))

strategy.phase(0).ipus(0, 1)
strategy.phase(1).ipus(0, 1)
strategy.phase(2).ipus(0, 1)

opts.setExecutionStrategy(strategy)

The code above causes all phases to run serially on IPUs 0 and 1. (A,B and C on IPU 0, A2, B2, C2 on IPU 1).

Parallel phased execution

In ParallelPhasedExecution, phases are executed in parallel alternating between two groups of IPUs. Even phases must run on even IPUs and odd phases on odd IPUs. Inter-phase cross-IPU copies can replace the memory transfers to and from the streaming memory, if the desired weights and activations are already available in another group of IPUs.

Listing 4.13 How to use ParallelPhasedExecution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
strategy = poptorch.ParallelPhasedExecution(
    poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")),
    poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")),
    poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5")))

strategy.phase(0).ipus(0, 2)
strategy.phase(1).ipus(1, 3)
strategy.phase(2).ipus(0, 2)

opts.setExecutionStrategy(strategy)

In the code example above, there are three phases. Each phase has two stages and each IPU group has two IPUs, so the number of groups matches the number of IPUs. Even phases 0 and 2 run on IPU 0 and 2, while odd phase 1 runs on IPU 1 and 3. This allows for faster cross-IPU copies, both inter-phase and intra-phase.

poptorch.Liveness

Liveness controls the availability of tensors on IPU, and is only needed for ParallelPhasedExecution and SerialPhasedExecution.

The default Liveness is AlwaysLive. OffChipAfterFwd, OffChipAfterFwdNoOverlap and OffChipAfterEachPhase may be helpful if you run a large model with a tight memory budget.

4.5. Optimizers

PopTorch supports the following optimizers:

  1. SGD

  2. Adam

  3. AdamW

  4. RMSprop

  5. LAMB

In addition, PopTorch has features to support float16 models, such as loss scaling, velocity scaling, bias correction and accumulator types.

Important

All of these extra attributes (except velocity_scaling) must have the same values for different param_groups and therefore you must set them at the optimizer level.

Listing 4.14 How to update values in an Optimizer
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         loss_scaling=2.0,
                         use_combined_accum=False)
poptorch_model = poptorch.trainingModel(model, options, opt)
poptorch_model(input, target)
# Update optimizer attribute
opt.loss_scaling = 1.0
# Update param_group attribute
opt.param_groups[0]["loss_scaling"] = 1.0
# Set the new optimizer in the model
poptorch_model.setOptimizer(opt)
poptorch_model(input, target)

Important

You must call setOptimizer() to apply the new optimizer values to the model.

4.5.1. Loss scaling

When training models which use half/float16 values, you can use loss scaling to prevent the gradients from becoming too small and underflowing.

Before calculating the gradients, PopTorch will scale the loss by the value of the loss_scaling parameter. PopTorch will multiply the gradients by the inverse scale prior to updating the optimizer state. Therefore, beyond improving numerical stability, neither the training nor the hyper-parameters are affected.

Higher loss_scaling values can improve numerical stability by minimising underflow. However, too high a value can result in overflow. The optimal loss scaling factor depends on the model.

You can either set the loss_scaling factors manually, or you can set setAutomaticLossScaling() in opts.Training, which will automatically set a global loss scaling factor. If you both set loss_scaling manually and enable automatic loss scaling, the manually set factor(s) will be used initially and updated automatically during training.

Warning

Automatic loss scaling is an experimental feature and may not behave as expected.

4.5.2. Velocity scaling (SGD combined variant only)

The SGD optimizer, when used with momentum, updates weights based on the velocity values. The combined variant uses one tensor per parameter to store the velocity and the changes to the velocity from accumulated gradients. Unlike the separate variant, therefore, each gradient accumulation step involves adding or subtracting values of a different magnitude to the gradients (for which loss scaling is used). You can therefore use the velocity_scaling parameter to scale the combined velocity tensor to improve numerical precision when using half/float16 values. (Note that the gradients are, in effect, scaled by velocity_scaling/loss_scaling so the loss_scaling has no impact on the effective scaling of velocity parameters.)

As with loss scaling, higher values can minimise underflow of the velocity values but may result in overflow.

4.5.3. Accumulation types

In order to improve numerical stability some of the optimizers (LAMB, Adam, AdamW, RMSprop) give you the option to tweak the data type used by the optimizer’s accumulators.

accum_type lets you choose the type used for gradient accumulation. first_order_momentum_accum_type / second_order_momentum_accum_type give you control over the type used to store the first-order and second-order momentum optimizer states.

4.5.4. Constant attributes

In order to improve performance and / or save memory PopTorch will try to embed directly in the program the attributes which are constant.

Important

Trying to modify a constant attribute after the model has been compiled will result in an error.

For PopTorch optimizers (those from the poptorch.optim namespace) by default the attributes explicitly passed to the optimizer’s constructor will be considered variables and the others will be considered as constant.

You can override this behaviour using markAsConstant() and markAsVariable() before compiling the model.

Listing 4.15 Constant and variable attributes for PopTorch optimizers
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# lr, momentum and loss_scaling will be marked as variable.
opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         momentum=0.0,
                         use_combined_accum=False)
# momentum and loss_scaling  will be marked as constant.
opt = poptorch.optim.SGD(model.parameters(), lr=0.01, use_combined_accum=False)
# lr and momentum will be marked as variable.
# loss_scaling will be marked as constant.
opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         momentum=0.0,
                         loss_scaling=2.0,
                         use_combined_accum=False)
opt.variable_attrs.markAsConstant("loss_scaling")
# lr, momentum and loss_scaling will be marked as variable.
opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         loss_scaling=2.0,
                         use_combined_accum=False)
opt.variable_attrs.markAsVariable("momentum")

For native optimizers (those from the torch.optim namespace) the attributes which are left to their default value in the constructor will be considered to be constant.

There is no method to override this behaviour which is why we recommend you always use the poptorch.optim optimizers instead.

Listing 4.16 Constant and variable attributes for Torch optimizers
1
2
3
4
5
6
7
# momentum will be marked as constant (It's not set)
opt = torch.optim.SGD(model.parameters(), lr=0.01)
# lr will be marked as variable.
# momentum will still be marked as constant (Because its default value is 0.0)
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.0)
# lr and momentum will both be marked as variable.
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=1.0)

Note

There is an exception: lr is always marked as variable.

4.6. PopTorch ops

This section describes some “helper” operations you can use within a model.

4.6.1. poptorch.ctc_beam_search_decoder

This function adds a Connectionist Temporal Classification (CTC) beam search decoder operator to the model.

1
2
3
4
5
class Model(torch.nn.Module):
    def forward(self, log_probs, lengths):
        return poptorch.ctc_beam_search_decoder(log_probs, lengths)


For more information see: ctc_beam_search_decoder().

4.6.2. poptorch.ipu_print_tensor

This function adds an op to print the content of a tensor on the IPU.

Note

To prevent the print operation being optimised out by the graph optimiser, you must use the return value of ipu_print_tensor().

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class ExampleModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.bias = torch.nn.Parameter(torch.zeros(()))

    def forward(self, x):
        x = x + 1

        # It is important to make sure the result of the print is used.
        x = poptorch.ipu_print_tensor(x)

        return x + self.bias


For more information see: ipu_print_tensor().

4.6.3. poptorch.identity_loss

You can use this function to implement custom losses. It takes a single PyTorch tensor and will backpropagate a gradient of ones through it.

Listing 4.17 Example of custom loss.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def custom_loss(output, target):
    # Mean squared error with a scale
    loss = output - target
    loss = loss * loss * 5
    return poptorch.identity_loss(loss, reduction="mean")


class ExampleModelWithCustomLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.model = ExampleModel()

    def forward(self, input, target):
        out = self.model(input)
        return out, custom_loss(out, target)


For more information see: identity_loss().

4.6.4. poptorch.MultiConv

Use the MultiConv wrapper class to define multi-convolutions.

Refer to the PopLibs documentation for multi-convolutions for further information.

For more information see: MultiConv and MultiConvPlanType.

4.6.5. poptorch.nop

PopTorch includes a “no-op” function for debugging purposes.

For more information see: nop().

4.6.6. poptorch.serializedMatMul

Use this function to create a serialized matrix multiplication, which splits a larger matrix multiplication into smaller matrix multiplications to reduce memory requirements.

For more information see: serializedMatMul().

4.6.7. poptorch.set_available_memory

Use this function to override the default proportion of tile memory available as temporary memory for use by operations such as a convolution or matrix multiplication. The operators that can be tuned with this setting include:

  • convolution

  • matrix multiplication

  • embedding lookup

  • indexing operations

For more information see:

4.6.8. Miscellaneous functions

The following PopTorch functions, not related to model creation, are available:

4.7. Half / float16 support

PopTorch supports the half-precision floating point (float16) format. You can simply input float16 tensors into your model. (You can convert a tensor to float16 using tensor = tensor.half())

You can use your models in one of the following ways:

  1. Convert all parameters (weights) to float16 by using using a Module’s .``half()`` method. This is the most memory efficient, however small updates to weights may be lost, hindering training.

  2. Keep the parameters (weights) as float32, in which case the parameter updates will occur using float32. However, the parameters will be converted to float16 if you call an operation with a float16 input. This is more memory efficient than using float32 tensors (inputs) but less memory efficient than using float16 weights.

  3. Use a mix of float32 and float16 parameters by manually specifying parameters as float16 or float32.

Note

When PyTorch encounters a mix of float16 and float32 inputs for a given operation, it will usually cast all inputs to float32. PopTorch differs and will cast all inputs to float16. This makes it easier to build models with float32 weights which take float16 tensors. However, if you wish to follow PyTorch behaviour, you can use opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat) where opts is the poptorch.Options object passed to the model wrapping function.

Listing 4.18 How to run a model using half precision
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
model = torch.nn.Linear(1, 10)

# Convert the parameters (weights) to halfs. Without doing so,
# the Linear parameters will automatically be cast to half, which allows
# training with float32 parameters but half tensors.
model.half()

t1 = torch.tensor([1.]).half()

opts = poptorch.Options()

inference_model = poptorch.inferenceModel(model, opts)
out = inference_model(t1)

assert out.dtype == torch.half

Because PopTorch relies on the torch.jit.trace() function, it is limited to tracing operations which run on the CPU. Many of these operations do not support float16 inputs. To allow the full range of operations, PopTorch converts all float16 inputs to float32 before tracing and then restores the inputs to float16 as part of the canonicalization process. Some operations may result in the model running in float32 where float16 would be expected, or vice versa (see Section 6.3, Float 16 operations for full details).

Graphcore’s tutorials repository contains a walkthrough on using half and mixed precision in PopTorch: Half and mixed precision tutorial.

4.8. Automatic mixed-precision casting

PopTorch supports converting your model automatically between float16 and float32. This functionality is not active by default - you must enable it explicitly by calling the autocast(enabled=True) method at model level.

Listing 4.19 Enabling automatic casting at model level
model = MyModel()
model.autocast()
poptorch_model = poptorch.inferenceModel(model)

During compilation, selected layers and operators will have their types adjusted aiming to strike a good compromise between compute efficiency, memory requirements and numerical precision.

You can also set automatic casting at the layer level. In this situation, its effect is hierarchical: changing the setting for a layer affects it and all layers contained within.

In the following example, automatic casting is enabled for all layers of the model, except for the first activation and second convolution.

Listing 4.20 Controlling automatic casting at layer level
model = torch.nn.Sequential()
model.add_module('conv1', torch.nn.Conv2d(1, 20, 5))
model.add_module('relu1', torch.nn.ReLU())
model.add_module('conv2', torch.nn.Conv2d(20, 64, 5))
model.add_module('relu2', torch.nn.ReLU())
model.autocast()
model.relu1.autocast(False)
model.conv2.autocast(False)

You can also set automatic casting with the function decorator @poptorch.autocast(enabled=True). Its effect is to apply automatic casting to the body of the function. Setting its parameter to False has the opposite effect. A typical use-case is applying it to the forward function of custom modules.

Listing 4.21 Controlling automatic casting via decorator
class MyModel(torch.nn.Module):
    @poptorch.autocast()
    def forward(self, x, y):
        return torch.bmm(x, y)


In addition, you can apply poptorch.autocast(enabled=True) to a code-block, with similar effect.

Listing 4.22 Controlling automatic casting via decorator
x = torch.randn(1, 10, 10)
y = torch.randn(1, 10, 10)
with poptorch.autocast():
    z = torch.bmm(x, y)

You can completely turn off this feature for the whole application via the autocastEnabled(bool) method of _PrecisionOptions.

Listing 4.23 Disabling automatic casting
x = torch.randn(1, 10, 10)
y = torch.randn(1, 10, 10)
with poptorch.autocast():
    z = torch.bmm(x, y)

4.8.1. Custom casting policies

PopTorch provides a mechanism to customize automatic casting behaviour in the form of casting policy classes. A casting policy is defined by four sets of Torch modules and/or torch operators:

  1. fp16 - set of operations to be typed as float16

  2. fp32 - set of operations to be typed as float32

  3. promote - set of operations to be promoted to float32 should they take mixed-precision inputs

  4. demote - set of operations to be demoted to float16 should they take mixed-precision inputs

The following example describes a policy where convolution and ReLU operations are to be performed using float16, whilst batch matrix multiplication is to be performed using float32. Dot product computations will be promoted to float32 when operands have mixed precision.

Listing 4.24 Custom casting policies
fp16 = [torch.nn.Conv2d, torch.relu]
fp32 = [torch.bmm]
promote = [torch.dot]
demote = []
policy = poptorch.autocasting.Policy(fp16, fp32, promote, demote)

opts = poptorch.Options()
opts.Precision.autocastPolicy(policy)
poptorch.model = poptorch.inferenceModel(model, opts)

4.9. PyTorch buffers

PopTorch supports PyTorch buffers in some circumstances. You can use buffers to make tensors persistent, that is to allow tensors to keep their values from the previous run on each new run, without making them model parameters. However, you must make sure that you only make in-place modifications to the buffer using PyTorch in-place operations (such as += or those ending in _). For example, you can torch.Tensor.copy_ to copy the contents of a another tensor to the buffer.

Unlike when running on the CPU, the following PyTorch code does not increment model.i each time, when running on the IPU:

Listing 4.25 The wrong way to have a persistent tensor
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class CounterModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.i = torch.tensor([0.], dtype=torch.float)

    def forward(self):
        self.i += 1
        return self.i


model = CounterModel()
poptorch_model = poptorch.inferenceModel(model)
print(poptorch_model())  # tensor([6.])
print(poptorch_model())  # tensor([6.])

This is because the PyTorch tracer will capture the value for model.i when tracing happens and then freeze the value as a constant. In fact, the value captured is 6.0 as the PyTorch has traced or called the forward method five times before it captures the constant.

You can keep the value of a tensor between runs by registering it as a buffer in PyTorch, as the following examples shows:

Listing 4.26 An example showing a tensor which is incremented on each iteration by registering it as a tensor.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
class CounterModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.register_buffer("i", torch.tensor([0.], dtype=torch.float))

    def forward(self):
        self.i += 1
        return self.i


model = CounterModel()
poptorch_model = poptorch.inferenceModel(model)

print(poptorch_model())  # tensor([1.])
print(poptorch_model())  # tensor([2.])

Note

When running an inference model (with inferenceModel()), any buffers which your model modifies will not be implicitly copied to the host. You will need to call copyWeightsToHost() before reading the value of a buffer which has been changed as a result of a model call.

4.10. Creating custom ops

If you need to implement functionality that is not directly supported in in PopTorch, you can create a custom op.

There are two steps to creating a custom op in PopTorch:

  1. Implement the op in C++ using the PopART API

  2. Make the op available in PopTorch so you can use it in your PyTorch model

4.10.1. Implementing the custom op

You will need to implement the new op as C++ code by creating subclasses of, at least, the Op and Opx base classes provided by the PopART API.

If you are going to use the custom op for training, then you will also need to define the classes that implement the gradient operation. For details of how to do this, see the Custom operators chapter of the PopART User Guide.

You can find some examples of PopART custom ops in the Graphcore GitHub tutorials repository.

Compiling the PopART custom op will create a dynamic library file, which you can use with your PyTorch code.

4.10.2. Make the op available to PyTorch

After you have compiled the C++ implementation of the custom op, you can can load the library file, and call the op from your PyTorch program, using the poptorch.custom_op class.

First, load the dynamic library as shown in Listing 4.27.

Listing 4.27 Loading the library for the custom op
1
2
3
4
myso = list(pathlib.Path("tests").rglob("libcustom_cube_op.*"))
assert myso, "Failed to find libcustom_cube_op"
myop = ctypes.cdll.LoadLibrary(myso[0])

You can now call your custom op using the PopTorch class custom_op.

Both the forward op and backward op are implemented in the PopART code. However, in this inference model example, only the forward op is called:

Listing 4.28 Calling a custom op in a PopTorch inference model
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def test_inference():
    class BasicNetwork(nn.Module):
        def forward(self, x, bias):
            x, y = poptorch.custom_op([x, bias],
                                      "Cube",
                                      "com.acme",
                                      1,
                                      example_outputs=[x, x])
            return x, y

In this example [x, x] is assigned to example_outputs, where x is one of the input tensors which is used as a template for the output tensors. The custom op code will need to create the tensors that it returns.

You can also call this custom op inside a training model using custom_op and the backward op will be called automatically.

The Graphcore tutorials repository contains a feature example demonstrating how to load and in and use a custom op in a PopTorch model: PopTorch example: Custom op.

4.10.3. Passing attributes to the custom op

You can pass attributes to the custom op using a Python dictionary, as shown in Listing 4.29.

Listing 4.29 Passing an attribute to a custom op from PopTorch
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    class Model(torch.nn.Module):
        def forward(self, x):
            x = poptorch.custom_op([x],
                                   "LeakyRelu",
                                   "com.acme",
                                   1,
                                   example_outputs=[x],
                                   attributes={"alpha": 0.02})
            return x[0]

You can then access these attributes within the C++ custom op code. The above example passes a Float attribute with the name alpha to the LeakyRELU implementation. See the Custom operators chapter of the PopART User Guide for more information.

Table Table 4.1 and the code example in Listing 4.30 show how to pass other attribute types to a custom op. PopTorch supports all attributes supported in PopART except for Graph.

Table 4.1 Python types to use to pass attributes to PopART

PopART attribute type

Python equivalent

Float

Python float (converted to float32)

Floats

List or tuple of Python float

Int

Python int (converted to 64-bit signed int)

Ints

List or tuple of Python int

String

Python str (converted to ASCII)

Strings

List or tuple of Python str

Graph

Not supported

Listing 4.30 Passing different attribute types from PopTorch
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def test_many_attributes_examples():
    class Model(torch.nn.Module):
        def forward(self, x):
            attributes = {
                "float_one": 1.0,
                "float_minus_two": -2.0,
                "int_zero": 0,
                "int_minus_five": -5,
                "floats_one_two_three": [1.0, 2.0, 3.0],
                "floats_minus_one_two_three": [-1.0, -2.0, -3.0],
                "ints_one_two_three": [1, 2, 3],
                "ints_minus_one_two_three": [-1, -2, -3],
                "a_string": "string with quotes and slash \" ' \\ end",
                "strs": ["abc", "def", "ghi"]
            }

            x = poptorch.custom_op([x],
                                   "ManyAttributeOp",
                                   "test.poptorch",
                                   1,
                                   example_outputs=[x],
                                   attributes=attributes)

4.11. Profiling

You can profile a graph produced by PopTorch for analysis using the PopVision Graph Analyser, which can be downloaded from the Graphcore support portal. To do this, use the POPLAR_ENGINE_OPTIONS environment variable.

4.12. Precompilation and caching

4.12.1. Caching

By default PopTorch will re-compile the model every time you instantiate a model. However if you often run the same models you might want to enable executable caching to save time.

You can do this by either setting the POPTORCH_CACHE_DIR environment variable or by calling enableExecutableCaching.

Warning

The cache directory might grow large quickly because PopTorch doesn’t delete old models from the cache and, depending on the number and size of your models and the number of IPUs used, the executables might be quite large. It is your responsibility to delete the unwanted cache files.

4.12.2. Precompilation

PopTorch supports precompilation: This means you can compile your model on a machine which doesn’t have an IPU and export the executable to a file. You can then reload and execute it on a different machine which does have an IPU.

Important

The PopTorch versions on both machines must be an exact match.

To precompile your model you need to wrap it using either trainingModel() or inferenceModel() then call compileAndExport() on the wrapper.

Listing 4.31 How to precompile a model using an offline IPU target.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import torch
import poptorch


class ExampleModelWithLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(10, 10)
        self.loss = torch.nn.MSELoss()

    def forward(self, x, target=None):
        fc = self.fc(x)
        if self.training:
            return fc, self.loss(fc, target)
        return fc


torch.manual_seed(0)
model = ExampleModelWithLoss()

opts = poptorch.Options()
# You don't need a real IPU to compile the executable.
opts.useOfflineIpuTarget(ipu_target_version)

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model, opts)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

poptorch_model.compileAndExport(filename, input, target)

Note

If you don’t know the IPU version on your system you can use ipuHardwareVersion().

The exported file by default will contain your original PyTorch model (including the weights), and enough information to re-create the PopTorch wrapper and reload the executable.

Important

For your model and weights to be exported, your model must be picklable. See https://docs.python.org/3/library/pickle.html for more information. If your model is not picklable then use export_model=False, as shown in Listing 4.34.

Now both the torch model, PopTorch wrapper and executable can be restored on the target machine using poptorch.load():

Listing 4.32 How to load a precompiled model
1
2
3
4
poptorch_model = poptorch.load(filename)

# That's all: your model is ready to be used.
poptorch_model(input, target)  # Run on IPU

In some cases you might want to provide some runtime information to select the device: you can do this using the edit_opts_fn argument of poptorch.load():

Listing 4.33 How to load a precompiled model and run on a specific IPU
1
2
3
4
5
6
def setIpuDevice(opts):
    opts.useIpuId(1)  # always use IPU 1


poptorch_model = poptorch.load(filename, edit_opts_fn=setIpuDevice)
poptorch_model(input, target)  # Run on IPU 1

Note

When loading a precompiled model, only run-time options will be applied; all others will be ignored.

Going back to the precompilation step: in some cases you might want to export only the executable and not the python wrapper or torch model (for example if your model cannot be pickled).

Listing 4.34 How to export only the executable
1
poptorch_model.compileAndExport(filename, input, target, export_model=False)

It means you will need to re-create and wrap the model yourself before loading the executable:

Listing 4.35 How to load a precompiled executable
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
model = ExampleModelWithLoss()

opts = poptorch.Options()

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model, opts)
poptorch_model.loadExecutable(filename)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

poptorch_model(input, target)  # Run on IPU

Important

Exported models lose their connections to other models.

For example, if you have a poptorch.trainingModel() and a poptorch.inferenceModel() based on the same PyTorch model, you wouldn’t usually need to keep the weights synchronised between the two; PopTorch would take care of it for you, implicitly.

In the following example, PopTorch automatically copies the weights from the training model to the inference model:

Listing 4.36 PopTorch implicit copies
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
model = ExampleModelWithLoss()

opts = poptorch.Options()

# Wrap the model in our PopTorch annotation wrapper.
training_model = poptorch.trainingModel(model, opts)
model.eval()
validation_model = poptorch.inferenceModel(model, opts)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

# Train the model:
for epoch in epochs:
    training_model(input, target)

# Weights are implicitly copied from the training model
# to the validation model
prediction = validation_model(input)

If you were to export these models:

Listing 4.37 Precompilation of both a training and validation models
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
model = ExampleModelWithLoss()

opts = poptorch.Options()

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

# Wrap the model in our PopTorch annotation wrapper.
training_model = poptorch.trainingModel(model, opts)
training_model.compileAndExport("training.poptorch", input, target)
model.eval()
validation_model = poptorch.inferenceModel(model, opts)
validation_model.compileAndExport("validation.poptorch", input)

Note

Don’t forget to call model.eval() or model.train(), as required, before calling compileAndExport().

You could then insert explicit copy operations:

Listing 4.38 Precompilation of both a training and validation models
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
training_model = poptorch.load("training.poptorch")
validation_model = poptorch.load("validation.poptorch")

for epoch in epochs:
    print("Epoch ", epoch)
    run_training(training_model)
    # Need to explicitly copy weights between the two models
    # because they're not connected anymore.
    training_model.copyWeightsToHost()
    validation_model.copyWeightsToDevice()
    run_validation(validation_model)

Or you would need to re-connect the two models by creating the second one from the first one and then loading the executable:

Listing 4.39 Precompilation of both a training and validation models
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
training_model = poptorch.load("training.poptorch")
# Create a validation python model based on the training model
validation_model = poptorch.inferenceModel(training_model)
validation_model.model.eval()
# Load the executable for that model:
validation_model.loadExecutable("validation.poptorch")

for epoch in epochs:
    print("Epoch ", epoch)
    run_training(training_model)
    # Nothing to do: training_model and validation_model are now connected
    # and PopTorch will implicitly keep the weights in sync between them.
    run_validation(validation_model)

4.13. Environment variables

4.13.1. Logging level

PopTorch uses the following levels of logging:
  • OFF: No logging

  • ERR: Errors only

  • WARN: Warnings and errors only

  • INFO: Info, warnings and errors (default)

  • DEBUG: Adds some extra debugging information

  • TRACE and TRACE_ALL: Trace everything inside PopTorch

You can use the POPTORCH_LOG_LEVEL environment variable to set the logging level:

export POPTORCH_LOG_LEVEL=DEBUG

4.13.2. Profiling

When running programs using PopTorch, you can enable profiling by using the POPLAR_ENGINE_OPTIONS environment variable used by Poplar.

In order to capture the reports needed for the PopVision Graph Analyser you only need to set POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}':

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'

By default, report files are output to the current working directory. You can specify a different output directory by setting autoReport.directory, for example:

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./tommyFlowers"}'

For more options, refer to the PopVision Graph Analyser User Guide.

In order to capture the pvti reports needed for the PopVision System Analyser you only need to set PVTI_OPTIONS='{"enable":"true"}'

You can also add extra tracepoints in your own code by using Channel.

4.13.3. IPU Model

By default PopTorch will try to attach to a physical IPU. If instead you want to use the model, you can do so by setting POPTORCH_IPU_MODEL to 1:

export POPTORCH_IPU_MODEL=1

See the Poplar and PopLibs User Guide for the limitations of the IPU Model.

4.13.4. Wait for an IPU to become available

By default, attempting to attach to an IPU when all IPUs are already in use will raise an exception. If you would rather wait for an IPU to become available, you can do so by setting POPTORCH_WAIT_FOR_IPU to 1.

export POPTORCH_WAIT_FOR_IPU=1

4.13.5. Enable executable caching

You can enable executable caching by either setting the POPTORCH_CACHE_DIR environment variable or by calling enableExecutableCaching.

export POPTORCH_CACHE_DIR=/tmp/poptorch_cache

For more information, see Section 4.12.1, Caching.