4. Features

Options
- Setting options via config file
Model wrapping functions
Error handling
Multi-IPU execution strategies
Optimizers
PopTorch ops
16-bit float support
PyTorch buffers
Creating custom ops
Precompilation and caching
- Caching
- Precompilation
Environment variables

4.1. Options 

You can change how PopTorch compiles and executes models using Options. You can find a full list of options in Section 11.1, Options. Broadly speaking, the options fall into the following categories:

General options (see Options)
Options related to half precision (see opts.Precision.*)
Management of the training process (see opts.Training.*)
Location of tensors (see: opts.TensorLocations.* and TensorLocationSettings)
Options relevant to the Torch JIT compiler (see opts.Jit.*)
Control of distributed execution environments when using tools other than PopRun (see opts.Distributed.*)

See Section 5, Efficient data batching for a full explanation of how deviceIterations(), gradientAccumulation() and replicationFactor() interact with a model’s input and output sizes.

You can choose to use the IPU Model instead of IPU hardware with the useIpuModel() option.

4.1.1. Setting options via config file 

In addition to setting these options programmatically, you can also set them in a config text file by using loadFromFile().

Each line in the file must contain a single command corresponding to setting an option in Options. To set an option within the file, write the command as you would within a Python script but omit the options. prefix. For example:

Listing 4.1 Example contents of a config file used to set options

deviceIterations(1)
setExecutionStrategy(poptorch.ShardedExecution())
replicationFactor(1)
enableSyntheticData(True)

Then, instantiate Options and call loadFromFile():

Listing 4.2 Setting options using a config file named “poptorch.conf”

opts = poptorch.Options()
opts.loadFromFile("tmp/poptorch.conf")

4.2. Model wrapping functions 

The basis of PopTorch integration comes from the two model wrapping functions described in the following sections.

Note

PopTorch makes a shallow copy of the model. Changes to the parameters in the models returned by these two model wrapping functions affect the original model and vice versa. However, primitive variable types will not be kept in sync. This includes the training bool of pytorch.nn.Module. If your PyTorch model is named model, call model.eval() or model.train(), if required, before calling these wrapping functions.

4.2.1. poptorch.trainingModel 

This function wraps a PyTorch model, yielding a PopTorch model that can be run on the IPU in training mode. See trainingModel() for more information.

Listing 4.3 An example of the use of trainingModel

import torch
import poptorch


class ExampleModelWithLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(10, 10)
        self.loss = torch.nn.MSELoss()

    def forward(self, x, target=None):
        fc = self.fc(x)
        if self.training:
            return fc, self.loss(fc, target)
        return fc


torch.manual_seed(0)
model = ExampleModelWithLoss()

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)
ones = torch.ones(10)

# Train on IPU.
for i in range(0, 800):
    # Each call here executes the forward pass, loss calculation, and backward
    # pass in one step.
    # Model input and loss function input are provided together.
    poptorch_out, loss = poptorch_model(input, target)
    print(f"{i}: {loss}")

# Copy the trained weights from the IPU back into the host model.
poptorch_model.copyWeightsToHost()

# Execute the trained weights on host.
model.eval()
native_out = model(input)

# Models should be very close to native output although some operations are
# numerically different and floating point differences can accumulate.
torch.testing.assert_close(native_out, poptorch_out, rtol=1e-04, atol=1e-04)

Note

By default, PopTorch will only return the final batch of outputs. Please see Section 5.7, poptorch.Options.outputMode for details on what PopTorch returns when using trainingModel() and how you can calculate statistics such as training accuracy over all batches.

4.2.2. poptorch.inferenceModel 

This function wraps a PyTorch model, yielding a PopTorch model that can be run on the IPU in inference mode. See inferenceModel() for more information.

Listing 4.4 An example of the use of inferenceModel

import torch
import torchvision
import poptorch

# Some dummy imagenet sized input.
picture_of_a_cat_here = torch.randn([1, 3, 224, 224])

# The model, in this case a MobileNet model with pretrained weights that comes
# canned with PyTorch.
model = torchvision.models.mobilenet_v2(pretrained=True)
model.train(False)

# Wrap in the PopTorch inference wrapper
inference_model = poptorch.inferenceModel(model)

# Execute on IPU.
out_tensor = inference_model(picture_of_a_cat_here)

# Get the top 5 ImageNet classes.
top_five_classes = torch.topk(torch.softmax(out_tensor, 1), 5)
print(top_five_classes)

# Try the same on native PyTorch
native_out = model(picture_of_a_cat_here)

native_top_five_classes = torch.topk(torch.softmax(native_out, 1), 5)

# Models should be very close to native output although some operations are
# numerically different and floating point differences can accumulate.
assert any(top_five_classes[1][0] == native_top_five_classes[1][0])
# inference_half_start
model = torch.nn.Linear(1, 10)

# Cast the parameters (weights) to half.
model.half()

t1 = torch.tensor([1.]).half()

opts = poptorch.Options()

inference_model = poptorch.inferenceModel(model, opts)
out = inference_model(t1)

assert out.dtype == torch.half
# inference_half_end

4.2.3. poptorch.PoplarExecutor 

You should not create this class directly. It is a wrapper around the model that was passed into inferenceModel() or trainingModel(). It has a few methods which you can use to interface with the IPU.

The PoplarExecutor will implicitly keep in sync the parameters of the source PyTorch model and the PopTorch model(s). However, you need to explicitly copy the weights before you run a model on the IPU if you train the model on the CPU after you have already wrapped it for the IPU. You also need to explicitly copy the weights if you alter an already wrapped model parameter by some other means.

See PoplarExecutor for a complete description of the IPU interface functionality.

Listing 4.5 Example contents of when explicit copies are needed

model = Model()
model.eval()

poptorch_inf = poptorch.inferenceModel(model)

# Switch for "poptorch.trainingModel": poptorch_inf will remain in "eval" mode
model.train()
poptorch_train = poptorch.trainingModel(model)

# train on IPU
train(poptorch_train)
torch.save(model.state_dict(), "model.save")  # OK

# Aready in "eval" mode
validate(poptorch_inf)  # OK

# switch to "eval" mode for CPU
model.eval()
validate(model)  # OK

# train on CPU
model.train()
train_on_cpu(model)

# Explicit copy needed
poptorch_inf.copyWeightsToDevice()
validate(poptorch_inf)

4.2.4. poptorch.isRunningOnIpu 

One useful utility function is isRunningOnIpu(). This returns True when executing on the IPU and False when executing the model outside IPU scope. This allows for different code paths within the model.

A common use case is executing equivalent code to a PopART custom operator when running on the CPU. For example:

class Network(torch.nn.Module):
  def forward(self, x, y):
      if poptorch.isRunningOnIpu():
          # IPU path
          return my_custom_operator(x, y)
      else:
          # CPU path
          return my_torch_implementation(x,y)

4.3. Error handling 

4.3.1. Recoverable runtime errors 

This category of error is likely to be transient.

Exception type raised by PopTorch: poptorch.RecoverableError (inherits from poptorch.Error)

The exception contains the action required to recover from this error in its recovery_action string attribute.

This attribute can contain:

IPU_RESET: Reset the IPU and reload the IPU memory.

LINK_RESET: Reset the IPU-Links in a non-Pod system. This retrains the IPU-Links between IPUs.

PARTITION_RESET: Reset the IPU partition in a Pod system. This retrains the IPU-Links between IPUs.

FULL_RESET: Power cycle the system.

4.3.2. Unrecoverable runtime errors 

These errors are likely to persist. You should take the system out of operation for analysis and repair.

Exception type raised by PopTorch: poptorch.UnrecoverableError (inherits from poptorch.Error)

4.3.3. Application and other errors 

This kind of error is due to an error in the program or a misuse of an API.

Exception type raised by PopTorch: poptorch.Error if the error was detected in the C++ backend, or some generic Python Exception if it happened in the Python layer.

poptorch.Error has the following string attributes:

message The error message without any of the context.

type The part of the software stack that raised the exception and the category of the error if available.

location Where the exception was raised.

Example:

Listing 4.6 How to handle recoverable / unrecoverable errors

    try:
        m = PytorchModel(model_param)
        inference_model = poptorch.inferenceModel(m)
        t1 = torch.tensor([1.])
        t2 = torch.tensor([2.])
        assert inference_model(t1, t2) == 3.0
    except poptorch.RecoverableError as e:
        print(e)
        if e.recovery_action == "FULL_RESET":
            reboot_server()
        elif e.recovery_action == "IPU_RESET":
            print("Need to reset the IPU")
        elif e.recovery_action == "PARITION_RESET":
            print("Need to reset the partition")
    except poptorch.UnrecoverableError as e:
        print(f"Unrecoverable error: machine needs to be taken offline: {e}")
        shutdown_system()
    except poptorch.Error as e:
        print(f"Received {e.message} from component {e.type}, "
              f"location: {e.location}")
        # Or you could just print all the information at once:
        print(e)
    except Exception as e:
        print(e)

4.4. Multi-IPU execution strategies 

This section describes strategies to run PopTorch code on more than one IPU. Some of these allow you to run code in parallel on multiple IPUs. You will need to use one of these execution strategies for PopTorch code that does not fit on a single IPU, but if you do not explicitly select one, PopTorch will use the default execution strategy, PipelinedExecution.

Note

In general, we advise pipelining over as few IPUs as possible. However, You may need to experiment to find the optimal pipeline length. In some corner cases, a longer pipeline can lead to faster throughput.

There are four kinds of execution strategies that you can use to run a model on a multi-IPU device:

You can select this with the setExecutionStrategy() option.

The following subsections first introduce the general functions which are relevant to all four parallel execution strategies. Next, they explain the four strategies with examples.

By default, PopTorch will not let you run the model if the number of IPUs is not a power of 2. For this reason, it is preferable to annotate the model so that the number of IPUs used is a power of 2. However, you can also enable autoRoundNumIPUs() to automatically round up the number of IPUs reserved to a power of 2, with the excess being reserved but idle. This option is not enabled by default to prevent unintentional overbooking of IPUs.

4.4.1. Annotations 

In PopTorch, you can divide a model into blocks. Blocks are associated to stages and stages can be grouped into phases. This chapter will describe how to define them and how to use them to set up different execution modes.

_images/stages_summary.png — Fig. 4.1 PopTorch model partition summary

Model partitioning using blocks 

BeginBlock is a wrapper class, Block is a context manager, and BlockFunction() is a function decorator. You can use one or more of these to partition models into “blocks” that can be executed on different IPUs.

You can use BeginBlock to annotate an existing model. Each call, with example arguments (layer_n, ipu_id=m), places layers enclosed in layer_n on IPU m. Note that, PopTorch places the first layers on ipu_id 0 by default. However, layers in between BeginBlock annotations will inherit that of the previous annotated block.

Listing 4.7 Annotating existing layers

import transformers
import torch
import poptorch

# A bert model from hugging face. See the packaged BERT example for actual usage.
pretrained_weights = 'mrm8488/bert-medium-finetuned-squadv2'


# For later versions of transformers, we need to wrap the model and set
# return_dict to False
class WrappedModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.wrapped = transformers.BertForQuestionAnswering.from_pretrained(
            pretrained_weights)

    def forward(self, input_ids, attention_mask, token_type_ids):
        return self.wrapped.forward(input_ids,
                                    attention_mask,
                                    token_type_ids,
                                    return_dict=False)

    def __getattr__(self, attr):
        try:
            return torch.nn.Module.__getattr__(self, attr)
        except AttributeError:
            return getattr(self.wrapped, attr)


model = WrappedModel()

# A handy way of seeing the names of all the layers in the network.
print(model)

# All layers before "model.bert.encoder.layer[0]" will be on IPU 0 and all layers from
# "model.bert.encoder.layer[0]" onwards (inclusive) will be on IPU 1.
model.bert.encoder.layer[0] = poptorch.BeginBlock(model.bert.encoder.layer[0],
                                                  ipu_id=1)

# Now all layers before layer are on IPU 1 and this layer onward is on IPU 2
model.bert.encoder.layer[2] = poptorch.BeginBlock(model.bert.encoder.layer[2],
                                                  ipu_id=2)

# Finally all layers from this layer till the end of the network are on IPU 3.
model.bert.encoder.layer[4] = poptorch.BeginBlock(model.bert.encoder.layer[4],
                                                  ipu_id=3)

# We must batch the data by at least the number of IPUs. Each IPU will still execute
# whatever the model batch size is.
data_batch_size = 4

# Create a poptorch.Options instance to override default options
opts = poptorch.Options()
opts.deviceIterations(data_batch_size)

Note

The BeginBlock annotations internally use PyTorch hooks. If the module passed to BeginBlock() uses hooks, for example with register_forward_pre_hook, then the assignment of operations to blocks may depend on the order those hooks are added. A concrete example may help to clarify this: consider a layer, and an operation that is defined in a hook function. If register_forward_pre_hook() is called on the layer, followed by a call to BeginBlock() passing the same layer as argument, then the operation defined in the hook will be assigned to the preceding block (so not the same block as the layer). If instead the call to BeginBlock() happens before register_forward_pre_hook(), then the operation will be assigned in the same block as the layer.

You can use Block to annotate a model from within its definition. This context manager class defines a scope in the context of the model. Everything within that scope is placed on the IPU specified (unless overridden by a Stage).

Listing 4.8 Annotating a model directly

class Network(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(5, 10)
        self.layer2 = torch.nn.Linear(10, 5)
        self.layer3 = torch.nn.Linear(5, 5)
        self.layer4 = torch.nn.Linear(5, 5)

        self.act = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, x):

        # Explicit layers on a certain IPU
        poptorch.Block.useAutoId()
        with poptorch.Block(ipu_id=0):
            x = self.act(self.layer1(x))

        with poptorch.Block(ipu_id=1):
            x = self.act(self.layer2(x))

        with poptorch.Block(ipu_id=2):
            x = self.act(self.layer3(x))
            x = self.act(self.layer4(x))

        with poptorch.Block(ipu_id=3):
            x = self.softmax(x)
        return x


model = Network()
opts = poptorch.Options()
opts.deviceIterations(4)
poptorch_model = poptorch.inferenceModel(model, options=opts)
print(poptorch_model(torch.rand((4, 5))))

In addition, you can use the BlockFunction() function decorator to place functions (containing one or more layers) onto a particular block. Everything within that function is placed on the IPU specified (unless overridden by a Stage).

Listing 4.9 Annotating functions

class Network(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(5, 10)
        self.layer2 = torch.nn.Linear(10, 5)
        self.layer3 = torch.nn.Linear(5, 5)
        self.layer4 = torch.nn.Linear(5, 5)

        self.act = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, x):
        poptorch.Block.useAutoId()
        x = self.block_one(x)
        x = self.block_two(x)
        x = self.final_activation(x)
        return x

    @poptorch.BlockFunction(ipu_id=0)
    def block_one(self, x):
        x = self.act(self.layer1(x))
        x = self.act(self.layer2(x))
        return x

    @poptorch.BlockFunction(ipu_id=1)
    def block_two(self, x):
        x = self.act(self.layer3(x))
        x = self.act(self.layer4(x))
        return x

    @poptorch.BlockFunction(ipu_id=1)
    def final_activation(self, x):
        return self.softmax(x)


model = Network()
opts = poptorch.Options()
opts.deviceIterations(4)
poptorch_model = poptorch.inferenceModel(model, options=opts)
print(poptorch_model(torch.rand((4, 5))))

You can use any, or a combination, of these three annotation options. In the above examples, ipu_id is used to specify blocks. This alone is sufficient to enable parallel execution: by default, AutoStage will set up a pipeline for which the pipeline stage is equal to the ipu_id for each block. However, it would be equally valid to instead use the user_id argument to assign names to each block. Then, using Stage or Phase classes, you can manually assign each block in a pipeline using their names, as outlined in the next sections.

BeginBlock, Block and BlockFunction() need to follow a set of rules:

You must declare all the layers inside a Block scope, using either the context manager or BlockFunction(), to avoid missing annotations. BeginBlock does not have this constraint because all the layers called after this will automatically be added to the last BeginBlock.
Note that PopTorch needs to reserve IPUs in powers of 2. You are advised to configure your model accordingly to take full advantage of the IPUs available. However, if you need to run with a different number of IPUs, you can use poptorch.Options().autoRoundNumIPUs(True) to allow PopTorch to reserve more IPUs than the model specifies.
You should not include unused or dead layers in any BeginBlock or Block.
If layer A happens before layer B inside the model and each layer has a BeginBlock associated with it, you need to write BeginBlock for layer A before BeginBlock for layer B.

Failing to obey above rules will result in compilation errors.

poptorch.Stage and poptorch.AutoStage 

Conceptually, BeginBlock and Block collect the layers of a model into a Stage. You can combine multiple stages into a Phase. Multiple phases form an execution strategy.

poptorch.Stage

Stage defines the layers of the model to run on one IPU. A stage can consist of one or more blocks created using BeginBlock or Block and identified by their user_id.

You can define consecutive layers in a model in either the same stage or consecutive stages. Whether stages run in parallel or sequentially depends on the specific execution strategy.

Internally, each operation in a model is assigned a stage_id through Stage.

poptorch.AutoStage

You can use AutoStage if you don’t want to specify stages by hand. This will assign one Stage per BeginBlock or Block.

By default, AutoStage.SameAsIpu is true, which means the stage_id of the Stage will be set to the ipu_id specified for the BeginBlock or Block.

Note that stage_id must have ascending values in PipelinedExecution. Let’s use the code example above. If your blocks “0”, “1”, and “2” are assigned to IPU 0, 1, and 0. Then the Block “2” will be assigned stage_id 0. This will cause the compiler to fail to schedule the last two stages “1” and “2” due to a conflict:

The model implies “1” should run earlier than “2”
Their stage_id values suggest “2” should run earlier than “1”

When AutoStage.AutoIncrement is true, each new BeginBlock or Block will be assigned an automatically incremented stage_id. In the previous example the last stage would be assigned stage_id 2 and the compilation would succeed.

poptorch.Phase 

Phase defines a processing unit of phased execution. It can contain one or more Stage stages.

Phase is only used in SerialPhasedExecution and ParallelPhasedExecution. It is not used in ShardedExecution and PipelinedExecution.

Listing 4.10 Example of Stage declaration

class Model(torch.nn.Module):
    def forward(self, x, y):
        with poptorch.Block("A"):
            c = x + x
        with poptorch.Block("B"):
            d = y + y
        with poptorch.Block("C"):
            e = x * 3

        return c, d, e


first = poptorch.Phase(poptorch.Stage("A").ipu(0))
# Regrouped in a single stage
second = poptorch.Phase(poptorch.Stage("B", "C").ipu(1))
# 2 separate stages
second = poptorch.Phase(poptorch.Stage("B").ipu(1), poptorch.Stage("C").ipu(3))

In the code snippet above, “A” and “B” will run in parallel on IPUs 0 and 1 simultaneously because they are placed in two stages. They will run sequentially on one IPU if they are placed in a single stage.

Advanced annotation with strings 

You can use Python strings to represent the user_id and ipu_id for a Block or BeginBlock. Because strings are evaluated at runtime, they allow for a dynamic number of stages and phases.

Here is an example showing how to use formatted strings(f-strings) in ParallelPhasedExecution.

In Listing 4.11, there are several places where f-strings are used:

Line 25: f"phase{phase}_ipu{ipu}", where phase has the values 0, 1, 1, 2, 3, 3, 4, 5, and 5, and ipu ranges from 0 to 1. The total number of instances for this f-string is 12, from 6 phases and 2 IPUs.
Line 32: f"phase{N*2-1}_ipu1", where phase is 5 and ipu is 1.
Lines 46-47 and 50-51: when defining Stage, four f-strings are used where n ranges from 0 to 2
- f"phase_{2*n}_ipu0"
- f"phase{2*n}_ipu1"
- f"phase_{2*n+1}_ipu0"
- f"phase{2*n+1}_ipu1"
These refer to phases 0, 2, 4 and 1, 3, 5, with ipu0 and ipu1, respectively. So all these 12 f-strings are defined in BeginBlock, and used in Stage dynamically. These match exactly.

Listing 4.11 An example of parallel phased execution

poptorch.setLogLevel("DEBUG")  # Force debug logging
N = 3
size = 10


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = torch.nn.ParameterList([
            torch.nn.Parameter(torch.rand(size, size), requires_grad=True)
            for n in range(N * 6)
        ])

    def forward(self, in0, target=None):
        phase = 0
        weight = iter(self.weights)
        with poptorch.Block("phase0_ipu0"):
            ins = torch.split(in0, size)
        for n in range(N * 3):
            out = []
            for ipu in range(2):
                x = ins[ipu]
                with poptorch.Block(f"phase{phase}_ipu{ipu}"):
                    x = torch.matmul(next(weight), x)
                    out.append(F.relu(x))
            ins = out[1], out[0]
            # We want 2 matmuls in the same phase
            if n % 3 != 1:
                phase += 1
        with poptorch.Block(f"phase{N*2-1}_ipu1"):
            res = ins[0] + ins[1]
            if target is None:
                return res
            return res, torch.nn.L1Loss(reduction="mean")(res, target)


input = torch.rand(size * 2, 1)
target = torch.rand(size, 1)
model = Model()
opts = poptorch.Options()
phases = []
# Alternate between 0-2 and 1-3
for n in range(N):
    phases.append([
        poptorch.Stage(f"phase{2*n}_ipu0").ipu(0),
        poptorch.Stage(f"phase{2*n}_ipu1").ipu(2)
    ])
    phases.append([
        poptorch.Stage(f"phase{2*n+1}_ipu0").ipu(1),
        poptorch.Stage(f"phase{2*n+1}_ipu1").ipu(3)
    ])
opts.setExecutionStrategy(poptorch.ParallelPhasedExecution(*phases))
poptorch_model = poptorch.trainingModel(model, opts)
poptorch_model.compile(input, target)

With the above functions as building blocks, you can set execution strategies using the four kinds of execution modes, as shown below.

4.4.2. Available execution strategies 

Note that you can use the same annotation for each execution strategy. They only differ in the method of parallelisation and tensor locations.

Pipelined execution 

PipelinedExecution is the default execution strategy.

When running a model for inference with PipelinedExecution, you must set deviceIterations() to be greater than or equal to the number of pipeline stages used by the model. You can also do this for training to improve efficiency.

Each time you switch IPU, PopTorch adds a new pipeline stage. If two consecutive blocks/stages use the same IPU, PopTorch will merge them into a single block/stage. It is usually better not to revisit an IPU, creating more than one pipeline stage on the same IPU, because the IPU can not run both stages at the same time. Hence in most cases, the number of pipeline stages for inference will be equal to the number of IPUs you have used.

When training, PopTorch doubles the number of pipeline stages in order to run backpropagation, except for the last forward stage which becomes a combined forward and backward pipeline stage (Fig. 4.2).

_images/pipelined_execution.png — Fig. 4.2 PopTorch pipelined execution for training. The last forward stage is combined with the first backward stage.

You must set gradientAccumulation() to be greater than or equal to the number of pipeline stages (forward and backward). As well as these constraints, you must also consider that the number of batches obtained each time you call the model will be multiplied (from the conventional model batch size, known as the micro-batch size) by deviceIterations() * (replicationFactor() / input_group_size) * gradientAccumulation() during training and deviceIterations() * (replicationFactor() / input_group_size) during inference (for details of input_group_size see replicationFactor()). You can use poptorch.DataLoader to abstract this calculation but you should still be aware that this will take place.

Note

The effective or conventional batch size for layers which depend on it (such as batch normalization) is known as the micro-batch size. If you use DataLoader, the batch_size which you pass to it is the micro-batch size.

After each IPU has finished processing a micro-batch, the same IPU immediately starts processing the next micro-batch while the next IPU processes the micro-batch that the same IPU just processed. This creates a pipeline which processes multiple micro-batches in parallel.

An IPU can only start its own stage of a micro-batch after the previous stage of that micro-batch has been processed. Hence, not all IPUs will be occupied until after a “ramp-up” period.

There is also a “ramp-down” period at the end of processing, during which there are no new micro-batches entering the pipeline for the first IPU to process while the IPUs down the pipeline still have micro-batches to process. Hence, during this period, the number of IPUs occupied will reduce each step. For this reason, you should try using a larger value for gradientAccumulation(). But you should note that reducing the frequency of parameter updates will also have an adverse effect on training.

Although you only define the Phase for forward passes, the corresponding phases for backward passes are also created.

Sharded execution 

In this strategy, each IPU will sequentially execute a distinct part of the model. A single unit of processing ShardedExecution is called a shard.

A shard is specified using Stage, or if no Stage is specified, the user_id passed by BeginBlock or Block is used. Each shard is executed sequentially on a single IPU (Fig. 4.3). You can place multiple shards on multiple IPUs. However, only one IPU is used at a time, while the other IPUs are idle.

_images/sharded_execution.png — Fig. 4.3 PopTorch sharded execution for training.

If an IPU is allocated to run consecutive stages, PopART will merge consecutive stages into one on the same IPU. Weights and activations will use the on-chip memory of the IPUs. You need to place layers that share weights on the same IPU.

ShardedExecution can be useful for processing a single sample or for debugging. Overall, it has low efficiency because only one IPU is used at a time.

Phased execution 

ParallelPhasedExecution and SerialPhasedExecution have the following features in common:

A portion of the weights and activations are transferred to and from Streaming Memory, before and after each phase.
If the desired weights and activations are already stored in an IPU of the same group of IPUs, intra-phase cross-IPU copies can replace the copies to and from Streaming Memory.
This specific portion is needed by the layers of the model wrapped in BeginBlock or Block in current Phase.
They both trade off some performance for larger models with higher memory needs.
Any number of phases is allowed.
The number of stages in each Phase should match the number of IPUs in each group of IPUs.
Stages inside each Phase can run in parallel.

Although you only define the Phase for forward passes, the corresponding phases for backward passes are also created. The order of phased execution for backward passes won’t change but you can decide whether a phase is shared by both forward and backward passes. In other words, you decide whether to avoid a memory transfer of a portion of the weights and activations.

Serial phased execution

In SerialPhasedExecution, phases execute on a single group of IPUs sequentially.

Listing 4.12 How to use SerialPhasedExecution

strategy = poptorch.SerialPhasedExecution(
    poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")),
    poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")),
    poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2")))

strategy.phase(0).ipus(0, 1)
strategy.phase(1).ipus(0, 1)
strategy.phase(2).ipus(0, 1)

opts.setExecutionStrategy(strategy)

The code above causes all phases to run serially on IPUs 0 and 1. (A,B and C on IPU 0, A2, B2, C2 on IPU 1).

Parallel phased execution

In ParallelPhasedExecution, phases are executed in parallel alternating between two groups of IPUs. Even phases must run on even IPUs and odd phases on odd IPUs. Inter-phase cross-IPU copies can replace the memory transfers to and from the Streaming Memory, if the desired weights and activations are already available in another group of IPUs.

Listing 4.13 How to use ParallelPhasedExecution

strategy = poptorch.ParallelPhasedExecution(
    poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")),
    poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")),
    poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5")))

strategy.phase(0).ipus(0, 2)
strategy.phase(1).ipus(1, 3)
strategy.phase(2).ipus(0, 2)

opts.setExecutionStrategy(strategy)

In the code example above, there are three phases. Each phase has two stages and each IPU group has two IPUs, so the number of groups matches the number of IPUs. Even phases 0 and 2 run on IPU 0 and 2, while odd phase 1 runs on IPU 1 and 3. This allows for faster cross-IPU copies, both inter-phase and intra-phase.

poptorch.Liveness

Liveness controls the availability of tensors on IPU, and is only needed for ParallelPhasedExecution and SerialPhasedExecution.

The default Liveness is AlwaysLive. OffChipAfterFwd, OffChipAfterFwdNoOverlap and OffChipAfterEachPhase may be helpful if you run a large model with a tight memory budget.

4.4.3. Grouping tensor weights across replicas 

PopTorch supports configuring weight tensors such that a different value of the weight tensor is sent to each replica, or to groups of replicas. This functionality can be used, for instance, to split a weight tensor and process parts of it on different groups of replicas. This functionality is accessed using the replicaGrouping() method on the weight tensor in question.

Listing 4.14 How to use replica grouped weights

class ModelWithLoss(torch.nn.Module):
    def __init__(self, W_init):
        super().__init__()
        self.W = torch.nn.Parameter(W_init)

    def forward(self, X):
        Z = X @ self.W
        return Z, poptorch.identity_loss(Z**2, reduction="mean")


# Split the weight tensor into 4, and the input data tensor into 2.
tensor_shards = 4
data_shards = 2

# Set up the problem
random = numpy.random.RandomState(seed=100)
prob_X = random.normal(size=(24, 40)).astype(numpy.float32)
prob_W_init = random.normal(size=(40, 56)).astype(
    numpy.float32) * (5 * 8)**-0.5
prob_steps = 4

X = torch.tensor(prob_X)

# Run on 8 IPUs
W_init = torch.tensor(
    prob_W_init.reshape(prob_W_init.shape[0], tensor_shards,
                        prob_W_init.shape[1] // tensor_shards).transpose(
                            1, 0, 2)).contiguous()
m = ModelWithLoss(W_init)
optim = torch.optim.SGD(m.parameters(), lr=0.01)

pt_opts = poptorch.Options()
pt_opts.replicationFactor(data_shards * tensor_shards)
pt_opts.inputReplicaGrouping(tensor_shards,
                             poptorch.enums.CommGroupType.Consecutive)
pt_opts.outputMode(poptorch.OutputMode.All)
pt_m = poptorch.trainingModel(m, optimizer=optim, options=pt_opts)
pt_m.W.replicaGrouping(poptorch.enums.CommGroupType.Orthogonal, data_shards,
                       poptorch.enums.VariableRetrievalMode.OnePerGroup)
pt_losses = []
if data_shards > 1:
    X = X.reshape(data_shards, X.shape[0] // data_shards, *X.shape[1:])
for _ in range(prob_steps):
    _, loss = pt_m(X)
    # We divide by the number of replicas because the mean is being
    # taken only over a part of the tensor on each replica, so we need to
    # divide by the number of replicas to get the correct mean.
    pt_losses.append(torch.sum(loss.detach()) / (data_shards * tensor_shards))
pt_losses = numpy.array(pt_losses)
pt_W_final = m.W.detach().numpy().transpose(1, 0, 2) \
              .reshape(prob_W_init.shape)

In the code example above, eight replicas are used. The weight tensor W is split four ways between orthogonal groups, each containing two replicas. Orthogonal groups are organised perpendicularly to the replica ordering, so that in this example replicas 0 and 4 would form the first group, 1 and 5 the second, and so on. See CommGroupType for other replica group organisation options (also illustrated in Fig. 4.4), and VariableRetrievalMode for options relating to how many replicas will be involved in value retrieval.

_images/comm-group-types.png — Fig. 4.4 Possible CommGroupTypes

Note that in this code example, the input tensor X is split two ways. This is achieved using inputReplicaGrouping().

4.5. Optimizers 

PopTorch supports the following optimizers:

In addition, PopTorch has features to support float16 models, such as loss scaling, velocity scaling, bias correction and accumulator types.

Important

All of these extra attributes (except velocity_scaling) must have the same values for different param_groups and therefore you must set them at the optimizer level.

Listing 4.15 How to update values in an Optimizer

opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         loss_scaling=2.0,
                         use_combined_accum=False)
poptorch_model = poptorch.trainingModel(model, options, opt)
poptorch_model(input, target)
# Update optimizer attribute
opt.loss_scaling = 1.0
# Update param_group attribute
opt.param_groups[0]["loss_scaling"] = 1.0
# Set the new optimizer in the model
poptorch_model.setOptimizer(opt)
poptorch_model(input, target)

Important

You must call setOptimizer() to apply the new optimizer values to the model.

Warning

PopTorch does not directly use the Python implementation of the optimizers. Built-in implementations are used in their place. This means that you cannot currently use custom optimizers. Subclassing a built-in optimizer will generate a warning. Any custom behaviour in a custom optimizer is unlikely to take effect, other than simply setting the existing attributes.

4.5.1. Loss scaling 

When training models which use half or float16 values, you can use loss scaling to prevent the gradients from becoming too small and causing underflows.

Before calculating the gradients, PopTorch will scale the loss by the value of the loss_scaling parameter. PopTorch will multiply the gradients by the inverse scale prior to updating the optimizer state. Therefore, beyond improving numerical stability, neither the training nor the hyper-parameters are affected.

Higher loss_scaling values can improve numerical stability by minimising underflow. However, too high a value can result in overflow. The optimal loss scaling factor depends on the model.

You can either set the loss_scaling factors manually, or you can set setAutomaticLossScaling() in opts.Training, which will automatically set a global loss scaling factor. If you both set loss_scaling manually and enable automatic loss scaling, the manually set factor(s) will be used initially and updated automatically during training.

Warning

Automatic loss scaling is a preview feature. It is well tested and enabled in some of our example applications, but may not behave as expected in all models. Recommendation: if your model with automatic loss scaling enabled does not converge or triggers a compilation error, then you will need to set the loss scale manually.

4.5.2. Velocity scaling (SGD combined variant only)

The SGD optimizer, when used with momentum, updates weights based on the velocity values. The combined variant uses one tensor per parameter to store the velocity and the changes to the velocity from accumulated gradients. Unlike the separate variant, therefore, each gradient accumulation step involves adding or subtracting values of a different magnitude to the gradients (for which loss scaling is used). You can therefore use the velocity_scaling parameter to scale the combined velocity tensor to improve numerical precision when using half/float16 values. (Note that the gradients are, in effect, scaled by velocity_scaling/loss_scaling so the loss_scaling has no impact on the effective scaling of velocity parameters.)

As with loss scaling, higher values can minimise underflow of the velocity values but may result in overflow.

4.5.3. Accumulation types 

In order to improve numerical stability some of the optimizers (LAMB, Adam, AdamW, RMSprop) give you the option to tweak the data type used by the optimizer’s accumulators.

accum_type lets you choose the type used for gradient accumulation. first_order_momentum_accum_type / second_order_momentum_accum_type give you control over the type used to store the first-order and second-order momentum optimizer states.

4.5.4. Constant attributes 

In order to improve performance and / or save memory PopTorch will try to embed directly in the program the attributes which are constant.

Important

Trying to modify a constant attribute after the model has been compiled will result in an error.

For PopTorch optimizers (those from the poptorch.optim namespace) by default the attributes explicitly passed to the optimizer’s constructor will be considered variables and the others will be considered as constant.

You can override this behaviour using markAsConstant() and markAsVariable() before compiling the model.

Listing 4.16 Constant and variable attributes for PopTorch optimizers

# lr, momentum and loss_scaling will be marked as variable.
opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         momentum=0.0,
                         use_combined_accum=False)
# momentum and loss_scaling  will be marked as constant.
opt = poptorch.optim.SGD(model.parameters(), lr=0.01, use_combined_accum=False)
# lr and momentum will be marked as variable.
# loss_scaling will be marked as constant.
opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         momentum=0.0,
                         loss_scaling=2.0,
                         use_combined_accum=False)
opt.variable_attrs.markAsConstant("loss_scaling")
# lr, momentum and loss_scaling will be marked as variable.
opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         loss_scaling=2.0,
                         use_combined_accum=False)
opt.variable_attrs.markAsVariable("momentum")

For native optimizers (those from the torch.optim namespace) the attributes which are left to their default value in the constructor will be considered to be constant.

There is no method to override this behaviour which is why we recommend you always use the poptorch.optim optimizers instead.

Listing 4.17 Constant and variable attributes for Torch optimizers

# momentum will be marked as constant (It's not set)
opt = torch.optim.SGD(model.parameters(), lr=0.01)
# lr will be marked as variable.
# momentum will still be marked as constant (Because its default value is 0.0)
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.0)
# lr and momentum will both be marked as variable.
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=1.0)

Note

There is an exception: lr is always marked as variable.

4.5.5. Reading and writing optimizer state 

When you use a poptorch.optim optimizer with a trainingModel(), you can use the optimizer’s state_dict() and load_state_dict() functions to read/write optimizer state to/from the IPU. This can be used to restart training from a checkpoint saved previously.

Listing 4.18 Reading and writing optimiser state

    optimizer = poptorch.optim.Adam(model.parameters())
    poptorch_model = poptorch.trainingModel(model, optimizer=optimizer)
    poptorch_model(input, target)

    # Saving the optimizer state
    torch.save({'optimizer_state_dict': optimizer.state_dict()}, PATH)

    # Destroy original model to prevent an error when wrapping the model again
    poptorch_model.destroy()

    new_optimizer = poptorch.optim.Adam(model.parameters())
    # Loading the optimizer state back
    checkpoint = torch.load(PATH)
    new_optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

    # The new training model will use the loaded optimizer state
    new_poptorch_model = poptorch.trainingModel(model, optimizer=optimizer)

Note

The structure of the state dictionary, as well as the keys within, will differ from those in PyTorch. As such, you cannot load a state dictionary with PopTorch that was obtained by running native PyTorch.

4.6. PopTorch ops 

This section describes some “helper” operations you can use within a model.

4.6.1. poptorch.ctc_beam_search_decoder 

This function adds a Connectionist Temporal Classification (CTC) beam search decoder operator to the model.

class Model(torch.nn.Module):
    def forward(self, log_probs, lengths):
        return poptorch.ctc_beam_search_decoder(log_probs, lengths)

For more information see: ctc_beam_search_decoder().

4.6.2. poptorch.ipu_print_tensor 

This function adds an op to print the content of a tensor on the IPU.

Note

To prevent the print operation being optimised out by the graph optimiser, you must use the return value of ipu_print_tensor().

class ExampleModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.bias = torch.nn.Parameter(torch.zeros(()))

    def forward(self, x):
        x = x + 1

        # It is important to make sure the result of the print is used.
        x = poptorch.ipu_print_tensor(x)

        return x + self.bias

For more information see: ipu_print_tensor().

4.6.3. poptorch.identity_loss 

You can use this function to implement custom losses. It takes a single PyTorch tensor and will backpropagate a gradient of ones through it.

Listing 4.19 Example of custom loss.

def custom_loss(output, target):
    # Mean squared error with a scale
    loss = output - target
    loss = loss * loss * 5
    return poptorch.identity_loss(loss, reduction="mean")


class ExampleModelWithCustomLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.model = ExampleModel()

    def forward(self, input, target):
        out = self.model(input)
        return out, custom_loss(out, target)

For more information see: identity_loss().

4.6.4. poptorch.MultiConv 

Use the MultiConv wrapper class to define multi-convolutions.

Refer to the PopLibs documentation for multi-convolutions for further information.

For more information see: MultiConv and MultiConvPlanType.

4.6.5. poptorch.nop 

PopTorch includes a “no-op” function for debugging purposes.

For more information see: nop().

4.6.6. poptorch.dynamic_slice 

Standard PyTorch slicing syntax cannot currently be used to create dynamic slices. This function supports dynamic slicing on the IPU.

For more information see: dynamic_slice().

4.6.7. poptorch.dynamic_update 

Standard PyTorch slicing syntax cannot currently be used to dynamically update a slice of a tensor. poptorch.dynamic_update allows updating a tensor with a statically sized slice at a dynamic index. This function supports dynamic updates on the IPU.

For more information see: dynamic_update().

4.6.8. poptorch.serializedMatMul 

Use this function to create a serialized matrix multiplication, which splits a larger matrix multiplication into smaller matrix multiplications to reduce memory requirements.

For more information see: serializedMatMul().

4.6.9. poptorch.set_available_memory 

Use this function to override the default proportion of tile memory available as temporary memory for use by operations such as a convolution or matrix multiplication. The operators that can be tuned with this setting include:

convolution
matrix multiplication
embedding lookup
indexing operations

For more information see:

set_available_memory()
technical note on optimising temporary memory usage

4.6.10. Miscellaneous functions 

The following PopTorch functions, not related to model creation, are available:

4.7. 16-bit float support 

PopTorch supports the half-precision floating point (float16) format. You can simply input float16 tensors into your model. (You can convert a tensor to float16 using tensor = tensor.half())

You can use your models in one of the following ways:

Convert all parameters (weights) to float16 by using a Module’s .half() method. This is the most memory efficient, however small updates to weights may be lost, hindering training.
Keep the parameters (weights) as float32, in which case the parameter updates will occur using float32. However, the parameters will be converted to float16 if you call an operation with a float16 input. This is more memory efficient than using float32 tensors (inputs) but less memory efficient than using float16 weights.
Use a mix of float32 and float16 parameters by manually specifying parameters as float16 or float32.

Note

When PyTorch encounters a mix of float16 and float32 inputs for a given operation, it will usually cast all inputs to float32, and PopTorch complies with this convention.

Listing 4.20 How to run a model using half precision

model = torch.nn.Linear(1, 10)

# Cast the parameters (weights) to half.
model.half()

t1 = torch.tensor([1.]).half()

opts = poptorch.Options()

inference_model = poptorch.inferenceModel(model, opts)
out = inference_model(t1)

assert out.dtype == torch.half

4.8. PyTorch buffers 

PopTorch supports PyTorch buffers in some circumstances. You can use buffers to make tensors persistent, that is to allow tensors to keep their values from the previous run on each new run, without making them model parameters. However, you must make sure that you only make in-place modifications to the buffer using PyTorch in-place operations (such as += or those ending in _). For example, you can torch.Tensor.copy_ to copy the contents of another tensor to the buffer.

Unlike when running on the CPU, the following PyTorch code does not increment model.i each time, when running on the IPU:

Listing 4.21 The wrong way to have a persistent tensor

class CounterModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.i = torch.tensor([0.], dtype=torch.float)

    def forward(self):
        self.i += 1
        return self.i


model = CounterModel()
poptorch_model = poptorch.inferenceModel(model)
print(poptorch_model())  # tensor([1.])
print(poptorch_model())  # tensor([1.])

This is because the PyTorch dispatcher will capture the value for model.i when building the graph and freeze the value as a constant.

You can keep the value of a tensor between runs by registering it as a buffer in PyTorch, as the following examples shows:

Listing 4.22 An example showing a tensor which is incremented on each iteration by registering it as a tensor.

class CounterModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.register_buffer("i", torch.tensor([0.], dtype=torch.float))

    def forward(self):
        self.i += 1
        return self.i


model = CounterModel()
poptorch_model = poptorch.inferenceModel(model)

print(poptorch_model())  # tensor([1.])
print(poptorch_model())  # tensor([2.])

Note

When running an inference model (with inferenceModel()), any buffers which your model modifies will not be implicitly copied to the host. You will need to call copyWeightsToHost() before reading the value of a buffer which has been changed as a result of a model call.

Note

PopTorch does not support broadcasting of buffers between replicas. You can make each replica use its own buffer by setting the PopTorch option broadcastBuffers() to False: poptorch.Options().broadcastBuffers(False)

You need to ensure that your model still works with each replica using a separate buffer.

4.9. Creating custom ops 

If you need to implement functionality that is not directly supported in in PopTorch, you can create a custom op.

There are two steps to creating a custom op in PopTorch:

Implement the op in C++ using the PopART API
Make the op available in PopTorch so you can use it in your PyTorch model

4.9.1. Implementing the custom op 

You will need to implement the new op as C++ code by creating subclasses of, at least, the Op and Opx base classes provided by the PopART API.

If you are going to use the custom op for training, then you will also need to define the classes that implement the gradient operation. For details of how to do this, see the Custom operators chapter of the PopART User Guide.

You can find some examples of custom ops in the Graphcore GitHub examples repository.

Compiling the PopART custom op will create a dynamic library file, which you can use with your PyTorch code.

4.9.2. Make the op available to PyTorch 

After you have compiled the C++ implementation of the custom op, you can load the library file, and call the op from your PyTorch program, using the custom_op class.

First, load the dynamic library as shown in Listing 4.23.

Listing 4.23 Loading the library for the custom op

myso = list(pathlib.Path("tests").rglob("libcustom_cube_op.*"))
assert myso, "Failed to find libcustom_cube_op"
myop = ctypes.cdll.LoadLibrary(myso[0])

You can now call your custom op using the PopTorch class custom_op.

Both the forward op and backward op are implemented in the PopART code. However, in this inference model example, only the forward op is called:

Listing 4.24 Calling a custom op in a PopTorch inference model

def test_inference():
    class BasicNetwork(nn.Module):
        def forward(self, x, bias):
            x, y = poptorch.custom_op([x, bias],
                                      "Cube",
                                      "com.acme",
                                      1,
                                      example_outputs=[x, x])
            return x, y

In this example [x, x] is assigned to example_outputs, where x is one of the input tensors which is used as a template for the output tensors. The custom op code will need to create the tensors that it returns.

You can also call this custom op inside a training model using custom_op and the backward op will be called automatically.

The Graphcore examples repository contains a feature example demonstrating how to load and in and use a custom op in a PopTorch model: PopTorch example: Custom op.

4.9.3. Passing attributes to the custom op 

You can pass attributes to the custom op using a Python dictionary, as shown in Listing 4.25.

Listing 4.25 Passing an attribute to a custom op from PopTorch

    class Model(torch.nn.Module):
        def forward(self, x):
            x = poptorch.custom_op([x],
                                   "LeakyRelu",
                                   "com.acme",
                                   1,
                                   example_outputs=[x],
                                   attributes={"alpha": 0.02})
            return x[0]

You can then access these attributes within the C++ custom op code. The above example passes a Float attribute with the name alpha to the LeakyRELU implementation. See the Custom operators chapter of the PopART User Guide for more information.

Table Table 4.1 and the code example in Listing 4.26 show how to pass other attribute types to a custom op. PopTorch supports all attributes supported in PopART except for Graph.

Table 4.1 Python types to use to pass attributes to PopART
PopART attribute type	Python equivalent
`Float`	Python float (converted to `float32`)
`Floats`	List or tuple of Python float
`Int`	Python int (converted to 64-bit signed int)
`Ints`	List or tuple of Python int
`String`	Python str (converted to ASCII)
`Strings`	List or tuple of Python str
`Graph`	Not supported

Listing 4.26 Passing different attribute types from PopTorch

def test_many_attributes_examples():
    class Model(torch.nn.Module):
        def forward(self, x):
            attributes = {
                "float_one": 1.0,
                "float_minus_two": -2.0,
                "int_zero": 0,
                "int_minus_five": -5,
                "floats_one_two_three": [1.0, 2.0, 3.0],
                "floats_minus_one_two_three": [-1.0, -2.0, -3.0],
                "ints_one_two_three": [1, 2, 3],
                "ints_minus_one_two_three": [-1, -2, -3],
                "a_string": "string with quotes and slash \" ' \\ end",
                "strs": ["abc", "def", "ghi"]
            }

            x = poptorch.custom_op([x],
                                   "ManyAttributeOp",
                                   "test.poptorch",
                                   1,
                                   example_outputs=[x],
                                   attributes=attributes)

4.10. Precompilation and caching 

4.10.1. Caching 

By default PopTorch will re-compile the model every time you instantiate a model. However if you often run the same models you might want to enable executable caching to save time.

You can do this by either setting the POPTORCH_CACHE_DIR environment variable or by calling enableExecutableCaching.

Warning

The cache directory might grow large quickly because PopTorch doesn’t delete old models from the cache and, depending on the number and size of your models and the number of IPUs used, the executables might be quite large. It is your responsibility to delete the unwanted cache files.

4.10.2. Precompilation 

PopTorch supports precompilation: This means you can compile your model on a machine which doesn’t have an IPU and export the executable to a file. You can then reload and execute it on a different machine which does have an IPU.

Important

The PopTorch versions on both machines must be an exact match.

To precompile your model you need to wrap it using either trainingModel() or inferenceModel() then call compileAndExport() on the wrapper.

Listing 4.27 How to precompile a model using an offline IPU target.

import torch
import poptorch


class ExampleModelWithLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(10, 10)
        self.loss = torch.nn.MSELoss()

    def forward(self, x, target=None):
        fc = self.fc(x)
        if self.training:
            return fc, self.loss(fc, target)
        return fc


torch.manual_seed(0)
model = ExampleModelWithLoss()

opts = poptorch.Options()
# You don't need a real IPU to compile the executable.
opts.useOfflineIpuTarget(ipu_target_version)

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model, opts)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

poptorch_model.compileAndExport(filename, input, target)

Note

If you don’t know the IPU version on your system you can use ipuHardwareVersion().

The exported file by default will contain your original PyTorch model (including the weights), and enough information to re-create the PopTorch wrapper and reload the executable.

Important

For your model and weights to be exported, your model must be picklable. See https://docs.python.org/3/library/pickle.html for more information. If your model is not picklable then use export_model=False, as shown in Listing 4.30.

Now both the torch model, PopTorch wrapper and executable can be restored on the target machine using load():

Listing 4.28 How to load a precompiled model

poptorch_model = poptorch.load(filename)

# That's all: your model is ready to be used.
poptorch_model(input, target)  # Run on IPU

In some cases you might want to provide some runtime information to select the device: you can do this using the edit_opts_fn argument of load():

Listing 4.29 How to load a precompiled model and run on a specific IPU

def setIpuDevice(opts):
    opts.useIpuId(1)  # always use IPU 1


poptorch_model = poptorch.load(filename, edit_opts_fn=setIpuDevice)
poptorch_model(input, target)  # Run on IPU 1

Note

When loading a precompiled model, only run-time options will be applied; all others will be ignored.

Going back to the precompilation step: in some cases you might want to export only the executable and not the python wrapper or torch model (for example if your model cannot be pickled).

Listing 4.30 How to export only the executable

poptorch_model.compileAndExport(filename, input, target, export_model=False)

It means you will need to re-create and wrap the model yourself before loading the executable:

Listing 4.31 How to load a precompiled executable

model = ExampleModelWithLoss()

opts = poptorch.Options()

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model, opts)
poptorch_model.loadExecutable(filename)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

poptorch_model(input, target)  # Run on IPU

Important

Exported models lose their connections to other models.

For example, if you have a trainingModel() and a inferenceModel() based on the same PyTorch model, you wouldn’t usually need to keep the weights synchronised between the two; PopTorch would take care of it for you, implicitly.

In the following example, PopTorch automatically copies the weights from the training model to the inference model:

Listing 4.32 PopTorch implicit copies

model = ExampleModelWithLoss()

opts = poptorch.Options()

# Wrap the model in our PopTorch annotation wrapper.
training_model = poptorch.trainingModel(model, opts)
model.eval()
validation_model = poptorch.inferenceModel(model, opts)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

# Train the model:
for epoch in epochs:
    training_model(input, target)

# Weights are implicitly copied from the training model
# to the validation model
prediction = validation_model(input)

If you were to export these models:

Listing 4.33 Precompilation of both a training and validation models

model = ExampleModelWithLoss()

opts = poptorch.Options()

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

# Wrap the model in our PopTorch annotation wrapper.
training_model = poptorch.trainingModel(model, opts)
training_model.compileAndExport("training.poptorch", input, target)
model.eval()
validation_model = poptorch.inferenceModel(model, opts)
validation_model.compileAndExport("validation.poptorch", input)

Note

Don’t forget to call model.eval() or model.train(), as required, before calling compileAndExport().

You could then insert explicit copy operations:

Listing 4.34 Precompilation of both a training and validation models

training_model = poptorch.load("training.poptorch")
validation_model = poptorch.load("validation.poptorch")

for epoch in epochs:
    print("Epoch ", epoch)
    run_training(training_model)
    # Need to explicitly copy weights between the two models
    # because they're not connected anymore.
    training_model.copyWeightsToHost()
    validation_model.copyWeightsToDevice()
    run_validation(validation_model)

Or you would need to re-connect the two models by creating the second one from the first one and then loading the executable:

Listing 4.35 Precompilation of both a training and validation models

training_model = poptorch.load("training.poptorch")
# Create a validation python model based on the training model
validation_model = poptorch.inferenceModel(training_model)
validation_model.model.eval()
# Load the executable for that model:
validation_model.loadExecutable("validation.poptorch")

for epoch in epochs:
    print("Epoch ", epoch)
    run_training(training_model)
    # Nothing to do: training_model and validation_model are now connected
    # and PopTorch will implicitly keep the weights in sync between them.
    run_validation(validation_model)

4.11. Environment variables 

4.11.1. Logging level 

PopTorch uses the following levels of logging:

OFF: No logging

ERR: Errors only

WARN: Warnings and errors only

INFO: Info, warnings and errors (default)

DEBUG: Adds some extra debugging information

TRACE and TRACE_ALL: Trace everything inside PopTorch

You can use the POPTORCH_LOG_LEVEL environment variable to set the logging level:

export POPTORCH_LOG_LEVEL=DEBUG

4.11.2. Profiling 

When running programs using PopTorch, you can enable profiling by using the POPLAR_ENGINE_OPTIONS environment variable used by Poplar.

In order to capture the reports needed for the PopVision Graph Analyser you only need to set POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}':

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'

By default, report files are output to the current working directory. You can specify a different output directory by setting autoReport.directory, for example:

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./tommyFlowers"}'

For more options, refer to the PopVision Graph Analyser User Guide.

In order to capture the pvti reports needed for the PopVision System Analyser you need to enable the PopVision Trace Instrumentation library (PVTI). To do so, set PVTI_OPTIONS='{"enable":"true"}'.

Important

By default, PopVision will display multiple trace files using relative time. This is because most of the time we want to compare two executions of the same model, for example. However, in this case we want the traces to be aligned on absolute time: this can be done by selecting “Absolute Timing” in the PopVision options.

You can also add extra tracepoints in your own code by using Channel.

4.11.3. IPU Model 

By default PopTorch will try to attach to a physical IPU. If instead you want to use the model, you can do so by setting POPTORCH_IPU_MODEL to 1:

export POPTORCH_IPU_MODEL=1

See the Poplar and PopLibs User Guide for the limitations of the IPU Model.

4.11.4. Wait for an IPU to become available 

By default, attempting to attach to an IPU when all IPUs are already in use will raise an exception. If you would rather wait for an IPU to become available, you can do so by setting POPTORCH_WAIT_FOR_IPU to 1.

export POPTORCH_WAIT_FOR_IPU=1

4.11.5. Enable executable caching 

You can enable executable caching by either setting the POPTORCH_CACHE_DIR environment variable or by calling enableExecutableCaching.

export POPTORCH_CACHE_DIR=/tmp/poptorch_cache

For more information, see Section 4.10.1, Caching.

Search help

4. Features

poptorch.Stage

poptorch.AutoStage

Serial phased execution

Parallel phased execution

poptorch.Liveness