4. Features

Options
- Setting options via config file
Model wrapping functions
Execution strategies
- Annotation tools
- Parallel execution strategies
Optimizers
PopTorch ops
Half / float16 support
Automatic mixed-precision casting
- Custom casting policies
Creating custom ops
Profiling
Precompilation and caching
- Caching
- Precompilation
Environment variables

4.1. Options 

You can change how PopTorch compiles and executes models using poptorch.Options. You can find a full list of options in Section 10.1, Options. Broadly speaking, the options fall into the following categories:

General options (see Options)
Options related to half precision (see opts.Precision.*)
Management of the training process (see opts.Training.*)
Location of tensors (see: opts.TensorLocations.* and TensorLocationSettings)
Options relevant to the Torch JIT compiler (see opts.Jit.*)
Control of distributed execution environments when using tools other than PopRun (see opts.Distributed.*)

See Section 5, Efficient data batching for a full explanation of how device_iterations greater than 1, gradient_accumulation, and replication_factor interact with the output and input sizes.

You can choose to use the IPU Model instead of IPU hardware with the useIpuModel() option.

4.1.1. Setting options via config file 

In addition to setting these options programmatically, you can also set them in a config text file by using loadFromFile().

Each line in the file must contain a single command corresponding to setting an option in Options. To set an option within the file, write the command as you would within a Python script but omit the options. prefix. For example:

Listing 4.1 Example contents of a config file used to set options

deviceIterations(1)
setExecutionStrategy(poptorch.ShardedExecution())
replicationFactor(1)
enableSyntheticData(True)

Then, instantiate Options and call loadFromFile():

Listing 4.2 Setting options using a config file named “poptorch.conf”

opts = poptorch.Options()
opts.loadFromFile("tmp/poptorch.conf")

4.2. Model wrapping functions 

The basis of PopTorch integration comes from the two model wrapping functions described in the following sections.

Note

PopTorch makes a shallow copy of the model. Changes to the parameters in the models returned by these two model wrapping functions affect the original model and vice versa. However, primitive variable types will not be kept in sync. This includes the training bool of pytorch.nn.Module. If your PyTorch model is named model, call model.eval() or model.train(), if required, before calling these wrapping functions.

4.2.1. poptorch.trainingModel 

This function wraps a PyTorch model, yielding a PopTorch model that can be run on the IPU in training mode. See trainingModel() for more information.

Listing 4.3 An example of the use of trainingModel

import torch
import poptorch


class ExampleModelWithLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(10, 10)
        self.loss = torch.nn.MSELoss()

    def forward(self, x, target=None):
        fc = self.fc(x)
        if self.training:
            return fc, self.loss(fc, target)
        return fc


torch.manual_seed(0)
model = ExampleModelWithLoss()

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)
ones = torch.ones(10)

# Train on IPU.
for i in range(0, 800):
    # Each call here executes the forward pass, loss calculation, and backward
    # pass in one step.
    # Model input and loss function input are provided together.
    poptorch_out, loss = poptorch_model(input, target)
    print(f"{i}: {loss}")

# Copy the trained weights from the IPU back into the host model.
poptorch_model.copyWeightsToHost()

# Execute the trained weights on host.
model.eval()
native_out = model(input)

# Models should be very close to native output although some operations are
# numerically different and floating point differences can accumulate.
torch.testing.assert_allclose(native_out, poptorch_out, rtol=1e-04, atol=1e-04)

Note

By default, PopTorch will only return the final batch of outputs. Please see Section 5.6, poptorch.Options.Training.anchorReturnType for details on what PopTorch returns when using trainingModel() and how you can calculate statistics such as training accuracy over all batches.

4.2.2. poptorch.inferenceModel 

This function wraps a PyTorch model, yielding a PopTorch model that can be run on the IPU in inference mode. See inferenceModel() for more information.

Listing 4.4 An example of the use of inferenceModel

import torch
import torchvision
import poptorch

# Some dummy imagenet sized input.
picture_of_a_cat_here = torch.randn([1, 3, 224, 224])

# The model, in this case a MobileNet model with pretrained weights that comes
# canned with Pytorch.
model = torchvision.models.mobilenet_v2(pretrained=True)
model.train(False)

# Wrap in the PopTorch inference wrapper
inference_model = poptorch.inferenceModel(model)

# Execute on IPU.
out_tensor = inference_model(picture_of_a_cat_here)

# Get the top 5 ImageNet classes.
top_five_classes = torch.topk(torch.softmax(out_tensor, 1), 5)
print(top_five_classes)

# Try the same on native PyTorch
native_out = model(picture_of_a_cat_here)

native_top_five_classes = torch.topk(torch.softmax(native_out, 1), 5)

# Models should be very close to native output although some operations are
# numerically different and floating point differences can accumulate.
assert any(top_five_classes[1][0] == native_top_five_classes[1][0])
# inference_half_start
model = torch.nn.Linear(1, 10)

# Convert the parameters (weights) to halfs. Without doing so,
# the Linear parameters will automatically be cast to half, which allows
# training with float32 parameters but half tensors.
model.half()

t1 = torch.tensor([1.]).half()

opts = poptorch.Options()

inference_model = poptorch.inferenceModel(model, opts)
out = inference_model(t1)

assert out.dtype == torch.half
# inference_half_end

4.2.3. poptorch.PoplarExecutor 

You should not create this class directly. It is a wrapper around the model that was passed into inferenceModel() or trainingModel(). It has a few methods which you can use to interface with the IPU.

The PoplarExecutor will implicitly keep in sync the parameters of the source PyTorch model and the PopTorch model(s). However, you need to explicitly copy the weights if the model is trained on the CPU and inference is run on the IPU.

See PoplarExecutor for a complete description of the IPU interface functionality.

Listing 4.5 Example contents of when explicit copies are needed

model = Model()

model.eval()
poptorch_inf = poptorch.inferenceModel(model)

# Switch for "poptorch.trainingModel": poptorch_inf will remain in "eval" mode
model.train()
poptorch_train = poptorch.trainingModel(model)

# train on IPU
train(poptorch_train)
torch.save(model.state_dict(), "model.save")  # OK

# Aready in "eval" mode
validate(poptorch_inf)  # OK

# switch to "eval" mode for CPU
model.eval()
validate(model)  # OK

# train on CPU
model.train()
train_on_cpu(model)

# Explicit copy needed
poptorch_inf.copyWeightsToDevice()
validate(poptorch_inf)

4.2.4. poptorch.isRunningOnIpu 

One useful utility function is isRunningOnIpu(). This returns True when executing on the IPU and False when executing the model outside IPU scope. This allows for different code paths within the model.

A common use case is executing equivalent code to a PopART custom operator when running on the CPU. For example:

class Network(torch.nn.Module):
  def forward(self, x, y):
      if poptorch.isRunningOnIpu():
          # IPU path
          return my_custom_operator(x, y)
      else:
          # CPU path
          return my_torch_implementation(x,y)

4.3. Execution strategies 

This section describes strategies to run PopTorch code on more than one IPU. Some of these allow code to be run in parallel on multiple IPUs. We recommended that you use these parallel execution strategies with PopTorch code that is already working correctly on a single IPU.

There are four kinds of execution strategies that you can use to run a model on a multi-IPU device:

You can select this with the setExecutionStrategy() option.

The default execution strategy is PipelinedExecution.

In the following, we first introduce the general functions that are relevant to all four parallel execution strategies. Finally, we explain the four strategies with examples.

By default, PopTorch will not let you run the model if the number of IPUs is not a power of 2. For this reason, it is preferable to annotate the model so that the number of IPUs used is a power of 2. However, you can also enable autoRoundNumIPUs() to automatically round up the number of IPUs reserved to a power of 2, with the excess being reserved but idle. This option is not enabled by default to prevent unintentional overbooking of IPUs.

4.3.1. Annotation tools 

poptorch.Block, poptorch.BeginBlock and poptorch.BlockFunction 

BeginBlock and Block are wrapper classes used to define model parallelism in a multi-IPU device. They partition models into “blocks” to be executed on different IPUs.

You can use Block to define a scope in the context of the model.

In the example below, all layers before model.bert.encoder.layer[0] will be put on IPU 0 and all layers from model.bert.encoder.layer[0] onwards (inclusive) will be on IPU 1.

Listing 4.6 Annotating existing layers

import transformers
import torch
import poptorch

# A bert model from hugging face. See the packaged BERT example for actual usage.
pretrained_weights = 'mrm8488/bert-medium-finetuned-squadv2'


# For later versions of transformers, we need to wrap the model and set
# return_dict to False
class WrappedModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.wrapped = transformers.BertForQuestionAnswering.from_pretrained(
            pretrained_weights)

    def forward(self, input_ids, attention_mask, token_type_ids):
        return self.wrapped.forward(input_ids,
                                    attention_mask,
                                    token_type_ids,
                                    return_dict=False)

    def __getattr__(self, attr):
        try:
            return torch.nn.Module.__getattr__(self, attr)
        except torch.nn.modules.module.ModuleAttributeError:
            return getattr(self.wrapped, attr)


model = WrappedModel()

# A handy way of seeing the names of all the layers in the network.
print(model)

# All layers before "model.bert.encoder.layer[0]" will be on IPU 0 and all layers from
# "model.bert.encoder.layer[0]" onwards (inclusive) will be on IPU 1.
model.bert.encoder.layer[0] = poptorch.BeginBlock(model.bert.encoder.layer[0],
                                                  ipu_id=1)

# Now all layers before layer are on IPU 1 and this layer onward is on IPU 2
model.bert.encoder.layer[2] = poptorch.BeginBlock(model.bert.encoder.layer[2],
                                                  ipu_id=2)

# Finally all layers from this layer till the end of the network are on IPU 3.
model.bert.encoder.layer[4] = poptorch.BeginBlock(model.bert.encoder.layer[4],
                                                  ipu_id=3)

# We must batch the data by at least the number of IPUs. Each IPU will still execute
# whatever the model batch size is.
data_batch_size = 4

# Create a poptorch.Options instance to override default options
opts = poptorch.Options()
opts.deviceIterations(data_batch_size)

BeginBlock is an annotation defined outside the model, and applied to current and onward layers. You can use both forms interchangeably.

Listing 4.7 Annotating a model directly

class Network(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(5, 10)
        self.layer2 = torch.nn.Linear(10, 5)
        self.layer3 = torch.nn.Linear(5, 5)
        self.layer4 = torch.nn.Linear(5, 5)

        self.act = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, x):

        # Explicit layers on a certain IPU
        poptorch.Block.useAutoId()
        with poptorch.Block(ipu_id=0):
            x = self.act(self.layer1(x))

        with poptorch.Block(ipu_id=1):
            x = self.act(self.layer2(x))

        with poptorch.Block(ipu_id=2):
            x = self.act(self.layer3(x))
            x = self.act(self.layer4(x))

        with poptorch.Block(ipu_id=3):
            x = self.softmax(x)
        return x


model = Network()
opts = poptorch.Options()
opts.deviceIterations(4)
poptorch_model = poptorch.inferenceModel(model, options=opts)
print(poptorch_model(torch.rand((4, 5))))

In addition, BlockFunction() is a decorator which can decorate an existing function.

Listing 4.8 Annotating functions

class Network(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(5, 10)
        self.layer2 = torch.nn.Linear(10, 5)
        self.layer3 = torch.nn.Linear(5, 5)
        self.layer4 = torch.nn.Linear(5, 5)

        self.act = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, x):
        poptorch.Block.useAutoId()
        x = self.block_one(x)
        x = self.block_two(x)
        x = self.final_activation(x)
        return x

    @poptorch.BlockFunction(ipu_id=0)
    def block_one(self, x):
        x = self.act(self.layer1(x))
        x = self.act(self.layer2(x))
        return x

    @poptorch.BlockFunction(ipu_id=1)
    def block_two(self, x):
        x = self.act(self.layer3(x))
        x = self.act(self.layer4(x))
        return x

    @poptorch.BlockFunction(ipu_id=1)
    def final_activation(self, x):
        return self.softmax(x)


model = Network()
opts = poptorch.Options()
opts.deviceIterations(4)
poptorch_model = poptorch.inferenceModel(model, options=opts)
print(poptorch_model(torch.rand((4, 5))))

Either annotation is enough to enable parallel execution in the simple cases. By default, the layers before the first BeginBlock will be placed on IPU 0.

BeginBlock, Block and BlockFunction() need to follow a set of rules:

You must declare all the layers inside a Block scope to avoid missing annotations. BeginBlock doesn’t have the same constraint because all the layers called after this will automatically be added to the last BeginBlock.
Note that PopTorch needs to reserve IPUs in powers of 2. You are advised to configure your model accordingly to take full advantage of the IPUs available. However, if you need to run with a different number of IPUs, you can use poptorch.Options().autoRoundNumIPUs(True) to allow PopTorch to reserve more IPUs than the model specifies.
You should not include unused or dead layers in any BeginBlock or Block.
If layer A happens before layer B inside the model and each layer has a BeginBlock associated with it, you need to write BeginBlock for layer A before BeginBlock for layer B.

Failing to obey above rules will result in compilation errors.

poptorch.Stage and poptorch.AutoStage 

Conceptually, BeginBlock and Block collect the layers of a model into a Stage. You can combine multiple stages into a Phase. Multiple phases form a parallel execution strategy.

poptorch.Stage

Stage defines the layers of the model to run on one IPU. A stage can consist of one or more blocks created using BeginBlock or Block and identified by their user_id.

You can define consecutive layers in a model in either the same stage or consecutive stages. Whether stages run in parallel or sequentially depends on the specific parallel execution strategy.

Internally, each operation in a model is assigned a stage_id through Stage.

poptorch.AutoStage

You can use AutoStage if you don’t want to specify stages by hand. This will assign one Stage per BeginBlock or Block.

By default, AutoStage.SameAsIpu is true, which means the stage_id of the Stage will be set to the ipu_id specified for the BeginBlock or Block.

Note that stage_id must have ascending values in PipelinedExecution. Let’s use the code example above. If your blocks “0”, “1”, and “2” are assigned to IPU 0, 1, and 0. Then the Block “2” will be assigned stage_id 0. This will cause the compiler to fail to schedule the last two stages “1” and “2” due to a conflict:

The model implies “1” should run earlier than “2”
Their stage_id values suggest “2” should run earlier than “1”

When AutoStage.AutoIncrement is true, each new BeginBlock or Block will be assigned an automatically incremented stage_id. In the previous example the last stage would be assigned stage_id 2 and the compilation would succeed.

poptorch.Phase 

Phase defines a processing unit of phased execution. It can contain one or more Stage stages.

Phase is only used in SerialPhasedExecution and ParallelPhasedExecution. It is not used in ShardedExecution and PipelinedExecution.

Listing 4.9 Example of Stage declaration

class Model(torch.nn.Module):
    def forward(self, x, y):
        with poptorch.Block("A"):
            c = x + x
        with poptorch.Block("B"):
            d = y + y
        with poptorch.Block("C"):
            e = x * 3

        return c, d, e


first = poptorch.Phase(poptorch.Stage("A").ipu(0))
# Regrouped in a single stage
second = poptorch.Phase(poptorch.Stage("B", "C").ipu(1))
# 2 separate stages
second = poptorch.Phase(poptorch.Stage("B").ipu(1), poptorch.Stage("C").ipu(3))

In the code snippet above, “A” and “B” will run in parallel on IPUs 0 and 1 simultaneously because they are placed in two stages. They will run sequentially on one IPU if they are placed in a single stage.

Advanced annotation with strings 

You can use Python strings to represent the user_id and ipu_id for a Block or BeginBlock. Because strings are evaluated at runtime, they allow for a dynamic number of stages and phases.

Here is an example showing how to use formatted strings(f-strings) in ParallelPhasedExecution.

In Listing 4.10, there are several places where f-strings are used:

Line 25: f"phase{phase}_ipu{ipu}", where phase has the values 0, 1, 1, 2, 3, 3, 4, 5, and 5, and ipu ranges from 0 to 1. The total number of instances for this f-string is 12, from 6 phases and 2 IPUs.
Line 32: f"phase{N*2-1}_ipu1", where phase is 5 and ipu is 1.
Lines 46-47 and 50-51: when defining Stage, four f-strings are used where n ranges from 0 to 2
- f"phase_{2*n}_ipu0"
- f"phase{2*n}_ipu1"
- f"phase_{2*n+1}_ipu0"
- f"phase{2*n+1}_ipu1"
These refer to phases 0, 2, 4 and 1, 3, 5, with ipu0 and ipu1, respectively. So all these 12 f-strings are defined in BeginBlock, and used in Stage dynamically. These match exactly.

Listing 4.10 An example of parallel phased execution

poptorch.setLogLevel(1)  # Force debug logging
N = 3
size = 10


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = []
        for n in range(N * 6):
            weight = torch.nn.Parameter(torch.rand(size, size),
                                        requires_grad=True)
            self.register_parameter(f"w{n}", weight)
            self.weights.append(weight)

    def forward(self, in0, target=None):
        phase = 0
        weight = iter(self.weights)
        with poptorch.Block("phase0_ipu0"):
            ins = torch.split(in0, size)
        for n in range(N * 3):
            out = []
            for ipu in range(2):
                x = ins[ipu]
                with poptorch.Block(f"phase{phase}_ipu{ipu}"):
                    x = torch.matmul(next(weight), x)
                    out.append(F.relu(x))
            ins = out[1], out[0]
            # We want 2 matmuls in the same phase
            if n % 3 != 1:
                phase += 1
        with poptorch.Block(f"phase{N*2-1}_ipu1"):
            res = ins[0] + ins[1]
            if target is None:
                return res
            return res, torch.nn.L1Loss(reduction="mean")(res, target)


input = torch.rand(size * 2, 1)
target = torch.rand(size, 1)
model = Model()
opts = poptorch.Options()
phases = []
# Alternate between 0-2 and 1-3
for n in range(N):
    phases.append([
        poptorch.Stage(f"phase{2*n}_ipu0").ipu(0),
        poptorch.Stage(f"phase{2*n}_ipu1").ipu(2)
    ])
    phases.append([
        poptorch.Stage(f"phase{2*n+1}_ipu0").ipu(1),
        poptorch.Stage(f"phase{2*n+1}_ipu1").ipu(3)
    ])
opts.setExecutionStrategy(poptorch.ParallelPhasedExecution(*phases))
poptorch_model = poptorch.trainingModel(model, opts)
poptorch_model.compile(input, target)

4.3.2. Parallel execution strategies 

With the above functions as building blocks, we can set execution strategies using the four kinds of execution modes, as shown below.

Note that you can use the same annotation for each execution strategy. They only differ in the method of parallelisation and tensor locations.

poptorch.ShardedExecution 

In this strategy, each IPU will sequentially execute a distinct part of the model. A single unit of processing ShardedExecution is called a shard.

A shard is specified using Stage, or if no Stage is specified, the user_id passed by BeginBlock or Block is used. Each shard is executed sequentially on a single IPU. You can place multiple shards on multiple IPUs. However, only one IPU is used at a time, while the other IPUs are idle. If an IPU is allocated to run consecutive stages, PopART will merge consecutive stages into one on the same IPU. Weights and activations will use the on-chip memory of the IPUs. You need to place layers that share weights on the same IPU.

ShardedExecution can be useful for processing a single sample or for debugging. Overall, it has low efficiency because only one IPU is used at a time.

poptorch.PipelinedExecution 

This is the default execution strategy. It extends poptorch.ShardedExecution with parallel execution on multiple IPUs.

Parallelisation in PipelinedExecution requires deviceIterations() and gradientAccumulation() as explained in Section 5, Efficient data batching.

After one stage is finished with processing a batch on one IPU, it starts immediately processing the next batch. This creates a pipeline where multiple batches are processed in parallel.

An IPU can only start its own stage of a batch after its previous stage of the current batch has been processed. Hence, all IPUs will be occupied after a “warm-up” period.

At the end of processing, a “cool-down” period is required to aggregate the results and apply weight updates.

Phased execution 

ParallelPhasedExecution and SerialPhasedExecution have the following features in common:

A portion of the weights and activations are transferred to and from streaming memory, before and after each phase.
If the desired weights and activations are already stored in an IPU of the same group of IPUs, intra-phase cross-IPU copies can replace the copies to and from streaming memory.
This specific portion is needed by the layers of the model wrapped in BeginBlock or Block in current Phase.
They both trade off some performance for larger models with higher memory needs.
Any number of phases is allowed.
The number of stages in each Phase should match the number of IPUs in each group of IPUs.
Stages inside each Phase can run in parallel.

Although you only define the Phase for forward passes, the corresponding phases for backward passes are created correspondingly. The order of phased execution for backward passes won’t change but you can decide whether a phase is shared by both forward and backward passes. In other words, you decide whether to avoid a memory transfer of a portion of the weights and activations.

poptorch.SerialPhasedExecution

In SerialPhasedExecution, phases execute on a single group of IPUs sequentially.

Listing 4.11 How to use SerialPhasedExecution

strategy = poptorch.SerialPhasedExecution(
    poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")),
    poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")),
    poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2")))

strategy.phase(0).ipus(0, 1)
strategy.phase(1).ipus(0, 1)
strategy.phase(2).ipus(0, 1)

opts.setExecutionStrategy(strategy)

The code above causes all phases to run serially on IPUs 0 and 1. (A,B and C on IPU 0, A2, B2, C2 on IPU 1).

poptorch.ParallelPhasedExecution

In ParallelPhasedExecution, phases are executed in parallel alternating between two groups of IPUs. Even phases must run on even IPUs and odd phases on odd IPUs. Inter-phase cross-IPU copies can replace the memory transfers to and from the streaming memory, if the desired weights and activations are already available in another group of IPUs.

Listing 4.12 How to use ParallelPhasedExecution

strategy = poptorch.ParallelPhasedExecution(
    poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")),
    poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")),
    poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5")))

strategy.phase(0).ipus(0, 2)
strategy.phase(1).ipus(1, 3)
strategy.phase(2).ipus(0, 2)

opts.setExecutionStrategy(strategy)

In the code example above, there are three phases. Each phase has two stages and each IPU group has two IPUs, so the number of groups matches the number of IPUs. Even phases 0 and 2 run on IPU 0 and 2, while odd phase 1 runs on IPU 1 and 3. This allows for faster cross-IPU copies, both inter-phase and intra-phase.

poptorch.Liveness

Liveness controls the availability of tensors on IPU, and is only needed for ParallelPhasedExecution and SerialPhasedExecution.

The default Liveness is AlwaysLive. OffChipAfterFwd, OffChipAfterFwdNoOverlap and OffChipAfterEachPhase may be helpful if you run a large model with a tight memory budget.

4.4. Optimizers 

Poptorch supports the following optimizers:

In addition, PopTorch has features to support float16 models, such as loss scaling, velocity scaling, bias correction and accumulator types.

Important

All of these extra attributes (except velocity_scaling) must have the same values for different param_groups and therefore you must set them at the optimizer level.

Listing 4.13 How to update values in an Optimizer

opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         loss_scaling=2.0,
                         use_combined_accum=False)
poptorch_model = poptorch.trainingModel(model, options, opt)
poptorch_model(input, target)
# Update optimizer attribute
opt.loss_scaling = 1.0
# Update param_group attribute
opt.param_groups[0]["loss_scaling"] = 1.0
# Set the new optimizer in the model
poptorch_model.setOptimizer(opt)
poptorch_model(input, target)

Important

You must call setOptimizer() to apply the new optimizer values to the model.

4.4.1. Loss scaling 

When training models which use half/float16 values, you can use loss scaling to prevent the gradients from becoming too small and underflowing.

Before calculating the gradients, PopTorch will scale the loss by the value of the loss_scaling parameter. PopTorch will multiply the gradients by the inverse scale prior to updating the optimizer state. Therefore, beyond improving numerical stability, neither the training nor the hyper-parameters are affected.

Higher loss_scaling values can improve numerical stability by minimising underflow. However, too high a value can result in overflow. The optimal loss scaling factor depends on the model.

You can either set the loss_scaling factors manually, or you can set setAutomaticLossScaling() in opts.Training, which will automatically set a global loss scaling factor. If you both set loss_scaling manually and enable automatic loss scaling, the manually set factor(s) will be used initially and updated automatically during training.

4.4.2. Velocity scaling (SGD combined variant only)

The SGD optimizer, when used with momentum, updates weights based on the velocity values. The combined variant uses one tensor per parameter to store the velocity and the changes to the velocity from accumulated gradients. Unlike the separate variant, therefore, each gradient accumulation step involves adding or subtracting values of a different magnitude to the gradients (for which loss scaling is used). You can therefore use the velocity_scaling parameter to scale the combined velocity tensor to improve numerical precision when using half/float16 values. (Note that the gradients are, in effect, scaled by velocity_scaling/loss_scaling so the loss_scaling has no impact on the effective scaling of velocity parameters.)

As with loss scaling, higher values can minimise underflow of the velocity values but may result in overflow.

4.4.3. Accumulation types 

In order to improve numerical stability some of the optimizers (LAMB, Adam, AdamW, RMSprop) give you the option to tweak the data type used by the optimizer’s accumulators.

accum_type lets you choose the type used for gradient accumulation. first_order_momentum_accum_type / second_order_momentum_accum_type give you control over the type used to store the first-order and second-order momentum optimizer states.

4.4.4. Constant attributes 

In order to improve performance and / or save memory PopTorch will try to embed directly in the program the attributes which are constant.

Important

Trying to modify a constant attribute after the model has been compiled will result in an error.

For PopTorch optimizers (those from the poptorch.optim namespace) by default the attributes explicitly passed to the optimizer’s constructor will be considered variables and the others will be considered as constant.

You can override this behaviour using markAsConstant() and markAsVariable() before compiling the model.

Listing 4.14 Constant and variable attributes for PopTorch optimizers

# lr, momentum and loss_scaling will be marked as variable.
opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         momentum=0.0,
                         use_combined_accum=False)
# momentum and loss_scaling  will be marked as constant.
opt = poptorch.optim.SGD(model.parameters(), lr=0.01, use_combined_accum=False)
# lr and momentum will be marked as variable.
# loss_scaling will be marked as constant.
opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         momentum=0.0,
                         loss_scaling=2.0,
                         use_combined_accum=False)
opt.variable_attrs.markAsConstant("loss_scaling")
# lr, momentum and loss_scaling will be marked as variable.
opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         loss_scaling=2.0,
                         use_combined_accum=False)
opt.variable_attrs.markAsVariable("momentum")

For native optimizers (those from the torch.optim namespace) the attributes which are left to their default value in the constructor will be considered to be constant.

There is no method to override this behaviour which is why we recommend you always use the poptorch.optim optimizers instead.

Listing 4.15 Constant and variable attributes for Torch optimizers

# momentum will be marked as constant (It's not set)
opt = torch.optim.SGD(model.parameters(), lr=0.01)
# lr will be marked as variable.
# momentum will still be marked as constant (Because its default value is 0.0)
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.0)
# lr and momentum will both be marked as variable.
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=1.0)

Note

There is an exception: lr is always marked as variable.

4.5. PopTorch ops 

This section describes some “helper” operations you can use within a model.

4.5.1. poptorch.ctc_beam_search_decoder 

This function adds a Connectionist Temporal Classification (CTC) beam search decoder operator to the model.

class Model(torch.nn.Module):
    def forward(self, log_probs, lengths):
        return poptorch.ctc_beam_search_decoder(log_probs, lengths)

For more information see: ctc_beam_search_decoder().

4.5.2. poptorch.ipu_print_tensor 

This function adds an op to print the content of a tensor on the IPU.

Note

To prevent the print operation being optimised out by the graph optimiser, you must use the return value of ipu_print_tensor().

class ExampleModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.bias = torch.nn.Parameter(torch.zeros(()))

    def forward(self, x):
        x = x + 1

        # It is important to make sure the result of the print is used.
        x = poptorch.ipu_print_tensor(x)

        return x + self.bias

For more information see: ipu_print_tensor().

4.5.3. poptorch.identity_loss 

You can use this function to implement custom losses. It takes a single PyTorch tensor and will backpropagate a gradient of ones through it.

Note

Passing a PyTorch loss function or another identity_loss to this function is not supported. You must implement multiple losses as composite PyTorch ops.

Listing 4.16 Example of custom loss.

def custom_loss(output, target):
    # Mean squared error with a scale
    loss = output - target
    loss = loss * loss * 5
    return poptorch.identity_loss(loss, reduction="mean")


class ExampleModelWithCustomLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.model = ExampleModel()

    def forward(self, input, target):
        out = self.model(input)
        return out, custom_loss(out, target)

For more information see: identity_loss().

4.5.4. poptorch.MultiConv 

Use the MultiConv wrapper class to define multi-convolutions.

Refer to the PopLibs documentation for multi-convolutions for further information.

For more information see: MultiConv and MultiConvPlanType.

4.5.5. poptorch.nop 

PopTorch includes a “no-op” function for debugging purposes.

For more information see: nop().

4.5.6. poptorch.serializedMatMul 

Use this function to create a serialized matrix multiplication, which splits a larger matrix multiplication into smaller matrix multiplications to reduce memory requirements.

For more information see: serializedMatMul().

4.5.7. poptorch.set_available_memory 

Use this function to override the proportion of tile memory available for use as temporary memory by a convolution or matrix multiplication.

For more information see: set_available_memory().

4.5.8. Miscellaneous functions 

The following PopTorch functions, not related to model creation, are available:

4.6. Half / float16 support 

PopTorch supports the half-precision floating point (float16) format. You can simply input float16 tensors into your model. (You can convert a tensor to float16 using tensor = tensor.half())

You can use your models in one of the following ways:

Convert all parameters (weights) to float16 by using using a Module’s .``half()`` method. This is the most memory efficient, however small updates to weights may be lost, hindering training.
Keep the parameters (weights) as float32, in which case the parameter updates will occur using float32. However, the parameters will be converted to float16 if you call an operation with a float16 input. This is more memory efficient than using float32 tensors (inputs) but less memory efficient than using float16 weights.
Use a mix of float32 and float16 parameters by manually specifying parameters as float16 or float32.

Note

When PyTorch encounters a mix of float16 and float32 inputs for a given operation, it will usually cast all inputs to float32. PopTorch differs and will cast all inputs to float16. This makes it easier to build models with float32 weights which take float16 tensors. However, if you wish to follow PyTorch behaviour, you can use opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat) where opts is the poptorch.Options object passed to the model wrapping function.

Listing 4.17 How to run a model using half precision

model = torch.nn.Linear(1, 10)

# Convert the parameters (weights) to halfs. Without doing so,
# the Linear parameters will automatically be cast to half, which allows
# training with float32 parameters but half tensors.
model.half()

t1 = torch.tensor([1.]).half()

opts = poptorch.Options()

inference_model = poptorch.inferenceModel(model, opts)
out = inference_model(t1)

assert out.dtype == torch.half

Because PopTorch relies on the torch.jit.trace() function, it is limited to tracing operations which run on the CPU. Many of these operations do not support float16 inputs. To allow the full range of operations, PopTorch converts all float16 inputs to float32 before tracing and then restores the inputs to float16 as part of the canonicalization process. Some operations may result in the model running in float32 where float16 would be expected, or vice versa (see Section 6.3, Float 16 operations for full details).

4.7. Automatic mixed-precision casting 

PopTorch supports converting your model automatically between float16 and float32. This functionality is not active by default - you must enable it explicitly by calling the autocast(enabled=True) method at model level.

Listing 4.18 Enabling automatic casting at model level

model = MyModel()
model.autocast()
poptorch_model = poptorch.inferenceModel(model)

During compilation, selected layers and operators will have their types adjusted aiming to strike a good compromise between compute efficiency, memory requirements and numerical precision.

You can also set automatic casting at the layer level. In this situation, its effect is hierarchical: changing the setting for a layer affects it and all layers contained within.

In the following example, automatic casting is enabled for all layers of the model, except for the first activation and second convolution.

Listing 4.19 Controlling automatic casting at layer level

model = torch.nn.Sequential()
model.add_module('conv1', torch.nn.Conv2d(1, 20, 5))
model.add_module('relu1', torch.nn.ReLU())
model.add_module('conv2', torch.nn.Conv2d(20, 64, 5))
model.add_module('relu2', torch.nn.ReLU())
model.autocast()
model.relu1.autocast(False)
model.conv2.autocast(False)

You can also set automatic casting with the function decorator @poptorch.autocast(enabled=True). Its effect is to apply automatic casting to the body of the function. Setting its parameter to False has the opposite effect. A typical use-case is applying it to the forward function of custom modules.

Listing 4.20 Controlling automatic casting via decorator

class MyModel(torch.nn.Module):
    @poptorch.autocast()
    def forward(self, x, y):
        return torch.bmm(x, y)

In addition, you can apply poptorch.autocast(enabled=True) to a code-block, with similar effect.

Listing 4.21 Controlling automatic casting via decorator

x = torch.randn(1, 10, 10)
y = torch.randn(1, 10, 10)
with poptorch.autocast():
    z = torch.bmm(x, y)

You can completely turn off this feature for the whole application via the autocastEnabled(bool) method of _PrecisionOptions.

Listing 4.22 Disabling automatic casting

x = torch.randn(1, 10, 10)
y = torch.randn(1, 10, 10)
with poptorch.autocast():
    z = torch.bmm(x, y)

4.7.1. Custom casting policies 

PopTorch provides a mechanism to customize automatic casting behaviour in the form of casting policy classes. A casting policy is defined by four sets of Torch modules and/or torch operators:

fp16 - set of operations to be typed as float16
fp32 - set of operations to be typed as float32
promote - set of operations to be promoted to float32 should they take mixed-precision inputs
demote - set of operations to be demoted to float16 should they take mixed-precision inputs

The following example describes a policy where convolution and ReLU operations are to be performed using float16, whilst batch matrix multiplication is to be performed using float32. Dot product computations will be promoted to float32 when operands have mixed precision.

Listing 4.23 Custom casting policies

fp16 = [torch.nn.Conv2d, torch.relu]
fp32 = [torch.bmm]
promote = [torch.dot]
demote = []
policy = poptorch.autocasting.Policy(fp16, fp32, promote, demote)

opts = poptorch.Options()
opts.Precision.autocastPolicy(policy)
poptorch.model = poptorch.inferenceModel(model, opts)

4.8. Creating custom ops 

If you need to implement functionality that is not directly supported in in PopTorch, you can create a custom op.

There are two steps to creating a custom op in PopTorch:

Implement the op in C++ using the PopART API
Make the op available in PopTorch so you can use it in your PyTorch model

4.8.1. Implementing the custom op 

You will need to implement the new op as C++ code by creating subclasses of, at least, the Op and Opx base classes provided by the PopART API.

If you are going to use the custom op for training, then you will also need to define the classes that implement the gradient operation. For details of how to do this, see the Custom operators chapter of the PopART User Guide.

You can find some examples of PopART custom ops in the Graphcore GitHub tutorials repository.

Compiling the PopART custom op will create a dynamic library file, which you can use with your PyTorch code.

4.8.2. Make the op available to PyTorch 

After you have compiled the C++ implementation of the custom op, you can can load the library file, and call the op from your PyTorch program, using the poptorch.custom_op class.

First, load the dynamic library as shown in Listing 4.24.

Listing 4.24 Loading the library for the custom op

myso = list(pathlib.Path("tests").rglob("libcustom_cube_op.*"))
assert myso, "Failed to find libcustom_cube_op"
myop = ctypes.cdll.LoadLibrary(myso[0])

You can now call your custom op using the PopTorch class custom_op.

Both the forward op and backward op are implemented in the PopART code. However, in this inference model example, only the forward op is called:

Listing 4.25 Calling a custom op in a PopTorch inference model

def test_inference():
    class BasicNetwork(nn.Module):
        def forward(self, x, bias):
            x, y = poptorch.custom_op([x, bias],
                                      "Cube",
                                      "com.acme",
                                      1,
                                      example_outputs=[x, x])
            return x, y

In this example [x, x] is assigned to example_outputs, where x is one of the input tensors which is used as a template for the output tensors. The custom op code will need to create the tensors that it returns.

You can also call this custom op inside a training model using custom_op and the backward op will be called automatically.

4.8.3. Passing attributes to the custom op 

You can pass attributes to the custom op using a Python dictionary, as shown in Listing 4.26.

Listing 4.26 Passing an attribute to a custom op from PopTorch

    class Model(torch.nn.Module):
        def forward(self, x):
            x = poptorch.custom_op([x],
                                   "LeakyRelu",
                                   "com.acme",
                                   1,
                                   example_outputs=[x],
                                   attributes={"alpha": 0.02})
            return x[0]

You can then access these attributes within the C++ custom op code. The above example passes a Float attribute with the name alpha to the LeakyRELU implementation. See the Custom operators chapter of the PopART User Guide for more information.

Table Table 4.1 and the code example in Listing 4.27 show how to pass other attribute types to a custom op. PopTorch supports all attributes supported in PopART except for Graph.

Table 4.1 Python types to use to pass attributes to PopART
PopART attribute type	Python equivalent
`Float`	Python float (converted to float32)
`Floats`	List or tuple of Python float
`Int`	Python int (converted to 64-bit signed int)
`Ints`	List or tuple of Python int
`String`	Python str (converted to ASCII)
`Strings`	List or tuple of Python str
`Graph`	Not supported

Listing 4.27 Passing different attribute types from PopTorch

def test_many_attributes_examples():
    class Model(torch.nn.Module):
        def forward(self, x):
            attributes = {
                "float_one": 1.0,
                "float_minus_two": -2.0,
                "int_zero": 0,
                "int_minus_five": -5,
                "floats_one_two_three": [1.0, 2.0, 3.0],
                "floats_minus_one_two_three": [-1.0, -2.0, -3.0],
                "ints_one_two_three": [1, 2, 3],
                "ints_minus_one_two_three": [-1, -2, -3],
                "a_string": "string with quotes and slash \" ' \\ end",
                "strs": ["abc", "def", "ghi"]
            }

            x = poptorch.custom_op([x],
                                   "ManyAttributeOp",
                                   "test.poptorch",
                                   1,
                                   example_outputs=[x],
                                   attributes=attributes)

4.9. Profiling 

You can profile a graph produced by PopTorch for analysis using the PopVision Graph Analyser, which can be downloaded from the Graphcore support portal. To do this, use the POPLAR_ENGINE_OPTIONS environment variable.

4.10. Precompilation and caching 

4.10.1. Caching 

By default PopTorch will re-compile the model every time you instantiate a model. However if you often run the same models you might want to enable executable caching to save time.

You can do this by either setting the POPTORCH_CACHE_DIR environment variable or by calling enableExecutableCaching.

Warning

The cache directory might grow large quickly because PopTorch doesn’t delete old models from the cache and, depending on the number and size of your models and the number of IPUs used, the executables might be quite large. It is your responsibility to delete the unwanted cache files.

4.10.2. Precompilation 

PopTorch supports precompilation: This means you can compile your model on a machine which doesn’t have an IPU and export the executable to a file. You can then reload and execute it on a different machine which does have an IPU.

Important

The PopTorch versions on both machines must be an exact match.

To precompile your model you need to wrap it using either trainingModel() or inferenceModel() then call compileAndExport() on the wrapper.

Listing 4.28 How to precompile a model using an offline IPU target.

import torch
import poptorch


class ExampleModelWithLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(10, 10)
        self.loss = torch.nn.MSELoss()

    def forward(self, x, target=None):
        fc = self.fc(x)
        if self.training:
            return fc, self.loss(fc, target)
        return fc


torch.manual_seed(0)
model = ExampleModelWithLoss()

opts = poptorch.Options()
# You don't need a real IPU to compile the executable.
opts.useOfflineIpuTarget(ipu_target_version)

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model, opts)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

poptorch_model.compileAndExport(filename, input, target)

Note

If you don’t know the IPU version on your system you can use ipuHardwareVersion().

The exported file by default will contain your original PyTorch model (including the weights), and enough information to re-create the PopTorch wrapper and reload the executable.

Important

For your model and weights to be exported, your model must be picklable. See https://docs.python.org/3/library/pickle.html for more information. If your model is not picklable then use export_model=False, as shown in Listing 4.31.

Now both the torch model, PopTorch wrapper and executable can be restored on the target machine using poptorch.load():

Listing 4.29 How to load a precompiled model

poptorch_model = poptorch.load(filename)

# That's all: your model is ready to be used.
poptorch_model(input, target)  # Run on IPU

In some cases you might want to provide some runtime information to select the device: you can do this using the edit_opts_fn argument of poptorch.load():

Listing 4.30 How to load a precompiled model and run on a specific IPU

def setIpuDevice(opts):
    opts.useIpuId(1)  # always use IPU 1


poptorch_model = poptorch.load(filename, edit_opts_fn=setIpuDevice)
poptorch_model(input, target)  # Run on IPU 1

Note

When loading a precompiled model, only run-time options will be applied; all others will be ignored.

Going back to the precompilation step: in some cases you might want to export only the executable and not the python wrapper or torch model (for example if your model cannot be pickled).

Listing 4.31 How to export only the executable

poptorch_model.compileAndExport(filename, input, target, export_model=False)

It means you will need to re-create and wrap the model yourself before loading the executable:

Listing 4.32 How to load a precompiled executable

model = ExampleModelWithLoss()

opts = poptorch.Options()

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model, opts)
poptorch_model.loadExecutable(filename)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

poptorch_model(input, target)  # Run on IPU

Important

Exported models lose their connections to other models.

For example, if you have a poptorch.trainingModel() and a poptorch.inferenceModel() based on the same PyTorch model, you wouldn’t usually need to keep the weights synchronised between the two; PopTorch would take care of it for you, implicitly.

In the following example, PopTorch automatically copies the weights from the training model to the inference model:

Listing 4.33 PopTorch implicit copies

model = ExampleModelWithLoss()

opts = poptorch.Options()

# Wrap the model in our PopTorch annotation wrapper.
training_model = poptorch.trainingModel(model, opts)
model.eval()
validation_model = poptorch.inferenceModel(model, opts)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

# Train the model:
for epoch in epochs:
    training_model(input, target)

# Weights are implicitly copied from the training model
# to the validation model
prediction = validation_model(input)

If you were to export these models:

Listing 4.34 Precompilation of both a training and validation models

model = ExampleModelWithLoss()

opts = poptorch.Options()

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

# Wrap the model in our PopTorch annotation wrapper.
training_model = poptorch.trainingModel(model, opts)
training_model.compileAndExport("training.poptorch", input, target)
model.eval()
validation_model = poptorch.inferenceModel(model, opts)
validation_model.compileAndExport("validation.poptorch", input)

Note

Don’t forget to call model.eval() or model.train(), as required, before calling compileAndExport().

You could then insert explicit copy operations:

Listing 4.35 Precompilation of both a training and validation models

training_model = poptorch.load("training.poptorch")
validation_model = poptorch.load("validation.poptorch")

for epoch in epochs:
    print("Epoch ", epoch)
    run_training(training_model)
    # Need to explicitly copy weights between the two models
    # because they're not connected anymore.
    training_model.copyWeightsToHost()
    validation_model.copyWeightsToDevice()
    run_validation(validation_model)

Or you would need to re-connect the two models by creating the second one from the first one and then loading the executable:

Listing 4.36 Precompilation of both a training and validation models

training_model = poptorch.load("training.poptorch")
# Create a validation python model based on the training model
validation_model = poptorch.inferenceModel(training_model)
validation_model.model.eval()
# Load the executable for that model:
validation_model.loadExecutable("validation.poptorch")

for epoch in epochs:
    print("Epoch ", epoch)
    run_training(training_model)
    # Nothing to do: training_model and validation_model are now connected
    # and PopTorch will implicitly keep the weights in sync between them.
    run_validation(validation_model)

4.11. Environment variables 

4.11.1. Logging level 

PopTorch uses the following levels of logging:

OFF: No logging
ERR: Errors only
WARN: Warnings and errors only
INFO: Info, warnings and errors (default)
DEBUG: Adds some extra debugging information
TRACE and TRACE_ALL: Trace everything inside PopTorch

You can use the POPTORCH_LOG_LEVEL environment variable to set the logging level:

export POPTORCH_LOG_LEVEL=DEBUG

4.11.2. Profiling 

When running programs using PopTorch, you can enable profiling by using the POPLAR_ENGINE_OPTIONS environment variable used by Poplar.

In order to capture the reports needed for the PopVision Graph Analyser you only need to set POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}':

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'

By default, report files are output to the current working directory. You can specify a different output directory by setting autoReport.directory, for example:

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./tommyFlowers"}'

For more options, refer to the PopVision Graph Analyser User Guide.

In order to capture the pvti reports needed for the PopVision System Analyser you only need to set PVTI_OPTIONS='{"enable":"true"}'

You can also add extra tracepoints in your own code by using Channel.

4.11.3. IPU Model 

By default PopTorch will try to attach to a physical IPU. If instead you want to use the model, you can do so by setting POPTORCH_IPU_MODEL to 1:

export POPTORCH_IPU_MODEL=1

See the Poplar and PopLibs User Guide for the limitations of the IPU Model.

4.11.4. Wait for an IPU to become available 

By default, attempting to attach to an IPU when all IPUs are already in use will raise an exception. If you would rather wait for an IPU to become available, you can do so by setting POPTORCH_WAIT_FOR_IPU to 1.

export POPTORCH_WAIT_FOR_IPU=1

4.11.5. Enable executable caching 

You can enable executable caching by either setting the POPTORCH_CACHE_DIR environment variable or by calling enableExecutableCaching.

export POPTORCH_CACHE_DIR=/tmp/poptorch_cache

For more information, see Section 4.10.1, Caching.