4. Features

Options
Model wrapping functions
Parallel execution
- Annotation tools
- Parallel execution strategies
Optimizers
Custom ops
Miscellaneous functions
Half / float 16 support
Profiling
Precompilation and caching
- Caching
- Precompilation
Environment variables

4.1. Options 

You can change how PopTorch compiles and executes models using poptorch.Options. You can find a full list of options in the Options section of the Reference chapter. Broadly speaking, the options fall into the following catagories:

General options (See poptorch.Options)
Options related to half precision (see poptorch.options._PrecisionOptions)
Management of the training process (see poptorch.options._TrainingOptions)
Control of distributed execution environments (see poptorch.options._DistributedOptions)
Location of tensors (see: poptorch.options._TensorLocationOptions and poptorch.TensorLocationSettings)
Options relevant to the Torch JIT compiler (see poptorch.options._JitOptions)

See Efficient data batching for a full explanation of how device_iterations greater than 1, gradient_accumulation, and replication_factor interact with the output and input sizes.

You can choose to use the IPU model or the real IPU hardware via poptorch.Options.useIpuModel.

4.2. Model wrapping functions 

The basis of PopTorch integration comes from these two model wrapping functions.

4.2.1. poptorch.trainingModel 

This function wraps around a PyTorch model, yielding a PopTorch model that may be run on the IPU in training mode. See poptorch.trainingModel() for a complete reference.

Listing 4.1 An example of the use of poptorch.trainingModel()

import torch
import poptorch


class ExampleModelWithLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(10, 10)
        self.loss = torch.nn.MSELoss()

    def forward(self, x, target=None):
        fc = self.fc(x)
        if self.training:
            return fc, self.loss(fc, target)
        return fc


torch.manual_seed(0)
model = ExampleModelWithLoss()

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

# Train on IPU.
for i in range(0, 800):
    # Each call here executes the forward pass, loss calculation, and backward
    # pass in one step.
    # Model input and loss function input are provided together.
    poptorch_out, loss = poptorch_model(input, target)
    print(f"{i}: {loss}")

# Copy the trained weights from the IPU back into the host model.
poptorch_model.copyWeightsToHost()

# Execute the trained weights on host.
model.eval()
native_out = model(input)

# Models should be very close to native output although some operations are
# numerically different and floating point differences can accumulate.
torch.testing.assert_allclose(native_out, poptorch_out, rtol=1e-06, atol=1e-06)

4.2.2. poptorch.inferenceModel 

This function wraps around a PyTorch model, yielding a PopTorch model that can be run on the IPU in inference mode. See poptorch.trainingModel() for a complete reference.

Listing 4.2 An example of the use of poptorch.inferenceModel()

import torch
import torchvision
import poptorch

# Some dummy imagenet sized input.
picture_of_a_cat_here = torch.randn([1, 3, 224, 224])

# The model, in this case a MobileNet model with pretrained weights that comes
# canned with Pytorch.
model = torchvision.models.mobilenet_v2(pretrained=True)
model.train(False)

# Wrap in the PopTorch inference wrapper
inference_model = poptorch.inferenceModel(model)

# Execute on IPU.
out_tensor = inference_model(picture_of_a_cat_here)

# Get the top 5 ImageNet classes.
top_five_classes = torch.topk(torch.softmax(out_tensor, 1), 5)
print(top_five_classes)

# Try the same on native PyTorch
native_out = model(picture_of_a_cat_here)

native_top_five_classes = torch.topk(torch.softmax(native_out, 1), 5)

# Models should be very close to native output although some operations are
# numerically different and floating point differences can accumulate.
assert any(top_five_classes[1][0] == native_top_five_classes[1][0])
# inference_half_start
model = torch.nn.Linear(1, 10)

# Convert the parameters (weights) to halfs. Without doing so,
# the Linear parameters will automatically be cast to half, which allows
# training with float32 parameters but half tensors.
model.half()

t1 = torch.tensor([1.]).half()

opts = poptorch.Options()

inference_model = poptorch.inferenceModel(model, opts)
out = inference_model(t1)

assert out.dtype == torch.half
# inference_half_end

4.2.3. poptorch.PoplarExecutor 

This class should not be created directly but is a wrapper around the model that was passed into inferenceModel() or trainingModel(). It only has a few methods which can be used to interface with the IPU.

The PoplarExecutor will implicitly keep in sync the parameters of the source PyTorch model and the PopTorch model(s). However, weights need to be explicitly copied if the model is trained on the CPU and inference is run on the IPU.

See PoplarExecutor for a complete description of the IPU interface functionality.

model = Model()
poptorch_train = poptorch.trainingModel(model)
poptorch_inf = poptorch.inferenceModel(model)

train(poptorch_train)
torch.save(model.state_dict(), "model.save") # OK
validate(poptorch_inf) # OK
validate(model) # OK

train(model)
# Explicit copy needed
poptorch_inf.copyWeightsToDevice()
validate(poptorch_inf)

4.2.4. poptorch.isRunningOnIpu 

One useful utility function is poptorch.isRunningOnIpu(). This returns True when executing on the IPU and False when executing the model outside IPU scope. This allows for different code paths within the model.

A common use case is executing equivalent code to a PopART custom operator when running on CPU. For example:

class Network(torch.nn.Module):
  def forward(self, x, y):
      if poptorch.isRunningOnIpu():
          # IPU path
          return my_custom_operator(x, y)
      else:
          # CPU path
          return my_torch_implementation(x,y)

4.3. Parallel execution 

This section demonstrates multi-IPU strategies for parallel execution in PopTorch. We recommended that you start such parallel programming from PopTorch code that is working properly on a single IPU.

There are four kinds of execution strategies in total to run a model on a multi-IPU device: poptorch.ShardedExecution, poptorch.PipelinedExecution, poptorch.SerialPhasedExecution. and poptorch.ParallelPhasedExecution. These execution strategies are set through poptorch.Options.setExecutionStrategy(). The default execution strategy is poptorch.PipelinedExecution. In the following, we first introduce the general APIs that will be applied to all four parallel execution strategies. Finally, we explain the four strategies with examples.

By default, PopTorch will not let you run the model if the number of IPUs is not a power of 2. For this reason, it is preferable to annotate the model so that the number of IPUs used is a power of 2. However, you can also enable poptorch.Options.autoRoundNumIPUs() to automatically round up the number of IPUs reserved to a power of 2, with the excess being reserved but idle. This option is not enabled by default to prevent unintentional overbooking of IPUs.

4.3.1. Annotation tools 

poptorch.Block and poptorch.BeginBlock 

poptorch.BeginBlock and poptorch.Block are wrapper classes used to define model parallelism in a multi-IPU device. They partition models into “blocks” that will be executed on different IPUs.

You can use poptorch.Block to define a scope in the context of the model.

In the example below, all layers before model.bert.encoder.layer[0] will be put on IPU 0 and all layers from model.bert.encoder.layer[0] onwards (inclusive) will be on IPU 1.

Listing 4.3 Annotating existing layers.

import transformers
import torch
import poptorch

# A bert model from hugging face. See the packaged BERT example for actual usage.
pretrained_weights = 'mrm8488/bert-medium-finetuned-squadv2'


# For later versions of transformers, we need to wrap the model and set
# return_dict to False
class WrappedModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.wrapped = transformers.BertForQuestionAnswering.from_pretrained(
            pretrained_weights)

    def forward(self, input_ids, attention_mask, token_type_ids):
        return self.wrapped.forward(input_ids,
                                    attention_mask,
                                    token_type_ids,
                                    return_dict=False)

    def __getattr__(self, attr):
        try:
            return torch.nn.Module.__getattr__(self, attr)
        except torch.nn.modules.module.ModuleAttributeError:
            return getattr(self.wrapped, attr)


model = WrappedModel()

# A handy way of seeing the names of all the layers in the network.
print(model)

# All layers before "model.bert.encoder.layer[0]" will be on IPU 0 and all layers from
# "model.bert.encoder.layer[0]" onwards (inclusive) will be on IPU 1.
model.bert.encoder.layer[0] = poptorch.BeginBlock(model.bert.encoder.layer[0],
                                                  ipu_id=1)

# Now all layers before layer are on IPU 1 and this layer onward is on IPU 2
model.bert.encoder.layer[2] = poptorch.BeginBlock(model.bert.encoder.layer[2],
                                                  ipu_id=2)

# Finally all layers from this layer till the end of the network are on IPU 3.
model.bert.encoder.layer[4] = poptorch.BeginBlock(model.bert.encoder.layer[4],
                                                  ipu_id=3)

# We must batch the data by at least the number of IPUs. Each IPU will still execute
# whatever the model batch size is.
data_batch_size = 4

# Create a poptorch.Options instance to override default options
opts = poptorch.Options()
opts.deviceIterations(data_batch_size)

poptorch.BeginBlock is an annotation defined outside the model, and applied to current and onward layers. Both forms can be used interchangeably.

Listing 4.4 Annotating a model directly.

class Network(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(5, 10)
        self.layer2 = torch.nn.Linear(10, 5)
        self.layer3 = torch.nn.Linear(5, 5)
        self.layer4 = torch.nn.Linear(5, 5)

        self.act = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, x):

        # Explicit layers on a certain IPU
        poptorch.Block.useAutoId()
        with poptorch.Block(ipu_id=0):
            x = self.act(self.layer1(x))

        with poptorch.Block(ipu_id=1):
            x = self.act(self.layer2(x))

        with poptorch.Block(ipu_id=2):
            x = self.act(self.layer3(x))
            x = self.act(self.layer4(x))

        with poptorch.Block(ipu_id=3):
            x = self.softmax(x)
        return x


model = Network()
opts = poptorch.Options()
opts.deviceIterations(4)
poptorch_model = poptorch.inferenceModel(model, options=opts)
print(poptorch_model(torch.rand((4, 5))))

Either annotation is enough to enable parallel execution in the simple cases. By default, the layers before the first poptorch.BeginBlock will be placed on IPU 0.

Both poptorch.BeginBlock and poptorch.Block need to follow a set of rules:

All the layers must be declared inside a poptorch.Block scope. It is to avoid missing annotation. poptorch.BeginBlock doesn’t have the same constraint because all the layers called after will automatically be added to the last poptorch.BeginBlock.
Please note that PopTorch needs to reserve IPUs in powers of 2 or multiples of 64. You are advised to configure your model accordingly to take full advantage of the IPUs available. However, if you need to run with a different number of IPUs, you can use poptorch.Options().autoRoundNumIPUs(True) to allow PopTorch to reserve more IPUs than the model specifies.
Unused or dead layers should NOT be included in any poptorch.BeginBlock or poptorch.Block.
If layer A happens before layer B inside the model and each layer has a poptorch.BeginBlock associated with it, you need to write poptorch.BeginBlock for layer A before poptorch.BeginBlock for layer B.

Failing to obey above rules will result in compilation errors.

poptorch.Stage and poptorch.AutoStage 

Conceptually poptorch.BeginBlock or poptorch.Block collects the layers of a model into a poptorch.Stage, multiple stages can be combined into a poptorch.Phase, and multiple phases form a parallel execution strategy.

poptorch.Stage 

poptorch.Stage defines some layers of model to run on one IPU. It can be made of one or more blocks created using poptorch.BeginBlock or poptorch.Block and identified by their user_id. Consecutive layers in a model can be defined either in the same poptorch.Stage or consecutive stages. Whether stages run in parallel or sequentially depends on specific parallel execution strategies.

Internally, each operation in a model is assigned a stage_id through poptorch.Stage.

poptorch.AutoStage 

You can use poptorch.AutoStage if you don’t want to specify poptorch.Stage by hand. It will assign one poptorch.Stage per poptorch.BeginBlock or poptorch.Block.

By default poptorch.AutoStage.SameAsIpu is in use, which means the stage_id of poptorch.Stage will be set to the ipu_id specified for the poptorch.BeginBlock or poptorch.Block. Please note that stage_id must be ascending in poptorch.PipelinedExecution. Let’s use the code example above. If your blocks “0”, “1”, and “2” are assigned to IPU 0, 1, and 0. Then the poptorch.Block “2” will be assigned stage_id 0. This will make the compiler fail to schedule the last two stages “1” and “2” due to a conflict:

The model implies “1” should run earlier than “2”.
their stage_id values suggest “2” should run earlier than “1”.

When poptorch.AutoStage.AutoIncrement is in use, each new poptorch.BeginBlock or poptorch.Block will be assigned an automatically incremented stage_id. In the previous example the last stage would be assigned stage_id 2 and the compilation would succeed.

poptorch.Phase 

poptorch.Phase defines a processing unit of phased execution. It may contain one or more poptorch.Stage. poptorch.Phase is only used in poptorch.SerialPhasedExecution and poptorch.ParallelPhasedExecution. It is not used in poptorch.ShardedExecution and poptorch.PipelinedExecution.

with poptorch.Block("A"):
    layer()
with poptorch.Block("B"):
    layer()
p = Phase(poptorch.Stage("A").ipu(0), poptorch.Stage("B").ipu(1))

In the code snippet above, “A” and “B” will run in parallel on IPU 0 and 1 simultaneously since they are placed in two stages. They will run sequentially on one IPU if they are placed in a single stage.

Advanced annotation with strings 

You can use Python strings to represent the user_id and ipu_id for a poptorch.Block or poptorch.BeginBlock. Since strings are evaluated at runtime, they allow for a dynamic number of stages and phases. Here is an example below to use formatted strings(f-strings) in poptorch.ParallelPhasedExecution.

Inside the code example below, there are two lines that f-strings are used in the forward() class. One is f"phase{phase}_ipu{ipu}" at Line 25, where phase is 0, 1, 1, 2, 3, 3, 4, 5, and 5 respectively, and ipu ranges from 0 to 1. The total number of instances for this f-string is 12 due to 6 phases and 2 IPUs. The other is f"phase{N*2-1}_ipu1" at Line 32, where phase is 5 and ipu is 1. When defining poptorch.Stage, four f-strings are used where n ranges from 0 to 2 at Line 46-47 and 50-51:

f"phase_{2*n}_ipu0"
f"phase{2*n}_ipu1"
f"phase_{2*n+1}_ipu0"
f"phase{2*n+1}_ipu1"

They refer to phase 0, 2, 4 and 1, 3, 5, with ipu0 and ipu1 respectively. So all these 12 f-strings are defined in poptorch.BeginBlock, and used in poptorch.Stage dynamically. They match exactly.

Listing 4.5 An example of parallel phased execution

poptorch.setLogLevel(1)  # Force debug logging
N = 3
size = 10


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = []
        for n in range(N * 6):
            weight = torch.nn.Parameter(torch.rand(size, size),
                                        requires_grad=True)
            self.register_parameter(f"w{n}", weight)
            self.weights.append(weight)

    def forward(self, in0, target=None):
        phase = 0
        weight = iter(self.weights)
        with poptorch.Block("phase0_ipu0"):
            ins = torch.split(in0, size)
        for n in range(N * 3):
            out = []
            for ipu in range(2):
                x = ins[ipu]
                with poptorch.Block(f"phase{phase}_ipu{ipu}"):
                    x = torch.matmul(next(weight), x)
                    out.append(F.relu(x))
            ins = out[1], out[0]
            # We want 2 matmuls in the same phase
            if n % 3 != 1:
                phase += 1
        with poptorch.Block(f"phase{N*2-1}_ipu1"):
            res = ins[0] + ins[1]
            if target is None:
                return res
            return res, torch.nn.L1Loss(reduction="mean")(res, target)


input = torch.rand(size * 2, 1)
target = torch.rand(size, 1)
model = Model()
opts = poptorch.Options()
phases = []
# Alternate between 0-2 and 1-3
for n in range(N):
    phases.append([
        poptorch.Stage(f"phase{2*n}_ipu0").ipu(0),
        poptorch.Stage(f"phase{2*n}_ipu1").ipu(2)
    ])
    phases.append([
        poptorch.Stage(f"phase{2*n+1}_ipu0").ipu(1),
        poptorch.Stage(f"phase{2*n+1}_ipu1").ipu(3)
    ])
opts.setExecutionStrategy(poptorch.ParallelPhasedExecution(*phases))
poptorch_model = poptorch.trainingModel(model, opts)
poptorch_model.compile(input, target)

4.3.2. Parallel execution strategies 

With the above APIs as building blocks, we can set execution strategies using the four kinds of execution modes, as shown below. Note that the same annotation can be used for each of them. They only differ in the method of parallelisation and tensor locations.

poptorch.ShardedExecution 

In this strategy, each IPU will sequentially execute a distinct part of the model. A single unit of processing poptorch.ShardedExecution is a shard. A shard is specified using poptorch.Stage, or if no poptorch.Stage is specified, the user_id passed by poptorch.BeginBlock or poptorch.Block is used. Each shard is executed sequentially on a single IPU. Multiple shards can be placed on multiple IPUs. However, only one IPU is used at a time, while the other IPUs are idle. If an IPU is allocated to run consecutive stages, PopART will merge consecutive stages into one on the same IPU. Weights and activations will use the on-chip memory of the IPUs. Layers sharing weights need to be placed on the same IPU.

poptorch.ShardedExecution can be useful for processing a single sample or debugging. Overall it has low efficiency since only one IPU is used at a time.

poptorch.PipelinedExecution 

This is the default execution strategy. It extends poptorch.ShardedExecution with parallel execution on multiple IPUs.

Parallelisation in poptorch.PipelinedExecution requires deviceIterations() and gradientAccumulation() as explained in Efficient data batching. After one poptorch.Stage is finished with processing a batch on one IPU, it starts immediately processing the next batch. This creates a pipeline where multiple batches are processed in parallel. An IPU can only start its own poptorch.Stage of a batch if its previous poptorch.Stage of the current batch is processed. Hence, all IPUs will be occupied after a warm-up period. A cool-down period is required to aggregate the results and apply weight changes.

Phased execution 

poptorch.ParallelPhasedExecution and poptorch.SerialPhasedExecution have the following features in common:

A portion of the weights and activations are transferred to and from streaming memory, before and after each phase.
If the desired weights and activations are already stored in an IPU of the same group of IPUs, intra-phase cross-IPU copies can replace the copies to and from streaming memory.
This specific portion is needed by the layers of the model wrapped in poptorch.BeginBlock or poptorch.Block in current poptorch.Phase.
They both trade off some performance for larger models with higher memory needs.
Any number of phases is allowed.
The number of stages in each poptorch.Phase should match the number of IPUs in each group of IPUs.
Stages inside each poptorch.Phase can run in parallel.

Although you only define the poptorch.Phase for forward passes, the corresponding phases for backward passes are created correspondingly. The order of phased execution for backward passes won’t change but you can decide whether a phase is shared by both forward and backward passes. In other words, you decide whether to avoid a memory transfer of a portion of the weights and activations.

poptorch.SerialPhasedExecution 

In poptorch.SerialPhasedExecution, phases execute on a single group of IPUs sequentially.

strategy = poptorch.SerialPhasedExecution([
  poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")),
  poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")),
  poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2"))])

strategy.phase(0).ipus(0,1)
strategy.phase(1).ipus(0,1)
strategy.phase(2).ipus(0,1)

opts.setExecutionStrategy(strategy)

The code above causes all phases to run serially on IPUs 0 and 1.

poptorch.ParallelPhasedExecution 

In poptorch.ParallelPhasedExecution, phases are executed in parallel alternating between two groups of IPUs. Even phases must run on even IPUs and odd phases on odd IPUs. Inter-phase cross-IPU copies can replace the memory transfers to and from the streaming memory, if the desired weights and activations are already available in another group of IPUs.

strategy = poptorch.SerialPhasedExecution([
  poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")),
  poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")),
  poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5"))])

strategy.phase(0).ipus(0,2)
strategy.phase(1).ipus(1,3)
strategy.phase(2).ipus(0,2)

opts.setExecutionStrategy(strategy)

In the code example above, there are three phases. Each phase has two stages and each IPU group has two IPUs, so the number of groups matches the number of IPUs. Even phases 0 and 2 run on IPU 0 and 2, while odd phase 1 runs on IPU 1 and as required. This allows for faster cross-IPU copies, both inter-phase and intra-phase.

poptorch.Liveness 

poptorch.Liveness controls the availability of tensors on IPU, and is only needed for poptorch.ParallelPhasedExecution and poptorch.SerialPhasedExecution.

The default poptorch.Liveness is AlwaysLive. OffChipAfterFwd and OffChipAfterEachPhase may be helpful if you run a large model with a tight memory budget.

4.4. Optimizers 

Poptorch supports the following optimizers:

In addition, PopTorch has features to support float16 models, such as loss scaling, velocity scaling, bias correction and accumulator types.

Important

All of these extra attributes (Except velocity_scaling) cannot have different values for different param_groups and therefore must be set at the optimizer level.

Listing 4.6 How to update values in an Optimizer

opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         loss_scaling=2.0,
                         velocity_scaling=2.0)
poptorch_model = poptorch.trainingModel(model, options, opt)
poptorch_model(input, target)
# Update optimizer attribute
opt.loss_scaling = 1.0
# Update param_group attribute
opt.param_groups[0]["velocity_scaling"] = 1.0
# Set the new optimizer in the model
poptorch_model.setOptimizer(opt)
poptorch_model(input, target)

Important

You must call setOptimizer() for the new optimizer values to be applied to the model.

4.4.1. Loss scaling 

When training models which use half/float16 values, you can use loss scaling to prevent the gradients from becoming too small and underflowing.

Before calculating the gradients, PopTorch will scale the loss by the value of the loss_scaling parameter. PopTorch will multiply the gradients by the inverse scale prior to updating the optimizer state. Therefore, beyond improving numerical stability, neither the training nor the hyper-parameters are affected.

Higher loss_scaling values can improve numerical stability by minimising underflow. However, too high a value can result in overflow. The optimal loss scaling factor depends on the model.

4.4.2. Velocity scaling (SGD only)

The SGD optimizer, when used with momentum, updates weights based on the velocity values. At each update step, the new velocity is a combination of the gradients derived from the loss function and the previous velocity value. Similar to loss scaling, the velocity_scaling parameter allows the velocity values to be scaled to improve numerical precision when using half/float16 values. (Note that the gradients are, in effect, scaled by velocity_scaling/loss_scaling so the loss_scaling has no impact on the effective scaling of velocity parameters.)

As with loss scaling, higher values can minimise underflow of the velocity values but may result in overflow.

4.4.3. Accumulation types 

In order to improve numerical stability some of the optimizers (LAMB, Adam, AdamW, RMSprop) give you the option to tweak the data type used by the optimizer’s accumulators.

accum_type lets you choose the type used for gradient accumulation. first_order_momentum_accum_type / second_order_momentum_accum_type give you control over the type used to store the first-order and second-order momentum optimizer states.

4.4.4. Constant attributes 

In order to improve performance and / or save memory PopTorch will try to embed directly in the program the attributes which are constant.

Important

Trying to modify a constant attribute after the model has been compiled will result in an error.

For PopTorch optimizers (those from the poptorch.optim namespace) by default the attributes explicitly passed to the Optimizer’s constructor will be considered variables and the others will be considered as constant.

This behaviour can be overridden using markAsConstant() and markAsVariable() before the model is compiled.

Listing 4.7 Constant and variable attributes for PopTorch optimizers

# lr, momentum and velocity_scaling will be marked as variable.
opt = poptorch.optim.SGD(model.parameters(), lr=0.01, momentum=0.0)
# momentum and velocity_scaling  will be marked as constant.
opt = poptorch.optim.SGD(model.parameters(), lr=0.01)
# lr and momentum will be marked as variable.
# velocity_scaling will be marked as constant.
opt = poptorch.optim.SGD(model.parameters(),
                         lr=0.01,
                         momentum=0.0,
                         velocity_scaling=2.0)
opt.variable_attrs.markAsConstant("velocity_scaling")
# lr, momentum and velocity_scaling will be marked as variable.
opt = poptorch.optim.SGD(model.parameters(), lr=0.01, velocity_scaling=2.0)
opt.variable_attrs.markAsVariable("momentum")

For native optimizers (those from the torch.optim namespace) the attributes which are left to their default value in the constructor will be considered as constant.

There is no method to override this behaviour which is why we recommend you always use the poptorch.optim optimizers instead.

Listing 4.8 Constant and variable attributes for Torch optimizers

# momentum will be marked as constant (It's not set)
opt = torch.optim.SGD(model.parameters(), lr=0.01)
# lr will be marked as variable.
# momentum will still be marked as constant (Because its default value is 0.0)
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.0)
# lr and momentum will both be marked as variable.
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=1.0)

Note

There is an exception: lr is always marked as variable.

4.5. Custom ops 

These are helper operations to be used within a model.

4.5.1. poptorch.ipu_print_tensor 

Adds an op to print the content of a given IPU tensor.

Warning

To prevent the print operation being optimised out by the graph optimiser, you must use the output of the print.

class ExampleModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.bias = torch.nn.Parameter(torch.zeros(()))

    def forward(self, x):
        x = x + 1

        # It is important to make sure the result of the print is used.
        x = poptorch.ipu_print_tensor(x)

        return x + self.bias

For more information see: poptorch.ipu_print_tensor().

4.5.2. poptorch.identity_loss 

This function is used to implement custom losses. This takes in a single PyTorch tensor and will backpropagate a gradient of ones through it.

Warning

Passing a PyTorch loss function or another identity_loss to this function is not supported. Multiple losses must be implemented via composite PyTorch ops.

Listing 4.9 Example of custom loss.

def custom_loss(output, target):
    # Mean squared error with a scale
    loss = output - target
    loss = loss * loss * 5
    return poptorch.identity_loss(loss, reduction="mean")


class ExampleModelWithCustomLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.model = ExampleModel()

    def forward(self, input, target):
        out = self.model(input)
        return out, custom_loss(out, target)

For more information see: poptorch.identity_loss().

4.5.3. poptorch.MultiConv 

Use poptorch.MultiConv wrapper class to define multi-convolutions.

Please refer to the PopLibs documentation for multi-convolutions for further information.

For more information see: poptorch.MultiConv poptorch.MultiConvPlanType.

4.5.4. poptorch.custom_op 

This is for the users who are familiar with PopART. If you need some special features that are not supported in PopART, you may write a PopART custom op. For more information about how to create Popart custom ops see Creating custom operations and Building custom operators using PopART. You can call such a PopART custom op using poptorch.custom_op in PopTorch.

It takes three steps to enable a PopART custom op in PopTorch.

First, set Poplar and PopART environment varibles as shown in Setting the environment variables and compile the PopART custom op. You can compile your custom op C++ code and link with Poplar and PopART to generate a dynamic library. Please refer to the custom op code custom_cube_op.cpp and its CMakeLists.txt under poptorch/tests/custom_ops$.

Second, load the dynamic library.

Listing 4.10 Loading the library for the PopART custom op

myso = list(pathlib.Path("tests").rglob("libcustom_cube_op.*"))
assert myso, "Failed to find libcustom_cube_op"
myop = ctypes.cdll.LoadLibrary(myso[0])

Finally, use poptorch.custom_op to finish the call. Its wrapper class is specified below.

For more information see: poptorch.custom_op.

In the PopART custom op, both forward op and backward op are implemented. In the PopTorch inference model, only the forward op will be called.

Listing 4.11 Calling a PopART custom op in a PopTorch inference model

def test_inference():
    class BasicNetwork(nn.Module):
        def forward(self, x, bias):
            x, y = poptorch.custom_op([x, bias],
                                      "Cube",
                                      "com.acme",
                                      1,
                                      example_outputs=[x, x])
            return x, y

In the code example above, example_outputs is assigned as [x, x], where x is one of the input tensors and used as a template to provide the right number of output tensors. The real outputs will be allocated memory, calculated and returned by the custom op. You can also call this custom op inside a training model using exactly the same interface of poptorch.custom_op, and the backward op will be called automatically.

You can pass attributes to custom ops using a Python dictionary, as shown by the following code example:

Listing 4.12 Passing an attribute to a PopART custom op from PopTorch

    class Model(torch.nn.Module):
        def forward(self, x):
            x = poptorch.custom_op([x],
                                   "LeakyRelu",
                                   "com.acme",
                                   1,
                                   example_outputs=[x],
                                   attributes={"alpha": 0.02})
            return x[0]

You can then obtain attributes from within the C++ code. The above code passes a Float attribute with the name alpha to the LeakyRELU implementation in the Custom operations chapter of the PopART user guide. PopTorch supports all possible attributes supported in PopArt except for Graph.

Please refer to the following table and code examples for information on how to pass other attribute types to a PopArt custom op implementation:

Table 4.1 Python types to use to pass attributes to PopART
PopART attribute type	Python equivalent
`Float`	Python float (converted to 32-bit)
`Floats`	list/tuple of Python floats
`Int`	Python int (converted to 64-bit signed int)
`Ints`	list/tuple of Python ints
`String`	Python str (converted to ASCII)
`Strings`	List/tuple of Python strs
`Graph`	Not supported

Listing 4.13 Passing different attribute types from PopTorch

def test_many_attributes_examples():
    class Model(torch.nn.Module):
        def forward(self, x):
            attributes = {
                "float_one": 1.0,
                "float_minus_two": -2.0,
                "int_zero": 0,
                "int_minus_five": -5,
                "floats_one_two_three": [1.0, 2.0, 3.0],
                "floats_minus_one_two_three": [-1.0, -2.0, -3.0],
                "ints_one_two_three": [1, 2, 3],
                "ints_minus_one_two_three": [-1, -2, -3],
                "a_string": "string with quotes and slash \" ' \\ end",
                "strs": ["abc", "def", "ghi"]
            }

            x = poptorch.custom_op([x],
                                   "ManyAttributeOp",
                                   "test.poptorch",
                                   1,
                                   example_outputs=[x],
                                   attributes=attributes)

4.5.5. poptorch.nop 

Poptorch includes a “no-op” function for debugging purposes.

For more information see: poptorch.nop().

4.5.6. poptorch.serializedMatMul 

Use this function to create a serialized matrix multiplication, which splits a larger matrix multiplication into smaller matrix multiplications to reduce memory requirements.

For more information see: poptorch.serializedMatMul().

4.5.7. poptorch.set_available_memory 

Use this function to override the proportion of tile memory for available to be used as temporary memory by a convolution or matrix multiplication.

For more information see: poptorch.set_available_memory().

4.6. Miscellaneous functions 

These PopTorch functions, not related to model creation, are available:

4.7. Half / float 16 support 

PopTorch supports the half-precision floating point (float 16) format. You can simply input float 16 tensors into your model. (You can convert a tensor to float 16 using tensor = tensor.half())

You can use your models in one of the following ways:

Convert all parameters (weights) to float 16 by using using a Module’s .``half()`` method. This is the most memory efficient, however small updates to weights may be lost, hindering training.
Keep the parameters (weights) as float 32, in which case the parameter updates will occur using float 32. However, the parameters will be converted to float 16 if you call an operation with a float 16 input. This is more memory efficient than using float 32 tensors (inputs) but less memory efficient than using float 16 weights.
Use a mix of float 32 and float 16 parameters by manually specifying parameters as float 16 or float 32.

Note

When PyTorch encounters a mix of float 16 and float 32 inputs for a given operation, it will usually cast all inputs and float 32. PopTorch differs and will cast all inputs to float 16. This makes it easier to build models with float 32 weights which take float 16 tensors. However, if you wish to follow PyTorch behavior, you can use opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat) where opts is the poptorch.Options object passed to the model wrapping function.

Listing 4.14 How to run a model using half precision

model = torch.nn.Linear(1, 10)

# Convert the parameters (weights) to halfs. Without doing so,
# the Linear parameters will automatically be cast to half, which allows
# training with float32 parameters but half tensors.
model.half()

t1 = torch.tensor([1.]).half()

opts = poptorch.Options()

inference_model = poptorch.inferenceModel(model, opts)
out = inference_model(t1)

assert out.dtype == torch.half

Because PopTorch relies on the torch.jit.trace API, it is limited to tracing operations which run on the CPU. Many of these operations do not support float 16 inputs. To allow the full range of operations, PopTorch converts all float 16 inputs to float 32 before tracing and then restores the inputs to float 16 as part of the canonicalization process. Some operations may result in the model running in float 32 where float 16 would be expected, or vice versa (see Float 16 operations for full details).

4.8. Profiling 

You can profile a graph produced by PopTorch for analysis using the PopVision Graph Analyser, which can be downloaded from the Graphcore support portal. To do this, use the POPLAR_ENGINE_OPTIONS environment variable.

4.9. Precompilation and caching 

4.9.1. Caching 

By default PopTorch will re-compile the model every time you instantiate a model. However if you often run the same models you might want to enable executable caching to save time.

You can do this by either setting the POPTORCH_CACHE_DIR environment variable or by calling poptorch.Options.enableExecutableCaching.

Warning

The cache directory might grow large quickly because PopTorch doesn’t evict old models from the cache and, depending on the number and size of your models and the number of IPUs used, the executables might be quite large. It is the your responsibility to delete the unwanted cache files.

4.9.2. Precompilation 

PopTorch supports precompilation: This means you can compile your model on a machine which doesn’t have an IPU and export the executable to a file. You can then reload and execute it on a different machine which does have an IPU.

Important

The PopTorch versions on both machines must be an exact match.

To precompile your model you need to wrap it using either poptorch.trainingModel() or poptorch.inferenceModel() then call compileAndExport() on the wrapper.

Listing 4.15 How to precompile a model using an offline IPU target.

import torch
import poptorch


class ExampleModelWithLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(10, 10)
        self.loss = torch.nn.MSELoss()

    def forward(self, x, target=None):
        fc = self.fc(x)
        if self.training:
            return fc, self.loss(fc, target)
        return fc


torch.manual_seed(0)
model = ExampleModelWithLoss()

opts = poptorch.Options()
# You don't need a real IPU to compile the executable.
opts.useOfflineIpuTarget(ipu_target_version)

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model, opts)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

poptorch_model.compileAndExport(filename, input, target)

Note

If you don’t know the IPU version on your system you can use poptorch.ipuHardwareVersion().

The exported file by default will contain your original Torch model (including the weights), and enough information to re-create the PopTorch wrapper and reload the executable.

Important

For your model and weights to be exported, your model must be picklable. See https://docs.python.org/3/library/pickle.html for more information. If your model is not picklable then use export_model=False, see below for a complete example.

Now both the torch model, PopTorch wrapper and executable can be restored on the target machine using poptorch.load():

Listing 4.16 How to load a precompiled model.

poptorch_model = poptorch.load(filename)

# That's all: your model is ready to be used.
poptorch_model(input, target)  # Run on IPU

In some cases you might want to provide some runtime information to select the device: this can be done using the edit_opts_fn argument of poptorch.load():

Listing 4.17 How to load a precompiled model and run on a specific IPU.

def setIpuDevice(opts):
    opts.useIpuId(1)  # always use IPU 1


poptorch_model = poptorch.load(filename, edit_opts_fn=setIpuDevice)
poptorch_model(input, target)  # Run on IPU 1

Note

Only runtime options will be used as the executable is already compiled

Going back to the precompilation step: in some cases you might want to export only the executable and not the python wrapper or torch model (For example if your model cannot be pickled).

Listing 4.18 How to export only the exectuable.

poptorch_model.compileAndExport(filename, input, target, export_model=False)

It means you will need to re-create and wrap the model yourself before loading the executable:

Listing 4.19 How to load a precompiled executable.

model = ExampleModelWithLoss()

opts = poptorch.Options()

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model, opts)
poptorch_model.loadExecutable(filename)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

poptorch_model(input, target)  # Run on IPU

Important

Exported models lose their connections to other models.

For example, if you have a poptorch.trainingModel() and a poptorch.inferenceModel() based on the same PyTorch model, you wouldn’t usually need to keep the weights synchronised between the two: PopTorch would take care of it implicitly for you.

For example:

Listing 4.20 PopTorch implicit copies.

model = ExampleModelWithLoss()

opts = poptorch.Options()

# Wrap the model in our PopTorch annotation wrapper.
training_model = poptorch.trainingModel(model, opts)
model.eval()
validation_model = poptorch.inferenceModel(model, opts)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

# Train the model:
for epoch in epochs:
    training_model(input, target)

# Weights are implicitly copied from the training model
# to the validation model
prediction = validation_model(input)

If you were to export these models:

Listing 4.21 Precompilation of both a training and validation models.

model = ExampleModelWithLoss()

opts = poptorch.Options()

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

# Wrap the model in our PopTorch annotation wrapper.
training_model = poptorch.trainingModel(model, opts)
training_model.compileAndExport("training.poptorch", input, target)
model.eval()
validation_model = poptorch.inferenceModel(model, opts)
validation_model.compileAndExport("validation.poptorch", input)

Note

Don’t forget to model.eval() or model.train() as required before calling compileAndExport().

You would then either need to insert explicit copy operations:

Listing 4.22 Precompilation of both a training and validation models.

training_model = poptorch.load("training.poptorch")
validation_model = poptorch.load("validation.poptorch")

for epoch in epochs:
    print("Epoch ", epoch)
    run_training(training_model)
    # Need to explicitly copy weights between the two models
    # because they're not connected anymore.
    training_model.copyWeightsToHost()
    validation_model.copyWeightsToDevice()
    run_validation(validation_model)

Or you would need to re-connect the two models by creating the second one from the first one and then loading the executable:

Listing 4.23 Precompilation of both a training and validation models.

training_model = poptorch.load("training.poptorch")
# Create a validation python model based on the training model
validation_model = poptorch.inferenceModel(training_model)
validation_model.model.eval()
# Load the executable for that model:
validation_model.loadExecutable("validation.poptorch")

for epoch in epochs:
    print("Epoch ", epoch)
    run_training(training_model)
    # Nothing to do: training_model and validation_model are now connected
    # and PopTorch will implicitly keep the weights in sync between them.
    run_validation(validation_model)

4.10. Environment variables 

4.10.1. Logging level 

PopTorch uses the following levels of logging:

OFF: No logging.
ERR: Errors only.
WARN: Warnings and errors only.
INFO: Info, warnings and errors. (Default)
DEBUG: Adds some extra debugging information.
TRACE and TRACE_ALL: Trace everything inside PopTorch.

The POPTORCH_LOG_LEVEL environment variable can be used to set the logging level:

export POPTORCH_LOG_LEVEL=DEBUG

4.10.2. Profiling 

When running programs using PopTorch, you can enable profiling by using the POPLAR_ENGINE_OPTIONS environment variable used by Poplar.

In order to capture the reports needed for the PopVision Graph Analyser you only need to set POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}':

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'

By default, report files are output to the current working directory. You can specify a different output directory by setting autoReport.directory, for example:

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./tommyFlowers"}'

For more options, please refer to the PopVision Graph Analyser User Guide.

In order to capture the pvti reports needed for the PopVision System Analyser you only need to set PVTI_OPTIONS='{"enable":"true"}'

You can also add extra tracepoints in your own code by using poptorch.profiling.Channel.

4.10.3. IPU Model 

By default PopTorch will try to attach to a physical IPU. If instead you want to use the model, you can do so by setting POPTORCH_IPU_MODEL to 1:

export POPTORCH_IPU_MODEL=1

Please see the Poplar and PopLibs User Guide for the limitations of the IPU Model.

4.10.4. Wait for an IPU to become available 

By default if you try to attach to an IPU but all the IPUs in the system are already in use, an exception will be raised. If you would rather wait for an IPU to become available, you can do so by setting POPTORCH_WAIT_FOR_IPU to 1.

export POPTORCH_WAIT_FOR_IPU=1

4.10.5. Enable executable caching 

This can be done by either setting the POPTORCH_CACHE_DIR environment variable or by calling poptorch.Options.enableExecutableCaching.