3. Features

3.1. Options

The compilation and execution on the IPU can be controlled using poptorch.Options:

See Efficient data batching for a full explanation of how device_iterations greater than 1, gradient_accumulation, and replication_factor interact with the output and input sizes.

class poptorch.Options

Options controlling how a model is run on the IPU.

property Distributed

Options specific to distributed execution.

property Jit

Options specific to upstream PyTorch’s JIT compiler.

property Popart

Options specific to the PopART backend. (Advanced users only).

property TensorLocations

Options related to tensor locations.

property Training

Options specific to training.

anchorMode(anchor_mode, anchor_return_period=None)

Specify which data to return from a model

Parameters

anchor_mode (poptorch.AnchorMode) –

  • All: Return a result for each batch.

  • Sum: Return the sum of all the batches.

  • Final: Return the last batch.

  • EveryN: Return every N batches: N is passed in as anchor_return_period.

  • Default: All for inference, Final for training.

For example:

>>> opts = poptorch.Options()
... opts.anchorMode(poptorch.AnchorMode.All)
... # or
... opts.anchorMode(poptorch.AnchorMode.EveryN, 10)
autoRoundNumIPUs(auto_round_num_ipus)

Whether or not to round up the number of IPUs used automatically: the number of IPUs requested must be a power of 2 or mutliple of 64. By default, an error occurs if an unsupport number of IPUs is used by the model to prevent unintentional overbooking of IPUs

Parameters

auto_round_num_ipus (bool) –

  • True: round up the number of IPUs to a power of 2 or multiple of 64 automatically

  • False: error if the number of IPUs is not supported

connectionType(connection_type)

When to connect to the IPU (if at all)

Parameters

connection_type (poptorch.ConnectionType) –

  • Always: Attach to the IPU from the start (Default).

  • OnDemand: Wait until the compilation is complete and the executable is ready to be run to attach to the IPU.

  • Never: Never try to attach to an IPU. (Useful for offline compilation, but trying to run an executable will raise an exception).

For example:

>>> opts = poptorch.Options()
... opts.connectionType(poptorch.ConnectionType.OnDemand)
defaultAnchorMode()
Returns

True if the anchorMode is currently set to Default; False otherwise

Return type

bool

deviceIterations(device_iterations)

Number of iterations the device should run over the data before returning to the user. (Default: 1)

Essentially, it is the equivalent of launching the IPU in a loop over that number of batches. This is efficient because that loop runs on the IPU directly.

enableExecutableCaching(path)

If path is None: disable executable caching.

Otherwise use path as a cache to save / load Poplar executables.

logDir(log_dir)

Where to save log files (Default: Current directory)

randomSeed(random_seed)

Set the seed for the random number generator on the IPU.

replicationFactor(replication_factor)

Number of model replications (Default: 1).

For example if your model uses 1 IPU, a replication factor of 2 will use 2 IPUs. If your model is pipelined across 4 IPUs, a replication factor of 4 will use 16 IPUs total.

setAvailableMemoryProportion(available_memory_proportion)

Memory is set on a per IPU basis, this should be a dictionary of IPU ids and float values between 0 and 1.

For example: {"IPU0": 0.5}

setExecutionStrategy(strategy)

Set the execution strategy to use to partition the graph

syncPattern(sync_pattern)

Set the IPU SyncPattern.

Parameters

sync_pattern (poptorch.SyncPattern) –

  • Full

  • SinglePipeline

  • ReplicaAndLadder

useIpuId(ipu_id)

Use the specified IPU id as provided by gc-info.

The number of IPUs associated with the id must be equal to the number of IPUs used by your grpah multiplied by the replication factor.

For example if your model uses 1 IPU and the replication factor is 2 you will need to provide an id with 2 IPUs.

If your model is pipelined across 4 IPUs, the replication factor is 4, you will need to provide an id containing 16 IPUs total.

Parameters

ipu_id (int) – IPU id as provided by gc-info.

useIpuModel(use_model)

Use the IPU model or physical hardware.

Default: False (Real Hardware).

This setting takes precedence over the POPTORCH_IPU_MODEL environment variable.

useOfflineIpuTarget(ipu_version=1)

Create an offline IPU target that can only be used for offline compilation.

Note

the offline IPU target cannot be used if the IPU model is enabled.

Parameters

ipu_version (int) – IPU version to target (1 for mk1, 2 for mk2). Default: 1.

You can choose to use the IPU model or the real IPU hardware via poptorch.Options.useIpuModel.

class poptorch.options._DistributedOptions

Options related to distributed execution.

Can be accessed via poptorch.Options.Distributed:

>>> opts = poptorch.Options()
>>> opts.Distributed.configureProcessId(0, 2)
configureProcessId(process_id, num_processes)

Manually set the current process ID and the total number of processess.

Parameters
  • process_id (int) – The ID of this process.

  • num_processes (int) – The total number of processes the execution is distributed over.

disable()

Ignore the current options / environment variables and disable distributed execution.

property numProcesses

Total number of processes the execution is distributed over.

property processId

Id of the current process.

setEnvVarNames(var_num_processes, var_process_id)

Utility to read and set processId and numProcesses from environment variables.

Useful if you use a third party library to manage the processes used for the distributed execution such as mpirun.

For example: mpirun -np 4 myscript.py

By default the OpenMPI OMPI_COMM_WORLD_SIZE and OMPI_COMM_WORLD_RANK variables are used.

class poptorch.options._JitOptions

Options related to Pytorch’s JIT compiler.

Can be accessed via poptorch.Options.Jit:

>>> opts = poptorch.Options()
>>> opts.Jit.traceModel(True)
traceModel(trace_model)

If True: use torch.jit.trace

If False: use torch.jit.script (Experimental)

Trace model is enabled by default.

class poptorch.options._TrainingOptions

Options specific to model training.

Note

You must not set these options for inference models.

Can be accessed via poptorch.Options.Training:

>>> opts = poptorch.Options()
>>> opts.Training.gradientAccumulation(4)
gradientAccumulation(gradient_accumulation)

Number of samples to accumulate for the gradient calculation.

Accumulate the gradient N times before applying it. This is needed to train with models expressing pipelined model parallelism using the IPU annotation. This is due to weights being shared across pipeline batches so gradients will be updated and used by subsequent batches out of order.

Might be called “pipeline depth” in some other frameworks.

class poptorch.options._PopartOptions

Options specific to the PopART backend.

Only for advanced users.

Any option from popart.SessionOptions can be set using this class. .. note:: there is no mapping for the various PopART enums so integers need to be used instead.

Can be accessed via poptorch.Options.Popart:

>>> opts = poptorch.Options()
>>> opts.Popart.set("autoRecomputation", 3) # RecomputationType::Pipeline
>>> opts.Popart.set("syntheticDataMode",
>>>                  int(popart.SyntheticDataMode.RandomNormal))
setPatterns(patterns, level=2)

Override the default patterns of Popart’s compiler.

Parameters
  • patterns (dict(str,bool)) – Dictionary of pattern names to enable / disable.

  • level (int) – Integer value corresponding to the popart::PaternsLevel to use to initialise the Patterns.

class poptorch.options._TensorLocationOptions(**default_values)

Options controlling where tensors are stored.

Can be accessed via poptorch.Options.TensorLocations:

>>> opts = poptorch.Options()
>>> opts.TensorLocations.setActivationLocation(
...     poptorch.TensorLocationSettings().useOnChipStorage(False))
setAccumulatorLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Where to store the accumulators.

setActivationLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Where to store the activations.

setOptimizerLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Where to store the optimizer states.

setWeightLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Where to store the weights.

class poptorch.TensorLocationSettings(**default_values)

Define where a tensor is stored

>>> opts = poptorch.Options()
>>> opts.TensorLocations.setActivationLocation(
...     poptorch.TensorLocationSettings().useOnChipStorage(False))
minElementsForOffChip(min_elements)

A minimum number of elements below which offloading won’t be considered.

minElementsForReplicatedTensorSharding(min_elements)

Only enable Replicated Tensor Sharding (RTS) for tensors with more than min_elements elements.

useIOTilesToLoad(use=True)

Load tensor through IO tiles

Parameters

use (bool) – Use IO tiles if True, use Compute tiles if False.

useIOTilesToStore(use=True)

Use IO tiles to store tensors.

(relevant for replicated tensor sharded tensors)

Parameters

use (bool) – Use IO tiles if True, use Compute tiles if False.

useOnChipStorage(use=True)

Permanent tensor storage

Parameters

use (bool) – True: use on chip memory, False: use off chip memory. None: keep it undefined.

useReplicatedTensorSharding(use=True)

Enable replicated tensor sharding

(relevant for weights and optimizer states)

3.2. Model wrapping functions

The basis of PopTorch integration comes from these two model wrapping functions.

3.2.1. poptorch.trainingModel

poptorch.trainingModel(model, options=None, optimizer=None)

Create a PopTorch training model, from a PyTorch model, to run on IPU hardware in training mode.

Parameters
Returns

The poptorch.PoplarExecutor wrapper to use in place of model.

Listing 3.1 An example of the use of poptorch.trainingModel()
 1import poptorch
 2import torch
 3
 4
 5class ExampleModelWithLoss(torch.nn.Module):
 6    def __init__(self):
 7        super().__init__()
 8        self.fc = torch.nn.Linear(10, 10)
 9        self.loss = torch.nn.MSELoss()
10
11    def forward(self, x, target=None):
12        fc = self.fc(x)
13        if self.training:
14            return fc, self.loss(fc, target)
15        return fc
16
17
18torch.manual_seed(0)
19model = ExampleModelWithLoss()
20
21# Wrap the model in our PopTorch annotation wrapper.
22poptorch_model = poptorch.trainingModel(model)
23
24# Some dummy inputs.
25input = torch.randn(10)
26target = torch.randn(10)
27
28# Train on IPU.
29for i in range(0, 100):
30    # Each call here executes the forward pass, loss calculation, and backward
31    # pass in one step.
32    # Model input and loss function input are provided together.
33    poptorch_out, loss = poptorch_model(input, target)
34    print(f"{i}: {loss}")
35
36# Copy the trained weights from the IPU back into the host model.
37poptorch_model.copyWeightsToHost()
38
39# Execute the trained weights on host.
40model.eval()
41native_out = model(input)
42
43# Models should be very close to native output although some operations are
44# numerically different and floating point differences can accumulate.
45torch.testing.assert_allclose(native_out, poptorch_out, rtol=1e-06, atol=1e-06)

3.2.2. poptorch.inferenceModel

poptorch.inferenceModel(model, options=None)

Create a PopTorch inference model, from a PyTorch model, to run on IPU hardware in inference mode.

Parameters
  • model (torch.nn.Module) – The PyTorch model to wrap.

  • options (poptorch.Options) – The IPU specific options

Returns

The poptorch.PoplarExecutor wrapper to use in place of model.

Listing 3.2 An example of the use of poptorch.inferenceModel()
 1import poptorch
 2import torch
 3import torchvision
 4
 5# Some dummy imagenet sized input.
 6picture_of_a_cat_here = torch.randn([1, 3, 224, 224])
 7
 8# The model, in this case a MobileNet model with pretrained weights that comes
 9# canned with Pytorch.
10model = torchvision.models.mobilenet_v2(pretrained=True)
11model.train(False)
12
13# Wrap in the PopTorch inference wrapper
14inference_model = poptorch.inferenceModel(model)
15
16# Execute on IPU.
17out_tensor = inference_model(picture_of_a_cat_here)
18
19# Get the top 5 ImageNet classes.
20top_five_classes = torch.topk(torch.softmax(out_tensor, 1), 5)
21print(top_five_classes)
22
23# Try the same on native PyTorch
24native_out = model(picture_of_a_cat_here)
25
26native_top_five_classes = torch.topk(torch.softmax(native_out, 1), 5)
27
28# Models should be very close to native output although some operations are
29# numerically different and floating point differences can accumulate.
30assert any(top_five_classes[1][0] == native_top_five_classes[1][0])
31# inference_half_start
32model = torch.nn.Linear(1, 10).half()
33t1 = torch.tensor([1.]).half()
34
35inference_model = poptorch.inferenceModel(model)
36out = inference_model(t1)
37
38assert out.dtype == torch.half
39# inference_half_end

3.2.3. poptorch.PoplarExecutor

class poptorch.PoplarExecutor(model, options, training, optimizer=None, user_model=None)

This class should not be created directly but is a wrapper around the model that was passed into inferenceModel or trainingModel. It only has a few methods which can be used to interface with the IPU.

__call__(*args, **kwargs)

Takes the same arguments as the wrapped PyTorch model.__call__.

Note

The first time the PoplarExecutor wrapper is called, the wrapped model will be traced and compiled.

compile(*args, **kwargs)

Takes the same arguments as the wrapped PyTorch model.__call__.

Trace and compile the wrapped model if no executable has been created yet.

copyWeightsToDevice()

Copies the weights from model.parameters() to the IPU device. Implicitly called on first call.

copyWeightsToHost()

Updates the parameters used in model with the weights stored on device. (The weights in model.parameters())

destroy()

Destroy the model: release the IPUs and the executable.

setOptimizer(optimizer)

Sets the optimiser for a training model. Will overwrite the previous one. Supported optimisers: optim.SGD, optim.AdamW, optim.RMSProp.

Note

The PoplarExecutor will implicitly keep in sync the parameters of the source PyTorch model and the PopTorch model(s). However, weights need to be explicitly copied if the model is trained on the CPU and inference is run on the IPU.

model = Model()
poptorch_train = poptorch.trainingModel(model)
poptorch_inf = poptorch.inferenceModel(model)

train(poptorch_train)
torch.save(model.state_dict(), "model.save") # OK
validate(poptorch_inf) # OK
validate(model) # OK

train(model)
# Explicit copy needed
poptorch_inf.copyWeightsToDevice()
validate(poptorch_inf)

3.3. Parallel execution

This section demonstrates multi-IPU strategies for parallel execution in PopTorch. We recommended that you start such parallel programming from PopTorch code that is working properly on a single IPU.

There are four kinds of execution strategies in total to run a model on a multi-IPU device: poptorch.ShardedExecution, poptorch.PipelinedExecution, poptorch.SerialPhasedExecution. and poptorch.ParallelPhasedExecution. These execution strategies are set through poptorch.Options.setExecutionStrategy(). The default execution strategy is poptorch.PipelinedExecution. In the following, we first introduce the general APIs that will be applied to all four parallel execution strategies. Finally, we explain the four strategies with examples.

By default, PopTorch will not let you run the model if the number of IPUs is not a power of 2. For this reason, it is preferable to annotate the model so that the number of IPUs used is a power of 2. However, you can also enable poptorch.Options.autoRoundNumIPUs() to automatically round up the number of IPUs reserved to a power of 2, with the excess being reserved but idle. This option is not enabled by default to prevent unintentional overbooking of IPUs.

3.3.1. Annotation tools

poptorch.Block and poptorch.BeginBlock

poptorch.BeginBlock and poptorch.Block are indispensable wrapper classes to define model parallelism in a multi-IPU device. You can use poptorch.Block to define a scope in the context of the model.

class poptorch.Block(user_id=None, ipu_id=None)

Runs all layers called inside this scope on a specified IPU.

>>> with poptorch.Block("IPU0"):
...     self.layer = MyLayer(x)
__init__(user_id=None, ipu_id=None)
Parameters
  • user_id (str, optional) – A user defined identifier for the block. Blocks with the same id are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.

  • ipu_id (int, optional) – The id of the IPU to run on. Note that the ipu_id is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used by gc-info.

static useAutoId()

Call this method at the beginning of your forward() method to enable automatic block id generation.

Blocks with a None user_id will be assigned an automatic id which will be the index of this block in the list of id-less Blocks.

>>> poptorch.Block.useAutoId()
>>> with poptorch.Block(): # user_id = "0"
...     layer()
>>> with poptorch.Block("special_block"): # user_id = "special_block"
...     layer()
>>> with poptorch.Block(): # user_id = "1"
...     layer()

poptorch.BeginBlock is an annotation defined outside the model, and applied to current and onward layers.

class poptorch.BeginBlock(layer_to_call, user_id=None, ipu_id=None)

Runs all layers from the given layer until the beginning of the next block on a specified IPU.

All layers after this layer will also run on the same IPU until another BeginBlock is encountered.

By default PipelinedExecution will be used, however this can be overridden in the poptorch.Options.

>>> self.layer = poptorch.BeginBlock(MyLayer(x))
__init__(layer_to_call, user_id=None, ipu_id=None)

All subsequent layers of the network will be part of this block until another layer is wrapped.

Parameters
  • layer_to_call (torch.nn.Module) – The layer to run on the specified IPU.

  • user_id (str, optional) – A user defined identifier for the block. Blocks with the same id are considered as being a single block. Block identifiers are also used to manually create Stages and Phases.

  • ipu_id (int, optional) – The id of the IPU to run on. Note that the ipu_id is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used by gc-info.

poptorch.BeginBlock or poptorch.Block alone is enough to enable parallel execution in the simplest case. By default, the layers before the first poptorch.BeginBlock will be placed on IPU 0. The complete code examples for poptorch.BeginBlock and poptorch.Block are shown below. All layers before model.bert.encoder.layer[0] will be on IPU 0 and all layers from model.bert.encoder.layer[0] onwards (inclusive) will be on IPU 1.

Listing 3.3 Annotations can be attached to layers in existing models.
 1import transformers
 2import torch
 3import poptorch
 4
 5# A bert model from hugging face. See the packaged BERT example for actual usage.
 6pretrained_weights = 'mrm8488/bert-medium-finetuned-squadv2'
 7model = transformers.BertForQuestionAnswering.from_pretrained(
 8    pretrained_weights)
 9
10# A handy way of seeing the names of all the layers in the network.
11print(model)
12
13# All layers before "model.bert.encoder.layer[0]" will be on IPU 0 and all layers from
14# "model.bert.encoder.layer[0]" onwards (inclusive) will be on IPU 1.
15model.bert.encoder.layer[0] = poptorch.BeginBlock(model.bert.encoder.layer[0],
16                                                  ipu_id=1)
17
18# Now all layers before layer are on IPU 1 and this layer onward is on IPU 2
19model.bert.encoder.layer[2] = poptorch.BeginBlock(model.bert.encoder.layer[2],
20                                                  ipu_id=2)
21
22# Finally all layers from this layer till the end of the network are on IPU 3.
23model.bert.encoder.layer[4] = poptorch.BeginBlock(model.bert.encoder.layer[4],
24                                                  ipu_id=3)
25
26# We must batch the data by at least the number of IPUs. Each IPU will still execute
27# whatever the model batch size is.
28data_batch_size = 4
29
30# Create a poptorch.Options instance to override default options
31opts = poptorch.Options()
32opts.deviceIterations(data_batch_size)
Listing 3.4 PopTorch also supports annotating the model directly. Both forms can be used interchangeably.
 1class Network(torch.nn.Module):
 2    def __init__(self):
 3        super().__init__()
 4        self.layer1 = torch.nn.Linear(5, 10)
 5        self.layer2 = torch.nn.Linear(10, 5)
 6        self.layer3 = torch.nn.Linear(5, 5)
 7        self.layer4 = torch.nn.Linear(5, 5)
 8
 9        self.act = torch.nn.ReLU()
10        self.softmax = torch.nn.Softmax(dim=1)
11
12    def forward(self, x):
13
14        # Explicit layers on a certain IPU
15        poptorch.Block.useAutoId()
16        with poptorch.Block(ipu_id=0):
17            x = self.act(self.layer1(x))
18
19        with poptorch.Block(ipu_id=1):
20            x = self.act(self.layer2(x))
21
22        with poptorch.Block(ipu_id=2):
23            x = self.act(self.layer3(x))
24            x = self.act(self.layer4(x))
25
26        with poptorch.Block(ipu_id=3):
27            x = self.softmax(x)
28        return x
29
30
31model = Network()
32opts = poptorch.Options()
33opts.deviceIterations(4)
34poptorch_model = poptorch.inferenceModel(model, options=opts)
35print(poptorch_model(torch.rand((4, 5))))

Both poptorch.BeginBlock and poptorch.Block need to follow a set of rules:

  • All the layers must be declared inside a poptorch.Block scope. It is to avoid missing annotation. poptorch.BeginBlock doesn’t have the same constraint because all the layers called after will automatically be added to the last poptorch.BeginBlock.

  • Please note that PopTorch needs to reserve IPUs in powers of 2 or multiples of 64. You are advised to configure your model accordingly to take full advantage of the IPUs available. However, if you need to run with a different number of IPUs, you can use poptorch.Options().autoRoundNumIPUs(True) to allow PopTorch to reserve more IPUs than the model specifies.

  • Unused or dead layers should NOT be included in any poptorch.BeginBlock or poptorch.Block.

  • If layer A happens before layer B inside the model and each layer has a poptorch.BeginBlock associated with it, you need to write poptorch.BeginBlock for layer A before poptorch.BeginBlock for layer B.

Failing to obey above rules will result in compilation errors.

poptorch.Stage and poptorch.AutoStage

Conceptually poptorch.BeginBlock or poptorch.Block collects the layers of a model into a poptorch.Stage, multiple stages can be combined into a poptorch.Phase, and multiple phases form a parallel execution strategy.

poptorch.Stage

poptorch.Stage defines some layers of model to run on one IPU. It can be made of one or more blocks created using poptorch.BeginBlock or poptorch.Block and identified by their user_id. Consecutive layers in a model can be defined either in the same poptorch.Stage or consecutive stages. Whether stages run in parallel or sequentially depends on specific parallel execution strategies.

class poptorch.Stage(*block_ids)

The various execution strategies are made of Stages: a stage consists of one of more Blocks running on one IPU.

__init__(*block_ids)

Internally, each operation in a model is assigned a stage_id through poptorch.Stage.

poptorch.AutoStage

You can use poptorch.AutoStage if you don’t want to specify poptorch.Stage by hand. It will assign one poptorch.Stage per poptorch.BeginBlock or poptorch.Block.

class poptorch.AutoStage(value)

Defines how the stages are automatically assigned to blocks when the user didn’t explicitly provide stages to the IExecutionStrategy’s constructor.

  • SameAsIpu: The stage id will be set to the selected ipu number.

  • AutoIncrement: The stage id for new blocks is automatically incremented.

Examples:

>>> # Block "0"
>>> with poptorch.Block(ipu_id=0):
...  layer()
>>> # Block "1"
>>> with poptorch.Block(ipu_id=1):
...  layer()
>>> # Block "2"
>>> with poptorch.Block(ipu_id=0):
...  layer()

By default, the following execution strategy is used:

>>> stategy = poptorch.PipelinedExecution(poptorch.AutoStage.SameAsIpu)
>>> opts.setExecutionStrategy(strategy)

which would translate to stage_id = ipu_id:

  • Block “0” ipu=0 stage=0

  • Block “1” ipu=1 stage=1

  • Block “2” ipu=0 stage=0

Now if instead you use:

>>> stategy = poptorch.PipelinedExecution(poptorch.AutoStage.AutoIncrement)
>>> opts.setExecutionStrategy(strategy)

The last block would be in its own stage rather than sharing one with Block “0”:

  • Block “0” ipu=0 stage=0

  • Block “1” ipu=1 stage=1

  • Block “2” ipu=0 stage=2

By default poptorch.AutoStage.SameAsIpu is in use, which means the stage_id of poptorch.Stage will be set to the ipu_id specified for the poptorch.BeginBlock or poptorch.Block. Please note that stage_id must be ascending in poptorch.PipelinedExecution. Let’s use the code example above. If your blocks “0”, “1”, and “2” are assigned to IPU 0, 1, and 0. Then the poptorch.Block “2” will be assigned stage_id 0. This will make the compiler fail to schedule the last two stages “1” and “2” due to a conflict:

  • The model implies “1” should run earlier than “2”.

  • their stage_id values suggest “2” should run earlier than “1”.

When poptorch.AutoStage.AutoIncrement is in use, each new poptorch.BeginBlock or poptorch.Block will be assigned an automatically incremented stage_id. In the previous example the last stage would be assigned stage_id 2 and the compilation would succeed.

poptorch.Phase

poptorch.Phase defines a processing unit of phased execution. It may contain one or more poptorch.Stage. poptorch.Phase is only used in poptorch.SerialPhasedExecution and poptorch.ParallelPhasedExecution. It is not used in poptorch.ShardedExecution and poptorch.PipelinedExecution.

class poptorch.Phase(arg)

Represents an execution phase

__init__(arg)

Create a phase.

Parameters

arg (str, poptorch.Stage, [poptorch.Stage], [str]) – must either be one or more Stages, or one or more blocks user_id.

If one or more strings are passed they will be interpreted as Block ids representing a single Stage.

Within a Phase, the stages will be executed in parallel.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> p = Phase(poptorch.Stage("A").ipu(0))
>>> # 2 stages made of one block each
>>> p = Phase(poptorch.Stage("A").ipu(0), poptorch.Stage("B").ipu(1))
>>> p = Phase("A","B") # One Stage made of 2 blocks

In the last two lines above, “A” and “B” will run in parallel on IPU 0 and 1 simultaneously since they are placed in two stages. They will run sequentially in one IPU if they are placed in one stage only.

Advanced annotation with strings

You can use Python strings to represent the user_id and ipu_id for a poptorch.Block or poptorch.BeginBlock. Since strings are evaluated at runtime, they allow for a dynamic number of stages and phases. Here is an example below to use formatted strings(f-strings) in poptorch.ParallelPhasedExecution.

Inside the code example below, there are two lines that f-strings are used in the forward() class. One is f"phase{phase}_ipu{ipu}" at Line 25, where phase is 0, 1, 1, 2, 3, 3, 4, 5, and 5 respectively, and ipu ranges from 0 to 1. The total number of instances for this f-string is 12 due to 6 phases and 2 IPUs. The other is f"phase{N*2-1}_ipu1" at Line 32, where phase is 5 and ipu is 1. When defining poptorch.Stage, four f-strings are used where n ranges from 0 to 2 at Line 46-47 and 50-51:

  • f"phase_{2*n}_ipu0"

  • f"phase{2*n}_ipu1"

  • f"phase_{2*n+1}_ipu0"

  • f"phase{2*n+1}_ipu1"

They refer to phase 0, 2, 4 and 1, 3, 5, with ipu0 and ipu1 respectively. So all these 12 f-strings are defined in poptorch.BeginBlock, and used in poptorch.Stage dynamically. They match exactly.

Listing 3.5 An example of parallel phased execution
 1poptorch.setLogLevel(1)  # Force debug logging
 2N = 3
 3size = 10
 4
 5
 6class Model(torch.nn.Module):
 7    def __init__(self):
 8        super().__init__()
 9        self.weights = []
10        for n in range(N * 6):
11            weight = torch.nn.Parameter(torch.rand(size, size),
12                                        requires_grad=True)
13            self.register_parameter(f"w{n}", weight)
14            self.weights.append(weight)
15
16    def forward(self, in0, target=None):
17        phase = 0
18        weight = iter(self.weights)
19        with poptorch.Block("phase0_ipu0"):
20            ins = torch.split(in0, size)
21        for n in range(N * 3):
22            out = []
23            for ipu in range(2):
24                x = ins[ipu]
25                with poptorch.Block(f"phase{phase}_ipu{ipu}"):
26                    x = torch.matmul(next(weight), x)
27                    out.append(F.relu(x))
28            ins = out[1], out[0]
29            # We want 2 matmuls in the same phase
30            if n % 3 != 1:
31                phase += 1
32        with poptorch.Block(f"phase{N*2-1}_ipu1"):
33            res = ins[0] + ins[1]
34            if target is None:
35                return res
36            return res, torch.nn.L1Loss(reduction="mean")(res, target)
37
38
39input = torch.rand(size * 2, 1)
40target = torch.rand(size, 1)
41model = Model()
42opts = poptorch.Options()
43phases = []
44# Alternate between 0-2 and 1-3
45for n in range(N):
46    phases.append([
47        poptorch.Stage(f"phase{2*n}_ipu0").ipu(0),
48        poptorch.Stage(f"phase{2*n}_ipu1").ipu(2)
49    ])
50    phases.append([
51        poptorch.Stage(f"phase{2*n+1}_ipu0").ipu(1),
52        poptorch.Stage(f"phase{2*n+1}_ipu1").ipu(3)
53    ])
54opts.setExecutionStrategy(poptorch.ParallelPhasedExecution(*phases))
55poptorch_model = poptorch.trainingModel(model, opts)
56poptorch_model.compile(input, target)

3.3.2. Parallel execution strategies

With the above APIs as building blocks, we can set execution strategies using the four kinds of execution modes, as shown below. Note that the same annotation can be used for each of them. They only differ in the method of parallelisation and tensor locations.

poptorch.ShardedExecution

In this strategy, each IPU will sequentially execute a distinct part of the model. A single unit of processing poptorch.ShardedExecution is a shard. A shard is specified using poptorch.Stage, or if no poptorch.Stage is specified, the user_id passed by poptorch.BeginBlock or poptorch.Block is used. Each shard is executed sequentially on a single IPU. Multiple shards can be placed on multiple IPUs. However, only one IPU is used at a time, while the other IPUs are idle. If an IPU is allocated to run consecutive stages, PopART will merge consecutive stages into one on the same IPU. Weights and activations will use the on-chip memory of the IPUs. Layers sharing weights need to be placed on the same IPU.

poptorch.ShardedExecution can be useful for processing a single sample or debugging. Overall it has low efficiency since only one IPU is used at a time.

class poptorch.ShardedExecution(*args)

Will shard the execution of the passed Stages or if no stage is passed will consider each unique Block name encountered during tracing as a different stage.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> opts = poptorch.Options()
>>> # Automatically create 3 shards based on the block names
>>> opts.setExecutionStrategy(poptorch.ShardedExecution())
Parameters

args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a poptorch.AutoStage strategy or an explicit list of stages or block ids.

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

poptorch.PipelinedExecution

This is the default execution strategy. It extends poptorch.ShardedExecution with parallel execution on multiple IPUs.

Parallelisation in poptorch.PipelinedExecution requires deviceIterations() and gradientAccumulation(). as explained in Efficient data batching. After one poptorch.Stage is finished with processing a batch on one IPU, it starts immediately processing the next batch. This creates a pipeline where multiple batches are processed in parallel. An IPU can only start its own poptorch.Stage of a batch if its previous poptorch.Stage of the current batch is processed. Hence, all IPUs will be occupied after a warm-up period. A cool-down period is required to aggregate the results and apply weight changes.

class poptorch.PipelinedExecution(*args)
__init__(*args)

Pipeline the execution of the passed Stages or if no stage is passed consider each unique Block name encountered during tracing as a different stage.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> opts = poptorch.Options()
>>> # Create a 3 stages pipeline
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution("A","B","C"))
>>> # Create a 2 stages pipeline
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution(
...    poptorch.Stage("A","B"),
...    "C"))
>>> # Automatically create a 3 stages pipeline based on the block names
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution())
Parameters

args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a poptorch.AutoStage strategy or an explicit list of stages or block ids.

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

Phased execution

poptorch.ParallelPhasedExecution and poptorch.SerialPhasedExecution have the following features in common:

  • A portion of the weights and activations are transferred to and from streaming memory, before and after each phase.

  • If the desired weights and activations are already stored in an IPU of the same group of IPUs, intra-phase cross-IPU copies can replace the copies to and from streaming memory.

  • This specific portion is needed by the layers of the model wrapped in poptorch.BeginBlock or poptorch.Block in current poptorch.Phase.

  • They both trade off some performance for larger models with higher memory needs.

  • Any number of phases is allowed.

  • The number of stages in each poptorch.Phase should match the number of IPUs in each group of IPUs.

  • Stages inside each poptorch.Phase can run in parallel.

Although you only define the poptorch.Phase for forward passes, the corresponding phases for backward passes are created correspondingly. The order of phased execution for backward passes won’t change but you can decide whether a phase is shared by both forward and backward passes. In other words, you decide whether to avoid a memory transfer of a portion of the weights and activations.

poptorch.SerialPhasedExecution

In poptorch.SerialPhasedExecution, phases execute on a single group of IPUs sequentially.

class poptorch.SerialPhasedExecution(*phases)

All the phases run serially on a single group of IPUs.

For example:

  • phase 0 runs on ipu 0 & 1

  • phase 1 runs on ipu 0 & 1

  • phase 2 runs on ipu 0 & 1

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("A2"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("B2"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> with poptorch.Block("C2"):
...     layer()
>>> opts = poptorch.Options()
>>> strategy = poptorch.SerialPhasedExecution([
...     poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")),
...     poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")),
...     poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2"))])
>>> strategy.phase(0).ipus(0,1)
>>> strategy.phase(1).ipus(0,1)
>>> strategy.phase(2).ipus(0,1)
>>> opts.setExecutionStrategy(strategy)
__init__(*phases)

Execute the model’s blocks in phases

Parameters

phases ([poptorch.Phase], [[poptorch.Stage]], [[str]]) –

Definition of phases must be either:

phase(phase)

Return the requested poptorch.Phase

Parameters

phase (int) – Index of the phase

setTensorsLiveness(liveness)

See poptorch.Liveness for more information

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

useSeparateBackwardPhase(use=True)

Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:

fwd:       bwd:
phase 0 -> phase 4
phase 1 -> phase 3
phase 2 -> phase 2

Note

The end of the forward pass and the beginning of the backward pass are part of the same phase.

If useSeparateBackwardPhase(True) is used then no phase will be shared between the forward and backward passes:

fwd:       bwd:
phase 0 -> phase 6
phase 1 -> phase 5
phase 2 -> phase 4
poptorch.ParallelPhasedExecution

In poptorch.ParallelPhasedExecution, phases are executed in parallel alternating between two groups of IPUs. Even phases must run on even IPUs and odd phases on odd IPUs. Inter-phase cross-IPU copies can replace the memory transfers to and from the streaming memory, if the desired weights and activations are already available in another group of IPUs.

class poptorch.ParallelPhasedExecution(*phases)

Phases are executed in parallel alternating between two groups of IPUs.

For example:

  • phase 0 runs on ipu 0 & 2

  • phase 1 runs on ipu 1 & 3

  • phase 2 runs on ipu 0 & 2

>>> poptorch.Block.useAutoId()
>>> with poptorch.Block(): # user_id = "0"
...     layer()
>>> with poptorch.Block(): # user_id = "1"
...     layer()
>>> with poptorch.Block(): # user_id = "2"
...     layer()
>>> with poptorch.Block(): # user_id = "3"
...     layer()
>>> with poptorch.Block(): # user_id = "4"
...     layer()
>>> with poptorch.Block(): # user_id = "5"
...     layer()
>>> opts = poptorch.Options()
>>> strategy = poptorch.ParalellPhasedExecution([
...     poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")),
...     poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")),
...     poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5"))])
>>> strategy.phase(0).ipus(0,2)
>>> strategy.phase(1).ipus(1,3)
>>> strategy.phase(2).ipus(0,2)
>>> opts.setExecutionStrategy(strategy)
__init__(*phases)

Execute the model’s blocks in phases

Parameters

phases ([poptorch.Phase], [[poptorch.Stage]], [[str]]) –

Definition of phases must be either:

phase(phase)

Return the requested poptorch.Phase

Parameters

phase (int) – Index of the phase

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

useSeparateBackwardPhase(use=True)

Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:

fwd:       bwd:
phase 0 -> phase 4
phase 1 -> phase 3
phase 2 -> phase 2

Note

The end of the forward pass and the beginning of the backward pass are part of the same phase.

If useSeparateBackwardPhase(True) is used then no phase will be shared between the forward and backward passes:

fwd:       bwd:
phase 0 -> phase 6
phase 1 -> phase 5
phase 2 -> phase 4

In the code example above, there are three phases. Each phase has two stages and each IPU group has two IPUs, so these two numbers match. Even phases 0 and 2 run on IPU 0 and 2, while odd phase 1 runs on IPU 1 and 3 as required. This allows for faster cross-IPU copies, both inter-phase and intra-phase.

poptorch.Liveness

poptorch.Liveness controls the availability of tensors on IPU, and is only needed for poptorch.ParallelPhasedExecution and poptorch.SerialPhasedExecution.

class poptorch.Liveness(value)

When using phased execution:

  • AlwaysLive: The tensors always stay on the IPU between the phases.

  • OffChipAfterFwd: The tensors are sent off the chip at the end of the forward pass and before the beginning of the backward pass.

  • OffChipAfterEachPhase: The tensors are sent off the chip at the end of each phase.

The default poptorch.Liveness is AlwaysLive. OffChipAfterFwd and OffChipAfterEachPhase may be helpful if you run a large model with a tight memory budget.

3.4. Optimizers

You can use a number of optimizers with PopTorch. In addition, PopTorch has additional features to support float16 models such as loss scaling.

class poptorch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, loss_scaling=1.0, velocity_scaling=1.0)

Stochastic gradient descent with optional momentum.

The optimizer matches PyTorch’s implementation (torch.optim.SGD) with optional loss and velocity scaling.

Nesterov momentum is not currently supported.

__init__(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, loss_scaling=1.0, velocity_scaling=1.0)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (float) – learning rate.

  • momentum (float, optional) – momentum factor.

  • dampening (float, optional) – damperning term for momentum.

  • weight_decay (float, optional) – Weight decay (L2 penalty) factor.

  • nesterov (bool, optional) – Not supported (must be False).

  • loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • velocity_scaling (float, optional) – Factor by which to scale the velocity values to assist numerical stability when using float16.

class poptorch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, loss_scaling=1.0, biasCorrection=True, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)

Adam optimizer with true weight decay.

This optimizer matches PyTorch’s implementation (torch.optim.AdamW) with optional loss scaling.

AMSGrad is currently not supported.

__init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, loss_scaling=1.0, biasCorrection=True, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (float, optional) – learning rate

  • eps (float, optional) – term added to the demoninator to ensure numerical stability.

  • weight_decay (float, optional) – Weight decay factor.

  • amsgrad (bool, optional) – Not supported (must be False).

  • loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • accumType (torch.dtype, optional) – data type used for gradients.

  • firstOrderMomentumAccumType (torch.dtype, optional) – data type used to store the first order momentum values for each parameter.

  • secondOrderMomentumAccumType (torch.dtype, optional) – data type used to store the second order momentum values for each parameter.

class poptorch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False, loss_scaling=1.0)

RMSprop optimizer with optional L2 penalty.

This optimizer matches PyTorch’s implementation (torch.optim.RMSprop) with optional loss scaling.

__init__(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False, loss_scaling=1.0)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (float, optional) – learning rate.

  • alpha (float, optional) – smoothing constant.

  • eps (float, optional) – term added to the demoninator to ensure numerical stability.

  • weight_decay (float, optional) – L2 penalty coeffecient.

  • momentum (float, optional) – momentum factor.

  • centered (bool, optional) – True: compute centred RMSProp in which the gradient is normalized by an estimate of its variance.

  • loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

class poptorch.optim.LAMB(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, biasCorrection=True, loss_scaling=1.0, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)

Layer-wise Adaptive Moments (LAMB) optimizer (biased version).

Based on “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes” (https://arxiv.org/abs/1904.00962).

The scaling function phi(z) is fixed as min(z, mwn); mwn is fixed at 10.0.

__init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, biasCorrection=True, loss_scaling=1.0, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (float, optional) – learning rate

  • betas (tuple, optional) – (beta1, beta2) parameters used in LAMB.

  • eps (float, optional) – term added to the denominator to ensure numerical stability/

  • weight_decay (float, optional) – (AdamW) weight decay factor.

  • accumType (torch.dtype, optional) – data type used for gradients.

  • firstOrderMomentumAccumType (torch.dtype, optional) – data type used to store the first order momentum values for each parameter.

  • secondOrderMomentumAccumType (torch.dtype, optional) – data type used to store the second order momentum values for each parameter.

step(closure=None)

Performs a single optimization step (parameter update).

Parameters

closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Note

Unless otherwise specified, this function should not modify the .grad field of the parameters.

3.4.1. Loss scaling

When training models which use half/float16 values, you can use loss scaling to prevent the gradients from becoming too small and underflowing. Before calculating the gradients, PopTorch will scale the loss by the value of the loss_scaling parameter. PopTorch will multiply the gradients by the inverse scale prior to updating the optimizer state. Therefore, beyond improving numerical stability, neither the training nor the hyper-parameters are affected.

Higher loss_scaling values can improve numerical stability by minimising underflow. However, too high a value can result in overflow. The optimal loss scaling factor depends on the model.

3.4.2. Velocity scaling (SGD only)

The SGD optimizer, when used with momentum, updates weights based on the velocity values. At each update step, the new velocity is a combination of the gradients derived from the loss function and the previous velocity value. Similar to loss scaling, the velocity_scaling parameter allows the velocity values to be scaled to improve numerical precision when using half/float16 values. (Note that the gradients are, in effect, scaled by velocity_scaling/loss_scaling so the loss_scaling has no impact on the effective scaling of velocity parameters.)

As with loss scaling, higher values can minimise underflow of the velocity values but may result in overflow.

3.5. Custom ops

These are helper operations to be used within a model.

3.5.1. poptorch.ipu_print_tensor

class ipu_print_tensor(tensor_to_print, optional_title)

Adds a tensor to be printed on the IPU. When this is executed the tensor will be copied back to host and printed.

When this operation is called in the backward pass it will print the gradient of the tensor.

The operation is an identity operation and it will return the exact same tensor. The returned tensor should be used in place of the original tensor, in the rest of the program to make sure that the print operation isn’t optimised away.

For example if the original code looks like this:

def forward(self, c, d, b)
  a = c + d
  return a + b

And you want to print the value of a. If you do:

def forward(self, c, d, b)
  a = c + d
  poptorch.ipu_print_tensor(a)
  return a + b

Optionally, you may add a second string parameter to be used as a title.


def forward(self, c, d, b)

a = c + d poptorch.ipu_print_tensor(a, “summation”)) return a + b

The result of ipu_print_tensor is not used,therefore it will be optimised out by the graph optimiser and a will not be printed.

Instead you should do:

def forward(self, c, d, b)
  a = c + d
  x = poptorch.ipu_print_tensor(a)
  return x + b

Warning

In order for the print operation to not be optimised out by the graph optimiser, you must use the output of the print.

Parameters

ipu_print_tensor – The tensor to print.

Returns

The input unchanged.

 1class ExampleModel(torch.nn.Module):
 2    def __init__(self):
 3        super().__init__()
 4        self.bias = torch.nn.Parameter(torch.zeros(()))
 5
 6    def forward(self, x):
 7        x += 1
 8
 9        # It is important to make sure the result of the print is used.
10        x = poptorch.ipu_print_tensor(x)
11
12        return x + self.bias
13
14

3.5.2. poptorch.identity_loss

This function is used to implement custom losses. This takes in a single PyTorch tensor and will backpropagate a gradient of ones through it.

Warning

Passing a PyTorch loss function or another identity_loss to this function is not supported. Multiple losses must be implemented via composite PyTorch ops.

poptorch.identity_loss(x, reduction)

Marks this operation as being part of the loss calculation and, as such, will back-propagate through it in the PopTorch autograd. This enables multiple losses and custom losses.

Parameters
  • loss (torch.Tensor) – The calculated loss.

  • reduction (str) –

    Reduce the loss output as per PyTorch loss semantics. Supported values are:

    • "sum": Sum the losses.

    • "mean": Take the mean of the losses.

    • "none": Don’t reduce the losses.

Returns

An identity loss custom op.

 1def custom_loss(output, target):
 2    # Mean squared error with a scale
 3    loss = output - target
 4    loss = loss * loss * 5
 5    return poptorch.identity_loss(loss, reduction="mean")
 6
 7
 8class ExampleModelWithCustomLoss(torch.nn.Module):
 9    def __init__(self):
10        super().__init__()
11        self.model = ExampleModel()
12
13    def forward(self, input, target):
14        out = self.model(input)
15        return out, custom_loss(out, target)
16
17

3.5.3. poptorch.MultiConv

Use poptorch.MultiConv wrapper class to define multi-convolutions.

class poptorch.MultiConv

Combines all convolution layers evaluated inside this scope into a single multi-convolution.

Multi-convolutions allow for a set of data-independent convolutions to be executed in parallel. Executing convolutions in parallel can lead to an increase in the data throughput.

For example:

>>> with poptorch.MultiConv():
...     y = self.convA(x)
...     v = self.convB(u)

Combines the two data-independent convolutions into a single multi-convolution.

Refer to the PopLibs documentation for further information on multi-convolutions.

availableMemoryProportions(value)

The available memory proportion per convolution, each [0, 1).

Parameters

value (float, [float]) – Can be a float value in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many float values as the number of convolutions.

Returns

self, to support method chaining

cycleBackOff(value)

Cycle back off proportion.

Parameters

value (float) – Number between 0 and 1

Returns

self, to support method chaining

partialsTypes(value)

The partials type used for each convolution.

Parameters

value (MultiConvPartialsType, [MultiConvPartialsType]) – Can be a single instance of poptorch.MultiConvPartialsType in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many poptorch.MultiConvPartialsType values as the number of convolutions.

Returns

self, to support method chaining

perConvReservedTiles(value)

Tiles to reserve for each convolution.

Parameters

value (int) – Number of tiles

Returns

self, to support method chaining

planType(value)

Select the multi-convolution execution strategy.

Parameters

value – An instance of MultiConvPlanType.

Returns

self, to support method chaining

class poptorch.MultiConvPartialsType(value)

Type for the partials of each convolution of a poptorch.MultiConv

  • Float

  • Half

class poptorch.MultiConvPlanType(value)

Selects the execution strategy for a poptorch.MultiConv

  • Parallel: Execute multiple convolutions in parallel (Default).

  • Serial: Execute each convolution independently. This is equivalent to using the independent convolution API.

3.5.4. poptorch.custom_op

This is for the users who are familiar with PopART. If you need some special features that are not supported in PopART, you may write a PopART custom op. For more information about how to create Popart custom ops see Creating custom operations and Building custom operators using PopART. You can call such a PopART custom op using poptorch.custom_op in PopTorch.

It takes three steps to enable a PopART custom op in PopTorch.

First, set Poplar and PopART environment varibles as shown in Setting the environment variables and compile the PopART custom op. You can compile your custom op C++ code and link with Poplar and PopART to generate a dynamic library. Please refer to the custom op code custom_cube_op.cpp and its CMakeLists.txt under poptorch/tests/custom_ops$.

Second, load the dynamic library.

Finally, use poptorch.custom_op to finish the call. Its wrapper class is specified below.

class poptorch.custom_op(inputs, name, domain, domain_version, example_outputs)

Applies a custom operation, implemented within PopART, to the inputs.

Parameters
  • inputs (tuple) – A tuple of input tensors, for example, (x, y).

  • name (str) – unique name of the PopART custom

  • domain (str) – domain for the op

  • domain_version (int) – version of the domain to use

  • example_outputs (iterable) – a tuple of tensors with the same type and shape of the outputs; the value does not matter as all values will be set to zero for tracing purposes.

Returns

The outputs of the forward op of the custom op.

In the PopART custom op, both forward op and backward op are implemented. In the PopTorch inference model, only the forward op will be called.

In the code example above, example_outputs is assigned as [x, x], where x is one of the input tensors and used as a template to provide the right number of output tensors. The real outputs will be allocated memory, calculated and returned by the custom op. You can also call this custom op inside a training model using exactly the same interface of poptorch.custom_op, and the backward op will be called automatically.

3.5.5. poptorch.nop

Poptorch includes a “no-op” function for debugging purposes.

poptorch.nop(tensor)

A no-operation: it is functionally the same as an identity but is never elimated by PopART patterns or inlining, so it is useful for debugging.

Parameters

tensor (torch.Tensor) – the tensor to simply return by the no-op.

Returns

The same tensor which was input.

Return type

torch.Tensor

3.5.6. poptorch.serializedMatMul

Use this function to create a serialized matrix multiplication, which splits a larger matrix multiplication into smaller matrix multiplications to reduce memory requirements.

poptorch.serializedMatMul(lhs, rhs, mode, factor=0, keep_precision=False)

Calculates a matrix product using a serialized matrix multiplication.

The matrix multiplication, lhs*rhs, is split into separate smaller multiplications, calculated one after the other, to reduce the memory requirements of the multiplication and its gradient calculation.

Parameters
  • lhs (torch.Tensor) – Left-hand size input matrix.

  • rhs (torch.Tensor) – Right-hand side input matrix.

  • mode (poptorch.MatMulSerializationMode) – Which dimension of the matmul to serialize on: for matrix A (m by n) multiplied by matrix B (n by p). * InputChannels: Split across the input channels (dimension m). * ReducingDim: Split aross the reducing dimension (n). * OutputChannels: Split across the output channels (dimenion p). * Disabled: Same as an ordinary matrix multiplication.

  • factor (int) – Number of serialized multiplications. Must be a factor of the dimension to serialize on.

  • keep_precision (bool) – (Half/float16 inputs only) The forward op when serializing over ReducingDim and the backwards ops when serializing over InputChannels involve an addition step. If keep_precision is True, these additions will occur using float32 rather than half precision partials, matching those used for the individual matrix multiplications.

3.5.7. poptorch.set_available_memory

Use this function to override the proportion of tile memory for available to be used as temporary memory by a convolution or matrix multiplication.

poptorch.set_available_memory(tensor, available_memory_proportion)

Sets the available memory for a convolution or matrix multiplication.

When called on the on the output of a convolution or a matrix multiplication, it sets the proportion of tile memory (between 0 and 1) to be made available as temporary memory for the convolution/matrix multipication. Less temporary memory will reduce the time performance but may use less memory overall. Lower memory proportions result in the use of more live (not tempoerary) memory, and so the overall memory may increase for too low values, possibly resulting in out of memory errors.

In the event that the value is too low, the planner will replan for the smaller memory usage possible.

>>> class BasicNetwork(nn.Module):
...     def __init__(self):
...         super().__init__()
...         self.conv = nn.Conv2d(4, 4, 3, stride=2)
...
...     def forward(self, x):
...         out = self.conv(x)
...         out = poptorch.set_available_memory(out, 0.2)
...         return out
Parameters
  • tensor (torch.Tensor) – output tensor of a convolution or matrix multiplication (otherwise the statement will be an identity).

  • available_memory_proportion (float) – proportion between 0.0 and 1.0 of tile memory to be made available for temporary memory (default 0.6).

Returns

input tensor, as if calling an identity function.

Return type

torch.Tensor

3.6. Miscellaneous functions

These PopTorch functions, not related to model creation, are available:

poptorch.ipuHardwareIsAvailable()

Indicates whether IPU hardware is available to use.

Returns

True if physical IPUs are available, False otherwise.

Return type

bool

poptorch.setLogLevel(level)

Changes the volume of messages printed in the console (stdout)

Parameters

level (str) –

  • TRACE: Print all messages.

  • DEBUG: Print debug messages and above.

  • INFO: Print info messages and above.

  • WARN: Print warings and errors.

  • ERR: Print errors only.

  • OFF: Print nothing.

3.7. Half / float 16 support

PopTorch supports the half-precision floating point (float 16) format. You can simply input float 16 tensors into your model. (You can convert a tensor to float 16 using tensor = tensor.half())

You can use your models in one of the following ways: #. Convert all parameters (weights) to float 16 by using using a Module’s .``half()`` method. This is the most memory efficient, however small updates to weights may be lost, hindering training. #. Keep the parameters (weights) as float 32, in which case the parameter updates will occur using float 32. However, the parameters will be converted to float 16 if you call an operation with a float 16 input. This is more memory efficient than using float 32 tensors (inputs) but less memory efficient than using float 16 weights. #. Use a mix of float 32 and float 16 parameters by manually specifying parameters as float 16 or float 32.

Note

When PyTorch encounters a mix of float 16 and float 32 inputs for a given operation, it will usually cast all inputs and float 32. PopTorch differs and will cast all inputs to float 16. This makes it easier to build models with float 32 weights which take float 16 tensors.

Listing 3.6 How to run a model using half precision
1model = torch.nn.Linear(1, 10).half()
2t1 = torch.tensor([1.]).half()
3
4inference_model = poptorch.inferenceModel(model)
5out = inference_model(t1)
6
7assert out.dtype == torch.half

Because PopTorch relies on the torch.jit.trace API, it is limited to tracing operations which run on the CPU. Many of these operations do not support float 16 inputs. To allow the full range of operations, PopTorch converts all float 16 inputs to float 32 before tracing and then restores the inputs to float 16 as part of the canonicalization process. Some operations may result in the model running in float 32 where float 16 would be expected, or vice versa (see Float 16 operations for full details).

3.8. Profiling

You can profile a graph produced by PopTorch for analysis using the PopVision Graph Analyser, which can be downloaded from the Graphcore support portal. To do this, use the POPLAR_ENGINE_OPTIONS environment variable.

3.9. Environment variables

3.9.1. Logging level

PopTorch uses the following levels of logging:
  • OFF: No logging.

  • ERR: Errors only.

  • WARN: Warnings and errors only.

  • INFO: Info, warnings and errors. (Default)

  • DEBUG: Adds some extra debugging information.

  • TRACE and TRACE_ALL: Trace everything inside PopTorch.

The POPTORCH_LOG_LEVEL environment variable can be used to set the logging level:

export POPTORCH_LOG_LEVEL=DEBUG

3.9.2. Profiling

When running programs using PopTorch, you can enable profiling by using the POPLAR_ENGINE_OPTIONS environment variable used by Poplar.

In order to capture the reports needed for the PopVision Graph Analyser you only need to set POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}':

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'

By default, report files are output to the current working directory. You can specify a different output directory by setting autoReport.directory, for example:

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./tommyFlowers"}'

For more options, please refer to the PopVision Graph Analyser User Guide.

In order to capture the pvti reports needed for the PopVision System Analyser you only need to set PVTI_OPTIONS='{"enable":"true"}'

You can also add extra tracepoints in your own code by using

class poptorch.profiling.Channel(name)

Profiling channel.

Note

If the libpvti profiling library is not available at runtime this class becomes a no-op.

Example:

>>> channel = poptorch.profiling.Channel("MyApp")
>>> with channel.tracepoint("TimeThis"):
...     functionToTime()
>>> channel.instrument(myobj, "methodName", "otherMethod")
instrument(obj, *methods)

Instrument the methods of an object.

Parameters
  • obj – Object to instrument

  • methods – One or more methods to wrap in profiling tracepoints.

tracepoint(name)

Create a context tracepoint

>>> with channel.tracepoint("DoingSomething"):
...     expensiveCall()
Parameters

name – Name associated to this tracepoint.

3.9.3. IPU Model

By default PopTorch will try to attach to a physical IPU. If instead you want to use the model, you can do so by setting POPTORCH_IPU_MODEL to 1:

export POPTORCH_IPU_MODEL=1

Please see the Poplar and PopLibs User Guide for the limitations of the IPU Model.

3.9.4. Wait for an IPU to become available

By default if you try to attach to an IPU but all the IPUs in the system are already in use, an exception will be raised. If you would rather wait for an IPU to become available, you can do so by setting POPTORCH_WAIT_FOR_IPU to 1.

export POPTORCH_WAIT_FOR_IPU=1