9. Reference

9.1. Options

class poptorch.Options

Set of all options controlling how a model is compiled and executed.

Pass an instance of this class to the model wrapping functions poptorch.inferenceModel() and poptorch.trainingModel() to change how the model is compiled and executed. An instance includes general options set within this class such as poptorch.Options.deviceIterations() as well as properties referring to categories of options such as Training.

>>> opts = poptorch.Options()
>>> opts.deviceIterations(10)
>>> opts.Training.gradientAccumulation(4)
property Distributed

Options specific to running on multiple IPU server (IPU-POD).

property Jit

Options specific to upstream PyTorch’s JIT compiler.

property Popart

(Deprecated) Options specific to the PopART backend. (Advanced users only).

property Precision

Options specific to the processing of the JIT graph prior to lowering to Popart.

property TensorLocations

Options related to tensor locations.

property Training

Options specific to training.

anchorMode(anchor_mode, anchor_return_period=None)

Specify which data to return from a model.

Parameters

anchor_mode (poptorch.AnchorMode) –

  • All: Return a result for each batch.

  • Sum: Return the sum of all the batches.

  • Final: Return the last batch.

  • EveryN: Return every N batches: N is passed in as anchor_return_period.

  • Default: All for inference, Final for training.

For example:

>>> opts = poptorch.Options()
>>> opts.anchorMode(poptorch.AnchorMode.All)
... # or
>>> opts.anchorMode(poptorch.AnchorMode.EveryN, 10)
autoRoundNumIPUs(auto_round_num_ipus)

Whether or not to round up the number of IPUs used automatically: the number of IPUs requested must be a power of 2 or multiple of 64. By default, an error occurs if the model uses an unsupported number of IPUs to prevent you unintentionally overbooking of IPUs.

Parameters

auto_round_num_ipus (bool) –

  • True: round up the number of IPUs to a power of 2 or multiple of 64 automatically.

  • False: error if the number of IPUs is not supported.

connectionType(connection_type)

When to connect to the IPU (if at all).

Parameters

connection_type (poptorch.ConnectionType) –

  • Always: Attach to the IPU from the start (default).

  • OnDemand: Wait until the compilation is complete and the executable is ready to be run to attach to the IPU.

  • Never: Never try to attach to an IPU: this is useful for offline compilation, but trying to run an executable will raise an exception.

For example:

>>> opts = poptorch.Options()
>>> opts.connectionType(poptorch.ConnectionType.OnDemand)
defaultAnchorMode()
Returns

deviceIterations(device_iterations)

Number of iterations the device should run over the data before returning to the user (default: 1).

This is equivalent to running the IPU in a loop over that the specified number of iterations, with a new batch of data each time. However, increasing deviceIterations is more efficient because the loop runs on the IPU directly.

enableExecutableCaching(path)

Load/save Poplar executables to the specified path, using it as a cache, to avoid recompiling identical graphs.

Parameters

path (str) – File path for Poplar executation cache store; setting path to None`` disables executable caching.

enableStableNorm(enabled)

Set whether a stable version of norm operators is used. This stable version is slower, but more accurate than its unstable counterpart.

Parameters

enabled (bool) –

  • True: Use stable norm calculation.

  • False: Do not use stable norm calculation.

enableSyntheticData(enabled)

Set whether host I/O is disabled and synthetic data is generated on the IPU instead. This can be used to benchmark models whilst simulating perfect I/O conditions.

Parameters

enabled (bool) –

  • True: Use data generated from a random normal distribution on the IPU. Host I/O is disabled.

  • False: Host I/O is enabled and real data is used.

logDir(log_dir)

Set the log directoery

Parameters

log_dir (str) – Directory where Poptorch saves log files (default: current directory)

randomSeed(random_seed)

Set the seed for the random number generator on the IPU.

Parameters

random_seed (int) – Random seed integer.

relaxOptimizerAttributesChecks(relax=True)

Controls whether unexpeted attributes in setOptimizer() lead to warnings or debug messages.

By default PopTorch will print warnings the first time it encounters unexpected attributes in setOptimizer().

Parameters

relax (bool) –

  • True: Redirect warnings to the debug channel.

  • False: Print warnings about unexpected attributes (default behaviour).

replicationFactor(replication_factor)

Number of times to replicate the model (fefault: 1).

Replicating the model increases the data throughput of the model as Poptorch uses more IPUs. This leads to the number of IPUs used being scaled by replication_factor, for example, if your model uses 1 IPU, a replication_factor of 2 will use 2 IPUs; if your model uses 4 IPUs, a replication factor of 4 will use 16 IPUs in total.

Parameters

replication_factor (int) – Number of replicas of the model to create.

setAvailableMemoryProportion(available_memory_proportion)

Memory is set on a per IPU basis, this should be a dictionary of IPU ids and float values between 0 and 1.

For example: {"IPU0": 0.5}

setExecutionStrategy(strategy)

Set the execution strategy to use to partition the graph.

Parameters

strategy – Must be an instance of once of the execution strategy classes.

syncPattern(sync_pattern)

Set the IPU SyncPattern.

Parameters

sync_pattern (poptorch.SyncPattern) –

  • Full

  • SinglePipeline

  • ReplicaAndLadder

useIpuId(ipu_id)

Use the IPU device specified by the ID (as provided by gc-info)

A device ID may refer to a single or to a group of IPUs (a multi-IPU device). The number of IPUs associated with the ID must be equal to the number of IPUs used by your annotated model multiplied by the replication factor.

For example if your model uses 1 IPU and the replication factor is 2 you will need to provide a device ID with 2 IPU; if your model is pipelined across 4 IPUs and the replication factor is 4, you will need to provide a device ID which represents a multi-IPU device of 16 IPUs.

You can use the the command-line tool gc-info: running gc-info -a, shows each device ID and a list of IPUs associated with the ID.

Parameters

ipu_id (int) – IPU device ID of a single-IPU or multi-IPU device

useIpuModel(use_model)

Whether to use the IPU Model or physical hardware (default)

The IPU model simulates the behaviour of IPU hardware but does not offer all the functionality of an IPU. Please see the Poplar and PopLibs User Guide for further information.

This setting takes precedence over the POPTORCH_IPU_MODEL environment variable.

Parameters

use_model (bool) –

  • True: Use the IPU Model.

  • False: Use IPU hardware.

useOfflineIpuTarget(ipu_version=2)

Create an offline IPU target that can only be used for offline compilation.

Note

the offline IPU target cannot be used if the IPU model is enabled.

Parameters

ipu_version (int) – IPU version to target (1 for mk1, 2 for mk2). Default: 2.

class poptorch.options._DistributedOptions

Options related to distributed execution.

Can be accessed via poptorch.Options.Distributed:

>>> opts = poptorch.Options()
>>> opts.Distributed.configureProcessId(0, 2)
configureProcessId(process_id, num_processes)

Manually set the current process ID and the total number of processes.

Parameters
  • process_id (int) – The ID of this process.

  • num_processes (int) – The total number of processes the execution is distributed over.

disable()

Ignore the current options / environment variables and disable distributed execution.

property numProcesses

Total number of processes the execution is distributed over.

property processId

Id of the current process.

setEnvVarNames(var_num_processes, var_process_id)

Utility to read and set processId and numProcesses from environment variables.

Useful if you use a third party library to manage the processes used for the distributed execution such as mpirun.

For example: mpirun -np 4 myscript.py

By default the OpenMPI OMPI_COMM_WORLD_SIZE and OMPI_COMM_WORLD_RANK variables are used.

class poptorch.options._PrecisionOptions(popart_options)

Options related to processing the PyTorch JIT graph prior to lowering to Popart

Can be accessed via poptorch.Options.Precision:

>>> opts = poptorch.Options()
>>> opts.Precision.halfFloatCasting(
...   poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)
enableStochasticRounding(enabled)

Set whether stochastic rounding is enabled on the IPU.

Stochastic rounding rounds up or down a values to half (float16) randomly such that that the expected (mean) result of rounded value is equal to the unrounded value. It can improve training perfomance by simulating higher precision behaviour and increasing the speed or likelihood of model convergence. However, the model is non-deterimistic and represents a departure from (deterministic) standard IEEE FP16 behaviour.

In the general case, we recommend enabling stochastic rounding for training where convergence is desirable, but not for inference where non-determinism may be undesirable.

Parameters

enabled (bool) –

  • True: Enable stochastic rounding on the IPU.

  • False: Disable stochastic rounding.

halfFloatCasting(half_float_casting)
Changes the casting behaviour for ops involving a float16 (half) and

a float32

The default option, FloatDowncastToHalf, allows parameters (weights) to be stored as and updated as float32 but cast to float16 when used in an operation with a float16 input. The benefit of this is higher efficiency and reduced memory footprint without the same loss of precision of parameters during the optimiser update step. However, you can change the behaviour to match PyTorch using option “HalfUpcastToFloat”.

Parameters

half_float_casting (poptorch.HalfFloatCastingBehavior) –

  • FloatDowncastToHalf: Any op with operands (inputs) which are a mix of float32 and float16 (half) will cast all operands to half.

  • HalfUpcastToFloat: Implicit casting will follow PyTorch’s rules, promoting float16 (half) inputs to float32 if another input is float32.

runningVarianceAlwaysFloat(value)

Controls whether the running variance tensor of batch normalisation layers should be a float32 regardless of input type.

A batch normalisation layer stores a running estimate of the variances of each channel during training, for use at inference in lieu of batch statistics. Storing the value as a half (float16) can result in poor performance due to the low precision. Enabling this option yields more reliable estimates by forcing all running estimates of variances to be stored as float32, at the cost of extra memory use.

Parameters

value (bool) –

  • True: Always store running estimates of variance as float32.

  • False: Store running estimates of variance as the same type as the layer input.

setPartialsType(dtype)

Set the data type of partial results for matrix multiplication and convolution operators.

The matrix multiplication and convolution operators store intermediate results known as partials as part of the calculation. You can use this option to change the data type of the parials. Using torch.half reduces on-chop memory use at the cost of precsion.

Parameters

type (torch.dtype) – The type to store parials, which must be either torch.float or torch.half

class poptorch.options._JitOptions

Options related to PyTorch’s JIT compiler.

Can be accessed via poptorch.Options.Jit:

>>> opts = poptorch.Options()
>>> opts.Jit.traceModel(True)
traceModel(trace_model)

Controls whether to use PyTorch’s tracing or scripting.

By default, PopTorch uses Pytorch’s JIT tracing however you can use scripting (experimental). See torch.jit.trace and torch.jit.script for details about PyTorch’s JIT implementations.

Parameters

trace_model (bool) –

  • True: use torch.jit.trace

  • False: use torch.jit.script (experimental)

class poptorch.options._TensorLocationOptions(**default_values)

Options controlling where to store tensors.

Can be accessed via poptorch.Options.TensorLocations:

>>> opts = poptorch.Options()
>>> opts.TensorLocations.setActivationLocation(
...     poptorch.TensorLocationSettings().useOnChipStorage(False))
setAccumulatorLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for accumulators.

setActivationLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for activations.

setOptimizerLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for optimiser states.

setWeightLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for weights.

class poptorch.TensorLocationSettings(**default_values)

Define where a tensor is stored

>>> opts = poptorch.Options()
>>> opts.TensorLocations.setActivationLocation(
...     poptorch.TensorLocationSettings().useOnChipStorage(False))
minElementsForOffChip(min_elements)

A minimum number of elements below which offloading won’t be considered.

minElementsForReplicatedTensorSharding(min_elements)

Only enable Replicated Tensor Sharding (RTS) for tensors with more than min_elements elements.

useIOTilesToLoad(use=True)

Load tensor through IO tiles

Parameters

use (bool) – Use IO tiles if True, use Compute tiles if False.

useIOTilesToStore(use=True)

Use IO tiles to store tensors.

(relevant for replicated tensor sharded tensors)

Parameters

use (bool) – Use IO tiles if True, use Compute tiles if False.

useOnChipStorage(use=True)

Permanent tensor storage

Parameters

use (bool) – True: use on chip memory. False: use off chip memory. None: keep it undefined.

useReplicatedTensorSharding(use=True)

Enable replicated tensor sharding

(relevant for weights and optimiser states)

class poptorch.options._TrainingOptions

Options specific to model training.

Note

You must not set these options for inference models.

Can be accessed via poptorch.Options.Training:

>>> opts = poptorch.Options()
>>> opts.Training.gradientAccumulation(4)
accumulationAndReplicationReductionType(reduction_type)

Set the type of reduction applied to reductions in the graph.

When using, a value for greater than one for gradientAccumulation() or for replicationFactor(), PopTorch applies a reduction to the gradient ouputs from each replica, and to the accumulated gradients. This reduction is independent of the model loss reduction (summing a mean-reduced loss and a sum-reduced loss in a PyTorch model is valid).

This seting governs both the accumulation of the loss gradients in replicated graphs and of all of the gradients when using gradient accumulation.

Parameters

reduction_type (poptorch.ReductionType) –

  • Mean: Reduce gradients by calculating the mean of them.

  • Sum: Reduce gradients by calculating the sum of them.

accumulationReductionType(reduction_type)

The type of reduction (sum or mean) applied to accumulated gradients.

When using a non-unity value for gradientAccumulation, you can specify whether to reduce the gradients by sum or mean (default). When using mean reduction, changing the gradientAccumulation will not change the training curve of the model (barring numerical error and changes due to the different compute batch size e.g. batch normalisation).

Parameters

accumulation_reduction_type (poptorch.ReductionType) –

  • Mean: Reduce gradients by calculating the mean of them.

  • Sum: Reduce gradients by calculating the sum of them.

gradientAccumulation(gradient_accumulation)

Number of micro-batches to accumulate for the gradient calculation.

Accumulate the gradient gradient_accumulation times before updating the model using the gradient. Other frameworks may refer to this setting as “pipeline depth”.

Accumulate the gradient gradient_accumulation times before updating the model using the gradient. Each micro-batch (a batch of size equal to the batch_size argument passed to poptorch.DataLoader) corresponds to one gradient accumulation. Therefore gradient_accumulation scales the global batch size (number of samples between optimiser updates).

Note

Increasing gradient_accumulation does not alter the (mini-) batch size used for batch normalisation.

A large value for gradient_accumulation can improve training throughput by amortising optimiser update costs, most notably when using PipelinedExecution or when training is distributed over a number of replicas. However, the consequential increase in the number of samples between optimiser updates can have an adverse impact on training.

The reason why the efficiency gains are most notable when training with models with multiple IPUs which express pipelined model parallelism (via PipelinedExecution or by default and annotating the model poptorch.BeginBlock or poptorch.Block) is because the pipeline has “ramp up” and “ramp down” steps around each optimiser update. Increasing the gradient accumulation factor in this instance reduces the proportion of time spent in the “ramp up” and “ramp down” phases, increasing overall throughput.

When training involves multiple replicas, including the cases of sharded and phased execution, each optimiser step incurs a communication cost associated with the reduction of the gradients. By accumulating gradients, you can reduce the total number of updates required and thus reduce the total amount of communication.

Note

Increasing the global batch size can have adverse effects on the sample efficiency of training so it is recommended to use a low or unity gradient accumulation count initially, and then try increasing to achieve higher throughput. You may also need to scale other hyper-parameters such as the optimiser learning rate accordingly.

9.2. Helpers

poptorch.ipuHardwareIsAvailable(num_ipus=1)

Indicates whether any IPU hardware with num_ipus is present in the system.

Note: This function doesn’t check if the IPU is free or already being used.

Parameters

num_ipus (int) –

Returns

True if physical IPUs are available, False otherwise.

Return type

bool

poptorch.ipuHardwareVersion()

Indicates what IPU hardware version is available in the system.

Raise an exception if no hardware is available.

Returns

The IPU hardware version or -1 if unknown.

Return type

int

poptorch.setLogLevel(level)

Changes the volume of messages printed in the console (stdout)

Parameters

level (str) –

  • TRACE: Print all messages.

  • DEBUG: Print debug messages and above.

  • INFO: Print info messages and above.

  • WARN: Print warings and errors.

  • ERR: Print errors only.

  • OFF: Print nothing.

class poptorch.profiling.Channel(name)

Profiling channel.

Note

If the libpvti profiling library is not available at runtime this class becomes a no-op.

Example:

>>> channel = poptorch.profiling.Channel("MyApp")
>>> with channel.tracepoint("TimeThis"):
...     functionToTime()
>>> channel.instrument(myobj, "methodName", "otherMethod")
instrument(obj, *methods)

Instrument the methods of an object.

Parameters
  • obj – Object to instrument

  • methods – One or more methods to wrap in profiling tracepoints.

tracepoint(name)

Create a context tracepoint

>>> with channel.tracepoint("DoingSomething"):
...     expensiveCall()
Parameters

name – Name associated to this tracepoint.

9.3. PopTorch Ops

poptorch.ipu_print_tensor(tensor, title='')

Adds an op to print the content of a given IPU tensor.

When this is executed the tensor will be copied back to host and printed.

When this operation is called in the backward pass it will print the gradient of the tensor.

The operation is an identity operation and will return the exact same tensor. The returned tensor should be used in place of the original tensor in the rest of the program, to make sure that the print operation isn’t optimised away.

For example if the original code looks like this:

def forward(self, c, d, b)
  a = c + d
  return a + b

If the result of ipu_print_tensor is not used, it will be optimised out by the graph optimiser and tensor will not be printed.

So if you want to print the value of a, you should do:

def forward(self, c, d, b)
  a = c + d
  x = poptorch.ipu_print_tensor(a)
  return x + b

Optionally, you may add a second string parameter to be used as a title.

def forward(self, c, d, b)
    a = c + d
    x = poptorch.ipu_print_tensor(a, "summation"))
    return x + b

Warning

In order for the print operation to not be optimised out by the graph optimiser, you must use the output of the print.

Parameters

ipu_print_tensor – The tensor to print.

Returns

The input unchanged.

poptorch.identity_loss(x, reduction)

Marks this operation as being part of the loss calculation and, as such, will back-propagate through it in the PopTorch autograd. This enables multiple losses and custom losses.

Parameters
  • loss (torch.Tensor) – The calculated loss.

  • reduction (str) –

    Reduce the loss output as per PyTorch loss semantics. Supported values are:

    • "sum": Sum the losses.

    • "mean": Take the mean of the losses.

    • "none": Don’t reduce the losses.

Returns

An identity loss custom op.

class poptorch.MultiConv

Combines all convolution layers evaluated inside this scope into a single multi-convolution.

Multi-convolutions allow for a set of data-independent convolutions to be executed in parallel. Executing convolutions in parallel can lead to an increase in the data throughput.

For example:

>>> with poptorch.MultiConv():
...     y = self.convA(x)
...     v = self.convB(u)

Combines the two data-independent convolutions into a single multi-convolution.

Refer to the PopLibs documentation for further information on multi-convolutions.

availableMemoryProportions(value)

The available memory proportion per convolution, each [0, 1).

Parameters

value (float, [float]) – Can be a float value in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many float values as the number of convolutions.

Returns

self, to support method chaining

cycleBackOff(value)

Cycle back off proportion.

Parameters

value (float) – Number between 0 and 1

Returns

self, to support method chaining

partialsTypes(value)

The partials type used for each convolution.

Parameters

value (torch.dtype, [torch.dtype]) – Can be a single instance of torch.dtype in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many torch.dtype values as the number of convolutions.

Returns

self, to support method chaining

perConvReservedTiles(value)

Tiles to reserve for each convolution.

Parameters

value (int) – Number of tiles

Returns

self, to support method chaining

planType(value)

Select the multi-convolution execution strategy.

Parameters

value – An instance of MultiConvPlanType.

Returns

self, to support method chaining

class poptorch.MultiConvPlanType(value)

Selects the execution strategy for a poptorch.MultiConv

  • Parallel: Execute multiple convolutions in parallel (Default).

  • Serial: Execute each convolution independently. This is equivalent to using the independent convolution API.

class poptorch.custom_op(inputs, name, domain, domain_version, example_outputs, attributes=None)

Applies a custom operation, implemented within PopART, to the inputs.

Parameters
  • inputs (tuple) – A tuple of input tensors, for example, (x, y).

  • name (str) – unique name of the PopART custom

  • domain (str) – domain for the op

  • domain_version (int) – version of the domain to use

  • example_outputs (iterable) – a tuple of tensors with the same type and shape of the outputs; the value does not matter as all values will be set to zero for tracing purposes.

  • attributes (dict) – a dictionary of attributes for the custom op. All attributes keys must be strings. All attribute values must be floats, ints, strings, or a list/tuple containing only floats, only ints or only strings (not a mix of types within the list).

Returns

The outputs of the forward op of the custom op.

poptorch.nop(tensor)

A no-operation: it is functionally the same as an identity but is never elimated by PopART patterns or inlining, so it is useful for debugging.

Parameters

tensor (torch.Tensor) – the tensor to simply return by the no-op.

Returns

The same tensor which was input.

Return type

torch.Tensor

poptorch.serializedMatMul(lhs, rhs, mode, factor=0, keep_precision=False)

Calculates a matrix product using a serialized matrix multiplication.

The matrix multiplication, lhs*rhs, is split into separate smaller multiplications, calculated one after the other, to reduce the memory requirements of the multiplication and its gradient calculation.

Parameters
  • lhs (torch.Tensor) – Left-hand size input matrix.

  • rhs (torch.Tensor) – Right-hand side input matrix.

  • mode (poptorch.MatMulSerializationMode) – Which dimension of the matmul to serialize on: for matrix A (m by n) multiplied by matrix B (n by p). * InputChannels: Split across the input channels (dimension m). * ReducingDim: Split aross the reducing dimension (n). * OutputChannels: Split across the output channels (dimenion p). * Disabled: Same as an ordinary matrix multiplication.

  • factor (int) – Number of serialized multiplications. Must be a factor of the dimension to serialize on.

  • keep_precision (bool) – (Half/float16 inputs only) The forward op when serializing over ReducingDim and the backwards ops when serializing over InputChannels involve an addition step. If keep_precision is True, these additions will occur using float32 rather than half precision partials, matching those used for the individual matrix multiplications.

poptorch.set_available_memory(tensor, available_memory_proportion)

Sets the available memory for a convolution or matrix multiplication.

When called on the on the output of a convolution or a matrix multiplication, it sets the proportion of tile memory (between 0 and 1) to be made available as temporary memory for the convolution/matrix multipication. Less temporary memory will reduce the time performance but may use less memory overall. Lower memory proportions result in the use of more live (not tempoerary) memory, and so the overall memory may increase for too low values, possibly resulting in out of memory errors.

In the event that the value is too low, the planner will replan for the smaller memory usage possible.

>>> class BasicNetwork(nn.Module):
...     def __init__(self):
...         super().__init__()
...         self.conv = nn.Conv2d(4, 4, 3, stride=2)
...
...     def forward(self, x):
...         out = self.conv(x)
...         out = poptorch.set_available_memory(out, 0.2)
...         return out
Parameters
  • tensor (torch.Tensor) – output tensor of a convolution or matrix multiplication (otherwise the statement will be an identity).

  • available_memory_proportion (float) – proportion between 0.0 and 1.0 of tile memory to be made available for temporary memory (default 0.6).

Returns

input tensor, as if calling an identity function.

Return type

torch.Tensor

9.4. Model wrapping functions

poptorch.trainingModel(model, options=None, optimizer=None)

Create a PopTorch training model, from a PyTorch model, to run on IPU hardware in training mode.

Parameters
Returns

The poptorch.PoplarExecutor wrapper to use in place of model.

poptorch.inferenceModel(model, options=None)

Create a PopTorch inference model, from a PyTorch model, to run on IPU hardware in inference mode.

Parameters
  • model (torch.nn.Module) – The PyTorch model to wrap.

  • options (poptorch.Options) – The IPU specific options

Returns

The poptorch.PoplarExecutor wrapper to use in place of model.

class poptorch.PoplarExecutor(model, options, training, optimizer=None, user_model=None, poptorch_version=None)

This class should not be created directly but is a wrapper around the model that was passed into inferenceModel or trainingModel. It only has a few methods which can be used to interface with the IPU.

__call__(*args, **kwargs)

Takes the same arguments as the wrapped PyTorch model.__call__.

Note

The first time the PoplarExecutor wrapper is called, the wrapped model will be traced and compiled.

attachToDevice()

Attach to target device. Before calling this function, the device must be detached.

compile(*args, **kwargs)

Takes the same arguments as the wrapped PyTorch model.__call__.

Trace and compile the wrapped model if no executable has been created yet.

Note: The executable created by this method can only be executed, it cannot be exported to file. To precompile and save to file use compileAndExport()

compileAndExport(filename, *args, export_model=True, **kwargs)

Precompile an executable and save it to file.

args and kwargs are the same arguments as the wrapped PyTorch model.__call__

Parameters
  • filename (str) – Where to save the compiled executable.

  • export_model (bool) – If True the Torch model will be saved in the file alongside the executable. poptorch.load() can be used to restore both the original Torch model, the PopTorch model and the executable. If False then only the executable will be exported and it will be the user’s responsibility to call poptorch.inferenceModel() or poptorch.trainingModel() to re-create the PopTorch model before calling loadExecutable() to restore the executable.

copyWeightsToDevice()

Copies the weights from model.parameters() to the IPU device. Implicitly called on first call.

copyWeightsToHost()

Updates the parameters used in model with the weights stored on device. (The weights in model.parameters())

destroy()

Destroy the model: release the IPUs and the executable.

detachFromDevice()

Detach from target device. Before calling this function, the device must be attached.

isAttachedToDevice()

Returns true, if the target device has been attached. False, otherwise.

loadExecutable(filename)

Load an executable previously generated using compileAndExport()

load_state_dict(state_dict, strict=True)

Will call load_state_dict() on the wrapped model and automatically synchronise the weights with the IPU.

property model

Access the wrapped Torch model.

setOptimizer(optimizer)

Sets the optimiser for a training model. Will overwrite the previous one. Supported optimisers: optim.SGD, optim.Adam, optim.AdamW, optim.RMSProp, optim.LAMB.

poptorch.isRunningOnIpu()

This function returns True when executing on IPU and False when executing the model outside IPU scope. This allows for seperate codepaths to be marked in the model simply by using:

if poptorch.isRunningOnIpu():

# IPU path

else:

# CPU path

Note this will only apply to code during execution. During model creation it will always return False.

Returns

True if running on IPU, otherwise False.

poptorch.load(filename, edit_opts_fn=None)

Load a PopTorch model from a file previously created using compileAndExport()

Parameters
  • filename (str) – Path to the file containing the model to load.

  • edit_opts_fn – Function to edit the options before the model is restored. For example to attach to a specific IPU device.

>>> model = poptorch.inferenceModel(model)
>>> model.compileAndExport("my_model.poptorch")
...
>>> model = poptorch.load("my_model.poptorch")
>>> model(my_input)

9.5. Parallel execution

class poptorch.Block(user_id=None, ipu_id=None)

Runs all layers called inside this scope on a specified IPU.

>>> with poptorch.Block("IPU0"):
...     self.layer = MyLayer(x)
__init__(user_id=None, ipu_id=None)
Parameters
  • user_id (str, optional) – A user defined identifier for the block. Blocks with the same id are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.

  • ipu_id (int, optional) – The id of the IPU to run on. Note that the ipu_id is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used by gc-info.

static useAutoId()

Call this method at the beginning of your forward() method to enable automatic block id generation.

Blocks with a None user_id will be assigned an automatic id which will be the index of this block in the list of id-less Blocks.

>>> poptorch.Block.useAutoId()
>>> with poptorch.Block(): # user_id = "0"
...     layer()
>>> with poptorch.Block("special_block"): # user_id = "special_block"
...     layer()
>>> with poptorch.Block(): # user_id = "1"
...     layer()
class poptorch.BeginBlock(layer_to_call, user_id=None, ipu_id=None)

Runs all layers from the given layer until the beginning of the next block on a specified IPU.

All layers after this layer will also run on the same IPU until another BeginBlock is encountered.

By default PipelinedExecution will be used, however this can be overridden in the poptorch.Options.

>>> self.layer = poptorch.BeginBlock(MyLayer(x))
__init__(layer_to_call, user_id=None, ipu_id=None)

All subsequent layers of the network will be part of this block until another layer is wrapped.

Parameters
  • layer_to_call (torch.nn.Module) – The layer to run on the specified IPU.

  • user_id (str, optional) – A user defined identifier for the block. Blocks with the same id are considered as being a single block. Block identifiers are also used to manually create Stages and Phases.

  • ipu_id (int, optional) – The id of the IPU to run on. Note that the ipu_id is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used by gc-info.

class poptorch.Stage(*block_ids)

The various execution strategies are made of Stages: a stage consists of one of more Blocks running on one IPU.

__init__(*block_ids)
class poptorch.AutoStage(value)

Defines how the stages are automatically assigned to blocks when the user didn’t explicitly provide stages to the IExecutionStrategy’s constructor.

  • SameAsIpu: The stage id will be set to the selected ipu number.

  • AutoIncrement: The stage id for new blocks is automatically incremented.

Examples:

>>> # Block "0"
>>> with poptorch.Block(ipu_id=0):
...  layer()
>>> # Block "1"
>>> with poptorch.Block(ipu_id=1):
...  layer()
>>> # Block "2"
>>> with poptorch.Block(ipu_id=0):
...  layer()

By default, the following execution strategy is used:

>>> stategy = poptorch.PipelinedExecution(poptorch.AutoStage.SameAsIpu)
>>> opts.setExecutionStrategy(strategy)

which would translate to stage_id = ipu_id:

  • Block “0” ipu=0 stage=0

  • Block “1” ipu=1 stage=1

  • Block “2” ipu=0 stage=0

Now if instead you use:

>>> stategy = poptorch.PipelinedExecution(poptorch.AutoStage.AutoIncrement)
>>> opts.setExecutionStrategy(strategy)

The last block would be in its own stage rather than sharing one with Block “0”:

  • Block “0” ipu=0 stage=0

  • Block “1” ipu=1 stage=1

  • Block “2” ipu=0 stage=2

class poptorch.Phase(arg)

Represents an execution phase

__init__(arg)

Create a phase.

Parameters

arg (str, poptorch.Stage, [poptorch.Stage], [str]) – must either be one or more Stages, or one or more blocks user_id.

If one or more strings are passed they will be interpreted as Block ids representing a single Stage.

Within a Phase, the stages will be executed in parallel.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> p = Phase(poptorch.Stage("A").ipu(0))
>>> # 2 stages made of one block each
>>> p = Phase(poptorch.Stage("A").ipu(0), poptorch.Stage("B").ipu(1))
>>> p = Phase("A","B") # One Stage made of 2 blocks
class poptorch.ShardedExecution(*args)

Will shard the execution of the passed Stages or if no stage is passed will consider each unique Block name encountered during tracing as a different stage.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> opts = poptorch.Options()
>>> # Automatically create 3 shards based on the block names
>>> opts.setExecutionStrategy(poptorch.ShardedExecution())
Parameters

args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a poptorch.AutoStage strategy or an explicit list of stages or block ids.

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

class poptorch.PipelinedExecution(*args)
__init__(*args)

Pipeline the execution of the passed Stages or if no stage is passed consider each unique Block name encountered during tracing as a different stage.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> opts = poptorch.Options()
>>> # Create a 3 stages pipeline
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution("A","B","C"))
>>> # Create a 2 stages pipeline
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution(
...    poptorch.Stage("A","B"),
...    "C"))
>>> # Automatically create a 3 stages pipeline based on the block names
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution())
Parameters

args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a poptorch.AutoStage strategy or an explicit list of stages or block ids.

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

class poptorch.SerialPhasedExecution(*phases)

All the phases run serially on a single group of IPUs.

For example:

  • phase 0 runs on ipu 0 & 1

  • phase 1 runs on ipu 0 & 1

  • phase 2 runs on ipu 0 & 1

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("A2"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("B2"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> with poptorch.Block("C2"):
...     layer()
>>> opts = poptorch.Options()
>>> strategy = poptorch.SerialPhasedExecution([
...     poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")),
...     poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")),
...     poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2"))])
>>> strategy.phase(0).ipus(0,1)
>>> strategy.phase(1).ipus(0,1)
>>> strategy.phase(2).ipus(0,1)
>>> opts.setExecutionStrategy(strategy)
__init__(*phases)

Execute the model’s blocks in phases

Parameters

phases ([poptorch.Phase], [[poptorch.Stage]], [[str]]) –

Definition of phases must be either:

phase(phase)

Return the requested poptorch.Phase

Parameters

phase (int) – Index of the phase

setTensorsLiveness(liveness)

See poptorch.Liveness for more information

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

useSeparateBackwardPhase(use=True)

Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:

fwd:       bwd:
phase 0 -> phase 4
phase 1 -> phase 3
phase 2 -> phase 2

Note

The end of the forward pass and the beginning of the backward pass are part of the same phase.

If useSeparateBackwardPhase(True) is used then no phase will be shared between the forward and backward passes:

fwd:       bwd:
phase 0 -> phase 6
phase 1 -> phase 5
phase 2 -> phase 4
class poptorch.ParallelPhasedExecution(*phases)

Phases are executed in parallel alternating between two groups of IPUs.

For example:

  • phase 0 runs on ipu 0 & 2

  • phase 1 runs on ipu 1 & 3

  • phase 2 runs on ipu 0 & 2

>>> poptorch.Block.useAutoId()
>>> with poptorch.Block(): # user_id = "0"
...     layer()
>>> with poptorch.Block(): # user_id = "1"
...     layer()
>>> with poptorch.Block(): # user_id = "2"
...     layer()
>>> with poptorch.Block(): # user_id = "3"
...     layer()
>>> with poptorch.Block(): # user_id = "4"
...     layer()
>>> with poptorch.Block(): # user_id = "5"
...     layer()
>>> opts = poptorch.Options()
>>> strategy = poptorch.ParalellPhasedExecution([
...     poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")),
...     poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")),
...     poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5"))])
>>> strategy.phase(0).ipus(0,2)
>>> strategy.phase(1).ipus(1,3)
>>> strategy.phase(2).ipus(0,2)
>>> opts.setExecutionStrategy(strategy)
__init__(*phases)

Execute the model’s blocks in phases

Parameters

phases ([poptorch.Phase], [[poptorch.Stage]], [[str]]) –

Definition of phases must be either:

phase(phase)

Return the requested poptorch.Phase

Parameters

phase (int) – Index of the phase

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

useSeparateBackwardPhase(use=True)

Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:

fwd:       bwd:
phase 0 -> phase 4
phase 1 -> phase 3
phase 2 -> phase 2

Note

The end of the forward pass and the beginning of the backward pass are part of the same phase.

If useSeparateBackwardPhase(True) is used then no phase will be shared between the forward and backward passes:

fwd:       bwd:
phase 0 -> phase 6
phase 1 -> phase 5
phase 2 -> phase 4
class poptorch.Liveness(value)

When using phased execution:

  • AlwaysLive: The tensors always stay on the IPU between the phases.

  • OffChipAfterFwd: The tensors are sent off the chip at the end of the forward pass and before the beginning of the backward pass.

  • OffChipAfterEachPhase: The tensors are sent off the chip at the end of each phase.

9.6. Optimizers

class poptorch.optim.VariableAttributes(variable_attributes, allowed_attributes)

Track which attributes are variable or constant.

Is accessible via any PopTorch optimizer via the variable_attrs attribute.

>>> opt = poptorch.optim.SGD(params, lr=0.01)
>>> opt.variable_attrs.isConstant("lr")
isConstant(attr)

Return True if the attribute is marked as constant

markAsConstant(attr)

Explicitly mark an attribute as constant

markAsVariable(attr)

Explicitly mark an attribute as variable

class poptorch.optim.SGD(params, lr=None, momentum=None, dampening=None, weight_decay=None, nesterov=None, loss_scaling=None, velocity_scaling=None)

Stochastic gradient descent with optional momentum.

The optimizer matches PyTorch’s implementation (torch.optim.SGD) with optional loss and velocity scaling.

Nesterov momentum is not currently supported.

__init__(params, lr=None, momentum=None, dampening=None, weight_decay=None, nesterov=None, loss_scaling=None, velocity_scaling=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (float) – learning rate.

  • momentum (float, optional) – momentum factor.

  • dampening (float, optional) – damperning term for momentum.

  • weight_decay (float, optional) – Weight decay (L2 penalty) factor.

  • nesterov (bool, optional) – Not supported (must be False).

  • loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • velocity_scaling (float, optional) – Factor by which to scale the velocity values to assist numerical stability when using float16.

class poptorch.optim.Adam(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)

Adam optimizer.

This optimizer matches PyTorch’s implementation (torch.optim.Adam) with optional loss scaling.

AMSGrad is currently not supported.

__init__(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (float, optional) – learning rate

  • betas (tuple, optional) – (beta1, beta2) parameters used in Adam.

  • eps (float, optional) – term added to the demoninator to ensure numerical stability.

  • weight_decay (float, optional) – Weight decay factor.

  • amsgrad (bool, optional) – Not supported (must be False).

  • loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • accum_type (torch.dtype, optional) – data type used for gradients.

  • first_order_momentum_accum_type (torch.dtype, optional) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (torch.dtype, optional) – data type used to store the second order momentum values for each parameter.

class poptorch.optim.AdamW(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, loss_scaling=None, bias_correction=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)

Adam optimizer with true weight decay.

This optimizer matches PyTorch’s implementation (torch.optim.AdamW) with optional loss scaling.

AMSGrad is currently not supported.

__init__(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, loss_scaling=None, bias_correction=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (float, optional) – learning rate

  • betas (tuple, optional) – (beta1, beta2) parameters used in AdamW.

  • eps (float, optional) – term added to the demoninator to ensure numerical stability.

  • weight_decay (float, optional) – Weight decay factor.

  • amsgrad (bool, optional) – Not supported (must be False).

  • loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • bias_correction (bool, optional) – True: compute Adam with bias correction.

  • accum_type (torch.dtype, optional) – data type used for gradients.

  • first_order_momentum_accum_type (torch.dtype, optional) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (torch.dtype, optional) – data type used to store the second order momentum values for each parameter.

class poptorch.optim.RMSprop(params, lr=None, alpha=None, eps=None, weight_decay=None, momentum=None, centered=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)

RMSprop optimizer with optional L2 penalty.

This optimizer matches PyTorch’s implementation (torch.optim.RMSprop) with optional loss scaling.

__init__(params, lr=None, alpha=None, eps=None, weight_decay=None, momentum=None, centered=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (float, optional) – learning rate.

  • alpha (float, optional) – smoothing constant.

  • eps (float, optional) – term added to the demoninator to ensure numerical stability.

  • weight_decay (float, optional) – L2 penalty coeffecient.

  • momentum (float, optional) – momentum factor.

  • centered (bool, optional) – True: compute centred RMSProp in which the gradient is normalized by an estimate of its variance.

  • loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • accum_type (torch.dtype, optional) – data type used for gradients.

  • first_order_momentum_accum_type (torch.dtype, optional) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (torch.dtype, optional) – data type used to store the second order momentum values for each parameter.

class poptorch.optim.LAMB(params, lr=None, betas=None, eps=None, weight_decay=None, bias_correction=None, loss_scaling=None, max_weight_norm=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)

Layer-wise Adaptive Moments (LAMB) optimizer (biased version).

Based on “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes” (https://arxiv.org/abs/1904.00962).

The scaling function phi(z) is fixed as min(z, max_weight_norm);

__init__(params, lr=None, betas=None, eps=None, weight_decay=None, bias_correction=None, loss_scaling=None, max_weight_norm=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (float, optional) – learning rate

  • betas (tuple, optional) – (beta1, beta2) parameters used in LAMB.

  • eps (float, optional) – term added to the denominator to ensure numerical stability/

  • weight_decay (float, optional) – weight decay factor.

  • bias_correction (bool, optional) – True: compute LAMB with bias correction.

  • loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • max_weight_norm (float, optional) – maximum value of the output of scaling function, phi(). Set to None to disable scaling function.

  • accum_type (torch.dtype, optional) – data type used for gradients.

  • first_order_momentum_accum_type (torch.dtype, optional) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (torch.dtype, optional) – data type used to store the second order momentum values for each parameter.

step(closure=None)

Performs a single optimization step (parameter update).

Parameters

closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Note

Unless otherwise specified, this function should not modify the .grad field of the parameters.

9.7. Data batching

class poptorch.DataLoader(options, dataset, batch_size=1, shuffle=False, num_workers=0, drop_last=True, persistent_workers=None, auto_distributed_partitioning=True, mode=DataLoaderMode.Sync, async_options=None, **kwargs)

Thin wrapper around the traditional torch.utils.data.DataLoader to abstract away some of the batch sizes calculations.

If this DataLoader is used in a distributed execution environment, it will ensure that each process uses a different subset of the dataset.

__init__(options, dataset, batch_size=1, shuffle=False, num_workers=0, drop_last=True, persistent_workers=None, auto_distributed_partitioning=True, mode=DataLoaderMode.Sync, async_options=None, **kwargs)
Parameters
  • options (poptorch.Options) – Options that will be used to compile and run the model.

  • dataset – The dataset to get the data from.

  • batch_size (int) – This is the batch size in the conventional sense of being the size that runs through an operation in the model at any given time.

  • shuffle (bool) – Whether or not the dataset should be shuffled.

  • num_workers (int) – Number of worker processes to use to read the data.

  • drop_last (bool) – If True and the number of elements in the dataset is not a multiple of the combined batch size then the incomplete batch at the end will be dropped.

  • persistent_workers (bool) – Re-use workers between iterations if True. If None (default): enabled if num_workers > 0, disabled otherwise.

  • auto_distributed_partitioning (bool) – If True, partitions the dataset for distributed execution automatically. Otherwise, it is assumed that partitioning has been handled manually.

  • mode (poptorch.DataLoaderMode) – If DataLoaderMode.Async, uses an AsynchronousDataAccessor to access the dataset. If DataLoaderMode.Sync, accesses the dataset synchronously.

  • async_options (dict) – Options to pass to AsynchronousDataAccessor.

  • kwargs – Other options to pass to the Torch’s DataLoader’s constructor.

terminate()

If mode==DataLoaderMode.Async, kills the worker process in the underlying AsynchronousDataAccessor manually, otherwise has no effect.

class poptorch.AsynchronousDataAccessor(dataset, buffer_size=3, miss_sleep_time_in_ms=0.1, load_indefinitely=True, early_preload=False, sharing_strategy=SharingStrategy.FileSystem)

A dataloader which launches the dataloading process on a separate thread to allow for the data to be preprocessed asynchronous on CPU to minimize CPU/IPU transfer time.

This works by loading the data into a ring buffer of shared memory. When the IPU needs another batch it uses the data ready in the in the ring buffer. The memory is shared so will be used inplace and won’t be freed until the next batch is requested. Behind the scenes the worker thread will be filling the unready elements of the ring buffer.

Important

In order to avoid hanging issues related to OpenMP and fork() the AsynchronousDataAccessor uses the spawn start method which means your dataset must be serializable by pickle. For more information see https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

__init__(dataset, buffer_size=3, miss_sleep_time_in_ms=0.1, load_indefinitely=True, early_preload=False, sharing_strategy=SharingStrategy.FileSystem)
Parameters
  • dataset – The dataset to pull data from, this can be any Python iterable.

  • buffer_size – The size of the ring buffer.

  • miss_sleep_time_in_ms – When the buffer is full how long should we sleep the worker before checking again.

  • load_indefinitely – If True when we hit the end of the dataset we will just loop round again.

  • early_preload – If True, start loading data in the ring buffer as soon as the worker is created. If False, wait for an iterator to be created before loading data.

  • sharing_strategy – Method to use to pass the dataset object when the child process is spawned. SharedMemory is fast but might be quite limited in size. FileSystem will serialise the dataset to file and reload it which will be slower.

terminate()

An override function to kill the worker process manually.