10. API reference

10.1. Options

class poptorch.Options

Set of all options controlling how a model is compiled and executed.

Pass an instance of this class to the model wrapping functions poptorch.inferenceModel() and poptorch.trainingModel() to change how the model is compiled and executed. An instance includes general options set within this class such as poptorch.Options.deviceIterations() as well as properties referring to categories of options such as Training.

>>> opts = poptorch.Options()
>>> opts.deviceIterations(10)
>>> opts.Training.gradientAccumulation(4)
property Distributed

Options specific to running on multiple IPU server (IPU-POD).

You should not use these when using PopRun/PopDist. Instead use popdist.poptorch.Options to set these values automatically.

property Jit

Options specific to upstream PyTorch’s JIT compiler.

property Precision

Options specific to the processing of the JIT graph prior to lowering to PopART.

property TensorLocations

Options related to tensor locations.

property Training

Options specific to training.

anchorMode(anchor_mode, anchor_return_period=None)

Specify which data to return from a model.

Parameters
  • anchor_mode (poptorch.AnchorMode) –

    • All: Return a result for each batch.

    • Sum: Return the sum of all the batches.

    • Final: Return the last batch.

    • EveryN: Return every N batches: N is passed in as anchor_return_period.

    • Default: All for inference, Final for training.

  • anchor_return_period (Optional[int]) –

Return type

poptorch.Options

For example:

>>> opts = poptorch.Options()
>>> opts.anchorMode(poptorch.AnchorMode.All)
... # or
>>> opts.anchorMode(poptorch.AnchorMode.EveryN, 10)
anchorTensor(short_name, long_name, anchor_mode=None, anchor_return_period=1)

Anchor a tensor such that it may be retrieved after a model run.

Parameters
  • short_name (str) – User defined name to be used for retrieval

  • long_name (str) – The PopART name of the tensor to be anchored

  • anchor_mode (poptorch.AnchorMode) – Specifies when data should be returned. Default to None, in which case the tensor will use the same anchor mode used for model outputs.

  • anchor_return_period (int) – Return period if anchor type is EveryN. Defaults to 1.

autoRoundNumIPUs(auto_round_num_ipus=True)

Whether or not to round up the number of IPUs used automatically: the number of IPUs requested must be a power of 2. By default, an error occurs if the model uses an unsupported number of IPUs to prevent you unintentionally overbooking IPUs.

Parameters

auto_round_num_ipus (bool) –

  • True: round up the number of IPUs to a power of 2.

  • False: error if the number of IPUs is not supported.

Return type

poptorch.Options

connectionType(connection_type)

When to connect to the IPU (if at all).

Parameters

connection_type (poptorch.ConnectionType) –

  • Always: Attach to the IPU from the start (default).

  • OnDemand: Wait until the compilation is complete and the executable is ready to be run to attach to the IPU.

  • Never: Never try to attach to an IPU: this is useful for offline compilation, but trying to run an executable will raise an exception.

Return type

poptorch.Options

For example:

>>> opts = poptorch.Options()
>>> opts.connectionType(poptorch.ConnectionType.OnDemand)
defaultAnchorMode()
Returns

Return type

bool

deviceIterations(device_iterations)

Number of iterations the device should run over the data before returning to the user (default: 1).

This is equivalent to running the IPU in a loop over that the specified number of iterations, with a new batch of data each time. However, increasing deviceIterations is more efficient because the loop runs on the IPU directly.

Parameters

device_iterations (int) –

Return type

poptorch.Options

enableExecutableCaching(path)

Load/save Poplar executables to the specified path, using it as a cache, to avoid recompiling identical graphs.

Parameters

path (str) – File path for Poplar executable cache store; setting path to None`` disables executable caching.

Return type

poptorch.Options

enableProfiling(profile_dir=None)

Enable profiling report generation.

To generate debug information associated with the profiling data, please specify autoReport.directory, and either autoReport.all or autoReport.outputDebugInfo in the POPLAR_ENGINE_OPTIONS environment variable. e.g.

POPLAR_ENGINE_OPTIONS={"autoReport.directory":"/profile/output",\
"autoReport.all":"true"}``

or:

POPLAR_ENGINE_OPTIONS={"autoReport.directory":"/profile/output",\
"autoReport.outputDebugInfo":"true"}``

Debug information and the rest of the profiling data will be stored in /profile/output directory. Values specified in the environment variable take precedence over profile_dir when both are given.

Parameters

profile_dir (str) – path to directory where report will be created. Defaults to current directory.

Return type

poptorch.Options

enableStableNorm(enabled)

Set whether a stable version of norm operators is used. This stable version is slower, but more accurate than its unstable counterpart.

Parameters

enabled (bool) –

  • True: Use stable norm calculation.

  • False: Do not use stable norm calculation.

Return type

poptorch.Options

enableSyntheticData(enabled)

Set whether host I/O is disabled and synthetic data is generated on the IPU instead. This can be used to benchmark models whilst simulating perfect I/O conditions.

Parameters

enabled (bool) –

  • True: Use data generated from a random normal distribution on the IPU. Host I/O is disabled.

  • False: Host I/O is enabled and real data is used.

Return type

poptorch.Options

loadFromFile(filepath)

Load options from a config file where each line in the file corresponds to a single option being set. To set an option, simply specify how you would set the option within a Python script, but omit the options. prefix.

For example, if you wanted to set options.deviceIterations(1), this would be set in the config file by adding a single line with contents deviceIterations(1).

Parameters

filepath (str) –

Return type

poptorch.Options

logCycleCount(log_cycle_count)

Log the number of IPU cycles used in executing the main graph, which is printed by setting the environment variable POPTORCH_LOG_LEVEL=INFO. This option requires IPU hardware to run.

Note: This will have a small detrimental impact on performance.

Parameters

log_cycle_count (bool) –

  • True: Enable logging the IPU cycle count.

  • False: Do not enable IPU cycle count logging.

Return type

poptorch.Options

logDir(log_dir)

Set the log directory

Parameters

log_dir (str) – Directory where PopTorch saves log files (default: current directory)

Return type

poptorch.Options

modelName(name)

Set the model name

Parameters

name (str) – Name of the model defaults to “inference” or “training” depending on the type of model created. Used when profiling to set the subdirectory of the report directory to output the profiling too.

Return type

poptorch.Options

randomSeed(random_seed)

Set the seed for the random number generator on the IPU.

Parameters

random_seed (int) – Random seed integer.

Return type

poptorch.Options

relaxOptimizerAttributesChecks(relax=True)

Controls whether unexpected attributes in setOptimizer() lead to warnings or debug messages.

By default PopTorch will print warnings the first time it encounters unexpected attributes in setOptimizer().

Parameters

relax (bool) –

  • True: Redirect warnings to the debug channel.

  • False: Print warnings about unexpected attributes (default behaviour).

Return type

poptorch.Options

replicationFactor(replication_factor)

Number of times to replicate the model (default: 1).

Replicating the model increases the data throughput of the model as PopTorch uses more IPUs. This leads to the number of IPUs used being scaled by replication_factor, for example, if your model uses 1 IPU, a replication_factor of 2 will use 2 IPUs; if your model uses 4 IPUs, a replication factor of 4 will use 16 IPUs in total.

Parameters

replication_factor (int) – Number of replicas of the model to create.

Return type

poptorch.Options

setAvailableMemoryProportion(available_memory_proportion)

Sets the amount of temporary memory made available on a per-IPU basis.

Use this setting to control the amount of temporary memory available to operations such as:

  • convolution

  • matrix multiplication

  • embedding lookups

  • indexing operations

Parameter should be a dictionary of IPU ids and float values between 0 and 1. (for example, {"IPU0": 0.5})

The floating point value has the same meaning and effect as documented in set_available_memory().

Parameters

available_memory_proportion (Dict[str, float]) –

setExecutionStrategy(strategy)

Set the execution strategy to use to partition the graph.

Parameters

strategy (Union[poptorch.ParallelPhasedExecution, poptorch.SerialPhasedExecution]) – Must be an instance of once of the execution strategy classes.

Return type

poptorch.Options

showCompilationProgressBar(show=True)

Show / hide a progress bar while the model is being compiled. (The progress bar is shown by default)

Parameters

show (bool) –

Return type

poptorch.Options

syncPattern(sync_pattern)

Controls synchronisation in multi-IPU systems.

This option can be used to allow subsets of IPUs to overlap their work. For example, one set of IPUs could be communicating with the host while other IPUs are processing data.

This option is typically used together with replicated execution, in which case it takes effect on a per-replica basis. If replication is not used, it will apply to all IPUs.

Parameters

sync_pattern (poptorch.SyncPattern) –

  • Full: Require all IPUs to synchronise on every communication between IPUs or between IPUs and host. This is the default.

  • SinglePipeline: Allow IPUs to synchronise with the host independently, without having to synchronise with each other. This permits any one IPU to perform host IO while other IPUs are processing data.

  • ReplicaAndLadder: Allow an IPU group to communicate with the host without requiring synchronisation between groups. This permits multiple IPU groups to alternate between performing host IO and computation.

Return type

poptorch.Options

useIpuId(ipu_id)

Use the IPU device specified by the ID (as provided by gc-info)

A device ID may refer to a single or to a group of IPUs (a multi-IPU device). The number of IPUs associated with the ID must be equal to the number of IPUs used by your annotated model multiplied by the replication factor.

For example if your model uses 1 IPU and the replication factor is 2 you will need to provide a device ID with 2 IPU; if your model is pipelined across 4 IPUs and the replication factor is 4, you will need to provide a device ID which represents a multi-IPU device of 16 IPUs.

You can use the the command-line tool gc-info: running gc-info -a, shows each device ID and a list of IPUs associated with the ID.

Parameters

ipu_id (int) – IPU device ID of a single-IPU or multi-IPU device

Return type

poptorch.Options

useIpuModel(use_model)

Whether to use the IPU Model or physical hardware (default)

The IPU model simulates the behaviour of IPU hardware but does not offer all the functionality of an IPU. Please see the Poplar and PopLibs User Guide for further information.

This setting takes precedence over the POPTORCH_IPU_MODEL environment variable.

Parameters

use_model (bool) –

  • True: Use the IPU Model.

  • False: Use IPU hardware.

Return type

poptorch.Options

useOfflineIpuTarget(ipu_version=2)

Create an offline IPU target that can only be used for offline compilation.

Note

the offline IPU target cannot be used if the IPU model is enabled.

Parameters

ipu_version (int) – IPU version to target (1 for Mk1, 2 for Mk2). Default: 2.

Return type

poptorch.Options

class poptorch.options._DistributedOptions

Options related to distributed execution.

You should not use these when using PopRun/PopDist. Instead use popdist.poptorch.Options to set these values automatically.

Can be accessed via poptorch.Options.Distributed:

>>> opts = poptorch.Options()
>>> opts.Distributed.configureProcessId(0, 2)
configureProcessId(process_id, num_processes)

Manually set the current process ID and the total number of processes.

Parameters
  • process_id (int) – The ID of this process.

  • num_processes (int) – The total number of processes the execution is distributed over.

Return type

poptorch.options._DistributedOptions

disable()

Ignore the current options / environment variables and disable distributed execution.

Return type

poptorch.options._DistributedOptions

property numProcesses

Total number of processes the execution is distributed over.

property processId

Id of the current process.

setEnvVarNames(var_num_processes, var_process_id)

Utility to read and set processId and numProcesses from environment variables.

Useful if you use a third party library to manage the processes used for the distributed execution such as mpirun.

For example: mpirun -np 4 myscript.py

By default the OpenMPI OMPI_COMM_WORLD_SIZE and OMPI_COMM_WORLD_RANK variables are used.

Parameters
  • var_num_processes (str) –

  • var_process_id (str) –

Return type

poptorch.options._DistributedOptions

class poptorch.options._PrecisionOptions(popart_options)

Options related to processing the PyTorch JIT graph prior to lowering to PopART

Can be accessed via poptorch.Options.Precision:

>>> opts = poptorch.Options()
>>> opts.Precision.halfFloatCasting(
...   poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)
autocastEnabled(autocast_enabled)

Controls whether automatic casting functionality is turned on.

Parameters

autocast_enabled (bool) – if True, automatic casting is active. Default value is True.

Return type

poptorch.options._PrecisionOptions

autocastPolicy(autocast_policy)

Set the automatic casting policy.

Parameters
Return type

poptorch.options._PrecisionOptions

enableFloatingPointExceptions(enabled)

Set whether floating point exceptions are enabled on the IPU.

When enabled, an exception will be generated when the IPU encounters any one of the following:

  • Operation resulting in subtraction of infinities

  • Divisions by zero or by infinity

  • Multiplications between zero and infinity

  • Real operations producing complex results

  • Comparison where any one operand is Not-a-Number

Parameters

enabled (bool) –

  • True: raise RuntimeError on floating point exception

  • False: do not raise RuntimeError (default)

Return type

poptorch.options._PrecisionOptions

enableStochasticRounding(enabled)

Set whether stochastic rounding is enabled on the IPU.

Stochastic rounding rounds up or down a values to half (float16) randomly such that that the expected (mean) result of rounded value is equal to the unrounded value. It can improve training performance by simulating higher precision behaviour and increasing the speed or likelihood of model convergence. However, the model is non-deterministic and represents a departure from (deterministic) standard IEEE FP16 behaviour.

In the general case, we recommend enabling stochastic rounding for training where convergence is desirable, but not for inference where non-determinism may be undesirable.

Parameters

enabled (bool) –

  • True: Enable stochastic rounding on the IPU.

  • False: Disable stochastic rounding.

Return type

poptorch.options._PrecisionOptions

halfFloatCasting(half_float_casting)
Changes the casting behaviour for ops involving a float16 (half) and

a float32

The default option, FloatDowncastToHalf, allows parameters (weights) to be stored as and updated as float32 but cast to float16 when used in an operation with a float16 input. The benefit of this is higher efficiency and reduced memory footprint without the same loss of precision of parameters during the optimiser update step. However, you can change the behaviour to match PyTorch using option HalfUpcastToFloat.

Parameters

half_float_casting (poptorch.HalfFloatCastingBehavior) –

  • FloatDowncastToHalf: Any op with operands (inputs) which are a mix of float32 and float16 (half) will cast all operands to half.

  • HalfUpcastToFloat: Implicit casting will follow PyTorch’s rules, promoting float16 (half) inputs to float32 if another input is float32.

Return type

poptorch.options._PrecisionOptions

runningStatisticsAlwaysFloat(value)

Controls whether the running mean and variance tensors of batch normalisation layers should be float32 regardless of input type.

A batch normalisation layer stores a running estimate of the means and variances of each channel during training, for use at inference in lieu of batch statistics. Storing the values as half (float16) can result in poor performance due to the low precision. Enabling this option yields more reliable estimates by forcing all running estimates of variances to be stored as float32, at the cost of extra memory use.

Parameters

value (bool) –

  • True: Always store running estimates of mean and variance as float32.

  • False: Store running estimates of mean and variance as the same type as the layer input.

Return type

poptorch.options._PrecisionOptions

setPartialsType(dtype)

Set the data type of partial results for matrix multiplication and convolution operators.

The matrix multiplication and convolution operators store intermediate results known as partials as part of the calculation. You can use this option to change the data type of the partials. Using torch.half reduces on-chip memory use at the cost of precision.

Parameters
  • type (torch.dtype) – The type to store partials, which must be either torch.float or torch.half

  • dtype (torch.dtype) –

Return type

poptorch.options._PrecisionOptions

class poptorch.options._JitOptions

Options related to PyTorch’s JIT compiler.

Can be accessed via poptorch.Options.Jit:

>>> opts = poptorch.Options()
>>> opts.Jit.traceModel(True)
traceModel(trace_model)

Controls whether to use PyTorch’s tracing or an alternative.

Currently unused and deadlocked to torch.jit.trace.

Parameters

trace_model (bool) –

Return type

poptorch.options._JitOptions

class poptorch.options._TensorLocationOptions(**default_values)

Options controlling where to store tensors.

Can be accessed via poptorch.Options.TensorLocations:

>>> opts = poptorch.Options()
>>> opts.TensorLocations.setActivationLocation(
...     poptorch.TensorLocationSettings().useOnChipStorage(False))
numIOTiles(num_tiles)

Assigns the number of tiles on the IPU to be IO rather than compute.

Allocating IO (input/output) tiles reduces the number of IPU tiles available for computation but allows you to reduce the latency of copying tensors from host ot the IPUs using the function poptorch.set_overlap_for_input() or to use off-chip memory with reduced by setting the option useIOTilesToLoad(). As reducing the number of computation tiles may reduce peformance, you should not use any IO tiles until you have successfully run your model and used profiling to identify “streamCopy” entries which take up a significant proportion of execution time.

Parameters

num_tiles (int) –

Return type

poptorch.TensorLocationSettings

setAccumulatorLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for accumulators.

Return type

poptorch.options._TensorLocationOptions

setActivationLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for activations.

Return type

poptorch.options._TensorLocationOptions

setOptimizerLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for optimiser states.

Return type

poptorch.options._TensorLocationOptions

setWeightLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for weights.

Return type

poptorch.options._TensorLocationOptions

class poptorch.TensorLocationSettings(**default_values)

Define where a tensor is stored

>>> opts = poptorch.Options()
>>> opts.TensorLocations.setActivationLocation(
...     poptorch.TensorLocationSettings().useOnChipStorage(False))
minElementsForOffChip(min_elements)

A minimum number of elements below which offloading won’t be considered.

Parameters

min_elements (int) –

Return type

poptorch.TensorLocationSettings

minElementsForReplicatedTensorSharding(min_elements)

Only enable replicated tensor sharding (RTS) for tensors with more than min_elements elements.

Parameters

min_elements (int) –

Return type

poptorch.TensorLocationSettings

useIOTilesToLoad(use=True)

Load tensor through IO tiles

Parameters

use (bool) – Use IO tiles if True, use Compute tiles if False.

Return type

poptorch.TensorLocationSettings

useIOTilesToStore(use=True)

Use IO tiles to store tensors.

(relevant for replicated tensor sharded tensors)

Parameters

use (bool) – Use IO tiles if True, use Compute tiles if False.

Return type

poptorch.TensorLocationSettings

useOnChipStorage(use=True)

Permanent tensor storage

Parameters

use (bool) – True: use on chip memory. False: use off chip memory. None: keep it undefined.

Return type

poptorch.TensorLocationSettings

useReplicatedTensorSharding(use=True)

Enable replicated tensor sharding

(relevant for weights and optimiser states)

Parameters

use (bool) –

Return type

poptorch.TensorLocationSettings

class poptorch.options._TrainingOptions(popart_options)

Options specific to model training.

Note

You must not set these options for inference models.

Can be accessed via poptorch.Options.Training:

>>> opts = poptorch.Options()
>>> opts.Training.gradientAccumulation(4)
accumulationAndReplicationReductionType(reduction_type)

Set the type of reduction applied to reductions in the graph.

When using, a value for greater than one for gradientAccumulation() or for replicationFactor(), PopTorch applies a reduction to the gradient outputs from each replica, and to the accumulated gradients. This reduction is independent of the model loss reduction (summing a mean-reduced loss and a sum-reduced loss in a PyTorch model is valid).

This setting governs both the accumulation of the loss gradients in replicated graphs and of all of the gradients when using gradient accumulation.

Parameters

reduction_type (poptorch.ReductionType) –

  • Mean: Reduce gradients by calculating the mean of them.

  • Sum: Reduce gradients by calculating the sum of them.

Return type

poptorch.options._TrainingOptions

gradientAccumulation(gradient_accumulation)

Number of micro-batches to accumulate for the gradient calculation.

Accumulate the gradient gradient_accumulation times before updating the model using the gradient. Other frameworks may refer to this setting as “pipeline depth”.

Accumulate the gradient gradient_accumulation times before updating the model using the gradient. Each micro-batch (a batch of size equal to the batch_size argument passed to poptorch.DataLoader) corresponds to one gradient accumulation. Therefore gradient_accumulation scales the global batch size (number of samples between optimiser updates).

Note

Increasing gradient_accumulation does not alter the (mini-) batch size used for batch normalisation.

A large value for gradient_accumulation can improve training throughput by amortising optimiser update costs, most notably when using PipelinedExecution or when training is distributed over a number of replicas. However, the consequential increase in the number of samples between optimiser updates can have an adverse impact on training.

The reason why the efficiency gains are most notable when training with models with multiple IPUs which express pipelined model parallelism (via PipelinedExecution or by default and annotating the model poptorch.BeginBlock or poptorch.Block) is because the pipeline has “ramp up” and “ramp down” steps around each optimiser update. Increasing the gradient accumulation factor in this instance reduces the proportion of time spent in the “ramp up” and “ramp down” phases, increasing overall throughput.

When training involves multiple replicas, including the cases of sharded and phased execution, each optimiser step incurs a communication cost associated with the reduction of the gradients. By accumulating gradients, you can reduce the total number of updates required and thus reduce the total amount of communication.

Note

Increasing the global batch size can have adverse effects on the sample efficiency of training so it is recommended to use a low or unity gradient accumulation count initially, and then try increasing to achieve higher throughput. You may also need to scale other hyper-parameters such as the optimiser learning rate accordingly.

Parameters

gradient_accumulation (int) –

Return type

poptorch.options._TrainingOptions

setAutomaticLossScaling(enabled)

Set whether automatic loss scaling is enabled on the IPU.

When using float16/half values for activations, gradients, and weights, the loss value needs to be scaled by a constant factor to avoid underflow/overflow. This adjustment is known as loss scaling. This setting automatically sets a global loss scaling factor during training.

Note: This is an experimental feature and may not behave as expected.

Parameters

enabled (bool) –

  • True: Enable automatic loss scaling on the IPU.

  • False: Disable automatic loss scaling.

Return type

poptorch.options._TrainingOptions

setConvolutionDithering(enabled)

Enable convolution dithering.

If true, then convolutions with different parameters will be laid out from different tiles in an effort to improve tile balance in models.

Use MultiConv to apply this option to specific set of convolutions.

Parameters

enabled (bool) – Enables or disables convolution dithering for all convolutions.

Return type

poptorch.options._TrainingOptions

setMeanAccumulationAndReplicationReductionStrategy(mean_reduction_strategy)

Specify when to divide by a mean reduction factor when accumulationAndReplicationReductionType is set to ReductionType.Mean.

Parameters

mean_reduction_strategy (poptorch.MeanReductionStrategy) –

  • Running: Keeps the reduction buffer as the current mean. This is preferred for numerical stability as the buffer value is never larger than the magnitude of the largest micro batch gradient.

  • Post: Divides by the accumulationFactor and replicatedGraphCount after all of the gradients have been reduced. In some cases this can be faster then using Running, however is prone to overflow.

  • PostAndLoss (deprecated): Divides by the replicatedGraphCount before the backwards pass, performs the gradient reduction across micro batches, and then divides by the accumulationFactor. This is to support legacy behaviour and is deprecated.

Return type

poptorch.options._TrainingOptions

10.2. Helpers

poptorch.ipuHardwareIsAvailable(num_ipus=1)

Indicates whether any IPU hardware with num_ipus is present in the system.

Note: This function doesn’t check if the IPU is free or already being used.

Parameters

num_ipus (int) –

Returns

True if physical IPUs are available, False otherwise.

Return type

bool

poptorch.ipuHardwareVersion()

Indicates what IPU hardware version is available in the system.

Raise an exception if no hardware is available.

Returns

The IPU hardware version or -1 if unknown.

Return type

int

poptorch.setLogLevel(level)

Changes the volume of messages printed in the console (stdout)

Parameters

level (Union[str, int]) –

  • TRACE: Print all messages.

  • DEBUG: Print debug messages and above.

  • INFO: Print info messages and above.

  • WARN: Print warnings and errors.

  • ERR: Print errors only.

  • OFF: Print nothing.

class poptorch.profiling.Channel(name)

Profiling channel.

Note

If the libpvti profiling library is not available at runtime this class becomes a no-op.

Example:

>>> channel = poptorch.profiling.Channel("MyApp")
>>> with channel.tracepoint("TimeThis"):
...     functionToTime()
>>> channel.instrument(myobj, "methodName", "otherMethod")
instrument(obj, *methods)

Instrument the methods of an object.

Parameters
  • obj – Object to instrument

  • methods – One or more methods to wrap in profiling tracepoints.

tracepoint(name)

Create a context tracepoint

>>> with channel.tracepoint("DoingSomething"):
...     expensiveCall()
Parameters

name – Name associated to this tracepoint.

10.3. PopTorch Ops

poptorch.ctc_beam_search_decoder(probs, lengths, blank=0, beam_width=100, top_paths=1)
Add a connectionist temporal classification (CTC) beam search decoder

to the model.

Calculates the most likely top paths and their probabilities given the input logarithmic probabilities and the data lengths.

Parameters
  • probs (torch.Tensor) – Logarithmic probabilities tensor with the shape of [input_length, batch_size, num_classes].

  • lengths (torch.Tensor) – Tensor representing lengths of the inputs of shape [batch_size].

  • blank (int) – Integer identifier of the blank class (default: 0).

  • beam_width (int) – Number of beams used during decoding (default: 100).

  • top_paths (int) – Number of most likely paths to return (default: 1).

Returns

Three tensors representing paths’ probabilities - of shape [batch_size, top_paths], paths’ lengths - of shape [batch_size, top_paths] and the decoded paths - of shape [batch_size, top_paths, input_length].

Return type

List[torch.Tensor]

poptorch.ipu_print_tensor(tensor, title='')

Adds an op to print the content of a given IPU tensor.

When this is executed the tensor will be copied back to host and printed.

When this operation is called in the backward pass it will print the gradient of the tensor.

The operation is an identity operation and will return the exact same tensor. The returned tensor must be used in place of the original tensor in the rest of the program, to make sure that the print operation isn’t optimised away.

For example if the original code looks like this:

def forward(self, c, d, b)
  a = c + d
  return a + b

If the result of ipu_print_tensor is not used, it will be optimised out by the graph optimiser and tensor will not be printed.

So if you want to print the value of a, you should do:

def forward(self, c, d, b)
  a = c + d
  x = poptorch.ipu_print_tensor(a)
  return x + b

Optionally, you may add a second string parameter to be used as a title.

def forward(self, c, d, b)
    a = c + d
    x = poptorch.ipu_print_tensor(a, "summation"))
    return x + b

Warning

In order for the print operation to not be optimised out by the graph optimiser, you must use the output of the print.

Parameters
  • ipu_print_tensor – The tensor to print.

  • tensor (torch.Tensor) –

  • title (str) –

Returns

The input unchanged.

Return type

torch.Tensor

poptorch.for_loop(count, body, inputs)
An on device for loop. This loop will execute on device for count

number of iterations.

The body should be a python function containing the PyTorch code you wish to execute in a loop. It should take as input the same number of tensors as it outputs. Each iteration will have the previous output passed in as input.

Parameters
  • count (int) – Number of iterations of the loop.

  • body (Callable[List[torch.Tensor], List[torch.Tensor]]) – The function to be executed.

  • inputs (List[torch.Tensor]) – The initial inputs to the function.

Return type

List[torch.Tensor]

poptorch.recomputationCheckpoint(*tensors)

Operation for checkpointing values in a computational pipeline stage.

When recomputation is enabled, these values will not be recomputed and they will be stored in memory between forward and backwards passes instead.

Parameters

tensors (List[torch.Tensor]) – One or more tensors which should be checkpointed.

Returns

Tensors (same number and shape as the input tensors).

Return type

List[torch.Tensor]

poptorch.identity_loss(x, reduction)

Marks this operation as being part of the loss calculation and, as such, will back-propagate through it in the PopTorch autograd. This enables multiple losses and custom losses.

Parameters
  • x (torch.Tensor) – The calculated loss.

  • reduction (str) –

    Reduce the loss output as per PyTorch loss semantics. Supported values are:

    • "sum": Sum the losses.

    • "mean": Take the mean of the losses.

    • "none": Don’t reduce the losses.

Returns

An identity loss custom op.

Return type

torch.Tensor

class poptorch.MultiConv

Combines all convolution layers evaluated inside this scope into a single multi-convolution.

Multi-convolutions allow for a set of data-independent convolutions to be executed in parallel. Executing convolutions in parallel can lead to an increase in the data throughput.

For example:

>>> with poptorch.MultiConv():
...     y = self.convA(x)
...     v = self.convB(u)

Combines the two data-independent convolutions into a single multi-convolution.

Refer to the PopLibs documentation for further information on multi-convolutions.

availableMemoryProportions(value)

The available memory proportion per convolution, each [0, 1).

Parameters

value (Union[float, List[float]]) – Can be a float value in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many float values as the number of convolutions.

Returns

self, to support method chaining

Return type

poptorch.MultiConv

cycleBackOff(value)

Cycle back off proportion.

Parameters

value (float) – Number between 0 and 1

Returns

self, to support method chaining

Return type

poptorch.MultiConv

enableConvDithering(value)

Enable per-convolution dithering.

Parameters

value (Union[bool, List[bool]]) – Can be a bool value in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many bool values as the number of convolutions.

Returns

self, to support method chaining

Return type

poptorch.MultiConv

partialsTypes(value)

The partials type used for each convolution.

Parameters

value (Union[torch.dtype, List[torch.dtype]]) – Can be a single instance of torch.dtype in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many torch.dtype values as the number of convolutions.

Returns

self, to support method chaining

Return type

poptorch.MultiConv

perConvReservedTiles(value)

Tiles to reserve for each convolution.

Parameters

value (int) – Number of tiles

Returns

self, to support method chaining

Return type

poptorch.MultiConv

planType(value)

Select the multi-convolution execution strategy.

Parameters

value (poptorch.MultiConvPlanType) – An instance of MultiConvPlanType.

Returns

self, to support method chaining

Return type

poptorch.MultiConv

class poptorch.CPU(layer_to_call, ID)
class poptorch.NameScope(name)

Create a name scope for a code block. All operators originating from this block will have their names prefixed by the given string.

>>> with poptorch.NameScope("CustomString"):
...     y = self.bmm(a, b)
...     z = torch.relu(y)
class poptorch.MultiConvPlanType(value)

Selects the execution strategy for a poptorch.MultiConv

  • Parallel: Execute multiple convolutions in parallel (Default).

  • Serial: Execute each convolution independently. This is equivalent to using the independent convolution API.

class poptorch.custom_op(inputs, name, domain, domain_version, example_outputs, attributes=None)

Applies a custom operation, implemented within PopART, to the inputs.

Parameters
  • inputs (tuple) – A tuple of input tensors, for example, (x, y).

  • name (str) – Unique name of the PopART custom op.

  • domain (str) – Domain for the op.

  • domain_version (int) – Version of the domain to use.

  • example_outputs (iterable) – A tuple of tensors with the same type and shape as the outputs. The value does not matter as all values will be set to zero for tracing purposes.

  • attributes (dict) – A dictionary of attributes for the custom op. All attribute keys must be strings. All attribute values must be floats, ints, strings, or a list/tuple containing only floats, only ints or only strings (not a mix of types within the list).

Returns

The outputs of the forward op of the custom op.

poptorch.nop(tensor)

A no-operation: it is functionally the same as an identity but is never eliminated by PopART patterns or inlining, so it is useful for debugging.

Parameters

tensor (torch.Tensor) – the tensor to pass to the no-op.

Returns

The same tensor which was input.

Return type

torch.Tensor

poptorch.serializedMatMul(lhs, rhs, mode, factor=0, keep_precision=False)

Calculates a matrix product using a serialized matrix multiplication.

The matrix multiplication, lhs*rhs, is split into separate smaller multiplications, calculated one after the other, to reduce the memory requirements of the multiplication and its gradient calculation.

Parameters
  • lhs (torch.Tensor) – Left-hand size input matrix.

  • rhs (torch.Tensor) – Right-hand side input matrix.

  • mode (poptorch.MatMulSerializationMode) – Which dimension of the matmul to serialize on: for matrix A (m by n) multiplied by matrix B (n by p). * InputChannels: Split across the input channels (dimension m). * ReducingDim: Split across the reducing dimension (n). * OutputChannels: Split across the output channels (dimension p). * Disabled: Same as an ordinary matrix multiplication.

  • factor (int) – Number of serialized multiplications. Must be a factor of the dimension to serialize on.

  • keep_precision (bool) – (Half/float16 inputs only) The forward op when serializing over ReducingDim and the backwards ops when serializing over InputChannels involve an addition step. If keep_precision is True, these additions will occur using float32 rather than half precision partials, matching those used for the individual matrix multiplications.

Return type

torch.Tensor

poptorch.set_available_memory(tensor, available_memory_proportion)

Sets the amount of temporary memory made available to an operation.

The operators that can be tuned with this setting include:

  • convolution

  • matrix multiplication

  • embedding lookups

  • indexing operations

When applied to the output of a supported operation, it controls the trade-off between execution cycles and the temporary memory used during the execution of the operation.

The value should be between 0 and 1 (inclusive) and represents a proportion of available memory on the IPU. The default value is 0.6 (therefore, by default, PopTorch will not use more than 60% of IPU memory for temporary data).

PopTorch passes this setting to the PopLibs operator planner, which will try to constrain the use of temporary memory to below this value. Generally, an operation that has more temporary memory available will run in fewer cycles.

For a specific operation, the necessary amount of temporary memory may be more than amount specified by this option. In this case, a warning message will be generated.

For more information, please refer to the technical note on optimising temporary memory usage.

>>> class BasicNetwork(nn.Module):
...     def __init__(self):
...         super().__init__()
...         self.conv = nn.Conv2d(4, 4, 3, stride=2)
...
...     def forward(self, x):
...         out = self.conv(x)
...         out = poptorch.set_available_memory(out, 0.2)
...         return out
Parameters
  • tensor (torch.Tensor) – Output tensor from a supported operation (otherwise the statement will be an identity).

  • available_memory_proportion (float) – Proportion between 0.0 and 1.0 of tile memory to be made available for temporary memory (default 0.6).

Returns

The input tensor, as if calling an identity function.

Return type

torch.Tensor

poptorch.set_overlap_for_input(input_tensor, mode)

Sets host overlap setting for input_tensor.

You can increase performance in some cases by overlapping the copying from the host to IPUs with computation. However, this requires a number of IPU tiles to be set aside as IO tiles using poptorch.options._TensorLocationOptions.numIOTiles() which may affect computation performance.

You should use this function at the start of your model’s forward method for each applicable input and use the returned tensor in future ops.

Parameters
  • input_tensor (torch.Tensor) – The input tensor for which enable overlapping host IO.

  • mode (poptorch.OverlapMode) – Control to what extent the host IO overlaps computation.

Returns

the input tensor, specified for overlap.

Return type

torch.Tensor

See also

poptorch.OverlapMode.

10.4. Model wrapping functions

poptorch.trainingModel(model, options=None, optimizer=None)

Create a PopTorch training model, from a PyTorch model, to run on IPU hardware in training mode.

Note

PopTorch makes a shallow copy of the model. Changes to the parameters in the returned training model affect the original model and vice versa. However, primitive variable types are not synced: for example calling model.train() on the original model, which changes the training bool of the model instance, will not alter the model returned by this function. You may need to call model.train() on your model before you call this function for correct behaviour.

Parameters
Returns

The poptorch.PoplarExecutor wrapper to use in place of model.

Return type

poptorch.PoplarExecutor

poptorch.inferenceModel(model, options=None)

Create a PopTorch inference model, from a PyTorch model, to run on IPU hardware in inference mode.

Note

PopTorch makes a shallow copy of the model. Changes to the parameters in the returned inference model affect the original model and vice versa. However, primitive variable types are not synced: for example calling model.eval() on the original model will not alter the model returned by this function. You may need to call model.eval() on your model before you call this function for correct behaviour.

Parameters
Returns

The poptorch.PoplarExecutor wrapper to use in place of model.

Return type

poptorch.PoplarExecutor

class poptorch.PoplarExecutor(model, options, training, poptorch_version, optimizer=None, user_model=None)

This class should not be created directly but is a wrapper around the model that was passed into inferenceModel or trainingModel. It only has a few methods which can be used to interface with the IPU.

__call__(*args, **kwargs)

Takes the same arguments as the wrapped PyTorch model.__call__.

Note

The first time the PoplarExecutor wrapper is called, the wrapped model will be traced and compiled.

Parameters
attachToDevice()

Attach to target device. Before calling this function, the device must be detached and the model compiled.

Return type

None

compile(*args, **kwargs)

Takes the same arguments as the wrapped PyTorch model.__call__.

Trace and compile the wrapped model if no executable has been created yet.

Note: The executable created by this method can only be executed, it cannot be exported to file. To precompile and save to file use compileAndExport()

Return type

None

compileAndExport(filename, *args, export_model=True, **kwargs)

Precompile an executable and save it to file.

args and kwargs are the same arguments as the wrapped PyTorch model.__call__

Parameters
  • filename (str) – Where to save the compiled executable.

  • export_model (bool) – If True the Torch model will be saved in the file alongside the executable. poptorch.load() can be used to restore both the original Torch model, the PopTorch model and the executable. If False then only the executable will be exported and it will be the user’s responsibility to call poptorch.inferenceModel() or poptorch.trainingModel() to re-create the PopTorch model before calling loadExecutable() to restore the executable.

  • args (List[torch.Tensor]) –

  • kwargs (Dict[str, torch.Tensor]) –

copyWeightsToDevice()

Copies the weights from model.parameters() to the IPU device. Implicitly called on first call.

Return type

None

copyWeightsToHost()

Updates the parameters used in model with the weights stored on device. (The weights in model.parameters())

Return type

None

destroy()

Destroy the model: release the IPUs and the executable.

Return type

None

detachFromDevice()

Detach from target device. Before calling this function, the device must be attached (and the model compiled).

Return type

None

getComputeLatency()

Return compute latency for the last execution of the model.

The compute latency is the interval of time (in fractional seconds) between the last input tensor being transferred to the IPU and the last output tensor becoming available.

The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.

getHostIpuLatency()

Return Host-IPU latency for the last execution of the model.

The Host-IPU latency is the interval of time (in fractional seconds) between the first input tensor being requested and the last input tensor being transferred to the IPU.

The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.

getIpuHostLatency()

Return IPU-Host latency for the last execution of the model.

The IPU-Host latency is the interval of time (in fractional seconds) between the first output tensor becoming available and the last output tensor being written back to the host.

The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.

getLatency()

Return round-trip latency for the last execution of the model.

The round-trip latency is the interval of time (in fractional seconds) between the first input tensor being requested and the last output tensor being written back to the host.

The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.

getPerfCounters()

Return performance counters for the last execution of the model.

Return the values (in fractional seconds) of the performance counters corresponding to the latest run of the model. The reference point of the returned value is undefined, however the difference between values is valid.

The returned object is a dictionary where they keys correspond to each of the following events: * ‘input’: the IPU requesting an input tensor * ‘input_complete’: an input tensor having been transferred * ‘output’: the IPU requesting to transmit an output tensor * ‘output_complete’: an output tensor having been transferred

The values of the dictionary are nested lists. The first level of nesting corresponds to an input or output index. The second level list contains the actual values as fractional seconds.

Examples: * dict[‘input’][1][3]: performance counter for the second input tensor being requested on the third iteration of the model * dict[‘output_complete’][0][0]: performance counter the first output tensor having been transferred on the first iteration of the model

getTensorNames()

Returns a list of all tensor names within the computational graph. Model must be compiled in advance.

Return type

List[str]

isAttachedToDevice()

Returns true, if the target device has been attached. False, otherwise.

Return type

bool

isCompiled()

Returns true if the model has been compiled (and not destroyed). False, otherwise.

Return type

bool

loadExecutable(filename)

Load an executable previously generated using compileAndExport()

Parameters

filename (str) –

Return type

None

load_state_dict(state_dict, strict=True)

Will call load_state_dict() on the wrapped model and automatically synchronise the weights with the IPU.

Returns

  • missing_keys is a list of str containing the missing keys

  • unexpected_keys is a list of str containing the

    unexpected keys

Return type

NamedTuple with missing_keys and unexpected_keys fields

Parameters
property model

Access the wrapped Torch model.

setOptimizer(optimizer)

Sets the optimiser for a training model. Will overwrite the previous one. Supported optimisers: optim.SGD, optim.Adam, optim.AdamW, optim.RMSProp, optim.LAMB.

Parameters

optimizer (torch.optim.optimizer.Optimizer) –

poptorch.isRunningOnIpu()

This function returns True when executing on IPU and False when executing the model outside IPU scope. This allows for separate codepaths to be marked in the model simply by using:

>>> if poptorch.isRunningOnIpu():
>>>      # IPU path
>>> else:
>>>     # CPU path

Note this will only apply to code during execution. During model creation it will always return False.

returns

True if running on IPU, otherwise False.

Return type

bool

poptorch.load(filename, edit_opts_fn=None)

Load a PopTorch model from a file previously created using compileAndExport()

Parameters
  • edit_opts_fn (Optional[Callable[poptorch.Options, None]]) – Function to edit the options before the model is restored. For example to attach to a specific IPU device.

  • filename (str) –

Return type

poptorch.PoplarExecutor

>>> model = poptorch.inferenceModel(model)
>>> model.compileAndExport("my_model.poptorch")
...
>>> model = poptorch.load("my_model.poptorch")
>>> model(my_input)

10.5. Parallel execution

class poptorch.Block(user_id=None, ipu_id=None)

A context manager to define blocks of the model.

You can use Block as a context manager. This means you use Python’s “with” statement as follows:

>>> with poptorch.Block("Encoder"):
...     self.layer = MyLayer(x)

All layers called inside this scope will run on the specified IPU, if one is specified. In addition, you can combine multiple blocks into a stage.

__init__(user_id=None, ipu_id=None)
Parameters
  • user_id (Optional[str]) – A user defined identifier for the block. Blocks with the same id are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.

  • ipu_id (Optional[int]) – The id of the IPU to run on. Note that the ipu_id is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used by gc-info.

static useAutoId()

Call this method at the beginning of your forward() method to enable automatic block id generation.

Blocks with a None user_id will be assigned an automatic id which will be the index of this block in the list of id-less Blocks.

>>> poptorch.Block.useAutoId()
>>> with poptorch.Block(): # user_id = "0"
...     layer()
>>> with poptorch.Block("special_block"): # user_id = "special_block"
...     layer()
>>> with poptorch.Block(): # user_id = "1"
...     layer()
class poptorch.BeginBlock(layer_to_call, user_id=None, ipu_id=None)

Define a block by modifying an existing PyTorch module.

You can use this with an existing PyTorch module instance, as follows:

>>> poptorch.BeginBlock(myModel.a_layer)
>>> poptorch.BeginBlock(MyNewLayer())

The wrapped module and all sub-modules will be part of this block until a sub-module is similar modified to be another block. In addition, if an IPU is specified, the module and its submodules will run on the specified IPU.

You can combines multiple blocks into a stage.

Parameters
  • layer_to_call – PyTorch module to assign to the block.

  • user_id – A user defined identifier for the block. Blocks with the same id are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.

  • ipu_id – The id of the IPU to run on. Note that the ipu_id is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used by gc-info.

__init__ = <method-wrapper '__init__' of function object>
poptorch.BlockFunction(user_id=None, ipu_id=None)

A decorator to define blocks of the model.

You can use BlockFunction as a decorator for an existing function, as follows:

>>> @BlockFunction("Decoder", ipu_id=1)
... def decoder(self, encoder_output):
...     self.decoder_b1(encoder_output)

All layers inside the function and any functions called by the function will run on the specified IPU, if one is specified. In addition, you can combine multiple blocks into a stage.

Parameters
  • user_id (Optional[str]) – A user defined identifier for the block. Blocks with the same id are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.

  • ipu_id (Optional[int]) – The id of the IPU to run on. Note that the ipu_id is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used by gc-info.

class poptorch.Stage(*block_ids)

The various execution strategies are made of Stages: a stage consists of one of more Blocks running on one IPU.

__init__(*block_ids)

Initialize self. See help(type(self)) for accurate signature.

Parameters

block_ids (str) –

Return type

None

class poptorch.AutoStage(value)

Defines how the stages are automatically assigned to blocks when the user didn’t explicitly provide stages to the IExecutionStrategy’s constructor.

  • SameAsIpu: The stage id will be set to the selected ipu number.

  • AutoIncrement: The stage id for new blocks is automatically incremented.

Examples:

>>> # Block "0"
>>> with poptorch.Block(ipu_id=0):
...  layer()
>>> # Block "1"
>>> with poptorch.Block(ipu_id=1):
...  layer()
>>> # Block "2"
>>> with poptorch.Block(ipu_id=0):
...  layer()

By default, the following execution strategy is used:

>>> stategy = poptorch.PipelinedExecution(poptorch.AutoStage.SameAsIpu)
>>> opts.setExecutionStrategy(strategy)

which would translate to stage_id = ipu_id:

  • Block “0” ipu=0 stage=0

  • Block “1” ipu=1 stage=1

  • Block “2” ipu=0 stage=0

Now if instead you use:

>>> stategy = poptorch.PipelinedExecution(poptorch.AutoStage.AutoIncrement)
>>> opts.setExecutionStrategy(strategy)

The last block would be in its own stage rather than sharing one with Block “0”:

  • Block “0” ipu=0 stage=0

  • Block “1” ipu=1 stage=1

  • Block “2” ipu=0 stage=2

class poptorch.Phase(*arg)

Represents an execution phase

__init__(*arg)

Create a phase.

Parameters

arg (Union[str, poptorch.Stage]) – must either be one or more Stages, or one or more blocks user_id.

If one or more strings are passed they will be interpreted as Block ids representing a single Stage.

Within a Phase, the stages will be executed in parallel.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> p = Phase(poptorch.Stage("A").ipu(0))
>>> # 2 stages made of one block each
>>> p = Phase(poptorch.Stage("A").ipu(0), poptorch.Stage("B").ipu(1))
>>> p = Phase("A","B") # One Stage made of 2 blocks
class poptorch.ShardedExecution(*args)

Will shard the execution of the passed Stages or if no stage is passed will consider each unique Block name encountered during tracing as a different stage.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> opts = poptorch.Options()
>>> # Automatically create 3 shards based on the block names
>>> opts.setExecutionStrategy(poptorch.ShardedExecution())
Parameters

args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a poptorch.AutoStage strategy or an explicit list of stages or block ids.

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

class poptorch.PipelinedExecution(*args)
__init__(*args)

Pipeline the execution of the passed Stages or if no stage is passed consider each unique Block name encountered during tracing as a different stage.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> opts = poptorch.Options()
>>> # Create a 3 stages pipeline
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution("A","B","C"))
>>> # Create a 2 stages pipeline
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution(
...    poptorch.Stage("A","B"),
...    "C"))
>>> # Automatically create a 3 stages pipeline based on the block names
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution())
Parameters

args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a poptorch.AutoStage strategy or an explicit list of stages or block ids.

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

class poptorch.SerialPhasedExecution(*phases)

All the phases run serially on a single group of IPUs.

For example:

  • phase 0 runs on ipu 0 & 1

  • phase 1 runs on ipu 0 & 1

  • phase 2 runs on ipu 0 & 1

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("A2"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("B2"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> with poptorch.Block("C2"):
...     layer()
>>> opts = poptorch.Options()
>>> strategy = poptorch.SerialPhasedExecution([
...     poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")),
...     poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")),
...     poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2"))])
>>> strategy.phase(0).ipus(0,1)
>>> strategy.phase(1).ipus(0,1)
>>> strategy.phase(2).ipus(0,1)
>>> opts.setExecutionStrategy(strategy)
__init__(*phases)

Execute the model’s blocks in phases

Parameters

phases ([poptorch.Phase], [[poptorch.Stage]], [[str]]) –

Definition of phases must be either:

phase(phase)

Return the requested poptorch.Phase

Parameters

phase (int) – Index of the phase

Return type

poptorch.Phase

setTensorsLiveness(liveness)

See poptorch.Liveness for more information

Parameters

liveness (poptorch.Liveness) –

Return type

poptorch.SerialPhasedExecution

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

useSeparateBackwardPhase(use=True)

Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:

fwd:       bwd:
phase 0 -> phase 4
phase 1 -> phase 3
phase 2 -> phase 2

Note

The end of the forward pass and the beginning of the backward pass are part of the same phase.

If useSeparateBackwardPhase(True) is used then no phase will be shared between the forward and backward passes:

fwd:       bwd:
phase 0 -> phase 6
phase 1 -> phase 5
phase 2 -> phase 4
Parameters

use (bool) –

class poptorch.ParallelPhasedExecution(*phases)

Phases are executed in parallel alternating between two groups of IPUs.

For example:

  • phase 0 runs on ipu 0 & 2

  • phase 1 runs on ipu 1 & 3

  • phase 2 runs on ipu 0 & 2

>>> poptorch.Block.useAutoId()
>>> with poptorch.Block(): # user_id = "0"
...     layer()
>>> with poptorch.Block(): # user_id = "1"
...     layer()
>>> with poptorch.Block(): # user_id = "2"
...     layer()
>>> with poptorch.Block(): # user_id = "3"
...     layer()
>>> with poptorch.Block(): # user_id = "4"
...     layer()
>>> with poptorch.Block(): # user_id = "5"
...     layer()
>>> opts = poptorch.Options()
>>> strategy = poptorch.ParallelPhasedExecution([
...     poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")),
...     poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")),
...     poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5"))])
>>> strategy.phase(0).ipus(0,2)
>>> strategy.phase(1).ipus(1,3)
>>> strategy.phase(2).ipus(0,2)
>>> opts.setExecutionStrategy(strategy)
__init__(*phases)

Execute the model’s blocks in phases

Parameters

phases ([poptorch.Phase], [[poptorch.Stage]], [[str]]) –

Definition of phases must be either:

phase(phase)

Return the requested poptorch.Phase

Parameters

phase (int) – Index of the phase

Return type

poptorch.Phase

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters

block_id (str) – A block id.

useSeparateBackwardPhase(use=True)

Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:

fwd:       bwd:
phase 0 -> phase 4
phase 1 -> phase 3
phase 2 -> phase 2

Note

The end of the forward pass and the beginning of the backward pass are part of the same phase.

If useSeparateBackwardPhase(True) is used then no phase will be shared between the forward and backward passes:

fwd:       bwd:
phase 0 -> phase 6
phase 1 -> phase 5
phase 2 -> phase 4
Parameters

use (bool) –

class poptorch.Liveness(value)

When using phased execution:

  • AlwaysLive: The tensors always stay on the IPU between the phases.

  • OffChipAfterFwd: The tensors are sent off the chip at the end of the forward pass and before the beginning of the backward pass.

  • OffChipAfterFwdNoOverlap: Same as OffChipAfterFwd, except there is no overlapping of load and store operations between phases. This makes it a more memory-efficient mode at the cost of delayed computation.

  • OffChipAfterEachPhase: The tensors are sent off the chip at the end of each phase.

10.6. Optimizers

class poptorch.optim.VariableAttributes(variable_attributes, allowed_attributes)

Track which attributes are variable or constant.

Is accessible via any PopTorch optimizer via the variable_attrs attribute.

>>> opt = poptorch.optim.SGD(params, lr=0.01)
>>> opt.variable_attrs.isConstant("lr")
isConstant(attr)

Return True if the attribute is marked as constant

Parameters

attr (str) –

Return type

bool

markAsConstant(attr)

Explicitly mark an attribute as constant

Parameters

attr (str) –

Return type

None

markAsVariable(attr)

Explicitly mark an attribute as variable

Parameters

attr (str) –

Return type

None

class poptorch.optim.SGD(params, lr, momentum=None, dampening=None, weight_decay=None, nesterov=None, loss_scaling=None, velocity_scaling=None, use_combined_accum=None, accum_type=None, velocity_accum_type=None)

Stochastic gradient descent with optional momentum.

The optimizer is based on PyTorch’s implementation (torch.optim.SGD) with optional loss and velocity scaling.

Nesterov momentum is not currently supported.

PopTorch provides two possible variants. Both variants are mathematically identical to PyTorch but differ in their stability and efficiency.

Note

If you set momentum to zero and do not use gradient accumulation, PopTorch will use a simple SGD variant and ignore the values of use_combined_accum, accum_type and velocity_accum_type.

Separate tensor variant (default)

If you set use_combined_accum to False (default), you will use a more stable but more memory intensive variant. In this case, PopTorch keeps two state tensors for each weight: one for gradient accumulation and one for velocity. It operates as follows when training:

  1. PopTorch runs one or more forward/backwards steps, equal the number of gradient accumulations (see gradientAccumulation()). Each time PopTorch sums the gradients, storing them in accumulators.

  2. Once all the forward and backwards have completed, PopTorch uses the summed gradients to update the velocities. At this stage, PopTorch will correct the scale based on the setting of accumulationAndReplicationReductionType(). PopTorch stores the velocities as optimiser states.

  3. Finally, PopTorch uses the velocities to update the parameters, taking into account the loss scaling and learning rate.

With use_combined_accum set to False, you can independently change the data type used for storing the accumulated gradients and the velocity values using accum_type and velocity_accum_type, respectively.

Velocity scaling is ignored for this variant.

Note

If the number of gradient accumulations is high, you can use off chip memory for the velocity tensors with a minimal performance hit. >>> opts.TensorLocations.setOptimizerLocation( … poptorch.TensorLocationSettings().useOnChipStorage(False))

Combined tensor variant

If you set use_combined_accum` to True, you will use a less stable but more memory efficient variant. In this case PopTorch uses a single tensor (the combined tensor) for gradient accumulation and velocity. It operates as follows when training:

  1. PopTorch runs one or more forward/backwards steps equal the number of gradient accumulations (see gradientAccumulation()). For each step, PopTorch immediately calculates an increment or decrement for the combined tensors for each parameter. The amount of increment or decrement takes into account the setting of accumulationAndReplicationReductionType(). as well as removing loss scaling and introducing any velocity scaling.

  2. After running all the steps, the combined tensor will be be equal to the new velocities. PopTorch uses these to update the parameters taking into account the velocity scaling and learning rate.

PopTorch ignores the accum_type` and velocity_accum_type values when using a combined tensor. In addition, there are no optimizer state tensors and so opts.TensorLocations.setOptimizerLocation has no effect.

Warning

For both variants, reducing the velocity scaling during training will result in temporary over-estimation of the velocity and could cause model instability. Increasing the scaling may temporarily slow model convergence but not lead to instability.

__init__(params, lr, momentum=None, dampening=None, weight_decay=None, nesterov=None, loss_scaling=None, velocity_scaling=None, use_combined_accum=None, accum_type=None, velocity_accum_type=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (float) – learning rate.

  • momentum (Optional[float]) – momentum factor.

  • dampening (Optional[float]) – dampening term for momentum.

  • weight_decay (Optional[float]) – Weight decay (L2 penalty) factor.

  • nesterov (Optional[bool]) – Not supported (must be False).

  • loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • velocity_scaling (Optional[float]) – Factor by which to scale the velocity values to assist numerical stability when using float16. (This applies to the combined variant only.)

  • use_combined_accum (Optional[bool]) – Whether to use a combined accumulator.

  • accum_type (Optional[torch.dtype]) – data type used for gradients.

  • velocity_accum_type (Optional[torch.dtype]) – data type used to store the velocity values for each parameter.

Return type

None

state_dict()

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content

    differs between optimizer classes.

  • param_groups - a dict containing all parameter groups

Return type

Dict[str, Any]

class poptorch.optim.Adam(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)

Adam optimizer.

This optimizer matches PyTorch’s implementation (torch.optim.Adam) with optional loss scaling.

AMSGrad is currently not supported.

__init__(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (Optional[float]) – learning rate

  • betas (Optional[Tuple[float, float]]) – (beta1, beta2) parameters used in Adam.

  • eps (Optional[float]) – term added to the denominator to ensure numerical stability.

  • weight_decay (Optional[float]) – Weight decay factor.

  • amsgrad (Optional[bool]) – Not supported (must be False).

  • loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • accum_type (Optional[torch.dtype]) – data type used for gradients.

  • first_order_momentum_accum_type (Optional[torch.dtype]) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (Optional[torch.dtype]) – data type used to store the second order momentum values for each parameter.

Return type

None

state_dict()

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content

    differs between optimizer classes.

  • param_groups - a dict containing all parameter groups

Return type

Dict[str, Any]

class poptorch.optim.AdamW(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, loss_scaling=None, bias_correction=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)

Adam optimizer with true weight decay.

This optimizer matches PyTorch’s implementation (torch.optim.AdamW) with optional loss scaling.

AMSGrad is currently not supported.

__init__(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, loss_scaling=None, bias_correction=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (Optional[float]) – learning rate

  • betas (Optional[Tuple[float, float]]) – (beta1, beta2) parameters used in AdamW.

  • eps (Optional[float]) – term added to the denominator to ensure numerical stability.

  • weight_decay (Optional[float]) – Weight decay factor.

  • amsgrad (Optional[bool]) – Not supported (must be False).

  • loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • bias_correction (Optional[bool]) – True: compute Adam with bias correction.

  • accum_type (Optional[torch.dtype]) – data type used for gradients.

  • first_order_momentum_accum_type (Optional[torch.dtype]) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (Optional[torch.dtype]) – data type used to store the second order momentum values for each parameter.

Return type

None

state_dict()

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content

    differs between optimizer classes.

  • param_groups - a dict containing all parameter groups

Return type

Dict[str, Any]

class poptorch.optim.RMSprop(params, lr=None, alpha=None, eps=None, weight_decay=None, momentum=None, centered=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, use_tf_variant=None)

RMSprop optimizer with optional L2 penalty.

This optimizer matches PyTorch’s implementation ( torch.optim.RMSprop) with optional loss scaling.

However, if the use_tf_variant flag is set to True, it will instead match the TensorFlow implementation which differs from PyTorch’s implementation in three ways: 1) The average squared gradients buffer is initialized to ones. 2) The small epsilon constant is applied inside the square root. 3) Learning rate is accumulated in the momentum buffer if momentum is used.

__init__(params, lr=None, alpha=None, eps=None, weight_decay=None, momentum=None, centered=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, use_tf_variant=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (Optional[float]) – learning rate.

  • alpha (Optional[float]) – smoothing constant.

  • eps (Optional[float]) – term added to the denominator to ensure numerical stability.

  • weight_decay (Optional[float]) – L2 penalty coefficient.

  • momentum (Optional[float]) – momentum factor.

  • centered (Optional[bool]) – True: compute centred RMSprop in which the gradient is normalized by an estimate of its variance.

  • loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • accum_type (Optional[torch.dtype]) – data type used for gradients.

  • first_order_momentum_accum_type (Optional[torch.dtype]) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (Optional[torch.dtype]) – data type used to store the second order momentum values for each parameter.

  • use_tf_variant (Optional[bool]) – False: If True, use the TensorFlow variant of RMSProp.

Return type

None

state_dict()

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content

    differs between optimizer classes.

  • param_groups - a dict containing all parameter groups

Return type

Dict[str, Any]

class poptorch.optim.LAMB(params, lr=None, betas=None, eps=None, weight_decay=None, bias_correction=None, loss_scaling=None, max_weight_norm=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)

Layer-wise Adaptive Moments (LAMB) optimizer (biased version).

Based on “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes” (https://arxiv.org/abs/1904.00962).

The scaling function phi(z) is fixed as min(z, max_weight_norm);

__init__(params, lr=None, betas=None, eps=None, weight_decay=None, bias_correction=None, loss_scaling=None, max_weight_norm=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (Optional[float]) – learning rate

  • betas (Optional[Tuple[float, float]]) – (beta1, beta2) parameters used in LAMB.

  • eps (Optional[float]) – term added to the denominator to ensure numerical stability/

  • weight_decay (Optional[float]) – weight decay factor.

  • bias_correction (Optional[bool]) – True: compute LAMB with bias correction.

  • loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • max_weight_norm (Optional[float]) – maximum value of the output of scaling function, phi(). Set to None to disable scaling function.

  • accum_type (Optional[torch.dtype]) – data type used for gradients.

  • first_order_momentum_accum_type (Optional[torch.dtype]) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (Optional[torch.dtype]) – data type used to store the second order momentum values for each parameter.

Return type

None

state_dict()

Returns the state of the optimizer as a dict.

It contains two entries:

  • state - a dict holding current optimization state. Its content

    differs between optimizer classes.

  • param_groups - a dict containing all parameter groups

Return type

Dict[str, Any]

step(closure=None)

Performs a single optimization step (parameter update).

Parameters

closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Return type

Optional[float]

Note

Unless otherwise specified, this function should not modify the .grad field of the parameters.

10.7. Data batching

class poptorch.DataLoader(options, dataset, batch_size=1, shuffle=False, num_workers=0, drop_last=True, persistent_workers=None, auto_distributed_partitioning=True, mode=<DataLoaderMode.Sync: 0>, async_options=None, rebatched_worker_size=None, **kwargs)

Thin wrapper around the traditional torch.utils.data.DataLoader to abstract away some of the batch sizes calculations.

If this data loader is used in a distributed execution environment, it will ensure that each process uses a different subset of the dataset, providing you first call options.randomSeed(N) with an integer N which is the same across all hosts.

__init__(options, dataset, batch_size=1, shuffle=False, num_workers=0, drop_last=True, persistent_workers=None, auto_distributed_partitioning=True, mode=<DataLoaderMode.Sync: 0>, async_options=None, rebatched_worker_size=None, **kwargs)
Parameters
  • options (poptorch.Options) – Options that will be used to compile and run the model.

  • dataset (torch.utils.data.Dataset) – The dataset to get the data from.

  • batch_size (int) – This is the batch size in the conventional sense of being the size that runs through an operation in the model at any given time.

  • shuffle (bool) – Whether or not the dataset should be shuffled.

  • num_workers (int) – Number of worker processes to use to read the data.

  • drop_last (bool) – If True and the number of elements in the dataset is not a multiple of the combined batch size then the incomplete batch at the end will be dropped.

  • persistent_workers (Optional[bool]) – Re-use workers between iterations if True.

  • auto_distributed_partitioning (bool) – If True, partitions the dataset for distributed execution automatically. Otherwise, it is assumed that partitioning has been handled manually.

  • mode (poptorch.DataLoaderMode) – If DataLoaderMode.Async, uses an AsynchronousDataAccessor to access the dataset. If DataLoaderMode.Sync, accesses the dataset synchronously.

  • async_options (Optional[Dict[str, Any]]) – Options to pass to AsynchronousDataAccessor.

  • rebatched_worker_size (Optional[int]) – When using AsyncRebatched: batch size of the tensors loaded by the workers. Default to the combined batch size. If specified the rebatched_worker_size must be less than or equal to the combined batch size.

  • kwargs – Other options to pass to PyTorch’s DataLoader constructor.

terminate()

If mode==DataLoaderMode.Async, kills the worker process in the underlying AsynchronousDataAccessor manually, otherwise has no effect.

Return type

None

class poptorch.AsynchronousDataAccessor(dataset, buffer_size=3, miss_sleep_time_in_ms=0.1, load_indefinitely=True, early_preload=False, sharing_strategy=<SharingStrategy.FileSystem: 1>, rebatched_size=None)

A data loader which launches the data loading process on a separate thread to allow for the data to be preprocessed asynchronous on CPU to minimise CPU/IPU transfer time.

This works by loading the data into a ring buffer of shared memory. When the IPU needs another batch it uses the data ready in the in the ring buffer. The memory is shared so will be used in-place and won’t be freed until the next batch is requested. Behind the scenes the worker thread will be filling the unready elements of the ring buffer.

Important

In order to avoid hanging issues related to OpenMP and fork() the AsynchronousDataAccessor uses the spawn start method which means your dataset must be serializable by pickle. For more information see https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

Note

When using a torch.utils.data.Dataset with rebatched_size the accessor will default to drop_last=True, to change that behaviour wrap the dataset into a poptorch.DataLoader(..., drop_last=False).

__init__(dataset, buffer_size=3, miss_sleep_time_in_ms=0.1, load_indefinitely=True, early_preload=False, sharing_strategy=<SharingStrategy.FileSystem: 1>, rebatched_size=None)
Parameters
  • dataset (Union[torch.utils.data.Dataset, poptorch.DataLoader]) – The dataset to pull data from, this can be any Python iterable.

  • buffer_size (int) – The size of the ring buffer.

  • miss_sleep_time_in_ms (float) – When the buffer is full how long should we sleep the worker before checking again.

  • load_indefinitely (bool) – If True when we hit the end of the dataset we will just loop round again.

  • early_preload (bool) – If True, start loading data in the ring buffer as soon as the worker is created. If False, wait for an iterator to be created before loading data.

  • sharing_strategy (poptorch.SharingStrategy) –

    Method to use to pass the dataset object when the child process is spawned.

    • SharedMemory is fast but might be quite limited in size.

    • FileSystem will serialise the dataset to file and reload it which will be slower.

  • rebatched_size (Optional[int]) – If not None: return N batched tensors from the dataset per iteration. (The passed dataset must have a batch_size of 1).

Note

If dataset is an iterable-type poptorch.DataLoader configured with drop_last=False then rebatched_size must be used.

terminate()

An override function to kill the worker process manually.

Return type

None

class poptorch.DataLoaderMode(value)
  • Sync: Access data synchronously

  • Async: Uses an AsynchronousDataAccessor to access the dataset

  • AsyncRebatched: For iterable datasets by default PyTorch will round down the number of elements to a multiple of the combined batch size in each worker. When the number of workers is high and/or the batch size large this might lead to a significant part of the dataset being discarded. In this mode, the combined batch size used by the PyTorch workers will be set to 1, and the batched tensor will instead be constructed in the AsynchronousDataAccessor. This mode is identical to Async for map-style datasets.

10.8. Enumerations

class poptorch.SharingStrategy(value)

Strategy to use to pass objects when spawning new processes.

  • SharedMemory: Fast but limited availability.

  • FileSystem: Slower but larger than memory.

10.9. Autocasting

class poptorch.autocasting.autocast(enabled=True)

Creates an auto-casting region for the layers called inside this scope.

>>> with poptorch.autocast():
...     layer()

To turn off auto-casting for this region, set the keyword parameter explicitly.

>>> with poptorch.autocast(enabled=False):
...     layer()
class poptorch.autocasting.Policy(fp16=None, fp32=None, promote=None, demote=None)

Base class for autocast policies.