11. API reference

11.1. Options

class poptorch.Options

Set of all options controlling how a model is compiled and executed.

Pass an instance of this class to the model wrapping functions inferenceModel() and trainingModel() to change how the model is compiled and executed. An instance includes general options set within this class such as deviceIterations() as well as properties referring to categories of options such as Training.

>>> opts = poptorch.Options()
>>> opts.deviceIterations(10)
>>> opts.Training.gradientAccumulation(4)
Return type

None

property Distributed: poptorch.options._DistributedOptions

Options specific to running on multiple IPU server (IPU-POD).

You should not use these when using PopRun/PopDist. Instead use popdist.poptorch.Options to set these values automatically.

property Jit: poptorch.options._JitOptions

Options specific to upstream PyTorch’s JIT compiler.

See also

_JitOptions

property Precision: poptorch.options._PrecisionOptions

Options specific to the processing of the JIT graph prior to lowering to PopART.

property TensorLocations: poptorch.options._TensorLocationOptions

Options related to tensor locations.

property Training: poptorch.options._TrainingOptions

Options specific to training.

See also

_TrainingOptions

anchorTensor(short_name, long_name, output_mode=None, output_return_period=1)

Anchor a tensor such that it may be retrieved after a model run.

Parameters
  • short_name (str) – User defined name to be used for retrieval

  • long_name (str) – The PopART name of the tensor to be anchored

  • output_mode (poptorch.OutputMode) – Specifies when data should be returned. Default to None, in which case the tensor will use the same output mode used for model outputs.

  • output_return_period (int) – Return period if output mode is EveryN. Defaults to 1.

appendToLocationExcludes(*excludes)

When printing the IR all the frames containing one of the excluded strings will be ignored.

This is helpful to get the IR to trace back to user code rather than some function inside a framework.

Parameters

excludes (str) – Append these exclusions to the existing list of exclusions.

Return type

poptorch.Options

autoRoundNumIPUs(auto_round_num_ipus=True)

Whether or not to round up the number of IPUs used automatically: the number of IPUs requested must be a power of 2. By default, an error occurs if the model uses an unsupported number of IPUs to prevent you unintentionally overbooking IPUs.

Parameters

auto_round_num_ipus (bool) –

  • True: round up the number of IPUs to a power of 2.

  • False: error if the number of IPUs is not supported.

Return type

poptorch.Options

broadcastBuffers(broadcast_buffers=True)

Broadcast buffers to all replicas.

Only non-broadcast buffers are currently supported, which means each replica will hold a set of buffers not in sync with other replicas’ buffers. To enable non-broadcast buffers, set this option to False.

Parameters

broadcast_buffers (bool) –

clone()

Create an unfrozen deep copy of the current options.

Return type

poptorch.Options

connectionType(connection_type)

When to connect to the IPU (if at all).

Parameters

connection_type (poptorch.ConnectionType) –

  • Always: Attach to the IPU from the start (default).

  • OnDemand: Wait until the compilation is complete and the executable is ready to be run to attach to the IPU.

  • Never: Never try to attach to an IPU: this is useful for offline compilation, but trying to run an executable will raise an exception.

Return type

poptorch.Options

For example:

>>> opts = poptorch.Options()
>>> opts.connectionType(poptorch.ConnectionType.OnDemand)
defaultOutputMode()
Returns

Return type

bool

deviceIterations(device_iterations)

Number of iterations the device should run over the data before returning to the user (default: 1).

This is equivalent to running the IPU in a loop over that the specified number of iterations, with a new batch of data each time. However, increasing deviceIterations is more efficient because the loop runs on the IPU directly.

Parameters

device_iterations (int) –

Return type

poptorch.Options

disableModuleNamescope()

Disable option adding name scope for each operator present in the module. This option is enabled by default. The operator name scope is be based on the names appearing in the named_modules function from torch.nn.Module.

For example:

>>> class Model(torch.nn.Module):
>>>     def __init__(self, num_groups, num_channels):
>>>         super().__init__()
>>>         self.gn = torch.nn.GroupNorm(num_groups, num_channels)
>>>     def forward(self, x):
>>>         return self.gn2(x)

With namescope enabled the name will be gn/GroupNormalization, with disabled it will be GroupNormalization.

Return type

poptorch.Options

enableExecutableCaching(path)

Load/save Poplar executables to the specified path, using it as a cache, to avoid recompiling identical graphs.

Parameters

path (str) – File path for Poplar executable cache store; setting path to None`` disables executable caching.

Return type

poptorch.Options

enableProfiling(profile_dir=None)

Enable profiling report generation.

To generate debug information associated with the profiling data, please specify autoReport.directory, and either autoReport.all or autoReport.outputDebugInfo in the POPLAR_ENGINE_OPTIONS environment variable. e.g.

POPLAR_ENGINE_OPTIONS={"autoReport.directory":"/profile/output",\
"autoReport.all":"true"}``

or:

POPLAR_ENGINE_OPTIONS={"autoReport.directory":"/profile/output",\
"autoReport.outputDebugInfo":"true"}``

Debug information and the rest of the profiling data will be stored in /profile/output directory. Values specified in the environment variable take precedence over profile_dir when both are given.

Parameters

profile_dir (str) – path to directory where report will be created. Defaults to current directory.

Return type

poptorch.Options

enableStableNorm(enabled)

Set whether a stable version of norm operators is used. This stable version is slower, but more accurate than its unstable counterpart.

Parameters

enabled (bool) –

  • True: Use stable norm calculation.

  • False: Do not use stable norm calculation.

Return type

poptorch.Options

enableSyntheticData(enabled)

Set whether host I/O is disabled and synthetic data is generated on the IPU instead. This can be used to benchmark models whilst simulating perfect I/O conditions.

Parameters

enabled (bool) –

  • True: Use data generated from a random normal distribution on the IPU. Host I/O is disabled.

  • False: Host I/O is enabled and real data is used.

Return type

poptorch.Options

inputReplicaGrouping(input_group_size, input_group_type)

Allows the input batches to be split between groups of replicas, in a similar way to what replicaGrouping() does for weight tensors.

Parameters
  • input_group_size (int) – Number of replicas to place in each input replica group. Must be a factor of replication_factor. Defaults to 1, which will divide the input evenly among all replicas.

  • input_group_type (poptorch.CommGroupType) – Arrangement type to use when placing replicas into input replica groups. Cannot be poptorch.CommGroupType.All. Defaults to poptorch.CommGroupType.Consecutive. For an explanation of the arrangement types, see CommGroupType and Section 4.4.3, Grouping tensor weights across replicas.

Return type

poptorch.Options

loadFromFile(filepath)

Load options from a config file where each line in the file corresponds to a single option being set. To set an option, simply specify how you would set the option within a Python script, but omit the options. prefix.

For example, if you wanted to set options.deviceIterations(1), this would be set in the config file by adding a single line with contents deviceIterations(1).

Parameters

filepath (str) –

Return type

poptorch.Options

logCycleCount(log_cycle_count)

Log the number of IPU cycles used in executing the main graph.

The cycle count will be printed when this option is enabled by setting the environment variable POPTORCH_LOG_LEVEL=DEBUG. This option requires IPU hardware to run.

Note: This will have a small detrimental impact on performance.

Parameters

log_cycle_count (bool) –

  • True: Enable logging the IPU cycle count.

  • False: Do not enable IPU cycle count logging.

Return type

poptorch.Options

logDir(log_dir)

Set the log directory

Parameters

log_dir (str) – Directory where PopTorch saves log files (default: current directory)

Return type

poptorch.Options

maxRepeatLogs(max_lines)
For often-repeated log lines, set the maximum number of repeated

lines that will be logged.

Parameters

max_lines (Optional[int]) – If None, show all log messages. Otherwise suppress repeated messages after max_lines lines. The default is to suppress after 4 lines.

Return type

poptorch.Options

modelName(name)

Set the model name

Parameters

name (str) – Name of the model defaults to “inference” or “training” depending on the type of model created. Used when profiling to set the subdirectory of the report directory to output the profiling too.

Return type

poptorch.Options

outputMode(output_mode, output_return_period=None)

Specify which data to return from a model.

Parameters
  • output_mode (poptorch.OutputMode) –

    • All: Return a result for each batch.

    • Sum: Return the sum of all the batches.

    • Final: Return the last batch.

    • EveryN: Return every N batches: N is passed in as output_return_period.

    • Default: All for inference, Final for training.

  • output_return_period (Optional[int]) –

Return type

poptorch.Options

For example:

>>> opts = poptorch.Options()
>>> opts.outputMode(poptorch.OutputMode.All)
... # or
>>> opts.outputMode(poptorch.OutputMode.EveryN, 10)
randomSeed(random_seed)

Set the seed for the random number generator on the IPU.

Parameters

random_seed (int) – Random seed integer.

Return type

poptorch.Options

relaxOptimizerAttributesChecks(relax=True)

Controls whether unexpected attributes in setOptimizer() lead to warnings or debug messages.

By default PopTorch will print warnings the first time it encounters unexpected attributes in setOptimizer().

Parameters

relax (bool) –

  • True: Redirect warnings to the debug channel.

  • False: Print warnings about unexpected attributes (default behaviour).

Return type

poptorch.Options

replicationFactor(replication_factor)

Number of times to replicate the model (default: 1).

Replicating the model increases the data throughput of the model as PopTorch uses more IPUs. This leads to the number of IPUs used being scaled by replication_factor, for example, if your model uses 1 IPU, a replication_factor of 2 will use 2 IPUs; if your model uses 4 IPUs, a replication factor of 4 will use 16 IPUs in total.

Parameters

replication_factor (int) – Number of replicas of the model to create.

Return type

poptorch.Options

setAvailableMemoryProportion(available_memory_proportion)

Sets the amount of temporary memory made available on a per-IPU basis.

Use this setting to control the amount of temporary memory available to operations such as:

  • convolution

  • matrix multiplication

  • embedding lookups

  • indexing operations

Parameter should be a dictionary of IPU IDs and float values between 0 and 1. (for example, {"IPU0": 0.5})

The floating point value has the same meaning and effect as documented in set_available_memory().

Parameters

available_memory_proportion (Dict[str, float]) –

setExecutionStrategy(strategy)

Set the execution strategy to use to partition the graph.

Parameters

strategy (Union[poptorch.ParallelPhasedExecution, poptorch.SerialPhasedExecution]) – Must be an instance of once of the execution strategy classes.

Return type

poptorch.Options

showCompilationProgressBar(show=True)

Show / hide a progress bar while the model is being compiled. (The progress bar is shown by default)

Parameters

show (bool) –

Return type

poptorch.Options

sourceLocationExcludes(excludes)

When printing the IR all the frames containing one of the excluded strings will be ignored.

This is helpful to get the IR to trace back to user code rather than some function inside a framework.

Parameters

excludes (List[str]) – Replace the current list of exclusions with this one.

Return type

poptorch.Options

syncPattern(sync_pattern)

Controls synchronisation in multi-IPU systems.

This option can be used to allow subsets of IPUs to overlap their work. For example, one set of IPUs could be communicating with the host while other IPUs are processing data.

This option is typically used together with replicated execution, in which case it takes effect on a per-replica basis. If replication is not used, it will apply to all IPUs.

Parameters

sync_pattern (poptorch.SyncPattern) –

  • Full: Require all IPUs to synchronise on every communication between IPUs or between IPUs and host. This is the default.

  • SinglePipeline: Allow IPUs to synchronise with the host independently, without having to synchronise with each other. This permits any one IPU to perform host IO while other IPUs are processing data.

  • ReplicaAndLadder: Allow an IPU group to communicate with the host without requiring synchronisation between groups. This permits multiple IPU groups to alternate between performing host IO and computation.

Return type

poptorch.Options

useIpuId(ipu_id)

Use the IPU device specified by the ID (as provided by gc-info).

A device ID may refer to a single or to a group of IPUs (a multi-IPU device). The number of IPUs associated with the ID must be equal to the number of IPUs used by your annotated model multiplied by the replication factor.

For example if your model uses 1 IPU and the replication factor is 2 you will need to provide a device ID with 2 IPU; if your model is pipelined across 4 IPUs and the replication factor is 4, you will need to provide a device ID which represents a multi-IPU device of 16 IPUs.

You can use the the command-line tool gc-info: running gc-info -l, shows each device ID and a list of IPUs associated with the ID.

Parameters

ipu_id (int) – IPU device ID of a single-IPU or multi-IPU device

Return type

poptorch.Options

useIpuModel(use_model)

Whether to use the IPU Model or physical hardware (default)

The IPU model simulates the behaviour of IPU hardware but does not offer all the functionality of an IPU. Please see the Poplar and PopLibs User Guide for further information.

This setting takes precedence over the POPTORCH_IPU_MODEL environment variable.

Parameters

use_model (bool) –

  • True: Use the IPU Model.

  • False: Use IPU hardware.

Return type

poptorch.Options

useOfflineIpuTarget(ipu_version=2)

Create an offline IPU target that can only be used for offline compilation.

Note

the offline IPU target cannot be used if the IPU model is enabled.

Parameters

ipu_version (int) – IPU version to target (1 for Mk1, 2 for Mk2, 21 for Mk2 with FP8 support). Default: 2.

Return type

poptorch.Options

class poptorch.options._DistributedOptions

Options related to distributed execution.

You should not use these when using PopRun/PopDist. Instead use popdist.poptorch.Options to set these values automatically.

Can be accessed via poptorch.Options.Distributed:

>>> opts = poptorch.Options()
>>> opts.Distributed.configureProcessId(0, 2)
Return type

None

configureProcessId(process_id, num_processes)

Manually set the current process ID and the total number of processes.

Parameters
  • process_id (int) – The ID of this process.

  • num_processes (int) – The total number of processes the execution is distributed over.

Return type

poptorch.options._DistributedOptions

disable()

Ignore the current options / environment variables and disable distributed execution.

Return type

poptorch.options._DistributedOptions

property numProcesses: int

Total number of processes the execution is distributed over.

property processId: int

Id of the current process.

setEnvVarNames(var_num_processes, var_process_id)

Utility to read and set processId and numProcesses from environment variables.

Useful if you use a third party library to manage the processes used for the distributed execution such as mpirun.

For example: mpirun -np 4 myscript.py

By default the OpenMPI OMPI_COMM_WORLD_SIZE and OMPI_COMM_WORLD_RANK variables are used.

Parameters
  • var_num_processes (str) –

  • var_process_id (str) –

Return type

poptorch.options._DistributedOptions

class poptorch.options._PrecisionOptions(popart_options)

Options related to processing the PyTorch JIT graph prior to lowering to PopART

Can be accessed via poptorch.Options.Precision:

>>> opts = poptorch.Options()
>>> opts.Precision.enableFloatingPointExceptions(True)
Parameters

popart_options (poptorch.options._PopartOptions) –

Return type

None

enableFloatingPointExceptions(enabled)

Set whether floating point exceptions are enabled on the IPU.

When enabled, an exception will be generated when the IPU encounters any one of the following:

  • Operation resulting in subtraction of infinities

  • Divisions by zero or by infinity

  • Multiplications between zero and infinity

  • Real operations producing complex results

  • Comparison where any one operand is Not-a-Number

Parameters

enabled (bool) –

  • True: raise RuntimeError on floating point exception

  • False: do not raise RuntimeError (default)

Return type

poptorch.options._PrecisionOptions

enableStochasticRounding(enabled)

Set whether stochastic rounding is enabled on the IPU.

Stochastic rounding rounds up or down a values to half (float16) randomly such that that the expected (mean) result of rounded value is equal to the unrounded value. It can improve training performance by simulating higher precision behaviour and increasing the speed or likelihood of model convergence. However, the model is non-deterministic and represents a departure from (deterministic) standard IEEE FP16 behaviour.

In the general case, we recommend enabling stochastic rounding for training where convergence is desirable, but not for inference where non-determinism may be undesirable.

Parameters

enabled (bool) –

  • True: Enable stochastic rounding on the IPU.

  • False: Disable stochastic rounding.

Return type

poptorch.options._PrecisionOptions

halfFloatCasting(half_float_casting)

DO NOT USE: about to be removed.

Parameters

half_float_casting (poptorch.HalfFloatCastingBehavior) –

Return type

poptorch.options._PrecisionOptions

runningStatisticsAlwaysFloat(value)

DO NOT USE: about to be removed.

Parameters

value (bool) –

Return type

poptorch.options._PrecisionOptions

setPartialsType(dtype)

Set the data type of partial results for matrix multiplication and convolution operators.

The matrix multiplication and convolution operators store intermediate results known as partials as part of the calculation. You can use this option to change the data type of the partials. Using torch.half reduces on-chip memory use at the cost of precision.

Parameters
  • type (torch.dtype) – The type to store partials, which must be either torch.float or torch.half

  • dtype (dtype) –

Return type

poptorch.options._PrecisionOptions

class poptorch.options._JitOptions(**default_values)

Options related to PyTorch’s JIT compiler.

Can be accessed via poptorch.Options.Jit:

>>> opts = poptorch.Options()
>>> opts.Jit.traceModel(True)
traceModel(trace_model)

DO NOT USE: about to be removed.

Parameters

trace_model (bool) –

Return type

poptorch.options._JitOptions

class poptorch.options._TensorLocationOptions(**default_values)

Options controlling where to store tensors.

Can be accessed via poptorch.Options.TensorLocations:

>>> opts = poptorch.Options()
>>> opts.TensorLocations.setActivationLocation(
...     poptorch.TensorLocationSettings().useOnChipStorage(False))
numIOTiles(num_tiles)

Assigns the number of tiles on the IPU to be IO rather than compute.

Allocating IO (input/output) tiles reduces the number of IPU tiles available for computation but allows you to reduce the latency of copying tensors from host to the IPUs using the function set_overlap_for_input(), IPUs to host using the function set_overlap_for_output() or to use off-chip memory with reduced by setting the option useIOTilesToLoad(). As reducing the number of computation tiles may reduce performance, you should not use any IO tiles until you have successfully run your model and used profiling to identify “streamCopy” entries which take up a significant proportion of execution time.

Parameters

num_tiles (int) –

Return type

poptorch.TensorLocationSettings

setAccumulatorLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for accumulators.

Return type

poptorch.options._TensorLocationOptions

setActivationLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for activations.

Return type

poptorch.options._TensorLocationOptions

setOptimizerLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for optimiser states.

Return type

poptorch.options._TensorLocationOptions

setWeightLocation(location)
Parameters

location (poptorch.TensorLocationSettings) – Update tensor location settings for weights.

Return type

poptorch.options._TensorLocationOptions

class poptorch.TensorLocationSettings(**default_values)

Define where a tensor is stored

>>> opts = poptorch.Options()
>>> opts.TensorLocations.setActivationLocation(
...     poptorch.TensorLocationSettings().useOnChipStorage(False))
minElementsForOffChip(min_elements)

A minimum number of elements below which offloading won’t be considered.

Parameters

min_elements (int) –

Return type

poptorch.TensorLocationSettings

minElementsForReplicatedTensorSharding(min_elements)

Only enable replicated tensor sharding (RTS) for tensors with more than min_elements elements.

Parameters

min_elements (int) –

Return type

poptorch.TensorLocationSettings

useIOTilesToLoad(use=True)

Load tensor through IO tiles

Parameters

use (bool) – Use IO tiles if True, use Compute tiles if False.

Return type

poptorch.TensorLocationSettings

useIOTilesToStore(use=True)

Use IO tiles to store tensors.

(relevant for replicated tensor sharded tensors)

Parameters

use (bool) – Use IO tiles if True, use Compute tiles if False.

Return type

poptorch.TensorLocationSettings

useOnChipStorage(use=True)

Permanent tensor storage

Parameters

use (bool) – True: use on chip memory. False: use off chip memory. None: keep it undefined.

Return type

poptorch.TensorLocationSettings

useReplicatedTensorSharding(use=True)

Enable replicated tensor sharding

(relevant for weights and optimiser states)

Parameters

use (bool) –

Return type

poptorch.TensorLocationSettings

class poptorch.options._TrainingOptions(popart_options)

Options specific to model training.

Note

You must not set these options for inference models.

Can be accessed via poptorch.Options.Training:

>>> opts = poptorch.Options()
>>> opts.Training.gradientAccumulation(4)
Parameters

popart_options (poptorch.options._PopartOptions) –

Return type

None

accumulationAndReplicationReductionType(reduction_type)

Set the type of reduction applied to reductions in the graph.

When using, a value for greater than one for gradientAccumulation() or for replicationFactor(), PopTorch applies a reduction to the gradient outputs from each replica, and to the accumulated gradients. This reduction is independent of the model loss reduction (summing a mean-reduced loss and a sum-reduced loss in a PyTorch model is valid).

This setting governs both the accumulation of the loss gradients in replicated graphs and of all of the gradients when using gradient accumulation.

Parameters

reduction_type (poptorch.ReductionType) –

  • Mean (default): Reduce gradients by calculating the mean of them.

  • Sum: Reduce gradients by calculating the sum of them.

Return type

poptorch.options._TrainingOptions

gradientAccumulation(gradient_accumulation)

Number of micro-batches to accumulate for the gradient calculation.

Accumulate the gradient gradient_accumulation times before updating the model using the gradient. Other frameworks may refer to this setting as “pipeline depth”.

Accumulate the gradient gradient_accumulation times before updating the model using the gradient. Each micro-batch (a batch of size equal to the batch_size argument passed to DataLoader) corresponds to one gradient accumulation. Therefore gradient_accumulation scales the global batch size (number of samples between optimiser updates).

Note

Increasing gradient_accumulation does not alter the (micro-)batch size used for batch normalisation.

A large value for gradient_accumulation can improve training throughput by amortising optimiser update costs, most notably when using PipelinedExecution or when training is distributed over a number of replicas. However, the consequential increase in the number of samples between optimiser updates can have an adverse impact on training.

The reason why the efficiency gains are most notable when training with models with multiple IPUs which express pipelined model parallelism (via PipelinedExecution or by default and annotating the model BeginBlock or Block) is because the pipeline has “ramp up” and “ramp down” steps around each optimiser update. Increasing the gradient accumulation factor in this instance reduces the proportion of time spent in the “ramp up” and “ramp down” phases, increasing overall throughput.

When training involves multiple replicas, including the cases of sharded and phased execution, each optimiser step incurs a communication cost associated with the reduction of the gradients. By accumulating gradients, you can reduce the total number of updates required and thus reduce the total amount of communication.

Note

Increasing the global batch size can have adverse effects on the sample efficiency of training so it is recommended to use a low or unity gradient accumulation count initially, and then try increasing to achieve higher throughput. You may also need to scale other hyper-parameters such as the optimiser learning rate accordingly.

Parameters

gradient_accumulation (int) –

Return type

poptorch.options._TrainingOptions

setAutomaticLossScaling(enabled)

Set whether automatic loss scaling is enabled on the IPU.

When using float16/half values for activations, gradients, and weights, the loss value needs to be scaled by a constant factor to avoid underflow/overflow. This adjustment is known as loss scaling. This setting automatically sets a global loss scaling factor during training.

Note: Automatic loss scaling is a preview feature. It is well tested and enabled in some of our example applications, but may not behave as expected in all models. Recommendation: if your model with automatic loss scaling enabled does not converge or triggers a compilation error, then you will need to set the loss scale manually.

Parameters

enabled (bool) –

  • True: Enable automatic loss scaling on the IPU.

  • False: Disable automatic loss scaling.

Return type

poptorch.options._TrainingOptions

setConvolutionDithering(enabled)

Enable convolution dithering.

If true, then convolutions with different parameters will be laid out from different tiles in an effort to improve tile balance in models.

Use MultiConv to apply this option to specific set of convolutions.

Parameters

enabled (bool) – Enables or disables convolution dithering for all convolutions.

Return type

poptorch.options._TrainingOptions

setMeanAccumulationAndReplicationReductionStrategy(mean_reduction_strategy)

Specify when to divide by a mean reduction factor when accumulationAndReplicationReductionType is set to ReductionType.Mean.

The default reduction strategy depends on the optimizer used. The default strategy is Running when the accum_type of the optimizer is set to half-precision (float16) format. Otherwise the Post strategy is used as this strategy is typically more performant but the Post strategy is less numerically robust.

Parameters

mean_reduction_strategy (poptorch.MeanReductionStrategy) –

  • Running: Keeps the reduction buffer as the current mean. This is preferred for numerical stability as the buffer value is never larger than the magnitude of the largest micro batch gradient.

  • Post: Divides by the accumulationFactor and replicatedGraphCount after all of the gradients have been reduced. In some cases this can be faster then using Running, however is prone to overflow.

  • PostAndLoss (deprecated): Divides by the replicatedGraphCount before the backwards pass, performs the gradient reduction across micro batches, and then divides by the accumulationFactor. This is to support legacy behaviour and is deprecated.

Return type

poptorch.options._TrainingOptions

11.2. Helpers

poptorch.ipuHardwareIsAvailable(num_ipus=1)

Indicates whether any IPU hardware with num_ipus is present in the system.

Note: This function doesn’t check if the IPU is free or already being used.

Parameters

num_ipus (int) –

Returns

True if physical IPUs are available, False otherwise.

Return type

bool

poptorch.ipuHardwareVersion()

Indicates what IPU hardware version is available in the system.

Raise an exception if no hardware is available.

Returns

The IPU hardware version or -1 if unknown.

Return type

int

poptorch.setLogLevel(level)

Changes the volume of messages printed in the console (stdout)

Parameters

level (Union[str, int]) –

  • TRACE: Print all messages.

  • DEBUG: Print debug messages and above.

  • INFO: Print info messages and above.

  • WARN: Print warnings and errors.

  • ERR: Print errors only.

  • OFF: Print nothing.

class poptorch.profiling.Channel(name)

Profiling channel.

Note

If the libpvti profiling library is not available at runtime this class becomes a no-op.

Example:

>>> channel = poptorch.profiling.Channel("MyApp")
>>> with channel.tracepoint("TimeThis"):
...     functionToTime()
>>> channel.instrument(myobj, "methodName", "otherMethod")
instrument(obj, *methods)

Instrument the methods of an object.

Parameters
  • obj – Object to instrument

  • methods – One or more methods to wrap in profiling tracepoints.

tracepoint(name)

Create a context tracepoint

>>> with channel.tracepoint("DoingSomething"):
...     expensiveCall()
Parameters

name – Name associated to this tracepoint.

11.3. PopTorch Ops

poptorch.ctc_beam_search_decoder(probs, lengths, blank=0, beam_width=100, top_paths=1)
Add a connectionist temporal classification (CTC) beam search decoder

to the model.

Calculates the most likely top paths and their probabilities given the input logarithmic probabilities and the data lengths.

Parameters
  • probs (Tensor) – Logarithmic probabilities tensor with the shape of [input_length, batch_size, num_classes].

  • lengths (Tensor) – Tensor representing lengths of the inputs of shape [batch_size].

  • blank (int) – Integer identifier of the blank class (default: 0).

  • beam_width (int) – Number of beams used during decoding (default: 100).

  • top_paths (int) – Number of most likely paths to return (default: 1).

Returns

Three tensors representing paths’ probabilities - of shape [batch_size, top_paths], paths’ lengths - of shape [batch_size, top_paths] and the decoded paths - of shape [batch_size, top_paths, input_length].

Return type

List[Tensor]

poptorch.ipu_print_tensor(tensor, title='')

Adds an op to print the contents of the IPU tensor.

When this is executed the tensor will be copied back to host and printed.

When this operation is called in the backward pass it will print the gradient of the tensor.

The operation is an identity operation and will return the exact same tensor. The returned tensor must be used in place of the original tensor in the rest of the program, to make sure that the print operation isn’t optimised away.

For example, if the original code looks like this:

def forward(self, c, d, b)
  a = c + d
  return a + b

If the result of ipu_print_tensor() is not used, the function will be optimised out by the graph optimiser and the tensor will not be printed.

So if you want to print the value of a, you should do:

def forward(self, c, d, b)
  a = c + d
  x = poptorch.ipu_print_tensor(a)
  return x + b

Optionally, you can add a second string argument to be used as a title, as shown in the following example. The value of a will be printed after the title “summation”. The value of the gradient of a will be printed after the title “summation_gradient” if the operation is called in the backward pass.

def forward(self, c, d, b)
    a = c + d
    x = poptorch.ipu_print_tensor(a, "summation"))
    return x + b

Warning

To prevent the print operation being optimised out by the graph optimiser, you must use the output of the print.

Parameters
  • tensor (Tensor) – The tensor to print.

  • title (str) – An optional title to print before the tensor value.

Returns

The input tensor unchanged.

Return type

Tensor

poptorch.for_loop(count, body, inputs)

An on-device for loop. This loop will execute on device for count number of iterations.

The body should be a Python function containing the PyTorch code you wish to execute in a loop. It should take as input the same number of tensors as it outputs. Each iteration will have the previous output passed in as input.

Parameters
Return type

List[Tensor]

poptorch.recomputationCheckpoint(*tensors)

Operation for checkpointing values in a computational pipeline stage.

When recomputation is enabled, these values will not be recomputed and they will be stored in memory between forward and backwards passes instead.

Parameters

tensors (List[Tensor]) – One or more tensors which should be checkpointed.

Returns

Tensors (same number and shape as the input tensors).

Return type

List[Tensor]

poptorch.identity_loss(x, reduction)

Marks a tensor as being part of the loss calculation and, as such, will backpropagate through it in the PopTorch autograd.

This function should be called on the (final) loss of a model so that it is used as the start of backpropagation. This is equivalent to calling x.backward() on a tensor x when running on the CPU.

This function is necessary to combine multiple losses into a custom loss. It ensures that the tensor is part of the loss calculation and, as such, should be part of the backpropagation in PopTorch autograd.

Multiple calls to identity_loss can be made inside the same model provided they are all dependant: all marked losses must be traceable into a single final tensor itself marked by a call to identity_loss otherwise an error is raised.

Parameters
  • x (Tensor) – The calculated loss.

  • reduction (str) –

    Reduce the loss output as per PyTorch loss semantics. Supported values are:

    • "sum": Sum the losses.

    • "mean": Take the mean of the losses.

    • "none": Don’t reduce the losses.

Returns

The loss tensor with the specified reduction applied.

Return type

Tensor

class poptorch.MultiConv

Combines all convolution layers evaluated inside this scope into a single multi-convolution.

Multi-convolutions allow for a set of data-independent convolutions to be executed in parallel. Executing convolutions in parallel can lead to an increase in the data throughput.

For example:

>>> with poptorch.MultiConv():
...     y = self.convA(x)
...     v = self.convB(u)

Combines the two data-independent convolutions into a single multi-convolution.

Refer to the PopLibs documentation for further information on multi-convolutions.

availableMemoryProportions(value)

The available memory proportion per convolution, each [0, 1).

For more information, please refer to the technical note on optimising temporary memory usage.

Parameters

value (Union[float, List[float]]) – Can be a float value in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many float values as the number of convolutions.

Returns

self, to support method chaining.

Return type

poptorch.MultiConv

cycleBackOff(value)

Cycle back off proportion.

Parameters

value (float) – Number between 0 and 1.

Returns

self, to support method chaining.

Return type

poptorch.MultiConv

enableConvDithering(value)

Enable per-convolution dithering.

Parameters

value (Union[bool, List[bool]]) – Can be a bool value in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many bool values as the number of convolutions.

Returns

self, to support method chaining.

Return type

poptorch.MultiConv

partialsTypes(value)

The partials type used for each convolution.

Parameters

value (Union[dtype, List[dtype]]) – Can be a single instance of torch.dtype in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many torch.dtype values as the number of convolutions.

Returns

self, to support method chaining.

Return type

poptorch.MultiConv

perConvReservedTiles(value)

Tiles to reserve for each convolution.

Parameters

value (int) – Number of tiles.

Returns

self, to support method chaining.

Return type

poptorch.MultiConv

planType(value)

Select the multi-convolution execution strategy.

Parameters

value (poptorch.MultiConvPlanType) – An instance of MultiConvPlanType.

Returns

self, to support method chaining.

Return type

poptorch.MultiConv

class poptorch.CPU(layer_to_call, ID)

Allow the execution of a CPU op in the middle of an inference IPU graph.

Important

CPU ops are only supported in inference graphs.

Example:

>>> class Model(torch.nn.Module):
>>>     def __init__(self):
>>>         super().__init__()
>>>         self.cpu = poptorch.CPU(self.myCpuOp, "MyCPUOp")
>>>
>>>     def myCpuOp(self, x):
>>>         return x * 2.0
>>>
>>>     def forward(self, x):
>>>         # The arguments passed to "cpu" are forwarded to "myCpuOp"
>>>         out = self.cpu(x)
>>>         out = self.cpu(out)
>>>         out = self.cpu(out)
>>>         return out
Parameters
__init__(layer_to_call, ID)

Execute a given function on the CPU.

Param

layer_to_call Python function to execute on the CPU. The arguments passed when the CPU wrapper is called will be forwarded to layer_to_call.

Param

ID Name of the CPU op.

Parameters
execute()

Implementation detail.

registerPersistentData()

Implementation detail.

class poptorch.NameScope(name)

Create a name scope for a code block. All operators originating from this block will have their names prefixed by the given string.

>>> with poptorch.NameScope("CustomString"):
...     y = self.bmm(a, b)
...     z = torch.relu(y)
Parameters

name (str) –

class poptorch.MultiConvPlanType(value)

Selects the execution strategy for a poptorch.MultiConv

  • Parallel: Execute multiple convolutions in parallel (Default).

  • Serial: Execute each convolution independently. This is equivalent to using the independent convolution API.

class poptorch.custom_op(inputs, name, domain, domain_version, example_outputs, attributes=None)

Applies a custom operation, implemented within PopART, to the inputs.

Parameters
  • inputs (tuple) – A tuple of input tensors, for example, (x, y).

  • name (str) – Unique name of the PopART custom op.

  • domain (str) – Domain for the op.

  • domain_version (int) – Version of the domain to use.

  • example_outputs (iterable) – A tuple of tensors with the same type and shape as the outputs. The value does not matter as all values will be set to zero for tracing purposes.

  • attributes (dict) – A dictionary of attributes for the custom op. All attribute keys must be strings. All attribute values must be floats, ints, strings, or a list/tuple containing only floats, only ints or only strings (not a mix of types within the list).

Returns

The outputs of the forward op of the custom op.

Return type

List[Tensor]

poptorch.nop(tensor)

A no-operation: it is functionally the same as an identity but is never eliminated by PopART patterns or inlining, so it is useful for debugging.

Parameters

tensor (Tensor) – The tensor to pass to the no-op.

Returns

The same tensor which was input.

Return type

Tensor

poptorch.dynamic_slice(tensor, dim, start, size, step)

Torch native dynamic slices can’t be properly intercepted by backends, so this op is provided to enable dynamic slicing in poptorch applications.

Parameters
  • tensor (Tensor) – The tensor to slice.

  • dim (int) – The dimension to slice along.

  • start (Tensor) – The start index.

  • size (int) – The slice size. Must be a constant int.

  • step (int) – The slice step. Must be a constant int.

Returns

The sliced tensor.

Return type

Tensor

poptorch.serializedMatMul(lhs, rhs, mode, factor=0, keep_precision=False)

Calculates a matrix product using a serialized matrix multiplication.

The matrix multiplication, lhs*rhs, is split into separate smaller multiplications, calculated one after the other, to reduce the memory requirements of the multiplication and its gradient calculation.

Parameters
  • lhs (torch.Tensor) – Left-hand side input matrix.

  • rhs (torch.Tensor) – Right-hand side input matrix.

  • mode (poptorch.MatMulSerializationMode) –

    Which dimension of the matmul to serialize on: for matrix A (m by n) multiplied by matrix B (n by p).

    • InputChannels: Split across the input channels (dimension m).

    • ReducingDim: Split across the reducing dimension (n).

    • OutputChannels: Split across the output channels (dimension p).

    • Disabled: Same as an ordinary matrix multiplication.

  • factor (int) – Number of serialized multiplications. Must be a factor of the dimension to serialize on.

  • keep_precision (bool) – (Half/float16 inputs only) The forward op when serializing over ReducingDim and the backwards ops when serializing over InputChannels involve an addition step. If keep_precision is True, these additions will occur using float32 rather than half precision partials, matching those used for the individual matrix multiplications.

Return type

torch.Tensor

poptorch.set_available_memory(tensor, available_memory_proportion)

Sets the amount of temporary memory made available to an operation.

The operators that can be tuned with this setting include:

  • convolution

  • matrix multiplication

  • embedding lookups

  • indexing operations

When applied to the output of a supported operation, it controls the trade-off between execution cycles and the temporary memory used during the execution of the operation.

The value should be between 0 and 1 (inclusive) and represents a proportion of available memory on the IPU. The default value is 0.6 (therefore, by default, PopTorch will not use more than 60% of IPU memory for temporary data).

PopTorch passes this setting to the PopLibs operator planner, which will try to constrain the use of temporary memory to below this value. Generally, an operation that has more temporary memory available will run in fewer cycles.

For a specific operation, the necessary amount of temporary memory may be more than amount specified by this option. In this case, a warning message will be generated.

For more information, please refer to the technical note on optimising temporary memory usage.

>>> class BasicNetwork(nn.Module):
...     def __init__(self):
...         super().__init__()
...         self.conv = nn.Conv2d(4, 4, 3, stride=2)
...
...     def forward(self, x):
...         out = self.conv(x)
...         out = poptorch.set_available_memory(out, 0.2)
...         return out
Parameters
  • tensor (Tensor) – Output tensor from a supported operation (otherwise the statement will be an identity).

  • available_memory_proportion (float) – Proportion between 0.0 and 1.0 of tile memory to be made available for temporary memory (default 0.6).

Returns

The input tensor, as if calling an identity function.

Return type

Tensor

poptorch.set_overlap_for_input(input_tensors, mode)

Sets host overlap setting for input_tensors.

You can increase performance in some cases by overlapping the copying from the host to IPUs with computation. However, this requires a number of IPU tiles to be set aside as IO tiles using numIOTiles() which may affect computation performance.

You should use this function at the start of your model’s forward method for each applicable input and use the returned tensors in future ops.

Parameters
  • input_tensors – The input tensors for which enable overlapping host IO. This can be either a single tensor, or any combination of tuple, list, or dict of tensors.

  • mode (poptorch.OverlapMode) – Control to what extent the host IO overlaps computation.

Returns

the input tensors, specified for overlap.

See also

OverlapMode.

poptorch.set_overlap_for_output(output_tensors, mode)

Sets host overlap setting for output_tensors.

You can increase performance in some cases by overlapping the copying from the IPUs to host with computation. However, this requires a number of IPU tiles to be set aside as IO tiles using numIOTiles() which may affect computation performance.

You should use this function at the end of your model’s forward method, for each applicable output, just before returning the tensors.

Parameters
  • output_tensors – The output tensors to enable overlapping host IO for. This can be either a single tensor, or any combination of tuple, list, or dict of tensors.

  • mode (poptorch.OverlapMode) – Control to what extent the host IO overlaps computation.

Returns

the output tensors, specified for overlap.

See also

OverlapMode.

11.4. Model wrapping functions

poptorch.trainingModel(model, options=None, optimizer=None)

Create a PopTorch training model, from a PyTorch model, to run on IPU hardware in training mode.

Note

PopTorch makes a shallow copy of the model and wraps the original model to fasciliate weight syncronisation. Changes to the parameters in the returned training model affect the original model and vice versa. However, primitive variable types are not synced: for example calling model.train() on the original model, which changes the training bool of the model instance, will not alter the model returned by this function. You may need to call model.train() on your model before you call this function for correct behaviour.

Parameters
Returns

The PoplarExecutor wrapper to use in place of model.

Return type

poptorch.PoplarExecutor

poptorch.inferenceModel(model, options=None)

Create a PopTorch inference model, from a PyTorch model, to run on IPU hardware in inference mode.

Note

PopTorch makes a shallow copy of the model. Changes to the parameters in the returned inference model affect the original model and vice versa. However, primitive variable types are not synced: for example calling model.eval() on the original model will not alter the model returned by this function. You may need to call model.eval() on your model before you call this function for correct behaviour.

Parameters
Returns

The PoplarExecutor wrapper to use in place of model.

Return type

poptorch.PoplarExecutor

class poptorch.PoplarExecutor(model, options, training, poptorch_version, optimizer=None, user_model=None)

This class should not be created directly but is a wrapper around the model that was passed into inferenceModel or trainingModel. It only has a few methods which can be used to interface with the IPU.

Parameters
__call__(*args, **kwargs)

Takes the same arguments as the wrapped PyTorch model.__call__.

Note

The first time the PoplarExecutor wrapper is called, the wrapped model will be traced and compiled.

Parameters
attachToDevice()

Attach to target device. Before calling this function, the device must be detached and the model compiled.

Return type

None

compile(*args, **kwargs)

Takes the same arguments as the wrapped PyTorch model.__call__.

Trace and compile the wrapped model if no executable has been created yet.

Note: The executable created by this method can only be executed, it cannot be exported to file. To precompile and save to file use compileAndExport()

Return type

None

compileAndExport(filename, *args, export_model=True, **kwargs)

Precompile an executable and save it to file.

args and kwargs are the same arguments as the wrapped PyTorch model.__call__

Parameters
  • filename (str) – Where to save the compiled executable.

  • export_model (bool) – If True the Torch model will be saved in the file alongside the executable. load() can be used to restore both the original Torch model, the PopTorch model and the executable. If False then only the executable will be exported and it will be the user’s responsibility to call inferenceModel() or trainingModel() to re-create the PopTorch model before calling loadExecutable() to restore the executable.

  • args (List[Tensor]) –

  • kwargs (Dict[str, Tensor]) –

copyWeightsToDevice()

Copies the weights from model.parameters() to the IPU device. Implicitly called on first call.

Return type

None

copyWeightsToHost()

Updates the parameters used in model with the weights stored on device. (The weights in model.parameters())

Return type

None

copyWeightsToHostIfNeeded()

Return True if the weights on the host were dirty and have been updated. Return False if the weights were already up to date.

Return type

bool

cycleCount()

Returns number of cycles which the IPU ran.

You must run the model on IPU hardware before calling this method.

Returns

number of cycles on the IPU for the last modern run. If you are using replicas, the returned value represents the first number of cycles for the first replica only.

Return type

int

destroy()

Destroy the model: release the IPUs and the executable.

Return type

None

detachFromDevice()

Detach from target device. Before calling this function, the device must be attached (and the model compiled).

Return type

None

getComputeLatency()

Return compute latency for the last execution of the model.

The compute latency is the interval of time (in fractional seconds) between the last input tensor being transferred to the IPU and the last output tensor becoming available.

The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.

getHostIpuLatency()

Return Host-IPU latency for the last execution of the model.

The Host-IPU latency is the interval of time (in fractional seconds) between the first input tensor being requested and the last input tensor being transferred to the IPU.

The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.

getIpuHostLatency()

Return IPU-Host latency for the last execution of the model.

The IPU-Host latency is the interval of time (in fractional seconds) between the first output tensor becoming available and the last output tensor being written back to the host.

The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.

getLatency()

Return round-trip latency for the last execution of the model.

The round-trip latency is the interval of time (in fractional seconds) between the first input tensor being requested and the last output tensor being written back to the host.

The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.

getPerfCounters()

Return performance counters for the last execution of the model.

Return the values (in fractional seconds) of the performance counters corresponding to the latest run of the model. The reference point of the returned value is undefined, however the difference between values is valid.

The returned object is a dictionary where they keys correspond to each of the following events: * ‘input’: the IPU requesting an input tensor * ‘input_complete’: an input tensor having been transferred * ‘output’: the IPU requesting to transmit an output tensor * ‘output_complete’: an output tensor having been transferred

The values of the dictionary are nested lists. The first level of nesting corresponds to an input or output index. The second level list contains the actual values as fractional seconds.

Examples: * dict[‘input’][1][3]: performance counter for the second input tensor being requested on the third iteration of the model * dict[‘output_complete’][0][0]: performance counter the first output tensor having been transferred on the first iteration of the model

getTensorNames()

Returns a list of all tensor names within the computational graph. Model must be compiled in advance.

Return type

List[str]

isAttachedToDevice()

Returns true, if the target device has been attached. False, otherwise.

Return type

bool

isCompiled()

Returns true if the model has been compiled (and not destroyed). False, otherwise.

Return type

bool

loadExecutable(filename)

Load an executable previously generated using compileAndExport()

Parameters

filename (str) –

Return type

None

load_state_dict(state_dict, strict=True)

Will call load_state_dict() on the wrapped model and automatically synchronise the weights with the IPU.

Returns

  • missing_keys is a list of str containing the missing keys

  • unexpected_keys is a list of str containing the

    unexpected keys

Return type

NamedTuple with missing_keys and unexpected_keys fields

Parameters
property model: torch.nn.modules.module.Module

Access the wrapped Torch model.

property options: poptorch.Options

Access to the options.

See also

Options

property rng_state: List[int]

Return the random number generator’s seed & state of the compiled model.

save(filename, export_model=True, save_rng_state=True)

Save the compiled model to file.

Parameters
  • filename (str) – Where to save the compiled executable.

  • export_model (bool) – If True the Torch model will be saved in the file alongside the executable. load() can be used to restore both the original Torch model, the PopTorch model and the executable. If False then only the executable will be exported and it will be the user’s responsibility to call inferenceModel() or trainingModel() to re-create the PopTorch model before calling loadExecutable() to restore the executable.

  • save_rng_state (bool) – If True the random number generator’s state and seed will be saved in the file alongside the executable.

setOptimizer(optimizer)

Sets the optimiser for a training model. Will overwrite the previous one. Supported optimisers: optim.SGD, optim.Adam, optim.AdamW, optim.RMSProp, optim.LAMB.

Parameters

optimizer (Optimizer) –

poptorch.isRunningOnIpu()

This function returns True when executing on IPU and False when executing the model outside IPU scope. This allows for separate codepaths to be marked in the model simply by using:

>>> if poptorch.isRunningOnIpu():
>>>      # IPU path
>>> else:
>>>     # CPU path

Note this will only apply to code during execution. During model creation it will always return False.

returns

True if running on IPU, otherwise False.

Return type

bool

poptorch.load(filename, edit_opts_fn=None)

Load a PopTorch model from a file previously created using compileAndExport()

Parameters
  • edit_opts_fn (Optional[Callable[[poptorch.Options], None]]) – Function to edit the options before the model is restored. For example to attach to a specific IPU device.

  • filename (str) –

Return type

poptorch.PoplarExecutor

>>> model = poptorch.inferenceModel(model)
>>> model.compileAndExport("my_model.poptorch")
...
>>> model = poptorch.load("my_model.poptorch")
>>> model(my_input)

11.5. Parallel execution

class poptorch.Block(user_id=None, ipu_id=None)

A context manager to define blocks of the model.

You can use Block as a context manager. This means you use Python’s “with” statement as follows:

>>> with poptorch.Block("Encoder"):
...     self.layer = MyLayer(x)

All layers called inside this scope will run on the specified IPU, if one is specified. In addition, you can combine multiple blocks into a stage.

Parameters
__init__(user_id=None, ipu_id=None)
Parameters
  • user_id (Optional[str]) – A user defined identifier for the block. Blocks with the same ID are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.

  • ipu_id (Optional[int]) – The ID of the IPU to run on. Note that the ipu_id is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used by gc-info.

static useAutoId()

Call this method at the beginning of your forward() method to enable automatic block ID generation.

Blocks with a None user_id will be assigned an automatic ID which will be the index of this block in the list of ID-less Blocks.

>>> poptorch.Block.useAutoId()
>>> with poptorch.Block(): # user_id = "0"
...     layer()
>>> with poptorch.Block("special_block"): # user_id = "special_block"
...     layer()
>>> with poptorch.Block(): # user_id = "1"
...     layer()
class poptorch.BeginBlock(layer_to_call, user_id=None, ipu_id=None)

Define a block by modifying an existing PyTorch module.

You can use this with an existing PyTorch module instance, as follows:

>>> poptorch.BeginBlock(myModel.a_layer)
>>> poptorch.BeginBlock(MyNewLayer())

The module and all sub-modules will be part of this block until a sub-module is modified to be in another block. In addition, if an IPU is specified, the module and its submodules will run on the specified IPU.

You can combine multiple blocks into a stage.

Parameters
  • layer_to_call (Module) – PyTorch module to assign to the block.

  • user_id (Optional[str]) – A user defined identifier for the block. Blocks with the same ID are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.

  • ipu_id (Optional[int]) – The ID of the IPU to run on. Note that the ipu_id is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device IDs used by gc-info.

Return type

Module

poptorch.BlockFunction(user_id=None, ipu_id=None)

A decorator to define blocks of the model.

You can use BlockFunction as a decorator for an existing function, as follows:

>>> @BlockFunction("Decoder", ipu_id=1)
... def decoder(self, encoder_output):
...     self.decoder_b1(encoder_output)

All layers inside the function and any functions called by the function will run on the specified IPU, if one is specified. In addition, you can combine multiple blocks into a stage.

Parameters
  • user_id (Optional[str]) – A user defined identifier for the block. Blocks with the same ID are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.

  • ipu_id (Optional[int]) – The ID of the IPU to run on. Note that the ipu_id is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device IDs used by gc-info.

poptorch.removeBlocks(module)

Recursively remove BeginBlock annotations from a Module if it contains any.

Parameters

module (torch.nn.Module) – Module to recursively unwrap.

class poptorch.Stage(*block_ids)

The various execution strategies are made of Stages: a stage consists of one of more Blocks running on one IPU.

Parameters

block_ids (str) –

Return type

None

__init__(*block_ids)
Parameters

block_ids (str) –

Return type

None

property blocks: List[str]

List of blocks this stage is made of.

ipu(ipu)

Set the IPU on which this stage will run

Parameters

ipu (int) –

Return type

poptorch.Stage

class poptorch.AutoStage(value)

Defines how the stages are automatically assigned to blocks when the user didn’t explicitly provide stages to the IExecutionStrategy’s constructor.

  • SameAsIpu: The stage id will be set to the selected ipu number.

  • AutoIncrement: The stage id for new blocks is automatically incremented.

Examples:

>>> # Block "0"
>>> with poptorch.Block(ipu_id=0):
...  layer()
>>> # Block "1"
>>> with poptorch.Block(ipu_id=1):
...  layer()
>>> # Block "2"
>>> with poptorch.Block(ipu_id=0):
...  layer()

By default, the following execution strategy is used:

>>> strategy = poptorch.PipelinedExecution(poptorch.AutoStage.SameAsIpu)
>>> opts.setExecutionStrategy(strategy)

which would translate to stage_id = ipu_id:

  • Block “0” ipu=0 stage=0

  • Block “1” ipu=1 stage=1

  • Block “2” ipu=0 stage=0

Now if instead you use:

>>> strategy = poptorch.PipelinedExecution(poptorch.AutoStage.AutoIncrement)
>>> opts.setExecutionStrategy(strategy)

The last block would be in its own stage rather than sharing one with Block “0”:

  • Block “0” ipu=0 stage=0

  • Block “1” ipu=1 stage=1

  • Block “2” ipu=0 stage=2

class poptorch.Phase(*arg)

Represents an execution phase

Parameters

arg (Union[str, poptorch.Stage]) –

__init__(*arg)

Create a phase.

Parameters

arg (Union[str, poptorch.Stage]) – must either be one or more Stages, or one or more blocks user_id.

If one or more strings are passed they will be interpreted as Block IDs representing a single Stage.

Within a Phase, the stages will be executed in parallel.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> p = Phase(poptorch.Stage("A").ipu(0))
>>> # 2 stages made of one block each
>>> p = Phase(poptorch.Stage("A").ipu(0), poptorch.Stage("B").ipu(1))
>>> p = Phase("A","B") # One Stage made of 2 blocks
ipus(*ipus)

Assign one IPU for each stage contained in this Phase.

The number of IPUs passed must match the number of stages in the Phase.

class poptorch.ShardedExecution(*args)

Will shard the execution of the passed Stages or if no stage is passed will consider each unique Block ipu_id encountered during tracing as a different stage.

>>> with poptorch.Block(ipu_id=0):
...     layer()
>>> with poptorch.Block(ipu_id=1):
...     layer()
>>> with poptorch.Block(ipu_id=2):
...     layer()
>>> opts = poptorch.Options()
>>> # Automatically create 3 shards based on the block names
>>> opts.setExecutionStrategy(poptorch.ShardedExecution())
Parameters

args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a AutoStage strategy or an explicit list of stages or block IDs.

stage(block_id)

Return the Stage the given block is belongs to.

Parameters

block_id (str) – A block ID.

class poptorch.PipelinedExecution(*args)
__init__(*args)

Pipeline the execution of the graph partitions. These partitions can be: a Stage, a Block or a BeginBlock. If none of these are passed, an AutoStage strategy can be passed instead to decide how the stage IDs are created. By default, poptorch.AutoStage.SameAsIpu is used: The stage ID will be set to the selected IPU number. This implies that each unique Block or BeginBlock in the graph must have their ipu_id explicitly set when using AutoStage.

Example 1: Blocks user_id are known, IPUs are inferred.

>>> with poptorch.Block("A"):
...     layer1()
>>> with poptorch.Block("B"):
...     layer2()
>>> with poptorch.Block("C"):
...     layer3()
>>> with poptorch.Block("D"):
...     layer4()
>>> opts = poptorch.Options()
>>> # Create a 4 stages pipeline based on `user_id`, 4 IPUs will be used.
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution("A","B",
...                                                       "C","D"))

Stages can also be set explicitly:

>>> # Create a 2 stages pipeline with the blocks `user_id`, 2 IPUs will be used.
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution(
...    poptorch.Stage("A","B"),
...    poptorch.Stage("C","D")))

Example 2: Blocks ipu_id are known, use default AutoStage.

>>> poptorch.Block.useAutoId()
>>> with poptorch.Block(ipu_id=0):
...     layer1()
>>> with poptorch.Block(ipu_id=1):
...     layer2()
>>> with poptorch.Block(ipu_id=2):
...     layer3()
>>> with poptorch.Block(ipu_id=3):
...     layer4()
>>> # Automatically create a 4-stage pipeline matching the block `ipu_id`.
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution())
>>> # Note: poptorch.PipelinedExecution()
>>> # is the default execution strategy when blocks are defined.

Example 3: Non-consecutive stages placed on the same IPU.

>>> with poptorch.Block(ipu_id=0):
...     layer1()
>>> with poptorch.Block(ipu_id=1):
...     layer2()
>>> with poptorch.Block(ipu_id=0):
...     layer3()
>>> # Automatically create a 3-stage pipeline forcing the stage
>>> # IDs to be incremental.
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution(
...                           poptorch.AutoStage.AutoIncrement))
Parameters

args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a AutoStage strategy or an explicit list of stages or block IDs.

stage(block_id)

Return the Stage the given block is belongs to.

Parameters

block_id (str) – A block ID.

class poptorch.SerialPhasedExecution(*phases)

All the phases run serially on a single group of IPUs.

For example:

  • phase 0 runs on ipu 0 & 1

  • phase 1 runs on ipu 0 & 1

  • phase 2 runs on ipu 0 & 1

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("A2"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("B2"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> with poptorch.Block("C2"):
...     layer()
>>> opts = poptorch.Options()
>>> strategy = poptorch.SerialPhasedExecution([
...     poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")),
...     poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")),
...     poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2"))])
>>> strategy.phase(0).ipus(0,1)
>>> strategy.phase(1).ipus(0,1)
>>> strategy.phase(2).ipus(0,1)
>>> opts.setExecutionStrategy(strategy)
Parameters

phases (Union[poptorch.Phase, List[poptorch.Stage], List[str]]) –

__init__(*phases)

Execute the model’s blocks in phases

Parameters

phases ([Phase], [[Stage]], [[str]]) –

Definition of phases must be either:

  • a list of Phase

  • a list of list of Stage

  • a list of list of Block IDs (Each list of blocks will be considered as a single Stage)

phase(phase)

Return the requested Phase

Parameters

phase (int) – Index of the phase

Return type

poptorch.Phase

setTensorsLiveness(liveness)

See Liveness for more information

Parameters

liveness (poptorch.Liveness) –

Return type

poptorch.SerialPhasedExecution

stage(block_id)

Return the Stage the given block is belongs to.

Parameters

block_id (str) – A block ID.

useSeparateBackwardPhase(use=True)

Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:

fwd:       bwd:
phase 0 -> phase 4
phase 1 -> phase 3
phase 2 -> phase 2

Note

The end of the forward pass and the beginning of the backward pass are part of the same phase.

If useSeparateBackwardPhase(True) is used then no phase will be shared between the forward and backward passes:

fwd:       bwd:
phase 0 -> phase 6
phase 1 -> phase 5
phase 2 -> phase 4
Parameters

use (bool) –

class poptorch.ParallelPhasedExecution(*phases)

Phases are executed in parallel alternating between two groups of IPUs.

For example:

  • phase 0 runs on ipu 0 & 2

  • phase 1 runs on ipu 1 & 3

  • phase 2 runs on ipu 0 & 2

>>> poptorch.Block.useAutoId()
>>> with poptorch.Block(): # user_id = "0"
...     layer()
>>> with poptorch.Block(): # user_id = "1"
...     layer()
>>> with poptorch.Block(): # user_id = "2"
...     layer()
>>> with poptorch.Block(): # user_id = "3"
...     layer()
>>> with poptorch.Block(): # user_id = "4"
...     layer()
>>> with poptorch.Block(): # user_id = "5"
...     layer()
>>> opts = poptorch.Options()
>>> strategy = poptorch.ParallelPhasedExecution([
...     poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")),
...     poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")),
...     poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5"))])
>>> strategy.phase(0).ipus(0,2)
>>> strategy.phase(1).ipus(1,3)
>>> strategy.phase(2).ipus(0,2)
>>> opts.setExecutionStrategy(strategy)
Parameters

phases (Union[poptorch.Phase, List[poptorch.Stage], List[str]]) –

__init__(*phases)

Execute the model’s blocks in phases

Parameters

phases ([Phase], [[Stage]], [[str]]) –

Definition of phases must be either:

  • a list of Phase

  • a list of list of Stage

  • a list of list of Block IDs (Each list of blocks will be considered as a single Stage)

phase(phase)

Return the requested Phase

Parameters

phase (int) – Index of the phase

Return type

poptorch.Phase

stage(block_id)

Return the Stage the given block is belongs to.

Parameters

block_id (str) – A block ID.

useSeparateBackwardPhase(use=True)

Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:

fwd:       bwd:
phase 0 -> phase 4
phase 1 -> phase 3
phase 2 -> phase 2

Note

The end of the forward pass and the beginning of the backward pass are part of the same phase.

If useSeparateBackwardPhase(True) is used then no phase will be shared between the forward and backward passes:

fwd:       bwd:
phase 0 -> phase 6
phase 1 -> phase 5
phase 2 -> phase 4
Parameters

use (bool) –

class poptorch.Liveness(value)

When using phased execution:

  • AlwaysLive: The tensors always stay on the IPU between the phases.

  • OffChipAfterFwd: The tensors are sent off the chip at the end of the forward pass and before the beginning of the backward pass.

  • OffChipAfterFwdNoOverlap: Same as OffChipAfterFwd, except there is no overlapping of load and store operations between phases. This makes it a more memory-efficient mode at the cost of delayed computation.

  • OffChipAfterEachPhase: The tensors are sent off the chip at the end of each phase.

class poptorch.CommGroupType(value)
Grouping to be used when distributing an input or per-replica variable

among replicas. See Grouping tensor weights across replicas.

  • All: This causes replicaGrouping() to have no effect, as the

    same variable value is distributed to all replicas. Group count is ignored. This is not valid as an input group type.

  • Consecutive: Each replica group is made up of consecutive replicas,

    So for group size k, the groups would be set up thus:

    {0, 1, ... k-1}, {k, ... 2k-1} ... {N-k-1, ... N-1}

  • Orthogonal: Each replica group is made up by slicing the replicas

    orthogonally to the replica ordering. So for group size k, with group count m = N/k:

    {0, m, 2m, ...}, {1, m+1, 2m+1, ...} ... {m-1, 2m-1, ... N-1}

  • NoGrouping: Each replica gets its own value of the variable. Group

    count is ignored.

class poptorch.VariableRetrievalMode(value)
Method to be used when retrieving the value of a grouped variable from

grouped replicas. See Grouping tensor weights across replicas.

  • OnePerGroup: Return one value for each replica group (takes the value

    from the first replica in the group).

  • AllReplicas: Return a value from each replica.

replicaGrouping()

Call this function on a weight tensor (after applying a PopTorch wrapper with inferenceModel() or trainingModel()) to configure replica groups which each receive a different value of the weight tensor. For details and a code example see Section 4.4.3, Grouping tensor weights across replicas.

Parameters
  • comm_group_type (poptorch.CommGroupType) – The replica group arrangement to use for this tensor.

  • shards (int) – The number of replicas in each replica group.

  • variable_retrieval_mode (poptorch.VariableRetrievalMode) – The method to use when retrieving the value of this tensor from the replicas.

11.6. Optimizers

class poptorch.optim.VariableAttributes(variable_attributes, allowed_attributes)

Track which attributes are variable or constant.

Is accessible via any PopTorch optimizer via the variable_attrs attribute.

>>> opt = poptorch.optim.SGD(params, lr=0.01)
>>> opt.variable_attrs.isConstant("lr")
Parameters
  • variable_attributes (List[str]) –

  • allowed_attributes (List[str]) –

Return type

None

isConstant(attr)

Return True if the attribute is marked as constant

Parameters

attr (str) –

Return type

bool

markAsConstant(attr)

Explicitly mark an attribute as constant

Parameters

attr (str) –

Return type

None

markAsVariable(attr)

Explicitly mark an attribute as variable

Parameters

attr (str) –

Return type

None

class poptorch.optim.SGD(params, lr, momentum=None, dampening=None, weight_decay=None, nesterov=None, maximize=None, foreach=None, differentiable=None, loss_scaling=None, velocity_scaling=None, use_combined_accum=None, accum_type=None, velocity_accum_type=None, max_grad_norm=None)

Stochastic gradient descent with optional momentum.

The optimizer is based on PyTorch’s implementation (torch.optim.SGD) with optional loss and velocity scaling.

PopTorch provides two possible variants. Both variants are mathematically identical to PyTorch but differ in their stability and efficiency.

Note

If you set momentum to zero and do not use gradient accumulation, PopTorch will use a simple SGD variant and ignore the values of use_combined_accum, accum_type and velocity_accum_type.

Separate tensor variant (default)

If you set use_combined_accum to False (default), you will use a more stable but more memory intensive variant. In this case, PopTorch keeps two state tensors for each weight: one for gradient accumulation and one for velocity. It operates as follows when training:

  1. PopTorch runs one or more forward/backwards steps, equal the number of gradient accumulations (see gradientAccumulation()). Each time PopTorch sums the gradients, storing them in accumulators.

  2. Once all the forward and backwards have completed, PopTorch uses the summed gradients to update the velocities. At this stage, PopTorch will correct the scale based on the setting of accumulationAndReplicationReductionType(). PopTorch stores the velocities as optimiser states.

  3. Finally, PopTorch uses the velocities to update the parameters, taking into account the loss scaling and learning rate.

With use_combined_accum set to False, you can independently change the data type used for storing the accumulated gradients and the velocity values using accum_type and velocity_accum_type, respectively.

Velocity scaling is ignored for this variant.

Note

If the number of gradient accumulations is high, you can use off chip memory for the velocity tensors with a minimal performance hit.

>>> opts.TensorLocations.setOptimizerLocation(
...     poptorch.TensorLocationSettings().useOnChipStorage(False))

Combined tensor variant

If you set use_combined_accum` to True, you will use a less stable but more memory efficient variant. In this case PopTorch uses a single tensor (the combined tensor) for gradient accumulation and velocity. It operates as follows when training:

  1. PopTorch runs one or more forward/backwards steps equal the number of gradient accumulations (see gradientAccumulation()). For each step, PopTorch immediately calculates an increment or decrement for the combined tensors for each parameter. The amount of increment or decrement takes into account the setting of accumulationAndReplicationReductionType(). as well as removing loss scaling and introducing any velocity scaling.

  2. After running all the steps, the combined tensor will be be equal to the new velocities. PopTorch uses these to update the parameters taking into account the velocity scaling and learning rate.

PopTorch ignores the accum_type` and velocity_accum_type values when using a combined tensor. In addition, there are no optimizer state tensors and so opts.TensorLocations.setOptimizerLocation has no effect.

Warning

For both variants, reducing the velocity scaling during training will result in temporary over-estimation of the velocity and could cause model instability. Increasing the scaling may temporarily slow model convergence but not lead to instability.

Parameters
Return type

None

__init__(params, lr, momentum=None, dampening=None, weight_decay=None, nesterov=None, maximize=None, foreach=None, differentiable=None, loss_scaling=None, velocity_scaling=None, use_combined_accum=None, accum_type=None, velocity_accum_type=None, max_grad_norm=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (float) – learning rate.

  • momentum (Optional[float]) – momentum factor.

  • dampening (Optional[float]) – dampening term for momentum.

  • weight_decay (Optional[float]) – Weight decay (L2 penalty) factor.

  • nesterov (Optional[bool]) – Whether to enable nesterov momentum. Default is False.

  • loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • velocity_scaling (Optional[float]) – Factor by which to scale the velocity values to assist numerical stability when using float16. (This applies to the combined variant only.)

  • use_combined_accum (Optional[bool]) – Whether to use a combined accumulator.

  • accum_type (Optional[dtype]) – data type used for gradients.

  • velocity_accum_type (Optional[dtype]) – data type used to store the velocity values for each parameter.

  • max_grad_norm (Optional[float]) – Maximum norm of gradients. Default is inf.

  • maximize (Optional[bool]) –

  • foreach (Optional[bool]) –

  • differentiable (Optional[bool]) –

Return type

None

class poptorch.optim.Adam(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, foreach=None, maximize=None, capturable=None, differentiable=None, fused=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, max_grad_norm=None)

Adam optimizer.

This optimizer matches PyTorch’s implementation (torch.optim.Adam) with optional loss scaling.

AMSGrad is currently not supported.

Parameters
Return type

None

__init__(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, foreach=None, maximize=None, capturable=None, differentiable=None, fused=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, max_grad_norm=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (Optional[float]) – learning rate

  • betas (Optional[Tuple[float, float]]) – (beta1, beta2) parameters used in Adam.

  • eps (Optional[float]) – term added to the denominator to ensure numerical stability.

  • weight_decay (Optional[float]) – Weight decay factor.

  • amsgrad (Optional[bool]) – Not supported (must be False).

  • loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • accum_type (Optional[dtype]) – data type used for gradients.

  • first_order_momentum_accum_type (Optional[dtype]) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (Optional[dtype]) – data type used to store the second order momentum values for each parameter.

  • max_grad_norm (Optional[float]) – Maximum norm of gradients. Default is inf.

  • foreach (Optional[bool]) –

  • maximize (Optional[bool]) –

  • capturable (Optional[bool]) –

  • differentiable (Optional[bool]) –

  • fused (Optional[bool]) –

Return type

None

class poptorch.optim.AdamW(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, maximize=None, foreach=None, capturable=None, loss_scaling=None, bias_correction=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, max_grad_norm=None)

Adam optimizer with true weight decay.

This optimizer matches PyTorch’s implementation (torch.optim.AdamW) with optional loss scaling.

AMSGrad is currently not supported.

Parameters
Return type

None

__init__(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, maximize=None, foreach=None, capturable=None, loss_scaling=None, bias_correction=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, max_grad_norm=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (Optional[float]) – learning rate

  • betas (Optional[Tuple[float, float]]) – (beta1, beta2) parameters used in AdamW.

  • eps (Optional[float]) – term added to the denominator to ensure numerical stability.

  • weight_decay (Optional[float]) – Weight decay factor.

  • amsgrad (Optional[bool]) – Not supported (must be False).

  • loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • bias_correction (Optional[bool]) – True: compute Adam with bias correction.

  • accum_type (Optional[dtype]) – data type used for gradients.

  • first_order_momentum_accum_type (Optional[dtype]) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (Optional[dtype]) – data type used to store the second order momentum values for each parameter.

  • max_grad_norm (Optional[float]) – Maximum norm of gradients. Default is inf.

  • maximize (Optional[bool]) –

  • foreach (Optional[bool]) –

  • capturable (Optional[bool]) –

Return type

None

class poptorch.optim.RMSprop(params, lr=None, alpha=None, eps=None, weight_decay=None, momentum=None, centered=None, foreach=None, maximize=None, differentiable=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, use_tf_variant=None)

RMSprop optimizer with optional L2 penalty.

This optimizer matches PyTorch’s implementation ( torch.optim.RMSprop) with optional loss scaling.

However, if the use_tf_variant flag is set to True, it will instead match the TensorFlow implementation which differs from PyTorch’s implementation in three ways: 1) The average squared gradients buffer is initialized to ones. 2) The small epsilon constant is applied inside the square root. 3) Learning rate is accumulated in the momentum buffer if momentum is used.

Parameters
Return type

None

__init__(params, lr=None, alpha=None, eps=None, weight_decay=None, momentum=None, centered=None, foreach=None, maximize=None, differentiable=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, use_tf_variant=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (Optional[float]) – learning rate.

  • alpha (Optional[float]) – smoothing constant.

  • eps (Optional[float]) – term added to the denominator to ensure numerical stability.

  • weight_decay (Optional[float]) – L2 penalty coefficient.

  • momentum (Optional[float]) – momentum factor.

  • centered (Optional[bool]) – True: compute centred RMSprop in which the gradient is normalized by an estimate of its variance.

  • loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • accum_type (Optional[dtype]) – data type used for gradients.

  • first_order_momentum_accum_type (Optional[dtype]) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (Optional[dtype]) – data type used to store the second order momentum values for each parameter.

  • use_tf_variant (Optional[bool]) – False: If True, use the TensorFlow variant of RMSProp.

  • foreach (Optional[bool]) –

  • maximize (Optional[bool]) –

  • differentiable (Optional[bool]) –

Return type

None

class poptorch.optim.LAMB(params, lr=None, betas=None, eps=None, weight_decay=None, bias_correction=None, loss_scaling=None, max_weight_norm=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)

Layer-wise Adaptive Moments (LAMB) optimizer (biased version).

Based on “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes” (https://arxiv.org/abs/1904.00962).

The scaling function phi(z) is fixed as min(z, max_weight_norm);

Parameters
Return type

None

__init__(params, lr=None, betas=None, eps=None, weight_decay=None, bias_correction=None, loss_scaling=None, max_weight_norm=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)
Parameters
  • params (iterable) – parameters to optimize.

  • lr (Optional[float]) – learning rate

  • betas (Optional[Tuple[float, float]]) – (beta1, beta2) parameters used in LAMB.

  • eps (Optional[float]) – term added to the denominator to ensure numerical stability/

  • weight_decay (Optional[float]) – weight decay factor.

  • bias_correction (Optional[bool]) – True: compute LAMB with bias correction.

  • loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

  • max_weight_norm (Optional[float]) – maximum value of the output of scaling function, phi(). Set to None to disable scaling function.

  • accum_type (Optional[dtype]) – data type used for gradients.

  • first_order_momentum_accum_type (Optional[dtype]) – data type used to store the first order momentum values for each parameter.

  • second_order_momentum_accum_type (Optional[dtype]) – data type used to store the second order momentum values for each parameter.

Return type

None

step(closure=None)

Performs a single optimization step (parameter update).

Parameters

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Return type

Optional[float]

Note

Unless otherwise specified, this function should not modify the .grad field of the parameters.

11.7. Data batching

class poptorch.DataLoader(options, dataset, batch_size=1, shuffle=None, num_workers=0, drop_last=True, persistent_workers=None, auto_distributed_partitioning=True, mode=DataLoaderMode.Sync, async_options=None, rebatched_worker_size=None, batch_sampler=None, **kwargs)

Thin wrapper around the traditional torch.utils.data.DataLoader to abstract away some of the batch sizes calculations.

If this data loader is used in a distributed execution environment, it will ensure that each process uses a different subset of the dataset, providing you first call options.randomSeed(N) with an integer N which is the same across all hosts.

__init__(options, dataset, batch_size=1, shuffle=None, num_workers=0, drop_last=True, persistent_workers=None, auto_distributed_partitioning=True, mode=DataLoaderMode.Sync, async_options=None, rebatched_worker_size=None, batch_sampler=None, **kwargs)
Parameters
  • options (poptorch.Options) – Options that will be used to compile and run the model.

  • dataset (torch.utils.data.Dataset) – The dataset to get the data from.

  • batch_size (int) – This is the batch size in the conventional sense of being the size that runs through an operation in the model at any given time.

  • shuffle (bool) – Whether or not the dataset should be shuffled.

  • num_workers (int) – Number of worker processes to use to read the data.

  • drop_last (bool) – If True and the number of elements in the dataset is not a multiple of the combined batch size then the incomplete batch at the end will be dropped.

  • persistent_workers (Optional[bool]) – Re-use workers between iterations if True.

  • auto_distributed_partitioning (bool) – If True, partitions the dataset for distributed execution automatically. Otherwise, it is assumed that partitioning has been handled manually.

  • mode (poptorch.DataLoaderMode) – If DataLoaderMode.Async, uses an AsynchronousDataAccessor to access the dataset. If DataLoaderMode.Sync, accesses the dataset synchronously.

  • async_options (Optional[Dict[str, Any]]) – Options to pass to AsynchronousDataAccessor.

  • rebatched_worker_size (Optional[int]) – When using AsyncRebatched: batch size of the tensors loaded by the workers. Default to the combined batch size. If specified the rebatched_worker_size must be less than or equal to the combined batch size.

  • batch_sampler (Optional[Union[Sampler[Sequence], Iterable[Sequence]]]) – Defines the strategy to draw samples from the dataset. Returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle.

  • kwargs – Other options to pass to PyTorch’s DataLoader constructor.

property combinedBatchSize: Optional[int]

Total number of elements consumed from the dataset for a single execution of the model.

property options: poptorch.Options

A reference to the options that were used to initialise this instance.

terminate()

If mode==DataLoaderMode.Async, kills the worker process in the underlying AsynchronousDataAccessor manually, otherwise has no effect.

Return type

None

class poptorch.AsynchronousDataAccessor(dataset, buffer_size=3, miss_sleep_time_in_ms=0.1, load_indefinitely=True, early_preload=False, sharing_strategy=SharingStrategy.ForkServer, rebatched_size=None)

A data loader which launches the data loading process on a separate thread to allow for the data to be preprocessed asynchronous on CPU to minimise CPU/IPU transfer time.

This works by loading the data into a ring buffer of shared memory. When the IPU needs another batch it uses the data ready in the in the ring buffer. The memory is shared so will be used in-place and won’t be freed until the next batch is requested. Behind the scenes the worker thread will be filling the unready elements of the ring buffer.

Note

When using a torch.utils.data.Dataset with rebatched_size the accessor will default to drop_last=True, to change that behaviour wrap the dataset into a poptorch.DataLoader(..., drop_last=False).

Parameters
__init__(dataset, buffer_size=3, miss_sleep_time_in_ms=0.1, load_indefinitely=True, early_preload=False, sharing_strategy=SharingStrategy.ForkServer, rebatched_size=None)
Parameters
  • dataset (Union[torch.utils.data.Dataset, DataLoader]) – The dataset to pull data from, this can be any Python iterable.

  • buffer_size (int) – The size of the ring buffer.

  • miss_sleep_time_in_ms (float) – When the buffer is full how long should we sleep the worker before checking again.

  • load_indefinitely (bool) – If True when we hit the end of the dataset we will just loop round again.

  • early_preload (bool) – If True, start loading data in the ring buffer as soon as the worker is created. If False, wait for an iterator to be created before loading data.

  • sharing_strategy (poptorch.SharingStrategy) –

    Method to use to pass the dataset object when the child process is created.

    • SharedMemory is fast but might be quite limited in size.

    • FileSystem will serialise the dataset to file and reload it which will be slower.

    • Fork new processes: no data sharing required but might cause problems if worker processes use threading.

    • ForkServer is similar to Fork but uses a server process to fork child processes. It is safe to use even if worker processes use threading.

  • rebatched_size (Optional[int]) – If not None: return N batched tensors from the dataset per iteration. (The passed dataset must have a batch_size of 1).

Note

If dataset is an iterable-type poptorch.DataLoader configured with drop_last=False then rebatched_size must be used.

terminate()

An override function to kill the worker process manually.

Return type

None

class poptorch.DataLoaderMode(value)
  • Sync: Access data synchronously

  • Async: Uses an AsynchronousDataAccessor to access the dataset

  • AsyncRebatched: For iterable datasets by default PyTorch will round down the number of elements to a multiple of the combined batch size in each worker. When the number of workers is high and/or the batch size large this might lead to a significant part of the dataset being discarded. In this mode, the combined batch size used by the PyTorch workers will be set to 1, and the batched tensor will instead be constructed in the AsynchronousDataAccessor. This mode is identical to Async for map-style datasets.

11.8. Enumerations

class poptorch.SharingStrategy(value)

Strategy to use to pass objects when creating new processes.

  • SharedMemory: Spawn new processes and share data using shared memory:

    Fast but limited availability.

  • FileSystem: Spawn new processes and shared data using the file

    system: slower but larger than memory.

  • Fork: Fork new processes: no data sharing required but might cause

    problems if worker processes use threading.

  • ForkServer: Similar to fork but a server process is used to fork child

    processes instead. This server process is single-threaded so there are no issues if worker processes use threading.

class poptorch.OverlapMode(value)
  • NoOverlap: The host will copy the tensor to the IPU only when required: this minimises on-chip memory use at the cost of performance.

  • OverlapAccumulationLoop: The host will preload values for the next gradient accumulation iteration onto an IO tile.

  • OverlapDeviceIterationLoop: The host will preload values not just for the next gradient accumulation iteration, but the next device iteration, onto an IO tile. This may require more IO tiles than the previous setting but offers greater performance.

class poptorch.MatMulSerializationMode(value)

Which dimension of the matrix multiplication to use for the serialization

class poptorch.SyncPattern(value)
  • Full: Require all IPUs to synchronise on every communication between IPUs or between IPUs and host.

  • SinglePipeline: Allow IPUs to synchronise with the host independently, without having to synchronise with each other. This permits any one IPU to perform host IO while other IPUs are processing data.

  • ReplicaAndLadder: Allow an IPU group to communicate with the host without requiring synchronisation between groups. This permits multiple IPU groups to alternate between performing host IO and computation.

class poptorch.ReductionType(value)
  • Sum: Calculate the sum of all values

  • Mean: Calculate the mean of all values

  • NoReduction: Do not reduce

class poptorch.ConnectionType(value)
  • Always: Attach to the IPU from the start (Default).

  • OnDemand: Wait until the compilation is complete and the executable is ready to be run to attach to the IPU.

  • Never: Never try to attach to an IPU. (Useful for offline compilation, but trying to run an executable will raise an exception).

class poptorch.OutputMode(value)
  • All: Return a result for each batch.

  • Sum: Return the sum of all the batches

  • Final: Return the last batch.

  • EveryN: Return every N batches. N is passed in as

    output_return_period

  • Default: “All” for inference, “Final” for training.

class poptorch.MeanReductionStrategy(value)

Specify when to divide by a mean reduction factor when accumulationAndReplicationReductionType is set to ReductionType.Mean.

  • Running: Keeps the reduction buffer as the current mean. This is preferred for numerical stability as the buffer value is never larger than the magnitude of the largest micro batch gradient.

  • Post: Divides by the accumulationFactor and replicatedGraphCount after all of the gradients have been reduced. In some cases this can be faster then using Running, however is prone to overflow.

  • PostAndLoss (deprecated): Divides by the replicatedGraphCount before the backwards pass, performs the gradient reduction across micro batches, and then divides by the accumulationFactor. This is to support legacy behaviour and is deprecated.