11. API reference
11.1. Options
- class poptorch.Options
Set of all options controlling how a model is compiled and executed.
Pass an instance of this class to the model wrapping functions
inferenceModel()
andtrainingModel()
to change how the model is compiled and executed. An instance includes general options set within this class such asdeviceIterations()
as well as properties referring to categories of options such asTraining
.>>> opts = poptorch.Options() >>> opts.deviceIterations(10) >>> opts.Training.gradientAccumulation(4)
- Return type
None
- property Distributed: poptorch.options._DistributedOptions
Options specific to running on multiple IPU server (IPU-POD).
You should not use these when using PopRun/PopDist. Instead use
popdist.poptorch.Options
to set these values automatically.See also
- property Jit: poptorch.options._JitOptions
Options specific to upstream PyTorch’s JIT compiler.
See also
- property Precision: poptorch.options._PrecisionOptions
Options specific to the processing of the JIT graph prior to lowering to PopART.
See also
- property TensorLocations: poptorch.options._TensorLocationOptions
Options related to tensor locations.
See also
- property Training: poptorch.options._TrainingOptions
Options specific to training.
See also
- anchorTensor(short_name, long_name, output_mode=None, output_return_period=1)
Anchor a tensor such that it may be retrieved after a model run.
- Parameters
short_name (str) – User defined name to be used for retrieval
long_name (str) – The PopART name of the tensor to be anchored
output_mode (poptorch.OutputMode) – Specifies when data should be returned. Default to None, in which case the tensor will use the same output mode used for model outputs.
output_return_period (int) – Return period if output mode is
EveryN
. Defaults to 1.
- appendToLocationExcludes(*excludes)
When printing the IR all the frames containing one of the excluded strings will be ignored.
This is helpful to get the IR to trace back to user code rather than some function inside a framework.
- Parameters
excludes (str) – Append these exclusions to the existing list of exclusions.
- Return type
- autoRoundNumIPUs(auto_round_num_ipus=True)
Whether or not to round up the number of IPUs used automatically: the number of IPUs requested must be a power of 2. By default, an error occurs if the model uses an unsupported number of IPUs to prevent you unintentionally overbooking IPUs.
- Parameters
auto_round_num_ipus (bool) –
True: round up the number of IPUs to a power of 2.
False: error if the number of IPUs is not supported.
- Return type
- broadcastBuffers(broadcast_buffers=True)
Broadcast buffers to all replicas.
Only non-broadcast buffers are currently supported, which means each replica will hold a set of buffers not in sync with other replicas’ buffers. To enable non-broadcast buffers, set this option to
False
.- Parameters
broadcast_buffers (bool) –
- clone()
Create an unfrozen deep copy of the current options.
- Return type
- connectionType(connection_type)
When to connect to the IPU (if at all).
- Parameters
connection_type (poptorch.ConnectionType) –
Always
: Attach to the IPU from the start (default).OnDemand
: Wait until the compilation is complete and the executable is ready to be run to attach to the IPU.Never
: Never try to attach to an IPU: this is useful for offline compilation, but trying to run an executable will raise an exception.
- Return type
For example:
>>> opts = poptorch.Options() >>> opts.connectionType(poptorch.ConnectionType.OnDemand)
- defaultOutputMode()
- Returns
- True:
outputMode()
is currently set to default.
- True:
- False:
outputMode()
is not set to default.
- False:
- Return type
- deviceIterations(device_iterations)
Number of iterations the device should run over the data before returning to the user (default: 1).
This is equivalent to running the IPU in a loop over that the specified number of iterations, with a new batch of data each time. However, increasing
deviceIterations
is more efficient because the loop runs on the IPU directly.- Parameters
device_iterations (int) –
- Return type
- disableModuleNamescope()
Disable option adding name scope for each operator present in the module. This option is enabled by default. The operator name scope is be based on the names appearing in the named_modules function from torch.nn.Module.
For example:
>>> class Model(torch.nn.Module): >>> def __init__(self, num_groups, num_channels): >>> super().__init__() >>> self.gn = torch.nn.GroupNorm(num_groups, num_channels) >>> def forward(self, x): >>> return self.gn2(x)
With namescope enabled the name will be gn/GroupNormalization, with disabled it will be GroupNormalization.
- Return type
- enableExecutableCaching(path)
Load/save Poplar executables to the specified
path
, using it as a cache, to avoid recompiling identical graphs.- Parameters
path (str) – File path for Poplar executable cache store; setting
path
to None`` disables executable caching.- Return type
- enableProfiling(profile_dir=None)
Enable profiling report generation.
To generate debug information associated with the profiling data, please specify
autoReport.directory
, and eitherautoReport.all
orautoReport.outputDebugInfo
in thePOPLAR_ENGINE_OPTIONS
environment variable. e.g.POPLAR_ENGINE_OPTIONS={"autoReport.directory":"/profile/output",\ "autoReport.all":"true"}``
or:
POPLAR_ENGINE_OPTIONS={"autoReport.directory":"/profile/output",\ "autoReport.outputDebugInfo":"true"}``
Debug information and the rest of the profiling data will be stored in
/profile/output directory
. Values specified in the environment variable take precedence overprofile_dir
when both are given.- Parameters
profile_dir (str) – path to directory where report will be created. Defaults to current directory.
- Return type
- enableStableNorm(enabled)
Set whether a stable version of norm operators is used. This stable version is slower, but more accurate than its unstable counterpart.
- Parameters
enabled (bool) –
True: Use stable norm calculation.
False: Do not use stable norm calculation.
- Return type
- enableSyntheticData(enabled)
Set whether host I/O is disabled and synthetic data is generated on the IPU instead. This can be used to benchmark models whilst simulating perfect I/O conditions.
- Parameters
enabled (bool) –
True: Use data generated from a random normal distribution on the IPU. Host I/O is disabled.
False: Host I/O is enabled and real data is used.
- Return type
- from_json(string)
Sets values of the object from a JSON string.
The format of the JSON string is:
{“name.of.accessor”: value}
Examples
>>> Options().from_json( ... '{"Precision.enableFloatingPointExceptions":true}' ... ) >>> Options().from_json('{"_Popart.set":["OptionName", 1]}')
- Parameters
string (str) –
- inputReplicaGrouping(input_group_size, input_group_type)
Allows the input batches to be split between groups of replicas, in a similar way to what
replicaGrouping()
does for weight tensors.- Parameters
input_group_size (int) – Number of replicas to place in each input replica group. Must be a factor of
replication_factor
. Defaults to 1, which will divide the input evenly among all replicas.input_group_type (poptorch.CommGroupType) – Arrangement type to use when placing replicas into input replica groups. Cannot be
poptorch.CommGroupType.All
. Defaults topoptorch.CommGroupType.Consecutive
. For an explanation of the arrangement types, seeCommGroupType
and Section 4.4.3, Grouping tensor weights across replicas.
- Return type
- loadFromFile(filepath)
Load options from a config file where each line in the file corresponds to a single option being set. To set an option, simply specify how you would set the option within a Python script, but omit the
options.
prefix.For example, if you wanted to set
options.deviceIterations(1)
, this would be set in the config file by adding a single line with contentsdeviceIterations(1)
.This method can be called multiple times on the same
Options
object. The options will not be reset to their defaults in between.For example, if
c1.cfg
contains the following:deviceIterations(32) replicationFactor(2)
and
c2.cfg
contains the following:deviceIterations(4)
then calling:
options.loadFromFile('c1.cfg') options.loadFromFile('c2.cfg')
is equivalent to calling:
options.deviceIterations(4) options.replicationFactor(2)
- Parameters
filepath (str) –
- Return type
- logCycleCount(log_cycle_count)
Log the number of IPU cycles used in executing the main graph.
The cycle count will be printed when this option is enabled by setting the environment variable
POPTORCH_LOG_LEVEL=DEBUG
. This option requires IPU hardware to run.Note: This will have a small detrimental impact on performance.
- Parameters
log_cycle_count (bool) –
True: Enable logging the IPU cycle count.
False: Do not enable IPU cycle count logging.
- Return type
- logDir(log_dir)
Set the log directory
- Parameters
log_dir (str) – Directory where PopTorch saves log files (default: current directory)
- Return type
- maxRepeatLogs(max_lines)
- For often-repeated log lines, set the maximum number of repeated
lines that will be logged.
- modelName(name)
Set the model name
- Parameters
name (str) – Name of the model defaults to “inference” or “training” depending on the type of model created. Used when profiling to set the subdirectory of the report directory to output the profiling too.
- Return type
- outputMode(output_mode, output_return_period=None)
Specify which data to return from a model.
- Parameters
output_mode (poptorch.OutputMode) –
All
: Return a result for each batch.Sum
: Return the sum of all the batches.Final
: Return the last batch.EveryN
: Return every N batches: N is passed in asoutput_return_period
.Default:
All
for inference,Final
for training.
- Return type
For example:
>>> opts = poptorch.Options() >>> opts.outputMode(poptorch.OutputMode.All) ... # or >>> opts.outputMode(poptorch.OutputMode.EveryN, 10)
- randomSeed(random_seed)
Set the seed for the random number generator on the IPU.
- Parameters
random_seed (int) – Random seed integer.
- Return type
- relaxOptimizerAttributesChecks(relax=True)
Controls whether unexpected attributes in
setOptimizer()
lead to warnings or debug messages.By default PopTorch will print warnings the first time it encounters unexpected attributes in
setOptimizer()
.- Parameters
relax (bool) –
True: Redirect warnings to the debug channel.
False: Print warnings about unexpected attributes (default behaviour).
- Return type
- replicationFactor(replication_factor)
Number of times to replicate the model (default: 1).
Replicating the model increases the data throughput of the model as PopTorch uses more IPUs. This leads to the number of IPUs used being scaled by
replication_factor
, for example, if your model uses 1 IPU, areplication_factor
of 2 will use 2 IPUs; if your model uses 4 IPUs, a replication factor of 4 will use 16 IPUs in total.- Parameters
replication_factor (int) – Number of replicas of the model to create.
- Return type
- setAvailableMemoryProportion(available_memory_proportion)
Sets the amount of temporary memory made available on a per-IPU basis.
Use this setting to control the amount of temporary memory available to operations such as:
convolution
matrix multiplication
embedding lookups
indexing operations
Parameter should be a dictionary of IPU IDs and float values between 0 and 1. (for example,
{"IPU0": 0.5}
)The floating point value has the same meaning and effect as documented in
set_available_memory()
.
- setExecutionStrategy(strategy)
Set the execution strategy to use to partition the graph.
- Parameters
strategy (Union[poptorch.ParallelPhasedExecution, poptorch.SerialPhasedExecution]) – Must be an instance of once of the execution strategy classes.
- Return type
- showCompilationProgressBar(show=True)
Show / hide a progress bar while the model is being compiled. (The progress bar is shown by default)
- Parameters
show (bool) –
- Return type
- sourceLocationExcludes(excludes)
When printing the IR all the frames containing one of the excluded strings will be ignored.
This is helpful to get the IR to trace back to user code rather than some function inside a framework.
- syncPattern(sync_pattern)
Controls synchronisation in multi-IPU systems.
This option can be used to allow subsets of IPUs to overlap their work. For example, one set of IPUs could be communicating with the host while other IPUs are processing data.
This option is typically used together with replicated execution, in which case it takes effect on a per-replica basis. If replication is not used, it will apply to all IPUs.
- Parameters
sync_pattern (poptorch.SyncPattern) –
Full
: Require all IPUs to synchronise on every communication between IPUs or between IPUs and host. This is the default.SinglePipeline
: Allow IPUs to synchronise with the host independently, without having to synchronise with each other. This permits any one IPU to perform host IO while other IPUs are processing data.ReplicaAndLadder
: Allow an IPU group to communicate with the host without requiring synchronisation between groups. This permits multiple IPU groups to alternate between performing host IO and computation.
- Return type
- updatableNamedBuffers(buffers)
List of model named buffers that can be updated with call to buffersFromHost(). This allows to update just a subset of model weights instead of all or them as it happens with weightsFromHost() call.
- Parameters
- Return type
- useIpuId(ipu_id)
Use the IPU device specified by the ID (as provided by gc-info).
A device ID may refer to a single or to a group of IPUs (a multi-IPU device). The number of IPUs associated with the ID must be equal to the number of IPUs used by your annotated model multiplied by the replication factor.
For example if your model uses 1 IPU and the replication factor is 2 you will need to provide a device ID with 2 IPU; if your model is pipelined across 4 IPUs and the replication factor is 4, you will need to provide a device ID which represents a multi-IPU device of 16 IPUs.
You can use the the command-line tool
gc-info
: runninggc-info -l
, shows each device ID and a list of IPUs associated with the ID.- Parameters
ipu_id (int) – IPU device ID of a single-IPU or multi-IPU device
- Return type
- useIpuModel(use_model)
Whether to use the IPU Model or physical hardware (default)
The IPU model simulates the behaviour of IPU hardware but does not offer all the functionality of an IPU. Please see the Poplar and PopLibs User Guide for further information.
This setting takes precedence over the
POPTORCH_IPU_MODEL
environment variable.- Parameters
use_model (bool) –
True: Use the IPU Model.
False: Use IPU hardware.
- Return type
- useOfflineIpuTarget(ipu_version=2)
Create an offline IPU target that can only be used for offline compilation.
Note
the offline IPU target cannot be used if the IPU model is enabled.
- Parameters
ipu_version (int) – IPU version to target (1 for Mk1, 2 for Mk2, 21 for Mk2 with FP8 support). Default: 2.
- Return type
- class poptorch.options._DistributedOptions
Options related to distributed execution.
You should not use these when using PopRun/PopDist. Instead use
popdist.poptorch.Options
to set these values automatically.Can be accessed via
poptorch.Options.Distributed
:>>> opts = poptorch.Options() >>> opts.Distributed.configureProcessId(0, 2)
- Return type
None
- configureProcessId(process_id, num_processes)
Manually set the current process ID and the total number of processes.
- Parameters
- Return type
- disable()
Ignore the current options / environment variables and disable distributed execution.
- Return type
- setEnvVarNames(var_num_processes, var_process_id)
Utility to read and set
processId
andnumProcesses
from environment variables.Useful if you use a third party library to manage the processes used for the distributed execution such as mpirun.
For example:
mpirun -np 4 myscript.py
By default the OpenMPI
OMPI_COMM_WORLD_SIZE
andOMPI_COMM_WORLD_RANK
variables are used.- Parameters
- Return type
- class poptorch.options._PrecisionOptions(popart_options)
Options related to processing the PyTorch JIT graph prior to lowering to PopART
Can be accessed via
poptorch.Options.Precision
:>>> opts = poptorch.Options() >>> opts.Precision.enableFloatingPointExceptions(True)
- Parameters
popart_options (poptorch.options._PopartOptions) –
- Return type
None
- enableFloatingPointExceptions(enabled)
Set whether floating point exceptions are enabled on the IPU.
When enabled, an exception will be generated when the IPU encounters any one of the following:
Operation resulting in subtraction of infinities
Divisions by zero or by infinity
Multiplications between zero and infinity
Real operations producing complex results
Comparison where any one operand is Not-a-Number
- Parameters
enabled (bool) –
True: raise
RuntimeError
on floating point exceptionFalse: do not raise
RuntimeError
(default)
- Return type
- enableStochasticRounding(enabled)
Set whether stochastic rounding is enabled on the IPU.
Stochastic rounding rounds up or down a values to half (float16) randomly such that that the expected (mean) result of rounded value is equal to the unrounded value. It can improve training performance by simulating higher precision behaviour and increasing the speed or likelihood of model convergence. However, the model is non-deterministic and represents a departure from (deterministic) standard IEEE FP16 behaviour.
In the general case, we recommend enabling stochastic rounding for training where convergence is desirable, but not for inference where non-determinism may be undesirable.
- Parameters
enabled (bool) –
True: Enable stochastic rounding on the IPU.
False: Disable stochastic rounding.
- Return type
- halfFloatCasting(half_float_casting)
DO NOT USE: about to be removed.
- Parameters
half_float_casting (poptorch.HalfFloatCastingBehavior) –
- Return type
- runningStatisticsAlwaysFloat(value)
DO NOT USE: about to be removed.
- Parameters
value (bool) –
- Return type
- setPartialsType(dtype)
Set the data type of partial results for matrix multiplication and convolution operators.
The matrix multiplication and convolution operators store intermediate results known as partials as part of the calculation. You can use this option to change the data type of the partials. Using
torch.half
reduces on-chip memory use at the cost of precision.- Parameters
type (torch.dtype) – The type to store partials, which must be either
torch.float
ortorch.half
dtype (dtype) –
- Return type
- class poptorch.options._JitOptions(**default_values)
Options related to PyTorch’s JIT compiler.
Can be accessed via
poptorch.Options.Jit
:>>> opts = poptorch.Options() >>> opts.Jit.traceModel(True)
- class poptorch.options._TensorLocationOptions(**default_values)
Options controlling where to store tensors.
Can be accessed via
poptorch.Options.TensorLocations
:>>> opts = poptorch.Options() >>> opts.TensorLocations.setActivationLocation( ... poptorch.TensorLocationSettings().useOnChipStorage(False))
- numIOTiles(num_tiles)
Assigns the number of tiles on the IPU to be IO rather than compute.
Allocating IO (input/output) tiles reduces the number of IPU tiles available for computation but allows you to reduce the latency of copying tensors from host to the IPUs using the function
set_overlap_for_input()
, IPUs to host using the functionset_overlap_for_output()
or to use off-chip memory with reduced by setting the optionuseIOTilesToLoad()
. As reducing the number of computation tiles may reduce performance, you should not use any IO tiles until you have successfully run your model and used profiling to identify “streamCopy” entries which take up a significant proportion of execution time.- Parameters
num_tiles (int) –
- Return type
- setAccumulatorLocation(location)
- Parameters
location (poptorch.TensorLocationSettings) – Update tensor location settings for accumulators.
- Return type
- setActivationLocation(location)
- Parameters
location (poptorch.TensorLocationSettings) – Update tensor location settings for activations.
- Return type
- setOptimizerLocation(location)
- Parameters
location (poptorch.TensorLocationSettings) – Update tensor location settings for optimiser states.
- Return type
- setWeightLocation(location)
- Parameters
location (poptorch.TensorLocationSettings) – Update tensor location settings for weights.
- Return type
- class poptorch.TensorLocationSettings(**default_values)
Define where a tensor is stored
>>> opts = poptorch.Options() >>> opts.TensorLocations.setActivationLocation( ... poptorch.TensorLocationSettings().useOnChipStorage(False))
- minElementsForOffChip(min_elements)
A minimum number of elements below which offloading won’t be considered.
- Parameters
min_elements (int) –
- Return type
- minElementsForReplicatedTensorSharding(min_elements)
Only enable replicated tensor sharding (RTS) for tensors with more than
min_elements
elements.- Parameters
min_elements (int) –
- Return type
- useIOTilesToLoad(use=True)
Load tensor through IO tiles
- Parameters
use (bool) – Use IO tiles if True, use Compute tiles if False.
- Return type
- useIOTilesToStore(use=True)
Use IO tiles to store tensors.
(relevant for replicated tensor sharded tensors)
- Parameters
use (bool) – Use IO tiles if True, use Compute tiles if False.
- Return type
- useOnChipStorage(use=True)
Permanent tensor storage
- Parameters
use (bool) – True: use on chip memory. False: use off chip memory. None: keep it undefined.
- Return type
- class poptorch.options._TrainingOptions(popart_options)
Options specific to model training.
Note
You must not set these options for inference models.
Can be accessed via
poptorch.Options.Training
:>>> opts = poptorch.Options() >>> opts.Training.gradientAccumulation(4)
- Parameters
popart_options (poptorch.options._PopartOptions) –
- Return type
None
- accumulationAndReplicationReductionType(reduction_type)
Set the type of reduction applied to reductions in the graph.
When using, a value for greater than one for
gradientAccumulation()
or forreplicationFactor()
, PopTorch applies a reduction to the gradient outputs from each replica, and to the accumulated gradients. This reduction is independent of the model loss reduction (summing a mean-reduced loss and a sum-reduced loss in a PyTorch model is valid).This setting governs both the accumulation of the loss gradients in replicated graphs and of all of the gradients when using gradient accumulation.
- Parameters
reduction_type (poptorch.ReductionType) –
Mean (default): Reduce gradients by calculating the mean of them.
Sum: Reduce gradients by calculating the sum of them.
- Return type
- gradientAccumulation(gradient_accumulation)
Number of micro-batches to accumulate for the gradient calculation.
Accumulate the gradient
gradient_accumulation
times before updating the model using the gradient. Other frameworks may refer to this setting as “pipeline depth”.Accumulate the gradient
gradient_accumulation
times before updating the model using the gradient. Each micro-batch (a batch of size equal to thebatch_size
argument passed toDataLoader
) corresponds to one gradient accumulation. Thereforegradient_accumulation
scales the global batch size (number of samples between optimiser updates).Note
Increasing
gradient_accumulation
does not alter the (micro-)batch size used for batch normalisation.A large value for
gradient_accumulation
can improve training throughput by amortising optimiser update costs, most notably when usingPipelinedExecution
or when training is distributed over a number of replicas. However, the consequential increase in the number of samples between optimiser updates can have an adverse impact on training.The reason why the efficiency gains are most notable when training with models with multiple IPUs which express pipelined model parallelism (via
PipelinedExecution
or by default and annotating the modelBeginBlock
orBlock
) is because the pipeline has “ramp up” and “ramp down” steps around each optimiser update. Increasing the gradient accumulation factor in this instance reduces the proportion of time spent in the “ramp up” and “ramp down” phases, increasing overall throughput.When training involves multiple replicas, including the cases of sharded and phased execution, each optimiser step incurs a communication cost associated with the reduction of the gradients. By accumulating gradients, you can reduce the total number of updates required and thus reduce the total amount of communication.
Note
Increasing the global batch size can have adverse effects on the sample efficiency of training so it is recommended to use a low or unity gradient accumulation count initially, and then try increasing to achieve higher throughput. You may also need to scale other hyper-parameters such as the optimiser learning rate accordingly.
- Parameters
gradient_accumulation (int) –
- Return type
- setAutomaticLossScaling(enabled)
Set whether automatic loss scaling is enabled on the IPU.
When using float16/half values for activations, gradients, and weights, the loss value needs to be scaled by a constant factor to avoid underflow/overflow. This adjustment is known as loss scaling. This setting automatically sets a global loss scaling factor during training.
Note: Automatic loss scaling is a preview feature. It is well tested and enabled in some of our example applications, but may not behave as expected in all models. Recommendation: if your model with automatic loss scaling enabled does not converge or triggers a compilation error, then you will need to set the loss scale manually.
- Parameters
enabled (bool) –
True: Enable automatic loss scaling on the IPU.
False: Disable automatic loss scaling.
- Return type
- setConvolutionDithering(enabled)
Enable convolution dithering.
If true, then convolutions with different parameters will be laid out from different tiles in an effort to improve tile balance in models.
Use
MultiConv
to apply this option to specific set of convolutions.- Parameters
enabled (bool) – Enables or disables convolution dithering for all convolutions.
- Return type
- setMeanAccumulationAndReplicationReductionStrategy(mean_reduction_strategy)
Specify when to divide by a mean reduction factor when
accumulationAndReplicationReductionType
is set toReductionType.Mean
.The default reduction strategy depends on the optimizer used. The default strategy is
Running
when theaccum_type
of the optimizer is set to half-precision (float16) format. Otherwise thePost
strategy is used as this strategy is typically more performant but thePost
strategy is less numerically robust.- Parameters
mean_reduction_strategy (poptorch.MeanReductionStrategy) –
Running: Keeps the reduction buffer as the current mean. This is preferred for numerical stability as the buffer value is never larger than the magnitude of the largest micro batch gradient.
Post: Divides by the accumulationFactor and replicatedGraphCount after all of the gradients have been reduced. In some cases this can be faster then using Running, however is prone to overflow.
PostAndLoss (deprecated): Divides by the replicatedGraphCount before the backwards pass, performs the gradient reduction across micro batches, and then divides by the accumulationFactor. This is to support legacy behaviour and is deprecated.
- Return type
11.2. Helpers
- poptorch.ipuHardwareIsAvailable(num_ipus=1)
Indicates whether any IPU hardware with
num_ipus
is present in the system.Note: This function doesn’t check if the IPU is free or already being used.
- poptorch.ipuHardwareVersion()
Indicates what IPU hardware version is available in the system.
Raise an exception if no hardware is available.
- Returns
The IPU hardware version or -1 if unknown.
- Return type
- poptorch.setLogLevel(level)
Changes the volume of messages printed in the console (stdout)
- class poptorch.profiling.Channel(name)
Profiling channel.
Note
If the
libpvti
profiling library is not available at runtime this class becomes a no-op.Example:
>>> channel = poptorch.profiling.Channel("MyApp") >>> with channel.tracepoint("TimeThis"): ... functionToTime() >>> channel.instrument(myobj, "methodName", "otherMethod")
- instrument(obj, *methods)
Instrument the methods of an object.
- Parameters
obj – Object to instrument
methods – One or more methods to wrap in profiling trace points.
- tracepoint(name)
Create a context tracepoint
>>> with channel.tracepoint("DoingSomething"): ... expensiveCall()
- Parameters
name – Name associated to this tracepoint.
11.3. PopTorch ops
- poptorch.ctc_beam_search_decoder(probs, lengths, blank=0, beam_width=100, top_paths=1)
- Add a connectionist temporal classification (CTC) beam search decoder
to the model.
Calculates the most likely top paths and their probabilities given the input logarithmic probabilities and the data lengths.
- Parameters
probs (Tensor) – Logarithmic probabilities tensor with the shape of [input_length, batch_size, num_classes].
lengths (Tensor) – Tensor representing lengths of the inputs of shape [batch_size].
blank (int) – Integer identifier of the blank class (default: 0).
beam_width (int) – Number of beams used during decoding (default: 100).
top_paths (int) – Number of most likely paths to return (default: 1).
- Returns
Three tensors representing paths’ probabilities - of shape [batch_size, top_paths], paths’ lengths - of shape [batch_size, top_paths] and the decoded paths - of shape [batch_size, top_paths, input_length].
- Return type
- poptorch.ipu_print_tensor(tensor, title='', print_gradient=True, summarise_threshold=1000, edge_items=3, max_line_width=80, digits=4, float_format='auto', separator=', ', open_bracket='(', close_bracket=')')
Adds an op to print the contents of the IPU tensor.
When this is executed the tensor will be copied back to host and printed.
When this operation is called in the backward pass it will print the gradient of the tensor.
The operation is an identity operation and will return the exact same tensor. The returned tensor must be used in place of the original tensor in the rest of the program, to make sure that the print operation isn’t optimised away.
For example, if the original code looks like this:
def forward(self, c, d, b) a = c + d return a + b
If the result of
ipu_print_tensor()
is not used, the function will be optimised out by the graph optimiser and the tensor will not be printed.So if you want to print the value of
a
, you should do:def forward(self, c, d, b) a = c + d x = poptorch.ipu_print_tensor(a) return x + b
Optionally, you can add a second string argument to be used as a title, as shown in the following example. The value of
a
will be printed after the title “summation”. The value of the gradient ofa
will be printed after the title “summation_gradient” if the operation is called in the backward pass.def forward(self, c, d, b) a = c + d x = poptorch.ipu_print_tensor(a, "summation")) return x + b
Warning
To prevent the print operation being optimised out by the graph optimiser, you must use the output of the print.
- Parameters
tensor (Tensor) – The tensor to print.
title (str) – An optional title to print before the tensor value. Defaults to “”.
print_gradient (bool) – Whether to print the gradient tensor associated with this tensor. Defaults to True.
summarise_threshold (int) – If the number of elements of the tensor exceeds this threshold the output will be summarised. Only the edge elements will be displayed with an ellipsis indicating skipped elements. A value of 0 will disable summarisation. Defaults to 1000.
edge_items (int) – Number of edge elements to include at the beginning and end when summarisation is enabled. Defaults to 3.
max_line_width (int) – Lines longer than this limit will be split across multiple lines. A value of 0 will disable line splitting. Defaults to 75.
digits (int) – Number of digits to display. For integers this limit can be exceeded if any number is large enough. For floating points this does not include the exponent. The number of digits is used in conjunction analysis of the tensor to determine the width of each element to align all elements when printed. A value of 0 disables this analysis and each elements will be printed in an unaligned format. Defaults to 4.
float_format (str) – Determines the floating point format to use. Automatic mode determines the appropriate format based on the data. Defaults to “auto”. One of: - “auto”: Automatically determine the format through analysis. - “fixed”: Use fixed point e.g. -100.00. - “scientific”: Use scientific notation e.g. -1.123e+10. - “none”: Do not display all elements with the same format
separator (str) – Character used to delineate values. Defaults to ” “.
open_bracket (str) – Character used to open a tensor. Defaults to “[“.
close_bracket (str) – Character used to close a tensor. Defaults to “]”.
- Returns
The input tensor unchanged.
- Return type
- poptorch.for_loop(count, body, inputs)
An on-device for loop. This loop will execute on device for
count
number of iterations.The body should be a Python function containing the PyTorch code you wish to execute in a loop. It should take as input the same number of tensors as it outputs. Each iteration will have the previous output passed in as input.
- poptorch.recomputationCheckpoint(*tensors)
Operation for checkpointing values in a computational pipeline stage.
When recomputation is enabled, these values will not be recomputed and they will be stored in memory between forward and backwards passes instead.
- poptorch.identity_loss(x, reduction)
Marks a tensor as being part of the loss calculation and, as such, will back-propagate through it in the PopTorch autograd.
This function should be called on the (final) loss of a model so that it is used as the start of backpropagation. This is equivalent to calling
x.backward()
on a tensorx
when running on the CPU.This function is necessary to combine multiple losses into a custom loss. It ensures that the tensor is part of the loss calculation and, as such, should be part of the backpropagation in PopTorch autograd.
Multiple calls to
identity_loss
can be made inside the same model provided they are all dependant: all marked losses must be traceable into a single final tensor itself marked by a call toidentity_loss
otherwise an error is raised.- Parameters
- Returns
The loss tensor with the specified reduction applied.
- Return type
- class poptorch.MultiConv
Combines all convolution layers evaluated inside this scope into a single multi-convolution.
Multi-convolutions allow for a set of data-independent convolutions to be executed in parallel. Executing convolutions in parallel can lead to an increase in the data throughput.
For example:
>>> with poptorch.MultiConv(): ... y = self.convA(x) ... v = self.convB(u)
Combines the two data-independent convolutions into a single multi-convolution.
Refer to the PopLibs documentation for further information on multi-convolutions.
- availableMemoryProportions(value)
The available memory proportion per convolution, each [0, 1).
For more information, please refer to the technical note on optimising temporary memory usage.
- cycleBackOff(value)
Cycle back off proportion.
- Parameters
value (float) – Number between 0 and 1.
- Returns
self
, to support method chaining.- Return type
- enableConvDithering(value)
Enable per-convolution dithering.
- partialsTypes(value)
The partials type used for each convolution.
- Parameters
value (Union[dtype, List[dtype]]) – Can be a single instance of
torch.dtype
in which case the same value is used for all of the convolutions. Otherwise, can be atuple
orlist
containing as manytorch.dtype
values as the number of convolutions.- Returns
self
, to support method chaining.- Return type
- perConvReservedTiles(value)
Tiles to reserve for each convolution.
- Parameters
value (int) – Number of tiles.
- Returns
self
, to support method chaining.- Return type
- planType(value)
Select the multi-convolution execution strategy.
- Parameters
value (poptorch.MultiConvPlanType) – An instance of
MultiConvPlanType
.- Returns
self
, to support method chaining.- Return type
- class poptorch.CPU(layer_to_call, ID)
Allow the execution of a CPU op in the middle of an inference IPU graph.
Important
CPU ops are only supported in inference graphs.
Example:
>>> class Model(torch.nn.Module): >>> def __init__(self): >>> super().__init__() >>> self.cpu = poptorch.CPU(self.myCpuOp, "MyCPUOp") >>> >>> def myCpuOp(self, x): >>> return x * 2.0 >>> >>> def forward(self, x): >>> # The arguments passed to "cpu" are forwarded to "myCpuOp" >>> out = self.cpu(x) >>> out = self.cpu(out) >>> out = self.cpu(out) >>> return out
- __init__(layer_to_call, ID)
Execute a given function on the CPU.
- execute()
Implementation detail.
- registerPersistentData()
Implementation detail.
- class poptorch.NameScope(name)
Create a name scope for a code block. All operators originating from this block will have their names prefixed by the given string.
>>> with poptorch.NameScope("CustomString"): ... y = self.bmm(a, b) ... z = torch.relu(y)
- Parameters
name (str) –
- class poptorch.MultiConvPlanType(value)
Selects the execution strategy for a
poptorch.MultiConv
Parallel
: Execute multiple convolutions in parallel (Default).Serial
: Execute each convolution independently. This is equivalent to using the independent convolution API.
- class poptorch.custom_op(inputs, name, domain, domain_version, example_outputs, attributes=None)
Applies a custom operation, implemented within PopART, to the inputs.
- Parameters
inputs (tuple) – A tuple of input tensors, for example, (x, y).
name (str) – Unique name of the PopART custom op.
domain (str) – Domain for the op.
domain_version (int) – Version of the domain to use.
example_outputs (iterable) – A tuple of tensors with the same type and shape as the outputs. The value does not matter as all values will be set to zero for tracing purposes.
attributes (dict) – A dictionary of attributes for the custom op. All attribute keys must be strings. All attribute values must be floats, ints, strings, or a list/tuple containing only floats, only ints or only strings (not a mix of types within the list).
- Returns
The outputs of the forward op of the custom op.
- Return type
- poptorch.nop(tensor)
A no-operation: it is functionally the same as an identity but is never eliminated by PopART patterns or inlining, so it is useful for debugging.
- poptorch.dynamic_slice(tensor, dim, start, size, step)
Torch native dynamic slices can’t be properly intercepted by backends, so this op is provided to enable dynamic slicing in poptorch applications.
- poptorch.dynamic_update(input, src, dim, start, size)
Torch native dynamic slices can’t be properly intercepted by backends, so this op is provided to enable dynamic update slice in poptorch applications.
- poptorch.serializedMatMul(lhs, rhs, mode, factor=0, keep_precision=False)
Calculates a matrix product using a serialized matrix multiplication.
The matrix multiplication,
lhs*rhs
, is split into separate smaller multiplications, calculated one after the other, to reduce the memory requirements of the multiplication and its gradient calculation.- Parameters
lhs (torch.Tensor) – Left-hand side input matrix.
rhs (torch.Tensor) – Right-hand side input matrix.
mode (poptorch.MatMulSerializationMode) –
Which dimension of the matmul to serialize on: for matrix A (m by n) multiplied by matrix B (n by p).
InputChannels: Split across the input channels (dimension m).
ReducingDim: Split across the reducing dimension (n).
OutputChannels: Split across the output channels (dimension p).
Disabled: Same as an ordinary matrix multiplication.
factor (int) – Number of serialized multiplications. Must be a factor of the dimension to serialize on.
keep_precision (bool) – (Half/float16 inputs only) The forward op when serializing over ReducingDim and the backwards ops when serializing over InputChannels involve an addition step. If
keep_precision
is True, these additions will occur using float32 rather than half precision partials, matching those used for the individual matrix multiplications.
- Return type
- poptorch.set_available_memory(tensor, available_memory_proportion)
Sets the amount of temporary memory made available to an operation.
The operators that can be tuned with this setting include:
convolution
matrix multiplication
embedding lookups
indexing operations
When applied to the output of a supported operation, it controls the trade-off between execution cycles and the temporary memory used during the execution of the operation.
The value should be between 0 and 1 (inclusive) and represents a proportion of available memory on the IPU. The default value is 0.6 (therefore, by default, PopTorch will not use more than 60% of IPU memory for temporary data).
PopTorch passes this setting to the PopLibs operator planner, which will try to constrain the use of temporary memory to below this value. Generally, an operation that has more temporary memory available will run in fewer cycles.
For a specific operation, the necessary amount of temporary memory may be more than amount specified by this option. In this case, a warning message will be generated.
For more information, please refer to the technical note on optimising temporary memory usage.
>>> class BasicNetwork(nn.Module): ... def __init__(self): ... super().__init__() ... self.conv = nn.Conv2d(4, 4, 3, stride=2) ... ... def forward(self, x): ... out = self.conv(x) ... out = poptorch.set_available_memory(out, 0.2) ... return out
- Parameters
- Returns
The input tensor, as if calling an identity function.
- Return type
- poptorch.set_overlap_for_input(input_tensors, mode)
Sets host overlap setting for input_tensors.
You can increase performance in some cases by overlapping the copying from the host to IPUs with computation. However, this requires a number of IPU tiles to be set aside as IO tiles using
numIOTiles()
which may affect computation performance.You should use this function at the start of your model’s
forward
method for each applicable input and use the returned tensors in future ops.- Parameters
input_tensors – The input tensors for which enable overlapping host IO. This can be either a single tensor, or any combination of tuple, list, or dict of tensors.
mode (poptorch.OverlapMode) – Control to what extent the host IO overlaps computation.
- Returns
the input tensors, specified for overlap.
See also
- poptorch.set_overlap_for_output(output_tensors, mode)
Sets host overlap setting for output_tensors.
You can increase performance in some cases by overlapping the copying from the IPUs to host with computation. However, this requires a number of IPU tiles to be set aside as IO tiles using
numIOTiles()
which may affect computation performance.You should use this function at the end of your model’s
forward
method, for each applicable output, just before returning the tensors.- Parameters
output_tensors – The output tensors to enable overlapping host IO for. This can be either a single tensor, or any combination of tuple, list, or dict of tensors.
mode (poptorch.OverlapMode) – Control to what extent the host IO overlaps computation.
- Returns
the output tensors, specified for overlap.
See also
- poptorch.nearest(x, y, batch_x=None, batch_y=None)
PopTorch implementation of the
torch_cluster
nearest
operator.This op clusters points in
x
together which are nearest to a given query point iny
.- Parameters
x (Tensor) – Node feature matrix.
y (Tensor) – Node feature matrix.
batch_x (Optional[Union[List[int], Tensor]]) – Batch vector, which assigns each node to a specific sample.
batch_x
needs to be sorted.batch_y (Optional[Union[List[int], Tensor]]) – Batch vector, which assigns each node to a specific sample.
batch_y
needs to be sorted.
- poptorch.fps(src, ptr, ratio=0.5, random_start=False)
PopTorch implementation of the
torch_cluster
fps
operator.This op is a sampling algorithm from the “PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space” paper, and iteratively samples the most distant point with regard to the rest points.
- Parameters
- Returns
A tensor of
src
point indexes.- Return type
- poptorch.cond(condition, then_body, then_inps, else_body, else_inps)
An on-device if/else operation. This creates two branches of instructions executed conditionally on the device. Only for inference.
The
then_body
andelse_body
should be Python functions containing the PyTorch code you wish to execute conditionally on the device. The condition is passed in the form of a booleanTensor
and the branch to be executed is decided in runtime directly on the device. There are a few conditions on the branch functions:then_body
andelse_body
can accept an arbitrary number of inputs (including zero).Tensors defined in the
cond
caller (the outer graph) can be used insidethen_body
andelse_body
implicitly just as if they were passed through the inputs list.then_body
andelse_body
have to return the same number of corresponding outputs. This is because the result of thecond
op is assigned to a common list of tensors.all the tensors utilized by
then_body
andelse_body
are passed in by copy, so updating any of the tensors insidethen_body
andelse_body
does not affect the original tensors. To update a tensor passed in, its new value has to be returned from the body and assigned to the original tensor (please note that the number of outputs fromthen_body
andelse_body
has to match).
- Parameters
- Return type
11.4. Model wrapping functions
- poptorch.trainingModel(model, options=None, optimizer=None)
Create a PopTorch training model, from a PyTorch model, to run on IPU hardware in training mode.
Note
PopTorch makes a shallow copy of the model and wraps the original model to facilitate weight synchronisation. Changes to the parameters in the returned training model affect the original model and vice versa. However, primitive variable types are not synced. For example calling
model.train()
on the original model, which changes thetraining
bool of the model instance, will not alter the model returned by this function. You may need to callmodel.train()
on your model before you call this function for correct behaviour.- Parameters
model (Union[torch.nn.Module, poptorch.PoplarExecutor]) – The PyTorch model to wrap.
options (Optional[poptorch.Options]) – The IPU specific options
optimizer (Optional[torch.optim.Optimizer]) –
The optimizers to apply during training.
Supported PyTorch optimizers:
optim.SGD
,optim.Adam
,optim.AdamW
,optim.RMSprop
.Supported PopTorch optimizers:
SGD
,Adam
,AdamW
,RMSprop
.LAMB
.
- Returns
The
PoplarExecutor
wrapper to use in place ofmodel
.- Return type
- poptorch.inferenceModel(model, options=None)
Create a PopTorch inference model, from a PyTorch model, to run on IPU hardware in inference mode.
Note
PopTorch makes a shallow copy of the model. Changes to the parameters in the returned inference model affect the original model and vice versa. However, primitive variable types are not synced: for example calling
model.eval()
on the original model will not alter the model returned by this function. You may need to callmodel.eval()
on your model before you call this function for correct behaviour.- Parameters
model (Union[torch.nn.Module, poptorch.PoplarExecutor]) – The PyTorch model to wrap.
options (Optional[poptorch.Options]) – The IPU specific options
- Returns
The
PoplarExecutor
wrapper to use in place ofmodel
.- Return type
- class poptorch.PoplarExecutor(model, options, training, poptorch_version, optimizer=None, user_model=None)
This class should not be created directly but is a wrapper around the model that was passed into
inferenceModel
ortrainingModel
. It only has a few methods which can be used to interface with the IPU.- Parameters
model (torch.nn.Module) –
options (Optional[poptorch.Options]) –
training (bool) –
poptorch_version (str) –
optimizer (Optional[torch.optim.Optimizer]) –
user_model (Optional[torch.nn.Module]) –
- __call__(*args, **kwargs)
Takes the same arguments as the wrapped PyTorch
model.__call__
.Note
The first time the
PoplarExecutor
wrapper is called, the wrapped model will be traced and compiled.
- attachToDevice()
Attach to target device. Before calling this function, the device must be detached and the model compiled.
- Return type
None
- compilationTime()
Returns the total model compilation time.
- Returns
An object of type datetime.timedelta representing the compilation time.
- Return type
Note
You must compile the model before calling this method also showCompilationProgressBar option must be set to True.
- compile(*args, **kwargs)
Takes the same arguments as the wrapped PyTorch
model.__call__
.Trace and compile the wrapped model if no executable has been created yet.
Note: The executable created by this method can only be executed, it cannot be exported to file. To precompile and save to file use
compileAndExport()
- Return type
None
- compileAndExport(filename, *args, export_model=True, **kwargs)
Precompile an executable and save it to file.
args
andkwargs
are the same arguments as the wrapped PyTorchmodel.__call__
- Parameters
filename (str) – Where to save the compiled executable.
export_model (bool) – If
True
the Torch model will be saved in the file alongside the executable.load()
can be used to restore both the original Torch model, the PopTorch model and the executable. IfFalse
then only the executable will be exported and it will be the user’s responsibility to callinferenceModel()
ortrainingModel()
to re-create the PopTorch model before callingloadExecutable()
to restore the executable.
- copyNamedBuffersToDevice()
Copies the buffers from
model.parameters()
to the IPU device.- Return type
None
- copyWeightsToDevice()
Copies the weights from
model.parameters()
to the IPU device. Implicitly called on first call.- Return type
None
- copyWeightsToHost()
Updates the parameters used in
model
with the weights stored on device. (The weights inmodel.parameters()
)- Return type
None
- copyWeightsToHostIfNeeded()
Return True if the weights on the host were dirty and have been updated. Return False if the weights were already up to date.
- Return type
- cycleCount()
Returns number of cycles which the IPU ran.
You must run the model on IPU hardware before calling this method.
- Returns
number of cycles on the IPU for the last modern run. If you are using replicas, the returned value represents the first number of cycles for the first replica only.
- Return type
- destroy()
Destroy the model: release the IPUs and the executable.
- Return type
None
- detachFromDevice()
Detach from target device. Before calling this function, the device must be attached (and the model compiled).
- Return type
None
- getComputeLatency()
Return compute latency for the last execution of the model.
The compute latency is the interval of time (in fractional seconds) between the last input tensor being transferred to the IPU and the last output tensor becoming available.
The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.
- getHostIpuLatency()
Return Host-IPU latency for the last execution of the model.
The Host-IPU latency is the interval of time (in fractional seconds) between the first input tensor being requested and the last input tensor being transferred to the IPU.
The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.
- getIpuHostLatency()
Return IPU-Host latency for the last execution of the model.
The IPU-Host latency is the interval of time (in fractional seconds) between the first output tensor becoming available and the last output tensor being written back to the host.
The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.
- getLatency()
Return round-trip latency for the last execution of the model.
The round-trip latency is the interval of time (in fractional seconds) between the first input tensor being requested and the last output tensor being written back to the host.
The result is a tuple containing the minimum, maximum and average latency for the iterations corresponding to the latest invocation of the model.
- getPerfCounters()
Return performance counters for the last execution of the model.
Return the values (in fractional seconds) of the performance counters corresponding to the latest run of the model. The reference point of the returned value is undefined, however the difference between values is valid.
The returned object is a dictionary where they keys correspond to each of the following events: * ‘input’: the IPU requesting an input tensor * ‘input_complete’: an input tensor having been transferred * ‘output’: the IPU requesting to transmit an output tensor * ‘output_complete’: an output tensor having been transferred
The values of the dictionary are nested lists. The first level of nesting corresponds to an input or output index. The second level list contains the actual values as fractional seconds.
Examples: * dict[‘input’][1][3]: performance counter for the second input tensor being requested on the third iteration of the model * dict[‘output_complete’][0][0]: performance counter the first output tensor having been transferred on the first iteration of the model
- getTensorNames()
Returns a list of all tensor names within the computational graph. Model must be compiled in advance.
- isAttachedToDevice()
Returns true, if the target device has been attached. False, otherwise.
- Return type
- isCompiled()
Returns true if the model has been compiled (and not destroyed). False, otherwise.
- Return type
- loadExecutable(filename)
Load an executable previously generated using
compileAndExport()
- Parameters
filename (str) –
- Return type
None
- load_state_dict(state_dict, strict=True)
Will call
load_state_dict()
on the wrapped model and automatically synchronise the weights with the IPU.
- property model: torch.nn.modules.module.Module
Access the wrapped Torch model.
- property options: poptorch.Options
Access to the options.
See also
- property rng_state: List[int]
Return the random number generator’s seed & state of the compiled model.
- save(filename, export_model=True, save_rng_state=True)
Save the compiled model to file.
- Parameters
filename (str) – Where to save the compiled executable.
export_model (bool) – If
True
the Torch model will be saved in the file alongside the executable.load()
can be used to restore both the original Torch model, the PopTorch model and the executable. IfFalse
then only the executable will be exported and it will be the user’s responsibility to callinferenceModel()
ortrainingModel()
to re-create the PopTorch model before callingloadExecutable()
to restore the executable.save_rng_state (bool) – If
True
the random number generator’s state and seed will be saved in the file alongside the executable.
- poptorch.isRunningOnIpu()
This function returns
True
when executing on IPU andFalse
when executing the model outside IPU scope. This allows for separate code-paths to be marked in the model simply by using:>>> if poptorch.isRunningOnIpu(): >>> # IPU path >>> else: >>> # CPU path
Note this will only apply to code during execution. During model creation it will always return
False
.- returns
True if running on IPU, otherwise False.
- Return type
- poptorch.load(filename, edit_opts_fn=None)
Load a PopTorch model from a file previously created using
compileAndExport()
- Parameters
edit_opts_fn (Optional[Callable[[poptorch.Options], None]]) – Function to edit the options before the model is restored. For example to attach to a specific IPU device.
filename (str) –
- Return type
>>> model = poptorch.inferenceModel(model) >>> model.compileAndExport("my_model.poptorch") ... >>> model = poptorch.load("my_model.poptorch") >>> model(my_input)
11.5. Parallel execution
- class poptorch.Block(user_id=None, ipu_id=None)
A context manager to define blocks of the model.
You can use
Block
as a context manager. This means you use Python’s “with” statement as follows:>>> with poptorch.Block("Encoder"): ... self.layer = MyLayer(x)
All layers called inside this scope will run on the specified IPU, if one is specified. In addition, you can combine multiple blocks into a stage.
See also
- __init__(user_id=None, ipu_id=None)
- Parameters
user_id (Optional[str]) – A user defined identifier for the block. Blocks with the same ID are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.
ipu_id (Optional[int]) – The ID of the IPU to run on. Note that the
ipu_id
is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used bygc-info
.
- static useAutoId()
Call this method at the beginning of your
forward()
method to enable automatic block ID generation.Blocks with a None
user_id
will be assigned an automatic ID which will be the index of this block in the list of ID-less Blocks.>>> poptorch.Block.useAutoId() >>> with poptorch.Block(): # user_id = "0" ... layer() >>> with poptorch.Block("special_block"): # user_id = "special_block" ... layer() >>> with poptorch.Block(): # user_id = "1" ... layer()
- class poptorch.BeginBlock(layer_to_call, user_id=None, ipu_id=None)
Define a block by modifying an existing PyTorch module.
You can use this with an existing PyTorch module instance, as follows:
>>> poptorch.BeginBlock(myModel.a_layer) >>> poptorch.BeginBlock(MyNewLayer())
The module and all sub-modules will be part of this block until a sub-module is modified to be in another block. In addition, if an IPU is specified, the module and its submodules will run on the specified IPU.
You can combine multiple blocks into a stage.
- Parameters
layer_to_call (Module) – PyTorch module to assign to the block.
user_id (Optional[str]) – A user defined identifier for the block. Blocks with the same ID are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.
ipu_id (Optional[int]) – The ID of the IPU to run on. Note that the
ipu_id
is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device IDs used bygc-info
.
- Return type
See also
- poptorch.BlockFunction(user_id=None, ipu_id=None)
A decorator to define blocks of the model.
You can use
BlockFunction
as a decorator for an existing function, as follows:>>> @BlockFunction("Decoder", ipu_id=1) ... def decoder(self, encoder_output): ... self.decoder_b1(encoder_output)
All layers inside the function and any functions called by the function will run on the specified IPU, if one is specified. In addition, you can combine multiple blocks into a stage.
- Parameters
user_id (Optional[str]) – A user defined identifier for the block. Blocks with the same ID are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.
ipu_id (Optional[int]) – The ID of the IPU to run on. Note that the
ipu_id
is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device IDs used bygc-info
.
See also
- poptorch.removeBlocks(module)
Recursively remove BeginBlock annotations from a Module if it contains any.
- Parameters
module (torch.nn.Module) – Module to recursively unwrap.
- class poptorch.Stage(*block_ids)
The various execution strategies are made of
Stages
: a stage consists of one of moreBlocks
running on one IPU.- Parameters
block_ids (str) –
- Return type
None
- class poptorch.AutoStage(value)
Defines how the stages are automatically assigned to blocks when the user didn’t explicitly provide stages to the
IExecutionStrategy
’s constructor.SameAsIpu
: The stage id will be set to the selected ipu number.AutoIncrement
: The stage id for new blocks is automatically incremented.
Examples:
>>> # Block "0" >>> with poptorch.Block(ipu_id=0): ... layer() >>> # Block "1" >>> with poptorch.Block(ipu_id=1): ... layer() >>> # Block "2" >>> with poptorch.Block(ipu_id=0): ... layer()
By default, the following execution strategy is used:
>>> strategy = poptorch.PipelinedExecution(poptorch.AutoStage.SameAsIpu) >>> opts.setExecutionStrategy(strategy)
which would translate to
stage_id = ipu_id
:Block “0” ipu=0 stage=0
Block “1” ipu=1 stage=1
Block “2” ipu=0 stage=0
Now if instead you use:
>>> strategy = poptorch.PipelinedExecution(poptorch.AutoStage.AutoIncrement) >>> opts.setExecutionStrategy(strategy)
The last block would be in its own stage rather than sharing one with Block “0”:
Block “0” ipu=0 stage=0
Block “1” ipu=1 stage=1
Block “2” ipu=0 stage=2
- class poptorch.Phase(*arg)
Represents an execution phase
- Parameters
arg (Union[str, poptorch.Stage]) –
- __init__(*arg)
Create a phase.
- Parameters
arg (Union[str, poptorch.Stage]) – must either be one or more
Stages
, or one or more blocksuser_id
.
If one or more strings are passed they will be interpreted as
Block
IDs representing a singleStage
.Within a
Phase
, the stages will be executed in parallel.>>> with poptorch.Block("A"): ... layer() >>> with poptorch.Block("B"): ... layer() >>> p = Phase(poptorch.Stage("A").ipu(0)) >>> # 2 stages made of one block each >>> p = Phase(poptorch.Stage("A").ipu(0), poptorch.Stage("B").ipu(1)) >>> p = Phase("A","B") # One Stage made of 2 blocks
- ipus(*ipus)
Assign one IPU for each stage contained in this Phase.
The number of IPUs passed must match the number of stages in the Phase.
- class poptorch.ShardedExecution(*args)
Will shard the execution of the passed Stages or if no stage is passed will consider each unique Block
ipu_id
encountered during tracing as a different stage.>>> with poptorch.Block(ipu_id=0): ... layer() >>> with poptorch.Block(ipu_id=1): ... layer() >>> with poptorch.Block(ipu_id=2): ... layer() >>> opts = poptorch.Options() >>> # Automatically create 3 shards based on the block names >>> opts.setExecutionStrategy(poptorch.ShardedExecution())
- Parameters
args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a
AutoStage
strategy or an explicit list of stages or block IDs.
- class poptorch.PipelinedExecution(*args)
- __init__(*args)
Pipeline the execution of the graph partitions. These partitions can be: a
Stage
, aBlock
or aBeginBlock
. If none of these are passed, anAutoStage
strategy can be passed instead to decide how the stage IDs are created. By default,poptorch.AutoStage.SameAsIpu
is used: The stage ID will be set to the selected IPU number. This implies that each uniqueBlock
orBeginBlock
in the graph must have theiripu_id
explicitly set when usingAutoStage
.Example 1: Blocks
user_id
are known, IPUs are inferred.>>> with poptorch.Block("A"): ... layer1() >>> with poptorch.Block("B"): ... layer2() >>> with poptorch.Block("C"): ... layer3() >>> with poptorch.Block("D"): ... layer4() >>> opts = poptorch.Options() >>> # Create a 4 stages pipeline based on `user_id`, 4 IPUs will be used. >>> opts.setExecutionStrategy(poptorch.PipelinedExecution("A","B", ... "C","D"))
Stages can also be set explicitly:
>>> # Create a 2 stages pipeline with the blocks `user_id`, 2 IPUs will be used. >>> opts.setExecutionStrategy(poptorch.PipelinedExecution( ... poptorch.Stage("A","B"), ... poptorch.Stage("C","D")))
Example 2: Blocks
ipu_id
are known, use default AutoStage.>>> poptorch.Block.useAutoId() >>> with poptorch.Block(ipu_id=0): ... layer1() >>> with poptorch.Block(ipu_id=1): ... layer2() >>> with poptorch.Block(ipu_id=2): ... layer3() >>> with poptorch.Block(ipu_id=3): ... layer4() >>> # Automatically create a 4-stage pipeline matching the block `ipu_id`. >>> opts.setExecutionStrategy(poptorch.PipelinedExecution()) >>> # Note: poptorch.PipelinedExecution() >>> # is the default execution strategy when blocks are defined.
Example 3: Non-consecutive stages placed on the same IPU.
>>> with poptorch.Block(ipu_id=0): ... layer1() >>> with poptorch.Block(ipu_id=1): ... layer2() >>> with poptorch.Block(ipu_id=0): ... layer3() >>> # Automatically create a 3-stage pipeline forcing the stage >>> # IDs to be incremental. >>> opts.setExecutionStrategy(poptorch.PipelinedExecution( ... poptorch.AutoStage.AutoIncrement))
- Parameters
args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a
AutoStage
strategy or an explicit list of stages or block IDs.
- class poptorch.SerialPhasedExecution(*phases)
All the phases run serially on a single group of IPUs.
For example:
phase 0 runs on ipu 0 & 1
phase 1 runs on ipu 0 & 1
phase 2 runs on ipu 0 & 1
>>> with poptorch.Block("A"): ... layer() >>> with poptorch.Block("A2"): ... layer() >>> with poptorch.Block("B"): ... layer() >>> with poptorch.Block("B2"): ... layer() >>> with poptorch.Block("C"): ... layer() >>> with poptorch.Block("C2"): ... layer() >>> opts = poptorch.Options() >>> strategy = poptorch.SerialPhasedExecution([ ... poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")), ... poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")), ... poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2"))]) >>> strategy.phase(0).ipus(0,1) >>> strategy.phase(1).ipus(0,1) >>> strategy.phase(2).ipus(0,1) >>> opts.setExecutionStrategy(strategy)
- Parameters
phases (Union[poptorch.Phase, List[poptorch.Stage], List[str]]) –
- __init__(*phases)
Execute the model’s blocks in phases
- setTensorsLiveness(liveness)
See
Liveness
for more information- Parameters
liveness (poptorch.Liveness) –
- Return type
- stage(block_id)
Return the
Stage
the given block is belongs to.- Parameters
block_id (str) – A block ID.
- useSeparateBackwardPhase(use=True)
Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:
fwd: bwd: phase 0 -> phase 4 phase 1 -> phase 3 phase 2 -> phase 2
Note
The end of the forward pass and the beginning of the backward pass are part of the same phase.
If
useSeparateBackwardPhase(True)
is used then no phase will be shared between the forward and backward passes:fwd: bwd: phase 0 -> phase 6 phase 1 -> phase 5 phase 2 -> phase 4
- Parameters
use (bool) –
- class poptorch.ParallelPhasedExecution(*phases)
Phases are executed in parallel alternating between two groups of IPUs.
For example:
phase 0 runs on ipu 0 & 2
phase 1 runs on ipu 1 & 3
phase 2 runs on ipu 0 & 2
>>> poptorch.Block.useAutoId() >>> with poptorch.Block(): # user_id = "0" ... layer() >>> with poptorch.Block(): # user_id = "1" ... layer() >>> with poptorch.Block(): # user_id = "2" ... layer() >>> with poptorch.Block(): # user_id = "3" ... layer() >>> with poptorch.Block(): # user_id = "4" ... layer() >>> with poptorch.Block(): # user_id = "5" ... layer() >>> opts = poptorch.Options() >>> strategy = poptorch.ParallelPhasedExecution([ ... poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")), ... poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")), ... poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5"))]) >>> strategy.phase(0).ipus(0,2) >>> strategy.phase(1).ipus(1,3) >>> strategy.phase(2).ipus(0,2) >>> opts.setExecutionStrategy(strategy)
- Parameters
phases (Union[poptorch.Phase, List[poptorch.Stage], List[str]]) –
- __init__(*phases)
Execute the model’s blocks in phases
- stage(block_id)
Return the
Stage
the given block is belongs to.- Parameters
block_id (str) – A block ID.
- useSeparateBackwardPhase(use=True)
Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:
fwd: bwd: phase 0 -> phase 4 phase 1 -> phase 3 phase 2 -> phase 2
Note
The end of the forward pass and the beginning of the backward pass are part of the same phase.
If
useSeparateBackwardPhase(True)
is used then no phase will be shared between the forward and backward passes:fwd: bwd: phase 0 -> phase 6 phase 1 -> phase 5 phase 2 -> phase 4
- Parameters
use (bool) –
- class poptorch.Liveness(value)
When using phased execution:
AlwaysLive
: The tensors always stay on the IPU between the phases.OffChipAfterFwd
: The tensors are sent off the chip at the end of the forward pass and before the beginning of the backward pass.OffChipAfterFwdNoOverlap
: Same asOffChipAfterFwd
, except there is no overlapping of load and store operations between phases. This makes it a more memory-efficient mode at the cost of delayed computation.OffChipAfterEachPhase
: The tensors are sent off the chip at the end of each phase.
- class poptorch.CommGroupType(value)
- Grouping to be used when distributing an input or per-replica variable
among replicas. See Grouping tensor weights across replicas.
All
: This causesreplicaGrouping()
to have no effect, as thesame variable value is distributed to all replicas. Group count is ignored. This is not valid as an input group type.
Consecutive
: Each replica group is made up of consecutive replicas,So for group size
k
, the groups would be set up thus:{0, 1, ... k-1}, {k, ... 2k-1} ... {N-k-1, ... N-1}
Orthogonal
: Each replica group is made up by slicing the replicasorthogonally to the replica ordering. So for group size
k
, with group countm = N/k
:{0, m, 2m, ...}, {1, m+1, 2m+1, ...} ... {m-1, 2m-1, ... N-1}
NoGrouping
: Each replica gets its own value of the variable. Groupcount is ignored.
- class poptorch.VariableRetrievalMode(value)
- Method to be used when retrieving the value of a grouped variable from
grouped replicas. See Grouping tensor weights across replicas.
OnePerGroup
: Return one value for each replica group (takes the valuefrom the first replica in the group).
AllReplicas
: Return a value from each replica.
- replicaGrouping()
Call this function on a weight tensor (after applying a PopTorch wrapper with
inferenceModel()
ortrainingModel()
) to configure replica groups which each receive a different value of the weight tensor. For details and a code example see Section 4.4.3, Grouping tensor weights across replicas.- Parameters
comm_group_type (poptorch.CommGroupType) – The replica group arrangement to use for this tensor.
shards (int) – The number of replicas in each replica group.
variable_retrieval_mode (poptorch.VariableRetrievalMode) – The method to use when retrieving the value of this tensor from the replicas.
11.6. Optimizers
- class poptorch.optim.VariableAttributes(variable_attributes, allowed_attributes)
Track which attributes are variable or constant.
Is accessible via any PopTorch optimizer via the
variable_attrs
attribute.>>> opt = poptorch.optim.SGD(params, lr=0.01) >>> opt.variable_attrs.isConstant("lr")
- isConstant(attr)
Return True if the attribute is marked as constant
- class poptorch.optim.SGD(params, lr, momentum=None, dampening=None, weight_decay=None, nesterov=None, maximize=None, foreach=None, differentiable=None, loss_scaling=None, velocity_scaling=None, use_combined_accum=None, accum_type=None, velocity_accum_type=None, max_grad_norm=None)
Stochastic gradient descent with optional momentum.
The optimizer is based on PyTorch’s implementation (torch.optim.SGD) with optional loss and velocity scaling.
PopTorch provides two possible variants. Both variants are mathematically identical to PyTorch but differ in their stability and efficiency.
Note
If you set momentum to zero and do not use gradient accumulation, PopTorch will use a simple SGD variant and ignore the values of
use_combined_accum
,accum_type
andvelocity_accum_type
.Separate tensor variant (default)
If you set
use_combined_accum
toFalse
(default), you will use a more stable but more memory intensive variant. In this case, PopTorch keeps two state tensors for each weight: one for gradient accumulation and one for velocity. It operates as follows when training:PopTorch runs one or more forward/backwards steps, equal the number of gradient accumulations (see
gradientAccumulation()
). Each time PopTorch sums the gradients, storing them in accumulators.Once all the forward and backwards have completed, PopTorch uses the summed gradients to update the velocities. At this stage, PopTorch will correct the scale based on the setting of
accumulationAndReplicationReductionType()
. PopTorch stores the velocities as optimiser states.Finally, PopTorch uses the velocities to update the parameters, taking into account the loss scaling and learning rate.
With
use_combined_accum
set to False, you can independently change the data type used for storing the accumulated gradients and the velocity values usingaccum_type
andvelocity_accum_type
, respectively.Velocity scaling is ignored for this variant.
Note
If the number of gradient accumulations is high, you can use off chip memory for the velocity tensors with a minimal performance hit.
>>> opts.TensorLocations.setOptimizerLocation( ... poptorch.TensorLocationSettings().useOnChipStorage(False))
Combined tensor variant
If you set
use_combined_accum`
toTrue
, you will use a less stable but more memory efficient variant. In this case PopTorch uses a single tensor (the combined tensor) for gradient accumulation and velocity. It operates as follows when training:PopTorch runs one or more forward/backwards steps equal the number of gradient accumulations (see
gradientAccumulation()
). For each step, PopTorch immediately calculates an increment or decrement for the combined tensors for each parameter. The amount of increment or decrement takes into account the setting ofaccumulationAndReplicationReductionType()
. as well as removing loss scaling and introducing any velocity scaling.After running all the steps, the combined tensor will be be equal to the new velocities. PopTorch uses these to update the parameters taking into account the velocity scaling and learning rate.
PopTorch ignores the
accum_type`
andvelocity_accum_type
values when using a combined tensor. In addition, there are no optimizer state tensors and soopts.TensorLocations.setOptimizerLocation
has no effect.Warning
For both variants, reducing the velocity scaling during training will result in temporary over-estimation of the velocity and could cause model instability. Increasing the scaling may temporarily slow model convergence but not lead to instability.
- __init__(params, lr, momentum=None, dampening=None, weight_decay=None, nesterov=None, maximize=None, foreach=None, differentiable=None, loss_scaling=None, velocity_scaling=None, use_combined_accum=None, accum_type=None, velocity_accum_type=None, max_grad_norm=None)
- Parameters
params (iterable) – parameters to optimize.
lr (float) – learning rate.
weight_decay (Optional[float]) – Weight decay (L2 penalty) factor.
nesterov (Optional[bool]) – Whether to enable Nesterov momentum. Default is
False
.loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.
velocity_scaling (Optional[float]) – Factor by which to scale the velocity values to assist numerical stability when using float16. (This applies to the combined variant only.)
use_combined_accum (Optional[bool]) – Whether to use a combined accumulator.
accum_type (Optional[dtype]) – data type used for gradients.
velocity_accum_type (Optional[dtype]) – data type used to store the velocity values for each parameter.
max_grad_norm (Optional[float]) – Maximum norm of gradients. Default is
inf
.
- Return type
None
- class poptorch.optim.Adam(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, foreach=None, maximize=None, capturable=None, differentiable=None, fused=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, max_grad_norm=None)
Adam optimizer.
This optimizer matches PyTorch’s implementation (torch.optim.Adam) with optional loss scaling.
AMSGrad is currently not supported.
- Parameters
params (Iterable) –
- Return type
None
- __init__(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, foreach=None, maximize=None, capturable=None, differentiable=None, fused=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, max_grad_norm=None)
- Parameters
params (iterable) – parameters to optimize.
betas (Optional[Tuple[float, float]]) –
(beta1, beta2)
parameters used in Adam.eps (Optional[float]) – term added to the denominator to ensure numerical stability.
loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.
accum_type (Optional[dtype]) – data type used for gradients.
first_order_momentum_accum_type (Optional[dtype]) – data type used to store the first order momentum values for each parameter.
second_order_momentum_accum_type (Optional[dtype]) – data type used to store the second order momentum values for each parameter.
max_grad_norm (Optional[float]) – Maximum norm of gradients. Default is
inf
.
- Return type
None
- class poptorch.optim.AdamW(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, maximize=None, foreach=None, capturable=None, differentiable=None, fused=None, loss_scaling=None, bias_correction=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, max_grad_norm=None)
Adam optimizer with true weight decay.
This optimizer matches PyTorch’s implementation (torch.optim.AdamW) with optional loss scaling.
AMSGrad is currently not supported.
- Parameters
params (Iterable) –
- Return type
None
- __init__(params, lr=None, betas=None, eps=None, weight_decay=None, amsgrad=None, maximize=None, foreach=None, capturable=None, differentiable=None, fused=None, loss_scaling=None, bias_correction=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, max_grad_norm=None)
- Parameters
params (iterable) – parameters to optimize.
betas (Optional[Tuple[float, float]]) –
(beta1, beta2)
parameters used in AdamW.eps (Optional[float]) – term added to the denominator to ensure numerical stability.
loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.
bias_correction (Optional[bool]) – True: compute Adam with bias correction.
accum_type (Optional[dtype]) – data type used for gradients.
first_order_momentum_accum_type (Optional[dtype]) – data type used to store the first order momentum values for each parameter.
second_order_momentum_accum_type (Optional[dtype]) – data type used to store the second order momentum values for each parameter.
max_grad_norm (Optional[float]) – Maximum norm of gradients. Default is
inf
.
- Return type
None
- class poptorch.optim.RMSprop(params, lr=None, alpha=None, eps=None, weight_decay=None, momentum=None, centered=None, foreach=None, maximize=None, differentiable=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, use_tf_variant=None)
RMSprop optimizer with optional L2 penalty.
This optimizer matches PyTorch’s implementation ( torch.optim.RMSprop) with optional loss scaling.
However, if the use_tf_variant flag is set to True, it will instead match the TensorFlow implementation which differs from PyTorch’s implementation in three ways: 1) The average squared gradients buffer is initialized to ones. 2) The small epsilon constant is applied inside the square root. 3) Learning rate is accumulated in the momentum buffer if momentum is used.
- Parameters
params (Iterable) –
- Return type
None
- __init__(params, lr=None, alpha=None, eps=None, weight_decay=None, momentum=None, centered=None, foreach=None, maximize=None, differentiable=None, loss_scaling=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None, use_tf_variant=None)
- Parameters
params (iterable) – parameters to optimize.
eps (Optional[float]) – term added to the denominator to ensure numerical stability.
centered (Optional[bool]) – True: compute centred RMSprop in which the gradient is normalized by an estimate of its variance.
loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.
accum_type (Optional[dtype]) – data type used for gradients.
first_order_momentum_accum_type (Optional[dtype]) – data type used to store the first order momentum values for each parameter.
second_order_momentum_accum_type (Optional[dtype]) – data type used to store the second order momentum values for each parameter.
use_tf_variant (Optional[bool]) – False: If True, use the TensorFlow variant of RMSProp.
- Return type
None
- class poptorch.optim.LAMB(params, lr=None, betas=None, eps=None, weight_decay=None, bias_correction=None, loss_scaling=None, max_weight_norm=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)
Layer-wise Adaptive Moments (LAMB) optimizer (biased version).
Based on “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes” (https://arxiv.org/abs/1904.00962).
The scaling function phi(z) is fixed as min(z, max_weight_norm);
- Parameters
params (Iterable) –
- Return type
None
- __init__(params, lr=None, betas=None, eps=None, weight_decay=None, bias_correction=None, loss_scaling=None, max_weight_norm=None, accum_type=None, first_order_momentum_accum_type=None, second_order_momentum_accum_type=None)
- Parameters
params (iterable) – parameters to optimize.
betas (Optional[Tuple[float, float]]) –
(beta1, beta2)
parameters used in LAMB.eps (Optional[float]) – term added to the denominator to ensure numerical stability/
bias_correction (Optional[bool]) – True: compute LAMB with bias correction.
loss_scaling (Optional[float]) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.
max_weight_norm (Optional[float]) – maximum value of the output of scaling function, phi(). Set to None to disable scaling function.
accum_type (Optional[dtype]) – data type used for gradients.
first_order_momentum_accum_type (Optional[dtype]) – data type used to store the first order momentum values for each parameter.
second_order_momentum_accum_type (Optional[dtype]) – data type used to store the second order momentum values for each parameter.
- Return type
None
- step(closure=None)
Performs a single optimization step (parameter update).
- Parameters
closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.
- Return type
Note
Unless otherwise specified, this function should not modify the
.grad
field of the parameters.
11.7. Data batching
- class poptorch.DataLoader(options, dataset, batch_size=1, shuffle=None, num_workers=0, drop_last=True, persistent_workers=None, auto_distributed_partitioning=True, mode=DataLoaderMode.Sync, async_options=None, rebatched_worker_size=None, batch_sampler=None, **kwargs)
Thin wrapper around the traditional
torch.utils.data.DataLoader
to abstract away some of the batch sizes calculations.If this data loader is used in a distributed execution environment, it will ensure that each process uses a different subset of the dataset, providing you first call
options.randomSeed(N)
with an integer N which is the same across all hosts.- __init__(options, dataset, batch_size=1, shuffle=None, num_workers=0, drop_last=True, persistent_workers=None, auto_distributed_partitioning=True, mode=DataLoaderMode.Sync, async_options=None, rebatched_worker_size=None, batch_sampler=None, **kwargs)
- Parameters
options (poptorch.Options) – Options that will be used to compile and run the model.
dataset (torch.utils.data.Dataset) – The dataset to get the data from.
batch_size (int) – This is the batch size in the conventional sense of being the size that runs through an operation in the model at any given time.
shuffle (bool) – Whether or not the dataset should be shuffled.
num_workers (int) – Number of worker processes to use to read the data.
drop_last (bool) – If True and the number of elements in the dataset is not a multiple of the combined batch size then the incomplete batch at the end will be dropped.
persistent_workers (Optional[bool]) – Re-use workers between iterations if True.
auto_distributed_partitioning (bool) – If True, partitions the dataset for distributed execution automatically. Otherwise, it is assumed that partitioning has been handled manually.
mode (poptorch.DataLoaderMode) – If
DataLoaderMode.Async
, uses anAsynchronousDataAccessor
to access the dataset. IfDataLoaderMode.Sync
, accesses the dataset synchronously.async_options (Optional[Dict[str, Any]]) – Options to pass to
AsynchronousDataAccessor
.rebatched_worker_size (Optional[int]) – When using AsyncRebatched: batch size of the tensors loaded by the workers. Default to the combined batch size. If specified the
rebatched_worker_size
must be less than or equal to the combined batch size.batch_sampler (Optional[Union[Sampler[Sequence], Iterable[Sequence]]]) – Defines the strategy to draw samples from the dataset. Returns a batch of indices at a time. Mutually exclusive with
batch_size
,shuffle
.kwargs – Other options to pass to PyTorch’s
DataLoader
constructor.
- property combinedBatchSize: Optional[int]
Total number of elements consumed from the dataset for a single execution of the model.
- property options: poptorch.Options
A reference to the options that were used to initialise this instance.
- terminate()
If
mode==DataLoaderMode.Async
, kills the worker process in the underlyingAsynchronousDataAccessor
manually, otherwise has no effect.- Return type
None
- class poptorch.AsynchronousDataAccessor(dataset, buffer_size=3, miss_sleep_time_in_ms=0.1, load_indefinitely=True, early_preload=False, sharing_strategy=SharingStrategy.ForkServer, rebatched_size=None)
A data loader which launches the data loading process on a separate thread to allow for the data to be preprocessed asynchronous on CPU to minimise CPU/IPU transfer time.
This works by loading the data into a ring buffer of shared memory. When the IPU needs another batch it uses the data ready in the in the ring buffer. The memory is shared so will be used in-place and won’t be freed until the next batch is requested. Behind the scenes the worker thread will be filling the unready elements of the ring buffer.
Note
When using a
torch.utils.data.Dataset
withrebatched_size
the accessor will default todrop_last=True
, to change that behaviour wrap the dataset into apoptorch.DataLoader(..., drop_last=False)
.- Parameters
dataset (Union[torch.utils.data.Dataset, DataLoader]) –
buffer_size (int) –
miss_sleep_time_in_ms (float) –
load_indefinitely (bool) –
early_preload (bool) –
sharing_strategy (poptorch.SharingStrategy) –
- __init__(dataset, buffer_size=3, miss_sleep_time_in_ms=0.1, load_indefinitely=True, early_preload=False, sharing_strategy=SharingStrategy.ForkServer, rebatched_size=None)
- Parameters
dataset (Union[torch.utils.data.Dataset, DataLoader]) – The dataset to pull data from, this can be any Python iterable.
buffer_size (int) – The size of the ring buffer.
miss_sleep_time_in_ms (float) – When the buffer is full how long should we sleep the worker before checking again.
load_indefinitely (bool) – If True when we hit the end of the dataset we will just loop round again.
early_preload (bool) – If True, start loading data in the ring buffer as soon as the worker is created. If False, wait for an iterator to be created before loading data.
sharing_strategy (poptorch.SharingStrategy) –
Method to use to pass the dataset object when the child process is created.
SharedMemory
is fast but might be quite limited in size.FileSystem
will serialise the dataset to file and reload it which will be slower.Fork
new processes: no data sharing required but might cause problems if worker processes use threading.ForkServer
is similar toFork
but uses a server process to fork child processes. It is safe to use even if worker processes use threading.
rebatched_size (Optional[int]) – If not None: return N batched tensors from the dataset per iteration. (The passed dataset must have a batch_size of 1).
Note
If dataset is an iterable-type
poptorch.DataLoader
configured withdrop_last=False
thenrebatched_size
must be used.
- terminate()
An override function to kill the worker process manually.
- Return type
None
- class poptorch.DataLoaderMode(value)
Sync
: Access data synchronouslyAsync
: Uses anAsynchronousDataAccessor
to access the datasetAsyncRebatched
: For iterable datasets by default PyTorch will round down the number of elements to a multiple of the combined batch size in each worker. When the number of workers is high and/or the batch size large this might lead to a significant part of the dataset being discarded. In this mode, the combined batch size used by the PyTorch workers will be set to 1, and the batched tensor will instead be constructed in theAsynchronousDataAccessor
. This mode is identical to Async for map-style datasets.
11.8. Enumerations
- class poptorch.SharingStrategy(value)
Strategy to use to pass objects when creating new processes.
SharedMemory
: Spawn new processes and share data using shared memory:Fast but limited availability.
FileSystem
: Spawn new processes and shared data using the filesystem: slower but larger than memory.
Fork
: Fork new processes: no data sharing required but might causeproblems if worker processes use threading.
ForkServer
: Similar to fork but a server process is used to fork childprocesses instead. This server process is single-threaded so there are no issues if worker processes use threading.
- class poptorch.OverlapMode(value)
NoOverlap
: The host will copy the tensor to the IPU only when required: this minimises on-chip memory use at the cost of performance.OverlapAccumulationLoop
: The host will preload values for the next gradient accumulation iteration onto an IO tile.OverlapDeviceIterationLoop
: The host will preload values not just for the next gradient accumulation iteration, but the next device iteration, onto an IO tile. This may require more IO tiles than the previous setting but offers greater performance.
- class poptorch.MatMulSerializationMode(value)
Which dimension of the matrix multiplication to use for the serialization
- class poptorch.SyncPattern(value)
Full
: Require all IPUs to synchronise on every communication between IPUs or between IPUs and host.SinglePipeline
: Allow IPUs to synchronise with the host independently, without having to synchronise with each other. This permits any one IPU to perform host IO while other IPUs are processing data.ReplicaAndLadder
: Allow an IPU group to communicate with the host without requiring synchronisation between groups. This permits multiple IPU groups to alternate between performing host IO and computation.
- class poptorch.ReductionType(value)
Sum
: Calculate the sum of all valuesMean
: Calculate the mean of all valuesNoReduction
: Do not reduce
- class poptorch.ConnectionType(value)
Always
: Attach to the IPU from the start (Default).OnDemand
: Wait until the compilation is complete and the executable is ready to be run to attach to the IPU.Never
: Never try to attach to an IPU. (Useful for offline compilation, but trying to run an executable will raise an exception).
- class poptorch.OutputMode(value)
All
: Return a result for each batch.Sum
: Return the sum of all the batchesFinal
: Return the last batch.EveryN
: Return every N batches. N is passed in asoutput_return_period
Default
: “All” for inference, “Final” for training.
- class poptorch.MeanReductionStrategy(value)
Specify when to divide by a mean reduction factor when
accumulationAndReplicationReductionType
is set toReductionType.Mean
.Running
: Keeps the reduction buffer as the current mean. This is preferred for numerical stability as the buffer value is never larger than the magnitude of the largest micro batch gradient.Post
: Divides by the accumulationFactor and replicatedGraphCount after all of the gradients have been reduced. In some cases this can be faster then using Running, however is prone to overflow.PostAndLoss
(deprecated): Divides by the replicatedGraphCount before the backwards pass, performs the gradient reduction across micro batches, and then divides by the accumulationFactor. This is to support legacy behaviour and is deprecated.