3. Features

Options
Model wrapping functions
Parallel execution
- Annotation tools
- Parallel execution strategies
Optimizers
- Loss scaling
- Velocity scaling (SGD only)
Custom ops
Miscellaneous functions
Half / float 16 support
Profiling
Environment variables

3.1. Options 

The compilation and execution on the IPU can be controlled using poptorch.Options:

See Efficient data batching for a full explanation of how device_iterations greater than 1, gradient_accumulation, and replication_factor interact with the output and input sizes.

class poptorch.Options

Options controlling how a model is run on the IPU.

property Distributed: Options specific to distributed execution.

See also

poptorch.options._DistributedOptions

property Jit: Options specific to upstream PyTorch’s JIT compiler.

See also

poptorch.options._JitOptions

property Popart: Options specific to the PopART backend. (Advanced users only).

See also

poptorch.options._PopartOptions

property TensorLocations: Options related to tensor locations.

See also

poptorch.options._TensorLocationOptions

property Training: Options specific to training.

See also

poptorch.options._TrainingOptions

anchorMode(anchor_mode, anchor_return_period=None)

Specify which data to return from a model

Parameters

anchor_mode (poptorch.AnchorMode) –

All: Return a result for each batch.
Sum: Return the sum of all the batches.
Final: Return the last batch.
EveryN: Return every N batches: N is passed in as anchor_return_period.
Default: All for inference, Final for training.

For example:

>>> opts = poptorch.Options()
... opts.anchorMode(poptorch.AnchorMode.All)
... # or
... opts.anchorMode(poptorch.AnchorMode.EveryN, 10)

autoRoundNumIPUs(auto_round_num_ipus)

Whether or not to round up the number of IPUs used automatically: the number of IPUs requested must be a power of 2 or mutliple of 64. By default, an error occurs if an unsupport number of IPUs is used by the model to prevent unintentional overbooking of IPUs

Parameters

auto_round_num_ipus (bool) –

True: round up the number of IPUs to a power of 2 or multiple of 64 automatically
False: error if the number of IPUs is not supported

connectionType(connection_type)

When to connect to the IPU (if at all)

Parameters

connection_type (poptorch.ConnectionType) –

Always: Attach to the IPU from the start (Default).
OnDemand: Wait until the compilation is complete and the executable is ready to be run to attach to the IPU.
Never: Never try to attach to an IPU. (Useful for offline compilation, but trying to run an executable will raise an exception).

For example:

>>> opts = poptorch.Options()
... opts.connectionType(poptorch.ConnectionType.OnDemand)

defaultAnchorMode()

Returns: True if the anchorMode is currently set to Default; False otherwise
Return type: bool

deviceIterations(device_iterations)

Number of iterations the device should run over the data before returning to the user. (Default: 1)

Essentially, it is the equivalent of launching the IPU in a loop over that number of batches. This is efficient because that loop runs on the IPU directly.

enableExecutableCaching(path)

If path is None: disable executable caching.

Otherwise use path as a cache to save / load Poplar executables.

logDir(log_dir): Where to save log files (Default: Current directory)

randomSeed(random_seed): Set the seed for the random number generator on the IPU.

replicationFactor(replication_factor)

Number of model replications (Default: 1).

For example if your model uses 1 IPU, a replication factor of 2 will use 2 IPUs. If your model is pipelined across 4 IPUs, a replication factor of 4 will use 16 IPUs total.

setAvailableMemoryProportion(available_memory_proportion)

Memory is set on a per IPU basis, this should be a dictionary of IPU ids and float values between 0 and 1.

For example: {"IPU0": 0.5}

setExecutionStrategy(strategy): Set the execution strategy to use to partition the graph

See also

PipelinedExecution, ShardedExecution, ParallelPhasedExecution, SerialPhasedExecution.

syncPattern(sync_pattern)

Set the IPU SyncPattern.

Parameters

sync_pattern (poptorch.SyncPattern) –

Full
SinglePipeline
ReplicaAndLadder

useIpuId(ipu_id)

Use the specified IPU id as provided by gc-info.

The number of IPUs associated with the id must be equal to the number of IPUs used by your grpah multiplied by the replication factor.

For example if your model uses 1 IPU and the replication factor is 2 you will need to provide an id with 2 IPUs.

If your model is pipelined across 4 IPUs, the replication factor is 4, you will need to provide an id containing 16 IPUs total.

Parameters: ipu_id (int) – IPU id as provided by gc-info.

useIpuModel(use_model)

Use the IPU model or physical hardware.

Default: False (Real Hardware).

This setting takes precedence over the POPTORCH_IPU_MODEL environment variable.

useOfflineIpuTarget(ipu_version=1)

Create an offline IPU target that can only be used for offline compilation.

Note

the offline IPU target cannot be used if the IPU model is enabled.

Parameters: ipu_version (int) – IPU version to target (1 for mk1, 2 for mk2). Default: 1.

You can choose to use the IPU model or the real IPU hardware via poptorch.Options.useIpuModel.

class poptorch.options._DistributedOptions

Options related to distributed execution.

Can be accessed via poptorch.Options.Distributed:

>>> opts = poptorch.Options()
>>> opts.Distributed.configureProcessId(0, 2)

configureProcessId(process_id, num_processes)

Manually set the current process ID and the total number of processess.

Parameters

process_id (int) – The ID of this process.
num_processes (int) – The total number of processes the execution is distributed over.

disable(): Ignore the current options / environment variables and disable distributed execution.

property numProcesses: Total number of processes the execution is distributed over.

property processId: Id of the current process.

setEnvVarNames(var_num_processes, var_process_id)

Utility to read and set processId and numProcesses from environment variables.

Useful if you use a third party library to manage the processes used for the distributed execution such as mpirun.

For example: mpirun -np 4 myscript.py

By default the OpenMPI OMPI_COMM_WORLD_SIZE and OMPI_COMM_WORLD_RANK variables are used.

class poptorch.options._JitOptions

Options related to Pytorch’s JIT compiler.

Can be accessed via poptorch.Options.Jit:

>>> opts = poptorch.Options()
>>> opts.Jit.traceModel(True)

traceModel(trace_model)

If True: use torch.jit.trace

If False: use torch.jit.script (Experimental)

Trace model is enabled by default.

class poptorch.options._TrainingOptions

Options specific to model training.

Note

You must not set these options for inference models.

Can be accessed via poptorch.Options.Training:

>>> opts = poptorch.Options()
>>> opts.Training.gradientAccumulation(4)

gradientAccumulation(gradient_accumulation)

Number of samples to accumulate for the gradient calculation.

Accumulate the gradient N times before applying it. This is needed to train with models expressing pipelined model parallelism using the IPU annotation. This is due to weights being shared across pipeline batches so gradients will be updated and used by subsequent batches out of order.

Might be called “pipeline depth” in some other frameworks.

class poptorch.options._PopartOptions

Options specific to the PopART backend.

Only for advanced users.

Any option from popart.SessionOptions can be set using this class. .. note:: there is no mapping for the various PopART enums so integers need to be used instead.

Can be accessed via poptorch.Options.Popart:

>>> opts = poptorch.Options()
>>> opts.Popart.set("autoRecomputation", 3) # RecomputationType::Pipeline
>>> opts.Popart.set("syntheticDataMode",
>>>                  int(popart.SyntheticDataMode.RandomNormal))

setPatterns(patterns, level=2)

Override the default patterns of Popart’s compiler.

Parameters

patterns (dict(str,bool)) – Dictionary of pattern names to enable / disable.
level (int) – Integer value corresponding to the popart::PaternsLevel to use to initialise the Patterns.

class poptorch.options._TensorLocationOptions(**default_values)

Options controlling where tensors are stored.

Can be accessed via poptorch.Options.TensorLocations:

>>> opts = poptorch.Options()
>>> opts.TensorLocations.setActivationLocation(
...     poptorch.TensorLocationSettings().useOnChipStorage(False))

setAccumulatorLocation(location)

Parameters: location (poptorch.TensorLocationSettings) – Where to store the accumulators.

setActivationLocation(location)

Parameters: location (poptorch.TensorLocationSettings) – Where to store the activations.

setOptimizerLocation(location)

Parameters: location (poptorch.TensorLocationSettings) – Where to store the optimizer states.

setWeightLocation(location)

Parameters: location (poptorch.TensorLocationSettings) – Where to store the weights.

class poptorch.TensorLocationSettings(**default_values)

Define where a tensor is stored

>>> opts = poptorch.Options()
>>> opts.TensorLocations.setActivationLocation(
...     poptorch.TensorLocationSettings().useOnChipStorage(False))

minElementsForOffChip(min_elements): A minimum number of elements below which offloading won’t be considered.

minElementsForReplicatedTensorSharding(min_elements): Only enable Replicated Tensor Sharding (RTS) for tensors with more than min_elements elements.

useIOTilesToLoad(use=True)

Load tensor through IO tiles

Parameters: use (bool) – Use IO tiles if True, use Compute tiles if False.

useIOTilesToStore(use=True)

Use IO tiles to store tensors.

(relevant for replicated tensor sharded tensors)

Parameters: use (bool) – Use IO tiles if True, use Compute tiles if False.

useOnChipStorage(use=True)

Permanent tensor storage

Parameters: use (bool) – True: use on chip memory, False: use off chip memory. None: keep it undefined.

useReplicatedTensorSharding(use=True)

Enable replicated tensor sharding

(relevant for weights and optimizer states)

3.2. Model wrapping functions 

The basis of PopTorch integration comes from these two model wrapping functions.

3.2.1. poptorch.trainingModel 

poptorch.trainingModel(model, options=None, optimizer=None)

Create a PopTorch training model, from a PyTorch model, to run on IPU hardware in training mode.

Parameters

model (torch.nn.Module) – The PyTorch model to wrap.
options (poptorch.Options) – The IPU specific options
optimizer (torch.optim.Optimizer) –
The optimizers to apply during training.

Supported PyTorch optimizers: optim.SGD, optim.AdamW, optim.RMSprop.

Supported PopTorch optimizers: poptorch.optim.SGD, poptorch.optim.AdamW, poptorch.optim.RMSprop. poptorch.optim.LAMB.

Returns

The poptorch.PoplarExecutor wrapper to use in place of model.

Listing 3.1 An example of the use of poptorch.trainingModel()

import poptorch
import torch


class ExampleModelWithLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = torch.nn.Linear(10, 10)
        self.loss = torch.nn.MSELoss()

    def forward(self, x, target=None):
        fc = self.fc(x)
        if self.training:
            return fc, self.loss(fc, target)
        return fc


torch.manual_seed(0)
model = ExampleModelWithLoss()

# Wrap the model in our PopTorch annotation wrapper.
poptorch_model = poptorch.trainingModel(model)

# Some dummy inputs.
input = torch.randn(10)
target = torch.randn(10)

# Train on IPU.
for i in range(0, 100):
    # Each call here executes the forward pass, loss calculation, and backward
    # pass in one step.
    # Model input and loss function input are provided together.
    poptorch_out, loss = poptorch_model(input, target)
    print(f"{i}: {loss}")

# Copy the trained weights from the IPU back into the host model.
poptorch_model.copyWeightsToHost()

# Execute the trained weights on host.
model.eval()
native_out = model(input)

# Models should be very close to native output although some operations are
# numerically different and floating point differences can accumulate.
torch.testing.assert_allclose(native_out, poptorch_out, rtol=1e-06, atol=1e-06)

3.2.2. poptorch.inferenceModel 

poptorch.inferenceModel(model, options=None)

Create a PopTorch inference model, from a PyTorch model, to run on IPU hardware in inference mode.

Parameters

model (torch.nn.Module) – The PyTorch model to wrap.
options (poptorch.Options) – The IPU specific options

Returns

The poptorch.PoplarExecutor wrapper to use in place of model.

Listing 3.2 An example of the use of poptorch.inferenceModel()

import poptorch
import torch
import torchvision

# Some dummy imagenet sized input.
picture_of_a_cat_here = torch.randn([1, 3, 224, 224])

# The model, in this case a MobileNet model with pretrained weights that comes
# canned with Pytorch.
model = torchvision.models.mobilenet_v2(pretrained=True)
model.train(False)

# Wrap in the PopTorch inference wrapper
inference_model = poptorch.inferenceModel(model)

# Execute on IPU.
out_tensor = inference_model(picture_of_a_cat_here)

# Get the top 5 ImageNet classes.
top_five_classes = torch.topk(torch.softmax(out_tensor, 1), 5)
print(top_five_classes)

# Try the same on native PyTorch
native_out = model(picture_of_a_cat_here)

native_top_five_classes = torch.topk(torch.softmax(native_out, 1), 5)

# Models should be very close to native output although some operations are
# numerically different and floating point differences can accumulate.
assert any(top_five_classes[1][0] == native_top_five_classes[1][0])
# inference_half_start
model = torch.nn.Linear(1, 10).half()
t1 = torch.tensor([1.]).half()

inference_model = poptorch.inferenceModel(model)
out = inference_model(t1)

assert out.dtype == torch.half
# inference_half_end

3.2.3. poptorch.PoplarExecutor 

class poptorch.PoplarExecutor(model, options, training, optimizer=None, user_model=None)

This class should not be created directly but is a wrapper around the model that was passed into inferenceModel or trainingModel. It only has a few methods which can be used to interface with the IPU.

__call__(*args, **kwargs): Takes the same arguments as the wrapped PyTorch model.__call__.

Note

The first time the PoplarExecutor wrapper is called, the wrapped model will be traced and compiled.

compile(*args, **kwargs)

Takes the same arguments as the wrapped PyTorch model.__call__.

Trace and compile the wrapped model if no executable has been created yet.

copyWeightsToDevice(): Copies the weights from model.parameters() to the IPU device. Implicitly called on first call.

copyWeightsToHost(): Updates the parameters used in model with the weights stored on device. (The weights in model.parameters())

destroy(): Destroy the model: release the IPUs and the executable.

setOptimizer(optimizer): Sets the optimiser for a training model. Will overwrite the previous one. Supported optimisers: optim.SGD, optim.AdamW, optim.RMSProp.

Note

The PoplarExecutor will implicitly keep in sync the parameters of the source PyTorch model and the PopTorch model(s). However, weights need to be explicitly copied if the model is trained on the CPU and inference is run on the IPU.

model = Model()
poptorch_train = poptorch.trainingModel(model)
poptorch_inf = poptorch.inferenceModel(model)

train(poptorch_train)
torch.save(model.state_dict(), "model.save") # OK
validate(poptorch_inf) # OK
validate(model) # OK

train(model)
# Explicit copy needed
poptorch_inf.copyWeightsToDevice()
validate(poptorch_inf)

3.3. Parallel execution 

This section demonstrates multi-IPU strategies for parallel execution in PopTorch. We recommended that you start such parallel programming from PopTorch code that is working properly on a single IPU.

There are four kinds of execution strategies in total to run a model on a multi-IPU device: poptorch.ShardedExecution, poptorch.PipelinedExecution, poptorch.SerialPhasedExecution. and poptorch.ParallelPhasedExecution. These execution strategies are set through poptorch.Options.setExecutionStrategy(). The default execution strategy is poptorch.PipelinedExecution. In the following, we first introduce the general APIs that will be applied to all four parallel execution strategies. Finally, we explain the four strategies with examples.

By default, PopTorch will not let you run the model if the number of IPUs is not a power of 2. For this reason, it is preferable to annotate the model so that the number of IPUs used is a power of 2. However, you can also enable poptorch.Options.autoRoundNumIPUs() to automatically round up the number of IPUs reserved to a power of 2, with the excess being reserved but idle. This option is not enabled by default to prevent unintentional overbooking of IPUs.

3.3.1. Annotation tools 

poptorch.Block and poptorch.BeginBlock 

poptorch.BeginBlock and poptorch.Block are indispensable wrapper classes to define model parallelism in a multi-IPU device. You can use poptorch.Block to define a scope in the context of the model.

class poptorch.Block(user_id=None, ipu_id=None)

Runs all layers called inside this scope on a specified IPU.

>>> with poptorch.Block("IPU0"):
...     self.layer = MyLayer(x)

__init__(user_id=None, ipu_id=None)

Parameters

user_id (str, optional) – A user defined identifier for the block. Blocks with the same id are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.
ipu_id (int, optional) – The id of the IPU to run on. Note that the ipu_id is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used by gc-info.

static useAutoId()

Call this method at the beginning of your forward() method to enable automatic block id generation.

Blocks with a None user_id will be assigned an automatic id which will be the index of this block in the list of id-less Blocks.

>>> poptorch.Block.useAutoId()
>>> with poptorch.Block(): # user_id = "0"
...     layer()
>>> with poptorch.Block("special_block"): # user_id = "special_block"
...     layer()
>>> with poptorch.Block(): # user_id = "1"
...     layer()

poptorch.BeginBlock is an annotation defined outside the model, and applied to current and onward layers.

class poptorch.BeginBlock(layer_to_call, user_id=None, ipu_id=None)

Runs all layers from the given layer until the beginning of the next block on a specified IPU.

All layers after this layer will also run on the same IPU until another BeginBlock is encountered.

By default PipelinedExecution will be used, however this can be overridden in the poptorch.Options.

poptorch.Stage and poptorch.AutoStage 

Conceptually poptorch.BeginBlock or poptorch.Block collects the layers of a model into a poptorch.Stage, multiple stages can be combined into a poptorch.Phase, and multiple phases form a parallel execution strategy.

poptorch.Stage 

poptorch.Stage defines some layers of model to run on one IPU. It can be made of one or more blocks created using poptorch.BeginBlock or poptorch.Block and identified by their user_id. Consecutive layers in a model can be defined either in the same poptorch.Stage or consecutive stages. Whether stages run in parallel or sequentially depends on specific parallel execution strategies.

class poptorch.Stage(*block_ids)

The various execution strategies are made of Stages: a stage consists of one of more Blocks running on one IPU.

poptorch.AutoStage 

You can use poptorch.AutoStage if you don’t want to specify poptorch.Stage by hand. It will assign one poptorch.Stage per poptorch.BeginBlock or poptorch.Block.

class poptorch.AutoStage(value)

Defines how the stages are automatically assigned to blocks when the user didn’t explicitly provide stages to the IExecutionStrategy’s constructor.

SameAsIpu: The stage id will be set to the selected ipu number.
AutoIncrement: The stage id for new blocks is automatically incremented.

Examples:

>>> # Block "0"
>>> with poptorch.Block(ipu_id=0):
...  layer()
>>> # Block "1"
>>> with poptorch.Block(ipu_id=1):
...  layer()
>>> # Block "2"
>>> with poptorch.Block(ipu_id=0):
...  layer()

By default, the following execution strategy is used:

>>> stategy = poptorch.PipelinedExecution(poptorch.AutoStage.SameAsIpu)
>>> opts.setExecutionStrategy(strategy)

which would translate to stage_id = ipu_id:

Block “0” ipu=0 stage=0
Block “1” ipu=1 stage=1
Block “2” ipu=0 stage=0

Now if instead you use:

>>> stategy = poptorch.PipelinedExecution(poptorch.AutoStage.AutoIncrement)
>>> opts.setExecutionStrategy(strategy)

The last block would be in its own stage rather than sharing one with Block “0”:

Block “0” ipu=0 stage=0
Block “1” ipu=1 stage=1
Block “2” ipu=0 stage=2

By default poptorch.AutoStage.SameAsIpu is in use, which means the stage_id of poptorch.Stage will be set to the ipu_id specified for the poptorch.BeginBlock or poptorch.Block. Please note that stage_id must be ascending in poptorch.PipelinedExecution. Let’s use the code example above. If your blocks “0”, “1”, and “2” are assigned to IPU 0, 1, and 0. Then the poptorch.Block “2” will be assigned stage_id 0. This will make the compiler fail to schedule the last two stages “1” and “2” due to a conflict:

The model implies “1” should run earlier than “2”.
their stage_id values suggest “2” should run earlier than “1”.

When poptorch.AutoStage.AutoIncrement is in use, each new poptorch.BeginBlock or poptorch.Block will be assigned an automatically incremented stage_id. In the previous example the last stage would be assigned stage_id 2 and the compilation would succeed.

poptorch.Phase 

poptorch.Phase defines a processing unit of phased execution. It may contain one or more poptorch.Stage. poptorch.Phase is only used in poptorch.SerialPhasedExecution and poptorch.ParallelPhasedExecution. It is not used in poptorch.ShardedExecution and poptorch.PipelinedExecution.

class poptorch.Phase(arg)

Represents an execution phase

__init__(arg)

Create a phase.

Parameters: arg (str, poptorch.Stage, [poptorch.Stage], [str]) – must either be one or more Stages, or one or more blocks user_id.

If one or more strings are passed they will be interpreted as Block ids representing a single Stage.

Within a Phase, the stages will be executed in parallel.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> p = Phase(poptorch.Stage("A").ipu(0))
>>> # 2 stages made of one block each
>>> p = Phase(poptorch.Stage("A").ipu(0), poptorch.Stage("B").ipu(1))
>>> p = Phase("A","B") # One Stage made of 2 blocks

In the last two lines above, “A” and “B” will run in parallel on IPU 0 and 1 simultaneously since they are placed in two stages. They will run sequentially in one IPU if they are placed in one stage only.

Advanced annotation with strings 

You can use Python strings to represent the user_id and ipu_id for a poptorch.Block or poptorch.BeginBlock. Since strings are evaluated at runtime, they allow for a dynamic number of stages and phases. Here is an example below to use formatted strings(f-strings) in poptorch.ParallelPhasedExecution.

Inside the code example below, there are two lines that f-strings are used in the forward() class. One is f"phase{phase}_ipu{ipu}" at Line 25, where phase is 0, 1, 1, 2, 3, 3, 4, 5, and 5 respectively, and ipu ranges from 0 to 1. The total number of instances for this f-string is 12 due to 6 phases and 2 IPUs. The other is f"phase{N*2-1}_ipu1" at Line 32, where phase is 5 and ipu is 1. When defining poptorch.Stage, four f-strings are used where n ranges from 0 to 2 at Line 46-47 and 50-51:

f"phase_{2*n}_ipu0"
f"phase{2*n}_ipu1"
f"phase_{2*n+1}_ipu0"
f"phase{2*n+1}_ipu1"

They refer to phase 0, 2, 4 and 1, 3, 5, with ipu0 and ipu1 respectively. So all these 12 f-strings are defined in poptorch.BeginBlock, and used in poptorch.Stage dynamically. They match exactly.

Listing 3.5 An example of parallel phased execution

poptorch.setLogLevel(1)  # Force debug logging
N = 3
size = 10


class Model(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weights = []
        for n in range(N * 6):
            weight = torch.nn.Parameter(torch.rand(size, size),
                                        requires_grad=True)
            self.register_parameter(f"w{n}", weight)
            self.weights.append(weight)

    def forward(self, in0, target=None):
        phase = 0
        weight = iter(self.weights)
        with poptorch.Block("phase0_ipu0"):
            ins = torch.split(in0, size)
        for n in range(N * 3):
            out = []
            for ipu in range(2):
                x = ins[ipu]
                with poptorch.Block(f"phase{phase}_ipu{ipu}"):
                    x = torch.matmul(next(weight), x)
                    out.append(F.relu(x))
            ins = out[1], out[0]
            # We want 2 matmuls in the same phase
            if n % 3 != 1:
                phase += 1
        with poptorch.Block(f"phase{N*2-1}_ipu1"):
            res = ins[0] + ins[1]
            if target is None:
                return res
            return res, torch.nn.L1Loss(reduction="mean")(res, target)


input = torch.rand(size * 2, 1)
target = torch.rand(size, 1)
model = Model()
opts = poptorch.Options()
phases = []
# Alternate between 0-2 and 1-3
for n in range(N):
    phases.append([
        poptorch.Stage(f"phase{2*n}_ipu0").ipu(0),
        poptorch.Stage(f"phase{2*n}_ipu1").ipu(2)
    ])
    phases.append([
        poptorch.Stage(f"phase{2*n+1}_ipu0").ipu(1),
        poptorch.Stage(f"phase{2*n+1}_ipu1").ipu(3)
    ])
opts.setExecutionStrategy(poptorch.ParallelPhasedExecution(*phases))
poptorch_model = poptorch.trainingModel(model, opts)
poptorch_model.compile(input, target)

3.3.2. Parallel execution strategies 

With the above APIs as building blocks, we can set execution strategies using the four kinds of execution modes, as shown below. Note that the same annotation can be used for each of them. They only differ in the method of parallelisation and tensor locations.

poptorch.ShardedExecution 

In this strategy, each IPU will sequentially execute a distinct part of the model. A single unit of processing poptorch.ShardedExecution is a shard. A shard is specified using poptorch.Stage, or if no poptorch.Stage is specified, the user_id passed by poptorch.BeginBlock or poptorch.Block is used. Each shard is executed sequentially on a single IPU. Multiple shards can be placed on multiple IPUs. However, only one IPU is used at a time, while the other IPUs are idle. If an IPU is allocated to run consecutive stages, PopART will merge consecutive stages into one on the same IPU. Weights and activations will use the on-chip memory of the IPUs. Layers sharing weights need to be placed on the same IPU.

poptorch.ShardedExecution can be useful for processing a single sample or debugging. Overall it has low efficiency since only one IPU is used at a time.

class poptorch.ShardedExecution(*args)

Will shard the execution of the passed Stages or if no stage is passed will consider each unique Block name encountered during tracing as a different stage.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> opts = poptorch.Options()
>>> # Automatically create 3 shards based on the block names
>>> opts.setExecutionStrategy(poptorch.ShardedExecution())

Parameters: args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a poptorch.AutoStage strategy or an explicit list of stages or block ids.

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters: block_id (str) – A block id.

poptorch.PipelinedExecution 

This is the default execution strategy. It extends poptorch.ShardedExecution with parallel execution on multiple IPUs.

Parallelisation in poptorch.PipelinedExecution requires deviceIterations() and gradientAccumulation(). as explained in Efficient data batching. After one poptorch.Stage is finished with processing a batch on one IPU, it starts immediately processing the next batch. This creates a pipeline where multiple batches are processed in parallel. An IPU can only start its own poptorch.Stage of a batch if its previous poptorch.Stage of the current batch is processed. Hence, all IPUs will be occupied after a warm-up period. A cool-down period is required to aggregate the results and apply weight changes.

class poptorch.PipelinedExecution(*args)

__init__(*args)

Pipeline the execution of the passed Stages or if no stage is passed consider each unique Block name encountered during tracing as a different stage.

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> opts = poptorch.Options()
>>> # Create a 3 stages pipeline
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution("A","B","C"))
>>> # Create a 2 stages pipeline
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution(
...    poptorch.Stage("A","B"),
...    "C"))
>>> # Automatically create a 3 stages pipeline based on the block names
>>> opts.setExecutionStrategy(poptorch.PipelinedExecution())

Parameters: args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a poptorch.AutoStage strategy or an explicit list of stages or block ids.

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters: block_id (str) – A block id.

Phased execution 

poptorch.ParallelPhasedExecution and poptorch.SerialPhasedExecution have the following features in common:

A portion of the weights and activations are transferred to and from streaming memory, before and after each phase.
If the desired weights and activations are already stored in an IPU of the same group of IPUs, intra-phase cross-IPU copies can replace the copies to and from streaming memory.
This specific portion is needed by the layers of the model wrapped in poptorch.BeginBlock or poptorch.Block in current poptorch.Phase.
They both trade off some performance for larger models with higher memory needs.
Any number of phases is allowed.
The number of stages in each poptorch.Phase should match the number of IPUs in each group of IPUs.
Stages inside each poptorch.Phase can run in parallel.

Although you only define the poptorch.Phase for forward passes, the corresponding phases for backward passes are created correspondingly. The order of phased execution for backward passes won’t change but you can decide whether a phase is shared by both forward and backward passes. In other words, you decide whether to avoid a memory transfer of a portion of the weights and activations.

poptorch.SerialPhasedExecution 

In poptorch.SerialPhasedExecution, phases execute on a single group of IPUs sequentially.

class poptorch.SerialPhasedExecution(*phases)

All the phases run serially on a single group of IPUs.

For example:

phase 0 runs on ipu 0 & 1
phase 1 runs on ipu 0 & 1
phase 2 runs on ipu 0 & 1

>>> with poptorch.Block("A"):
...     layer()
>>> with poptorch.Block("A2"):
...     layer()
>>> with poptorch.Block("B"):
...     layer()
>>> with poptorch.Block("B2"):
...     layer()
>>> with poptorch.Block("C"):
...     layer()
>>> with poptorch.Block("C2"):
...     layer()
>>> opts = poptorch.Options()
>>> strategy = poptorch.SerialPhasedExecution([
...     poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")),
...     poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")),
...     poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2"))])
>>> strategy.phase(0).ipus(0,1)
>>> strategy.phase(1).ipus(0,1)
>>> strategy.phase(2).ipus(0,1)
>>> opts.setExecutionStrategy(strategy)

__init__(*phases)

Execute the model’s blocks in phases

Parameters

phases ([poptorch.Phase], [[poptorch.Stage]], [[str]]) –

Definition of phases must be either:

a list of poptorch.Phase
a list of list of poptorch.Stage
a list of list of poptorch.Block ids (Each list of blocks will be considered as a single poptorch.Stage )

phase(phase)

Return the requested poptorch.Phase

Parameters: phase (int) – Index of the phase

setTensorsLiveness(liveness): See poptorch.Liveness for more information

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters: block_id (str) – A block id.

useSeparateBackwardPhase(use=True)

Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:

fwd:       bwd:
phase 0 -> phase 4
phase 1 -> phase 3
phase 2 -> phase 2

Note

The end of the forward pass and the beginning of the backward pass are part of the same phase.

If useSeparateBackwardPhase(True) is used then no phase will be shared between the forward and backward passes:

fwd:       bwd:
phase 0 -> phase 6
phase 1 -> phase 5
phase 2 -> phase 4

poptorch.ParallelPhasedExecution 

In poptorch.ParallelPhasedExecution, phases are executed in parallel alternating between two groups of IPUs. Even phases must run on even IPUs and odd phases on odd IPUs. Inter-phase cross-IPU copies can replace the memory transfers to and from the streaming memory, if the desired weights and activations are already available in another group of IPUs.

class poptorch.ParallelPhasedExecution(*phases)

Phases are executed in parallel alternating between two groups of IPUs.

For example:

phase 0 runs on ipu 0 & 2
phase 1 runs on ipu 1 & 3
phase 2 runs on ipu 0 & 2

>>> poptorch.Block.useAutoId()
>>> with poptorch.Block(): # user_id = "0"
...     layer()
>>> with poptorch.Block(): # user_id = "1"
...     layer()
>>> with poptorch.Block(): # user_id = "2"
...     layer()
>>> with poptorch.Block(): # user_id = "3"
...     layer()
>>> with poptorch.Block(): # user_id = "4"
...     layer()
>>> with poptorch.Block(): # user_id = "5"
...     layer()
>>> opts = poptorch.Options()
>>> strategy = poptorch.ParalellPhasedExecution([
...     poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")),
...     poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")),
...     poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5"))])
>>> strategy.phase(0).ipus(0,2)
>>> strategy.phase(1).ipus(1,3)
>>> strategy.phase(2).ipus(0,2)
>>> opts.setExecutionStrategy(strategy)

__init__(*phases)

Execute the model’s blocks in phases

Parameters

phases ([poptorch.Phase], [[poptorch.Stage]], [[str]]) –

Definition of phases must be either:

a list of poptorch.Phase
a list of list of poptorch.Stage
a list of list of poptorch.Block ids (Each list of blocks will be considered as a single poptorch.Stage )

phase(phase)

Return the requested poptorch.Phase

Parameters: phase (int) – Index of the phase

stage(block_id)

Return the poptorch.Stage the given block is belongs to.

Parameters: block_id (str) – A block id.

useSeparateBackwardPhase(use=True)

Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:

fwd:       bwd:
phase 0 -> phase 4
phase 1 -> phase 3
phase 2 -> phase 2

Note

The end of the forward pass and the beginning of the backward pass are part of the same phase.

If useSeparateBackwardPhase(True) is used then no phase will be shared between the forward and backward passes:

fwd:       bwd:
phase 0 -> phase 6
phase 1 -> phase 5
phase 2 -> phase 4

In the code example above, there are three phases. Each phase has two stages and each IPU group has two IPUs, so these two numbers match. Even phases 0 and 2 run on IPU 0 and 2, while odd phase 1 runs on IPU 1 and 3 as required. This allows for faster cross-IPU copies, both inter-phase and intra-phase.

poptorch.Liveness 

poptorch.Liveness controls the availability of tensors on IPU, and is only needed for poptorch.ParallelPhasedExecution and poptorch.SerialPhasedExecution.

class poptorch.Liveness(value)

When using phased execution:

AlwaysLive: The tensors always stay on the IPU between the phases.
OffChipAfterFwd: The tensors are sent off the chip at the end of the forward pass and before the beginning of the backward pass.
OffChipAfterEachPhase: The tensors are sent off the chip at the end of each phase.

The default poptorch.Liveness is AlwaysLive. OffChipAfterFwd and OffChipAfterEachPhase may be helpful if you run a large model with a tight memory budget.

3.4. Optimizers 

You can use a number of optimizers with PopTorch. In addition, PopTorch has additional features to support float16 models such as loss scaling.

class poptorch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, loss_scaling=1.0, velocity_scaling=1.0)

Stochastic gradient descent with optional momentum.

The optimizer matches PyTorch’s implementation (torch.optim.SGD) with optional loss and velocity scaling.

Nesterov momentum is not currently supported.

__init__(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, loss_scaling=1.0, velocity_scaling=1.0)

Parameters

params (iterable) – parameters to optimize.
lr (float) – learning rate.
momentum (float, optional) – momentum factor.
dampening (float, optional) – damperning term for momentum.
weight_decay (float, optional) – Weight decay (L2 penalty) factor.
nesterov (bool, optional) – Not supported (must be False).
loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.
velocity_scaling (float, optional) – Factor by which to scale the velocity values to assist numerical stability when using float16.

class poptorch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, loss_scaling=1.0, biasCorrection=True, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)

Adam optimizer with true weight decay.

This optimizer matches PyTorch’s implementation (torch.optim.AdamW) with optional loss scaling.

AMSGrad is currently not supported.

__init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, loss_scaling=1.0, biasCorrection=True, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)

Parameters

params (iterable) – parameters to optimize.
lr (float, optional) – learning rate
eps (float, optional) – term added to the demoninator to ensure numerical stability.
weight_decay (float, optional) – Weight decay factor.
amsgrad (bool, optional) – Not supported (must be False).
loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.
accumType (torch.dtype, optional) – data type used for gradients.
firstOrderMomentumAccumType (torch.dtype, optional) – data type used to store the first order momentum values for each parameter.
secondOrderMomentumAccumType (torch.dtype, optional) – data type used to store the second order momentum values for each parameter.

class poptorch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False, loss_scaling=1.0)

RMSprop optimizer with optional L2 penalty.

This optimizer matches PyTorch’s implementation (torch.optim.RMSprop) with optional loss scaling.

__init__(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False, loss_scaling=1.0)

Parameters

params (iterable) – parameters to optimize.
lr (float, optional) – learning rate.
alpha (float, optional) – smoothing constant.
eps (float, optional) – term added to the demoninator to ensure numerical stability.
weight_decay (float, optional) – L2 penalty coeffecient.
momentum (float, optional) – momentum factor.
centered (bool, optional) – True: compute centred RMSProp in which the gradient is normalized by an estimate of its variance.
loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.

class poptorch.optim.LAMB(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, biasCorrection=True, loss_scaling=1.0, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)

Layer-wise Adaptive Moments (LAMB) optimizer (biased version).

Based on “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes” (https://arxiv.org/abs/1904.00962).

The scaling function phi(z) is fixed as min(z, mwn); mwn is fixed at 10.0.

__init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, biasCorrection=True, loss_scaling=1.0, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)

Parameters

params (iterable) – parameters to optimize.
lr (float, optional) – learning rate
betas (tuple, optional) – (beta1, beta2) parameters used in LAMB.
eps (float, optional) – term added to the denominator to ensure numerical stability/
weight_decay (float, optional) – (AdamW) weight decay factor.
accumType (torch.dtype, optional) – data type used for gradients.
firstOrderMomentumAccumType (torch.dtype, optional) – data type used to store the first order momentum values for each parameter.
secondOrderMomentumAccumType (torch.dtype, optional) – data type used to store the second order momentum values for each parameter.

step(closure=None)

Performs a single optimization step (parameter update).

Parameters: closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Note

Unless otherwise specified, this function should not modify the .grad field of the parameters.

3.4.1. Loss scaling 

When training models which use half/float16 values, you can use loss scaling to prevent the gradients from becoming too small and underflowing. Before calculating the gradients, PopTorch will scale the loss by the value of the loss_scaling parameter. PopTorch will multiply the gradients by the inverse scale prior to updating the optimizer state. Therefore, beyond improving numerical stability, neither the training nor the hyper-parameters are affected.

Higher loss_scaling values can improve numerical stability by minimising underflow. However, too high a value can result in overflow. The optimal loss scaling factor depends on the model.

3.4.2. Velocity scaling (SGD only)

The SGD optimizer, when used with momentum, updates weights based on the velocity values. At each update step, the new velocity is a combination of the gradients derived from the loss function and the previous velocity value. Similar to loss scaling, the velocity_scaling parameter allows the velocity values to be scaled to improve numerical precision when using half/float16 values. (Note that the gradients are, in effect, scaled by velocity_scaling/loss_scaling so the loss_scaling has no impact on the effective scaling of velocity parameters.)

As with loss scaling, higher values can minimise underflow of the velocity values but may result in overflow.

3.5. Custom ops 

These are helper operations to be used within a model.

3.5.1. poptorch.ipu_print_tensor 

class ipu_print_tensor(tensor_to_print, optional_title)

Adds a tensor to be printed on the IPU. When this is executed the tensor will be copied back to host and printed.

When this operation is called in the backward pass it will print the gradient of the tensor.

The operation is an identity operation and it will return the exact same tensor. The returned tensor should be used in place of the original tensor, in the rest of the program to make sure that the print operation isn’t optimised away.

For example if the original code looks like this:

def forward(self, c, d, b)
  a = c + d
  return a + b

And you want to print the value of a. If you do:

def forward(self, c, d, b)
  a = c + d
  poptorch.ipu_print_tensor(a)
  return a + b

Optionally, you may add a second string parameter to be used as a title.

def forward(self, c, d, b): a = c + d poptorch.ipu_print_tensor(a, “summation”)) return a + b

The result of ipu_print_tensor is not used,therefore it will be optimised out by the graph optimiser and a will not be printed.

Instead you should do:

def forward(self, c, d, b)
  a = c + d
  x = poptorch.ipu_print_tensor(a)
  return x + b

Warning

In order for the print operation to not be optimised out by the graph optimiser, you must use the output of the print.

Parameters: ipu_print_tensor – The tensor to print.
Returns: The input unchanged.

class ExampleModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.bias = torch.nn.Parameter(torch.zeros(()))

    def forward(self, x):
        x += 1

        # It is important to make sure the result of the print is used.
        x = poptorch.ipu_print_tensor(x)

        return x + self.bias

3.5.2. poptorch.identity_loss 

This function is used to implement custom losses. This takes in a single PyTorch tensor and will backpropagate a gradient of ones through it.

Warning

Passing a PyTorch loss function or another identity_loss to this function is not supported. Multiple losses must be implemented via composite PyTorch ops.

poptorch.identity_loss(x, reduction)

Marks this operation as being part of the loss calculation and, as such, will back-propagate through it in the PopTorch autograd. This enables multiple losses and custom losses.

Parameters

loss (torch.Tensor) – The calculated loss.
reduction (str) –
Reduce the loss output as per PyTorch loss semantics. Supported values are:
- "sum": Sum the losses.
- "mean": Take the mean of the losses.
- "none": Don’t reduce the losses.

Returns

An identity loss custom op.

def custom_loss(output, target):
    # Mean squared error with a scale
    loss = output - target
    loss = loss * loss * 5
    return poptorch.identity_loss(loss, reduction="mean")


class ExampleModelWithCustomLoss(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.model = ExampleModel()

    def forward(self, input, target):
        out = self.model(input)
        return out, custom_loss(out, target)

3.5.3. poptorch.MultiConv 

Use poptorch.MultiConv wrapper class to define multi-convolutions.

class poptorch.MultiConv

Combines all convolution layers evaluated inside this scope into a single multi-convolution.

Multi-convolutions allow for a set of data-independent convolutions to be executed in parallel. Executing convolutions in parallel can lead to an increase in the data throughput.

For example:

>>> with poptorch.MultiConv():
...     y = self.convA(x)
...     v = self.convB(u)

Combines the two data-independent convolutions into a single multi-convolution.

Refer to the PopLibs documentation for further information on multi-convolutions.

availableMemoryProportions(value)

The available memory proportion per convolution, each [0, 1).

Parameters: value (float, [float]) – Can be a float value in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many float values as the number of convolutions.
Returns: self, to support method chaining

cycleBackOff(value)

Cycle back off proportion.

Parameters: value (float) – Number between 0 and 1
Returns: self, to support method chaining

partialsTypes(value)

The partials type used for each convolution.

Parameters: value (MultiConvPartialsType, [MultiConvPartialsType]) – Can be a single instance of poptorch.MultiConvPartialsType in which case the same value is used for all of the convolutions. Otherwise, can be a tuple or list containing as many poptorch.MultiConvPartialsType values as the number of convolutions.
Returns: self, to support method chaining

perConvReservedTiles(value)

Tiles to reserve for each convolution.

Parameters: value (int) – Number of tiles
Returns: self, to support method chaining

planType(value)

Select the multi-convolution execution strategy.

Parameters: value – An instance of MultiConvPlanType.
Returns: self, to support method chaining

class poptorch.MultiConvPartialsType(value)

Type for the partials of each convolution of a poptorch.MultiConv

Float
Half

class poptorch.MultiConvPlanType(value)

Selects the execution strategy for a poptorch.MultiConv

Parallel: Execute multiple convolutions in parallel (Default).
Serial: Execute each convolution independently. This is equivalent to using the independent convolution API.

3.5.4. poptorch.custom_op 

This is for the users who are familiar with PopART. If you need some special features that are not supported in PopART, you may write a PopART custom op. For more information about how to create Popart custom ops see Creating custom operations and Building custom operators using PopART. You can call such a PopART custom op using poptorch.custom_op in PopTorch.

It takes three steps to enable a PopART custom op in PopTorch.

First, set Poplar and PopART environment varibles as shown in Setting the environment variables and compile the PopART custom op. You can compile your custom op C++ code and link with Poplar and PopART to generate a dynamic library. Please refer to the custom op code custom_cube_op.cpp and its CMakeLists.txt under poptorch/tests/custom_ops$.

Second, load the dynamic library.

Finally, use poptorch.custom_op to finish the call. Its wrapper class is specified below.

class poptorch.custom_op(inputs, name, domain, domain_version, example_outputs)

Applies a custom operation, implemented within PopART, to the inputs.

Parameters

inputs (tuple) – A tuple of input tensors, for example, (x, y).
name (str) – unique name of the PopART custom
domain (str) – domain for the op
domain_version (int) – version of the domain to use
example_outputs (iterable) – a tuple of tensors with the same type and shape of the outputs; the value does not matter as all values will be set to zero for tracing purposes.

Returns

The outputs of the forward op of the custom op.

In the PopART custom op, both forward op and backward op are implemented. In the PopTorch inference model, only the forward op will be called.

In the code example above, example_outputs is assigned as [x, x], where x is one of the input tensors and used as a template to provide the right number of output tensors. The real outputs will be allocated memory, calculated and returned by the custom op. You can also call this custom op inside a training model using exactly the same interface of poptorch.custom_op, and the backward op will be called automatically.

3.5.5. poptorch.nop 

Poptorch includes a “no-op” function for debugging purposes.

poptorch.nop(tensor)

A no-operation: it is functionally the same as an identity but is never elimated by PopART patterns or inlining, so it is useful for debugging.

Parameters: tensor (torch.Tensor) – the tensor to simply return by the no-op.
Returns: The same tensor which was input.
Return type: torch.Tensor

3.5.6. poptorch.serializedMatMul 

Use this function to create a serialized matrix multiplication, which splits a larger matrix multiplication into smaller matrix multiplications to reduce memory requirements.

poptorch.serializedMatMul(lhs, rhs, mode, factor=0, keep_precision=False)

Calculates a matrix product using a serialized matrix multiplication.

The matrix multiplication, lhs*rhs, is split into separate smaller multiplications, calculated one after the other, to reduce the memory requirements of the multiplication and its gradient calculation.

Parameters

lhs (torch.Tensor) – Left-hand size input matrix.
rhs (torch.Tensor) – Right-hand side input matrix.
mode (poptorch.MatMulSerializationMode) – Which dimension of the matmul to serialize on: for matrix A (m by n) multiplied by matrix B (n by p). * InputChannels: Split across the input channels (dimension m). * ReducingDim: Split aross the reducing dimension (n). * OutputChannels: Split across the output channels (dimenion p). * Disabled: Same as an ordinary matrix multiplication.
factor (int) – Number of serialized multiplications. Must be a factor of the dimension to serialize on.
keep_precision (bool) – (Half/float16 inputs only) The forward op when serializing over ReducingDim and the backwards ops when serializing over InputChannels involve an addition step. If keep_precision is True, these additions will occur using float32 rather than half precision partials, matching those used for the individual matrix multiplications.

3.5.7. poptorch.set_available_memory 

Use this function to override the proportion of tile memory for available to be used as temporary memory by a convolution or matrix multiplication.

poptorch.set_available_memory(tensor, available_memory_proportion)

Sets the available memory for a convolution or matrix multiplication.

When called on the on the output of a convolution or a matrix multiplication, it sets the proportion of tile memory (between 0 and 1) to be made available as temporary memory for the convolution/matrix multipication. Less temporary memory will reduce the time performance but may use less memory overall. Lower memory proportions result in the use of more live (not tempoerary) memory, and so the overall memory may increase for too low values, possibly resulting in out of memory errors.

In the event that the value is too low, the planner will replan for the smaller memory usage possible.

>>> class BasicNetwork(nn.Module):
...     def __init__(self):
...         super().__init__()
...         self.conv = nn.Conv2d(4, 4, 3, stride=2)
...
...     def forward(self, x):
...         out = self.conv(x)
...         out = poptorch.set_available_memory(out, 0.2)
...         return out

Parameters

tensor (torch.Tensor) – output tensor of a convolution or matrix multiplication (otherwise the statement will be an identity).
available_memory_proportion (float) – proportion between 0.0 and 1.0 of tile memory to be made available for temporary memory (default 0.6).

Returns

input tensor, as if calling an identity function.

Return type

torch.Tensor

3.6. Miscellaneous functions 

These PopTorch functions, not related to model creation, are available:

poptorch.ipuHardwareIsAvailable()

Indicates whether IPU hardware is available to use.

Returns: True if physical IPUs are available, False otherwise.
Return type: bool

poptorch.setLogLevel(level)

Changes the volume of messages printed in the console (stdout)

Parameters

level (str) –

TRACE: Print all messages.
DEBUG: Print debug messages and above.
INFO: Print info messages and above.
WARN: Print warings and errors.
ERR: Print errors only.
OFF: Print nothing.

3.7. Half / float 16 support 

PopTorch supports the half-precision floating point (float 16) format. You can simply input float 16 tensors into your model. (You can convert a tensor to float 16 using tensor = tensor.half())

You can use your models in one of the following ways: #. Convert all parameters (weights) to float 16 by using using a Module’s .``half()`` method. This is the most memory efficient, however small updates to weights may be lost, hindering training. #. Keep the parameters (weights) as float 32, in which case the parameter updates will occur using float 32. However, the parameters will be converted to float 16 if you call an operation with a float 16 input. This is more memory efficient than using float 32 tensors (inputs) but less memory efficient than using float 16 weights. #. Use a mix of float 32 and float 16 parameters by manually specifying parameters as float 16 or float 32.

Note

When PyTorch encounters a mix of float 16 and float 32 inputs for a given operation, it will usually cast all inputs and float 32. PopTorch differs and will cast all inputs to float 16. This makes it easier to build models with float 32 weights which take float 16 tensors.

Listing 3.6 How to run a model using half precision

model = torch.nn.Linear(1, 10).half()
t1 = torch.tensor([1.]).half()

inference_model = poptorch.inferenceModel(model)
out = inference_model(t1)

assert out.dtype == torch.half

Because PopTorch relies on the torch.jit.trace API, it is limited to tracing operations which run on the CPU. Many of these operations do not support float 16 inputs. To allow the full range of operations, PopTorch converts all float 16 inputs to float 32 before tracing and then restores the inputs to float 16 as part of the canonicalization process. Some operations may result in the model running in float 32 where float 16 would be expected, or vice versa (see Float 16 operations for full details).

3.8. Profiling 

You can profile a graph produced by PopTorch for analysis using the PopVision Graph Analyser, which can be downloaded from the Graphcore support portal. To do this, use the POPLAR_ENGINE_OPTIONS environment variable.

3.9. Environment variables 

3.9.1. Logging level 

PopTorch uses the following levels of logging:

OFF: No logging.
ERR: Errors only.
WARN: Warnings and errors only.
INFO: Info, warnings and errors. (Default)
DEBUG: Adds some extra debugging information.
TRACE and TRACE_ALL: Trace everything inside PopTorch.

The POPTORCH_LOG_LEVEL environment variable can be used to set the logging level:

export POPTORCH_LOG_LEVEL=DEBUG

3.9.2. Profiling 

When running programs using PopTorch, you can enable profiling by using the POPLAR_ENGINE_OPTIONS environment variable used by Poplar.

In order to capture the reports needed for the PopVision Graph Analyser you only need to set POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}':

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'

By default, report files are output to the current working directory. You can specify a different output directory by setting autoReport.directory, for example:

export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./tommyFlowers"}'

For more options, please refer to the PopVision Graph Analyser User Guide.

In order to capture the pvti reports needed for the PopVision System Analyser you only need to set PVTI_OPTIONS='{"enable":"true"}'

You can also add extra tracepoints in your own code by using

class poptorch.profiling.Channel(name)

Profiling channel.

Note

If the libpvti profiling library is not available at runtime this class becomes a no-op.

Example:

>>> channel = poptorch.profiling.Channel("MyApp")
>>> with channel.tracepoint("TimeThis"):
...     functionToTime()
>>> channel.instrument(myobj, "methodName", "otherMethod")

instrument(obj, *methods)

Instrument the methods of an object.

Parameters

obj – Object to instrument
methods – One or more methods to wrap in profiling tracepoints.

tracepoint(name)

Create a context tracepoint

>>> with channel.tracepoint("DoingSomething"):
...     expensiveCall()

Parameters: name – Name associated to this tracepoint.

3.9.3. IPU Model 

By default PopTorch will try to attach to a physical IPU. If instead you want to use the model, you can do so by setting POPTORCH_IPU_MODEL to 1:

export POPTORCH_IPU_MODEL=1

Please see the Poplar and PopLibs User Guide for the limitations of the IPU Model.

3.9.4. Wait for an IPU to become available 

By default if you try to attach to an IPU but all the IPUs in the system are already in use, an exception will be raised. If you would rather wait for an IPU to become available, you can do so by setting POPTORCH_WAIT_FOR_IPU to 1.

export POPTORCH_WAIT_FOR_IPU=1