3. Features
3.1. Options
The compilation and execution on the IPU can be controlled using poptorch.Options
:
See Efficient data batching for a full
explanation of how device_iterations
greater than 1, gradient_accumulation
, and
replication_factor
interact with the output and input sizes.
- class poptorch.Options
Options controlling how a model is run on the IPU.
- property Distributed
Options specific to distributed execution.
See also
- property Jit
Options specific to upstream PyTorch’s JIT compiler.
See also
- property Popart
Options specific to the PopART backend. (Advanced users only).
See also
- property TensorLocations
Options related to tensor locations.
- property Training
Options specific to training.
See also
- anchorMode(anchor_mode, anchor_return_period=None)
Specify which data to return from a model
- Parameters
anchor_mode (poptorch.AnchorMode) –
All: Return a result for each batch.
Sum: Return the sum of all the batches.
Final: Return the last batch.
EveryN: Return every N batches: N is passed in as
anchor_return_period
.Default:
All
for inference,Final
for training.
For example:
>>> opts = poptorch.Options() ... opts.anchorMode(poptorch.AnchorMode.All) ... # or ... opts.anchorMode(poptorch.AnchorMode.EveryN, 10)
- autoRoundNumIPUs(auto_round_num_ipus)
Whether or not to round up the number of IPUs used automatically: the number of IPUs requested must be a power of 2 or mutliple of 64. By default, an error occurs if an unsupport number of IPUs is used by the model to prevent unintentional overbooking of IPUs
- Parameters
auto_round_num_ipus (bool) –
True: round up the number of IPUs to a power of 2 or multiple of 64 automatically
False: error if the number of IPUs is not supported
- connectionType(connection_type)
When to connect to the IPU (if at all)
- Parameters
connection_type (poptorch.ConnectionType) –
Always: Attach to the IPU from the start (Default).
OnDemand: Wait until the compilation is complete and the executable is ready to be run to attach to the IPU.
Never: Never try to attach to an IPU. (Useful for offline compilation, but trying to run an executable will raise an exception).
For example:
>>> opts = poptorch.Options() ... opts.connectionType(poptorch.ConnectionType.OnDemand)
- defaultAnchorMode()
- Returns
True if the anchorMode is currently set to Default; False otherwise
- Return type
bool
- deviceIterations(device_iterations)
Number of iterations the device should run over the data before returning to the user. (Default: 1)
Essentially, it is the equivalent of launching the IPU in a loop over that number of batches. This is efficient because that loop runs on the IPU directly.
- enableExecutableCaching(path)
If
path
isNone
: disable executable caching.Otherwise use
path
as a cache to save / load Poplar executables.
- logDir(log_dir)
Where to save log files (Default: Current directory)
- randomSeed(random_seed)
Set the seed for the random number generator on the IPU.
- replicationFactor(replication_factor)
Number of model replications (Default: 1).
For example if your model uses 1 IPU, a replication factor of 2 will use 2 IPUs. If your model is pipelined across 4 IPUs, a replication factor of 4 will use 16 IPUs total.
- setAvailableMemoryProportion(available_memory_proportion)
Memory is set on a per IPU basis, this should be a dictionary of IPU ids and float values between 0 and 1.
For example:
{"IPU0": 0.5}
- setExecutionStrategy(strategy)
Set the execution strategy to use to partition the graph
- syncPattern(sync_pattern)
Set the IPU SyncPattern.
- Parameters
sync_pattern (poptorch.SyncPattern) –
Full
SinglePipeline
ReplicaAndLadder
- useIpuId(ipu_id)
Use the specified IPU id as provided by
gc-info
.The number of IPUs associated with the id must be equal to the number of IPUs used by your grpah multiplied by the replication factor.
For example if your model uses 1 IPU and the replication factor is 2 you will need to provide an id with 2 IPUs.
If your model is pipelined across 4 IPUs, the replication factor is 4, you will need to provide an id containing 16 IPUs total.
- Parameters
ipu_id (int) – IPU id as provided by
gc-info
.
- useIpuModel(use_model)
Use the IPU model or physical hardware.
Default: False (Real Hardware).
This setting takes precedence over the
POPTORCH_IPU_MODEL
environment variable.
- useOfflineIpuTarget(ipu_version=1)
Create an offline IPU target that can only be used for offline compilation.
Note
the offline IPU target cannot be used if the IPU model is enabled.
- Parameters
ipu_version (int) – IPU version to target (1 for mk1, 2 for mk2). Default: 1.
You can choose to use the IPU model or the real IPU hardware
via poptorch.Options.useIpuModel
.
- class poptorch.options._DistributedOptions
Options related to distributed execution.
Can be accessed via
poptorch.Options.Distributed
:>>> opts = poptorch.Options() >>> opts.Distributed.configureProcessId(0, 2)
- configureProcessId(process_id, num_processes)
Manually set the current process ID and the total number of processess.
- Parameters
process_id (int) – The ID of this process.
num_processes (int) – The total number of processes the execution is distributed over.
- disable()
Ignore the current options / environment variables and disable distributed execution.
- property numProcesses
Total number of processes the execution is distributed over.
- property processId
Id of the current process.
- setEnvVarNames(var_num_processes, var_process_id)
Utility to read and set
processId
andnumProcesses
from environment variables.Useful if you use a third party library to manage the processes used for the distributed execution such as mpirun.
For example:
mpirun -np 4 myscript.py
By default the OpenMPI
OMPI_COMM_WORLD_SIZE
andOMPI_COMM_WORLD_RANK
variables are used.
- class poptorch.options._JitOptions
Options related to Pytorch’s JIT compiler.
Can be accessed via
poptorch.Options.Jit
:>>> opts = poptorch.Options() >>> opts.Jit.traceModel(True)
- traceModel(trace_model)
If True: use torch.jit.trace
If False: use torch.jit.script (Experimental)
Trace model is enabled by default.
- class poptorch.options._TrainingOptions
Options specific to model training.
Note
You must not set these options for inference models.
Can be accessed via
poptorch.Options.Training
:>>> opts = poptorch.Options() >>> opts.Training.gradientAccumulation(4)
- gradientAccumulation(gradient_accumulation)
Number of samples to accumulate for the gradient calculation.
Accumulate the gradient N times before applying it. This is needed to train with models expressing pipelined model parallelism using the IPU annotation. This is due to weights being shared across pipeline batches so gradients will be updated and used by subsequent batches out of order.
Might be called “pipeline depth” in some other frameworks.
- class poptorch.options._PopartOptions
Options specific to the PopART backend.
Only for advanced users.
Any option from
popart.SessionOptions
can be set using this class. .. note:: there is no mapping for the various PopART enums so integers need to be used instead.Can be accessed via
poptorch.Options.Popart
:>>> opts = poptorch.Options() >>> opts.Popart.set("autoRecomputation", 3) # RecomputationType::Pipeline >>> opts.Popart.set("syntheticDataMode", >>> int(popart.SyntheticDataMode.RandomNormal))
- setPatterns(patterns, level=2)
Override the default patterns of Popart’s compiler.
- Parameters
patterns (dict(str,bool)) – Dictionary of pattern names to enable / disable.
level (int) – Integer value corresponding to the
popart::PaternsLevel
to use to initialise thePatterns
.
- class poptorch.options._TensorLocationOptions(**default_values)
Options controlling where tensors are stored.
Can be accessed via
poptorch.Options.TensorLocations
:>>> opts = poptorch.Options() >>> opts.TensorLocations.setActivationLocation( ... poptorch.TensorLocationSettings().useOnChipStorage(False))
- setAccumulatorLocation(location)
- Parameters
location (poptorch.TensorLocationSettings) – Where to store the accumulators.
- setActivationLocation(location)
- Parameters
location (poptorch.TensorLocationSettings) – Where to store the activations.
- setOptimizerLocation(location)
- Parameters
location (poptorch.TensorLocationSettings) – Where to store the optimizer states.
- setWeightLocation(location)
- Parameters
location (poptorch.TensorLocationSettings) – Where to store the weights.
- class poptorch.TensorLocationSettings(**default_values)
Define where a tensor is stored
>>> opts = poptorch.Options() >>> opts.TensorLocations.setActivationLocation( ... poptorch.TensorLocationSettings().useOnChipStorage(False))
- minElementsForOffChip(min_elements)
A minimum number of elements below which offloading won’t be considered.
- minElementsForReplicatedTensorSharding(min_elements)
Only enable Replicated Tensor Sharding (RTS) for tensors with more than
min_elements
elements.
- useIOTilesToLoad(use=True)
Load tensor through IO tiles
- Parameters
use (bool) – Use IO tiles if True, use Compute tiles if False.
- useIOTilesToStore(use=True)
Use IO tiles to store tensors.
(relevant for replicated tensor sharded tensors)
- Parameters
use (bool) – Use IO tiles if True, use Compute tiles if False.
- useOnChipStorage(use=True)
Permanent tensor storage
- Parameters
use (bool) – True: use on chip memory, False: use off chip memory. None: keep it undefined.
- useReplicatedTensorSharding(use=True)
Enable replicated tensor sharding
(relevant for weights and optimizer states)
3.2. Model wrapping functions
The basis of PopTorch integration comes from these two model wrapping functions.
3.2.1. poptorch.trainingModel
- poptorch.trainingModel(model, options=None, optimizer=None)
Create a PopTorch training model, from a PyTorch model, to run on IPU hardware in training mode.
- Parameters
model (torch.nn.Module) – The PyTorch model to wrap.
options (poptorch.Options) – The IPU specific options
optimizer (torch.optim.Optimizer) –
The optimizers to apply during training.
Supported PyTorch optimizers:
optim.SGD
,optim.AdamW
,optim.RMSprop
.Supported PopTorch optimizers:
poptorch.optim.SGD
,poptorch.optim.AdamW
,poptorch.optim.RMSprop
.poptorch.optim.LAMB
.
- Returns
The
poptorch.PoplarExecutor
wrapper to use in place ofmodel
.
1import poptorch
2import torch
3
4
5class ExampleModelWithLoss(torch.nn.Module):
6 def __init__(self):
7 super().__init__()
8 self.fc = torch.nn.Linear(10, 10)
9 self.loss = torch.nn.MSELoss()
10
11 def forward(self, x, target=None):
12 fc = self.fc(x)
13 if self.training:
14 return fc, self.loss(fc, target)
15 return fc
16
17
18torch.manual_seed(0)
19model = ExampleModelWithLoss()
20
21# Wrap the model in our PopTorch annotation wrapper.
22poptorch_model = poptorch.trainingModel(model)
23
24# Some dummy inputs.
25input = torch.randn(10)
26target = torch.randn(10)
27
28# Train on IPU.
29for i in range(0, 100):
30 # Each call here executes the forward pass, loss calculation, and backward
31 # pass in one step.
32 # Model input and loss function input are provided together.
33 poptorch_out, loss = poptorch_model(input, target)
34 print(f"{i}: {loss}")
35
36# Copy the trained weights from the IPU back into the host model.
37poptorch_model.copyWeightsToHost()
38
39# Execute the trained weights on host.
40model.eval()
41native_out = model(input)
42
43# Models should be very close to native output although some operations are
44# numerically different and floating point differences can accumulate.
45torch.testing.assert_allclose(native_out, poptorch_out, rtol=1e-06, atol=1e-06)
3.2.2. poptorch.inferenceModel
- poptorch.inferenceModel(model, options=None)
Create a PopTorch inference model, from a PyTorch model, to run on IPU hardware in inference mode.
- Parameters
model (torch.nn.Module) – The PyTorch model to wrap.
options (poptorch.Options) – The IPU specific options
- Returns
The
poptorch.PoplarExecutor
wrapper to use in place ofmodel
.
1import poptorch
2import torch
3import torchvision
4
5# Some dummy imagenet sized input.
6picture_of_a_cat_here = torch.randn([1, 3, 224, 224])
7
8# The model, in this case a MobileNet model with pretrained weights that comes
9# canned with Pytorch.
10model = torchvision.models.mobilenet_v2(pretrained=True)
11model.train(False)
12
13# Wrap in the PopTorch inference wrapper
14inference_model = poptorch.inferenceModel(model)
15
16# Execute on IPU.
17out_tensor = inference_model(picture_of_a_cat_here)
18
19# Get the top 5 ImageNet classes.
20top_five_classes = torch.topk(torch.softmax(out_tensor, 1), 5)
21print(top_five_classes)
22
23# Try the same on native PyTorch
24native_out = model(picture_of_a_cat_here)
25
26native_top_five_classes = torch.topk(torch.softmax(native_out, 1), 5)
27
28# Models should be very close to native output although some operations are
29# numerically different and floating point differences can accumulate.
30assert any(top_five_classes[1][0] == native_top_five_classes[1][0])
31# inference_half_start
32model = torch.nn.Linear(1, 10).half()
33t1 = torch.tensor([1.]).half()
34
35inference_model = poptorch.inferenceModel(model)
36out = inference_model(t1)
37
38assert out.dtype == torch.half
39# inference_half_end
3.2.3. poptorch.PoplarExecutor
- class poptorch.PoplarExecutor(model, options, training, optimizer=None, user_model=None)
This class should not be created directly but is a wrapper around the model that was passed into
inferenceModel
ortrainingModel
. It only has a few methods which can be used to interface with the IPU.- __call__(*args, **kwargs)
Takes the same arguments as the wrapped PyTorch
model.__call__
.Note
The first time the PoplarExecutor wrapper is called, the wrapped model will be traced and compiled.
- compile(*args, **kwargs)
Takes the same arguments as the wrapped PyTorch
model.__call__
.Trace and compile the wrapped model if no executable has been created yet.
- copyWeightsToDevice()
Copies the weights from
model.parameters()
to the IPU device. Implicitly called on first call.
- copyWeightsToHost()
Updates the parameters used in
model
with the weights stored on device. (The weights inmodel.parameters()
)
- destroy()
Destroy the model: release the IPUs and the executable.
- setOptimizer(optimizer)
Sets the optimiser for a training model. Will overwrite the previous one. Supported optimisers:
optim.SGD
,optim.AdamW
,optim.RMSProp
.
Note
The PoplarExecutor
will implicitly keep in sync the parameters
of the source PyTorch model and the PopTorch model(s).
However, weights need to be explicitly copied if the
model is trained on the CPU and inference is run on the IPU.
model = Model()
poptorch_train = poptorch.trainingModel(model)
poptorch_inf = poptorch.inferenceModel(model)
train(poptorch_train)
torch.save(model.state_dict(), "model.save") # OK
validate(poptorch_inf) # OK
validate(model) # OK
train(model)
# Explicit copy needed
poptorch_inf.copyWeightsToDevice()
validate(poptorch_inf)
3.3. Parallel execution
This section demonstrates multi-IPU strategies for parallel execution in PopTorch. We recommended that you start such parallel programming from PopTorch code that is working properly on a single IPU.
There are four kinds of execution strategies in total to run a model on a
multi-IPU device:
poptorch.ShardedExecution
,
poptorch.PipelinedExecution
,
poptorch.SerialPhasedExecution
.
and poptorch.ParallelPhasedExecution
.
These execution strategies are set through
poptorch.Options.setExecutionStrategy()
.
The default execution strategy is poptorch.PipelinedExecution
.
In the following,
we first introduce the general APIs that will be applied to all four
parallel execution strategies.
Finally, we explain the four strategies with examples.
By default, PopTorch will not let you run the model if the number of IPUs is
not a power of 2.
For this reason, it is preferable to annotate the model so that the number of
IPUs used is a power of 2.
However, you can also enable poptorch.Options.autoRoundNumIPUs()
to
automatically round up the number of IPUs reserved to a power of 2, with the
excess being reserved but idle.
This option is not enabled by default to prevent unintentional overbooking of
IPUs.
3.3.1. Annotation tools
poptorch.Block and poptorch.BeginBlock
poptorch.BeginBlock
and poptorch.Block
are
indispensable wrapper classes to define model
parallelism in a multi-IPU device.
You can use poptorch.Block
to define a scope in the context of the
model.
- class poptorch.Block(user_id=None, ipu_id=None)
Runs all layers called inside this scope on a specified IPU.
>>> with poptorch.Block("IPU0"): ... self.layer = MyLayer(x)
- __init__(user_id=None, ipu_id=None)
- Parameters
user_id (str, optional) – A user defined identifier for the block. Blocks with the same id are considered as being a single block. Block identifiers are also used to manually specify pipelines or phases.
ipu_id (int, optional) – The id of the IPU to run on. Note that the
ipu_id
is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used bygc-info
.
- static useAutoId()
Call this method at the beginning of your
forward()
method to enable automatic block id generation.Blocks with a None
user_id
will be assigned an automatic id which will be the index of this block in the list of id-less Blocks.>>> poptorch.Block.useAutoId() >>> with poptorch.Block(): # user_id = "0" ... layer() >>> with poptorch.Block("special_block"): # user_id = "special_block" ... layer() >>> with poptorch.Block(): # user_id = "1" ... layer()
poptorch.BeginBlock
is an annotation defined outside the
model, and applied to current and onward layers.
- class poptorch.BeginBlock(layer_to_call, user_id=None, ipu_id=None)
Runs all layers from the given layer until the beginning of the next block on a specified IPU.
All layers after this layer will also run on the same IPU until another
BeginBlock
is encountered.By default
PipelinedExecution
will be used, however this can be overridden in thepoptorch.Options
.>>> self.layer = poptorch.BeginBlock(MyLayer(x))
- __init__(layer_to_call, user_id=None, ipu_id=None)
All subsequent layers of the network will be part of this block until another layer is wrapped.
- Parameters
layer_to_call (torch.nn.Module) – The layer to run on the specified IPU.
user_id (str, optional) – A user defined identifier for the block. Blocks with the same id are considered as being a single block. Block identifiers are also used to manually create
Stages
andPhases
.ipu_id (int, optional) – The id of the IPU to run on. Note that the
ipu_id
is an index in a multi-IPU device within PopTorch, and is separate and distinct from the device ids used bygc-info
.
poptorch.BeginBlock
or poptorch.Block
alone is enough to enable parallel execution in the simplest case.
By default, the layers before the first poptorch.BeginBlock
will be placed on IPU 0.
The complete code examples for poptorch.BeginBlock
and
poptorch.Block
are shown below.
All layers before model.bert.encoder.layer[0]
will be on IPU 0 and all layers from model.bert.encoder.layer[0]
onwards (inclusive) will be on IPU 1.
1import transformers
2import torch
3import poptorch
4
5# A bert model from hugging face. See the packaged BERT example for actual usage.
6pretrained_weights = 'mrm8488/bert-medium-finetuned-squadv2'
7model = transformers.BertForQuestionAnswering.from_pretrained(
8 pretrained_weights)
9
10# A handy way of seeing the names of all the layers in the network.
11print(model)
12
13# All layers before "model.bert.encoder.layer[0]" will be on IPU 0 and all layers from
14# "model.bert.encoder.layer[0]" onwards (inclusive) will be on IPU 1.
15model.bert.encoder.layer[0] = poptorch.BeginBlock(model.bert.encoder.layer[0],
16 ipu_id=1)
17
18# Now all layers before layer are on IPU 1 and this layer onward is on IPU 2
19model.bert.encoder.layer[2] = poptorch.BeginBlock(model.bert.encoder.layer[2],
20 ipu_id=2)
21
22# Finally all layers from this layer till the end of the network are on IPU 3.
23model.bert.encoder.layer[4] = poptorch.BeginBlock(model.bert.encoder.layer[4],
24 ipu_id=3)
25
26# We must batch the data by at least the number of IPUs. Each IPU will still execute
27# whatever the model batch size is.
28data_batch_size = 4
29
30# Create a poptorch.Options instance to override default options
31opts = poptorch.Options()
32opts.deviceIterations(data_batch_size)
1class Network(torch.nn.Module):
2 def __init__(self):
3 super().__init__()
4 self.layer1 = torch.nn.Linear(5, 10)
5 self.layer2 = torch.nn.Linear(10, 5)
6 self.layer3 = torch.nn.Linear(5, 5)
7 self.layer4 = torch.nn.Linear(5, 5)
8
9 self.act = torch.nn.ReLU()
10 self.softmax = torch.nn.Softmax(dim=1)
11
12 def forward(self, x):
13
14 # Explicit layers on a certain IPU
15 poptorch.Block.useAutoId()
16 with poptorch.Block(ipu_id=0):
17 x = self.act(self.layer1(x))
18
19 with poptorch.Block(ipu_id=1):
20 x = self.act(self.layer2(x))
21
22 with poptorch.Block(ipu_id=2):
23 x = self.act(self.layer3(x))
24 x = self.act(self.layer4(x))
25
26 with poptorch.Block(ipu_id=3):
27 x = self.softmax(x)
28 return x
29
30
31model = Network()
32opts = poptorch.Options()
33opts.deviceIterations(4)
34poptorch_model = poptorch.inferenceModel(model, options=opts)
35print(poptorch_model(torch.rand((4, 5))))
Both poptorch.BeginBlock
and poptorch.Block
need to follow a set of rules:
All the layers must be declared inside a
poptorch.Block
scope. It is to avoid missing annotation.poptorch.BeginBlock
doesn’t have the same constraint because all the layers called after will automatically be added to the lastpoptorch.BeginBlock
.Please note that PopTorch needs to reserve IPUs in powers of 2 or multiples of 64. You are advised to configure your model accordingly to take full advantage of the IPUs available. However, if you need to run with a different number of IPUs, you can use
poptorch.Options().autoRoundNumIPUs(True)
to allow PopTorch to reserve more IPUs than the model specifies.Unused or dead layers should NOT be included in any
poptorch.BeginBlock
orpoptorch.Block
.If layer A happens before layer B inside the model and each layer has a
poptorch.BeginBlock
associated with it, you need to writepoptorch.BeginBlock
for layer A beforepoptorch.BeginBlock
for layer B.
Failing to obey above rules will result in compilation errors.
poptorch.Stage and poptorch.AutoStage
Conceptually poptorch.BeginBlock
or
poptorch.Block
collects the
layers of a model into a poptorch.Stage
,
multiple stages can be combined into a poptorch.Phase
,
and multiple phases form a parallel execution strategy.
poptorch.Stage
poptorch.Stage
defines some layers of model to run on one IPU.
It can be made of one or more blocks created using
poptorch.BeginBlock
or poptorch.Block
and identified by their user_id
.
Consecutive layers in a model can be defined either in the same
poptorch.Stage
or consecutive stages.
Whether stages run in parallel or sequentially depends on specific
parallel execution strategies.
- class poptorch.Stage(*block_ids)
The various execution strategies are made of
Stages
: a stage consists of one of moreBlocks
running on one IPU.- __init__(*block_ids)
Internally, each operation in a model is assigned a stage_id
through poptorch.Stage
.
poptorch.AutoStage
You can use poptorch.AutoStage
if you don’t want to
specify poptorch.Stage
by hand.
It will assign one poptorch.Stage
per poptorch.BeginBlock
or poptorch.Block
.
- class poptorch.AutoStage(value)
Defines how the stages are automatically assigned to blocks when the user didn’t explicitly provide stages to the
IExecutionStrategy
’s constructor.SameAsIpu
: The stage id will be set to the selected ipu number.AutoIncrement
: The stage id for new blocks is automatically incremented.
Examples:
>>> # Block "0" >>> with poptorch.Block(ipu_id=0): ... layer() >>> # Block "1" >>> with poptorch.Block(ipu_id=1): ... layer() >>> # Block "2" >>> with poptorch.Block(ipu_id=0): ... layer()
By default, the following execution strategy is used:
>>> stategy = poptorch.PipelinedExecution(poptorch.AutoStage.SameAsIpu) >>> opts.setExecutionStrategy(strategy)
which would translate to
stage_id = ipu_id
:Block “0” ipu=0 stage=0
Block “1” ipu=1 stage=1
Block “2” ipu=0 stage=0
Now if instead you use:
>>> stategy = poptorch.PipelinedExecution(poptorch.AutoStage.AutoIncrement) >>> opts.setExecutionStrategy(strategy)
The last block would be in its own stage rather than sharing one with Block “0”:
Block “0” ipu=0 stage=0
Block “1” ipu=1 stage=1
Block “2” ipu=0 stage=2
By default poptorch.AutoStage.SameAsIpu
is in use, which means the
stage_id
of poptorch.Stage
will be set to the ipu_id
specified for the poptorch.BeginBlock
or
poptorch.Block
.
Please note that stage_id
must be ascending in
poptorch.PipelinedExecution
.
Let’s use the code example above.
If your blocks “0”, “1”, and “2” are assigned to IPU 0, 1, and 0.
Then the poptorch.Block
“2” will be assigned stage_id
0. This will make
the compiler fail to
schedule the last two stages “1” and “2” due to a conflict:
The model implies “1” should run earlier than “2”.
their
stage_id
values suggest “2” should run earlier than “1”.
When poptorch.AutoStage.AutoIncrement
is in use, each new
poptorch.BeginBlock
or
poptorch.Block
will be assigned an automatically incremented
stage_id
.
In the previous example the last stage would be assigned stage_id
2 and
the compilation would succeed.
poptorch.Phase
poptorch.Phase
defines a processing unit of phased execution.
It may contain one or more poptorch.Stage
.
poptorch.Phase
is only used in
poptorch.SerialPhasedExecution
and
poptorch.ParallelPhasedExecution
.
It is not used in
poptorch.ShardedExecution
and
poptorch.PipelinedExecution
.
- class poptorch.Phase(arg)
Represents an execution phase
- __init__(arg)
Create a phase.
- Parameters
arg (str, poptorch.Stage, [poptorch.Stage], [str]) – must either be one or more
Stages
, or one or more blocksuser_id
.
If one or more strings are passed they will be interpreted as
Block
ids representing a singleStage
.Within a
Phase
, the stages will be executed in parallel.>>> with poptorch.Block("A"): ... layer() >>> with poptorch.Block("B"): ... layer() >>> p = Phase(poptorch.Stage("A").ipu(0)) >>> # 2 stages made of one block each >>> p = Phase(poptorch.Stage("A").ipu(0), poptorch.Stage("B").ipu(1)) >>> p = Phase("A","B") # One Stage made of 2 blocks
In the last two lines above, “A” and “B” will run in parallel on IPU 0 and 1 simultaneously since they are placed in two stages. They will run sequentially in one IPU if they are placed in one stage only.
Advanced annotation with strings
You can use Python strings to represent the user_id
and ipu_id
for a
poptorch.Block
or
poptorch.BeginBlock
.
Since strings are evaluated at runtime,
they allow for a dynamic number of stages and phases.
Here is an example below to use formatted strings(f-strings) in
poptorch.ParallelPhasedExecution
.
Inside the code example below, there are two lines that f-strings are
used in the forward()
class.
One is f"phase{phase}_ipu{ipu}"
at Line 25,
where phase
is
0, 1, 1, 2, 3, 3, 4, 5, and 5 respectively,
and ipu
ranges from 0 to 1.
The total number of instances for this f-string is 12 due to
6 phases and 2 IPUs.
The other is f"phase{N*2-1}_ipu1"
at Line 32,
where phase
is 5 and ipu
is 1.
When defining poptorch.Stage
,
four f-strings are used where n
ranges from 0 to 2
at Line 46-47 and 50-51:
f"phase_{2*n}_ipu0"
f"phase{2*n}_ipu1"
f"phase_{2*n+1}_ipu0"
f"phase{2*n+1}_ipu1"
They refer to phase
0, 2, 4 and 1, 3, 5, with ipu0
and ipu1
respectively.
So all these 12 f-strings are defined in poptorch.BeginBlock
,
and used in poptorch.Stage
dynamically. They match exactly.
1poptorch.setLogLevel(1) # Force debug logging
2N = 3
3size = 10
4
5
6class Model(torch.nn.Module):
7 def __init__(self):
8 super().__init__()
9 self.weights = []
10 for n in range(N * 6):
11 weight = torch.nn.Parameter(torch.rand(size, size),
12 requires_grad=True)
13 self.register_parameter(f"w{n}", weight)
14 self.weights.append(weight)
15
16 def forward(self, in0, target=None):
17 phase = 0
18 weight = iter(self.weights)
19 with poptorch.Block("phase0_ipu0"):
20 ins = torch.split(in0, size)
21 for n in range(N * 3):
22 out = []
23 for ipu in range(2):
24 x = ins[ipu]
25 with poptorch.Block(f"phase{phase}_ipu{ipu}"):
26 x = torch.matmul(next(weight), x)
27 out.append(F.relu(x))
28 ins = out[1], out[0]
29 # We want 2 matmuls in the same phase
30 if n % 3 != 1:
31 phase += 1
32 with poptorch.Block(f"phase{N*2-1}_ipu1"):
33 res = ins[0] + ins[1]
34 if target is None:
35 return res
36 return res, torch.nn.L1Loss(reduction="mean")(res, target)
37
38
39input = torch.rand(size * 2, 1)
40target = torch.rand(size, 1)
41model = Model()
42opts = poptorch.Options()
43phases = []
44# Alternate between 0-2 and 1-3
45for n in range(N):
46 phases.append([
47 poptorch.Stage(f"phase{2*n}_ipu0").ipu(0),
48 poptorch.Stage(f"phase{2*n}_ipu1").ipu(2)
49 ])
50 phases.append([
51 poptorch.Stage(f"phase{2*n+1}_ipu0").ipu(1),
52 poptorch.Stage(f"phase{2*n+1}_ipu1").ipu(3)
53 ])
54opts.setExecutionStrategy(poptorch.ParallelPhasedExecution(*phases))
55poptorch_model = poptorch.trainingModel(model, opts)
56poptorch_model.compile(input, target)
3.3.2. Parallel execution strategies
With the above APIs as building blocks, we can set execution strategies using the four kinds of execution modes, as shown below. Note that the same annotation can be used for each of them. They only differ in the method of parallelisation and tensor locations.
poptorch.ShardedExecution
In this strategy, each IPU
will sequentially execute a distinct part of the model.
A single unit of processing poptorch.ShardedExecution
is a
shard.
A shard is specified using poptorch.Stage
,
or if no poptorch.Stage
is specified,
the user_id
passed by
poptorch.BeginBlock
or poptorch.Block
is used.
Each shard is executed sequentially on a single IPU.
Multiple shards can be placed on multiple IPUs.
However, only one IPU is used at a time, while
the other IPUs are idle.
If an IPU is allocated to run consecutive stages,
PopART will merge consecutive stages into one on the same IPU.
Weights and activations will use the on-chip memory of the IPUs.
Layers sharing weights need to be placed on the same IPU.
poptorch.ShardedExecution
can be useful
for processing a single sample or debugging.
Overall it has low efficiency since only one IPU is used at a time.
- class poptorch.ShardedExecution(*args)
Will shard the execution of the passed Stages or if no stage is passed will consider each unique Block name encountered during tracing as a different stage.
>>> with poptorch.Block("A"): ... layer() >>> with poptorch.Block("B"): ... layer() >>> with poptorch.Block("C"): ... layer() >>> opts = poptorch.Options() >>> # Automatically create 3 shards based on the block names >>> opts.setExecutionStrategy(poptorch.ShardedExecution())
- Parameters
args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a
poptorch.AutoStage
strategy or an explicit list of stages or block ids.
- stage(block_id)
Return the
poptorch.Stage
the given block is belongs to.- Parameters
block_id (str) – A block id.
poptorch.PipelinedExecution
This is the default execution strategy. It extends poptorch.ShardedExecution with parallel execution on multiple IPUs.
Parallelisation in poptorch.PipelinedExecution
requires deviceIterations()
and gradientAccumulation()
.
as explained in Efficient data batching.
After one poptorch.Stage
is finished with processing a batch
on one IPU, it starts immediately processing the next batch.
This creates a pipeline where multiple batches are processed in parallel.
An IPU can only start its own poptorch.Stage
of a batch if
its previous poptorch.Stage
of the current batch is processed.
Hence, all IPUs will be occupied after a warm-up period.
A cool-down period is required to aggregate the results and apply weight
changes.
- class poptorch.PipelinedExecution(*args)
- __init__(*args)
Pipeline the execution of the passed
Stages
or if no stage is passed consider each uniqueBlock
name encountered during tracing as a different stage.>>> with poptorch.Block("A"): ... layer() >>> with poptorch.Block("B"): ... layer() >>> with poptorch.Block("C"): ... layer() >>> opts = poptorch.Options() >>> # Create a 3 stages pipeline >>> opts.setExecutionStrategy(poptorch.PipelinedExecution("A","B","C")) >>> # Create a 2 stages pipeline >>> opts.setExecutionStrategy(poptorch.PipelinedExecution( ... poptorch.Stage("A","B"), ... "C")) >>> # Automatically create a 3 stages pipeline based on the block names >>> opts.setExecutionStrategy(poptorch.PipelinedExecution())
- Parameters
args (poptorch.AutoStage, [str], [poptorch.Stage]) – Either a
poptorch.AutoStage
strategy or an explicit list of stages or block ids.
- stage(block_id)
Return the
poptorch.Stage
the given block is belongs to.- Parameters
block_id (str) – A block id.
Phased execution
poptorch.ParallelPhasedExecution
and
poptorch.SerialPhasedExecution
have the following
features in common:
A portion of the weights and activations are transferred to and from streaming memory, before and after each phase.
If the desired weights and activations are already stored in an IPU of the same group of IPUs, intra-phase cross-IPU copies can replace the copies to and from streaming memory.
This specific portion is needed by the layers of the model wrapped in
poptorch.BeginBlock
orpoptorch.Block
in currentpoptorch.Phase
.They both trade off some performance for larger models with higher memory needs.
Any number of phases is allowed.
The number of stages in each
poptorch.Phase
should match the number of IPUs in each group of IPUs.Stages inside each
poptorch.Phase
can run in parallel.
Although you only define the poptorch.Phase
for forward passes,
the corresponding phases for backward passes are created correspondingly.
The order of phased execution for backward passes won’t change
but you can decide whether a phase is shared by both
forward and backward passes. In other words, you decide whether to avoid
a memory transfer of a portion of the weights and activations.
poptorch.SerialPhasedExecution
In poptorch.SerialPhasedExecution
,
phases execute on a single group of IPUs sequentially.
- class poptorch.SerialPhasedExecution(*phases)
All the phases run serially on a single group of IPUs.
For example:
phase 0 runs on ipu 0 & 1
phase 1 runs on ipu 0 & 1
phase 2 runs on ipu 0 & 1
>>> with poptorch.Block("A"): ... layer() >>> with poptorch.Block("A2"): ... layer() >>> with poptorch.Block("B"): ... layer() >>> with poptorch.Block("B2"): ... layer() >>> with poptorch.Block("C"): ... layer() >>> with poptorch.Block("C2"): ... layer() >>> opts = poptorch.Options() >>> strategy = poptorch.SerialPhasedExecution([ ... poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")), ... poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")), ... poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2"))]) >>> strategy.phase(0).ipus(0,1) >>> strategy.phase(1).ipus(0,1) >>> strategy.phase(2).ipus(0,1) >>> opts.setExecutionStrategy(strategy)
- __init__(*phases)
Execute the model’s blocks in phases
- Parameters
phases ([
poptorch.Phase
], [[poptorch.Stage
]], [[str]]) –Definition of phases must be either:
a list of
poptorch.Phase
a list of list of
poptorch.Stage
a list of list of
poptorch.Block
ids (Each list of blocks will be considered as a singlepoptorch.Stage
)
- phase(phase)
Return the requested
poptorch.Phase
- Parameters
phase (int) – Index of the phase
- setTensorsLiveness(liveness)
See
poptorch.Liveness
for more information
- stage(block_id)
Return the
poptorch.Stage
the given block is belongs to.- Parameters
block_id (str) – A block id.
- useSeparateBackwardPhase(use=True)
Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:
fwd: bwd: phase 0 -> phase 4 phase 1 -> phase 3 phase 2 -> phase 2
Note
The end of the forward pass and the beginning of the backward pass are part of the same phase.
If
useSeparateBackwardPhase(True)
is used then no phase will be shared between the forward and backward passes:fwd: bwd: phase 0 -> phase 6 phase 1 -> phase 5 phase 2 -> phase 4
poptorch.ParallelPhasedExecution
In poptorch.ParallelPhasedExecution
,
phases are executed in parallel alternating between two groups of IPUs.
Even phases must run on even IPUs and odd phases on odd IPUs.
Inter-phase cross-IPU copies can replace the memory transfers to and from
the streaming memory, if the desired weights and activations are already
available in another group of IPUs.
- class poptorch.ParallelPhasedExecution(*phases)
Phases are executed in parallel alternating between two groups of IPUs.
For example:
phase 0 runs on ipu 0 & 2
phase 1 runs on ipu 1 & 3
phase 2 runs on ipu 0 & 2
>>> poptorch.Block.useAutoId() >>> with poptorch.Block(): # user_id = "0" ... layer() >>> with poptorch.Block(): # user_id = "1" ... layer() >>> with poptorch.Block(): # user_id = "2" ... layer() >>> with poptorch.Block(): # user_id = "3" ... layer() >>> with poptorch.Block(): # user_id = "4" ... layer() >>> with poptorch.Block(): # user_id = "5" ... layer() >>> opts = poptorch.Options() >>> strategy = poptorch.ParalellPhasedExecution([ ... poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")), ... poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")), ... poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5"))]) >>> strategy.phase(0).ipus(0,2) >>> strategy.phase(1).ipus(1,3) >>> strategy.phase(2).ipus(0,2) >>> opts.setExecutionStrategy(strategy)
- __init__(*phases)
Execute the model’s blocks in phases
- Parameters
phases ([
poptorch.Phase
], [[poptorch.Stage
]], [[str]]) –Definition of phases must be either:
a list of
poptorch.Phase
a list of list of
poptorch.Stage
a list of list of
poptorch.Block
ids (Each list of blocks will be considered as a singlepoptorch.Stage
)
- phase(phase)
Return the requested
poptorch.Phase
- Parameters
phase (int) – Index of the phase
- stage(block_id)
Return the
poptorch.Stage
the given block is belongs to.- Parameters
block_id (str) – A block id.
- useSeparateBackwardPhase(use=True)
Given a forward pass with 3 phases (0,1,2), by default the phases will run as follows:
fwd: bwd: phase 0 -> phase 4 phase 1 -> phase 3 phase 2 -> phase 2
Note
The end of the forward pass and the beginning of the backward pass are part of the same phase.
If
useSeparateBackwardPhase(True)
is used then no phase will be shared between the forward and backward passes:fwd: bwd: phase 0 -> phase 6 phase 1 -> phase 5 phase 2 -> phase 4
In the code example above, there are three phases. Each phase has two stages and each IPU group has two IPUs, so these two numbers match. Even phases 0 and 2 run on IPU 0 and 2, while odd phase 1 runs on IPU 1 and 3 as required. This allows for faster cross-IPU copies, both inter-phase and intra-phase.
poptorch.Liveness
poptorch.Liveness
controls the availability of tensors on IPU,
and is only needed for
poptorch.ParallelPhasedExecution
and poptorch.SerialPhasedExecution
.
- class poptorch.Liveness(value)
When using phased execution:
AlwaysLive
: The tensors always stay on the IPU between the phases.OffChipAfterFwd
: The tensors are sent off the chip at the end of the forward pass and before the beginning of the backward pass.OffChipAfterEachPhase
: The tensors are sent off the chip at the end of each phase.
The default poptorch.Liveness
is AlwaysLive
.
OffChipAfterFwd
and
OffChipAfterEachPhase
may be helpful if you run a large model
with a tight memory budget.
3.4. Optimizers
You can use a number of optimizers with PopTorch. In addition, PopTorch has additional features to support float16 models such as loss scaling.
- class poptorch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, loss_scaling=1.0, velocity_scaling=1.0)
Stochastic gradient descent with optional momentum.
The optimizer matches PyTorch’s implementation (torch.optim.SGD) with optional loss and velocity scaling.
Nesterov momentum is not currently supported.
- __init__(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False, loss_scaling=1.0, velocity_scaling=1.0)
- Parameters
params (iterable) – parameters to optimize.
lr (float) – learning rate.
momentum (float, optional) – momentum factor.
dampening (float, optional) – damperning term for momentum.
weight_decay (float, optional) – Weight decay (L2 penalty) factor.
nesterov (bool, optional) – Not supported (must be False).
loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.
velocity_scaling (float, optional) – Factor by which to scale the velocity values to assist numerical stability when using float16.
- class poptorch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, loss_scaling=1.0, biasCorrection=True, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)
Adam optimizer with true weight decay.
This optimizer matches PyTorch’s implementation (torch.optim.AdamW) with optional loss scaling.
AMSGrad is currently not supported.
- __init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, loss_scaling=1.0, biasCorrection=True, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)
- Parameters
params (iterable) – parameters to optimize.
lr (float, optional) – learning rate
eps (float, optional) – term added to the demoninator to ensure numerical stability.
weight_decay (float, optional) – Weight decay factor.
amsgrad (bool, optional) – Not supported (must be False).
loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.
accumType (torch.dtype, optional) – data type used for gradients.
firstOrderMomentumAccumType (torch.dtype, optional) – data type used to store the first order momentum values for each parameter.
secondOrderMomentumAccumType (torch.dtype, optional) – data type used to store the second order momentum values for each parameter.
- class poptorch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False, loss_scaling=1.0)
RMSprop optimizer with optional L2 penalty.
This optimizer matches PyTorch’s implementation (torch.optim.RMSprop) with optional loss scaling.
- __init__(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False, loss_scaling=1.0)
- Parameters
params (iterable) – parameters to optimize.
lr (float, optional) – learning rate.
alpha (float, optional) – smoothing constant.
eps (float, optional) – term added to the demoninator to ensure numerical stability.
weight_decay (float, optional) – L2 penalty coeffecient.
momentum (float, optional) – momentum factor.
centered (bool, optional) – True: compute centred RMSProp in which the gradient is normalized by an estimate of its variance.
loss_scaling (float, optional) – Factor by which to scale the loss and hence gradients to assist numerical stability when using float16.
- class poptorch.optim.LAMB(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, biasCorrection=True, loss_scaling=1.0, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)
Layer-wise Adaptive Moments (LAMB) optimizer (biased version).
Based on “Large Batch Optimization for Deep Learning: Training BERT in 76 minutes” (https://arxiv.org/abs/1904.00962).
The scaling function phi(z) is fixed as min(z, mwn); mwn is fixed at 10.0.
- __init__(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, biasCorrection=True, loss_scaling=1.0, accumType=torch.float32, firstOrderMomentumAccumType=torch.float32, secondOrderMomentumAccumType=torch.float32)
- Parameters
params (iterable) – parameters to optimize.
lr (float, optional) – learning rate
betas (tuple, optional) – (beta1, beta2) parameters used in LAMB.
eps (float, optional) – term added to the denominator to ensure numerical stability/
weight_decay (float, optional) – (AdamW) weight decay factor.
accumType (torch.dtype, optional) – data type used for gradients.
firstOrderMomentumAccumType (torch.dtype, optional) – data type used to store the first order momentum values for each parameter.
secondOrderMomentumAccumType (torch.dtype, optional) – data type used to store the second order momentum values for each parameter.
- step(closure=None)
Performs a single optimization step (parameter update).
- Parameters
closure (callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.
Note
Unless otherwise specified, this function should not modify the
.grad
field of the parameters.
3.4.1. Loss scaling
When training models which use half/float16 values, you can use loss scaling to prevent the gradients from becoming too small and underflowing.
Before calculating the gradients, PopTorch will scale the loss by the value of the loss_scaling
parameter.
PopTorch will multiply the gradients by the inverse scale prior to updating the optimizer state.
Therefore, beyond improving numerical stability, neither the training nor the hyper-parameters are affected.
Higher loss_scaling
values can improve numerical stability by minimising underflow.
However, too high a value can result in overflow.
The optimal loss scaling factor depends on the model.
3.4.2. Velocity scaling (SGD only)
The SGD optimizer, when used with momentum, updates weights based on the velocity values.
At each update step, the new velocity is a combination of the gradients derived from the loss function and the previous velocity value.
Similar to loss scaling, the velocity_scaling
parameter allows the velocity values to be scaled to improve numerical precision when using half/float16 values.
(Note that the gradients are, in effect, scaled by velocity_scaling/loss_scaling
so the loss_scaling
has no impact on the effective scaling of velocity parameters.)
As with loss scaling, higher values can minimise underflow of the velocity values but may result in overflow.
3.5. Custom ops
These are helper operations to be used within a model.
3.5.1. poptorch.ipu_print_tensor
- class ipu_print_tensor(tensor_to_print, optional_title)
Adds a tensor to be printed on the IPU. When this is executed the tensor will be copied back to host and printed.
When this operation is called in the backward pass it will print the gradient of the tensor.
The operation is an identity operation and it will return the exact same tensor. The returned tensor should be used in place of the original tensor, in the rest of the program to make sure that the print operation isn’t optimised away.
For example if the original code looks like this:
def forward(self, c, d, b) a = c + d return a + b
And you want to print the value of
a
. If you do:def forward(self, c, d, b) a = c + d poptorch.ipu_print_tensor(a) return a + b
Optionally, you may add a second string parameter to be used as a title.
- def forward(self, c, d, b)
a = c + d poptorch.ipu_print_tensor(a, “summation”)) return a + b
The result of
ipu_print_tensor
is not used,therefore it will be optimised out by the graph optimiser anda
will not be printed.Instead you should do:
def forward(self, c, d, b) a = c + d x = poptorch.ipu_print_tensor(a) return x + b
Warning
In order for the print operation to not be optimised out by the graph optimiser, you must use the output of the print.
- Parameters
ipu_print_tensor – The tensor to print.
- Returns
The input unchanged.
1class ExampleModel(torch.nn.Module):
2 def __init__(self):
3 super().__init__()
4 self.bias = torch.nn.Parameter(torch.zeros(()))
5
6 def forward(self, x):
7 x += 1
8
9 # It is important to make sure the result of the print is used.
10 x = poptorch.ipu_print_tensor(x)
11
12 return x + self.bias
13
14
3.5.2. poptorch.identity_loss
This function is used to implement custom losses. This takes in a single PyTorch tensor and will backpropagate a gradient of ones through it.
Warning
Passing a PyTorch loss function or another identity_loss
to this function is not
supported. Multiple losses must be implemented via composite PyTorch ops.
- poptorch.identity_loss(x, reduction)
Marks this operation as being part of the loss calculation and, as such, will back-propagate through it in the PopTorch autograd. This enables multiple losses and custom losses.
- Parameters
loss (torch.Tensor) – The calculated loss.
reduction (str) –
Reduce the loss output as per PyTorch loss semantics. Supported values are:
"sum"
: Sum the losses."mean"
: Take the mean of the losses."none"
: Don’t reduce the losses.
- Returns
An identity loss custom op.
1def custom_loss(output, target):
2 # Mean squared error with a scale
3 loss = output - target
4 loss = loss * loss * 5
5 return poptorch.identity_loss(loss, reduction="mean")
6
7
8class ExampleModelWithCustomLoss(torch.nn.Module):
9 def __init__(self):
10 super().__init__()
11 self.model = ExampleModel()
12
13 def forward(self, input, target):
14 out = self.model(input)
15 return out, custom_loss(out, target)
16
17
3.5.3. poptorch.MultiConv
Use poptorch.MultiConv
wrapper class to define multi-convolutions.
- class poptorch.MultiConv
Combines all convolution layers evaluated inside this scope into a single multi-convolution.
Multi-convolutions allow for a set of data-independent convolutions to be executed in parallel. Executing convolutions in parallel can lead to an increase in the data throughput.
For example:
>>> with poptorch.MultiConv(): ... y = self.convA(x) ... v = self.convB(u)
Combines the two data-independent convolutions into a single multi-convolution.
Refer to the PopLibs documentation for further information on multi-convolutions.
- availableMemoryProportions(value)
The available memory proportion per convolution, each [0, 1).
- Parameters
value (float, [float]) – Can be a
float
value in which case the same value is used for all of the convolutions. Otherwise, can be atuple
orlist
containing as manyfloat
values as the number of convolutions.- Returns
self, to support method chaining
- cycleBackOff(value)
Cycle back off proportion.
- Parameters
value (float) – Number between 0 and 1
- Returns
self, to support method chaining
- partialsTypes(value)
The partials type used for each convolution.
- Parameters
value (
MultiConvPartialsType
, [MultiConvPartialsType
]) – Can be a single instance ofpoptorch.MultiConvPartialsType
in which case the same value is used for all of the convolutions. Otherwise, can be atuple
orlist
containing as manypoptorch.MultiConvPartialsType
values as the number of convolutions.- Returns
self, to support method chaining
- perConvReservedTiles(value)
Tiles to reserve for each convolution.
- Parameters
value (int) – Number of tiles
- Returns
self, to support method chaining
- planType(value)
Select the multi-convolution execution strategy.
- Parameters
value – An instance of
MultiConvPlanType
.- Returns
self, to support method chaining
- class poptorch.MultiConvPartialsType(value)
Type for the partials of each convolution of a
poptorch.MultiConv
Float
Half
- class poptorch.MultiConvPlanType(value)
Selects the execution strategy for a
poptorch.MultiConv
Parallel
: Execute multiple convolutions in parallel (Default).Serial
: Execute each convolution independently. This is equivalent to using the independent convolution API.
3.5.4. poptorch.custom_op
This is for the users who are familiar with PopART.
If you need some special features that are not
supported in PopART, you may write a PopART custom op.
For more information about
how to create Popart custom ops see
Creating custom operations
and
Building custom operators using PopART.
You can call such a PopART custom op using
poptorch.custom_op
in PopTorch.
It takes three steps to enable a PopART custom op in PopTorch.
First, set Poplar and PopART environment varibles as shown in Setting the environment variables and compile the PopART custom op. You can compile your custom op C++ code and link with Poplar and PopART to generate a dynamic library. Please refer to the custom op code custom_cube_op.cpp and its CMakeLists.txt under poptorch/tests/custom_ops$.
Second, load the dynamic library.
Finally, use poptorch.custom_op
to finish the call.
Its wrapper class is specified below.
- class poptorch.custom_op(inputs, name, domain, domain_version, example_outputs)
Applies a custom operation, implemented within PopART, to the inputs.
- Parameters
inputs (tuple) – A tuple of input tensors, for example, (x, y).
name (str) – unique name of the PopART custom
domain (str) – domain for the op
domain_version (int) – version of the domain to use
example_outputs (iterable) – a tuple of tensors with the same type and shape of the outputs; the value does not matter as all values will be set to zero for tracing purposes.
- Returns
The outputs of the forward op of the custom op.
In the PopART custom op, both forward op and backward op are implemented. In the PopTorch inference model, only the forward op will be called.
In the code example above, example_outputs
is assigned as
[x
, x
], where x
is one of the input tensors and used as
a template to provide the right number of output tensors.
The real outputs will be allocated memory, calculated and
returned by the custom op.
You can also call this custom op inside a training model
using exactly the same interface of poptorch.custom_op
,
and the backward op will be called automatically.
3.5.5. poptorch.nop
Poptorch includes a “no-op” function for debugging purposes.
- poptorch.nop(tensor)
A no-operation: it is functionally the same as an identity but is never elimated by PopART patterns or inlining, so it is useful for debugging.
- Parameters
tensor (torch.Tensor) – the tensor to simply return by the no-op.
- Returns
The same tensor which was input.
- Return type
torch.Tensor
3.5.6. poptorch.serializedMatMul
Use this function to create a serialized matrix multiplication, which splits a larger matrix multiplication into smaller matrix multiplications to reduce memory requirements.
- poptorch.serializedMatMul(lhs, rhs, mode, factor=0, keep_precision=False)
Calculates a matrix product using a serialized matrix multiplication.
The matrix multiplication, lhs*rhs, is split into separate smaller multiplications, calculated one after the other, to reduce the memory requirements of the multiplication and its gradient calculation.
- Parameters
lhs (torch.Tensor) – Left-hand size input matrix.
rhs (torch.Tensor) – Right-hand side input matrix.
mode (poptorch.MatMulSerializationMode) – Which dimension of the matmul to serialize on: for matrix A (m by n) multiplied by matrix B (n by p). * InputChannels: Split across the input channels (dimension m). * ReducingDim: Split aross the reducing dimension (n). * OutputChannels: Split across the output channels (dimenion p). * Disabled: Same as an ordinary matrix multiplication.
factor (int) – Number of serialized multiplications. Must be a factor of the dimension to serialize on.
keep_precision (bool) – (Half/float16 inputs only) The forward op when serializing over ReducingDim and the backwards ops when serializing over InputChannels involve an addition step. If
keep_precision
is True, these additions will occur using float32 rather than half precision partials, matching those used for the individual matrix multiplications.
3.5.7. poptorch.set_available_memory
Use this function to override the proportion of tile memory for available to be used as temporary memory by a convolution or matrix multiplication.
- poptorch.set_available_memory(tensor, available_memory_proportion)
Sets the available memory for a convolution or matrix multiplication.
When called on the on the output of a convolution or a matrix multiplication, it sets the proportion of tile memory (between 0 and 1) to be made available as temporary memory for the convolution/matrix multipication. Less temporary memory will reduce the time performance but may use less memory overall. Lower memory proportions result in the use of more live (not tempoerary) memory, and so the overall memory may increase for too low values, possibly resulting in out of memory errors.
In the event that the value is too low, the planner will replan for the smaller memory usage possible.
>>> class BasicNetwork(nn.Module): ... def __init__(self): ... super().__init__() ... self.conv = nn.Conv2d(4, 4, 3, stride=2) ... ... def forward(self, x): ... out = self.conv(x) ... out = poptorch.set_available_memory(out, 0.2) ... return out
- Parameters
tensor (torch.Tensor) – output tensor of a convolution or matrix multiplication (otherwise the statement will be an identity).
available_memory_proportion (float) – proportion between 0.0 and 1.0 of tile memory to be made available for temporary memory (default 0.6).
- Returns
input tensor, as if calling an identity function.
- Return type
torch.Tensor
3.6. Miscellaneous functions
These PopTorch functions, not related to model creation, are available:
- poptorch.ipuHardwareIsAvailable()
Indicates whether IPU hardware is available to use.
- Returns
True if physical IPUs are available, False otherwise.
- Return type
bool
- poptorch.setLogLevel(level)
Changes the volume of messages printed in the console (stdout)
- Parameters
level (str) –
TRACE: Print all messages.
DEBUG: Print debug messages and above.
INFO: Print info messages and above.
WARN: Print warings and errors.
ERR: Print errors only.
OFF: Print nothing.
3.7. Half / float 16 support
PopTorch supports the half-precision floating point (float 16) format.
You can simply input float 16 tensors into your model.
(You can convert a tensor to float 16 using tensor = tensor.half()
)
You can use your models in one of the following ways:
#. Convert all parameters (weights) to float 16 by using using a Module
’s .``half()`` method. This is the most memory efficient, however small updates to weights may be lost, hindering training.
#. Keep the parameters (weights) as float 32, in which case the parameter updates will occur using float 32. However, the parameters will be converted to float 16 if you call an operation with a float 16 input.
This is more memory efficient than using float 32 tensors (inputs) but less memory efficient than using float 16 weights.
#. Use a mix of float 32 and float 16 parameters by manually specifying parameters as float 16 or float 32.
Note
When PyTorch encounters a mix of float 16 and float 32 inputs for a given operation, it will usually cast all inputs and float 32. PopTorch differs and will cast all inputs to float 16. This makes it easier to build models with float 32 weights which take float 16 tensors.
1model = torch.nn.Linear(1, 10).half()
2t1 = torch.tensor([1.]).half()
3
4inference_model = poptorch.inferenceModel(model)
5out = inference_model(t1)
6
7assert out.dtype == torch.half
Because PopTorch relies on the torch.jit.trace
API, it is limited to tracing operations which run on the CPU.
Many of these operations do not support float 16 inputs.
To allow the full range of operations, PopTorch converts all float 16 inputs to float 32 before tracing and then restores the inputs to float 16 as part of the canonicalization process.
Some operations may result in the model running in float 32 where float 16 would
be expected, or vice versa (see Float 16 operations for full details).
3.8. Profiling
You can profile a graph produced by PopTorch for analysis using the PopVision Graph Analyser, which can be downloaded from the Graphcore support portal. To do this, use the POPLAR_ENGINE_OPTIONS environment variable.
3.9. Environment variables
3.9.1. Logging level
- PopTorch uses the following levels of logging:
OFF
: No logging.ERR
: Errors only.WARN
: Warnings and errors only.INFO
: Info, warnings and errors. (Default)DEBUG
: Adds some extra debugging information.TRACE
andTRACE_ALL
: Trace everything inside PopTorch.
The POPTORCH_LOG_LEVEL
environment variable can be used to set the logging level:
export POPTORCH_LOG_LEVEL=DEBUG
3.9.2. Profiling
When running programs using PopTorch, you can enable profiling by using the POPLAR_ENGINE_OPTIONS
environment variable used by Poplar.
In order to capture the reports needed for the PopVision Graph Analyser you only need to set POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'
:
export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'
By default, report files are output to the current working directory. You can specify a different output directory by setting autoReport.directory
, for example:
export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./tommyFlowers"}'
For more options, please refer to the PopVision Graph Analyser User Guide.
In order to capture the pvti
reports needed for the PopVision System Analyser you only need to set PVTI_OPTIONS='{"enable":"true"}'
You can also add extra tracepoints in your own code by using
- class poptorch.profiling.Channel(name)
Profiling channel.
Note
If the
libpvti
profiling library is not available at runtime this class becomes a no-op.Example:
>>> channel = poptorch.profiling.Channel("MyApp") >>> with channel.tracepoint("TimeThis"): ... functionToTime() >>> channel.instrument(myobj, "methodName", "otherMethod")
- instrument(obj, *methods)
Instrument the methods of an object.
- Parameters
obj – Object to instrument
methods – One or more methods to wrap in profiling tracepoints.
- tracepoint(name)
Create a context tracepoint
>>> with channel.tracepoint("DoingSomething"): ... expensiveCall()
- Parameters
name – Name associated to this tracepoint.
3.9.3. IPU Model
By default PopTorch will try to attach to a physical IPU.
If instead you want to use the model, you can do so by setting POPTORCH_IPU_MODEL
to 1
:
export POPTORCH_IPU_MODEL=1
Please see the Poplar and PopLibs User Guide for the limitations of the IPU Model.
3.9.4. Wait for an IPU to become available
By default if you try to attach to an IPU but all the IPUs in the system are
already in use, an exception will be raised.
If you would rather wait for an IPU to become available, you can do so by setting POPTORCH_WAIT_FOR_IPU
to 1
.
export POPTORCH_WAIT_FOR_IPU=1