4. Features
4.1. Options
You can change how PopTorch compiles and executes models using poptorch.Options
.
You can find a full list of options in the Options section of the Reference chapter.
Broadly speaking, the options fall into the following catagories:
General options (See
poptorch.Options
)Options related to half precision (see
poptorch.options._PrecisionOptions
)Management of the training process (see
poptorch.options._TrainingOptions
)Control of distributed execution environments (see
poptorch.options._DistributedOptions
)Location of tensors (see:
poptorch.options._TensorLocationOptions
andpoptorch.TensorLocationSettings
)Options relevant to the Torch JIT compiler (see
poptorch.options._JitOptions
)
See Efficient data batching for a full
explanation of how device_iterations
greater than 1, gradient_accumulation
, and
replication_factor
interact with the output and input sizes.
You can choose to use the IPU model or the real IPU hardware
via poptorch.Options.useIpuModel
.
4.2. Model wrapping functions
The basis of PopTorch integration comes from these two model wrapping functions.
4.2.1. poptorch.trainingModel
This function wraps around a PyTorch model, yielding a PopTorch model that may
be run on the IPU in training mode. See poptorch.trainingModel()
for a
complete reference.
1import torch
2import poptorch
3
4
5class ExampleModelWithLoss(torch.nn.Module):
6 def __init__(self):
7 super().__init__()
8 self.fc = torch.nn.Linear(10, 10)
9 self.loss = torch.nn.MSELoss()
10
11 def forward(self, x, target=None):
12 fc = self.fc(x)
13 if self.training:
14 return fc, self.loss(fc, target)
15 return fc
16
17
18torch.manual_seed(0)
19model = ExampleModelWithLoss()
20
21# Wrap the model in our PopTorch annotation wrapper.
22poptorch_model = poptorch.trainingModel(model)
23
24# Some dummy inputs.
25input = torch.randn(10)
26target = torch.randn(10)
27
28# Train on IPU.
29for i in range(0, 800):
30 # Each call here executes the forward pass, loss calculation, and backward
31 # pass in one step.
32 # Model input and loss function input are provided together.
33 poptorch_out, loss = poptorch_model(input, target)
34 print(f"{i}: {loss}")
35
36# Copy the trained weights from the IPU back into the host model.
37poptorch_model.copyWeightsToHost()
38
39# Execute the trained weights on host.
40model.eval()
41native_out = model(input)
42
43# Models should be very close to native output although some operations are
44# numerically different and floating point differences can accumulate.
45torch.testing.assert_allclose(native_out, poptorch_out, rtol=1e-06, atol=1e-06)
4.2.2. poptorch.inferenceModel
This function wraps around a PyTorch model, yielding a PopTorch model that can
be run on the IPU in inference mode. See poptorch.trainingModel()
for
a complete reference.
1import torch
2import torchvision
3import poptorch
4
5# Some dummy imagenet sized input.
6picture_of_a_cat_here = torch.randn([1, 3, 224, 224])
7
8# The model, in this case a MobileNet model with pretrained weights that comes
9# canned with Pytorch.
10model = torchvision.models.mobilenet_v2(pretrained=True)
11model.train(False)
12
13# Wrap in the PopTorch inference wrapper
14inference_model = poptorch.inferenceModel(model)
15
16# Execute on IPU.
17out_tensor = inference_model(picture_of_a_cat_here)
18
19# Get the top 5 ImageNet classes.
20top_five_classes = torch.topk(torch.softmax(out_tensor, 1), 5)
21print(top_five_classes)
22
23# Try the same on native PyTorch
24native_out = model(picture_of_a_cat_here)
25
26native_top_five_classes = torch.topk(torch.softmax(native_out, 1), 5)
27
28# Models should be very close to native output although some operations are
29# numerically different and floating point differences can accumulate.
30assert any(top_five_classes[1][0] == native_top_five_classes[1][0])
31# inference_half_start
32model = torch.nn.Linear(1, 10)
33
34# Convert the parameters (weights) to halfs. Without doing so,
35# the Linear parameters will automatically be cast to half, which allows
36# training with float32 parameters but half tensors.
37model.half()
38
39t1 = torch.tensor([1.]).half()
40
41opts = poptorch.Options()
42
43inference_model = poptorch.inferenceModel(model, opts)
44out = inference_model(t1)
45
46assert out.dtype == torch.half
47# inference_half_end
4.2.3. poptorch.PoplarExecutor
This class should not be created directly but is a wrapper around the model
that was passed into inferenceModel()
or trainingModel()
.
It only has a few methods which can be used to interface with the IPU.
The PoplarExecutor
will implicitly keep in sync the parameters
of the source PyTorch model and the PopTorch model(s). However, weights need to
be explicitly copied if the model is trained on the CPU and inference is run on
the IPU.
See PoplarExecutor
for a complete description of the IPU interface
functionality.
model = Model() poptorch_train = poptorch.trainingModel(model) poptorch_inf = poptorch.inferenceModel(model) train(poptorch_train) torch.save(model.state_dict(), "model.save") # OK validate(poptorch_inf) # OK validate(model) # OK train(model) # Explicit copy needed poptorch_inf.copyWeightsToDevice() validate(poptorch_inf)
4.2.4. poptorch.isRunningOnIpu
One useful utility function is poptorch.isRunningOnIpu()
. This
returns True
when executing on the IPU and False
when executing
the model outside IPU scope. This allows for different code paths within
the model.
A common use case is executing equivalent code to a PopART custom operator when running on CPU. For example:
class Network(torch.nn.Module): def forward(self, x, y): if poptorch.isRunningOnIpu(): # IPU path return my_custom_operator(x, y) else: # CPU path return my_torch_implementation(x,y)
4.3. Parallel execution
This section demonstrates multi-IPU strategies for parallel execution in PopTorch. We recommended that you start such parallel programming from PopTorch code that is working properly on a single IPU.
There are four kinds of execution strategies in total to run a model on a
multi-IPU device:
poptorch.ShardedExecution
,
poptorch.PipelinedExecution
,
poptorch.SerialPhasedExecution
.
and poptorch.ParallelPhasedExecution
.
These execution strategies are set through
poptorch.Options.setExecutionStrategy()
.
The default execution strategy is poptorch.PipelinedExecution
.
In the following,
we first introduce the general APIs that will be applied to all four
parallel execution strategies.
Finally, we explain the four strategies with examples.
By default, PopTorch will not let you run the model if the number of IPUs is
not a power of 2.
For this reason, it is preferable to annotate the model so that the number of
IPUs used is a power of 2.
However, you can also enable poptorch.Options.autoRoundNumIPUs()
to
automatically round up the number of IPUs reserved to a power of 2, with the
excess being reserved but idle.
This option is not enabled by default to prevent unintentional overbooking of
IPUs.
4.3.1. Annotation tools
poptorch.Block and poptorch.BeginBlock
poptorch.BeginBlock
and poptorch.Block
are wrapper
classes used to define model parallelism in a multi-IPU device. They partition
models into “blocks” that will be executed on different IPUs.
You can use poptorch.Block
to define a scope in the context of the
model.
In the example below, all layers before model.bert.encoder.layer[0]
will be
put on IPU 0 and all layers from model.bert.encoder.layer[0]
onwards
(inclusive) will be on IPU 1.
1import transformers
2import torch
3import poptorch
4
5# A bert model from hugging face. See the packaged BERT example for actual usage.
6pretrained_weights = 'mrm8488/bert-medium-finetuned-squadv2'
7
8
9# For later versions of transformers, we need to wrap the model and set
10# return_dict to False
11class WrappedModel(torch.nn.Module):
12 def __init__(self):
13 super().__init__()
14 self.wrapped = transformers.BertForQuestionAnswering.from_pretrained(
15 pretrained_weights)
16
17 def forward(self, input_ids, attention_mask, token_type_ids):
18 return self.wrapped.forward(input_ids,
19 attention_mask,
20 token_type_ids,
21 return_dict=False)
22
23 def __getattr__(self, attr):
24 try:
25 return torch.nn.Module.__getattr__(self, attr)
26 except torch.nn.modules.module.ModuleAttributeError:
27 return getattr(self.wrapped, attr)
28
29
30model = WrappedModel()
31
32# A handy way of seeing the names of all the layers in the network.
33print(model)
34
35# All layers before "model.bert.encoder.layer[0]" will be on IPU 0 and all layers from
36# "model.bert.encoder.layer[0]" onwards (inclusive) will be on IPU 1.
37model.bert.encoder.layer[0] = poptorch.BeginBlock(model.bert.encoder.layer[0],
38 ipu_id=1)
39
40# Now all layers before layer are on IPU 1 and this layer onward is on IPU 2
41model.bert.encoder.layer[2] = poptorch.BeginBlock(model.bert.encoder.layer[2],
42 ipu_id=2)
43
44# Finally all layers from this layer till the end of the network are on IPU 3.
45model.bert.encoder.layer[4] = poptorch.BeginBlock(model.bert.encoder.layer[4],
46 ipu_id=3)
47
48# We must batch the data by at least the number of IPUs. Each IPU will still execute
49# whatever the model batch size is.
50data_batch_size = 4
51
52# Create a poptorch.Options instance to override default options
53opts = poptorch.Options()
54opts.deviceIterations(data_batch_size)
poptorch.BeginBlock
is an annotation defined outside the
model, and applied to current and onward layers. Both forms can be used
interchangeably.
1class Network(torch.nn.Module):
2 def __init__(self):
3 super().__init__()
4 self.layer1 = torch.nn.Linear(5, 10)
5 self.layer2 = torch.nn.Linear(10, 5)
6 self.layer3 = torch.nn.Linear(5, 5)
7 self.layer4 = torch.nn.Linear(5, 5)
8
9 self.act = torch.nn.ReLU()
10 self.softmax = torch.nn.Softmax(dim=1)
11
12 def forward(self, x):
13
14 # Explicit layers on a certain IPU
15 poptorch.Block.useAutoId()
16 with poptorch.Block(ipu_id=0):
17 x = self.act(self.layer1(x))
18
19 with poptorch.Block(ipu_id=1):
20 x = self.act(self.layer2(x))
21
22 with poptorch.Block(ipu_id=2):
23 x = self.act(self.layer3(x))
24 x = self.act(self.layer4(x))
25
26 with poptorch.Block(ipu_id=3):
27 x = self.softmax(x)
28 return x
29
30
31model = Network()
32opts = poptorch.Options()
33opts.deviceIterations(4)
34poptorch_model = poptorch.inferenceModel(model, options=opts)
35print(poptorch_model(torch.rand((4, 5))))
Either annotation is enough to enable parallel execution in the simple cases.
By default, the layers before the first poptorch.BeginBlock
will be
placed on IPU 0.
Both poptorch.BeginBlock
and poptorch.Block
need to follow a set of rules:
All the layers must be declared inside a
poptorch.Block
scope. It is to avoid missing annotation.poptorch.BeginBlock
doesn’t have the same constraint because all the layers called after will automatically be added to the lastpoptorch.BeginBlock
.Please note that PopTorch needs to reserve IPUs in powers of 2 or multiples of 64. You are advised to configure your model accordingly to take full advantage of the IPUs available. However, if you need to run with a different number of IPUs, you can use
poptorch.Options().autoRoundNumIPUs(True)
to allow PopTorch to reserve more IPUs than the model specifies.Unused or dead layers should NOT be included in any
poptorch.BeginBlock
orpoptorch.Block
.If layer A happens before layer B inside the model and each layer has a
poptorch.BeginBlock
associated with it, you need to writepoptorch.BeginBlock
for layer A beforepoptorch.BeginBlock
for layer B.
Failing to obey above rules will result in compilation errors.
poptorch.Stage and poptorch.AutoStage
Conceptually poptorch.BeginBlock
or
poptorch.Block
collects the
layers of a model into a poptorch.Stage
,
multiple stages can be combined into a poptorch.Phase
,
and multiple phases form a parallel execution strategy.
poptorch.Stage
poptorch.Stage
defines some layers of model to run on one IPU.
It can be made of one or more blocks created using
poptorch.BeginBlock
or poptorch.Block
and identified by their user_id
.
Consecutive layers in a model can be defined either in the same
poptorch.Stage
or consecutive stages.
Whether stages run in parallel or sequentially depends on specific
parallel execution strategies.
Internally, each operation in a model is assigned a stage_id
through poptorch.Stage
.
poptorch.AutoStage
You can use poptorch.AutoStage
if you don’t want to
specify poptorch.Stage
by hand.
It will assign one poptorch.Stage
per poptorch.BeginBlock
or poptorch.Block
.
By default poptorch.AutoStage.SameAsIpu
is in use, which means the
stage_id
of poptorch.Stage
will be set to the ipu_id
specified for the poptorch.BeginBlock
or
poptorch.Block
.
Please note that stage_id
must be ascending in
poptorch.PipelinedExecution
.
Let’s use the code example above.
If your blocks “0”, “1”, and “2” are assigned to IPU 0, 1, and 0.
Then the poptorch.Block
“2” will be assigned stage_id
0. This will make
the compiler fail to
schedule the last two stages “1” and “2” due to a conflict:
The model implies “1” should run earlier than “2”.
their
stage_id
values suggest “2” should run earlier than “1”.
When poptorch.AutoStage.AutoIncrement
is in use, each new
poptorch.BeginBlock
or
poptorch.Block
will be assigned an automatically incremented
stage_id
.
In the previous example the last stage would be assigned stage_id
2 and
the compilation would succeed.
poptorch.Phase
poptorch.Phase
defines a processing unit of phased execution.
It may contain one or more poptorch.Stage
.
poptorch.Phase
is only used in
poptorch.SerialPhasedExecution
and
poptorch.ParallelPhasedExecution
.
It is not used in
poptorch.ShardedExecution
and
poptorch.PipelinedExecution
.
with poptorch.Block("A"): layer() with poptorch.Block("B"): layer() p = Phase(poptorch.Stage("A").ipu(0), poptorch.Stage("B").ipu(1))
In the code snippet above, “A” and “B” will run in parallel on IPU 0 and 1 simultaneously since they are placed in two stages. They will run sequentially on one IPU if they are placed in a single stage.
Advanced annotation with strings
You can use Python strings to represent the user_id
and ipu_id
for a
poptorch.Block
or
poptorch.BeginBlock
.
Since strings are evaluated at runtime,
they allow for a dynamic number of stages and phases.
Here is an example below to use formatted strings(f-strings) in
poptorch.ParallelPhasedExecution
.
Inside the code example below, there are two lines that f-strings are
used in the forward()
class.
One is f"phase{phase}_ipu{ipu}"
at Line 25,
where phase
is
0, 1, 1, 2, 3, 3, 4, 5, and 5 respectively,
and ipu
ranges from 0 to 1.
The total number of instances for this f-string is 12 due to
6 phases and 2 IPUs.
The other is f"phase{N*2-1}_ipu1"
at Line 32,
where phase
is 5 and ipu
is 1.
When defining poptorch.Stage
,
four f-strings are used where n
ranges from 0 to 2
at Line 46-47 and 50-51:
f"phase_{2*n}_ipu0"
f"phase{2*n}_ipu1"
f"phase_{2*n+1}_ipu0"
f"phase{2*n+1}_ipu1"
They refer to phase
0, 2, 4 and 1, 3, 5, with ipu0
and ipu1
respectively.
So all these 12 f-strings are defined in poptorch.BeginBlock
,
and used in poptorch.Stage
dynamically. They match exactly.
1poptorch.setLogLevel(1) # Force debug logging
2N = 3
3size = 10
4
5
6class Model(torch.nn.Module):
7 def __init__(self):
8 super().__init__()
9 self.weights = []
10 for n in range(N * 6):
11 weight = torch.nn.Parameter(torch.rand(size, size),
12 requires_grad=True)
13 self.register_parameter(f"w{n}", weight)
14 self.weights.append(weight)
15
16 def forward(self, in0, target=None):
17 phase = 0
18 weight = iter(self.weights)
19 with poptorch.Block("phase0_ipu0"):
20 ins = torch.split(in0, size)
21 for n in range(N * 3):
22 out = []
23 for ipu in range(2):
24 x = ins[ipu]
25 with poptorch.Block(f"phase{phase}_ipu{ipu}"):
26 x = torch.matmul(next(weight), x)
27 out.append(F.relu(x))
28 ins = out[1], out[0]
29 # We want 2 matmuls in the same phase
30 if n % 3 != 1:
31 phase += 1
32 with poptorch.Block(f"phase{N*2-1}_ipu1"):
33 res = ins[0] + ins[1]
34 if target is None:
35 return res
36 return res, torch.nn.L1Loss(reduction="mean")(res, target)
37
38
39input = torch.rand(size * 2, 1)
40target = torch.rand(size, 1)
41model = Model()
42opts = poptorch.Options()
43phases = []
44# Alternate between 0-2 and 1-3
45for n in range(N):
46 phases.append([
47 poptorch.Stage(f"phase{2*n}_ipu0").ipu(0),
48 poptorch.Stage(f"phase{2*n}_ipu1").ipu(2)
49 ])
50 phases.append([
51 poptorch.Stage(f"phase{2*n+1}_ipu0").ipu(1),
52 poptorch.Stage(f"phase{2*n+1}_ipu1").ipu(3)
53 ])
54opts.setExecutionStrategy(poptorch.ParallelPhasedExecution(*phases))
55poptorch_model = poptorch.trainingModel(model, opts)
56poptorch_model.compile(input, target)
4.3.2. Parallel execution strategies
With the above APIs as building blocks, we can set execution strategies using the four kinds of execution modes, as shown below. Note that the same annotation can be used for each of them. They only differ in the method of parallelisation and tensor locations.
poptorch.ShardedExecution
In this strategy, each IPU
will sequentially execute a distinct part of the model.
A single unit of processing poptorch.ShardedExecution
is a
shard.
A shard is specified using poptorch.Stage
,
or if no poptorch.Stage
is specified,
the user_id
passed by
poptorch.BeginBlock
or poptorch.Block
is used.
Each shard is executed sequentially on a single IPU.
Multiple shards can be placed on multiple IPUs.
However, only one IPU is used at a time, while
the other IPUs are idle.
If an IPU is allocated to run consecutive stages,
PopART will merge consecutive stages into one on the same IPU.
Weights and activations will use the on-chip memory of the IPUs.
Layers sharing weights need to be placed on the same IPU.
poptorch.ShardedExecution
can be useful
for processing a single sample or debugging.
Overall it has low efficiency since only one IPU is used at a time.
poptorch.PipelinedExecution
This is the default execution strategy. It extends poptorch.ShardedExecution with parallel execution on multiple IPUs.
Parallelisation in poptorch.PipelinedExecution
requires deviceIterations()
and gradientAccumulation()
as explained in Efficient data batching.
After one poptorch.Stage
is finished with processing a batch
on one IPU, it starts immediately processing the next batch.
This creates a pipeline where multiple batches are processed in parallel.
An IPU can only start its own poptorch.Stage
of a batch if
its previous poptorch.Stage
of the current batch is processed.
Hence, all IPUs will be occupied after a warm-up period.
A cool-down period is required to aggregate the results and apply weight
changes.
Phased execution
poptorch.ParallelPhasedExecution
and
poptorch.SerialPhasedExecution
have the following
features in common:
A portion of the weights and activations are transferred to and from streaming memory, before and after each phase.
If the desired weights and activations are already stored in an IPU of the same group of IPUs, intra-phase cross-IPU copies can replace the copies to and from streaming memory.
This specific portion is needed by the layers of the model wrapped in
poptorch.BeginBlock
orpoptorch.Block
in currentpoptorch.Phase
.They both trade off some performance for larger models with higher memory needs.
Any number of phases is allowed.
The number of stages in each
poptorch.Phase
should match the number of IPUs in each group of IPUs.Stages inside each
poptorch.Phase
can run in parallel.
Although you only define the poptorch.Phase
for forward passes,
the corresponding phases for backward passes are created correspondingly.
The order of phased execution for backward passes won’t change
but you can decide whether a phase is shared by both
forward and backward passes. In other words, you decide whether to avoid
a memory transfer of a portion of the weights and activations.
poptorch.SerialPhasedExecution
In poptorch.SerialPhasedExecution
,
phases execute on a single group of IPUs sequentially.
strategy = poptorch.SerialPhasedExecution([ poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")), poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")), poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2"))]) strategy.phase(0).ipus(0,1) strategy.phase(1).ipus(0,1) strategy.phase(2).ipus(0,1) opts.setExecutionStrategy(strategy)
The code above causes all phases to run serially on IPUs 0 and 1.
poptorch.ParallelPhasedExecution
In poptorch.ParallelPhasedExecution
,
phases are executed in parallel alternating between two groups of IPUs.
Even phases must run on even IPUs and odd phases on odd IPUs.
Inter-phase cross-IPU copies can replace the memory transfers to and from
the streaming memory, if the desired weights and activations are already
available in another group of IPUs.
strategy = poptorch.SerialPhasedExecution([ poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")), poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")), poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5"))]) strategy.phase(0).ipus(0,2) strategy.phase(1).ipus(1,3) strategy.phase(2).ipus(0,2) opts.setExecutionStrategy(strategy)
In the code example above, there are three phases. Each phase has two stages and each IPU group has two IPUs, so the number of groups matches the number of IPUs. Even phases 0 and 2 run on IPU 0 and 2, while odd phase 1 runs on IPU 1 and as required. This allows for faster cross-IPU copies, both inter-phase and intra-phase.
poptorch.Liveness
poptorch.Liveness
controls the availability of tensors on IPU,
and is only needed for
poptorch.ParallelPhasedExecution
and poptorch.SerialPhasedExecution
.
The default poptorch.Liveness
is AlwaysLive
.
OffChipAfterFwd
and
OffChipAfterEachPhase
may be helpful if you run a large model
with a tight memory budget.
4.4. Optimizers
Poptorch supports the following optimizers:
SGD (see
poptorch.optim.SGD
)Adam (see
poptorch.optim.Adam
)AdamW (see
poptorch.optim.AdamW
)RMSprop (see
poptorch.optim.RMSprop
)LAMB (see
poptorch.optim.LAMB
)
In addition, PopTorch has features to support float16 models, such as loss scaling, velocity scaling, bias correction and accumulator types.
Important
All of these extra attributes (Except velocity_scaling
) cannot have different values for different param_groups
and therefore must be set at the optimizer level.
1opt = poptorch.optim.SGD(model.parameters(),
2 lr=0.01,
3 loss_scaling=2.0,
4 velocity_scaling=2.0)
5poptorch_model = poptorch.trainingModel(model, options, opt)
6poptorch_model(input, target)
7# Update optimizer attribute
8opt.loss_scaling = 1.0
9# Update param_group attribute
10opt.param_groups[0]["velocity_scaling"] = 1.0
11# Set the new optimizer in the model
12poptorch_model.setOptimizer(opt)
13poptorch_model(input, target)
Important
You must call setOptimizer()
for the new optimizer values to be applied to the model.
4.4.1. Loss scaling
When training models which use half/float16 values, you can use loss scaling to prevent the gradients from becoming too small and underflowing.
Before calculating the gradients, PopTorch will scale the loss by the value of the loss_scaling
parameter.
PopTorch will multiply the gradients by the inverse scale prior to updating the optimizer state.
Therefore, beyond improving numerical stability, neither the training nor the hyper-parameters are affected.
Higher loss_scaling
values can improve numerical stability by minimising underflow.
However, too high a value can result in overflow.
The optimal loss scaling factor depends on the model.
4.4.2. Velocity scaling (SGD only)
The SGD optimizer, when used with momentum, updates weights based on the velocity values.
At each update step, the new velocity is a combination of the gradients derived from the loss function and the previous velocity value.
Similar to loss scaling, the velocity_scaling
parameter allows the velocity values to be scaled to improve numerical precision when using half/float16 values.
(Note that the gradients are, in effect, scaled by velocity_scaling/loss_scaling
so the loss_scaling
has no impact on the effective scaling of velocity parameters.)
As with loss scaling, higher values can minimise underflow of the velocity values but may result in overflow.
4.4.3. Accumulation types
In order to improve numerical stability some of the optimizers (LAMB, Adam, AdamW, RMSprop) give you the option to tweak the data type used by the optimizer’s accumulators.
accum_type
lets you choose the type used for gradient accumulation.
first_order_momentum_accum_type
/ second_order_momentum_accum_type
give you control over the type used to store the first-order and second-order momentum optimizer states.
4.4.4. Constant attributes
In order to improve performance and / or save memory PopTorch will try to embed directly in the program the attributes which are constant.
Important
Trying to modify a constant attribute after the model has been compiled will result in an error.
For PopTorch optimizers (those from the poptorch.optim
namespace) by default the attributes explicitly passed to the Optimizer’s constructor will be considered variables and the others will be considered as constant.
This behaviour can be overridden using markAsConstant()
and markAsVariable()
before the model is compiled.
1# lr, momentum and velocity_scaling will be marked as variable.
2opt = poptorch.optim.SGD(model.parameters(), lr=0.01, momentum=0.0)
3# momentum and velocity_scaling will be marked as constant.
4opt = poptorch.optim.SGD(model.parameters(), lr=0.01)
5# lr and momentum will be marked as variable.
6# velocity_scaling will be marked as constant.
7opt = poptorch.optim.SGD(model.parameters(),
8 lr=0.01,
9 momentum=0.0,
10 velocity_scaling=2.0)
11opt.variable_attrs.markAsConstant("velocity_scaling")
12# lr, momentum and velocity_scaling will be marked as variable.
13opt = poptorch.optim.SGD(model.parameters(), lr=0.01, velocity_scaling=2.0)
14opt.variable_attrs.markAsVariable("momentum")
For native optimizers (those from the torch.optim
namespace) the attributes which are left to their default value in the constructor will be considered as constant.
There is no method to override this behaviour which is why we recommend you always use the poptorch.optim
optimizers instead.
1# momentum will be marked as constant (It's not set)
2opt = torch.optim.SGD(model.parameters(), lr=0.01)
3# lr will be marked as variable.
4# momentum will still be marked as constant (Because its default value is 0.0)
5opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.0)
6# lr and momentum will both be marked as variable.
7opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=1.0)
Note
There is an exception: lr
is always marked as variable.
4.5. Custom ops
These are helper operations to be used within a model.
4.5.1. poptorch.ipu_print_tensor
Adds an op to print the content of a given IPU tensor.
Warning
To prevent the print operation being optimised out by the graph optimiser, you must use the output of the print.
1class ExampleModel(torch.nn.Module):
2 def __init__(self):
3 super().__init__()
4 self.bias = torch.nn.Parameter(torch.zeros(()))
5
6 def forward(self, x):
7 x = x + 1
8
9 # It is important to make sure the result of the print is used.
10 x = poptorch.ipu_print_tensor(x)
11
12 return x + self.bias
13
14
For more information see: poptorch.ipu_print_tensor()
.
4.5.2. poptorch.identity_loss
This function is used to implement custom losses. This takes in a single PyTorch tensor and will backpropagate a gradient of ones through it.
Warning
Passing a PyTorch loss function or another identity_loss
to this function is not
supported. Multiple losses must be implemented via composite PyTorch ops.
1def custom_loss(output, target):
2 # Mean squared error with a scale
3 loss = output - target
4 loss = loss * loss * 5
5 return poptorch.identity_loss(loss, reduction="mean")
6
7
8class ExampleModelWithCustomLoss(torch.nn.Module):
9 def __init__(self):
10 super().__init__()
11 self.model = ExampleModel()
12
13 def forward(self, input, target):
14 out = self.model(input)
15 return out, custom_loss(out, target)
16
17
For more information see: poptorch.identity_loss()
.
4.5.3. poptorch.MultiConv
Use poptorch.MultiConv
wrapper class to define multi-convolutions.
Please refer to the PopLibs documentation for multi-convolutions for further information.
For more information see: poptorch.MultiConv
poptorch.MultiConvPlanType
.
4.5.4. poptorch.custom_op
This is for the users who are familiar with PopART.
If you need some special features that are not
supported in PopART, you may write a PopART custom op.
For more information about
how to create Popart custom ops see
Creating custom operations
and
Building custom operators using PopART.
You can call such a PopART custom op using
poptorch.custom_op
in PopTorch.
It takes three steps to enable a PopART custom op in PopTorch.
First, set Poplar and PopART environment varibles as shown in Setting the environment variables and compile the PopART custom op. You can compile your custom op C++ code and link with Poplar and PopART to generate a dynamic library. Please refer to the custom op code custom_cube_op.cpp and its CMakeLists.txt under poptorch/tests/custom_ops$.
Second, load the dynamic library.
1myso = list(pathlib.Path("tests").rglob("libcustom_cube_op.*"))
2assert myso, "Failed to find libcustom_cube_op"
3myop = ctypes.cdll.LoadLibrary(myso[0])
4
Finally, use poptorch.custom_op
to finish the call.
Its wrapper class is specified below.
For more information see: poptorch.custom_op
.
In the PopART custom op, both forward op and backward op are implemented. In the PopTorch inference model, only the forward op will be called.
1def test_inference():
2 class BasicNetwork(nn.Module):
3 def forward(self, x, bias):
4 x, y = poptorch.custom_op([x, bias],
5 "Cube",
6 "com.acme",
7 1,
8 example_outputs=[x, x])
9 return x, y
10
In the code example above, example_outputs
is assigned as
[x
, x
], where x
is one of the input tensors and used as
a template to provide the right number of output tensors.
The real outputs will be allocated memory, calculated and
returned by the custom op.
You can also call this custom op inside a training model
using exactly the same interface of poptorch.custom_op
,
and the backward op will be called automatically.
You can pass attributes to custom ops using a Python dictionary, as shown by the following code example:
1 class Model(torch.nn.Module):
2 def forward(self, x):
3 x = poptorch.custom_op([x],
4 "LeakyRelu",
5 "com.acme",
6 1,
7 example_outputs=[x],
8 attributes={"alpha": 0.02})
9 return x[0]
10
You can then obtain attributes from within the C++ code. The above code
passes a Float
attribute with the name alpha
to the LeakyRELU implementation in the Custom operations chapter of the PopART user guide.
PopTorch supports all possible attributes supported in PopArt except for
Graph
.
Please refer to the following table and code examples for information on how to pass other attribute types to a PopArt custom op implementation:
PopART attribute type |
Python equivalent |
---|---|
|
Python float (converted to 32-bit) |
|
list/tuple of Python floats |
|
Python int (converted to 64-bit signed int) |
|
list/tuple of Python ints |
|
Python str (converted to ASCII) |
|
List/tuple of Python strs |
|
Not supported |
1def test_many_attributes_examples():
2 class Model(torch.nn.Module):
3 def forward(self, x):
4 attributes = {
5 "float_one": 1.0,
6 "float_minus_two": -2.0,
7 "int_zero": 0,
8 "int_minus_five": -5,
9 "floats_one_two_three": [1.0, 2.0, 3.0],
10 "floats_minus_one_two_three": [-1.0, -2.0, -3.0],
11 "ints_one_two_three": [1, 2, 3],
12 "ints_minus_one_two_three": [-1, -2, -3],
13 "a_string": "string with quotes and slash \" ' \\ end",
14 "strs": ["abc", "def", "ghi"]
15 }
16
17 x = poptorch.custom_op([x],
18 "ManyAttributeOp",
19 "test.poptorch",
20 1,
21 example_outputs=[x],
22 attributes=attributes)
4.5.5. poptorch.nop
Poptorch includes a “no-op” function for debugging purposes.
For more information see: poptorch.nop()
.
4.5.6. poptorch.serializedMatMul
Use this function to create a serialized matrix multiplication, which splits a larger matrix multiplication into smaller matrix multiplications to reduce memory requirements.
For more information see: poptorch.serializedMatMul()
.
4.5.7. poptorch.set_available_memory
Use this function to override the proportion of tile memory for available to be used as temporary memory by a convolution or matrix multiplication.
For more information see: poptorch.set_available_memory()
.
4.6. Miscellaneous functions
These PopTorch functions, not related to model creation, are available:
4.7. Half / float 16 support
PopTorch supports the half-precision floating point (float 16) format.
You can simply input float 16 tensors into your model.
(You can convert a tensor to float 16 using tensor = tensor.half()
)
You can use your models in one of the following ways:
Convert all parameters (weights) to float 16 by using using a
Module
’s .``half()`` method. This is the most memory efficient, however small updates to weights may be lost, hindering training.Keep the parameters (weights) as float 32, in which case the parameter updates will occur using float 32. However, the parameters will be converted to float 16 if you call an operation with a float 16 input. This is more memory efficient than using float 32 tensors (inputs) but less memory efficient than using float 16 weights.
Use a mix of float 32 and float 16 parameters by manually specifying parameters as float 16 or float 32.
Note
When PyTorch encounters a mix of float 16 and float 32 inputs for a given operation, it will usually cast all inputs and float 32.
PopTorch differs and will cast all inputs to float 16.
This makes it easier to build models with float 32 weights which take float 16 tensors. However, if you wish to follow PyTorch behavior, you can use opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)
where opts
is the poptorch.Options
object passed to the model wrapping function.
1model = torch.nn.Linear(1, 10)
2
3# Convert the parameters (weights) to halfs. Without doing so,
4# the Linear parameters will automatically be cast to half, which allows
5# training with float32 parameters but half tensors.
6model.half()
7
8t1 = torch.tensor([1.]).half()
9
10opts = poptorch.Options()
11
12inference_model = poptorch.inferenceModel(model, opts)
13out = inference_model(t1)
14
15assert out.dtype == torch.half
Because PopTorch relies on the torch.jit.trace
API, it is limited to tracing operations which run on the CPU.
Many of these operations do not support float 16 inputs.
To allow the full range of operations, PopTorch converts all float 16 inputs to float 32 before tracing and then restores the inputs to float 16 as part of the canonicalization process.
Some operations may result in the model running in float 32 where float 16 would
be expected, or vice versa (see Float 16 operations for full details).
4.8. Profiling
You can profile a graph produced by PopTorch for analysis using the PopVision Graph Analyser, which can be downloaded from the Graphcore support portal. To do this, use the POPLAR_ENGINE_OPTIONS environment variable.
4.9. Precompilation and caching
4.9.1. Caching
By default PopTorch will re-compile the model every time you instantiate a model. However if you often run the same models you might want to enable executable caching to save time.
You can do this by either setting the POPTORCH_CACHE_DIR
environment variable or by calling poptorch.Options.enableExecutableCaching
.
Warning
The cache directory might grow large quickly because PopTorch doesn’t evict old models from the cache and, depending on the number and size of your models and the number of IPUs used, the executables might be quite large. It is the your responsibility to delete the unwanted cache files.
4.9.2. Precompilation
PopTorch supports precompilation: This means you can compile your model on a machine which doesn’t have an IPU and export the executable to a file. You can then reload and execute it on a different machine which does have an IPU.
Important
The PopTorch versions on both machines must be an exact match.
To precompile your model you need to wrap it using either poptorch.trainingModel()
or poptorch.inferenceModel()
then call compileAndExport()
on the wrapper.
1import torch
2import poptorch
3
4
5class ExampleModelWithLoss(torch.nn.Module):
6 def __init__(self):
7 super().__init__()
8 self.fc = torch.nn.Linear(10, 10)
9 self.loss = torch.nn.MSELoss()
10
11 def forward(self, x, target=None):
12 fc = self.fc(x)
13 if self.training:
14 return fc, self.loss(fc, target)
15 return fc
16
17
18torch.manual_seed(0)
19model = ExampleModelWithLoss()
20
21opts = poptorch.Options()
22# You don't need a real IPU to compile the executable.
23opts.useOfflineIpuTarget(ipu_target_version)
24
25# Wrap the model in our PopTorch annotation wrapper.
26poptorch_model = poptorch.trainingModel(model, opts)
27
28# Some dummy inputs.
29input = torch.randn(10)
30target = torch.randn(10)
31
32poptorch_model.compileAndExport(filename, input, target)
Note
If you don’t know the IPU version on your system you can use poptorch.ipuHardwareVersion()
.
The exported file by default will contain your original Torch model (including the weights), and enough information to re-create the PopTorch wrapper and reload the executable.
Important
For your model and weights to be exported, your model must be picklable. See https://docs.python.org/3/library/pickle.html for more information.
If your model is not picklable then use export_model=False
, see below for a complete example.
Now both the torch model, PopTorch wrapper and executable can be restored on the target machine using poptorch.load()
:
1poptorch_model = poptorch.load(filename)
2
3# That's all: your model is ready to be used.
4poptorch_model(input, target) # Run on IPU
In some cases you might want to provide some runtime information to select the device: this can be done
using the edit_opts_fn
argument of poptorch.load()
:
1def setIpuDevice(opts):
2 opts.useIpuId(1) # always use IPU 1
3
4
5poptorch_model = poptorch.load(filename, edit_opts_fn=setIpuDevice)
6poptorch_model(input, target) # Run on IPU 1
Note
Only runtime options will be used as the executable is already compiled
Going back to the precompilation step: in some cases you might want to export only the executable and not the python wrapper or torch model (For example if your model cannot be pickled).
1poptorch_model.compileAndExport(filename, input, target, export_model=False)
It means you will need to re-create and wrap the model yourself before loading the executable:
1model = ExampleModelWithLoss()
2
3opts = poptorch.Options()
4
5# Wrap the model in our PopTorch annotation wrapper.
6poptorch_model = poptorch.trainingModel(model, opts)
7poptorch_model.loadExecutable(filename)
8
9# Some dummy inputs.
10input = torch.randn(10)
11target = torch.randn(10)
12
13poptorch_model(input, target) # Run on IPU
Important
Exported models lose their connections to other models.
For example, if you have a poptorch.trainingModel()
and a poptorch.inferenceModel()
based
on the same PyTorch model, you wouldn’t usually need to keep the weights synchronised between the two:
PopTorch would take care of it implicitly for you.
For example:
1model = ExampleModelWithLoss()
2
3opts = poptorch.Options()
4
5# Wrap the model in our PopTorch annotation wrapper.
6training_model = poptorch.trainingModel(model, opts)
7model.eval()
8validation_model = poptorch.inferenceModel(model, opts)
9
10# Some dummy inputs.
11input = torch.randn(10)
12target = torch.randn(10)
13
14# Train the model:
15for epoch in epochs:
16 training_model(input, target)
17
18# Weights are implicitly copied from the training model
19# to the validation model
20prediction = validation_model(input)
If you were to export these models:
1model = ExampleModelWithLoss()
2
3opts = poptorch.Options()
4
5# Some dummy inputs.
6input = torch.randn(10)
7target = torch.randn(10)
8
9# Wrap the model in our PopTorch annotation wrapper.
10training_model = poptorch.trainingModel(model, opts)
11training_model.compileAndExport("training.poptorch", input, target)
12model.eval()
13validation_model = poptorch.inferenceModel(model, opts)
14validation_model.compileAndExport("validation.poptorch", input)
Note
Don’t forget to model.eval()
or model.train()
as required before calling compileAndExport()
.
You would then either need to insert explicit copy operations:
1training_model = poptorch.load("training.poptorch")
2validation_model = poptorch.load("validation.poptorch")
3
4for epoch in epochs:
5 print("Epoch ", epoch)
6 run_training(training_model)
7 # Need to explicitly copy weights between the two models
8 # because they're not connected anymore.
9 training_model.copyWeightsToHost()
10 validation_model.copyWeightsToDevice()
11 run_validation(validation_model)
Or you would need to re-connect the two models by creating the second one from the first one and then loading the executable:
1training_model = poptorch.load("training.poptorch")
2# Create a validation python model based on the training model
3validation_model = poptorch.inferenceModel(training_model)
4validation_model.model.eval()
5# Load the executable for that model:
6validation_model.loadExecutable("validation.poptorch")
7
8for epoch in epochs:
9 print("Epoch ", epoch)
10 run_training(training_model)
11 # Nothing to do: training_model and validation_model are now connected
12 # and PopTorch will implicitly keep the weights in sync between them.
13 run_validation(validation_model)
4.10. Environment variables
4.10.1. Logging level
- PopTorch uses the following levels of logging:
OFF
: No logging.ERR
: Errors only.WARN
: Warnings and errors only.INFO
: Info, warnings and errors. (Default)DEBUG
: Adds some extra debugging information.TRACE
andTRACE_ALL
: Trace everything inside PopTorch.
The POPTORCH_LOG_LEVEL
environment variable can be used to set the logging level:
export POPTORCH_LOG_LEVEL=DEBUG
4.10.2. Profiling
When running programs using PopTorch, you can enable profiling by using the POPLAR_ENGINE_OPTIONS
environment variable used by Poplar.
In order to capture the reports needed for the PopVision Graph Analyser you only need to set POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'
:
export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'
By default, report files are output to the current working directory. You can specify a different output directory by setting autoReport.directory
, for example:
export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./tommyFlowers"}'
For more options, please refer to the PopVision Graph Analyser User Guide.
In order to capture the pvti
reports needed for the PopVision System Analyser you only need to set PVTI_OPTIONS='{"enable":"true"}'
You can also add extra tracepoints in your own code by using poptorch.profiling.Channel
.
4.10.3. IPU Model
By default PopTorch will try to attach to a physical IPU.
If instead you want to use the model, you can do so by setting POPTORCH_IPU_MODEL
to 1
:
export POPTORCH_IPU_MODEL=1
Please see the Poplar and PopLibs User Guide for the limitations of the IPU Model.
4.10.4. Wait for an IPU to become available
By default if you try to attach to an IPU but all the IPUs in the system are
already in use, an exception will be raised.
If you would rather wait for an IPU to become available, you can do so by setting POPTORCH_WAIT_FOR_IPU
to 1
.
export POPTORCH_WAIT_FOR_IPU=1
4.10.5. Enable executable caching
This can be done by either setting the POPTORCH_CACHE_DIR
environment variable or by calling poptorch.Options.enableExecutableCaching
.
See also
For more info caching
export POPTORCH_CACHE_DIR=/tmp/poptorch_cache