4. Features
4.1. Options
You can change how PopTorch compiles and executes models using poptorch.Options
.
You can find a full list of options in Section 10.1, Options.
Broadly speaking, the options fall into the following categories:
General options (see
Options
)Options related to half precision (see
opts.Precision.*
)Management of the training process (see
opts.Training.*
)Location of tensors (see:
opts.TensorLocations.*
andTensorLocationSettings
)Options relevant to the Torch JIT compiler (see
opts.Jit.*
)Control of distributed execution environments when using tools other than PopRun (see
opts.Distributed.*
)
See Section 5, Efficient data batching for a full
explanation of how device_iterations
greater than 1, gradient_accumulation
, and
replication_factor
interact with the output and input sizes.
You can choose to use the IPU Model instead of IPU hardware
with the useIpuModel()
option.
4.1.1. Setting options via config file
In addition to setting these options programmatically, you can also set them in a
config text file by using loadFromFile()
.
Each line in the file must contain a single command corresponding to setting an option
in Options
. To set an option within the file, write the command as you
would within a Python script but omit the options.
prefix. For example:
1deviceIterations(1)
2setExecutionStrategy(poptorch.ShardedExecution())
3replicationFactor(1)
4enableSyntheticData(True)
5
Then, instantiate Options
and call loadFromFile()
:
1opts = poptorch.Options()
2opts.loadFromFile("tmp/poptorch.conf")
4.2. Model wrapping functions
The basis of PopTorch integration comes from the two model wrapping functions described in the following sections.
Note
PopTorch makes a shallow copy of the model. Changes to the parameters
in the models returned by these two model wrapping functions affect the
original model and vice versa. However, primitive variable types will not be
kept in sync. This includes the training
bool of pytorch.nn.Module
.
If your PyTorch model is named model
, call model.eval()
or
model.train()
, if required, before calling these wrapping functions.
4.2.1. poptorch.trainingModel
This function wraps a PyTorch model, yielding a PopTorch model that can
be run on the IPU in training mode. See trainingModel()
for
more information.
1import torch
2import poptorch
3
4
5class ExampleModelWithLoss(torch.nn.Module):
6 def __init__(self):
7 super().__init__()
8 self.fc = torch.nn.Linear(10, 10)
9 self.loss = torch.nn.MSELoss()
10
11 def forward(self, x, target=None):
12 fc = self.fc(x)
13 if self.training:
14 return fc, self.loss(fc, target)
15 return fc
16
17
18torch.manual_seed(0)
19model = ExampleModelWithLoss()
20
21# Wrap the model in our PopTorch annotation wrapper.
22poptorch_model = poptorch.trainingModel(model)
23
24# Some dummy inputs.
25input = torch.randn(10)
26target = torch.randn(10)
27ones = torch.ones(10)
28
29# Train on IPU.
30for i in range(0, 800):
31 # Each call here executes the forward pass, loss calculation, and backward
32 # pass in one step.
33 # Model input and loss function input are provided together.
34 poptorch_out, loss = poptorch_model(input, target)
35 print(f"{i}: {loss}")
36
37# Copy the trained weights from the IPU back into the host model.
38poptorch_model.copyWeightsToHost()
39
40# Execute the trained weights on host.
41model.eval()
42native_out = model(input)
43
44# Models should be very close to native output although some operations are
45# numerically different and floating point differences can accumulate.
46torch.testing.assert_allclose(native_out, poptorch_out, rtol=1e-04, atol=1e-04)
Note
By default, PopTorch will only return the final batch of outputs.
Please see Section 5.6, poptorch.Options.Training.anchorReturnType for details on what PopTorch returns
when using trainingModel()
and how you can calculate
statistics such as training accuracy over all batches.
4.2.2. poptorch.inferenceModel
This function wraps a PyTorch model, yielding a PopTorch model that can
be run on the IPU in inference mode. See inferenceModel()
for
more information.
1import torch
2import torchvision
3import poptorch
4
5# Some dummy imagenet sized input.
6picture_of_a_cat_here = torch.randn([1, 3, 224, 224])
7
8# The model, in this case a MobileNet model with pretrained weights that comes
9# canned with Pytorch.
10model = torchvision.models.mobilenet_v2(pretrained=True)
11model.train(False)
12
13# Wrap in the PopTorch inference wrapper
14inference_model = poptorch.inferenceModel(model)
15
16# Execute on IPU.
17out_tensor = inference_model(picture_of_a_cat_here)
18
19# Get the top 5 ImageNet classes.
20top_five_classes = torch.topk(torch.softmax(out_tensor, 1), 5)
21print(top_five_classes)
22
23# Try the same on native PyTorch
24native_out = model(picture_of_a_cat_here)
25
26native_top_five_classes = torch.topk(torch.softmax(native_out, 1), 5)
27
28# Models should be very close to native output although some operations are
29# numerically different and floating point differences can accumulate.
30assert any(top_five_classes[1][0] == native_top_five_classes[1][0])
31# inference_half_start
32model = torch.nn.Linear(1, 10)
33
34# Convert the parameters (weights) to halfs. Without doing so,
35# the Linear parameters will automatically be cast to half, which allows
36# training with float32 parameters but half tensors.
37model.half()
38
39t1 = torch.tensor([1.]).half()
40
41opts = poptorch.Options()
42
43inference_model = poptorch.inferenceModel(model, opts)
44out = inference_model(t1)
45
46assert out.dtype == torch.half
47# inference_half_end
4.2.3. poptorch.PoplarExecutor
You should not create this class directly. It is a wrapper around the model
that was passed into inferenceModel()
or trainingModel()
.
It has a few methods which you can use to interface with the IPU.
The PoplarExecutor
will implicitly keep in sync the parameters
of the source PyTorch model and the PopTorch model(s). However, you need to explicitly copy the weights
if the model is trained on the CPU and inference is run on
the IPU.
See PoplarExecutor
for a complete description of the IPU interface
functionality.
1model = Model()
2
3model.eval()
4poptorch_inf = poptorch.inferenceModel(model)
5
6# Switch for "poptorch.trainingModel": poptorch_inf will remain in "eval" mode
7model.train()
8poptorch_train = poptorch.trainingModel(model)
9
10# train on IPU
11train(poptorch_train)
12torch.save(model.state_dict(), "model.save") # OK
13
14# Aready in "eval" mode
15validate(poptorch_inf) # OK
16
17# switch to "eval" mode for CPU
18model.eval()
19validate(model) # OK
20
21# train on CPU
22model.train()
23train_on_cpu(model)
24
25# Explicit copy needed
26poptorch_inf.copyWeightsToDevice()
27validate(poptorch_inf)
4.2.4. poptorch.isRunningOnIpu
One useful utility function is isRunningOnIpu()
. This
returns True
when executing on the IPU and False
when executing
the model outside IPU scope. This allows for different code paths within
the model.
A common use case is executing equivalent code to a PopART custom operator when running on the CPU. For example:
class Network(torch.nn.Module):
def forward(self, x, y):
if poptorch.isRunningOnIpu():
# IPU path
return my_custom_operator(x, y)
else:
# CPU path
return my_torch_implementation(x,y)
4.3. Execution strategies
This section describes strategies to run PopTorch code on more than one IPU. Some of these allow code to be run in parallel on multiple IPUs. We recommended that you use these parallel execution strategies with PopTorch code that is already working correctly on a single IPU.
There are four kinds of execution strategies that you can use to run a model on a multi-IPU device:
You can select this with the
setExecutionStrategy()
option.
The default execution strategy is PipelinedExecution
.
In the following, we first introduce the general functions that are relevant to all four parallel execution strategies. Finally, we explain the four strategies with examples.
By default, PopTorch will not let you run the model if the number of IPUs is
not a power of 2.
For this reason, it is preferable to annotate the model so that the number of
IPUs used is a power of 2.
However, you can also enable autoRoundNumIPUs()
to
automatically round up the number of IPUs reserved to a power of 2, with the
excess being reserved but idle.
This option is not enabled by default to prevent unintentional overbooking of
IPUs.
4.3.1. Annotation tools
poptorch.Block, poptorch.BeginBlock and poptorch.BlockFunction
BeginBlock
and Block
are wrapper
classes used to define model parallelism in a multi-IPU device. They partition
models into “blocks” to be executed on different IPUs.
You can use Block
to define a scope in the context of the
model.
In the example below, all layers before model.bert.encoder.layer[0]
will be
put on IPU 0 and all layers from model.bert.encoder.layer[0]
onwards
(inclusive) will be on IPU 1.
1import transformers
2import torch
3import poptorch
4
5# A bert model from hugging face. See the packaged BERT example for actual usage.
6pretrained_weights = 'mrm8488/bert-medium-finetuned-squadv2'
7
8
9# For later versions of transformers, we need to wrap the model and set
10# return_dict to False
11class WrappedModel(torch.nn.Module):
12 def __init__(self):
13 super().__init__()
14 self.wrapped = transformers.BertForQuestionAnswering.from_pretrained(
15 pretrained_weights)
16
17 def forward(self, input_ids, attention_mask, token_type_ids):
18 return self.wrapped.forward(input_ids,
19 attention_mask,
20 token_type_ids,
21 return_dict=False)
22
23 def __getattr__(self, attr):
24 try:
25 return torch.nn.Module.__getattr__(self, attr)
26 except torch.nn.modules.module.ModuleAttributeError:
27 return getattr(self.wrapped, attr)
28
29
30model = WrappedModel()
31
32# A handy way of seeing the names of all the layers in the network.
33print(model)
34
35# All layers before "model.bert.encoder.layer[0]" will be on IPU 0 and all layers from
36# "model.bert.encoder.layer[0]" onwards (inclusive) will be on IPU 1.
37model.bert.encoder.layer[0] = poptorch.BeginBlock(model.bert.encoder.layer[0],
38 ipu_id=1)
39
40# Now all layers before layer are on IPU 1 and this layer onward is on IPU 2
41model.bert.encoder.layer[2] = poptorch.BeginBlock(model.bert.encoder.layer[2],
42 ipu_id=2)
43
44# Finally all layers from this layer till the end of the network are on IPU 3.
45model.bert.encoder.layer[4] = poptorch.BeginBlock(model.bert.encoder.layer[4],
46 ipu_id=3)
47
48# We must batch the data by at least the number of IPUs. Each IPU will still execute
49# whatever the model batch size is.
50data_batch_size = 4
51
52# Create a poptorch.Options instance to override default options
53opts = poptorch.Options()
54opts.deviceIterations(data_batch_size)
BeginBlock
is an annotation defined outside the
model, and applied to current and onward layers. You can use both forms
interchangeably.
1class Network(torch.nn.Module):
2 def __init__(self):
3 super().__init__()
4 self.layer1 = torch.nn.Linear(5, 10)
5 self.layer2 = torch.nn.Linear(10, 5)
6 self.layer3 = torch.nn.Linear(5, 5)
7 self.layer4 = torch.nn.Linear(5, 5)
8
9 self.act = torch.nn.ReLU()
10 self.softmax = torch.nn.Softmax(dim=1)
11
12 def forward(self, x):
13
14 # Explicit layers on a certain IPU
15 poptorch.Block.useAutoId()
16 with poptorch.Block(ipu_id=0):
17 x = self.act(self.layer1(x))
18
19 with poptorch.Block(ipu_id=1):
20 x = self.act(self.layer2(x))
21
22 with poptorch.Block(ipu_id=2):
23 x = self.act(self.layer3(x))
24 x = self.act(self.layer4(x))
25
26 with poptorch.Block(ipu_id=3):
27 x = self.softmax(x)
28 return x
29
30
31model = Network()
32opts = poptorch.Options()
33opts.deviceIterations(4)
34poptorch_model = poptorch.inferenceModel(model, options=opts)
35print(poptorch_model(torch.rand((4, 5))))
36
In addition, BlockFunction()
is a decorator which can
decorate an existing function.
1class Network(torch.nn.Module):
2 def __init__(self):
3 super().__init__()
4 self.layer1 = torch.nn.Linear(5, 10)
5 self.layer2 = torch.nn.Linear(10, 5)
6 self.layer3 = torch.nn.Linear(5, 5)
7 self.layer4 = torch.nn.Linear(5, 5)
8
9 self.act = torch.nn.ReLU()
10 self.softmax = torch.nn.Softmax(dim=1)
11
12 def forward(self, x):
13 poptorch.Block.useAutoId()
14 x = self.block_one(x)
15 x = self.block_two(x)
16 x = self.final_activation(x)
17 return x
18
19 @poptorch.BlockFunction(ipu_id=0)
20 def block_one(self, x):
21 x = self.act(self.layer1(x))
22 x = self.act(self.layer2(x))
23 return x
24
25 @poptorch.BlockFunction(ipu_id=1)
26 def block_two(self, x):
27 x = self.act(self.layer3(x))
28 x = self.act(self.layer4(x))
29 return x
30
31 @poptorch.BlockFunction(ipu_id=1)
32 def final_activation(self, x):
33 return self.softmax(x)
34
35
36model = Network()
37opts = poptorch.Options()
38opts.deviceIterations(4)
39poptorch_model = poptorch.inferenceModel(model, options=opts)
40print(poptorch_model(torch.rand((4, 5))))
Either annotation is enough to enable parallel execution in the simple cases.
By default, the layers before the first BeginBlock
will be
placed on IPU 0.
BeginBlock
, Block
and
BlockFunction()
need to follow a set of rules:
You must declare all the layers inside a
Block
scope to avoid missing annotations.BeginBlock
doesn’t have the same constraint because all the layers called after this will automatically be added to the lastBeginBlock
.Note that PopTorch needs to reserve IPUs in powers of 2. You are advised to configure your model accordingly to take full advantage of the IPUs available. However, if you need to run with a different number of IPUs, you can use
poptorch.Options().autoRoundNumIPUs(True)
to allow PopTorch to reserve more IPUs than the model specifies.You should not include unused or dead layers in any
BeginBlock
orBlock
.If layer A happens before layer B inside the model and each layer has a
BeginBlock
associated with it, you need to writeBeginBlock
for layer A beforeBeginBlock
for layer B.
Failing to obey above rules will result in compilation errors.
poptorch.Stage and poptorch.AutoStage
Conceptually, BeginBlock
and
Block
collect the
layers of a model into a Stage
.
You can combine multiple stages into a Phase
.
Multiple phases form a parallel execution strategy.
poptorch.Stage
Stage
defines the layers of the model to run on one IPU.
A stage can consist of one or more blocks created using
BeginBlock
or Block
and identified by their user_id
.
You can define consecutive layers in a model in either the same stage or consecutive stages. Whether stages run in parallel or sequentially depends on the specific parallel execution strategy.
Internally, each operation in a model is assigned a stage_id
through Stage
.
poptorch.AutoStage
You can use AutoStage
if you don’t want to
specify stages by hand.
This will assign one Stage
per BeginBlock
or Block
.
By default, AutoStage.SameAsIpu
is true, which means the
stage_id
of the Stage
will be set to the ipu_id
specified for the BeginBlock
or
Block
.
Note that stage_id
must have ascending values in
PipelinedExecution
.
Let’s use the code example above.
If your blocks “0”, “1”, and “2” are assigned to IPU 0, 1, and 0.
Then the Block
“2” will be assigned stage_id
0. This will cause
the compiler to fail to
schedule the last two stages “1” and “2” due to a conflict:
The model implies “1” should run earlier than “2”
Their
stage_id
values suggest “2” should run earlier than “1”
When AutoStage.AutoIncrement
is true, each new
BeginBlock
or
Block
will be assigned an automatically incremented
stage_id
.
In the previous example the last stage would be assigned stage_id
2 and
the compilation would succeed.
poptorch.Phase
Phase
defines a processing unit of phased execution.
It can contain one or more Stage
stages.
Phase
is only used in
SerialPhasedExecution
and
ParallelPhasedExecution
.
It is not used in
ShardedExecution
and
PipelinedExecution
.
1class Model(torch.nn.Module):
2 def forward(self, x, y):
3 with poptorch.Block("A"):
4 c = x + x
5 with poptorch.Block("B"):
6 d = y + y
7 with poptorch.Block("C"):
8 e = x * 3
9
10 return c, d, e
11
12
13first = poptorch.Phase(poptorch.Stage("A").ipu(0))
14# Regrouped in a single stage
15second = poptorch.Phase(poptorch.Stage("B", "C").ipu(1))
16# 2 separate stages
17second = poptorch.Phase(poptorch.Stage("B").ipu(1), poptorch.Stage("C").ipu(3))
In the code snippet above, “A” and “B” will run in parallel on IPUs 0 and 1 simultaneously because they are placed in two stages. They will run sequentially on one IPU if they are placed in a single stage.
Advanced annotation with strings
You can use Python strings to represent the user_id
and ipu_id
for a
Block
or
BeginBlock
.
Because strings are evaluated at runtime,
they allow for a dynamic number of stages and phases.
Here is an example showing how to use formatted strings(f-strings) in
ParallelPhasedExecution
.
In Listing 4.10, there are several places where f-strings are used:
Line 25:
f"phase{phase}_ipu{ipu}"
, wherephase
has the values 0, 1, 1, 2, 3, 3, 4, 5, and 5, andipu
ranges from 0 to 1. The total number of instances for this f-string is 12, from 6 phases and 2 IPUs.Line 32:
f"phase{N*2-1}_ipu1"
, wherephase
is 5 andipu
is 1.Lines 46-47 and 50-51: when defining
Stage
, four f-strings are used wheren
ranges from 0 to 2f"phase_{2*n}_ipu0"
f"phase{2*n}_ipu1"
f"phase_{2*n+1}_ipu0"
f"phase{2*n+1}_ipu1"
These refer to phases 0, 2, 4 and 1, 3, 5, with
ipu0
andipu1
, respectively. So all these 12 f-strings are defined inBeginBlock
, and used inStage
dynamically. These match exactly.
1poptorch.setLogLevel(1) # Force debug logging
2N = 3
3size = 10
4
5
6class Model(torch.nn.Module):
7 def __init__(self):
8 super().__init__()
9 self.weights = []
10 for n in range(N * 6):
11 weight = torch.nn.Parameter(torch.rand(size, size),
12 requires_grad=True)
13 self.register_parameter(f"w{n}", weight)
14 self.weights.append(weight)
15
16 def forward(self, in0, target=None):
17 phase = 0
18 weight = iter(self.weights)
19 with poptorch.Block("phase0_ipu0"):
20 ins = torch.split(in0, size)
21 for n in range(N * 3):
22 out = []
23 for ipu in range(2):
24 x = ins[ipu]
25 with poptorch.Block(f"phase{phase}_ipu{ipu}"):
26 x = torch.matmul(next(weight), x)
27 out.append(F.relu(x))
28 ins = out[1], out[0]
29 # We want 2 matmuls in the same phase
30 if n % 3 != 1:
31 phase += 1
32 with poptorch.Block(f"phase{N*2-1}_ipu1"):
33 res = ins[0] + ins[1]
34 if target is None:
35 return res
36 return res, torch.nn.L1Loss(reduction="mean")(res, target)
37
38
39input = torch.rand(size * 2, 1)
40target = torch.rand(size, 1)
41model = Model()
42opts = poptorch.Options()
43phases = []
44# Alternate between 0-2 and 1-3
45for n in range(N):
46 phases.append([
47 poptorch.Stage(f"phase{2*n}_ipu0").ipu(0),
48 poptorch.Stage(f"phase{2*n}_ipu1").ipu(2)
49 ])
50 phases.append([
51 poptorch.Stage(f"phase{2*n+1}_ipu0").ipu(1),
52 poptorch.Stage(f"phase{2*n+1}_ipu1").ipu(3)
53 ])
54opts.setExecutionStrategy(poptorch.ParallelPhasedExecution(*phases))
55poptorch_model = poptorch.trainingModel(model, opts)
56poptorch_model.compile(input, target)
57
4.3.2. Parallel execution strategies
With the above functions as building blocks, we can set execution strategies using the four kinds of execution modes, as shown below.
Note that you can use the same annotation for each execution strategy. They only differ in the method of parallelisation and tensor locations.
poptorch.ShardedExecution
In this strategy, each IPU
will sequentially execute a distinct part of the model.
A single unit of processing ShardedExecution
is called a
shard.
A shard is specified using Stage
,
or if no Stage
is specified,
the user_id
passed by
BeginBlock
or Block
is used.
Each shard is executed sequentially on a single IPU.
You can place multiple shards on multiple IPUs.
However, only one IPU is used at a time, while
the other IPUs are idle.
If an IPU is allocated to run consecutive stages,
PopART will merge consecutive stages into one on the same IPU.
Weights and activations will use the on-chip memory of the IPUs.
You need to place layers that share weights on the same IPU.
ShardedExecution
can be useful
for processing a single sample or for debugging.
Overall, it has low efficiency because only one IPU is used at a time.
poptorch.PipelinedExecution
This is the default execution strategy. It extends poptorch.ShardedExecution with parallel execution on multiple IPUs.
Parallelisation in PipelinedExecution
requires deviceIterations()
and gradientAccumulation()
as explained in Section 5, Efficient data batching.
After one stage is finished with processing a batch on one IPU, it starts immediately processing the next batch. This creates a pipeline where multiple batches are processed in parallel.
An IPU can only start its own stage of a batch after its previous stage of the current batch has been processed. Hence, all IPUs will be occupied after a “warm-up” period.
At the end of processing, a “cool-down” period is required to aggregate the results and apply weight updates.
Phased execution
ParallelPhasedExecution
and
SerialPhasedExecution
have the following
features in common:
A portion of the weights and activations are transferred to and from streaming memory, before and after each phase.
If the desired weights and activations are already stored in an IPU of the same group of IPUs, intra-phase cross-IPU copies can replace the copies to and from streaming memory.
This specific portion is needed by the layers of the model wrapped in
BeginBlock
orBlock
in currentPhase
.They both trade off some performance for larger models with higher memory needs.
Any number of phases is allowed.
The number of stages in each
Phase
should match the number of IPUs in each group of IPUs.Stages inside each
Phase
can run in parallel.
Although you only define the Phase
for forward passes,
the corresponding phases for backward passes are created correspondingly.
The order of phased execution for backward passes won’t change
but you can decide whether a phase is shared by both
forward and backward passes. In other words, you decide whether to avoid
a memory transfer of a portion of the weights and activations.
poptorch.SerialPhasedExecution
In SerialPhasedExecution
,
phases execute on a single group of IPUs sequentially.
1strategy = poptorch.SerialPhasedExecution(
2 poptorch.Phase(poptorch.Stage("A"), poptorch.Stage("A2")),
3 poptorch.Phase(poptorch.Stage("B"), poptorch.Stage("B2")),
4 poptorch.Phase(poptorch.Stage("C"), poptorch.Stage("C2")))
5
6strategy.phase(0).ipus(0, 1)
7strategy.phase(1).ipus(0, 1)
8strategy.phase(2).ipus(0, 1)
9
10opts.setExecutionStrategy(strategy)
The code above causes all phases to run serially on IPUs 0 and 1. (A,B and C on IPU 0, A2, B2, C2 on IPU 1).
poptorch.ParallelPhasedExecution
In ParallelPhasedExecution
,
phases are executed in parallel alternating between two groups of IPUs.
Even phases must run on even IPUs and odd phases on odd IPUs.
Inter-phase cross-IPU copies can replace the memory transfers to and from
the streaming memory, if the desired weights and activations are already
available in another group of IPUs.
1strategy = poptorch.ParallelPhasedExecution(
2 poptorch.Phase(poptorch.Stage("0"), poptorch.Stage("1")),
3 poptorch.Phase(poptorch.Stage("2"), poptorch.Stage("3")),
4 poptorch.Phase(poptorch.Stage("4"), poptorch.Stage("5")))
5
6strategy.phase(0).ipus(0, 2)
7strategy.phase(1).ipus(1, 3)
8strategy.phase(2).ipus(0, 2)
9
10opts.setExecutionStrategy(strategy)
In the code example above, there are three phases. Each phase has two stages and each IPU group has two IPUs, so the number of groups matches the number of IPUs. Even phases 0 and 2 run on IPU 0 and 2, while odd phase 1 runs on IPU 1 and 3. This allows for faster cross-IPU copies, both inter-phase and intra-phase.
poptorch.Liveness
Liveness
controls the availability of tensors on IPU,
and is only needed for
ParallelPhasedExecution
and SerialPhasedExecution
.
The default Liveness
is AlwaysLive
.
OffChipAfterFwd
, OffChipAfterFwdNoOverlap
and
OffChipAfterEachPhase
may be helpful if you run a large model
with a tight memory budget.
4.4. Optimizers
Poptorch supports the following optimizers:
In addition, PopTorch has features to support float16 models, such as loss scaling, velocity scaling, bias correction and accumulator types.
Important
All of these extra attributes (except velocity_scaling
) must have the same values for different param_groups
and therefore you must set them at the optimizer level.
1opt = poptorch.optim.SGD(model.parameters(),
2 lr=0.01,
3 loss_scaling=2.0,
4 use_combined_accum=False)
5poptorch_model = poptorch.trainingModel(model, options, opt)
6poptorch_model(input, target)
7# Update optimizer attribute
8opt.loss_scaling = 1.0
9# Update param_group attribute
10opt.param_groups[0]["loss_scaling"] = 1.0
11# Set the new optimizer in the model
12poptorch_model.setOptimizer(opt)
13poptorch_model(input, target)
Important
You must call setOptimizer()
to apply the new optimizer values to the model.
4.4.1. Loss scaling
When training models which use half/float16 values, you can use loss scaling to prevent the gradients from becoming too small and underflowing.
Before calculating the gradients, PopTorch will scale the loss by the value of the loss_scaling
parameter.
PopTorch will multiply the gradients by the inverse scale prior to updating the optimizer state.
Therefore, beyond improving numerical stability, neither the training nor the hyper-parameters are affected.
Higher loss_scaling
values can improve numerical stability by minimising underflow.
However, too high a value can result in overflow.
The optimal loss scaling factor depends on the model.
You can either set the loss_scaling
factors manually, or you can set setAutomaticLossScaling()
in opts.Training
,
which will automatically set a global loss scaling factor. If you both set loss_scaling
manually and enable automatic loss scaling, the manually
set factor(s) will be used initially and updated automatically during training.
4.4.2. Velocity scaling (SGD combined variant only)
The SGD optimizer, when used with momentum, updates weights based
on the velocity values.
The combined variant uses one tensor per parameter to store the
velocity and the changes to the velocity from accumulated gradients.
Unlike the separate variant, therefore, each gradient accumulation step involves
adding or subtracting values of a different magnitude to the gradients (for
which loss scaling is used). You can therefore use the velocity_scaling
parameter to scale the combined velocity tensor to improve numerical precision when using half/float16 values.
(Note that the gradients are, in effect, scaled by velocity_scaling/loss_scaling
so the loss_scaling
has no impact on the effective scaling of velocity parameters.)
As with loss scaling, higher values can minimise underflow of the velocity values but may result in overflow.
4.4.3. Accumulation types
In order to improve numerical stability some of the optimizers (LAMB, Adam, AdamW, RMSprop) give you the option to tweak the data type used by the optimizer’s accumulators.
accum_type
lets you choose the type used for gradient accumulation.
first_order_momentum_accum_type
/ second_order_momentum_accum_type
give you control over the type used to store the first-order and second-order momentum optimizer states.
4.4.4. Constant attributes
In order to improve performance and / or save memory PopTorch will try to embed directly in the program the attributes which are constant.
Important
Trying to modify a constant attribute after the model has been compiled will result in an error.
For PopTorch optimizers (those from the poptorch.optim
namespace) by default the attributes explicitly passed to the optimizer’s constructor will be considered variables and the others will be considered as constant.
You can override this behaviour using markAsConstant()
and markAsVariable()
before compiling the model.
1# lr, momentum and loss_scaling will be marked as variable.
2opt = poptorch.optim.SGD(model.parameters(),
3 lr=0.01,
4 momentum=0.0,
5 use_combined_accum=False)
6# momentum and loss_scaling will be marked as constant.
7opt = poptorch.optim.SGD(model.parameters(), lr=0.01, use_combined_accum=False)
8# lr and momentum will be marked as variable.
9# loss_scaling will be marked as constant.
10opt = poptorch.optim.SGD(model.parameters(),
11 lr=0.01,
12 momentum=0.0,
13 loss_scaling=2.0,
14 use_combined_accum=False)
15opt.variable_attrs.markAsConstant("loss_scaling")
16# lr, momentum and loss_scaling will be marked as variable.
17opt = poptorch.optim.SGD(model.parameters(),
18 lr=0.01,
19 loss_scaling=2.0,
20 use_combined_accum=False)
21opt.variable_attrs.markAsVariable("momentum")
For native optimizers (those from the torch.optim namespace) the attributes which are left to their default value in the constructor will be considered to be constant.
There is no method to override this behaviour which is why we recommend you always use the poptorch.optim
optimizers instead.
1# momentum will be marked as constant (It's not set)
2opt = torch.optim.SGD(model.parameters(), lr=0.01)
3# lr will be marked as variable.
4# momentum will still be marked as constant (Because its default value is 0.0)
5opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.0)
6# lr and momentum will both be marked as variable.
7opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=1.0)
Note
There is an exception: lr
is always marked as variable.
4.5. PopTorch ops
This section describes some “helper” operations you can use within a model.
4.5.1. poptorch.ctc_beam_search_decoder
This function adds a Connectionist Temporal Classification (CTC) beam search decoder operator to the model.
1class Model(torch.nn.Module):
2 def forward(self, log_probs, lengths):
3 return poptorch.ctc_beam_search_decoder(log_probs, lengths)
4
5
For more information see: ctc_beam_search_decoder()
.
4.5.2. poptorch.ipu_print_tensor
This function adds an op to print the content of a tensor on the IPU.
Note
To prevent the print operation being optimised out by the graph
optimiser, you must use the return value of ipu_print_tensor()
.
1class ExampleModel(torch.nn.Module):
2 def __init__(self):
3 super().__init__()
4 self.bias = torch.nn.Parameter(torch.zeros(()))
5
6 def forward(self, x):
7 x = x + 1
8
9 # It is important to make sure the result of the print is used.
10 x = poptorch.ipu_print_tensor(x)
11
12 return x + self.bias
13
14
For more information see: ipu_print_tensor()
.
4.5.3. poptorch.identity_loss
You can use this function to implement custom losses. It takes a single PyTorch tensor and will backpropagate a gradient of ones through it.
Note
Passing a PyTorch loss function or another identity_loss
to this function is not
supported. You must implement multiple losses as composite PyTorch ops.
1def custom_loss(output, target):
2 # Mean squared error with a scale
3 loss = output - target
4 loss = loss * loss * 5
5 return poptorch.identity_loss(loss, reduction="mean")
6
7
8class ExampleModelWithCustomLoss(torch.nn.Module):
9 def __init__(self):
10 super().__init__()
11 self.model = ExampleModel()
12
13 def forward(self, input, target):
14 out = self.model(input)
15 return out, custom_loss(out, target)
16
17
For more information see: identity_loss()
.
4.5.4. poptorch.MultiConv
Use the MultiConv
wrapper class to define multi-convolutions.
Refer to the PopLibs documentation for multi-convolutions for further information.
For more information see: MultiConv
and MultiConvPlanType
.
4.5.5. poptorch.nop
PopTorch includes a “no-op” function for debugging purposes.
For more information see: nop()
.
4.5.6. poptorch.serializedMatMul
Use this function to create a serialized matrix multiplication, which splits a larger matrix multiplication into smaller matrix multiplications to reduce memory requirements.
For more information see: serializedMatMul()
.
4.5.7. poptorch.set_available_memory
Use this function to override the proportion of tile memory available for use as temporary memory by a convolution or matrix multiplication.
For more information see: set_available_memory()
.
4.5.8. Miscellaneous functions
The following PopTorch functions, not related to model creation, are available:
4.6. Half / float16 support
PopTorch supports the half-precision floating point (float16) format.
You can simply input float16 tensors into your model.
(You can convert a tensor to float16 using tensor = tensor.half()
)
You can use your models in one of the following ways:
Convert all parameters (weights) to float16 by using using a
Module
’s .``half()`` method. This is the most memory efficient, however small updates to weights may be lost, hindering training.Keep the parameters (weights) as float32, in which case the parameter updates will occur using float32. However, the parameters will be converted to float16 if you call an operation with a float16 input. This is more memory efficient than using float32 tensors (inputs) but less memory efficient than using float16 weights.
Use a mix of float32 and float16 parameters by manually specifying parameters as float16 or float32.
Note
When PyTorch encounters a mix of float16 and float32 inputs for a given operation, it will usually cast all inputs to float32.
PopTorch differs and will cast all inputs to float16.
This makes it easier to build models with float32 weights which take float16 tensors. However, if you wish to follow PyTorch behaviour, you can use opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)
where opts
is the poptorch.Options
object passed to the model wrapping function.
1model = torch.nn.Linear(1, 10)
2
3# Convert the parameters (weights) to halfs. Without doing so,
4# the Linear parameters will automatically be cast to half, which allows
5# training with float32 parameters but half tensors.
6model.half()
7
8t1 = torch.tensor([1.]).half()
9
10opts = poptorch.Options()
11
12inference_model = poptorch.inferenceModel(model, opts)
13out = inference_model(t1)
14
15assert out.dtype == torch.half
Because PopTorch relies on the torch.jit.trace() function, it is limited to tracing operations which run on the CPU. Many of these operations do not support float16 inputs. To allow the full range of operations, PopTorch converts all float16 inputs to float32 before tracing and then restores the inputs to float16 as part of the canonicalization process. Some operations may result in the model running in float32 where float16 would be expected, or vice versa (see Section 6.3, Float 16 operations for full details).
4.7. Automatic mixed-precision casting
PopTorch supports converting your model automatically between float16 and float32.
This functionality is not active by default - you must enable it explicitly by
calling the autocast(enabled=True)
method at model level.
model = MyModel()
model.autocast()
poptorch_model = poptorch.inferenceModel(model)
During compilation, selected layers and operators will have their types adjusted aiming to strike a good compromise between compute efficiency, memory requirements and numerical precision.
You can also set automatic casting at the layer level. In this situation, its effect is hierarchical: changing the setting for a layer affects it and all layers contained within.
In the following example, automatic casting is enabled for all layers of the model, except for the first activation and second convolution.
model = torch.nn.Sequential()
model.add_module('conv1', torch.nn.Conv2d(1, 20, 5))
model.add_module('relu1', torch.nn.ReLU())
model.add_module('conv2', torch.nn.Conv2d(20, 64, 5))
model.add_module('relu2', torch.nn.ReLU())
model.autocast()
model.relu1.autocast(False)
model.conv2.autocast(False)
You can also set automatic casting with the function decorator @poptorch.autocast(enabled=True)
.
Its effect is to apply automatic casting to the body of the function. Setting its parameter to False
has the opposite effect. A typical use-case is applying it to the forward
function of custom modules.
class MyModel(torch.nn.Module):
@poptorch.autocast()
def forward(self, x, y):
return torch.bmm(x, y)
In addition, you can apply poptorch.autocast(enabled=True)
to a code-block, with similar effect.
x = torch.randn(1, 10, 10)
y = torch.randn(1, 10, 10)
with poptorch.autocast():
z = torch.bmm(x, y)
You can completely turn off this feature for the whole application via the autocastEnabled(bool)
method of _PrecisionOptions
.
x = torch.randn(1, 10, 10)
y = torch.randn(1, 10, 10)
with poptorch.autocast():
z = torch.bmm(x, y)
4.7.1. Custom casting policies
PopTorch provides a mechanism to customize automatic casting behaviour in the form of casting policy classes. A casting policy is defined by four sets of Torch modules and/or torch operators:
fp16
- set of operations to be typed as float16fp32
- set of operations to be typed as float32promote
- set of operations to be promoted to float32 should they take mixed-precision inputsdemote
- set of operations to be demoted to float16 should they take mixed-precision inputs
The following example describes a policy where convolution and ReLU operations are to be performed using float16, whilst batch matrix multiplication is to be performed using float32. Dot product computations will be promoted to float32 when operands have mixed precision.
fp16 = [torch.nn.Conv2d, torch.relu]
fp32 = [torch.bmm]
promote = [torch.dot]
demote = []
policy = poptorch.autocasting.Policy(fp16, fp32, promote, demote)
opts = poptorch.Options()
opts.Precision.autocastPolicy(policy)
poptorch.model = poptorch.inferenceModel(model, opts)
4.8. Creating custom ops
If you need to implement functionality that is not directly supported in in PopTorch, you can create a custom op.
There are two steps to creating a custom op in PopTorch:
Implement the op in C++ using the PopART API
Make the op available in PopTorch so you can use it in your PyTorch model
4.8.1. Implementing the custom op
You will need to implement the new op as C++ code by creating subclasses of, at least, the Op and Opx base classes provided by the PopART API.
If you are going to use the custom op for training, then you will also need to define the classes that implement the gradient operation. For details of how to do this, see the Custom operators chapter of the PopART User Guide.
You can find some examples of PopART custom ops in the Graphcore GitHub tutorials repository.
Compiling the PopART custom op will create a dynamic library file, which you can use with your PyTorch code.
4.8.2. Make the op available to PyTorch
After you have compiled the C++ implementation of the custom op, you can can load
the library file, and call the op from your PyTorch program, using the
poptorch.custom_op
class.
First, load the dynamic library as shown in Listing 4.24.
1myso = list(pathlib.Path("tests").rglob("libcustom_cube_op.*"))
2assert myso, "Failed to find libcustom_cube_op"
3myop = ctypes.cdll.LoadLibrary(myso[0])
4
You can now call your custom op using the PopTorch class
custom_op
.
Both the forward op and backward op are implemented in the PopART code. However, in this inference model example, only the forward op is called:
1def test_inference():
2 class BasicNetwork(nn.Module):
3 def forward(self, x, bias):
4 x, y = poptorch.custom_op([x, bias],
5 "Cube",
6 "com.acme",
7 1,
8 example_outputs=[x, x])
9 return x, y
10
In this example [x, x]
is assigned to example_outputs
, where x
is one of the input tensors which is used as a template for the output tensors.
The custom op code will need to create the tensors that it returns.
You can also call this custom op inside a training model using
custom_op
and the backward op will be called automatically.
4.8.3. Passing attributes to the custom op
You can pass attributes to the custom op using a Python dictionary, as shown in Listing 4.26.
1 class Model(torch.nn.Module):
2 def forward(self, x):
3 x = poptorch.custom_op([x],
4 "LeakyRelu",
5 "com.acme",
6 1,
7 example_outputs=[x],
8 attributes={"alpha": 0.02})
9 return x[0]
10
You can then access these attributes within the C++ custom op code. The above
example passes a Float
attribute with the name alpha
to the LeakyRELU
implementation. See the Custom operators
chapter of the PopART User Guide for more information.
Table Table 4.1 and the code example in
Listing 4.27 show how to pass other attribute types
to a custom op. PopTorch supports all attributes supported in PopART except for
Graph
.
PopART attribute type |
Python equivalent |
---|---|
|
Python float (converted to float32) |
|
List or tuple of Python float |
|
Python int (converted to 64-bit signed int) |
|
List or tuple of Python int |
|
Python str (converted to ASCII) |
|
List or tuple of Python str |
|
Not supported |
1def test_many_attributes_examples():
2 class Model(torch.nn.Module):
3 def forward(self, x):
4 attributes = {
5 "float_one": 1.0,
6 "float_minus_two": -2.0,
7 "int_zero": 0,
8 "int_minus_five": -5,
9 "floats_one_two_three": [1.0, 2.0, 3.0],
10 "floats_minus_one_two_three": [-1.0, -2.0, -3.0],
11 "ints_one_two_three": [1, 2, 3],
12 "ints_minus_one_two_three": [-1, -2, -3],
13 "a_string": "string with quotes and slash \" ' \\ end",
14 "strs": ["abc", "def", "ghi"]
15 }
16
17 x = poptorch.custom_op([x],
18 "ManyAttributeOp",
19 "test.poptorch",
20 1,
21 example_outputs=[x],
22 attributes=attributes)
4.9. Profiling
You can profile a graph produced by PopTorch for analysis using the PopVision Graph Analyser, which can be downloaded from the Graphcore support portal. To do this, use the POPLAR_ENGINE_OPTIONS environment variable.
4.10. Precompilation and caching
4.10.1. Caching
By default PopTorch will re-compile the model every time you instantiate a model. However if you often run the same models you might want to enable executable caching to save time.
You can do this by either setting the POPTORCH_CACHE_DIR
environment variable or by calling enableExecutableCaching
.
Warning
The cache directory might grow large quickly because PopTorch doesn’t delete old models from the cache and, depending on the number and size of your models and the number of IPUs used, the executables might be quite large. It is your responsibility to delete the unwanted cache files.
4.10.2. Precompilation
PopTorch supports precompilation: This means you can compile your model on a machine which doesn’t have an IPU and export the executable to a file. You can then reload and execute it on a different machine which does have an IPU.
Important
The PopTorch versions on both machines must be an exact match.
To precompile your model you need to wrap it using either trainingModel()
or inferenceModel()
then call compileAndExport()
on the wrapper.
1import torch
2import poptorch
3
4
5class ExampleModelWithLoss(torch.nn.Module):
6 def __init__(self):
7 super().__init__()
8 self.fc = torch.nn.Linear(10, 10)
9 self.loss = torch.nn.MSELoss()
10
11 def forward(self, x, target=None):
12 fc = self.fc(x)
13 if self.training:
14 return fc, self.loss(fc, target)
15 return fc
16
17
18torch.manual_seed(0)
19model = ExampleModelWithLoss()
20
21opts = poptorch.Options()
22# You don't need a real IPU to compile the executable.
23opts.useOfflineIpuTarget(ipu_target_version)
24
25# Wrap the model in our PopTorch annotation wrapper.
26poptorch_model = poptorch.trainingModel(model, opts)
27
28# Some dummy inputs.
29input = torch.randn(10)
30target = torch.randn(10)
31
32poptorch_model.compileAndExport(filename, input, target)
Note
If you don’t know the IPU version on your system you can use ipuHardwareVersion()
.
The exported file by default will contain your original PyTorch model (including the weights), and enough information to re-create the PopTorch wrapper and reload the executable.
Important
For your model and weights to be exported, your model must be picklable. See https://docs.python.org/3/library/pickle.html for more information.
If your model is not picklable then use export_model=False
, as shown in Listing 4.31.
Now both the torch model, PopTorch wrapper and executable can be restored on the target machine using poptorch.load()
:
1poptorch_model = poptorch.load(filename)
2
3# That's all: your model is ready to be used.
4poptorch_model(input, target) # Run on IPU
In some cases you might want to provide some runtime information to select the device: you can do this
using the edit_opts_fn
argument of poptorch.load()
:
1def setIpuDevice(opts):
2 opts.useIpuId(1) # always use IPU 1
3
4
5poptorch_model = poptorch.load(filename, edit_opts_fn=setIpuDevice)
6poptorch_model(input, target) # Run on IPU 1
Note
When loading a precompiled model, only run-time options will be applied; all others will be ignored.
Going back to the precompilation step: in some cases you might want to export only the executable and not the python wrapper or torch model (for example if your model cannot be pickled).
1poptorch_model.compileAndExport(filename, input, target, export_model=False)
It means you will need to re-create and wrap the model yourself before loading the executable:
1model = ExampleModelWithLoss()
2
3opts = poptorch.Options()
4
5# Wrap the model in our PopTorch annotation wrapper.
6poptorch_model = poptorch.trainingModel(model, opts)
7poptorch_model.loadExecutable(filename)
8
9# Some dummy inputs.
10input = torch.randn(10)
11target = torch.randn(10)
12
13poptorch_model(input, target) # Run on IPU
Important
Exported models lose their connections to other models.
For example, if you have a poptorch.trainingModel()
and a poptorch.inferenceModel()
based
on the same PyTorch model, you wouldn’t usually need to keep the weights synchronised between the two;
PopTorch would take care of it for you, implicitly.
In the following example, PopTorch automatically copies the weights from the training model to the inference model:
1model = ExampleModelWithLoss()
2
3opts = poptorch.Options()
4
5# Wrap the model in our PopTorch annotation wrapper.
6training_model = poptorch.trainingModel(model, opts)
7model.eval()
8validation_model = poptorch.inferenceModel(model, opts)
9
10# Some dummy inputs.
11input = torch.randn(10)
12target = torch.randn(10)
13
14# Train the model:
15for epoch in epochs:
16 training_model(input, target)
17
18# Weights are implicitly copied from the training model
19# to the validation model
20prediction = validation_model(input)
If you were to export these models:
1model = ExampleModelWithLoss()
2
3opts = poptorch.Options()
4
5# Some dummy inputs.
6input = torch.randn(10)
7target = torch.randn(10)
8
9# Wrap the model in our PopTorch annotation wrapper.
10training_model = poptorch.trainingModel(model, opts)
11training_model.compileAndExport("training.poptorch", input, target)
12model.eval()
13validation_model = poptorch.inferenceModel(model, opts)
14validation_model.compileAndExport("validation.poptorch", input)
Note
Don’t forget to call model.eval()
or model.train()
, as required, before calling compileAndExport()
.
You could then insert explicit copy operations:
1training_model = poptorch.load("training.poptorch")
2validation_model = poptorch.load("validation.poptorch")
3
4for epoch in epochs:
5 print("Epoch ", epoch)
6 run_training(training_model)
7 # Need to explicitly copy weights between the two models
8 # because they're not connected anymore.
9 training_model.copyWeightsToHost()
10 validation_model.copyWeightsToDevice()
11 run_validation(validation_model)
Or you would need to re-connect the two models by creating the second one from the first one and then loading the executable:
1training_model = poptorch.load("training.poptorch")
2# Create a validation python model based on the training model
3validation_model = poptorch.inferenceModel(training_model)
4validation_model.model.eval()
5# Load the executable for that model:
6validation_model.loadExecutable("validation.poptorch")
7
8for epoch in epochs:
9 print("Epoch ", epoch)
10 run_training(training_model)
11 # Nothing to do: training_model and validation_model are now connected
12 # and PopTorch will implicitly keep the weights in sync between them.
13 run_validation(validation_model)
4.11. Environment variables
4.11.1. Logging level
- PopTorch uses the following levels of logging:
OFF
: No loggingERR
: Errors onlyWARN
: Warnings and errors onlyINFO
: Info, warnings and errors (default)DEBUG
: Adds some extra debugging informationTRACE
andTRACE_ALL
: Trace everything inside PopTorch
You can use the POPTORCH_LOG_LEVEL
environment variable to set the logging level:
export POPTORCH_LOG_LEVEL=DEBUG
4.11.2. Profiling
When running programs using PopTorch, you can enable profiling by using the POPLAR_ENGINE_OPTIONS
environment variable used by Poplar.
In order to capture the reports needed for the PopVision Graph Analyser you only need to set POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'
:
export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}'
By default, report files are output to the current working directory. You can specify a different output directory by setting autoReport.directory
, for example:
export POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./tommyFlowers"}'
For more options, refer to the PopVision Graph Analyser User Guide.
In order to capture the pvti
reports needed for the PopVision System Analyser you only need to set PVTI_OPTIONS='{"enable":"true"}'
You can also add extra tracepoints in your own code by using Channel
.
4.11.3. IPU Model
By default PopTorch will try to attach to a physical IPU.
If instead you want to use the model, you can do so by setting POPTORCH_IPU_MODEL
to 1
:
export POPTORCH_IPU_MODEL=1
See the Poplar and PopLibs User Guide for the limitations of the IPU Model.
4.11.4. Wait for an IPU to become available
By default, attempting to attach to an IPU when all IPUs are
already in use will raise an exception.
If you would rather wait for an IPU to become available, you can do so by setting POPTORCH_WAIT_FOR_IPU
to 1
.
export POPTORCH_WAIT_FOR_IPU=1
4.11.5. Enable executable caching
You can enable executable caching by either setting the POPTORCH_CACHE_DIR
environment variable or by calling enableExecutableCaching
.
export POPTORCH_CACHE_DIR=/tmp/poptorch_cache
For more information, see Section 4.10.1, Caching.