PopTorch is a set of extensions for PyTorch to enable PyTorch models to run directly on Graphcore IPU hardware. PopTorch has been designed to require as few changes as possible to your models in order to run on the IPU. However, it does have some differences from native PyTorch execution, to get the most out of IPU hardware.
See the “Getting Started” guide for your IPU system on the Graphcore documentation portal for information on installing the Poplar SDK and PopTorch.
In the Graphcore software stack, PyTorch sits at the highest level of abstraction. Poplar and PopLibs provide a software interface to operations running on the IPU. PopTorch compiles PyTorch models into Poplar executables and also provides IPU-specific functions.
PopTorch supports executing native PyTorch models for both inference and training. To run a PyTorch model on the IPU, you must wrap your model with either:
Both of these functions accept a PyTorch model (torch.nn.Module) and create a representation of the model that can be executed on the IPU hardware.
In training mode, PopTorch uses its own automatic differentiation engine
(autograd) that differs from native PyTorch. The input model (torch.nn.Module)
is required to have at least one loss built into the forward pass. PopTorch
backpropagates the gradients from the loss value(s) to update the model
parameters. This is all taken care of automatically so your training loop does not
need to call
.backward() on the loss value(s) or
.step() on the optimiser.
The following example shows a typical native PyTorch training loop. The model
incorporates a loss criterion within the
.forward() method, and returns the loss
value as a second output (along with the prediction). This native PyTorch training
loop manually invokes the
.backward() method to backpropagate the gradients.
The loop also manually updates the optimiser by calling the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
training_data = torch.utils.data.DataLoader(ExampleDataset(shape=, length=20000), batch_size=10, shuffle=True, drop_last=True) model = ExampleModelWithLoss() model.train() optimizer = torch.optim.AdamW(model.parameters(), lr=0.001) momentum_loss = None for batch, target in training_data: # Zero gradients optimizer.zero_grad() # Run model. _, loss = model(batch, target) # Back propagate the gradients. loss.backward() # Update the weights. optimizer.step() if momentum_loss is None: momentum_loss = loss else: momentum_loss = momentum_loss * 0.95 + loss * 0.05 if momentum_loss < 0.1: optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001)
1.1. Data batching¶
An equivalent training loop executing the model on the IPU with PopTorch is shown
poptorch.DataLoader is used to efficiently load data batches
on the IPU. PopTorch follows the data batching semantics of PopART. By default,
this means you will just pass in data of the normal batch size. However, there are a
number of options provided in PopTorch which will enable more efficient data
loading. See Section 5, Efficient data batching for more information.
Notice that the torch.optim.AdamW optimiser is passed as an input argument to the
poptorch.trainingModel() wrapper which applies the optimiser algorithm
during training on the IPU. The optimiser state is automatically managed by the
PopART framework so there is no need to call the
.step() method. Another
significant change from the native training loop is there is no
As mentioned above, PopTorch uses its own automatic differentiation engine and will
detect the loss value to backpropagate the gradients from.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
# Set up the PyTorch DataLoader to load that much data at each iteration opts = poptorch.Options() opts.deviceIterations(10) training_data = poptorch.DataLoader(options=opts, dataset=ExampleDataset(shape=, length=20000), batch_size=10, shuffle=True, drop_last=True) model = ExampleModelWithLoss() model.train() optimizer = torch.optim.AdamW(model.parameters(), lr=0.001) # Wrap the model in a PopTorch training wrapper poptorch_model = poptorch.trainingModel(model, options=opts, optimizer=optimizer) momentum_loss = None for batch, target in training_data: # Performs forward pass, loss function evaluation, # backward pass and weight update in one go on the device. _, loss = poptorch_model(batch, target) if momentum_loss is None: momentum_loss = loss else: momentum_loss = momentum_loss * 0.95 + loss * 0.05 # Optimizer can be updated via setOptimizer. if momentum_loss < 0.1: poptorch_model.setOptimizer( torch.optim.AdamW(model.parameters(), lr=0.0001))
1.2. Parallel and Distributed execution¶
To scale your models, you can enable Multi-IPU execution strategies using PopTorch’s Annotations to label or wrap individual parts of your model and assign parts of the model to an individual IPU or execution phase. You can also use PopTorch’s Available execution strategies to determine how the model executes the phases.
Having assigned the model to run on one or more IPUs, you can add additional parallelism through replication. Each replica represents an addition copy of the entire model, which runs in parallel.
PopTorch can also run across multiple hosts. This is necessary for using more than 64 IPUs across IPU-PODs and may be beneficial when using a smaller number of IPUs such as models involving intensive pre-processing on the CPU. We recommend using the PopRun command-line tool and and PopDist configuration library, which can automatically set up PopTorch to run across multiple IPU-POD hosts. Please refer to the PopDist and PopRun User Guide for more information, including details about the installation of Horovod if you are using the MPI communication protocol.
PopTorch uses PyTorch’s torch.jit.trace API. That means it inherits the constraints of that API. These include:
Inputs must be PyTorch tensors or tuples containing PyTorch tensors.
Nonecan be used as a default value for a parameter but cannot be explicitly passed as an input value.
torch.jit.trace cannot handle control flow or shape variations within the model. That is, the inputs passed at run-time cannot vary the control flow of the model or the shapes/sizes of results. If you attempt this, the graph will be frozen to whichever control flow path was traced as a result of the first inputs given to the wrapped model.
All tensor data types and shapes must be constant for the entire dataset.
Not all PyTorch operations have been implemented by the PopTorch compiler yet. See Section 6, IPU supported operations for a list of operators that are supported on the IPU. Please also report any unsupported operators to firstname.lastname@example.org so that these ops may be incorporated into a future release.
1.4. Other resources¶
Please see Graphcore’s website for How-to Videos, Graphcore’s examples GitHub repository for PopTorch applications, and Graphcore’s tutorials GitHub repository for feature examples, tutorials and simple applications. Further developer resources can be found on Graphcore’s developer portal.