PopTorch is a set of extensions for PyTorch to enable PyTorch models to run directly on Graphcore IPU hardware. PopTorch has been designed to require as few changes as possible to your models in order to run on the IPU. However, it does have some differences from native PyTorch execution, to get the most out of IPU hardware.
See the “Getting Started” guide for your IPU system on the Graphcore documentation portal for information on installing the Poplar SDK and PopTorch.
In the Graphcore software stack, PyTorch sits at the highest level of abstraction. Poplar and PopLibs provide a software interface to operations running on the IPU. PopTorch compiles PyTorch models into Poplar executables and also provides IPU-specific functions.
PopTorch supports executing native PyTorch models for both inference and training. To run a PyTorch model on the IPU, you must wrap your model with either:
Both of these functions accept a PyTorch model (torch.nn.Module) and create a representation of the model that can be executed on the IPU hardware.
In training mode, PopTorch uses its own automatic differentiation engine
(autograd) that differs from native PyTorch. The input model (torch.nn.Module)
is required to have at least one loss built into the forward pass. PopTorch
backpropagates the gradients from the loss value(s) to update the model
parameters. This is all taken care of automatically so your training loop does not
need to call
.backward() on the loss value(s) or
.step() on the optimiser.
The following example shows a typical native PyTorch training loop. The model
incorporates a loss criterion within the
.forward() method, and returns the loss
value as a second output (along with the prediction). This native PyTorch training
loop manually invokes the
.backward() method to backpropagate the gradients.
The loop also manually updates the optimiser by calling the
1 training_data = torch.utils.data.DataLoader(ExampleDataset(shape=, 2 length=20000), 3 batch_size=10, 4 shuffle=True, 5 drop_last=True) 6 7 model = ExampleModelWithLoss() 8 model.train() 9 10 optimizer = torch.optim.AdamW(model.parameters(), lr=0.001) 11 12 momentum_loss = None 13 14 for batch, target in training_data: 15 # Zero gradients 16 optimizer.zero_grad() 17 18 # Run model. 19 _, loss = model(batch, target) 20 21 # Back propagate the gradients. 22 loss.backward() 23 24 # Update the weights. 25 optimizer.step() 26 27 if momentum_loss is None: 28 momentum_loss = loss 29 else: 30 momentum_loss = momentum_loss * 0.95 + loss * 0.05 31 32 if momentum_loss < 0.1: 33 optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001)
1.1. Data batching
An equivalent training loop executing the model on the IPU with PopTorch is shown
DataLoader is used to efficiently load data batches
on the IPU. PopTorch follows the data batching semantics of PopART. By default,
this means you will just pass in data of the normal batch size. However, there are a
number of options provided in PopTorch which will enable more efficient data
loading. See Section 5, Efficient data batching for more information.
Notice that the torch.optim.AdamW optimiser is passed as an input argument to the
trainingModel() wrapper which applies the optimiser algorithm
during training on the IPU. The optimiser state is automatically managed by the
PopART framework so there is no need to call the
.step() method. Another
significant change from the native training loop is there is no
As mentioned above, PopTorch uses its own automatic differentiation engine and will
detect the loss value to backpropagate the gradients from.
1 # Set up the PyTorch DataLoader to load that much data at each iteration 2 opts = poptorch.Options() 3 opts.deviceIterations(10) 4 training_data = poptorch.DataLoader(options=opts, 5 dataset=ExampleDataset(shape=, 6 length=20000), 7 batch_size=10, 8 shuffle=True, 9 drop_last=True) 10 11 model = ExampleModelWithLoss() 12 model.train() 13 14 optimizer = torch.optim.AdamW(model.parameters(), lr=0.001) 15 16 # Wrap the model in a PopTorch training wrapper 17 poptorch_model = poptorch.trainingModel(model, 18 options=opts, 19 optimizer=optimizer) 20 21 momentum_loss = None 22 23 for batch, target in training_data: 24 # Performs forward pass, loss function evaluation, 25 # backward pass and weight update in one go on the device. 26 _, loss = poptorch_model(batch, target) 27 28 if momentum_loss is None: 29 momentum_loss = loss 30 else: 31 momentum_loss = momentum_loss * 0.95 + loss * 0.05 32 33 # Optimizer can be updated via setOptimizer. 34 if momentum_loss < 0.1: 35 poptorch_model.setOptimizer( 36 torch.optim.AdamW(model.parameters(), lr=0.0001))
1.2. Parallel and Distributed execution
To scale your models, you can enable Multi-IPU execution strategies using PopTorch’s Annotations to label or wrap individual parts of your model and assign parts of the model to an individual IPU or execution phase. You can also use PopTorch’s Available execution strategies to determine how the model executes the phases.
Having assigned the model to run on one or more IPUs, you can add additional parallelism through replication. Each replica represents an addition copy of the entire model, which runs in parallel.
PopTorch can also run across multiple hosts. This is necessary for using more than 64 IPUs across IPU-PODs and may be beneficial when using a smaller number of IPUs such as models involving intensive pre-processing on the CPU. We recommend using the PopRun command-line tool and and PopDist configuration library, which can automatically set up PopTorch to run across multiple IPU-POD hosts. Please refer to the PopDist and PopRun User Guide for more information, including details about the installation of Horovod if you are using the MPI communication protocol.
The following constraints apply when using PopTorch:
All tensor data types and shapes must be constant for the entire dataset.
As PopTorch compiles to a static graph, it cannot handle control flow variations within the model. This means that the inputs passed at run-time cannot vary the control flow of the model or the shapes or sizes of results. If this is attempted, the graph will be frozen to whichever control flow path was activated as a result of the first inputs given to the wrapped model.
Not all PyTorch operations are implemented within the PopTorch compiler. See Section 6, IPU supported operations for a list of operators that are supported on the IPU. Please also report any unsupported operators to email@example.com so that these ops may be incorporated into a future release.
Whilst any argument type can be used in the forward method, only tensor arguments may change between model invocations, as other types will be statically compiled inside the executable.
1.4. Other resources
Please see Graphcore’s website for How-to Videos, Graphcore’s examples GitHub repository for PopTorch applications, and Graphcore’s tutorials GitHub repository for feature examples, tutorials and simple applications. Further developer resources can be found on Graphcore’s developer portal.