1. Introduction

PopTorch is a set of extensions for PyTorch to enable PyTorch models to run directly on Graphcore IPU hardware. PopTorch has been designed to require as few changes as possible to your models in order to run on the IPU. However, it does have some differences from native PyTorch execution, to get the most out of IPU hardware.

See the “Getting Started” guide for your IPU system on the Graphcore documentation portal for information on installing the Poplar SDK and PopTorch.

In the Graphcore software stack, PyTorch sits at the highest level of abstraction. Poplar and PopLibs provide a software interface to operations running on the IPU. PopTorch compiles PyTorch models into Poplar executables and also provides IPU-specific functions.

_images/pytorch-software-stack.png

Fig. 1.1 PyTorch, PopTorch and the Poplar software stack

PopTorch supports executing native PyTorch models for both inference and training. To run a PyTorch model on the IPU, you must wrap your model with either:

Both of these functions accept a PyTorch model (torch.nn.Module) and create a representation of the model that can be executed on the IPU hardware.

In training mode, PopTorch uses its own automatic differentiation engine (autograd) that differs from native PyTorch. The input model (torch.nn.Module) is required to have at least one loss built into the forward pass. PopTorch backpropagates the gradients from the loss value(s) to update the model parameters. This is all taken care of automatically so your training loop does not need to call .backward() on the loss value(s) or .step() on the optimiser.

The following example shows a typical native PyTorch training loop. The model incorporates a loss criterion within the .forward() method, and returns the loss value as a second output (along with the prediction). This native PyTorch training loop manually invokes the .backward() method to backpropagate the gradients. The loop also manually updates the optimiser by calling the .step() method.

Listing 1.1 A simple example of training using PyTorch on the CPU
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
    training_data = torch.utils.data.DataLoader(ExampleDataset(shape=[1],
                                                               length=20000),
                                                batch_size=10,
                                                shuffle=True,
                                                drop_last=True)

    model = ExampleModelWithLoss()
    model.train()

    optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

    momentum_loss = None

    for batch, target in training_data:
        # Zero gradients
        optimizer.zero_grad()

        # Run model.
        _, loss = model(batch, target)

        # Back propagate the gradients.
        loss.backward()

        # Update the weights.
        optimizer.step()

        if momentum_loss is None:
            momentum_loss = loss
        else:
            momentum_loss = momentum_loss * 0.95 + loss * 0.05

        if momentum_loss < 0.1:
            optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001)

1.1. Data batching

An equivalent training loop executing the model on the IPU with PopTorch is shown below. The poptorch.DataLoader is used to efficiently load data batches on the IPU. PopTorch follows the data batching semantics of PopART. By default, this means you will just pass in data of the normal batch size. However, there are a number of options provided in PopTorch which will enable more efficient data loading. See Section 5, Efficient data batching for more information.

Notice that the torch.optim.AdamW optimiser is passed as an input argument to the poptorch.trainingModel() wrapper which applies the optimiser algorithm during training on the IPU. The optimiser state is automatically managed by the PopART framework so there is no need to call the .step() method. Another significant change from the native training loop is there is no loss.backward(). As mentioned above, PopTorch uses its own automatic differentiation engine and will detect the loss value to backpropagate the gradients from.

Listing 1.2 Equivalent code using PopTorch to train on the IPU
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
    # Set up the PyTorch DataLoader to load that much data at each iteration
    opts = poptorch.Options()
    opts.deviceIterations(10)
    training_data = poptorch.DataLoader(options=opts,
                                        dataset=ExampleDataset(shape=[1],
                                                               length=20000),
                                        batch_size=10,
                                        shuffle=True,
                                        drop_last=True)

    model = ExampleModelWithLoss()
    model.train()

    optimizer = torch.optim.AdamW(model.parameters(), lr=0.001)

    # Wrap the model in a PopTorch training wrapper
    poptorch_model = poptorch.trainingModel(model,
                                            options=opts,
                                            optimizer=optimizer)

    momentum_loss = None

    for batch, target in training_data:
        # Performs forward pass, loss function evaluation,
        # backward pass and weight update in one go on the device.
        _, loss = poptorch_model(batch, target)

        if momentum_loss is None:
            momentum_loss = loss
        else:
            momentum_loss = momentum_loss * 0.95 + loss * 0.05

        # Optimizer can be updated via setOptimizer.
        if momentum_loss < 0.1:
            poptorch_model.setOptimizer(
                torch.optim.AdamW(model.parameters(), lr=0.0001))

1.2. Parallel and Distributed execution

To scale your models, you can enable Multi-IPU execution strategies using PopTorch’s Annotations to label or wrap individual parts of your model and assign parts of the model to an individual IPU or execution phase. You can also use PopTorch’s Available execution strategies to determine how the model executes the phases.

Having assigned the model to run on one or more IPUs, you can add additional parallelism through replication. Each replica represents an addition copy of the entire model, which runs in parallel.

PopTorch can also run across multiple hosts. This is necessary for using more than 64 IPUs across IPU-PODs and may be beneficial when using a smaller number of IPUs such as models involving intensive pre-processing on the CPU. We recommend using the PopRun command-line tool and and PopDist configuration library, which can automatically set up PopTorch to run across multiple IPU-POD hosts. Please refer to the PopRun an PopDist user guide.

1.3. Constraints

PopTorch uses PyTorch’s torch.jit.trace API. That means it inherits the constraints of that API. These include:

  • Inputs must be PyTorch tensors or tuples containing PyTorch tensors.

  • None can be used as a default value for a parameter but cannot be explicitly passed as an input value.

  • torch.jit.trace cannot handle control flow or shape variations within the model. That is, the inputs passed at run-time cannot vary the control flow of the model or the shapes/sizes of results. If you attempt this, the graph will be frozen to whichever control flow path was traced as a result of the first inputs given to the wrapped model.

Note

All tensor data types and shapes must be constant for the entire dataset.

Not all PyTorch operations have been implemented by the PopTorch compiler yet. See Section 6, IPU supported operations for a list of operators that are supported on the IPU. Please also report any unsupported operators to support@graphcore.ai so that these ops may be incorporated into a future release.

1.4. Other resources

Please see Graphcore’s website for How-to Videos, Graphcore’s examples GitHub repository for PopTorch applications, and Graphcore’s tutorials GitHub repository for feature examples, tutorials and simple applications. Further developer resources can be found on Graphcore’s developer portal.