5. Efficient data batching

By default PopTorch will process the batch_size which you provided to the poptorch.DataLoader.

When using the other options below, the actual number of samples used per step varies to allow the IPU(s) to process data more efficiently.

However, the effective (mini-)batch size for operations which depend on it (such as batch normalization) will not change. All that changes is how much data is actually sent for a single step.

Note

Failure to use poptorch.DataLoader may result in accidentally changing the effective batch size for operations which depend on it, such as batch normalization.

5.1. poptorch.DataLoader

Poptorch provides a thin wrapper around the traditional torch.utils.data.DataLoader to abstract away some of the batch sizes calculations. If poptorch.DataLoader is used in a distributed execution environment, it will ensure that each process uses a different subset of the dataset.

If you set the DataLoader batch_size to more than 1 then each operation in the model will process that number of elements at any given time.

See below for usage example.

5.2. poptorch.AsynchronousDataAccessor

To reduce host overhead you can offload the data loading process to a separate thread by specifying mode=poptorch.DataLoaderMode.Async in the DataLoader constructor. Internally this uses an AsynchronousDataAccessor. Doing this allows you to reduce the host/IPU communication overhead by using the time that the IPU is running to load the next batch on the CPU. This means that when the IPU is finished executing and returns to host the data will be ready for the IPU to pull in again.

Listing 5.1 Use of AsynchronousDataAccessor

    opts = poptorch.Options()
    opts.deviceIterations(device_iterations)
    opts.replicationFactor(replication_factor)

    loader = poptorch.DataLoader(opts,
                                 ExampleDataset(shape=shape,
                                                length=num_tensors),
                                 batch_size=batch_size,
                                 num_workers=num_workers,
                                 mode=poptorch.DataLoaderMode.Async)

    poptorch_model = poptorch.inferenceModel(model, opts)

    for it, (data, _) in enumerate(loader):
        out = poptorch_model(data)

Warning

AsynchronousDataAccessor makes use of the Python multiprocessing module’s spawn start method. Consequently, the entry point of a program that uses it must be guarded by a if __name__ == '__main__': block to avoid endless recursion. The dataset used must also be picklable. For more information, please see https://docs.python.org/3/library/multiprocessing.html#the-spawn-and-forkserver-start-methods.

Warning

Tensors being iterated over using an AsynchronousDataAccessor use shared memory. You must clone tensors at each iteration if you wish to keep their references outside of each iteration.

Consider the following example:

predictions, labels = [], []

for data, label in dataloader:
    predictions += poptorch_model(data)
    labels += label

The predictions list will be correct because it’s producing a new tensor from the inputs. However, The list labels will contain identical references. This line would need to be replaced with the following:

labels += label.detach().clone()

5.3. poptorch.Options.deviceIterations

If you set deviceIterations() to more than 1 then you are telling PopART to execute that many batches in sequence.

Essentially, it is the equivalent of launching the IPU in a loop over that number of batches. This is efficient because that loop runs on the IPU directly.

Listing 5.2 Use of device iterations and batch size

from functools import reduce
from operator import mul

import torch
import poptorch


class ExampleModelWithLoss(torch.nn.Module):
    def __init__(self, data_shape, num_classes):
        super().__init__()

        self.fc = torch.nn.Linear(reduce(mul, data_shape), num_classes)
        self.loss = torch.nn.CrossEntropyLoss()

    def forward(self, x, target=None):
        reshaped = x.reshape([x.shape[0], -1])
        fc = self.fc(reshaped)

        if target is not None:
            return fc, self.loss(fc, target)
        return fc


class ExampleDataset(torch.utils.data.Dataset):
    def __init__(self, shape, length):
        super().__init__()
        self._shape = shape
        self._length = length

        self._all_data = []
        self._all_labels = []

        torch.manual_seed(0)
        for _ in range(length):
            label = 1 if torch.rand(()) > 0.5 else 0
            data = torch.rand(self._shape) + label
            data[0] = -data[0]
            self._all_data.append(data)
            self._all_labels.append(label)

    def __len__(self):
        return self._length

    def __getitem__(self, index):
        return self._all_data[index], self._all_labels[index]


def example():
    # Set the batch size in the conventional sense of being the size that
    # runs through an operation in the model at any given time
    model_batch_size = 2

    # Create a poptorch.Options instance to override default options
    opts = poptorch.Options()

    # Run a 100 iteration loop on the IPU, fetching a new batch each time
    opts.deviceIterations(100)

    # Set up the DataLoader to load that much data at each iteration
    training_data = poptorch.DataLoader(opts,
                                        dataset=ExampleDataset(shape=[3, 2],
                                                               length=10000),
                                        batch_size=model_batch_size,
                                        shuffle=True,
                                        drop_last=True)

    model = ExampleModelWithLoss(data_shape=[3, 2], num_classes=2)
    # Wrap the model in a PopTorch training wrapper
    poptorch_model = poptorch.trainingModel(model, options=opts)

    # Run over the training data with "batch_size" 200 essentially.
    for batch_number, (data, labels) in enumerate(training_data):
        # Execute the device with a 100 iteration loop of batchsize 2.
        # "output" and "loss" will be the respective output and loss of the final
        # batch (the default AnchorMode).

        output, loss = poptorch_model(data, labels)
        print(f"{labels[-1]}, {output}, {loss}")

5.4. poptorch.Options.replicationFactor

replicationFactor() will replicate the model over multiple IPUs to allow automatic data parallelism across many IPUs.

Listing 5.3 Use of replication factor

    # Create a poptorch.Options instance to override default options
    opts = poptorch.Options()

    # Run a 100 iteration loop on the IPU, fetching a new batch each time
    opts.deviceIterations(100)

    # Duplicate the model over 4 replicas.
    opts.replicationFactor(4)

    training_data = poptorch.DataLoader(opts,
                                        dataset=ExampleDataset(shape=[3, 2],
                                                               length=100000),
                                        batch_size=model_batch_size,
                                        shuffle=True,
                                        drop_last=True)

    model = ExampleModelWithLoss(data_shape=[3, 2], num_classes=2)
    # Wrap the model in a PopTorch training wrapper
    poptorch_model = poptorch.trainingModel(model, options=opts)

    # Run over the training data with "batch_size" 200 essentially.
    for batch_number, (data, labels) in enumerate(training_data):
        # Execute the device with a 100 iteration loop of batchsize 2 across
        # 4 IPUs. "output" and "loss" will be the respective output and loss of the
        # final batch of each replica (the default AnchorMode).
        output, loss = poptorch_model(data, labels)
        print(f"{labels[-1]}, {output}, {loss}")

5.5. poptorch.Options.Training.gradientAccumulation

You need to use gradientAccumulation() when training with pipelined models because the weights are shared across pipeline batches so gradients will be both updated and used by subsequent batches out of order. Note gradientAccumulation() is only needed by poptorch.PipelinedExecution.

See also poptorch.Block.

Listing 5.4 Use of gradient accumulation

    # Create a poptorch.Options instance to override default options
    opts = poptorch.Options()

    # Run a 100 iteration loop on the IPU, fetching a new batch each time
    opts.deviceIterations(400)

    # Accumulate the gradient 8 times before applying it.
    opts.Training.gradientAccumulation(8)

    training_data = poptorch.DataLoader(opts,
                                        dataset=ExampleDataset(shape=[3, 2],
                                                               length=100000),
                                        batch_size=model_batch_size,
                                        shuffle=True,
                                        drop_last=True)

    # Wrap the model in a PopTorch training wrapper
    poptorch_model = poptorch.trainingModel(model, options=opts)

    # Run over the training data with "batch_size" 200 essentially.
    for batch_number, (data, labels) in enumerate(training_data):
        # Execute the device with a 100 iteration loop of batchsize 2 across
        # 4 IPUs. "output" and "loss" will be the respective output and loss of the
        # final batch of each replica (the default AnchorMode).
        output, loss = poptorch_model(data, labels)
        print(f"{labels[-1]}, {output}, {loss}")

In the code example below, poptorch.Block introduced in poptorch.isRunningOnIpu is used to divide up a different model into disjoint subsets of layers. These blocks can be shared among multiple parallel execution strategies.

Listing 5.5 A training model making use of poptorch.Block

class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        self.layer1 = nn.Linear(784, 784)
        self.layer2 = nn.Linear(784, 784)
        self.layer3 = nn.Linear(784, 128)
        self.layer4 = nn.Linear(128, 10)
        self.softmax = nn.Softmax(1)

    def forward(self, x):
        x = x.view(-1, 784)
        with poptorch.Block("B1"):
            x = self.layer1(x)
        with poptorch.Block("B2"):
            x = self.layer2(x)
        with poptorch.Block("B3"):
            x = self.layer3(x)
        with poptorch.Block("B4"):
            x = self.layer4(x)
            x = self.softmax(x)
        return x


class TrainingModelWithLoss(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model
        self.loss = torch.nn.CrossEntropyLoss()

    def forward(self, args, loss_inputs=None):
        output = self.model(args)
        if loss_inputs is None:
            return output
        with poptorch.Block("B4"):
            loss = self.loss(output, loss_inputs)
        return output, loss

You can see the code examples of poptorch.SerialPhasedExecution, poptorch.PipelinedExecution, and poptorch.ShardedExecution below.

An instance of class poptorch.PipelinedExecution defines an execution strategy that assigns layers to multiple IPUs as a pipeline. Gradient accumulation is used to push multiple batches through the pipeline allowing IPUs to run in parallel.

Listing 5.6 An example of different parallel execution strategies

    training_data, test_data = get_mnist_data(opts)
    model = Network()
    model_with_loss = TrainingModelWithLoss(model)
    model_opts = poptorch.Options().deviceIterations(1)
    if opts.strategy == "phased":
        strategy = poptorch.SerialPhasedExecution("B1", "B2", "B3", "B4")
        strategy.stage("B1").ipu(0)
        strategy.stage("B2").ipu(0)
        strategy.stage("B3").ipu(0)
        strategy.stage("B4").ipu(0)
        model_opts.setExecutionStrategy(strategy)
    elif opts.strategy == "pipelined":
        strategy = poptorch.PipelinedExecution("B1", "B2", "B3", "B4")
        strategy.stage("B1").ipu(0)
        strategy.stage("B2").ipu(1)
        strategy.stage("B3").ipu(2)
        strategy.stage("B4").ipu(3)
        model_opts.setExecutionStrategy(strategy)
        model_opts.Training.gradientAccumulation(opts.batches_per_step)
    else:
        strategy = poptorch.ShardedExecution("B1", "B2", "B3", "B4")
        strategy.stage("B1").ipu(0)
        strategy.stage("B2").ipu(0)
        strategy.stage("B3").ipu(0)
        strategy.stage("B4").ipu(0)
        model_opts.setExecutionStrategy(strategy)

    if opts.offload_opt:
        model_opts.TensorLocations.setActivationLocation(
            poptorch.TensorLocationSettings().useOnChipStorage(True))
        model_opts.TensorLocations.setWeightLocation(
            poptorch.TensorLocationSettings().useOnChipStorage(True))
        model_opts.TensorLocations.setAccumulatorLocation(
            poptorch.TensorLocationSettings().useOnChipStorage(True))
        model_opts.TensorLocations.setOptimizerLocation(
            poptorch.TensorLocationSettings().useOnChipStorage(False))

    training_model = poptorch.trainingModel(
        model_with_loss,
        model_opts,
        optimizer=optim.AdamW(model.parameters(), lr=opts.lr))

    # run training, on IPU
    train(training_model, training_data, opts)

Fig. 5.1 shows the pipeline execution for multiple batches on IPUs. There are 4 pipeline stages running on 4 IPUs respectively. Gradient accumulation enables us to keep the same number of pipeline stages, but with a wider pipeline. This helps hide the latency, which is the total time for one item to go through the whole system, as highlighted.

_images/IPU-pipeline.jpg — Fig. 5.1 Pipeline execution with gradient accumulation