10. Experimental features

10.1. Distributed execution without PopRun

PopTorch supports distributed execution on a Pod using the IPU over Fabric (IPUoF).

If you run a program using your own distributed processing tool instead of PopRun, the only change you need to make to your code is to set the ID of the current process and the total number of processes the execution is distributed across, using configureProcessId().

Note that replicationFactor() should be used to set the number of local replicas (per host) not the total (global) number of replicas.

Listing 10.1 Changes required for distributed execution
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def process(process_id=0, num_processes=1):
    # Create a poptorch.Options instance to override default options
    opts = poptorch.Options()

    # Run a 100 iteration loop on the IPU, fetching a new batch each time
    opts.deviceIterations(400)

    # Replicate the graph across 2 IPUs in each process.
    opts.replicationFactor(2)

    # Set the id of the current process and the total number of processes.
    opts.Distributed.configureProcessId(process_id, num_processes)

    # Accumulate the gradient 8 times before applying it.
    opts.Training.gradientAccumulation(8)

    # Optional: All the processes must use the same seed if shuffle=True is used for the DataLoader.
    opts.randomSeed(42)

    training_data = poptorch.DataLoader(opts,
                                        dataset=ExampleDataset(shape=[3, 2],
                                                               length=100000),
                                        batch_size=model_batch_size,
                                        shuffle=True,
                                        drop_last=True)

    # Wrap the model in a PopTorch training wrapper
    poptorch_model = poptorch.trainingModel(model, options=opts)

    # Run over the training data with "batch_size" 200 essentially.
    for batch_number, (data, labels) in enumerate(training_data):
        # Execute the device with a 100 iteration loop of batchsize 8 across
        # 4 IPUs (batch-size 2 per replica). "output" and "loss" will be the
        # respective output and loss of the final batch of each replica
        # (the default OutputMode).
        output, loss = poptorch_model(data, labels)
        print(f"{batch_number} {labels[-1]}, {output}, {loss}")

Note

The DataLoader will automatically select a different subset of the dataset based on the process ID.

Warning

All the processes must use the same seed if shuffle=True is used for the DataLoader.

10.2. torch.nn.CTCLoss

The CTCLoss operator is supported, with some limitations:

  1. The zero_infinity parameter must be set to False

  2. The reduction parameter must be set to either sum or mean

  3. The targets tensor must be 2D, corresponding to stacked, padded layout