16. Application example: MNIST

In this section, you will see how to train a simple machine learning application in PopXL. The neural network in this example has two linear layers. It will be trained with the the MNIST dataset. This dataset contains 60,000 training images and 10,000 testing images. Each input image is a handwritten digit with a resolution of 28x28 pixels.

16.1. Import the necessary libraries

First, you need to import all the required libraries.

 6import argparse
 7from typing import Dict, List, Tuple, Mapping
 8import numpy as np
 9import torch
10from tqdm import tqdm
11import popxl
12import popxl.ops as ops
13import popxl.transforms as transforms
14from popxl.ops.call import CallSiteInfo
15from mnist_utils import Timer, get_mnist_data
16

16.2. Prepare dataset

You can use a torch.utils.data.DataLoader for the training and validation data. Here, mnist is a function that returns a torch.utils.data.DataSet for the MNIST dataset.

 97    training_data = torch.utils.data.DataLoader(
 98        mnist(train=True),
 99        batch_size=batch_size,
100        shuffle=True,
101        drop_last=True,
102    )
103
104    validation_data = torch.utils.data.DataLoader(
105        mnist(train=False),
106        batch_size=test_batch_size,
107        shuffle=True,
108        drop_last=True,
109    )

16.3. Create IR for training

The training IR is created in build_train_ir. After creating an instance of IR, operations are added to the IR within the context of its main graph. These operations are also forced to execute in the same order as they are added by using context manager :py:func:~popxl.in_sequence`.

188    ir = popxl.Ir()
189    ir.num_host_transfers = 1
190    ir.replication_factor = 1
191    with ir.main_graph, popxl.in_sequence():

The initial operation is to load input images and labels to x and labels, respectively from host-to-device streams img_stream and label_stream.

194        # Host load input and labels
195        img_stream = popxl.h2d_stream(
196            [opts.batch_size, 28, 28], popxl.float32, name="input_stream"
197        )
198        x = ops.host_load(img_stream, "x")
199
200        label_stream = popxl.h2d_stream(
201            [opts.batch_size], popxl.int32, name="label_stream"
202        )
203        labels = ops.host_load(label_stream, "labels")

After the data is loaded from host, you can build the network, calculate the loss and gradients, and finally update the weights. This process is shown in Fig. 16.1 and will be detailed in later sections.

illustration of mnist train ir in popxl

Fig. 16.1 Overview of how to build a training IR in PopXL

To monitor the training process, you can also stream the loss from the IPU devices to the host.

219        # Host store to get loss
220        loss_stream = popxl.d2h_stream(loss.shape, loss.dtype, name="loss_stream")
221        ops.host_store(loss_stream, loss)

16.3.1. Create network

The network has 2 linear layers. A linear layer is defined by the class Linear that inherits from popxl.Module. We are here overriding the build method which builds the subgraph to do the linear computation.

30class Linear(popxl.Module):
31    def __init__(self) -> None:
32        """
33        Define a linear layer in PopXL.
34        """
35        self.W: popxl.Tensor = None
36        self.b: popxl.Tensor = None
37
38    def build(
39        self, x: popxl.Tensor, out_features: int, bias: bool = True
40    ) -> Tuple[popxl.Tensor, ...]:
41        """
42        Override the `build` method to build a graph.
43        """
44        self.W = popxl.graph_input((x.shape[-1], out_features), popxl.float32, "W")
45        y = x @ self.W
46        if bias:
47            self.b = popxl.graph_input((out_features,), popxl.float32, "b")
48            y = y + self.b
49
50        y = ops.gelu(y)
51        return y
52
53

In the diagram Fig. 16.1, you can see two graphs created from the two linear layers by using popxl.Ir.create_graph() and called by using popxl.call_with_info(). The tensors x1 and y are respectively the outputs of the first linear graph call and the second linear graph. The weight tensors, bias tensors, output tensors, graphs, and graph callsite infos are all returned for the next step. This forward graph of the network is created in the method create_network_fwd_graph.

 58def create_network_fwd_graph(
 59    ir, x
 60) -> Tuple[
 61    Tuple[popxl.Tensor], Dict[str, popxl.Tensor], List[popxl.Graph], Tuple[CallSiteInfo]
 62]:
 63    """
 64    Define the network architecture.
 65
 66    Args:
 67        ir (popxl.Ir): The ir to create model in.
 68        x (popxl.Tensor): The input tensor of this model.
 69
 70    Returns:
 71        Tuple[Tuple[popxl.Tensor], Dict[str, popxl.Tensor], List[popxl.Graph], Tuple[CallSiteInfo]]: The info needed to calculate the gradients later
 72    """
 73    # Linear layer 0
 74    x = x.reshape((-1, 28 * 28))
 75    W0_data = np.random.normal(0, 0.02, (x.shape[-1], 32)).astype(np.float32)
 76    W0 = popxl.variable(W0_data, name="W0")
 77    b0_data = np.random.normal(0, 0.02, (32)).astype(np.float32)
 78    b0 = popxl.variable(b0_data, name="b0")
 79
 80    # Linear layer 1
 81    W1_data = np.random.normal(0, 0.02, (32, 10)).astype(np.float32)
 82    W1 = popxl.variable(W1_data, name="W1")
 83    b1_data = np.random.normal(0, 0.02, (10)).astype(np.float32)
 84    b1 = popxl.variable(b1_data, name="b1")
 85
 86    # Create graph to call for linear layer 0
 87    linear_0 = Linear()
 88    linear_graph_0 = ir.create_graph(linear_0, x, out_features=32)
 89
 90    # Call the linear layer 0 graph
 91    fwd_call_info_0 = ops.call_with_info(
 92        linear_graph_0, x, inputs_dict={linear_0.W: W0, linear_0.b: b0}
 93    )
 94    # Output of linear layer 0
 95    x1 = fwd_call_info_0.outputs[0]
 96
 97    # Create graph to call for linear layer 1
 98    linear_1 = Linear()
 99    linear_graph_1 = ir.create_graph(linear_1, x1, out_features=10)
100
101    # Call the linear layer 1 graph
102    fwd_call_info_1 = ops.call_with_info(
103        linear_graph_1, x1, inputs_dict={linear_1.W: W1, linear_1.b: b1}
104    )
105    # Output of linear layer 1
106    y = fwd_call_info_1.outputs[0]
107
108    outputs = (x1, y)
109    params = {"W0": W0, "W1": W1, "b0": b0, "b1": b1}
110    linears = [linear_0, linear_1]
111    fwd_call_infos = (fwd_call_info_0, fwd_call_info_1)
112
113    return outputs, params, linears, fwd_call_infos
114
115

16.3.2. Calculate gradients and update weights

After creating the forward pass in the training IR, we will calculate the gradients in calculate_grads and update the weights and bias in update_weights_bias.

  • Calculate loss and initial gradients dy by using nll_loss_with_softmax_grad().

    209        # Calculate loss and initial gradients
    210        probs = ops.softmax(outputs[1], axis=-1)
    211        loss, dy = ops.nll_loss_with_softmax_grad(probs, labels)
    
  • Construct the graph to calculate the gradients for each layer, bwd_graph_info_0 and bwd_graph_info_1 by using :py:func:~popxl.transforms.autodiff` (Section 10.1, Autodiff) transformation on its forward pass graph. Note that, you only need to calculate the gradients for W0 and b0 in the first layer, and gradients for all the inputs, x1, W1 and b1, in the second layer. In this example, you will see two different ways to use autodiff and how to use it to get the required gradients.

    Let’s start fromt the second layer. The bwd_graph_info_1, returned from autodiff of the second layer, contains the graph to calculate the gradient for the layer. The activations for this layer activations_1 is obtained from the corresponding forward graph call. After calling the gradient graph, bwd_graph_info_1.graph with popxl.ops.call_with_info, the grads_1_call_info is used to get all the gradients with regard to the inputs x1, W1, and b1. The method fwd_parent_ins_to_grad_parent_outs gives a mapping from the corresponding forward graph inputs, x1, W1, and b1, and their gradients, grad_x1, grad_w_1, and grad_b_1. The input gradient for grads_1_call_info is dy.

    126    # Obtain graph to calculate gradients from autodiff
    127    bwd_graph_info_1 = transforms.autodiff(fwd_call_infos[1].called_graph)
    128
    129    # Get activations for layer 1 from forward call info
    130    activations_1 = bwd_graph_info_1.inputs_dict(fwd_call_infos[1])
    131
    132    # Get the gradients dictionary by calling the gradient graphs with ops.call_with_info
    133    grads_1_call_info = ops.call_with_info(
    134        bwd_graph_info_1.graph, dy, inputs_dict=activations_1
    135    )
    136    # Find the corresponding gradient w.r.t. the input, weights and bias
    137    grads_1 = bwd_graph_info_1.fwd_parent_ins_to_grad_parent_outs(
    138        fwd_call_infos[1], grads_1_call_info
    139    )
    140    x1 = outputs[0]
    141    W1 = params["W1"]
    142    b1 = params["b1"]
    143    grad_x_1 = grads_1[x1]
    144    grad_w_1 = grads_1[W1]
    145    grad_b_1 = grads_1[b1]
    

    For the first layer, we can obtain the required gradients in a similar way. Here we will show you an alternative approach. We define the list of tensors that require gradients grads_required=[linears[0].W, linears[0].b] in autodiff. Their gradients are returned directly from the popxl.ops.call of the gradient graph bwd_graph_info_0.graph. The input gradient for grads_0_call_infof is the gradients w.r.t the input of the second linear graph, the output of the first linear graph, grad_x_1.

    148    # Use autodiff to obtain graph that calculate gradients, specify which graph inputs need gradients
    149    bwd_graph_info_0 = transforms.autodiff(
    150        fwd_call_infos[0].called_graph, grads_required=[linears[0].W, linears[0].b]
    151    )
    152    # Get activations for layer 0 from forward call info
    153    activations_0 = bwd_graph_info_0.inputs_dict(fwd_call_infos[0])
    154    # Get the required gradients by calling the gradient graphs with ops.call
    155    grad_w_0, grad_b_0 = ops.call(
    156        bwd_graph_info_0.graph, grad_x_1, inputs_dict=activations_0
    157    )
    
  • Update the weights and bias tensors with SGD by using scaled_add_().

    165def update_weights_bias(opts, grads, params) -> None:
    166    """
    167    Update weights and bias by W += - lr * grads_w, b += - lr * grads_b.
    168    """
    169    for k, v in params.items():
    170        ops.scaled_add_(v, grads[k], b=-opts.lr)
    171
    172
    

16.4. Run the IR to train the model

After an IR is built taking into account the batch size args.batch_size, we can run it repeatedly until the end of the required number of epochs. Each session is initiated by one IR as shown in the following code:

345        train_session = popxl.Session(train_ir, "ipu_model")
346        with train_session:
347            train(train_session, training_data, opts, input_streams, loss_stream)

The session is run for nb_batches times for each epoch. Each train_session run consumes a batch of input images and labels, and produces their loss values to the host.

255def train(train_session, training_data, opts, input_streams, loss_stream) -> None:
256    nb_batches = len(training_data)
257    for epoch in range(1, opts.epochs + 1):
258        print(f"Epoch {epoch}/{opts.epochs}")
259        bar = tqdm(training_data, total=nb_batches)
260        for data, labels in bar:
261            inputs: Mapping[popxl.HostToDeviceStream, np.ndarray] = dict(
262                zip(
263                    input_streams,
264                    [data.squeeze().float().numpy(), labels.int().numpy()],
265                )
266            )
267
268            outputs = train_session.run(inputs)
269            loss = outputs[loss_stream]
270            bar.set_description(f"Average loss: {np.mean(loss):.4f}")
271
272

After the training session finishes running, the trained tensor values, in a mapping from tensors to their values trained_weights_data_dict, are obtained by using train_session.get_tensors_data.

16.5. Create an IR for testing and run the IR to test the model

For testing the trained tensors, you need to create an IR for testing, test_ir, and its corresponding session, test_session to run the test. The method write_variables_data is used to copy the trained values from trained_weights_data_dict to the corresponding tensors in test IR, test_variables.

354        # Build the ir for testing
355        test_ir, test_input_streams, out_stream, test_variables = build_test_ir(opts)
356        test_session = popxl.Session(test_ir, "ipu_model")
357        # Get test variable values from trained weights
358        test_weights_data_dict = get_test_var_values(
359            test_variables, trained_weights_data_dict
360        )
361        # Copy trained weights to the test ir
362        test_session.write_variables_data(test_weights_data_dict)
363        with test_session:
364            test(test_session, test_data, test_input_streams, out_stream)