Source (GitHub) | `Download notebook`

# 2.4. Half and mixed precision in PopTorch

This tutorial shows how to use half and mixed precision in PopTorch with the example task of training a simple CNN model on a single Graphcore IPU (Mk1 or Mk2).

Before starting this tutorial, we recommend that you read through our tutorial on the basics of PyTorch on the IPU and our MNIST starting tutorial.

Requirements:

A Poplar SDK environment enabled (see the Getting Started guide for your IPU system)

Other Python modules:

`python -m pip install -r requirements.txt`

To run the Jupyter notebook version of this tutorial:

Enable a Poplar SDK environment and install required packages with

`python -m pip install -r requirements.txt`

;In the same environment, install the Jupyter notebook server:

`python -m pip install jupyter`

;Launch a Jupyter Server on a specific port:

`jupyter-notebook --no-browser --port <port number>`

;Connect via SSH to your remote machine, forwarding your chosen port:

`ssh -NL <port number>:localhost:<port number> <your username>@<remote machine>`

.

For more details about this process, or if you need troubleshooting, see our guide on using IPUs from Jupyter notebooks.

## General

### Motives for half precision

Data is stored in memory, and some formats to store that data require less memory than others. In a device’s memory, when it comes to numerical data, we use either integers or real numbers. Real numbers are represented by one of several floating point formats, which vary in how many bits they use to represent each number. Using more bits allows for greater precision and a wider range of representable numbers, whereas using fewer bits allows for faster calculations and reduces memory and power usage.

In deep learning applications, where less precise calculations are acceptable and throughput is critical, using a lower precision format can provide substantial gains in performance.

The Graphcore IPU provides native support for two floating-point formats:

IEEE single-precision, which uses 32 bits for each number (FP32)

IEEE half-precision, which uses 16 bits for each number (FP16)

Some applications which use FP16 do all calculations in FP16, whereas others
use a mix of FP16 and FP32. The latter approach is known as *mixed precision*.

In this tutorial, we are going to talk about real numbers represented in FP32 and FP16, and how to use these data types (dtypes) in PopTorch in order to reduce the memory requirements of a model.

### Numerical stability

Numeric stability refers to how a model’s performance is affected by the use of a lower-precision dtype. We say an operation is “numerically unstable” in FP16 if running it in this dtype causes the model to have worse accuracy compared to running the operation in FP32. Two techniques that can be used to increase the numerical stability of a model are loss scaling and stochastic rounding.

#### Loss scaling

A numerical issue that can occur when training a model in half-precision is that the gradients can underflow. This can be difficult to debug because the model will simply appear to not be training, and can be especially damaging because any gradients which underflow will propagate a value of 0 backwards to other gradient calculations.

The standard solution to this is known as *loss scaling*, which consists of
scaling up the loss value right before the start of backpropagation to prevent
numerical underflow of the gradients. Instructions on how to use loss scaling
will be discussed later in this tutorial.

#### Stochastic rounding

When training in half or mixed precision, numbers multiplied by each other will need to be rounded in order to fit into the floating point format used. Stochastic rounding is the process of using a probabilistic equation for the rounding. Instead of always rounding to the nearest representable number, we round up or down with a probability such that the expected value after rounding is equal to the value before rounding. Since the expected value of an addition after rounding is equal to the exact result of the addition, the expected value of a sum is also its exact value.

This means that on average, the values of the parameters of a network will be close to the values they would have had if a higher-precision format had been used. The added bonus of using stochastic rounding is that the parameters can be stored in FP16, which means the parameters can be stored using half as much memory. This can be especially helpful when training with small batch sizes, where the memory used to store the parameters is proportionally greater than the memory used to store parameters when training with large batch sizes.

It is highly recommended that you enable this feature when training neural networks with FP16 weights. The instructions to enable it in PopTorch are presented later in this tutorial.

## Train a model in half precision

### Import the packages

```
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import poptorch
from tqdm import tqdm
```

### Build the model

We use the same model as in the other tutorials on PopTorch. Just like in the tutorial on efficient data loading, we are using larger images (128x128) to simulate a heavier data load. This will make the difference in memory between FP32 and FP16 meaningful enough to showcase in this tutorial.

```
class CustomModel(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 5, 3)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(5, 12, 5)
self.norm = nn.GroupNorm(3, 12)
self.fc1 = nn.Linear(41772, 100)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(100, 10)
self.log_softmax = nn.LogSoftmax(dim=0)
self.loss = nn.NLLLoss()
def forward(self, x, labels=None):
x = self.pool(self.relu(self.conv1(x)))
x = self.norm(self.relu(self.conv2(x)))
x = torch.flatten(x, start_dim=1)
x = self.relu(self.fc1(x))
x = self.log_softmax(self.fc2(x))
# The model is responsible for the calculation
# of the loss when using an IPU. We do it this way:
if self.training:
return x, self.loss(x, labels)
return x
```

NOTE: The model inherits`self.training`

from`torch.nn.Module`

which initialises its value to True. Use`model.eval()`

to set it to False and`model.train()`

to switch it back to True.

### Choose parameters

NOTE: If you wish to modify these parameters for educational purposes, make sure you re-run all the cells below this one, including this entire cell as well:

```
# Cast the model parameters and data to FP16
execution_half = True
# Cast the accumulation of gradients values types of the optimiser to FP16
optimizer_half = True
# Use stochastic rounding
stochastic_rounding = True
# Set partials data type to FP16
partials_half = False
```

#### Casting a model’s parameters

The default data type of the parameters of a PyTorch module is FP32
(`torch.float32`

). To convert all the parameters of a model to be represented
in FP16 (`torch.float16`

), an operation we will call *downcasting*, we simply
do:

```
model = CustomModel()
if execution_half:
model = model.half()
```

For this tutorial, we will cast all the model’s parameters to FP16.

#### Casting a single layer’s parameters

For bigger or more complex models, downcasting all the layers may generate
numerical instabilities and cause underflows. While the PopTorch and the IPU
offer features to alleviate those issues, it is still sensible for those
models to cast only the parameters of certain layers and observe how it
affects the overall training job. To downcast the parameters of a single
layer, we select the layer by its *name* and use `half()`

:

```
model.conv1 = model.conv1.half()
```

If you would like to upcast a layer instead, you can use `model.conv1.float()`

.

NOTE: One can print out a list of the components of a PyTorch model, with their names, by doing`print(model)`

.

### Prepare the data

We will use the FashionMNIST dataset that we download from `torchvision`

. The
last stage of the pipeline will have to convert the data type of the tensors
representing the images to `torch.half`

(equivalent to `torch.float16`

) so that
our input data is also in FP16. This has the advantage of reducing the
bandwidth needed between the host and the IPU.

```
transform_list = [
transforms.Resize(128),
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
]
if execution_half:
transform_list.append(transforms.ConvertImageDtype(torch.half))
transform = transforms.Compose(transform_list)
```

Pull the datasets if they are not available locally:

```
train_dataset = torchvision.datasets.FashionMNIST(
"~/.torch/datasets", transform=transform, download=True, train=True
)
test_dataset = torchvision.datasets.FashionMNIST(
"~/.torch/datasets", transform=transform, download=True, train=False
)
```

### Optimizers and loss scaling

The value of the loss scaling factor can be passed as a parameter to the
optimisers in `poptorch.optim`

. In this tutorial, we will set it to `1024`

for
an AdamW optimizer. For all optimisers (except `poptorch.optim.SGD`

),
using a model in FP16 requires the argument `accum_type`

to be set to
`torch.float16`

as well:

```
accum, loss_scaling = (torch.float16, 1024) if optimizer_half else (torch.float32, None)
optimizer = poptorch.optim.AdamW(
params=model.parameters(), lr=0.001, accum_type=accum, loss_scaling=loss_scaling
)
```

While higher values of `loss_scaling`

minimize underflows, values that are
too high can also generate overflows as well as hurt convergence of the loss.
The optimal value depends on the model and the training job. This is therefore
a hyperparameter for you to tune.

### Set PopTorch’s options

To configure some features of the IPU and to be able to use PopTorch’s classes
in the next sections, we will need to create an instance of `poptorch.Options`

which stores the options we will be using. We covered some of the available
options in the: introductory tutorial for
PopTorch.

Let’s initialise our options object before we talk about the options we will use:

```
opts = poptorch.Options()
```

NOTE: This tutorial has been designed to be run on a single IPU. If you do not have access to an IPU, you can use the option`useIpuModel`

to run a simulation on CPU instead. You can read more on the IPU Model and its limitations here.

#### Stochastic rounding on IPU

With the IPU, stochastic rounding is implemented directly in the hardware and
only requires you to enable it. To do so, there is the option
`enableStochasticRounding`

in the `Precision`

namespace of `poptorch.Options`

.
This namespace holds other options for using mixed precision that we will talk
about. To enable stochastic rounding, we do:

```
if stochastic_rounding:
opts.Precision.enableStochasticRounding(True)
```

With the IPU Model, this option won’t change anything since stochastic rounding is implemented on the IPU.

#### Partials data type

Matrix multiplications and convolutions have intermediate states we
call *partials*. Those partials can be stored in FP32 or FP16. There is
a memory benefit to using FP16 partials but the main benefit is that it can
increase the throughput for some models without affecting accuracy. However
there is a risk of increasing numerical instability if the values being
multiplied are small, due to underflows. The default data type of partials is
the input’s data type (FP16). For this tutorial, we set partials to FP32 just
to showcase how it can be done. We use the option `setPartialsType`

to do it:

```
if partials_half:
opts.Precision.setPartialsType(torch.half)
else:
opts.Precision.setPartialsType(torch.float)
```

Further information on the Partials Type setting can be found in our memory and performance optimisation guide.

### Train the model

We can now train the model. After we have set all our options, we reuse
our `poptorch.Options`

instance for the training `poptorch.DataLoader`

that we will be using:

```
train_dataloader = poptorch.DataLoader(
opts, train_dataset, batch_size=12, shuffle=True, num_workers=40
)
```

We first make sure our model is in training mode, and then wrap it
with `poptorch.trainingModel`

:

```
model.train() # Switch the model to training mode
poptorch_model = poptorch.trainingModel(model, options=opts, optimizer=optimizer)
```

Let’s run the training loop for 10 epochs:

```
epochs = 10
for epoch in tqdm(range(epochs), desc="epochs"):
total_loss = 0.0
for data, labels in tqdm(train_dataloader, desc="batches", leave=False):
output, loss = poptorch_model(data, labels)
total_loss += loss
```

… and release IPU resources:

```
poptorch_model.detachFromDevice()
```

Our new model is now trained and we can start its evaluation.

### Evaluate the model

Some PyTorch’s operations, such as CNNs, are not supported in FP16 on the CPU,
so we will evaluate our fine-tuned model in mixed precision on an IPU
using `poptorch.inferenceModel`

:

```
model.eval() # Switch the model to inference mode
poptorch_model_inf = poptorch.inferenceModel(model, options=opts)
test_dataloader = poptorch.DataLoader(opts, test_dataset, batch_size=32, num_workers=40)
```

Run inference on the labelled data:

```
predictions, labels = [], []
for data, label in test_dataloader:
predictions += poptorch_model_inf(data).data.float().max(dim=1).indices
labels += label
```

Release resources:

```
poptorch_model_inf.detachFromDevice()
```

We obtained an accuracy of approximately 84% on the test dataset.

```
print(
f"""Eval accuracy on IPU: {100 *
(1 - torch.count_nonzero(torch.sub(torch.tensor(labels),
torch.tensor(predictions))) / len(labels)):.2f}%"""
)
```

## Visualise the memory footprint

We can visually compare the memory footprint on the IPU of the model trained in FP16 and FP32, thanks to Graphcore’s PopVision Graph Analyser.

We generated memory reports of the same training session as covered in this
tutorial for both cases: with and without downcasting the model with
`model.half()`

. Here is the figure of both memory footprints, where “source”
and “target” represent the model trained in FP16 and FP32 respectively:

We observed a ~26% reduction in memory usage with the settings of this tutorial, including from peak to peak. The impact on the accuracy was also small, with less than 1% lost!

## Debug floating-point exceptions

Floating-point issues can be difficult to debug because the model will simply
appear to not be training without specific information about what went wrong.
For more detailed information on the issue we set
`debug.floatPointOpException`

to true in the environment variable
`POPLAR_ENGINE_OPTIONS`

. To set this, you can add the following before
the command you use to run your model:

```
POPLAR_ENGINE_OPTIONS = '{"debug.floatPointOpException": "true"}'
```

## Summary

Use half and mixed precision when you need to save memory on the IPU;

You can cast a PyTorch model or a specific layer to FP16 using:

# Model model.half() # Layer model.layer.half()

Several features are available in PopTorch to improve the numerical stability of a model in FP16:

Loss scaling:

`poptorch.optim.SGD(..., loss_scaling=1000)`

;Stochastic rounding:

`opts.Precision.enableStochasticRounding(True)`

;Upcast partials data types:

`opts.Precision.setPartialsType(torch.float)`

The PopVision Graph Analyser can be used to inspect the memory usage of a model and to help debug issues.

Generated:2022-11-22T13:37 Source:walkthrough.py SDK:3.1.0-EA.1+1183 SST:0.0.9