# 1. Overview

The goal of this document is to help Graphcore AI engineers and customers to develop high-performance machine learning models running on the IPU. This document will cover the general topic of optimising performance.

There are many factors that affect model performance, and this document will cover the following:

• Memory optimisation

• Execution schemes

• Optimisations specific to the IPU and Poplar

• Scaling an application over multiple replicas

Although this document focuses on performance it is worth bearing in mind that numerical stability and convergence properties may limit the design options when trying to optimise a model for performance.

Implementation details will not be covered in this document, but you can refer to the framework-specific documentation:

# 2. Understanding the IPU programming model

Before optimising performance on the IPU, it is necessary to understand the IPU programming model. This section summarises the core concepts of the IPU programming model, further details can be found in the IPU Programmer’s Guide.

## 2.1. Core concepts for IPU programming

Programming the IPU is determined by the features of the IPU hardware and the software used to develop the machine learning models.

The IPU is organised in multiple processing cores called tiles. Each tile can be viewed as an independent processor unit, which executes a tile-specific program and has access to local SRAM in the tile (called In-Processor-Memory). All tiles are connected to the exchange fabric, which is the communication network used to transfer data between tiles within an IPU. This exchange can be further distributed over multiple IPUs. The communication bandwidth during exchange is particularly fast, about 47 TB/s within an IPU and 8 TB/s between IPUs.

The IPU is programmed through software abstractions provided by the Poplar Graph Programming framework.

Two core concepts for programming IPUs are:

• The bulk-synchronous parallel model of execution. This decomposes the runtime execution into three phases: local compute, global synchronisation, and data exchange.

• The graph representation of computations. The Poplar graph programming framework operates on a computational graph where vertices represent operations and edges represent the input and output data.

## 2.2. Main differences from GPU programming

For engineers coming from a GPU programming background, the differences in IPU programming are:

• The graph is compiled statically. This means that dynamic tensor access is not easily performed and comes with a memory cost.

• Model parallelism (such as sharding a model) is often required for large-scale models.

• The IPU programmer is more involved with controlling how the IPU memory is used, for example, partitioned and allocated.

• Because the graph and tensor layouts are static, the code required to perform communication during the exchange phase grows as the total number of tensors increases.

## 2.3. Factors affecting model performance

Model performance is driven by:

• Efficient use of compute cycles: the aim is to maximise the use of the compute capabilities of the IPU.

• Efficient use of memory: the aim is to minimise the total memory required and to maximise the bandwidth at which the used memory is accessed.

• Efficient use of communication with the host: the aim is to maximise the bandwidth and minimise the latency of the data transfers between the host and the IPUs.

## 2.4. PopVision Tools

Profiling the compiled program and its execution can provide insights into the possible memory, compute cycles and communications optimisations. The PopVision Graph Analyser helps with understanding how the graph is distributed and run on the IPU. The PopVision System Analyser breaks down all the processes around the processing on the IPU such as graph compilation or host-IPU communications.

# 3. Mapping a model to an IPU system

This section describes how to map a model to the Graphcore IPU hardware and the Poplar graph programming framework.

At this stage, we are not considering any optimisation that could improve the data memory use, such as recomputation or offloading. We are simply interested in assessing how large the model is and how many IPUs may be required to make it fit.

This section describes:

• How a machine learning algorithm is mapped onto a Poplar graph

• How Poplar uses IPU memory at runtime

• How to estimate the memory use of the model tensors at the design stage

## 3.1. Computational graph of ML model training

Most deep learning models are trained using backpropagation and can be represented by a computational graph similar to that in Fig. 3.1.

Fig. 3.1 Computational graph of training a simple network using backpropagation. op is a forward-pass operation, bwd: the corresponding backward pass operation and wu: the weight update operation.

The mapping of the computational graph onto the IPUs is left to the AI engineer and is critical for achieving performance targets. Note that this exercise may be constrained by the capabilities of a given ML framework. For example, PopART offers more flexibility with partitioning the graph into low-level operations than high-level frameworks such as TensorFlow or PyTorch.

## 3.2. Understanding the memory mapping of a computational graph

The Poplar compiler takes the computational graph and creates an executable for running on IPU systems. The IPU memory used by the computational graph is allocated when Poplar compiles the graph. This memory is composed of:

• code (for example, linear algebra operations)

• input and output tensors

• code and buffers supporting data exchanges

### 3.2.1. Machine-learning models use of memory

Before porting a model to an IPU system, it can sometimes be beneficial to estimate the total memory required by a given machine learning model. The total memory required can be decomposed into:

• Data and variables of the model such as

• trainable parameters (for example, weights and biases)

• non-trainable parameters (for example, the mean and variance statistics of the local batch in the normalisation layers)

• activations (outputs of a layer)

• optimiser state variables (for example, the first two moments in the Adam optimisers)

• gradients with respect to the weights

• Program code required to calculate:

• the model output during inference or during the forward pass

• the losses and gradients during the backward pass

• the weight updates, which depend on the optimiser used

In addition to data and program code, Poplar also allocates memory to store the code and data buffers used during the exchange phase. This memory is used during communication between different entities, such as compute sets, tiles and IPUs.

### 3.2.2. IPU memory used by a Poplar computational graph

The memory used by a Poplar graph is composed of:

• Tensor variables, which contain the data at the input and output of a vertex, such as weights, optimiser states, and weight update values;

• Vertex code, which is generally the code performing numerical operations;

• Vertex state memory, used by variables local to a vertex instance;

• Exchange code, which contains the code needed during the exchange phase

• Exchange buffers, which contain data used during the exchange phase.

The exchange code and buffers exist for three types of exchanges:

• Internal exchange for exchanges between tiles on a same IPU

• Inter-IPU (or global) exchange for exchanges between IPUs

• Host exchange for exchanges between the IPUs and the host

First-in first-out (FIFO) queues on the IPU also exist to enqueue and dequeue data coming from the host.

## 3.3. Always-live and not-always-live memory

The memory used by the data is allocated by the Poplar compiler at compile time. The compiler tries to re-use memory by considering memory use over the entire duration of the graph execution. For example, two variables may share the same memory addresses if the lifetime of each variable does not overlap.

Memory that is allocated to a specific entity (for example, compute set code, variables and constants) for the entire lifetime of the graph execution is called always-live memory.

Similarly, the data for not always live variables are only needed for some program steps. If two variables are not live at the same time, they can be allocated to the same location, thus saving memory. This memory is referred to as not-always-live memory.

It is the combination of always-live and not-always-live memory requirements for a model that determine the total memory use and therefore affect performance.

## 3.4. Tensor variables memory use

### 3.4.1. Number of model parameters

At this stage, it is worth doing some back-of-the-envelope calculations to identify which elements of the models (such as convolution kernels, dense layers, embedding tables) may be the largest contributors to the data memory use.

The size of these blocks can be related to some model hyperparameters, such as sequence lengths, image size, number of channels, embeddings table size, and hidden representation size.

Table 3.1 Example summarising elements of the model and how they contribute to memory use

Layer

Inputs

Number of parameters

Convolution layer

Kernel size $$(h,w)$$, $$N$$ filters, $$C$$ channels

$$(w \cdot h \cdot C + 1) N$$ parameters (weights and biases).

Dense layer

Input size $$n$$, output size $$p$$

$$n \cdot p$$ weights and $$p$$ biases.

Batch norm

Tensor size $$n$$ on normalisation axis

(In a convolution layer, normalisation is done along the filters.)

$$2n$$ moving average parameters (training and inference) plus $$2n$$ scale and location parameters ($$\beta$$ and $$\gamma$$ training only)

Layer norm

Input length $$n$$

$$2n$$ parameters

Embeddings table

Vocabulary size $$N$$, hidden size $$h$$

$$N \cdot h$$ parameters

Note that normalisation layers contain parameters that are “non-trainable”. Typically, during training, these are based on the statistics of a micro-batch, and during inference these are re-used from the training stage. These always reside in memory but there will be no optimisation parameters associated with them.

### 3.4.2. Number of activations

During inference, the outputs of the activation functions (referred to as activations) do not need to be stored long-term, as they are typically only used as inputs to the next layer. Note: this is not always the case and some models, in particular image segmentation models, need to pass the activations between multiple layers during inference. If these layers reside on different IPUs, the activations will need to be stored and exchanged, generally increasing both data, code and exchange memory requirements.

The memory use of activations during training is considered in the following sections.

### 3.4.3. Number of optimiser states (training only)

During training, the weight update step of the optimiser may rely on state variables. For example, the Adam and LAMB optimisers need to store the first ($$m$$) and second ($$v$$) moments between weight updates. These optimiser states will need to be stored in always-live memory and scale linearly with the number of trainable parameters in the model.

### 3.4.4. Number of backpropagation variables (training only)

During training via backpropagation, you need to make the activations available in the backward pass in order to calculate the gradients with respect to the activations and weights. As a first approximation, we should consider that all the activations are stored in always-live memory, although in practice recomputation of activations (detailed in Section 5.3, Activation recomputations) is often used.

The number of activations required (either for storage or computation) scales linearly with the compute batch size.

The gradients (with respect to the weights) also need to be stored during training. The parameters of every trainable model have associated gradients, and these need to be stored in memory.

Therefore, the memory needed increases by:

• $$N_{gradients} = N_{activations} \times B_{size}$$

• $$N_{backprop\_ vars} = N_{weights} + N_{biases} + N_{non\_ trainable} + N_{optimiser\_ states}$$

where $$N_{gradients}$$ is the number of gradients, $$N_{activations}$$ is the number of activations, $$B_{size}$$ is the compute batch size, $$N_{backprop\_ vars}$$ is the number of backpropagation variables, $$N_{weights}$$ is the number of weights, $$N_{biases}$$ is the number of biases, $$N_{non\_ trainable}$$ is the number of non-trainable parameters and $$N_{optimiser\_ states}$$ is the number of optimiser states.

### 3.4.5. Determine total memory used by variables

The total amount of memory required for all the model parameters is thus:

• In inference: $$(N_{weights} + N_{biases} + N_{non\_ trainable}) \times N_{FP}$$

• In training: $$(N_{weights} + N_{biases} + N_{gradients} + N_{optimiser\_ states} + N_{backprop\_ vars}) \times N_{FP}$$

where $$N_{weights}$$ is the number of weights, $$N_{biases}$$ is the number of biases, $$N_{non\_ trainable}$$ is the number of non-trainable parameters, $$N_{gradients}$$ is the number of gradients, $$N_{optimiser\_ states}$$ is the number of optimiser states, $$N_{backprop\_ vars}$$ is the number of variables required to store the gradients with respect to the weights during backpropagations and $$N_{FP}$$ is the number of bytes used by each variable and depends on the numerical precision (half or single) chosen.

## 3.5. Vertex code and exchange memory use

Accurate estimation of the exchange code size is very difficult, if not impossible, at the design stage of a model. The internal exchange code is statically compiled, so dynamic access to tensors is not well optimised. It is safer to consider that the entire tensor will be allocated instead of slices of tensors.

The exchange code is always-live in memory.

Predicting the memory used by the exchange code and vertex code is dependent on the model and on the Poplar compiler optimisations. There is no general approach to determine this memory use. The best way to estimate the proportion of the memory used by the vertex code and exchange is to start with a small model and vary parameters such as micro-batch size, number of hidden layers and number of replicas. This will give an estimate of how the memory use grows with the model size and the proportion of the memory used for code versus data (Section 3.4, Tensor variables memory use).

Some rules of thumb to minimise the exchange code size are:

• Lots of small tensors are less efficient than fewer large ones.

• The tensor layouts may matter, since the tensors may need to be “re-arranged” during the exchange, which increases the temporary memory use.

# 4. Optimising for performance

This section describes the various factors that affect ML model performance and under what conditions they are applicable.

## 4.1. Memory

Generally, more efficient memory use will translate into performance improvements during training and inference. Better optimised memory means larger batch sizes can be used, thus improving the throughput in both inference and training. Large operations that need to be serialised over multiple steps, for example, matmul and convolutions, can be executed in fewer cycles when more temporary memory is available. When more In-Processor-Memory is available, this reduces the need to access the slower streaming memory.

Section 5, Common memory optimisations describes the methods used in memory optimisation in detail.

## 4.2. Pipeline execution scheme

Good for Training large models Training with large global batch sizes Low-latency inference

The most common execution scheme for large models on the IPU is pipelining. This is a form of model parallelism, where a large model is split into multiple stages running on multiple IPUs. Each IPU contains a portion of the model and the IPU utilisation is maximised by running a pipeline of data. Once a pipeline is established there is always more than one micro-batch “in flight”.

The pipeline efficiency is bound by the longest stage executed at any one time. There is thus no point in optimising a pipeline stage if its length is dominated by a longer one. Performance gains can also be achieved by “breaking up” the longest stage and re-balancing it over all pipeline stages.

Pipelining works best when the gradient accumulation factor is much greater than the number of stages, since $$2 N$$ stages are always under-utilised during the ramp up and ramp down phases of the pipeline.

During training, pipelining provides two options for the arrangement of the forward and backward passes:

• grouped

• interleaved.

Note these pipeline schedule options may not be available in all frameworks.

Fig. 4.1 Grouped pipeline schedule.

Grouped scheduling (used by default, Fig. 4.1) is generally faster but requires more memory to stash the activations than the interleaved scheduling.

Grouped scheduling requires storing:

• $$2 N+1$$ activations stored in FIFOs on the first IPU

• $$2 (N-1)+1$$ activations stored in FIFOs on the second IPU

• and so on.

Fig. 4.2 Interleaved pipeline schedule.

Conversely the interleaved scheduling (Fig. 4.2) uses less memory to store the activations but may be slower, since a backward pass, generally longer than the forward pass, is always executed at every stage of a pipeline.

Interleaved scheduling requires storing $$N+1$$ FIFOs of activations.

Pipelining generally comes with recomputation of activations, which is described in Section 5.3, Activation recomputations.

Available resources for pipelined execution:

## 4.3. Data parallelism

In cases where the model is small enough, but the dataset is large, the concept of data parallelism may be applied. This means that the same model is loaded onto each IPU, but the data is split up between IPUs. Data parallelism is the most common way to perform distributed training.

Good for Small-ish models (fit on few IPUs) Heavy host IO and preprocessing

### 4.3.1. Graph replication

Data parallelism is achieved by replication of the Poplar graph.

Each replica operates on a micro batch. During inference, the results from each replica are enqueued in the Poplar outfeed queue and sent to the host.

During training via backpropagation, the gradients {$$g_{i}^{n}\}$$ are calculated independently on each replica $$n$$ and then reduced across all replicas before performing the weight update.

Fig. 4.3 Calculation of gradients across replicas and weight updates during training.

Note the weights are not explicitly shared yet they remain consistent ($$w_{i}^{0} = w_{i}^{1} = \ldots = w_{i}^{N}$$) across all replicas by sharing the same weight updates. Therefore, the weights must be initialised to the same value on each replica.

Replicas typically span multiple IPUs or even IPU-Machines or Pod systems. The cross-replica reduction introduces more communication across multiple IPUs and introduces a performance overhead. The relative impact of the communications overhead can be reduced by using a larger replica batch size, typically by increasing the gradient accumulation.

Using replication also increases the exchange code, linearly with the number of replicas, as illustrated in Table 4.1. Note that increasing the number of replicas from 1 to 4 (in Table 4.1) has increased the size of the global exchange code by a factor of 4. In this application the size of the global exchange code is still negligible when compared with the size of the internal and host exchange code.

Table 4.1 BERT-Large, 1 and 4 replicas.

1 Replica

4 Replicas

Control Code

91,668,904 Bytes

117,063,756 Bytes

Internal exchange Code

108,270,156 Bytes

126,520,640 Bytes

Host exchange Code

66,044,276 Bytes

66,147,360 Bytes

Global exchange Code

1,035,988 Bytes

4,657,516 Bytes

You can enable local replication in different frameworks:

However, for better scaling performances it is recommended to use PopDist and PopRun when possible.

### 4.3.2. Multiple SDK instances and replication: PopDist

One challenge that may arise when using graph replication is that the throughput performance stops increasing with the addition of new replicas because of the host limited compute power and bandwidth. For example the model can become CPU-bound from JPEG decoding or data augmentation on the host.

Fig. 4.4 A single instance of the host Poplar SDK feeding into multiple replicas of the graph.

This situation can be improved by using PopDist, which distributes the host-side processes on multiple processors or even hosts. PopDist relies on MPI to seamlessly allow graph replication to use multiple hosts and instances. Two common improvements are:

• Increasing the number of host processes (which share the same IO connectivity), if the application is compute bound by the host preprocessing

• Increasing the number of hosts (vertical scaling), if the application is also IO bound.

In Fig. 4.5, each process is an instance of the host code run by PopRun and is connected to a single instance of the replicated model. The preprocessing and IO communications happen inside the host processes.

Fig. 4.5 With PopDist, multiple instances of the programs, possibly distributed over multiple hosts, can feed into multiple replicas of the graphs.

Code examples for PopDist:

## 4.4. Host-IPU IO optimisation

Good for IO-bound models Model memory requirements are close to the IPU maximum memory

How memory external to the IPU is accessed affects performance. Data is transferred in streams, with the IPU using a callback function to indicate to the host that the host can read data from the buffer (for IPU to host transfer) and that the host can populate the stream (for host to IPU transfer). For more information about data streams, refer to the Poplar and PopLibs User Guide.

There are three options that can be used to maximise IO performance:

• Prefetch depth

• I/O tiles

• Reducing the bandwidth required, for example, by using 8-bit integers or FP16 transfers.

### 4.4.1. Prefetch and prefetch depth

Enabling prefetch means that the IPU should call the callback function as early as possible (for example, immediately after it releases the stream buffer from a previous transfer). The host is then able to fill the buffer in advance of the transfer, meaning the IPU spends less time waiting for the host because the data is available before it is needed. The prefetch depth controls how much data is fetched, and a prefetch depth of 3 seems optimal.

To modify the prefetch depth:

• Tensorflow provides a prefetch_depth option for the IPUInfeedQueue.

• PopART also provides a way to change this value with the session option defaultPrefetchBufferingDepth.

opt = poptorch.Options()
opt._Popart.set("defaultPrefetchBufferingDepth", 3)


### 4.4.2. Overlapping I/O with compute

The option to designate a number of IPU tiles to be I/O tiles allows for the Poplar graph to be constructed so that the data transfer and the computation can overlap in time.

Fig. 4.6 No I/O tiles designated. The StreamCopy happens before execution of the model code.

Fig. 4.7 Using I/O tiles. StreamCopy happens in parallel with the execution of the model code.

Setting the number of IO tiles is left up to you and depends on the application. The IO tiles must hold at least a full micro-batch of data and not all the memory on an IO tile is available for data (typically ~5% is used for the exchange code and buffers). Thus a rule of thumb for the number of IO tiles ($$N_{IO}$$) to allocate is:

$$N_{IO} \approx 1.05*\frac{S_{micro\_ batch} \cdot S_{micro\_ batch}}{S_{tile}}$$

where $$S_{micro\_ batch}$$ is the micro-batch size and $$S_{tile}$$ = 624 kB on the Mk2 IPU.

It may be more optimal if the number of IO tiles is a power of two, and doing a sweep of powers of two values is recommended. Note that, as of SDK 2.3, this will only overlap I/O with computation for a single IPU application or a pipelined application using the grouped schedule. Also, for a multi-replica or multi-instance configuration, the host multi-threading must be configured with the Poplar engine runtime options:

streamCallbacks.multiThreadMode: collaborative


Documentation for I/O tile settings:

### 4.4.3. Data size reduction

Instead of transferring the data in single-precision floating point (FP32) format, it is possible to improve data throughput by transferring the data in half-precision floating point (FP16) or, if possible, in Int8 format. The latter is especially relevant for images, where data normalisation/scaling happens on the accelerator instead of the host. Similarly, the returned data can be reduced if it requires a large portion of the bandwidth.

## 4.5. Host-side processing optimisations

### 4.5.1. Host and IPU preprocessing

You can improve your data pipeline with caching and more efficient preprocessing, for example, normalisation of the data, or you can move some of the preprocessing directly to the IPU. This makes sure that data is generated fast enough.

### 4.5.2. The one-computational-graph concept

The Graphcore frameworks extract the computational graphs from different frameworks like PyTorch and TensorFlow post-process them to obtain highly efficient computational graphs that get executed on the IPU. Since the IPU looks at the whole graph and not just single operators, this compilation process can take a bit longer. Thus, we provide features like executable caching in the different frameworks to make sure that compilation happens rarely and loading the executable to hardware takes very few seconds.

However, when developing the application code, it is important to avoid changing the computational graph because this will trigger recompilation when the change occurs for the first time and later on it will trigger changes of the executable. Hence, it is preferable that there is only one computational graph. A common example that creates more than one graph is when the batch size changes. With a different batch-size, data and compute need to be distributed differently to the IPU tiles and thus a new optimisation is required. Further examples are provided in the technical note Optimising for the IPU: Computational Graph Recompilation and Executable Switching in TensorFlow. Similarly, eager execution breaks down the full graph into multiple subgraphs, which should be avoided as described in the TensorFlow 2 user guide.

### 4.5.3. Looping

Instead of processing sample by sample, it is beneficial to loop over streams of samples instead. In this way, data is continuously transferred between host and device and the host-side overhead of starting and stopping the processing process is avoided. This happens in most of our interfaces, for example, in the TensorFlow Keras interface or our PyTorch interface. However, in some TensorFlow applications, processing still needs to happen sample by sample. In these cases, explicit loops should be used, as shown in the TensorFlow 2 porting examples.

## 4.6. Optimising numerical precision

Good for Most deep learning models May need some granularity to avoid loss of precision in critical parts of the model

Reducing the numerical precision is a common way to improve the training or inference performance of a model. This uses less memory for all operations, increases the bandwidth of data exchanges, and reduces the number of cycles required by arithmetic operations.

Note that numerical stability issues may arise when reducing the numerical precision. While resolving these issues is out of the scope of this document; some key ideas are:

• Using loss scaling when the gradients are calculated in reduced precision. This will limit the risk of underflow errors when gradient values are too small to be accurately represented.

• Enabling stochastic rounding to improve the average numerical accuracy, in particular during weight update.

• Analysing the distributions of the values of the weights and gradients in all layers of the networks, and locally increasing the precision of some critical operations.

For more information on numerical precision on the IPU, refer to the AI Float White Paper.

## 4.7. Replicated tensor sharding (RTS)

Good for Very large models Large number of optimiser states Models without data parallelism Exchange bandwidth already constrained by data IO

Replicated tensor sharding is a technique only available in combination with graph replication. With replicated tensor sharding, a tensor that would otherwise contain the same information on all replicas can be sharded (for example, broken up into equal sub-tensors) across all replicas of a graph. This frees up memory on the replica, but requires re-assembling the tensor when it needs to be used.

This feature can be utilised with tensors both on (stored in IPU SRAM) and off chip (stored in Streaming Memory).

Fig. 4.8 Tensor memory layout before sharding

Fig. 4.9 Tensor memory layout after sharding

Replicated tensor sharding is especially useful for sharding the optimiser states across replicas. This enables you to apply the optimiser update in parallel across many replicas. Each optimiser update then only acts on its corresponding shard of the parameters before reducing over the replicas. This requires additional operations and so is more beneficial on large, distributed models, where the trade-off between these operations is outweighed by the benefits of parallelism.

This method is similar to Microsoft’s Zero Redundancy Optimizer (ZeRO).

Enabling replicated tensor sharding:

• In Tensorflow, CrossReplicaGradientAccumulationOptimizerV2 and GradientAccumulationOptimizerV2 take an optional argument replicated_optimizer_state_sharding. This will shard the optimiser state tensors among replicas. It can also be set when using the pipelining Op . When using multiple instances, the tensor is partitioned across the local replicas owned by each process. Each host instance has access to the complete variable tensor.

• PopART provides a flexible tensor location API to configure RTS.

• Since PopART is the backend for PopTorch, the PopART logic applies for RTS in PopTorch. The tensor location options can be accessed via poptorch.Options.TensorLocations.

For instance, let’s set RTS for the optimizer state tensors as soon as it contains more elements than our number of replicas. Let’s assume we are using PopDist and PopRun to handle distributed training.

• In PopART:

opts = popart.SessionOptions()
popdist.popart.configureSessionOptions(opts)

num_local_replicas = popdist.getNumLocalReplicas()
opts.optimizerStateTensorLocationSettings.minElementsForReplicatedTensorSharding = num_local_replicas
opts.optimizerStateTensorLocationSettings.location.replicatedTensorSharding = popart.ReplicatedTensorSharding.On

• In PopTorch:

opts = popdist.poptorch.Options()

num_local_replicas = popdist.getNumLocalReplicas()
opts.TensorLocations.setOptimizerLocation(
poptorch.TensorLocationSettings()
.minElementsForReplicatedTensorSharding(num_local_replicas)
.useReplicatedTensorSharding(True)
)


Note

PopART’s CommGroup settings are not available in PopTorch. PopTorch uses CommGroupType.Consecutive as default: Tensors are sharded among the local replicas. Each host instance has access to the complete variable tensor.

## 4.8. Tile mapping

A poor tile mapping can lead to strided copying. Strided copying arises when either the memory access patten or the location in memory of data is non-contiguous, for example, when you are transposing data. This leads to slower performance and has a considerable code cost.

There are two ways you can identify that strided copying exists:

• Use the PopVision Graph Analyser tool to examine the list of vertex types. Strided copies will have names of the form DstStridedCopy and there will be many of them.

• PopVision allows you to look at the program tree to track down where the vertices are used.

• The execution graph shows many PreArrange steps, like PreAarrange-0 and PreArrange-1.

• These indicate on-tile copy operations which rearrange data. The data is rearranged so that it has the correct layout for input to vertices, for example, for alignment and to make the input contiguous in memory.

Strided copies could also be the result of suboptimal mapping by the framework. One way to address them directly in the framework is to reverse the operands or to use PopLibs functions directly as part of custom operators. Refer to the technical note Creating Custom Operations for the IPU for details.

Fig. 4.10 PopVision execution trace showing PreArrange steps taking up significantly more cycles than the operation itself. In this example the additional operation takes about 580 cycles while 5800 cycles are needed to re-arrange the tensors. (Note that Group Executions must be disabled in the Options of the Execution Trace plot.)

## 4.9. Other execution schemes

Good for Very large models Very large code Models with interdependencies between pipeline stages This is generally not supported in high-level ML frameworks

Other execution schemes, such as domain sharding and TensorParallel execution, can be considered as a “last resort” as they are not easily implemented at the application level.

# 5. Common memory optimisations

## 5.1. Available memory proportion tuning

The availableMemoryProportion parameter controls the amount of memory allocated for use of temporary data whilst the operation is executing (for example, for intermediate calculated values or temporary values passed between tiles on the IPU). The value is specified as a proportion of available memory on the IPU and is used by the PopLibs Convolution Planner to set the upper bound on the amount of temporary memory in the first pass of the planner. Note this should affect only the temporary memory and so can be used to reduce not-always-live memory spikes.

The effect of the available memory proportion may not be visible on the Graph Analyser when a model goes out of memory. Only the first stage of the convolution planner uses the available memory proportion. If the operations cannot be made to fit in the first stage the planner will disregard the available memory proportion value and create a generic plan that’s not bounded by the memory.

The technical note Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU specifically addresses the tuning of the availableMemoryProportion parameter.

## 5.2. Partials type

Partials are temporary variables, generally accumulators, that are used by the accumulating matrix product units that contain intermediate calculations during matmul and convolution operations. These intermediate values used by the matmul/conv (accumulating matrix product units) contribute to the not-always-live memory use, as the partials are created during the forward pass and reused during the backward pass. Partials are stored in FP32 by default.

Using FP16 for the partials will reduce the temporary memory used by the accumulating matrix product units by half, at the expense of less accurate numerical results. In practice, if the weights are stored in FP16 reducing the partials accuracy to FP16 does not generally affect model stability or accuracy.

Reducing the partials format also improves the performance (See Table 5.1 in the Mixed-Precision Arithmetic for AI white paper). So, each time the numerical precision is halved, the execution speed of matmul and convolution operations is doubled.

• In Tensorflow, the partial type of convolutions and matmuls can be set globally by setting convolution or matmul options with IPUConfig(). This allows you to pass Poplar’s backend options to Tensorflow as a dictionary. The Poplar and PopLibs API Reference lists all options for matrix multiplications and for convolutions.

cfg = IPUConfig()
cfg.convolutions.poplar_options = {'partialsType': 'half'}
cfg.matmuls.poplar_options = {'partialsType': 'half'}


It is also possible to use a dedicated scope that will apply the partial type to the matmuls and convolutions when possible.

• In PopTorch, you can set the partial type of convolutions and matmuls globally using the Precision options. For instance:

opts = poptorch.Options()
opts.Precision.setPartialsType(torch.half)

• In PopART you can set a global partial type for matrix multiplications with the session option partialsTypeMatMuls. Similarly to Tensorflow, you can also set Poplar options for convolutions and matmuls.

userOpts = popart.SessionOptions()
userOpts.convolutionOptions = {'partialsType': 'half'}
userOpts.matmulOptions = {'partialsType': 'half'}


When contructing your graph with the PopART builder API, you can overwrite the partials type of individual nodes by using setPartialsType.

## 5.3. Activation recomputations

In a neural network trained using backpropagation, the activation calculated during the forward pass needs to be stored in order to be reused in the backward pass. This is required in order to calculate the gradient with respect to the activations. This is problematic because storing these activations during the forward pass uses always-live memory and the amount of memory grows linearly with the micro-batch size.

### 5.3.1. Activations recomputation and memory use

Activations are stored as FIFO variables and are shown under the Not Always Live Variables section in PopVision Graph Analyser.

As an illustration, Table 5.1 compares the FIFO variable sizes of the BERT Large model with batch size 1 and batch size 8.

Table 5.1 Comparison between the FIFO variable sizes of a BERT Large model with batch sizes 1 and 8.

Variable

Batch size 1

Batch size 8

/fifo.12/buffer/0 (2/1)

2.0 MB

16.0 MB

/fifo.13/buffer/0 (2/1)

2.0 MB

16.0 MB

/fifo.35/buffer/0

1.5 MB

12.0 MB

/fifo.36/buffer/0

1.5 MB

12.0 MB

One way to reduce the amount of always-live memory needed is to not store the activations during the forward pass and to instead recompute them “on-the-fly” as they are needed during the backward pass. The trade-off with recomputation is that we use less memory, but we have more execution cycles in the backward pass (Fig. 5.1).

Fig. 5.1 Recomputation of activations increases the number of operations. Blue = forward, cyan=forward for recomputation, red=backward

In models that are memory intensive (and most deep-learning models are), using recomputation to save memory is more valuable than the extra execution cycles. For example, BERT Large would not fit in memory without recomputation (Fig. 5.2 and Fig. 5.3).

Fig. 5.2 Total memory use in a model without recomputation of activations (blue) and with recomputation (red).

Fig. 5.3 The change in not-always-live memory for a model without recomputation (green) and for a model with recomputation (blue).

### 5.3.2. Recomputation checkpoints

Recomputation checkpoints are points in the computational graphs where all the activations are stored. Any activations in the graph from that point on will be recomputed from the previous checkpoint values, and stored in not-always-live memory, until the next checkpoint. Checkpoint positions in the graph can be set automatically by the frameworks or you can manually insert. When you pipeline a model, checkpoints are automatically inserted at the beginning of a pipelining stage.

The careful introduction of recomputation checkpoints, whether a model is pipelined or not, significantly saves always-live memory, since the amount of memory saved corresponds to all the activation FIFOs between two checkpoints.

The trade-off is an increase in compute cycles. Effectively, between two recomputation checkpoints, the backwards pass is replaced by a forward and a backwards pass. As a rule of thumb, the backwards pass takes between 2 to 2.2 times as many cycles as a forward pass.

In the inference mode, intermediate activations are never stored and hence recomputations are not required.

• In Tensorflow, recomputation can be enabled using the IPUConfig option allow_recompute. Different recomputation modes exist. The recomputation checkpoints can be set in multiple ways when using pipelining:

• A tensor in the graph can be wrapped in a recomputation_checkpoint.

• In Keras, a special layer is available to indicate the recomputation checkpoints:

• In PopTorch, recomputation is enabled by default when using a pipelined execution. Checkpoints can be added by wrapping one or more tensors in poptorch.recomputationCheckpoint.

Note

By default there is at least one checkpoint per pipeline stage.

• In PopART, you need to select a recomputation type first and enable it by setting the session option autoRecomputation. For instance:

userOpts = popart.SessionOptions()
userOpts.autoRecomputation = popart.RecomputationType.Standard


When using the builder API, manual checkpoints can be added with the builder.checkpointOutput method. For instance, let’s define a simple graph and add a recomputation checkpoint to the output.

# Set RecomputeAll to have all ops recomputed by default
userOpts = popart.SessionOptions()
userOpts.autoRecomputation = popart.RecomputationType.RecomputeAll

def build_graph():
builder = popart.Builder(opsets={"ai.onnx": 10, "ai.graphcore": 1})
y_tensor = builder.aiOnnx.mul([x, x])
# Checkpoint y_tensor
y_tensor = builder.checkpointOutput([y_tensor])[0]
return y_tensor


You can also force the recomputation of a specific output using builder.recomputeOutputInBackwardPass.

The optimiser state parameters (for example, the first and second moments in ADAM) are always present in memory during training. You can also offload these to the host to save IPU memory. The optimiser states are only transferred once at the beginning (from host to IPU) and once at the end (from IPU to host) of the weight update step, so the communication penalty is much less than transferring model parameters.

Offloading variables to the host will create some exchange code, the size of which is difficult to estimate. It will also use temporary (not always live) data memory in order to store the input and output buffers from and to the host.

• In Tensorflow, variable offloading can be set with gradient accumulation and pipelining: TensorFlow 1, TensorFlow 2.

• In PopART it is possible to offload the optimiser state tensors in Streaming Memory using the tensor location setting.

• In PyTorch the same thing can be achieved by using PopTorch’s tensor location settings. To setup optimiser state offloading we can use setOptimizerLocation:

opts = poptorch.Options()
opts.TensorLocations.setOptimizerLocation(
poptorch.TensorLocationSettings().useOnChipStorage(False))


## 5.5. Graph outlining

You can reuse identical code in different parts of the computational graph by outlining the code. This will keep only one copy of the code per tile and the graph will contain multiple call operations to execute the code. Without outlining, identical code at different vertices is always duplicated, taking up memory.

Note: You can think of outlining as the opposite of in-lined functions in C++.

In a Poplar graph, the mapping of the tensors to the vertices is static. When a section of code is outlined, the input and output tensors are statically mapped. This means that when the section of outlined code is called at different points of the computational graph, the input and outputs at this point of the computational graph must be copied into and out of the locations of the statically mapped tensors in the outlined code. The copy operations require more always-live memory and a penalty of using outlining is that you require more cycles.

We will illustrate how outlining works on an example using BERT transformer layers.

In the original, not outlined program, most of the code is under the top-level control program and there are only two additional functions (Fig. 5.4).

Fig. 5.4 Program tree in PopVision

Looking into the control program code we can see the self-attention blocks are repeated multiple times in the program tree (Fig. 5.5, Fig. 5.6). These self-attention blocks operate on large matrices and can take a large amount of program memory.

Fig. 5.5 First transformer layer

Fig. 5.6 Second transformer layer

And similarly for the 22 remaining layers of BERT Large. The code, required for this self-attention operation, is thus duplicated many times and this s the total code memory use.

When the self-attention block is outlined, the outlined section of code is now an independent Poplar function. Looking at the control program, we can see the self-attention blocks of code have been replaced by a call to a Poplar function (Fig. 5.7, Fig. 5.8).

Fig. 5.7 First transformer layer with outlining

Fig. 5.8 Second transformer layer with outlining

The “target” argument is the identifier of the Poplar function being called. So, in this case, Function 3 is being called. Looking into Function 3, called from Call(target: 3) we see the self-attention block (Fig. 5.9).

Fig. 5.9 Details of self-attention block after outlining

The function exists only once in memory and is re-used for multiple input and output tensors. Poplar adds copy operations in the graph so that different variables (tensors) can be used as inputs to the single function. The developer could expect the function to use pointers to eliminate the need for copies, however this is not possible in the static graph model of Poplar, so outlining saves always-live code memory but adds copies at run time.

Fig. 5.10 Execution graph (from PopVision) with outlining

Fig. 5.11 Execution graph (from PopVision) without outlining

The self-attention block spans many execution cycles (in Fig. 5.10 only the start of the self-attention block is shown). So, the execution cycles added for the extra copy operations for the outlined code do not increase the total execution cycles by too much.

Outlining is performed by default by PopART and Tensorflow. They both give the user different levels of control:

## 5.6. Reducing the batch size

In some applications, it may be possible to use a smaller batch size to reduce memory requirements.

If you reduce the global batch size (that is, the total number of samples processed between weight updates) for training, you should also reduce the learning rate. Scaling the learning rate by the same factor as the batch size has been found to be effective - see §2.1 of Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour by Priya Goyal et al. for details.

Using a smaller micro-batch size won’t necessarily hurt performance. The IPU’s architecture makes it better for fine-grained parallelism than other processors which enables better performance at low batch sizes. Further, researchers at Graphcore have shown that smaller batch sizes can stabilise training. See Revisiting Small Batch Training for Deep Neural Networks by Masters and Luschi for full details.

However, there are some operations, such as batch normalisation, that are calculated across a mini-batch. The performance of these operations can degrade when reducing the batch size. If this causes problems, you may choose to replace batch normalisation with group, layer or instance normalisation instead.

As an alternative solution, you may choose to use proxy normalisation, a technique developed by researchers at Graphcore. Proxy normalisation is designed to preserve the desirable properties of batch normalisation while operating independently of the batch size. A TensorFlow 1 implementation is available in Graphcore’s public examples. Proxy normalisation was first described in the paper “Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence” by Labatie et al.

## 5.7. Writing a custom operation

If you have a particular operation in your model that is using a lot of memory, you may find it useful to write a custom operation using a lower-level library that uses less memory.

Refer to the technical note “Creating Custom Operations for the IPU” for details and links to additional resources on writing custom operations.

# 6. Debugging an out-of-memory exception

Sometimes, you will try to run a program on the IPU and find that there is not enough memory to execute your program. In this section, we will discuss how you can resolve this.

## 6.1. Identifying when you’ve run out of IPU memory

All memory allocation on the IPU is done when the program is compiled. If the Poplar compiler cannot allocate enough memory for your program, the compiler will throw an exception. This exception will be of type graph_memory_allocation_error, though this may not be reported if you are using a higher-level framework.

The most common error message looks something like this:

Memory allocation error : Out of memory on tile 0: 876476 bytes used but tiles only have 638976 bytes of memory


Setting the Poplar engine option debug.allowOutOfMemory to true allows compilation to continue even when it has been detected that running out of memory is inevitable. This option is turned on by default when profiling is enabled. If this option is enabled, the following error message will be displayed:

Out of memory: Cannot fit all variable data onto one or more tiles.


See Section 6.3, Profiling the model for further details of how these engine options are set and used.

There are some other error messages that can be shown when there is not enough memory on the IPU. For example, the following error occurs when there is an exchange in which more memory is sent to tile 0 than there is memory on tile 0, and so there is no way the exchange could possibly be executed:

Tile 0 receives more data than it has total memory in exchange 'cs19_98/scatterAdd/multiUpdateAdd_ExchangePre'


Here is another example, which is usually shown when more code needs to be stored on a tile than there is memory available:

tile 0 _poplar_start must be 0 not 638976 bytes from the start of memory. Typically this error occurs when there is too much code to fit incode memory.


There are some error messages that say some memory limit has been exceeded, but actually refer to a memory limit other than the amount of memory available on the IPU itself. For example, the following message refers to host buffer memory, not the memory on the IPU’s tiles.

Buffer 1 needs to be at least 2616922112 Bytes, but remaining host buffer memory is only 268435456


We do not cover how to deal with such issues here.

## 6.2. Memory limits on the IPU

A single Mk2 IPU has 1472 tiles, each of which has 624 KiB of memory, for a total of 897 MiB across the processor.

These values can be useful to help you guide your decisions. For example, suppose you have a model with one billion weights. Even if these weights are stored in float-16, they would require 2 GB of memory, and therefore you would need more than two Mk2 IPUs just to have enough memory for the weights. It’s also a good idea to check your model for any intermediate tensors that would require more memory than there is on a single IPU.

However, you should not try to use these values to make exact calculations before running a program. The exact amount of memory required by an IPU program is unpredictable because the memory required by a program includes the binary code that runs on the IPU. Liveness constraints can also make memory requirements difficult to predict. In general, the best way to determine if a program is going to go out of memory is to try to compile it for the IPU.

Refer to Section 3.2, Understanding the memory mapping of a computational graph, Section 3.3, Always-live and not-always-live memory, and Section 3.4, Tensor variables memory use for more details about memory usage on the IPU.

## 6.3. Profiling the model

By profiling your model, you can collect information about its memory usage and liveness properties. This information can be displayed and explored visually using the PopVision Graph Analyser. For full details of how to use the PopVision Graph Analyser, refer to the PopVision Graph Analyser User Guide.

### 6.3.1. Enabling profiling

To profile the memory consumption of your model, use the following environment variable:

POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./profile", "autoReport.outputExecutionProfile":"false"}'


The value passed to POPLAR_ENGINE_OPTIONS is a JSON string, so using double quotes inside the braces and single quotes outside them is essential.

Setting autoReport.all to true enables all profiling options. A list of all profiling options can be found under “Engine creation options: Report generation” in the documentation for the Poplar Engine class.

In this case, we have also disabled execution profiling by setting autoReport.outputExecutionProfiling to false. This is because execution profiling adds additional code to your IPU program to record the length of each step. This can add further confusion as to why your program is going out of memory. Without execution profiling, no extra memory on the IPU is used by profiling. You can use the key-value pair "autoReport.outputExecutionProfile":"true" instead if you want to keep execution profiling enabled (for example, if you want to see the effect of using execution profiling on the memory usage of your program).

To use this, you can add this to the start of the command used to run your program:

POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./profile", "autoReport.outputExecutionProfile":"false"}' python training.py


You can change the directory where the profile is stored by changing the value of the autoReport.directory option:

POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./profile_inference", "autoReport.outputExecutionProfile":"false"}' python inference.py


You may need to take more care when profiling if you are running a distributed application with PopDist and PopRun.

### 6.3.2. Using offline compilation to reduce IPU usage when profiling

It may be useful to compile an executable for the IPU offline to test if the model goes out of memory without blocking the use of an IPU by other programs.

There are options to do this in each framework:

• TensorFlow: Set the device_connection.type attribute of your IPUConfig object to ipu.config.DeviceConnection.NEVER

• Pytorch: Use opts.connectionType(poptorch.ConnectionType.Never) where opts is your poptorch.Options object

• PopART: Use the device manager to create an offline device and then compile and export your program.

### 6.3.3. Using the PopVision Graph Analyser

The PopVision Graph Analyser can be used to explore the data captured when you profile your program. The application can be downloaded from the PopVision microsite.

The tabs which are most likely to be of use for resolving an out-of-memory error are:

• the Memory Report, which shows how much memory is used on each tile

• the Liveness Report, which shows how much memory is occupied by live variables at each step of the program

For a more comprehensive guide to using the PopVision Graph Analyser, see the PopVision Graph Analyser User Guide.

## 6.4. Deciding what to do

There are quite a few different techniques available for reducing memory usage, and so it can be difficult to decide what to do.

A good starting point is to reduce the batch size. Most reference implementations of models are written for processors which have an optimal batch size which is larger than the optimal batch size for the IPU. This means that many models can consume much more memory than is needed for effective training. See Section 5.6, Reducing the batch size for details.

If you find that your model is still not fitting even after reducing your batch size as far as you reasonably can, you should profile your program (see Section 6.3, Profiling the model and determine which parts of your program are taking up the most memory.

As a rule of thumb, you should start to look at splitting up your model if you still require more than twice as much memory overall as is available on an IPU after reducing the batch size as much as possible.

### 6.4.1. Tile and IPU memory balance

In some cases, you may find that you have enough overall memory for your program, but one or few tiles require much more memory than the others. This may be because you have hit an edge case in the PopLibs implementation of an underlying operation. Cases like these should in general be reported to Graphcore’s support team.

You may be able to get around this issue by identifying the operation which is causing the problem and implementing it yourself in Poplar as a custom op. See Section 5.7, Writing a custom operation for details.

If you are running a program over multiple IPUs, and there is an imbalance in the memory used between different IPUs, you may need to reconsider the way that you are splitting your model and what values you are checkpointing for recomputation.

### 6.4.2. Techniques by liveness of memory

In this section, we discuss techniques for reducing memory usage according to whether the memory reduced is not always live or always live.

#### 6.4.2.1. Reducing not-always-live memory

There are many techniques for reducing the not-always-live memory requirements of your model. You could try using float-16 computations or partials, recomputing activations in the backwards pass, or reducing the available memory proportion for convolutions. All of these are discussed in Section 5, Common memory optimisations.

#### 6.4.2.2. Reducing always-live memory

Dealing with an excess of always-live memory is generally more difficult. It may be the case that the parameters of your model are using too much memory, in which case your choices are limited to storing them in float-16 (which may be unstable, especially without also using stochastic rounding) or increasing the number of IPUs that you are using. If code is taking up a lot of memory, it may be useful to outline certain parts of the model so that code can be reused. See Section 5.5, Graph outlining for full details.

With some optimisers, additional variables are used to maintain running statistics of the gradients. These variables can take up a lot of memory, and this memory must always be live. In this case, you may find it useful to use variable offloading to reduce the amount of memory taken up by these variables. See Section 5.4, Variable offloading for full details.

In some cases, you can afford to swap out your optimiser for one that requires less memory for these variables. For example, you could use stochastic gradient descent with momentum instead of Adam.