1. Overview

The goal of this document is to help Graphcore AI engineers and customers to develop high-performance machine learning models running on the IPU. This document will cover the general topic of optimising performance.

There are many factors that affect model performance, and this document will cover the following:

  • Memory optimisation

  • Execution schemes

  • Optimisations specific to the IPU and Poplar

Although this document focuses on performance it is worth bearing in mind that numerical stability and convergence properties may limit the design options when trying to optimise a model for performance.

Implementation details will not be covered in this document, but you can refer to the framework-specific documentation:

2. Understanding the IPU programming model

Before optimising performance on the IPU, it is necessary to understand the IPU programming model. This section summarises the core concepts of the IPU programming model, further details can be found in the IPU Programmer’s Guide.

2.1. Core concepts for IPU programming

Programming the IPU is determined by the features of the IPU hardware and the software used to develop the machine learning models.

The IPU is organised in multiple processing cores called tiles. Each tile can be viewed as an independent processor unit, which executes a tile-specific program and has access to local SRAM in the tile (called In-Processor-Memory). All tiles are connected to the exchange fabric, which is the communication network used to transfer data between tiles within an IPU. This exchange can be further distributed over multiple IPUs. The communication bandwidth during exchange is particularly fast, about 47 TB/s within an IPU and 8 TB/s between IPUs.

The IPU is programmed through software abstractions provided by the Poplar Graph Programming framework.

Two core concepts for programming IPUs are:

  • The bulk-synchronous parallel model of execution. This decomposes the runtime execution into three phases: local compute, global synchronisation, and data exchange.

  • The graph representation of computations. The Poplar graph programming framework operates on a computational graph where vertices represent operations and edges represent the input and output data.

2.2. Main differences with GPU programming

For engineers coming from a GPU programming background, the differences in IPU programming are:

  • The graph is compiled statically. This means that dynamic tensor access is not easily performed and comes with a memory cost.

  • Model parallelism (such as sharding a model) is often required for large-scale models.

  • The IPU programmer is more involved with controlling how the IPU memory is used, for example, partitioned and allocated.

  • Because the graph and tensor layouts are static, the code required to perform communication during the exchange phase grows as the total number of tensors increases.

2.3. Factors affecting model performance

Model performance is driven by:

  • Efficient use of compute cycles: the aim is to maximise the use of the compute capabilities of the IPU.

  • Efficient use of memory: the aim is to minimise the total memory required and to maximise the bandwidth at which the used memory is accessed.

  • Efficient use of communication with the host: the aim is to maximise the bandwidth and minimise the latency of the data transfers between the host and the IPUs.

2.4. PopVision Tools

Profiling the compiled program and its execution can provide insights into the possible memory, compute cycles and communications optimisations. The PopVision Graph Analyser helps with understanding how the graph is distributed and run on the IPU. The PopVision System Analyser breaks down all the processes around the processing on the IPU such as graph compilation or host-IPU communications.

3. Mapping a model to an IPU system

This section describes how to map a model to the Graphcore IPU hardware and the Poplar graph programming framework.

At this stage, we are not considering any optimisation that could improve the data memory use, such as recomputation or offloading. We are simply interested in assessing how large the model is and how many IPUs may be required to make it fit.

This section describes:

  • How a machine learning algorithm is mapped onto a Poplar graph

  • How Poplar uses IPU memory at runtime

  • How to estimate the memory use of the model tensors at the design stage

3.1. Computational graph of ML model training

Most deep learning models are trained using backpropagation and can be represented by a computational graph similar to that in Fig. 3.1.

_images/backprop_graph.png

Fig. 3.1 Computational graph of training a simple network using backpropagation. op is a forward-pass operation, bwd: the corresponding backward pass operation and wu: the weight update operation.

The mapping of the computational graph onto the IPUs is left to the AI engineer and is critical for achieving performance targets. Note that this exercise may be constrained by the capabilities of a given ML framework. For example, PopART offers more flexibility with partitioning the graph into low-level operations than high-level frameworks such as TensorFlow or PyTorch.

3.2. Understanding the memory mapping of a computational graph

The Poplar compiler takes the computational graph and creates an executable for running on IPU systems. The IPU memory used by the computational graph is allocated when Poplar compiles the graph. This memory is composed of:

  • code (for example, linear algebra operations)

  • input and output tensors

  • code and buffers supporting data exchanges

3.2.1. Machine-learning models use of memory

Before porting a model to an IPU system, it can sometimes be beneficial to estimate the total memory required by a given machine learning model. The total memory required can be decomposed into:

  • Data and variables of the model such as

    • trainable parameters (for example, weights and biases)

    • non-trainable parameters (for example, the mean and variance statistics of the local batch in the normalisation layers)

    • activations (outputs of a layer)

    • optimiser state variables (for example, the first two moments in the Adam optimisers)

    • gradients with respect to the weights

  • Program code required to calculate:

    • the model output during inference or during the forward pass

    • the losses and gradients during the backward pass

    • the weight updates, which depend on the optimiser used

In addition to data and program code, Poplar also allocates memory to store the code and data buffers used during the exchange phase. This memory is used during communication between different entities, such as compute sets, tiles and IPUs.

3.2.2. IPU memory used by a Poplar computational graph

The memory used by a Poplar graph is composed of:

  • Tensor variables, which contain the data at the input and output of a vertex, such as weights, optimiser states, and weight update values;

  • Vertex code, which is generally the code performing numerical operations;

  • Vertex state memory, used by variables local to a vertex instance;

  • Exchange code, which contains the code needed during the exchange phase

  • Exchange buffers, which contain data used during the exchange phase.

The exchange code and buffers exist for three types of exchanges:

  • Internal exchange for exchanges between tiles on a same IPU

  • Inter-IPU (or global) exchange for exchanges between IPUs

  • Host exchange for exchanges between the IPUs and the host

First-in first-out (FIFO) queues on the IPU also exist to enqueue and dequeue data coming from the host.

3.3. Always-live and not-always-alive memory

The memory used by the data is allocated by the Poplar compiler at compile time. The compiler tries to re-use memory by considering memory use over the entire duration of the graph execution. For example, two variables may share the same memory addresses if the lifetime of each variable does not overlap.

Memory that is allocated to a specific entity (for example, compute set code, variables and constants) for the entire lifetime of the graph execution is called always-live memory.

Similarly, the data for not always live variables are only needed for some program steps. If two variables are not live at the same time, they can be allocated to the same location, thus saving memory. This memory is referred to as not-always-live memory.

It is the combination of always-live and not-always-live memory requirements for a model that determine the total memory use and therefore affect performance.

3.4. Tensor variables memory use

3.4.1. Number of model parameters

At this stage, it is worth doing some back-of-the-envelope calculations to identify which elements of the models (such as convolution kernels, dense layers, embedding tables) may be the largest contributors to the data memory use.

The size of these blocks can be related to some model hyperparameters, such as sequence lengths, image size, number of channels, embeddings table size, and hidden representation size.

Table 3.1 Example summarising elements of the model and how they contribute to memory use

Layer

Inputs

Number of parameters

Convolution layer

Kernel size \((h,w)\), \(N\) filters, \(C\) channels

\((w \cdot h \cdot C + 1) N\) parameters (weights and biases).

Dense layer

Input size \(n\), output size \(p\)

\(n \cdot p\) weights and \(p\) biases.

Batch norm

Tensor size \(n\) on normalisation axis

(In a convolution layer, normalisation is done along the filters.)

\(2n\) moving average parameters (training and inference) plus \(2n\) scale and location parameters (\(\beta\) and \(\gamma\) training only)

Layer norm

Input length \(n\)

\(2n\) parameters

Embeddings table

Vocabulary size \(N\), hidden size \(h\)

\(N \cdot h\) parameters

Note that normalisation layers contain parameters that are “non-trainable”. Typically, during training, these are based on the statistics of a micro-batch, and during inference these are re-used from the training stage. These always reside in memory but there will be no optimisation parameters associated with them.

3.4.2. Number of activations

During inference, the outputs of the activation functions (referred to as activations) do not need to be stored long-term, as they are typically only used as inputs to the next layer. Note: this is not always the case and some models, in particular image segmentation models, need to pass the activations between multiple layers during inference. If these layers reside on different IPUs, the activations will need to be stored and exchanged, generally increasing both data, code and exchange memory requirements.

The memory use of activations during training is considered in the following sections.

3.4.3. Number of optimiser states (training only)

During training, the weight update step of the optimiser may rely on state variables. For example, the Adam and LAMB optimisers need to store the first (\(m\)) and second (\(v\)) moments between weight updates. These optimiser states will need to be stored in always-live memory and scale linearly with the number of trainable parameters in the model.

3.4.4. Number of backpropagation variables (training only)

During training via backpropagation, you need to make the activations available in the backward pass in order to calculate the gradients with respect to the activations and weights. As a first approximation, we should consider that all the activations are stored in always-live memory, although in practice recomputation of activations (detailed in Section 5.3, Activation recomputations) is often used.

The number of activations required (either for storage or computation) scales linearly with the compute batch size.

The gradients (with respect to the weights) also need to be stored during training. The parameters of every trainable model have associated gradients, and these need to be stored in memory.

Therefore, the memory needed increases by:

  • \(N_{gradients} = N_{activations} \times B_{size}\)

  • \(N_{backprop\_ vars} = N_{weights} + N_{biases} + N_{non\_ trainable} + N_{optimiser\_ states}\)

where \(N_{gradients}\) is the number of gradients, \(N_{activations}\) is the number of activations, \(B_{size}\) is the compute batch size, \(N_{backprop\_ vars}\) is the number of backpropagation variables, \(N_{weights}\) is the number of weights, \(N_{biases}\) is the number of biases, \(N_{non\_ trainable}\) is the number of non-trainable parameters and \(N_{optimiser\_ states}\) is the number of optimiser states.

3.4.5. Determine total memory used by variables

The total amount of memory required for all the model parameters is thus:

  • In inference: \((N_{weights} + N_{biases} + N_{non\_ trainable}) \times N_{FP}\)

  • In training: \((N_{weights} + N_{biases} + N_{gradients} + N_{optimiser\_ states} + N_{backprop\_ vars}) \times N_{FP}\)

where \(N_{weights}\) is the number of weights, \(N_{biases}\) is the number of biases, \(N_{non\_ trainable}\) is the number of non-trainable parameters, \(N_{gradients}\) is the number of gradients, \(N_{optimiser\_ states}\) is the number of optimiser states, \(N_{backprop\_ vars}\) is the number of variables required to store the gradients with respect to the weights during backpropagations and \(N_{FP}\) is the number of bytes used by each variable and depends on the numerical precision (half or single) chosen.

3.5. Vertex code and exchange memory use

Accurate estimation of the exchange code size is very difficult, if not impossible, at the design stage of a model. The internal exchange code is statically compiled, so dynamic access to tensors is not well optimised. It is safer to consider that the entire tensor will be allocated instead of slices of tensors.

The exchange code is always-live in memory.

Predicting the memory used by the exchange code and vertex code is dependent on the model and on the Poplar compiler optimisations. There is no general approach to determine this memory use. The best way to estimate the proportion of the memory used by the vertex code and exchange is to start with a small model and vary parameters such as micro-batch size, number of hidden layers and number of replicas. This will give an estimate of how the memory use grows with the model size and the proportion of the memory used for code versus data (Section 3.4, Tensor variables memory use).

Some rules of thumb to minimise the exchange code size are:

  • Lots of small tensors are less efficient than fewer large ones.

  • The tensor layouts may matter, since the tensors may need to be “re-arranged” during the exchange, which increases the temporary memory use.

4. Optimising for performance

This section describes the various factors that affect ML model performance and under what conditions they are applicable.

4.1. Memory

Generally, more efficient memory use will translate into performance improvements during training and inference. Better optimised memory means larger batch sizes can be used, thus improving the throughput in both inference and training. Large operations that need to be serialised over multiple steps, for example, matmul and convolutions, can be executed in fewer cycles when more temporary memory is available. When more In-Processor-Memory is available, this reduces the need to access the slower streaming memory.

Section 5, Common memory optimisations describes the methods used in memory optimisation in detail.

4.2. Pipeline execution scheme

Good for

Training large models

Training with large global batch sizes

Not recommended

Low-latency inference

The most common execution scheme for large models on the IPU is pipelining. This is a form of model parallelism, where a large model is split into multiple stages running on multiple IPUs. Each IPU contains a portion of the model and the IPU utilisation is maximised by running a pipeline of data. Once a pipeline is established there is always more than one micro-batch “in flight”.

The pipeline efficiency is bound by the longest stage executed at any one time. There is thus no point in optimising a pipeline stage if its length is dominated by a longer one. Performance gains can also be achieved by “breaking up” the longest stage and re-balancing it over all pipeline stages.

Pipelining works best when the gradient accumulation factor is much greater than the number of stages, since \(2 N\) stages are always under-utilised during the ramp up and ramp down phases of the pipeline.

During training, pipelining provides two options for the arrangement of the forward and backward passes:

  • grouped

  • interleaved.

Note these pipeline schedule options may not be available in all frameworks.

_images/image2.png

Fig. 4.1 Grouped pipeline schedule.

Grouped scheduling (used by default, Fig. 4.1) is generally faster but requires more memory to stash the activations than the interleaved scheduling.

Grouped scheduling requires storing:

  • \(2 N+1\) activations stored in FIFOs on the first IPU

  • \(2 (N-1)+1\) activations stored in FIFOs on the second IPU

  • and so on.

_images/image3.png

Fig. 4.2 Interleaved pipeline schedule.

Conversely the interleaved scheduling (Fig. 4.2) uses less memory to store the activations but may be slower, since a backward pass, generally longer than the forward pass, is always executed at every stage of a pipeline.

Interleaved scheduling requires storing \(N+1\) FIFOs of activations.

Pipelining generally comes with recomputation of activations, which is described in Section 5.3, Activation recomputations.

4.3. Data parallelism

In cases where the model is small enough, but the dataset is large, the concept of data parallelism may be applied. This means that the same model is loaded onto each IPU, but the data is split up between IPUs. Data parallelism is the most common way to perform distributed training.

Good for

Small-ish models (fit on few IPUs)

Not recommended

Heavy host IO and preprocessing

4.3.1. Graph replication

Data parallelism is achieved by replication of the Poplar graph.

Each replica operates on a micro batch. During inference, the results from each replica are enqueued in the Poplar outfeed queue and sent to the host.

During training via backpropagation, the gradients {\(g_{i}^{n}\}\) are calculated independently on each replica \(n\) and then reduced across all replicas before performing the weight update.

_images/image4.png

Fig. 4.3 Calculation of gradients across replicas and weight updates during training.

Note the weights are not explicitly shared yet they remain consistent (\(w_{i}^{0} = w_{i}^{1} = \ldots = w_{i}^{N}\)) across all replicas by sharing the same weight updates. Therefore, the weights must be initialised to the same value on each replica.

Replicas typically span multiple IPUs or even IPU-Machines or IPU-POD systems. The cross-replica reduction introduces more communication across multiple IPUs and introduces a performance overhead. The relative impact of the communications overhead can be reduced by using a larger replica batch size, typically by increasing the gradient accumulation.

Using replication also increases the exchange code, linearly with the number of replicas, as illustrated in Table 4.1. Note that increasing the number of replicas from 1 to 4 (in Table 4.1) has increased the size of the global exchange code by a factor of 4. In this application the size of the global exchange code is still negligible when compared with the size of the internal and host exchange code.

Table 4.1 BERT-Large, 1 and 4 replicas.

1 Replica

4 Replicas

Control Code

91,668,904 Bytes

117,063,756 Bytes

Internal exchange Code

108,270,156 Bytes

126,520,640 Bytes

Host exchange Code

66,044,276 Bytes

66,147,360 Bytes

Global exchange Code

1,035,988 Bytes

4,657,516 Bytes

4.3.2. Multiple SDK instances and replication: PopDist

One challenge that may arise when using graph replication is that the throughput performance stops increasing with the addition of new replicas because of the host limited compute power and bandwidth. For example the model can become CPU-bound from JPEG decoding or data augmentation on the host.

_images/single_instance_host.png

Fig. 4.4 A single instance of the host Poplar SDK feeding into multiple replicas of the graph.

This situation can be improved by using PopDist, which distributes the host-side processes on multiple processors or even hosts. PopDist relies on MPI to seamlessly allow graph replication to use multiple hosts and instances. Two common improvements are:

  • Increasing the number of host processes (which share the same IO connectivity), if the application is compute bound by the host preprocessing

  • Increasing the number of hosts (vertical scaling), if the application is also IO bound.

In Fig. 4.5, each process is an instance of the host code run by PopRun and is connected to a single instance of the replicated model. The preprocessing and IO communications happen inside the host processes.

_images/multi_hosts_multi_instances.png

Fig. 4.5 With PopDist, multiple instances of the programs, possibly distributed over multiple hosts, can feed into multiple replicas of the graphs.

4.4. Host-IPU IO optimisation

Good for

IO-bound models

Not recommended

Model memory requirements are close to the IPU maximum memory

How memory external to the IPU is accessed affects performance. Data is transferred in streams, with the IPU using a callback function to indicate to the host that the host can read data from the buffer (for IPU to host transfer) and that the host can populate the stream (for host to IPU transfer). For more information about data streams, refer to the Poplar and PopLibs User Guide.

There are three options that can be used to maximise IO performance:

  • Prefetch depth

  • I/O tiles

  • Reducing the bandwidth required, for example, by using 8-bit integers or FP16 transfers.

4.4.1. Prefetch and prefetch depth

Enabling prefetch means that the IPU should call the callback function as early as possible (for example, immediately after it releases the stream buffer from a previous transfer). The host is then able to fill the buffer in advance of the transfer, meaning the IPU spends less time waiting for the host because the data is available before it is needed. The prefetch depth controls how much data is fetched, and a prefetch depth of 3 seems optimal.

4.4.2. Overlapping I/O with compute

The option to designate a number of IPU tiles to be I/O tiles allows for the Poplar graph to be constructed so that the data transfer and the computation can overlap in time.

_images/image7.png

Fig. 4.6 No I/O tiles designated. The StreamCopy happens before execution of the model code.

_images/image8.png

Fig. 4.7 Using I/O tiles. StreamCopy happens in parallel to the execution of the model code.

Setting the number of IO tiles is left up to you and depends on the application. The IO tiles must hold at least a full micro-batch of data and not all the memory on an IO tile is available for data (typically ~5% is used for the exchange code and buffers). Thus a rule of thumb for the number of IO tiles (\(N_{IO}\)) to allocate is:

\(N_{IO} \approx 1.05*\frac{S_{micro\_ batch} \cdot S_{micro\_ batch}}{S_{tile}}\)

where \(S_{micro\_ batch}\) is the micro-batch size and \(S_{tile}\) = 624 kB on the Mk2 IPU.

It may be more optimal if the number of IO tiles is a power of two, and doing a sweep of powers of two values is recommended. Note that, as of SDK 2.3, this will only overlap I/O with computation for a single IPU application or a pipelined application using the grouped schedule. Also, for a multi-replica or multi-instance configuration, the host multi-threading must be configured with the Poplar engine runtime options:

streamCallbacks.multiThreadMode: collaborative
streamCallbacks.numWorkerThreads: 6

4.4.3. Data size reduction

Instead of transferring the data in single-precision floating point (FP32) format, it is possible to improve data throughput by transferring the data in half-precision floating point (FP16) or, if possible, in Int8 format. The latter is especially relevant for images, where data normalisation/scaling happens on the accelerator instead of the host. Similarly, the returned data can be reduced if it requires a large portion of the bandwidth.

4.5. Host-side processing optimisations

4.5.1. Host and IPU preprocessing

You can improve your data pipeline with caching and more efficient preprocessing, for example, normalisation of the data, or you can move some of the preprocessing directly to the IPU. This makes sure that data is generated fast enough.

4.5.2. The one-computational-graph concept

The Graphcore frameworks extract the computational graphs from different frameworks like PyTorch and TensorFlow post-process them to obtain highly efficient computational graphs that get executed on the IPU. Since the IPU looks at the whole graph and not just single operators, this compilation process can take a bit longer. Thus, we provide features like executable caching in the different frameworks to make sure that compilation happens rarely and loading the executable to hardware takes very few seconds.

However, when developing the application code, it is important to avoid changing the computational graph because this will trigger recompilation when the change occurs for the first time and later on it will trigger changes of the executable. Hence, it is preferable that there is only one computational graph. A common example that creates more than one graph is when the batch size changes. With a different batch-size, data and compute need to be distributed differently to the IPU tiles and thus a new optimisation is required. Further examples are provided in the technical note Optimising for the IPU: Computational Graph Recompilation and Executable Switching in TensorFlow. Similarly, eager execution breaks down the full graph into multiple subgraphs, which should be avoided as described in the TensorFlow 2 user guide.

4.5.3. Looping

Instead of processing sample by sample, it is beneficial to loop over streams of samples instead. In this way, data is continuously transferred between host and device and the host-side overhead of starting and stopping the processing process is avoided. This happens in most of our interfaces, for example, in the TensorFlow Keras interface or our PyTorch interface. However, in some TensorFlow applications, processing still needs to happen sample by sample. In these cases, explicit loops should be used, as shown in the TensorFlow 2 porting examples.

4.6. Optimising numerical precision

Good for

Most deep learning models

Not recommended

May need some granularity to avoid loss of precision in critical parts of the model

Reducing the numerical precision is a common way to improve the training or inference performance of a model. This uses less memory for all operations, increases the bandwidth of data exchanges, and reduces the number of cycles required by arithmetic operations.

Note that numerical stability issues may arise when reducing the numerical precision. While resolving these issues is out of the scope of this document; some key ideas are:

  • Using loss scaling when the gradients are calculated in reduced precision. This will limit the risk of underflow errors when gradient values are too small to be accurately represented.

  • Enabling stochastic rounding to improve the average numerical accuracy, in particular during weight update.

  • Analysing the distributions of the values of the weights and gradients in all layers of the networks, and locally increasing the precision of some critical operations.

For more information on numerical precision on the IPU, refer to the AI Float White Paper.

4.7. Replicated tensor sharding

Good for

Very large models

Large number of optimiser states

Not recommended

Models without data parallelism

Exchange bandwidth already constrained by data IO

Replicated tensor sharding is a technique only available in combination with graph replication. With replicated tensor sharding, a tensor that would otherwise contain the same information on all replicas can be sharded (for example, broken up into equal sub-tensors) across all replicas of a graph. This frees up memory on the replica, but requires re-assembling the tensor when it needs to be used.

This feature can be utilised with tensors both on (stored in IPU SRAM) and off chip (stored in Streaming Memory).

_images/image9.png

Fig. 4.8 Tensor memory layout before sharding

_images/image10.png

Fig. 4.9 Tensor memory layout after sharding

Replicated tensor sharding is especially useful for sharding the optimiser states across replicas. This enables you to apply the optimiser update in parallel across many replicas. Each optimiser update then only acts on its corresponding shard of the parameters before reducing over the replicas. This requires additional operations and so is more beneficial on large, distributed models, where the trade-off between these operations is outweighed by the benefits of parallelism.

This method is similar to Microsoft’s Zero Redundancy Optimizer (ZeRO).

4.8. Tile mapping

A poor tile mapping can lead to strided copying. Strided copying arises when either the memory access patten or the location in memory of data is non-contiguous, for example, when you are transposing data. This leads to slower performance and has a considerable code cost.

There are two ways you can identify that strided copying exists:

  • Use the PopVision Graph Analyser tool to examine the list of vertex types. Strided copies will have names of the form DstStridedCopy and there will be many of them.

    • PopVision allows you to look at the program tree to track down where the vertices are used.

  • The execution graph shows many PreArrange steps, like PreAarrange-0 and PreArrange-1.

    • These indicate on-tile copy operations which rearrange data. The data is rearranged so that it has the correct layout for input to vertices, for example, for alignment and to make the input contiguous in memory.

Strided copies could also be the result of suboptimal mapping by the framework. One way to address them directly in the framework is to reverse the operands or to use PopLibs functions directly as part of custom operators. Refer to the technical note Creating Custom Operations for the IPU for details.

_images/image11.png

Fig. 4.10 PopVision execution trace showing PreArrange steps taking up significantly more cycles than the operation itself. In this example the additional operation takes about 580 cycles while 5800 cycles are needed to re-arrange the tensors. (Note that Group Executions must be disabled in the Options of the Execution Trace plot.)

4.9. Other execution schemes

Good for

Very large models

Very large code

Models with interdependencies between pipeline stages

Not recommended

This is generally not supported in high-level ML frameworks

Other execution schemes, such as domain sharding and TensorParallel execution, can be considered as a “last resort” as they are not easily implemented at the application level.

5. Common memory optimisations

5.1. Available memory proportion tuning

The availableMemoryProportion parameter controls the amount of memory allocated for use of temporary data whilst the operation is executing (for example, for intermediate calculated values or temporary values passed between tiles on the IPU). The value is specified as a proportion of available memory on the IPU and is used by the PopLibs Convolution Planner to set the upper bound on the amount of temporary memory in the first pass of the planner. Note this should affect only the temporary memory and so can be used to reduce not-always-live memory spikes.

The effect of the available memory proportion may not be visible on the Graph Analyser when a model goes out of memory. Only the first stage of the convolution planner uses the available memory proportion. If the operations cannot be made to fit in the first stage the planner will disregard the available memory proportion value and create a generic plan that’s not bounded by the memory.

The technical note Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU specifically addresses the tuning of the availableMemoryProportion parameter.

5.2. Partials type

Partials are temporary variables, generally accumulators, that are used by the accumulating matrix product units that contain intermediate calculations during matmul and convolution operations. These intermediate values used by the matmul/conv (accumulating matrix product units) contribute to the not-always-live memory use, as the partials are created during the forward pass and reused during the backward pass. Partials are stored in FP32 by default.

Using FP16 for the partials will reduce the temporary memory used by the accumulating matrix product units by half, at the expense of less accurate numerical results. In practice, if the weights are stored in FP16 reducing the partials accuracy to FP16 does not generally affect model stability or accuracy.

Reducing the partials format also improves the performance (See Table 5.1 in the AI-FloatTM white paper). So, each time the numerical precision is halved, the execution speed of matmul and convolution operations is doubled.

5.3. Activation recomputations

In a neural network trained using backpropagation, the activation calculated during the forward pass needs to be stored in order to be reused in the backward pass. This is required in order to calculate the gradient with respect to the activations. This is problematic because storing these activations during the forward pass uses always-live memory and the amount of memory grows linearly with the micro-batch size.

5.3.1. Activations recomputation and memory use

Activations are stored as FIFO variables and are shown under the Not Always Live Variables section in PopVision Graph Analyser.

As an illustration, Table 5.1 compares the FIFO variable sizes of the BERT Large model with batch size 1 and batch size 8.

Table 5.1 Comparison between the FIFO variable sizes of a BERT Large model with batch sizes 1 and 8.

Variable

Batch size 1

Batch size 8

/fifo.12/buffer/0 (2/1)

2.0 MB

16.0 MB

/fifo.13/buffer/0 (2/1)

2.0 MB

16.0 MB

/fifo.35/buffer/0

1.5 MB

12.0 MB

/fifo.36/buffer/0

1.5 MB

12.0 MB

One way to reduce the amount of always-live memory needed is to not store the activations during the forward pass and to instead recompute them “on-the-fly” as they are needed during the backward pass. The trade-off with recomputation is that we use less memory, but we have more execution cycles in the backward pass (Fig. 5.1).

_images/activations_recomputation.png

Fig. 5.1 Recomputation of activations increases the number of operations. Blue = forward, cyan=forward for recomputation, red=backward

In models that are memory intensive (and most deep-learning models are), using recomputation to save memory is more valuable than the extra execution cycles. For example, BERT Large would not fit in memory without recomputation (Fig. 5.2 and Fig. 5.3).

_images/image13.png

Fig. 5.2 Total memory use in a model without recomputation of activations (blue) and with recomputation (red).

_images/image14.png

Fig. 5.3 The change in not-always-live memory for a model without recomputation (green) and for a model with recomputation (blue).

5.3.2. Recomputation checkpoints

Recomputation checkpoints are points in the computational graphs where all the activations are stored. Any activations in the graph from that point on will be recomputed from the previous checkpoint values, and stored in not-always-live memory, until the next checkpoint. Checkpoint positions in the graph can be set automatically by the frameworks or you can manually insert. When you pipeline a model, checkpoints are automatically inserted at the beginning of a pipelining stage.

The careful introduction of recomputation checkpoints, whether a model is pipelined or not, significantly saves always-live memory, since the amount of memory saved corresponds to all the activation FIFOs between two checkpoints.

The trade-off is an increase in compute cycles. Effectively, between two recomputation checkpoints, the backwards pass is replaced by a forward and a backwards pass. As a rule of thumb, the backwards pass takes between 2 to 2.2 times as many cycles as a forward pass.

In the inference mode, intermediate activations are never stored and hence recomputations are not required.

5.4. Variable offloading

The optimiser state parameters (for example, the first and second moments in ADAM) are always present in memory during training. You can also offload these to the host to save IPU memory. The optimiser states are only transferred once at the beginning (from host to IPU) and once at the end (from IPU to host) of the weight update step, so the communication penalty is much less than transferring model parameters.

Offloading variables to the host will create some exchange code, the size of which is difficult to estimate. It will also use temporary (not always live) data memory in order to store the input and output buffers from and to the host.

5.5. Graph outlining

You can reuse identical code in different parts of the computational graph by outlining the code. This will keep only one copy of the code per tile and the graph will contain multiple call operations to execute the code. Without outlining, identical code at different vertices is always duplicated, taking up memory.

Note: You can think of outlining as the opposite of in-lined functions in C++.

In a Poplar graph, the mapping of the tensors to the vertices is static. When a section of code is outlined, the input and output tensors are statically mapped. This means that when the section of outlined code is called at different points of the computational graph, the input and outputs at this point of the computational graph must be copied into and out of the locations of the statically mapped tensors in the outlined code. The copy operations require more always-live memory and a penalty of using outlining is that you require more cycles.

We will illustrate how outlining works on an example using BERT transformer layers.

In the original, not outlined program, most of the code is under the top-level control program and there are only two additional functions (Fig. 5.4).

_images/image15.png

Fig. 5.4 Program tree in PopVision

Looking into the control program code we can see the self-attention blocks are repeated multiple times in the program tree (Fig. 5.5, Fig. 5.6). These self-attention blocks operate on large matrices and can take a large amount of program memory.

_images/image16.png

Fig. 5.5 First transformer layer

_images/image17.png

Fig. 5.6 Second transformer layer

And similarly for the 22 remaining layers of BERT Large. The code, required for this self-attention operation, is thus duplicated many times and this s the total code memory use.

When the self-attention block is outlined, the outlined section of code is now an independent Poplar function. Looking at the control program, we can see the self-attention blocks of code have been replaced by a call to a Poplar function (Fig. 5.7, Fig. 5.8).

_images/image18.png

Fig. 5.7 First transformer layer with outlining

_images/image19.png

Fig. 5.8 Second transformer layer with outlining

The “target” argument is the identifier of the Poplar function being called. So, in this case, Function 3 is being called. Looking into Function 3, called from Call(target: 3) we see the self-attention block (Fig. 5.9).

_images/image20.png

Fig. 5.9 Details of self-attention block after outlining

The function exists only once in memory and is re-used for multiple input and output tensors. Poplar adds copy operations in the graph so that different variables (tensors) can be used as inputs to the single function. The developer could expect the function to use pointers to eliminate the need for copies, however this is not possible in the static graph model of Poplar, so outlining saves always-live code memory but adds copies at run time.

_images/image21.png

Fig. 5.10 Execution graph (from PopVision) with outlining

_images/image22.png

Fig. 5.11 Execution graph (from PopVision) without outlining

The self-attention block spans many execution cycles (in Fig. 5.10 only the start of the self-attention block is shown). So, the execution cycles added for the extra copy operations for the outlined code do not increase the total execution cycles by too much.