6. Debugging an out-of-memory exception

Sometimes, you will try to run a program on the IPU and find that there is not enough memory to execute your program. In this section, we will discuss how you can resolve this.

6.1. Identifying when you’ve run out of IPU memory

All memory allocation on the IPU is done when the program is compiled. If the Poplar compiler cannot allocate enough memory for your program, the compiler will throw an exception. This exception will be of type graph_memory_allocation_error, though this may not be reported if you are using a higher-level framework.

The most common error message looks something like this:

Memory allocation error : Out of memory on tile 0: 876476 bytes used but tiles only have 638976 bytes of memory

Setting the Poplar engine option debug.allowOutOfMemory to true allows compilation to continue even when it has been detected that running out of memory is inevitable. This option is turned on by default when profiling is enabled. If this option is enabled, the following error message will be displayed:

Out of memory: Cannot fit all variable data onto one or more tiles.

See Section 6.3, Profiling the model for further details of how these engine options are set and used.

There are some other error messages that can be shown when there is not enough memory on the IPU. For example, the following error occurs when there is an exchange in which more memory is sent to tile 0 than there is memory on tile 0, and so there is no way the exchange could possibly be executed:

Tile 0 receives more data than it has total memory in exchange 'cs19_98/scatterAdd/multiUpdateAdd_ExchangePre'

Here is another example, which is usually shown when more code needs to be stored on a tile than there is memory available:

tile 0 _poplar_start must be 0 not 638976 bytes from the start of memory. Typically this error occurs when there is too much code to fit incode memory.

There are some error messages that say some memory limit has been exceeded, but actually refer to a memory limit other than the amount of memory available on the IPU itself. For example, the following message refers to host buffer memory, not the memory on the IPU’s tiles.

Buffer 1 needs to be at least 2616922112 Bytes, but remaining host buffer memory is only 268435456

We do not cover how to deal with such issues here.

6.2. Memory limits on the IPU

A single Mk2 IPU has 1472 tiles, each of which has 624 KiB of memory, for a total of 897 MiB across the processor.

These values can be useful to help you guide your decisions. For example, suppose you have a model with one billion weights. Even if these weights are stored in float-16, they would require 2 GB of memory, and therefore you would need more than two Mk2 IPUs just to have enough memory for the weights. It’s also a good idea to check your model for any intermediate tensors that would require more memory than there is on a single IPU.

However, you should not try to use these values to make exact calculations before running a program. The exact amount of memory required by an IPU program is unpredictable because the memory required by a program includes the binary code that runs on the IPU. Liveness constraints can also make memory requirements difficult to predict. In general, the best way to determine if a program is going to go out of memory is to try to compile it for the IPU.

Refer to Section 3.2, Understanding the memory mapping of a computational graph, Section 3.3, Always-live and not-always-live memory, and Section 3.4, Tensor variables memory use for more details about memory usage on the IPU.

6.3. Profiling the model

By profiling your model, you can collect information about its memory usage and liveness properties. This information can be displayed and explored visually using the PopVision Graph Analyser. For full details of how to use the PopVision Graph Analyser, refer to the PopVision Graph Analyser User Guide.

6.3.1. Enabling profiling

To profile the memory consumption of your model, use the following environment variable:

POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./profile", "autoReport.outputExecutionProfile":"false"}'

The value passed to POPLAR_ENGINE_OPTIONS is a JSON string, so using double quotes inside the braces and single quotes outside them is essential.

Setting autoReport.all to true enables all profiling options. A list of all profiling options can be found under “Engine creation options: Report generation” in the documentation for the Poplar Engine class.

In this case, we have also disabled execution profiling by setting autoReport.outputExecutionProfiling to false. This is because execution profiling adds additional code to your IPU program to record the length of each step. This can add further confusion as to why your program is going out of memory. Without execution profiling, no extra memory on the IPU is used by profiling. You can use the key-value pair "autoReport.outputExecutionProfile":"true" instead if you want to keep execution profiling enabled (for example, if you want to see the effect of using execution profiling on the memory usage of your program).

To use this, you can add this to the start of the command used to run your program:

POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./profile", "autoReport.outputExecutionProfile":"false"}' python training.py

You can change the directory where the profile is stored by changing the value of the autoReport.directory option:

POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./profile_inference", "autoReport.outputExecutionProfile":"false"}' python inference.py

You may need to take more care when profiling if you are running a distributed application with PopDist and PopRun.

6.3.2. Using offline compilation to reduce IPU usage when profiling

It may be useful to compile an executable for the IPU offline to test if the model goes out of memory without blocking the use of an IPU by other programs.

There are options to do this in each framework:

6.3.3. Using the PopVision Graph Analyser

The PopVision Graph Analyser can be used to explore the data captured when you profile your program. The application can be downloaded from the PopVision microsite.

The tabs which are most likely to be of use for resolving an out-of-memory error are:

  • the Memory Report, which shows how much memory is used on each tile

  • the Liveness Report, which shows how much memory is occupied by live variables at each step of the program

For a more comprehensive guide to using the PopVision Graph Analyser, see the PopVision Graph Analyser User Guide.

6.4. Deciding what to do

There are quite a few different techniques available for reducing memory usage, and so it can be difficult to decide what to do.

A good starting point is to reduce the batch size. Most reference implementations of models are written for processors which have an optimal batch size which is larger than the optimal batch size for the IPU. This means that many models can consume much more memory than is needed for effective training. See Section 5.6, Reducing the batch size for details.

If you find that your model is still not fitting even after reducing your batch size as far as you reasonably can, you should profile your program (see Section 6.3, Profiling the model and determine which parts of your program are taking up the most memory.

As a rule of thumb, you should start to look at splitting up your model if you still require more than twice as much memory overall as is available on an IPU after reducing the batch size as much as possible.

6.4.1. Tile and IPU memory balance

In some cases, you may find that you have enough overall memory for your program, but one or few tiles require much more memory than the others. This may be because you have hit an edge case in the PopLibs implementation of an underlying operation. Cases like these should in general be reported to Graphcore’s support team.

You may be able to get around this issue by identifying the operation which is causing the problem and implementing it yourself in Poplar as a custom op. See Section 5.7, Writing a custom operation for details.

If you are running a program over multiple IPUs, and there is an imbalance in the memory used between different IPUs, you may need to reconsider the way that you are splitting your model and what values you are checkpointing for recomputation.

6.4.2. Techniques by liveness of memory

In this section, we discuss techniques for reducing memory usage according to whether the memory reduced is not always live or always live.

6.4.2.1. Reducing not-always-live memory

There are many techniques for reducing the not-always-live memory requirements of your model. You could try using float-16 computations or partials, recomputing activations in the backwards pass, or reducing the available memory proportion for convolutions. All of these are discussed in Section 5, Common memory optimisations.

6.4.2.2. Reducing always-live memory

Dealing with an excess of always-live memory is generally more difficult. It may be the case that the parameters of your model are using too much memory, in which case your choices are limited to storing them in float-16 (which may be unstable, especially without also using stochastic rounding) or increasing the number of IPUs that you are using. If code is taking up a lot of memory, it may be useful to outline certain parts of the model so that code can be reused. See Section 5.5, Graph outlining for full details.

With some optimisers, additional variables are used to maintain running statistics of the gradients. These variables can take up a lot of memory, and this memory must always be live. In this case, you may find it useful to use variable offloading to reduce the amount of memory taken up by these variables. See Section 5.4, Variable offloading for full details.

In some cases, you can afford to swap out your optimiser for one that requires less memory for these variables. For example, you could use stochastic gradient descent with momentum instead of Adam.