1. Introduction

When code is executed on an IPU, a multi-operation computation graph is compiled to run efficiently on the device.

This compilation ensures that the code running on the IPU is optimal: as many tiles as possible are used, as little device memory as possible is used and the number of execution cycles is short. Note that, in contrast to some other platforms, the graph to be compiled isn’t just one matmul operation but many consecutive operations and so almost every graph execution is different and will need to be compiled and optimised.

The compilation process performs many optimizations, and so it can take some time. It is therefore important to know when the compilation of the graph will happen and avoid it occurring at inconvenient times, or too often. This is especially relevant when running benchmarks since it can add significant overhead.

As a result, it is important to avoid recompilations as far as possible.

2. Consideration 0: Avoiding recompilation

To avoid recompiling the same code every time a TensorFlow process is started, you can turn on caching of the executable. Each generated file is identified by a 64-bit hash value.

Caching is enabled by setting the option --executable_cache_path to a directory where the compiled files will be stored. For example:

export TF_POPLAR_FLAGS="--executable_cache_path=/mnt/data/USERNAME/ipu_cache/"

See Caching of compiled executables in the TensorFlow user guide for more information.

However, there are several other cases that can still cause recompilation even with the cache being active. To detect recompilation, set the logging as shown:

export POPLAR_LOG_LEVEL=INFO

Then look for “Starting phase: graphConstruction”. A related issue is to also look for repeated “Loading executables” messages. For example, this log message:

Ending phase: Loading executable (duration 607 ms; diff RSS: 0.586426 MB)

These messages might occur in large numbers at the beginning but should not occur after a warm-up phase. As you can see in the example log message above, they can take a significant amount of time if they occur too frequently during a run (more than 500ms each time).

Note that at the beginning for initialisation, and at the end for final results there might be some executable messages in the log. Those should not cause any problems. Other than those, executable messages are not desirable and should be taken care of.

Another logging tool is to set the environment variable

TF_CPP_VMODULE=poplar_compiler=1

This allows you to view the logging message at Poplar compilation level.

Then look for “Begin XLA compilation: … (Hash 0x…)”. The hash code gives a unique identification for the compilation. If the hash codes are the same for different runs, the compilations are also the same and the executable will be loaded if it has been saved previously. Otherwise, recompilation will occur.

Note that when executable caching is enabled, either one of the following message will be observed:

“Loaded ModelName from CachedExecutableFilename” if executable is found or “Couldn’t find CachedExecutableFilename” if executable is not found.

3. Consideration 1: Computational graphs

If different processes run workloads other than initialisation make sure that they run the same graph. Otherwise executables will get loaded repeatedly to the IPU. This will be visible in your logging, together with the respective time measurements.

If you have different workloads, try to either put them together into one graph or distribute the graphs onto different IPUs.

Relying solely on the with ipu_scope("/device:IPU:0"): statement has a high chance of creating different computational graphs. A better approach is to combine the whole computation into one graph and apply the ipu.ipu_compiler.compile method as described in the model training documentation. Alternatively, higher level APIs can be used such as estimators or, in TensorFlow 2, the combination of tf.function and the IPUStrategy.

For further details on how to wrap computational graphs, see the Graphcore TensorFlow documentation.

4. Consideration 2: Batch size

Calculate the correct batch size and keep it constant.

Each change in batch size causes a new graph construction and executable. The reason is that the computational graph is static and before compilation the batch size is undefined. The execution instructions loaded on the device depend directly on this as we are going to loop multiple times over data and and need to know the number of repetitions in advance. Furthermore, with a different batch size, a different distribution of the processing onto the tiles will be required in order to benefit from the synergies of larger batch sizes and to obtain high efficiency.

5. Consideration 3: Weights

Keep weights and graphs separate.

Graph freezing causes recompilation and slows down the processing if the weights change. The reasons for this are the same as discussed above for Consideration 2.

6. Consideration 4: Concatenate

Using tf.concat inside a tf.function will cause the graph to be recompiled at every epoch. This issue is not specific to the IPU, as it also happens on the CPU.

If tf.stack is used instead of tf.concat, the graph is compiled only once.

To replace tf.concat with tf.stack, consider the following example:

# The two lines below are equivalent
input = tf.concat([X, Y])
input = tf.reshape(tf.stack([X, Y]), [X.shape[0], X.shape[1]])

7. Consideration 5: Constants

In addition to the weights you might also have other parameters. These parameters should be handled as tf.constant (in TensorFlow 2) or tf.placeholder (in TensorFlow 1) if possible, otherwise changing them will always result in different graphs. Note that this is not possible with all parameters - for some, like batch size, a change will result in a different computational graph and will require recompilation. On the other hand, parameters such as limits in while loops can be handled as constants.

The advantage with this method is that the graph gets compiled more generically and then the respective variables get loaded into the executable without a recompilation. However, if you change the parameters within a program run, you will still see different executables being loaded.

To check if compilation is caused by a constant, you can look for constant({...}) in the XLA dump and compare it between runs (see Section 8, Consideration 6: Deep dive). Also, the use of constant({...}) is connected to an op_name which may not appear meaningful in the XLA dump. In that case, you can add tf.name_scope() to the implementation to extend the op_name occurrences in the XLA dump so that you can have a better understanding of which part of the code causes a potentially changing constant.

8. Consideration 6: Deep dive

If none of these approaches apply to your problem or your program is too complex to spot the source, a last resort in TensorFlow is to compare XLA dump text files (*.txt) by using the --xla_dump_to option to specify the output folder. See TensorFlow options for reporting.

Make sure that you get XLA dumps of the different executions that should have the same executable but cause recompilation. Check that they don’t get overwritten and filter out the largest ones for the comparison - diffmerge can help you visualise any differences.

Usually there will be a clear difference in the patterns, such as a variable that has different values between the two variants.

If there is no clear difference then the wrong files may have been chosen for comparison.

If you have frozen weights as constants in the graph and those are the only thing differing between the executions, this approach might not help because only low dimensional weights get displayed. Also, larger arrays of constants might cause issues. Other variables are usually well supported.

9. Code example

This example code addresses the different considerations. It is written for TensorFlow 1.15 but the aforementioned principles also apply to TensorFlow 2. In TensorFlow 2, tf.constant can be used instead of tf.placeholder.

Note

From Poplar SDK 3.1, TensorFlow 1 will only be supported in CentOS 7. In addition, Examples and Tutorials for TensorFlow 1 are only available up to version 3.0 of the SDK. There has been limited testing of the 3.0 versions of the TensorFlow 1 tutorials and examples with Poplar SDK 3.1.

Download the source code: