1. Introduction
When code is executed on an IPU, a multi-operation computation graph is compiled to run efficiently on the device.
This compilation ensures that the code running on the IPU is optimal: as many tiles as possible are used, as little device memory as possible is used and the number of execution cycles is short. Note that, in contrast to some other platforms, the graph to be compiled isn’t just one matmul operation but many consecutive operations and so almost every graph execution is different and will need to be compiled and optimised.
The compilation process performs many optimizations, and so it can take some time. It is therefore important to know when the compilation of the graph will happen and avoid it occurring at inconvenient times, or too often. This is especially relevant when running benchmarks since it can add significant overhead.
As a result, it is important to avoid recompilations as far as possible.
2. Consideration 0: Avoiding recompilation
To avoid recompiling the same code every time a TensorFlow process is started, you can turn on caching of the executable. Each generated file is identified by a 64-bit hash value.
Caching is enabled by setting the option --executable_cache_path
to a
directory where the compiled files will be stored. For example:
export TF_POPLAR_FLAGS="--executable_cache_path=/mnt/data/USERNAME/ipu_cache/"
See Caching of compiled executables in the TensorFlow user guide for more information.
However, there are several other cases that can still cause recompilation even with the cache being active. To detect recompilation, set the logging as shown:
export POPLAR_LOG_LEVEL=INFO
Then look for “Starting phase: graphConstruction”. A related issue is to also look for repeated “Loading executables” messages. For example, this log message:
Ending phase: Loading executable (duration 607 ms; diff RSS: 0.586426 MB)
These messages might occur in large numbers at the beginning but should not occur after a warm-up phase. As you can see in the example log message above, they can take a significant amount of time if they occur too frequently during a run (more than 500ms each time).
Note that at the beginning for initialisation, and at the end for final results there might be some executable messages in the log. Those should not cause any problems. Other than those, executable messages are not desirable and should be taken care of.
Another logging tool is to set the environment variable
TF_CPP_VMODULE=poplar_compiler=1
This allows you to view the logging message at Poplar compilation level.
Then look for “Begin XLA compilation: … (Hash 0x…)”. The hash code gives a unique identification for the compilation. If the hash codes are the same for different runs, the compilations are also the same and the executable will be loaded if it has been saved previously. Otherwise, recompilation will occur.
Note that when executable caching is enabled, either one of the following message will be observed:
“Loaded ModelName
from CachedExecutableFilename
” if executable is found or
“Couldn’t find CachedExecutableFilename
” if executable is not found.
3. Consideration 1: Computational graphs
If different processes run workloads other than initialisation make sure that they run the same graph. Otherwise executables will get loaded repeatedly to the IPU. This will be visible in your logging, together with the respective time measurements.
If you have different workloads, try to either put them together into one graph or distribute the graphs onto different IPUs.
Relying solely on the with ipu_scope("/device:IPU:0"):
statement has a high
chance of creating different computational graphs. A better approach is to
combine the whole computation into one graph and apply the
ipu.ipu_compiler.compile
method as described in the
model training documentation.
Alternatively, higher level APIs can be used such as
estimators
or, in TensorFlow 2, the combination of tf.function
and the
IPUStrategy.
For further details on how to wrap computational graphs, see the Graphcore TensorFlow documentation.
4. Consideration 2: Batch size
Calculate the correct batch size and keep it constant.
Each change in batch size causes a new graph construction and executable. The reason is that the computational graph is static and before compilation the batch size is undefined. The execution instructions loaded on the device depend directly on this as we are going to loop multiple times over data and and need to know the number of repetitions in advance. Furthermore, with a different batch size, a different distribution of the processing onto the tiles will be required in order to benefit from the synergies of larger batch sizes and to obtain high efficiency.
5. Consideration 3: Weights
Keep weights and graphs separate.
Graph freezing causes recompilation and slows down the processing if the weights change. The reasons for this are the same as discussed above for Consideration 2.
6. Consideration 4: Concatenate
Using tf.concat
inside a tf.function
will cause the graph to be
recompiled at every epoch. This issue is not specific to the IPU, as it also
happens on the CPU.
If tf.stack
is used instead of tf.concat
, the graph is compiled only
once.
To replace tf.concat
with tf.stack
, consider the following example:
# The two lines below are equivalent
input = tf.concat([X, Y])
input = tf.reshape(tf.stack([X, Y]), [X.shape[0], X.shape[1]])
7. Consideration 5: Constants
In addition to the weights you might also have other parameters. These
parameters should be handled as tf.constant
(in TensorFlow 2) or tf.placeholder
(in TensorFlow 1) if possible, otherwise changing them will always result in
different graphs. Note that this is not possible with all parameters - for
some, like batch size, a change will result in a different computational graph
and will require recompilation. On the other hand, parameters such as limits in
while loops can be handled as constants.
The advantage with this method is that the graph gets compiled more generically and then the respective variables get loaded into the executable without a recompilation. However, if you change the parameters within a program run, you will still see different executables being loaded.
To check if compilation is caused by a constant, you can look for constant({...})
in the XLA dump and compare it between runs (see Section 8, Consideration 6: Deep dive).
Also, the use of constant({...})
is connected to an op_name
which may not
appear meaningful in the XLA dump. In that case, you can add tf.name_scope()
to
the implementation to extend the op_name
occurrences in the XLA dump so that you
can have a better understanding of which part of the code causes a potentially changing
constant.
8. Consideration 6: Deep dive
If none of these approaches apply to your problem or your program is too
complex to spot the source, a last resort in TensorFlow is to compare XLA dump
text files (*.txt
) by using the --xla_dump_to
option to specify the
output folder. See
TensorFlow options for reporting.
Make sure that you get XLA dumps of the different executions that should have
the same executable but cause recompilation. Check that they don’t get
overwritten and filter out the largest ones for the comparison - diffmerge
can help you visualise any differences.
Usually there will be a clear difference in the patterns, such as a variable that has different values between the two variants.
If there is no clear difference then the wrong files may have been chosen for comparison.
If you have frozen weights as constants in the graph and those are the only thing differing between the executions, this approach might not help because only low dimensional weights get displayed. Also, larger arrays of constants might cause issues. Other variables are usually well supported.
9. Code example
This example code addresses the different considerations. It is written for
TensorFlow 1.15 but the aforementioned principles also apply to TensorFlow 2.
In TensorFlow 2, tf.constant
can be used instead of tf.placeholder
.
Note
From Poplar SDK 3.1, TensorFlow 1 will only be supported in CentOS 7. In addition, Examples and Tutorials for TensorFlow 1 are only available up to version 3.0 of the SDK. There has been limited testing of the 3.0 versions of the TensorFlow 1 tutorials and examples with Poplar SDK 3.1.
Download the source code:
TensorFlow 1.15: recompilation.py.
TensorFlow 2.6: TF2_recompilation.py.