1. Introduction

When code is executed on an IPU, a multi-operation computation graph is compiled to run efficiently on the device.

This compilation ensures that the code running on the IPU is optimal: as many tiles as possible are used, as little device memory as possible is used and the number of execution cycles is short. Note that, in contrast to some other platforms, the graph to be compiled isn’t just one matmul operation but many consecutive operations and so almost every graph execution is different and will need to be compiled and optimised.

The compilation process performs many optimizations, and so it can take some time. It is therefore important to know when the compilation of the graph will happen and avoid it occurring at inconvenient times, or too often. This is especially relevant when running benchmarks since it can add significant overhead.

As a result, it is important to avoid recompilations as far as possible. This technical note provides some strategies that can help you with this.

2. Consideration 0: Avoiding recompilation

To avoid recompiling the same code every time a TensorFlow process is started, you can turn on caching of the executable. Each generated file is identified by a 64-bit hash value.

Caching is enabled by setting the option --executable_cache_path to a directory where the compiled files will be stored. For example:

export TF_POPLAR_FLAGS="--executable_cache_path=/mnt/data/USERNAME/ipu_cache/"

See Caching of compiled executables in the TensorFlow user guide for more information.

However, there are several other cases that can still cause recompilation even with the cache being active. To detect recompilation, set the logging as shown:

export POPLAR_LOG_LEVEL=INFO

Then look for “Starting phase: graphConstruction”. A related issue is to also look for repeated “Loading executables” messages. For example, this log message:

Ending phase: Loading executable (duration 607 ms; diff RSS: 0.586426 MB)

These messages might occur in large numbers at the beginning but should not occur after a warm-up phase. As you can see in the example log message above, they can take a significant amount of time if they occur too frequently during a run (more than 500ms each time).

Note that at the beginning for initialisation, and at the end for final results there might be some executable messages in the log. Those should not cause any problems. You should avoid executable loading and compilation at any cost when running benchmarks.

3. Consideration 1: Sessions

Don’t use different sessions on the same device.

This advice is also relevant when you are working with threading. Either use a single session or distribute the sessions to different devices.

4. Consideration 2: Computational graphs

If different processes run workloads on the same IPU (apart from initialisation at the beginning) make sure that they run the same graph. Otherwise executables will get loaded repeatedly to the IPU. This will be visible in your logging, together with the respective time measurements.

If you have different workloads, try to either put them together into one graph or distribute the graphs onto different IPUs.

Relying solely on the with ipu_scope("/device:IPU:0"): statement has a high chance of creating different computational graphs. A better approach is to combine the whole computation into one graph and apply the ipu.ipu_compiler.compile method as described in the model training documentation. Alternatively, higher level APIs can be used such as estimators or, in TensorFlow 2, the combination of tf.function and the IPUStrategy.

For further details on how to wrap computational graphs, see the Graphcore TensorFlow documentation.

5. Consideration 3: Batch size

Calculate the correct batch size and keep it constant.

Each change in batch size causes a new graph construction and executable. The reason is that the computational graph is static and before compilation the batch size is undefined. The execution instructions loaded on the device depend directly on this as we are going to loop multiple times over data and and need to know the number of repetitions in advance. Furthermore, with a different batch size, a different distribution of the processing onto the tiles will be required in order to benefit from the synergies of larger batch sizes and to obtain high efficiency.

6. Consideration 4: Weights

Keep weights and graphs separate.

Graph freezing causes recompilation and slows down the processing if the weights change. The reasons for this are the same as discussed above for Consideration 2.

7. Consideration 5: Constants

In addition to the weights you might also have other parameters. These parameters should be handled as tf.constant or tf.placeholder (in TensorFlow 1) if possible, otherwise changing them will always result in different graphs. Note that this is not possible with all parameters - for some, like batch size, a change will result in a different computational graph and will require recompilation. On the other hand, parameters such as limits in while loops can be handled as constants.

The advantage with this method is that the graph gets compiled more generically and then the respective variables get loaded into the executable without a recompilation. However, if you change the parameters within a program run, you will still see different executables being loaded.

8. Consideration 6: Deep dive

If none of these approaches apply to your problem or your program is too complex to spot the source, a last resort in TensorFlow is to compare XLA dump text files (*.txt) by setting the “xla_dump” variable to the respective folder.

Make sure that you get XLA dumps of the different executions that should have the same executable but cause recompilation. Check that they don’t get overwritten and filter out the largest ones for the comparison - diffmerge can help you visualise any differences.

Usually there will be a clear difference in the patterns, such as a variable that has different values between the two variants.

If there is no clear difference then the wrong files may have been chosen for comparison.

If you have frozen weights as constants in the graph and those are the only thing differing between the executions, this approach might not help because only low dimensional weights get displayed. Also, larger arrays of constants might cause issues. Other variables are usually well supported.

9. Code example

This example code addresses the different considerations. It is written for TensorFlow 1.15 but the aforementioned principles also apply to TensorFlow 2. In TensorFlow 2, tf.constant can be used instead of tf.placeholder.

Download source

"""Tutorial code to play around with graph recompilation and executable loading

Parameters to play around with are CACHING, NOMULTISESSION, PLACEHOLDER,
and SAMEBATCH. Some comments in the document refer to the underlying tutorial
in the documentation portal.

The code will print out what the expected behaviour should look like.
"""

import os
import numpy as np

import tensorflow as tf
from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope

# Consideration 0: Environment setup
CACHING = True  # Cache compiled graph. The folder is tmp_tutorial.
# Consideration 1: Sessions
NOMULTISESSION = True  # Avoid using different sessions.
# Consideration 2, 4, 5: Graphs, Weights, Constants
# Use a placeholder that is handed over to the graph instead of a hard coded
# hyperparameter that might change between executions.
PLACEHOLDER = True
# Consideration 3: Batch size
SAMEBATCH = True  # Change the batch size between executions.

# Consideration 0: Environment setup
if "TF_POPLAR_FLAGS" in os.environ and not CACHING:
    os.environ["TF_POPLAR_FLAGS"] = ""
else:
    os.environ["TF_POPLAR_FLAGS"] = "--executable_cache_path=tmp_tutorial"
if "POPLAR_LOG_LEVEL" not in os.environ or \
        os.environ["POPLAR_LOG_LEVEL"] != "INFO":
    print("Setting POPLAR_LOG_LEVEL to INFO for graph compilation information.")
    os.environ["POPLAR_LOG_LEVEL"] = "INFO"

# Consideration 6
os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
    np.random.randint(2, 101))
os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "

# Configure arguments for targeting the IPU
cfg = ipu.utils.create_ipu_config()
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

with tf.device("cpu"):
    pa = tf.placeholder(np.float32, [None, 2], name="a")
    pb = tf.placeholder(np.float32, [None, 2], name="b")
    pc = tf.placeholder(np.float32, [None, 2], name="c")

if PLACEHOLDER:
    mult = tf.placeholder(np.float32, [], name="multiplier")
else:
    mult = np.random.uniform(0, 1)


def basic_graph(pa, pb, pc):
    # Do basic addition with tensors
    o1 = pa + pb
    o2 = pa + pc
    simple_graph_output = mult * (o1 + o2)
    return simple_graph_output


with ipu_scope("/device:IPU:0"):
    comp_graph = basic_graph(pa, pb, pc)

print("\nWarm up & Caching Test: ")
print("No compilation after first execution expected but executable load. \n")
with tf.Session() as sess1, tf.Session() as sess2:
    # Run the graph through the session feeding it an arbitrary dictionary
    if PLACEHOLDER:
        result0 = sess1.run(comp_graph,
                            feed_dict={
                                pa: [[1., 1.]],
                                pb: [[0., 1.]],
                                pc: [[1., 5.]],
                                mult: 10.0
                            })
    else:
        result0 = sess1.run(comp_graph,
                            feed_dict={
                                pa: [[1., 1.]],
                                pb: [[0., 1.]],
                                pc: [[1., 5.]],
                            })

# Consideration 6
os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
    np.random.randint(101, 201))
os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "

# Consideration 2, 4, 5: Graphs, Weights, Constants
m = np.random.uniform(0, 1)
if not PLACEHOLDER:
    mult = m
    with ipu_scope("/device:IPU:0"):
        comp_graph = basic_graph(pa, pb, pc)

with tf.Session() as sess1, tf.Session() as sess2:
    print("\nPlaceholder test. ")
    print("No recompilation but executable switch should occur.\n")
    # Run the graph through the session feeding it an arbitrary dictionary
    if PLACEHOLDER:
        result1 = sess1.run(comp_graph,
                            feed_dict={
                                pa: [[1., 1.]],
                                pb: [[0., 1.]],
                                pc: [[1., 5.]],
                                mult: m
                            })
    else:
        result1 = sess1.run(comp_graph,
                            feed_dict={
                                pa: [[1., 1.]],
                                pb: [[0., 1.]],
                                pc: [[1., 5.]],
                            })

    # Consideration 6
    os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
        np.random.randint(201, 301))
    os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
    os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
    os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "

    # Consideration 1: Sessions
    if NOMULTISESSION:
        sess2 = sess1
    else:
        print("Switching session.")

    print("\nSession Test.")
    print("No recompilation or executable switch should occur.\n")
    if PLACEHOLDER:
        result2 = sess2.run(comp_graph,
                            feed_dict={
                                pa: [[1., 1.]],
                                pb: [[0., 1.]],
                                pc: [[1., 5.]],
                                mult: m
                            })
    else:
        result2 = sess2.run(comp_graph,
                            feed_dict={
                                pa: [[1., 1.]],
                                pb: [[0., 1.]],
                                pc: [[1., 5.]],
                            })

    # Consideration 6
    os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
        np.random.randint(301, 401))
    os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
    os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
    os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "

    # Consideration 3: Batch size
    if SAMEBATCH:
        bs = 1
    else:
        bs = np.random.randint(2, 101)
    print("\nBatch Size Test with batch size %d." % bs)
    print("No recompilation or executable switch should occur.")
    print("Batch size should be the original 1.\n")
    if PLACEHOLDER:
        result3 = sess2.run(comp_graph,
                            feed_dict={
                                pa: [[1., 1.]] * bs,
                                pb: [[0., 1.]] * bs,
                                pc: [[1., 5.]] * bs,
                                mult: m
                            })
    else:
        result3 = sess2.run(comp_graph,
                            feed_dict={
                                pa: [[1., 1.]] * bs,
                                pb: [[0., 1.]] * bs,
                                pc: [[1., 5.]] * bs,
                            })

    print("\nFirst two results should be different (different multiplier).\n")
    print("Caching/warm up test:\t", result0)
    print()
    print("Placeholder test:    \t", result1)
    print()
    print("Session test:        \t", result2)
    print()
    if bs > 1:
        print("Batch size test:     \t", result3[:2], "...")
    else:
        print("Batch size test:     \t", result3)