1. Introduction

When code is executed on an IPU, a multi-operation computation graph is compiled to run efficiently on the device.

This compilation ensures that the code running on the IPU is optimal: as many tiles as possible are used, as little device memory as possible is used and the number of execution cycles is short. Note that, in contrast to some other platforms, the graph to be compiled isn’t just one matmul operation but many consecutive operations and so almost every graph execution is different and will need to be compiled and optimised.

The compilation process performs many optimizations, and so it can take some time. It is therefore important to know when the compilation of the graph will happen and avoid it occurring at inconvenient times, or too often. This is especially relevant when running benchmarks since it can add significant overhead.

As a result, it is important to avoid recompilations as far as possible. This technical note provides some strategies that can help you with this.

2. Consideration 0: Avoiding recompilation

To avoid recompiling the same code every time a TensorFlow process is started, you can turn on caching of the executable. Each generated file is identified by a 64-bit hash value.

Caching is enabled by setting the option --executable_cache_path to a directory where the compiled files will be stored. For example:

export TF_POPLAR_FLAGS="--executable_cache_path=/mnt/data/USERNAME/ipu_cache/"

See Caching of compiled executables in the TensorFlow user guide for more information.

However, there are several other cases that can still cause recompilation even with the cache being active. To detect recompilation, set the logging as shown:

export POPLAR_LOG_LEVEL=INFO

Then look for “Starting phase: graphConstruction”. A related issue is to also look for repeated “Loading executables” messages. For example, this log message:

Ending phase: Loading executable (duration 607 ms; diff RSS: 0.586426 MB)

These messages might occur in large numbers at the beginning but should not occur after a warm-up phase. As you can see in the example log message above, they can take a significant amount of time if they occur too frequently during a run (more than 500ms each time).

Note that at the beginning for initialisation, and at the end for final results there might be some executable messages in the log. Those should not cause any problems. You should avoid executable loading and compilation at any cost when running benchmarks.

3. Consideration 1: Sessions

Don’t use different sessions on the same device.

This advice is also relevant when you are working with threading. Either use a single session or distribute the sessions to different devices.

4. Consideration 2: Computational graphs

If different processes run workloads on the same IPU (apart from initialisation at the beginning) make sure that they run the same graph. Otherwise executables will get loaded repeatedly to the IPU. This will be visible in your logging, together with the respective time measurements.

If you have different workloads, try to either put them together into one graph or distribute the graphs onto different IPUs.

Relying solely on the with ipu_scope("/device:IPU:0"): statement has a high chance of creating different computational graphs. A better approach is to combine the whole computation into one graph and apply the ipu.ipu_compiler.compile method as described in the model training documentation. Alternatively, higher level APIs can be used such as estimators or, in TensorFlow 2, the combination of tf.function and the IPUStrategy.

For further details on how to wrap computational graphs, see the Graphcore TensorFlow documentation.

5. Consideration 3: Batch size

Calculate the correct batch size and keep it constant.

Each change in batch size causes a new graph construction and executable. The reason is that the computational graph is static and before compilation the batch size is undefined. The execution instructions loaded on the device depend directly on this as we are going to loop multiple times over data and and need to know the number of repetitions in advance. Furthermore, with a different batch size, a different distribution of the processing onto the tiles will be required in order to benefit from the synergies of larger batch sizes and to obtain high efficiency.

6. Consideration 4: Weights

Keep weights and graphs separate.

Graph freezing causes recompilation and slows down the processing if the weights change. The reasons for this are the same as discussed above for Consideration 2.

7. Consideration 5: Constants

In addition to the weights you might also have other parameters. These parameters should be handled as tf.constant or tf.placeholder (in TensorFlow 1) if possible, otherwise changing them will always result in different graphs. Note that this is not possible with all parameters - for some, like batch size, a change will result in a different computational graph and will require recompilation. On the other hand, parameters such as limits in while loops can be handled as constants.

The advantage with this method is that the graph gets compiled more generically and then the respective variables get loaded into the executable without a recompilation. However, if you change the parameters within a program run, you will still see different executables being loaded.

8. Consideration 6: Deep dive

If none of these approaches apply to your problem or your program is too complex to spot the source, a last resort in TensorFlow is to compare XLA dump text files (*.txt) by setting the “xla_dump” variable to the respective folder.

Make sure that you get XLA dumps of the different executions that should have the same executable but cause recompilation. Check that they don’t get overwritten and filter out the largest ones for the comparison - diffmerge can help you visualise any differences.

Usually there will be a clear difference in the patterns, such as a variable that has different values between the two variants.

If there is no clear difference then the wrong files may have been chosen for comparison.

If you have frozen weights as constants in the graph and those are the only thing differing between the executions, this approach might not help because only low dimensional weights get displayed. Also, larger arrays of constants might cause issues. Other variables are usually well supported.

9. Code example

This example code addresses the different considerations. It is written for TensorFlow 1.15 but the aforementioned principles also apply to TensorFlow 2. In TensorFlow 2, tf.constant can be used instead of tf.placeholder.

Download source

  1"""Tutorial code to play around with graph recompilation and executable loading
  2
  3Parameters to play around with are CACHING, NOMULTISESSION, PLACEHOLDER,
  4and SAMEBATCH. Some comments in the document refer to the underlying tutorial
  5in the documentation portal.
  6
  7The code will print out what the expected behaviour should look like.
  8"""
  9
 10import os
 11import numpy as np
 12
 13import tensorflow as tf
 14from tensorflow.python import ipu
 15from tensorflow.python.ipu.scopes import ipu_scope
 16
 17# Consideration 0: Environment setup
 18CACHING = True  # Cache compiled graph. The folder is tmp_tutorial.
 19# Consideration 1: Sessions
 20NOMULTISESSION = True  # Avoid using different sessions.
 21# Consideration 2, 4, 5: Graphs, Weights, Constants
 22# Use a placeholder that is handed over to the graph instead of a hard coded
 23# hyperparameter that might change between executions.
 24PLACEHOLDER = True
 25# Consideration 3: Batch size
 26SAMEBATCH = True  # Change the batch size between executions.
 27
 28# Consideration 0: Environment setup
 29if "TF_POPLAR_FLAGS" in os.environ and not CACHING:
 30    os.environ["TF_POPLAR_FLAGS"] = ""
 31else:
 32    os.environ["TF_POPLAR_FLAGS"] = "--executable_cache_path=tmp_tutorial"
 33if "POPLAR_LOG_LEVEL" not in os.environ or \
 34        os.environ["POPLAR_LOG_LEVEL"] != "INFO":
 35    print("Setting POPLAR_LOG_LEVEL to INFO for graph compilation information.")
 36    os.environ["POPLAR_LOG_LEVEL"] = "INFO"
 37
 38# Consideration 6
 39os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
 40    np.random.randint(2, 101))
 41os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
 42os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
 43os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "
 44
 45# Configure arguments for targeting the IPU
 46cfg = ipu.utils.create_ipu_config()
 47cfg = ipu.utils.auto_select_ipus(cfg, 1)
 48ipu.utils.configure_ipu_system(cfg)
 49
 50with tf.device("cpu"):
 51    pa = tf.placeholder(np.float32, [None, 2], name="a")
 52    pb = tf.placeholder(np.float32, [None, 2], name="b")
 53    pc = tf.placeholder(np.float32, [None, 2], name="c")
 54
 55if PLACEHOLDER:
 56    mult = tf.placeholder(np.float32, [], name="multiplier")
 57else:
 58    mult = np.random.uniform(0, 1)
 59
 60
 61def basic_graph(pa, pb, pc):
 62    # Do basic addition with tensors
 63    o1 = pa + pb
 64    o2 = pa + pc
 65    simple_graph_output = mult * (o1 + o2)
 66    return simple_graph_output
 67
 68
 69with ipu_scope("/device:IPU:0"):
 70    comp_graph = basic_graph(pa, pb, pc)
 71
 72print("\nWarm up & Caching Test: ")
 73print("No compilation after first execution expected but executable load. \n")
 74with tf.Session() as sess1, tf.Session() as sess2:
 75    # Run the graph through the session feeding it an arbitrary dictionary
 76    if PLACEHOLDER:
 77        result0 = sess1.run(comp_graph,
 78                            feed_dict={
 79                                pa: [[1., 1.]],
 80                                pb: [[0., 1.]],
 81                                pc: [[1., 5.]],
 82                                mult: 10.0
 83                            })
 84    else:
 85        result0 = sess1.run(comp_graph,
 86                            feed_dict={
 87                                pa: [[1., 1.]],
 88                                pb: [[0., 1.]],
 89                                pc: [[1., 5.]],
 90                            })
 91
 92# Consideration 6
 93os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
 94    np.random.randint(101, 201))
 95os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
 96os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
 97os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "
 98
 99# Consideration 2, 4, 5: Graphs, Weights, Constants
100m = np.random.uniform(0, 1)
101if not PLACEHOLDER:
102    mult = m
103    with ipu_scope("/device:IPU:0"):
104        comp_graph = basic_graph(pa, pb, pc)
105
106with tf.Session() as sess1, tf.Session() as sess2:
107    print("\nPlaceholder test. ")
108    print("No recompilation but executable switch should occur.\n")
109    # Run the graph through the session feeding it an arbitrary dictionary
110    if PLACEHOLDER:
111        result1 = sess1.run(comp_graph,
112                            feed_dict={
113                                pa: [[1., 1.]],
114                                pb: [[0., 1.]],
115                                pc: [[1., 5.]],
116                                mult: m
117                            })
118    else:
119        result1 = sess1.run(comp_graph,
120                            feed_dict={
121                                pa: [[1., 1.]],
122                                pb: [[0., 1.]],
123                                pc: [[1., 5.]],
124                            })
125
126    # Consideration 6
127    os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
128        np.random.randint(201, 301))
129    os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
130    os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
131    os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "
132
133    # Consideration 1: Sessions
134    if NOMULTISESSION:
135        sess2 = sess1
136    else:
137        print("Switching session.")
138
139    print("\nSession Test.")
140    print("No recompilation or executable switch should occur.\n")
141    if PLACEHOLDER:
142        result2 = sess2.run(comp_graph,
143                            feed_dict={
144                                pa: [[1., 1.]],
145                                pb: [[0., 1.]],
146                                pc: [[1., 5.]],
147                                mult: m
148                            })
149    else:
150        result2 = sess2.run(comp_graph,
151                            feed_dict={
152                                pa: [[1., 1.]],
153                                pb: [[0., 1.]],
154                                pc: [[1., 5.]],
155                            })
156
157    # Consideration 6
158    os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
159        np.random.randint(301, 401))
160    os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
161    os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
162    os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "
163
164    # Consideration 3: Batch size
165    if SAMEBATCH:
166        bs = 1
167    else:
168        bs = np.random.randint(2, 101)
169    print("\nBatch Size Test with batch size %d." % bs)
170    print("No recompilation or executable switch should occur.")
171    print("Batch size should be the original 1.\n")
172    if PLACEHOLDER:
173        result3 = sess2.run(comp_graph,
174                            feed_dict={
175                                pa: [[1., 1.]] * bs,
176                                pb: [[0., 1.]] * bs,
177                                pc: [[1., 5.]] * bs,
178                                mult: m
179                            })
180    else:
181        result3 = sess2.run(comp_graph,
182                            feed_dict={
183                                pa: [[1., 1.]] * bs,
184                                pb: [[0., 1.]] * bs,
185                                pc: [[1., 5.]] * bs,
186                            })
187
188    print("\nFirst two results should be different (different multiplier).\n")
189    print("Caching/warm up test:\t", result0)
190    print()
191    print("Placeholder test:    \t", result1)
192    print()
193    print("Session test:        \t", result2)
194    print()
195    if bs > 1:
196        print("Batch size test:     \t", result3[:2], "...")
197    else:
198        print("Batch size test:     \t", result3)