1. Introduction
When code is executed on an IPU, a multi-operation computation graph is compiled to run efficiently on the device.
This compilation ensures that the code running on the IPU is optimal: as many tiles as possible are used, as little device memory as possible is used and the number of execution cycles is short. Note that, in contrast to some other platforms, the graph to be compiled isn’t just one matmul operation but many consecutive operations and so almost every graph execution is different and will need to be compiled and optimised.
The compilation process performs many optimizations, and so it can take some time. It is therefore important to know when the compilation of the graph will happen and avoid it occurring at inconvenient times, or too often. This is especially relevant when running benchmarks since it can add significant overhead.
As a result, it is important to avoid recompilations as far as possible. This technical note provides some strategies that can help you with this.
2. Consideration 0: Avoiding recompilation
To avoid recompiling the same code every time a TensorFlow process is started, you can turn on caching of the executable. Each generated file is identified by a 64-bit hash value.
Caching is enabled by setting the option --executable_cache_path
to a
directory where the compiled files will be stored. For example:
export TF_POPLAR_FLAGS="--executable_cache_path=/mnt/data/USERNAME/ipu_cache/"
See Caching of compiled executables in the TensorFlow user guide for more information.
However, there are several other cases that can still cause recompilation even with the cache being active. To detect recompilation, set the logging as shown:
export POPLAR_LOG_LEVEL=INFO
Then look for “Starting phase: graphConstruction”. A related issue is to also look for repeated “Loading executables” messages. For example, this log message:
Ending phase: Loading executable (duration 607 ms; diff RSS: 0.586426 MB)
These messages might occur in large numbers at the beginning but should not occur after a warm-up phase. As you can see in the example log message above, they can take a significant amount of time if they occur too frequently during a run (more than 500ms each time).
Note that at the beginning for initialisation, and at the end for final results there might be some executable messages in the log. Those should not cause any problems. You should avoid executable loading and compilation at any cost when running benchmarks.
3. Consideration 1: Sessions
Don’t use different sessions on the same device.
This advice is also relevant when you are working with threading. Either use a single session or distribute the sessions to different devices.
4. Consideration 2: Computational graphs
If different processes run workloads on the same IPU (apart from initialisation at the beginning) make sure that they run the same graph. Otherwise executables will get loaded repeatedly to the IPU. This will be visible in your logging, together with the respective time measurements.
If you have different workloads, try to either put them together into one graph or distribute the graphs onto different IPUs.
Relying solely on the with ipu_scope("/device:IPU:0"):
statement has a high
chance of creating different computational graphs. A better approach is to
combine the whole computation into one graph and apply the
ipu.ipu_compiler.compile
method as described in the
model training documentation.
Alternatively, higher level APIs can be used such as
estimators
or, in TensorFlow 2, the combination of tf.function
and the
IPUStrategy.
For further details on how to wrap computational graphs, see the Graphcore TensorFlow documentation.
5. Consideration 3: Batch size
Calculate the correct batch size and keep it constant.
Each change in batch size causes a new graph construction and executable. The reason is that the computational graph is static and before compilation the batch size is undefined. The execution instructions loaded on the device depend directly on this as we are going to loop multiple times over data and and need to know the number of repetitions in advance. Furthermore, with a different batch size, a different distribution of the processing onto the tiles will be required in order to benefit from the synergies of larger batch sizes and to obtain high efficiency.
6. Consideration 4: Weights
Keep weights and graphs separate.
Graph freezing causes recompilation and slows down the processing if the weights change. The reasons for this are the same as discussed above for Consideration 2.
7. Consideration 5: Constants
In addition to the weights you might also have other parameters. These
parameters should be handled as tf.constant
or tf.placeholder
(in
TensorFlow 1) if possible, otherwise changing them will always result in
different graphs. Note that this is not possible with all parameters - for
some, like batch size, a change will result in a different computational graph
and will require recompilation. On the other hand, parameters such as limits in
while loops can be handled as constants.
The advantage with this method is that the graph gets compiled more generically and then the respective variables get loaded into the executable without a recompilation. However, if you change the parameters within a program run, you will still see different executables being loaded.
8. Consideration 6: Deep dive
If none of these approaches apply to your problem or your program is too
complex to spot the source, a last resort in TensorFlow is to compare XLA dump
text files (*.txt
) by setting the
“xla_dump”
variable to the respective folder.
Make sure that you get XLA dumps of the different executions that should have
the same executable but cause recompilation. Check that they don’t get
overwritten and filter out the largest ones for the comparison - diffmerge
can help you visualise any differences.
Usually there will be a clear difference in the patterns, such as a variable that has different values between the two variants.
If there is no clear difference then the wrong files may have been chosen for comparison.
If you have frozen weights as constants in the graph and those are the only thing differing between the executions, this approach might not help because only low dimensional weights get displayed. Also, larger arrays of constants might cause issues. Other variables are usually well supported.
9. Code example
This example code addresses the different considerations. It is written for
TensorFlow 1.15 but the aforementioned principles also apply to TensorFlow 2.
In TensorFlow 2, tf.constant
can be used instead of tf.placeholder
.
1"""Tutorial code to play around with graph recompilation and executable loading
2
3Parameters to play around with are CACHING, NOMULTISESSION, PLACEHOLDER,
4and SAMEBATCH. Some comments in the document refer to the underlying tutorial
5in the documentation portal.
6
7The code will print out what the expected behaviour should look like.
8"""
9
10import os
11import numpy as np
12
13import tensorflow as tf
14from tensorflow.python import ipu
15from tensorflow.python.ipu.scopes import ipu_scope
16
17# Consideration 0: Environment setup
18CACHING = True # Cache compiled graph. The folder is tmp_tutorial.
19# Consideration 1: Sessions
20NOMULTISESSION = True # Avoid using different sessions.
21# Consideration 2, 4, 5: Graphs, Weights, Constants
22# Use a placeholder that is handed over to the graph instead of a hard coded
23# hyperparameter that might change between executions.
24PLACEHOLDER = True
25# Consideration 3: Batch size
26SAMEBATCH = True # Change the batch size between executions.
27
28# Consideration 0: Environment setup
29if "TF_POPLAR_FLAGS" in os.environ and not CACHING:
30 os.environ["TF_POPLAR_FLAGS"] = ""
31else:
32 os.environ["TF_POPLAR_FLAGS"] = "--executable_cache_path=tmp_tutorial"
33if "POPLAR_LOG_LEVEL" not in os.environ or \
34 os.environ["POPLAR_LOG_LEVEL"] != "INFO":
35 print("Setting POPLAR_LOG_LEVEL to INFO for graph compilation information.")
36 os.environ["POPLAR_LOG_LEVEL"] = "INFO"
37
38# Consideration 6
39os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
40 np.random.randint(2, 101))
41os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
42os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
43os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "
44
45# Configure arguments for targeting the IPU
46cfg = ipu.utils.create_ipu_config()
47cfg = ipu.utils.auto_select_ipus(cfg, 1)
48ipu.utils.configure_ipu_system(cfg)
49
50with tf.device("cpu"):
51 pa = tf.placeholder(np.float32, [None, 2], name="a")
52 pb = tf.placeholder(np.float32, [None, 2], name="b")
53 pc = tf.placeholder(np.float32, [None, 2], name="c")
54
55if PLACEHOLDER:
56 mult = tf.placeholder(np.float32, [], name="multiplier")
57else:
58 mult = np.random.uniform(0, 1)
59
60
61def basic_graph(pa, pb, pc):
62 # Do basic addition with tensors
63 o1 = pa + pb
64 o2 = pa + pc
65 simple_graph_output = mult * (o1 + o2)
66 return simple_graph_output
67
68
69with ipu_scope("/device:IPU:0"):
70 comp_graph = basic_graph(pa, pb, pc)
71
72print("\nWarm up & Caching Test: ")
73print("No compilation after first execution expected but executable load. \n")
74with tf.Session() as sess1, tf.Session() as sess2:
75 # Run the graph through the session feeding it an arbitrary dictionary
76 if PLACEHOLDER:
77 result0 = sess1.run(comp_graph,
78 feed_dict={
79 pa: [[1., 1.]],
80 pb: [[0., 1.]],
81 pc: [[1., 5.]],
82 mult: 10.0
83 })
84 else:
85 result0 = sess1.run(comp_graph,
86 feed_dict={
87 pa: [[1., 1.]],
88 pb: [[0., 1.]],
89 pc: [[1., 5.]],
90 })
91
92# Consideration 6
93os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
94 np.random.randint(101, 201))
95os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
96os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
97os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "
98
99# Consideration 2, 4, 5: Graphs, Weights, Constants
100m = np.random.uniform(0, 1)
101if not PLACEHOLDER:
102 mult = m
103 with ipu_scope("/device:IPU:0"):
104 comp_graph = basic_graph(pa, pb, pc)
105
106with tf.Session() as sess1, tf.Session() as sess2:
107 print("\nPlaceholder test. ")
108 print("No recompilation but executable switch should occur.\n")
109 # Run the graph through the session feeding it an arbitrary dictionary
110 if PLACEHOLDER:
111 result1 = sess1.run(comp_graph,
112 feed_dict={
113 pa: [[1., 1.]],
114 pb: [[0., 1.]],
115 pc: [[1., 5.]],
116 mult: m
117 })
118 else:
119 result1 = sess1.run(comp_graph,
120 feed_dict={
121 pa: [[1., 1.]],
122 pb: [[0., 1.]],
123 pc: [[1., 5.]],
124 })
125
126 # Consideration 6
127 os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
128 np.random.randint(201, 301))
129 os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
130 os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
131 os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "
132
133 # Consideration 1: Sessions
134 if NOMULTISESSION:
135 sess2 = sess1
136 else:
137 print("Switching session.")
138
139 print("\nSession Test.")
140 print("No recompilation or executable switch should occur.\n")
141 if PLACEHOLDER:
142 result2 = sess2.run(comp_graph,
143 feed_dict={
144 pa: [[1., 1.]],
145 pb: [[0., 1.]],
146 pc: [[1., 5.]],
147 mult: m
148 })
149 else:
150 result2 = sess2.run(comp_graph,
151 feed_dict={
152 pa: [[1., 1.]],
153 pb: [[0., 1.]],
154 pc: [[1., 5.]],
155 })
156
157 # Consideration 6
158 os.environ["XLA_FLAGS"] = "--xla_dump_to=tmp_xla_{} ".format(
159 np.random.randint(301, 401))
160 os.environ["XLA_FLAGS"] += " --xla_dump_hlo_pass_re=forward-allocation "
161 os.environ["XLA_FLAGS"] += " --xla_hlo_graph_sharding_color "
162 os.environ["XLA_FLAGS"] += " --xla_dump_hlo_as_text "
163
164 # Consideration 3: Batch size
165 if SAMEBATCH:
166 bs = 1
167 else:
168 bs = np.random.randint(2, 101)
169 print("\nBatch Size Test with batch size %d." % bs)
170 print("No recompilation or executable switch should occur.")
171 print("Batch size should be the original 1.\n")
172 if PLACEHOLDER:
173 result3 = sess2.run(comp_graph,
174 feed_dict={
175 pa: [[1., 1.]] * bs,
176 pb: [[0., 1.]] * bs,
177 pc: [[1., 5.]] * bs,
178 mult: m
179 })
180 else:
181 result3 = sess2.run(comp_graph,
182 feed_dict={
183 pa: [[1., 1.]] * bs,
184 pb: [[0., 1.]] * bs,
185 pc: [[1., 5.]] * bs,
186 })
187
188 print("\nFirst two results should be different (different multiplier).\n")
189 print("Caching/warm up test:\t", result0)
190 print()
191 print("Placeholder test: \t", result1)
192 print()
193 print("Session test: \t", result2)
194 print()
195 if bs > 1:
196 print("Batch size test: \t", result3[:2], "...")
197 else:
198 print("Batch size test: \t", result3)