12. IPU Outlined Functions
An outlined function is a block of organized, reusable code which is used to perform a single action. Functions provide better modularity for your application and a high degree of code reusing which can decrease the memory usage as only one copy of the code needs to be compiled. Using functions however can increase the amount of computations as the function inputs need to be copied to the correct function argument locations and the function outputs need to be returned as well.
If the provided function contains any stateful operations, such as stateful random number generation, then the function cannot be reused and it will be inlined automatically.
Note that the function code is only reusable for calls on the same IPUs. This means that benefits of function calls will only be seen if the function calls are made from the same shard, or a pipeline stage mapped to the same IPU.
IPU outlined functions should not be confused with tf.function
which creates a
TensorFlow graph, whereas the IPU function creates a Poplar function which can
be used inside of tf.function
.
12.1. Usage
The Python function provided can only take a list of positional arguments. All
of the arguments must be tf.Tensor
-like objects, or be convertible to them
(for example constants).
Other non tf.Tensor
-like objects can still be accessed by the function using
Python closure capturing.
IPU functions can access TensorFlow variables, however unless each function
invocations is meant to use the same variable, a variable_scope
should be
used.
A variable_scope
is not a tf.Tensor
-like object and therefore it cannot be
passed as an argument, so if we used the following function:
1import tensorflow.compat.v1 as tf
2from tensorflow.python import ipu
3
4tf.disable_v2_behavior()
5
6
7def model(batch):
8 @ipu.outlined_function
9 def func(a):
10 with tf.variable_scope("vs", use_resource=True):
11 w = tf.get_variable(
12 "w",
13 shape=[64, 64],
14 initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
15 x = tf.matmul(a, w)
16 return x
17
18 partial = func(batch)
19 partial = func(partial)
20 # ...
Each invocation of the function of the function will use the same variable.
To circumvent this, we can use Python closures to create unique scopes for each invocation of the function:
1import tensorflow.compat.v1 as tf
2from tensorflow.python import ipu
3
4tf.disable_v2_behavior()
5
6
7def model(batch):
8 # The outer function is just a Python function.
9 def func(a, variable_scope_name):
10 # The inner function is an IPU function which captures the variable scope
11 # name using Python closures to create scopes.
12 @ipu.outlined_function
13 def f(a):
14 with tf.variable_scope(variable_scope_name, use_resource=True):
15 w = tf.get_variable(
16 "w",
17 shape=[64, 64],
18 initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
19 x = tf.matmul(a, w)
20 return x
21
22 return f(a)
23
24 partial = func(batch, "block1")
25 partial = func(partial, "block2")
26 # ...
Here we wrap the IPU function (f
) in a Python function(func
), which has
extra arguments (the variable scope name). These extra arguments can then be
captured by the IPU function f
resulting, meaning that each invocation of the
function will result in different variables being captured.
Alternatively we can explicitly pass the tf.Variables
as inputs to the
function:
1import tensorflow.compat.v1 as tf
2from tensorflow.python import ipu
3
4tf.disable_v2_behavior()
5
6
7def model(batch):
8 @ipu.outlined_function
9 def func(lhs, rhs):
10 x = tf.matmul(lhs, rhs)
11 return x
12
13 # Create the variables.
14 with tf.variable_scope("vs", use_resource=True):
15 w1 = tf.get_variable(
16 "w1",
17 shape=[64, 64],
18 initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
19 w2 = tf.get_variable(
20 "w2",
21 shape=[64, 64],
22 initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
23
24 # Pass the variables as inputs to the function.
25 partial = func(batch, w1)
26 partial = func(partial, w2)
27 # ...
12.2. Examples
Functions can be beneficial in many scenarios, especially where we want to reduce the amount of code generated.
12.2.1. Models with common structures
Some models often have common structures/layers residing on the same IPU, where the inputs and outputs have the same shapes and data types. We can create a single function for these common building blocks to reduce the code size.
1# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7# http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14# =============================================================================
15
16from tensorflow.python import ipu
17from tensorflow.python.ipu import ipu_compiler
18from tensorflow.python.ipu import ipu_infeed_queue
19from tensorflow.python.ipu import ipu_outfeed_queue
20from tensorflow.python.ipu import loops
21from tensorflow.python.ipu import nn_ops
22from tensorflow.python.ipu import normalization_ops
23from tensorflow.python.ipu import scopes
24import tensorflow.compat.v1 as tf
25tf.disable_v2_behavior()
26
27# The dataset for feeding the graphs
28ds = tf.data.Dataset.from_tensors(tf.constant(1.0, shape=[128, 128]))
29ds = ds.repeat()
30
31# The host side queues
32infeed_queue = ipu_infeed_queue.IPUInfeedQueue(ds)
33outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()
34
35
36# The device side main
37def body(x):
38 w1 = tf.get_variable(
39 "w1",
40 shape=[128, 128],
41 initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
42 w2 = tf.get_variable(
43 "w2",
44 shape=[128, 128],
45 initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
46
47 # The model has some repeated structure to it, and we manually convert it into
48 # an IPU function
49 @ipu.outlined_function
50 def func(a, b):
51 x = tf.matmul(a, b)
52 x = normalization_ops.layer_norm(x)
53 x = nn_ops.gelu(x)
54 return x
55
56 # Invoke the function twice with different arguments
57 x = func(x, w1)
58 x = func(x, w2)
59 outfeed = outfeed_queue.enqueue(x)
60 return outfeed
61
62
63def my_net():
64 r = loops.repeat(10, body, [], infeed_queue)
65 return r
66
67
68with scopes.ipu_scope('/device:IPU:0'):
69 run_loop = ipu_compiler.compile(my_net, inputs=[])
70
71# The outfeed dequeue has to happen after the outfeed enqueue
72dequeue_outfeed = outfeed_queue.dequeue()
73
74# Configure the hardware
75config = ipu.config.IPUConfig()
76config.auto_select_ipus = 1
77config.configure_ipu_system()
78
79with tf.Session() as sess:
80 sess.run(infeed_queue.initializer)
81 sess.run(tf.global_variables_initializer())
82 sess.run(run_loop)
83 result = sess.run(dequeue_outfeed)
84 print(result)
12.2.2. Serializing large operations
Some operations in the model might generate large intermediate values which can cause large spikes in memory usage. Such spikes can be reduced by serializing the operation, however it can result in extra code. To try and avoid the extra code, IPU functions can be used.
1# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
2#
3# Licensed under the Apache License, Version 2.0 (the "License");
4# you may not use this file except in compliance with the License.
5# You may obtain a copy of the License at
6#
7# http://www.apache.org/licenses/LICENSE-2.0
8#
9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14# =============================================================================
15
16from tensorflow.python import ipu
17from tensorflow.python.ipu import ipu_compiler
18from tensorflow.python.ipu import ipu_infeed_queue
19from tensorflow.python.ipu import ipu_outfeed_queue
20from tensorflow.python.ipu import loops
21from tensorflow.python.ipu import scopes
22import tensorflow.compat.v1 as tf
23tf.disable_v2_behavior()
24
25# The dataset for feeding the graphs
26ds = tf.data.Dataset.from_tensors(tf.constant(1.0, shape=[20000, 64]))
27ds = ds.repeat()
28
29# The host side queues
30infeed_queue = ipu_infeed_queue.IPUInfeedQueue(ds)
31outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()
32
33
34# The device side main
35def body(x):
36 # The model looks as following:
37 # x = a tensor of shape [20000, 64]
38 # w = a tensor of shape [64, 128]
39 # partial = tf.matmul(x, w) <- output shape is [20000, 128]
40 # result = tf.reduce_mean(partial, axis=1) <- output shape is [20000]
41 #
42 # If the code generated when calculating `partial` and `result` is too large,
43 # we can manually serialize the computation and reuse the code
44 w = tf.get_variable(
45 "w",
46 shape=[64, 128],
47 initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
48
49 # We are going to serialize along the 0th dimension of x
50 x_shape = tf.shape(x)
51 # Split the computation into 10 chunks
52 NUM_SPLITS = 10
53 SLICE_SIZE = x_shape[0] // NUM_SPLITS
54
55 # An IPU function which works on the part of x
56 @ipu.outlined_function
57 def func(partial_x, w):
58 partial = tf.matmul(partial_x, w)
59 partial_result = tf.reduce_mean(partial, axis=1)
60 return partial_result
61
62 # A list to store the partials results in
63 result_slices = []
64 # Loop which works on the serialized slices
65 for i in range(NUM_SPLITS):
66 # Get the slice
67 slice_start = i * SLICE_SIZE
68 x_slice = tf.slice(x, [slice_start, 0], [SLICE_SIZE, x_shape[1]])
69 # Call the function to generate the partial result
70 partial_result = func(x_slice, w)
71 result_slices.append(partial_result)
72
73 # Combine the partials results
74 result = tf.stack(result_slices)
75
76 outfeed = outfeed_queue.enqueue(result)
77 return outfeed
78
79
80def my_net():
81 r = loops.repeat(10, body, [], infeed_queue)
82 return r
83
84
85with scopes.ipu_scope('/device:IPU:0'):
86 run_loop = ipu_compiler.compile(my_net, inputs=[])
87
88# The outfeed dequeue has to happen after the outfeed enqueue
89dequeue_outfeed = outfeed_queue.dequeue()
90
91# Configure the hardware
92config = ipu.config.IPUConfig()
93config.auto_select_ipus = 1
94config.configure_ipu_system()
95
96with tf.Session() as sess:
97 sess.run(infeed_queue.initializer)
98 sess.run(tf.global_variables_initializer())
99 sess.run(run_loop)
100 output = sess.run(dequeue_outfeed)
101 print(output)