13. IPU Outlined Functions

An outlined function is a block of organized, reusable code which is used to perform a single action. Functions provide better modularity for your application and a high degree of code reusing which can decrease the memory usage as only one copy of the code needs to be compiled. Using functions however can increase the amount of computations as the function inputs need to be copied to the correct function argument locations and the function outputs need to be returned as well.

If the provided function contains any stateful operations, such as stateful random number generation, then the function cannot be reused and it will be inlined automatically.

Note that the function code is only reusable for calls on the same IPUs. This means that benefits of function calls will only be seen if the function calls are made from the same shard, or a pipeline stage mapped to the same IPU.

IPU outlined functions should not be confused with tf.function which creates a TensorFlow graph, whereas the IPU function creates a Poplar function which can be used inside of tf.function.

13.1. Usage

The Python function provided can only take a list of positional arguments. All of the arguments must be tf.Tensor-like objects, or be convertible to them (for example constants). Other non tf.Tensor-like objects can still be accessed by the function using Python closure capturing.

IPU functions can access TensorFlow variables, however unless each function invocations is meant to use the same variable, a variable_scope should be used.

A variable_scope is not a tf.Tensor-like object and therefore it cannot be passed as an argument, so if we used the following function:

 1import tensorflow.compat.v1 as tf
 2from tensorflow.python import ipu
 3
 4tf.disable_v2_behavior()
 5
 6
 7def model(batch):
 8  @ipu.outlined_function
 9  def func(a):
10    with tf.variable_scope("vs", use_resource=True):
11      w = tf.get_variable(
12          "w",
13          shape=[64, 64],
14          initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
15    x = tf.matmul(a, w)
16    return x
17
18  partial = func(batch)
19  partial = func(partial)
20  # ...

Each invocation of the function of the function will use the same variable.

To circumvent this, we can use Python closures to create unique scopes for each invocation of the function:

 1import tensorflow.compat.v1 as tf
 2from tensorflow.python import ipu
 3
 4tf.disable_v2_behavior()
 5
 6
 7def model(batch):
 8  # The outer function is just a Python function.
 9  def func(a, variable_scope_name):
10    # The inner function is an IPU function which captures the variable scope
11    # name using Python closures to create scopes.
12    @ipu.outlined_function
13    def f(a):
14      with tf.variable_scope(variable_scope_name, use_resource=True):
15        w = tf.get_variable(
16            "w",
17            shape=[64, 64],
18            initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
19      x = tf.matmul(a, w)
20      return x
21
22    return f(a)
23
24  partial = func(batch, "block1")
25  partial = func(partial, "block2")
26  # ...

Here we wrap the IPU function (f) in a Python function(func), which has extra arguments (the variable scope name). These extra arguments can then be captured by the IPU function f resulting, meaning that each invocation of the function will result in different variables being captured.

Alternatively we can explicitly pass the tf.Variables as inputs to the function:

 1import tensorflow.compat.v1 as tf
 2from tensorflow.python import ipu
 3
 4tf.disable_v2_behavior()
 5
 6
 7def model(batch):
 8  @ipu.outlined_function
 9  def func(lhs, rhs):
10    x = tf.matmul(lhs, rhs)
11    return x
12
13  # Create the variables.
14  with tf.variable_scope("vs", use_resource=True):
15    w1 = tf.get_variable(
16        "w1",
17        shape=[64, 64],
18        initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
19    w2 = tf.get_variable(
20        "w2",
21        shape=[64, 64],
22        initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
23
24  # Pass the variables as inputs to the function.
25  partial = func(batch, w1)
26  partial = func(partial, w2)
27  # ...

13.2. Examples

Functions can be beneficial in many scenarios, especially where we want to reduce the amount of code generated.

13.2.1. Models with common structures

Some models often have common structures/layers residing on the same IPU, where the inputs and outputs have the same shapes and data types. We can create a single function for these common building blocks to reduce the code size.

 1# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
 2#
 3# Licensed under the Apache License, Version 2.0 (the "License");
 4# you may not use this file except in compliance with the License.
 5# You may obtain a copy of the License at
 6#
 7#     http://www.apache.org/licenses/LICENSE-2.0
 8#
 9# Unless required by applicable law or agreed to in writing, software
10# distributed under the License is distributed on an "AS IS" BASIS,
11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12# See the License for the specific language governing permissions and
13# limitations under the License.
14# =============================================================================
15
16from tensorflow.python import ipu
17from tensorflow.python.ipu import ipu_compiler
18from tensorflow.python.ipu import ipu_infeed_queue
19from tensorflow.python.ipu import ipu_outfeed_queue
20from tensorflow.python.ipu import loops
21from tensorflow.python.ipu import nn_ops
22from tensorflow.python.ipu import normalization_ops
23from tensorflow.python.ipu import scopes
24from tensorflow.python.ipu import utils
25import tensorflow.compat.v1 as tf
26tf.disable_v2_behavior()
27
28# The dataset for feeding the graphs
29ds = tf.data.Dataset.from_tensors(tf.constant(1.0, shape=[128, 128]))
30ds = ds.repeat()
31
32# The host side queues
33infeed_queue = ipu_infeed_queue.IPUInfeedQueue(ds)
34outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()
35
36
37# The device side main
38def body(x):
39  w1 = tf.get_variable(
40      "w1",
41      shape=[128, 128],
42      initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
43  w2 = tf.get_variable(
44      "w2",
45      shape=[128, 128],
46      initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
47
48  # The model has some repeated structure to it, and we manually convert it into
49  # an IPU function
50  @ipu.outlined_function
51  def func(a, b):
52    x = tf.matmul(a, b)
53    x = normalization_ops.layer_norm(x)
54    x = nn_ops.gelu(x)
55    return x
56
57  # Invoke the function twice with different arguments
58  x = func(x, w1)
59  x = func(x, w2)
60  outfeed = outfeed_queue.enqueue(x)
61  return outfeed
62
63
64def my_net():
65  r = loops.repeat(10, body, [], infeed_queue)
66  return r
67
68
69with scopes.ipu_scope('/device:IPU:0'):
70  run_loop = ipu_compiler.compile(my_net, inputs=[])
71
72# The outfeed dequeue has to happen after the outfeed enqueue
73dequeue_outfeed = outfeed_queue.dequeue()
74
75# Configure the hardware
76config = ipu.config.IPUConfig()
77config.auto_select_ipus = 1
78config.configure_ipu_system()
79
80with tf.Session() as sess:
81  sess.run(infeed_queue.initializer)
82  sess.run(tf.global_variables_initializer())
83  sess.run(run_loop)
84  result = sess.run(dequeue_outfeed)
85  print(result)

Download function_example1.py

13.2.2. Serializing large operations

Some operations in the model might generate large intermediate values which can cause large spikes in memory usage. Such spikes can be reduced by serializing the operation, however it can result in extra code. To try and avoid the extra code, IPU functions can be used.

  1# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
  2#
  3# Licensed under the Apache License, Version 2.0 (the "License");
  4# you may not use this file except in compliance with the License.
  5# You may obtain a copy of the License at
  6#
  7#     http://www.apache.org/licenses/LICENSE-2.0
  8#
  9# Unless required by applicable law or agreed to in writing, software
 10# distributed under the License is distributed on an "AS IS" BASIS,
 11# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12# See the License for the specific language governing permissions and
 13# limitations under the License.
 14# =============================================================================
 15
 16from tensorflow.python import ipu
 17from tensorflow.python.ipu import ipu_compiler
 18from tensorflow.python.ipu import ipu_infeed_queue
 19from tensorflow.python.ipu import ipu_outfeed_queue
 20from tensorflow.python.ipu import loops
 21from tensorflow.python.ipu import scopes
 22import tensorflow.compat.v1 as tf
 23tf.disable_v2_behavior()
 24
 25# The dataset for feeding the graphs
 26ds = tf.data.Dataset.from_tensors(tf.constant(1.0, shape=[20000, 64]))
 27ds = ds.repeat()
 28
 29# The host side queues
 30infeed_queue = ipu_infeed_queue.IPUInfeedQueue(ds)
 31outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()
 32
 33
 34# The device side main
 35def body(x):
 36  # The model looks as following:
 37  # x = a tensor of shape [20000, 64]
 38  # w = a tensor of shape [64, 128]
 39  # partial = tf.matmul(x, w) <- output shape is [20000, 128]
 40  # result = tf.reduce_mean(partial, axis=1) <- output shape is [20000]
 41  #
 42  # If the code generated when calculating `partial` and `result` is too large,
 43  # we can manually serialize the computation and reuse the code
 44  w = tf.get_variable(
 45      "w",
 46      shape=[64, 128],
 47      initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
 48
 49  # We are going to serialize along the 0th dimension of x
 50  x_shape = tf.shape(x)
 51  # Split the computation into 10 chunks
 52  NUM_SPLITS = 10
 53  SLICE_SIZE = x_shape[0] // NUM_SPLITS
 54
 55  # An IPU function which works on the part of x
 56  @ipu.outlined_function
 57  def func(partial_x, w):
 58    partial = tf.matmul(partial_x, w)
 59    partial_result = tf.reduce_mean(partial, axis=1)
 60    return partial_result
 61
 62  # A list to store the partials results in
 63  result_slices = []
 64  # Loop which works on the serialized slices
 65  for i in range(NUM_SPLITS):
 66    # Get the slice
 67    slice_start = i * SLICE_SIZE
 68    x_slice = tf.slice(x, [slice_start, 0], [SLICE_SIZE, x_shape[1]])
 69    # Call the function to generate the partial result
 70    partial_result = func(x_slice, w)
 71    result_slices.append(partial_result)
 72
 73  # Combine the partials results
 74  result = tf.stack(result_slices)
 75
 76  outfeed = outfeed_queue.enqueue(result)
 77  return outfeed
 78
 79
 80def my_net():
 81  r = loops.repeat(10, body, [], infeed_queue)
 82  return r
 83
 84
 85with scopes.ipu_scope('/device:IPU:0'):
 86  run_loop = ipu_compiler.compile(my_net, inputs=[])
 87
 88# The outfeed dequeue has to happen after the outfeed enqueue
 89dequeue_outfeed = outfeed_queue.dequeue()
 90
 91# Configure the hardware
 92config = ipu.config.IPUConfig()
 93config.auto_select_ipus = 1
 94config.configure_ipu_system()
 95
 96with tf.Session() as sess:
 97  sess.run(infeed_queue.initializer)
 98  sess.run(tf.global_variables_initializer())
 99  sess.run(run_loop)
100  output = sess.run(dequeue_outfeed)
101  print(output)

Download function_example2.py