12. IPU Outlined Functions

An outlined function is a block of organized, reusable code which is used to perform a single action. Functions provide better modularity for your application and a high degree of code reusing which can decrease the memory usage as only one copy of the code needs to be compiled. Using functions however can increase the amount of computations as the function inputs need to be copied to the correct function argument locations and the function outputs need to be returned as well.

If the provided function contains any stateful operations, such as stateful random number generation, then the function cannot be reused and it will be inlined automatically.

Note that the function code is only reusable for calls on the same IPUs. This means that benefits of function calls will only be seen if the function calls are made from the same shard, or a pipeline stage mapped to the same IPU.

IPU outlined functions should not be confused with tf.function which creates a TensorFlow graph, whereas the IPU function creates a Poplar function which can be used inside of tf.function.

12.1. Usage

The Python function provided can only take a list of positional arguments. All of the arguments must be tf.Tensor-like objects, or be convertible to them (for example constants). Other non tf.Tensor-like objects can still be accessed by the function using Python closure capturing.

IPU functions can access TensorFlow variables, however unless each function invocations is meant to use the same variable, a variable_scope should be used.

A variable_scope is not a tf.Tensor-like object and therefore it cannot be passed as an argument, so if we used the following function:

import tensorflow.compat.v1 as tf
from tensorflow.python import ipu

tf.disable_v2_behavior()


def model(batch):
  @ipu.outlined_function
  def func(a):
    with tf.variable_scope("vs", use_resource=True):
      w = tf.get_variable(
          "w",
          shape=[64, 64],
          initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
    x = tf.matmul(a, w)
    return x

  partial = func(batch)
  partial = func(partial)
  # ...

Each invocation of the function of the function will use the same variable.

To circumvent this, we can use Python closures to create unique scopes for each invocation of the function:

import tensorflow.compat.v1 as tf
from tensorflow.python import ipu

tf.disable_v2_behavior()


def model(batch):
  # The outer function is just a Python function.
  def func(a, variable_scope_name):
    # The inner function is an IPU function which captures the variable scope
    # name using Python closures to create scopes.
    @ipu.outlined_function
    def f(a):
      with tf.variable_scope(variable_scope_name, use_resource=True):
        w = tf.get_variable(
            "w",
            shape=[64, 64],
            initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
      x = tf.matmul(a, w)
      return x

    return f(a)

  partial = func(batch, "block1")
  partial = func(partial, "block2")
  # ...

Here we wrap the IPU function (f) in a Python function(func), which has extra arguments (the variable scope name). These extra arguments can then be captured by the IPU function f resulting, meaning that each invocation of the function will result in different variables being captured.

Alternatively we can explicitly pass the tf.Variables as inputs to the function:

import tensorflow.compat.v1 as tf
from tensorflow.python import ipu

tf.disable_v2_behavior()


def model(batch):
  @ipu.outlined_function
  def func(lhs, rhs):
    x = tf.matmul(lhs, rhs)
    return x

  # Create the variables.
  with tf.variable_scope("vs", use_resource=True):
    w1 = tf.get_variable(
        "w1",
        shape=[64, 64],
        initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
    w2 = tf.get_variable(
        "w2",
        shape=[64, 64],
        initializer=tf.glorot_uniform_initializer(dtype=tf.float32))

  # Pass the variables as inputs to the function.
  partial = func(batch, w1)
  partial = func(partial, w2)
  # ...

12.2. Examples

Functions can be beneficial in many scenarios, especially where we want to reduce the amount of code generated.

12.2.1. Models with common structures

Some models often have common structures/layers residing on the same IPU, where the inputs and outputs have the same shapes and data types. We can create a single function for these common building blocks to reduce the code size.

# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# =============================================================================

from tensorflow.python import ipu
from tensorflow.python.ipu import ipu_compiler
from tensorflow.python.ipu import ipu_infeed_queue
from tensorflow.python.ipu import ipu_outfeed_queue
from tensorflow.python.ipu import loops
from tensorflow.python.ipu import nn_ops
from tensorflow.python.ipu import normalization_ops
from tensorflow.python.ipu import scopes
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# The dataset for feeding the graphs
ds = tf.data.Dataset.from_tensors(tf.constant(1.0, shape=[128, 128]))
ds = ds.repeat()

# The host side queues
infeed_queue = ipu_infeed_queue.IPUInfeedQueue(ds)
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()


# The device side main
def body(x):
  w1 = tf.get_variable(
      "w1",
      shape=[128, 128],
      initializer=tf.glorot_uniform_initializer(dtype=tf.float32))
  w2 = tf.get_variable(
      "w2",
      shape=[128, 128],
      initializer=tf.glorot_uniform_initializer(dtype=tf.float32))

  # The model has some repeated structure to it, and we manually convert it into
  # an IPU function
  @ipu.outlined_function
  def func(a, b):
    x = tf.matmul(a, b)
    x = normalization_ops.layer_norm(x)
    x = nn_ops.gelu(x)
    return x

  # Invoke the function twice with different arguments
  x = func(x, w1)
  x = func(x, w2)
  outfeed = outfeed_queue.enqueue(x)
  return outfeed


def my_net():
  r = loops.repeat(10, body, [], infeed_queue)
  return r


with scopes.ipu_scope('/device:IPU:0'):
  run_loop = ipu_compiler.compile(my_net, inputs=[])

# The outfeed dequeue has to happen after the outfeed enqueue
dequeue_outfeed = outfeed_queue.dequeue()

# Configure the hardware
config = ipu.config.IPUConfig()
config.auto_select_ipus = 1
config.configure_ipu_system()

with tf.Session() as sess:
  sess.run(infeed_queue.initializer)
  sess.run(tf.global_variables_initializer())
  sess.run(run_loop)
  result = sess.run(dequeue_outfeed)
  print(result)

12.2.2. Serializing large operations

Some operations in the model might generate large intermediate values which can cause large spikes in memory usage. Such spikes can be reduced by serializing the operation, however it can result in extra code. To try and avoid the extra code, IPU functions can be used.

# Copyright 2020 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# =============================================================================

from tensorflow.python import ipu
from tensorflow.python.ipu import ipu_compiler
from tensorflow.python.ipu import ipu_infeed_queue
from tensorflow.python.ipu import ipu_outfeed_queue
from tensorflow.python.ipu import loops
from tensorflow.python.ipu import scopes
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# The dataset for feeding the graphs
ds = tf.data.Dataset.from_tensors(tf.constant(1.0, shape=[20000, 64]))
ds = ds.repeat()

# The host side queues
infeed_queue = ipu_infeed_queue.IPUInfeedQueue(ds)
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()


# The device side main
def body(x):
  # The model looks as following:
  # x = a tensor of shape [20000, 64]
  # w = a tensor of shape [64, 128]
  # partial = tf.matmul(x, w) <- output shape is [20000, 128]
  # result = tf.reduce_mean(partial, axis=1) <- output shape is [20000]
  #
  # If the code generated when calculating `partial` and `result` is too large,
  # we can manually serialize the computation and reuse the code
  w = tf.get_variable(
      "w",
      shape=[64, 128],
      initializer=tf.glorot_uniform_initializer(dtype=tf.float32))

  # We are going to serialize along the 0th dimension of x
  x_shape = tf.shape(x)
  # Split the computation into 10 chunks
  NUM_SPLITS = 10
  SLICE_SIZE = x_shape[0] // NUM_SPLITS

  # An IPU function which works on the part of x
  @ipu.outlined_function
  def func(partial_x, w):
    partial = tf.matmul(partial_x, w)
    partial_result = tf.reduce_mean(partial, axis=1)
    return partial_result

  # A list to store the partials results in
  result_slices = []
  # Loop which works on the serialized slices
  for i in range(NUM_SPLITS):
    # Get the slice
    slice_start = i * SLICE_SIZE
    x_slice = tf.slice(x, [slice_start, 0], [SLICE_SIZE, x_shape[1]])
    # Call the function to generate the partial result
    partial_result = func(x_slice, w)
    result_slices.append(partial_result)

  # Combine the partials results
  result = tf.stack(result_slices)

  outfeed = outfeed_queue.enqueue(result)
  return outfeed


def my_net():
  r = loops.repeat(10, body, [], infeed_queue)
  return r


with scopes.ipu_scope('/device:IPU:0'):
  run_loop = ipu_compiler.compile(my_net, inputs=[])

# The outfeed dequeue has to happen after the outfeed enqueue
dequeue_outfeed = outfeed_queue.dequeue()

# Configure the hardware
config = ipu.config.IPUConfig()
config.auto_select_ipus = 1
config.configure_ipu_system()

with tf.Session() as sess:
  sess.run(infeed_queue.initializer)
  sess.run(tf.global_variables_initializer())
  sess.run(run_loop)
  output = sess.run(dequeue_outfeed)
  print(output)