19. Keras with IPUs

The Graphcore implementation of Keras includes support for the IPU. Keras model creation is no different than what you would use if you were training on other devices. To target the Poplar XLA device, Keras model creation must be inside the strategy.scope of an IPUStrategy.

For a more practical walkthrough, see this tutorial about using Keras on the IPU from the Graphcore tutorials repository.

19.1. Single IPU models

You can train, evaluate or run inference on single-IPU models through the Keras APIs as you would with other accelerators, as long as you create the model inside the scope of an IPUStrategy:

import tensorflow as tf
from tensorflow.python import ipu

import keras
from keras.datasets import mnist

# Configure the IPU device.
config = ipu.config.IPUConfig()
config.auto_select_ipus = 1
config.configure_ipu_system()


# Create a simple model.
def create_model():
  return keras.Sequential([
      keras.layers.Flatten(),
      keras.layers.Dense(256, activation='relu'),
      keras.layers.Dense(128, activation='relu'),
      keras.layers.Dense(10)
  ])


# Create a dataset for the model.
def create_dataset():
  (x_train, y_train), (_, _) = mnist.load_data()
  x_train = x_train / 255.0

  train_ds = tf.data.Dataset.from_tensor_slices(
      (x_train, y_train)).shuffle(10000).batch(32, drop_remainder=True)
  train_ds = train_ds.map(lambda d, l:
                          (tf.cast(d, tf.float32), tf.cast(l, tf.int32)))

  return train_ds.repeat().prefetch(16)


dataset = create_dataset()

# Create a strategy for execution on the IPU.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Create a Keras model inside the strategy.
  model = create_model()

  # Compile the model for training.
  model.compile(
      loss=keras.losses.SparseCategoricalCrossentropy(),
      optimizer='rmsprop',
      metrics=["accuracy"],
  )

  model.fit(dataset, epochs=2, steps_per_epoch=100)

19.2. Using steps_per_execution

To reduce Python overhead and maximize the performance of your model, pass the steps_per_execution argument to the compile method. This argument sets the number of batches processed sequentially by one replica in a single execution which can greatly improve performance because any overhead between steps is removed, thus increasing IPU utilization.

Ideally, steps_per_execution is equal to the number of steps your model needs to run per replica in order to complete one epoch. Note that it is not possible to fetch intermediate results when steps_per_execution is specified. Model weights are read on the Python host after all steps are executed on the IPU. If you need to access model weights during an epoch (for example for saving a checkpoint), you must set steps_per_execution accordingly.

Note

In order to achieve best performance, steps_per_execution needs to be set before using fit(), evaluate() and predict(), even if no training is performed.

See the documentation for the compile method for full details.

The example below highlights the usage of steps_per_execution:

import tensorflow as tf
from tensorflow.python import ipu

import keras
from keras.datasets import mnist

# Configure the IPU device.
config = ipu.config.IPUConfig()
config.auto_select_ipus = 1
config.configure_ipu_system()


# Create a simple model.
def create_model():
  return keras.Sequential([
      keras.layers.Flatten(),
      keras.layers.Dense(256, activation='relu'),
      keras.layers.Dense(128, activation='relu'),
      keras.layers.Dense(10)
  ])


# Create a dataset for the model.
def create_dataset():
  (x_train, y_train), (_, _) = mnist.load_data()
  x_train = x_train / 255.0

  train_ds = tf.data.Dataset.from_tensor_slices(
      (x_train, y_train)).shuffle(10000).batch(32, drop_remainder=True)
  train_ds = train_ds.map(lambda d, l:
                          (tf.cast(d, tf.float32), tf.cast(l, tf.int32)))

  return train_ds.prefetch(16)


dataset = create_dataset()

# Create a strategy for execution on the IPU.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Create a Keras model inside the strategy.
  model = create_model()

  # Compile the model for training.
  model.compile(
      loss=keras.losses.SparseCategoricalCrossentropy(),
      optimizer='rmsprop',
      metrics=["accuracy"],
      # Anything between 2 and the length of the dataset would work,
      # but the greater `steps_per_execution` the greater the
      # performance gains.
      steps_per_execution=dataset.cardinality(),
  )

  model.fit(dataset, epochs=2)

19.3. Gradient accumulation

When training, gradient accumulation allows us to simulate bigger batch sizes. This is achieved by accumulating the gradients across multiple batches together then performing the weight update.

For example, if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica to 4 then this simulates an input batch of size 64.

Gradient accumulation can be easily enabled for Keras models created inside of an IPUStrategy by calling the following methods:

`Functional` model	`set_gradient_accumulation_options()`
`Sequential` model	`set_gradient_accumulation_options()`
`Model` subclass	`set_gradient_accumulation_options()`

See the respective API documentation for more details.

Note

When using data-parallelism, the steps_per_execution value the model was compiled with must be an integer multiple of gradient_accumulation_steps_per_replica. Data parallelism is discussed in Section 19.5, Automatic data parallelism.

Note

Not all operations are compatible with gradient accumulation.

The example below highlights the usage of set_gradient_accumulation_options:

import tensorflow as tf
from tensorflow.python import ipu

import keras
from keras.datasets import mnist

# Configure the IPU device.
config = ipu.config.IPUConfig()
config.auto_select_ipus = 1
config.configure_ipu_system()


# Create a simple model.
def create_model():
  return keras.Sequential([
      keras.layers.Flatten(),
      keras.layers.Dense(256, activation='relu'),
      keras.layers.Dense(128, activation='relu'),
      keras.layers.Dense(10)
  ])


# Create a dataset for the model.
def create_dataset():
  (x_train, y_train), (_, _) = mnist.load_data()
  x_train = x_train / 255.0

  train_ds = tf.data.Dataset.from_tensor_slices(
      (x_train, y_train)).shuffle(10000).batch(32, drop_remainder=True)
  train_ds = train_ds.map(lambda d, l:
                          (tf.cast(d, tf.float32), tf.cast(l, tf.int32)))

  return train_ds.prefetch(16)


dataset = create_dataset()

# Create a strategy for execution on the IPU.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Create a Keras model inside the strategy.
  model = create_model()

  # `steps_per_execution` must be divisible by `gradient_accumulation_steps_per_replica`.
  # Say we want to accumulate 10 steps before doing a weight update, then we would end up
  # with the following values.
  gradient_accumulation_steps_per_replica = 10
  number_of_accumulated_steps = dataset.cardinality(
  ) // gradient_accumulation_steps_per_replica

  # In order to get the proper `steps_per_execution` value, we have to multiply
  # `number_of_accumulated_steps` with `gradient_accumulation_steps_per_replica`.
  steps_per_execution = number_of_accumulated_steps * \
                        gradient_accumulation_steps_per_replica

  # Now we need to truncate the dataset so Keras will not try to take more data
  # from the dataset than is available.
  dataset = dataset.take(steps_per_execution)

  # Compile the model for training.
  model.compile(
      loss=keras.losses.SparseCategoricalCrossentropy(),
      optimizer='rmsprop',
      metrics=["accuracy"],
      steps_per_execution=steps_per_execution,
  )

  model.set_gradient_accumulation_options(
      gradient_accumulation_steps_per_replica=10)

  model.fit(dataset, epochs=2)

19.4. Model parallelism

The models described so far occupy a single IPU device, however some models might require the model layers to be split across multiple IPU devices to achieve high compute efficiency.

One method to achieve model parallelism is called pipelining, where the model layers are assigned to pipeline stages. Each pipeline stage can be assigned to a different device and different devices can execute in parallel.

By default, these pipeline stages will be executed using the grouped schedule (Fig. 19.1), where the forward and backward stages are grouped together on each IPU. All IPUs alternate between executing a forward pass and then a backward pass.

../_images/grouped_pipeline.png — Fig. 19.1 Grouped pipeline

Two other schedules are available and can be configured as shown in Section 19.4.4, Pipelining options. When using the interleaved schedule (Fig. 19.2) the forward and backward passes are interleaved (which requires less memory but is likely to be slower). The sequential schedule (Fig. 19.3) executes one stage at a time and may be useful when debugging your model.

../_images/interleaved_pipeline.png — Fig. 19.2 Interleaved pipeline

../_images/sequential_pipeline.png — Fig. 19.3 Sequential pipeline

A detailed explanation of pipelining can be found in the technical note on Model parallelism with TensorFlow: sharding and pipelining.

The method to pipeline your model depends on whether your model is a Sequential model, a Functional model, or is subclassed from the Model class.

19.4.1. Sequential model

To enable IPU pipelining for a Sequential model (an instance of keras.Sequential), a list of per-layer pipeline stage assignments should be passed to the set_pipeline_stage_assignment() method of the model.

For example, a simple four layer Sequential model could be assigned to two different pipeline stages as follows:

  model = keras.Sequential([
      keras.layers.Dense(8),  # Pipeline stage 0.
      keras.layers.Dense(16),  # Pipeline stage 0.
      keras.layers.Dense(16),  # Pipeline stage 1.
      keras.layers.Dense(1),  # Pipeline stage 1.
  ])

  model.set_pipeline_stage_assignment([0, 0, 1, 1])

You can confirm which layers are assigned to which stages using the print_pipeline_stage_assignment_summary() method of the model.

19.4.2. Functional model

There are two ways to enable IPU pipelining for a Functional model (an instance of keras.Model) depending on if you’re pipelining a model you are writing yourself or an existing model.

Pipelining a model you are writing yourself

To pipeline a Functional model you are writing yourself, each layer call must happen within the scope of an keras.ipu.PipelineStage context.

For example, a simple four layer Functional model could be assigned to two different pipeline stages as follows:

  input_layer = keras.layers.Input((28, 28))

  with keras.ipu.PipelineStage(0):
    x = keras.layers.Dense(8)(input_layer)
    x = keras.layers.Dense(16)(x)

  with keras.ipu.PipelineStage(1):
    x = keras.layers.Dense(16)(x)
    x = keras.layers.Dense(1)(x)

  model = keras.Model(inputs=input_layer, outputs=x)

Note

Layers constructed within a PipelineStage context will have that pipeline stage assigned to all invocations of the layer. These assignments are overridden if the layer calls happen within a different PipelineStage context.

Pipelining an existing functional model

To pipeline an existing Functional model, you can use get_pipeline_stage_assignment(). Each layer invocation in the model has an associated FunctionalLayerPipelineStageAssignment object, which indicates what pipeline stage that invocation is assigned to. get_pipeline_stage_assignment returns a list of these stage assignments, which you can inspect and modify. Note that the list is in post-order, which means the assignments are returned in the order they will be executed.

Once you are done modifying the stage assignments, you should use set_pipeline_stage_assignment() to set them on the model.

For example, a naive way of pipelining ResNet50 would be to assign everything up until the “conv4_block2_add” layer invocation to the first stage, then everything else to the second stage, as follows:

strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():

  model = resnet.ResNet50(weights='imagenet')

  # Get the individual assignments - note that they are returned in post-order.
  assignments = model.get_pipeline_stage_assignment()

  # Iterate over them and set their pipeline stages.
  stage_id = 0
  for assignment in assignments:
    assignment.pipeline_stage = stage_id
    # Split the model on the `conv4_block2_add` layer.
    if assignment.layer.name.startswith("conv4_block2_add"):
      stage_id = 1

  # Set the assignments to the model.
  model.set_pipeline_stage_assignment(assignments)

  model.print_pipeline_stage_assignment_summary()

Note

You can use print_pipeline_stage_assignment_summary() to print the pipeline stage assignments of the model’s layer invocations.

Note

This method of assigning pipeline stages can also be used with Functional models you are writing yourself, as well as with Sequential models and Model subclasses using the SequentialExtension and ModelExtension equivalents.

19.4.3. Model subclass

Model subclasses are subclasses of keras.Model, which override the call method. There are two ways to enable IPU pipelining for an instance of a Model subclass, depending on if you’re pipelining a model you are writing yourself or an existing model. These are very similar to the methods available for Functional models.

Pipelining a model you are writing yourself

To pipeline a Model subclass you are writing yourself, each layer call must happen within the scope of an keras.ipu.PipelineStage context.

For example, a simple four layer Model subclass could be assigned to four different pipeline stages as follows:

class MyModel(keras.Model):
  def __init__(self):
    super().__init__(self)
    self.dense_layer_1 = keras.layers.Dense(8)
    self.dense_layer_2 = keras.layers.Dense(8)
    self.concat_layer = keras.layers.Concatenate()
    self.dense_layer_3 = keras.layers.Dense(1)

  def call(self, inputs):
    # Invoke layers inside PipelineStage scopes to assign the layer invocations
    # to the specified pipeline stage.
    with keras.ipu.PipelineStage(0):
      x = self.dense_layer_1(inputs)
    with keras.ipu.PipelineStage(1):
      x1 = self.dense_layer_2(x)
      x2 = self.dense_layer_2(x)
    with keras.ipu.PipelineStage(2):
      x1 = self.dense_layer_2(x1)
      x2 = self.dense_layer_2(x2)
      x = self.concat_layer([x1, x2])
    with keras.ipu.PipelineStage(3):
      x = self.dense_layer_3(x)

    return x

Note

Layers constructed within a PipelineStage context will have that pipeline stage assigned to all invocations of the layer. These assignments are overridden if the layer calls happen within a different PipelineStage context.

Pipelining an existing model

To pipeline an existing Model subclass, you must use get_pipeline_stage_assignment(). Each layer invocation in the model has an associated ModelLayerPipelineStageAssignment object, which indicates what pipeline stage that invocation is assigned to. get_pipeline_stage_assignment() returns a list of these stage assignments, which you can inspect and modify. Note that the list is in post-order, which means the assignments are returned in the order they will be executed.

Once you are done modifying the stage assignments, you should use set_pipeline_stage_assignment() to set them on the model.

Before you can get or set pipeline stage assignments, you must first call keras.Model.build() on your model, specifying the input shapes. This traces the model’s call function using the shapes specified. The resulting graph is what will be used for pipelined execution. You can update the graph by calling build again, though this will invalidate existing pipeline stage assignments if the structure of the updated graph is different.

Note

If you need to specify input dtypes when calling keras.Model.build(), you can pass in keras.Input objects instead of plain shapes.

For example, an existing Model subclass with four layers, could be assigned to four different pipeline stages as follows:

  model = ExistingModel()

  # Call build to trace the graph generated by the call function.
  # This step is required before getting or setting pipeline stage assignments.
  model.build((28, 28))

  # Get a blank set of pipeline stage assignments.
  assignments = model.get_pipeline_stage_assignment()

  # Modify the assignments by setting pipline stages.
  for assignment in assignments:
    if assignment.layer == model.dense_layer_1:
      assignment.pipeline_stage = 0
    elif assignment.layer == model.dense_layer_2 and assignment.node_index < 2:
      assignment.pipeline_stage = 1
    elif assignment.layer == model.dense_layer_2 and assignment.node_index < 4:
      assignment.pipeline_stage = 2
    elif assignment.layer == model.concat_layer:
      assignment.pipeline_stage = 2
    elif assignment.layer == model.dense_layer_3:
      assignment.pipeline_stage = 3

  # Apply the modified assignments back to the model.
  model.set_pipeline_stage_assignment(assignments)

Note

You can use print_pipeline_stage_assignment_summary() to print the pipeline stage assignments of the model’s layer invocations.

Note

This method of assigning pipeline stages can also be used with Model subclasses you are writing yourself, as well as with Functional and Sequential models using the SequentialExtension and FunctionalExtension equivalents.

19.4.4. Pipelining options

Pipelining options can be set with the following methods:

`Functional` model	`set_pipelining_options()`
`Sequential` model	`set_pipelining_options()`
`Model` subclass	`set_pipelining_options()`

See the respective API documentation for more details.

Gradient accumulation is always used when training a pipelined model (unless using the Sequential schedule). This means that you must set the option gradient_accumulation_steps_per_replica using this API when using the Grouped or Interleaved schedule. It is optional when using the Sequential schedule.

The API documentation for set_pipelining_options explains that the additional keyword arguments (pipelining_kwargs) will be forwarded to the tensorflow.python.ipu.pipelining_ops.pipeline() operator (which is used internally - see Section 19.11, Implementation details). Refer to the API documentation for pipeline() for details about these arguments.

The code sample below illustrates how options can be set with the set_pipelining_options API.

  model.set_pipelining_options(
      gradient_accumulation_steps_per_replica=16,
      pipeline_schedule=ipu.ops.pipelining_ops.PipelineSchedule.Interleaved)

19.5. Automatic data parallelism

IPU TensorFlow supports automatic data parallelism when multiple IPU devices are configured with the system. Automatic data parallelism is achieved by model replication across available IPU devices. The number of times the model is replicated is called the replication factor; higher replication factors allow higher data throughput.

When replicating, gradients are reduced across replicas during training, which has implications for gradient accumulation. For a non replicated model, the effective batch size is the product of the dataset batch size and the number of gradient accumulation steps. In the case of a replication factor greater than one, the effective batch size is additionally scaled by the replication factor according to the following formula:

effective_batch_size = dataset_batch_size * gradient_accumulation_steps_per_replica * num_replicas

19.6. Asynchronous callbacks

IPU TensorFlow supports the use of Callback objects with the Keras APIs, however there is an important difference to note when specifying steps_per_execution. In IPU TensorFlow, if steps_per_execution is specified for your model, then per-batch callback functions will only be invoked every steps_per_execution steps, which can have the effect of delaying access to results.

However, IPU TensorFlow also supports asynchronous callbacks by providing a polling mechanism which allows results to be accessed at the earliest possible instance. Asynchronous callbacks can be enabled by passing True to the following methods:

`Functional` model	`set_asynchronous_callbacks()`
`Sequential` model	`set_asynchronous_callbacks()`
`Model` subclass	`set_asynchronous_callbacks()`

See the respective API documentation for more details.

19.7. Configuring Infeeds and Outfeed

Keras models created inside of an IPUStrategy scope automatically create IPUInfeedQueue and OutfeedQueue data queues for efficiently feeding data to and from the IPU devices when using fit(), evaluate() and predict().

Instances of IPUInfeedQueue and OutfeedQueue can be created with optional arguments which can affect performance of the model.

Use the following methods to configure the IPUInfeedQueue for your Keras model:

`Functional` model	`set_infeed_queue_options()`
`Sequential` model	`set_infeed_queue_options()`
`Model` subclass	`set_infeed_queue_options()`

Use the following methods to configure the OutfeedQueue for your Keras model:

`Functional`	`set_outfeed_queue_options()`
`Sequential`	`set_outfeed_queue_options()`
`Model` subclass	`set_outfeed_queue_options()`

For example the prefetch_depth parameter of the OutfeedQueue and the buffer_depth parameter of the OutfeedQueue can be configured as follows:

from tensorflow.python import ipu

import keras

# Configure the IPU device.
config = ipu.config.IPUConfig()
config.auto_select_ipus = 1
config.configure_ipu_system()


# Create a simple model.
def create_model():
  return keras.Sequential([
      keras.layers.Flatten(),
      keras.layers.Dense(256, activation='relu'),
      keras.layers.Dense(128, activation='relu'),
      keras.layers.Dense(10)
  ])


# Create a strategy for execution on the IPU.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():

  model = create_model()

  # Set the infeed and outfeed options.
  model.set_infeed_queue_options(prefetch_depth=2)
  model.set_outfeed_queue_options(buffer_depth=2)

19.8. Saving and loading Keras models

Saving and loading a Keras model must be done within the IPUStrategy scope in order to save/load IPU-specific information.

When saving and loading Model subclasses, make sure to save and restore class members, such as layers, via the config. This can be done by overriding the get_config and from_config methods. Re-creating members from scratch can cause errors, as the original members may be restored as part of the IPU-specific internal state.

Note

The arguments pipelining_kwargs from set_pipelining_options() and gradient_accumulation_optimizer_kwargs from set_gradient_accumulation_options() are not serializable, which means that when the model is being saved, their values are not saved. When restoring/loading a model, call set_pipelining_options() or set_gradient_accumulation_options() again.

19.9. Exporting precompiled Keras models for TensorFlow Serving

There are two ways of exporting Keras models for TensorFlow Serving, independent of whether they’re pipelined or not. Keras models can be exported using the tensorflow.python.ipu.serving.export_keras() function. This takes only three arguments: the model to export, a directory where the SavedModel will be stored and, optionally, a batch size value. The other way uses the model’s export_for_ipu_serving() method which takes only the path to the SavedModel directory and, optionally, a batch size value.

It’s important to note that before exporting the model you must build it, providing the input shapes to the model’s build() method. Similarly to exporting non-Keras models, you can set the iteration parameter by calling the model’s compile() method with steps_per_execution argument. The meaning of that parameter is analogous to that of non-Keras models, both non-pipelined and pipelined ones. In both cases you can use it to tweak the inference latency.

The export_for_ipu_serving() method adds the possibility of passing the preprocessing_step and postprocessing_step functions which will be included into the SavedModel graph and executed on the CPU on the server. If all preprocessing and postprocessing operations are available on the IPU, preprocessing_step and postprocessing_step functions should be called inside the Keras model. Then function bodies will be compiled together with the inference model.

Exported models contain Poplar programs compiled for specific batch size value. Because of that, you must always provide the batch size value to be used by the exported model. You can achieve it in two ways:

passing the batch_size argument explicitly to the export function, or
setting the batch size value during model creation and leaving the default value of the batch_size argument.

19.9.1. Non-pipelined Keras model example

This example creates a simple non-pipelined Keras model that adds two inputs together. After that, the model is exported for TensorFlow Serving.

import os
import shutil

import numpy as np
from tensorflow.python import ipu

import keras

# Directory where SavedModel will be written.
saved_model_directory = './my_saved_model_ipu/007'
# Directory should be empty or should not exist.
if os.path.exists(saved_model_directory):
  shutil.rmtree(saved_model_directory)

batch_size = 1
input_shape = (batch_size, 4)
# Number of IPU-optimized loop's iterations.
iterations = 16

# Configure the IPU for compilation.
cfg = ipu.config.IPUConfig()
cfg.auto_select_ipus = 1
cfg.device_connection.enable_remote_buffers = True
cfg.device_connection.type = ipu.config.DeviceConnectionType.ON_DEMAND
cfg.configure_ipu_system()

# Always create Keras models inside an IPU strategy.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Always set `batch_size` if model has explicit input layers.
  input1 = keras.layers.Input(shape=input_shape[1:],
                              batch_size=batch_size,
                              name="input_1")
  input2 = keras.layers.Input(shape=input_shape[1:],
                              batch_size=batch_size,
                              name="input_2")
  x = keras.layers.Add()([input1, input2])
  model = keras.Model(inputs=[input1, input2], outputs=x)

  model.build([input_shape, input_shape])
  # Call compile to set the number of iterations of the inference loop.
  # It can be used to tweak the inference latency.
  model.compile(steps_per_execution=iterations)

# Export as a SavedModel.
runtime_func = model.export_for_ipu_serving(saved_model_directory)
# Alternatively: `runtime_func = serving.export_keras(model, saved_model_directory)`
print(f"SavedModel written to {saved_model_directory}")

# You can test the exported executable using returned `runtime_func`.
# This should print the numbers from 2 to 17.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  for i in range(iterations):
    input1_data = np.ones(input_shape, dtype=np.float32) * i
    input2_data = np.ones(input_shape, dtype=np.float32) * 2
    print(runtime_func(input1_data, input2_data))

19.9.2. Non-pipelined Keras model example with additional preprocessing and postprocessing steps

This example exports a very simple Keras model with an embedded IPU program that adds two inputs together. The model also performs a preprocessing step (on the IPU) to compute the absolute value of the input tensors and a postprocessing step (on the IPU) to reduce the output.

import os
import shutil

import numpy as np
import tensorflow as tf
from tensorflow.python import ipu

import keras

# Directory where SavedModel will be written.
saved_model_directory = './my_saved_model_ipu/009'
# Directory should be empty or should not exist.
if os.path.exists(saved_model_directory):
  shutil.rmtree(saved_model_directory)

batch_size = 1
input_shape = (batch_size, 4)
# Number of IPU-optimized loop's iterations.
iterations = 16

# Configure the IPU for compilation.
cfg = ipu.config.IPUConfig()
cfg.auto_select_ipus = 1
cfg.device_connection.enable_remote_buffers = True
cfg.device_connection.type = ipu.config.DeviceConnectionType.ON_DEMAND
cfg.configure_ipu_system()


# The preprocessing step is performed fully on the IPU.
def preprocessing_step(lhs_input, rhs_input):
  abs_layer = keras.layers.Lambda(tf.abs)
  return abs_layer(lhs_input), abs_layer(rhs_input)


# The postprocessing step is performed fully on the IPU.
def postprocessing(model_result):
  reduce_layer = keras.layers.Lambda(tf.reduce_sum)
  return reduce_layer(model_result)


# Always create Keras models inside an IPU strategy.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Always set `batch_size` if model has explicit input layers.
  input1 = keras.layers.Input(shape=input_shape[1:],
                              batch_size=batch_size,
                              name="input_1")
  input2 = keras.layers.Input(shape=input_shape[1:],
                              batch_size=batch_size,
                              name="input_2")

  x = keras.layers.Add()(preprocessing_step(input1, input2))
  output = postprocessing(x)

  model = keras.Model(inputs=[input1, input2], outputs=output)

  model.build([input_shape, input_shape])
  # Call compile to set the number of iterations of the inference loop.
  # It can be used to tweak the inference latency.
  model.compile(steps_per_execution=iterations)

# Export as a SavedModel.
runtime_func = model.export_for_ipu_serving(saved_model_directory)
# Alternatively: `runtime_func = serving.export_keras(model, saved_model_directory)`
print(f"SavedModel written to {saved_model_directory}")

# You can test the exported executable using returned `runtime_func`.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  for i in range(iterations):
    input1_data = np.ones(input_shape, dtype=np.float32) * i
    input2_data = np.ones(input_shape, dtype=np.float32) * 2
    print(runtime_func(input1_data, input2_data))

This example exports a very simple Keras model with an embedded IPU program, which doubles the input tensor. The model also performs a preprocessing step (on the CPU) to convert string tensors to floats and a postprocessing step (on the CPU) to compute the absolute value of the outputs.

import os
import shutil

import numpy as np
import tensorflow as tf
from tensorflow.python import ipu

import keras

# Directory where SavedModel will be written.
saved_model_directory = './my_saved_model_ipu/009'
# Directory should be empty or should not exist.
if os.path.exists(saved_model_directory):
  shutil.rmtree(saved_model_directory)

batch_size = 1
input_shape = (batch_size, 6)
# Number of IPU-optimized iterations.
iterations = 16

# Configure the IPU for compilation.
cfg = ipu.config.IPUConfig()
cfg.auto_select_ipus = 1
cfg.device_connection.enable_remote_buffers = True
cfg.device_connection.type = ipu.config.DeviceConnectionType.ON_DEMAND
cfg.configure_ipu_system()

# Prepare the `preprocessing_step` function signature.
preprocessing_step_signature = (tf.TensorSpec(shape=input_shape,
                                              dtype=tf.string),
                                tf.TensorSpec(shape=input_shape,
                                              dtype=tf.string))
# Prepare the `postprocessing_step` function signature.
postprocessing_step_signature = (tf.TensorSpec(shape=input_shape,
                                               dtype=np.float32),)


# The preprocessing step is performed fully on the CPU.
@tf.function(input_signature=preprocessing_step_signature)
def preprocessing_step(lhs_input, rhs_input):
  transform_fn = lambda input: tf.constant(
      1.0) if input == "graphcore" else tf.random.uniform(shape=tuple(),
                                                          dtype=np.float32)
  transform_string = lambda input: tf.stack([
      tf.stack([transform_fn(elem) for elem in tf.unstack(rank1)])
      for rank1 in tf.unstack(input)
  ])
  return transform_string(lhs_input), transform_string(rhs_input)


# The postprocessing step is performed fully on the CPU.
@tf.function(input_signature=postprocessing_step_signature)
def postprocessing_step(model_result):
  return tf.abs(model_result)


# Always create Keras models inside an IPU strategy.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Always set `batch_size` if model has explicit input layers.
  input1 = keras.layers.Input(shape=input_shape[1:],
                              batch_size=batch_size,
                              name="input_1")
  input2 = keras.layers.Input(shape=input_shape[1:],
                              batch_size=batch_size,
                              name="input_2")

  x = keras.layers.Add()([input1, input2])

  model = keras.Model(inputs=[input1, input2], outputs=x)

  model.build([input_shape, input_shape])
  # Call `compile` to set the number of iterations of the inference loop.
  # It can be used to tweak the inference latency.
  model.compile(steps_per_execution=iterations)

# Export as a SavedModel.
runtime_func = model.export_for_ipu_serving(
    saved_model_directory,
    preprocessing_step=preprocessing_step,
    postprocessing_step=postprocessing_step)
# Alternatively: `runtime_func = serving.export_keras(
#   model,
#   saved_model_directory,
#   preprocessing_step=preprocessing_step,
#   postprocessing_step=postprocessing_step)`
print(f"SavedModel written to {saved_model_directory}")

# You can test the exported executable using returned `runtime_func`.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  input1_data = tf.constant(
      ["graphcore", "red", "blue", "yellow", "graphcore", "purple"],
      shape=input_shape,
      dtype=tf.string)
  input2_data = tf.constant(
      ["apple", "banana", "graphcore", "orange", "pineapple", "graphcore"],
      shape=input_shape,
      dtype=tf.string)
  print(runtime_func(input1_data, input2_data))

19.9.3. Pipelined Keras model example

This example creates a simple pipelined Keras model that adds two inputs together in the first pipeline stage and later multiplies the result of the addition operation with the second input in the second pipeline stage. After that, the model is exported for TensorFlow Serving.

Note that building, compiling and exporting look exactly the same for pipelined and non-pipelined models.

import os
import shutil

import numpy as np
from tensorflow.python import ipu

import keras

# Directory where SavedModel will be written.
saved_model_directory = './my_saved_model_ipu/010'
# Directory should be empty or should not exist.
if os.path.exists(saved_model_directory):
  shutil.rmtree(saved_model_directory)

batch_size = 1
input_shape = (batch_size, 4)
iterations = 16

# Configure the IPU for compilation.
cfg = ipu.config.IPUConfig()
cfg.auto_select_ipus = 2
cfg.device_connection.enable_remote_buffers = True
cfg.device_connection.type = ipu.config.DeviceConnectionType.ON_DEMAND
cfg.configure_ipu_system()

# Always create Keras models inside an IPU strategy.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Always set `batch_size` if model has explicit input layers.
  input1 = keras.layers.Input(shape=input_shape[1:],
                              batch_size=batch_size,
                              name="input_1")
  input2 = keras.layers.Input(shape=input_shape[1:],
                              batch_size=batch_size,
                              name="input_2")

  with keras.ipu.PipelineStage(0):
    x = keras.layers.Multiply()([input1, input2])

  with keras.ipu.PipelineStage(1):
    x = keras.layers.Add()([x, input2])

  model = keras.Model(inputs=[input1, input2], outputs=x)
  model.set_pipelining_options(device_mapping=[0, 1])

  model.build([input_shape, input_shape])
  # Call compile to set the number of times each pipeline stage is executed.
  # It can be used to minimize the latency a bit.
  model.compile(steps_per_execution=iterations)

# Export as a SavedModel.
runtime_func = model.export_for_ipu_serving(saved_model_directory)
# Alternatively: `runtime_func = serving.export_keras(model, saved_model_directory)`
print("SavedModel written to", saved_model_directory)

# You can test the exported executable using returned runtime_func
# This should print the even numbers 2 to 32.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  for i in range(iterations):
    input1_data = np.ones(input_shape, dtype=np.float32) * i
    input2_data = np.ones(input_shape, dtype=np.float32) * 2
    print(runtime_func(input1_data, input2_data))

19.9.4. Pipelined Keras model example with additional preprocessing and postprocessing steps

This example creates a simple pipelined Keras model that adds two inputs together in the first computational pipeline stage of the model and later multiplies the result of the addition operation with the second input in the next pipeline stage. The model also performs a preprocessing stage (on the IPU) to compute the absolute value of the input and a postprocessing stage (on the IPU) to reduce the output.

import os
import shutil

import numpy as np
import tensorflow as tf
from tensorflow.python import ipu

import keras

# Directory where SavedModel will be written.
saved_model_directory = './my_saved_model_ipu/011'
# Directory should be empty or should not exist.
if os.path.exists(saved_model_directory):
  shutil.rmtree(saved_model_directory)

batch_size = 1
input_shape = (batch_size, 4)
iterations = 16

# Configure the IPU for compilation.
cfg = ipu.config.IPUConfig()
cfg.auto_select_ipus = 4
cfg.device_connection.enable_remote_buffers = True
cfg.device_connection.type = ipu.config.DeviceConnectionType.ON_DEMAND
cfg.configure_ipu_system()


# The preprocessing step is performed fully on the IPU.
def preprocessing_step(lhs_input, rhs_input):
  abs_layer = keras.layers.Lambda(tf.abs)
  return abs_layer(lhs_input), abs_layer(rhs_input)


# The postprocessing step is performed fully on the IPU.
def postprocessing(model_result):
  reduce_layer = keras.layers.Lambda(tf.reduce_sum)
  return reduce_layer(model_result)


# Always create Keras models inside an IPU strategy.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Always set `batch_size` if model has explicit input layers.
  input1_ph = keras.layers.Input(shape=input_shape[1:],
                                 batch_size=batch_size,
                                 name="input_1")
  input2_ph = keras.layers.Input(shape=input_shape[1:],
                                 batch_size=batch_size,
                                 name="input_2")
  with keras.ipu.PipelineStage(0):
    input1, input2 = preprocessing_step(input1_ph, input1_ph)

  with keras.ipu.PipelineStage(1):
    x = keras.layers.Multiply()([input1, input2])

  with keras.ipu.PipelineStage(2):
    x = keras.layers.Add()([x, input2])

  with keras.ipu.PipelineStage(3):
    x = postprocessing(x)

  model = keras.Model(inputs=[input1_ph, input2_ph], outputs=x)
  model.set_pipelining_options(device_mapping=[0, 1, 2, 3])

  model.build([input_shape, input_shape])
  # Call compile to set the number of times each pipeline stage is executed.
  # It can be used to minimize the latency a bit.
  model.compile(steps_per_execution=iterations)

# Export as a SavedModel.
runtime_func = model.export_for_ipu_serving(saved_model_directory)
# Alternatively: `runtime_func = serving.export_keras(model, saved_model_directory)`
print("SavedModel written to", saved_model_directory)

# You can test the exported executable using returned runtime_func
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  for i in range(iterations):
    input1_data = np.ones(input_shape, dtype=np.float32) * i
    input2_data = np.ones(input_shape, dtype=np.float32) * 2
    print(runtime_func(input1_data, input2_data))

This example creates a simple pipelined Keras model that adds two inputs together in the first pipeline stage and later multiplies the result of the addition operation with the second input in the second pipeline stage. The model also performs a preprocessing step (on the CPU) to convert string tensors to floats and a postprocessing step (on the CPU) to compute the absolute value of the outputs.

import os
import shutil

import numpy as np
import tensorflow as tf
from tensorflow.python import ipu

import keras

# Directory where SavedModel will be written.
saved_model_directory = './my_saved_model_ipu/010'
# Directory should be empty or should not exist.
if os.path.exists(saved_model_directory):
  shutil.rmtree(saved_model_directory)

batch_size = 1
input_shape = (batch_size, 6)
iterations = 16

# Prepare the `preprocessing_step` function signature.
preprocessing_step_signature = (tf.TensorSpec(shape=input_shape,
                                              dtype=tf.string),
                                tf.TensorSpec(shape=input_shape,
                                              dtype=tf.string))
# Prepare the `postprocessing_step` function signature.
postprocessing_step_signature = (tf.TensorSpec(shape=input_shape,
                                               dtype=np.float32),)


# The preprocessing step is performed fully on the CPU.
@tf.function(input_signature=preprocessing_step_signature)
def preprocessing_step(lhs_input, rhs_input):
  transform_fn = lambda input: tf.constant(
      1.0) if input == "graphcore" else tf.random.uniform(shape=tuple(),
                                                          dtype=np.float32)

  transform_string = lambda input: tf.stack([
      tf.stack([transform_fn(elem) for elem in tf.unstack(rank1)])
      for rank1 in tf.unstack(input)
  ])
  return transform_string(lhs_input), transform_string(rhs_input)


# The postprocessing step is performed fully on the CPU.
@tf.function(input_signature=postprocessing_step_signature)
def postprocessing_step(model_result):
  return tf.abs(model_result)


# Configure the IPU for compilation.
cfg = ipu.config.IPUConfig()
cfg.auto_select_ipus = 2
cfg.device_connection.enable_remote_buffers = True
cfg.device_connection.type = ipu.config.DeviceConnectionType.ON_DEMAND
cfg.configure_ipu_system()

# Always create Keras models inside an IPU strategy.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Always set `batch_size` if model has explicit input layers.
  input1 = keras.layers.Input(shape=input_shape[1:],
                              batch_size=batch_size,
                              name="input_1")
  input2 = keras.layers.Input(shape=input_shape[1:],
                              batch_size=batch_size,
                              name="input_2")

  with keras.ipu.PipelineStage(0):
    x = keras.layers.Multiply()([input1, input2])

  with keras.ipu.PipelineStage(1):
    x = keras.layers.Add()([x, input2])

  model = keras.Model(inputs=[input1, input2], outputs=x)
  model.set_pipelining_options(device_mapping=[0, 1])

  model.build([input_shape, input_shape])
  # Call compile to set the number of times each pipeline stage is executed.
  # It can be used to minimize the latency a bit.
  model.compile(steps_per_execution=iterations)

# Export as a SavedModel.
runtime_func = model.export_for_ipu_serving(
    saved_model_directory,
    preprocessing_step=preprocessing_step,
    postprocessing_step=postprocessing_step)
# Alternatively: `runtime_func = serving.export_keras(
#   model,
#   saved_model_directory,
#   preprocessing_step=preprocessing_step,
#   postprocessing_step=postprocessing_step)`

print("SavedModel written to", saved_model_directory)

# You can test the exported executable using returned runtime_func
# This should print the even numbers 2 to 32.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  input1_data = tf.constant(
      ["graphcore", "red", "blue", "yellow", "graphcore", "purple"],
      shape=input_shape,
      dtype=tf.string)
  input2_data = tf.constant(
      ["apple", "banana", "graphcore", "orange", "pineapple", "graphcore"],
      shape=input_shape,
      dtype=tf.string)
  print(runtime_func(input1_data, input2_data))

19.10. IPU-specific Keras layers and optimizers

The ipu_tensorflow_addons.keras.layers namespace contains IPU-specific implementations of standard Keras layers and optimizers. More information, including details of every layer and optimizer in this namespace and a code example showing how to use it can be found in Section 20, IPU TensorFlow Addons.

19.11. Implementation details

When instantiating a standard TensorFlow Keras model inside the scope of an IPUStrategy instance, it is dynamically injected with additional, IPU-specific, functions. This is done through the relevant IPU Keras extension classes:

`Functional` model	`FunctionalExtension()`
`Sequential` model	`SequentialExtension()`
`Model` subclass	`ModelExtension()`