4. Keras with IPUs

The Graphcore implementation of TensorFlow includes Keras support for IPUs. Keras model creation is no different than what you would use if you were training on other devices. To target the Poplar XLA device, Keras model creation must be inside the strategy.scope of an IPUStrategy.

4.1. Single IPU models

You can train, evaluate or run inference on single-IPU models through the Keras APIs as you would with other accelerators, as long as you create the model inside the scope of an IPUStrategy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import tensorflow as tf
from tensorflow.python import ipu

# Configure the IPU device.
config = ipu.config.IPUConfig()
config.auto_select_ipus = 1
config.configure_ipu_system()


# Create a simple model.
def create_model():
  return tf.keras.Sequential([
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(256, activation='relu'),
      tf.keras.layers.Dense(128, activation='relu'),
      tf.keras.layers.Dense(10)
  ])


# Create a dataset for the model.
def create_dataset():
  mnist = tf.keras.datasets.mnist

  (x_train, y_train), (_, _) = mnist.load_data()
  x_train = x_train / 255.0

  train_ds = tf.data.Dataset.from_tensor_slices(
      (x_train, y_train)).shuffle(10000).batch(32, drop_remainder=True)
  train_ds = train_ds.map(lambda d, l:
                          (tf.cast(d, tf.float32), tf.cast(l, tf.int32)))

  return train_ds.repeat().prefetch(16)


dataset = create_dataset()

# Create a strategy for execution on the IPU.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Create a Keras model inside the strategy.
  model = create_model()

  # Compile the model for training.
  model.compile(
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      optimizer=tf.keras.optimizers.RMSprop(),
      metrics=["accuracy"],
  )

  model.fit(dataset, epochs=2, steps_per_epoch=100)

4.2. Using steps_per_execution

To reduce Python overhead and maximize the performance of your model, pass in the steps_per_execution argument to the compile method. This argument sets the number of batches to process sequentially in a single execution. You should increase this number to improve accelerator utilization.

Note

In order to achieve best performance, steps_per_execution needs to be set before using fit(), evaluate() and predict(), even if no training is performed.

See the documentation for the compile method for full details.

The example below highlights the usage of steps_per_execution:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import tensorflow as tf
from tensorflow.python import ipu

# Configure the IPU device.
config = ipu.config.IPUConfig()
config.auto_select_ipus = 1
config.configure_ipu_system()


# Create a simple model.
def create_model():
  return tf.keras.Sequential([
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(256, activation='relu'),
      tf.keras.layers.Dense(128, activation='relu'),
      tf.keras.layers.Dense(10)
  ])


# Create a dataset for the model.
def create_dataset():
  mnist = tf.keras.datasets.mnist

  (x_train, y_train), (_, _) = mnist.load_data()
  x_train = x_train / 255.0

  train_ds = tf.data.Dataset.from_tensor_slices(
      (x_train, y_train)).shuffle(10000).batch(32, drop_remainder=True)
  train_ds = train_ds.map(lambda d, l:
                          (tf.cast(d, tf.float32), tf.cast(l, tf.int32)))

  return train_ds.repeat().prefetch(16)


dataset = create_dataset()

# Create a strategy for execution on the IPU.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Create a Keras model inside the strategy.
  model = create_model()

  # Compile the model for training.
  model.compile(
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      optimizer=tf.keras.optimizers.RMSprop(),
      metrics=["accuracy"],
      # Anything between 2 and `steps_per_epoch` could help here.
      steps_per_execution=50,
  )

  model.fit(dataset, epochs=2, steps_per_epoch=100)

4.3. Gradient accumulation

When training, gradient accumulation allows us to simulate bigger batch sizes. This is achieved by accumulating the gradients across multiple batches together then performing the weight update.

For example, if we have a model where each step is of batch size 16 and we use a gradient accumulation factor of 4 then this simulates an input batch of size 64.

Gradient accumulation can be easily enabled for Keras models created inside of an IPUStrategy by calling the set_gradient_accumulation_options() method for Functional Keras models and the set_gradient_accumulation_options() method for Sequential Keras models. See the respective method documentation for more details.

Note

When using data-parallelism, the steps_per_execution value the model was compiled with must be an integer multiple of gradient_accumulation_steps_per_replica multiplied by the number of replicas in the model. Data parallelism is discussed in the keras-data-parallelism section below.

Note

Not all operations are compatible with gradient accumulation.

The example below highlights the usage of set_gradient_accumulation_options:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import tensorflow as tf
from tensorflow.python import ipu

# Configure the IPU device.
config = ipu.config.IPUConfig()
config.auto_select_ipus = 1
config.configure_ipu_system()


# Create a simple model.
def create_model():
  return tf.keras.Sequential([
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(256, activation='relu'),
      tf.keras.layers.Dense(128, activation='relu'),
      tf.keras.layers.Dense(10)
  ])


# Create a dataset for the model.
def create_dataset():
  mnist = tf.keras.datasets.mnist

  (x_train, y_train), (_, _) = mnist.load_data()
  x_train = x_train / 255.0

  train_ds = tf.data.Dataset.from_tensor_slices(
      (x_train, y_train)).shuffle(10000).batch(32, drop_remainder=True)
  train_ds = train_ds.map(lambda d, l:
                          (tf.cast(d, tf.float32), tf.cast(l, tf.int32)))

  return train_ds.repeat().prefetch(16)


dataset = create_dataset()

# Create a strategy for execution on the IPU.
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Create a Keras model inside the strategy.
  model = create_model()

  # Compile the model for training.
  model.compile(
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      optimizer=tf.keras.optimizers.RMSprop(),
      metrics=["accuracy"],
      steps_per_execution=50,
  )

  model.set_gradient_accumulation_options(
      gradient_accumulation_steps_per_replica=10)

  model.fit(dataset, epochs=2, steps_per_epoch=100)

4.4. Model parallelism

The models described so far occupy a single IPU device, however some models might require the model layers to be split across multiple IPU devices to achieve high compute efficiency.

One method to achieve model parallelism is called pipelining, where the model layers are assigned to pipeline stages. Each pipeline stage can be assigned to a different device and different devices can execute in parallel.

The method to pipeline your model depends on whether your model is a Sequential or a Functional model.

4.4.1. Sequential model

To enable IPU pipelining for a Sequential model (an instance of tensorflow.keras.Sequential), a list of per-layer pipeline stage assignments should be passed to the set_pipeline_stage_assignment() method of the model.

For example, a simple four layer Sequential model could be assigned to two different pipeline stages as follows:

1
2
3
4
5
6
7
8
  model = tf.keras.Sequential([
      tf.keras.layers.Dense(8),  # Pipeline stage 0.
      tf.keras.layers.Dense(16),  # Pipeline stage 0.
      tf.keras.layers.Dense(16),  # Pipeline stage 1.
      tf.keras.layers.Dense(1),  # Pipeline stage 1.
  ])

  model.set_pipeline_stage_assignment([0, 0, 1, 1])

You can confirm which layers are assigned to which stages using the print_pipeline_stage_assignment_summary() method of the model.

4.4.2. Functional model

There are two ways to enable IPU pipelining for a Functional model (an instance of tensorflow.keras.Model) depending on if you’re pipelining a model you are writing yourself or an existing model.

Pipelining a model you are writing yourself

To pipeline a Functional model you are writing yourself, each layer call must happen within the scope of an ipu.keras.PipelineStage context.

For example, a simple four layer Functional model could be assigned to two different pipeline stages as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
  input_layer = tf.keras.layers.Input((28, 28))

  with ipu.keras.PipelineStage(0):
    x = tf.keras.layers.Dense(8)(input_layer)
    x = tf.keras.layers.Dense(16)(x)

  with ipu.keras.PipelineStage(1):
    x = tf.keras.layers.Dense(16)(x)
    x = tf.keras.layers.Dense(1)(x)

  model = tf.keras.Model(inputs=input_layer, outputs=x)

Pipelining an existing functional model

To pipeline an existing Functional model, you can use get_pipeline_stage_assignment(). Each layer invocation in the model has an associated FunctionalLayerPipelineStageAssignment object, which indicates what pipeline stage that invocation is assigned to. get_pipeline_stage_assignment returns a list of these stage assignments, which you can inspect and modify. Note that the list is in post-order, which means the assignments are returned in the order they will be executed.

Once you are done modifying the stage assignments, you should use set_pipeline_stage_assignment() to set them on the model.

For example, a naive way of pipelining ResNet50 would be to assign everything up until the “conv4_block2_add” layer invocation to the first stage, then everything else to the second stage, as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():

  from tensorflow.keras.applications.resnet50 import ResNet50
  model = ResNet50(weights='imagenet')

  # Get the individual assignments - note that they are returned in post-order.
  assignments = model.get_pipeline_stage_assignment()

  # Iterate over them and set their pipeline stages.
  stage_id = 0
  for assignment in assignments:
    assignment.pipeline_stage = stage_id
    # Split the model on the `conv4_block2_add` layer.
    if assignment.layer.name.startswith("conv4_block2_add"):
      stage_id = 1

  # Set the assignments to the model.
  model.set_pipeline_stage_assignment(assignments)

  model.print_pipeline_stage_assignment_summary()

Note

You can use print_pipeline_stage_assignment_summary() to print the pipeline stage assignments of the model’s layer invocations.

Note

This method of pipelining can also be used with Functional models you are writing yourself, as well as Sequential models using the SequentialExtension equivalents.

4.5. Automatic data parallelism

IPU TensorFlow supports automatic data parallelism when multiple IPU devices are configured with the system. Automatic data parallelism is achieved by model replication across available IPU devices. The number of times the model is replicated is called the replication factor; higher replication factors allow higher data throughput.

When replicating, gradients are reduced across replicas during training, which has implications for gradient accumulation. For a non replicated model, the effective batch size is the product of the dataset batch size and the number of gradient accumulation steps. In the case of a replication factor greater than one, the effective batch size is additionally scaled by the replication factor according to the following formula:

effective_batch_size = dataset_batch_size * gradient_accumulation_steps_per_replica * num_replicas

4.6. Asynchronous callbacks

IPU TensorFlow supports the use of Callback objects with the Keras APIs, however there is an important difference to note when specifying steps_per_execution. In IPU TensorFlow, if steps_per_execution is specified for your model, then per-batch callback functions will only be invoked every steps_per_execution steps, which can have the effect of delaying access to results.

However, IPU TensorFlow also supports asynchronous callbacks by providing a polling mechanism which allows results to be accessed at the earliest possible instance. Asynchronous callbacks can be enabled by invoking set_asynchronous_callbacks() with True on your Sequential or Functional Keras model.

4.7. Porting models from TensorFlow 2.1

Previously, IPU TensorFlow included IPU-specific Keras model classes for Functional and Sequential models. These classes no longer exist and must be replaced with their standard Keras counterparts. Specifically, use of the old IPUSequential (or tensorflow.python.ipu.keras.Sequential) class should be changed to tensorflow.keras.Sequential and use of the old IPUModel (or tensorflow.python.ipu.keras.Model) class should be changed to tensorflow.keras.Model.

Any IPU-specific arguments to the old IPU-specific classes (such as gradient_accumulation_count) should also be removed and the behaviour they specify achieved by the means outlined in this document.

For reference, the following table details APIs that have been removed and their replacements:

TF2.1

TF2.4

ipu.keras.IPUModel / ipu.keras.Model

Removed, use tensorflow.keras.Model

ipu.keras.IPUSequential / ipu.keras.Sequential

Removed, use tensorflow.keras.Sequential

ipu.keras.PipelineSequential

Removed, use tensorflow.keras.Sequential and set pipeline stages via Sequential.set_pipeline_stage_assignment

ipu.keras.PipelineModel

Removed, use tensorflow.keras.Model and set pipeline stages via ipu.keras.PipelineStage or Functional.set_pipeline_stage_assignment

gradient_accumulation_count

Removed, set via Sequential.set_gradient_accumulation_options and Model.set_gradient_accumulation_options

gradient_accumulation_count (pipelined models)

Removed, set via Sequential.set_pipelining_options and Model.set_pipelining_options

gradient_accumulation_dtype

Removed

batch_serialization_iterations

Set via Sequential.set_pipelining_options and Model.set_pipelining_options

pipeline_schedule

recomputation_mode

forward_propagation_stages_poplar_options

backward_propagation_stages_poplar_options

weight_update_poplar_options

offload_weight_update_variables

replicated_optimizer_state_sharding

offload_activations

offload_gradient_accumulation_buffers

replicated_weight_sharding

offload_weights

As an example, the following snippets show equivalent TF2.1 and TF2.4 code for creating and fitting a pipelined sequential keras model.

4.7.1. TF2.1

strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Using IPU-specific PipelineSequential model.
  # IPU-specific arguments passed into model constructor.
  model = ipu.keras.PipelineSequential(
      [tf.keras.layers.Flatten(),
       tf.keras.layers.Dense(256, activation='relu'),
       tf.keras.layers.Dense(128, activation='relu'),
       tf.keras.layers.Dense(10)],
      gradient_accumulation_count=16,
      device_mapping=[0, 0, 1, 1])

  model.compile(
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      optimizer=tf.keras.optimizers.RMSprop()
  )

  model.fit(dataset, epochs=2, steps_per_epoch=128)

4.7.2. TF2.4

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Using standard keras Sequential model.
  model = tf.keras.Sequential([
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(256, activation='relu'),
      tf.keras.layers.Dense(128, activation='relu'),
      tf.keras.layers.Dense(10)
  ])

  # IPU-specific arguments passed into separate configuration methods.
  model.set_pipeline_stage_assignment([0, 0, 1, 1])

  # Replication factor is 1 in this example.
  model.set_pipelining_options(gradient_accumulation_steps_per_replica=16)

  # steps_per_execution specified to improve performance.
  model.compile(steps_per_execution=256,
                loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                optimizer=tf.keras.optimizers.RMSprop())

  model.fit(dataset, epochs=2, steps_per_epoch=128)

4.8. Implementation details

When instantiating a standard TensorFlow Keras model inside the scope of an IPUStrategy instance, it is dynamically injected with additional, IPU-specific, functions. This is done through the relevant IPU Keras extension classes. For tensorflow.keras.Sequential, IPU-specific extensions exist in SequentialExtension and for tensorflow.keras.Model in FunctionalExtension.