22. TensorFlow Python API

Remember to import the IPU API using:

from tensorflow.python import ipu

You cannot access the IPU API via the top-level tensorflow namespace. For example, this will not work:

import tensorflow as tf
cfg = tf.python.ipu.config.IPUConfig() ...

Note

tensorflow.python.ipu.ipu_strategy.IPUStrategy is an alias of tensorflow.python.ipu.ipu_strategy.IPUStrategyV1.

22.2. Distribution strategy for a single system

class tensorflow.python.ipu.ipu_strategy.IPUExtendedV1(container_strategy, ipu_device, cpu_device)
__init__(container_strategy, ipu_device, cpu_device)
non_slot_devices(var_list)

Device(s) for non-slot variables.

DEPRECATED: TF 1.x ONLY.

This method returns non-slot devices where non-slot variables are placed. Users can create non-slot variables on these devices by using a block:

```python with tf.distribute.StrategyExtended.colocate_vars_with(tf.distribute.StrategyExtended.non_slot_devices(…)):

```

Parameters

var_list – The list of variables being optimized, needed with the default tf.distribute.Strategy.

Returns

A sequence of devices for non-slot variables.

property parameter_devices

Returns the tuple of all devices used to place variables.

value_container(value)

Returns the container that this per-replica value belongs to.

Parameters

value – A value returned by run() or a variable created in scope().

Returns

A container that value belongs to. If value does not belong to any container (including the case of container having been destroyed), returns the value itself. value in experimental_local_results(value_container(value)) will always be true.

property worker_devices

Returns the tuple of all devices used to for compute replica execution.

tensorflow.python.ipu.ipu_strategy.IPUStrategy

alias of IPUStrategyV1

class tensorflow.python.ipu.ipu_strategy.IPUStrategyV1(ipu_device='/device:IPU:0', cpu_device='/device:CPU:0', enable_dataset_iterators=True, enable_keras_extensions=True)

This is a distribution strategy for targeting a system with one or more IPUs.

Creating variables and Keras models within the scope of the IPUStrategyV1 will ensure that they are placed on the IPU.

A tf.function can be executed on the IPU by calling it from the run function.

Variables will automatically be placed onto the IPUs, but the initializers for the variables will be performed on the CPU device.

from tensorflow.python import ipu

# Create an IPU distribution strategy
strategy = ipu.ipu_strategy.IPUStrategyV1()

with strategy.scope():

    # Instantiate a keras model here
    m = MyModel()

    # And train it
    m.fit(...)

    # Or call a tf.function
    res = strategy.run(my_fn, [...])
__init__(ipu_device='/device:IPU:0', cpu_device='/device:CPU:0', enable_dataset_iterators=True, enable_keras_extensions=True)

Create a new IPUStrategyV1.

Parameters
  • ipu_device – The TensorFlow device representing the IPUs.

  • cpu_device – The TensorFlow device for the CPU.

  • enable_dataset_iterators – Whether to create IPUStrategy specific dataset iterators inside of this strategy scope or whether to use standard dataset iterators.

  • enable_keras_extensions – Whether to enable IPU specific Keras extensions to improve Keras performance when using IPUs.

run(fn, args=(), kwargs=None, options=None)

Invokes fn on each replica, with the given arguments.

This method is the primary way to distribute your computation with a tf.distribute object. It invokes fn on each replica. If args or kwargs have tf.distribute.DistributedValues, such as those produced by a tf.distribute.DistributedDataset from tf.distribute.Strategy.experimental_distribute_dataset or tf.distribute.Strategy.distribute_datasets_from_function, when fn is executed on a particular replica, it will be executed with the component of tf.distribute.DistributedValues that correspond to that replica.

fn is invoked under a replica context. fn may call tf.distribute.get_replica_context() to access members such as all_reduce. Please see the module-level docstring of tf.distribute for the concept of replica context.

All arguments in args or kwargs can be a nested structure of tensors, e.g. a list of tensors, in which case args and kwargs will be passed to the fn invoked on each replica. Or args or kwargs can be tf.distribute.DistributedValues containing tensors or composite tensors, i.e. tf.compat.v1.TensorInfo.CompositeTensor, in which case each fn call will get the component of a tf.distribute.DistributedValues corresponding to its replica. Note that arbitrary Python values that are not of the types above are not supported.

IMPORTANT: Depending on the implementation of tf.distribute.Strategy and whether eager execution is enabled, fn may be called one or more times. If fn is annotated with tf.function or tf.distribute.Strategy.run is called inside a tf.function (eager execution is disabled inside a tf.function by default), fn is called once per replica to generate a Tensorflow graph, which will then be reused for execution with new inputs. Otherwise, if eager execution is enabled, fn will be called once per replica every step just like regular python code.

Example usage:

  1. Constant tensor input.

>>> strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
>>> tensor_input = tf.constant(3.0)
>>> @tf.function
... def replica_fn(input):
...   return input*2.0
>>> result = strategy.run(replica_fn, args=(tensor_input,))
>>> result
PerReplica:{
  0: <tf.Tensor: shape=(), dtype=float32, numpy=6.0>,
  1: <tf.Tensor: shape=(), dtype=float32, numpy=6.0>
}
  1. DistributedValues input.

>>> strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"])
>>> @tf.function
... def run():
...   def value_fn(value_context):
...     return value_context.num_replicas_in_sync
...   distributed_values = (
...     strategy.experimental_distribute_values_from_function(
...       value_fn))
...   def replica_fn2(input):
...     return input*2
...   return strategy.run(replica_fn2, args=(distributed_values,))
>>> result = run()
>>> result
<tf.Tensor: shape=(), dtype=int32, numpy=4>
  1. Use tf.distribute.ReplicaContext to allreduce values.

>>> strategy = tf.distribute.MirroredStrategy(["gpu:0", "gpu:1"])
>>> @tf.function
... def run():
...    def value_fn(value_context):
...      return tf.constant(value_context.replica_id_in_sync_group)
...    distributed_values = (
...        strategy.experimental_distribute_values_from_function(
...            value_fn))
...    def replica_fn(input):
...      return tf.distribute.get_replica_context().all_reduce("sum", input)
...    return strategy.run(replica_fn, args=(distributed_values,))
>>> result = run()
>>> result
PerReplica:{
  0: <tf.Tensor: shape=(), dtype=int32, numpy=1>,
  1: <tf.Tensor: shape=(), dtype=int32, numpy=1>
}
Parameters
  • fn – The function to run on each replica.

  • args – Optional positional arguments to fn. Its element can be a tensor, a nested structure of tensors or a tf.distribute.DistributedValues.

  • kwargs – Optional keyword arguments to fn. Its element can be a tensor, a nested structure of tensors or a tf.distribute.DistributedValues.

  • options – An optional instance of tf.distribute.RunOptions specifying the options to run fn.

Returns

Merged return value of fn across replicas. The structure of the return value is the same as the return value from fn. Each element in the structure can either be tf.distribute.DistributedValues, Tensor objects, or `Tensor`s (for example, if running on a single replica).

22.3. Compiler interface

tensorflow.python.ipu.ipu_compiler.compile(computation, inputs=None)

Builds an operator that compiles and runs computation with the Graphcore IPU XLA backend.

Parameters
  • computation

    A Python function that builds a computation to apply to the input. If the function takes n inputs, inputs should be a list of n tensors.

    computation may return a list of operations and tensors. Tensors must come before operations in the returned list. The return value of compile is a list of tensors corresponding to the tensors from the output of computation.

    All operations returned from computation will be executed when evaluating any of the returned output tensors.

  • inputs – A list of inputs or None (equivalent to an empty list). Each input can be a nested structure containing values that are convertible to tensors. Note that passing an N-dimension list of compatible values will result in a N-dimension list of scalar tensors rather than a single Rank-N tensors. If you need different behaviour, convert part of inputs to tensors with tf.convert_to_tensor.

Returns

Same data structure as if computation(inputs) is called directly with some exceptions for correctness.

  1. None output. a NoOp would be returned which control-depends on computation.

  2. Single value output. A tuple containing the value would be returned.

  3. Operation-only outputs. a NoOp would be returned which control-depends on computation.

Raises

Exception – If the computation was not compiled for an IPU device.

22.4. Scoping contexts

tensorflow.python.ipu.scopes.frontend_attribute(attribute_name, attribute_value, restore_to=None)

Sets the specified scope attribute to the specified value in the graph.

Parameters
  • attribute_name – Name of the attribute.

  • attribute_value – Attribute’s value as a string.

  • restore_to – If at the end of the scope the attribute was to be undefined sets it to this value instead.

Returns

A context

tensorflow.python.ipu.scopes.ipu_jit_scope(ipu_scope)

Provides a scope for compilation of operations.

If you would like to compile several sets of operations together, then this can provide that mechanism.

Parameters

ipu_scope – A name to differentiate between different JIT scopes

Returns

A context

tensorflow.python.ipu.scopes.ipu_scope(device)

Provides a scope for placing operations onto a particular IPU/IPU cluster.

Parameters

device – The name of the TensorFlow device, such as ‘/device:IPU:0’

Returns

A context

tensorflow.python.ipu.scopes.ipu_shard(index)

Control sharding for a set of operations.

Provides a scope which targets operations onto a particular shard (IPU) of a multi-IPU sharded device. Gradients created from these operations will also be put onto the same shard. Consequently an ipu_shard scope enclosing a call to tf.gradients or tf.GradientTape.gradient won’t change the sharding of the backwards ops.

Parameters

index – The index of the IPU on which to place the enclosed operations.

Returns

A context

tensorflow.python.ipu.scopes.outside_compilation_scope(name='outside')

Provides a scope for placing operations on the host, outside the current compilation scope. The operations will be placed on the default host device. This allows for offloading computations from the IPU to the host, which can be useful for operations that are not supported or suitable for execution on the IPU.

Example:

def my_net(a):
  with ipu_scope("/device:IPU:0"):
    b = a * a
    with outside_compilation_scope():
      c = b + 2  # Placed on the host.
    d = b + c
    return d
Parameters

name – A name for the outside compilation scope.

Returns

A context

tensorflow.python.ipu.scopes.partials_type(override_type)

Override the default type used to store intermediate results by convolution and matrix mutliply operations.

EXPERIMENTAL - there are no guarantees that the partials type provided will be used and therefore this should not be used.

Parameters

override_type – Numpy type of the partials (float16 or float32)

Returns

A context

tensorflow.python.ipu.scopes.stochastic_rounding(override)

Control stochastic rounding for a set of operations.

EXPERIMENTAL - there are no guarantees that the stochastic rounding provided will be used and therefore this should not be used.

Parameters

override – if True then stochastic rounding will be used, otherwise it will be disabled for this set of operations.

Returns

A context

22.5. Infeed queue

class tensorflow.python.ipu.ipu_infeed_queue.IPUInfeedQueue(dataset, device_ordinal=None, prefetch_depth=None, optimise_latency=False, **kwargs)

Wraps a tf.Dataset object with infeed operations specific to the IPU.

This class, along with tensorflow.python.ipu.loops is used to create a data pipeline from a dataset into a training/inference loop on the IPU inside a single session.run which reduces the overheads of calling session.run for each iteration of the loop.

You should pass the infeed queue as an argument to a loop from tensorflow.python.ipu.loops. These loops will then handle the dequeuing of the data to the device automatically.

The following skeleton shows how to use this method when building a training loop. Note how the body signature contains variables which correspond to the nested structure of tf.Tensor objects representing the next element in the infeed queue:

# Create an example dataset.
dataset = ...  # A `tf.data.Dataset` object.

def dataset_parser(value):
  features, labels = parse_record(value)
  return {"features": features,
          "labels": labels}
# The resulting dataset has a nested structure of: {features, labels}.
dataset = dataset.map(dataset_parser)

infeed_queue = ipu.ipu_infeed_queue.IPUInfeedQueue(dataset)

# dataset can no longer be used beyond this point.

def my_net():
  # Note how the nested structure forms part of the loop body signature.
  def body(loss, features, labels):
    with variable_scope.variable_scope("vs", use_resource=True):
      y = tf.conv2d(features, .....)
      ...
      ...
      logits = tf.nn.xw_plus_b(....)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=labels))
    optimizer = gradient_descent.GradientDescentOptimizer(0.000001)
    train = optimizer.minimize(loss)
    with ops.control_dependencies([train]):
      return array_ops.identity(loss)

  loss = 0.0
  return = tf.python.ipu.loops.repeat(10000, body, [loss], infeed_queue)

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[])

with tf.Session() as sess:
  sess.run(infeed_queue.initializer)
  sess.run(variables.global_variables_initializer())
  result = sess.run(res)
__init__(dataset, device_ordinal=None, prefetch_depth=None, optimise_latency=False, **kwargs)

Creates an IPUInfeedQueue object.

Parameters
  • dataset – a tf.data.Dataset object, all transformations e.g. shuffle, repeat, batch must be applied prior to passing in to this function. This dataset can no longer be used after creating this queue.

  • device_ordinal – Integer ordinal of the IPU device on which this queue will be used. If not specified will try and deduce the IPU device from the current strategy and if that fails will default to “/device:IPU:0”.

  • prefetch_depth – the number of elements Poplar will prefetch. The depth of the Poplar datastream buffer size which may be prefetched before being read by the device. By default the prefetch_depth size is automatically determined (currently defaults to 3). Increasing the size of the prefetch_depth allows for prefetching of multiple entries, increasing the probability there will be a valid entry in the buffer for the device to read before falling back to synchronously fetching the next entry. This value has to be greater than zero.

  • optimise_latency – Prioritise packet reduction to try to speed up the the host transfer. This has the downside that it will introduce an extra copy and so should only be used on small exchanges that will produce lots of packets.

Raises

ValueError – if all dimensions of shapes of dataset.output_shapes are not fully defined. tf.data.batch function must be called with drop_remainder=True to ensure that batch size is constant.

property deleter

A tf.Operation that can be run to delete the resources owned by this IPUInfeedQueue. This allows creating a new IPUInfeedQueue with the same name afterwards.

Returns

A tf.Operation that can be run to delete this IPUInfeedQueue

property dequeued

Returns whether this queue has been dequeued.

Returns

A nested structure of tf.Tensor objects.

get_next()

Obsolete function.

property initializer

A tf.Operation that should be run to initialize this IPUInfeedQueue.

Returns

A tf.Operation that should be run to initialize this IPUInfeedQueue

Raises

ValueError – if the function initializer has already been called.

property number_of_tuple_elements

Returns the number of arguments supplied by this IPUInfeedQueue.

class tensorflow.python.ipu.ipu_infeed_queue.IPUIterator(dataset=None, infeed_spec=None, element_spec=None, **kwargs)

An IPU specific iterator producing tf.Tensor objects from a tf.data.Dataset.

This iterator should be initially constructed in eager mode in order to make sure that the dataset is constructed on a compatible device.

Note that the infeed queue is not deleted.

The elements from iterator can only be accessed inside of tf.functions for maximum performance.

__init__(dataset=None, infeed_spec=None, element_spec=None, **kwargs)

Creates a new iterator from the given dataset.

If dataset is not specified, the iterator will be created from the given infeed spec and element structure. In particular, the alternative for constructing the iterator is used when the iterator is reconstructed from it CompositeTensor representation.

Parameters
  • dataset – A tf.data.Dataset object.

  • infeed_spec – IPUInfeedQueue TypeSpec the iterator from.

  • element_spec – A nested structure of TypeSpec objects that represents the type specification of elements of the iterator.

  • **kwargs – Arguments passed to the IPUInfeedQueue.

Raises

ValueError – If dataset is not provided and either infeed_spec or element_spec is not provided. Or dataset is provided and either infeed_spec and element_spec is provided.

property element_spec

The type specification of an element of this iterator.

>>> dataset = tf.data.Dataset.from_tensors(42)
>>> iterator = iter(dataset)
>>> iterator.element_spec
tf.TensorSpec(shape=(), dtype=tf.int32, name=None)

For more information, read [this guide](https://www.tensorflow.org/guide/data#dataset_structure).

Returns

A (nested) structure of tf.TypeSpec objects matching the structure of an element of this iterator, specifying the type of individual components.

get_next()

Returns the next element.

>>> dataset = tf.data.Dataset.from_tensors(42)
>>> iterator = iter(dataset)
>>> print(iterator.get_next())
tf.Tensor(42, shape=(), dtype=int32)
Returns

A (nested) structure of values matching tf.data.Iterator.element_spec.

Raises

tf.errors.OutOfRangeError – If the end of the iterator has been reached.

get_next_as_optional()

Returns the next element warpped in tf.experimental.Optional.

If the iterator has reached the end of the sequence, the returned tf.experimental.Optional will have no value.

>>> dataset = tf.data.Dataset.from_tensors(42)
>>> iterator = iter(dataset)
>>> optional = iterator.get_next_as_optional()
>>> print(optional.has_value())
tf.Tensor(True, shape=(), dtype=bool)
>>> print(optional.get_value())
tf.Tensor(42, shape=(), dtype=int32)
>>> optional = iterator.get_next_as_optional()
>>> print(optional.has_value())
tf.Tensor(False, shape=(), dtype=bool)
Returns

A tf.experimental.Optional object representing the next element.

class tensorflow.python.ipu.ipu_infeed_queue.IPUOwnedIterator(dataset=None, infeed_spec=None, element_spec=None, **kwargs)

An IPU specific iterator producing tf.Tensor objects from a tf.data.Dataset.

The iterator resource created through IPUOwnedIterator is owned by the Python object and the life time of the underlying resource is tied to the life time of the IPUOwnedIterator object. This makes IPUOwnedIterator appropriate for use inside of tf.functions.

This iterator should be initially constructed in eager mode in order to make sure that the dataset is constructed on a compatible device.

The elements from iterator can only be accessed inside of tf.functions for maximum performance.

__init__(dataset=None, infeed_spec=None, element_spec=None, **kwargs)

Creates a new iterator from the given dataset.

If dataset is not specified, the iterator will be created from the given infeed spec and element structure. In particular, the alternative for constructing the iterator is used when the iterator is reconstructed from it CompositeTensor representation.

Parameters
  • dataset – A tf.data.Dataset object.

  • infeed_spec – IPUInfeedQueue TypeSpec the iterator from.

  • element_spec – A nested structure of TypeSpec objects that represents the type specification of elements of the iterator.

  • **kwargs – Arguments passed to the IPUInfeedQueue.

Raises

ValueError – If dataset is not provided and either infeed_spec or element_spec is not provided. Or dataset is provided and either infeed_spec and element_spec is provided.

22.6. Outfeed queue

class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedMode(value)

Types used to control the IPUOutfeedQueue modes.

Contains the following values:

  • ALL - When used with an IPUOutfeedQueue, all the elements which were enqueued to the queue will be returned by the outfeed.

  • LAST - When used with an IPUOutfeedQueue, only the last element which was enqueued to the queue will be returned by the outfeed.

class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedQueue(outfeed_mode=None, device_ordinal=None, buffer_depth=3, optimise_latency=False)

Generates and adds outfeed enqueue/dequeue operations to the graph.

An outfeed is the counterpart to an infeed and manages the transfer of data (like tensors, tuples or dictionaries of tensors) from the IPU graph to the host.

The queue has two modes of operation - outfeed all or outfeed last. In outfeed all mode every element that is enqueued will be stored for a subsequent dequeue. All of the enqueued elements will be returned when the dequeue operation is run. This is the default behaviour.

In outfeed last mode only the last enqueued element is stored. The dequeue operation will in this case return a single element.

__init__(outfeed_mode=None, device_ordinal=None, buffer_depth=3, optimise_latency=False)

Creates an IPUOutfeedQueue object.

Parameters
  • outfeed_modeipu_outfeed_queue.IPUOutfeedMode type used to control the outfeed behaviour. If not specified then all elements will be returned by the outfeed when the dequeue operation is run.

  • device_ordinal – Integer ordinal of the IPU device on which this queue will be used. If not specified will try and deduce the IPU device from the current strategy and if that fails will default to “/device:IPU:0”.

  • buffer_depth – The maximum number of elements Poplar can buffer in external memory before blocking the device.

  • optimise_latency – Prioritise packet reduction to try to speed up the the host transfer. This has the downside that it will introduce an extra copy and so should only be used on small exchanges that will produce lots of packets.

Raises

ValueError – if the types or values are incorrect

property deleter

A tf.Operation that can be run to delete the resources owned by this IPUOutfeedQueue. This allows creating a new IPUOutfeedQueue with the same name afterwards. The behaviour is undefined if this op is executed concurrently with the dequeue op.

Returns

A tf.Operation that can be run to delete this IPUOutfeedQueue

dequeue(wait_for_completion=False)

Generate host side operation to dequeue the outfeed values.

Parameters

wait_for_completion – whether the dequeueing operation should wait for the current execution of a graph containing the outfeed enqueue to complete. Defaults to False which means that only the tensors which have already been enqueued will be returned.

The return value of this operation depends on the enqueued tensors, replication factor and the execution mode. Where replication factor is determined by the model.

Note: If the TF_POPLAR_FLAGS environment variable contains the flag --use_synthetic_data then no data will be returned to the host. If outfeed_mode is IPUOutfeedMode.ALL then empty arrays with the same element structure as the enqueued tensors are returned. If outfeed_mode is IPUOutfeedMode.LAST then running the dequeue operation will throw an exception (there is no last element in this case).

Examples:

  1. Outfeed returning a single tensor:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(input):
  output = input + 1
  outfeed = outfeed_queue.enqueue(output)
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(20, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example the tensor output is of shape [4, 4] and it is enqueued into the outfeed. If the outfeed_mode is IPUOutfeedMode.ALL, and the model has a replication factor of 2 then the shape of the resulting outfed tensor will be [20, 2, 4, 4], where the first dimension represents the number of times we have enqueued a tensor to the outfeed - in this example the loop is repeated 20 times, and therefore we get 20 values back from the outfeed. The second dimension is the replication factor, which allows us to see the individual values from each replicated graph. If the outfeed_mode is IPUOutfeedMode.LAST, then the shape of the resulting outfed tensor will be [2, 4, 4], which represents the value of the output tensor the last time it was enqueued during execution for each of the replicated graphs.

  1. Outfeed returning a tuple of tensors:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(input):
  output = input + 1
  sum = tf.reduce_sum(output)
  outfeed = outfeed_queue.enqueue((output, sum))
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(20, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example we outfeed a tuple of tensors, output and sum, where the former is of shape [4, 4] and latter [1]. If the outfeed_mode is IPUOutfeedMode.ALL and the model has a replication factor of 1, then the resulting outfed is a two-tuple of tensors with shapes ([20, 4, 4], [20, 1]), where the first dimension in each of the tensors represents the number of times we have enqueued these tensors to the outfeed - in this example the loop is repeated 20 times, and therefore we get 20 values back from the outfeed for each of the tensors in the tuple. If the outfeed_mode is IPUOutfeedMode.LAST, then outfed is a two tuple of tensors with shapes ([4, 4], [1]), which represents the values of the output and sum tensors the last time they were enqueued during execution.

Note that replication factor here is 1, which means that the extra replication dimension is not added.

  1. Outfeed returning a dictionary of tensors:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(input):
  output = input + 1
  sum = tf.reduce_sum(output)
  outfeed = outfeed_queue.enqueue({"x": output,
                                   "y": sum})
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(40, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example we outfeed a dictionary of tensors, output and sum, where the former is of shape [4, 4] and latter [1]. If the outfeed_mode is IPUOutfeedMode.ALL and the model has a replication factor of 8, then the resulting outfed is a dictionary of tensors with shapes: {“x”: [40, 8, 4, 4], “y”: [40, 8, 1]}, where the first dimension in each of the tensors represents the number of times we have enqueued these tensors to the outfeed - in this example the loop is repeated 40 times, and therefore we get 40 values back from the outfeed for each of the tensors in the tuple. The second dimension is the replication factor, which allows us to see the individual values from each replicated graph. If the outfeed_mode is IPUOutfeedMode.LAST, then outfed is a dictionary of tensors with shapes: {“x”: [8, 4, 4], “y”: [8, 1]}, which represents the values of the output and sum tensors the last time they were enqueued during execution for each of the replicated graphs.

enqueue(tensors)

Enqueue a tensor, tuple or a dictionary of tensors for being outfed from the IPU graph. This operation is placed on the IPU device. This function returns an Operation which needs be executed (by either returning it or using tf.control_dependencies(…))

Examples:

  1. Outfeed returning a single tensor:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(v):
  v = v + 1
  outfeed = outfeed_queue.enqueue(v)
  return (v, outfeed)

def my_net(v):
  r = loops.repeat(20, body, (v))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

...
...
  1. Outfeed returning a tuple of tensors:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(v):
  v = v + 1
  x = v * 2
  outfeed = outfeed_queue.enqueue((v, x))
  return (v, outfeed)

def my_net(v):
  r = loops.repeat(20, body, (v))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

...
...
  1. Outfeed returning a dictionary of tensors:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(v):
  v = v + 1
  x = v * 2
  outfeed = outfeed_queue.enqueue({"output_1": v,
                                   "output_2": x})
  return (v, outfeed)

def my_net(v):
  r = loops.repeat(20, body, (v))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

...
...
class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedQueueIterator(outfeed_queue)

An iterator producing tf.Tensor objects from a IPUOutfeedQueue.

__init__(outfeed_queue)

Creates a new iterator from the given outfeed queue.

Parameters

outfeed_queue – A ipu.ipu_outfeed_queue.IPUOutfeedQueue object.

class tensorflow.python.ipu.ipu_outfeed_queue.ScopedIPUOutfeedQueue(outfeed_mode=None, device_ordinal=None)

A version of IPUOutfeedQueue which automatically calls delete when it goes out of scope.

Can only be created in eager mode.

__init__(outfeed_mode=None, device_ordinal=None)

Creates an IPUOutfeedQueue object.

Parameters
  • outfeed_modeipu_outfeed_queue.IPUOutfeedMode type used to control the outfeed behaviour. If not specified then all elements will be returned by the outfeed when the dequeue operation is run.

  • device_ordinal – Integer ordinal of the IPU device on which this queue will be used. If not specified will try and deduce the IPU device from the current strategy and if that fails will default to “/device:IPU:0”.

Raises

RuntimeError – if not running in eager mode.

22.7. General utilities

tensorflow.python.ipu.utils.export_dataset_to_file(dataset_or_infeed, output_filename, num_elements, feed_name='', apply_debug_options=True)

Export as binary num_elements from the given infeed to the specified output_filename.

If the infeed elements are tuples then one file per tuple element will be created. For example, if dataset looks like

[{ "a": A_0, "b": B_0}, { "a": A_1, "b": B_1}, ...]

then export_dataset_to_file(dataset, "my_dataset.bin", 100) will generate:

my_dataset.0.bin   # Contains tensors [ A_0, A_1, ..., A_99]
my_dataset.1.bin   # Contains tensors [ B_0, B_1, ..., B_99]
Parameters
  • dataset_or_infeed – An unary dataset with the same input and output structure or an IPUInfeedQueue.

  • output_filename – Where to export the tensors to.

  • num_elements – Number of elements to export from the dataset.

  • feed_name – Specify the feed name.

  • apply_debug_options – Whether to apply debug options.

tensorflow.python.ipu.utils.export_inputs_to_file(inputs, output_filename, feed_dict)

Export as binary the list of inputs provided to the specified output_filename.

Parameters
  • inputs – List of graph inputs to export.

  • output_filename – Where to export the tensors to.

  • feed_dict – Feed dictionary containing the inputs’ values.

tensorflow.python.ipu.utils.get_num_of_ipus_in_device(ipu_device, device='cpu')

Get the number of physical IPUs

Parameters
  • ipu_device – The IPU device for which to get the number of devices for.

  • device – The CPU device which is local to the IPU hardware.

Returns

A number of physical IPUs configured for a particular TF device.

tensorflow.python.ipu.utils.move_variable_initialization_to_cpu(graph=None)

For all variables in the VARIABLES collection, move any initialization ops onto the CPU.

Parameters

graph – Operations are moved around on this graph. The default graph will be used if not specified.

Returns

None

tensorflow.python.ipu.utils.reset_ipu_seed(seed, device='/device:IPU:0', cpu_device='cpu', experimental_identical_replicas=False)

Reset the seed used to generate stateful random numbers and perform stochastic rounding.

Parameters
  • seed – The new random number generator seed.

  • device – The device to which the seed will be applied.

  • cpu_device – The CPU device which is on the same hardware to the IPU device.

  • experimental_identical_replicas – Whether to seed all the local replicas identically. Note that to generate identical sequences of random numbers on all replicas, the Poplar engine option "target.deterministicWorkers" must also be set to "portable". Also note that for multi-replica distribution with multiple processes, the same seed must be passed to each process to ensure that all the replicas globally get the same seed. WARNING: This flag is experimental and subject to change.

Returns

None

tensorflow.python.ipu.utils.running_on_ipu_model()

Check if XLA is configured to run on the ipu model.

Returns

True if XLA is configured to run on the ipu model. False if XLA is configured to run on real hardware.

tensorflow.python.ipu.utils.use_synthetic_data_for(synthetic_data_category)

Get whether synthetic data is being used for the given category.

Parameters

synthetic_data_category – A SyntheticDataCategory enum value.

Returns

A bool indicating the result.

22.8. Configuration utilities

class tensorflow.python.ipu.config.DeviceConnectionType(value)

Enumeration to describe the mechanism used to attach to the Poplar device.

  • ALWAYS indicates that the system will attach when configuring the device.

  • ON_DEMAND will defer connection to when the IPU is needed.

  • PRE_COMPILE will never try to attach to a device and anything which is meant to be executed on the device will return all zeros. Used to pre-compile Poplar programs on machines without IPUs. For more information, see Pre-compiling executables.

  • NEVER will never try to attach to a device.

class tensorflow.python.ipu.config.ExecutionProfileType(value)

The execution profile type indicates the desired information in the execution profile.

  • NO_PROFILE indicates that there should be no execution profiling.

  • DEVICE_PROFILE indicates that the execution profile should contain only device wide events.

  • IPU_PROFILE indicates that the profile should contain IPU level execution events.

  • TILE_PROFILE indicates that the profile should contain Tile level execution events.

class tensorflow.python.ipu.config.MergeRemoteBuffersBehaviour(value)

The remote buffers merging behaviour indicates when or if compatible remote buffers should be merged.

  • NO_MERGING indicates that there should be no merging.

  • MERGE indicates that all compatible remote buffers will be merged.

  • IF_BENEFICIAL indicates that compatible remote buffers will only be merged when it is considered beneficial for code re-use.

class tensorflow.python.ipu.config.SchedulingAlgorithm(value)

Controls the algorithm that the scheduler uses.

  • CHOOSE_BEST compares several of the scheduling algorithms below and selects the one that leads to the lowest predicted overall peak liveness. This can sometimes produce incorrect results because the overall peak liveness isn’t always a good measure for the maximum liveness on one tile of the processor.

  • CLUSTERING groups clusters of operations together in order to look through stretches of instructions with potentially high liveness.

  • POST_ORDER schedules the instructions in the order which is obtained by walking the graph in ‘post order’.

  • LOOK_AHEAD looks ahead a number of operations from any schedulable one, as given by the maximum scheduler lookahead depth and maximum scheduler search space size options. It attempts to look through areas of high liveness.

  • SHORTEST_PATH gives priority to the shortest path to the root.

class tensorflow.python.ipu.config.SelectionOrder(value)

Depending on the communication pattern of the model, the order in which the IPUs are selected and mapped to shards can impact the performance.

For example, given a model which executes on multiple IPUs:

def sharded_graph(pa, pb, pc, pd):
  with ipu.scopes.ipu_shard(0):
    o1 = pa + pb
  with ipu.scopes.ipu_shard(1):
    o2 = o1 + pc
  with ipu.scopes.ipu_shard(2):
    o3 = o2 + pd
    return o3

and a Graphcore Pod system with 16 IPUs:

 _______               _______
|       |             |       |
|  14   |=============|  15   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|  12   |=============|  13   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|  10   |=============|  11   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   8   |=============|   9   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   6   |=============|   7   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   4   |=============|   5   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   2   |=============|   3   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   0   |=============|   1   |
|_______|             |_______|

Here, each numbered square represents an IPU with the given device ID and the == and || connections represent IPUs directly connected via IPU-Links.

We can see that the ipu_shard(0) directly communicates with ipu_shard(1) and that ipu_shard(1) directly communicates with ipu_shard(2).

If the shards 0, 1, 2 were mapped to IPUs 0, 1, 2 in that order, then the communication between shards 1 and 2 would not have a direct connection via an IPU-Link and would have to perform a “hop” through an intermediate IPU.

If the shards 0, 1, 2 were mapped to IPUs 0, 1, 3 in that order, then the communication between shards 1 and 2 would have a direct connection via an IPU-Link, which will reduce the communication cost.

This enumeration is used to control the order in which the IPUs are selected. Currently, the following IPU selection orderings are supported:

  • AUTO: automatically try and select the best selection given the network.

  • ZIGZAG: follow the natural ordering of IPUs. In the above example, the IPUs would be selected in the following order: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15.

  • SNAKE: select IPUs such that each consecutive shard is directly connected via IPU-Links to the shard before and after. In the above example, the IPUs would be selected in the following order: 0, 1, 3, 2, 4, 5, 7, 6, 8, 9, 11, 10, 12, 13, 15, 14.

  • HOOF: select IPUs such that each consecutive shard is directly connected via IPU-Links to the shard before and after, and the last and first shard are on adjacent IPUs. In the above example, the IPUs would be selected in the following order: 0, 2, 4, 6, 8, 10, 12, 14, 15, 13, 11, 9, 7, 5, 3, 1.

The SNAKE and HOOF IPU selection orders are particularly beneficial for pipelined models.

class tensorflow.python.ipu.config.StochasticRoundingBehaviour(value)

Controls how stochastic rounding is performed.

OFF disables stochastic rounding. ON enables stochastic rounding. REPLICA_IDENTICAL_ONLY enables stochastic rounding for portions of the graph which are identified as being replica identical - meaning that when executed with replication they produce the same result on each replica.

tensorflow.python.ipu.config.configure_ipu_system(config, device='cpu', reset_configuration=True)

Configure an IPU system with an IPUConfig or IpuOptions instance.

Parameters
  • config – An IPUConfig instance or IpuOptions configuration protobuf.

  • device – The TensorFlow virtual CPU device which is local to the IPU hardware.

  • reset_configuration – Whether to reset any existing IPU configurations.

Returns

None

tensorflow.python.ipu.config.get_ipu_config(session=None)

Get the configuration of an IPU system.

Parameters

session – An optional session on which to execute.

Returns

A list of IpuOption instances, one for each PoplarExecutor.

tensorflow.python.ipu.config.reset_ipu_configuration()

Reset the IPU configuration in preparation for it to be reconfigured. Blocks until all currently configured IPU devices have finished executing.

Note that this function does not currently support reseting IPUs that are running in parallel python threads.

class tensorflow.python.ipu.config.AttributeMetadata
check_type(value)

Checks if value is one of the allowed types for this option. Throws a TypeError if not.

Parameters

value – The value to check against this attribute’s type.

Returns

True if value satisfies this attribute’s type.

property default

The default value for this option. Categories themselves do not have default values.

property deprecated

Whether or not this option/category is deprecated.

property deprecated_msg

The deprecation message for this attribute. None if it is not deprecated.

property name

The full name of the option/category, relative to the config structure’s root.

property type

The type of this option, as a string. The type can be a simple Python type or a type hint. Categories themselves do not have types.

warn_if_deprecated()

Outputs a log warning if this option/category is deprecated.

class tensorflow.python.ipu.config.IPUConfig
allow_recompute: bool = False

Whether or not to recompute instructions during training. If this is enabled then we will attempt to pattern match instructions/pipeline stages in the forward pass and recompute them in the backward pass to avoid having to preserve activations which increase the maximum memory liveness. Enabling this option can reduce memory usage at the expense of extra computation. Stateful operations cannot be recomputed.

selection_order: SelectionOrder = SelectionOrder.AUTO

The order in which IPUs are selected and mapped to physical IPU devices when using multi-IPU devices. Must be one of SelectionOrder.

serialization_output_folder: str = ""

Specifies the directory in which serialized Poplar executables will be saved. The value must be a valid path. The default (“”) disables executable serialization.

compilation_poplar_options: dict = {}

Set the Poplar compilation options for the session. Must be a dictionary of valid Poplar compilation flags. See the Engine class in the Poplar API reference for the full list of options.

gcl_poplar_options: dict = {}

Set the IPU options for the Graphcore Communication Library. Must be a dictionary of valid GCL options. See the allReduce function in the GCL API reference for the full list of options. The options will be applied to all applicable GCL collective operations in the graph during compilation.

auto_select_ipus: Union[int, List[int], Tuple[int, ...]] = []

Configure the IPUs to be used by the session. The configuration describes a system consisting of multiple TensorFlow devices, each with control of one of more IPUs. The devices will be labeled /device:IPU:0, /device:IPU:1 and so on.

Each device can control a specific number of IPUs, given by the num_ipus parameter. The system will automatically select IPU configurations from the available IPUs, where they match the desired number of IPUs.

Examples:

config = IPUConfig()

# Create a single TensorFlow device, with one IPU
config.auto_select_ipus = 1

# Create two TensorFlow devices, with two IPUs per device.
config.auto_select_ipus = [2, 2]

# Create two TensorFlow devices, with one IPU in the first device and two
# IPUs in the second device.
config.auto_select_ipus = [1, 2]
select_ipus: Union[int, List[int], Tuple[int, ...]] = []

Configure the IPUs to be used by the session.

The configuration describes a system consisting of multiple TensorFlow devices, each with control of one of more IPUs. The TensorFlow devices will be labeled /device:IPU:0, /device:IPU:1 and so on.

Each TensorFlow device uses a specific configuration consisting of one or more IPUs from the list of devices. These can be found by running the Graphcore utility gc-info -l. For instance, the following listing shows the device configurations available on a system with 16 IPUs.

[email protected]:~$ gc-info -l
Graphcore device listing:

-+- Id:  [0], type:      [PCIe], PCI Domain: [0000:1a:00.0]
-+- Id:  [1], type:      [PCIe], PCI Domain: [0000:1b:00.0]
-+- Id:  [2], type:      [PCIe], PCI Domain: [0000:23:00.0]
-+- Id:  [3], type:      [PCIe], PCI Domain: [0000:24:00.0]
-+- Id:  [4], type:      [PCIe], PCI Domain: [0000:3d:00.0]
-+- Id:  [5], type:      [PCIe], PCI Domain: [0000:3e:00.0]
-+- Id:  [6], type:      [PCIe], PCI Domain: [0000:43:00.0]
-+- Id:  [7], type:      [PCIe], PCI Domain: [0000:44:00.0]
-+- Id:  [8], type:      [PCIe], PCI Domain: [0000:8b:00.0]
-+- Id:  [9], type:      [PCIe], PCI Domain: [0000:8c:00.0]
-+- Id: [10], type:      [PCIe], PCI Domain: [0000:8e:00.0]
-+- Id: [11], type:      [PCIe], PCI Domain: [0000:8f:00.0]
-+- Id: [12], type:      [PCIe], PCI Domain: [0000:b8:00.0]
-+- Id: [13], type:      [PCIe], PCI Domain: [0000:b9:00.0]
-+- Id: [14], type:      [PCIe], PCI Domain: [0000:ba:00.0]
-+- Id: [15], type:      [PCIe], PCI Domain: [0000:bb:00.0]
-+- Id: [16], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
-+- Id: [17], type: [Multi IPU]
|--- PCIe Id:  [4], DNC Id: [0], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:43:00.0]
-+- Id: [18], type: [Multi IPU]
|--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [1], PCI Domain: [0000:1b:00.0]
-+- Id: [19], type: [Multi IPU]
|--- PCIe Id:  [2], DNC Id: [0], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [1], PCI Domain: [0000:1a:00.0]
-+- Id: [20], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
-+- Id: [21], type: [Multi IPU]
|--- PCIe Id: [12], DNC Id: [0], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [1], PCI Domain: [0000:ba:00.0]
-+- Id: [22], type: [Multi IPU]
|--- PCIe Id:  [9], DNC Id: [0], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [1], PCI Domain: [0000:8f:00.0]
-+- Id: [23], type: [Multi IPU]
|--- PCIe Id: [10], DNC Id: [0], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [1], PCI Domain: [0000:8b:00.0]
-+- Id: [24], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
-+- Id: [25], type: [Multi IPU]
|--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [1], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [2], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [3], PCI Domain: [0000:1a:00.0]
-+- Id: [26], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [2], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [3], PCI Domain: [0000:ba:00.0]
-+- Id: [27], type: [Multi IPU]
|--- PCIe Id:  [9], DNC Id: [0], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [1], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [2], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [3], PCI Domain: [0000:8b:00.0]
-+- Id: [28], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
|--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [5], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [6], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
-+- Id: [29], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [2], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [3], PCI Domain: [0000:ba:00.0]
|--- PCIe Id:  [9], DNC Id: [4], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [5], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [6], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [7], PCI Domain: [0000:8b:00.0]
-+- Id: [30], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
|--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [5], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [6], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
|--- PCIe Id: [13], DNC Id: [8], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [9], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [10], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [11], PCI Domain: [0000:ba:00.0]
|--- PCIe Id:  [9], DNC Id: [12], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [13], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [14], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [15], PCI Domain: [0000:8b:00.0]

Examples based on the listing above:

config = IPUConfig()

# Create a single TensorFlow device with 1 IPU at PCI address
# 0000:1a:00.0 by using IPU configuration index 0
config.select_ipus = 0

# Create a single TensorFlow device with 1 IPU at PCI address
# 0000:8b:00.0 by using IPU configuration index 8
config.select_ipus = 8

# Create two TensorFlow devices, with one IPU each, being devices at
# indices 0 and 1
config.select_ipus = [0, 1]

# Create two TensorFlow devices, with four IPUs each. The device
# configurations at indices 24 (0000:3e:00.0, 0000:44:00.0,
# 0000:3d:00.0, 000:43:00.0) and 25 (0000:24:00.0, 0000:1b:00.0,
# 0000:23:00.0, 00:1a:00.0)
config.select_ipus = [24, 25]

# Create four TensorFlow devices each with one IPU, at addresses
# 0000:1a:00.0, 0000:1b:00.0, 0000:23:00.0, 0000:24:00.0.
config.select_ipus = [0, 1, 2, 3]
convolutions

Sub-category containing configuration options that affect convolutions.

convolutions.poplar_options: dict = {}

Set the PopLibs convolution options for the session. Must be a dictionary of valid PopLibs convolution options. See createWeights in the PopLibs API reference for the full list of options. The options will be applied to all convolution operations in the session graph during compilation.

Of particular note is the availableMemoryProportion parameter which is the amount of memory allocated for use for temporary data whilst the operation is executing (for example, for intermediate calculated values or temporary values passed between tiles on the IPU). The value is specified as a proportion of available memory on the IPU. So, for example, a value of 0.1 will constrain the library to use 10% of the total memory for temporary data.

See the technical note on Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU for more details and for some practical examples of using availableMemoryProportion.

Another important parameter is partialsType, which sets the type of the values of intermediate calculations (partials). This parameter can either be set to "float" (for float32) or "half" (for float16). Note the use of "float" or "half" and not "float32" or "float16" for the parameter values (this is because Poplar/PopLibs uses the IEEE definitions of what the datatypes should be called). An example showing how to use this parameter is shown below:

cfg = config.IPUConfig()
cfg.convolutions.poplar_options['partialsType'] = "half"
cfg.configure_ipu_system()
device_connection

Sub-category containing configuration options to control when to attach to IPU devices.

device_connection.type: DeviceConnectionType = DeviceConnectionType.ALWAYS

Configure when to attach to the device. For example, you can use this to compile and cache a program without attaching to an IPU, and then later run on a real IPU device without recompiling. Setting the connection type doesn’t impact the ability to profile a model. For possible values, see DeviceConnectionType.

# Compile without attaching to the device.
config = IPUConfig()
config.device_connection.type = DeviceConnectionType.ON_DEMAND
device_connection.version: str = ""

Version of the IPU hardware to use (string). Must be one of “ipu1”, “ipu2” or “” (default). Only required if the connection type provided is DeviceConnectionType.PRE_COMPILE or DeviceConnectionType.NEVER.

device_connection.enable_remote_buffers: bool = False

Default to False. When connection type is DeviceConnectionType.PRE_COMPILE, DeviceConnectionType.NEVER or DeviceConnectionType.ON_DEMAND, this argument is used to indicate whether remote buffers are enabled and supported in the system which will eventually be used to execute the compiled programs. Set it to True if the system on which you will execute the compiled programs has remote buffers enabled and connection_type is not DeviceConnectionType.ALWAYS. If the connection_type is DeviceConnectionType.ALWAYS then the enable_remote_buffers parameter is ignored because in that case it is possible to query the device and check if remote buffers are supported on it (if they are, they will be used automatically).

In order to check whether your target system supports remote buffers you can run the command:

$ gc-info -d 0 -i | grep "remote buffers supported:"

If you see remote buffers supported: 1 in the output, that means that remote buffers are supported on your system. For more information, see the gc-info documentation.

slices

Sub-category containing configuration options that affect slice operations.

slices.poplar_options: dict = {}

Set the PopLibs slice options for the session. Must be a dictionary of valid PopLibs slice options. See embedding::plan in the PopLibs API reference for the full list of options. The options will be passed to multiSlice, multiUpdate, and multiUpdateAdd poplibs calls. These are most commonly generated when using embeddings.

Of particular note is the availableMemoryProportion parameter which is the amount of memory allocated for use for temporary data whilst the operation is executing (for example, for intermediate calculated values or temporary values passed between tiles on the IPU). The value is specified as a proportion of available memory on the IPU. So, for example, a value of 0.1 will constrain the library to use 10% of the total memory for temporary data.

experimental

Sub-category containing experimental configuration options that may be changed or removed with short or no notice.

experimental.always_rearrange_copies_on_the_host: bool = False

The data which is streamed to/from the device might be stored in different layouts on the device and on the host. If so, rearrangement is performed on the device by default. By enabling this option the rearrangement will be performed on the host at the expense of latency.

experimental.enable_remote_buffer_embedding: bool = False

When set to true, HostEmbedding will make use of Poplar remote buffers. The creation of this remote buffer may take several minutes. The remote buffer will be synchronised with every IPU execution, so we recommend that you use high steps_per_execution with this option.

experimental.enable_prng_stability: bool = False

Enable prng seed management. This aims to reduce divergence of weights when running models across multiple replicas with stochastic rounding.

experimental.multi_replica_distribution

Sub-category containing configuration options controlling multi replica distribution. This will use the Poplar runtime replica subset feature to let multiple processes collaborate on executing the same Poplar program by executing a subset of the global replicas each.

The total global replication factor will be equal to the local replication factor multiplied by the process_count.

experimental.multi_replica_distribution.process_index: int = 0

The index of the current process being configured.

experimental.multi_replica_distribution.process_count: int = 0

The total number of processes. When set to 0 (default), multi-replica distribution will not be used.

floating_point_behaviour

Sub-category containing configuration options that affect the floating point behaviour of the IPU devices, including stochastic rounding and behaviour when an overflow is encountered during execution. For more information, see Controlling the half-precision floating-point unit.

floating_point_behaviour.inv: bool = False

If True, a floating point invalid operation (defined by IEEE 754) will cause an exception.

floating_point_behaviour.div0: bool = False

If True, a floating point divide by zero operation will cause an exception.

floating_point_behaviour.oflo: bool = False

If True, a floating point overflow will cause an exception.

floating_point_behaviour.esr: StochasticRoundingBehaviour = StochasticRoundingBehaviour.OFF

A StochasticRoundingBehaviour. If StochasticRoundingBehaviour.OFF (default) then stochastic rounding will be disabled. Otherwise it’s enabled with the semantics of the particular option.

floating_point_behaviour.nanoo: bool = False

If True, Not-a-Number (NaN) on overflow mode will be enabled.

floating_point_behaviour.set_all: bool = False

If True, unconditionally enables all floating point behaviour options (inv, div0, oflo, esr, nanoo) when the IPUConfig is configured.

io_tiles

Sub-category containing configuration options that affect parallel I/O on a subset of tiles. For more information, see I/O Tiles.

io_tiles.num_io_tiles: int = 0

Number of tiles to reserve for I/O.

io_tiles.place_ops_on_io_tiles: bool = False

Whether to place TensorFlow I/O operations on the I/O tiles.

io_tiles.available_memory_proportion: float = 0.9

Proportion of I/O tiles’ memory which can be used to store data in, with the remaining memory assumed to be used by code. If the size of data which is to be stored on I/O tiles exceeds the total I/O tiles memory multiplied by this proportion, then a warning message will appear and the operations will not be placed on I/O tiles.

ipu_model

Sub-category containing configuration options related to the IPU model. Note that these will only have an effect if you are running with the IPU model enabled. For more information, see TF_POPLAR_FLAGS environment variable.

ipu_model.compile_ipu_code: bool = True

Whether or not to compile IPU code for modelling.

ipu_model.tiles_per_ipu: int = 0

The number of tiles per IPU Model device. When set to 0 (the default), Poplar will use the standard number of tiles for the chosen version.

ipu_model.version: str = "ipu2"

Specify the IPU version to be used by the IPU Model. Options are “ipu1” or “ipu2” (default).

matmuls

Sub-category containing configuration options that affect matmuls.

matmuls.clear_pass_type: bool = False

Controls whether or not the “Pass” type of the MatMul is passed to PopLibs. When set to True, PopLibs will not be told about the type of the MatMuls in the graph. This can save memory in some circumstances, such as large batch ResNet models. See matMul in the PopLibs API reference.

matmuls.poplar_options: dict = {}

Set the PopLibs matrix multiplication options for the session. Must be a dictionary of valid PopLibs matrix multiplication options. See matMul in the PopLibs API reference for the full list of options. The options will be applied to all matmul operations in the session graph during compilation.

Of particular note is the availableMemoryProportion parameter which is the amount of memory allocated for use for temporary data whilst the operation is executing (for example, for intermediate calculated values or temporary values passed between tiles on the IPU). The value is specified as a proportion of available memory on the IPU. So, for example, a value of 0.1 will constrain the library to use 10% of the total memory for temporary data.

See the technical note on Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU for more details and for some practical examples of using availableMemoryProportion.

Another important parameter is partialsType, which sets the type of the values of intermediate calculations (partials). This parameter can either be set to "float" (for float32) or "half" (for float16). Note the use of "float" or "half" and not "float32" or "float16" for the parameter values (this is because Poplar/PopLibs uses the IEEE definitions of what the datatypes should be called). An example showing how to use this parameter is shown below:

cfg = config.IPUConfig()
cfg.matmuls.poplar_options['partialsType'] = "half"
cfg.configure_ipu_system()
norms

Sub-category containing configuration options that affect normalizations. Note that these options will be applied to all normalisation operations encountered (Fused Batch Norm, IPU Specific Group Norm, IPU Specific Layer Norm and IPU Specific Instance Norm).

norms.use_stable_statistics: bool = False

If True, computes the mean minus the activations first before computing the variance. The implementation with this flag set to True is slower than when set to False.

norms.experimental

Sub-category containing experimental configuration options for normalizations that may be changed or removed with short or no notice.

norms.experimental.distributed_batch_norm_replica_group_size: int = 1

When executing fused batch-norms for training, this option specifies how many replicas to aggregate the batch statistics across. For example, if a model is being executed across four replicas and this option is set to two, replicas 0 and 1 will be grouped together and replicas 2 and 3 will be grouped together and the batch norm statistics will be synchronously all-reduced every time the layer is executed (including any recomputation) across the replicas within a group. This option should not be used when using model parallelism (pipelining) and it is not supported with I/O tiles. When recomputation is enabled and the training fused batch norm operation is recomputed, the statistics will have to be all-reduced again, unless the RecomputeAndBackpropagateInterleaved recomputation mode is used.

optimizations

Sub-category containing configuration options that control a variety of optimizations made when lowering the TensorFlow graph to Poplar.

optimizations.math

Sub-category containing configuration options related to simplifying algebraic mathematical expressions..

optimizations.math.fast: bool = False

Enables optimizations which allow arbitrary reassociations and transformations of mathematical operations with no accuracy guarantees. Enabling this option can result in incorrect output for programs that depend on an exact implementation of IEEE floating point for maths functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.

optimizations.math.dot_strength: bool = True

Enable dot strength optimization. When set to True, the graph optimizer will convert a dot product where either the LHS or the RHS contains only batch and/or contracting dimensions to an elementwise matrix multiplication.

optimizations.prefetch_data_streams: bool = True

If True (default), prefetching of data for data streams on the host will be overlapped with execution on the IPU.

optimizations.combine_embedding_lookups: bool = False

If True, fuse embedding lookups which are on the same tensor. This might improve performance but increase memory usage.

optimizations.combine_matmuls: bool = False

If True, fuse matmul operations if they share the same weights or the same input.

optimizations.enable_graph_outlining: bool = True

If True (default), operations in the graph which are the same but with different input tensors may be outlined. This means the same code will be re-used to execute them, reducing the amount of program code, but their inputs will be exchanged into a common memory location to do so, increasing execution time. If you care more about speed than memory, these optimizations can be disabled by setting this option to False.

optimizations.merge_infeed_io_copies: bool = True

If True, this flag will merge the streamed host to device input copies into one larger copy. This may reduce the time to copy data from the host, at the expense of increasing the live tensor memory on the device.

optimizations.maximum_cross_replica_sum_buffer_size: int = 0

The maximum number of bytes that can be waiting before a cross replica sum op is scheduled. 0 (default) means that they are scheduled immediately. This value represents an always-live vs not-always-live trade off - increasing the max_cross_replica_sum_buffer_size will lead to larger temporary buffers in the cross replica sums, but fewer cross replica sums overall and therefore less control code. If your model contains a lot of trainable variables, then it is strongly advised to consider adjusting this option.

optimizations.maximum_reduce_scatter_buffer_size: int = 0

The maximum number of bytes that can be waiting before a reduce scatter op is scheduled.

optimizations.maximum_inter_ipu_copies_buffer_size: int = 0

The maximum number of bytes that can be waiting before an inter IPU copy between IPUs is scheduled.

optimizations.maximum_send_recv_cluster_size: int = 0

The maximum number of bytes that can be waiting before a cluster of send/recv instructions to/from the host is scheduled. These are lowered to stream copies that can be merged by Poplar.

optimizations.maximum_reduce_many_buffer_size: int = 0

The maximum size (in bytes) a cluster of reduce operations can reach before it is scheduled. These clusters are lowered to popops ReduceMany operations.

optimizations.maximum_all_gather_buffer_size: int = 0

The maximum size (in bytes) a cluster of all gather operations can reach before it is scheduled. These clusters are lowered to popops AllGather operations.

optimizations.minimum_remote_tensor_size: int = 128

The minimum size (in bytes) a tensor must be in order to be considered for being stored in remote memory.

optimizations.merge_remote_buffers: MergeRemoteBuffersBehaviour = MergeRemoteBuffersBehaviour.IF_BENEFICIAL

Whether to merge compatible remote buffers. Merging of remote buffers can allow for more code re-use if the only difference between computations are the remote buffers being accessed. Must be a MergeRemoteBuffersBehaviour.

optimizations.enable_gather_simplifier: bool = True

If True (default), more aggressive optimizations will be done on embedding lookups.

optimizations.triangular_solve_expander_block_size: int = 0

Defines the block size for the triangular solver expander. The processing within each block is performed on a single tile. The control code for performing computations over blocks is unrolled on the device. For a matrix of rank N and block size B`, there are log2(N/B) iterations of the control code. The choice of this parameter therefore has to balance between the amount of data in a tile (lower value is better, gives better parallelism) and the amount of control code (larger value is better, less control code). A value of 0 (default) selects an implementation defined default.

optimizations.cholesky_block_size: int = 0

Defines the block size for the Cholesky factoriser. The processing within each block is performed on a single tile. The control code for performing computations over blocks are unrolled on the device. For a matrix of rank N and block size B, there are N/B iterations of the control code. The choice of this parameter therefore has to balance between the amount of data in a tile (lower value is better, gives better parallelism) and the amount of control code (larger value is better, less control code). A value of 0 (default) selects an implementation defined default.

optimizations.enable_fast_math: bool = False

Note

DEPRECATED: ‘enable_fast_math’ has been moved to ‘optimizations.math.fast’.It will be removed from this location in a future release.

Enables optimizations which allow arbitrary reassociations and transformations of mathematical operations with no accuracy guarantees. Enabling this option can result in incorrect output for programs that depend on an exact implementation of IEEE floating point for maths functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.

optimizations.enable_dynamic_slice_replacement: bool = True

Control whether or not we replace dynamicSlice/Update with multiSlice/Update. This can increase parallelism and provide better memory usage since multiSlice/Update can be planned.

pooling

Sub-category containing configuration options that affect pooling operations.

pooling.poplar_options: dict = {}

Set the PopLibs pooling compilation options for the session. Must be a dictionary of valid PopLibs pooling options. See pool in the PopLibs API reference for the full list of options. The options will be applied to all pooling operations in the session graph during compilation.

scheduling

Sub-category containing configuration options that affect the scheduling of operations in the graph during compilation.

scheduling.algorithm: SchedulingAlgorithm = SchedulingAlgorithm.CHOOSE_BEST

A SchedulingAlgorithm. If SchedulingAlgorithm.CHOOSE_BEST (default), several schedules will be created and the one with the lowest predicted liveness chosen. Setting this to a specific scheduling algorithm forces the compiler to use that algorithm when ordering the instructions.

scheduling.maximum_scheduler_lookahead_depth: int = 5

Controls how far the LOOK_AHEAD scheduling algorithm can look beyond a given scheduling decision to understand the max-liveness implications. This search space grows very quickly and can take an unacceptable amount of time for large values. Only for SchedulingAlgorithm.LOOK_AHEAD.

scheduling.maximum_scheduler_search_space_size: int = 64

The upper-limit to the size of the LOOK_AHEAD scheduling algorithm’s search space to guarantee that it will terminate in a reasonable amount of time. Only for SchedulingAlgorithm.LOOK_AHEAD.

get_attribute_metadata(attr)

Get the attribute metadata for attr.

Parameters

attr – required, a string which specifies which attribute to retrieve metadata for. Must be its full name relative to the category this method is being called on.

Returns

An AttributeMetadata object containing the metadata for the attribute.

configure_ipu_system(device='cpu')

Configure the IPU system with this config.

Parameters

device – The CPU device which is local to the IPU hardware.

from_dict(dct)

Restore configuration from a dict object.

Parameters

dct – A dictionary containing a configuration.

to_dict()

Export the configuration stored within this configuration object to a dict.

Returns

A dictionary containing the configuration.

from_json(json_cfg)

Restore configuration from a JSON string.

Parameters

json_cfg – A JSON string containing a configuration.

to_json()

Export the configuration stored within this configuration object as a JSON string.

Returns

A JSON string containing the configuration.

allow_recompute

The order in which IPUs are selected and mapped to physical IPU devices when using multi-IPU devices. Must be one of SelectionOrder.

auto_select_ipus: Union[int, List[int], Tuple[int, ...]]

Configure the IPUs to be used by the session.

The configuration describes a system consisting of multiple TensorFlow devices, each with control of one of more IPUs. The TensorFlow devices will be labeled /device:IPU:0, /device:IPU:1 and so on.

Each TensorFlow device uses a specific configuration consisting of one or more IPUs from the list of devices. These can be found by running the Graphcore utility gc-info -l. For instance, the following listing shows the device configurations available on a system with 16 IPUs.

[email protected]:~$ gc-info -l
Graphcore device listing:

-+- Id:  [0], type:      [PCIe], PCI Domain: [0000:1a:00.0]
-+- Id:  [1], type:      [PCIe], PCI Domain: [0000:1b:00.0]
-+- Id:  [2], type:      [PCIe], PCI Domain: [0000:23:00.0]
-+- Id:  [3], type:      [PCIe], PCI Domain: [0000:24:00.0]
-+- Id:  [4], type:      [PCIe], PCI Domain: [0000:3d:00.0]
-+- Id:  [5], type:      [PCIe], PCI Domain: [0000:3e:00.0]
-+- Id:  [6], type:      [PCIe], PCI Domain: [0000:43:00.0]
-+- Id:  [7], type:      [PCIe], PCI Domain: [0000:44:00.0]
-+- Id:  [8], type:      [PCIe], PCI Domain: [0000:8b:00.0]
-+- Id:  [9], type:      [PCIe], PCI Domain: [0000:8c:00.0]
-+- Id: [10], type:      [PCIe], PCI Domain: [0000:8e:00.0]
-+- Id: [11], type:      [PCIe], PCI Domain: [0000:8f:00.0]
-+- Id: [12], type:      [PCIe], PCI Domain: [0000:b8:00.0]
-+- Id: [13], type:      [PCIe], PCI Domain: [0000:b9:00.0]
-+- Id: [14], type:      [PCIe], PCI Domain: [0000:ba:00.0]
-+- Id: [15], type:      [PCIe], PCI Domain: [0000:bb:00.0]
-+- Id: [16], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
-+- Id: [17], type: [Multi IPU]
|--- PCIe Id:  [4], DNC Id: [0], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:43:00.0]
-+- Id: [18], type: [Multi IPU]
|--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [1], PCI Domain: [0000:1b:00.0]
-+- Id: [19], type: [Multi IPU]
|--- PCIe Id:  [2], DNC Id: [0], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [1], PCI Domain: [0000:1a:00.0]
-+- Id: [20], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
-+- Id: [21], type: [Multi IPU]
|--- PCIe Id: [12], DNC Id: [0], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [1], PCI Domain: [0000:ba:00.0]
-+- Id: [22], type: [Multi IPU]
|--- PCIe Id:  [9], DNC Id: [0], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [1], PCI Domain: [0000:8f:00.0]
-+- Id: [23], type: [Multi IPU]
|--- PCIe Id: [10], DNC Id: [0], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [1], PCI Domain: [0000:8b:00.0]
-+- Id: [24], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
-+- Id: [25], type: [Multi IPU]
|--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [1], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [2], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [3], PCI Domain: [0000:1a:00.0]
-+- Id: [26], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [2], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [3], PCI Domain: [0000:ba:00.0]
-+- Id: [27], type: [Multi IPU]
|--- PCIe Id:  [9], DNC Id: [0], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [1], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [2], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [3], PCI Domain: [0000:8b:00.0]
-+- Id: [28], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
|--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [5], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [6], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
-+- Id: [29], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [2], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [3], PCI Domain: [0000:ba:00.0]
|--- PCIe Id:  [9], DNC Id: [4], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [5], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [6], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [7], PCI Domain: [0000:8b:00.0]
-+- Id: [30], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
|--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [5], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [6], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
|--- PCIe Id: [13], DNC Id: [8], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [9], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [10], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [11], PCI Domain: [0000:ba:00.0]
|--- PCIe Id:  [9], DNC Id: [12], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [13], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [14], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [15], PCI Domain: [0000:8b:00.0]

Examples based on the listing above:

config = IPUConfig()

# Create a single TensorFlow device with 1 IPU at PCI address
# 0000:1a:00.0 by using IPU configuration index 0
config.select_ipus = 0

# Create a single TensorFlow device with 1 IPU at PCI address
# 0000:8b:00.0 by using IPU configuration index 8
config.select_ipus = 8

# Create two TensorFlow devices, with one IPU each, being devices at
# indices 0 and 1
config.select_ipus = [0, 1]

# Create two TensorFlow devices, with four IPUs each. The device
# configurations at indices 24 (0000:3e:00.0, 0000:44:00.0,
# 0000:3d:00.0, 000:43:00.0) and 25 (0000:24:00.0, 0000:1b:00.0,
# 0000:23:00.0, 00:1a:00.0)
config.select_ipus = [24, 25]

# Create four TensorFlow devices each with one IPU, at addresses
# 0000:1a:00.0, 0000:1b:00.0, 0000:23:00.0, 0000:24:00.0.
config.select_ipus = [0, 1, 2, 3]
compilation_poplar_options

Set the IPU options for the Graphcore Communication Library. Must be a dictionary of valid GCL options. See the allReduce function in the GCL API reference for the full list of options. The options will be applied to all applicable GCL collective operations in the graph during compilation.

configure_ipu_system(device='cpu')

Configure the IPU system with this config.

Parameters

device – The CPU device which is local to the IPU hardware.

convolutions

Sub-category containing configuration options to control when to attach to IPU devices.

device_connection

Sub-category containing configuration options that affect slice operations.

experimental

Sub-category containing configuration options that affect the floating point behaviour of the IPU devices, including stochastic rounding and behaviour when an overflow is encountered during execution. For more information, see Controlling the half-precision floating-point unit.

floating_point_behaviour

Sub-category containing configuration options that affect parallel I/O on a subset of tiles. For more information, see I/O Tiles.

gcl_poplar_options

Configure the IPUs to be used by the session. The configuration describes a system consisting of multiple TensorFlow devices, each with control of one of more IPUs. The devices will be labeled /device:IPU:0, /device:IPU:1 and so on.

Each device can control a specific number of IPUs, given by the num_ipus parameter. The system will automatically select IPU configurations from the available IPUs, where they match the desired number of IPUs.

Examples:

config = IPUConfig()

# Create a single TensorFlow device, with one IPU
config.auto_select_ipus = 1

# Create two TensorFlow devices, with two IPUs per device.
config.auto_select_ipus = [2, 2]

# Create two TensorFlow devices, with one IPU in the first device and two
# IPUs in the second device.
config.auto_select_ipus = [1, 2]
io_tiles

Sub-category containing configuration options related to the IPU model. Note that these will only have an effect if you are running with the IPU model enabled. For more information, see TF_POPLAR_FLAGS environment variable.

ipu_model

Sub-category containing configuration options that affect matmuls.

matmuls

Sub-category containing configuration options that affect normalizations. Note that these options will be applied to all normalisation operations encountered (Fused Batch Norm, IPU Specific Group Norm, IPU Specific Layer Norm and IPU Specific Instance Norm).

norms

Sub-category containing configuration options that control a variety of optimizations made when lowering the TensorFlow graph to Poplar.

optimizations

Sub-category containing configuration options that affect pooling operations.

pooling

Sub-category containing configuration options that affect the scheduling of operations in the graph during compilation.

select_ipus: Union[int, List[int], Tuple[int, ...]]

Sub-category containing configuration options that affect convolutions.

selection_order

Specifies the directory in which serialized Poplar executables will be saved. The value must be a valid path. The default (“”) disables executable serialization.

serialization_output_folder

Set the Poplar compilation options for the session. Must be a dictionary of valid Poplar compilation flags. See the Engine class in the Poplar API reference for the full list of options.

slices

Sub-category containing experimental configuration options that may be changed or removed with short or no notice.

22.9. Looping utilities

tensorflow.python.ipu.loops.repeat(n, body, inputs=None, infeed_queue=None, use_while_v1=True)

Builds a loop that executes a fixed number of iterations.

The set of loop-carried tensors correspond to inputs. body must be a function that takes and returns the values of the loop-carried tensors.

Parameters
  • n – the number of loop iterations

  • body – a Python function that builds the loop body.

  • inputs – a list of initial values passed into the loop or None (equivalent to an empty list).

  • infeed_queue – if not None, the IPUInfeedQueue from which data is consumed.

  • use_while_v1 – if True, then use a TensorFlow v1.x dataflow while loop.

Returns

The final values of the loop-carried tensors.

Raises
tensorflow.python.ipu.loops.while_loop(condition, body, inputs=None, infeed_queue=None, maximum_iterations=None, use_while_v1=True)

Builds a while loop for IPUs.

The set of loop-carried tensors corresponds to inputs. Both condition and body take the current value of the loop-carried tensors. condition must return a single boolean value that determines whether iteration continues. body must return an updated list of values for the loop-carried tensors.

Parameters
  • condition – a Python function that builds the loop condition.

  • body – a Python function that builds the loop body.

  • inputs – a list of initial values passed into the loop, or None (equivalent to an empty list).

  • infeed_queue – if not None, the IPUInfeedQueue from which data is consumed.

  • use_while_v1 – if True, then use a TensorFlow v1.x dataflow while loop.

Returns

The final values of the loop-carried tensors.

Raises

TypeError – if body or condition has the wrong signature.

22.10. Distributed training

class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUDistributedVariable(*args, **kwargs)
class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUMirroredVariable(*args, **kwargs)
class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUMultiWorkerExtendedV1(container_strategy, cluster_resolver, ipu_device, variables_on_host)
__init__(container_strategy, cluster_resolver, ipu_device, variables_on_host)
read_var(var)

Read the aggregate value of a replica-local variable.

tensorflow.python.ipu.ipu_multi_worker_strategy.IPUMultiWorkerStrategy

alias of IPUMultiWorkerStrategyV1

class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUMultiWorkerStrategyV1(cluster_resolver, ipu_device='/device:IPU:0', variables_on_host=False)

This is a distribution strategy for synchronous training using (deprecated)

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use PopDistStrategy instead.

IPUs on multiple workers with between-graph replication.

By default variables and ops are placed on the IPU of each worker, but variables can optionally be placed on the host by setting variables_on_host=True. In any case, this strategy will make sure that variables are kept in sync between the workers by performing multi-worker reductions.

The multi-worker reductions are done using TensorFlow’s implementation of collective operations over gRPC.

Variable synchronization

The default behavior is to sync (allreduce) the variables when they are written (sync-on-write). This is a good choice when reads are at least as common as writes. However, for variables where writes are more common than reads (like metrics or population statistics in batch normalization layers), it is beneficial to only sync (allreduce) the variables when they are read (sync-on-read).

In both cases, it is important that all the workers participate in the sync, otherwise progress will be blocked. Take special care in the latter case (with sync-on-read variables), because it implies that all the workers need to read these variables at the same time. For example, it implies that all the workers must checkpoint the model at the same time.

Sync-on-read variables are placed on the IPU even when variables were requested placed on the host (with variables_on_host=True), because it allows the ops to update the variables directly on the IPU without any host involvement. Only when the variable is read, it is streamed to the host and allreduced there.

Weight updates

When used during training with an Optimizer, there is an implicit allreduce in the optimizer.apply_gradients() function (which is called from optimizer.minimize()). This will automatically cause the gradients to be streamed to the host of each worker, allreduced between the workers, and then streamed back to the IPU of each worker, where identical weight updates are performed (keeping the workers in sync). This is done even when the call to optimizer.apply_gradients() is inside a function passed to ipu_compiler.compile(), as the allreduce is extracted from the compiled XLA cluster and placed on the host in the outside graph (by internally using an outside_compilation_scope()).

When variables are placed on the host, the weight updates should also be placed on the host. In other words, the optimizer.compute_gradients() call should be placed on the IPU, while the optimizer.apply_gradients() call should be placed on the host. This must be done explicitly. In this scenario all the “slot” variables used by the optimizer (e.g. the momentum accumulator) are then also kept only in host memory and never used on the IPU, saving IPU memory.

Compatibility

IPUEstimator: Pass the IPUMultiWorkerStrategyV1 instance to the RunConfig as the train_distribute argument. When variables are placed on the host, the optimizer.apply_gradients() call should also be placed on the host by using the IPUEstimatorSpec host_call argument. See full example: Distributed training.

IPUPipelineEstimator: Pass the IPUMultiWorkerStrategyV1 instance to the RunConfig as the train_distribute argument. Placing variables on the host is not currently supported here.

Keras Model.fit: Not currently supported.

Custom training loop: Pass the training step function to IPUMultiWorkerStrategyV1.run(). With variables on the IPU, the optimizer.apply_gradients() call can be done from an XLA compiled IPU function, and the inter-host allreduce will be automatically extracted from the compiled XLA cluster and placed on the host. With variables on the host, the optimizer.apply_gradients() call must be explicitly placed on the host.

Example using a custom training loop with pipelining

cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
strategy = IPUMultiWorkerStrategyV1(cluster_resolver)

sess_config = tf.ConfigProto()
sess_config = strategy.update_config_proto(sess_config)
server = tf.distribute.Server(cluster_resolver.cluster_spec(),
                              job_name=cluster_resolver.task_type,
                              task_index=cluster_resolver.task_id,
                              config=sess_config)
sess_target = server.target

with strategy.scope():

  infeed_queue = ipu_infeed_queue.IPUInfeedQueue(dataset)
  outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

  def stage1(lr, images, labels):
    partial = keras.layers.Dense(256, activation="relu")(images)
    partial = keras.layers.Dense(128, activation="relu")(partial)
    return lr, partial, labels

  def stage2(lr, partial, labels):
    logits = keras.layers.Dense(10)(partial)
    per_example_loss = keras.losses.sparse_categorical_crossentropy(
        y_true=labels, y_pred=logits, from_logits=True)
    # In a custom training loop, the optimiser does an allreduce *sum*, not
    # average, of the gradients across the distributed workers. Therefore
    # we want to divide the loss here by the *global* batch size, which is
    # done by the `tf.nn.compute_average_loss()` function.
    loss = nn.compute_average_loss(per_example_loss)
    return lr, loss

  def optimizer_function(lr, loss):
    optimizer = GradientDescentOptimizer(lr)
    return pipelining_ops.OptimizerFunctionOutput(optimizer, loss)

  def model(lr):
    pipeline_op = pipelining_ops.pipeline(
        computational_stages=[stage1, stage2],
        gradient_accumulation_count=gradient_accumulation_count,
        inputs=[lr],
        infeed_queue=infeed_queue,
        outfeed_queue=outfeed_queue,
        optimizer_function=optimizer_function,
        name="Pipeline")
    return pipeline_op

  def compiled_model(lr):
    with ipu_scope("/device:IPU:0"):
      return ipu_compiler.compile(model, inputs=[lr])

  with ops.device("cpu"):
    lr = array_ops.placeholder(np.float32, [])

  train_op = strategy.run(compiled_model, args=[lr])

  _, per_worker_losses = outfeed_queue.dequeue()

  # Mean across the local `gradient_accumulation_count` batches:
  per_worker_loss = math_ops.reduce_mean(per_worker_losses)

  # Global mean across the distributed workers (since it is already
  # divided by the global batch size above, we do a sum here):
  global_loss = strategy.reduce(ReduceOp.SUM, per_worker_loss)

  config = ipu.config.IPUConfig()
  config.auto_select_ipus = 2
  config.configure_ipu_system()
  ipu_utils.move_variable_initialization_to_cpu()

  with session_lib.Session(target=sess_target, config=sess_config) as sess:
    sess.run(infeed_queue.initializer)
    sess.run(variables.global_variables_initializer())

    for _ in range(10):
      sess.run(train_op, {lr: 0.01})
      global_loss_val = sess.run(global_loss)
__init__(cluster_resolver, ipu_device='/device:IPU:0', variables_on_host=False)

DEPRECATED FUNCTION

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use PopDistStrategy instead.

class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUOnReadPolicy(aggregation)
class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUOnWritePolicy(aggregation)
class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUSyncOnReadVariable(*args, **kwargs)

22.11. Horovod

tensorflow.python.ipu.distributed.allgather(tensor, name=None)

An op which concatenates the input tensor with the same input tensor on all other Horovod processes.

The concatenation is done on the first dimension, so the input tensors on the different processes must have the same rank and shape, except for the first dimension, which is allowed to be different.

Returns

A tensor of the same type as tensor, concatenated on dimension zero across all processes. The shape is identical to the input shape, except for the first dimension, which may be greater and is the sum of all first dimensions of the tensors in different Horovod processes.

tensorflow.python.ipu.distributed.allreduce(tensor, op=None, prescale_factor=1.0, postscale_factor=1.0)

Perform an allreduce on a tf.Tensor or tf.IndexedSlices.

This function performs a bandwidth-optimal ring allreduce on the input tensor. If the input is an tf.IndexedSlices, the function instead does an allgather on the values and the indices, effectively doing an allreduce on the represented tensor.

Parameters
  • tensor – tf.Tensor, tf.Variable, or tf.IndexedSlices to reduce. The shape of the input must be identical across all ranks.

  • op – The reduction operation to combine tensors across different ranks. Defaults to Average if None is given.

  • prescale_factor – Multiplicative factor to scale tensor before allreduce.

  • postscale_factor – Multiplicative factor to scale tensor after allreduce.

Returns

A tensor of the same shape and type as tensor, summed across all processes.

tensorflow.python.ipu.distributed.broadcast(tensor, root_rank, name=None)

An op which broadcasts the input tensor on root rank to the same input tensor on all other Horovod processes.

The broadcast operation is keyed by the name of the op. The tensor type and shape must be the same on all Horovod processes for a given name. The broadcast will not start until all processes are ready to send and receive the tensor.

Returns

A tensor of the same shape and type as tensor, with the value broadcasted from root rank.

class tensorflow.python.ipu.distributed.ipu_horovod_strategy.IPUHorovodExtendedV1(container_strategy, cluster_resolver, ipu_device, variables_on_host)
__init__(container_strategy, cluster_resolver, ipu_device, variables_on_host)
tensorflow.python.ipu.distributed.ipu_horovod_strategy.IPUHorovodStrategy

alias of IPUHorovodStrategyV1

class tensorflow.python.ipu.distributed.ipu_horovod_strategy.IPUHorovodStrategyV1(ipu_device='/device:IPU:0', variables_on_host=False)

This is a distribution strategy using Horovod. (deprecated)

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use PopDistStrategy instead.

Usage is very similar to the IPUMultiWorkerStrategyV1, with the following differences:

  • There is no cluster_resolver argument, as Horovod’s built-in cluster discovery is used. Hence the TF_CONFIG environment variable containing the cluster configuration is not needed.

  • As Horovod sets up the necessary communication channels, starting a tf.distribute.Server is not needed either.

  • Launching the cluster should be done with the mpirun tool.

Example using a custom training loop with pipelining

strategy = IPUHorovodStrategyV1()

with strategy.scope():

  infeed_queue = ipu_infeed_queue.IPUInfeedQueue(dataset)
  outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

  def stage1(lr, images, labels):
    partial = keras.layers.Dense(256, activation="relu")(images)
    partial = keras.layers.Dense(128, activation="relu")(partial)
    return lr, partial, labels

  def stage2(lr, partial, labels):
    logits = keras.layers.Dense(10)(partial)
    per_example_loss = keras.losses.sparse_categorical_crossentropy(
        y_true=labels, y_pred=logits, from_logits=True)
    # In a custom training loop, the optimiser does an allreduce *sum*, not
    # average, of the gradients across the distributed workers. Therefore
    # we want to divide the loss here by the *global* batch size, which is
    # done by the `tf.nn.compute_average_loss()` function.
    loss = nn.compute_average_loss(per_example_loss)
    return lr, loss

  def optimizer_function(lr, loss):
    optimizer = GradientDescentOptimizer(lr)
    return pipelining_ops.OptimizerFunctionOutput(optimizer, loss)

  def model(lr):
    pipeline_op = pipelining_ops.pipeline(
        computational_stages=[stage1, stage2],
        gradient_accumulation_count=gradient_accumulation_count,
        inputs=[lr],
        infeed_queue=infeed_queue,
        outfeed_queue=outfeed_queue,
        optimizer_function=optimizer_function,
        name="Pipeline")
    return pipeline_op

  def compiled_model(lr):
    with ipu_scope("/device:IPU:0"):
      return ipu_compiler.compile(model, inputs=[lr])

  with ops.device("cpu"):
    lr = array_ops.placeholder(np.float32, [])

  train_op = strategy.run(compiled_model, args=[lr])

  _, per_worker_losses = outfeed_queue.dequeue()

  # Mean across the local `gradient_accumulation_count` batches:
  per_worker_loss = math_ops.reduce_mean(per_worker_losses)

  # Global mean across the distributed workers (since it is already
  # divided by the global batch size above, we do a sum here):
  global_loss = strategy.reduce(ReduceOp.SUM, per_worker_loss)

  config = ipu.config.IPUConfig()
  config.auto_select_ipus = 2
  config.configure_ipu_system())
  ipu_utils.move_variable_initialization_to_cpu()

  with session.Session() as sess:
    sess.run(infeed_queue.initializer)
    sess.run(variables.global_variables_initializer())

    for _ in range(10):
      sess.run(train_op, {lr: 0.01})
      global_loss_val = sess.run(global_loss)
__init__(ipu_device='/device:IPU:0', variables_on_host=False)

DEPRECATED FUNCTION

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use PopDistStrategy instead.

class tensorflow.python.ipu.distributed.popdist_strategy.IPUDistributedVariable(*args, **kwargs)
class tensorflow.python.ipu.distributed.popdist_strategy.IPUMirroredVariable(*args, **kwargs)
class tensorflow.python.ipu.distributed.popdist_strategy.IPUOnReadPolicy(aggregation)
class tensorflow.python.ipu.distributed.popdist_strategy.IPUOnWritePolicy(aggregation)
class tensorflow.python.ipu.distributed.popdist_strategy.IPUSyncOnReadVariable(*args, **kwargs)
class tensorflow.python.ipu.distributed.popdist_strategy.PopDistExtendedV1(container_strategy, cluster_resolver, ipu_device, add_ipu_cross_replica_reductions)
__init__(container_strategy, cluster_resolver, ipu_device, add_ipu_cross_replica_reductions)
non_slot_devices(var_list)

Device(s) for non-slot variables.

DEPRECATED: TF 1.x ONLY.

This method returns non-slot devices where non-slot variables are placed. Users can create non-slot variables on these devices by using a block:

```python with tf.distribute.StrategyExtended.colocate_vars_with(tf.distribute.StrategyExtended.non_slot_devices(…)):

```

Parameters

var_list – The list of variables being optimized, needed with the default tf.distribute.Strategy.

Returns

A sequence of devices for non-slot variables.

read_var(var)

Read the aggregate value of a replica-local variable.

class tensorflow.python.ipu.distributed.popdist_strategy.PopDistStrategy(ipu_device='/device:IPU:0', add_ipu_cross_replica_reductions=True, enable_dataset_iterators=True, enable_keras_extensions=True)

This is a distribution strategy for multi-replica distribution that uses compiled communications with GCL for reductions over IPU-Links and GW-Links. It uses Horovod for broadcasting of the initial values of variables to all processes. It also uses Horovod when a reduction is requested with a CPU as the current device.

This is the recommended distribution strategy when using PopDist and PopRun. The GCL reductions will then be performed across all the global replicas in the application.

__init__(ipu_device='/device:IPU:0', add_ipu_cross_replica_reductions=True, enable_dataset_iterators=True, enable_keras_extensions=True)
update_ipu_config(config)

Update the given IPU configuration with the multi-replica distribution options.

Parameters

config – The IPUConfig instance to update.

Returns

The IPUConfig instance.

Note

Both tensorflow.python.ipu.distributed.popdist_strategy.PopDistStrategy and tensorflow.python.ipu.distributed.ipu_horovod_strategy.IPUHorovodStrategy are still available through the deprecated module tensorflow.python.ipu.horovod.

22.12. Serving utilities

tensorflow.python.ipu.serving.export_keras(model, export_dir, batch_size=None, output_names=None, preprocessing_step=None, preprocessing_step_signature=None, postprocessing_step=None, postprocessing_step_signature=None, purge_export_dir=False)

Export Keras model using the SavedModel format for TensorFlow serving.

Wrap model’s call function inside a while loop, add an infeed for the inputs and an outfeed for the outputs, convert any variables into constants and write a SavedModel containing an IPU runtime function and Poplar executable.

Parameters
  • model (tf.keras.Model) – The Keras model to export.

  • export_dir (str) – The path to the directory where the SavedModel will be written.

  • batch_size (int, optional) – The batch size value to be used in the exported model. If not specified and the model was built with a specified batch size (different than None), the exported model will use the currently set batch size. This argument must be specified if the model’s batch size is None.

  • output_names (str or list, optional) –

    Output name or list of output names

    for the outputs in the SavedModel’s SignatureDef. If not provided, outputs will be named: output_0, output_1 and so on.

    preprocessing_step (Callable or tf.function, optional): Function that runs

    the preprocessing step on the CPU device. This function is called just before the Keras model. preprocessing_step and the Keras model are exported together. The preprocessing_step output is passed directly to the Keras modelel input queue.

  • preprocessing_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the preprocessing_step function. If preprocessing_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from preprocessing_step.

  • postprocessing_step (Callable or tf.function, optional) – Function that runs the postprocessing step on the CPU. This function is called after the Keras model. postprocessing_step and the Keras model are exported together. Tensors from the Keras model output queue are inputs to postprocessing_step.

  • postprocessing_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the postprocessing_step function. If postprocessing_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from postprocessing_step.

  • purge_export_dir (Boolean, optional) – If True, before starting the export, the target directory is emptied. Otherwise no cleaning is performed and if the target directory is not empty, the function fails with an error.

Returns

A reference to the same predict function that was exported

using the SavedModel format. This function uses the embedded runtime op to run the executable that was included in the SavedModel’s assets subfolder.

Return type

tf.function

Raises
  • ValueError – If model does not have the export_for_ipu_serving method.

  • ValueError – If export_dir is not an empty directory and purge_export_dir is not set to True.

  • TypeError – If preprocessing_step_signature is neither a tuple, a list of tf.TensorSpec objects nor a NoneType.

  • TypeError – If postprocessing_step_signature is neither a tuple, a list of tf.TensorSpec objects nor a NoneType.

  • ValueError – If preprocessing_step_signature is an empty tuple or a list.

  • ValueError – If postprocessing_step_signature is an empty tuple or a list.

  • ValueError – If preprocessing_step is provided and preprocessing_step_signature is not provided and preprocessing_step is not a tf.function or is a tf.function but no input_signature is provided.

  • ValueError – If postprocessing_step is provided and postprocessing_step_signature is not provided and postprocessing_step is not a tf.function or is a tf.function but no input_signature is provided.

tensorflow.python.ipu.serving.export_pipeline(computational_stages, export_dir, iterations, inputs=None, device_mapping=None, pipeline_schedule=None, poplar_options=None, name=None, predict_step_signature=None, input_dataset=None, output_names=None, preprocessing_step=None, preprocessing_step_signature=None, postprocessing_step=None, postprocessing_step_signature=None, purge_export_dir=False)

Create a pipelined SavedModel in export_dir for TensorFlow Serving.

Create a pipeline op using computational_stages, add an infeed for the inputs and an outfeed for the outputs, freeze any variables into constants and write a SavedModel containing an IPU runtime function (preceded by optional preprocessing step) and Poplar executable.

SavedModel flow: predict_step = computational_stages[0] preprocessing_step (optional, CPU) -> predict_step (IPU) -> postprocessing_step (optional, CPU) -> result

Parameters
  • computational_stages (list) – A list of Python functions or TensorFlow functions, where each function represents a computational stage in the pipeline. The function takes the outputs of the previous pipeline stage as its inputs.

  • export_dir (str) – Path to the directory where the SavedModel will be written.

  • iterations (int) – The number of times each computational stage will be executed during the execution of the pipeline. It can also be considered as the pipeline depth.

  • inputs (list, optional) – Arguments passed to the first computational stage without usage of infeed queue.

  • device_mapping (list, optional) – If provided, a list of length equal to the number of computational stages. An element at index i in the list represents which IPU the computational_stages[i] should reside on. This can be used to make sure computational stages which share tf.Variable objects are resident on the same IPU.

  • pipeline_schedule (PipelineSchedule, optional) – Which scheduling algorithm to use for pipeline lowering. Defaults to PipelineSchedule.Grouped.

  • poplar_options (list, optional) – If provided, a list of length equal to the number of computational stages. Each element is a PipelineStageOptions object which allows for fine grain control of the Poplar options for a given forward propagation computational stage.

  • name (str, optional) – Name of this pipeline.

  • predict_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the first computational stage. If preprocessing_step is not provided and input_dataset is provided, this argument should be None. If preprocessing_step is provided or preprocessing_step and input_dataset are not provided and first computational stage is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from the first computational stage.

  • input_dataset (tf.Dataset, optional) – Dataset from which SavedModel’s input_signature will be inferred.

  • output_names (str or list, optional) – Output name or list of output names for the outputs in the SavedModel’s SignatureDef. If not provided, outputs will be named: output_0, output_1 and so on.

  • preprocessing_step (Callable or tf.function, optional) – Function that runs preprocessing step on the CPU device. Function is called just before the first computational stage. preprocessing_step and compiled pipelined computational stages are exported together. preprocessing_step output will be directly passed to the input queue of the first computational stage.

  • preprocessing_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the preprocessing_step function. If preprocessing_step and input_dataset are provided, this argument should be None. If preprocessing_step is provided and input_dataset is not provided and preprocessing_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from preprocessing_step.

  • postprocessing_step (Callable or tf.function, optional) – Function that runs the postprocessing step on the CPU. Function is called after predict_step. postprocessing_step and predict_step are exported together. Tensors from the predict_step output queue are postprocessing_step inputs.

  • postprocessing_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the postprocessing_step function. If postprocessing_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from postprocessing_step.

  • purge_export_dir (Boolean, optional) – If True, before starting the export, the target directory is emptied. Otherwise no cleaning is performed and if target dir is not empty, the function fails with an error.

Returns

A reference to the same predict function that was exported

using the SavedModel format. This function uses the embedded runtime op to run the executable that was included in the SavedModel’s assets subfolder.

Return type

tf.function

Raises
  • ValueError – If export_dir is not an empty directory.

  • TypeError – If input_dataset is not a tf.Dataset or NoneType.

  • TypeError – If predict_step_signature is neither a tuple, list of tf.TensorSpec objects nor a NoneType.

  • TypeError – If preprocessing_step_signature is neither a tuple, list of tf.TensorSpec objects nor a NoneType.

  • TypeError – If postprocessing_step_signature is neither a tuple, list of tf.TensorSpec objects nor a NoneType.

  • ValueError – If predict_step_signature is an empty tuple or list.

  • ValueError – If preprocessing_step_signature is an empty tuple or list.

  • ValueError – If postprocessing_step_signature is an empty tuple or list.

  • ValueError – If preprocessing_step is not provided and both predict_step_signature and input_dataset are provided.

  • ValueError – If preprocessing_step, predict_step_signature, input_dataset are not provided and predict_step is not a tf.function or is a tf.function with not provided input_signature.

  • ValueError – If preprocessing_step, preprocessing_step_signature, input_dataset are provided.

  • ValueError – If preprocessing_step is provided and both preprocessing_step_signature, input_dataset are not provided and preprocessing_step is not a tf.function or is a tf.function but no input_signature is provided.

  • ValueError – If preprocessing_step, predict_step_signature are not provided and predict_step is not a tf.function or is a tf.function but no input_signature is provided.

  • ValueError – If postprocessing_step is provided and postprocessing_step_signature is not provided and postprocessing_step is not a tf.function or is a tf.function but no input_signature is provided.

tensorflow.python.ipu.serving.export_single_step(predict_step, export_dir, iterations, predict_step_signature=None, input_dataset=None, output_names=None, preprocessing_step=None, preprocessing_step_signature=None, postprocessing_step=None, postprocessing_step_signature=None, purge_export_dir=False)

Create a SavedModel in export_dir for TensorFlow Serving.

Wrap predict_step inside a while loop, add an infeed for the inputs and an outfeed for the outputs, freeze any variables into constants and write a SavedModel containing a compiled IPU runtime function (preceded by optional preprocessing step) and Poplar executable.

SavedModel flow: preprocessing_step (optional, CPU) -> predict_step (IPU) -> postprocessing_step (optional, CPU) -> result

Parameters
  • predict_step (Callable or tf.function) – Function to compile into the IPU platform and export.

  • export_dir (str) – Path to the directory where the SavedModel will be written.

  • iterations (int) – Number of loop iterations.

  • predict_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the predict_step function. If preprocessing_step is not provided and input_dataset is provided, this argument should be None. If preprocessing_step is provided or preprocessing_step and input_dataset`are not provided and `predict_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from predict_step.

  • input_dataset (tf.Dataset, optional) – Dataset from which SavedModel input_signature will be inferred. If preprocessing_step is not provided and predict_step_signature is provided,this argument should be None. If preprocessing_step and preprocessing_step_signature are provided this argument should be None.

  • output_names (str or list, optional) – Output name or list of output names for the outputs in the SavedModel’s SignatureDef. If not provided, outputs will be named: output_0, output_1 and so on.

  • preprocessing_step (Callable or tf.function, optional) – Function that runs the preprocessing step on the CPU device. Function is called just before predict_step. preprocessing_step and predict_step are exported together. preprocessing_step output is directly passed to the predict_step input queue.

  • preprocessing_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the preprocessing_step function. If preprocessing_step and input_dataset are provided, this argument should be None. If preprocessing_step is provided and input_dataset is not provided and preprocessing_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from preprocessing_step.

  • postprocessing_step (Callable or tf.function, optional) – Function that runs the postprocessing step on the CPU. Function is called after predict_step. postprocessing_step and predict_step are exported together. Tensors from the predict_step output queue are postprocessing_step inputs.

  • postprocessing_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the postprocessing_step function. If postprocessing_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from postprocessing_step.

  • purge_export_dir (Boolean, optional) – If True, before starting the export, the target directory is emptied. Otherwise no cleaning is performed and if target dir is not empty, the function fails with an error.

Returns

A reference to the same predict function that was exported

using the SavedModel format. This function uses the embedded runtime op to run the executable that was included in the SavedModel’s assets subfolder.

Return type

tf.function

Raises
  • ValueError – If export_dir is not an empty directory.

  • TypeError – If input_dataset is not a tf.Dataset or NoneType.

  • TypeError – If predict_step_signature is neither a tuple, list of tf.TensorSpec objects nor a NoneType.

  • TypeError – If preprocessing_step_signature is neither a tuple, list of tf.TensorSpec objects nor a NoneType.

  • TypeError – If postprocessing_step_signature is neither a tuple, list of tf.TensorSpec objects nor a NoneType.

  • ValueError – If predict_step_signature is an empty tuple or list.

  • ValueError – If preprocessing_step_signature is an empty tuple or list.

  • ValueError – If postprocessing_step_signature is an empty tuple or list.

  • ValueError – If preprocessing_step is not provided and both predict_step_signature and input_dataset are provided.

  • ValueError – If preprocessing_step, predict_step_signature, input_dataset are not provided and predict_step is not a tf.function or is a tf.function with not provided input_signature.

  • ValueError – If preprocessing_step, preprocessing_step_signature, input_dataset are provided.

  • ValueError – If preprocessing_step is provided and both preprocessing_step_signature, input_dataset are not provided and preprocessing_step is not a tf.function or is a tf.function but no input_signature is provided.

  • ValueError – If preprocessing_step, predict_step_signature are not provided and predict_step is not a tf.function or is a tf.function but no input_signature is provided.

  • ValueError – If postprocessing_step is provided and postprocessing_step_signature is not provided and postprocessing_step is not a tf.function or is a tf.function but no input_signature is provided.

22.13. Datasets

22.13.1. Dataset benchmarking

tensorflow.python.ipu.dataset_benchmark.dataset_benchmark(dataset, number_of_epochs, elements_per_epochs, print_stats=True, apply_debug_options=True, do_memcpy=True)

Allows the user to benchmark performance of a tf.data.Dataset.

Parameters
  • dataset – An instance of tf.data.Dataset which will be benchmarked.

  • number_of_epochs – The number of epochs this dataset will be run for.

  • elements_per_epochs – The number of elements there are in each epoch.

  • print_stats – Whether to print statistics about the performance to the console.

  • apply_debug_options – Whether to apply debug options.

  • do_memcpy – Whether to perform a memcpy operation which simulates a dataset buffer being copied to a Poplar managed buffer.

Returns

A JSON string with performance statistics, which records the following metrics every epoch:

  • elements_processed - number of elements processed.

  • total_bytes_processed - total number of bytes which was processed.

  • time_elapsed - the time it took (in seconds) for the epoch to complete.

  • elements_per_second - number of elements processed per second.

  • bandwidth - the bandwidth achieved, measured in GB/s.

The JSON string returned can be parsed into a native Python JSON library (see https://docs.python.org/3/library/json.html).

Raises
  • TypeError – if dataset is not an instance of tf.data.Dataset.

  • ValueError – if number_of_epochs or elements_per_epochs is less than 1.

  • InvalidArgumentError – if dataset contains a shape with a dimension of size 0.

tensorflow.python.ipu.dataset_benchmark.infeed_benchmark(infeed_queue, number_of_epochs, elements_per_epochs, print_stats=True, do_memcpy=True)

Allows the user to benchmark performance of an ipu.ipu_infeed_queue.IPUInfeedQueue.

Parameters
  • infeed_queue – An instance of ipu.ipu_infeed_queue.IPUInfeedQueue which will be benchmarked.

  • number_of_epochs – The number of epochs this infeed queue will be run for.

  • elements_per_epochs – The number of elements there are in each epoch.

  • print_stats – Whether to print statistics about the performance to the console.

  • do_memcpy – Whether to perform a memcpy operation which simulates a dataset buffer being copied to a Poplar managed buffer.

Returns

A JSON string with performance statistics, which records the following metrics every epoch:

  • elements_processed - number of elements processed.

  • total_bytes_processed - total number of bytes which was processed.

  • time_elapsed - the time it took (in seconds) for the epoch to complete.

  • elements_per_second - number of elements processed per second.

  • bandwidth - the bandwidth achieved, measured in GB/s.

The JSON string returned can be parsed into a native Python JSON library (see https://docs.python.org/3/library/json.html).

Raises
  • TypeError – if infeed_queue is not an instance of ipu.ipu_infeed_queue.IPUInfeedQueue.

  • ValueError – if number_of_epochs or elements_per_epochs is less than 1.

  • InvalidArgumentError – if infeed_queue contains a shape with a dimension of size 0.

22.13.2. Dataset wrappers

class tensorflow.python.ipu.data.ops.dataset_ops.BufferDataset(input_dataset, buffer_size)

A Dataset which makes sure there is a multiple of buffer_size number of elements available.

__init__(input_dataset, buffer_size)
A Dataset which makes sure there is a multiple of buffer_size number of

elements available.

Parameters
  • input_dataset – The input dataset.

  • buffer_size – The number of dataset elements which will be available.

22.14. Estimators

22.14.1. IPUEstimator

class tensorflow.python.ipu.ipu_estimator.IPUEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None, train_batch_size=None, eval_batch_size=None, predict_batch_size=None)

Estimator with IPU support.

IPUEstimator handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. It also provides a simple way to use multiple IPUs in the form of either data parallelism or model parallelism.

The data parallelism is based on graph replication. One batch from the dataset returned by the input_fn (of size batch_size) is sent to each replica, giving an effective batch size of num_replicas * batch_size. The only change needed to the model_fn is that the optimizer should be wrapped in a CrossReplicaOptimizer in order to average the gradients across the replicas.

This can also be combined with distributed multi-worker training using the IPUMultiWorkerStrategyV1, giving a total effective batch size of num_workers * num_replicas * batch_size.

The desired global batch size can be passed as train_batch_size, eval_batch_size and predict_batch_size, and the local batch size will be calculated based on the number of replicas and the number of distributed workers and passed to the input_fn and model_fn in params['batch_size']. If the input_fn returns a dataset batched with dataset.batch(params['batch_size'], drop_remainder=True), the global batch size will be as desired.

The model parallelism supported by this class is basic sharding. Consider using the IPUPipelineEstimator to get pipelined execution.

For efficiency, it supports compiling a graph that contains multiple iterations of the training/prediction/evaluation loop, which will be fully executed on the IPU before yielding back to the TensorFlow Python runtime on the CPU.

See https://tensorflow.org/guide/estimators for general information about estimators.

Parameters
  • model_fn – The model function. Refer to https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/custom_estimators.md#write-a-model-function for details on how to write this function.

  • model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.

  • config – A RunConfig object.

  • paramsdict of hyper parameters that will be passed into model_fn. Keys are names of parameters, values are basic python types.

  • warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm-start from, or a tf.estimator.WarmStartSettings object to fully configure warm-starting. If the string filepath is provided instead of a tf.estimator.WarmStartSettings, then all variables are warm-started, and it is assumed that vocabularies and tf.Tensor names are unchanged.

  • train_batch_size – If not None, an int representing the global training batch size. This global batch size is transformed to a local batch size passed as params['batch_size'] to the input_fn and model_fn during training. Must be divisible by the number of replicas multiplied by the number of distributed workers.

  • eval_batch_size – If not None, an int representing the global evaluation batch size. Same behaviour as train_batch_size, only during evaluation.

  • predict_batch_size – If not None, an int representing the global prediction batch size. Same behaviour as train_batch_size, only during prediction.

class tensorflow.python.ipu.ipu_estimator.IPUEstimatorSpec(mode, predictions=None, loss=None, train_op=None, eval_metric_ops=None, eval_metrics=None, host_call=None, training_hooks=None, evaluation_hooks=None, prediction_hooks=None)

Ops and objects returned from a model_fn and passed to IPUEstimator.

This is very similar to EstimatorSpec, with the addition of two extra arguments: eval_metrics and host_call. If neither of those arguments are needed, an EstimatorSpec can be passed to the IPUEstimator instead.

eval_metrics is a tuple of a (function, tensors), where tensors is either a list of tf.Tensor or a dict from strings to tf.Tensor, that is passed to the function. The function runs on the CPU and returns a dict of metrics. The tensors are transferred from the IPU to the CPU host and passed to the function.

Exactly one of eval_metrics and eval_metric_ops must be provided during evaluation. The major difference between the two is that while the eval_metric_ops will execute directly on the IPU, the eval_metrics will execute on the CPU host using the provided function. Example:

def my_metrics_fn(features, labels):
  return {
      "accuracy": tf.metrics.accuracy(labels, features),
      "precision": tf.metrics.precision(labels, features),
      "recall": tf.metrics.recall(labels, features),
  }

eval_metrics = (my_metrics_fn, [features, labels])
spec = IPUEstimatorSpec(mode, loss=loss, eval_metrics=eval_metrics)

host_call is a tuple of a function and a list of tensors to pass to that function. host_call only works for training and is executed on the CPU for every training step. The tensors are transferred from the IPU to the CPU host and passed to the function.

This functionality can be used for e.g. doing all-reduce of the gradients and weight updates on the host during distributed training with the IPUMultiWorkerStrategyV1. Example:

def my_host_fn(*host_gradients):
  # This will all-reduce the gradients and update the weights on the host.
  return optimizer.apply_gradients(zip(host_gradients, variables))

train_op = tf.identity(loss)
grads_and_vars = optimizer.compute_gradients(loss, var_list=variables)
gradients = [g for (g, _) in grads_and_vars]
host_call = (my_host_fn, gradients)

spec = IPUEstimatorSpec(mode=mode,
                        loss=loss,
                        train_op=train_op,
                        host_call=host_call)

See full example: Distributed training.

The various hooks (training_hooks, `evaluation_hooks, prediction_hooks) support instances of tf.estimator.SessionRunHook. To log tensor values from within the model_fn, use the IPULoggingTensorHook.

For documentation of the remaining arguments, see EstimatorSpec.

class tensorflow.python.ipu.ipu_estimator.IPUEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None, train_batch_size=None, eval_batch_size=None, predict_batch_size=None)

Estimator with IPU support.

IPUEstimator handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. It also provides a simple way to use multiple IPUs in the form of either data parallelism or model parallelism.

The data parallelism is based on graph replication. One batch from the dataset returned by the input_fn (of size batch_size) is sent to each replica, giving an effective batch size of num_replicas * batch_size. The only change needed to the model_fn is that the optimizer should be wrapped in a CrossReplicaOptimizer in order to average the gradients across the replicas.

This can also be combined with distributed multi-worker training using the IPUMultiWorkerStrategyV1, giving a total effective batch size of num_workers * num_replicas * batch_size.

The desired global batch size can be passed as train_batch_size, eval_batch_size and predict_batch_size, and the local batch size will be calculated based on the number of replicas and the number of distributed workers and passed to the input_fn and model_fn in params['batch_size']. If the input_fn returns a dataset batched with dataset.batch(params['batch_size'], drop_remainder=True), the global batch size will be as desired.

The model parallelism supported by this class is basic sharding. Consider using the IPUPipelineEstimator to get pipelined execution.

For efficiency, it supports compiling a graph that contains multiple iterations of the training/prediction/evaluation loop, which will be fully executed on the IPU before yielding back to the TensorFlow Python runtime on the CPU.

See https://tensorflow.org/guide/estimators for general information about estimators.

Parameters
  • model_fn – The model function. Refer to https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/custom_estimators.md#write-a-model-function for details on how to write this function.

  • model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.

  • config – A RunConfig object.

  • paramsdict of hyper parameters that will be passed into model_fn. Keys are names of parameters, values are basic python types.

  • warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm-start from, or a tf.estimator.WarmStartSettings object to fully configure warm-starting. If the string filepath is provided instead of a tf.estimator.WarmStartSettings, then all variables are warm-started, and it is assumed that vocabularies and tf.Tensor names are unchanged.

  • train_batch_size – If not None, an int representing the global training batch size. This global batch size is transformed to a local batch size passed as params['batch_size'] to the input_fn and model_fn during training. Must be divisible by the number of replicas multiplied by the number of distributed workers.

  • eval_batch_size – If not None, an int representing the global evaluation batch size. Same behaviour as train_batch_size, only during evaluation.

  • predict_batch_size – If not None, an int representing the global prediction batch size. Same behaviour as train_batch_size, only during prediction.

eval_dir(name=None)

Shows the directory name where evaluation metrics are dumped.

Parameters

name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A string which is the path of directory contains evaluation metrics.

evaluate(input_fn, steps=None, hooks=None, checkpoint_path=None, name=None)

Evaluates the model given evaluation data input_fn.

Parameters
  • input_fn

    A function that constructs the input data for evaluation. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

  • steps – Number of steps for which to evaluate model.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the evaluation call.

  • checkpoint_path – Path of a specific checkpoint to evaluate. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, evaluation is run with newly initialized Variables instead of ones restored from checkpoint.

  • name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A dict containing the evaluation metrics specified in model_fn keyed by name, as well as an entry global_step which contains the value of the global step for which this evaluation was performed.

experimental_export_all_saved_models(export_dir_base, input_receiver_fn_map, assets_extra=None, as_text=False, checkpoint_path=None)

Exports a SavedModel with tf.MetaGraphDefs for each requested mode.

For each mode passed in via the input_receiver_fn_map, this method builds a new graph by calling the input_receiver_fn to obtain feature and label Tensor`s. Next, this method calls the `Estimator’s model_fn in the passed mode to generate the model graph based on those features and labels, and restores the given checkpoint (or, lacking that, the most recent checkpoint) into the graph. Only one of the modes is used for saving variables to the SavedModel (order of preference: tf.estimator.ModeKeys.TRAIN, tf.estimator.ModeKeys.EVAL, then tf.estimator.ModeKeys.PREDICT), such that up to three tf.MetaGraphDefs are saved with a single set of variables in a single SavedModel directory.

For the variables and tf.MetaGraphDefs, a timestamped export directory below export_dir_base, and writes a SavedModel into it containing the tf.MetaGraphDef for the given mode and its associated signatures.

For prediction, the exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

For training and evaluation, the train_op is stored in an extra collection, and loss, metrics, and predictions are included in a SignatureDef for the mode in question.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {'my_asset_file.txt': '/path/to/my_asset_file.txt'}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • input_receiver_fn_map – dict of tf.estimator.ModeKeys to input_receiver_fn mappings, where the input_receiver_fn is a function that takes no arguments and returns the appropriate subclass of InputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

Returns

The path to the exported directory as a bytes object.

Raises

ValueError – if any input_receiver_fn is None, no export_outputs are provided, or no checkpoint can be found.

export_saved_model(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, experimental_mode='infer')

Exports inference graph as a SavedModel into the given dir.

For a detailed guide on SavedModel, see [Using the SavedModel format] (https://tensorflow.org/guide/saved_model#savedmodels_from_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {'my_asset_file.txt': '/path/to/my_asset_file.txt'}.

The experimental_mode parameter can be used to export a single train/eval/predict graph as a SavedModel. See experimental_export_all_saved_models for full docs.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • experimental_modetf.estimator.ModeKeys value indicating with mode will be exported. Note that this feature is experimental.

Returns

The path to the exported directory as a bytes object.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

export_savedmodel(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, strip_default_attrs=False)

Exports inference graph as a SavedModel into the given dir. (deprecated)

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This function has been renamed, use export_saved_model instead.

For a detailed guide, see [SavedModel from Estimators.](https://www.tensorflow.org/guide/estimator#savedmodels_from_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {'my_asset_file.txt': '/path/to/my_asset_file.txt'}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • strip_default_attrs – Boolean. If True, default-valued attributes will be removed from the `NodeDef`s. For a detailed guide, see [Stripping Default-Valued Attributes]( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md#stripping-default-valued-attributes).

Returns

The path to the exported directory as a bytes object.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

get_variable_names()

Returns list of all variable names in this model.

Returns

List of names.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

get_variable_value(name)

Returns value of the variable given by name.

Parameters

name – string or a list of string, name of the tensor.

Returns

Numpy array - value of the tensor.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

latest_checkpoint()

Finds the filename of the latest saved checkpoint file in model_dir.

Returns

The full path to the latest checkpoint or None if no checkpoint was found.

property model_fn

Returns the model_fn which is bound to self.params.

Returns

def model_fn(features, labels, mode, config)

Return type

The model_fn with following signature

predict(input_fn, predict_keys=None, hooks=None, checkpoint_path=None, yield_single_examples=True, num_predictions=None)

Yields predictions for given features.

Parameters
  • input_fn

    A function that constructs the features. The function should return a tf.data.Dataset object. The outputs of the Dataset object should be one of the following:

    • features: A Tensor or a dictionary of string feature name to Tensor. features are consumed by model_fn.

    • A tuple, in which case the first item is extracted as features.

  • predict_keys – list of str, name of the keys to predict. It is used if the tf.estimator.EstimatorSpec.predictions is a dict. If predict_keys is used then rest of the predictions will be filtered from the dictionary. If None, returns all.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the prediction call.

  • checkpoint_path – Path of a specific checkpoint to predict. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, prediction is run with newly initialized Variables instead of ones restored from checkpoint.

  • yield_single_examples – If False, yields the whole batch as returned by the model_fn instead of decomposing the batch into individual elements. This is useful if model_fn returns some tensors whose first dimension is not equal to the batch size.

  • num_predictions – If not None, the generator will raise StopIteration after yielding this number of predictions. This allows draining the generator by using list(predictions). If None, the returned generator is infinite and will trigger a fatal error if you try to consume more predictions from it than what is actually generated, instead of raising the StopIteration exception. This is caused by the current behaviour when requesting to run a loop on the IPU for more iterations than there are elements remaining in the dataset. In this case you cannot drain it by using list(predictions), you have to consume the expected number of elements yourself, e.g. using [next(predictions) for _ in range(num_predictions)].

Yields

Evaluated values of predictions tensors.

train(input_fn, hooks=None, steps=None, max_steps=None, saving_listeners=None)

Trains a model given training data input_fn.

Parameters
  • input_fn

    A function that provides input data for training as minibatches. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the training loop.

  • steps – Number of steps for which to train the model. steps works incrementally. If you call two times train(steps=10) then training occurs in total 20 steps. If you don’t want to have incremental behavior please set max_steps instead. If set, max_steps must be None.

  • max_steps – Number of total steps for which to train model. If set, steps must be None. Two calls to train(steps=100) means 200 training iterations. On the other hand, two calls to train(max_steps=100) means that the second call will not do any iteration since first call did all 100 steps.

  • saving_listeners – list of CheckpointSaverListener objects. Used for callbacks that run immediately before or after checkpoint savings.

Returns

self, for chaining.

class tensorflow.python.ipu.ipu_estimator.IPUEstimatorSpec(mode, predictions=None, loss=None, train_op=None, eval_metric_ops=None, eval_metrics=None, host_call=None, training_hooks=None, evaluation_hooks=None, prediction_hooks=None)

Ops and objects returned from a model_fn and passed to IPUEstimator.

This is very similar to EstimatorSpec, with the addition of two extra arguments: eval_metrics and host_call. If neither of those arguments are needed, an EstimatorSpec can be passed to the IPUEstimator instead.

eval_metrics is a tuple of a (function, tensors), where tensors is either a list of tf.Tensor or a dict from strings to tf.Tensor, that is passed to the function. The function runs on the CPU and returns a dict of metrics. The tensors are transferred from the IPU to the CPU host and passed to the function.

Exactly one of eval_metrics and eval_metric_ops must be provided during evaluation. The major difference between the two is that while the eval_metric_ops will execute directly on the IPU, the eval_metrics will execute on the CPU host using the provided function. Example:

def my_metrics_fn(features, labels):
  return {
      "accuracy": tf.metrics.accuracy(labels, features),
      "precision": tf.metrics.precision(labels, features),
      "recall": tf.metrics.recall(labels, features),
  }

eval_metrics = (my_metrics_fn, [features, labels])
spec = IPUEstimatorSpec(mode, loss=loss, eval_metrics=eval_metrics)

host_call is a tuple of a function and a list of tensors to pass to that function. host_call only works for training and is executed on the CPU for every training step. The tensors are transferred from the IPU to the CPU host and passed to the function.

This functionality can be used for e.g. doing all-reduce of the gradients and weight updates on the host during distributed training with the IPUMultiWorkerStrategyV1. Example:

def my_host_fn(*host_gradients):
  # This will all-reduce the gradients and update the weights on the host.
  return optimizer.apply_gradients(zip(host_gradients, variables))

train_op = tf.identity(loss)
grads_and_vars = optimizer.compute_gradients(loss, var_list=variables)
gradients = [g for (g, _) in grads_and_vars]
host_call = (my_host_fn, gradients)

spec = IPUEstimatorSpec(mode=mode,
                        loss=loss,
                        train_op=train_op,
                        host_call=host_call)

See full example: Distributed training.

The various hooks (training_hooks, `evaluation_hooks, prediction_hooks) support instances of tf.estimator.SessionRunHook. To log tensor values from within the model_fn, use the IPULoggingTensorHook.

For documentation of the remaining arguments, see EstimatorSpec.

22.14.2. IPUPipelineEstimator

class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None)

Estimator for pipelining on IPUs.

IPUPipelineEstimator, like IPUEstimator, handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. Additionally, it adds support for pipelined execution over multiple IPUs.

The major API difference from the IPUEstimator is that the provided model_fn must return a IPUPipelineEstimatorSpec that contains the information needed for pipelined execution.

Data parallelism based on graph replication is supported. Each replica will consume gradient_accumulation_count batches from the dataset returned by the input_fn and accumulate the gradients, giving an effective batch size of num_replicas * gradient_accumulation_count * batch_size. The optimizer in the model_fn should be wrapped in a CrossReplicaOptimizer in order to average the gradients across the replicas.

This can further be combined with distributed multi-worker training using the IPUMultiWorkerStrategyV1, giving a total effective batch size of num_workers * num_replicas * gradient_accumulation_count * batch_size.

Refer to the pipelining_ops documentation for more details about pipelining.

Note: because the model_fn is compiled to run on the IPU, you must use the warm_start_from parameter for a warm start and not the tf.train.init_from_checkpoint method.

Parameters
  • model_fn – The model function. Refer to https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/custom_estimators.md#write-a-model-function for details on how to write this function.

  • model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.

  • config – A RunConfig object.

  • paramsdict of hyper parameters that will be passed into model_fn. Keys are names of parameters, values are basic python types.

  • warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm start from, or a tf.estimator.WarmStartSettings object to fully configure warm-starting. If the string filepath is provided instead of a tf.estimator.WarmStartSettings, then all variables are warm started, and it is assumed that vocabularies and tf.Tensor names are unchanged.

class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimatorSpec(mode, computational_stages, gradient_accumulation_count=None, eval_metrics_fn=None, optimizer_function=None, device_mapping=None, loss_accumulator_dtype=None, training_hooks=None, evaluation_hooks=None, prediction_hooks=None, reduction_method=GradientAccumulationReductionMethod.SUM, **pipeline_op_kwargs)

Ops and objects returned from a model_fn and passed to IPUPipelineEstimator.

class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None)

Estimator for pipelining on IPUs.

IPUPipelineEstimator, like IPUEstimator, handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. Additionally, it adds support for pipelined execution over multiple IPUs.

The major API difference from the IPUEstimator is that the provided model_fn must return a IPUPipelineEstimatorSpec that contains the information needed for pipelined execution.

Data parallelism based on graph replication is supported. Each replica will consume gradient_accumulation_count batches from the dataset returned by the input_fn and accumulate the gradients, giving an effective batch size of num_replicas * gradient_accumulation_count * batch_size. The optimizer in the model_fn should be wrapped in a CrossReplicaOptimizer in order to average the gradients across the replicas.

This can further be combined with distributed multi-worker training using the IPUMultiWorkerStrategyV1, giving a total effective batch size of num_workers * num_replicas * gradient_accumulation_count * batch_size.

Refer to the pipelining_ops documentation for more details about pipelining.

Note: because the model_fn is compiled to run on the IPU, you must use the warm_start_from parameter for a warm start and not the tf.train.init_from_checkpoint method.

Parameters
  • model_fn – The model function. Refer to https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/custom_estimators.md#write-a-model-function for details on how to write this function.

  • model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.

  • config – A RunConfig object.

  • paramsdict of hyper parameters that will be passed into model_fn. Keys are names of parameters, values are basic python types.

  • warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm start from, or a tf.estimator.WarmStartSettings object to fully configure warm-starting. If the string filepath is provided instead of a tf.estimator.WarmStartSettings, then all variables are warm started, and it is assumed that vocabularies and tf.Tensor names are unchanged.

eval_dir(name=None)

Shows the directory name where evaluation metrics are dumped.

Parameters

name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A string which is the path of directory contains evaluation metrics.

evaluate(input_fn, steps=None, hooks=None, checkpoint_path=None, name=None)

Evaluates the model given evaluation data input_fn.

Parameters
  • input_fn

    A function that constructs the input data for evaluation. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

  • steps – Number of steps for which to evaluate model.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the evaluation call.

  • checkpoint_path – Path of a specific checkpoint to evaluate. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, evaluation is run with newly initialized Variables instead of ones restored from checkpoint.

  • name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A dict containing the evaluation metrics specified in model_fn keyed by name, as well as an entry global_step which contains the value of the global step for which this evaluation was performed.

experimental_export_all_saved_models(export_dir_base, input_receiver_fn_map, assets_extra=None, as_text=False, checkpoint_path=None)

Exports a SavedModel with tf.MetaGraphDefs for each requested mode.

For each mode passed in via the input_receiver_fn_map, this method builds a new graph by calling the input_receiver_fn to obtain feature and label Tensor`s. Next, this method calls the `Estimator’s model_fn in the passed mode to generate the model graph based on those features and labels, and restores the given checkpoint (or, lacking that, the most recent checkpoint) into the graph. Only one of the modes is used for saving variables to the SavedModel (order of preference: tf.estimator.ModeKeys.TRAIN, tf.estimator.ModeKeys.EVAL, then tf.estimator.ModeKeys.PREDICT), such that up to three tf.MetaGraphDefs are saved with a single set of variables in a single SavedModel directory.

For the variables and tf.MetaGraphDefs, a timestamped export directory below export_dir_base, and writes a SavedModel into it containing the tf.MetaGraphDef for the given mode and its associated signatures.

For prediction, the exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

For training and evaluation, the train_op is stored in an extra collection, and loss, metrics, and predictions are included in a SignatureDef for the mode in question.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {'my_asset_file.txt': '/path/to/my_asset_file.txt'}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • input_receiver_fn_map – dict of tf.estimator.ModeKeys to input_receiver_fn mappings, where the input_receiver_fn is a function that takes no arguments and returns the appropriate subclass of InputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

Returns

The path to the exported directory as a bytes object.

Raises

ValueError – if any input_receiver_fn is None, no export_outputs are provided, or no checkpoint can be found.

export_saved_model(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, experimental_mode='infer')

Exports inference graph as a SavedModel into the given dir.

For a detailed guide on SavedModel, see [Using the SavedModel format] (https://tensorflow.org/guide/saved_model#savedmodels_from_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {'my_asset_file.txt': '/path/to/my_asset_file.txt'}.

The experimental_mode parameter can be used to export a single train/eval/predict graph as a SavedModel. See experimental_export_all_saved_models for full docs.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • experimental_modetf.estimator.ModeKeys value indicating with mode will be exported. Note that this feature is experimental.

Returns

The path to the exported directory as a bytes object.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

export_savedmodel(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, strip_default_attrs=False)

Exports inference graph as a SavedModel into the given dir. (deprecated)

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This function has been renamed, use export_saved_model instead.

For a detailed guide, see [SavedModel from Estimators.](https://www.tensorflow.org/guide/estimator#savedmodels_from_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {'my_asset_file.txt': '/path/to/my_asset_file.txt'}.

Parameters
  • export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.

  • serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.

  • assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.

  • as_text – whether to write the SavedModel proto in text format.

  • checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

  • strip_default_attrs – Boolean. If True, default-valued attributes will be removed from the `NodeDef`s. For a detailed guide, see [Stripping Default-Valued Attributes]( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md#stripping-default-valued-attributes).

Returns

The path to the exported directory as a bytes object.

Raises
  • ValueError – if no serving_input_receiver_fn is provided, no

  • export_outputs

get_variable_names()

Returns list of all variable names in this model.

Returns

List of names.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

get_variable_value(name)

Returns value of the variable given by name.

Parameters

name – string or a list of string, name of the tensor.

Returns

Numpy array - value of the tensor.

Raises

ValueError – If the Estimator has not produced a checkpoint yet.

latest_checkpoint()

Finds the filename of the latest saved checkpoint file in model_dir.

Returns

The full path to the latest checkpoint or None if no checkpoint was found.

property model_fn

Returns the model_fn which is bound to self.params.

Returns

def model_fn(features, labels, mode, config)

Return type

The model_fn with following signature

predict(input_fn, predict_keys=None, hooks=None, checkpoint_path=None, yield_single_examples=True, num_predictions=None)

Yields predictions for given features.

Parameters
  • input_fn

    A function that constructs the features. The function should return a tf.data.Dataset object. The outputs of the Dataset object should be one of the following:

    • features: A Tensor or a dictionary of string feature name to Tensor. features are consumed by model_fn.

    • A tuple, in which case the first item is extracted as features.

  • predict_keys – list of str, name of the keys to predict. It is used if the tf.estimator.EstimatorSpec.predictions is a dict. If predict_keys is used then rest of the predictions will be filtered from the dictionary. If None, returns all.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the prediction call.

  • checkpoint_path – Path of a specific checkpoint to predict. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, prediction is run with newly initialized Variables instead of ones restored from checkpoint.

  • yield_single_examples – If False, yields the whole batch as returned by the model_fn instead of decomposing the batch into individual elements. This is useful if model_fn returns some tensors whose first dimension is not equal to the batch size.

  • num_predictions – If not None, the generator will raise StopIteration after yielding this number of predictions. This allows draining the generator by using list(predictions). If None, the returned generator is infinite and will trigger a fatal error if you try to consume more predictions from it than what is actually generated, instead of raising the StopIteration exception. This is caused by the current behaviour when requesting to run a loop on the IPU for more iterations than there are elements remaining in the dataset. In this case you cannot drain it by using list(predictions), you have to consume the expected number of elements yourself, e.g. using [next(predictions) for _ in range(num_predictions)].

Yields

Evaluated values of predictions tensors.

train(input_fn, hooks=None, steps=None, max_steps=None, saving_listeners=None)

Trains a model given training data input_fn.

Parameters
  • input_fn

    A function that provides input data for training as minibatches. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where

    • features is a tf.Tensor or a dictionary of string feature name to Tensor

    • labels is a Tensor or a dictionary of string label name to Tensor

    Both features and labels are consumed by model_fn.

  • hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the training loop.

  • steps – Number of steps for which to train the model. steps works incrementally. If you call two times train(steps=10) then training occurs in total 20 steps. If you don’t want to have incremental behavior please set max_steps instead. If set, max_steps must be None.

  • max_steps – Number of total steps for which to train model. If set, steps must be None. Two calls to train(steps=100) means 200 training iterations. On the other hand, two calls to train(max_steps=100) means that the second call will not do any iteration since first call did all 100 steps.

  • saving_listeners – list of CheckpointSaverListener objects. Used for callbacks that run immediately before or after checkpoint savings.

Returns

self, for chaining.

class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimatorSpec(mode, computational_stages, gradient_accumulation_count=None, eval_metrics_fn=None, optimizer_function=None, device_mapping=None, loss_accumulator_dtype=None, training_hooks=None, evaluation_hooks=None, prediction_hooks=None, reduction_method=GradientAccumulationReductionMethod.SUM, **pipeline_op_kwargs)

Ops and objects returned from a model_fn and passed to IPUPipelineEstimator.

22.14.3. Run configs

class tensorflow.python.ipu.ipu_run_config.IPURunConfig(iterations_per_loop=1, ipu_options=None, num_replicas=1, num_shards=1, ordinal=0, prefetch_depth=None)

IPU related configuration required by IPUEstimator.

static __new__(cls, iterations_per_loop=1, ipu_options=None, num_replicas=1, num_shards=1, ordinal=0, prefetch_depth=None)

Creates an IPURunConfig instance.

Parameters
  • iterations_per_loop – The number of mini-batches consumed on the IPU device before returning to the CPU host for each Session.run. The global step counter is increased by iterations_per_loop for every Session.run. The number of weight updates can be less than the number of iterations if gradient accumulation is used.

  • ipu_options – An IPUConfig which you have populated with your desired configuration options before creating this IPURunConfig. The IPUEstimator will then configure the IPU system with this ipu_options object when it builds your model.

  • num_replicas – Number of replicated graphs (data parallelism)

  • num_shards – Number of IPU devices on which the graph is sharded (model parallelism)

  • ordinal – The IPU device ordinal to use. For instance, 0 corresponds to /device:IPU:0.

  • prefetch_depth – Integer or None. The prefetch_depth to be used by the IPUInfeedQueue that is created internally.

class tensorflow.python.ipu.ipu_run_config.RunConfig(ipu_run_config=None, master=None, **kwargs)

RunConfig with IPU support.

__init__(ipu_run_config=None, master=None, **kwargs)

Constructs a RunConfig with IPU support.

These are the arguments specific to the RunConfig for IPUs. All remaining keyword arguments are passed to the base class, which is documented below.

Parameters
  • ipu_run_configIPURunConfig object for IPU-specific configuration.

  • master – a string. The address of the distributed master to use for training.

Constructs a RunConfig.

All distributed training related properties cluster_spec, is_chief, master , num_worker_replicas, num_ps_replicas, task_id, and task_type are set based on the TF_CONFIG environment variable, if the pertinent information is present. The TF_CONFIG environment variable is a JSON object with attributes: cluster and task.

cluster is a JSON serialized version of ClusterSpec’s Python dict from server_lib.py, mapping task types (usually one of the TaskType enums) to a list of task addresses.

task has two attributes: type and index, where type can be any of the task types in cluster. When TF_CONFIG contains said information, the following properties are set on this class:

  • cluster_spec is parsed from TF_CONFIG['cluster']. Defaults to {}. If present, must have one and only one node in the chief attribute of cluster_spec.

  • task_type is set to TF_CONFIG['task']['type']. Must set if cluster_spec is present; must be worker (the default value) if cluster_spec is not set.

  • task_id is set to TF_CONFIG['task']['index']. Must set if cluster_spec is present; must be 0 (the default value) if cluster_spec is not set.

  • master is determined by looking up task_type and task_id in the cluster_spec. Defaults to ‘’.

  • num_ps_replicas is set by counting the number of nodes listed in the ps attribute of cluster_spec. Defaults to 0.

  • num_worker_replicas is set by counting the number of nodes listed in the worker and chief attributes of cluster_spec. Defaults to 1.

  • is_chief is determined based on task_type and cluster.

There is a special node with task_type as evaluator, which is not part of the (training) cluster_spec. It handles the distributed evaluation job.

Example of non-chief node: ```

cluster = {‘chief’: [‘host0:2222’],

‘ps’: [‘host1:2222’, ‘host2:2222’], ‘worker’: [‘host3:2222’, ‘host4:2222’, ‘host5:2222’]}

os.environ[‘TF_CONFIG’] = json.dumps(
{‘cluster’: cluster,

‘task’: {‘type’: ‘worker’, ‘index’: 1}})

config = RunConfig() assert config.master == ‘host4:2222’ assert config.task_id == 1 assert config.num_ps_replicas == 2 assert config.num_worker_replicas == 4 assert config.cluster_spec == server_lib.ClusterSpec(cluster) assert config.task_type == ‘worker’ assert not config.is_chief

```

Example of chief node: ```

cluster = {‘chief’: [‘host0:2222’],

‘ps’: [‘host1:2222’, ‘host2:2222’], ‘worker’: [‘host3:2222’, ‘host4:2222’, ‘host5:2222’]}

os.environ[‘TF_CONFIG’] = json.dumps(
{‘cluster’: cluster,

‘task’: {‘type’: ‘chief’, ‘index’: 0}})

config = RunConfig() assert config.master == ‘host0:2222’ assert config.task_id == 0 assert config.num_ps_replicas == 2 assert config.num_worker_replicas == 4 assert config.cluster_spec == server_lib.ClusterSpec(cluster) assert config.task_type == ‘chief’ assert config.is_chief

```

Example of evaluator node (evaluator is not part of training cluster): ```

cluster = {‘chief’: [‘host0:2222’],

‘ps’: [‘host1:2222’, ‘host2:2222’], ‘worker’: [‘host3:2222’, ‘host4:2222’, ‘host5:2222’]}

os.environ[‘TF_CONFIG’] = json.dumps(
{‘cluster’: cluster,

‘task’: {‘type’: ‘evaluator’, ‘index’: 0}})

config = RunConfig() assert config.master == ‘’ assert config.evaluator_master == ‘’ assert config.task_id == 0 assert config.num_ps_replicas == 0 assert config.num_worker_replicas == 0 assert config.cluster_spec == {} assert config.task_type == ‘evaluator’ assert not config.is_chief

```

N.B.: If save_checkpoints_steps or save_checkpoints_secs is set, keep_checkpoint_max might need to be adjusted accordingly, especially in distributed training. For example, setting save_checkpoints_secs as 60 without adjusting keep_checkpoint_max (defaults to 5) leads to situation that checkpoint would be garbage collected after 5 minutes. In distributed training, the evaluation job starts asynchronously and might fail to load or find the checkpoint due to race condition.

Parameters
  • model_dir – directory where model parameters, graph, etc are saved. If PathLike object, the path will be resolved. If None, will use a default value set by the Estimator.

  • tf_random_seed – Random seed for TensorFlow initializers. Setting this value allows consistency between reruns.

  • save_summary_steps – Save summaries every this many steps.

  • save_checkpoints_steps – Save checkpoints every this many steps. Can not be specified with save_checkpoints_secs.

  • save_checkpoints_secs – Save checkpoints every this many seconds. Can not be specified with save_checkpoints_steps. Defaults to 600 seconds if both save_checkpoints_steps and save_checkpoints_secs are not set in constructor. If both save_checkpoints_steps and save_checkpoints_secs are None, then checkpoints are disabled.

  • session_config – a ConfigProto used to set session parameters, or None.

  • keep_checkpoint_max – The maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, all checkpoint files are kept. Defaults to 5 (that is, the 5 most recent checkpoint files are kept). If a saver is passed to the estimator, this argument will be ignored.

  • keep_checkpoint_every_n_hours – Number of hours between each checkpoint to be saved. The default value of 10,000 hours effectively disables the feature.

  • log_step_count_steps – The frequency, in number of global steps, that the global step and the loss will be logged during training. Also controls the frequency that the global steps / s will be logged (and written to summary) during training.

  • train_distribute – An optional instance of tf.distribute.Strategy. If specified, then Estimator will distribute the user’s model during training, according to the policy specified by that strategy. Setting experimental_distribute.train_distribute is preferred.

  • device_fn – A callable invoked for every Operation that takes the Operation and returns the device string. If None, defaults to the device function returned by tf.train.replica_device_setter with round-robin strategy.

  • protocol – An optional argument which specifies the protocol used when starting server. None means default to grpc.

  • eval_distribute – An optional instance of tf.distribute.Strategy. If specified, then Estimator will distribute the user’s model during evaluation, according to the policy specified by that strategy. Setting experimental_distribute.eval_distribute is preferred.

  • experimental_distribute – An optional tf.contrib.distribute.DistributeConfig object specifying DistributionStrategy-related configuration. The train_distribute and eval_distribute can be passed as parameters to RunConfig or set in experimental_distribute but not both.

  • experimental_max_worker_delay_secs – An optional integer specifying the maximum time a worker should wait before starting. By default, workers are started at staggered times, with each worker being delayed by up to 60 seconds. This is intended to reduce the risk of divergence, which can occur when many workers simultaneously update the weights of a randomly initialized model. Users who warm-start their models and train them for short durations (a few minutes or less) should consider reducing this default to improve training times.

  • session_creation_timeout_secs – Max time workers should wait for a session to become available (on initialization or when recovering a session) with MonitoredTrainingSession. Defaults to 7200 seconds, but users may want to set a lower value to detect problems with variable / session (re)-initialization more quickly.

  • checkpoint_save_graph_def – Whether to save the GraphDef and MetaGraphDef to checkpoint_dir. The GraphDef is saved after the session is created as graph.pbtxt. MetaGraphDefs are saved out for every checkpoint as model.ckpt-*.meta.

Raises
  • ValueError – If both save_checkpoints_steps and save_checkpoints_secs

  • are set.

22.14.4. Session run hooks

class tensorflow.python.ipu.ipu_session_run_hooks.IPULoggingTensorHook(every_n_iter=None, every_n_secs=None, at_end=False, formatter=None, logging_mode=IPUOutfeedMode.LAST)

Prints the given tensors every N local steps, every N seconds, or at end.

This is a version of tf.estimator.LoggingTensorHook that supports logging from inside a function compiled for the IPU. The implementation uses an IPU outfeed in order to send the tensors from the compiled function to the host.

The tensors will be printed to the log, with INFO severity.

LoggingMode

alias of IPUOutfeedMode

__init__(every_n_iter=None, every_n_secs=None, at_end=False, formatter=None, logging_mode=IPUOutfeedMode.LAST)

Initializes the hook.

Parameters
  • every_n_iterint, print the tensor values once every N steps.

  • every_n_secsint or float, print the tensor values once every N seconds. Exactly one of every_n_iter and every_n_secs should be provided (unless at_end is True).

  • at_endbool specifying whether to print the tensor values at the end of the run.

  • formatter – function that takes a dict with tensor names and values and returns a string. If None, uses default formatting.

  • logging_modeIPULoggingTensorHook.LoggingMode that determines the behaviour when enqueuing multiple tensor values between dequeues (e.g. print all of them or only the last one).

after_run(run_context, run_values)

Called after each call to run().

The run_values argument contains results of requested ops/tensors by before_run().

The run_context argument is the same one send to before_run call. run_context.request_stop() can be called to stop the iteration.

If session.run() raises any exceptions then after_run() is not called.

Parameters
  • run_context – A SessionRunContext object.

  • run_values – A SessionRunValues object.

begin()

Called once before using the session.

When called, the default graph is the one that will be launched in the session. The hook can modify the graph by adding new operations to it. After the begin() call the graph will be finalized and the other callbacks can not modify the graph anymore. Second call of begin() on the same graph, should not change the graph.

end(session)

Called at the end of session.

The session argument can be used in case the hook wants to run final ops, such as saving a last checkpoint.

If session.run() raises exception other than OutOfRangeError or StopIteration then end() is not called. Note the difference between end() and after_run() behavior when session.run() raises OutOfRangeError or StopIteration. In that case end() is called but after_run() is not called.

Parameters

session – A TensorFlow Session that will be soon closed.

log(tensors)

Logs the given tensors.

Parameters

tensors – either a dict from string to tf.Tensor, a list/tuple of tf.Tensor objects, or a tf.Tensor.

Returns

The logging operation. It might be necessary to add a control dependency on this operation, or include it in the training operation using tf.group(), to avoid it from being pruned from the graph.

22.15. Keras

22.15.1. IPU specific Keras extensions

class tensorflow.python.ipu.keras.extensions.FunctionalExtension(*args, **kwargs)
get_pipeline_stage_assignment()

Returns the pipeline stage assignment of all the layers in the model.

If set_pipeline_stage_assignment() has been called before, then it returns a copy of the current assignment, otherwise returns a list of FunctionalLayerPipelineStageAssignment for each invocation of each layer in the model (excluding input layers) in post order (which means that layers are returned in the order they are executed).

print_pipeline_stage_assignment_summary(line_length=None, print_fn=None)

Prints a summary of the pipeline stage assignment of the model.

Parameters
  • line_length – Total length of printed lines (e.g. set this to adapt the display to different terminal window sizes).

  • print_fn – Print function to use. It will be called on each line of the summary. You can set it to a custom function in order to capture the string summary. It defaults to print (prints to stdout).

reset_pipeline_stage_assignment()

Resets the pipeline stage assignment so that the model is no longer pipelined.

set_asynchronous_callbacks(asynchronous=False)

Sets the asynchronous callbacks options when calling fit(), evaluate() and predict().

When running fit(), evaluate() and predict() the callbacks the model is configured with are executed after steps_per_execution have executed. Enabling asynchronous callbacks means that the callbacks are invoked after every step, even when steps_per_execution > 1. This can reduce the latency of receiving per step results and metrics at a cost of an extra thread running in the background of the application. Note that this option is ignored for the fit() and evaluate() when running a pipelined model and accumulate_outfeed=True (configured via set_pipelining_options).

Parameters

asynchronous – Whether asynchronous callbacks should be enabled.

set_gradient_accumulation_options(gradient_accumulation_steps_per_replica=None, experimental_normalize_gradients=None, gradient_accumulation_reduction_method='sum', **gradient_accumulation_optimizer_kwargs)

Sets the gradient accumulation options for non-pipelined models which are (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (experimental_normalize_gradients). They will be removed in a future version. Instructions for updating: experimental_normalize_gradients=True has been deprecated and will be replaced in a future release with the use of mean reduction when accumulating gradients. Please update your optimizer settings.

to be used when training a model.

When set, and gradient_accumulation_steps_per_replica > 1, the optimizer which the current model has been compiled with is wrapped in GradientAccumulationOptimizerV2. This means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient Accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

See the Gradient accumulation section in the documention for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. This value multiplied by the number of replicas needs to divide the steps_per_execution value the model has been compiled with. This value is saved/loaded when the model is saved/loaded.

  • experimental_normalize_gradients – If set to True, the gradients for each step are first scaled by 1/(gradient_accumulation_steps_per_replica * number of replicas) before being added to the gradient accumulation buffer. Note that this option is experimental and the behavior might change in future releases. This value is saved/loaded when the model is saved/loaded.

  • reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod) # pylint: disable=line-too-long

  • gradient_accumulation_optimizer_kwargs – All remaining keyword arguments are forwarded to GradientAccumulationOptimizerV2. See the optimizer for all the available arguments. Must not contain opt or num_mini_batches as keys. Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_gradient_accumulation_options again.

set_infeed_queue_options(**kwargs)

Sets the options for all instances of IPUInfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUInfeedQueue is created to efficiently feed data from the dataset to the device. Instances of IPUInfeedQueue can be created with optional arguments, such as prefetch_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUInfeedQueue.

set_outfeed_queue_options(**kwargs)

Sets the options for all instances of IPUOutfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUOutfeedQueue is created to efficiently feed data from the device to the host. Instances of IPUOutfeedQueue can be created with optional arguments, such as buffer_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUOutfeedQueue.

set_pipeline_stage_assignment(pipeline_stage_assignment)

Sets the pipeline stage assignment of all the invocations of all the layers in the model.

Sets the pipeline stage assignment all the invocations of all the layers (excluding input layers) in the model which is used to create a model-parallel execution of this model when calling fit(), evaluate() and predict(). Note that this pipelining stage assignment is ignored when using the call() function on this model.

Parameters

pipeline_stage_assignment – A list of the same length as the total number of invocations of all the layers in this model (excluding input layers). All elements have to be instances of FunctionalLayerPipelineStageAssignment which are used to indicate which pipeline stage a particular layer invocation should be assigned to.

Raises

ValueErrorpipeline_stage_assignment is not a valid assignment.

set_pipelining_options(gradient_accumulation_steps_per_replica=None, device_mapping=None, accumulate_outfeed=None, experimental_normalize_gradients=None, gradient_accumulation_reduction_method='sum', **pipelining_kwargs)

Sets the pipelining options, including gradient accumulation options, (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (experimental_normalize_gradients). They will be removed in a future version. Instructions for updating: experimental_normalize_gradients=True has been deprecated and will be replaced in a future release with the use of mean reduction when accumulating gradients. Please update your pipeline settings.

for pipelined models.

Before training a pipelined model, gradient_accumulation_steps_per_replica argument needs to be set as pipelined models always perform gradient accumulation when training. Setting gradient_accumulation_steps_per_replica > 1 means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient Accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

When training a data-parallel model, enabling gradient accumulation also reduces the communication overhead as the all-reduce of gradients is now performed after each replica has performed gradient_accumulation_steps_per_replica steps instead of after each step.

See the Gradient accumulation section in the documention for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. This value multiplied by the number of replicas needs to divide the steps_per_execution value the model has been compiled with. This value is saved/loaded when the model is saved/loaded.

  • device_mapping – If provided, a list of length equal to the number of pipeline stages assigned in this model. An element at index i in the list represents which IPU the i’th pipeline stage should reside on. This can be used to make sure computational stages which share Keras layers/tf.Variable objects are resident on the same IPU. This value is saved/loaded when the model is saved/loaded.

  • accumulate_outfeed – The metrics from the model are normally enqueued as soon as they’re available. If this option is True, the data will instead be accumulated when they’re available and enqueued at the end of pipeline execution, reducing the amount of host <-> device communication. When used with training, the accumulated metrics are normalised by gradient_accumulation_steps_per_replica. When used with evaluation, the accumulated metrics are normalised by steps_per_epoch. This option is ignored when doing prediction. When using accumulate_outfeed, model callbacks will be called with the same data for the batches which the data was accumulated for. This value is saved/loaded when the model is saved/loaded.

  • experimental_normalize_gradients – If set to True, the gradients for each step are first scaled by 1/(gradient_accumulation_steps_per_replica * number of replicas) before being added to the gradient accumulation buffer. Note that this option is experimental and the behavior might change in future releases. This value is saved/loaded when the model is saved/loaded.

  • gradient_accumulation_reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod) # pylint: disable=line-too-long

  • pipelining_kwargs – All remaining keyword arguments are forwarded to pipeline(). Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_pipelining_options again.

class tensorflow.python.ipu.keras.extensions.FunctionalLayerPipelineStageAssignment(layer, node_index, pipeline_stage=None)

A class to indicate at which pipeline stage a layer in a Functional model should be executed.

Keras Layers can be called multiple times in order to share weights between layers. Each of these calls produces Tensor output which can be executed in different pipeline stages (as long as these stages are mapped to the same device).

property inbound_layers

Returns the input layers for the layer in this assignment. This can be useful for identifying which specific node_index this is.

property layer

Returns the Keras layer associated with this assignment.

property node_index

Returns the specific call to the layer that produced a Tensor.

property pipeline_stage

Returns the pipeline stage this layer has been assigned to. If None, then this layer has not been assigned to a pipeline stage.

class tensorflow.python.ipu.keras.extensions.ModelExtension(*args, **kwargs)
get_pipeline_stage_assignment()

Returns the pipeline stage assignment of the layers in the model.

If set_pipeline_stage_assignment() has been called before, then it returns a copy of the current assignment, otherwise returns a list of ModelLayerPipelineStageAssignment for each layer in the model in post order (which means that layers are returned in the order they are executed).

print_pipeline_stage_assignment_summary(line_length=None, print_fn=None)

Prints a summary of the pipeline stage assignment of the model.

Parameters
  • line_length – Total length of printed lines (e.g. set this to adapt the display to different terminal window sizes).

  • print_fn – Print function to use. It will be called on each line of the summary. You can set it to a custom function in order to capture the string summary. It defaults to print (prints to stdout).

reset_pipeline_stage_assignment()

Resets the pipeline stage assignment so that the model is no longer pipelined.

set_asynchronous_callbacks(asynchronous=False)

Sets the asynchronous callbacks options when calling fit(), evaluate() and predict().

When running fit(), evaluate() and predict() the callbacks the model is configured with are executed after steps_per_execution have executed. Enabling asynchronous callbacks means that the callbacks are invoked after every step, even when steps_per_execution > 1. This can reduce the latency of receiving per step results and metrics at a cost of an extra thread running in the background of the application. Note that this option is ignored for the fit() and evaluate() when running a pipelined model and accumulate_outfeed=True (configured via set_pipelining_options).

Parameters

asynchronous – Whether asynchronous callbacks should be enabled.

set_gradient_accumulation_options(gradient_accumulation_steps_per_replica=None, experimental_normalize_gradients=None, gradient_accumulation_reduction_method='sum', **gradient_accumulation_optimizer_kwargs)

Sets the gradient accumulation options for non-pipelined models which are (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (experimental_normalize_gradients). They will be removed in a future version. Instructions for updating: experimental_normalize_gradients=True has been deprecated and will be replaced in a future release with the use of mean reduction when accumulating gradients. Please update your optimizer settings.

to be used when training a model.

When set, and gradient_accumulation_steps_per_replica > 1, the optimizer which the current model has been compiled with is wrapped in GradientAccumulationOptimizerV2. This means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient Accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

See the Gradient accumulation section in the documention for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. This value multiplied by the number of replicas needs to divide the steps_per_execution value the model has been compiled with. This value is saved/loaded when the model is saved/loaded.

  • experimental_normalize_gradients – If set to True, the gradients for each step are first scaled by 1/(gradient_accumulation_steps_per_replica * number of replicas) before being added to the gradient accumulation buffer. Note that this option is experimental and the behavior might change in future releases. This value is saved/loaded when the model is saved/loaded.

  • reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod) # pylint: disable=line-too-long

  • gradient_accumulation_optimizer_kwargs – All remaining keyword arguments are forwarded to GradientAccumulationOptimizerV2. See the optimizer for all the available arguments. Must not contain opt or num_mini_batches as keys. Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_gradient_accumulation_options again.

set_infeed_queue_options(**kwargs)

Sets the options for all instances of IPUInfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUInfeedQueue is created to efficiently feed data from the dataset to the device. Instances of IPUInfeedQueue can be created with optional arguments, such as prefetch_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUInfeedQueue.

set_outfeed_queue_options(**kwargs)

Sets the options for all instances of IPUOutfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUOutfeedQueue is created to efficiently feed data from the device to the host. Instances of IPUOutfeedQueue can be created with optional arguments, such as buffer_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUOutfeedQueue.

set_pipeline_stage_assignment(pipeline_stage_assignment)

Sets the pipeline stage assignment for all the layers in the model.

Sets the pipeline stage assignment of all the layers in the model which is used to create a model-parallel execution of this Model when calling fit(), evaluate() and predict(). Note that this pipelining stage assignment is ignored when using the call() function on this model.

Parameters

pipeline_stage_assignment – A list of the same length as the number of layers in this model. All elements can be either intergers or instances of ModelLayerPipelineStageAssignment. If all the elements are integers, then a layer in this model at index i is assigned to a pipeline stage pipeline_stage_assignment[i]. Otherwise, if all the elements are of type ModelLayerPipelineStageAssignment then a layer in this model at index i is assigned to a pipeline stage indicated by pipeline_stage_assignment[i].pipeline_stage.

Raises

ValueErrorpipeline_stage_assignment is not a valid assignment.

set_pipelining_options(gradient_accumulation_steps_per_replica=None, device_mapping=None, accumulate_outfeed=None, experimental_normalize_gradients=None, gradient_accumulation_reduction_method='sum', **pipelining_kwargs)

Sets the pipelining options, including gradient accumulation options, (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (experimental_normalize_gradients). They will be removed in a future version. Instructions for updating: experimental_normalize_gradients=True has been deprecated and will be replaced in a future release with the use of mean reduction when accumulating gradients. Please update your pipeline settings.

for pipelined models.

Before training a pipelined model, gradient_accumulation_steps_per_replica argument needs to be set as pipelined models always perform gradient accumulation when training. Setting gradient_accumulation_steps_per_replica > 1 means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient Accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

When training a data-parallel model, enabling gradient accumulation also reduces the communication overhead as the all-reduce of gradients is now performed after each replica has performed gradient_accumulation_steps_per_replica steps instead of after each step.

See the Gradient accumulation section in the documention for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. This value multiplied by the number of replicas needs to divide the steps_per_execution value the model has been compiled with. This value is saved/loaded when the model is saved/loaded.

  • device_mapping – If provided, a list of length equal to the number of pipeline stages assigned in this model. An element at index i in the list represents which IPU the i’th pipeline stage should reside on. This can be used to make sure computational stages which share Keras layers/tf.Variable objects are resident on the same IPU. This value is saved/loaded when the model is saved/loaded.

  • accumulate_outfeed – The metrics from the model are normally enqueued as soon as they’re available. If this option is True, the data will instead be accumulated when they’re available and enqueued at the end of pipeline execution, reducing the amount of host <-> device communication. When used with training, the accumulated metrics are normalised by gradient_accumulation_steps_per_replica. When used with evaluation, the accumulated metrics are normalised by steps_per_epoch. This option is ignored when doing prediction. When using accumulate_outfeed, model callbacks will be called with the same data for the batches which the data was accumulated for. This value is saved/loaded when the model is saved/loaded.

  • experimental_normalize_gradients – If set to True, the gradients for each step are first scaled by 1/(gradient_accumulation_steps_per_replica * number of replicas) before being added to the gradient accumulation buffer. Note that this option is experimental and the behavior might change in future releases. This value is saved/loaded when the model is saved/loaded.

  • gradient_accumulation_reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod) # pylint: disable=line-too-long

  • pipelining_kwargs – All remaining keyword arguments are forwarded to pipeline(). Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_pipelining_options again.

class tensorflow.python.ipu.keras.extensions.ModelLayerPipelineStageAssignment(layer, node_index, pipeline_stage=None)

A class to indicate at which pipeline stage a layer in a Model subclass should be executed.

Keras Layers can be called multiple times in order to share weights between layers. Each of these calls produces Tensor output which can be executed in different pipeline stages (as long as these stages are mapped to the same device).

property inbound_layers

Returns the input layers for the layer in this assignment. This can be useful for identifying which specific node_index this is.

property layer

Returns the Keras layer associated with this assignment.

property node_index

Returns the specific call to the layer that produced a Tensor.

property pipeline_stage

Returns the pipeline stage this layer has been assigned to. If None, then this layer has not been assigned to a pipeline stage.

class tensorflow.python.ipu.keras.extensions.PipelineStage(stage)

A scope within which Keras Layers and/or calls to Keras layers can be assigned to pipeline stages.

Pipeline stages can be assigned to all calls of layer by constructing the layer within a PipelineStage scope as follows:

strategy = ipu.ipu_strategy.IPUStrategy()
input_layer = Input(2)
with strategy.scope():
  with PipelineStage(0):
    x = Dense(4)(input_layer)

  with PipelineStage(1):
    x = Dense(4)(x)

Pipeline stages can also be assigned to individual layer calls, as follows:

strategy = ipu.ipu_strategy.IPUStrategy()
input_layer = Input(2)
l = Dense(4)
with strategy.scope():
  with PipelineStage(0):
    x = l(input_layer)

  with PipelineStage(1):
    x = l(x)

Pipeline stages assigned to layer calls take precedence over those assigned when constructing the layer.

class tensorflow.python.ipu.keras.extensions.SequentialExtension(*args, **kwargs)
get_pipeline_stage_assignment()

Returns the pipeline stage assignment of the layers in the model.

If set_pipeline_stage_assignment() has been called before, then it returns a copy of the current assignment, otherwise returns a list of SequentialLayerPipelineStageAssignment for each layer in the model in post order (which means that layers are returned in the order they are executed).

print_pipeline_stage_assignment_summary(line_length=None, print_fn=None)

Prints a summary of the pipeline stage assignment of the model.

Parameters
  • line_length – Total length of printed lines (e.g. set this to adapt the display to different terminal window sizes).

  • print_fn – Print function to use. It will be called on each line of the summary. You can set it to a custom function in order to capture the string summary. It defaults to print (prints to stdout).

reset_pipeline_stage_assignment()

Resets the pipeline stage assignment so that the model is no longer pipelined.

set_asynchronous_callbacks(asynchronous=False)

Sets the asynchronous callbacks options when calling fit(), evaluate() and predict().

When running fit(), evaluate() and predict(), the callback functions are called after executing the number of steps specified by steps_per_execution, where each step processes one batch.

Enabling asynchronous callbacks means that the callbacks are invoked after every step, even when steps_per_execution > 1. This can reduce the latency of receiving per-step results and metrics, at the cost of an extra thread running in the background of the application.

Note that this option is ignored for fit() and evaluate() when running a pipelined model and accumulate_outfeed=True (configured via set_pipelining_options).

Parameters

asynchronous – Whether asynchronous callbacks should be enabled.

set_gradient_accumulation_options(gradient_accumulation_steps_per_replica=None, experimental_normalize_gradients=None, gradient_accumulation_reduction_method='sum', **gradient_accumulation_optimizer_kwargs)

Sets the gradient accumulation options for non-pipelined models which are (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (experimental_normalize_gradients). They will be removed in a future version. Instructions for updating: experimental_normalize_gradients=True has been deprecated and will be replaced in a future release with the use of mean reduction when accumulating gradients. Please update your optimizer settings.

to be used when training a model.

When set, and gradient_accumulation_steps_per_replica > 1, the optimizer which the current model has been compiled with is wrapped in GradientAccumulationOptimizerV2. This means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient Accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

See the Gradient accumulation section in the documention for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. This value multiplied by the number of replicas needs to divide the steps_per_execution value the model has been compiled with. This value is saved/loaded when the model is saved/loaded.

  • experimental_normalize_gradients – If set to True, the gradients for each step are first scaled by 1/(gradient_accumulation_steps_per_replica * number of replicas) before being added to the gradient accumulation buffer. Note that this option is experimental and the behavior might change in future releases. This value is saved/loaded when the model is saved/loaded.

  • reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod) # pylint: disable=line-too-long

  • gradient_accumulation_optimizer_kwargs – All remaining keyword arguments are forwarded to GradientAccumulationOptimizerV2. See the optimizer for all the available arguments. Must not contain opt or num_mini_batches as keys. Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_gradient_accumulation_options again.

set_infeed_queue_options(**kwargs)

Sets the options for all instances of IPUInfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUInfeedQueue is created to efficiently feed data from the dataset to the device. Instances of IPUInfeedQueue can be created with optional arguments, such as prefetch_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUInfeedQueue.

set_outfeed_queue_options(**kwargs)

Sets the options for all instances of IPUOutfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUOutfeedQueue is created to efficiently feed data from the device to the host. Instances of IPUOutfeedQueue can be created with optional arguments, such as buffer_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUOutfeedQueue.

set_pipeline_stage_assignment(pipeline_stage_assignment)

Sets the pipeline stage assignment of all the layers in the model.

Sets the pipeline stage assignment of all the layers in the model which is used to create a model-parallel execution of this Sequential model when calling fit(), evaluate() and predict(). Note that this pipelining stage assignment is ignored when using the call() function on this model.

Parameters

pipeline_stage_assignment – A list of the same length as the number of layers in this model. All elements can be either intergers or instances of SequentialLayerPipelineStageAssignment. If all the elements are integers, then a layer in this model at index i is assigned to a pipeline stage pipeline_stage_assignment[i]. Otherwise, if all the elements are of type SequentialLayerPipelineStageAssignment then a layer in this model at index i is assigned to a pipeline stage indicated by pipeline_stage_assignment[i].pipeline_stage.

Raises

ValueErrorpipeline_stage_assignment is not a valid assignment.

set_pipelining_options(gradient_accumulation_steps_per_replica=None, device_mapping=None, accumulate_outfeed=None, experimental_normalize_gradients=None, gradient_accumulation_reduction_method='sum', **pipelining_kwargs)

Sets the pipelining options, including gradient accumulation options, (deprecated arguments)

Warning: SOME ARGUMENTS ARE DEPRECATED: (experimental_normalize_gradients). They will be removed in a future version. Instructions for updating: experimental_normalize_gradients=True has been deprecated and will be replaced in a future release with the use of mean reduction when accumulating gradients. Please update your pipeline settings.

for pipelined models.

Before training a pipelined model, gradient_accumulation_steps_per_replica argument needs to be set as pipelined models always perform gradient accumulation when training. Setting gradient_accumulation_steps_per_replica > 1 means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient Accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

When training a data-parallel model, enabling gradient accumulation also reduces the communication overhead as the all-reduce of gradients is now performed after each replica has performed gradient_accumulation_steps_per_replica steps instead of after each step.

See the Gradient accumulation section in the documention for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. This value multiplied by the number of replicas needs to divide the steps_per_execution value the model has been compiled with. This value is saved/loaded when the model is saved/loaded.

  • device_mapping – If provided, a list of length equal to the number of pipeline stages assigned in this model. An element at index i in the list represents which IPU the i’th pipeline stage should reside on. This can be used to make sure computational stages which share Keras layers/tf.Variable objects are resident on the same IPU. This value is saved/loaded when the model is saved/loaded.

  • accumulate_outfeed – The metrics from the model are normally enqueued as soon as they’re available. If this option is True, the data will instead be accumulated when they’re available and enqueued at the end of pipeline execution, reducing the amount of host <-> device communication. When used with training, the accumulated metrics are normalised by gradient_accumulation_steps_per_replica. When used with evaluation, the accumulated metrics are normalised by steps_per_epoch. This option is ignored when doing prediction. When using accumulate_outfeed, model callbacks will be called with the same data for the batches which the data was accumulated for. This value is saved/loaded when the model is saved/loaded.

  • experimental_normalize_gradients – If set to True, the gradients for each step are first scaled by 1/(gradient_accumulation_steps_per_replica * number of replicas) before being added to the gradient accumulation buffer. Note that this option is experimental and the behavior might change in future releases. This value is saved/loaded when the model is saved/loaded.

  • gradient_accumulation_reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod) # pylint: disable=line-too-long

  • pipelining_kwargs – All remaining keyword arguments are forwarded to pipeline(). Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_pipelining_options again.

class tensorflow.python.ipu.keras.extensions.SequentialLayerPipelineStageAssignment(layer, pipeline_stage=None)

A class used to indicate which pipeline stage a layer in a Sequential model should be executed in.

property layer

Returns the Keras layer associated with this assignment.

property pipeline_stage

Returns the pipeline stage this layer has been assigned to. If None, then this layer has not been assigned to a pipeline stage.

22.16. Keras layers

22.16.1. Keras layer specializations for the Graphcore IPU

class tensorflow.python.ipu.keras.layers.AssumeEqualAcrossReplicas(*args, **kwargs)

Layer for marking values as equal across replicas to try and prevent divergent control flow compilation errors.

Divergent control flow describes the situation where program flow differs among replicas. This happens when the value of a conditional is not the same across all replicas. This is a problem if the conditional body requires a cross-replica sync, as only some replicas will reach it. If this happens, the execution will hang as the operation waits for all replicas to sync.

To warn the user about this, Poplar checks for divergent control flow during compilation. However since the values of tensors are unknown at compilation time it can’t be certain whether a tensor will lead to divergent control flow or not. assume_equal_across_replicas can be used to mark tensors which are equal across all replicas and in doing so prevents them causing divergency errors, if used in a conditional.

Parameters

inplace – A bool for controlling whether or not the given tensor(s) is copied or operated on inplace. This is needed when using AssumeEqualAcrossReplicas with tensor slices.

call(inputs, **kwargs)

This is where the layer’s logic lives.

Note here that call() method in tf.keras is little bit different from keras API. In keras API, you can pass support masking for layers as additional arguments. Whereas tf.keras has compute_mask() method to support masking.

Parameters
  • inputs

    Input tensor, or dict/list/tuple of input tensors. The first positional inputs argument is subject to special rules: - inputs must be explicitly passed. A layer cannot have zero

    arguments, and inputs cannot be provided via the default value of a keyword argument.

    • NumPy array or Python scalar values in inputs get cast as tensors.

    • Keras mask metadata is only collected from inputs.

    • Layers are built (build(input_shape) method) using shape info from inputs only.

    • input_spec compatibility is only checked against inputs.

    • Mixed precision input casting is only applied to inputs. If a layer has tensor arguments in *args or **kwargs, their casting behavior in mixed precision should be handled manually.

    • The SavedModel input specification is generated using inputs only.

    • Integration with various ecosystem packages like TFMOT, TFLite, TF.js, etc is only supported for inputs and not for tensors in positional and keyword arguments.

  • *args – Additional positional arguments. May contain tensors, although this is not recommended, for the reasons above.

  • **kwargs

    Additional keyword arguments. May contain tensors, although this is not recommended, for the reasons above. The following optional keyword arguments are reserved: - training: Boolean scalar tensor of Python boolean indicating

    whether the call is meant for training or inference.

    • mask: Boolean input mask. If the layer’s call() method takes a mask argument, its default value will be set to the mask generated for inputs by the previous layer (if input did come from a layer that generated a corresponding mask, i.e. if it came from a Keras layer with masking support).

Returns

A tensor or list/tuple of tensors.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

class tensorflow.python.ipu.keras.layers.CTCInferenceLayer(*args, **kwargs)

Computes CTC (Connectionist Temporal Classification) predictions using a beam search. This implementation is designed and optimized for the IPU and cannot be used with other systems.

Parameters
  • blank_index – The class index to use for the blank label.

  • beam_width – The beam width to use in the beam search.

  • top_paths – The number of paths to return.

  • from_logits – Whether to expect the input data in the form of logits (True) or log probabilities (False). Default value is False.

call(data, data_length, **kwargs)
Parameters
  • data – The data input [max_time, batch_size, num_classes] tensor.

  • data_length – A tensor of shape [batch_size] containing the number of timesteps in each data batch entry.

Returns

  • Label probabilities: Negative log probabilities that each path is correct.

  • Label lengths: Length of each path of predictions.

  • Decoded labels: The predictions made by the beam search.

Return type

A tuple of values

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

class tensorflow.python.ipu.keras.layers.CTCPredictionsLayer(*args, **kwargs)

Computes CTC (Connectionist Temporal Classification) most probable predictions.

Returns the most probable predictions from the ctc decoder. This selects the most probable of all predictions returned. It also fills the values off the end with the blank index

This layer does a lot of post processing steps to create the predictions. If your model is close to its memory limit it may be worth using the CTCInference layer and streaming the results of that off the device and performing the processing on the CPU. However this will create a larger stream copy that may also cost memory.

Parameters
  • blank_index – The class index to use for the blank label.

  • beam_width – The beam width to use in the beam search.

  • top_paths – The number of paths to return.

  • from_logits – Whether to expect the input data in the form of logits (True) or log probabilities (False). Default value is False.

call(data, data_length, **kwargs)
Parameters
  • data – The data input [max_time, batch_size, num_classes] tensor The data is expected in the form of log probabilities.

  • data_length – A tensor of shape [batch_size] containing the number of timesteps in each data batch entry. If not provided can only perform inference.

Returns

The most probable predictions from the CTC decoder. This selects the most probable of all predictions returned. It fills the values off the end with the blank index.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

class tensorflow.python.ipu.keras.layers.Dropout(*args, **kwargs)

Dropout layer optimized for running on the IPU.

The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the expected sum is unchanged.

Note that the Dropout layer only applies when training is set to True, so no values are dropped during inference.

Parameters
  • rate – Float between 0 and 1. Fraction of the input units to drop.

  • noise_shape – 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input.

  • seed – An optional two-element tensor-like object (tf.Tensor, a numpy array or Python list/tuple) containing a pair of 32-bit integers that will be used to seed the random number generator that generates the dropout mask.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=None)

Perform dropout.

Parameters
  • inputs – Input tensor (of any rank).

  • training – Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (doing nothing).

Returns

In training mode, a tensor which has some nodes set to zero, as randomly selected based on other parameters. In inference mode, a tensor that is identical to the input tensor.

compute_output_shape(input_shape)

Computes the output shape of the layer.

If the layer has not been built, this method will call build on the layer. This assumes that the layer will later be used with inputs that match the input shape provided here.

Parameters

input_shape – Shape tuple (tuple of integers) or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.

Returns

An input shape tuple.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

class tensorflow.python.ipu.keras.layers.EffectiveTransformer(*args, **kwargs)

EffectiveTransformer is an implementation of a multihead attention network.

Transformers of this type are described in the following paper: https://arxiv.org/abs/1706.03762

This implementation is optimised for batches of padded sequences, by dynamically compressing the input sequences for computationally expensive parts of the algorithm. This compression is achieved by the removal of padding for those computations that do not rely on a 1:1 relationship between the input to and from sequences.

For an input sequence tensor X of shape [B, N], the algorithm will process X in compressed chunks of shape [B', N], where B' is less than or equal to max_batch_size. The algorithm output, however, keeps the input batch size B. Though the maximum batch size of compressed sequences to be processed in each chunk is of shape [B', N], the parameter sequences_per_iter determines the upper limit on the total number of compressed sequences to be processed for each B' sized batch.

The distinction between max_batch_size and sequences_per_iter is of importance when a corpus of data has much variance in the length of its sequences (the degree of padding in each row). max_batch_size determines the upper bound on the number of rows of data to be processed in each chunk and sequences_per_iter determines the upper bound on the number of sequences to be compressed into each chunk. This distinction is important to consider because a chunk of compressed sequences will need to be decompressed at points in the algorithm. This can incur large memory usage if the number of compressed sequences to process is high and the uncompressed shape unbounded.

sequences_per_iter must be less than or equal to max_batch_size.

Parameters
  • output_layer_size – The number of output units.

  • max_batch_size – The upper limit to which additional sequences will be compressed into a chunk of data. This is the maximum size of the uncompressed sequence tensor.

  • use_scale – If True, learn a scale parameter.

  • num_attention_heads – The number of attention heads to use for multihead attention.

  • attention_head_size – The size of each attention head.

  • sequences_per_iter – The number of full-sequence equivalents to process in each data chunk. Must be less than or equal to max_batch_size.

  • qkv_activation – The activation function to use for the Query, Key and Value embeddings.

  • attention_dropout_prob – Dropout probability applied to the attention distribution.

  • output_activation – The activation function to use for the layer output.

  • output_dropout_prob – Dropout probability applied to the layer output.

  • layer_norm_output – Whether to apply Layer Normalisation to the output.

  • embedding_initializer – The initializer to be used for the QKV embeddings. Default is ‘glorot_uniform’.

  • embedding_bias_initializer – The initializer to be used for QKV embeddings additive bias. Defaults to ‘zeros’.

  • output_initializer – The initializer for the output layer. Defaults to ‘glorot_uniform’.

  • output_bias_initializer – The initializer for the output layer additive bias. Defaults to ‘zeros’.

build(input_shapes)

Builds an EffectiveTransformer Layer with respect to the provided input_shapes.

Parameters
  • input_shapes (case of four elements provided in) – A list of Tensor shapes of length four or five. In the

  • input_shapes

  • shapes (the Tensor) –

  • from_sequences (should correspond to the) –

  • from_sequence_lengths

:param : :param to_sequences and to_sequence_lengths Tensor arguments to the: :param call method. In the case of five Tensor shapes provided in: :param input_shapes: :param the fifth element should correspond to the optional: :param q_mask input to the call method.:

call(inputs, training=True)

Performs a single forward pass of an EffectiveTransformer layer instance.

As input, two sequence sets and their respective sequence lengths are required. The two sets of sequences are referred to as the ‘from’ sequences and ‘to’ sequences, referring to the computed attention relationship. In the case that the ‘from’ and ‘to’ sequence sets are equal, this layer will compute self-attention.

Parameters
  • inputs – A list of input Tensors, of at least four elements containing

  • from_sequences

  • from_sequence_lengths

  • and (to_sequences) –

  • Additionally (to_sequence_lengths.) –

  • for (a fifth tensor q_mask) –

  • provided. (attention head masking can be) –

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

class tensorflow.python.ipu.keras.layers.Embedding(*args, **kwargs)

This is designed to be a replacement for the typical use cases of the Keras Embedding layer.

Parameters
  • input_dim – int > 0. Size of the vocabulary, i.e. maximum integer index + 1.

  • output_dim – int >= 0. Dimension of the dense embedding.

  • embeddings_initializer – Initializer for the embeddings matrix.

  • serialization_factor – If greater than 1, the embedding lookup will be broken up into serialization_factor smaller lookups, serialized along the 0th dimension. This option should not be used unless the parameters of this layer is used by another layer. If this is the case, then serialization can reduce the maximum memory at the cost of extra computation.

Input shape:

2D tensor with shape: (batch_size, input_length).

Output shape:

3D tensor with shape: (batch_size, input_length, output_dim).

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=None)

Perform an embedding lookup.

Parameters

inputs – An integer tensor of indices into the embedding variable.

Returns

The entries of the embedding tensor corresponding to the ids tensor indices.

compute_output_shape(input_shape)

Computes the output shape of the layer.

If the layer has not been built, this method will call build on the layer. This assumes that the layer will later be used with inputs that match the input shape provided here.

Parameters

input_shape – Shape tuple (tuple of integers) or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.

Returns

An input shape tuple.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

tensorflow.python.ipu.keras.layers.GroupNorm

alias of GroupNormalization

class tensorflow.python.ipu.keras.layers.GroupNormalization(*args, **kwargs)

Group normalization layer optimized for running on the IPU.

This layer is used like the standard Keras BatchNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.

Group normalization is described in this paper: https://arxiv.org/abs/1803.08494.

Parameters
  • groups – The number of groups to use in the normalization.

  • channels_axis – Integer, the axis that should be normalized (typically the features axis).

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • beta_initializer – Initializer for the beta weight.

  • gamma_initializer – Initializer for the gamma weight.

  • strided_channel_grouping – Selects whether to group the channels dimension for group normalisation with a stride between channels. This makes the PopLibs implementation more efficient but is unconventional. Among other things this will mean that using pre-trained weights would not be possible if not produced with this unconventional implementation.

  • trainable – Boolean, if True the variables will be marked as trainable.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=None)
Parameters

inputs – The tensor to apply normalization to.

Returns

The tensor resulting from applying normalization.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

tensorflow.python.ipu.keras.layers.InstanceNorm

alias of InstanceNormalization

class tensorflow.python.ipu.keras.layers.InstanceNormalization(*args, **kwargs)

Instance normalization layer optimized for use on the IPU.

This layer is used like the standard Keras InstanceNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.

Instance normalization is described in this paper: https://arxiv.org/abs/1607.08022.

Parameters
  • channels_axis – Integer, the axis that should be normalized (typically the features axis).

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • beta_initializer – Initializer for the beta weight.

  • gamma_initializer – Initializer for the gamma weight.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=None)
Parameters

inputs – The tensor to apply normalization to.

Returns

The tensor resulting from applying normalization.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

tensorflow.python.ipu.keras.layers.LayerNorm

alias of LayerNormalization

class tensorflow.python.ipu.keras.layers.LayerNormalization(*args, **kwargs)

Layer normalization layer optimized for use on the IPU.

This layer is used like the standard Keras LayerNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.

Layer normalization is described in this paper: https://arxiv.org/abs/1607.06450.

Parameters
  • axis – Integer or List/Tuple. The axis that should be normalized (typically the features axis).

  • epsilon – Small float added to variance to avoid dividing by zero.

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.

  • beta_initializer – Initializer for the beta weight.

  • gamma_initializer – Initializer for the gamma weight.

  • beta_regularizer – Optional regularizer for the beta weight.

  • gamma_regularizer – Optional regularizer for the gamma weight.

  • beta_constraint – Optional constraint for the beta weight.

  • gamma_constraint – Optional constraint for the gamma weight.

  • trainable – Boolean, if True the variables will be marked as trainable.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=None)
Parameters

inputs – The tensor to apply normalization to.

Returns

The tensor resulting from applying normalization.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

class tensorflow.python.ipu.keras.layers.RecomputationCheckpoint(*args, **kwargs)

Layer for checkpointing values in a computational pipeline stage. When recomputation is enabled, these values will not be recomputed and they will be stored in memory instead.

This layer can reduce memory liveness peaks when using recomputation if there are too many activations which need to be recomputed before the backpropagation operations can be executed.

This layer should be used with the RecomputationMode.RecomputeAndBackpropagateInterleaved pipelining recomputation mode.

Note that this layer has no effect when used with the RecomputationMode.RecomputeThenBackpropagate pipelining recomputation mode.

call(inputs, **kwargs)

Checkpoint the input tensors.

Parameters

inputs – A tensor or a structure of tensors which should be checkpointed.

Returns

A tensor or a structure of tensors which matches shape and type of inputs.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

class tensorflow.python.ipu.keras.layers.SerialDense(*args, **kwargs)

Densely-connected NN layer where the dot operation is serialized to reduce the size of this operation.

Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).

Given the input tensor with shape [..., m, k] and kernel tensor with shape [k, n], the matrix multiplication can be serialized as follows:

  • Along the m dimension of input, by setting serialization_dimension to input_columns.

  • Along the k dimension of input and kernel by setting serialization_dimension to input_rows_kernel_columns.

  • Along n dimension of kernel, by setting serialization_dimension to kernel_rows.

Example:

# as first layer in a sequential model:
model = Sequential()
model.add(SerialDense(32, input_shape=(16,)))
# now the model will take as input arrays of shape (*, 16)
# and output arrays of shape (*, 32)

# after the first layer, you don't need to specify
# the size of the input anymore:
model.add(SerialDense(32))
Parameters
  • units – Positive integer, dimensionality of the output space.

  • serialization_factor – An integer indicating the number of smaller matrix multiplies this operation is broken up into. Must divide the dimension along which the operation is serialized on.

  • serialization_dimension – A string, must be one of input_columns, input_rows_kernel_columns or kernel_rows. Indicates the dimension along which the operation is serialzed on.

  • activation – Activation function to use. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).

  • use_bias – Boolean, whether the layer uses a bias vector.

  • kernel_initializer – Initializer for the kernel weights matrix.

  • bias_initializer – Initializer for the bias vector.

  • kernel_regularizer – Regularizer function applied to the kernel weights matrix.

  • bias_regularizer – Regularizer function applied to the bias vector.

  • activity_regularizer – Regularizer function applied to the output of the layer (its “activation”).

  • kernel_constraint – Constraint function applied to the kernel weights matrix.

  • bias_constraint – Constraint function applied to the bias vector.

Input shape:

N-D tensor with shape: (batch_size, ..., input_dim). The most common situation would be a 2D input with shape (batch_size, input_dim).

Output shape:

N-D tensor with shape: (batch_size, ..., units). For instance, for a 2D input with shape (batch_size, input_dim), the output would have shape (batch_size, units).

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, **kwargs)
Parameters

inputs – The tensor to apply the dense weights to.

Returns

The tensor resulting from applying the dense weights.

compute_output_shape(input_shape)

Computes the output shape of the layer.

If the layer has not been built, this method will call build on the layer. This assumes that the layer will later be used with inputs that match the input shape provided here.

Parameters

input_shape – Shape tuple (tuple of integers) or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.

Returns

An input shape tuple.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

22.17. Keras losses

22.17.1. Keras loss functions for the Graphcore IPU

class tensorflow.python.ipu.keras.losses.CTCLoss(*args, **kwargs)

Computes CTC (Connectionist Temporal Classification) loss. This implementation is designed and optimized for the IPU and cannot be used with other systems.

Usage:

labels = tf.keras.layers.Input((max_label_length), batch_size=batch_size,
                               dtype=np.int32, name="labels")
data = tf.keras.layers.Input((max_time, num_classes),
                             batch_size=batch_size, dtype=np.float32,
                             name="data")
label_length = tf.keras.layers.Input((), batch_size=batch_size,
                                     dtype=np.int32, name="label_length")
logit_length = tf.keras.layers.Input((), batch_size=batch_size,
                                     dtype=np.int32, name="logit_length")

dense_layer = tf.keras.layers.Dense(num_classes)
transpose_layer = tf.keras.layers.Lambda(
    lambda x: keras.backend.permute_dimensions(x, (1, 0, 2)))
ctc_loss_layer = ipu.keras.losses.CTCLoss(from_logits=True)

x = dense_layer(data)
x = transpose_layer(x)
loss = ctc_loss_layer(labels, x, label_length, logit_length)

model = ipu.keras.Model((labels, data, label_length, logit_length), loss)
get_loss_output = lambda y_true, y_pred: y_pred
model.compile('sgd', loss=get_loss_output)
Parameters
  • blank_index – The class index to use for the blank label.

  • from_logits – Whether to expect the input data in the form of logits (True) or log probabilities (False). Default value is False.

call(labels, data, label_length, data_length, **kwargs)
Parameters
  • labels – The labels input [batch_size, max_label_length] tensor.

  • data – The data input [max_time, batch_size, num_classes].

  • label_length – A tensor of shape [batch_size] containing the number of labels in each labels batch entry.

  • data_length – A tensor of shape [batch_size] containing the number of timesteps in each data batch entry.

Returns

The calculated loss.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Note that get_config() does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.

Returns

Python dictionary.

22.18. Keras optimizers

22.18.1. Keras Optimizer wrappers for the Graphcore IPU

class tensorflow.python.ipu.keras.optimizers.CrossReplicaOptimizer(opt, name='CrossReplicaOptimizer')

An optimizer that averages gradients across IPU replicas.

classmethod from_config(config, custom_objects=None)

Creates a CrossReplicaOptimizer from its config.

This method is the reverse of get_config, capable of instantiating the same optimizer from the config dictionary.

Parameters
  • config – A Python dictionary, typically the output of get_config.

  • custom_objects – A Python dictionary mapping names to additional Python objects used to create this optimizer, such as a function used for a hyperparameter.

Returns

A CrossReplicaOptimizer instance.

class tensorflow.python.ipu.keras.optimizers.GradientAccumulationOptimizer(opt, num_mini_batches, *nargs, **kwargs)

An optimizer which performs the weight update after multiple batches have been accumulated.

class tensorflow.python.ipu.keras.optimizers.IpuOptimizer(opt, name=None, **kwargs)

The wrapper interface for ipu.keras v2 optimizers. Any custom wrappers written for IPU keras applications should inherit from this class and override the appropriate functions.

This provides the convenience of automatically passing on functions that have not been overwritten to the sub class and also allows you to define custom APIs specifically for the IPU.

add_slot(var, slot_name, initializer='zeros')

Default wrapper that calls the wrapped optimizer’s add_slot.

Parameters
  • var – A variable to add.

  • slot_name – The name of the slot.

  • initializer – Default initializer for var.

property clipnorm

float or None. If set, clips gradients to a maximum norm.

property clipvalue

float or None. If set, clips gradients to a maximum value.

get_config()

Returns the config of the IpuOptimizer instance.

An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.

The returned config will contain at a minimum, inner_optimizer_config, inner_optimizer_type and name.

Returns

Python dictionary.

get_gradients(loss, params)

Default wrapper that calls the wrapped optimizer’s get_gradients.

Parameters
  • loss – A loss to compute gradients of.

  • params – A list of variables to compute gradients with respect to.

get_slot(var, slot_name)

Default wrapper that calls the wrapped optimizer’s get_slot.

Parameters
  • var – A variable to look up.

  • slot_name – The name of the slot.

get_slot_names()

Default wrapper that calls the wrapped optimizer’s get_slot_names.

get_weights()

Default wrapper that calls the wrapped optimizer’s get_weights.

property global_clipnorm

float or None. If set, clips gradients to a maximum norm.

preprocess_gradients(grad, var)

Default wrapper to call through to wrapped optimizers preprocess_gradients if it has it.

set_weights(weights)

Default wrapper that calls the wrapped optimizer’s set_weights.

Parameters

weights – The weights to set.

variables()

Returns the variables of the wrapped optimizer.

property weights

Returns variables of this Optimizer based on the order created.

class tensorflow.python.ipu.keras.optimizers.MapGradientOptimizer(opt, gradient_mapping_function, name='MapGradientOptimizer')

Removed, please use MapGradientOptimizerInvertedChaining.

class tensorflow.python.ipu.keras.optimizers.MapGradientOptimizerInvertedChaining(opt, gradient_mapping_function, name='MapGradientOptimizerInvertedChaining')

Apply a function to gradients before they are applied to the variables.

If wrapping multiple optimizers then the outer mapping functions will be applied first (this is the opposite way to MapGradientOptimizer). If used with MapGradientOptimizer wrapper then the MapGradientOptimziers will always be applied first.

classmethod from_config(config, custom_objects=None)

Creates a MapGradientOptimizer from its config.

This method is the reverse of get_config, capable of instantiating the same optimizer from the config dictionary.

Parameters
  • config – A Python dictionary, typically the output of get_config.

  • custom_objects – A Python dictionary mapping names to additional Python objects used to create this optimizer, such as a function used for a hyperparameter.

Returns

A MapGradientOptimizer instance.

get_config()

Returns the config of the MapGradientOptimizer instance.

22.19. Operators

It is also possible to access the operators via the tensorflow.python.ipu.ops namespace, for example: tensorflow.python.ipu.ops.normalization_ops.group_norm().

tensorflow.python.ipu.application_compile_op.experimental_application_compile_op(func, inputs=None, output_path=None, freeze_variables=False, name=None)

An operation that compiles a function into an executable for the IPU. The operation itself should be placed on CPU, and it will compile for the default IPU device.

WARNING: This API is experimental and subject to change.

Example usage:

def model(x):
  return x * x

v = tf.placeholder(tf.float32, shape=(2,))
compile_model = experimental_application_compile_op(model, inputs=[v])

with tf.Session() as sess:
  executable_path = sess.run(compile_model, {v: np.zeros(v.shape)})
Parameters
  • func – The Python function to compile.

  • inputs – The inputs passed to the function, as func(*inputs).

  • output_path – The path where the executable will be stored. If None, a temporary file is used.

  • freeze_variables – If True, any referenced variables will be captured by their values (when the compile op is executed) and embedded into the compiled executable as constants. If False, the referenced variables instead become implicit inputs that must be provided when executing the compiled executable.

  • name – Optional op name.

Returns

A Tensor of type string with the path to the compiled executable.

22.19.1. Control flow operations.

tensorflow.python.ipu.control_flow_ops.barrier(tensors, insert_barrier_for_gradients=False, name=None)

A control flow operation to force the scheduling of operations in the Poplar XLA backend.

For example given the following program:

def func(a, b, c, d):
  e = a + b
  f = c + d
  g = e + a
  return f, g

The operations `f` and `g` are independent of each other meaning that either
`f` or `g` can execute first. However if we want to force `f` to execute
first, we can insert a barrier operation:
def func(a, b, c, d):
  e = a + b
  f = c + d
  f, a = ipu.control_flow_ops.barrier([f, a])
  g = e + a
  return f, g

This will result in f executing before g as now there is a data dependency between the operations.

Parameters

tensors – A tensor or a structure of tensors which all have to be executed before the outputs of the barrier operation can be used.

Returns

A tensor or a structure of tensors which matches shape and type of the tensors arg.

22.19.2. Custom operations

tensorflow.python.ipu.custom_ops.codelet_expression_op(vertex_expression, *args)

Add a custom fused elementwise expression operation to the graph.

The automatic gradient calculation in TensorFlow does not have visibility of the operations performed by this function and so this operation cannot be used for training.

In the following example, the Python function my_custom_op() provides the expression, and the arguments a, b and c are the three inputs from other parts of the TensorFlow graph.

def my_custom_op(x, y, z):
    return x * x + y * z

ipu.custom_ops.codelet_expression_op(my_custom_op, a, b, c)
Parameters
  • vertex_expression – A Python function that defines the codelet expression.

  • args – The tensor inputs to the expression.

Returns

The Tensor which is a result of applying the elementwise operation

tensorflow.python.ipu.custom_ops.cpu_user_operation(inputs, library_path, outs=None, name='UserOp', op_name='Callback', separate_gradients=False, inputs_with_gradients=None, attributes=None, gradient_attributes=None)

Call the CPU function located in the shared library at library_path as part of the normal TensorFlow execution with the given inputs copied from the IPU to the CPU, and the outputs are copied back to the IPU afterwards.

The shape and type of the outputs should be specified by outs. If it is None it will default to no output. outs should be a dictionary with two elements like so:

outs = {
  "output_types": [my_types_as_a_list],
  "output_shapes": [my_shapes_as_a_list],
}
Parameters
  • inputs – The tensor inputs to the operation.

  • library_path – The path to the shared object that contains the functions to execute the operation.

  • outs – A dictionary describing the output tensor shapes and types.

  • name – The name of the operation.

  • op_name – The prefix of the functions inside the shared object file. This defaults to ‘Callback’.

  • separate_gradients – When set to True, multiple gradient ops will be generated, one for each input. When False, a single gradient op will be generated, which should produce the partial derivatives for all inputs.

  • inputs_with_gradients – A list of input indices. If this is defined then the op will only calculate derivatives for the specified inputs.

  • attributes – An optional string object which is passed as an argument to the Poplar function. Allows you to specify function attributes which were not known at the compile time of the C++ Poplar function. Can be used to pass a JSON or ProtoBuf serialized string to the Poplar function for ease of use. See the documention for examples.

  • gradient_attributes – Same as attribute, however this is passed as the attribute to the gradient operations (if training.)

Returns

The array of tensor outputs.

tensorflow.python.ipu.custom_ops.precompiled_user_op(inputs, library_path, gp_path='', outs=None, name='UserOp', op_name='Build', separate_gradients=False, inputs_with_gradients=None, attributes=None, gradient_attributes=None)

Call the Poplar function located in the shared library at library_path as part of the normal TensorFlow execution with the given inputs.

The shape and type of the output should be specified by outs. If it is None it will default to no output. outs should be a dictionary with two elements like this:

outs = {
  "output_types": [my_types_as_a_list],
  "output_shapes": [my_shapes_as_a_list],
}
Parameters
  • inputs – The tensor inputs to the operation.

  • library_path – The path to the shared object file that contains the functions to build the Poplar operation in the graph.

  • gp_path – The path to a precompiled codelet file, if you have one.

  • outs – A dictionary describing the output tensor shapes and types.

  • name – The name of the operation in TensorFlow.

  • op_name – The prefix of the functions inside the shared object file. This defaults to ‘Build’.

  • separate_gradients – When set to true, multiple gradient ops will be generated, one for each input. When false, a single gradient op will be generated, which should produce the partial derivatives for all inputs (or all inputs specified in inputs_with_gradients).

  • inputs_with_gradients – A list of input indices. If this is defined then the op will only calculate derivatives for the specified inputs.

  • attributes – An optional string object which is passed as an argument to the build function. Allows you to specify function attributes which were not known at the compile time of the C++ Poplar function. Can be used to pass a JSON or ProtoBuf serialized string to the Poplar function for ease of use. See the documention for examples.

  • gradient_attributes – The same as attributes, however this is passed as the attributes argument to the gradient operation (if training).

Returns

The array of tensor outputs.

22.19.3. Functional operators

tensorflow.python.ipu.functional_ops.outlined_function(func=None, unique_sharding=False, keep_input_layouts=None, name=None)

An outlined function is a block of organized, reusable code which is used to perform a single action. Functions provide better modularity for your application and a high degree of code reusing which can decrease the memory usage at the expense of passing the arguments around.

Arguments can be passed in two ways, as a parameter of the python function func, or as a value defined in the enclosing scope and used within func. Arguments that are compile-time graph constants should be defined in the enclosing scope, as this makes them eligible for expression evaluation. Arguments passed via function params will always be treated as a runtime value.

Functions can be used by models constrained by memory which have common structures or to serialize some large operations.

If the provided function contains any stateful operations, such as stateful random number generation, then the function cannot be reused and it will be inlined automatically.

See the documentation for more details and examples.

Parameters
  • func – A python function which takes a list of positional arguments only. All the arguments must be tf.Tensor-like objects, or be convertible to them. See the documentation for examples of how to pass non tf.Tensor-like objects to the functions. The function provided must return at least one tf.Tensor-like object.

  • unique_sharding – Makes sure that all function inputs are copied to a single device before the function call is executed. Enabling this can increase performance as any inter IPU communication can be more efficiently scheduled and any duplicated copies can be elided.

  • keep_input_layouts – A hint to decide whether to keep the layouts of the function inputs when calling the function or re-allocate them based on the operations inside the function. Reallocating them can improve the performance, but it can also increase the IPU code size. When set to ‘None’, this option will be decided automatically.

  • name – The name of the function.

Returns

An Operation that executes the function.

22.19.4. Image operations

tensorflow.python.ipu.image_ops.normalise_image(image, channel_offsets, channel_scales, scale=1, name=None)

Pad an image to have 4 channel dimensions and normalise it according to the following formula:

image = (image[c] * scale - channel_offsets[c]) * channel_scales[c]

for each of the c channels in the image.

Parameters
  • image – An [X,Y,Z,3] tensor, where the channels are the innermost dimension. Must be uint8, float32 or float16.

  • channel_offsets – A [3] array or tensor of offsets for the channels.

  • channel_scales – A [3] array or tensor of scales for the channels.

  • scale – A scalar constant that will scale the image before channel normalization. Defaults to 1.

  • name – Optional op name.

Returns

An [X,Y,Z,4] tensor with the same type as the input image, except uint8 inputs where the output is float16.

22.19.5. Graphcore utility operations

tensorflow.python.ipu.internal_ops.fifo(x, depth, offload=False, name=None)

Introduces a first-in-first-out queue with a fixed depth.

Parameters
  • x – The tensor to enqueue.

  • depth – The depth of the queue.

  • offload – Whether to offload the queue storage to Poplar remote buffers.

  • name – Optional op name.

Returns

A Tensor which was dequeued from the fifo. This will be x at t - depth. The first depth iterations will have unspecified values.

tensorflow.python.ipu.internal_ops.get_current_iteration_counter(name=None, **kwargs)

Returns which gradient accumulation iteration the pipeline is in.

Returns

A scalar tensor with the iteration count.

tensorflow.python.ipu.internal_ops.print_tensor(input, name='')

Print the specified input.

Parameters
  • input – The tensor to print.

  • name – Optional op name.

Returns

An operator that prints the specified input to the standard error. For the tensor to be printed one must either return it as part of their XLA function which is consumed by ipu_compiler.compile, or include the returned op in the input to session.run, or use the operator as a control dependency for executed ops by specifying with tf.control_dependencies([print_op]).

Examples

  1. Returning the print operation as part of the XLA function:

import tensorflow as tf

from tensorflow.python.ipu import internal_ops
from tensorflow.python.ipu import scopes

def my_net(v):
  print_op = internal_ops.print_tensor(v)
  v = v + 1
  return v, print_op

with scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

...
...
  1. Including the print operation in session.run:

import numpy as np
import tensorflow as tf

from tensorflow.python.ipu import internal_ops
from tensorflow.python.ipu import scopes

with scopes.ipu_scope("/device:IPU:0"):
  pa = tf.placeholder(np.float32, [2, 2], name="a")
  print_op = internal_ops.print_tensor(pa)
  x = pa + 1

with tf.Session() as session:
 result = session.run([x, print_op], feed_dict={pa : np.ones([2, 2])})

...
...
  1. Using control dependencies:

import numpy as np
import tensorflow as tf

from tensorflow.python.ipu import internal_ops
from tensorflow.python.ipu import scopes

with scopes.ipu_scope("/device:IPU:0"):
  pa = tf.placeholder(np.float32, [2, 2], name="a")
  print_op = internal_ops.print_tensor(pa)
  with tf.control_dependencies([print_op]):
    x = pa + 1

with tf.Session() as session:
 result = session.run(x, feed_dict={pa : np.ones([2, 2])})

...
...
tensorflow.python.ipu.internal_ops.remap(x, name=None)

Clone and map the input linearly across the IPU.

Parameters
  • x – The tensor to remap.

  • name – Optional op name.

Returns

A Tensor which is has been linearly mapped across the IPU.

tensorflow.python.ipu.internal_ops.remap_deduce(x, name=None)

Clone the tensor and deduce the tile mapping.

Parameters
  • x – The tensor to remap.

  • name – Optional op name.

Returns

A Tensor which is has been mapped across the IPU by deducing the tile layout from the input parameter.

22.19.6. IPU specific maths operations

tensorflow.python.ipu.math_ops.segment_sum(data, segment_ids, num_segments, name=None)

Computes the sum along segments of a tensor, such that:

\[output_i = \sum_j data_j\]

where sum is over j such that segment_ids[j] == i.

If the sum is empty for a given segment ID i then output[i] = 0.

Segments are partitions of a tensor along the first dimension indexed by a 1-D segment_ids tensor.

Read the TensorFlow documentation on segmentation for a more detailed explanation of segments.

For example:

c = tf.constant([[1, 2, 3, 4], [4, 3, 2, 1], [5, 6, 7, 8]])
tf.segment_sum(c, tf.constant([0, 0, 1]), 2)
# ==> [[5, 5, 5, 5],
#      [5, 6, 7, 8]]

Caution

The segment_ids must be sorted in ascending order. If provided with an unsorted tensor, no exception will be raised and the behaviour of this operation is undefined.

num_segments must be specified and must be greater than 1 + max(segment_ids).

Parameters
  • datatf.Tensor with rank >= 1.

  • segment_ids – A sorted tf.Tensor of int32 with rank == 1 and the same length as the 0th dimension of data.

  • num_segments – Number of segments to take within data.

  • name – Name for the operation (optional).

Returns

A tf.Tensor of the same type and rank as data but where the length of the 0th dimension is equal to num_segments, which comprises the sum of all the elements within the same segment in each cross-section.

Raises
  • ValueError – If the rank of data and segment_ids are not fully defined.

  • ValueError – If the length of the 0th dimension of data and segment_ids are not equal.

  • ValueError – If data does not have at least rank 1.

  • ValueError – If ``segment_ids` does not have a rank equal to 1.

tensorflow.python.ipu.math_ops.serialized_matmul(a, b, serialization_factor, serialization_dimension, transpose_a=False, transpose_b=False, name=None)

Multiplies matrix a by matrix b, producing a * b, with the multiplication being serialized on one of the dimensions.

Serializing a matrix multiplication operation can reduce the code size of the multiplication at the expense of extra computation due to copying of tensors.

The inputs must, following any transpositions, be tensors of rank >= 2 where the inner 2 dimensions specify valid matrix multiplication dimensions, and any further outer dimensions specify matching batch size.

Either matrix can be transposed on the fly by setting one of the corresponding flag to True. These are False by default.

Given the tensor a with shape [..., m, k] and tensor b with shape […, k, n] after the transpositions, the matrix multiplication can be serialized as follows:

  • Along the columns dimension of a (the m-dimension), by setting serialization_dimension to a_columns.

  • Along the rows dimension of a and the columns dimension of b (the k-dimension), by setting serialization_dimension to a_rows_b_columns.

  • Along the rows dimension of b (the n-dimension), by setting serialization_dimension to b_rows.

Note that taking a gradient of a serialized matrix multiplication means that the backward propagation of the matrix multiply will also be serialized.

Note that adjoining and sparse matrices are not supported.

Parameters
  • atf.Tensor of type float16, float32, int32 and rank >= 2.

  • btf.Tensor with same type and rank as a.

  • serialization_factor – An integer indicating the number of smaller matrix multiplies this operation is broken up into. Must divide the dimension along which the operation is serialized on.

  • serialization_dimension – A string, must be one of a_columns, a_rows_b_columns or b_rows. Indicates the dimension along which the operation is serialzed on.

  • transpose_a – If True, a is transposed before multiplication.

  • transpose_b – If True, b is transposed before multiplication.

  • name – Name for the operation (optional).

Returns

A tf.Tensor of the same type as a and b where each inner-most matrix is the product of the corresponding matrices in a and b, e.g. if all transpose attributes are False:

output[…, i, j] = sum_k (a[…, i, k] * b[…, k, j]), for all indices i, j.

22.19.7. Pipelining operators

class tensorflow.python.ipu.pipelining_ops.OptimizerFunctionOutput(opt, loss, compute_gradients_args=None, compute_gradients_kwargs=None, apply_gradients_args=None, apply_gradients_kwargs=None, variables=None, tape=None, gradient_capture_context=None, captured_gradient_outfeed=None)

A helper class used for returning a structured output from an optimizer_function in a pipeline.

__init__(opt, loss, compute_gradients_args=None, compute_gradients_kwargs=None, apply_gradients_args=None, apply_gradients_kwargs=None, variables=None, tape=None, gradient_capture_context=None, captured_gradient_outfeed=None)

Creates an OptimizerFunctionOutput object.

Parameters
  • opt – An instance of optimizer.Optimizer which is used to generate the back-propagation and the weight update pipeline stages.

  • loss – The loss which is passed to the optimizer when calling compute_gradients.

  • compute_gradients_args – Positional arguments (not including loss) which are passed to the compute_gradients function.

  • compute_gradients_kwargs – Keyword arguments (not including loss) which are passed to the compute_gradients function.

  • apply_gradients_args – Positional arguments (not including grads_and_vars) which are passed to the apply_gradients function.

  • apply_gradients_kwargs – Keyword arguments (not including grads_and_vars) which are passed to the apply_gradients function.

  • variables – A list or tuple of variables to compute gradients with respect to when opt is an instance of OptimizerV2.

  • tape – A GradientTape for gradient computation when opt is an instance of OptimizerV2.

  • gradient_capture_context – An

  • gradients (ipu.eager.backprop.GradientCaptureContext for accessing) –

  • ipu.ops.grad_util_ops.capture_upstream_gradients. (captured by) –

  • captured_gradient_outfeed – An ipu.IPUOutfeedQueue to which any captured gradients are pushed.

class tensorflow.python.ipu.pipelining_ops.PipelineSchedule(value)

The PipelineSchedule describes how stages are interleaved on the IPUs servicing the pipeline. The forward and backward passes of each stage will execute on the same IPUs. So, in the core of the pipeline there is a choice as to whether to run the forward stages together, or the backward stages and the forward stages together.

Grouped

This groups the forward passes on multiple IPUs. This requires more memory since activations need to be stored until the backward stages run together. However, since forward passes tend to be smaller than backward passes, Grouped tends to improve the speed of the execution, as different IPUs don’t spend so much time waiting for each other.

Interleaved

This schedules the backward passes whenever the forward passes have just generated some activations. Consequently fewer activations are required to be stored between the forward and backward pipeline stages, so less memory is required. However, since forward and backward stages tend to be very different in terms of execution cycles, the overall performance of the pipeline tends to be slower.

Sequential

This is a debug mode, where the pipeline is scheduled in the same way as if it were a sharded model.

class tensorflow.python.ipu.pipelining_ops.PipelineStageOptions(convolution_options=None, matmul_options=None, slice_options=None)

A helper class which can be used to configure Poplar compilation options (such as availableMemoryProportion or partialsType) inside a pipeline forward, backward and weight update stage. This will override the global options set by the convolution poplar options, matmul poplar options, and slice poplar options in the .

__init__(convolution_options=None, matmul_options=None, slice_options=None)

Creates an PipelineStageOptions object.

Parameters
  • convolution_options – If provided, a dictionary of Poplar option flags for all the convolution operations in the stage.

  • matmul_options – If provided, a dictionary of Poplar option flags for all the matmul operations in the stage.

  • slice_options – If provided, a dictionary of Poplar option flags for all the slice operations in the stage.

  • loss – The loss which is passed to the optimizer.

class tensorflow.python.ipu.pipelining_ops.RecomputationMode(value)

When working with pipeline models for training, recomputation might be required in order to reduce the number of activations being stored on the device at any given time.

This Enum class is used to control the recomputation implementation, with the following approaches supported:

  • Auto: automatically try and select the best recomputation strategy based on the provided model and pipeline schedule.

  • RecomputeThenBackpropagate: first recompute all the activations and then perform backpropagation. This mode allows for better code reuse as the corresponding forward propagation and the recomputation operations can share the exact same code. This recomputation mode is supported by PipelineSchedule.Grouped and PipelineSchedule.Interleaved pipeline schedules. This is the default recomputation mode for PipelineSchedule.Grouped and PipelineSchedule.Interleaved pipeline schedules.

  • RecomputeAndBackpropagateInterleaved: recompute and backpropagate operations are interleaved together. This mode can help reduce the maximum liveness compared to RecomputeThenBackpropagate as the backpropagation operations can be scheduled as soon as possible, however less code reuse will be possible. This recomputation mode is supported by PipelineSchedule.Grouped and PipelineSchedule.Sequential pipeline schedules. This is the default recomputation mode for the PipelineSchedule.Sequential pipeline schedule.

tensorflow.python.ipu.pipelining_ops.pipeline(computational_stages, gradient_accumulation_count=None, gradient_accumulation_dtype=None, repeat_count=1, batch_serialization_iterations=1, inputs=None, infeed_queue=None, outfeed_queue=None, optimizer_function=None, device_mapping=None, pipeline_schedule=None, recomputation_mode=None, forward_propagation_stages_poplar_options=None, backward_propagation_stages_poplar_options=None, weight_update_poplar_options=None, offload_weight_update_variables=None, replicated_optimizer_state_sharding=False, offload_activations=None, offload_gradient_accumulation_buffers=None, replicated_weight_sharding=None, offload_weights=None, continuous_weight_updates=False, outfeed_loss=False, accumulate_outfeed=False, accumulate_outfeed_dtype=None, outfeed_mask=None, reduction_method=GradientAccumulationReductionMethod.SUM, name=None)

Sets up a series of computational stages, where the outputs of one stage are the inputs to the next one. These stages are then executed in parallel across multiple IPUs. This approach can be used to split the model where layer(s) are executed on different IPUs.

The first stage takes the inputs and the infeed_queue (if provided) as its inputs. If the infeed_queue is provided, it is automatically dequeued (similar to the ipu.loops API) therefore care needs to be taken to make sure the signature of the first pipeline stage matches both the arguments from inputs and the infeed_queue, otherwise an error is thrown.

All tensors which are used in the pipeline which are not TensorFlow Variables need to be explicitly passed as inputs to the pipeline. If an input does not change its value during the execution of the pipeline op (for example hyperparameters such as learning rate), it needs to be passed as part of inputs. Alternatively, if these values change during execution (for example the model processes different batches of data) the input should be passed through the infeed_queue (see IPUInfeedQueue).

When training a model, an optional optimizer_function function can be provided. This function takes all the outputs from the last computational stage as inputs, and returns an instance of OptimizerFunctionOutput that is used to generate the backwards pass of the model using the TensorFlow Optimizer API. This will internally create corresponding backpropagation pipeline stages for each pipeline stage and colocate them such that the activations and weights required for the gradient calculation and application stay on the device in order to minimise the number of copies between IPUs.

Note that the gradients, which are calculated by the compute_gradients function, will be accumulated automatically during the execution of the pipeline, unless continuous_weight_updates is enabled.

If the last computational stage has any outputs, then an outfeed_queue (see IPUOutfeedQueue) is required and all the outputs from the last computational stage are enqueued to the outfeed_queue.

Note that pipelining supports the recomputation of activations for stateless ops during the backwards pass. This reduces the number of activations that will be stored on the device, saving memory at the expense of additional computation. To enable recomputation, use the tensorflow.python.ipu.utils.set_recomputation_options() function when configuring the device.

For example a simple inference network for the MNIST can be split across two IPUs:

from tensorflow import keras

# Create the dataset
#...

# Create the data queues from/to IPU.
infeed_queue = ipu_infeed_queue.IPUInfeedQueue(dataset)
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

# Create a pipelined model which is split accross two stages.
def stage1(image):
  partial = keras.layers.Dense(256, activation=tf.nn.relu)(image)
  partial = keras.layers.Dense(128, activation=tf.nn.relu)(partial)
  return partial

def stage2(partial):
  logits = keras.layers.Dense(10)(partial)
  probabilities = tf.nn.softmax(logits)
  classes = tf.argmax(input=logits, axis=1)
  return probabilities, classes

def model():
  with variable_scope.variable_scope("vs", use_resource=True):
    pipeline_op = pipelining_ops.pipeline(
                      computational_stages=[stage1, stage2],
                      gradient_accumulation_count=250,
                      repeat_count=2,
                      inputs=[],
                      infeed_queue=infeed_queue,
                      outfeed_queue=outfeed_queue,
                      device_mapping=[3,1],
                      name="Pipeline")
  return pipeline_op

with ops.device("/device:IPU:0"):
  compiled_model = ipu_compiler.compile(model, inputs=[])

outfeed_op = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(compiled_model)
  probabilities, classes = sess.run(outfeed_op)

In this set up, the model is split across two IPUs. By default the first two layers would be executed on the first IPU and the third layer and the probabilities and classes on the second IPU but here device_mapping is used to override the default IPU allocation and instead the first two layers will be executed on the fourth IPU and the third layer and the probabilities and classed on the second IPU.

This creates a pipeline of depth 250 (specified by the gradient_accumulation_count), which means each pipeline stage is executed 250 times.

This pipeline is then executed 2 times (specified by the repeat_count) The results of the pipeline (probabilities and classes) are returned to the host by the outfeed queue.

We can also train this network by providing optimizer_function:

from tensorflow import keras

# Create the dataset
#...

# Create the data queues from/to IPU.
infeed_queue = ipu_infeed_queue.IPUInfeedQueue(dataset)
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

# Create a pipelined model which is split accross two stages.
def stage1(lr, images, labels):
  partial = keras.layers.Dense(256, activation=tf.nn.relu)(images)
  partial = keras.layers.Dense(128, activation=tf.nn.relu)(partial)
  return lr, partial, labels

def stage2(lr, partial, labels):
  logits = keras.layers.Dense(10)(partial)
  cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
                        labels=labels, logits=logits)
  loss = tf.reduce_mean(cross_entropy)
  return lr, loss

def optimizer_function(lr, loss):
  optimizer = tf.train.GradientDescentOptimizer(lr)
  return pipelining_ops.OptimizerFunctionOutput(optimizer, loss)

def model(lr):
  with variable_scope.variable_scope("vs", use_resource=True):
    pipeline_op = pipelining_ops.pipeline(
                      computational_stages=[stage1, stage2],
                      gradient_accumulation_count=128,
                      repeat_count=10,
                      inputs=[lr],
                      infeed_queue=infeed_queue,
                      outfeed_queue=outfeed_queue,
                      optimizer_function=optimizer_function,
                      name="Pipeline")
  return pipeline_op

with ops.device('cpu'):
  lr = tf.placeholder(np.float16, [])

with ops.device("/device:IPU:0"):
  compiled_model = ipu_compiler.compile(model, inputs=[lr])

outfeed_op = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(compiled_model, {lr: 0.01})
  losses = sess.run(outfeed_op)

Here the tf.train.GradientDescentOptimizer generates the pipeline stages which calculate the gradients and apply them to the weights. Note how the loss is returned to the host by the outfeed queue.

If a model requires multiple computational pipeline stages to access the same tf.Variable, then all of these computational stages need to be placed on the same IPU using the device_mapping argument.

Note that modifying tf.Variable values in a pipeline stage and/or during the gradient calculation will result in undefined behavior. These variables can only be modified by the apply_gradients member function of the applied Optimizer.

Note that arguments marked with (EXPERIMENTAL) are under active development and might not provide representative performance.

Parameters
  • computational_stages – a list of python functions, where each function represents a computational pipeline stage. The function takes the outputs of the previous pipeline state as its inputs.

  • gradient_accumulation_count – the number of times each pipeline stage will be executed.

  • gradient_accumulation_dtype

    The data type used for the gradient accumulation buffer. One of:

    • None: Use an accumulator of the same type as the variable type.

    • A DType: Use this type for all the accumulators.

    • A callable that takes the variable and returns a DType: Allows specifying the accumulator type on a per-variable basis.

    The gradients passed to Optimizer.apply_gradients will have the dtype requested here. If that dtype is different from the variable dtype a cast is needed at some point to make them compatible. If you want to cast the gradients immediately, you can wrap your optimizer in the MapGradientOptimizer with a tf.cast.

  • repeat_count – the number of times the pipeline will be executed.

  • batch_serialization_iterations – (EXPERIMENTAL) number of times a loop executes to compute a batch on each pipeline stage execution. Currently only supported with the PipelineSchedule.Sequential.

  • inputs – arguments passed to the first pipeline stage.

  • infeed_queue – optional IPUInfeedQueue, if passed, it is dequeued and passed as an input in the first pipeline stage.

  • outfeed_queue – IPUOutfeedQueue, required if the last computational stage has any outputs. The outputs of these are enqueued to this queue and they can be accessed on the host.

  • optimizer_function – optional Python function which takes the output of the last computational stage as parameters and returns an instance of pipelining_ops.OptimizerFunctionOutput in order to generate the back-propagation and weight-update parts of the model suitable for training.

  • device_mapping – If provided, a list of length equal to the number of computational stages. An element at index i in the list represents which IPU the computational stage computational_stages[i] should reside on. This can be used to make sure computational stages which share tf.Variable are resident on the same IPU.

  • pipeline_schedule – Which scheduling algorithm to use for pipeline lowering. Defaults to PipelineSchedule.Grouped.

  • recomputation_mode – The recomputation mode to use for training pipeline models. Defaults to RecomputationMode.Auto. Only applies if recomputation is enabled. This must be done by using the tensorflow.python.ipu.utils.set_recomputation_options() function when configuring the device.

  • forward_propagation_stages_poplar_options – If provided, a list of length equal to the number of computational stages. Each element is a PipelineStageOptions object which allows for fine grain control of the Poplar options for a given forward propagation computational stage.

  • backward_propagation_stages_poplar_options – If provided, a list of length equal to the number of computational stages. Each element is a PipelineStageOptions object which allows for fine grained control of the Poplar options for a given backward propagation computational stage.

  • weight_update_poplar_options – If provided, a PipelineStageOptions object which allows for fine grained control of the Poplar options for the weight update stage.

  • offload_weight_update_variables – When enabled, any tf.Variable which is only used by the weight update of the pipeline (for example the accumulator variable when using the tf.MomentumOptimizer), will be stored in the remote memory. During the weight update this variable will be streamed onto the device and then streamed back to the remote memory after it has been updated. Requires the machine to be configured with support for Poplar remote buffers. Offloading variables into remote memory can reduce maximum memory liveness, but can also increase the computation time of the weight update. When set to None the variables will be placed in either in-processor or remote memory automatically based on the current best placement strategy. Note that this option has no effect for inference only pipelines.

  • replicated_optimizer_state_sharding – If True, any tf.Variable which is offloaded (for example the accumulator variable when using the tf.MomentumOptimizer), will be partitioned across the replicas. This can exploit the additional bandwidth of the IPU-Links to improve overall throughput, however it might increase the code size and hence the model might need adjusting (for example the PopLibs option availableMemoryProportion might need to be changed). Note that this option has no effect for inference only pipelines.

  • offload_activations – When enabled, all the activations for the batches which are not being executed by the pipeline stages at the given time are stored in remote memory. Requires the machine to be configured with support for Poplar remote buffers. Offloading activations into remote memory can reduce maximum memory liveness, but can also increase the computation time as activations have to be copied from/to the device(s). When set to None, the activations might be offloaded when beneficial.

  • offload_gradient_accumulation_buffers – (EXPERIMENTAL) When enabled, all the gradient accumulation buffers are stored in remote memory. Offloading gradient accumulation buffers into remote memory can reduce maximum memory liveness, but can also increase the computation time as the buffers have to be copied to the device, updated and the copied off the device. Requires the machine to be configured with support for Poplar remote buffers. When set to None, the offload_gradient_accumulation_buffers might be offloaded when beneficial. Note that this option has no effect for inference only pipelines.

  • replicated_weight_sharding – (EXPERIMENTAL) When enabled and running a replicated model, any tf.Variable used by the pipeline stage computations (excluding those only used by the weight update), will be partitioned across the replicas. Whenever the a partitioned tf.Variable is accessed, it will be first all-gathered across replicas to make sure each replica has access to the whole tf.Variable. This can exploit the additional bandwidth of the IPU-Links to improve overall throughput. When set to None, the activations might be offloaded when beneficial. This feature is enabled by default when the pipeline schedule is PipelineSchedule.Sequential and batch_serialization_iterations > 1, where this option can reduce the memory usage at the cost of extra communication.

  • offload_weights – (EXPERIMENTAL) When enabled and replicated_weight_sharding is enabled, any tf.Variable which are partitioned across replicas will be stored in Poplar remote buffers. Offloading variables into remote memory can further reduce maximum memory liveness, but can also increase the computation time due to extra communication. When set to None the variables will be placed in either in-processor or remote memory automatically based on the current best placement strategy.

  • continuous_weight_updates – ** CURRENTLY UNIMPLEMENTED ** When training, this option will apply the gradients to the resource variables immediately, rather than accumulating the gradients and applying them at the end of each execution of the pipeline.

  • outfeed_loss – If True, the loss given by the optimizer_function will be enqueued on the outfeed, instead of the outputs from the last computational stage. Cannot be set when outfeed_mask is set.

  • accumulate_outfeed – Data (loss or outputs) is normally enqueued immediately after the last computational stage inside the pipeline. If this option is True, the data will instead be accumulated and only enqueued once at the end of pipeline execution. To use this option, the provided outfeed_queue must be in the IPUOutfeedMode ALL mode (see IPUOutfeedMode).

  • accumulate_outfeed_dtype

    The data type used for the outfeed accumulation buffers. One of:

    • None: Use an accumulator of the same type as the variable type.

    • A DType: Use this type for all the accumulators.

    • A callable that takes the variable and returns a DType: Allows specifying the accumulator type on a per-variable basis.

  • outfeed_mask – If set, a list of booleans of same length as the same number of outputs from the last computational stage. If outfeed_mask[i] evaluates to False, then the output at that index is enqueued to the outfeed queue, and if it is set to True it is not enqueued. Cannot be set when outfeed_loss is set. Can only be used when optimizer_function has been set.

  • reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod) # pylint: disable=line-too-long

  • name – name of this pipeline.

Returns

An Operation that executes the pipeline.

tensorflow.python.ipu.pipelining_ops.recomputation_checkpoint(tensors, name=None