21. TensorFlow Python API

Remember to import the IPU API using:

from tensorflow.python import ipu

You cannot access the IPU API via the top-level tensorflow namespace. For example, this will not work:

import tensorflow as tf
cfg = tf.python.ipu.config.IPUConfig() ...

21.2. Compiler interface

tensorflow.python.ipu.ipu_compiler.compile(computation, inputs=None)

Builds an operator that compiles and runs computation with the Graphcore IPU XLA backend.

Parameters

computation –
A Python function that builds a computation to apply to the input. If the function takes n inputs, inputs should be a list of n tensors.

computation may return a list of operations and tensors. Tensors must come before operations in the returned list. The return value of compile is a list of tensors corresponding to the tensors from the output of computation.

All operations returned from computation will be executed when evaluating any of the returned output tensors.
inputs – A list of inputs or None (equivalent to an empty list). Each input can be a nested structure containing values that are convertible to tensors. Note that passing an N-dimension list of compatible values will result in a N-dimension list of scalar tensors rather than a single Rank-N tensors. If you need different behaviour, convert part of inputs to tensors with tf.convert_to_tensor.

Returns

Same data structure as if computation(inputs) is called directly with some exceptions for correctness.

None output. a NoOp would be returned which control-depends on computation.
Single value output. A tuple containing the value would be returned.
Operation-only outputs. a NoOp would be returned which control-depends on computation.

Raises

Exception – If the computation was not compiled for an IPU device.

21.3. Scoping contexts

tensorflow.python.ipu.scopes.frontend_attribute(attribute_name, attribute_value, restore_to=None)

Sets the specified scope attribute to the specified value in the graph.

Parameters

attribute_name – Name of the attribute.
attribute_value – Attribute’s value as a string.
restore_to – If at the end of the scope the attribute was to be undefined sets it to this value instead.

Returns

A context

tensorflow.python.ipu.scopes.ipu_jit_scope(ipu_scope)

Provides a scope for compilation of operations.

If you would like to compile several sets of operations together, then this can provide that mechanism.

Parameters: ipu_scope – A name to differentiate between different JIT scopes
Returns: A context

tensorflow.python.ipu.scopes.ipu_scope(device)

Provides a scope for placing operations onto a particular IPU/IPU cluster.

Parameters: device – The name of the TensorFlow device, such as ‘/device:IPU:0’
Returns: A context

tensorflow.python.ipu.scopes.ipu_shard(index)

Control sharding for a set of operations.

Provides a scope which targets operations onto a particular shard (IPU) of a multi-IPU sharded device. Gradients created from these operations will also be put onto the same shard. Consequently an ipu_shard scope enclosing a call to tf.gradients or tf.GradientTape.gradient won’t change the sharding of the backwards ops.

Parameters: index – The index of the IPU on which to place the enclosed operations.
Returns: A context

tensorflow.python.ipu.scopes.outside_compilation_scope(name='outside')

Provides a scope for placing operations on the host, outside the current compilation scope. The operations will be placed on the default host device. This allows for offloading computations from the IPU to the host, which can be useful for operations that are not supported or suitable for execution on the IPU.

Example:

def my_net(a):
  with ipu_scope("/device:IPU:0"):
    b = a * a
    with outside_compilation_scope():
      c = b + 2  # Placed on the host.
    d = b + c
    return d

Parameters: name – A name for the outside compilation scope.
Returns: A context

tensorflow.python.ipu.scopes.partials_type(override_type)

Override the default type used to store intermediate results by convolution and matrix mutliply operations.

EXPERIMENTAL - there are no guarantees that the partials type provided will be used and therefore this should not be used.

Parameters: override_type – Numpy type of the partials (float16 or float32)
Returns: A context

tensorflow.python.ipu.scopes.stochastic_rounding(override)

Control stochastic rounding for a set of operations.

EXPERIMENTAL - there are no guarantees that the stochastic rounding provided will be used and therefore this should not be used.

Parameters: override – if True then stochastic rounding will be used, otherwise it will be disabled for this set of operations.
Returns: A context

21.4. Infeed queue

class tensorflow.python.ipu.ipu_infeed_queue.IPUInfeedQueue(dataset, device_ordinal=0, prefetch_depth=None, optimise_latency=False)

Wraps a tf.Dataset object with infeed operations specific to the IPU.

This class, along with tensorflow.python.ipu.loops is used to create a data pipeline from a dataset into a training/inference loop on the IPU inside a single session.run which reduces the overheads of calling session.run for each iteration of the loop.

You should pass the infeed queue as an argument to a loop from tensorflow.python.ipu.loops. These loops will then handle the dequeuing of the data to the device automatically.

The following skeleton shows how to use this method when building a training loop. Note how the body signature contains variables which correspond to the nested structure of tf.Tensor objects representing the next element in the infeed queue:

# Create an example dataset.
dataset = ...  # A `tf.data.Dataset` object.

def dataset_parser(value):
  features, labels = parse_record(value)
  return {"features": features,
          "labels": labels}
# The resulting dataset has a nested structure of: {features, labels}.
dataset = dataset.map(dataset_parser)

infeed_queue = ipu.ipu_infeed_queue.IPUInfeedQueue(dataset)

# dataset can no longer be used beyond this point.

def my_net():
  # Note how the nested structure forms part of the loop body signature.
  def body(loss, features, labels):
    with variable_scope.variable_scope("vs", use_resource=True):
      y = tf.conv2d(features, .....)
      ...
      ...
      logits = tf.nn.xw_plus_b(....)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=labels))
    optimizer = gradient_descent.GradientDescentOptimizer(0.000001)
    train = optimizer.minimize(loss)
    with ops.control_dependencies([train]):
      return array_ops.identity(loss)

  loss = 0.0
  return = tf.python.ipu.loops.repeat(10000, body, [loss], infeed_queue)

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[])

with tf.Session() as sess:
  sess.run(infeed_queue.initializer)
  sess.run(variables.global_variables_initializer())
  result = sess.run(res)

__init__(dataset, device_ordinal=0, prefetch_depth=None, optimise_latency=False)

Creates an IPUInfeedQueue object.

Parameters

dataset – a tf.data.Dataset object, all transformations e.g. shuffle, repeat, batch must be applied prior to passing in to this function. This dataset can no longer be used after creating this queue.
device_ordinal – ordinal of the IPU device on which this queue will be used. By default the queue will be used on “/device/IPU:0”.
prefetch_depth – the number of elements Poplar will prefetch. The depth of the Poplar datastream buffer size which may be prefetched before being read by the device. By default the prefetch_depth size is automatically determined (currently defaults to 3). Increasing the size of the prefetch_depth allows for prefetching of multiple entries, increasing the probability there will be a valid entry in the buffer for the device to read before falling back to synchronously fetching the next entry. This value has to be greater than zero.
optimise_latency – Prioritise packet reduction to try to speed up the the host transfer. This has the downside that it will introduce an extra copy and so should only be used on small exchanges that will produce lots of packets.

Raises

ValueError – if all dimensions of shapes of dataset.output_shapes are not fully defined. tf.data.batch function must be called with drop_remainder=True to ensure that batch size is constant.

property deleter

A tf.Operation that can be run to delete the resources owned by this IPUInfeedQueue. This allows creating a new IPUInfeedQueue with the same name afterwards.

Returns: A tf.Operation that can be run to delete this IPUInfeedQueue

property dequeued

Returns whether this queue has been dequeued.

Returns: A nested structure of tf.Tensor objects.

get_next(): Obsolete function.

property initializer

A tf.Operation that should be run to initialize this IPUInfeedQueue.

Returns: A tf.Operation that should be run to initialize this IPUInfeedQueue
Raises: ValueError – if the function initializer has already been called.

property number_of_tuple_elements: Returns the number of arguments supplied by this IPUInfeedQueue.

21.5. Outfeed queue

class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedMode(value)

Types used to control the IPUOutfeedQueue modes.

Contains the following values:

ALL - When used with an IPUOutfeedQueue, all the elements which were enqueued to the queue will be returned by the outfeed.
LAST - When used with an IPUOutfeedQueue, only the last element which was enqueued to the queue will be returned by the outfeed.

class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedQueue(outfeed_mode=None, device_ordinal=0, buffer_depth=3, optimise_latency=False)

Generates and adds outfeed enqueue/dequeue operations to the graph.

An outfeed is the counterpart to an infeed and manages the transfer of data (like tensors, tuples or dictionaries of tensors) from the IPU graph to the host.

The queue has two modes of operation - outfeed all or outfeed last. In outfeed all mode every element that is enqueued will be stored for a subsequent dequeue. All of the enqueued elements will be returned when the dequeue operation is run. This is the default behaviour.

In outfeed last mode only the last enqueued element is stored. The dequeue operation will in this case return a single element.

__init__(outfeed_mode=None, device_ordinal=0, buffer_depth=3, optimise_latency=False)

Creates an IPUOutfeedQueue object.

Parameters

outfeed_mode – ipu_outfeed_queue.IPUOutfeedMode type used to control the outfeed behaviour. If not specified then all elements will be returned by the outfeed when the dequeue operation is run.
device_ordinal – ordinal of the IPU device on which this queue will be used. By default the queue will be used on “/device/IPU:0”.
buffer_depth – The maximum number of elements Poplar can buffer in external memory before blocking the device.
optimise_latency – Prioritise packet reduction to try to speed up the the host transfer. This has the downside that it will introduce an extra copy and so should only be used on small exchanges that will produce lots of packets.

Raises

ValueError – if the types or values are incorrect

property deleter

A tf.Operation that can be run to delete the resources owned by this IPUOutfeedQueue. This allows creating a new IPUOutfeedQueue with the same name afterwards. The behaviour is undefined if this op is executed concurrently with the dequeue op.

Returns: A tf.Operation that can be run to delete this IPUOutfeedQueue

dequeue(wait_for_completion=False)

Generate host side operation to dequeue the outfeed values.

Parameters: wait_for_completion – whether the dequeueing operation should wait for the current execution of a graph containing the outfeed enqueue to complete. Defaults to False which means that only the tensors which have already been enqueued will be returned.

The return value of this operation depends on the enqueued tensors, replication factor and the execution mode. Where replication factor is determined by the model.

Note: If the TF_POPLAR_FLAGS environment variable contains the flag --use_synthetic_data then no data will be returned to the host. If outfeed_mode is IPUOutfeedMode.ALL then empty arrays with the same element structure as the enqueued tensors are returned. If outfeed_mode is IPUOutfeedMode.LAST then running the dequeue operation will throw an exception (there is no last element in this case).

Examples:

Outfeed returning a single tensor:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(input):
  output = input + 1
  outfeed = outfeed_queue.enqueue(output)
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(20, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example the tensor output is of shape [4, 4] and it is enqueued into the outfeed. If the outfeed_mode is IPUOutfeedMode.ALL, and the model has a replication factor of 2 then the shape of the resulting outfed tensor will be [20, 2, 4, 4], where the first dimension represents the number of times we have enqueued a tensor to the outfeed - in this example the loop is repeated 20 times, and therefore we get 20 values back from the outfeed. The second dimension is the replication factor, which allows us to see the individual values from each replicated graph. If the outfeed_mode is IPUOutfeedMode.LAST, then the shape of the resulting outfed tensor will be [2, 4, 4], which represents the value of the output tensor the last time it was enqueued during execution for each of the replicated graphs.

Outfeed returning a tuple of tensors:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(input):
  output = input + 1
  sum = tf.reduce_sum(output)
  outfeed = outfeed_queue.enqueue((output, sum))
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(20, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example we outfeed a tuple of tensors, output and sum, where the former is of shape [4, 4] and latter [1]. If the outfeed_mode is IPUOutfeedMode.ALL and the model has a replication factor of 1, then the resulting outfed is a two-tuple of tensors with shapes ([20, 4, 4], [20, 1]), where the first dimension in each of the tensors represents the number of times we have enqueued these tensors to the outfeed - in this example the loop is repeated 20 times, and therefore we get 20 values back from the outfeed for each of the tensors in the tuple. If the outfeed_mode is IPUOutfeedMode.LAST, then outfed is a two tuple of tensors with shapes ([4, 4], [1]), which represents the values of the output and sum tensors the last time they were enqueued during execution.

Note that replication factor here is 1, which means that the extra replication dimension is not added.

Outfeed returning a dictionary of tensors:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(input):
  output = input + 1
  sum = tf.reduce_sum(output)
  outfeed = outfeed_queue.enqueue({"x": output,
                                   "y": sum})
  return (output, outfeed)

def my_net(input):
  r = loops.repeat(40, body, (input))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

with ops.device('cpu'):
  v = tf.placeholder(np.float32, [4, 4])

outfeed = outfeed_queue.dequeue()
with tf.Session() as sess:
  result = sess.run(res, {v:np.ones([4, 4], np.float32)})
  outfed = sess.run(outfeed)

In this example we outfeed a dictionary of tensors, output and sum, where the former is of shape [4, 4] and latter [1]. If the outfeed_mode is IPUOutfeedMode.ALL and the model has a replication factor of 8, then the resulting outfed is a dictionary of tensors with shapes: {“x”: [40, 8, 4, 4], “y”: [40, 8, 1]}, where the first dimension in each of the tensors represents the number of times we have enqueued these tensors to the outfeed - in this example the loop is repeated 40 times, and therefore we get 40 values back from the outfeed for each of the tensors in the tuple. The second dimension is the replication factor, which allows us to see the individual values from each replicated graph. If the outfeed_mode is IPUOutfeedMode.LAST, then outfed is a dictionary of tensors with shapes: {“x”: [8, 4, 4], “y”: [8, 1]}, which represents the values of the output and sum tensors the last time they were enqueued during execution for each of the replicated graphs.

enqueue(tensors)

Enqueue a tensor, tuple or a dictionary of tensors for being outfed from the IPU graph. This operation is placed on the IPU device. This function returns an Operation which needs be executed (by either returning it or using tf.control_dependencies(…))

Examples:

Outfeed returning a single tensor:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(v):
  v = v + 1
  outfeed = outfeed_queue.enqueue(v)
  return (v, outfeed)

def my_net(v):
  r = loops.repeat(20, body, (v))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

...
...

Outfeed returning a tuple of tensors:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(v):
  v = v + 1
  x = v * 2
  outfeed = outfeed_queue.enqueue((v, x))
  return (v, outfeed)

def my_net(v):
  r = loops.repeat(20, body, (v))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

...
...

Outfeed returning a dictionary of tensors:

outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

def body(v):
  v = v + 1
  x = v * 2
  outfeed = outfeed_queue.enqueue({"output_1": v,
                                   "output_2": x})
  return (v, outfeed)

def my_net(v):
  r = loops.repeat(20, body, (v))
  return r

with ipu.scopes.ipu_scope("/device:IPU:0"):
  res = ipu_compiler.compile(my_net, inputs=[v])

...
...

21.6. General utilities

tensorflow.python.ipu.utils.export_dataset_to_file(dataset_or_infeed, output_filename, num_elements, feed_name='', apply_options=True)

Export as binary num_elements from the given infeed to the specified output_filename.

If the infeed elements are tuples then one file per tuple element will be created. For example, if dataset looks like

[{ "a": A_0, "b": B_0}, { "a": A_1, "b": B_1}, ...]

then export_dataset_to_file(dataset, "my_dataset.bin", 100) will generate:

my_dataset.0.bin   # Contains tensors [ A_0, A_1, ..., A_99]
my_dataset.1.bin   # Contains tensors [ B_0, B_1, ..., B_99]

Parameters

dataset_or_infeed – An unary dataset with the same input and output structure or an IPUInfeedQueue.
output_filename – Where to export the tensors to.
num_elements – Number of elements to export from the dataset.
feed_name – Specify the feed name.
apply_options – Whether to apply optimization options which can improve the dataset performance.

tensorflow.python.ipu.utils.export_inputs_to_file(inputs, output_filename, feed_dict)

Export as binary the list of inputs provided to the specified output_filename.

Parameters

inputs – List of graph inputs to export.
output_filename – Where to export the tensors to.
feed_dict – Feed dictionary containing the inputs’ values.

tensorflow.python.ipu.utils.get_num_of_ipus_in_device(ipu_device, device='cpu')

Get the number of physical IPUs

Parameters

ipu_device – The IPU device for which to get the number of devices for.
device – The CPU device which is local to the IPU hardware.

Returns

A number of physical IPUs configured for a particular TF device.

tensorflow.python.ipu.utils.move_variable_initialization_to_cpu(graph=None)

For all variables in the VARIABLES collection, move any initialization ops onto the CPU.

Parameters: graph – Operations are moved around on this graph. The default graph will be used if not specified.
Returns: None

tensorflow.python.ipu.utils.reset_ipu_seed(seed, device='/device:IPU:0', cpu_device='cpu', experimental_identical_replicas=False)

Reset the seed used to generate stateful random numbers and perform stochastic rounding.

Parameters

seed – The new random number generator seed.
device – The device to which the seed will be applied.
cpu_device – The CPU device which is on the same hardware to the IPU device.
experimental_identical_replicas – Whether to seed all the local replicas identically. Note that to generate identical sequences of random numbers on all replicas, the Poplar engine option "target.deterministicWorkers" must also be set to "portable". Also note that for multi-replica distribution with multiple processes, the same seed must be passed to each process to ensure that all the replicas globally get the same seed. WARNING: This flag is experimental and subject to change.

Returns

None

tensorflow.python.ipu.utils.running_on_ipu_model()

Check if XLA is configured to run on the ipu model.

Returns: True if XLA is configured to run on the ipu model. False if XLA is configured to run on real hardware.

tensorflow.python.ipu.utils.use_synthetic_data_for(synthetic_data_category)

Get whether synthetic data is being used for the given category.

Parameters: synthetic_data_category – A SyntheticDataCategory enum value.
Returns: A bool indicating the result.

21.7. Configuration utilities

class tensorflow.python.ipu.config.DeviceConnectionType(value)

Enumeration to describe the mechanism used to attach to the Poplar device.

ALWAYS indicates that the system will attach when configuring the device.
ON_DEMAND will defer connection to when the IPU is needed.
PRE_COMPILE will never try to attach to a device and anything which is meant to be executed on the device will return all zeros. Used to pre-compile Poplar programs on machines without IPUs. For more information, see Pre-compiling executables.
NEVER will never try to attach to a device.

class tensorflow.python.ipu.config.ExecutionProfileType(value)

The execution profile type indicates the desired information in the execution profile.

NO_PROFILE indicates that there should be no execution profiling.
DEVICE_PROFILE indicates that the execution profile should contain only device wide events.
IPU_PROFILE indicates that the profile should contain IPU level execution events.
TILE_PROFILE indicates that the profile should contain Tile level execution events.

class tensorflow.python.ipu.config.MergeRemoteBuffersBehaviour(value)

The remote buffers merging behaviour indicates when or if compatible remote buffers should be merged.

NO_MERGING indicates that there should be no merging.
MERGE indicates that all compatible remote buffers will be merged.
IF_BENEFICIAL indicates that compatible remote buffers will only be merged when it is considered beneficial for code re-use.

class tensorflow.python.ipu.config.SchedulingAlgorithm(value)

Controls the algorithm that the scheduler uses.

CHOOSE_BEST compares several of the scheduling algorithms below and selects the one that leads to the lowest predicted overall peak liveness. This can sometimes produce incorrect results because the overall peak liveness isn’t always a good measure for the maximum liveness on one tile of the processor.
CLUSTERING groups clusters of operations together in order to look through stretches of instructions with potentially high liveness.
POST_ORDER schedules the instructions in the order which is obtained by walking the graph in ‘post order’.
LOOK_AHEAD looks ahead a number of operations from any schedulable one, as given by the maximum scheduler lookahead depth and maximum scheduler search space size options. It attempts to look through areas of high liveness.
SHORTEST_PATH gives priority to the shortest path to the root.

class tensorflow.python.ipu.config.SelectionOrder(value)

Depending on the communication pattern of the model, the order in which the IPUs are selected and mapped to shards can impact the performance.

For example, given a model which executes on multiple IPUs:

def sharded_graph(pa, pb, pc, pd):
  with ipu.scopes.ipu_shard(0):
    o1 = pa + pb
  with ipu.scopes.ipu_shard(1):
    o2 = o1 + pc
  with ipu.scopes.ipu_shard(2):
    o3 = o2 + pd
    return o3

and a Graphcore Pod system with 16 IPUs:

 _______               _______
|       |             |       |
|  14   |=============|  15   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|  12   |=============|  13   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|  10   |=============|  11   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   8   |=============|   9   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   6   |=============|   7   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   4   |=============|   5   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   2   |=============|   3   |
|_______|             |_______|
    ||                    ||
 _______               _______
|       |             |       |
|   0   |=============|   1   |
|_______|             |_______|

Here, each numbered square represents an IPU with the given device ID and the == and || connections represent IPUs directly connected via IPU-Links.

We can see that the ipu_shard(0) directly communicates with ipu_shard(1) and that ipu_shard(1) directly communicates with ipu_shard(2).

If the shards 0, 1, 2 were mapped to IPUs 0, 1, 2 in that order, then the communication between shards 1 and 2 would not have a direct connection via an IPU-Link and would have to perform a “hop” through an intermediate IPU.

If the shards 0, 1, 2 were mapped to IPUs 0, 1, 3 in that order, then the communication between shards 1 and 2 would have a direct connection via an IPU-Link, which will reduce the communication cost.

This enumeration is used to control the order in which the IPUs are selected. Currently, the following IPU selection orderings are supported:

AUTO: automatically try and select the best selection given the network.
ZIGZAG: follow the natural ordering of IPUs. In the above example, the IPUs would be selected in the following order: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15.
SNAKE: select IPUs such that each consecutive shard is directly connected via IPU-Links to the shard before and after. In the above example, the IPUs would be selected in the following order: 0, 1, 3, 2, 4, 5, 7, 6, 8, 9, 11, 10, 12, 13, 15, 14.
HOOF: select IPUs such that each consecutive shard is directly connected via IPU-Links to the shard before and after, and the last and first shard are on adjacent IPUs. In the above example, the IPUs would be selected in the following order: 0, 2, 4, 6, 8, 10, 12, 14, 15, 13, 11, 9, 7, 5, 3, 1.

The SNAKE and HOOF IPU selection orders are particularly beneficial for pipelined models.

class tensorflow.python.ipu.config.StochasticRoundingBehaviour(value)

Controls how stochastic rounding is performed.

OFF disables stochastic rounding. ON enables stochastic rounding. REPLICA_IDENTICAL_ONLY enables stochastic rounding for portions of the graph which are identified as being replica identical - meaning that when executed with replication they produce the same result on each replica.

tensorflow.python.ipu.config.configure_ipu_system(config, device='cpu', reset_configuration=True)

Configure an IPU system with an IPUConfig or IpuOptions instance.

Parameters

config – An IPUConfig instance or IpuOptions configuration protobuf.
device – The TensorFlow virtual CPU device which is local to the IPU hardware.
reset_configuration – Whether to reset any existing IPU configurations.

Returns

None

tensorflow.python.ipu.config.get_ipu_config(session=None)

Get the configuration of an IPU system.

Parameters: session – An optional session on which to execute.
Returns: A list of IpuOption instances, one for each PoplarExecutor.

tensorflow.python.ipu.config.reset_ipu_configuration()

Reset the IPU configuration in preparation for it to be reconfigured. This blocks until all currently configured IPU devices have finished executing.

Note that this function does not currently support resetting IPUs that are running in parallel Python threads.

class tensorflow.python.ipu.config.AttributeMetadata

check_type(value)

Checks if value is one of the allowed types for this option. Throws a TypeError if not.

Parameters: value – The value to check against this attribute’s type.
Returns: True if value satisfies this attribute’s type.

property default: The default value for this option. Categories themselves do not have default values.

property deprecated: Whether or not this option/category is deprecated.

property deprecated_msg: The deprecation message for this attribute. None if it is not deprecated.

property name: The full name of the option/category, relative to the config structure’s root.

property type: The type of this option, as a string. The type can be a simple Python type or a type hint. Categories themselves do not have types.

warn_if_deprecated(): Outputs a log warning if this option/category is deprecated.

class tensorflow.python.ipu.config.IPUConfig

allow_recompute: bool = False: Whether or not to recompute instructions during training. If this is enabled then we will attempt to pattern match instructions/pipeline stages in the forward pass and recompute them in the backward pass to avoid having to preserve activations which increase the maximum memory liveness. Enabling this option can reduce memory usage at the expense of extra computation. Stateful operations cannot be recomputed.

selection_order: SelectionOrder = SelectionOrder.AUTO: The order in which IPUs are selected and mapped to physical IPU devices when using multi-IPU devices. Must be one of SelectionOrder.

serialization_output_folder: str = "": Specifies the directory in which serialized Poplar executables will be saved. The value must be a valid path. The default (“”) disables executable serialization.

compilation_poplar_options: dict = {}: Set the Poplar compilation options for the session. Must be a dictionary of valid Poplar compilation flags. See the Engine class in the Poplar API reference for the full list of options.

gcl_poplar_options: dict = {}: Set the IPU options for the Graphcore Communication Library. Must be a dictionary of valid GCL options. See the allReduce function in the GCL API reference for the full list of options. The options will be applied to all applicable GCL collective operations in the graph during compilation.

auto_select_ipus: Union[int, List[int], Tuple[int, ...]] = []

Configure the IPUs to be used by the session. The configuration describes a system consisting of multiple TensorFlow devices, each with control of one of more IPUs. The devices will be labelled /device:IPU:0, /device:IPU:1 and so on.

Each device can control a specific number of IPUs, given by the num_ipus parameter. The system will automatically select IPU configurations from the available IPUs, where they match the desired number of IPUs.

Examples:

config = IPUConfig()

# Create a single TensorFlow device, with one IPU
config.auto_select_ipus = 1

# Create two TensorFlow devices, with two IPUs per device.
config.auto_select_ipus = [2, 2]

# Create two TensorFlow devices, with one IPU in the first device and two
# IPUs in the second device.
config.auto_select_ipus = [1, 2]

select_ipus: Union[int, List[int], Tuple[int, ...]] = []

Configure the IPUs to be used by the session.

The configuration describes a system consisting of multiple TensorFlow devices, each with control of one of more IPUs. The TensorFlow devices will be labelled /device:IPU:0, /device:IPU:1 and so on.

Each TensorFlow device uses a specific configuration consisting of one or more IPUs from the list of devices. These can be found by running the Graphcore utility gc-info -l. For instance, the following listing shows the device configurations available on a system with 16 IPUs.

user@host:~$ gc-info -l
Graphcore device listing:

-+- Id:  [0], type:      [PCIe], PCI Domain: [0000:1a:00.0]
-+- Id:  [1], type:      [PCIe], PCI Domain: [0000:1b:00.0]
-+- Id:  [2], type:      [PCIe], PCI Domain: [0000:23:00.0]
-+- Id:  [3], type:      [PCIe], PCI Domain: [0000:24:00.0]
-+- Id:  [4], type:      [PCIe], PCI Domain: [0000:3d:00.0]
-+- Id:  [5], type:      [PCIe], PCI Domain: [0000:3e:00.0]
-+- Id:  [6], type:      [PCIe], PCI Domain: [0000:43:00.0]
-+- Id:  [7], type:      [PCIe], PCI Domain: [0000:44:00.0]
-+- Id:  [8], type:      [PCIe], PCI Domain: [0000:8b:00.0]
-+- Id:  [9], type:      [PCIe], PCI Domain: [0000:8c:00.0]
-+- Id: [10], type:      [PCIe], PCI Domain: [0000:8e:00.0]
-+- Id: [11], type:      [PCIe], PCI Domain: [0000:8f:00.0]
-+- Id: [12], type:      [PCIe], PCI Domain: [0000:b8:00.0]
-+- Id: [13], type:      [PCIe], PCI Domain: [0000:b9:00.0]
-+- Id: [14], type:      [PCIe], PCI Domain: [0000:ba:00.0]
-+- Id: [15], type:      [PCIe], PCI Domain: [0000:bb:00.0]
-+- Id: [16], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
-+- Id: [17], type: [Multi IPU]
|--- PCIe Id:  [4], DNC Id: [0], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:43:00.0]
-+- Id: [18], type: [Multi IPU]
|--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [1], PCI Domain: [0000:1b:00.0]
-+- Id: [19], type: [Multi IPU]
|--- PCIe Id:  [2], DNC Id: [0], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [1], PCI Domain: [0000:1a:00.0]
-+- Id: [20], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
-+- Id: [21], type: [Multi IPU]
|--- PCIe Id: [12], DNC Id: [0], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [1], PCI Domain: [0000:ba:00.0]
-+- Id: [22], type: [Multi IPU]
|--- PCIe Id:  [9], DNC Id: [0], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [1], PCI Domain: [0000:8f:00.0]
-+- Id: [23], type: [Multi IPU]
|--- PCIe Id: [10], DNC Id: [0], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [1], PCI Domain: [0000:8b:00.0]
-+- Id: [24], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
-+- Id: [25], type: [Multi IPU]
|--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [1], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [2], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [3], PCI Domain: [0000:1a:00.0]
-+- Id: [26], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [2], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [3], PCI Domain: [0000:ba:00.0]
-+- Id: [27], type: [Multi IPU]
|--- PCIe Id:  [9], DNC Id: [0], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [1], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [2], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [3], PCI Domain: [0000:8b:00.0]
-+- Id: [28], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
|--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [5], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [6], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
-+- Id: [29], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [2], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [3], PCI Domain: [0000:ba:00.0]
|--- PCIe Id:  [9], DNC Id: [4], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [5], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [6], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [7], PCI Domain: [0000:8b:00.0]
-+- Id: [30], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
|--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [5], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [6], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
|--- PCIe Id: [13], DNC Id: [8], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [9], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [10], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [11], PCI Domain: [0000:ba:00.0]
|--- PCIe Id:  [9], DNC Id: [12], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [13], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [14], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [15], PCI Domain: [0000:8b:00.0]

Examples based on the listing above:

config = IPUConfig()

# Create a single TensorFlow device with 1 IPU at PCI address
# 0000:1a:00.0 by using IPU configuration index 0
config.select_ipus = 0

# Create a single TensorFlow device with 1 IPU at PCI address
# 0000:8b:00.0 by using IPU configuration index 8
config.select_ipus = 8

# Create two TensorFlow devices, with one IPU each, being devices at
# indices 0 and 1
config.select_ipus = [0, 1]

# Create two TensorFlow devices, with four IPUs each. The device
# configurations at indices 24 (0000:3e:00.0, 0000:44:00.0,
# 0000:3d:00.0, 000:43:00.0) and 25 (0000:24:00.0, 0000:1b:00.0,
# 0000:23:00.0, 00:1a:00.0)
config.select_ipus = [24, 25]

# Create four TensorFlow devices each with one IPU, at addresses
# 0000:1a:00.0, 0000:1b:00.0, 0000:23:00.0, 0000:24:00.0.
config.select_ipus = [0, 1, 2, 3]

convolutions

Sub-category containing configuration options that affect convolutions.

convolutions.poplar_options: dict = {}

Set the PopLibs convolution options for the session. Must be a dictionary of valid PopLibs convolution options. See createWeights in the PopLibs API reference for the full list of options. The options will be applied to all convolution operations in the session graph during compilation.

Of particular note is the availableMemoryProportion parameter which is the amount of memory allocated for use for temporary data whilst the operation is executing (for example, for intermediate calculated values or temporary values passed between tiles on the IPU). The value is specified as a proportion of available memory on the IPU. So, for example, a value of 0.1 will constrain the library to use 10% of the total memory for temporary data.

See the technical note on Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU for more details and for some practical examples of using availableMemoryProportion.

device_connection

Sub-category containing configuration options to control when to attach to IPU devices.

device_connection.type: DeviceConnectionType = DeviceConnectionType.ALWAYS

Configure when to attach to the device. For example, you can use this to compile and cache a program without attaching to an IPU, and then later run on a real IPU device without recompiling. Setting the connection type doesn’t impact the ability to profile a model. For possible values, see DeviceConnectionType.

# Compile without attaching to the device.
config = IPUConfig()
config.device_connection.type = DeviceConnectionType.ON_DEMAND

If using DeviceConnectionType.PRE_COMPILE to compile models to run on C600 cards then the link topology will need to be set to “line” using the POPLAR_TARGET_OPTIONS environment variable. See Environment variables in the Poplar and PopLibs API Reference for more information.

device_connection.version: str = "": Version of the IPU architecture to use (string). Must be one of “ipu1”, “ipu2”, “ipu21” or “” (default). A specific version is required if the connection type is specified as DeviceConnectionType.PRE_COMPILE or DeviceConnectionType.NEVER. Do not specify a version otherwise.

device_connection.enable_remote_buffers: bool = False

Default to False. When connection type is DeviceConnectionType.PRE_COMPILE, DeviceConnectionType.NEVER or DeviceConnectionType.ON_DEMAND, this argument is used to indicate whether remote buffers are enabled and supported in the system which will eventually be used to execute the compiled programs. Set it to True if the system on which you will execute the compiled programs has remote buffers enabled and connection_type is not DeviceConnectionType.ALWAYS. If the connection_type is DeviceConnectionType.ALWAYS then the enable_remote_buffers parameter is ignored because in that case it is possible to query the device and check if remote buffers are supported on it (if they are, they will be used automatically).

In order to check whether your target system supports remote buffers you can run the command:

$ gc-info -d 0 -i | grep "remote buffers supported:"

If you see remote buffers supported: 1 in the output, that means that remote buffers are supported on your system. For more information, see the gc-info documentation.

slices

Sub-category containing configuration options that affect slice operations.

slices.poplar_options: dict = {}

Set the PopLibs slice options for the session. Must be a dictionary of valid PopLibs slice options. See embedding::plan in the PopLibs API reference for the full list of options. The options will be passed to multiSlice, multiUpdate, and multiUpdateAdd poplibs calls. These are most commonly generated when using embeddings.

Of particular note is the availableMemoryProportion parameter which is the amount of memory allocated for use for temporary data whilst the operation is executing (for example, for intermediate calculated values or temporary values passed between tiles on the IPU). The value is specified as a proportion of available memory on the IPU. So, for example, a value of 0.1 will constrain the library to use 10% of the total memory for temporary data.

experimental

Sub-category containing experimental configuration options that may be changed or removed with short or no notice.

experimental.always_rearrange_copies_on_the_host: bool = False: The data which is streamed to/from the device might be stored in different layouts on the device and on the host. If so, rearrangement is performed on the device by default. By enabling this option the rearrangement will be performed on the host at the expense of latency.

experimental.enable_remote_buffer_embedding: bool = False: When set to true, HostEmbedding will make use of Poplar remote buffers. The creation of this remote buffer may take several minutes. The remote buffer will be synchronised with every IPU execution, so we recommend that you use a high value of n in repeat() for your training loop.

experimental.enable_prng_stability: bool = False: Enable prng seed management. This aims to reduce divergence of weights when running models across multiple replicas with stochastic rounding.

experimental.multi_replica_distribution

Sub-category containing configuration options controlling multi replica distribution. This will use the Poplar runtime replica subset feature to let multiple processes collaborate on executing the same Poplar program by executing a subset of the global replicas each.

The total global replication factor will be equal to the local replication factor multiplied by the process_count.

experimental.multi_replica_distribution.process_index: int = 0: The index of the current process being configured.

experimental.multi_replica_distribution.process_count: int = 0: The total number of processes. When set to 0 (default), multi-replica distribution will not be used.

floating_point_behaviour

Sub-category containing configuration options that affect the floating point behaviour of the IPU devices, including stochastic rounding and behaviour when an overflow is encountered during execution. For more information, see Controlling the half-precision floating-point unit.

floating_point_behaviour.inv: bool = False: If True, a floating point invalid operation (defined by IEEE 754) will cause an exception.

floating_point_behaviour.div0: bool = False: If True, a floating point divide by zero operation will cause an exception.

floating_point_behaviour.oflo: bool = False: If True, a floating point overflow will cause an exception.

floating_point_behaviour.esr: StochasticRoundingBehaviour = StochasticRoundingBehaviour.OFF: A StochasticRoundingBehaviour. If StochasticRoundingBehaviour.OFF (default) then stochastic rounding will be disabled. Otherwise it’s enabled with the semantics of the particular option.

floating_point_behaviour.nanoo: bool = False: If True, Not-a-Number (NaN) on overflow mode will be enabled.

floating_point_behaviour.set_all: bool = False: If True, unconditionally enables all floating point behaviour options (inv, div0, oflo, esr, nanoo) when the IPUConfig is configured.

io_tiles

Sub-category containing configuration options that affect parallel I/O on a subset of tiles. For more information, see I/O Tiles.

io_tiles.num_io_tiles: int = 0: Number of tiles to reserve for I/O.

io_tiles.place_ops_on_io_tiles: bool = False: Whether to place TensorFlow I/O operations on the I/O tiles.

io_tiles.available_memory_proportion: float = 0.9: Proportion of I/O tiles’ memory which can be used to store data in, with the remaining memory assumed to be used by code. If the size of data which is to be stored on I/O tiles exceeds the total I/O tiles memory multiplied by this proportion, then a warning message will appear and the operations will not be placed on I/O tiles.

ipu_model

Sub-category containing configuration options related to the IPU model. Note that these will only have an effect if you are running with the IPU model enabled. For more information, see TF_POPLAR_FLAGS environment variable.

ipu_model.compile_ipu_code: bool = True: Whether or not to compile IPU code for modelling.

ipu_model.tiles_per_ipu: int = 0: The number of tiles per IPU Model device. When set to 0 (the default), Poplar will use the standard number of tiles for the chosen version.

ipu_model.version: str = "ipu2": Specify the IPU version to be used by the IPU Model. Options are “ipu1” or “ipu2” (default).

matmuls

Sub-category containing configuration options that affect matmuls.

matmuls.clear_pass_type: bool = False: Controls whether or not the “Pass” type of the MatMul is passed to PopLibs. When set to True, PopLibs will not be told about the type of the MatMuls in the graph. This can save memory in some circumstances, such as large batch ResNet models. See matMul in the PopLibs API reference.

matmuls.poplar_options: dict = {}

Set the PopLibs matrix multiplication options for the session. Must be a dictionary of valid PopLibs matrix multiplication options. See matMul in the PopLibs API reference for the full list of options. The options will be applied to all matmul operations in the session graph during compilation.

Of particular note is the availableMemoryProportion parameter which is the amount of memory allocated for use for temporary data whilst the operation is executing (for example, for intermediate calculated values or temporary values passed between tiles on the IPU). The value is specified as a proportion of available memory on the IPU. So, for example, a value of 0.1 will constrain the library to use 10% of the total memory for temporary data.

See the technical note on Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU for more details and for some practical examples of using availableMemoryProportion.

norms

Sub-category containing configuration options that affect normalizations. Note that these options will be applied to all normalisation operations encountered (Fused Batch Norm, IPU Specific Group Norm, IPU Specific Layer Norm and IPU Specific Instance Norm).

norms.use_stable_statistics: bool = False: If True, computes the mean minus the activations first before computing the variance. The implementation with this flag set to True is slower than when set to False.

norms.experimental

Sub-category containing experimental configuration options for normalizations that may be changed or removed with short or no notice.

norms.experimental.distributed_batch_norm_replica_group_size: int = 1: When executing fused batch-norms for training, this option specifies how many replicas to aggregate the batch statistics across. For example, if a model is being executed across four replicas and this option is set to two, replicas 0 and 1 will be grouped together and replicas 2 and 3 will be grouped together and the batch norm statistics will be synchronously all-reduced every time the layer is executed (including any recomputation) across the replicas within a group. This option should not be used when using model parallelism (pipelining) and it is not supported with I/O tiles. When recomputation is enabled and the training fused batch norm operation is recomputed, the statistics will have to be all-reduced again, unless the RecomputeAndBackpropagateInterleaved recomputation mode is used.

optimizations

Sub-category containing configuration options that control a variety of optimizations made when lowering the TensorFlow graph to Poplar.

optimizations.math

Sub-category containing configuration options related to simplifying algebraic mathematical expressions..

optimizations.math.fast: bool = False: Enables optimizations which allow arbitrary re-associations and transformations of mathematical operations with no accuracy guarantees. Enabling this option can result in incorrect output for programs that depend on an exact implementation of IEEE floating point for maths functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.

optimizations.math.dot_strength: bool = True: Enable dot strength optimization. When set to True, the graph optimizer will convert a dot product where either the LHS or the RHS contains only batch and/or contracting dimensions to an elementwise matrix multiplication.

optimizations.prefetch_data_streams: bool = True: If True (default), prefetching of data for data streams on the host will be overlapped with execution on the IPU.

optimizations.combine_embedding_lookups: bool = False: If True, fuse embedding lookups which are on the same tensor. This might improve performance but increase memory usage.

optimizations.combine_matmuls: bool = False: If True, fuse matmul operations if they share the same weights or the same input.

optimizations.enable_graph_outlining: bool = True: If True (default), operations in the graph which are the same but with different input tensors may be outlined. This means the same code will be re-used to execute them, reducing the amount of program code, but their inputs will be exchanged into a common memory location to do so, increasing execution time. If you care more about speed than memory, these optimizations can be disabled by setting this option to False.

optimizations.merge_infeed_io_copies: bool = True: If True, this flag will merge the streamed host to device input copies into one larger copy. This may reduce the time to copy data from the host, at the expense of increasing the live tensor memory on the device.

optimizations.maximum_cross_replica_sum_buffer_size: int = 0: The maximum number of bytes that can be waiting before a cross replica sum op is scheduled. 0 (default) means that they are scheduled immediately. This value represents an always-live vs not-always-live trade off - increasing the max_cross_replica_sum_buffer_size will lead to larger temporary buffers in the cross replica sums, but fewer cross replica sums overall and therefore less control code. If your model contains a lot of trainable variables, then it is strongly advised to consider adjusting this option.

optimizations.maximum_reduce_scatter_buffer_size: int = 0: The maximum number of bytes that can be waiting before a reduce scatter op is scheduled.

optimizations.maximum_inter_ipu_copies_buffer_size: int = 0: The maximum number of bytes that can be waiting before an inter IPU copy between IPUs is scheduled.

optimizations.maximum_send_recv_cluster_size: int = 0: The maximum number of bytes that can be waiting before a cluster of send/recv instructions to/from the host is scheduled. These are lowered to stream copies that can be merged by Poplar.

optimizations.maximum_reduce_many_buffer_size: int = 0: The maximum size (in bytes) a cluster of reduce operations can reach before it is scheduled. These clusters are lowered to popops ReduceMany operations.

optimizations.maximum_all_gather_buffer_size: int = 0: The maximum size (in bytes) a cluster of all gather operations can reach before it is scheduled. These clusters are lowered to popops AllGather operations.

optimizations.minimum_remote_tensor_size: int = 128: The minimum size (in bytes) a tensor must be in order to be considered for being stored in remote memory.

optimizations.merge_remote_buffers: MergeRemoteBuffersBehaviour = MergeRemoteBuffersBehaviour.IF_BENEFICIAL: Whether to merge compatible remote buffers. Merging of remote buffers can allow for more code re-use if the only difference between computations are the remote buffers being accessed. Must be a MergeRemoteBuffersBehaviour.

optimizations.enable_gather_simplifier: bool = True: If True (default), more aggressive optimizations will be done on embedding lookups.

optimizations.triangular_solve_expander_block_size: int = 0: Defines the block size for the triangular solver expander. The processing within each block is performed on a single tile. The control code for performing computations over blocks is unrolled on the device. For a matrix of rank N and block size B`, there are log2(N/B) iterations of the control code. The choice of this parameter therefore has to balance between the amount of data in a tile (lower value is better, gives better parallelism) and the amount of control code (larger value is better, less control code). A value of 0 (default) selects an implementation defined default.

optimizations.cholesky_block_size: int = 0: Defines the block size for the Cholesky factoriser. The processing within each block is performed on a single tile. The control code for performing computations over blocks are unrolled on the device. For a matrix of rank N and block size B, there are N/B iterations of the control code. The choice of this parameter therefore has to balance between the amount of data in a tile (lower value is better, gives better parallelism) and the amount of control code (larger value is better, less control code). A value of 0 (default) selects an implementation defined default.

optimizations.enable_fast_math: bool = False: Note

DEPRECATED: ‘enable_fast_math’ has been moved to ‘optimizations.math.fast’.It will be removed from this location in a future release.

Enables optimizations which allow arbitrary re-associations and transformations of mathematical operations with no accuracy guarantees. Enabling this option can result in incorrect output for programs that depend on an exact implementation of IEEE floating point for maths functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.

optimizations.enable_dynamic_slice_replacement: bool = True: Control whether or not we replace dynamicSlice/Update with multiSlice/Update. This can increase parallelism and provide better memory usage since multiSlice/Update can be planned.

pooling

Sub-category containing configuration options that affect pooling operations.

pooling.poplar_options: dict = {}: Set the PopLibs pooling compilation options for the session. Must be a dictionary of valid PopLibs pooling options. See pool in the PopLibs API reference for the full list of options. The options will be applied to all pooling operations in the session graph during compilation.

scheduling

Sub-category containing configuration options that affect the scheduling of operations in the graph during compilation.

scheduling.algorithm: SchedulingAlgorithm = SchedulingAlgorithm.CHOOSE_BEST: A SchedulingAlgorithm. If SchedulingAlgorithm.CHOOSE_BEST (default), several schedules will be created and the one with the lowest predicted liveness chosen. Setting this to a specific scheduling algorithm forces the compiler to use that algorithm when ordering the instructions.

scheduling.maximum_scheduler_lookahead_depth: int = 5: Controls how far the LOOK_AHEAD scheduling algorithm can look beyond a given scheduling decision to understand the max-liveness implications. This search space grows very quickly and can take an unacceptable amount of time for large values. Only for SchedulingAlgorithm.LOOK_AHEAD.

scheduling.maximum_scheduler_search_space_size: int = 64: The upper-limit to the size of the LOOK_AHEAD scheduling algorithm’s search space to guarantee that it will terminate in a reasonable amount of time. Only for SchedulingAlgorithm.LOOK_AHEAD.

get_attribute_metadata(attr)

Get the attribute metadata for attr.

Parameters: attr – required, a string which specifies which attribute to retrieve metadata for. Must be its full name relative to the category this method is being called on.
Returns: An AttributeMetadata object containing the metadata for the attribute.

configure_ipu_system(device='cpu')

Configure the IPU system with this config.

Parameters: device – The CPU device which is local to the IPU hardware.

from_dict(dct)

Restore configuration from a dict object.

Parameters: dct – A dictionary containing a configuration.

to_dict()

Export the configuration stored within this configuration object to a dict.

Returns: A dictionary containing the configuration.

from_json(json_cfg)

Restore configuration from a JSON string.

Parameters: json_cfg – A JSON string containing a configuration.

to_json()

Export the configuration stored within this configuration object as a JSON string.

Returns: A JSON string containing the configuration.

allow_recompute: The order in which IPUs are selected and mapped to physical IPU devices when using multi-IPU devices. Must be one of SelectionOrder.

auto_select_ipus: Union[int, List[int], Tuple[int, ...]]

Configure the IPUs to be used by the session.

The configuration describes a system consisting of multiple TensorFlow devices, each with control of one of more IPUs. The TensorFlow devices will be labelled /device:IPU:0, /device:IPU:1 and so on.

Each TensorFlow device uses a specific configuration consisting of one or more IPUs from the list of devices. These can be found by running the Graphcore utility gc-info -l. For instance, the following listing shows the device configurations available on a system with 16 IPUs.

user@host:~$ gc-info -l
Graphcore device listing:

-+- Id:  [0], type:      [PCIe], PCI Domain: [0000:1a:00.0]
-+- Id:  [1], type:      [PCIe], PCI Domain: [0000:1b:00.0]
-+- Id:  [2], type:      [PCIe], PCI Domain: [0000:23:00.0]
-+- Id:  [3], type:      [PCIe], PCI Domain: [0000:24:00.0]
-+- Id:  [4], type:      [PCIe], PCI Domain: [0000:3d:00.0]
-+- Id:  [5], type:      [PCIe], PCI Domain: [0000:3e:00.0]
-+- Id:  [6], type:      [PCIe], PCI Domain: [0000:43:00.0]
-+- Id:  [7], type:      [PCIe], PCI Domain: [0000:44:00.0]
-+- Id:  [8], type:      [PCIe], PCI Domain: [0000:8b:00.0]
-+- Id:  [9], type:      [PCIe], PCI Domain: [0000:8c:00.0]
-+- Id: [10], type:      [PCIe], PCI Domain: [0000:8e:00.0]
-+- Id: [11], type:      [PCIe], PCI Domain: [0000:8f:00.0]
-+- Id: [12], type:      [PCIe], PCI Domain: [0000:b8:00.0]
-+- Id: [13], type:      [PCIe], PCI Domain: [0000:b9:00.0]
-+- Id: [14], type:      [PCIe], PCI Domain: [0000:ba:00.0]
-+- Id: [15], type:      [PCIe], PCI Domain: [0000:bb:00.0]
-+- Id: [16], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
-+- Id: [17], type: [Multi IPU]
|--- PCIe Id:  [4], DNC Id: [0], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [1], PCI Domain: [0000:43:00.0]
-+- Id: [18], type: [Multi IPU]
|--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [1], PCI Domain: [0000:1b:00.0]
-+- Id: [19], type: [Multi IPU]
|--- PCIe Id:  [2], DNC Id: [0], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [1], PCI Domain: [0000:1a:00.0]
-+- Id: [20], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
-+- Id: [21], type: [Multi IPU]
|--- PCIe Id: [12], DNC Id: [0], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [1], PCI Domain: [0000:ba:00.0]
-+- Id: [22], type: [Multi IPU]
|--- PCIe Id:  [9], DNC Id: [0], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [1], PCI Domain: [0000:8f:00.0]
-+- Id: [23], type: [Multi IPU]
|--- PCIe Id: [10], DNC Id: [0], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [1], PCI Domain: [0000:8b:00.0]
-+- Id: [24], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
-+- Id: [25], type: [Multi IPU]
|--- PCIe Id:  [3], DNC Id: [0], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [1], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [2], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [3], PCI Domain: [0000:1a:00.0]
-+- Id: [26], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [2], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [3], PCI Domain: [0000:ba:00.0]
-+- Id: [27], type: [Multi IPU]
|--- PCIe Id:  [9], DNC Id: [0], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [1], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [2], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [3], PCI Domain: [0000:8b:00.0]
-+- Id: [28], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
|--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [5], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [6], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
-+- Id: [29], type: [Multi IPU]
|--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [2], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [3], PCI Domain: [0000:ba:00.0]
|--- PCIe Id:  [9], DNC Id: [4], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [5], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [6], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [7], PCI Domain: [0000:8b:00.0]
-+- Id: [30], type: [Multi IPU]
|--- PCIe Id:  [5], DNC Id: [0], PCI Domain: [0000:3e:00.0]
|--- PCIe Id:  [7], DNC Id: [1], PCI Domain: [0000:44:00.0]
|--- PCIe Id:  [4], DNC Id: [2], PCI Domain: [0000:3d:00.0]
|--- PCIe Id:  [6], DNC Id: [3], PCI Domain: [0000:43:00.0]
|--- PCIe Id:  [3], DNC Id: [4], PCI Domain: [0000:24:00.0]
|--- PCIe Id:  [1], DNC Id: [5], PCI Domain: [0000:1b:00.0]
|--- PCIe Id:  [2], DNC Id: [6], PCI Domain: [0000:23:00.0]
|--- PCIe Id:  [0], DNC Id: [7], PCI Domain: [0000:1a:00.0]
|--- PCIe Id: [13], DNC Id: [8], PCI Domain: [0000:b9:00.0]
|--- PCIe Id: [15], DNC Id: [9], PCI Domain: [0000:bb:00.0]
|--- PCIe Id: [12], DNC Id: [10], PCI Domain: [0000:b8:00.0]
|--- PCIe Id: [14], DNC Id: [11], PCI Domain: [0000:ba:00.0]
|--- PCIe Id:  [9], DNC Id: [12], PCI Domain: [0000:8c:00.0]
|--- PCIe Id: [11], DNC Id: [13], PCI Domain: [0000:8f:00.0]
|--- PCIe Id: [10], DNC Id: [14], PCI Domain: [0000:8e:00.0]
|--- PCIe Id:  [8], DNC Id: [15], PCI Domain: [0000:8b:00.0]

Examples based on the listing above:

config = IPUConfig()

# Create a single TensorFlow device with 1 IPU at PCI address
# 0000:1a:00.0 by using IPU configuration index 0
config.select_ipus = 0

# Create a single TensorFlow device with 1 IPU at PCI address
# 0000:8b:00.0 by using IPU configuration index 8
config.select_ipus = 8

# Create two TensorFlow devices, with one IPU each, being devices at
# indices 0 and 1
config.select_ipus = [0, 1]

# Create two TensorFlow devices, with four IPUs each. The device
# configurations at indices 24 (0000:3e:00.0, 0000:44:00.0,
# 0000:3d:00.0, 000:43:00.0) and 25 (0000:24:00.0, 0000:1b:00.0,
# 0000:23:00.0, 00:1a:00.0)
config.select_ipus = [24, 25]

# Create four TensorFlow devices each with one IPU, at addresses
# 0000:1a:00.0, 0000:1b:00.0, 0000:23:00.0, 0000:24:00.0.
config.select_ipus = [0, 1, 2, 3]

compilation_poplar_options: Set the IPU options for the Graphcore Communication Library. Must be a dictionary of valid GCL options. See the allReduce function in the GCL API reference for the full list of options. The options will be applied to all applicable GCL collective operations in the graph during compilation.

configure_ipu_system(device='cpu')

Configure the IPU system with this config.

Parameters: device – The CPU device which is local to the IPU hardware.

convolutions: Sub-category containing configuration options to control when to attach to IPU devices.

device_connection: Sub-category containing configuration options that affect slice operations.

experimental: Sub-category containing configuration options that affect the floating point behaviour of the IPU devices, including stochastic rounding and behaviour when an overflow is encountered during execution. For more information, see Controlling the half-precision floating-point unit.

floating_point_behaviour: Sub-category containing configuration options that affect parallel I/O on a subset of tiles. For more information, see I/O Tiles.

gcl_poplar_options

Configure the IPUs to be used by the session. The configuration describes a system consisting of multiple TensorFlow devices, each with control of one of more IPUs. The devices will be labelled /device:IPU:0, /device:IPU:1 and so on.

Each device can control a specific number of IPUs, given by the num_ipus parameter. The system will automatically select IPU configurations from the available IPUs, where they match the desired number of IPUs.

Examples:

config = IPUConfig()

# Create a single TensorFlow device, with one IPU
config.auto_select_ipus = 1

# Create two TensorFlow devices, with two IPUs per device.
config.auto_select_ipus = [2, 2]

# Create two TensorFlow devices, with one IPU in the first device and two
# IPUs in the second device.
config.auto_select_ipus = [1, 2]

io_tiles: Sub-category containing configuration options related to the IPU model. Note that these will only have an effect if you are running with the IPU model enabled. For more information, see TF_POPLAR_FLAGS environment variable.

ipu_model: Sub-category containing configuration options that affect matmuls.

matmuls: Sub-category containing configuration options that affect normalizations. Note that these options will be applied to all normalisation operations encountered (Fused Batch Norm, IPU Specific Group Norm, IPU Specific Layer Norm and IPU Specific Instance Norm).

norms: Sub-category containing configuration options that control a variety of optimizations made when lowering the TensorFlow graph to Poplar.

optimizations: Sub-category containing configuration options that affect pooling operations.

pooling: Sub-category containing configuration options that affect the scheduling of operations in the graph during compilation.

select_ipus: Union[int, List[int], Tuple[int, ...]]: Sub-category containing configuration options that affect convolutions.

selection_order: Specifies the directory in which serialized Poplar executables will be saved. The value must be a valid path. The default (“”) disables executable serialization.

serialization_output_folder: Set the Poplar compilation options for the session. Must be a dictionary of valid Poplar compilation flags. See the Engine class in the Poplar API reference for the full list of options.

slices: Sub-category containing experimental configuration options that may be changed or removed with short or no notice.

21.8. Looping utilities

tensorflow.python.ipu.loops.repeat(n, body, inputs=None, infeed_queue=None, use_while_v1=True)

Builds a loop that executes a fixed number of iterations.

The set of loop-carried tensors correspond to inputs. body must be a function that takes and returns the values of the loop-carried tensors.

Parameters

n – the number of loop iterations
body – a Python function that builds the loop body.
inputs – a list of initial values passed into the loop or None (equivalent to an empty list).
infeed_queue – if not None, the IPUInfeedQueue from which data is consumed.
use_while_v1 – if True, then use a TensorFlow v1.x dataflow while loop.

Returns

The final values of the loop-carried tensors.

Raises

ValueError – if there is a type error.
TypeError – if body has the wrong signature.

tensorflow.python.ipu.loops.while_loop(condition, body, inputs=None, infeed_queue=None, maximum_iterations=None, use_while_v1=True)

Builds a while loop for IPUs.

The set of loop-carried tensors corresponds to inputs. Both condition and body take the current value of the loop-carried tensors. condition must return a single boolean value that determines whether iteration continues. body must return an updated list of values for the loop-carried tensors.

Parameters

condition – a Python function that builds the loop condition.
body – a Python function that builds the loop body.
inputs – a list of initial values passed into the loop, or None (equivalent to an empty list).
infeed_queue – if not None, the IPUInfeedQueue from which data is consumed.
use_while_v1 – if True, then use a TensorFlow v1.x dataflow while loop.

Returns

The final values of the loop-carried tensors.

Raises

TypeError – if body or condition has the wrong signature.

21.9. Distributed training

class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUMirroredVariable(*args, **kwargs)

class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUMultiWorkerExtended(container_strategy, cluster_resolver, ipu_device, variables_on_host)

__init__(container_strategy, cluster_resolver, ipu_device, variables_on_host)

read_var(var): Read the aggregate value of a replica-local variable.

class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUMultiWorkerStrategy(cluster_resolver, ipu_device='/device:IPU:0', variables_on_host=False)

This is a distribution strategy for synchronous training using IPUs on multiple workers with between-graph replication.

By default variables and ops are placed on the IPU of each worker, but variables can optionally be placed on the host by setting variables_on_host=True. In any case, this strategy will make sure that variables are kept in sync between the workers by performing multi-worker reductions.

The multi-worker reductions are done using TensorFlow’s implementation of collective operations over gRPC.

Variable synchronization

The default behavior is to sync (allreduce) the variables when they are written (sync-on-write). This is a good choice when reads are at least as common as writes. However, for variables where writes are more common than reads (like metrics or population statistics in batch normalization layers), it is beneficial to only sync (allreduce) the variables when they are read (sync-on-read).

In both cases, it is important that all the workers participate in the sync, otherwise progress will be blocked. Take special care in the latter case (with sync-on-read variables), because it implies that all the workers need to read these variables at the same time. For example, it implies that all the workers must checkpoint the model at the same time.

Sync-on-read variables are placed on the IPU even when variables were requested placed on the host (with variables_on_host=True), because it allows the ops to update the variables directly on the IPU without any host involvement. Only when the variable is read, it is streamed to the host and allreduced there.

Weight updates

When used during training with an Optimizer, there is an implicit allreduce in the optimizer.apply_gradients() function (which is called from optimizer.minimize()). This will automatically cause the gradients to be streamed to the host of each worker, allreduced between the workers, and then streamed back to the IPU of each worker, where identical weight updates are performed (keeping the workers in sync). This is done even when the call to optimizer.apply_gradients() is inside a function passed to ipu_compiler.compile(), as the allreduce is extracted from the compiled XLA cluster and placed on the host in the outside graph (by internally using an outside_compilation_scope()).

When variables are placed on the host, the weight updates should also be placed on the host. In other words, the optimizer.compute_gradients() call should be placed on the IPU, while the optimizer.apply_gradients() call should be placed on the host. This must be done explicitly. In this scenario all the “slot” variables used by the optimizer (e.g. the momentum accumulator) are then also kept only in host memory and never used on the IPU, saving IPU memory.

Compatibility

IPUEstimator: Pass the IPUMultiWorkerStrategy instance to the RunConfig as the train_distribute argument. When variables are placed on the host, the optimizer.apply_gradients() call should also be placed on the host by using the IPUEstimatorSpec host_call argument.

IPUPipelineEstimator: Pass the IPUMultiWorkerStrategy instance to the RunConfig as the train_distribute argument. Placing variables on the host is not currently supported here.

Keras Model.fit: Not currently supported.

Custom training loop: Pass the training step function to IPUMultiWorkerStrategy.experimental_run_v2(). With variables on the IPU, the optimizer.apply_gradients() call can be done from an XLA compiled IPU function, and the inter-host allreduce will be automatically extracted from the compiled XLA cluster and placed on the host. With variables on the host, the optimizer.apply_gradients() call must be explicitly placed on the host.

Example using a custom training loop with pipelining

cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
strategy = IPUMultiWorkerStrategy(cluster_resolver)

sess_config = tf.ConfigProto()
sess_config = strategy.update_config_proto(sess_config)
server = tf.distribute.Server(cluster_resolver.cluster_spec(),
                              job_name=cluster_resolver.task_type,
                              task_index=cluster_resolver.task_id,
                              config=sess_config)
sess_target = server.target

with strategy.scope():

  infeed_queue = ipu_infeed_queue.IPUInfeedQueue(dataset)
  outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

  def stage1(lr, images, labels):
    partial = keras.layers.Dense(256, activation="relu")(images)
    partial = keras.layers.Dense(128, activation="relu")(partial)
    return lr, partial, labels

  def stage2(lr, partial, labels):
    logits = keras.layers.Dense(10)(partial)
    per_example_loss = keras.losses.sparse_categorical_crossentropy(
        y_true=labels, y_pred=logits, from_logits=True)
    # In a custom training loop, the optimiser does an allreduce *sum*, not
    # average, of the gradients across the distributed workers. Therefore
    # we want to divide the loss here by the *global* batch size, which is
    # done by the `tf.nn.compute_average_loss()` function.
    loss = nn.compute_average_loss(per_example_loss)
    return lr, loss

  def optimizer_function(lr, loss):
    optimizer = GradientDescentOptimizer(lr)
    return pipelining_ops.OptimizerFunctionOutput(optimizer, loss)

  def model(lr):
    pipeline_op = pipelining_ops.pipeline(
        computational_stages=[stage1, stage2],
        gradient_accumulation_count=gradient_accumulation_count,
        inputs=[lr],
        infeed_queue=infeed_queue,
        outfeed_queue=outfeed_queue,
        optimizer_function=optimizer_function,
        name="Pipeline")
    return pipeline_op

  def compiled_model(lr):
    with ipu_scope("/device:IPU:0"):
      return ipu_compiler.compile(model, inputs=[lr])

  with ops.device("cpu"):
    lr = array_ops.placeholder(np.float32, [])

  train_op = strategy.experimental_run_v2(compiled_model, args=[lr])

  _, per_worker_losses = outfeed_queue.dequeue()

  # Mean across the local `gradient_accumulation_count` batches:
  per_worker_loss = math_ops.reduce_mean(per_worker_losses)

  # Global mean across the distributed workers (since it is already
  # divided by the global batch size above, we do a sum here):
  global_loss = strategy.reduce(ReduceOp.SUM, per_worker_loss)

  config = ipu.config.IPUConfig()
  config.auto_select_ipus = 2
  config.configure_ipu_system()
  ipu_utils.move_variable_initialization_to_cpu()

  with session_lib.Session(target=sess_target, config=sess_config) as sess:
    sess.run(infeed_queue.initializer)
    sess.run(variables.global_variables_initializer())

    for _ in range(10):
      sess.run(train_op, {lr: 0.01})
      global_loss_val = sess.run(global_loss)

__init__(cluster_resolver, ipu_device='/device:IPU:0', variables_on_host=False)

DEPRECATED FUNCTION

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use PopDistStrategy instead.

class tensorflow.python.ipu.ipu_multi_worker_strategy.IPUSyncOnReadVariable(*args, **kwargs)

21.10. Horovod

tensorflow.python.ipu.distributed.allgather(tensor, name=None)

An op which concatenates the input tensor with the same input tensor on all other Horovod processes.

The concatenation is done on the first dimension, so the input tensors on the different processes must have the same rank and shape, except for the first dimension, which is allowed to be different.

Returns: A tensor of the same type as tensor, concatenated on dimension zero across all processes. The shape is identical to the input shape, except for the first dimension, which may be greater and is the sum of all first dimensions of the tensors in different Horovod processes.

tensorflow.python.ipu.distributed.allreduce(tensor, op=None)

Perform an allreduce on a tf.Tensor or tf.IndexedSlices.

This function performs a bandwidth-optimal ring allreduce on the input tensor. If the input is an tf.IndexedSlices, the function instead does an allgather on the values and the indices, effectively doing an allreduce on the represented tensor.

Parameters

tensor – tf.Tensor, tf.Variable, or tf.IndexedSlices to reduce. The shape of the input must be identical across all ranks.
op – The reduction operation to combine tensors across different ranks. Defaults to Average if None is given.

Returns

A tensor of the same shape and type as tensor, summed across all processes.

tensorflow.python.ipu.distributed.broadcast(tensor, root_rank, name=None)

An op which broadcasts the input tensor on root rank to the same input tensor on all other Horovod processes.

The broadcast operation is keyed by the name of the op. The tensor type and shape must be the same on all Horovod processes for a given name. The broadcast will not start until all processes are ready to send and receive the tensor.

Returns: A tensor of the same shape and type as tensor, with the value broadcasted from root rank.

class tensorflow.python.ipu.distributed.ipu_horovod_strategy.IPUHorovodExtended(container_strategy, cluster_resolver, ipu_device, variables_on_host)

__init__(container_strategy, cluster_resolver, ipu_device, variables_on_host)

class tensorflow.python.ipu.distributed.popdist_strategy.IPUMirroredVariable(*args, **kwargs)

class tensorflow.python.ipu.distributed.popdist_strategy.IPUSyncOnReadVariable(*args, **kwargs)

class tensorflow.python.ipu.distributed.popdist_strategy.PopDistExtendedV1(container_strategy, cluster_resolver, ipu_device, add_ipu_cross_replica_reductions)

__init__(container_strategy, cluster_resolver, ipu_device, add_ipu_cross_replica_reductions)

read_var(var): Read the aggregate value of a replica-local variable.

class tensorflow.python.ipu.distributed.popdist_strategy.PopDistStrategy(ipu_device='/device:IPU:0', add_ipu_cross_replica_reductions=True)

This is a distribution strategy for multi-replica distribution that uses compiled communications with GCL for reductions over IPU links and gateway links, while using Horovod for broadcasting of the initial values of variables to all processes, or when a reduction is requested with a CPU as the current device.

This is the recommended distribution strategy when using PopDist and PopRun. The GCL reductions will then be performed across all the global replicas in the application.

__init__(ipu_device='/device:IPU:0', add_ipu_cross_replica_reductions=True)

update_ipu_config(config)

Update the given IPU configuration with the multi-replica distribution options.

Parameters: config – The IPUConfig instance to update.
Returns: The IPUConfig instance.

Note

Both tensorflow.python.ipu.distributed.popdist_strategy.PopDistStrategy and tensorflow.python.ipu.distributed.ipu_horovod_strategy.IPUHorovodStrategy are still available through the deprecated module tensorflow.python.ipu.horovod.

21.11. Serving utilities

class tensorflow.python.ipu.serving.Tensor(op, value_index, dtype)

Represents one of the outputs of an Operation.

A Tensor is a symbolic handle to one of the outputs of an Operation. It does not hold the values of that operation’s output, but instead provides a means of computing those values in a TensorFlow tf.compat.v1.Session.

This class has two primary purposes:

A Tensor can be passed as an input to another Operation. This builds a dataflow connection between operations, which enables TensorFlow to execute an entire Graph that represents a large, multi-step computation.
After the graph has been launched in a session, the value of the Tensor can be computed by passing it to tf.Session.run. t.eval() is a shortcut for calling tf.compat.v1.get_default_session().run(t).

In the following example, c, d, and e are symbolic Tensor objects, whereas result is a numpy array that stores a concrete value:

```python # Build a dataflow graph. c = tf.constant([[1.0, 2.0], [3.0, 4.0]]) d = tf.constant([[1.0, 1.0], [0.0, 1.0]]) e = tf.matmul(c, d)

# Construct a Session to execute the graph. sess = tf.compat.v1.Session()

# Execute the graph and store the value that e represents in result. result = sess.run(e) ```

consumers()

Returns a list of `Operation`s that consume this tensor.

Returns: A list of `Operation`s.

property device: The name of the device on which this tensor will be produced, or None.

property dtype: The DType of elements in this tensor.

eval(feed_dict=None, session=None)

Evaluates this tensor in a Session.

Calling this method will execute all preceding operations that produce the inputs needed for the operation that produces this tensor.

N.B. Before invoking Tensor.eval(), its graph must have been launched in a session, and either a default session must be available, or session must be specified explicitly.

Parameters

feed_dict – A dictionary that maps Tensor objects to feed values. See tf.Session.run for a description of the valid feed values.
session – (Optional.) The Session to be used to evaluate this tensor. If none, the default session will be used.

Returns

A numpy array corresponding to the value of this tensor.

experimental_ref()

Returns a hashable reference object to this Tensor.

Warning: Experimental API that could be changed or removed.

The primary usecase for this API is to put tensors in a set/dictionary. We can’t put tensors in a set/dictionary as tensor.__hash__() is no longer available starting Tensorflow 2.0.

```python import tensorflow as tf

x = tf.constant(5) y = tf.constant(10) z = tf.constant(10)

# The followings will raise an exception starting 2.0 # TypeError: Tensor is unhashable if Tensor equality is enabled. tensor_set = {x, y, z} tensor_dict = {x: ‘five’, y: ‘ten’, z: ‘ten’} ```

Instead, we can use tensor.experimental_ref().

```python tensor_set = {x.experimental_ref(),

y.experimental_ref(), z.experimental_ref()}

print(x.experimental_ref() in tensor_set) ==> True

tensor_dict = {x.experimental_ref(): ‘five’,: y.experimental_ref(): ‘ten’, z.experimental_ref(): ‘ten’}

print(tensor_dict[y.experimental_ref()]) ==> ten ```

Also, the reference object provides .deref() function that returns the original Tensor.

`python x = tf.constant(5) print(x.experimental_ref().deref()) ==> tf.Tensor(5, shape=(), dtype=int32) `

get_shape(): Alias of Tensor.shape.

property graph: The Graph that contains this tensor.

property name: The string name of this tensor.

property op: The Operation that produces this tensor as an output.

set_shape(shape)

Updates the shape of this tensor.

This method can be called multiple times, and will merge the given shape with the current shape of this tensor. It can be used to provide additional information about the shape of this tensor that cannot be inferred from the graph alone. For example, this can be used to provide additional information about the shapes of images:

```python _, image_data = tf.compat.v1.TFRecordReader(…).read(…) image = tf.image.decode_png(image_data, channels=3)

# The height and width dimensions of image are data dependent, and # cannot be computed without executing the op. print(image.shape) ==> TensorShape([Dimension(None), Dimension(None), Dimension(3)])

# We know that each image in this dataset is 28 x 28 pixels. image.set_shape([28, 28, 3]) print(image.shape) ==> TensorShape([Dimension(28), Dimension(28), Dimension(3)]) ```

NOTE: This shape is not enforced at runtime. Setting incorrect shapes can result in inconsistencies between the statically-known graph and the runtime value of tensors. For runtime validation of the shape, use tf.ensure_shape instead.

Parameters: shape – A TensorShape representing the shape of this tensor, a TensorShapeProto, a list, a tuple, or None.
Raises: ValueError – If shape is not compatible with the current shape of this tensor.

property shape

Returns the TensorShape that represents the shape of this tensor.

The shape is computed using shape inference functions that are registered in the Op for each Operation. See tf.TensorShape for more details of what a shape represents.

The inferred shape of a tensor is used to provide shape information without having to launch the graph in a session. This can be used for debugging, and providing early error messages. For example:

```python c = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])

print(c.shape) ==> TensorShape([Dimension(2), Dimension(3)])

d = tf.constant([[1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [0.0, 1.0]])

print(d.shape) ==> TensorShape([Dimension(4), Dimension(2)])

# Raises a ValueError, because c and d do not have compatible # inner dimensions. e = tf.matmul(c, d)

f = tf.matmul(c, d, transpose_a=True, transpose_b=True)

print(f.shape) ==> TensorShape([Dimension(3), Dimension(4)]) ```

In some cases, the inferred shape may have unknown dimensions. If the caller has additional information about the values of these dimensions, Tensor.set_shape() can be used to augment the inferred shape.

Returns: A TensorShape representing the shape of this tensor.

property value_index: The index of this tensor in the outputs of its Operation.

tensorflow.python.ipu.serving.export_pipeline(computational_stages, export_dir, iterations, inputs=None, device_mapping=None, pipeline_schedule=None, poplar_options=None, name=None, predict_step_signature=None, input_dataset=None, variable_initializer=None, output_names=None, preprocessing_step=None, preprocessing_step_signature=None, postprocessing_step=None, postprocessing_step_signature=None, purge_export_dir=False, checkpoint_restore_dir=None)

Create a pipelined SavedModel in export_dir for TensorFlow Serving.

Create a pipeline op using computational_stages, add an infeed for the inputs and an outfeed for the outputs, freeze any variables into constants and write a SavedModel containing an IPU runtime function (preceded by an optional preprocessing step) and Poplar executable.

SavedModel flow: predict_step = computational_stages[0] preprocessing_step (optional, CPU) -> predict_step (IPU) -> postprocessing_step (optional, CPU) -> result

Parameters

computational_stages (list) – A list of python functions, where each function represents a computational pipeline stage. The function takes the outputs of the previous pipeline stage as its inputs.
export_dir (str) – Path to the directory where the SavedModel will be written.
iterations (int) – The number of times each computational stage will be executed during the execution of the pipeline. It can also be considered as the pipeline depth.
inputs (list, optional) – Arguments passed to the first computational stage.
device_mapping (list, optional) – If provided, a list of length equal to the number of computational stages. An element at index i in the list represents which IPU the computational_stages[i] should reside on. This can be used to make sure computational stages which share tf.Variable objects are resident on the same IPU.
pipeline_schedule (PipelineSchedule, optional) – Which scheduling algorithm to use for pipeline lowering. Defaults to PipelineSchedule.Grouped.
poplar_options (list, optional) – If provided, a list of length equal to the number of computational stages. Each element is a PipelineStageOptions object which allows for fine grain control of the Poplar options for a given forward propagation computational stage.
name (str, optional) – Name of this pipeline.
predict_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the first computational stage. If preprocessing_step is not provided and input_dataset is provided, this argument should be None. If preprocessing_step is provided or preprocessing_step and input_dataset are not provided and first computational stage is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from the first computational stage.
input_dataset (tf.Dataset, optional) – Dataset from which SavedModel input_signature will be inferred.

variable_initializer (Callable, optional) –

A function that initializes variables. Takes a tf.Session as the only argument. For example, this function allows restoring model’s variables from a checkpoint:

def variable_initializer(session):
  saver = tf.train.Saver()
  ipu.utils.move_variable_initialization_to_cpu()
  init = tf.global_variables_initializer()
  session.run(init)
  saver.restore(session, 'path/to/checkpoint')

output_names (str or list, optional) – Output name or list of output names for the outputs in the SavedModel’s SignatureDef. If not provided, outputs will be named: output_0, output_1, … output_n.
preprocessing_step (Callable or tf.function, optional) – Function that runs the preprocessing step on the CPU. Function is called just before the first computational stage. preprocessing_step and compiled pipelined computational stages are exported together. preprocessing_step output will be directly passed to the input queue of the first computational stage.
preprocessing_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the preprocessing_step function. If preprocessing_step and input_dataset are provided, this argument should be None. If preprocessing_step is provided and input_dataset is not provided and preprocessing_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from preprocessing_step.
postprocessing_step (Callable or tf.function, optional) – Function that runs the postprocessing step on the CPU. Function is called after predict_step. postprocessing_step and predict_step are exported together. Tensors from the predict_step output queue are postprocessing_step inputs.
postprocessing_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the postprocessing_step function. If postprocessing_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from postprocessing_step.
purge_export_dir (Boolean, optional) – If True, before starting the export, the target directory is emptied. Otherwise no cleaning is performed and if the target directory is not empty, the function fails with an error.
checkpoint_restore_dir (str) – Path to saved checkpoint, where the model Variables are to be restored. To be used with preprocessing only.

Returns

A reference to the same predict function that was exported using the SavedModel format. This function uses the embedded runtime op to run the executable that was included in the SavedModel’s assets subfolder.

Return type

function

Raises

ValueError – If export_dir is not an empty directory.
TypeError – If input_dataset is not a tf.Dataset or NoneType.
TypeError – If predict_step_signature is neither a tuple, a list of tf.TensorSpec objects nor a NoneType.
TypeError – If preprocessing_step_signature is neither a tuple, a list of tf.TensorSpec objects nor a NoneType.
TypeError – If postprocessing_step_signature is neither a tuple, a list of tf.TensorSpec objects nor a NoneType.
ValueError – If predict_step_signature is an empty tuple or list.
ValueError – If preprocessing_step_signature is an empty tuple or a list.
ValueError – If postprocessing_step_signature is an empty tuple or a list.
ValueError – If preprocessing_step is not provided and both predict_step_signature and input_dataset are provided.
ValueError – If preprocessing_step, predict_step_signature, input_dataset are not provided and predict_step is not a tf.function or is a tf.function with not provided input_signature.
ValueError – If preprocessing_step, preprocessing_step_signature, input_dataset are provided.
ValueError – If preprocessing_step is provided and both preprocessing_step_signature, input_dataset are not provided and preprocessing_step is not a tf.function or is a tf.function but no input_signature is provided.
ValueError – If preprocessing_step, predict_step_signature are not provided and predict_step is not a tf.function or is a tf.function but no input_signature is provided.
ValueError – If postprocessing_step is provided and postprocessing_step_signature is not provided and postprocessing_step is not a tf.function or is a tf.function but no input_signature is provided.

tensorflow.python.ipu.serving.export_single_step(predict_step, export_dir, iterations, predict_step_signature=None, input_dataset=None, variable_initializer=None, output_names=None, preprocessing_step=None, preprocessing_step_signature=None, postprocessing_step=None, postprocessing_step_signature=None, purge_export_dir=False, checkpoint_restore_dir=None)

Create a SavedModel in export_dir for TensorFlow Serving.

Wrap predict_step inside a while loop, add an infeed for the inputs and an outfeed for the outputs, freeze any variables into constants and write a SavedModel containing an IPU runtime function and Poplar executable.

SavedModel flow: preprocessing_step (optional, CPU) -> predict_step (IPU) -> postprocessing_step (optional, CPU) -> result

Parameters

predict_step (Callable or tf.function) – Function to compile for the IPU platform and export.
export_dir (str) – Path to the directory where the SavedModel will be written.
iterations (int) – Number of loop iterations.
predict_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the predict_step function. If preprocessing_step is not provided and input_dataset is provided, this argument should be None. If preprocessing_step is provided or preprocessing_step and input_dataset are not provided and predict_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from predict_step.
input_dataset (tf.Dataset', optional) – Dataset from which SavedModel’s input_signature will be inferred. If preprocessing_step is not provided and predict_step_signature is provided,this argument should be None. If preprocessing_step and preprocessing_step_signature are provided this argument should be None.

variable_initializer (Callable, optional) –

A function that initializes variables. Takes a tf.Session as the only argument. For example, this function allows restoring model’s variables from a checkpoint:

def variable_initializer(session):
  saver = tf.train.Saver()
  ipu.utils.move_variable_initialization_to_cpu()
  init = tf.global_variables_initializer()
  session.run(init)
  saver.restore(session, 'path/to/checkpoint')

output_names (str or list, optional) – Output name or list of names for the outputs in the SavedModel’s SignatureDef. If not provided, outputs will be named: output_0, output_1 and so on.
preprocessing_step (Callable or tf.function, optional) – Function that runs preprocessing step on the CPU device. Function is called just before predict_step. preprocessing_step and predict_step are exported together. preprocessing_step output will be directly passed to the predict_step input queue.
preprocessing_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the preprocessing_step function. If preprocessing_step and input_dataset are provided, this argument should be None. If preprocessing_step is provided and input_dataset is not provided and preprocessing_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from preprocessing_step.
postprocessing_step (Callable or tf.function, optional) – Function that runs the postprocessing step on the CPU. Function is called after predict_step. postprocessing_step and predict_step are exported together. Tensors from the predict_step output queue are postprocessing_step inputs.
postprocessing_step_signature (list or tuple, optional) – A sequence of tf.TensorSpec objects that describe the input arguments of the postprocessing_step function. If postprocessing_step is a tf.function and input_signature was specified during tf.function creation then this argument can be None and the signature will be captured directly from postprocessing_step.
purge_export_dir (Boolean, optional) – If True, before starting the export, the target directory is emptied. Otherwise no cleaning is performed and if the target directory is not empty, the function fails with an error.
checkpoint_restore_dir (str) – Path to saved checkpoint, for which the model Variables are to be restored.

Returns

A reference to the same predict function that was exported using the SavedModel format. This function uses the embedded runtime op to run the executable that was included in the SavedModel’s assets subfolder.

Return type

function

Raises

ValueError – If export_dir is not an empty directory.
TypeError – If input_dataset is not a tf.Dataset or NoneType.
TypeError – If predict_step_signature is neither a tuple, a list of tf.TensorSpec objects nor a NoneType.
TypeError – If preprocessing_step_signature is neither a tuple, a list of tf.TensorSpec objects nor a NoneType.
TypeError – If postprocessing_step_signature is neither a tuple, a list of tf.TensorSpec objects nor a NoneType.
ValueError – If predict_step_signature is an empty tuple or a list.
ValueError – If preprocessing_step_signature is an empty tuple or a list.
ValueError – If postprocessing_step_signature is an empty tuple or a list.
ValueError – If preprocessing_step is not provided and both predict_step_signature and input_dataset are provided.
ValueError – If preprocessing_step, predict_step_signature, input_dataset are not provided and predict_step is not a tf.function or is a tf.function with not provided input_signature.
ValueError – If preprocessing_step, preprocessing_step_signature, input_dataset are provided.
ValueError – If preprocessing_step is provided and both preprocessing_step_signature, input_dataset are not provided and preprocessing_step is not a tf.function or is a tf.function but no input_signature is provided.
ValueError – If preprocessing_step, predict_step_signature are not provided and predict_step is not a tf.function or is a tf.function but no input_signature is provided.
ValueError – If postprocessing_step is provided and postprocessing_step_signature is not provided and postprocessing_step is not a tf.function or is a tf.function but no input_signature is provided.

21.12. Datasets

21.12.1. Dataset benchmarking

tensorflow.python.ipu.dataset_benchmark.dataset_benchmark(dataset, number_of_epochs, elements_per_epochs, print_stats=True, apply_options=True, do_memcpy=True)

Allows the user to benchmark performance of a tf.data.Dataset.

Parameters

dataset – An instance of tf.data.Dataset which will be benchmarked.
number_of_epochs – The number of epochs this dataset will be run for.
elements_per_epochs – The number of elements there are in each epoch.
print_stats – Whether to print statistics about the performance to the console.
apply_options – Whether to apply optimization options which can improve the dataset performance.
do_memcpy – Whether to perform a memcpy operation which simulates a dataset buffer being copied to a Poplar managed buffer.

Returns

A JSON string with performance statistics, which records the following metrics every epoch:

elements_processed - number of elements processed.

total_bytes_processed - total number of bytes which was processed.

time_elapsed - the time it took (in seconds) for the epoch to complete.

elements_per_second - number of elements processed per second.

bandwidth - the bandwidth achieved, measured in GB/s.

The JSON string returned can be parsed into a native Python JSON library (see https://docs.python.org/3/library/json.html).

Raises

TypeError – if dataset is not an instance of tf.data.Dataset.
ValueError – if number_of_epochs or elements_per_epochs is less than 1.
InvalidArgumentError – if dataset contains a shape with a dimension of size 0.

tensorflow.python.ipu.dataset_benchmark.infeed_benchmark(infeed_queue, number_of_epochs, elements_per_epochs, print_stats=True, do_memcpy=True)

Allows the user to benchmark performance of an ipu.ipu_infeed_queue.IPUInfeedQueue.

Parameters

infeed_queue – An instance of ipu.ipu_infeed_queue.IPUInfeedQueue which will be benchmarked.
number_of_epochs – The number of epochs this infeed queue will be run for.
elements_per_epochs – The number of elements there are in each epoch.
print_stats – Whether to print statistics about the performance to the console.
do_memcpy – Whether to perform a memcpy operation which simulates a dataset buffer being copied to a Poplar managed buffer.

Returns

A JSON string with performance statistics, which records the following metrics every epoch:

elements_processed - number of elements processed.

total_bytes_processed - total number of bytes which was processed.

time_elapsed - the time it took (in seconds) for the epoch to complete.

elements_per_second - number of elements processed per second.

bandwidth - the bandwidth achieved, measured in GB/s.

The JSON string returned can be parsed into a native Python JSON library (see https://docs.python.org/3/library/json.html).

Raises

TypeError – if infeed_queue is not an instance of ipu.ipu_infeed_queue.IPUInfeedQueue.
ValueError – if number_of_epochs or elements_per_epochs is less than 1.
InvalidArgumentError – if infeed_queue contains a shape with a dimension of size 0.

21.12.2. Dataset wrappers

class tensorflow.python.ipu.data.ops.dataset_ops.BufferDataset(input_dataset, buffer_size)

A Dataset which makes sure there is a multiple of buffer_size number of elements available.

__init__(input_dataset, buffer_size)

A Dataset which makes sure there is a multiple of buffer_size number of: elements available.

Parameters

input_dataset – The input dataset.
buffer_size – The number of dataset elements which will be available.

21.13. Estimators

21.13.1. IPUEstimator

class tensorflow.python.ipu.ipu_estimator.IPUEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None, train_batch_size=None, eval_batch_size=None, predict_batch_size=None)

Estimator with IPU support.

IPUEstimator handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. It also provides a simple way to use multiple IPUs in the form of either data parallelism or model parallelism.

The data parallelism is based on graph replication. One batch from the dataset returned by the input_fn (of size batch_size) is sent to each replica, giving an effective batch size of num_replicas * batch_size. The only change needed to the model_fn is that the optimizer should be wrapped in a CrossReplicaOptimizer in order to average the gradients across the replicas.

This can also be combined with distributed multi-worker training using the IPUMultiWorkerStrategy, giving a total effective batch size of num_workers * num_replicas * batch_size.

The desired global batch size can be passed as train_batch_size, eval_batch_size and predict_batch_size, and the local batch size will be calculated based on the number of replicas and the number of distributed workers and passed to the input_fn and model_fn in params['batch_size']. If the input_fn returns a dataset batched with dataset.batch(params['batch_size'], drop_remainder=True), the global batch size will be as desired.

The model parallelism supported by this class is basic sharding. Consider using the IPUPipelineEstimator to get pipelined execution.

For efficiency, it supports compiling a graph that contains multiple iterations of the training/prediction/evaluation loop, which will be fully executed on the IPU before yielding back to the TensorFlow Python runtime on the CPU.

See https://tensorflow.org/guide/estimators for general information about estimators.

Parameters

model_fn – The model function. Refer to https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/custom_estimators.md#write-a-model-function for details on how to write this function.
model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.
config – A RunConfig object.
params – dict of hyper parameters that will be passed into model_fn. Keys are names of parameters, values are basic python types.
warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm-start from, or a tf.estimator.WarmStartSettings object to fully configure warm-starting. If the string filepath is provided instead of a tf.estimator.WarmStartSettings, then all variables are warm-started, and it is assumed that vocabularies and tf.Tensor names are unchanged.
train_batch_size – If not None, an int representing the global training batch size. This global batch size is transformed to a local batch size passed as params['batch_size'] to the input_fn and model_fn during training. Must be divisible by the number of replicas multiplied by the number of distributed workers.
eval_batch_size – If not None, an int representing the global evaluation batch size. Same behaviour as train_batch_size, only during evaluation.
predict_batch_size – If not None, an int representing the global prediction batch size. Same behaviour as train_batch_size, only during prediction.

class tensorflow.python.ipu.ipu_estimator.IPUEstimatorSpec(mode, predictions=None, loss=None, train_op=None, eval_metric_ops=None, eval_metrics=None, host_call=None, training_hooks=None, evaluation_hooks=None, prediction_hooks=None)

Ops and objects returned from a model_fn and passed to IPUEstimator.

This is very similar to EstimatorSpec, with the addition of two extra arguments: eval_metrics and host_call. If neither of those arguments are needed, an EstimatorSpec can be passed to the IPUEstimator instead.

eval_metrics is a tuple of a (function, tensors), where tensors is either a list of tf.Tensor or a dict from strings to tf.Tensor, that is passed to the function. The function runs on the CPU and returns a dict of metrics. The tensors are transferred from the IPU to the CPU host and passed to the function.

Exactly one of eval_metrics and eval_metric_ops must be provided during evaluation. The major difference between the two is that while the eval_metric_ops will execute directly on the IPU, the eval_metrics will execute on the CPU host using the provided function. Example:

def my_metrics_fn(features, labels):
  return {
      "accuracy": tf.metrics.accuracy(labels, features),
      "precision": tf.metrics.precision(labels, features),
      "recall": tf.metrics.recall(labels, features),
  }

eval_metrics = (my_metrics_fn, [features, labels])
spec = IPUEstimatorSpec(mode, loss=loss, eval_metrics=eval_metrics)

host_call is a tuple of a function and a list of tensors to pass to that function. host_call only works for training and is executed on the CPU for every training step. The tensors are transferred from the IPU to the CPU host and passed to the function.

This functionality can be used for e.g. doing all-reduce of the gradients and weight updates on the host during distributed training with the IPUMultiWorkerStrategy. Example:

def my_host_fn(*host_gradients):
  # This will all-reduce the gradients and update the weights on the host.
  return optimizer.apply_gradients(zip(host_gradients, variables))

train_op = tf.identity(loss)
grads_and_vars = optimizer.compute_gradients(loss, var_list=variables)
gradients = [g for (g, _) in grads_and_vars]
host_call = (my_host_fn, gradients)

spec = IPUEstimatorSpec(mode=mode,
                        loss=loss,
                        train_op=train_op,
                        host_call=host_call)

See full example: Distributed training.

The various hooks (training_hooks, `evaluation_hooks, prediction_hooks) support instances of tf.estimator.SessionRunHook. To log tensor values from within the model_fn, use the IPULoggingTensorHook.

For documentation of the remaining arguments, see EstimatorSpec.

class tensorflow.python.ipu.ipu_estimator.IPUEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None, train_batch_size=None, eval_batch_size=None, predict_batch_size=None)

Estimator with IPU support.

IPUEstimator handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. It also provides a simple way to use multiple IPUs in the form of either data parallelism or model parallelism.

The data parallelism is based on graph replication. One batch from the dataset returned by the input_fn (of size batch_size) is sent to each replica, giving an effective batch size of num_replicas * batch_size. The only change needed to the model_fn is that the optimizer should be wrapped in a CrossReplicaOptimizer in order to average the gradients across the replicas.

This can also be combined with distributed multi-worker training using the IPUMultiWorkerStrategy, giving a total effective batch size of num_workers * num_replicas * batch_size.

The desired global batch size can be passed as train_batch_size, eval_batch_size and predict_batch_size, and the local batch size will be calculated based on the number of replicas and the number of distributed workers and passed to the input_fn and model_fn in params['batch_size']. If the input_fn returns a dataset batched with dataset.batch(params['batch_size'], drop_remainder=True), the global batch size will be as desired.

The model parallelism supported by this class is basic sharding. Consider using the IPUPipelineEstimator to get pipelined execution.

For efficiency, it supports compiling a graph that contains multiple iterations of the training/prediction/evaluation loop, which will be fully executed on the IPU before yielding back to the TensorFlow Python runtime on the CPU.

See https://tensorflow.org/guide/estimators for general information about estimators.

Parameters

model_fn – The model function. Refer to https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/custom_estimators.md#write-a-model-function for details on how to write this function.
model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.
config – A RunConfig object.
params – dict of hyper parameters that will be passed into model_fn. Keys are names of parameters, values are basic python types.
warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm-start from, or a tf.estimator.WarmStartSettings object to fully configure warm-starting. If the string filepath is provided instead of a tf.estimator.WarmStartSettings, then all variables are warm-started, and it is assumed that vocabularies and tf.Tensor names are unchanged.
train_batch_size – If not None, an int representing the global training batch size. This global batch size is transformed to a local batch size passed as params['batch_size'] to the input_fn and model_fn during training. Must be divisible by the number of replicas multiplied by the number of distributed workers.
eval_batch_size – If not None, an int representing the global evaluation batch size. Same behaviour as train_batch_size, only during evaluation.
predict_batch_size – If not None, an int representing the global prediction batch size. Same behaviour as train_batch_size, only during prediction.

eval_dir(name=None)

Shows the directory name where evaluation metrics are dumped.

Parameters: name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.
Returns: A string which is the path of directory contains evaluation metrics.

evaluate(input_fn, steps=None, hooks=None, checkpoint_path=None, name=None)

Evaluates the model given evaluation data input_fn.

Parameters

input_fn –
A function that constructs the input data for evaluation. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where
- features is a tf.Tensor or a dictionary of string feature name to Tensor
- labels is a Tensor or a dictionary of string label name to Tensor
Both features and labels are consumed by model_fn.
steps – Number of steps for which to evaluate model.
hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the evaluation call.
checkpoint_path – Path of a specific checkpoint to evaluate. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, evaluation is run with newly initialized Variables instead of ones restored from checkpoint.
name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.

Returns

A dict containing the evaluation metrics specified in model_fn keyed by name, as well as an entry global_step which contains the value of the global step for which this evaluation was performed.

experimental_export_all_saved_models(export_dir_base, input_receiver_fn_map, assets_extra=None, as_text=False, checkpoint_path=None)

Exports a SavedModel with tf.MetaGraphDefs for each requested mode.

For each mode passed in via the input_receiver_fn_map, this method builds a new graph by calling the input_receiver_fn to obtain feature and label Tensor`s. Next, this method calls the `Estimator’s model_fn in the passed mode to generate the model graph based on those features and labels, and restores the given checkpoint (or, lacking that, the most recent checkpoint) into the graph. Only one of the modes is used for saving variables to the SavedModel (order of preference: tf.estimator.ModeKeys.TRAIN, tf.estimator.ModeKeys.EVAL, then tf.estimator.ModeKeys.PREDICT), such that up to three tf.MetaGraphDefs are saved with a single set of variables in a single SavedModel directory.

For the variables and tf.MetaGraphDefs, a timestamped export directory below export_dir_base, and writes a SavedModel into it containing the tf.MetaGraphDef for the given mode and its associated signatures.

For prediction, the exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

For training and evaluation, the train_op is stored in an extra collection, and loss, metrics, and predictions are included in a SignatureDef for the mode in question.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {'my_asset_file.txt': '/path/to/my_asset_file.txt'}.

Parameters

export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.
input_receiver_fn_map – dict of tf.estimator.ModeKeys to input_receiver_fn mappings, where the input_receiver_fn is a function that takes no arguments and returns the appropriate subclass of InputReceiver.
assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.
as_text – whether to write the SavedModel proto in text format.
checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.

Returns

The string path to the exported directory.

Raises

ValueError – if any input_receiver_fn is None, no export_outputs are provided, or no checkpoint can be found.

export_saved_model(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, experimental_mode='infer')

Exports inference graph as a SavedModel into the given dir.

For a detailed guide, see [Using SavedModel with Estimators](https://tensorflow.org/guide/saved_model#using_savedmodel_with_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {'my_asset_file.txt': '/path/to/my_asset_file.txt'}.

The experimental_mode parameter can be used to export a single train/eval/predict graph as a SavedModel. See experimental_export_all_saved_models for full docs.

Parameters

export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.
serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.
assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.
as_text – whether to write the SavedModel proto in text format.
checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.
experimental_mode – tf.estimator.ModeKeys value indicating with mode will be exported. Note that this feature is experimental.

Returns

The string path to the exported directory.

Raises

ValueError – if no serving_input_receiver_fn is provided, no
export_outputs –

export_savedmodel(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, strip_default_attrs=False)

Exports inference graph as a SavedModel into the given dir. (deprecated)

Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: This function has been renamed, use export_saved_model instead.

For a detailed guide, see [Using SavedModel with Estimators](https://tensorflow.org/guide/saved_model#using_savedmodel_with_estimators).

This method builds a new graph by first calling the serving_input_receiver_fn to obtain feature Tensor`s, and then calling this `Estimator’s model_fn to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the given export_dir_base, and writes a SavedModel into it containing a single tf.MetaGraphDef saved from this session.

The exported MetaGraphDef will provide one SignatureDef for each element of the export_outputs dict returned from the model_fn, named using the same keys. One of these keys is always tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the corresponding tf.estimator.export.ExportOutput`s, and the inputs are always the input receivers provided by the `serving_input_receiver_fn.

Extra assets may be written into the SavedModel via the assets_extra argument. This should be a dict, where each key gives a destination path (including the filename) relative to the assets.extra directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as {'my_asset_file.txt': '/path/to/my_asset_file.txt'}.

Parameters

export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported `SavedModel`s.
serving_input_receiver_fn – A function that takes no argument and returns a tf.estimator.export.ServingInputReceiver or tf.estimator.export.TensorServingInputReceiver.
assets_extra – A dict specifying how to populate the assets.extra directory within the exported SavedModel, or None if no extra assets are needed.
as_text – whether to write the SavedModel proto in text format.
checkpoint_path – The checkpoint path to export. If None (the default), the most recent checkpoint found within the model directory is chosen.
strip_default_attrs – Boolean. If True, default-valued attributes will be removed from the `NodeDef`s. For a detailed guide, see [Stripping Default-Valued Attributes]( https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/saved_model/README.md#stripping-default-valued-attributes).

Returns

The string path to the exported directory.

Raises

ValueError – if no serving_input_receiver_fn is provided, no
export_outputs –

get_variable_names()

Returns list of all variable names in this model.

Returns: List of names.
Raises: ValueError – If the Estimator has not produced a checkpoint yet.

get_variable_value(name)

Returns value of the variable given by name.

Parameters: name – string or a list of string, name of the tensor.
Returns: Numpy array - value of the tensor.
Raises: ValueError – If the Estimator has not produced a checkpoint yet.

latest_checkpoint()

Finds the filename of the latest saved checkpoint file in model_dir.

Returns: The full path to the latest checkpoint or None if no checkpoint was found.

property model_fn

Returns the model_fn which is bound to self.params.

Returns: def model_fn(features, labels, mode, config)
Return type: The model_fn with following signature

predict(input_fn, predict_keys=None, hooks=None, checkpoint_path=None, yield_single_examples=True, num_predictions=None)

Yields predictions for given features.

Parameters

input_fn –
A function that constructs the features. The function should return a tf.data.Dataset object. The outputs of the Dataset object should be one of the following:
- features: A Tensor or a dictionary of string feature name to Tensor. features are consumed by model_fn.
- A tuple, in which case the first item is extracted as features.
predict_keys – list of str, name of the keys to predict. It is used if the tf.estimator.EstimatorSpec.predictions is a dict. If predict_keys is used then rest of the predictions will be filtered from the dictionary. If None, returns all.
hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the prediction call.
checkpoint_path – Path of a specific checkpoint to predict. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, prediction is run with newly initialized Variables instead of ones restored from checkpoint.
yield_single_examples – If False, yields the whole batch as returned by the model_fn instead of decomposing the batch into individual elements. This is useful if model_fn returns some tensors whose first dimension is not equal to the batch size.
num_predictions – If not None, the generator will raise StopIteration after yielding this number of predictions. This allows draining the generator by using list(predictions). If None, the returned generator is infinite and will trigger a fatal error if you try to consume more predictions from it than what is actually generated, instead of raising the StopIteration exception. This is caused by the current behaviour when requesting to run a loop on the IPU for more iterations than there are elements remaining in the dataset. In this case you cannot drain it by using list(predictions), you have to consume the expected number of elements yourself, e.g. using [next(predictions) for _ in range(num_predictions)].

Yields

Evaluated values of predictions tensors.

train(input_fn, hooks=None, steps=None, max_steps=None, saving_listeners=None)

Trains a model given training data input_fn.

Parameters

input_fn –
A function that provides input data for training as minibatches. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where
- features is a tf.Tensor or a dictionary of string feature name to Tensor
- labels is a Tensor or a dictionary of string label name to Tensor
Both features and labels are consumed by model_fn.
hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the training loop.
steps – Number of steps for which to train the model. steps works incrementally. If you call two times train(steps=10) then training occurs in total 20 steps. If you don’t want to have incremental behavior please set max_steps instead. If set, max_steps must be None.
max_steps – Number of total steps for which to train model. If set, steps must be None. Two calls to train(steps=100) means 200 training iterations. On the other hand, two calls to train(max_steps=100) means that the second call will not do any iteration since first call did all 100 steps.
saving_listeners – list of CheckpointSaverListener objects. Used for callbacks that run immediately before or after checkpoint savings.

Returns

self, for chaining.

class tensorflow.python.ipu.ipu_estimator.IPUEstimatorSpec(mode, predictions=None, loss=None, train_op=None, eval_metric_ops=None, eval_metrics=None, host_call=None, training_hooks=None, evaluation_hooks=None, prediction_hooks=None)

Ops and objects returned from a model_fn and passed to IPUEstimator.

This is very similar to EstimatorSpec, with the addition of two extra arguments: eval_metrics and host_call. If neither of those arguments are needed, an EstimatorSpec can be passed to the IPUEstimator instead.

eval_metrics is a tuple of a (function, tensors), where tensors is either a list of tf.Tensor or a dict from strings to tf.Tensor, that is passed to the function. The function runs on the CPU and returns a dict of metrics. The tensors are transferred from the IPU to the CPU host and passed to the function.

Exactly one of eval_metrics and eval_metric_ops must be provided during evaluation. The major difference between the two is that while the eval_metric_ops will execute directly on the IPU, the eval_metrics will execute on the CPU host using the provided function. Example:

def my_metrics_fn(features, labels):
  return {
      "accuracy": tf.metrics.accuracy(labels, features),
      "precision": tf.metrics.precision(labels, features),
      "recall": tf.metrics.recall(labels, features),
  }

eval_metrics = (my_metrics_fn, [features, labels])
spec = IPUEstimatorSpec(mode, loss=loss, eval_metrics=eval_metrics)

host_call is a tuple of a function and a list of tensors to pass to that function. host_call only works for training and is executed on the CPU for every training step. The tensors are transferred from the IPU to the CPU host and passed to the function.

This functionality can be used for e.g. doing all-reduce of the gradients and weight updates on the host during distributed training with the IPUMultiWorkerStrategy. Example:

def my_host_fn(*host_gradients):
  # This will all-reduce the gradients and update the weights on the host.
  return optimizer.apply_gradients(zip(host_gradients, variables))

train_op = tf.identity(loss)
grads_and_vars = optimizer.compute_gradients(loss, var_list=variables)
gradients = [g for (g, _) in grads_and_vars]
host_call = (my_host_fn, gradients)

spec = IPUEstimatorSpec(mode=mode,
                        loss=loss,
                        train_op=train_op,
                        host_call=host_call)

See full example: Distributed training.

The various hooks (training_hooks, `evaluation_hooks, prediction_hooks) support instances of tf.estimator.SessionRunHook. To log tensor values from within the model_fn, use the IPULoggingTensorHook.

For documentation of the remaining arguments, see EstimatorSpec.

21.13.2. IPUPipelineEstimator

class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None)

Estimator for pipelining on IPUs.

IPUPipelineEstimator, like IPUEstimator, handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. Additionally, it adds support for pipelined execution over multiple IPUs.

The major API difference from the IPUEstimator is that the provided model_fn must return a IPUPipelineEstimatorSpec that contains the information needed for pipelined execution.

Data parallelism based on graph replication is supported. Each replica will consume gradient_accumulation_count batches from the dataset returned by the input_fn and accumulate the gradients, giving an effective batch size of num_replicas * gradient_accumulation_count * batch_size. The optimizer in the model_fn should be wrapped in a CrossReplicaOptimizer in order to average the gradients across the replicas.

This can further be combined with distributed multi-worker training using the IPUMultiWorkerStrategy, giving a total effective batch size of num_workers * num_replicas * gradient_accumulation_count * batch_size.

Refer to the pipelining_ops documentation for more details about pipelining.

Note: because the model_fn is compiled to run on the IPU, you must use the warm_start_from parameter for a warm start and not the tf.train.init_from_checkpoint method.

Parameters

model_fn – The model function. Refer to https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/custom_estimators.md#write-a-model-function for details on how to write this function.
model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.
config – A RunConfig object.
params – dict of hyper parameters that will be passed into model_fn. Keys are names of parameters, values are basic python types.
warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm start from, or a tf.estimator.WarmStartSettings object to fully configure warm-starting. If the string filepath is provided instead of a tf.estimator.WarmStartSettings, then all variables are warm started, and it is assumed that vocabularies and tf.Tensor names are unchanged.

class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimatorSpec(mode, computational_stages, gradient_accumulation_count=None, eval_metrics_fn=None, optimizer_function=None, device_mapping=None, loss_accumulator_dtype=None, training_hooks=None, evaluation_hooks=None, prediction_hooks=None, reduction_method=GradientAccumulationReductionMethod.SUM, **pipeline_op_kwargs): Ops and objects returned from a model_fn and passed to IPUPipelineEstimator.

class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None)

Estimator for pipelining on IPUs.

IPUPipelineEstimator, like IPUEstimator, handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. Additionally, it adds support for pipelined execution over multiple IPUs.

The major API difference from the IPUEstimator is that the provided model_fn must return a IPUPipelineEstimatorSpec that contains the information needed for pipelined execution.

Data parallelism based on graph replication is supported. Each replica will consume gradient_accumulation_count batches from the dataset returned by the input_fn and accumulate the gradients, giving an effective batch size of num_replicas * gradient_accumulation_count * batch_size. The optimizer in the model_fn should be wrapped in a CrossReplicaOptimizer in order to average the gradients across the replicas.

This can further be combined with distributed multi-worker training using the IPUMultiWorkerStrategy, giving a total effective batch size of num_workers * num_replicas * gradient_accumulation_count * batch_size.

Refer to the pipelining_ops documentation for more details about pipelining.

Note: because the model_fn is compiled to run on the IPU, you must use the warm_start_from parameter for a warm start and not the tf.train.init_from_checkpoint method.

Parameters

model_fn – The model function. Refer to https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/custom_estimators.md#write-a-model-function for details on how to write this function.
model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.
config – A RunConfig object.
params – dict of hyper parameters that will be passed into model_fn. Keys are names of parameters, values are basic python types.
warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm start from, or a tf.estimator.WarmStartSettings object to fully configure warm-starting. If the string filepath is provided instead of a tf.estimator.WarmStartSettings, then all variables are warm started, and it is assumed that vocabularies and tf.Tensor names are unchanged.

eval_dir(name=None)

Shows the directory name where evaluation metrics are dumped.

Parameters: name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.
Returns: A string which is the path of directory contains evaluation metrics.

evaluate(input_fn, steps=None, hooks=None, checkpoint_path=None, name=None)

Evaluates the model given evaluation data input_fn.

Parameters

input_fn –
A function that constructs the input data for evaluation. The function should return a tf.data.Dataset object. The outputs of the Dataset object must be a tuple (features, labels) where
- features is a tf.Tensor or a dictionary of string feature name to Tensor
- labels is a Tensor or a dictionary of string label name to Tensor
Both features and labels are consumed by model_fn.
steps – Number of steps for which to evaluate model.
hooks – List of tf.train.SessionRunHook subclass instances. Used for callbacks inside the evaluation call.
checkpoint_path – Path of a specific checkpoint to evaluate. If None, the latest checkpoint in model_dir is used. If there are no checkpoints in model_dir, evaluation is run with newly initialized Variables instead of ones restored from checkpoint.
name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.