22. TensorFlow Python API
Remember to import the IPU API using:
from tensorflow.python import ipu
You cannot access the IPU API via the top-level tensorflow
namespace.
For example, this will not work:
import tensorflow as tf
cfg = tf.python.ipu.config.IPUConfig() ...
Note
tensorflow.python.ipu.ipu_strategy.IPUStrategy
is an alias of
tensorflow.python.ipu.ipu_strategy.IPUStrategyV1
.
22.2. Distribution strategy for a single system
- class tensorflow.python.ipu.ipu_strategy.IPUExtendedV1(container_strategy, ipu_device, cpu_device)
- __init__(container_strategy, ipu_device, cpu_device)
- non_slot_devices(var_list)
Device(s) for non-slot variables.
DEPRECATED: TF 1.x ONLY.
This method returns non-slot devices where non-slot variables are placed. Users can create non-slot variables on these devices by using a block:
with tf.distribute.StrategyExtended.colocate_vars_with(tf.distribute.StrategyExtended.non_slot_devices(...)): ...
- Parameters
var_list – The list of variables being optimized, needed with the default
tf.distribute.Strategy
.- Returns
A sequence of devices for non-slot variables.
- property parameter_devices
Returns the tuple of all devices used to place variables.
- value_container(value)
Returns the container that this per-replica
value
belongs to.- Parameters
value – A value returned by
run()
or a variable created inscope()
.- Returns
A container that
value
belongs to. If value does not belong to any container (including the case of container having been destroyed), returns the value itself.value in experimental_local_results(value_container(value))
will always be true.
- property worker_devices
Returns the tuple of all devices used to for compute replica execution.
- tensorflow.python.ipu.ipu_strategy.IPUStrategy
alias of
IPUStrategyV1
- class tensorflow.python.ipu.ipu_strategy.IPUStrategyV1(ipu_device='/device:IPU:0', cpu_device='/device:CPU:0', enable_dataset_iterators=True, enable_keras_extensions=True)
This is a distribution strategy for targeting a system with one or more IPUs.
Creating variables and Keras models within the scope of the IPUStrategyV1 will ensure that they are placed on the IPU.
A tf.function can be executed on the IPU by calling it from the
run
function.Variables will automatically be placed onto the IPUs, but the initializers for the variables will be performed on the CPU device.
from tensorflow.python import ipu # Create an IPU distribution strategy strategy = ipu.ipu_strategy.IPUStrategyV1() with strategy.scope(): # Instantiate a keras model here m = MyModel() # And train it m.fit(...) # Or call a tf.function res = strategy.run(my_fn, [...])
- __init__(ipu_device='/device:IPU:0', cpu_device='/device:CPU:0', enable_dataset_iterators=True, enable_keras_extensions=True)
Create a new IPUStrategyV1.
- Parameters
ipu_device – The TensorFlow device representing the IPUs.
cpu_device – The TensorFlow device for the CPU.
enable_dataset_iterators – Whether to create IPUStrategy specific dataset iterators inside of this strategy scope or whether to use standard dataset iterators.
enable_keras_extensions – Whether to enable IPU specific Keras extensions to improve Keras performance when using IPUs.
- run(fn, args=(), kwargs=None, options=None)
Invokes
fn
on each replica, with the given arguments.This method is the primary way to distribute your computation with a
tf.distribute
object. It invokesfn
on each replica. Ifargs
orkwargs
havetf.distribute.DistributedValues
, such as those produced by atf.distribute.DistributedDataset
fromtf.distribute.Strategy.experimental_distribute_dataset
ortf.distribute.Strategy.distribute_datasets_from_function
, whenfn
is executed on a particular replica, it will be executed with the component oftf.distribute.DistributedValues
that correspond to that replica.fn
is invoked under a replica context.fn
may calltf.distribute.get_replica_context()
to access members such asall_reduce
. See the module-level docstring oftf.distribute
for the concept of replica context.All arguments in
args
orkwargs
can be nested structures of tensors, for example a list of tensors, in which caseargs
andkwargs
will be passed to thefn
invoked on each replica. Orargs
orkwargs
can betf.distribute.DistributedValues
containing tensors or composite tensors, that istf.compat.v1.TensorInfo.CompositeTensor
, in which case eachfn
call will get the component of atf.distribute.DistributedValues
corresponding to its replica. Note that arbitrary Python values that are not of the types above are not supported.IMPORTANT: Depending on the implementation of
tf.distribute.Strategy
and whether eager execution is enabled,fn
may be called one or more times. Iffn
is annotated withtf.function
ortf.distribute.Strategy.run
is called inside atf.function
(eager execution is disabled inside atf.function
by default),fn
is called once per replica to generate a TensorFlow graph, which will then be reused for execution with new inputs. Otherwise, if eager execution is enabled,fn
will be called once per replica every step just like regular Python code.Example usage:
Constant tensor input.
>>> strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"]) >>> tensor_input = tf.constant(3.0) >>> @tf.function ... def replica_fn(input): ... return input*2.0 >>> result = strategy.run(replica_fn, args=(tensor_input,)) >>> result PerReplica:{ 0: <tf.Tensor: shape=(), dtype=float32, numpy=6.0>, 1: <tf.Tensor: shape=(), dtype=float32, numpy=6.0> }
DistributedValues input.
>>> strategy = tf.distribute.MirroredStrategy(["GPU:0", "GPU:1"]) >>> @tf.function ... def run(): ... def value_fn(value_context): ... return value_context.num_replicas_in_sync ... distributed_values = ( ... strategy.experimental_distribute_values_from_function( ... value_fn)) ... def replica_fn2(input): ... return input*2 ... return strategy.run(replica_fn2, args=(distributed_values,)) >>> result = run() >>> result <tf.Tensor: shape=(), dtype=int32, numpy=4>
Use
tf.distribute.ReplicaContext
to allreduce values.
>>> strategy = tf.distribute.MirroredStrategy(["gpu:0", "gpu:1"]) >>> @tf.function ... def run(): ... def value_fn(value_context): ... return tf.constant(value_context.replica_id_in_sync_group) ... distributed_values = ( ... strategy.experimental_distribute_values_from_function( ... value_fn)) ... def replica_fn(input): ... return tf.distribute.get_replica_context().all_reduce("sum", input) ... return strategy.run(replica_fn, args=(distributed_values,)) >>> result = run() >>> result PerReplica:{ 0: <tf.Tensor: shape=(), dtype=int32, numpy=1>, 1: <tf.Tensor: shape=(), dtype=int32, numpy=1> }
- Parameters
fn – The function to run on each replica.
args – Optional positional arguments to
fn
. Its element can be a tensor, a nested structure of tensors or atf.distribute.DistributedValues
.kwargs – Optional keyword arguments to
fn
. Its element can be a tensor, a nested structure of tensors or atf.distribute.DistributedValues
.options – An optional instance of
tf.distribute.RunOptions
specifying the options to runfn
.
- Returns
Merged return value of
fn
across replicas. The structure of the return value is the same as the return value fromfn
. Each element in the structure can either betf.distribute.DistributedValues
orTensor
objects (for example, if running on a single replica).
22.3. Compiler interface
- tensorflow.python.ipu.ipu_compiler.compile(computation, inputs=None)
Builds an operator that compiles and runs
computation
with the Graphcore IPU XLA backend.- Parameters
computation –
A Python function that builds a computation to apply to the input. If the function takes n inputs,
inputs
should be a list of n tensors.computation
may return a list of operations and tensors. Tensors must come before operations in the returned list. The return value ofcompile
is a list of tensors corresponding to the tensors from the output ofcomputation
.All operations returned from
computation
will be executed when evaluating any of the returned output tensors.inputs – A list of inputs or
None
(equivalent to an empty list). Each input can be a nested structure containing values that are convertible to tensors. Note that passing an N-dimension list of compatible values will result in a N-dimension list of scalar tensors rather than a single Rank-N tensors. If you need different behaviour, convert part of inputs to tensors withtf.convert_to_tensor
.
- Returns
Same data structure as if
computation(inputs)
is called directly with some exceptions for correctness.None output. a NoOp would be returned which control-depends on computation.
Single value output. A tuple containing the value would be returned.
Operation-only outputs. a NoOp would be returned which control-depends on computation.
- Raises
Exception – If the computation was not compiled for an IPU device.
22.4. Scoping contexts
- tensorflow.python.ipu.scopes.frontend_attribute(attribute_name, attribute_value, restore_to=None)
Sets the specified scope attribute to the specified value in the graph.
- Parameters
attribute_name – Name of the attribute.
attribute_value – Attribute’s value as a string.
restore_to – If at the end of the scope the attribute was to be undefined sets it to this value instead.
- Returns
A context
- tensorflow.python.ipu.scopes.ipu_jit_scope(ipu_scope)
Provides a scope for compilation of operations.
If you would like to compile several sets of operations together, then this can provide that mechanism.
- Parameters
ipu_scope – A name to differentiate between different JIT scopes
- Returns
A context
- tensorflow.python.ipu.scopes.ipu_scope(device)
Provides a scope for placing operations onto a particular IPU/IPU cluster.
- Parameters
device – The name of the TensorFlow device, such as ‘/device:IPU:0’
- Returns
A context
- tensorflow.python.ipu.scopes.ipu_shard(index)
Control sharding for a set of operations.
Provides a scope which targets operations onto a particular shard (IPU) of a multi-IPU sharded device. Gradients created from these operations will also be put onto the same shard. Consequently an
ipu_shard
scope enclosing a call totf.gradients
ortf.GradientTape.gradient
won’t change the sharding of the backwards ops.- Parameters
index – The index of the IPU on which to place the enclosed operations.
- Returns
A context
- tensorflow.python.ipu.scopes.outside_compilation_scope(name='outside')
Provides a scope for placing operations on the host, outside the current compilation scope. The operations will be placed on the default host device. This allows for offloading computations from the IPU to the host, which can be useful for operations that are not supported or suitable for execution on the IPU.
Example:
def my_net(a): with ipu_scope("/device:IPU:0"): b = a * a with outside_compilation_scope(): c = b + 2 # Placed on the host. d = b + c return d
- Parameters
name – A name for the outside compilation scope.
- Returns
A context
- tensorflow.python.ipu.scopes.partials_type(override_type)
Override the default type used to store intermediate results by convolution and matrix mutliply operations.
EXPERIMENTAL - there are no guarantees that the partials type provided will be used and therefore this should not be used.
- Parameters
override_type – Numpy type of the partials (float16 or float32)
- Returns
A context
- tensorflow.python.ipu.scopes.stochastic_rounding(override)
Control stochastic rounding for a set of operations.
EXPERIMENTAL - there are no guarantees that the stochastic rounding provided will be used and therefore this should not be used.
- Parameters
override – if True then stochastic rounding will be used, otherwise it will be disabled for this set of operations.
- Returns
A context
22.5. Infeed queue
- class tensorflow.python.ipu.ipu_infeed_queue.IPUInfeedQueue(dataset, device_ordinal=None, prefetch_depth=None, optimise_latency=False, **kwargs)
Wraps a tf.Dataset object with infeed operations specific to the IPU.
This class, along with
tensorflow.python.ipu.loops
is used to create a data pipeline from adataset
into a training/inference loop on the IPU inside a singlesession.run
which reduces the overheads of callingsession.run
for each iteration of the loop.You should pass the infeed queue as an argument to a loop from
tensorflow.python.ipu.loops
. These loops will then handle the dequeuing of the data to the device automatically.The following skeleton shows how to use this method when building a training loop. Note how the body signature contains variables which correspond to the nested structure of
tf.Tensor
objects representing the next element in the infeed queue:# Create an example dataset. dataset = ... # A `tf.data.Dataset` object. def dataset_parser(value): features, labels = parse_record(value) return {"features": features, "labels": labels} # The resulting dataset has a nested structure of: {features, labels}. dataset = dataset.map(dataset_parser) infeed_queue = ipu.ipu_infeed_queue.IPUInfeedQueue(dataset) # dataset can no longer be used beyond this point. def my_net(): # Note how the nested structure forms part of the loop body signature. def body(loss, features, labels): with variable_scope.variable_scope("vs", use_resource=True): y = tf.conv2d(features, .....) ... ... logits = tf.nn.xw_plus_b(....) loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits, labels=labels)) optimizer = gradient_descent.GradientDescentOptimizer(0.000001) train = optimizer.minimize(loss) with ops.control_dependencies([train]): return array_ops.identity(loss) loss = 0.0 return = tf.python.ipu.loops.repeat(10000, body, [loss], infeed_queue) with ipu.scopes.ipu_scope("/device:IPU:0"): res = ipu_compiler.compile(my_net, inputs=[]) with tf.Session() as sess: sess.run(infeed_queue.initializer) sess.run(variables.global_variables_initializer()) result = sess.run(res)
- __init__(dataset, device_ordinal=None, prefetch_depth=None, optimise_latency=False, **kwargs)
Creates an IPUInfeedQueue object.
- Parameters
dataset – a
tf.data.Dataset
object, all transformations e.g.shuffle
,repeat
,batch
must be applied prior to passing in to this function. This dataset can no longer be used after creating this queue.device_ordinal – Integer ordinal of the IPU device on which this queue will be used. If not specified will try and deduce the IPU device from the current strategy and if that fails will default to “/device:IPU:0”.
prefetch_depth – the number of elements Poplar will prefetch. The depth of the Poplar datastream buffer size which may be prefetched before being read by the device. By default the prefetch_depth size is automatically determined (currently defaults to 3). Increasing the size of the prefetch_depth allows for prefetching of multiple entries, increasing the probability there will be a valid entry in the buffer for the device to read before falling back to synchronously fetching the next entry. This value has to be greater than zero.
optimise_latency – Prioritise packet reduction to try to speed up the the host transfer. This has the downside that it will introduce an extra copy and so should only be used on small exchanges that will produce lots of packets.
- Raises
ValueError – if all dimensions of shapes of dataset.output_shapes are not fully defined. tf.data.batch function must be called with
drop_remainder=True
to ensure that batch size is constant.
- property deleter
A
tf.Operation
that can be run to delete the resources owned by this IPUInfeedQueue. This allows creating a new IPUInfeedQueue with the same name afterwards.- Returns
A
tf.Operation
that can be run to delete this IPUInfeedQueue
- property dequeued
Returns whether this queue has been dequeued.
- Returns
A nested structure of
tf.Tensor
objects.
- get_next()
Obsolete function.
- property initializer
A
tf.Operation
that should be run to initialize this IPUInfeedQueue.- Returns
A
tf.Operation
that should be run to initialize this IPUInfeedQueue- Raises
ValueError – if the function
initializer
has already been called.
- property number_of_tuple_elements
Returns the number of arguments supplied by this IPUInfeedQueue.
- class tensorflow.python.ipu.ipu_infeed_queue.IPUIterator(dataset=None, infeed_spec=None, element_spec=None, **kwargs)
An IPU specific iterator producing tf.Tensor objects from a tf.data.Dataset.
This iterator should be initially constructed in eager mode in order to make sure that the dataset is constructed on a compatible device.
Note that the infeed queue is not deleted.
The elements from iterator can only be accessed inside of tf.functions for maximum performance.
- __init__(dataset=None, infeed_spec=None, element_spec=None, **kwargs)
Creates a new iterator from the given dataset.
If
dataset
is not specified, the iterator will be created from the given infeed spec and element structure. In particular, the alternative for constructing the iterator is used when the iterator is reconstructed from itCompositeTensor
representation.- Parameters
dataset – A
tf.data.Dataset
object.infeed_spec – IPUInfeedQueue
TypeSpec
the iterator from.element_spec – A nested structure of
TypeSpec
objects that represents the type specification of elements of the iterator.**kwargs – Arguments passed to the
IPUInfeedQueue
.
- Raises
ValueError – If
dataset
is not provided and eitherinfeed_spec
orelement_spec
is not provided. Ordataset
is provided and eitherinfeed_spec
andelement_spec
is provided.
- property element_spec
The type specification of an element of this iterator.
For more information, read this guide.
- Returns
A (nested) structure of
tf.TypeSpec
objects matching the structure of an element of this iterator and specifying the type of individual components.
- get_next()
Returns the next element.
>>> dataset = tf.data.Dataset.from_tensors(42) >>> iterator = iter(dataset) >>> print(iterator.get_next()) tf.Tensor(42, shape=(), dtype=int32)
- Returns
A (nested) structure of values matching
tf.data.Iterator.element_spec
.- Raises
tf.errors.OutOfRangeError – If the end of the iterator has been reached.
- get_next_as_optional()
Returns the next element warpped in
tf.experimental.Optional
.If the iterator has reached the end of the sequence, the returned
tf.experimental.Optional
will have no value.>>> dataset = tf.data.Dataset.from_tensors(42) >>> iterator = iter(dataset) >>> optional = iterator.get_next_as_optional() >>> print(optional.has_value()) tf.Tensor(True, shape=(), dtype=bool) >>> print(optional.get_value()) tf.Tensor(42, shape=(), dtype=int32) >>> optional = iterator.get_next_as_optional() >>> print(optional.has_value()) tf.Tensor(False, shape=(), dtype=bool)
- Returns
A
tf.experimental.Optional
object representing the next element.
- class tensorflow.python.ipu.ipu_infeed_queue.IPUOwnedIterator(dataset=None, infeed_spec=None, element_spec=None, **kwargs)
An IPU specific iterator producing tf.Tensor objects from a tf.data.Dataset.
The iterator resource created through
IPUOwnedIterator
is owned by the Python object and the life time of the underlying resource is tied to the life time of theIPUOwnedIterator
object. This makesIPUOwnedIterator
appropriate for use inside of tf.functions.This iterator should be initially constructed in eager mode in order to make sure that the dataset is constructed on a compatible device.
The elements from iterator can only be accessed inside of tf.functions for maximum performance.
- __init__(dataset=None, infeed_spec=None, element_spec=None, **kwargs)
Creates a new iterator from the given dataset.
If
dataset
is not specified, the iterator will be created from the given infeed spec and element structure. In particular, the alternative for constructing the iterator is used when the iterator is reconstructed from itCompositeTensor
representation.- Parameters
dataset – A
tf.data.Dataset
object.infeed_spec – IPUInfeedQueue
TypeSpec
the iterator from.element_spec – A nested structure of
TypeSpec
objects that represents the type specification of elements of the iterator.**kwargs – Arguments passed to the
IPUInfeedQueue
.
- Raises
ValueError – If
dataset
is not provided and eitherinfeed_spec
orelement_spec
is not provided. Ordataset
is provided and eitherinfeed_spec
andelement_spec
is provided.
22.6. Outfeed queue
- class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedMode(value)
Types used to control the IPUOutfeedQueue modes.
Contains the following values:
ALL
- When used with an IPUOutfeedQueue, all the elements which were enqueued to the queue will be returned by the outfeed.LAST
- When used with an IPUOutfeedQueue, only the last element which was enqueued to the queue will be returned by the outfeed.
- class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedQueue(outfeed_mode=None, device_ordinal=None, buffer_depth=3, optimise_latency=False)
Generates and adds outfeed enqueue/dequeue operations to the graph.
An outfeed is the counterpart to an infeed and manages the transfer of data (like tensors, tuples or dictionaries of tensors) from the IPU graph to the host.
The queue has two modes of operation - outfeed all or outfeed last. In outfeed all mode every element that is enqueued will be stored for a subsequent dequeue. All of the enqueued elements will be returned when the dequeue operation is run. This is the default behaviour.
In outfeed last mode only the last enqueued element is stored. The dequeue operation will in this case return a single element.
- __init__(outfeed_mode=None, device_ordinal=None, buffer_depth=3, optimise_latency=False)
Creates an IPUOutfeedQueue object.
- Parameters
outfeed_mode –
ipu_outfeed_queue.IPUOutfeedMode
type used to control the outfeed behaviour. If not specified then all elements will be returned by the outfeed when the dequeue operation is run.device_ordinal – Integer ordinal of the IPU device on which this queue will be used. If not specified will try and deduce the IPU device from the current strategy and if that fails will default to “/device:IPU:0”.
buffer_depth – The maximum number of elements Poplar can buffer in external memory before blocking the device.
optimise_latency – Prioritise packet reduction to try to speed up the the host transfer. This has the downside that it will introduce an extra copy and so should only be used on small exchanges that will produce lots of packets.
- Raises
ValueError – if the types or values are incorrect
- property deleter
A
tf.Operation
that can be run to delete the resources owned by this IPUOutfeedQueue. This allows creating a new IPUOutfeedQueue with the same name afterwards. The behaviour is undefined if this op is executed concurrently with the dequeue op.- Returns
A
tf.Operation
that can be run to delete this IPUOutfeedQueue
- dequeue(wait_for_completion=False)
Generate host side operation to dequeue the outfeed values.
- Parameters
wait_for_completion – whether the dequeueing operation should wait for the current execution of a graph containing the outfeed enqueue to complete. Defaults to
False
which means that only the tensors which have already been enqueued will be returned.
The return value of this operation depends on the enqueued tensors, replication factor and the execution mode. Where replication factor is determined by the model.
Note: If the
TF_POPLAR_FLAGS
environment variable contains the flag--use_synthetic_data
then no data will be returned to the host. Ifoutfeed_mode
isIPUOutfeedMode.ALL
then empty arrays with the same element structure as the enqueued tensors are returned. Ifoutfeed_mode
isIPUOutfeedMode.LAST
then running the dequeue operation will throw an exception (there is no last element in this case).Examples:
Outfeed returning a single tensor:
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue() def body(input): output = input + 1 outfeed = outfeed_queue.enqueue(output) return (output, outfeed) def my_net(input): r = loops.repeat(20, body, (input)) return r with ipu.scopes.ipu_scope("/device:IPU:0"): res = ipu_compiler.compile(my_net, inputs=[v]) with ops.device('cpu'): v = tf.placeholder(np.float32, [4, 4]) outfeed = outfeed_queue.dequeue() with tf.Session() as sess: result = sess.run(res, {v:np.ones([4, 4], np.float32)}) outfed = sess.run(outfeed)
In this example the tensor
output
is of shape [4, 4] and it is enqueued into the outfeed. If theoutfeed_mode
isIPUOutfeedMode.ALL
, and the model has a replication factor of 2 then the shape of the resultingoutfed
tensor will be [20, 2, 4, 4], where the first dimension represents the number of times we have enqueued a tensor to the outfeed - in this example the loop is repeated 20 times, and therefore we get 20 values back from the outfeed. The second dimension is the replication factor, which allows us to see the individual values from each replicated graph. If theoutfeed_mode
isIPUOutfeedMode.LAST
, then the shape of the resultingoutfed
tensor will be [2, 4, 4], which represents the value of the output tensor the last time it was enqueued during execution for each of the replicated graphs.Outfeed returning a tuple of tensors:
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue() def body(input): output = input + 1 sum = tf.reduce_sum(output) outfeed = outfeed_queue.enqueue((output, sum)) return (output, outfeed) def my_net(input): r = loops.repeat(20, body, (input)) return r with ipu.scopes.ipu_scope("/device:IPU:0"): res = ipu_compiler.compile(my_net, inputs=[v]) with ops.device('cpu'): v = tf.placeholder(np.float32, [4, 4]) outfeed = outfeed_queue.dequeue() with tf.Session() as sess: result = sess.run(res, {v:np.ones([4, 4], np.float32)}) outfed = sess.run(outfeed)
In this example we outfeed a tuple of tensors,
output
andsum
, where the former is of shape [4, 4] and latter [1]. If theoutfeed_mode
isIPUOutfeedMode.ALL
and the model has a replication factor of 1, then the resultingoutfed
is a two-tuple of tensors with shapes ([20, 4, 4], [20, 1]), where the first dimension in each of the tensors represents the number of times we have enqueued these tensors to the outfeed - in this example the loop is repeated 20 times, and therefore we get 20 values back from the outfeed for each of the tensors in the tuple. If theoutfeed_mode
isIPUOutfeedMode.LAST
, thenoutfed
is a two tuple of tensors with shapes ([4, 4], [1]), which represents the values of theoutput
andsum
tensors the last time they were enqueued during execution.Note that replication factor here is 1, which means that the extra replication dimension is not added.
Outfeed returning a dictionary of tensors:
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue() def body(input): output = input + 1 sum = tf.reduce_sum(output) outfeed = outfeed_queue.enqueue({"x": output, "y": sum}) return (output, outfeed) def my_net(input): r = loops.repeat(40, body, (input)) return r with ipu.scopes.ipu_scope("/device:IPU:0"): res = ipu_compiler.compile(my_net, inputs=[v]) with ops.device('cpu'): v = tf.placeholder(np.float32, [4, 4]) outfeed = outfeed_queue.dequeue() with tf.Session() as sess: result = sess.run(res, {v:np.ones([4, 4], np.float32)}) outfed = sess.run(outfeed)
In this example we outfeed a dictionary of tensors,
output
andsum
, where the former is of shape [4, 4] and latter [1]. If theoutfeed_mode
isIPUOutfeedMode.ALL
and the model has a replication factor of 8, then the resultingoutfed
is a dictionary of tensors with shapes: {“x”: [40, 8, 4, 4], “y”: [40, 8, 1]}, where the first dimension in each of the tensors represents the number of times we have enqueued these tensors to the outfeed - in this example the loop is repeated 40 times, and therefore we get 40 values back from the outfeed for each of the tensors in the tuple. The second dimension is the replication factor, which allows us to see the individual values from each replicated graph. If theoutfeed_mode
isIPUOutfeedMode.LAST
, thenoutfed
is a dictionary of tensors with shapes: {“x”: [8, 4, 4], “y”: [8, 1]}, which represents the values of theoutput
andsum
tensors the last time they were enqueued during execution for each of the replicated graphs.
- enqueue(tensors)
Enqueue a tensor, tuple or a dictionary of tensors for being outfed from the IPU graph. This operation is placed on the IPU device. This function returns an Operation which needs be executed (by either returning it or using tf.control_dependencies(…))
Examples:
Outfeed returning a single tensor:
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue() def body(v): v = v + 1 outfeed = outfeed_queue.enqueue(v) return (v, outfeed) def my_net(v): r = loops.repeat(20, body, (v)) return r with ipu.scopes.ipu_scope("/device:IPU:0"): res = ipu_compiler.compile(my_net, inputs=[v]) ... ...
Outfeed returning a tuple of tensors:
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue() def body(v): v = v + 1 x = v * 2 outfeed = outfeed_queue.enqueue((v, x)) return (v, outfeed) def my_net(v): r = loops.repeat(20, body, (v)) return r with ipu.scopes.ipu_scope("/device:IPU:0"): res = ipu_compiler.compile(my_net, inputs=[v]) ... ...
Outfeed returning a dictionary of tensors:
outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue() def body(v): v = v + 1 x = v * 2 outfeed = outfeed_queue.enqueue({"output_1": v, "output_2": x}) return (v, outfeed) def my_net(v): r = loops.repeat(20, body, (v)) return r with ipu.scopes.ipu_scope("/device:IPU:0"): res = ipu_compiler.compile(my_net, inputs=[v]) ... ...
- class tensorflow.python.ipu.ipu_outfeed_queue.IPUOutfeedQueueIterator(outfeed_queue)
An iterator producing tf.Tensor objects from a IPUOutfeedQueue.
- __init__(outfeed_queue)
Creates a new iterator from the given outfeed queue.
- Parameters
outfeed_queue – A
ipu.ipu_outfeed_queue.IPUOutfeedQueue
object.
- class tensorflow.python.ipu.ipu_outfeed_queue.ScopedIPUOutfeedQueue(outfeed_mode=None, device_ordinal=None)
A version of IPUOutfeedQueue which automatically calls delete when it goes out of scope.
Can only be created in eager mode.
- __init__(outfeed_mode=None, device_ordinal=None)
Creates an IPUOutfeedQueue object.
- Parameters
outfeed_mode –
ipu_outfeed_queue.IPUOutfeedMode
type used to control the outfeed behaviour. If not specified then all elements will be returned by the outfeed when the dequeue operation is run.device_ordinal – Integer ordinal of the IPU device on which this queue will be used. If not specified will try and deduce the IPU device from the current strategy and if that fails will default to “/device:IPU:0”.
- Raises
RuntimeError – if not running in eager mode.
22.7. General utilities
- tensorflow.python.ipu.utils.export_dataset_to_file(dataset_or_infeed, output_filename, num_elements, feed_name='', apply_debug_options=True)
Export as binary
num_elements
from the giveninfeed
to the specifiedoutput_filename
.If the infeed elements are tuples then one file per tuple element will be created. For example, if
dataset
looks like[{ "a": A_0, "b": B_0}, { "a": A_1, "b": B_1}, ...]
then
export_dataset_to_file(dataset, "my_dataset.bin", 100)
will generate:my_dataset.0.bin # Contains tensors [ A_0, A_1, ..., A_99] my_dataset.1.bin # Contains tensors [ B_0, B_1, ..., B_99]
- Parameters
dataset_or_infeed – An unary dataset with the same input and output structure or an
IPUInfeedQueue
.output_filename – Where to export the tensors to.
num_elements – Number of elements to export from the dataset.
feed_name – Specify the feed name.
apply_debug_options – Whether to apply debug options.
- tensorflow.python.ipu.utils.export_inputs_to_file(inputs, output_filename, feed_dict)
Export as binary the list of
inputs
provided to the specifiedoutput_filename
.- Parameters
inputs – List of graph inputs to export.
output_filename – Where to export the tensors to.
feed_dict – Feed dictionary containing the inputs’ values.
- tensorflow.python.ipu.utils.get_num_of_ipus_in_device(ipu_device, device='cpu')
Get the number of physical IPUs
- Parameters
ipu_device – The IPU device for which to get the number of devices for.
device – The CPU device which is local to the IPU hardware.
- Returns
A number of physical IPUs configured for a particular TF device.
- tensorflow.python.ipu.utils.move_variable_initialization_to_cpu(graph=None)
For all variables in the VARIABLES collection, move any initialization ops onto the CPU.
- Parameters
graph – Operations are moved around on this graph. The default graph will be used if not specified.
- Returns
None
- tensorflow.python.ipu.utils.reset_ipu_seed(seed, device='/device:IPU:0', cpu_device='cpu', experimental_identical_replicas=False)
Reset the seed used to generate stateful random numbers and perform stochastic rounding.
- Parameters
seed – The new random number generator seed.
device – The device to which the seed will be applied.
cpu_device – The CPU device which is on the same hardware to the IPU device.
experimental_identical_replicas – Whether to seed all the local replicas identically. Note that to generate identical sequences of random numbers on all replicas, the Poplar engine option
"target.deterministicWorkers"
must also be set to"portable"
. Also note that for multi-replica distribution with multiple processes, the same seed must be passed to each process to ensure that all the replicas globally get the same seed. WARNING: This flag is experimental and subject to change.
- Returns
None
- tensorflow.python.ipu.utils.running_on_ipu_model()
Check if XLA is configured to run on the ipu model.
- Returns
True if XLA is configured to run on the ipu model. False if XLA is configured to run on real hardware.
- tensorflow.python.ipu.utils.use_synthetic_data_for(synthetic_data_category)
Get whether synthetic data is being used for the given category.
- Parameters
synthetic_data_category – A SyntheticDataCategory enum value.
- Returns
A bool indicating the result.
22.8. Configuration utilities
- class tensorflow.python.ipu.config.DeviceConnectionType(value)
Enumeration to describe the mechanism used to attach to the Poplar device.
ALWAYS
indicates that the system will attach when configuring the device.ON_DEMAND
will defer connection to when the IPU is needed.PRE_COMPILE
will never try to attach to a device and anything which is meant to be executed on the device will return all zeros. Used to pre-compile Poplar programs on machines without IPUs. For more information, see Pre-compiling executables.NEVER
will never try to attach to a device.
- class tensorflow.python.ipu.config.ExecutionProfileType(value)
The execution profile type indicates the desired information in the execution profile.
NO_PROFILE
indicates that there should be no execution profiling.DEVICE_PROFILE
indicates that the execution profile should contain only device wide events.IPU_PROFILE
indicates that the profile should contain IPU level execution events.TILE_PROFILE
indicates that the profile should contain Tile level execution events.
- class tensorflow.python.ipu.config.MergeRemoteBuffersBehaviour(value)
The remote buffers merging behaviour indicates when or if compatible remote buffers should be merged.
NO_MERGING
indicates that there should be no merging.MERGE
indicates that all compatible remote buffers will be merged.IF_BENEFICIAL
indicates that compatible remote buffers will only be merged when it is considered beneficial for code re-use.
- class tensorflow.python.ipu.config.SchedulingAlgorithm(value)
Controls the algorithm that the scheduler uses.
CHOOSE_BEST
compares several of the scheduling algorithms below and selects the one that leads to the lowest predicted overall peak liveness. This can sometimes produce incorrect results because the overall peak liveness isn’t always a good measure for the maximum liveness on one tile of the processor.CLUSTERING
groups clusters of operations together in order to look through stretches of instructions with potentially high liveness.POST_ORDER
schedules the instructions in the order which is obtained by walking the graph in ‘post order’.LOOK_AHEAD
looks ahead a number of operations from any schedulable one, as given by the maximum scheduler lookahead depth and maximum scheduler search space size options. It attempts to look through areas of high liveness.SHORTEST_PATH
gives priority to the shortest path to the root.
- class tensorflow.python.ipu.config.SelectionOrder(value)
Depending on the communication pattern of the model, the order in which the IPUs are selected and mapped to shards can impact the performance.
For example, given a model which executes on multiple IPUs:
def sharded_graph(pa, pb, pc, pd): with ipu.scopes.ipu_shard(0): o1 = pa + pb with ipu.scopes.ipu_shard(1): o2 = o1 + pc with ipu.scopes.ipu_shard(2): o3 = o2 + pd return o3
and a Graphcore Pod system with 16 IPUs:
_______ _______ | | | | | 14 |=============| 15 | |_______| |_______| || || _______ _______ | | | | | 12 |=============| 13 | |_______| |_______| || || _______ _______ | | | | | 10 |=============| 11 | |_______| |_______| || || _______ _______ | | | | | 8 |=============| 9 | |_______| |_______| || || _______ _______ | | | | | 6 |=============| 7 | |_______| |_______| || || _______ _______ | | | | | 4 |=============| 5 | |_______| |_______| || || _______ _______ | | | | | 2 |=============| 3 | |_______| |_______| || || _______ _______ | | | | | 0 |=============| 1 | |_______| |_______|
Here, each numbered square represents an IPU with the given device ID and the
==
and||
connections represent IPUs directly connected via IPU-Links.We can see that the
ipu_shard(0)
directly communicates withipu_shard(1)
and thatipu_shard(1)
directly communicates withipu_shard(2)
.If the shards 0, 1, 2 were mapped to IPUs 0, 1, 2 in that order, then the communication between shards 1 and 2 would not have a direct connection via an IPU-Link and would have to perform a “hop” through an intermediate IPU.
If the shards 0, 1, 2 were mapped to IPUs 0, 1, 3 in that order, then the communication between shards 1 and 2 would have a direct connection via an IPU-Link, which will reduce the communication cost.
This enumeration is used to control the order in which the IPUs are selected. Currently, the following IPU selection orderings are supported:
AUTO
: automatically try and select the best selection given the network.ZIGZAG
: follow the natural ordering of IPUs. In the above example, the IPUs would be selected in the following order:0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
.SNAKE
: select IPUs such that each consecutive shard is directly connected via IPU-Links to the shard before and after. In the above example, the IPUs would be selected in the following order:0, 1, 3, 2, 4, 5, 7, 6, 8, 9, 11, 10, 12, 13, 15, 14
.HOOF
: select IPUs such that each consecutive shard is directly connected via IPU-Links to the shard before and after, and the last and first shard are on adjacent IPUs. In the above example, the IPUs would be selected in the following order:0, 2, 4, 6, 8, 10, 12, 14, 15, 13, 11, 9, 7, 5, 3, 1
.
The
SNAKE
andHOOF
IPU selection orders are particularly beneficial for pipelined models.
- class tensorflow.python.ipu.config.StochasticRoundingBehaviour(value)
Controls how stochastic rounding is performed.
OFF
disables stochastic rounding.ON
enables stochastic rounding.REPLICA_IDENTICAL_ONLY
enables stochastic rounding for portions of the graph which are identified as being replica identical - meaning that when executed with replication they produce the same result on each replica.
- tensorflow.python.ipu.config.configure_ipu_system(config, device='cpu', reset_configuration=True)
Configure an IPU system with an IPUConfig or IpuOptions instance.
- Parameters
config – An IPUConfig instance or IpuOptions configuration protobuf.
device – The TensorFlow virtual CPU device which is local to the IPU hardware.
reset_configuration – Whether to reset any existing IPU configurations.
- Returns
None
- tensorflow.python.ipu.config.get_ipu_config(session=None)
Get the configuration of an IPU system.
- Parameters
session – An optional session on which to execute.
- Returns
A list of IpuOption instances, one for each PoplarExecutor.
- tensorflow.python.ipu.config.reset_ipu_configuration()
Reset the IPU configuration in preparation for it to be reconfigured. This blocks until all currently configured IPU devices have finished executing.
Note that this function does not currently support resetting IPUs that are running in parallel Python threads.
- class tensorflow.python.ipu.config.IPUConfig
- allow_recompute: bool = False
Whether or not to recompute instructions during training. If this is enabled then we will attempt to pattern match instructions/pipeline stages in the forward pass and recompute them in the backward pass to avoid having to preserve activations which increase the maximum memory liveness. Enabling this option can reduce memory usage at the expense of extra computation. Stateful operations cannot be recomputed.
- selection_order: SelectionOrder = SelectionOrder.AUTO
The order in which IPUs are selected and mapped to physical IPU devices when using multi-IPU devices. Must be one of
SelectionOrder
.
- serialization_output_folder: str = ""
Specifies the directory in which serialized Poplar executables will be saved. The value must be a valid path. The default (“”) disables executable serialization.
- compilation_poplar_options: dict = {}
Set the Poplar compilation options for the session. Must be a dictionary of valid Poplar compilation flags. See the
Engine
class in the Poplar API reference for the full list of options.
- gcl_poplar_options: dict = {}
Set the IPU options for the Graphcore Communication Library. Must be a dictionary of valid GCL options. See the
allReduce
function in the GCL API reference for the full list of options. The options will be applied to all applicable GCL collective operations in the graph during compilation.
- auto_select_ipus: Union[int, List[int], Tuple[int, ...]] = []
Configure the IPUs to be used by the session. The configuration describes a system consisting of multiple TensorFlow devices, each with control of one of more IPUs. The devices will be labelled
/device:IPU:0
,/device:IPU:1
and so on.Each device can control a specific number of IPUs, given by the
num_ipus
parameter. The system will automatically select IPU configurations from the available IPUs, where they match the desired number of IPUs.Examples:
config = IPUConfig() # Create a single TensorFlow device, with one IPU config.auto_select_ipus = 1 # Create two TensorFlow devices, with two IPUs per device. config.auto_select_ipus = [2, 2] # Create two TensorFlow devices, with one IPU in the first device and two # IPUs in the second device. config.auto_select_ipus = [1, 2]
- select_ipus: Union[int, List[int], Tuple[int, ...]] = []
Configure the IPUs to be used by the session.
The configuration describes a system consisting of multiple TensorFlow devices, each with control of one of more IPUs. The TensorFlow devices will be labelled
/device:IPU:0
,/device:IPU:1
and so on.Each TensorFlow device uses a specific configuration consisting of one or more IPUs from the list of devices. These can be found by running the Graphcore utility
gc-info -l
. For instance, the following listing shows the device configurations available on a system with 16 IPUs.user@host:~$ gc-info -l Graphcore device listing: -+- Id: [0], type: [PCIe], PCI Domain: [0000:1a:00.0] -+- Id: [1], type: [PCIe], PCI Domain: [0000:1b:00.0] -+- Id: [2], type: [PCIe], PCI Domain: [0000:23:00.0] -+- Id: [3], type: [PCIe], PCI Domain: [0000:24:00.0] -+- Id: [4], type: [PCIe], PCI Domain: [0000:3d:00.0] -+- Id: [5], type: [PCIe], PCI Domain: [0000:3e:00.0] -+- Id: [6], type: [PCIe], PCI Domain: [0000:43:00.0] -+- Id: [7], type: [PCIe], PCI Domain: [0000:44:00.0] -+- Id: [8], type: [PCIe], PCI Domain: [0000:8b:00.0] -+- Id: [9], type: [PCIe], PCI Domain: [0000:8c:00.0] -+- Id: [10], type: [PCIe], PCI Domain: [0000:8e:00.0] -+- Id: [11], type: [PCIe], PCI Domain: [0000:8f:00.0] -+- Id: [12], type: [PCIe], PCI Domain: [0000:b8:00.0] -+- Id: [13], type: [PCIe], PCI Domain: [0000:b9:00.0] -+- Id: [14], type: [PCIe], PCI Domain: [0000:ba:00.0] -+- Id: [15], type: [PCIe], PCI Domain: [0000:bb:00.0] -+- Id: [16], type: [Multi IPU] |--- PCIe Id: [5], DNC Id: [0], PCI Domain: [0000:3e:00.0] |--- PCIe Id: [7], DNC Id: [1], PCI Domain: [0000:44:00.0] -+- Id: [17], type: [Multi IPU] |--- PCIe Id: [4], DNC Id: [0], PCI Domain: [0000:3d:00.0] |--- PCIe Id: [6], DNC Id: [1], PCI Domain: [0000:43:00.0] -+- Id: [18], type: [Multi IPU] |--- PCIe Id: [3], DNC Id: [0], PCI Domain: [0000:24:00.0] |--- PCIe Id: [1], DNC Id: [1], PCI Domain: [0000:1b:00.0] -+- Id: [19], type: [Multi IPU] |--- PCIe Id: [2], DNC Id: [0], PCI Domain: [0000:23:00.0] |--- PCIe Id: [0], DNC Id: [1], PCI Domain: [0000:1a:00.0] -+- Id: [20], type: [Multi IPU] |--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0] |--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0] -+- Id: [21], type: [Multi IPU] |--- PCIe Id: [12], DNC Id: [0], PCI Domain: [0000:b8:00.0] |--- PCIe Id: [14], DNC Id: [1], PCI Domain: [0000:ba:00.0] -+- Id: [22], type: [Multi IPU] |--- PCIe Id: [9], DNC Id: [0], PCI Domain: [0000:8c:00.0] |--- PCIe Id: [11], DNC Id: [1], PCI Domain: [0000:8f:00.0] -+- Id: [23], type: [Multi IPU] |--- PCIe Id: [10], DNC Id: [0], PCI Domain: [0000:8e:00.0] |--- PCIe Id: [8], DNC Id: [1], PCI Domain: [0000:8b:00.0] -+- Id: [24], type: [Multi IPU] |--- PCIe Id: [5], DNC Id: [0], PCI Domain: [0000:3e:00.0] |--- PCIe Id: [7], DNC Id: [1], PCI Domain: [0000:44:00.0] |--- PCIe Id: [4], DNC Id: [2], PCI Domain: [0000:3d:00.0] |--- PCIe Id: [6], DNC Id: [3], PCI Domain: [0000:43:00.0] -+- Id: [25], type: [Multi IPU] |--- PCIe Id: [3], DNC Id: [0], PCI Domain: [0000:24:00.0] |--- PCIe Id: [1], DNC Id: [1], PCI Domain: [0000:1b:00.0] |--- PCIe Id: [2], DNC Id: [2], PCI Domain: [0000:23:00.0] |--- PCIe Id: [0], DNC Id: [3], PCI Domain: [0000:1a:00.0] -+- Id: [26], type: [Multi IPU] |--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0] |--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0] |--- PCIe Id: [12], DNC Id: [2], PCI Domain: [0000:b8:00.0] |--- PCIe Id: [14], DNC Id: [3], PCI Domain: [0000:ba:00.0] -+- Id: [27], type: [Multi IPU] |--- PCIe Id: [9], DNC Id: [0], PCI Domain: [0000:8c:00.0] |--- PCIe Id: [11], DNC Id: [1], PCI Domain: [0000:8f:00.0] |--- PCIe Id: [10], DNC Id: [2], PCI Domain: [0000:8e:00.0] |--- PCIe Id: [8], DNC Id: [3], PCI Domain: [0000:8b:00.0] -+- Id: [28], type: [Multi IPU] |--- PCIe Id: [5], DNC Id: [0], PCI Domain: [0000:3e:00.0] |--- PCIe Id: [7], DNC Id: [1], PCI Domain: [0000:44:00.0] |--- PCIe Id: [4], DNC Id: [2], PCI Domain: [0000:3d:00.0] |--- PCIe Id: [6], DNC Id: [3], PCI Domain: [0000:43:00.0] |--- PCIe Id: [3], DNC Id: [4], PCI Domain: [0000:24:00.0] |--- PCIe Id: [1], DNC Id: [5], PCI Domain: [0000:1b:00.0] |--- PCIe Id: [2], DNC Id: [6], PCI Domain: [0000:23:00.0] |--- PCIe Id: [0], DNC Id: [7], PCI Domain: [0000:1a:00.0] -+- Id: [29], type: [Multi IPU] |--- PCIe Id: [13], DNC Id: [0], PCI Domain: [0000:b9:00.0] |--- PCIe Id: [15], DNC Id: [1], PCI Domain: [0000:bb:00.0] |--- PCIe Id: [12], DNC Id: [2], PCI Domain: [0000:b8:00.0] |--- PCIe Id: [14], DNC Id: [3], PCI Domain: [0000:ba:00.0] |--- PCIe Id: [9], DNC Id: [4], PCI Domain: [0000:8c:00.0] |--- PCIe Id: [11], DNC Id: [5], PCI Domain: [0000:8f:00.0] |--- PCIe Id: [10], DNC Id: [6], PCI Domain: [0000:8e:00.0] |--- PCIe Id: [8], DNC Id: [7], PCI Domain: [0000:8b:00.0] -+- Id: [30], type: [Multi IPU] |--- PCIe Id: [5], DNC Id: [0], PCI Domain: [0000:3e:00.0] |--- PCIe Id: [7], DNC Id: [1], PCI Domain: [0000:44:00.0] |--- PCIe Id: [4], DNC Id: [2], PCI Domain: [0000:3d:00.0] |--- PCIe Id: [6], DNC Id: [3], PCI Domain: [0000:43:00.0] |--- PCIe Id: [3], DNC Id: [4], PCI Domain: [0000:24:00.0] |--- PCIe Id: [1], DNC Id: [5], PCI Domain: [0000:1b:00.0] |--- PCIe Id: [2], DNC Id: [6], PCI Domain: [0000:23:00.0] |--- PCIe Id: [0], DNC Id: [7], PCI Domain: [0000:1a:00.0] |--- PCIe Id: [13], DNC Id: [8], PCI Domain: [0000:b9:00.0] |--- PCIe Id: [15], DNC Id: [9], PCI Domain: [0000:bb:00.0] |--- PCIe Id: [12], DNC Id: [10], PCI Domain: [0000:b8:00.0] |--- PCIe Id: [14], DNC Id: [11], PCI Domain: [0000:ba:00.0] |--- PCIe Id: [9], DNC Id: [12], PCI Domain: [0000:8c:00.0] |--- PCIe Id: [11], DNC Id: [13], PCI Domain: [0000:8f:00.0] |--- PCIe Id: [10], DNC Id: [14], PCI Domain: [0000:8e:00.0] |--- PCIe Id: [8], DNC Id: [15], PCI Domain: [0000:8b:00.0]
Examples based on the listing above:
config = IPUConfig() # Create a single TensorFlow device with 1 IPU at PCI address # 0000:1a:00.0 by using IPU configuration index 0 config.select_ipus = 0 # Create a single TensorFlow device with 1 IPU at PCI address # 0000:8b:00.0 by using IPU configuration index 8 config.select_ipus = 8 # Create two TensorFlow devices, with one IPU each, being devices at # indices 0 and 1 config.select_ipus = [0, 1] # Create two TensorFlow devices, with four IPUs each. The device # configurations at indices 24 (0000:3e:00.0, 0000:44:00.0, # 0000:3d:00.0, 000:43:00.0) and 25 (0000:24:00.0, 0000:1b:00.0, # 0000:23:00.0, 00:1a:00.0) config.select_ipus = [24, 25] # Create four TensorFlow devices each with one IPU, at addresses # 0000:1a:00.0, 0000:1b:00.0, 0000:23:00.0, 0000:24:00.0. config.select_ipus = [0, 1, 2, 3]
- convolutions
Sub-category containing configuration options that affect convolutions.
- convolutions.poplar_options: dict = {}
Set the PopLibs convolution options for the session. Must be a dictionary of valid PopLibs convolution options. See
createWeights
in the PopLibs API reference for the full list of options. The options will be applied to all convolution operations in the session graph during compilation.Of particular note is the
availableMemoryProportion
parameter which is the amount of memory allocated for use for temporary data whilst the operation is executing (for example, for intermediate calculated values or temporary values passed between tiles on the IPU). The value is specified as a proportion of available memory on the IPU. So, for example, a value of 0.1 will constrain the library to use 10% of the total memory for temporary data.See the technical note on Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU for more details and for some practical examples of using
availableMemoryProportion
.Another important parameter is
partialsType
, which sets the type of the values of intermediate calculations (partials). This parameter can either be set to"float"
(forfloat32
) or"half"
(forfloat16
). Note the use of"float"
or"half"
and not"float32"
or"float16"
for the parameter values (this is because Poplar/PopLibs uses the IEEE definitions of what the datatypes should be called). An example showing how to use this parameter is shown below:cfg = config.IPUConfig() cfg.convolutions.poplar_options['partialsType'] = "half" cfg.configure_ipu_system()
- device_connection
Sub-category containing configuration options to control when to attach to IPU devices.
- device_connection.type: DeviceConnectionType = DeviceConnectionType.ALWAYS
Configure when to attach to the device. For example, you can use this to compile and cache a program without attaching to an IPU, and then later run on a real IPU device without recompiling. Setting the connection type doesn’t impact the ability to profile a model. For possible values, see
DeviceConnectionType
.# Compile without attaching to the device. config = IPUConfig() config.device_connection.type = DeviceConnectionType.ON_DEMAND
If using
DeviceConnectionType.PRE_COMPILE
to compile models to run on C600 cards then the link topology will need to be set to “line” using thePOPLAR_TARGET_OPTIONS
environment variable. See Environment variables in the Poplar and PopLibs API Reference for more information.
- device_connection.version: str = ""
Version of the IPU architecture to use (string). Must be one of “ipu1”, “ipu2”, “ipu21” or “” (default). A specific version is required if the connection type is specified as
DeviceConnectionType.PRE_COMPILE
orDeviceConnectionType.NEVER
. Do not specify a version otherwise.
- device_connection.enable_remote_buffers: bool = False
Default to
False
. When connection type isDeviceConnectionType.PRE_COMPILE
,DeviceConnectionType.NEVER
orDeviceConnectionType.ON_DEMAND
, this argument is used to indicate whether remote buffers are enabled and supported in the system which will eventually be used to execute the compiled programs. Set it to True if the system on which you will execute the compiled programs has remote buffers enabled andconnection_type
is notDeviceConnectionType.ALWAYS
. If theconnection_type
isDeviceConnectionType.ALWAYS
then theenable_remote_buffers
parameter is ignored because in that case it is possible to query the device and check if remote buffers are supported on it (if they are, they will be used automatically).In order to check whether your target system supports remote buffers you can run the command:
$ gc-info -d 0 -i | grep "remote buffers supported:"
If you see
remote buffers supported: 1
in the output, that means that remote buffers are supported on your system. For more information, see the gc-info documentation.
- slices
Sub-category containing configuration options that affect slice operations.
- slices.poplar_options: dict = {}
Set the PopLibs slice options for the session. Must be a dictionary of valid PopLibs slice options. See
embedding::plan
in the PopLibs API reference for the full list of options. The options will be passed to multiSlice, multiUpdate, and multiUpdateAdd poplibs calls. These are most commonly generated when using embeddings.Of particular note is the
availableMemoryProportion
parameter which is the amount of memory allocated for use for temporary data whilst the operation is executing (for example, for intermediate calculated values or temporary values passed between tiles on the IPU). The value is specified as a proportion of available memory on the IPU. So, for example, a value of 0.1 will constrain the library to use 10% of the total memory for temporary data.
- experimental
Sub-category containing experimental configuration options that may be changed or removed with short or no notice.
- experimental.always_rearrange_copies_on_the_host: bool = False
The data which is streamed to/from the device might be stored in different layouts on the device and on the host. If so, rearrangement is performed on the device by default. By enabling this option the rearrangement will be performed on the host at the expense of latency.
- experimental.enable_remote_buffer_embedding: bool = False
When set to true,
HostEmbedding
will make use of Poplar remote buffers. The creation of this remote buffer may take several minutes. The remote buffer will be synchronised with every IPU execution, so we recommend that you use high steps_per_execution with this option.
- experimental.enable_prng_stability: bool = False
Enable prng seed management. This aims to reduce divergence of weights when running models across multiple replicas with stochastic rounding.
- experimental.multi_replica_distribution
Sub-category containing configuration options controlling multi replica distribution. This will use the Poplar runtime replica subset feature to let multiple processes collaborate on executing the same Poplar program by executing a subset of the global replicas each.
The total global replication factor will be equal to the local replication factor multiplied by the process_count.
- floating_point_behaviour
Sub-category containing configuration options that affect the floating point behaviour of the IPU devices, including stochastic rounding and behaviour when an overflow is encountered during execution. For more information, see Controlling the half-precision floating-point unit.
- floating_point_behaviour.inv: bool = False
If True, a floating point invalid operation (defined by IEEE 754) will cause an exception.
- floating_point_behaviour.div0: bool = False
If True, a floating point divide by zero operation will cause an exception.
- floating_point_behaviour.oflo: bool = False
If True, a floating point overflow will cause an exception.
- floating_point_behaviour.esr: StochasticRoundingBehaviour = StochasticRoundingBehaviour.OFF
A
StochasticRoundingBehaviour
. IfStochasticRoundingBehaviour.OFF
(default) then stochastic rounding will be disabled. Otherwise it’s enabled with the semantics of the particular option.
- io_tiles
Sub-category containing configuration options that affect parallel I/O on a subset of tiles. For more information, see I/O Tiles.
- io_tiles.place_ops_on_io_tiles: bool = False
Whether to place TensorFlow I/O operations on the I/O tiles.
- io_tiles.available_memory_proportion: float = 0.9
Proportion of I/O tiles’ memory which can be used to store data in, with the remaining memory assumed to be used by code. If the size of data which is to be stored on I/O tiles exceeds the total I/O tiles memory multiplied by this proportion, then a warning message will appear and the operations will not be placed on I/O tiles.
- ipu_model
Sub-category containing configuration options related to the IPU model. Note that these will only have an effect if you are running with the IPU model enabled. For more information, see TF_POPLAR_FLAGS environment variable.
- matmuls
Sub-category containing configuration options that affect matmuls.
- matmuls.clear_pass_type: bool = False
Controls whether or not the “Pass” type of the MatMul is passed to PopLibs. When set to True, PopLibs will not be told about the type of the MatMuls in the graph. This can save memory in some circumstances, such as large batch ResNet models. See
matMul
in the PopLibs API reference.
- matmuls.poplar_options: dict = {}
Set the PopLibs matrix multiplication options for the session. Must be a dictionary of valid PopLibs matrix multiplication options. See
matMul
in the PopLibs API reference for the full list of options. The options will be applied to all matmul operations in the session graph during compilation.Of particular note is the
availableMemoryProportion
parameter which is the amount of memory allocated for use for temporary data whilst the operation is executing (for example, for intermediate calculated values or temporary values passed between tiles on the IPU). The value is specified as a proportion of available memory on the IPU. So, for example, a value of 0.1 will constrain the library to use 10% of the total memory for temporary data.See the technical note on Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU for more details and for some practical examples of using
availableMemoryProportion
.Another important parameter is
partialsType
, which sets the type of the values of intermediate calculations (partials). This parameter can either be set to"float"
(forfloat32
) or"half"
(forfloat16
). Note the use of"float"
or"half"
and not"float32"
or"float16"
for the parameter values (this is because Poplar/PopLibs uses the IEEE definitions of what the datatypes should be called). An example showing how to use this parameter is shown below:cfg = config.IPUConfig() cfg.matmuls.poplar_options['partialsType'] = "half" cfg.configure_ipu_system()
- norms
Sub-category containing configuration options that affect normalizations. Note that these options will be applied to all normalisation operations encountered (Fused Batch Norm, IPU Specific Group Norm, IPU Specific Layer Norm and IPU Specific Instance Norm).
- norms.use_stable_statistics: bool = False
If True, computes the mean minus the activations first before computing the variance. The implementation with this flag set to True is slower than when set to False.
- norms.experimental
Sub-category containing experimental configuration options for normalizations that may be changed or removed with short or no notice.
- norms.experimental.distributed_batch_norm_replica_group_size: int = 1
When executing fused batch-norms for training, this option specifies how many replicas to aggregate the batch statistics across. For example, if a model is being executed across four replicas and this option is set to two, replicas 0 and 1 will be grouped together and replicas 2 and 3 will be grouped together and the batch norm statistics will be synchronously all-reduced every time the layer is executed (including any recomputation) across the replicas within a group. This option should not be used when using model parallelism (pipelining) and it is not supported with I/O tiles. When recomputation is enabled and the training fused batch norm operation is recomputed, the statistics will have to be all-reduced again, unless the
RecomputeAndBackpropagateInterleaved
recomputation mode is used.
- optimizations
Sub-category containing configuration options that control a variety of optimizations made when lowering the TensorFlow graph to Poplar.
- optimizations.math
Sub-category containing configuration options related to simplifying algebraic mathematical expressions..
- optimizations.math.fast: bool = False
Enables optimizations which allow arbitrary re-associations and transformations of mathematical operations with no accuracy guarantees. Enabling this option can result in incorrect output for programs that depend on an exact implementation of IEEE floating point for maths functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.
- optimizations.prefetch_data_streams: bool = True
If True (default), prefetching of data for data streams on the host will be overlapped with execution on the IPU.
- optimizations.combine_embedding_lookups: bool = False
If True, fuse embedding lookups which are on the same tensor. This might improve performance but increase memory usage.
- optimizations.combine_matmuls: bool = False
If True, fuse matmul operations if they share the same weights or the same input.
- optimizations.enable_graph_outlining: bool = True
If True (default), operations in the graph which are the same but with different input tensors may be outlined. This means the same code will be re-used to execute them, reducing the amount of program code, but their inputs will be exchanged into a common memory location to do so, increasing execution time. If you care more about speed than memory, these optimizations can be disabled by setting this option to False.
- optimizations.merge_infeed_io_copies: bool = True
If True, this flag will merge the streamed host to device input copies into one larger copy. This may reduce the time to copy data from the host, at the expense of increasing the live tensor memory on the device.
- optimizations.maximum_cross_replica_sum_buffer_size: int = 0
The maximum number of bytes that can be waiting before a cross replica sum op is scheduled. 0 (default) means that they are scheduled immediately. This value represents an always-live vs not-always-live trade off - increasing the max_cross_replica_sum_buffer_size will lead to larger temporary buffers in the cross replica sums, but fewer cross replica sums overall and therefore less control code. If your model contains a lot of trainable variables, then it is strongly advised to consider adjusting this option.
- optimizations.maximum_reduce_scatter_buffer_size: int = 0
The maximum number of bytes that can be waiting before a reduce scatter op is scheduled.
- optimizations.maximum_inter_ipu_copies_buffer_size: int = 0
The maximum number of bytes that can be waiting before an inter IPU copy between IPUs is scheduled.
- optimizations.maximum_send_recv_cluster_size: int = 0
The maximum number of bytes that can be waiting before a cluster of send/recv instructions to/from the host is scheduled. These are lowered to stream copies that can be merged by Poplar.
- optimizations.maximum_reduce_many_buffer_size: int = 0
The maximum size (in bytes) a cluster of reduce operations can reach before it is scheduled. These clusters are lowered to popops ReduceMany operations.
- optimizations.maximum_all_gather_buffer_size: int = 0
The maximum size (in bytes) a cluster of all gather operations can reach before it is scheduled. These clusters are lowered to popops AllGather operations.
- optimizations.minimum_remote_tensor_size: int = 128
The minimum size (in bytes) a tensor must be in order to be considered for being stored in remote memory.
- optimizations.merge_remote_buffers: MergeRemoteBuffersBehaviour = MergeRemoteBuffersBehaviour.IF_BENEFICIAL
Whether to merge compatible remote buffers. Merging of remote buffers can allow for more code re-use if the only difference between computations are the remote buffers being accessed. Must be a
MergeRemoteBuffersBehaviour
.
- optimizations.enable_gather_simplifier: bool = True
If True (default), more aggressive optimizations will be done on embedding lookups.
- optimizations.triangular_solve_expander_block_size: int = 0
Defines the block size for the triangular solver expander. The processing within each block is performed on a single tile. The control code for performing computations over blocks is unrolled on the device. For a matrix of rank
N
and block sizeB
, there arelog2(N/B)
iterations of the control code. The choice of this parameter therefore has to balance between the amount of data in a tile (lower value is better, gives better parallelism) and the amount of control code (larger value is better, less control code). A value of 0 (default) selects an implementation defined default.
- optimizations.cholesky_block_size: int = 0
Defines the block size for the Cholesky factoriser. The processing within each block is performed on a single tile. The control code for performing computations over blocks are unrolled on the device. For a matrix of rank
N
and block sizeB
, there areN/B
iterations of the control code. The choice of this parameter therefore has to balance between the amount of data in a tile (lower value is better, gives better parallelism) and the amount of control code (larger value is better, less control code). A value of 0 (default) selects an implementation defined default.
- optimizations.enable_fast_math: bool = False
Note
DEPRECATED: ‘enable_fast_math’ has been moved to ‘optimizations.math.fast’. It will be removed from this location in a future release.
Enables optimizations which allow arbitrary re-associations and transformations of mathematical operations with no accuracy guarantees. Enabling this option can result in incorrect output for programs that depend on an exact implementation of IEEE floating point for maths functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.
- pooling
Sub-category containing configuration options that affect pooling operations.
- pooling.poplar_options: dict = {}
Set the PopLibs pooling compilation options for the session. Must be a dictionary of valid PopLibs pooling options. See
pool
in the PopLibs API reference for the full list of options. The options will be applied to all pooling operations in the session graph during compilation.
- scheduling
Sub-category containing configuration options that affect the scheduling of operations in the graph during compilation.
- scheduling.algorithm: SchedulingAlgorithm = SchedulingAlgorithm.CHOOSE_BEST
A
SchedulingAlgorithm
. IfSchedulingAlgorithm.CHOOSE_BEST
(default), several schedules will be created and the one with the lowest predicted liveness chosen. Setting this to a specific scheduling algorithm forces the compiler to use that algorithm when ordering the instructions.
- scheduling.maximum_scheduler_lookahead_depth: int = 5
Controls how far the
LOOK_AHEAD
scheduling algorithm can look beyond a given scheduling decision to understand the max-liveness implications. This search space grows very quickly and can take an unacceptable amount of time for large values. Only forSchedulingAlgorithm.LOOK_AHEAD
.
- configure_ipu_system(device='cpu')
Configure the IPU system with this config.
- Parameters
device – The CPU device which is local to the IPU hardware.
- from_dict(dct)
Restore configuration from a dict object.
- Parameters
dct – A dictionary containing a configuration.
- from_json(json_cfg)
Restore configuration from a JSON string.
- Parameters
json_cfg – A JSON string containing a configuration.
- get_attribute_metadata(attr)
Get the attribute metadata for
attr
.- Parameters
attr – required, a string which specifies which attribute to retrieve metadata for. Must be its full name relative to the category this method is being called on.
- Returns
An
AttributeMetadata
object containing the metadata for the attribute.
- to_dict()
Export the configuration stored within this configuration object to a dict.
- Returns
A dictionary containing the configuration.
- to_json()
Export the configuration stored within this configuration object as a JSON string.
- Returns
A JSON string containing the configuration.
22.9. Looping utilities
- tensorflow.python.ipu.loops.repeat(n, body, inputs=None, infeed_queue=None, use_while_v1=True)
Builds a loop that executes a fixed number of iterations.
The set of loop-carried tensors correspond to
inputs
.body
must be a function that takes and returns the values of the loop-carried tensors.- Parameters
n – the number of loop iterations
body – a Python function that builds the loop body.
inputs – a list of initial values passed into the loop or None (equivalent to an empty list).
infeed_queue – if not None, the IPUInfeedQueue from which data is consumed.
use_while_v1 – if True, then use a TensorFlow v1.x dataflow while loop.
- Returns
The final values of the loop-carried tensors.
- Raises
ValueError – if there is a type error.
TypeError – if body has the wrong signature.
- tensorflow.python.ipu.loops.while_loop(condition, body, inputs=None, infeed_queue=None, maximum_iterations=None, use_while_v1=True)
Builds a while loop for IPUs.
The set of loop-carried tensors corresponds to
inputs
. Bothcondition
andbody
take the current value of the loop-carried tensors.condition
must return a single boolean value that determines whether iteration continues.body
must return an updated list of values for the loop-carried tensors.- Parameters
condition – a Python function that builds the loop condition.
body – a Python function that builds the loop body.
inputs – a list of initial values passed into the loop, or None (equivalent to an empty list).
infeed_queue – if not None, the IPUInfeedQueue from which data is consumed.
use_while_v1 – if True, then use a TensorFlow v1.x dataflow while loop.
- Returns
The final values of the loop-carried tensors.
- Raises
TypeError – if body or condition has the wrong signature.
22.10. Distribution using PopDist
- class tensorflow.python.ipu.distributed.popdist_strategy.PopDistStrategy(ipu_device='/device:IPU:0', add_ipu_cross_replica_reductions=True, enable_dataset_iterators=True, enable_keras_extensions=True)
This is a distribution strategy for multi-replica distribution that uses compiled communications with GCL for reductions over IPU-Links and GW-Links, across all the global replicas in the application. This is the recommended distribution strategy when using PopDist and PopRun.
PopDist is used for host communication, for example when broadcasting initial values of variables to all processes. Another example is when a reduction is requested with a CPU as the current device.
- __init__(ipu_device='/device:IPU:0', add_ipu_cross_replica_reductions=True, enable_dataset_iterators=True, enable_keras_extensions=True)
- update_ipu_config(config)
Update the given IPU configuration with the multi-replica distribution options.
- Parameters
config – The IPUConfig instance to update.
- Returns
The IPUConfig instance.
22.11. Serving utilities
- tensorflow.python.ipu.serving.export_keras(model, export_dir, batch_size=None, output_names=None, preprocessing_step=None, preprocessing_step_signature=None, postprocessing_step=None, postprocessing_step_signature=None, purge_export_dir=False)
Export Keras model using the SavedModel format for TensorFlow serving.
Wrap model’s
call
function inside awhile
loop, add an infeed for the inputs and an outfeed for the outputs, convert any variables into constants and write a SavedModel containing an IPU runtime function and Poplar executable.- Parameters
model (tf.keras.Model) – The Keras model to export.
export_dir (str) – The path to the directory where the SavedModel will be written.
batch_size (int, optional) – The batch size value to be used in the exported model. If not specified and the model was built with a specified batch size (different than None), the exported model will use the currently set batch size. This argument must be specified if the model’s batch size is
None
.output_names (str or list, optional) –
- Output name or list of output names
for the outputs in the SavedModel’s SignatureDef. If not provided, outputs will be named:
output_0
,output_1
and so on.- preprocessing_step (Callable or tf.function, optional): Function that runs
the preprocessing step on the CPU device. This function is called just before the Keras model.
preprocessing_step
and the Keras model are exported together. Thepreprocessing_step
output is passed directly to the Keras modelel input queue.
preprocessing_step_signature (list or tuple, optional) – A sequence of
tf.TensorSpec
objects that describe the input arguments of thepreprocessing_step
function. Ifpreprocessing_step
is atf.function
andinput_signature
was specified duringtf.function
creation then this argument can be None and the signature will be captured directly frompreprocessing_step
.postprocessing_step (Callable or tf.function, optional) – Function that runs the postprocessing step on the CPU. This function is called after the Keras model.
postprocessing_step
and the Keras model are exported together. Tensors from the Keras model output queue are inputs topostprocessing_step
.postprocessing_step_signature (list or tuple, optional) – A sequence of
tf.TensorSpec
objects that describe the input arguments of thepostprocessing_step
function. Ifpostprocessing_step
is atf.function
andinput_signature
was specified duringtf.function
creation then this argument can be None and the signature will be captured directly frompostprocessing_step
.purge_export_dir (Boolean, optional) – If True, before starting the export, the target directory is emptied. Otherwise no cleaning is performed and if the target directory is not empty, the function fails with an error.
- Returns
- A reference to the same predict function that was exported
using the SavedModel format. This function uses the embedded runtime op to run the executable that was included in the SavedModel’s
assets
subfolder.
- Return type
tf.function
- Raises
ValueError – If
model
does not have theexport_for_ipu_serving
method.ValueError – If
export_dir
is not an empty directory andpurge_export_dir
is not set to True.TypeError – If
preprocessing_step_signature
is neither a tuple, a list oftf.TensorSpec
objects nor aNoneType
.TypeError – If
postprocessing_step_signature
is neither a tuple, a list oftf.TensorSpec
objects nor aNoneType
.ValueError – If
preprocessing_step_signature
is an empty tuple or a list.ValueError – If
postprocessing_step_signature
is an empty tuple or a list.ValueError – If
preprocessing_step
is provided andpreprocessing_step_signature
is not provided andpreprocessing_step
is not atf.function
or is atf.function
but noinput_signature
is provided.ValueError – If
postprocessing_step
is provided andpostprocessing_step_signature
is not provided andpostprocessing_step
is not atf.function
or is atf.function
but noinput_signature
is provided.
- tensorflow.python.ipu.serving.export_pipeline(computational_stages, export_dir, iterations, inputs=None, device_mapping=None, pipeline_schedule=None, poplar_options=None, name=None, predict_step_signature=None, input_dataset=None, output_names=None, preprocessing_step=None, preprocessing_step_signature=None, postprocessing_step=None, postprocessing_step_signature=None, purge_export_dir=False)
Create a pipelined SavedModel in
export_dir
for TensorFlow Serving.Create a pipeline op using
computational_stages
, add an infeed for the inputs and an outfeed for the outputs, freeze any variables into constants and write a SavedModel containing an IPU runtime function (preceded by optional preprocessing step) and Poplar executable.SavedModel flow: predict_step = computational_stages[0]
preprocessing_step
(optional, CPU) -> predict_step (IPU) ->postprocessing_step
(optional, CPU) -> result- Parameters
computational_stages (list) – A list of Python functions or TensorFlow functions, where each function represents a computational stage in the pipeline. The function takes the outputs of the previous pipeline stage as its inputs.
export_dir (str) – Path to the directory where the SavedModel will be written.
iterations (int) – The number of times each computational stage will be executed during the execution of the pipeline. It can also be considered as the pipeline depth.
inputs (list, optional) – Arguments passed to the first computational stage without usage of infeed queue.
device_mapping (list, optional) – If provided, a list of length equal to the number of computational stages. An element at index
i
in the list represents which IPU thecomputational_stages[i]
should reside on. This can be used to make sure computational stages which sharetf.Variable
objects are resident on the same IPU.pipeline_schedule (PipelineSchedule, optional) – Which scheduling algorithm to use for pipeline lowering. Defaults to
PipelineSchedule.Grouped
.poplar_options (list, optional) – If provided, a list of length equal to the number of computational stages. Each element is a
PipelineStageOptions
object which allows for fine grain control of the Poplar options for a given forward propagation computational stage.name (str, optional) – Name of this pipeline.
predict_step_signature (list or tuple, optional) – A sequence of
tf.TensorSpec
objects that describe the input arguments of the first computational stage. Ifpreprocessing_step
is not provided andinput_dataset
is provided, this argument should be None. Ifpreprocessing_step
is provided orpreprocessing_step
andinput_dataset
are not provided and first computational stage is atf.function
andinput_signature
was specified duringtf.function
creation then this argument can be None and the signature will be captured directly from the first computational stage.input_dataset (tf.Dataset, optional) – Dataset from which SavedModel’s
input_signature
will be inferred.output_names (str or list, optional) – Output name or list of output names for the outputs in the SavedModel’s SignatureDef. If not provided, outputs will be named:
output_0
,output_1
and so on.preprocessing_step (Callable or tf.function, optional) – Function that runs preprocessing step on the CPU device. Function is called just before the first computational stage.
preprocessing_step
and compiled pipelined computational stages are exported together.preprocessing_step
output will be directly passed to the input queue of the first computational stage.preprocessing_step_signature (list or tuple, optional) – A sequence of
tf.TensorSpec
objects that describe the input arguments of thepreprocessing_step
function. Ifpreprocessing_step
andinput_dataset
are provided, this argument should be None. Ifpreprocessing_step
is provided andinput_dataset
is not provided andpreprocessing_step
is atf.function
andinput_signature
was specified duringtf.function
creation then this argument can be None and the signature will be captured directly frompreprocessing_step
.postprocessing_step (Callable or tf.function, optional) – Function that runs the postprocessing step on the CPU. Function is called after
predict_step
.postprocessing_step
andpredict_step
are exported together. Tensors from thepredict_step
output queue arepostprocessing_step
inputs.postprocessing_step_signature (list or tuple, optional) – A sequence of
tf.TensorSpec
objects that describe the input arguments of thepostprocessing_step
function. Ifpostprocessing_step
is atf.function
andinput_signature
was specified duringtf.function
creation then this argument can be None and the signature will be captured directly frompostprocessing_step
.purge_export_dir (Boolean, optional) – If True, before starting the export, the target directory is emptied. Otherwise no cleaning is performed and if target dir is not empty, the function fails with an error.
- Returns
- A reference to the same predict function that was exported
using the SavedModel format. This function uses the embedded runtime op to run the executable that was included in the SavedModel’s
assets
subfolder.
- Return type
tf.function
- Raises
ValueError – If
export_dir
is not an empty directory.TypeError – If
input_dataset
is not atf.Dataset
orNoneType
.TypeError – If
predict_step_signature
is neither a tuple, list oftf.TensorSpec
objects nor aNoneType
.TypeError – If
preprocessing_step_signature
is neither a tuple, list oftf.TensorSpec
objects nor aNoneType
.TypeError – If
postprocessing_step_signature
is neither a tuple, list oftf.TensorSpec
objects nor aNoneType
.ValueError – If
predict_step_signature
is an empty tuple or list.ValueError – If
preprocessing_step_signature
is an empty tuple or list.ValueError – If
postprocessing_step_signature
is an empty tuple or list.ValueError – If
preprocessing_step
is not provided and bothpredict_step_signature
andinput_dataset
are provided.ValueError – If
preprocessing_step
,predict_step_signature
,input_dataset
are not provided andpredict_step
is not atf.function
or is atf.function
with not providedinput_signature
.ValueError – If
preprocessing_step
,preprocessing_step_signature
,input_dataset
are provided.ValueError – If
preprocessing_step
is provided and bothpreprocessing_step_signature
,input_dataset
are not provided andpreprocessing_step
is not atf.function
or is atf.function
but noinput_signature
is provided.ValueError – If
preprocessing_step
,predict_step_signature
are not provided andpredict_step
is not atf.function
or is atf.function
but noinput_signature
is provided.ValueError – If
postprocessing_step
is provided andpostprocessing_step_signature
is not provided andpostprocessing_step
is not atf.function
or is atf.function
but noinput_signature
is provided.
- tensorflow.python.ipu.serving.export_single_step(predict_step, export_dir, iterations, predict_step_signature=None, input_dataset=None, output_names=None, preprocessing_step=None, preprocessing_step_signature=None, postprocessing_step=None, postprocessing_step_signature=None, purge_export_dir=False)
Create a SavedModel in
export_dir
for TensorFlow Serving.Wrap
predict_step
inside a while loop, add an infeed for the inputs and an outfeed for the outputs, freeze any variables into constants and write a SavedModel containing a compiled IPU runtime function (preceded by optional preprocessing step) and Poplar executable.SavedModel flow:
preprocessing_step
(optional, CPU) ->predict_step
(IPU) ->postprocessing_step
(optional, CPU) -> result- Parameters
predict_step (Callable or tf.function) – Function to compile into the IPU platform and export.
export_dir (str) – Path to the directory where the SavedModel will be written.
iterations (int) – Number of loop iterations.
predict_step_signature (list or tuple, optional) – A sequence of
tf.TensorSpec
objects that describe the input arguments of thepredict_step
function. Ifpreprocessing_step
is not provided andinput_dataset
is provided, this argument should be None. Ifpreprocessing_step
is provided orpreprocessing_step
andinput_dataset`are not provided and `predict_step
is atf.function
andinput_signature
was specified duringtf.function
creation then this argument can be None and the signature will be captured directly frompredict_step
.input_dataset (tf.Dataset, optional) – Dataset from which SavedModel
input_signature
will be inferred. Ifpreprocessing_step
is not provided andpredict_step_signature
is provided,this argument should be None. Ifpreprocessing_step
andpreprocessing_step_signature
are provided this argument should be None.output_names (str or list, optional) – Output name or list of output names for the outputs in the SavedModel’s SignatureDef. If not provided, outputs will be named:
output_0
,output_1
and so on.preprocessing_step (Callable or tf.function, optional) – Function that runs the preprocessing step on the CPU device. Function is called just before
predict_step
.preprocessing_step
andpredict_step
are exported together.preprocessing_step
output is directly passed to thepredict_step
input queue.preprocessing_step_signature (list or tuple, optional) – A sequence of
tf.TensorSpec
objects that describe the input arguments of thepreprocessing_step
function. Ifpreprocessing_step
andinput_dataset
are provided, this argument should be None. Ifpreprocessing_step
is provided andinput_dataset
is not provided andpreprocessing_step
is atf.function
andinput_signature
was specified duringtf.function
creation then this argument can be None and the signature will be captured directly frompreprocessing_step
.postprocessing_step (Callable or tf.function, optional) – Function that runs the postprocessing step on the CPU. Function is called after
predict_step
.postprocessing_step
andpredict_step
are exported together. Tensors from thepredict_step
output queue arepostprocessing_step
inputs.postprocessing_step_signature (list or tuple, optional) – A sequence of
tf.TensorSpec
objects that describe the input arguments of thepostprocessing_step
function. Ifpostprocessing_step
is atf.function
andinput_signature
was specified duringtf.function
creation then this argument can be None and the signature will be captured directly frompostprocessing_step
.purge_export_dir (Boolean, optional) – If True, before starting the export, the target directory is emptied. Otherwise no cleaning is performed and if target dir is not empty, the function fails with an error.
- Returns
- A reference to the same predict function that was exported
using the SavedModel format. This function uses the embedded runtime op to run the executable that was included in the SavedModel’s
assets
subfolder.
- Return type
tf.function
- Raises
ValueError – If
export_dir
is not an empty directory.TypeError – If
input_dataset
is not atf.Dataset
orNoneType
.TypeError – If
predict_step_signature
is neither a tuple, list oftf.TensorSpec
objects nor aNoneType
.TypeError – If
preprocessing_step_signature
is neither a tuple, list oftf.TensorSpec
objects nor aNoneType
.TypeError – If
postprocessing_step_signature
is neither a tuple, list oftf.TensorSpec
objects nor aNoneType
.ValueError – If
predict_step_signature
is an empty tuple or list.ValueError – If
preprocessing_step_signature
is an empty tuple or list.ValueError – If
postprocessing_step_signature
is an empty tuple or list.ValueError – If
preprocessing_step
is not provided and bothpredict_step_signature
andinput_dataset
are provided.ValueError – If
preprocessing_step
,predict_step_signature
,input_dataset
are not provided andpredict_step
is not atf.function
or is atf.function
with not providedinput_signature
.ValueError – If
preprocessing_step
,preprocessing_step_signature
,input_dataset
are provided.ValueError – If
preprocessing_step
is provided and bothpreprocessing_step_signature
,input_dataset
are not provided andpreprocessing_step
is not atf.function
or is atf.function
but noinput_signature
is provided.ValueError – If
preprocessing_step
,predict_step_signature
are not provided andpredict_step
is not atf.function
or is atf.function
but noinput_signature
is provided.ValueError – If
postprocessing_step
is provided andpostprocessing_step_signature
is not provided andpostprocessing_step
is not atf.function
or is atf.function
but noinput_signature
is provided.
22.12. Datasets
22.12.1. Dataset benchmarking
- tensorflow.python.ipu.dataset_benchmark.dataset_benchmark(dataset, number_of_epochs, elements_per_epochs, print_stats=True, apply_debug_options=True, do_memcpy=True)
Allows the user to benchmark performance of a
tf.data.Dataset
.- Parameters
dataset – An instance of
tf.data.Dataset
which will be benchmarked.number_of_epochs – The number of epochs this dataset will be run for.
elements_per_epochs – The number of elements there are in each epoch.
print_stats – Whether to print statistics about the performance to the console.
apply_debug_options – Whether to apply debug options.
do_memcpy – Whether to perform a
memcpy
operation which simulates a dataset buffer being copied to a Poplar managed buffer.
- Returns
A JSON string with performance statistics, which records the following metrics every epoch:
elements_processed
- number of elements processed.total_bytes_processed
- total number of bytes which was processed.time_elapsed
- the time it took (in seconds) for the epoch to complete.elements_per_second
- number of elements processed per second.bandwidth
- the bandwidth achieved, measured in GB/s.
The JSON string returned can be parsed into a native Python JSON library (see https://docs.python.org/3/library/json.html).
- Raises
TypeError – if
dataset
is not an instance oftf.data.Dataset
.ValueError – if
number_of_epochs
orelements_per_epochs
is less than 1.InvalidArgumentError – if
dataset
contains a shape with a dimension of size 0.
- tensorflow.python.ipu.dataset_benchmark.infeed_benchmark(infeed_queue, number_of_epochs, elements_per_epochs, print_stats=True, do_memcpy=True)
Allows the user to benchmark performance of an
ipu.ipu_infeed_queue.IPUInfeedQueue
.- Parameters
infeed_queue – An instance of
ipu.ipu_infeed_queue.IPUInfeedQueue
which will be benchmarked.number_of_epochs – The number of epochs this infeed queue will be run for.
elements_per_epochs – The number of elements there are in each epoch.
print_stats – Whether to print statistics about the performance to the console.
do_memcpy – Whether to perform a
memcpy
operation which simulates a dataset buffer being copied to a Poplar managed buffer.
- Returns
A JSON string with performance statistics, which records the following metrics every epoch:
elements_processed
- number of elements processed.total_bytes_processed
- total number of bytes which was processed.time_elapsed
- the time it took (in seconds) for the epoch to complete.elements_per_second
- number of elements processed per second.bandwidth
- the bandwidth achieved, measured in GB/s.
The JSON string returned can be parsed into a native Python JSON library (see https://docs.python.org/3/library/json.html).
- Raises
TypeError – if
infeed_queue
is not an instance ofipu.ipu_infeed_queue.IPUInfeedQueue
.ValueError – if
number_of_epochs
orelements_per_epochs
is less than 1.InvalidArgumentError – if
infeed_queue
contains a shape with a dimension of size 0.
22.12.2. Dataset wrappers
- class tensorflow.python.ipu.data.ops.dataset_ops.BufferDataset(input_dataset, buffer_size)
A
Dataset
which makes sure there is a multiple ofbuffer_size
number of elements available.- __init__(input_dataset, buffer_size)
- A
Dataset
which makes sure there is a multiple ofbuffer_size
number of elements available.
- Parameters
input_dataset – The input dataset.
buffer_size – The number of dataset elements which will be available.
- A
22.13. Estimators
22.13.1. IPUEstimator
- class tensorflow.python.ipu.ipu_estimator.IPUEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None, train_batch_size=None, eval_batch_size=None, predict_batch_size=None)
Estimator with IPU support.
IPUEstimator handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. It also provides a simple way to use multiple IPUs in the form of either data parallelism or model parallelism.
The data parallelism is based on graph replication. One batch from the dataset returned by the
input_fn
(of sizebatch_size
) is sent to each replica, giving an effective batch size ofnum_replicas * batch_size
. The only change needed to themodel_fn
is that the optimizer should be wrapped in aCrossReplicaOptimizer
in order to average the gradients across the replicas.This can also be combined with distributed multi-worker training using the
IPUMultiWorkerStrategyV1
, giving a total effective batch size ofnum_workers * num_replicas * batch_size
.The desired global batch size can be passed as
train_batch_size
,eval_batch_size
andpredict_batch_size
, and the local batch size will be calculated based on the number of replicas and the number of distributed workers and passed to theinput_fn
andmodel_fn
inparams['batch_size']
. If theinput_fn
returns a dataset batched withdataset.batch(params['batch_size'], drop_remainder=True)
, the global batch size will be as desired.The model parallelism supported by this class is basic sharding. Consider using the
IPUPipelineEstimator
to get pipelined execution.For efficiency, it supports compiling a graph that contains multiple iterations of the training/prediction/evaluation loop, which will be fully executed on the IPU before yielding back to the TensorFlow Python runtime on the CPU.
See https://tensorflow.org/guide/estimators for general information about estimators.
- Parameters
model_fn – The model function. Refer to https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/custom_estimators.md#write-a-model-function for details on how to write this function.
model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If
PathLike
object, the path will be resolved. IfNone
, the model_dir inconfig
will be used if set. If both are set, they must be same. If both areNone
, a temporary directory will be used.config – A
RunConfig
object.params –
dict
of hyper parameters that will be passed intomodel_fn
. Keys are names of parameters, values are basic python types.warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm-start from, or a
tf.estimator.WarmStartSettings
object to fully configure warm-starting. If the string filepath is provided instead of atf.estimator.WarmStartSettings
, then all variables are warm-started, and it is assumed that vocabularies andtf.Tensor
names are unchanged.train_batch_size – If not None, an int representing the global training batch size. This global batch size is transformed to a local batch size passed as
params['batch_size']
to theinput_fn
andmodel_fn
during training. Must be divisible by the number of replicas multiplied by the number of distributed workers.eval_batch_size – If not None, an int representing the global evaluation batch size. Same behaviour as train_batch_size, only during evaluation.
predict_batch_size – If not None, an int representing the global prediction batch size. Same behaviour as train_batch_size, only during prediction.
- eval_dir(name=None)
Shows the directory name where evaluation metrics are dumped.
- Parameters
name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.
- Returns
A string which is the path of directory contains evaluation metrics.
- evaluate(input_fn, steps=None, hooks=None, checkpoint_path=None, name=None)
Evaluates the model given evaluation data
input_fn
.- Parameters
input_fn –
A function that constructs the input data for evaluation. The function should return a
tf.data.Dataset
object. The outputs of theDataset
object must be a tuple(features, labels)
wherefeatures
is atf.Tensor
or a dictionary of string feature name toTensor
labels
is aTensor
or a dictionary of string label name toTensor
Both
features
andlabels
are consumed bymodel_fn
.steps – Number of steps for which to evaluate model.
hooks – List of
tf.train.SessionRunHook
subclass instances. Used for callbacks inside the evaluation call.checkpoint_path – Path of a specific checkpoint to evaluate. If
None
, the latest checkpoint inmodel_dir
is used. If there are no checkpoints inmodel_dir
, evaluation is run with newly initializedVariables
instead of ones restored from checkpoint.name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.
- Returns
A dict containing the evaluation metrics specified in
model_fn
keyed by name, as well as an entryglobal_step
which contains the value of the global step for which this evaluation was performed.
- experimental_export_all_saved_models(export_dir_base, input_receiver_fn_map, assets_extra=None, as_text=False, checkpoint_path=None)
Exports a
SavedModel
withtf.MetaGraphDefs
for each requested mode.For each mode passed in via the
input_receiver_fn_map
, this method builds a new graph by calling theinput_receiver_fn
to obtain feature and labelTensor
objects. Next, this method calls theEstimator
object’smodel_fn
in the passed mode to generate the model graph based on those features and labels, and restores the given checkpoint (or, lacking that, the most recent checkpoint) into the graph. Only one of the modes is used for saving variables to theSavedModel
(order of preference:tf.estimator.ModeKeys.TRAIN
,tf.estimator.ModeKeys.EVAL
, thentf.estimator.ModeKeys.PREDICT
), such that up to threetf.MetaGraphDefs
are saved with a single set of variables in a singleSavedModel
directory.For the variables and
tf.MetaGraphDefs
, a timestamped export directory belowexport_dir_base
, and writes aSavedModel
into it containing thetf.MetaGraphDef
for the given mode and its associated signatures.For prediction, the exported
MetaGraphDef
will provide oneSignatureDef
for each element of theexport_outputs
dict returned from themodel_fn
, named using the same keys. One of these keys is alwaystf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the correspondingtf.estimator.export.ExportOutput
objects, and the inputs are always the input receivers provided by theserving_input_receiver_fn
.For training and evaluation the
train_op
is stored in an extra collection. Loss, metrics, and predictions are included in aSignatureDef
for the mode in question.Extra assets may be written into the
SavedModel
via theassets_extra
argument. This should be a dict, where each key gives a destination path (including the filename) relative to theassets.extra
directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as{'my_asset_file.txt': '/path/to/my_asset_file.txt'}
.- Parameters
export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported
SavedModel
objects.input_receiver_fn_map – dict of
tf.estimator.ModeKeys
toinput_receiver_fn
mappings, where theinput_receiver_fn
is a function that takes no arguments and returns the appropriate subclass ofInputReceiver
.assets_extra – A dict specifying how to populate the
assets.extra
directory within the exportedSavedModel
, orNone
if no extra assets are needed.as_text – whether to write the
SavedModel
proto in text format.checkpoint_path – The checkpoint path to export. If
None
(the default), the most recent checkpoint found within the model directory is chosen.
- Returns
The path to the exported directory as a bytes object.
- Raises
ValueError – if any
input_receiver_fn
isNone
, noexport_outputs
are provided, or no checkpoint can be found.
- export_saved_model(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, experimental_mode='infer')
Exports inference graph as a
SavedModel
into the given directory.For a detailed guide to using SavedModel, see Using the SavedModel format.
This method builds a new graph by first calling the
serving_input_receiver_fn
to obtain featureTensor
objects, and then calling thisEstimator
object’smodel_fn
to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the givenexport_dir_base
, and writes aSavedModel
into it containing a singletf.MetaGraphDef
saved from this session.The exported
MetaGraphDef
will provide oneSignatureDef
for each element of theexport_outputs
dict returned from themodel_fn
, named using the same keys. One of these keys is alwaystf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the correspondingtf.estimator.export.ExportOutput
objects, and the inputs are always the input receivers provided by theserving_input_receiver_fn
.Extra assets may be written into the
SavedModel
via theassets_extra
argument. This should be a dict, where each key gives a destination path (including the filename) relative to theassets.extra
directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as{'my_asset_file.txt': '/path/to/my_asset_file.txt'}
.The experimental_mode parameter can be used to export a single train/eval/predict graph as a
SavedModel
. Seeexperimental_export_all_saved_models
for a full description.- Parameters
export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported
SavedModel
objects.serving_input_receiver_fn – A function that takes no argument and returns a
tf.estimator.export.ServingInputReceiver
ortf.estimator.export.TensorServingInputReceiver
.assets_extra – A dict specifying how to populate the
assets.extra
directory within the exportedSavedModel
, orNone
if no extra assets are needed.as_text – whether to write the
SavedModel
proto in text format.checkpoint_path – The checkpoint path to export. If
None
(the default), the most recent checkpoint found within the model directory is chosen.experimental_mode –
tf.estimator.ModeKeys
value indicating with mode will be exported. Note that this feature is experimental.
- Returns
The path to the exported directory as a bytes object.
- Raises
ValueError – if no
serving_input_receiver_fn
is provided, noexport_outputs –
- get_variable_names()
Returns list of all variable names in this model.
- Returns
List of names.
- Raises
ValueError – If the
Estimator
has not produced a checkpoint yet.
- get_variable_value(name)
Returns value of the variable given by name.
- Parameters
name – string or a list of string, name of the tensor.
- Returns
Numpy array - value of the tensor.
- Raises
ValueError – If the
Estimator
has not produced a checkpoint yet.
- latest_checkpoint()
Finds the filename of the latest saved checkpoint file in
model_dir
.- Returns
The full path to the latest checkpoint or
None
if no checkpoint was found.
- property model_fn
Returns the
model_fn
which is bound toself.params
.- Returns
def model_fn(features, labels, mode, config)
- Return type
The
model_fn
with following signature
- predict(input_fn, predict_keys=None, hooks=None, checkpoint_path=None, yield_single_examples=True, num_predictions=None)
Yields predictions for given features.
- Parameters
input_fn –
A function that constructs the features. The function should return a
tf.data.Dataset
object. The outputs of theDataset
object should be one of the following:features: A
Tensor
or a dictionary of string feature name toTensor
. features are consumed bymodel_fn
.A tuple, in which case the first item is extracted as features.
predict_keys – list of
str
, name of the keys to predict. It is used if thetf.estimator.EstimatorSpec.predictions
is adict
. Ifpredict_keys
is used then rest of the predictions will be filtered from the dictionary. IfNone
, returns all.hooks – List of
tf.train.SessionRunHook
subclass instances. Used for callbacks inside the prediction call.checkpoint_path – Path of a specific checkpoint to predict. If
None
, the latest checkpoint inmodel_dir
is used. If there are no checkpoints inmodel_dir
, prediction is run with newly initializedVariables
instead of ones restored from checkpoint.yield_single_examples – If
False
, yields the whole batch as returned by themodel_fn
instead of decomposing the batch into individual elements. This is useful ifmodel_fn
returns some tensors whose first dimension is not equal to the batch size.num_predictions – If not
None
, the generator will raiseStopIteration
after yielding this number of predictions. This allows draining the generator by usinglist(predictions)
. IfNone
, the returned generator is infinite and will trigger a fatal error if you try to consume more predictions from it than what is actually generated, instead of raising theStopIteration
exception. This is caused by the current behaviour when requesting to run a loop on the IPU for more iterations than there are elements remaining in the dataset. In this case you cannot drain it by usinglist(predictions)
, you have to consume the expected number of elements yourself, e.g. using[next(predictions) for _ in range(num_predictions)]
.
- Yields
Evaluated values of
predictions
tensors.
- train(input_fn, hooks=None, steps=None, max_steps=None, saving_listeners=None)
Trains a model given training data
input_fn
.- Parameters
input_fn –
A function that provides input data for training as minibatches. The function should return a
tf.data.Dataset
object. The outputs of theDataset
object must be a tuple(features, labels)
wherefeatures
is atf.Tensor
or a dictionary of string feature name toTensor
labels
is aTensor
or a dictionary of string label name toTensor
Both
features
andlabels
are consumed bymodel_fn
.hooks – List of
tf.train.SessionRunHook
subclass instances. Used for callbacks inside the training loop.steps – Number of steps for which to train the model.
steps
works incrementally. If you call two timestrain(steps=10)
then training occurs in total 20 steps. If you don’t want to have incremental behavior please setmax_steps
instead. If set,max_steps
must beNone
.max_steps – Number of total steps for which to train model. If set,
steps
must beNone
. Two calls totrain(steps=100)
means 200 training iterations. On the other hand, two calls totrain(max_steps=100)
means that the second call will not do any iteration since first call did all 100 steps.saving_listeners – list of
CheckpointSaverListener
objects. Used for callbacks that run immediately before or after checkpoint savings.
- Returns
self
, for chaining.
- class tensorflow.python.ipu.ipu_estimator.IPUEstimatorSpec(mode, predictions=None, loss=None, train_op=None, eval_metric_ops=None, eval_metrics=None, host_call=None, training_hooks=None, evaluation_hooks=None, prediction_hooks=None)
Ops and objects returned from a
model_fn
and passed toIPUEstimator
.This is very similar to
EstimatorSpec
, with the addition of two extra arguments:eval_metrics
andhost_call
. If neither of those arguments are needed, anEstimatorSpec
can be passed to theIPUEstimator
instead.eval_metrics
is a tuple of a (function
,tensors
), wheretensors
is either a list oftf.Tensor
or a dict from strings totf.Tensor
, that is passed to the function. The function runs on the CPU and returns a dict of metrics. The tensors are transferred from the IPU to the CPU host and passed to the function.Exactly one of
eval_metrics
andeval_metric_ops
must be provided during evaluation. The major difference between the two is that while theeval_metric_ops
will execute directly on the IPU, theeval_metrics
will execute on the CPU host using the provided function. Example:def my_metrics_fn(features, labels): return { "accuracy": tf.metrics.accuracy(labels, features), "precision": tf.metrics.precision(labels, features), "recall": tf.metrics.recall(labels, features), } eval_metrics = (my_metrics_fn, [features, labels]) spec = IPUEstimatorSpec(mode, loss=loss, eval_metrics=eval_metrics)
host_call
is a tuple of a function and a list of tensors to pass to that function.host_call
only works for training and is executed on the CPU for every training step. The tensors are transferred from the IPU to the CPU host and passed to the function.This functionality can be used for e.g. doing all-reduce of the gradients and weight updates on the host during distributed training with the
IPUMultiWorkerStrategyV1
. Example:def my_host_fn(*host_gradients): # This will all-reduce the gradients and update the weights on the host. return optimizer.apply_gradients(zip(host_gradients, variables)) train_op = tf.identity(loss) grads_and_vars = optimizer.compute_gradients(loss, var_list=variables) gradients = [g for (g, _) in grads_and_vars] host_call = (my_host_fn, gradients) spec = IPUEstimatorSpec(mode=mode, loss=loss, train_op=train_op, host_call=host_call)
See full example: Distributed training.
The various hooks (
training_hooks, `evaluation_hooks
,prediction_hooks
) support instances oftf.estimator.SessionRunHook
. To log tensor values from within themodel_fn
, use theIPULoggingTensorHook
.For documentation of the remaining arguments, see
EstimatorSpec
.- count(value, /)
Return number of occurrences of value.
- eval_metric_ops
Alias for field number 4
- eval_metrics
Alias for field number 5
- evaluation_hooks
Alias for field number 8
- host_call
Alias for field number 6
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
- loss
Alias for field number 2
- mode
Alias for field number 0
- prediction_hooks
Alias for field number 9
- predictions
Alias for field number 1
- train_op
Alias for field number 3
- training_hooks
Alias for field number 7
22.13.2. IPUPipelineEstimator
- class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimator(model_fn, model_dir=None, config=None, params=None, warm_start_from=None)
Estimator for pipelining on IPUs.
IPUPipelineEstimator
, likeIPUEstimator
, handles many of the details of running on IPUs, such as placement of operations and tensors, graph compilation and usage of data feeds. Additionally, it adds support for pipelined execution over multiple IPUs.The major API difference from the IPUEstimator is that the provided
model_fn
must return aIPUPipelineEstimatorSpec
that contains the information needed for pipelined execution.Data parallelism based on graph replication is supported. Each replica will consume
gradient_accumulation_count
batches from the dataset returned by theinput_fn
and accumulate the gradients, giving an effective batch size ofnum_replicas * gradient_accumulation_count * batch_size
. The optimizer in themodel_fn
should be wrapped in aCrossReplicaOptimizer
in order to average the gradients across the replicas.This can further be combined with distributed multi-worker training using the
IPUMultiWorkerStrategyV1
, giving a total effective batch size ofnum_workers * num_replicas * gradient_accumulation_count * batch_size
.Refer to the
pipelining_ops
documentation for more details about pipelining.Note: because the
model_fn
is compiled to run on the IPU, you must use thewarm_start_from
parameter for a warm start and not thetf.train.init_from_checkpoint
method.- Parameters
model_fn – The model function. Refer to https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/custom_estimators.md#write-a-model-function for details on how to write this function.
model_dir – Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into an estimator to continue training a previously saved model. If
PathLike
object, the path will be resolved. IfNone
, the model_dir inconfig
will be used if set. If both are set, they must be same. If both areNone
, a temporary directory will be used.config – A
RunConfig
object.params –
dict
of hyper parameters that will be passed intomodel_fn
. Keys are names of parameters, values are basic python types.warm_start_from – Optional string filepath to a checkpoint or SavedModel to warm start from, or a
tf.estimator.WarmStartSettings
object to fully configure warm-starting. If the string filepath is provided instead of atf.estimator.WarmStartSettings
, then all variables are warm started, and it is assumed that vocabularies andtf.Tensor
names are unchanged.
- eval_dir(name=None)
Shows the directory name where evaluation metrics are dumped.
- Parameters
name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.
- Returns
A string which is the path of directory contains evaluation metrics.
- evaluate(input_fn, steps=None, hooks=None, checkpoint_path=None, name=None)
Evaluates the model given evaluation data
input_fn
.- Parameters
input_fn –
A function that constructs the input data for evaluation. The function should return a
tf.data.Dataset
object. The outputs of theDataset
object must be a tuple(features, labels)
wherefeatures
is atf.Tensor
or a dictionary of string feature name toTensor
labels
is aTensor
or a dictionary of string label name toTensor
Both
features
andlabels
are consumed bymodel_fn
.steps – Number of steps for which to evaluate model.
hooks – List of
tf.train.SessionRunHook
subclass instances. Used for callbacks inside the evaluation call.checkpoint_path – Path of a specific checkpoint to evaluate. If
None
, the latest checkpoint inmodel_dir
is used. If there are no checkpoints inmodel_dir
, evaluation is run with newly initializedVariables
instead of ones restored from checkpoint.name – Name of the evaluation if user needs to run multiple evaluations on different data sets, such as on training data vs test data. Metrics for different evaluations are saved in separate folders, and appear separately in tensorboard.
- Returns
A dict containing the evaluation metrics specified in
model_fn
keyed by name, as well as an entryglobal_step
which contains the value of the global step for which this evaluation was performed.
- experimental_export_all_saved_models(export_dir_base, input_receiver_fn_map, assets_extra=None, as_text=False, checkpoint_path=None)
Exports a
SavedModel
withtf.MetaGraphDefs
for each requested mode.For each mode passed in via the
input_receiver_fn_map
, this method builds a new graph by calling theinput_receiver_fn
to obtain feature and labelTensor
objects. Next, this method calls theEstimator
object’smodel_fn
in the passed mode to generate the model graph based on those features and labels, and restores the given checkpoint (or, lacking that, the most recent checkpoint) into the graph. Only one of the modes is used for saving variables to theSavedModel
(order of preference:tf.estimator.ModeKeys.TRAIN
,tf.estimator.ModeKeys.EVAL
, thentf.estimator.ModeKeys.PREDICT
), such that up to threetf.MetaGraphDefs
are saved with a single set of variables in a singleSavedModel
directory.For the variables and
tf.MetaGraphDefs
, a timestamped export directory belowexport_dir_base
, and writes aSavedModel
into it containing thetf.MetaGraphDef
for the given mode and its associated signatures.For prediction, the exported
MetaGraphDef
will provide oneSignatureDef
for each element of theexport_outputs
dict returned from themodel_fn
, named using the same keys. One of these keys is alwaystf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the correspondingtf.estimator.export.ExportOutput
objects, and the inputs are always the input receivers provided by theserving_input_receiver_fn
.For training and evaluation the
train_op
is stored in an extra collection. Loss, metrics, and predictions are included in aSignatureDef
for the mode in question.Extra assets may be written into the
SavedModel
via theassets_extra
argument. This should be a dict, where each key gives a destination path (including the filename) relative to theassets.extra
directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as{'my_asset_file.txt': '/path/to/my_asset_file.txt'}
.- Parameters
export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported
SavedModel
objects.input_receiver_fn_map – dict of
tf.estimator.ModeKeys
toinput_receiver_fn
mappings, where theinput_receiver_fn
is a function that takes no arguments and returns the appropriate subclass ofInputReceiver
.assets_extra – A dict specifying how to populate the
assets.extra
directory within the exportedSavedModel
, orNone
if no extra assets are needed.as_text – whether to write the
SavedModel
proto in text format.checkpoint_path – The checkpoint path to export. If
None
(the default), the most recent checkpoint found within the model directory is chosen.
- Returns
The path to the exported directory as a bytes object.
- Raises
ValueError – if any
input_receiver_fn
isNone
, noexport_outputs
are provided, or no checkpoint can be found.
- export_saved_model(export_dir_base, serving_input_receiver_fn, assets_extra=None, as_text=False, checkpoint_path=None, experimental_mode='infer')
Exports inference graph as a
SavedModel
into the given directory.For a detailed guide to using SavedModel, see Using the SavedModel format.
This method builds a new graph by first calling the
serving_input_receiver_fn
to obtain featureTensor
objects, and then calling thisEstimator
object’smodel_fn
to generate the model graph based on those features. It restores the given checkpoint (or, lacking that, the most recent checkpoint) into this graph in a fresh session. Finally it creates a timestamped export directory below the givenexport_dir_base
, and writes aSavedModel
into it containing a singletf.MetaGraphDef
saved from this session.The exported
MetaGraphDef
will provide oneSignatureDef
for each element of theexport_outputs
dict returned from themodel_fn
, named using the same keys. One of these keys is alwaystf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY
, indicating which signature will be served when a serving request does not specify one. For each signature, the outputs are provided by the correspondingtf.estimator.export.ExportOutput
objects, and the inputs are always the input receivers provided by theserving_input_receiver_fn
.Extra assets may be written into the
SavedModel
via theassets_extra
argument. This should be a dict, where each key gives a destination path (including the filename) relative to theassets.extra
directory. The corresponding value gives the full path of the source file to be copied. For example, the simple case of copying a single file without renaming it is specified as{'my_asset_file.txt': '/path/to/my_asset_file.txt'}
.The experimental_mode parameter can be used to export a single train/eval/predict graph as a
SavedModel
. Seeexperimental_export_all_saved_models
for a full description.- Parameters
export_dir_base – A string containing a directory in which to create timestamped subdirectories containing exported
SavedModel
objects.serving_input_receiver_fn – A function that takes no argument and returns a
tf.estimator.export.ServingInputReceiver
ortf.estimator.export.TensorServingInputReceiver
.assets_extra – A dict specifying how to populate the
assets.extra
directory within the exportedSavedModel
, orNone
if no extra assets are needed.as_text – whether to write the
SavedModel
proto in text format.checkpoint_path – The checkpoint path to export. If
None
(the default), the most recent checkpoint found within the model directory is chosen.experimental_mode –
tf.estimator.ModeKeys
value indicating with mode will be exported. Note that this feature is experimental.
- Returns
The path to the exported directory as a bytes object.
- Raises
ValueError – if no
serving_input_receiver_fn
is provided, noexport_outputs –
- get_variable_names()
Returns list of all variable names in this model.
- Returns
List of names.
- Raises
ValueError – If the
Estimator
has not produced a checkpoint yet.
- get_variable_value(name)
Returns value of the variable given by name.
- Parameters
name – string or a list of string, name of the tensor.
- Returns
Numpy array - value of the tensor.
- Raises
ValueError – If the
Estimator
has not produced a checkpoint yet.
- latest_checkpoint()
Finds the filename of the latest saved checkpoint file in
model_dir
.- Returns
The full path to the latest checkpoint or
None
if no checkpoint was found.
- property model_fn
Returns the
model_fn
which is bound toself.params
.- Returns
def model_fn(features, labels, mode, config)
- Return type
The
model_fn
with following signature
- predict(input_fn, predict_keys=None, hooks=None, checkpoint_path=None, yield_single_examples=True, num_predictions=None)
Yields predictions for given features.
- Parameters
input_fn –
A function that constructs the features. The function should return a
tf.data.Dataset
object. The outputs of theDataset
object should be one of the following:features: A
Tensor
or a dictionary of string feature name toTensor
. features are consumed bymodel_fn
.A tuple, in which case the first item is extracted as features.
predict_keys – list of
str
, name of the keys to predict. It is used if thetf.estimator.EstimatorSpec.predictions
is adict
. Ifpredict_keys
is used then rest of the predictions will be filtered from the dictionary. IfNone
, returns all.hooks – List of
tf.train.SessionRunHook
subclass instances. Used for callbacks inside the prediction call.checkpoint_path – Path of a specific checkpoint to predict. If
None
, the latest checkpoint inmodel_dir
is used. If there are no checkpoints inmodel_dir
, prediction is run with newly initializedVariables
instead of ones restored from checkpoint.yield_single_examples – If
False
, yields the whole batch as returned by themodel_fn
instead of decomposing the batch into individual elements. This is useful ifmodel_fn
returns some tensors whose first dimension is not equal to the batch size.num_predictions – If not
None
, the generator will raiseStopIteration
after yielding this number of predictions. This allows draining the generator by usinglist(predictions)
. IfNone
, the returned generator is infinite and will trigger a fatal error if you try to consume more predictions from it than what is actually generated, instead of raising theStopIteration
exception. This is caused by the current behaviour when requesting to run a loop on the IPU for more iterations than there are elements remaining in the dataset. In this case you cannot drain it by usinglist(predictions)
, you have to consume the expected number of elements yourself, e.g. using[next(predictions) for _ in range(num_predictions)]
.
- Yields
Evaluated values of
predictions
tensors.
- train(input_fn, hooks=None, steps=None, max_steps=None, saving_listeners=None)
Trains a model given training data
input_fn
.- Parameters
input_fn –
A function that provides input data for training as minibatches. The function should return a
tf.data.Dataset
object. The outputs of theDataset
object must be a tuple(features, labels)
wherefeatures
is atf.Tensor
or a dictionary of string feature name toTensor
labels
is aTensor
or a dictionary of string label name toTensor
Both
features
andlabels
are consumed bymodel_fn
.hooks – List of
tf.train.SessionRunHook
subclass instances. Used for callbacks inside the training loop.steps – Number of steps for which to train the model.
steps
works incrementally. If you call two timestrain(steps=10)
then training occurs in total 20 steps. If you don’t want to have incremental behavior please setmax_steps
instead. If set,max_steps
must beNone
.max_steps – Number of total steps for which to train model. If set,
steps
must beNone
. Two calls totrain(steps=100)
means 200 training iterations. On the other hand, two calls totrain(max_steps=100)
means that the second call will not do any iteration since first call did all 100 steps.saving_listeners – list of
CheckpointSaverListener
objects. Used for callbacks that run immediately before or after checkpoint savings.
- Returns
self
, for chaining.
- class tensorflow.python.ipu.ipu_pipeline_estimator.IPUPipelineEstimatorSpec(mode, computational_stages, gradient_accumulation_count=None, eval_metrics_fn=None, optimizer_function=None, device_mapping=None, loss_accumulator_dtype=None, training_hooks=None, evaluation_hooks=None, prediction_hooks=None, reduction_method=GradientAccumulationReductionMethod.SUM, **pipeline_op_kwargs)
Ops and objects returned from a
model_fn
and passed toIPUPipelineEstimator
.- computational_stages
Alias for field number 1
- count(value, /)
Return number of occurrences of value.
- device_mapping
Alias for field number 5
- eval_metrics_fn
Alias for field number 3
- evaluation_hooks
Alias for field number 8
- gradient_accumulation_count
Alias for field number 2
- index(value, start=0, stop=9223372036854775807, /)
Return first index of value.
Raises ValueError if the value is not present.
- loss_accumulator_dtype
Alias for field number 6
- mode
Alias for field number 0
- optimizer_function
Alias for field number 4
- pipeline_op_kwargs
Alias for field number 11
- prediction_hooks
Alias for field number 9
- reduction_method
Alias for field number 10
- training_hooks
Alias for field number 7
22.13.3. Run configs
- class tensorflow.python.ipu.ipu_run_config.IPURunConfig(iterations_per_loop=1, ipu_options=None, num_replicas=1, num_shards=1, ordinal=0, prefetch_depth=None)
IPU related configuration required by
IPUEstimator
.- static __new__(cls, iterations_per_loop=1, ipu_options=None, num_replicas=1, num_shards=1, ordinal=0, prefetch_depth=None)
Creates an
IPURunConfig
instance.- Parameters
iterations_per_loop – The number of mini-batches consumed on the IPU device before returning to the CPU host for each
Session.run
. The global step counter is increased byiterations_per_loop
for everySession.run
. The number of weight updates can be less than the number of iterations if gradient accumulation is used.ipu_options – An
IPUConfig
which you have populated with your desired configuration options before creating this IPURunConfig. TheIPUEstimator
will then configure the IPU system with thisipu_options
object when it builds your model.num_replicas – Number of replicated graphs (data parallelism)
num_shards – Number of IPU devices on which the graph is sharded (model parallelism)
ordinal – The IPU device ordinal to use. For instance, 0 corresponds to
/device:IPU:0
.prefetch_depth – Integer or
None
. Theprefetch_depth
to be used by theIPUInfeedQueue
that is created internally.
- class tensorflow.python.ipu.ipu_run_config.RunConfig(ipu_run_config=None, master=None, **kwargs)
RunConfig with IPU support.
- __init__(ipu_run_config=None, master=None, **kwargs)
Constructs a RunConfig with IPU support.
Below are the arguments specific to the RunConfig for IPUs. See the base class documentation for the remaining arguments.
- Parameters
ipu_run_config –
IPURunConfig
object for IPU-specific configuration.master – a string. The address of the distributed master to use for training.
22.13.4. Session run hooks
- class tensorflow.python.ipu.ipu_session_run_hooks.IPULoggingTensorHook(every_n_iter=None, every_n_secs=None, at_end=False, formatter=None, logging_mode=IPUOutfeedMode.LAST)
Prints the given tensors every N local steps, every N seconds, or at end.
This is a version of
tf.estimator.LoggingTensorHook
that supports logging from inside a function compiled for the IPU. The implementation uses an IPU outfeed in order to send the tensors from the compiled function to the host.The tensors will be printed to the log, with
INFO
severity.- LoggingMode
alias of
IPUOutfeedMode
- __init__(every_n_iter=None, every_n_secs=None, at_end=False, formatter=None, logging_mode=IPUOutfeedMode.LAST)
Initializes the hook.
- Parameters
every_n_iter –
int
, print the tensor values once every N steps.every_n_secs –
int
orfloat
, print the tensor values once every N seconds. Exactly one ofevery_n_iter
andevery_n_secs
should be provided (unlessat_end
is True).at_end –
bool
specifying whether to print the tensor values at the end of the run.formatter – function that takes a dict with tensor names and values and returns a string. If None, uses default formatting.
logging_mode –
IPULoggingTensorHook.LoggingMode
that determines the behaviour when enqueuing multiple tensor values between dequeues (e.g. print all of them or only the last one).
- after_run(run_context, run_values)
Called after each call to run().
The
run_values
argument contains results of requested ops/tensors bybefore_run()
.The
run_context
argument is the same one send tobefore_run
call.run_context.request_stop()
can be called to stop the iteration.If
session.run()
raises any exceptions thenafter_run()
is not called.- Parameters
run_context – A
SessionRunContext
object.run_values – A SessionRunValues object.
- begin()
Called once before using the session.
When called, the default graph is the one that will be launched in the session. The hook can modify the graph by adding new operations to it. After the
begin()
call the graph will be finalized and the other callbacks can not modify the graph anymore. Second call ofbegin()
on the same graph, should not change the graph.
- end(session)
Called at the end of session.
The
session
argument can be used in case the hook wants to run final ops, such as saving a last checkpoint.If
session.run()
raises exception other than OutOfRangeError or StopIteration thenend()
is not called. Note the difference betweenend()
andafter_run()
behavior whensession.run()
raises OutOfRangeError or StopIteration. In that caseend()
is called butafter_run()
is not called.- Parameters
session – A TensorFlow Session that will be soon closed.
- log(tensors)
Logs the given
tensors
.- Parameters
tensors – either a dict from string to
tf.Tensor
, a list/tuple oftf.Tensor
objects, or atf.Tensor
.- Returns
The logging operation. It might be necessary to add a control dependency on this operation, or include it in the training operation using
tf.group()
, to avoid it from being pruned from the graph.
22.14. Operators
It is also possible to access the operators via the
tensorflow.python.ipu.ops
namespace, for example:
tensorflow.python.ipu.ops.normalization_ops.group_norm()
.
- tensorflow.python.ipu.application_compile_op.experimental_application_compile_op(func, inputs=None, output_path=None, freeze_variables=False, name=None)
An operation that compiles a function into an executable for the IPU. The operation itself should be placed on CPU, and it will compile for the default IPU device.
WARNING: This API is experimental and subject to change.
Example usage:
def model(x): return x * x v = tf.placeholder(tf.float32, shape=(2,)) compile_model = experimental_application_compile_op(model, inputs=[v]) with tf.Session() as sess: executable_path = sess.run(compile_model, {v: np.zeros(v.shape)})
- Parameters
func – The Python function to compile.
inputs – The inputs passed to the function, as
func(*inputs)
.output_path – The path where the executable will be stored. If None, a temporary file is used.
freeze_variables – If True, any referenced variables will be captured by their values (when the compile op is executed) and embedded into the compiled executable as constants. If False, the referenced variables instead become implicit inputs that must be provided when executing the compiled executable.
name – Optional op name.
- Returns
A
Tensor
of type string with the path to the compiled executable.
22.14.1. Control flow operations.
- tensorflow.python.ipu.control_flow_ops.barrier(tensors, insert_barrier_for_gradients=False, name=None)
A control flow operation to force the scheduling of operations in the Poplar XLA backend.
For example given the following program:
def func(a, b, c, d): e = a + b f = c + d g = e + a return f, g
The operations
f
andg
are independent of each other meaning that eitherf
org
can execute first. However if we want to forcef
to execute first, we can insert a barrier operation:def func(a, b, c, d): e = a + b f = c + d f, a = ipu.control_flow_ops.barrier([f, a]) g = e + a return f, g
This will result in
f
executing beforeg
as now there is a data dependency between the operations.- Parameters
tensors – A tensor or a structure of tensors which all have to be executed before the outputs of the barrier operation can be used.
- Returns
A tensor or a structure of tensors which matches shape and type of the
tensors
arg.
22.14.2. Custom operations
- tensorflow.python.ipu.custom_ops.codelet_expression_op(vertex_expression, *args)
Add a custom fused elementwise expression operation to the graph.
The automatic gradient calculation in TensorFlow does not have visibility of the operations performed by this function and so this operation cannot be used for training.
In the following example, the Python function
my_custom_op()
provides the expression, and the argumentsa
,b
andc
are the three inputs from other parts of the TensorFlow graph.def my_custom_op(x, y, z): return x * x + y * z ipu.custom_ops.codelet_expression_op(my_custom_op, a, b, c)
- Parameters
vertex_expression – A Python function that defines the codelet expression.
args – The tensor inputs to the expression.
- Returns
The Tensor which is a result of applying the elementwise operation
- tensorflow.python.ipu.custom_ops.cpu_user_operation(inputs, library_path, outs=None, name='UserOp', op_name='Callback', separate_gradients=False, inputs_with_gradients=None, attributes=None, gradient_attributes=None)
Call the CPU function located in the shared library at
library_path
as part of the normal TensorFlow execution with the giveninputs
copied from the IPU to the CPU, and the outputs are copied back to the IPU afterwards.The shape and type of the outputs should be specified by
outs
. If it isNone
it will default to no output.outs
should be a dictionary with two elements like so:outs = { "output_types": [my_types_as_a_list], "output_shapes": [my_shapes_as_a_list], }
- Parameters
inputs – The tensor inputs to the operation.
library_path – The path to the shared object that contains the functions to execute the operation.
outs – A dictionary describing the output tensor shapes and types.
name – The name of the operation.
op_name – The prefix of the functions inside the shared object file. This defaults to ‘Callback’.
separate_gradients – When set to
True
, multiple gradient ops will be generated, one for each input. WhenFalse
, a single gradient op will be generated, which should produce the partial derivatives for all inputs.inputs_with_gradients – A list of input indices. If this is defined then the op will only calculate derivatives for the specified inputs.
attributes – An optional string object which is passed as an argument to the Poplar function. Allows you to specify function attributes which were not known at the compile time of the C++ Poplar function. Can be used to pass a JSON or ProtoBuf serialized string to the Poplar function for ease of use. See the documention for examples.
gradient_attributes – Same as
attribute
, however this is passed as theattribute
to the gradient operations (if training.)
- Returns
The array of tensor outputs.
- tensorflow.python.ipu.custom_ops.precompiled_user_op(inputs, library_path, gp_path='', outs=None, name='UserOp', op_name='Build', separate_gradients=False, inputs_with_gradients=None, attributes=None, gradient_attributes=None)
Call the Poplar function located in the shared library at
library_path
as part of the normal TensorFlow execution with the giveninputs
.The shape and type of the output should be specified by
outs
. If it isNone
it will default to no output.outs
should be a dictionary with two elements like this:outs = { "output_types": [my_types_as_a_list], "output_shapes": [my_shapes_as_a_list], }
- Parameters
inputs – The tensor inputs to the operation.
library_path – The path to the shared object file that contains the functions to build the Poplar operation in the graph.
gp_path – The path to a precompiled codelet file, if you have one.
outs – A dictionary describing the output tensor shapes and types.
name – The name of the operation in TensorFlow.
op_name – The prefix of the functions inside the shared object file. This defaults to ‘Build’.
separate_gradients – When set to true, multiple gradient ops will be generated, one for each input. When false, a single gradient op will be generated, which should produce the partial derivatives for all inputs (or all inputs specified in
inputs_with_gradients
).inputs_with_gradients – A list of input indices. If this is defined then the op will only calculate derivatives for the specified inputs.
attributes – An optional string object which is passed as an argument to the build function. Allows you to specify function attributes which were not known at the compile time of the C++ Poplar function. Can be used to pass a JSON or ProtoBuf serialized string to the Poplar function for ease of use. See the documention for examples.
gradient_attributes – The same as
attributes
, however this is passed as theattributes
argument to the gradient operation (if training).
- Returns
The array of tensor outputs.
22.14.3. Functional operators
- tensorflow.python.ipu.functional_ops.outlined_function(func=None, unique_sharding=False, keep_input_layouts=None, name=None)
An outlined function is a block of organized, reusable code which is used to perform a single action. Functions provide better modularity for your application and a high degree of code reusing which can decrease the memory usage at the expense of passing the arguments around.
Arguments can be passed in two ways, as a parameter of the python function
func
, or as a value defined in the enclosing scope and used withinfunc
. Arguments that are compile-time graph constants should be defined in the enclosing scope, as this makes them eligible for expression evaluation. Arguments passed via function params will always be treated as a runtime value.Functions can be used by models constrained by memory which have common structures or to serialize some large operations.
If the provided function contains any stateful operations, such as stateful random number generation, then the function cannot be reused and it will be inlined automatically.
See the documentation for more details and examples.
- Parameters
func – A python function which takes a list of positional arguments only. All the arguments must be
tf.Tensor
-like objects, or be convertible to them. See the documentation for examples of how to pass nontf.Tensor
-like objects to the functions. The function provided must return at least onetf.Tensor
-like object.unique_sharding – Makes sure that all function inputs are copied to a single device before the function call is executed. Enabling this can increase performance as any inter IPU communication can be more efficiently scheduled and any duplicated copies can be elided.
keep_input_layouts – A hint to decide whether to keep the layouts of the function inputs when calling the function or re-allocate them based on the operations inside the function. Reallocating them can improve the performance, but it can also increase the IPU code size. When set to ‘None’, this option will be decided automatically.
name – The name of the function.
- Returns
An
Operation
that executes the function.
22.14.4. Image operations
- tensorflow.python.ipu.image_ops.normalise_image(image, channel_offsets, channel_scales, scale=1, name=None)
Pad an image to have 4 channel dimensions and normalise it according to the following formula:
image = (image[c] * scale - channel_offsets[c]) * channel_scales[c]
for each of the
c
channels in the image.- Parameters
image – An
[X,Y,Z,3]
tensor, where the channels are the innermost dimension. Must beuint8
,float32
orfloat16
.channel_offsets – A
[3]
array or tensor of offsets for the channels.channel_scales – A
[3]
array or tensor of scales for the channels.scale – A scalar constant that will scale the image before channel normalization. Defaults to 1.
name – Optional op name.
- Returns
An
[X,Y,Z,4]
tensor with the same type as the inputimage
, exceptuint8
inputs where the output isfloat16
.
22.14.5. Graphcore utility operations
- tensorflow.python.ipu.internal_ops.fifo(x, depth, offload=False, name=None)
Introduces a first-in-first-out queue with a fixed depth.
- Parameters
x – The tensor to enqueue.
depth – The depth of the queue.
offload – Whether to offload the queue storage to Poplar remote buffers.
name – Optional op name.
- Returns
A
Tensor
which was dequeued from the fifo. This will bex
att - depth
. The firstdepth
iterations will have unspecified values.
- tensorflow.python.ipu.internal_ops.get_current_iteration_counter(name=None, **kwargs)
Returns which gradient accumulation iteration the pipeline is in.
- Returns
A scalar tensor with the iteration count.
- tensorflow.python.ipu.internal_ops.print_tensor(input, name='')
Print the specified input.
- Parameters
input – The tensor to print.
name – Optional op name.
- Returns
An operator that prints the specified input to the standard error. For the tensor to be printed one must either return it as part of their XLA function which is consumed by ipu_compiler.compile, or include the returned op in the input to session.run, or use the operator as a control dependency for executed ops by specifying with tf.control_dependencies([print_op]).
Examples
Returning the print operation as part of the XLA function:
import tensorflow as tf from tensorflow.python.ipu import internal_ops from tensorflow.python.ipu import scopes def my_net(v): print_op = internal_ops.print_tensor(v) v = v + 1 return v, print_op with scopes.ipu_scope("/device:IPU:0"): res = ipu_compiler.compile(my_net, inputs=[v]) ... ...
Including the print operation in session.run:
import numpy as np import tensorflow as tf from tensorflow.python.ipu import internal_ops from tensorflow.python.ipu import scopes with scopes.ipu_scope("/device:IPU:0"): pa = tf.placeholder(np.float32, [2, 2], name="a") print_op = internal_ops.print_tensor(pa) x = pa + 1 with tf.Session() as session: result = session.run([x, print_op], feed_dict={pa : np.ones([2, 2])}) ... ...
Using control dependencies:
import numpy as np import tensorflow as tf from tensorflow.python.ipu import internal_ops from tensorflow.python.ipu import scopes with scopes.ipu_scope("/device:IPU:0"): pa = tf.placeholder(np.float32, [2, 2], name="a") print_op = internal_ops.print_tensor(pa) with tf.control_dependencies([print_op]): x = pa + 1 with tf.Session() as session: result = session.run(x, feed_dict={pa : np.ones([2, 2])}) ... ...
- tensorflow.python.ipu.internal_ops.remap(x, name=None)
Clone and map the input linearly across the IPU.
- Parameters
x – The tensor to remap.
name – Optional op name.
- Returns
A
Tensor
which is has been linearly mapped across the IPU.
- tensorflow.python.ipu.internal_ops.remap_deduce(x, name=None)
Clone the tensor and deduce the tile mapping.
- Parameters
x – The tensor to remap.
name – Optional op name.
- Returns
A
Tensor
which is has been mapped across the IPU by deducing the tile layout from the input parameter.
22.14.6. IPU specific maths operations
- tensorflow.python.ipu.math_ops.segment_sum(data, segment_ids, num_segments, name=None)
Computes the sum along segments of a tensor, such that:
\[output_i = \sum_j data_j\]where sum is over
j
such thatsegment_ids[j] == i
.If the sum is empty for a given segment ID
i
thenoutput[i] = 0
.Segments are partitions of a tensor along the first dimension indexed by a 1-D
segment_ids
tensor.Read the TensorFlow documentation on segmentation for a more detailed explanation of segments.
For example:
c = tf.constant([[1, 2, 3, 4], [4, 3, 2, 1], [5, 6, 7, 8]]) tf.segment_sum(c, tf.constant([0, 0, 1]), 2) # ==> [[5, 5, 5, 5], # [5, 6, 7, 8]]
Caution
The
segment_ids
must be sorted in ascending order. If provided with an unsorted tensor, no exception will be raised and the behaviour of this operation is undefined.num_segments
must be specified and must be greater than1 + max(segment_ids)
.- Parameters
data –
tf.Tensor
with rank >= 1.segment_ids – A sorted
tf.Tensor
ofint32
with rank == 1 and the same length as the 0th dimension ofdata
.num_segments – Number of segments to take within
data
.name – Name for the operation (optional).
- Returns
A
tf.Tensor
of the same type and rank as data but where the length of the 0th dimension is equal tonum_segments
, which comprises the sum of all the elements within the same segment in each cross-section.- Raises
ValueError – If the rank of
data
andsegment_ids
are not fully defined.ValueError – If the length of the 0th dimension of
data
andsegment_ids
are not equal.ValueError – If
data
does not have at least rank 1.ValueError – If
segment_ids
does not have a rank equal to 1.
- tensorflow.python.ipu.math_ops.serialized_matmul(a, b, serialization_factor, serialization_dimension, transpose_a=False, transpose_b=False, name=None)
Multiplies matrix a by matrix b, producing a * b, with the multiplication being serialized on one of the dimensions.
Serializing a matrix multiplication operation can reduce the code size of the multiplication at the expense of extra computation due to copying of tensors.
The inputs must, following any transpositions, be tensors of rank >= 2 where the inner 2 dimensions specify valid matrix multiplication dimensions, and any further outer dimensions specify matching batch size.
Either matrix can be transposed on the fly by setting one of the corresponding flag to True. These are False by default.
Given the tensor
a
with shape[..., m, k]
and tensorb
with shape […, k, n] after the transpositions, the matrix multiplication can be serialized as follows:Along the columns dimension of
a
(them
-dimension), by settingserialization_dimension
toa_columns
.Along the rows dimension of
a
and the columns dimension ofb
(thek
-dimension), by settingserialization_dimension
toa_rows_b_columns
.Along the rows dimension of
b
(then
-dimension), by settingserialization_dimension
tob_rows
.
Note that taking a gradient of a serialized matrix multiplication means that the backward propagation of the matrix multiply will also be serialized.
Note that adjoining and sparse matrices are not supported.
- Parameters
a –
tf.Tensor
of type float16, float32, int32 and rank >= 2.b –
tf.Tensor
with same type and rank as a.serialization_factor – An integer indicating the number of smaller matrix multiplies this operation is broken up into. Must divide the dimension along which the operation is serialized on.
serialization_dimension – A string, must be one of
a_columns
,a_rows_b_columns
orb_rows
. Indicates the dimension along which the operation is serialzed on.transpose_a – If True, a is transposed before multiplication.
transpose_b – If True, b is transposed before multiplication.
name – Name for the operation (optional).
- Returns
A
tf.Tensor
of the same type as a and b where each inner-most matrix is the product of the corresponding matrices in a and b, e.g. if all transpose attributes are False:output[…, i, j] = sum_k (a[…, i, k] * b[…, k, j]), for all indices i, j.
22.14.7. Pipelining operators
- class tensorflow.python.ipu.pipelining_ops.OptimizerFunctionOutput(opt, loss, compute_gradients_args=None, compute_gradients_kwargs=None, apply_gradients_args=None, apply_gradients_kwargs=None, variables=None, tape=None, gradient_capture_context=None, captured_gradient_outfeed=None)
A helper class used for returning a structured output from an optimizer_function in a pipeline.
- __init__(opt, loss, compute_gradients_args=None, compute_gradients_kwargs=None, apply_gradients_args=None, apply_gradients_kwargs=None, variables=None, tape=None, gradient_capture_context=None, captured_gradient_outfeed=None)
Creates an OptimizerFunctionOutput object.
- Parameters
opt – An instance of
optimizer.Optimizer
which is used to generate the back-propagation and the weight update pipeline stages.loss – The loss which is passed to the optimizer when calling
compute_gradients
.compute_gradients_args – Positional arguments (not including loss) which are passed to the
compute_gradients
function.compute_gradients_kwargs – Keyword arguments (not including loss) which are passed to the
compute_gradients
function.apply_gradients_args – Positional arguments (not including grads_and_vars) which are passed to the
apply_gradients
function.apply_gradients_kwargs – Keyword arguments (not including grads_and_vars) which are passed to the
apply_gradients
function.variables – A list or tuple of variables to compute gradients with respect to when
opt
is an instance ofOptimizerV2
.tape – A
GradientTape
for gradient computation whenopt
is an instance ofOptimizerV2
.gradient_capture_context – An
gradients (ipu.eager.backprop.GradientCaptureContext for accessing) –
ipu.ops.grad_util_ops.capture_upstream_gradients. (captured by) –
captured_gradient_outfeed – An
ipu.IPUOutfeedQueue
to which any captured gradients are pushed.
- class tensorflow.python.ipu.pipelining_ops.PipelineSchedule(value)
The PipelineSchedule describes how stages are interleaved on the IPUs servicing the pipeline. The forward and backward passes of each stage will execute on the same IPUs. So, in the core of the pipeline there is a choice as to whether to run the forward stages together, or the backward stages and the forward stages together.
- Grouped
This groups the forward passes on multiple IPUs. This requires more memory since activations need to be stored until the backward stages run together. However, since forward passes tend to be smaller than backward passes, Grouped tends to improve the speed of the execution, as different IPUs don’t spend so much time waiting for each other.
- Interleaved
This schedules the backward passes whenever the forward passes have just generated some activations. Consequently fewer activations are required to be stored between the forward and backward pipeline stages, so less memory is required. However, since forward and backward stages tend to be very different in terms of execution cycles, the overall performance of the pipeline tends to be slower.
- Sequential
This is a debug mode, where the pipeline is scheduled in the same way as if it were a sharded model.
- class tensorflow.python.ipu.pipelining_ops.PipelineStageOptions(convolution_options=None, matmul_options=None, slice_options=None)
A helper class which can be used to configure Poplar compilation options (such as
availableMemoryProportion
orpartialsType
) inside a pipeline forward, backward and weight update stage. This will override the global options set by the convolution poplar options, matmul poplar options, and slice poplar options in the.
- __init__(convolution_options=None, matmul_options=None, slice_options=None)
Creates an PipelineStageOptions object.
- Parameters
convolution_options – If provided, a dictionary of Poplar option flags for all the convolution operations in the stage.
matmul_options – If provided, a dictionary of Poplar option flags for all the matmul operations in the stage.
slice_options – If provided, a dictionary of Poplar option flags for all the slice operations in the stage.
loss – The loss which is passed to the optimizer.
- class tensorflow.python.ipu.pipelining_ops.RecomputationMode(value)
When working with pipeline models for training, recomputation might be required in order to reduce the number of activations being stored on the device at any given time.
This Enum class is used to control the recomputation implementation, with the following approaches supported:
Auto
: automatically try and select the best recomputation strategy based on the provided model and pipeline schedule.RecomputeThenBackpropagate
: first recompute all the activations and then perform backpropagation. This mode allows for better code reuse as the corresponding forward propagation and the recomputation operations can share the exact same code. This recomputation mode is supported byPipelineSchedule.Grouped
andPipelineSchedule.Interleaved
pipeline schedules. This is the default recomputation mode forPipelineSchedule.Grouped
andPipelineSchedule.Interleaved
pipeline schedules.RecomputeAndBackpropagateInterleaved
: recompute and backpropagate operations are interleaved together. This mode can help reduce the maximum liveness compared toRecomputeThenBackpropagate
as the backpropagation operations can be scheduled as soon as possible, however less code reuse will be possible. This recomputation mode is supported byPipelineSchedule.Grouped
andPipelineSchedule.Sequential
pipeline schedules. This is the default recomputation mode for thePipelineSchedule.Sequential
pipeline schedule.
- tensorflow.python.ipu.pipelining_ops.pipeline(computational_stages, gradient_accumulation_count=None, gradient_accumulation_dtype=None, gradient_accumulation_for_captured_grads=True, repeat_count=1, batch_serialization_iterations=1, inputs=None, infeed_queue=None, outfeed_queue=None, optimizer_function=None, device_mapping=None, pipeline_schedule=None, recomputation_mode=None, forward_propagation_stages_poplar_options=None, backward_propagation_stages_poplar_options=None, weight_update_poplar_options=None, offload_weight_update_variables=None, replicated_optimizer_state_sharding=False, offload_activations=None, offload_gradient_accumulation_buffers=None, replicated_weight_sharding=None, offload_weights=None, continuous_weight_updates=False, outfeed_loss=False, accumulate_outfeed=False, accumulate_outfeed_dtype=None, outfeed_mask=None, reduction_method=GradientAccumulationReductionMethod.SUM, name=None)
Sets up a series of computational stages, where the outputs of one stage are the inputs to the next one. These stages are then executed in parallel across multiple IPUs. This approach can be used to split the model where layer(s) are executed on different IPUs.
The first stage takes the
inputs
and theinfeed_queue
(if provided) as its inputs. If theinfeed_queue
is provided, it is automatically dequeued (similar to the ipu.loops API) therefore care needs to be taken to make sure the signature of the first pipeline stage matches both the arguments frominputs
and theinfeed_queue
, otherwise an error is thrown.All tensors which are used in the pipeline which are not TensorFlow Variables need to be explicitly passed as inputs to the pipeline. If an input does not change its value during the execution of the pipeline op (for example hyperparameters such as learning rate), it needs to be passed as part of
inputs
. Alternatively, if these values change during execution (for example the model processes different batches of data) the input should be passed through theinfeed_queue
(seeIPUInfeedQueue
).When training a model, an optional
optimizer_function
function can be provided. This function takes all the outputs from the last computational stage as inputs, and returns an instance ofOptimizerFunctionOutput
that is used to generate the backwards pass of the model using the TensorFlow Optimizer API. This will internally create corresponding backpropagation pipeline stages for each pipeline stage and colocate them such that the activations and weights required for the gradient calculation and application stay on the device in order to minimise the number of copies between IPUs.Note that the gradients, which are calculated by the
compute_gradients
function, will be accumulated automatically during the execution of the pipeline, unlesscontinuous_weight_updates
is enabled.If the last computational stage has any outputs, then an
outfeed_queue
(seeIPUOutfeedQueue
) is required and all the outputs from the last computational stage are enqueued to theoutfeed_queue
.Note that pipelining supports the recomputation of activations for stateless ops during the backwards pass. This reduces the number of activations that will be stored on the device, saving memory at the expense of additional computation. To enable recomputation, set the
tensorflow.python.ipu.config.IPUConfig.allow_recompute
attribute toTrue
when configuring the device.For example a simple inference network for the MNIST can be split across two IPUs:
from tensorflow import keras # Create the dataset #... # Create the data queues from/to IPU. infeed_queue = ipu_infeed_queue.IPUInfeedQueue(dataset) outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue() # Create a pipelined model which is split accross two stages. def stage1(image): partial = keras.layers.Dense(256, activation=tf.nn.relu)(image) partial = keras.layers.Dense(128, activation=tf.nn.relu)(partial) return partial def stage2(partial): logits = keras.layers.Dense(10)(partial) probabilities = tf.nn.softmax(logits) classes = tf.argmax(input=logits, axis=1) return probabilities, classes def model(): with variable_scope.variable_scope("vs", use_resource=True): pipeline_op = pipelining_ops.pipeline( computational_stages=[stage1, stage2], gradient_accumulation_count=250, repeat_count=2, inputs=[], infeed_queue=infeed_queue, outfeed_queue=outfeed_queue, device_mapping=[3,1], name="Pipeline") return pipeline_op with ops.device("/device:IPU:0"): compiled_model = ipu_compiler.compile(model, inputs=[]) outfeed_op = outfeed_queue.dequeue() with tf.Session() as sess: result = sess.run(compiled_model) probabilities, classes = sess.run(outfeed_op)
In this set up, the model is split across two IPUs. By default the first two layers would be executed on the first IPU and the third layer and the probabilities and classes on the second IPU but here
device_mapping
is used to override the default IPU allocation and instead the first two layers will be executed on the fourth IPU and the third layer and the probabilities and classed on the second IPU.This creates a pipeline of depth 250 (specified by the
gradient_accumulation_count
), which means each pipeline stage is executed 250 times.This pipeline is then executed 2 times (specified by the
repeat_count
) The results of the pipeline (probabilities and classes) are returned to the host by the outfeed queue.We can also train this network by providing
optimizer_function
:from tensorflow import keras # Create the dataset #... # Create the data queues from/to IPU. infeed_queue = ipu_infeed_queue.IPUInfeedQueue(dataset) outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue() # Create a pipelined model which is split accross two stages. def stage1(lr, images, labels): partial = keras.layers.Dense(256, activation=tf.nn.relu)(images) partial = keras.layers.Dense(128, activation=tf.nn.relu)(partial) return lr, partial, labels def stage2(lr, partial, labels): logits = keras.layers.Dense(10)(partial) cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits( labels=labels, logits=logits) loss = tf.reduce_mean(cross_entropy) return lr, loss def optimizer_function(lr, loss): optimizer = tf.train.GradientDescentOptimizer(lr) return pipelining_ops.OptimizerFunctionOutput(optimizer, loss) def model(lr): with variable_scope.variable_scope("vs", use_resource=True): pipeline_op = pipelining_ops.pipeline( computational_stages=[stage1, stage2], gradient_accumulation_count=128, repeat_count=10, inputs=[lr], infeed_queue=infeed_queue, outfeed_queue=outfeed_queue, optimizer_function=optimizer_function, name="Pipeline") return pipeline_op with ops.device('cpu'): lr = tf.placeholder(np.float16, []) with ops.device("/device:IPU:0"): compiled_model = ipu_compiler.compile(model, inputs=[lr]) outfeed_op = outfeed_queue.dequeue() with tf.Session() as sess: result = sess.run(compiled_model, {lr: 0.01}) losses = sess.run(outfeed_op)
Here the
tf.train.GradientDescentOptimizer
generates the pipeline stages which calculate the gradients and apply them to the weights. Note how the loss is returned to the host by the outfeed queue.If a model requires multiple computational pipeline stages to access the same
tf.Variable
, then all of these computational stages need to be placed on the same IPU using thedevice_mapping
argument.Note that modifying
tf.Variable
values in a pipeline stage and/or during the gradient calculation will result in undefined behavior. These variables can only be modified by theapply_gradients
member function of the applied Optimizer.Note that arguments marked with (EXPERIMENTAL) are under active development and might not provide representative performance.
- Parameters
computational_stages – a list of python functions, where each function represents a computational pipeline stage. The function takes the outputs of the previous pipeline state as its inputs.
gradient_accumulation_count – the number of times each pipeline stage will be executed.
gradient_accumulation_dtype –
The data type used for the gradient accumulation buffer. One of:
None
: Use an accumulator of the same type as the variable type.A
DType
: Use this type for all the accumulators. For exampletf.float32
.A callable that takes the variable and returns a
DType
: Allows specifying the accumulator type on a per-variable basis.
The gradients passed to
Optimizer.apply_gradients
will have the dtype requested here. If that dtype is different from the variable dtype a cast is needed at some point to make them compatible. If you want to cast the gradients immediately, you can wrap your optimizer in theMapGradientOptimizer
with atf.cast
.gradient_accumulation_for_captured_grads – If
True
, any captured gradients are accumulated before being passed to the optimizer’sapply_gradients
method (via itscaptured_grads
keyword argument, if it exists). IfFalse
, the “raw”, unaccumulated gradients are passed instead.repeat_count – the number of times the pipeline will be executed.
batch_serialization_iterations – (EXPERIMENTAL) number of times a loop executes to compute a batch on each pipeline stage execution. Currently only supported with the
PipelineSchedule.Sequential
.inputs – arguments passed to the first pipeline stage.
infeed_queue – optional IPUInfeedQueue, if passed, it is dequeued and passed as an input in the first pipeline stage.
outfeed_queue – IPUOutfeedQueue, required if the last computational stage has any outputs. The outputs of these are enqueued to this queue and they can be accessed on the host.
optimizer_function – optional Python function which takes the output of the last computational stage as parameters and returns an instance of
pipelining_ops.OptimizerFunctionOutput
in order to generate the back-propagation and weight-update parts of the model suitable for training.device_mapping – If provided, a list of length equal to the number of computational stages. An element at index
i
in the list represents which IPU the computational stagecomputational_stages[i]
should reside on. This can be used to make sure computational stages which sharetf.Variable
are resident on the same IPU.pipeline_schedule – Which scheduling algorithm to use for pipeline lowering. Defaults to
PipelineSchedule.Grouped
.recomputation_mode – The recomputation mode to use for training pipeline models. Defaults to RecomputationMode.Auto. Only applies if recomputation is enabled. This must be done by setting the
tensorflow.python.ipu.config.IPUConfig.allow_recompute
attribute toTrue
when configuring the device.forward_propagation_stages_poplar_options – If provided, a list of length equal to the number of computational stages. Each element is a PipelineStageOptions object which allows for fine grain control of the Poplar options for a given forward propagation computational stage.
backward_propagation_stages_poplar_options – If provided, a list of length equal to the number of computational stages. Each element is a PipelineStageOptions object which allows for fine grained control of the Poplar options for a given backward propagation computational stage.
weight_update_poplar_options – If provided, a PipelineStageOptions object which allows for fine grained control of the Poplar options for the weight update stage.
offload_weight_update_variables – When enabled, any
tf.Variable
which is only used by the weight update of the pipeline (for example the accumulator variable when using thetf.MomentumOptimizer
), will be stored in the remote memory. During the weight update this variable will be streamed onto the device and then streamed back to the remote memory after it has been updated. Requires the machine to be configured with support forPoplar remote buffers
. Offloading variables into remote memory can reduce maximum memory liveness, but can also increase the computation time of the weight update. When set toNone
the variables will be placed in either in-processor or remote memory automatically based on the current best placement strategy. Note that this option has no effect for inference only pipelines.replicated_optimizer_state_sharding – If True, any
tf.Variable
which is offloaded (for example the accumulator variable when using thetf.MomentumOptimizer
), will be partitioned across the replicas. This can exploit the additional bandwidth of the IPU-Links to improve overall throughput, however it might increase the code size and hence the model might need adjusting (for example the PopLibs optionavailableMemoryProportion
might need to be changed). Note that this option has no effect for inference only pipelines.offload_activations – When enabled, all the activations for the batches which are not being executed by the pipeline stages at the given time are stored in remote memory. Requires the machine to be configured with support for
Poplar remote buffers
. Offloading activations into remote memory can reduce maximum memory liveness, but can also increase the computation time as activations have to be copied from/to the device(s). When set toNone
, the activations might be offloaded when beneficial.offload_gradient_accumulation_buffers – (EXPERIMENTAL) When enabled, all the gradient accumulation buffers are stored in remote memory. Offloading gradient accumulation buffers into remote memory can reduce maximum memory liveness, but can also increase the computation time as the buffers have to be copied to the device, updated and the copied off the device. Requires the machine to be configured with support for
Poplar remote buffers
. When set toNone
, theoffload_gradient_accumulation_buffers
might be offloaded when beneficial. Note that this option has no effect for inference only pipelines.replicated_weight_sharding – (EXPERIMENTAL) When enabled and running a replicated model, any
tf.Variable
used by the pipeline stage computations (excluding those only used by the weight update), will be partitioned across the replicas. Whenever the a partitionedtf.Variable
is accessed, it will be first all-gathered across replicas to make sure each replica has access to the wholetf.Variable
. This can exploit the additional bandwidth of the IPU-Links to improve overall throughput. When set toNone
, the activations might be offloaded when beneficial. This feature is enabled by default when the pipeline schedule isPipelineSchedule.Sequential
andbatch_serialization_iterations > 1
, where this option can reduce the memory usage at the cost of extra communication.offload_weights – (EXPERIMENTAL) When enabled and
replicated_weight_sharding
is enabled, anytf.Variable
which are partitioned across replicas will be stored inPoplar remote buffers
. Offloading variables into remote memory can further reduce maximum memory liveness, but can also increase the computation time due to extra communication. When set toNone
the variables will be placed in either in-processor or remote memory automatically based on the current best placement strategy.continuous_weight_updates – ** CURRENTLY UNIMPLEMENTED ** When training, this option will apply the gradients to the resource variables immediately, rather than accumulating the gradients and applying them at the end of each execution of the pipeline.
outfeed_loss – If True, the loss given by the
optimizer_function
will be enqueued on the outfeed, instead of the outputs from the last computational stage. Cannot be set whenoutfeed_mask
is set.accumulate_outfeed – Data (loss or outputs) is normally enqueued immediately after the last computational stage inside the pipeline. If this option is True, the data will instead be accumulated and only enqueued once at the end of pipeline execution. To use this option, the provided
outfeed_queue
must be in theIPUOutfeedMode
ALL mode (seeIPUOutfeedMode
).accumulate_outfeed_dtype –
The data type used for the outfeed accumulation buffers. One of:
None
: Use an accumulator of the same type as the variable type.A
DType
: Use this type for all the accumulators. For exampletf.float32
.A callable that takes the variable and returns a
DType
: Allows specifying the accumulator type on a per-variable basis.
outfeed_mask – If set, a list of booleans of same length as the same number of outputs from the last computational stage. If
outfeed_mask[i]
evaluates toFalse
, then the output at that index is enqueued to the outfeed queue, and if it is set toTrue
it is not enqueued. Cannot be set whenoutfeed_loss
is set. Can only be used whenoptimizer_function
has been set.reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to
GradientAccumulationReductionMethod.SUM
(seeGradientAccumulationReductionMethod
) # pylint: disable=line-too-longname – name of this pipeline.
- Returns
An
Operation
that executes the pipeline.
- tensorflow.python.ipu.pipelining_ops.recomputation_checkpoint(tensors, name=None)
Operation for checkpointing values in a computational pipeline stage. When recomputation is enabled, these values will not be recomputed and they will be stored in memory instead.
This operation can reduce memory liveness peaks when using recomputation if there are too many activations which need to be recomputed before the backpropagation operations can be executed.
This operation should be used with the
RecomputationMode.RecomputeAndBackpropagateInterleaved
pipelining recomputation mode. Note that this operation has no effect when used with theRecomputationMode.RecomputeThenBackpropagate
pipelining recomputation mode.- Parameters
tensors – A tensor or a structure of tensors which should be checkpointed.
name – name of this operation.
- Returns
A tensor or a structure of tensors which matches shape and type of tensors.
- tensorflow.python.ipu.pipelining_ops.reduce(function, sequence[, initial]) value
Apply a function of two arguments cumulatively to the items of a sequence, from left to right, so as to reduce the sequence to a single value. For example, reduce(lambda x, y: x+y, [1, 2, 3, 4, 5]) calculates ((((1+2)+3)+4)+5). If initial is present, it is placed before the items of the sequence in the calculation, and serves as a default when the sequence is empty.
22.14.8. Popnn primitive neural network operators
- tensorflow.python.ipu.nn_ops.ctc_beam_search_decoder(logits, logits_lengths, beam_width=100, top_paths=1, blank_index=- 1, name=None)
Calculates and returns CTC (Connectionist Temporal Classification) predictions. This op is designed and optimized for the IPU and cannot be used with other systems.
# assuming batch_size = 1 # hyper-parameters top_paths = 1 beam_width = 100 if mode == "predict": probs, lengths, predictions = ctc_beam_search_decoder(logits, logits_lengths, beam_width, top_paths) batch_index = 0 # as batch_size 1, otherwise must iterate batch path_index = 0 # as top_paths = 1 otherwise argmin(probs[batch_index]) vocab_predictions = [tokens[predictions[batch_index][path_index][l]] for l in range(lengths[batch_index)] predicted_prob_of_correct_prediction = probs[batch_index][path_index] return vocab_predictions, predicted_prob_of_correct_prediction
Note: The TensorFlow op tf.nn.ctc_beam_search_decoder is not compatible with the IPU. This version also returns the predicted label lengths in addition to the probabilities and decoded labels. Instead of returning a lengths tensor the upstream version returns a list of dynamically sized tensors.
- Parameters
logits – The data input [max_time, batch_size, num_classes] tensor. The data is expected in the form of logits.
logit_lengths – A tensor of shape [batch_size] containing the number of valid timesteps in each
logits
batch entry.beam_width – The beam width to be passed to the beam search algorithm.
top_paths – The number of paths to keep track of in the beam search algorithm. This must be less than or equal to
beam_width
.blank_index – The class index to use for the blank label.
name – A name for this op. Defaults to “ctc_beam_search”.
- Returns
A tensor of shape [batch_size, top_paths] containing the negative log probabilities of the
top_paths
most likely labels.A tensor of shape [batch_size, top_paths] containing the length of the
top_paths
most likely labels.A tensor of shape [batch_size, top_paths, max_time] containing the decoded
top_paths
most likely labels.
- tensorflow.python.ipu.nn_ops.ctc_beam_search_decoder_with_log_probs(log_probs, input_lengths, beam_width=100, top_paths=1, blank_index=- 1, name=None)
Calculates and returns CTC (Connectionist Temporal Classification) predictions. This op is designed and optimized for the IPU and cannot be used with other systems. It is identical to the
ctc_beam_search_decoder()
operation except that it takes negative log probabilities instead of logits for the data input.Note: The TensorFlow op tf.nn.beam_search_decoder is not compatible with the IPU. This version also returns the predicted label lengths in addition to the probabilities and decoded labels.
- Parameters
log_probs – The data input [max_time, batch_size, num_classes] tensor. The data is expected in the form of log probabilities.
input_lengths – A tensor of shape [batch_size] containing the number of valid timesteps in each
log_probs
batch entry.beam_width – The beam width to be passed to the beam search algorithm.
top_paths – The number of paths to keep track of in the beam search algorithm. This must be less than or equal to
beam_width
.blank_index – The class index to use for the blank label.
name – A name for this op. Defaults to “ctc_beam_search”.
- Returns
A tensor of shape [batch_size, top_paths] containing the negative log probabilities of the
top_paths
most likely labels.A tensor of shape [batch_size, top_paths] containing the length of the
top_paths
most likely labels.A tensor of shape [batch_size, top_paths, max_time] containing the decoded
top_paths
most likely labels.
- tensorflow.python.ipu.nn_ops.ctc_loss_v2(labels, logits, label_length, logit_length, blank_index, out_dtype=None, name=None)
Calculates and returns CTC (Connectionist Temporal Classification) loss. This op is designed and optimized for the IPU and cannot be used with other systems.
Note: The TensorFlow op tf.nn.ctc_loss is not compatible with the IPU.
- Parameters
labels – The labels input [batch_size, max_label_length] tensor.
logits – The data input [max_time, batch_size, num_classes] tensor. The data is expected in the form of logits.
label_length – A tensor of shape [batch_size] containing the number of labels in each
labels
batch entry.logit_length – A tensor of shape [batch_size] containing the number of timesteps in each
logits
batch entry.blank_index – The class index to use for the blank label.
out_dtype – The dtype of the loss tensor (float16 or float32). Cannot be float16 if the dtype of
logits
is float32. Default: the same dtype aslogits
.name – A name for this op. Defaults to “ctc_loss”.
- Returns
A loss tensor of shape [batch_size].
- tensorflow.python.ipu.nn_ops.ctc_loss_with_log_probs(labels, data, label_length, data_length, blank_index, out_dtype=None, name=None)
Calculates and returns CTC (Connectionist Temporal Classification) loss. This op is designed and optimized for the IPU and cannot be used with other systems. It is identical to the
ctc_loss_v2()
operation except that it takes negative log probabilities instead of logits for the data input.Note: The TensorFlow op tf.nn.ctc_loss is not compatible with the IPU.
- Parameters
labels – The labels input [batch_size, max_label_length] tensor.
data – The data input [max_time, batch_size, num_classes] tensor. The data is expected in the form of log probabilities.
label_length – A tensor of shape [batch_size] containing the number of labels in each
labels
batch entry.data_length – A tensor of shape [batch_size] containing the number of timesteps in each
data
batch entry.blank_index – The class index to use for the blank label.
out_dtype – The dtype of the loss tensor. Cannot be float16 if the dtype of
data
is float32. Default: the same dtype asdata
.name – A name for this op. Defaults to “ctc_loss”.
- Returns
A loss tensor of shape [batch_size].
- tensorflow.python.ipu.nn_ops.gelu(x, approximate=True, name=None)
This targets the PopLibs Popnn gelu operation, optimised for execution on the IPU.
- Parameters
x – The input tensor.
approximate – Use tanh()-based approximation if true, otherwise use erf()
name – Optional op name.
- Returns
A
Tensor
. Has the same type the input tensor.
- tensorflow.python.ipu.nn_ops.hard_sigmoid(x, name=None)
IPU implementation of the hard sigmoid activation function.
Args: x: The input tensor. name: Optional op name.
- Returns
A
Tensor
. Has the same type the input tensor.
- tensorflow.python.ipu.nn_ops.multi_conv(func=None, options=None)
A function decorator for generating multi-convolution operations. Multi-convolutions allow for a set of data-independent convolutions to be executed in parallel. Executing convolutions in parallel can lead to an increase in the data throughput.
The
multi_conv
function decorator is a convenient way to generate multi-convolutions - it detects all the convolution operations inside of the decorated function and executes them in parallel.For example:
from tensorflow import keras from tensorflow.python import ipu @ipu.nn_ops.multi_conv def convs(x, y, z): x = keras.layers.DepthwiseConv2D(8, 2, depth_multiplier=2)(x) y = keras.layers.DepthwiseConv2D(16, 4, depth_multiplier=2)(y) z = keras.layers.Conv2D(8, 3)(z) return x, y, z
Will detect and execute the three convolutions
x
,y
andz
in parallel. Note that any operations which are not convolutions, such as bias add operations, will be executed in the same way as if they were not inside of amulti_conv
decorated function.It is also possible to set PopLibs multi-convolution options using this decorator.
For example:
from tensorflow import keras from tensorflow.python import ipu @ipu.nn_ops.multi_conv(options={"perConvReservedTiles":"50"}) def convs(x, y, z): x = keras.layers.DepthwiseConv2D(8, 2, depth_multiplier=2)(x) y = keras.layers.DepthwiseConv2D(16, 4, depth_multiplier=2)(y) z = keras.layers.Conv2D(8, 3)(z) return x, y, z
See the PopLibs documention for the list of all available flags. Note that these options will also be applied to the gradient operations generated during backpropagation.
- Parameters
func – A python function which takes a list of positional arguments only. All the arguments must be
tf.Tensor
-like objects, or be convertible to them. The function provided must return at least onetf.Tensor
-like object.options – A dictionary of Poplar option flags for multi-convolution. See the multi-convolution PopLibs documentation for available flags.
- tensorflow.python.ipu.nn_ops.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, num_true=1, sampled_values=None, name='nce_loss')
Computes and returns the noise-contrastive estimation training loss.
This is a version of the nce_loss function in tensorflow/python/ops/nn_impl.py which targets the IPU-optimized embedding lookup.
See Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. Also see the TensorFlow Candidate Sampling Algorithms Reference.
A common use case is to use this method for training, and calculate the full sigmoid loss for evaluation or inference, as in the following example:
if mode == "train": loss = tf.nn.nce_loss( weights=weights, biases=biases, labels=labels, inputs=inputs, ...) elif mode == "eval": logits = tf.matmul(inputs, tf.transpose(weights)) logits = tf.nn.bias_add(logits, biases) labels_one_hot = tf.one_hot(labels, n_classes) loss = tf.nn.sigmoid_cross_entropy_with_logits( labels=labels_one_hot, logits=logits) loss = tf.reduce_sum(loss, axis=1)
Note: By default this uses a log-uniform (Zipfian) distribution for sampling, so your labels must be sorted in order of decreasing frequency to achieve good results. For more details, see
tf.random.log_uniform_candidate_sampler
.Note: In the case where
num_true
> 1, we assign to each target class the target probability 1 /num_true
so that the target probabilities sum to 1 per-example.Note: It would be useful to allow a variable number of target classes per example. TensorFlow hopes to provide this functionality in a future release. For now, if you have a variable number of target classes, you can pad them out to a constant number by either repeating them or by padding with an otherwise unused class.
- Parameters
weights – A
Tensor
of shape[num_classes, dim]
, or a list ofTensor
objects whose concatenation along dimension 0 has shape [num_classes, dim]. The (possibly-partitioned) class embeddings.biases – A
Tensor
of shape[num_classes]
. The class biases.labels – A
Tensor
of typeint64
and shape[batch_size, num_true]
. The target classes.inputs – A
Tensor
of shape[batch_size, dim]
. The forward activations of the input network.num_sampled – An
int
. The number of negative classes to randomly sample per batch. This single sample of negative classes is evaluated for each element in the batch.num_classes – An
int
. The number of possible classes.num_true – An
int
. The number of target classes per training example.sampled_values – a tuple of (
sampled_candidates
,true_expected_count
,sampled_expected_count
) returned by a*_candidate_sampler
function. (if None, we default tolog_uniform_candidate_sampler
)name – A name for the operation (optional).
- Returns
A
batch_size
1-D tensor of per-example NCE losses.
- tensorflow.python.ipu.nn_ops.sampled_softmax_loss(weights, biases, labels, inputs, num_sampled, num_classes, num_true=1, sampled_values=None, name='sampled_softmax_loss', seed=None)
Computes and returns the sampled softmax training loss.
This is a version of the sampled_softmax_loss function in tensorflow/python/ops/nn_impl.py which targets the IPU-optimized embedding lookup.
This is a faster way to train a softmax classifier over a huge number of classes.
This operation is for training only. It is generally an underestimate of the full softmax loss.
A common use case is to use this method for training, and calculate the full softmax loss for evaluation or inference, as in the following example:
if mode == "train": loss = tf.nn.sampled_softmax_loss( weights=weights, biases=biases, labels=labels, inputs=inputs, ...) elif mode == "eval": logits = tf.matmul(inputs, tf.transpose(weights)) logits = tf.nn.bias_add(logits, biases) labels_one_hot = tf.one_hot(labels, n_classes) loss = tf.nn.softmax_cross_entropy_with_logits( labels=labels_one_hot, logits=logits)
See the TensorFlow Candidate Sampling Algorithms Reference
Also see Section 3 of Jean et al., 2014 (pdf) for the maths.
- Parameters
weights – A
Tensor
of shape[num_classes, dim]
, or a list ofTensor
objects whose concatenation along dimension 0 has shape [num_classes, dim]. The (possibly-sharded) class embeddings.biases – A
Tensor
of shape[num_classes]
. The class biases.labels – A
Tensor
of typeint64
and shape[batch_size, num_true]
. The target classes. Note that this format differs from thelabels
argument ofnn.softmax_cross_entropy_with_logits
.inputs – A
Tensor
of shape[batch_size, dim]
. The forward activations of the input network.num_sampled – An
int
. The number of classes to randomly sample per batch.num_classes – An
int
. The number of possible classes.num_true – An
int
. The number of target classes per training example.sampled_values – a tuple of (
sampled_candidates
,true_expected_count
,sampled_expected_count
) returned by a*_candidate_sampler
function. (if None, we default tolog_uniform_candidate_sampler
)name – A name for the operation (optional).
seed – random seed for candidate sampling. Default to None, which doesn’t set the op-level random seed for candidate sampling.
- Returns
A
batch_size
1-D tensor of per-example sampled softmax losses.
- tensorflow.python.ipu.nn_ops.softmax(x, stable=False, name=None)
IPU implementation of the softmax activation function.
- Parameters
x – The input tensor.
stable – A boolean to decide whether to use the stable softmax implementation. Defaults to
False
.name – Optional op name.
- tensorflow.python.ipu.nn_ops.swish(x, name=None)
IPU implementation of the swish activation function.
Args: x: The input tensor. name: Optional op name.
- Returns
A
Tensor
. Has the same type the input tensor.
22.14.9. Popnn normalization operators
- tensorflow.python.ipu.normalization_ops.group_norm(inputs, groups=2, channels_axis=- 1, center=True, scale=True, epsilon=1.53e-05, param_initializers=None, reuse=None, variables_collections=None, training=True, trainable=True, scope=None, strided_channel_grouping=True)
Functional interface for the group normalization layer.
Reference: https://arxiv.org/abs/1803.08494.
“Group Normalization”, Yuxin Wu, Kaiming He
- Parameters
inputs – A Tensor with at least 2 dimensions one which is channels. All shape dimensions must be fully defined.
groups – Integer. Divide the channels into this number of groups over which normalization statistics are computed. This number must be commensurate with the number of channels in
inputs
.channels_axis – An integer. Specifies index of channels axis which will be broken into
groups
, each of which whose statistics will be computed across. Preferred usage is to specify negative integers to be agnostic as to whether a batch dimension is included.center – If True, add offset of
beta
to normalized tensor. If False,beta
is ignored.scale – If True, multiply by
gamma
. If False,gamma
is not used. When the next layer is linear (also e.g.nn.relu
), this can be disabled since the scaling can be done by the next layer.epsilon – Small float added to variance to avoid dividing by zero.
param_initializers – Optional initializers for beta and gamma.
reuse – Whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given.
variables_collections – Optional collections for the variables.
training – Whether this is operation is being used in a training network.
trainable – If
True
also add variables to the graph collectionGraphKeys.TRAINABLE_VARIABLES
(seetf.Variable
).scope – Optional scope for
variable_scope
.strided_channel_grouping – Selects whether to group the channels dimension for group normalisation with a stride between channels. Enabling this makes the PopLibs implementation more efficient but is unconventional. Among other things this will mean that using pre-trained weights would not be possible if not produced with this unconventional implementation.
- Returns
A
Tensor
representing the output of the operation.- Raises
ValueError – If the rank of
inputs
is undefined.ValueError – If rank or channels dimension of
inputs
is undefined.ValueError – If channels dimension is not 1 or 3.
ValueError – If number of groups is not commensurate with number of channels.
- tensorflow.python.ipu.normalization_ops.instance_norm(inputs, channels_axis=- 1, center=True, scale=True, epsilon=1.53e-05, param_initializers=None, reuse=None, variables_collections=None, training=True, trainable=True, scope=None)
Functional interface for the instance normalization layer.
Reference: https://arxiv.org/abs/1607.08022.
“Instance Normalization: The Missing Ingredient for Fast Stylization” Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky
Instance normalization will generate normalization statistics across the spatial (X,Y,…) dimensions. Each slice along the feature channels dimension (C) is normalized independently. It is equivalent to a group normalization where the number of groups is the same as the size of the feature channels dimension.
- Parameters
inputs – A Tensor with at least 2 dimensions one which is channels. All shape dimensions must be fully defined.
channels_axis – An integer. Specifies index of channels axis. Preferred usage is to specify negative integers to be agnostic as to whether a batch dimension is included.
center – If True, add offset of
beta
to normalized tensor. If False,beta
is ignored.scale – If True, multiply by
gamma
. If False,gamma
is not used. When the next layer is linear (also e.g.nn.relu
), this can be disabled since the scaling can be done by the next layer.epsilon – Small float added to variance to avoid dividing by zero.
param_initializers – Optional initializers for beta and gamma.
reuse – Whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given.
variables_collections – Optional collections for the variables.
training – Whether this is operation is being used in a training network.
trainable – If
True
also add variables to the graph collectionGraphKeys.TRAINABLE_VARIABLES
(seetf.Variable
).scope – Optional scope for
variable_scope
.
- Returns
A
Tensor
representing the output of the operation.- Raises
ValueError – If
data_format
is neitherNHWC
norNCHW
.ValueError – If the rank of
inputs
is undefined.ValueError – If rank or channels dimension of
inputs
is undefined.
- tensorflow.python.ipu.normalization_ops.layer_norm(inputs, channels_axis=- 1, center=True, scale=True, epsilon=1.53e-05, param_initializers=None, reuse=None, variables_collections=None, training=True, trainable=True, scope=None)
Adds a Layer Normalization layer.
Based on the paper:
“Layer Normalization”
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
Layer normalization will generate normalization statistics across the spatial (X,Y,…) dimensions and the feature channels dimension (C). It is equivalent to a group normalization where all of the features in the feature channels dimension are put into a single group.
The shapes of
beta
andgamma
areinputs.shape[begin_params_axis:]
, and this part of the inputs’ shape must be fully defined.- Parameters
inputs – A Tensor with at least 2 dimensions one which is channels. All shape dimensions must be fully defined.
channels_axis – An integer. Specifies index of channels axis. Preferred usage is to specify negative integers to be agnostic as to whether a batch dimension is included.
center – If True, add offset of
beta
to normalized tensor. If False,beta
is ignored.scale – If True, multiply by
gamma
. If False,gamma
is not used. When the next layer is linear (also e.g.nn.relu
), this can be disabled since the scaling can be done by the next layer.epsilon – Small float added to variance to avoid dividing by zero.
param_initializers – Optional initializers for beta and gamma.
reuse – Whether or not the layer and its variables should be reused. To be able to reuse the layer scope must be given.
variables_collections – Optional collections for the variables.
training – Whether this is operation is being used in a training network.
trainable – If
True
also add variables to the graph collectionGraphKeys.TRAINABLE_VARIABLES
(seetf.Variable
).scope – Optional scope for
variable_scope
.
- Returns
A
Tensor
representing the output of the operation, having the same shape and dtype asinputs
.- Raises
ValueError – If the rank of
inputs
is not known at graph build time, or ifinputs.shape[begin_params_axis:]
is not fully defined at graph build time.
22.14.10. Popops all to all and all gather operators
- tensorflow.python.ipu.all_to_all_op.all_gather(x, replication_factor, name=None)
Gather the data on all replicas to all other replicas. Each replica will have the exact same output.
- Parameters
x – The tensor or list of tensors to gather
replication_factor – The number of replicas in each collective group. If less than the total number of replicas in the model, the replicas are divided into consecutive groups of the given size, and the collective operation is performed within each respective group. If there are
N
total replicas denoted{0, ... N-1}
andreplication_factor
isk
, then the groups are:{0, 1, ... k-1}, {k, ... 2k-1} ... {N-k-1, ... N-1}
. Note thatN
must be evenly divisible byk
, otherwise an exception will be thrown during compilation.name – Optional op name.
- Returns
A tensor or list of tensors of shape [replication_factor][x] with each replica in the same group having the same tensor.
- tensorflow.python.ipu.all_to_all_op.all_to_all(x, split_dimension, concat_dimension, replication_factor, name=None)
Perform an XLA all to all operation across all replicas. (See https://www.tensorflow.org/xla/operation_semantics#alltoall)
- Parameters
split_dimension – A value in the interval [0,n) that names the dimension along which the operand is split
concat_dimension – A value in the interval [0,n) that names the dimension along which the split blocks are concatenated.
replication_factor – The replication factor of the model.
name – Optional op name.
- Returns
A tensor of the same size where each replica will have a different value.
22.14.11. Popops cross replica operators
- tensorflow.python.ipu.cross_replica_ops.assume_equal_across_replicas(tensors, inplace=False)
Mark the given tensors as equal across replicas to try and prevent divergent control flow compilation errors.
Divergent control flow describes the situation where program flow differs among replicas. This happens when the value of a conditional is not the same across all replicas. This is a problem if the conditional body requires a cross-replica sync, as only some replicas will reach it. If this happens, the execution will hang as the operation waits for all replicas to sync.
To warn the user about this, Poplar checks for divergent control flow during compilation. However since the values of tensors are unknown at compilation time it can’t be certain whether a tensor will lead to divergent control flow or not.
assume_equal_across_replicas
can be used to mark tensors which are equal across all replicas and in doing so prevents them causing divergency errors, if used in a conditional.- Parameters
tensors – A tensor or a structure of tensors which will be marked as equal across replicas. Note that undefined behaviour will occur if these tensors are in fact not equal across replicas.
inplace – A bool for controlling whether or not the given tensor(s) is copied or operated on inplace. This is needed when using
assume_equal_across_replicas
with tensor slices.
- Returns
A tensor or a structure of tensors which matches shape and type of the
tensors
arg. This should be used in place of the args to prevent divergent control flow errors.
- tensorflow.python.ipu.cross_replica_ops.cross_replica_mean(x, replica_group_size=None, name=None)
Computes the mean of the input tensor across replicas.
- Parameters
x – The local tensor to the mean.
replica_group_size – The number of replicas in each collective group. If None, there is a single group containing all the replicas. If a number less than the total number of replicas in the model is provided, the replicas are divided into consecutive groups of the given size, and the collective operation is performed within each respective group. Given
N
total replicas denoted{0, ... N-1}
and areplica_group_size
of k, the groups are:{0, 1, ... k-1}, {k, ... 2k-1} ... {N-k-1, ... N-1}
. Note thatN
must be evenly divisible byk
, otherwise an exception will be thrown during compilation.name – Optional op name.
- Returns
A
Tensor
which is averaged across the replicas in the same group.
- tensorflow.python.ipu.cross_replica_ops.cross_replica_sum(x, replica_group_size=None, name=None)
Sum the input tensor across replicas.
- Parameters
x – The local tensor to the sum.
replica_group_size – The number of replicas in each collective group. If None, there is a single group containing all the replicas. If a number less than the total number of replicas in the model is provided, the replicas are divided into consecutive groups of the given size, and the collective operation is performed within each respective group. Given
N
total replicas denoted{0, ... N-1}
and areplica_group_size
of k, the groups are:{0, 1, ... k-1}, {k, ... 2k-1} ... {N-k-1, ... N-1}
. Note thatN
must be evenly divisible byk
, otherwise an exception will be thrown during compilation.name – Optional op name.
- Returns
A
Tensor
which is summed across the replicas in the same group.
22.14.12. Popops embedding operators
- class tensorflow.python.ipu.embedding_ops.HostEmbedding(name, embedding_tensor, partition_strategy='TOKEN', optimizer_spec=None)
Host Embedding wrapper.
HostEmbedding encapsulates the embedding tensor and the additional meta-data required to coordinate the host embedding and the device lookup. Through an instance of this class, an IPU can perform lookups on an embedding that resides on the host.
It is assumed that the given embedding will be rank two where the outermost dimension (dimension zero) is the token dimension, and the innermost dimension is the encoding dimension.
- __init__(name, embedding_tensor, partition_strategy='TOKEN', optimizer_spec=None)
Create a HostEmbedding.
- Parameters
name – The name which uniquely identifies the embedding.
embedding_tensor – The tensor which holds the embedding.
optimizer_spec – A description of how the embedding will be optimized. When
None
, the embedding is assumed to not be trainable.
- get_embedding_tensor()
Retrieve the CPU bound embedding tensor.
- Returns
The TF CPU tensor for the embedding.
- lookup(indices, clip_indices=True)
Perform a host embedding lookup on an IPU.
- Parameters
indices – The indices to lookup.
clip_indices – Whether to enforce a valid range on the lookup indices with clipping. When False, out-of-range values have undefined behaviour.
- Returns
A Tensor containing the elements requested by the user indices.
- register(session=None)
Creates a host embedding context manager bound to the given session.
- Parameters
session – The session to register the embedding to.
- Returns
A Python context manager object. This object manages the lifetime of the host embedding connection to the IPU.
- class tensorflow.python.ipu.embedding_ops.HostEmbeddingOptimizerSpec(learning_rate, optimizer_name=None)
Description of the Host Embedding optimizer.
Despite the embedding living on the host, we want to compute the gradients on the device. Additionally, the communication channel between the device and host is opaque to TensorFlow. For these reasons we need to describe the optimizer parameters separately.
Currently only supports SGD.
- __init__(learning_rate, optimizer_name=None)
Create a HostEmbeddingOptimizerSpec.
- Parameters
learning_rate – The SGD learning rate.
- create_deregister_instruction(embedding_tensor, slot_vars, name)
Create a deregister instruction.
This will be called when exiting the
HostEmbedding
context manager.- Parameters
embedding_tensor – The TF embedding tensor bound to the CPU.
slot_vars – Any created slot variables.
name – The name of the host embedding.
- Returns
The deregister instruction.
- create_lookup_instruction(embedding_tensor, indices, slot_vars, partition_strategy, name)
Create a lookup instruction.
This will be called from the
HostEmbedding
wrapper class.- Parameters
embedding_tensor – The TF embedding tensor bound to the CPU.
indices – The TF indices tensor bound to the IPU.
slot_vars – Any created slot variables.
partition_strategy – The user selected partition strategy.
name – The name of the host embedding.
- Returns
The result of the embedding lookup in an IPU tensor.
- create_register_instruction(embedding_tensor, slot_vars, name)
Create a register instruction.
This will be called when entering the
HostEmbedding
context manager.- Parameters
embedding_tensor – The TF embedding tensor bound to the CPU.
slot_vars – Any created slot variables.
name – The name of the host embedding.
- Returns
The register instruction.
- create_slot_variables(embedding_tensor, name)
Create any required slot variables for this optimiser.
This will be called when exiting the
HostEmbedding
context manager.- Parameters
embedding_tensor – The TF embedding tensor bound to the CPU.
name – The name of the host embedding.
- Returns
A list of TF tensors bound to the CPU.
- get_learning_rate()
Get the optimizer learning rate.
- Returns
The learning rate.
- class tensorflow.python.ipu.embedding_ops.HostEmbeddingSGDGAOptimizerSpec(learning_rate, accumulation_factor)
Description of the Host Embedding optimizer that uses SGD and gradient accumulation.
- __init__(learning_rate, accumulation_factor)
Create a HostEmbeddingSGDGAOptimizerSpec.
- Parameters
learning_rate – The SGD learning rate.
accumulation_factor – The gradient accumulation factor (number of mini-batches the gradients will be accumulated for).
- get_accumulation_factor()
Get the optimizer gradient accumulation factor.
- Returns
The gradient accumulation factor.
- tensorflow.python.ipu.embedding_ops.create_host_embedding(name, shape, dtype, partition_strategy='TOKEN', optimizer_spec=None, initializer=None)
Create a HostEmbedding.
- Parameters
name – The name which uniquely identifies the embedding.
shape – The shape for the tensor which will hold the embedding.
dtype – The dtype for the tensor which will hold the embedding.
partition_strategy – When the IPU system is configured with an IPUConfig instance that has its
enable_remote_buffer_embedding
option set toTrue
and uses replication, the embedding must be distributed across the replicas. This option specifies on which axis the embedding will be split. Options are “TOKEN” or “ENCODING”.optimizer_spec – A description of how the embedding will be optimized. When
None
, the embedding is assumed to not be trainable.initializer – The initializer to use when creating the embedding tensor.
- Returns
A
HostEmbedding
object that wraps the created embedding tensor.
- tensorflow.python.ipu.embedding_ops.embedding_lookup(params, ids, serialization_factor=1, indices_are_sorted=False, name=None)
Looks up
ids
in a list of embedding tensors.This is designed to be a drop-in replacement for the typical use cases with
tf.nn.embedding_lookup
for the IPU.- Parameters
params – A single tensor representing the complete embedding tensor.
ids – A
Tensor
with typeint32
containing the slices to be extracted fromparams
.serialization_factor – If greater than 1, the embedding lookup will be broken up into
serialization_factor
smaller lookups, serialized along the 0th dimension. This option should not be used unlessparams
is used by another operation, such as matrix multiplication. Ifparams
has multiple users, then serialization can reduce the maximum memory at the cost of extra computation.indices_are_sorted – An optional
bool
. Defaults toFalse
. Allows Poplar to optimise for the case when indices to look up are in order.name – A name for the operation.
- Returns
A
Tensor
with the same type as the tensors inparams
.
22.14.13. F8 operations
- class tensorflow.python.ipu.f8_ops.Format(value)
Format describes bit layout of F8 type.
- F143
1 sign bit, 4 bit of significand and 3 bits of exponent.
- F152
1 sign bit, 5 bit of significand and 2 bits of exponent.
- class tensorflow.python.ipu.f8_ops.IntEnum(value)
Enum where members are also (and must be) ints
- class tensorflow.python.ipu.f8_ops.QuarterTensor(data, metadata)
Represents a tensor with data type fp8.
- assign(new_values, **kwargs)
Assigns new values to the tensor
- Parameters
new_values – An array of format [data, metadata] that should be the output of
QuarterTensor.numpy
.
- numpy()
Returns a numpy representation of the tensor
- tensorflow.python.ipu.f8_ops.cast(x, dtype, name=None)
Casts a tensor to a new type.
The operation casts
x
(in case ofTensor
) orx.values
(in case ofSparseTensor
orIndexedSlices
) todtype
.For example:
>>> x = tf.constant([1.8, 2.2], dtype=tf.float32) >>> tf.cast(x, tf.int32) <tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>
Notice
tf.cast
has an aliastf.dtypes.cast
:>>> x = tf.constant([1.8, 2.2], dtype=tf.float32) >>> tf.dtypes.cast(x, tf.int32) <tf.Tensor: shape=(2,), dtype=int32, numpy=array([1, 2], dtype=int32)>
The operation supports data types (for
x
anddtype
) ofuint8
,uint16
,uint32
,uint64
,int8
,int16
,int32
,int64
,float16
,float32
,float64
,complex64
,complex128
,bfloat16
. In case of casting from complex types (complex64
,complex128
) to real types, only the real part ofx
is returned. In case of casting from real types to complex types (complex64
,complex128
), the imaginary part of the returned value is set to0
. The handling of complex types here matches the behavior of numpy.Note casting nan and inf values to integral types has undefined behavior.
- Parameters
x – A
Tensor
orSparseTensor
orIndexedSlices
of numeric type. It could beuint8
,uint16
,uint32
,uint64
,int8
,int16
,int32
,int64
,float16
,float32
,float64
,complex64
,complex128
,bfloat16
.dtype – The destination type. The list of supported dtypes is the same as
x
.name – A name for the operation (optional).
- Returns
- A
Tensor
orSparseTensor
orIndexedSlices
with same shape asx
and same type as
dtype
.
- A
- Raises
TypeError – If
x
cannot be cast to thedtype
.
- tensorflow.python.ipu.f8_ops.convert_from_f8(packed_input, dtype=tf.float16, name=None)
Converts packed f8 representation to tensor of type dtype
- Parameters
packed_input – result of convert_to_f8 or any other f8 op.
dtype – output tensor type. Default is half because it’s hw accelerated and would not require extra cast.
name – Optional op name.
- Returns
Tensor with type dtype with unpacked f8 values.
- tensorflow.python.ipu.f8_ops.convert_to_f8(values, metadata, name=None)
Converts given values to f8 representation.
- Parameters
values – Any tensor of any type
metadata – metadata created by create_metadata
name – Optional op name.
- Returns
(output, metadata) tuple of uint8 output and metadata.
- tensorflow.python.ipu.f8_ops.f8_conv_1d(inputs, filters, strides, padding, data_format='NWC', dilations=[1], name='f8_conv_1d')
Performs a 1D convolution on the 2 inputs tensors supporting element type fp8.
- Parameters
inputs – A
Tensor
orQuarterTensor
of at least rank-3. Mfilters – A
Tensor
orQuarterTensor
of rank at least 3.strides – An int or list of
ints
that has length1
or3
. The number of entries by which the filter is moved right at each step.padding – ‘SAME’ or ‘VALID’
data_format – An optional
string
from"NWC", "NCW"
. The data is stored in the order ofbatch_shape + [in_width, in_channels]
. The"NCW"
format stores data asbatch_shape + [in_channels, in_width]
. Defaults to"NWC"
.dilations – An int or list of
ints
that has length1
or3
which defaults to 1. The dilation factor for each dimension of input. If set tok > 1
, there will bek-1
skipped cells between each filter element on that dimension. Dilations in the batch and depth dimensions must be 1.name – A name for the operation (optional). Defaults to ‘f8_conv_1d’.
- Returns
Tensor
of typefloat16
.
- tensorflow.python.ipu.f8_ops.f8_conv_2d(inputs, filters, strides, padding, data_format='NHWC', dilations=[1, 1, 1, 1], name='f8_conv_2d')
Performs a 2D convolution on the 2 inputs tensors supporting element type fp8.
- Parameters
inputs – A
Tensor
orQuarterTensor
. A Tensor of rank at least 4. The dimension order is interpreted according to the value ofdata_format
; with the all-but-inner-3 dimensions acting as batch dimensions. See below for details.filters – A
Tensor
orQuarterTensor
. A 4-D tensor of shape[filter_height, filter_width, in_channels, out_channels]
strides – An int or list of
ints
that has length1
,2
or4
. The stride of the sliding window for each dimension ofinput
. If a single value is given it is replicated in theH
andW
dimension. By default theN
andC
dimensions are set to 1. The dimension order is determined by the value ofdata_format
, see below for details.padding – Either the
string
"SAME"
or"VALID"
indicating the type of padding algorithm to use, or a list indicating the explicit paddings at the start and end of each dimension. See [here](https://www.tensorflow.org/api_docs/python/tf/nn#notes_on_padding_2) for more information. When explicit padding is used and data_format is"NHWC"
, this should be in the form[[0, 0], [pad_top, pad_bottom], [pad_left, pad_right], [0, 0]]
. When explicit padding used and data_format is"NCHW"
, this should be in the form[[0, 0], [0, 0], [pad_top, pad_bottom], [pad_left, pad_right]]
.data_format –
An optional
string
from:"NHWC", "NCHW"
. Specify the data format of the input and output data. With the default format “NHWC”, the data is stored in the order of:batch_shape + [height, width, channels]
.- Alternatively, the format could be “NCHW”, the data storage order of:
batch_shape + [channels, height, width]
.
Defaults to
"NHWC"
.dilations – An int or list of
ints
that has length1
,2
or4
. The dilation factor for each dimension of`input`. If a single value is given it is replicated in theH
andW
dimension. By default theN
andC
dimensions are set to 1. If set to k > 1, there will be k-1 skipped cells between each filter element on that dimension. The dimension order is determined by the value ofdata_format
, see above for details. Dilations in the batch and depth dimensions if a 4-d tensor must be 1. Defaults to None.name – A name for the operation (optional). Defaults to ‘f8_conv_2d’.
- Returns
Tensor
of typefloat16
.
- tensorflow.python.ipu.f8_ops.f8_conv_3d(inputs, filters, strides, padding, data_format='NDHWC', dilations=[1, 1, 1, 1, 1], name='f8_conv_3d')
Performs a 3d convolution on the 2 inputs tensors supporting element type fp8.
- Parameters
inputs – A
Tensor
orQuarterTensor
of shape[batch, in_depth, in_height, in_width, in_channels]
.filters – A
Tensor
orQuarterTensor
. Must have the same type asinput
. Shape[filter_depth, filter_height, filter_width, in_channels, out_channels]
.in_channels
must match betweeninput
andfilters
.strides – A list of ints that has length >= 5. 1-D
Tensor
of length 5. The stride of the sliding window for each dimension of input. Must havestrides[0] = strides[4] = 1
.padding – A string from:
"SAME"
,"VALID"
. The type of padding algorithm to use.data_format – An optional string from:
"NDHWC"
,"NCDHW"
. The data format of the input and output data. With the default format"NDHWC"
, the data is stored in the order of:[batch, in_depth, in_height, in_width, in_channels]
. Alternatively, the format could be"NCDHW"
, the data storage order is:[batch, in_channels, in_depth, in_height, in_width]
. Defaults to “NDHWC”.dilations – An optional list of ints. 1-D tensor of length 5. The dilation, factor for each dimension of input. If set to
k > 1
, there will bek-1
skipped cells between each filter element on that dimension. The dimension order is determined by the value of data_format, see above for details. Dilations in the batch and depth dimensions must be 1. Defaults to [1, 1, 1, 1, 1].name – A name for the operation (optional). Defaults to ‘f8_conv_3d’.
- Returns
Tensor
of typefloat16
.
- tensorflow.python.ipu.f8_ops.f8_matmul(lhs, rhs, name='f8_matmul')
Performs a matmul on the 2 inputs tensors supporting element type fp8.
- Parameters
lhs – Left hand side of matmul, can be tensor or quarter tensor.
rhs – Right hand side of matmul, can be tensor or quarter tensor.
- Returns
Tensor with type float16.
22.14.14. Popops reduce scatter operator
- tensorflow.python.ipu.reduce_scatter_op.reduce_scatter(x, replication_factor, op='COLLECTIVE_OP_ADD', name=None)
Reduce the given replicated tensor with the result scattered across the replicas. For an input of shape
[num_elements]
, the output will have shape[ceil(num_elements / replication_factor)]
. Ifreplication_factor
does not evenly dividenum_elements
, the result is zero-padded. Example:Input: Replica0: [x0, y0, z0] Replica1: [x1, y1, z1] Output: Replica0: [x0 + x1, y0 + y1] Replica1: [z0 + z1, 0]
- Parameters
x – The input tensor or list of tensors. The tensors must have rank 1.
replication_factor – The number of replicas in each collective group. If less than the total number of replicas in the model, the replicas are divided into consecutive groups of the given size, and the collective operation is performed within each respective group. If there are
N
total replicas denoted{0, ... N-1}
andreplication_factor
isk
, then the groups are:{0, 1, ... k-1}, {k, ... 2k-1} ... {N-k-1, ... N-1}
. Note thatN
must be evenly divisible byk
, otherwise an exception will be thrown during compilation.op – Reduce operation, valid ops are: COLLECTIVE_OP_ADD, COLLECTIVE_OP_MUL, COLLECTIVE_OP_MIN, COLLECTIVE_OP_MAX, COLLECTIVE_OP_LOGICAL_AND, COLLECTIVE_OP_LOGICAL_OR, COLLECTIVE_OP_SQUARE_ADD, COLLECTIVE_OP_LOCAL and COLLECTIVE_OP_MEAN.
name – Optional op name.
- Returns
A
Tensor
object or list ofTensor
objects. The shape of each output will be[ceil(input_length / number_of_replicas)]
.
22.14.15. Popops within replica operators
- tensorflow.python.ipu.within_replica_ops.all_gather(input_shards, axis=0)
Perform an all gather for a list of sharded tensors within a replica.
- Parameters
input_shards – The sharded input tensors to gather. These are expected to be supplied in incrementing sharded order, so that input_shards[0] is on shard 0 and input_shard[i] is on shard i. Additionally these tensors must all be of the same type and of the same rank.
axis –
input_shards
are flattened to rank 1 prior to being gathered and reshaped on return. This argument specifies the axis that the gathered elements should be added to.
- Returns
A tuple of tensors that contains a copy of the data for each shard. Element i is the tensor mapped to shard i. Each sub-tensor is of shape
tf.concat(input_shards, axis=axis)
.
- tensorflow.python.ipu.within_replica_ops.all_reduce(input_shards, op)
Perform a
reduce_scatter
using the given op, followed by anall_gather
on the results, so each shard contains all the reduced results. Inputs are 0 padded to the same size. Example:Input: IPU0 [x0, y0] IPU1 [x1, y1, z1] IPU2 [x2, y2, z2] IPU3 [x3, y3, z3] Output: IPU0 [op(x0, x1, x2, x3), op(y0, y1, y2, y3), op(0, z1, z2, z3)] IPU1 [op(x0, x1, x2, x3), op(y0, y1, y2, y3), op(0, z1, z2, z3)] IPU2 [op(x0, x1, x2, x3), op(y0, y1, y2, y3), op(0, z1, z2, z3)] IPU3 [op(x0, x1, x2, x3), op(y0, y1, y2, y3), op(0, z1, z2, z3)]
- Parameters
input_shards – The tensors to reduce. These are expected to be supplied in increasing shard order, so that input_shards[0] is on shard0 and input_shard[i] is on shard i. Additionally these tensors must be of the same type and of rank 0 or 1.
op – Reduce operation, valid ops are: COLLECTIVE_OP_ADD, COLLECTIVE_OP_MUL, COLLECTIVE_OP_MIN, COLLECTIVE_OP_MAX, COLLECTIVE_OP_LOGICAL_AND, COLLECTIVE_OP_LOGICAL_OR, COLLECTIVE_OP_LOCAL.
- Returns
A tuple of tensors that contains a copy of all the reduced data. Element i is the
Tensor
mapped to shard i.
- tensorflow.python.ipu.within_replica_ops.reduce_scatter(input_shards, op)
Reduce the given sharded tensors with the results scattered across the shards. If the tensors contain fewer/more elements than shards then the results will be 0 padded. Example:
Input: IPU0 [x0, y0, z0] IPU1 [x1, y1, z1] IPU2 [x2, y2, z2] IPU3 [x3, y3, z3] Output: IPU0 [0] IPU1 [op(y0, y1, y2, y3)] IPU2 [op(z0, z1, z2, z3)] IPU3 [op(x0, x1, x2, x3)]
- Parameters
input_shards – The tensors to reduce. These are expected to be supplied in increasing shard order, so that input_shards[0] is on shard0 and input_shard[i] is on shard i. Additionally these tensors must be of the same type and of rank 0 or 1.
op – Reduce operation, valid ops are: COLLECTIVE_OP_ADD, COLLECTIVE_OP_MUL, COLLECTIVE_OP_MIN, COLLECTIVE_OP_MAX, COLLECTIVE_OP_LOGICAL_AND, COLLECTIVE_OP_LOGICAL_OR, COLLECTIVE_OP_LOCAL.
- Returns
A tuple of tensors, where each tensor contains 0 or more reduction results. Element i is the
Tensor
mapped to shard i.
22.14.16. Poprand operators
- tensorflow.python.ipu.rand_ops.dropout(x, rate=0.5, noise_shape=None, seed=None, name=None, **kwargs)
This targets the PopLibs Poprand operation, optimized for execution on the IPU.
With probability
rate
, drops elements ofx
. Inputs which are kept are scaled up by1 / (1 - rate)
such that the expected sum is unchanged.- Parameters
x – The input tensor.
rate – The probability that a given element will be zeroed out.
noise_shape – An optional parameter that determines the shape of the dropout. Regular, unshaped dropout used if not specified.
seed – An optional two-element tensor-like object (
tf.Tensor
, a numpy array or Python list/tuple) containing a pair of 32-bit integers that will be used to seed the random number generator that generates the dropout mask.name – Optional op name.
- Returns
A tensor which has some nodes set to zero, as randomly selected based on other parameters.
22.14.17. Utility operations to be used in replicated mode
- tensorflow.python.ipu.replication_ops.replication_index(name=None)
An operation which allows the user to get the replication index.
- Parameters
name – Optional op name.
- Returns
A
Tensor
initialized with the replication index.
22.14.18. Slicing operators
- tensorflow.python.ipu.slicing_ops.sequence_slice(dst, src, num_elems, src_offsets, dst_offsets, zero_unused)
This op targets the PopLibs SequenceSlice operation.
The SequenceSlice operation takes specified elements from the source tensor and inserts them at specified locations in the destination tensor.
The parameters of the slice operation are defined by the number of elements to take for each slice
num_elems
, the offset in the source tensor from which to take themsrc_offsets
, and the offset in the destination tensor from which the elements should be placeddst_offsets
.For each slice, an element count, source offset and destination offset must be provided. The i-th entry of
num_elems
corresponds to the i-th entry ofsrc_offsets
and the i-th entry ofdst_offsets
.For example:
from tensorflow.python.framework.ops import array_ops from tensorflow.python.ipu.ops.slicing_ops import sequence_slice src = [[0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1], [2, 2, 2, 2, 2, 2], [3, 3, 3, 3, 3, 3], [4, 4, 4, 4, 4, 4], [5, 5, 5, 5, 5, 5]] num_elems = [2, 2] src_offsets = [2, 1] dst_offsets = [0, 4] dst = array_ops.zeros([6, 6]) dst = sequence_slice(dst, src, num_elems, src_offsets, dst_offsets, False)
Following which, the contents of the destination tensor
dst
are as follows:[[2. 2. 2. 2. 2. 2.] [3. 3. 3. 3. 3. 3.] [0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0.] [1. 1. 1. 1. 1. 1.] [2. 2. 2. 2. 2. 2.]]
In this example, the first slice takes two elements from index 2 of the source tensor and inserts them at index 0 of the destination tensor. The second slice also takes two elements, but from index 1 of the source tensor, inserting them at index 4 in the destination tensor.
- Parameters
dst – The destination tensor which will be updated, must be of at least rank 2 with inner dimensions matching that of
src
.src – The source tensor from which the values are accessed, must be of at least rank 2 with inner dimensions matching that of
dst
.num_elems – A list (or rank 1 tensor) of the number of elements to copy.
src_offsets – A list (or rank 1 tensor) of first elements to read from src.
dst_offsets – A list (or rank 1 tensor) of first elements to write to dst.
zero_unused – Whether to zero unreferenced dst elements.
- Returns
The destination tensor dst.
- tensorflow.python.ipu.slicing_ops.sequence_slice_pack(dst, src, num_elems, dst_offsets, zero_unused)
This op specialises the PopLibs SequenceSlice operation for sequence packing.
The SequenceSlicePack operation takes a contiguous tensor of sequences ( such as the output of
sequence_slice_unpack
) and packs its elements into specified locations in the destination tensor.The parameters of the slice operation are defined by the number of elements to take for each slice
num_elems
and the offset in the destination tensor into which the elements should be placed,dst_offsets
.For each slice, an element count and destination offset must be provided. The i-th entry of
num_elems
corresponds to the i-th entry ofdst_offsets
.For example:
from tensorflow.python.framework.ops import array_ops from tensorflow.python.ipu.ops.slicing_ops import sequence_slice_pack src = [[2, 2, 2, 2, 2, 2], [3, 3, 3, 3, 3, 3], [1, 1, 1, 1, 1, 1], [2, 2, 2, 2, 2, 2]] num_elems = [2, 2] dst_offsets = [2, 1] dst = array_ops.zeros([6, 6]) dst = sequence_slice_pack(dst, src, num_elems, dst_offsets, False)
Following which, the contents of the destination tensor
dst
are as follows:[[0. 0. 0. 0. 0. 0.] [1. 1. 1. 1. 1. 1.] [2. 2. 2. 2. 2. 2.] [3. 3. 3. 3. 3. 3.] [0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0.]]
In this example, the first slice takes the first two elements of the source tensor and inserts them at index 2 in the destination tensor. The second slice takes the next two elements in the source tensor, and inserts them at index 1 of destination tensor.
- Parameters
dst – The destination tensor which will be updated, must be of at least rank 2 with inner dimensions matching that of
src
.src – The source tensor from which the values are accessed, must be of at least rank 2.
num_elems – A list (or rank 1 tensor) of the number of elements to copy.
dst_offsets – A list (or rank 1 tensor) of first elements to write to dst.
zero_unused – Whether to zero unreferenced dst elements.
- Returns
The packed sequences.
- tensorflow.python.ipu.slicing_ops.sequence_slice_unpack(src, num_elems, src_offsets, total_elements)
This op specialises the PopLibs SequenceSlice operation for sequence unpacking.
The SequenceSliceUnpack operation unpacks specified elements from the source tensor and inserts them contiguously into the resulting tensor.
The parameters of the slice operation are defined by the number of elements to take for each slice
num_elems
and the offset in the source tensor from which to take themsrc_offsets
.For each slice, an element count and source offset must be provided. The i-th entry of
num_elems
corresponds to the i-th entry ofsrc_offsets
.For example:
from tensorflow.python.ipu.ops.slicing_ops import sequence_slice_unpack src = [[0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1], [2, 2, 2, 2, 2, 2], [3, 3, 3, 3, 3, 3], [4, 4, 4, 4, 4, 4], [5, 5, 5, 5, 5, 5]] num_elems = [2, 2] src_offsets = [2, 1] total_elements = 4 dst = sequence_slice_unpack(src, num_elems, src_offsets, False, total_elements)
Following which, the contents of the destination tensor
dst
are as follows:[[2. 2. 2. 2. 2. 2.] [3. 3. 3. 3. 3. 3.] [1. 1. 1. 1. 1. 1.] [2. 2. 2. 2. 2. 2.]]
In this example, the first slice takes two elements from index 2 of the source tensor and inserts them at index 0 of the output tensor. The second slice also takes two elements, but from index 1 of the source tensor, inserting them at index 2 in the output tensor.
- Parameters
src – The source tensor from which the values are accessed, must be of at least rank 2.
num_elems – A list (or rank 1 tensor) of the number of elements to copy.
src_offsets – A list (or rank 1 tensor) of first elements to read from src.
total_elements – Total number of elements to slice.
- Returns
The unpacked sequences.
22.14.19. Statistics operators
- tensorflow.python.ipu.statistics_ops.fixed_width_bins(inputs, n_bins)
This op generates evenly spaced levels for histogram binning derived from the value range of
inputs
.- Parameters
inputs – A rank-1 tensor of values over which to compute binning levels.
n_bins – The number of bins required.
- Returns
A rank-1 tensor of binning values.
- tensorflow.python.ipu.statistics_ops.histogram(inputs, levels, absolute_of_input=False)
This op generates a histogram of
inputs
over the fixed width bins defined bylevels
.- Parameters
inputs – A rank-1 tensor of values over which to compute binning levels.
levels – The number of bins required.
absolute_of_input – If True, bin on input magnitude (absolute value). Default is False.
- Returns
A rank-1 histogram tensor.
- tensorflow.python.ipu.statistics_ops.histogram_normalize(hist)
This op normalizes a histogram.
- Parameters
hist – The histogram to be normalized.
- Returns
The normalized histogram.
- tensorflow.python.ipu.statistics_ops.histogram_update(hist, inputs, levels, absolute_of_input=False)
This op updates the histogram
hist
over the fixed width bins defined bylevels
for newinputs
.- Parameters
inputs – A rank-1 tensor of values over which to compute binning levels.
levels – The number of bins required.
absolute_of_input – If True, bin on input magnitude (absolute value). Default is False.
- Returns
The updated rank-1 histogram tensor,
hist
.
22.14.20. Embedded application runtime
- class tensorflow.python.ipu.embedded_runtime.RuntimeContext(name, executable_file, executable_proto, start_output)
Represents an instance of the application runtime.
This class must not be constructed directly, instead call
embedded_runtime_start
oremedded_runtime_start_and_call
.- name()
Get the name of the application runtime instance.
- Returns
The name of the application runtime instance.
- output_types()
Get the output dtypes of the executable.
- Returns
A list of output dtypes for the TF poplar executable.
- signature()
Get the signature of the executable.
- Returns
The signature protobuf object for the TF poplar executable.
- start_output()
Get the output from the start op which will start the application runtime instance.
- Returns
The output tensor from the start op.
- tensorflow.python.ipu.embedded_runtime.embedded_runtime_call(inputs, context)
Call an application with a batch of input data.
- Parameters
inputs – A batch of data to pass to the application.
context – The application runtime context created with
embedded_runtime_start
.
- Returns
The output tensors from the application.
- tensorflow.python.ipu.embedded_runtime.embedded_runtime_start(executable_file, inputs, name, timeout=None)
Create and start an application runtime from a TF poplar executable.
- Parameters
executable_file – The path to the executable file (given as string or Tensor)
inputs – The initial input tensors.
name – The name of the application runtime instance.
timeout – An integer indicating how long (measured in microseconds) to allow an executable for a pipelined model or a model with IO tiles to wait for the next batch of data before forcing the execution to continue. This is required because pipelined models and models with IO tiles cannot proceed with execution until the next batch of data arrives. If not provided, defaults to 5000 microseconds.
- Returns
An embedded application runtime context instance.
- tensorflow.python.ipu.embedded_runtime.embedded_runtime_start_and_call(executable_file, startup_inputs, call_inputs, name)
Create and start an application runtime from a TF poplar executable.
- Parameters
executable_file – The path to the executable file.
startup_inputs – The initial input tensors.
call_inputs – A batch of data to pass to the application.
name – The name of the application runtime instance.
- Returns
A tuple of the batch results and the embedded application runtime context.
- tensorflow.python.ipu.embedded_runtime.embedded_runtime_stop(context)
Stop an application runtime from a TF poplar executable.
- Parameters
context – The application runtime context created with
embedded_runtime_start
.
- tensorflow.python.ipu.embedded_runtime.executing_eagerly()
Checks whether the current thread has eager execution enabled.
Eager execution is enabled by default and this API returns
True
in most of cases. However, this API might returnFalse
in the following use cases.Executing inside
tf.function
, unless undertf.init_scope
ortf.config.run_functions_eagerly(True)
is previously called.Executing inside a transformation function for
tf.dataset
.tf.compat.v1.disable_eager_execution()
is called.
General case:
>>> print(tf.executing_eagerly()) True
Inside
tf.function
:>>> @tf.function ... def fn(): ... with tf.init_scope(): ... print(tf.executing_eagerly()) ... print(tf.executing_eagerly()) >>> fn() True False
Inside
tf.function
aftertf.config.run_functions_eagerly(True)
is called:>>> tf.config.run_functions_eagerly(True) >>> @tf.function ... def fn(): ... with tf.init_scope(): ... print(tf.executing_eagerly()) ... print(tf.executing_eagerly()) >>> fn() True True >>> tf.config.run_functions_eagerly(False)
Inside a transformation function for
tf.dataset
:>>> def data_fn(x): ... print(tf.executing_eagerly()) ... return x >>> dataset = tf.data.Dataset.range(100) >>> dataset = dataset.map(data_fn) False
- Returns
True
if the current thread has eager execution enabled.
22.15. Optimisers
In addition to the tensorflow.python.ipu.optimizers
namespace, it is also possible to access the optimizer classes via other namespaces, as shown in the following table:
Optimizer |
Alternative namespaces |
---|---|
tensorflow.python.ipu.cross_replica_optimizer tensorflow.python.ipu.optimizers.cross_replica_optimizer |
|
tensorflow.python.ipu.gradient_accumulation_optimizer tensorflow.python.ipu.optimizers.gradient_accumulation_optimizer |
|
tensorflow.python.ipu.gradient_accumulation_optimizer tensorflow.python.ipu.optimizers.gradient_accumulation_optimizer |
|
tensorflow.python.ipu.gradient_accumulation_optimizer tensorflow.python.ipu.optimizers.gradient_accumulation_optimizer |
|
tensorflow.python.ipu.gradient_accumulation_optimizer tensorflow.python.ipu.optimizers.gradient_accumulation_optimizer |
|
tensorflow.python.ipu.map_gradient_optimizer tensorflow.python.ipu.optimizers.map_gradient_optimizer |
|
tensorflow.python.ipu.sharded_optimizer tensorflow.python.ipu.optimizers.sharded_optimizer |
Note
The ipu.optimizers
optimizer classes can only be used with subclasses of tensorflow.compat.v1.train.Optimizer
.
You can configure GradientAccumulationOptimizerV2
and CrossReplicaGradientAccumulationOptimizerV2
with an optional reduction method (see Table 22.2) defining how to accumulate gradients (see enumerated class GradientAccumulationReductionMethod
).
Reduction method |
Behaviour |
---|---|
|
Sum gradients across the mini-batch. |
|
Sum gradients across the mini-batch after scaling them by (1 / mini-batch-size) |
|
Compute a running mean of gradients across the mini-batch
using the expression |
22.15.1. Helper classes and methods for gradient accumulation.
- class tensorflow.python.ipu.gradient_accumulation.Enum(value)
Generic enumeration.
Derive from this class to define new enumerations.
- name
The name of the Enum member.
- value
The value of the Enum member.
- class tensorflow.python.ipu.gradient_accumulation.GradientAccumulationReductionMethod(value)
Reduction method to use when accumulating gradients. We perform
gradient_accumulation_count
iterations (forward & backward passes) in each optimizer step, at the end of which we update the optimizer with gradients accumulated during the optimizer step. For each iteration within the optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation, especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly.Note: The term
gradient_accumulation_count
is from the pipeline API and is referred to asnum_mini_batches
inGradientAccumulationOptimizerV2
andCrossReplicaGradientAccumulationOptimizerV2
# pylint: disable=line-too-longSUM: Performs a sum of gradients
MEAN: Performs a sum of gradients scaled by (
1/num_mini_batches
)RUNNING_MEAN: Performs a running mean of gradients (
acc*n/(n+1) + grad/(n+1)
for the nth iteration)
22.15.2. Optimizer classes for the Graphcore IPU
- class tensorflow.python.ipu.optimizers.CrossReplicaGradientAccumulationOptimizer(opt, num_mini_batches, verify_usage=True, name='CrossReplicaGradientAccumulationOptimizer')
An optimizer where instead of performing the weight update for every batch, gradients across multiple batches are accumulated. After multiple batches have been processed, their accumulated gradients are then reduced accross the replicas before being used to compute the weight update.
This feature of neural networks allows us to simulate bigger batch sizes. For example if we have a model of batch size 16 and we accumulate the gradients of 4 batches, this simulates an input batch of size 64.
This optimizer is similar to GradientAccumulationOptimizer, however using this optimizer guarantees that the accumulated gradients will only be exchanged between IPUs when the accumulated gradients are back-propagated through the network.
- __init__(opt, num_mini_batches, verify_usage=True, name='CrossReplicaGradientAccumulationOptimizer')
Construct a Cross Replica Gradient Accumulation Optimizer.
- Parameters
opt – An existing
Optimizer
to encapsulate.num_mini_batches – Number of mini-batches the gradients will be accumulated for.
verify_usage – The current gradient accumulation supports the
GradientDescentOptimizer
andMomentumOptimizer
optimizers. Any other usages of this optimizer might results in incorrect results. This option can be used to disable this check.name – Optional name prefix for the operations created when applying gradients. Defaults to “CrossReplicaGradientAccumulationOptimizer”.
- class tensorflow.python.ipu.optimizers.CrossReplicaGradientAccumulationOptimizerV2(opt, num_mini_batches, offload_weight_update_variables=None, replicated_optimizer_state_sharding=False, dtype=None, reduction_method=GradientAccumulationReductionMethod.SUM, name='CrossReplicaGradientAccumulationOptimizerV2')
An optimizer where instead of performing the weight update for every batch, gradients across multiple batches are accumulated. After multiple batches have been processed, their accumulated gradients are then reduced accross the replicas before being used to compute the weight update.
This feature of neural networks allows us to simulate bigger batch sizes. For example if we have a model of batch size 16 and we accumulate the gradients of 4 batches, this simulates an input batch of size 64.
This optimizer is similar to GradientAccumulationOptimizerV2, however using this optimizer guarantees that the accumulated gradients will only be exchanged between IPUs when the gradients are applied to the weights, and hence reduces the number of cross-IPU gradient exchanges by a factor of ‘num_mini_batches’.
- __init__(opt, num_mini_batches, offload_weight_update_variables=None, replicated_optimizer_state_sharding=False, dtype=None, reduction_method=GradientAccumulationReductionMethod.SUM, name='CrossReplicaGradientAccumulationOptimizerV2')
Construct a Cross Replica Gradient Accumulation Optimizer V2.
- Parameters
opt – An existing
Optimizer
to encapsulate.num_mini_batches – Number of mini-batches the gradients will be accumulated for.
offload_weight_update_variables – If True, any
tf.Variable
which is only used by the weight update of the model (for example the accumulator variable when using thetf.MomentumOptimizer
), will be stored in the remote memory. During the weight update this variable will be streamed onto the device and then streamed back to the remote memory after it has been updated. Requires the machine to be configured with support forPoplar remote buffers
. Offloading variables into remote memory can reduce maximum memory liveness, but can also increase the computation time of the weight update.replicated_optimizer_state_sharding – If True, any
tf.Variable
which is offloaded will be partitioned across the replicas. A collective all-gather will be inserted to restore the tensor on each replica. IfNone
, this value will match the value ofoffload_weight_update_variables
.dtype –
The data type used for the gradient accumulation buffer. One of:
None
: Use an accumulator of the same type as the variable type.A
DType
: Use this type for all the accumulators. For exampletf.float32
.A callable that takes the variable and returns a
DType
: Allows specifying the accumulator type on a per-variable basis.
The gradients passed to
Optimizer.apply_gradients
will have the dtype requested here. If that dtype is different from the variable dtype a cast is needed at some point to make them compatible. If you want to cast the gradients immediately, you can wrap your optimizer in theMapGradientOptimizer
with atf.cast
.reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to
GradientAccumulationReductionMethod.SUM
(seeGradientAccumulationReductionMethod
) # pylint: disable=line-too-longname – Optional name prefix for the operations created when applying gradients. Defaults to “CrossReplicaGradientAccumulationOptimizerV2”.
- class tensorflow.python.ipu.optimizers.CrossReplicaOptimizer(opt, name='CrossReplicaOptimizer')
An optimizer that averages gradients across IPU replicas.
- __init__(opt, name='CrossReplicaOptimizer')
Construct a new cross-replica optimizer.
- Parameters
opt – An existing
Optimizer
to encapsulate.name – Optional name prefix for the operations created when applying gradients. Defaults to “CrossReplicaOptimizer”.
- apply_gradients(grads_and_vars, global_step=None, name=None)
Apply gradients to variables.
Calls popops_cross_replica_sum.cross_replica_sum() to sum gradient contributions across replicas, and then applies the real optimizer.
- Parameters
grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().
global_step – Optional Variable to increment by one after the variables have been updated.
name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.
- Returns
An
Operation
that applies the gradients. Ifglobal_step
was not None, that operation also incrementsglobal_step
.- Raises
ValueError – If the grads_and_vars is malformed.
- class tensorflow.python.ipu.optimizers.GradientAccumulationOptimizer(opt, num_mini_batches, verify_usage=True, name='GradientAccumulationOptimizer')
An optimizer where instead of performing the weight update for every batch, gradients across multiple batches are accumulated. After multiple batches have been processed, their accumulated gradients are used to compute the weight update.
This feature of neural networks allows us to simulate bigger batch sizes. For example if we have a model of batch size 16 and we accumulate the gradients of 4 batches, this simulates an input batch of size 64.
This optimizer supports
tf.train.GradientDescentOptimizer
andtf.train.MomentumOptimizer
only. All other optimizers should useGradientAccumulationOptimizerV2
.- __init__(opt, num_mini_batches, verify_usage=True, name='GradientAccumulationOptimizer')
Construct a Gradient Accumulation Optimizer.
- Parameters
opt – An existing
Optimizer
to encapsulate.num_mini_batches – Number of mini-batches the gradients will be accumulated for.
verify_usage – The current gradient accumulation supports the
GradientDescentOptimizer
andMomentumOptimizer
optimizers. Any other usages of this optimizer might results in incorrect results. This option can be used to disable this check.name – Optional name prefix for the operations created when applying gradients. Defaults to “GradientAccumulationOptimizer”.
- apply_gradients(grads_and_vars, global_step=None, name=None)
Apply gradients to variables.
- Parameters
grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().
global_step – Optional Variable to increment by one after the variables have been updated.
name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.
- Returns
An
Operation
that applies the gradients. Ifglobal_step
was not None, that operation also incrementsglobal_step
.- Raises
ValueError – If the grads_and_vars is malformed.
- class tensorflow.python.ipu.optimizers.GradientAccumulationOptimizerV2(opt, num_mini_batches, offload_weight_update_variables=None, replicated_optimizer_state_sharding=False, dtype=None, reduction_method=GradientAccumulationReductionMethod.SUM, name='GradientAccumulationOptimizerV2')
An optimizer where instead of performing the weight update for every batch, gradients across multiple batches are accumulated. After multiple batches have been processed, their accumulated gradients are used to compute the weight update.
This feature of neural networks allows us to simulate bigger batch sizes. For example if we have a model of batch size 16 and we accumulate the gradients of 4 batches, this simulates an input batch of size 64.
Unlike ‘GradientAccumulationOptimizer’, this optimizer can be used to wrap any other TensorFlow optimizer.
See the Gradient accumulation section in the documention for more details.
- __init__(opt, num_mini_batches, offload_weight_update_variables=None, replicated_optimizer_state_sharding=False, dtype=None, reduction_method=GradientAccumulationReductionMethod.SUM, name='GradientAccumulationOptimizerV2')
Construct a Gradient Accumulation Optimizer V2.
- Parameters
opt – An existing
Optimizer
to encapsulate.num_mini_batches – Number of mini-batches the gradients will be accumulated for.
offload_weight_update_variables – When enabled, any
tf.Variable
which is only used by the weight update of the pipeline (for example the accumulator variable when using thetf.MomentumOptimizer
), will be stored in the remote memory. During the weight update this variable will be streamed onto the device and then streamed back to the remote memory after it has been updated. Requires the machine to be configured with support forPoplar remote buffers
. Offloading variables into remote memory can reduce maximum memory liveness, but can also increase the computation time of the weight update. When set toNone
the variables will be placed in either in-processor or remote memory automatically based on the current best placement strategy.replicated_optimizer_state_sharding – If True, any
tf.Variable
which is offloaded (for example the accumulator variable when using thetf.MomentumOptimizer
), will be partitioned across the replicas. This can exploit the additional bandwidth of the IPU-Links to improve overall throughput, however it might increase the code size and hence the model might need adjusting (for example the PopLibs optionavailableMemoryProportion
might need to be changed).dtype –
The data type used for the gradient accumulation buffer. One of:
None
: Use an accumulator of the same type as the variable type.A
DType
: Use this type for all the accumulators. For exampletf.float32
.A callable that takes the variable and returns a
DType
: Allows specifying the accumulator type on a per-variable basis.
The gradients passed to
Optimizer.apply_gradients
will have the dtype requested here. If that dtype is different from the variable dtype a cast is needed at some point to make them compatible. If you want to cast the gradients immediately, you can wrap your optimizer in theMapGradientOptimizer
with atf.cast
.reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to
GradientAccumulationReductionMethod.SUM
(seeGradientAccumulationReductionMethod
) # pylint: disable=line-too-longname – Optional name prefix for the operations created when applying gradients. Defaults to “GradientAccumulationOptimizerV2”.
- apply_gradients(grads_and_vars, global_step=None, name=None)
Apply gradients to variables.
- Parameters
grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().
global_step – Optional Variable to increment by one after the variables have been updated.
name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.
- Returns
An
Operation
that applies the gradients. Ifglobal_step
was not None, that operation also incrementsglobal_step
.- Raises
ValueError – If the grads_and_vars is malformed.
- class tensorflow.python.ipu.optimizers.GradientAccumulationReductionMethod(value)
Reduction method to use when accumulating gradients. We perform
gradient_accumulation_count
iterations (forward & backward passes) in each optimizer step, at the end of which we update the optimizer with gradients accumulated during the optimizer step. For each iteration within the optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation, especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly.Note: The term
gradient_accumulation_count
is from the pipeline API and is referred to asnum_mini_batches
inGradientAccumulationOptimizerV2
andCrossReplicaGradientAccumulationOptimizerV2
# pylint: disable=line-too-longSUM: Performs a sum of gradients
MEAN: Performs a sum of gradients scaled by (
1/num_mini_batches
)RUNNING_MEAN: Performs a running mean of gradients (
acc*n/(n+1) + grad/(n+1)
for the nth iteration)
- class tensorflow.python.ipu.optimizers.IpuOptimizer(opt, name=None)
The wrapper interface for optimizer.Optimizer optimizers. Custom wrappers written for IPU can inherit from this class and override the appropriate functions.
This provides the convenience of automatically passing on functions that have not been overwritten to the sub class and also allows you to define custom APIs specifically for the IPU.
- __init__(opt, name=None)
Construct a new IpuOptimizer
- Parameters
opt – The optimizer to be wrapped.
name – The name to be passed to Optimizer constructor.
- apply_gradients(grads_and_vars, global_step=None, name=None)
Apply gradients to variables.
Applies gradients from underlying optimizer.
- Parameters
grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().
global_step – Optional Variable to increment by one after the variables have been updated.
name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.
- Returns
An
Operation
that applies the gradients. Ifglobal_step
was not None, that operation also incrementsglobal_step
.- Raises
ValueError – If the grads_and_vars is malformed.
- compute_gradients(loss, var_list=None, **kwargs)
Compute gradients of “loss” for the variables in “var_list”.
This simply wraps the compute_gradients() from the real optimizer. The gradients will be aggregated in the apply_gradients() so that user can modify the gradients like clipping with per replica global norm if needed.
- Parameters
loss – A Tensor containing the value to minimize.
var_list – Optional list or tuple of
tf.Variable
to update to minimizeloss
. Defaults to the list of variables collected in the graph under the keyGraphKey.TRAINABLE_VARIABLES
.**kwargs – Keyword arguments for compute_gradients().
- Returns
A list of (gradient, variable) pairs.
- get_name()
Return the name of the underlying optimizer
- get_slot(*args, **kwargs)
Return a slot named “name” created for “var” by the Optimizer.
This simply wraps the get_slot() from the actual optimizer.
- Parameters
*args – Arguments for get_slot().
**kwargs – Keyword arguments for get_slot().
- Returns
The
Variable
for the slot if it was created,None
otherwise.
- get_slot_names(*args, **kwargs)
Return a list of the names of slots created by the
Optimizer
.This simply wraps the get_slot_names() from the actual optimizer.
- Parameters
*args – Arguments for get_slot().
**kwargs – Keyword arguments for get_slot().
- Returns
A list of strings.
- variables()
Forwarding the variables from the underlying optimizer.
- class tensorflow.python.ipu.optimizers.MapGradientOptimizer(wrapped_optimizer, gradient_mapping_function, name='MapGradientOptimizer')
This class enables modification of the computed gradients, before they are passed to the final optimizer for application.
MapGradientOptimizer needs a map function that will modify the gradients, and an optimizer to which the modified gradients are passed.
The map function has two arguments:
gradient
andvariable
. The map function must return the modified gradient.Example
# Define function which will modify computed gradients. # This is a gradient decay function. def map_fn_decay(grad, var): return grad + (WEIGHT_DECAY * var) # To run the code we need a session: with self.cached_session(): optimizer = gradient_descent.GradientDescentOptimizer(0.000001) # We define MapGradientOptimizer map_optimizer = map_gradient_optimizer.MapGradientOptimizer( optimizer, map_fn_decay) # Gradients are computed by compute_gradients(), where our map function # modifies computed gradients. compute_gradients(loss, var_list) arguments # are loss and var_list so define arguments and call # map_optimizer.compute_gradients(). values = [1.0, 2.0, 3.0] vars_ = [variables.Variable([v], dtype=dtypes.float32) for v in values] grads_and_vars = map_optimizer.compute_gradients( vars_[0] * vars_[1] + vars_[0] * vars_[2] + vars_[1] * vars_[2], vars_) # The output grads_and_vars contains computed gradients modified by # the decay map function. # grads are 5.01, 4.02 and 3.03. If we did not use MapGradientOptimizer # they would be 5, 4 and 3.
- __init__(wrapped_optimizer, gradient_mapping_function, name='MapGradientOptimizer')
Construct a MapGradientOptimizer.
- Parameters
wrapped_optimizer – TensorFlow (derived) optimizer.
gradient_mapping_function – The function to be applied on the gradients and variables which are provided by
wrapped_optimizer.compute_gradients()
.
- compute_gradients(*args, **kwargs)
Compute gradients of “loss” for the variables in “var_list”.
The gradients computed by the wrapped optimizer are modified using the
gradient_mapping_function
that was passed to the constructor.- Parameters
loss – A Tensor containing the value to minimize.
var_list – Optional list or tuple of
tf.Variable
to update to minimizeloss
. Defaults to the list of variables collected in the graph under the keyGraphKey.TRAINABLE_VARIABLES
.**kwargs – Keyword arguments for compute_gradients().
- Returns
A list of (gradient, variable) pairs.
- class tensorflow.python.ipu.optimizers.ShardedOptimizer(optimizer)
- __init__(optimizer)
Construct a new sharded optimizer.
- Parameters
optimizer – The optimizer to wrap.
- apply_gradients(grads_and_vars, global_step=None, name=None)
Apply gradients to variables.
Applies gradients from underlying optimizer.
- Parameters
grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().
global_step – Optional Variable to increment by one after the variables have been updated.
name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.
- Returns
An
Operation
that applies the gradients. Ifglobal_step
was not None, that operation also incrementsglobal_step
.- Raises
ValueError – If the grads_and_vars is malformed.
- compute_gradients(loss, var_list=None, **kwargs)
Compute gradients of “loss” for the variables in “var_list”.
This simply wraps the compute_gradients() from the real optimizer. The gradients will be aggregated in the apply_gradients() so that user can modify the gradients like clipping with per replica global norm if needed.
- Parameters
loss – A Tensor containing the value to minimize.
var_list – Optional list or tuple of
tf.Variable
to update to minimizeloss
. Defaults to the list of variables collected in the graph under the keyGraphKey.TRAINABLE_VARIABLES
.**kwargs – Keyword arguments for compute_gradients().
- Returns
A list of (gradient, variable) pairs.
22.16. Sharding
22.16.1. Utility functions for sharding graphs
- tensorflow.python.ipu.sharding.dependencies(roots)
Find a list of ancestor operations for a given set of root operations
- Parameters
roots – The root operations from which to start.
- tensorflow.python.ipu.sharding.enable_sharded_gradient_tape()
Enable backward ops generated by
tf.GradientTape
to inherit the sharding of their forward op.
- tensorflow.python.ipu.sharding.get_shard_from_colocation(op)
Find the shard number from an op which shares co-location information with the given operation.
- Parameters
op – The operation to apply sharding to.
- tensorflow.python.ipu.sharding.get_sharding(op)
Get the sharding for the given op.
- Parameters
op – An operation.
- Returns
None if the operation has no sharding, otherwise the shard number.
- tensorflow.python.ipu.sharding.has_attr(o, attr_name)
Test for the presence of a specific attribute.
- Parameters
o – An operation.
attr_name – The name of an attribute to test for.
- Returns
True if the operation has the given attribute.
- tensorflow.python.ipu.sharding.propagate_sharding(g)
Move the sharding from the forward pass operations onto their co-located backward pass operations.
- Parameters
g – The graph.