23. IPU TensorFlow Addons Python API¶
23.1. Keras Layers¶
23.1.1. Keras layers made for IPU TensorFlow¶

class
ipu_tensorflow_addons.keras.layers.
AssumeEqualAcrossReplicas
(*args, **kwargs)¶ Layer for marking values as equal across replicas to try and prevent divergent control flow compilation errors.
Divergent control flow describes the situation where program flow differs among replicas. This happens when the value of a conditional is not the same across all replicas. This is a problem if the conditional body requires a crossreplica sync, as only some replicas will reach it. If this happens, the execution will hang as the operation waits for all replicas to sync.
To warn the user about this, Poplar checks for divergent control flow during compilation. However since the values of tensors are unknown at compilation time it can’t be certain whether a tensor will lead to divergent control flow or not.
assume_equal_across_replicas
can be used to mark tensors which are equal across all replicas and in doing so prevents them causing divergency errors, if used in a conditional. Parameters
inplace – A bool for controlling whether or not the given tensor(s) is copied or operated on inplace. This is needed when using
AssumeEqualAcrossReplicas
with tensor slices.

call
(inputs, **kwargs)¶ This is where the layer’s logic lives.
Note here that
call()
method intf.keras
is little bit different fromkeras
API. Inkeras
API, you can pass support masking for layers as additional arguments. Whereastf.keras
hascompute_mask()
method to support masking. Parameters
inputs – Input tensor, or list/tuple of input tensors.
**kwargs – Additional keyword arguments. Currently unused.
 Returns
A tensor or list/tuple of tensors.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

class
ipu_tensorflow_addons.keras.layers.
CTCInferenceLayer
(*args, **kwargs)¶ Computes CTC (Connectionist Temporal Classification) predictions using a beam search. This implementation is designed and optimized for the IPU and cannot be used with other systems.
 Parameters
blank_index – The class index to use for the blank label.
beam_width – The beam width to use in the beam search.
top_paths – The number of paths to return.
from_logits – Whether to expect the input data in the form of logits (
True
) or log probabilities (False
). Default value isFalse
.

call
(data, data_length, **kwargs)¶  Parameters
data – The data input [max_time, batch_size, num_classes] tensor.
data_length – A tensor of shape [batch_size] containing the number of timesteps in each
data
batch entry.
 Returns
Label probabilities: Negative log probabilities that each path is correct.
Label lengths: Length of each path of predictions.
Decoded labels: The predictions made by the beam search.
 Return type
A tuple of values

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

class
ipu_tensorflow_addons.keras.layers.
CTCLoss
(*args, **kwargs)¶ Computes CTC (Connectionist Temporal Classification) loss. This implementation is designed and optimized for the IPU and cannot be used with other systems.
Usage:
labels = tf.keras.layers.Input((max_label_length), batch_size=batch_size, dtype=np.int32, name="labels") data = tf.keras.layers.Input((max_time, num_classes), batch_size=batch_size, dtype=np.float32, name="data") label_length = tf.keras.layers.Input((), batch_size=batch_size, dtype=np.int32, name="label_length") logit_length = tf.keras.layers.Input((), batch_size=batch_size, dtype=np.int32, name="logit_length") dense_layer = tf.keras.layers.Dense(num_classes) transpose_layer = tf.keras.layers.Lambda( lambda x: keras.backend.permute_dimensions(x, (1, 0, 2))) ctc_loss_layer = ipu.keras.losses.CTCLoss(from_logits=True) x = dense_layer(data) x = transpose_layer(x) loss = ctc_loss_layer(labels, x, label_length, logit_length) model = ipu.keras.Model((labels, data, label_length, logit_length), loss) get_loss_output = lambda y_true, y_pred: y_pred model.compile('sgd', loss=get_loss_output)
 Parameters
blank_index – The class index to use for the blank label.
from_logits – Whether to expect the input data in the form of logits (
True
) or log probabilities (False
). Default value isFalse
.

call
(labels, data, label_length, data_length, **kwargs)¶  Parameters
labels – The labels input [batch_size, max_label_length] tensor.
data – The data input [max_time, batch_size, num_classes].
label_length – A tensor of shape [batch_size] containing the number of labels in each
labels
batch entry.data_length – A tensor of shape [batch_size] containing the number of timesteps in each
data
batch entry.
 Returns
The calculated loss.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

class
ipu_tensorflow_addons.keras.layers.
CTCPredictionsLayer
(*args, **kwargs)¶ Computes CTC (Connectionist Temporal Classification) most probable predictions.
Returns the most probable predictions from the ctc decoder. This selects the most probable of all predictions returned. It also fills the values off the end with the blank index
This layer does a lot of post processing steps to create the predictions. If your model is close to its memory limit it may be worth using the CTCInference layer and streaming the results of that off the device and performing the processing on the CPU. However this will create a larger stream copy that may also cost memory.
 Parameters
blank_index – The class index to use for the blank label.
beam_width – The beam width to use in the beam search.
top_paths – The number of paths to return.
from_logits – Whether to expect the input data in the form of logits (
True
) or log probabilities (False
). Default value isFalse
.

call
(data, data_length, **kwargs)¶  Parameters
data – The data input [max_time, batch_size, num_classes] tensor The data is expected in the form of log probabilities.
data_length – A tensor of shape [batch_size] containing the number of timesteps in each
data
batch entry. If not provided can only perform inference.
 Returns
The most probable predictions from the CTC decoder. This selects the most probable of all predictions returned. It fills the values off the end with the blank index.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

class
ipu_tensorflow_addons.keras.layers.
Dropout
(*args, **kwargs)¶ Dropout layer optimized for running on the IPU.
The Dropout layer randomly sets input units to 0 with a frequency of
rate
at each step during training. Inputs not set to 0 are scaled up by1/(1  rate)
such that the expected sum is unchanged.Note that the Dropout layer only applies when
training
is set to True, so no values are dropped during inference. Parameters
rate – Float between 0 and 1. Fraction of the input units to drop.
noise_shape – 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input.
seed – An optional twoelement tensorlike object (
tf.Tensor
, a numpy array or Python list/tuple) containing a pair of 32bit integers that will be used to seed the random number generator that generates the dropout mask.

build
(input_shape)¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of
Layer
orModel
can override if they need a statecreation step inbetween layer instantiation and layer call.This is typically used to create the weights of
Layer
subclasses. Parameters
input_shape – Instance of
TensorShape
, or list of instances ofTensorShape
if the layer expects a list of inputs (one instance per input).

call
(inputs, training=None)¶ Perform dropout.
 Parameters
inputs – Input tensor (of any rank).
training – Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (doing nothing).
 Returns
In training mode, a tensor which has some nodes set to zero, as randomly selected based on other parameters. In inference mode, a tensor that is identical to the input tensor.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

class
ipu_tensorflow_addons.keras.layers.
EffectiveTransformer
(*args, **kwargs)¶ EffectiveTransformer is an implementation of a multihead attention network.
Transformers of this type are described in the following paper: https://arxiv.org/abs/1706.03762
This implementation is optimised for batches of padded sequences, by dynamically compressing the input sequences for computationally expensive parts of the algorithm. This compression is achieved by the removal of padding for those computations that do not rely on a 1:1 relationship between the input
to
andfrom
sequences.For an input sequence tensor
X
of shape[B, N]
, the algorithm will processX
in compressed chunks of shape[B', N]
, whereB'
is less than or equal tomax_batch_size
. The algorithm output, however, keeps the input batch sizeB
. Though the maximum batch size of compressed sequences to be processed in each chunk is of shape[B', N]
, the parametersequences_per_iter
determines the upper limit on the total number of compressed sequences to be processed for eachB'
sized batch.The distinction between
max_batch_size
andsequences_per_iter
is of importance when a corpus of data has much variance in the length of its sequences (the degree of padding in each row).max_batch_size
determines the upper bound on the number of rows of data to be processed in each chunk andsequences_per_iter
determines the upper bound on the number of sequences to be compressed into each chunk. This distinction is important to consider because a chunk of compressed sequences will need to be decompressed at points in the algorithm. This can incur large memory usage if the number of compressed sequences to process is high and the uncompressed shape unbounded.sequences_per_iter
must be less than or equal tomax_batch_size
. Parameters
output_layer_size – The number of output units.
max_batch_size – The upper limit to which additional sequences will be compressed into a chunk of data. This is the maximum size of the uncompressed sequence tensor.
use_scale – If True, learn a scale parameter.
num_attention_heads – The number of attention heads to use for multihead attention.
attention_head_size – The size of each attention head.
sequences_per_iter – The number of fullsequence equivalents to process in each data chunk. Must be less than or equal to
max_batch_size
.qkv_activation – The activation function to use for the Query, Key and Value embeddings.
attention_dropout_prob – Dropout probability applied to the attention distribution.
output_activation – The activation function to use for the layer output.
output_dropout_prob – Dropout probability applied to the layer output.
layer_norm_output – Whether to apply Layer Normalisation to the output.
embedding_initializer – The initializer to be used for the QKV embeddings. Default is ‘glorot_uniform’.
embedding_bias_initializer – The initializer to be used for QKV embeddings additive bias. Defaults to ‘zeros’.
output_initializer – The initializer for the output layer. Defaults to ‘glorot_uniform’.
output_bias_initializer – The initializer for the output layer additive bias. Defaults to ‘zeros’.

build
(input_shapes)¶ Builds an
EffectiveTransformer
Layer with respect to the providedinput_shapes
. Parameters
input_shapes – A list of Tensor shapes of length four or five. In the
of four elements provided in input_shapes (case) –
Tensor shapes (the) –
correspond to the from_sequences (should) –
from_sequence_lengths –
:param : :param
to_sequences
andto_sequence_lengths
Tensor arguments to the: :paramcall
method. In the case of five Tensor shapes provided in: :paraminput_shapes
: :param the fifth element should correspond to the optional: :paramq_mask
input to thecall
method.:

call
(inputs, training=True)¶ Performs a single forward pass of an
EffectiveTransformer
layer instance.As input, two sequence sets and their respective sequence lengths are required. The two sets of sequences are referred to as the ‘from’ sequences and ‘to’ sequences, referring to the computed attention relationship. In the case that the ‘from’ and ‘to’ sequence sets are equal, this layer will compute selfattention.
 Parameters
inputs – A list of input Tensors, of at least four elements containing
from_sequences –
from_sequence_lengths –
and (to_sequences) –
Additionally (to_sequence_lengths.) –
fifth tensor q_mask for (a) –
head masking can be provided. (attention) –

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

class
ipu_tensorflow_addons.keras.layers.
Embedding
(*args, **kwargs)¶ This is designed to be a replacement for the typical use cases of the Keras Embedding layer.
 Parameters
input_dim – int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
output_dim – int >= 0. Dimension of the dense embedding.
embeddings_initializer – Initializer for the
embeddings
matrix.serialization_factor – If greater than 1, the embedding lookup will be broken up into
serialization_factor
smaller lookups, serialized along the 0th dimension. This option should not be used unless the parameters of this layer is used by another layer. If this is the case, then serialization can reduce the maximum memory at the cost of extra computation.
 Input shape:
2D tensor with shape:
(batch_size, input_length)
. Output shape:
3D tensor with shape:
(batch_size, input_length, output_dim)
.

call
(inputs, training=None)¶ Perform an embedding lookup.
 Parameters
inputs – An integer tensor of indices into the embedding variable.
 Returns
The entries of the embedding tensor corresponding to the ids tensor indices.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

ipu_tensorflow_addons.keras.layers.
GRU
¶ alias of
ipu_tensorflow_addons.keras.layers.rnn.PopnnGRU

ipu_tensorflow_addons.keras.layers.
GroupNorm
¶ alias of
ipu_tensorflow_addons.keras.layers.normalization.GroupNormalization

class
ipu_tensorflow_addons.keras.layers.
GroupNormalization
(*args, **kwargs)¶ Group normalization layer optimized for running on the IPU.
This layer is used like the standard Keras BatchNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.
Group normalization is described in this paper: https://arxiv.org/abs/1803.08494.
 Parameters
groups – The number of groups to use in the normalization.
channels_axis – Integer, the axis that should be normalized (typically the features axis).
center – If True, add offset of
beta
to normalized tensor. If False,beta
is ignored.scale – If True, multiply by
gamma
. If False,gamma
is not used.epsilon – Small float added to variance to avoid dividing by zero.
beta_initializer – Initializer for the beta weight.
gamma_initializer – Initializer for the gamma weight.
strided_channel_grouping – Selects whether to group the channels dimension for group normalisation with a stride between channels. This makes the PopLibs implementation more efficient but is unconventional. Among other things this will mean that using pretrained weights would not be possible if not produced with this unconventional implementation.
trainable – Boolean, if
True
the variables will be marked as trainable.

build
(input_shape)¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of
Layer
orModel
can override if they need a statecreation step inbetween layer instantiation and layer call.This is typically used to create the weights of
Layer
subclasses. Parameters
input_shape – Instance of
TensorShape
, or list of instances ofTensorShape
if the layer expects a list of inputs (one instance per input).

call
(inputs, training=None)¶  Parameters
inputs – The tensor to apply normalization to.
 Returns
The tensor resulting from applying normalization.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

ipu_tensorflow_addons.keras.layers.
InstanceNorm
¶ alias of
ipu_tensorflow_addons.keras.layers.normalization.InstanceNormalization

class
ipu_tensorflow_addons.keras.layers.
InstanceNormalization
(*args, **kwargs)¶ Instance normalization layer optimized for use on the IPU.
This layer is used like the standard Keras InstanceNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.
Instance normalization is described in this paper: https://arxiv.org/abs/1607.08022.
 Parameters
channels_axis – Integer, the axis that should be normalized (typically the features axis).
center – If True, add offset of
beta
to normalized tensor. If False,beta
is ignored.scale – If True, multiply by
gamma
. If False,gamma
is not used.epsilon – Small float added to variance to avoid dividing by zero.
beta_initializer – Initializer for the beta weight.
gamma_initializer – Initializer for the gamma weight.

build
(input_shape)¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of
Layer
orModel
can override if they need a statecreation step inbetween layer instantiation and layer call.This is typically used to create the weights of
Layer
subclasses. Parameters
input_shape – Instance of
TensorShape
, or list of instances ofTensorShape
if the layer expects a list of inputs (one instance per input).

call
(inputs, training=None)¶  Parameters
inputs – The tensor to apply normalization to.
 Returns
The tensor resulting from applying normalization.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

ipu_tensorflow_addons.keras.layers.
LSTM
¶ alias of
ipu_tensorflow_addons.keras.layers.rnn.PopnnLSTM

ipu_tensorflow_addons.keras.layers.
LayerNorm
¶ alias of
ipu_tensorflow_addons.keras.layers.normalization.LayerNormalization

class
ipu_tensorflow_addons.keras.layers.
LayerNormalization
(*args, **kwargs)¶ Layer normalization layer optimized for use on the IPU.
This layer is used like the standard Keras LayerNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.
Layer normalization is described in this paper: https://arxiv.org/abs/1607.06450.
 Parameters
axis – Integer or List/Tuple. The axis that should be normalized (typically the features axis).
epsilon – Small float added to variance to avoid dividing by zero.
center – If True, add offset of
beta
to normalized tensor. If False,beta
is ignored.scale – If True, multiply by
gamma
. If False,gamma
is not used. When the next layer is linear (also e.g.nn.relu
), this can be disabled since the scaling will be done by the next layer.beta_initializer – Initializer for the beta weight.
gamma_initializer – Initializer for the gamma weight.
beta_regularizer – Optional regularizer for the beta weight.
gamma_regularizer – Optional regularizer for the gamma weight.
beta_constraint – Optional constraint for the beta weight.
gamma_constraint – Optional constraint for the gamma weight.
trainable – Boolean, if
True
the variables will be marked as trainable.

build
(input_shape)¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of
Layer
orModel
can override if they need a statecreation step inbetween layer instantiation and layer call.This is typically used to create the weights of
Layer
subclasses. Parameters
input_shape – Instance of
TensorShape
, or list of instances ofTensorShape
if the layer expects a list of inputs (one instance per input).

call
(inputs, training=None)¶  Parameters
inputs – The tensor to apply normalization to.
 Returns
The tensor resulting from applying normalization.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

class
ipu_tensorflow_addons.keras.layers.
PopnnGRU
(*args, **kwargs)¶ Popnn implementation of the Gated Recurrent Unit (Cho et al. 2014), optimized for the IPU.
There are two variants of the GRU implementation. The default is based on v3 and has reset gate applied to hidden state before matrix multiplication. The other is based on the original version and has the order reversed. The first one is the default behaviour for this implementation, however the Keras equivalent can use the second variant. To use this variant, set
'reset_after'=True
.Note that the Keras equivalent uses the
hard_sigmoid
as the default recurrent activation, however this version usessigmoid
as the default. Parameters
units – Positive integer, dimensionality of the output space.
activation – Activation function to use. Default: hyperbolic tangent (“tanh”). Accepted activations: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass
None
, no activation is applied (ie. “linear” activation:a(x) = x
).recurrent_activation – Activation function to use for the recurrent step. Default: sigmoid (“sigmoid”). Accepted activations: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass
None
, no activation is applied (ie. “linear” activation:a(x) = x
).use_bias – Boolean. If True then the layer will use a bias vector.
kernel_initializer – Initializer for the
kernel
weights matrix, used for the linear transformation of the inputs.recurrent_initializer – Initializer for the
recurrent_kernel
weights matrix, used for the linear transformation of the recurrent state.bias_initializer – Initializer for the bias vector.
kernel_regularizer – Unsupported  Regularizer function applied to the
kernel
weights matrix.recurrent_regularizer – Unsupported  Regularizer function applied to the
recurrent_kernel
weights matrix.bias_regularizer – Unsupported  Regularizer function applied to the bias vector.
activity_regularizer – Unsupported  Regularizer function applied to the output of the layer (its “activation”).
kernel_constraint – Unsupported  Constraint function applied to the
kernel
weights matrix.recurrent_constraint – Unsupported  Constraint function applied to the
recurrent_kernel
weights matrix.bias_constraint – Unsupported  Constraint function applied to the bias vector.
dropout – Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
dropout_seed – An optional twoelement tensorlike object (
tf.Tensor
, a numpy array or Python list/tuple), representing the random seed that will be used to create the distribution for dropout.recurrent_dropout – Unsupported  Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.
implementation – Unsupported  Implementation mode.
return_sequences – Boolean. If True then the full output sequence will be returned. If False then only the last output in the output sequence will be returned.
return_state – Boolean. If True then the last state will be returned in addition to the last output or output sequence.
go_backwards – Unsupported  Boolean (default False). If True process the input sequence backwards and return the reversed sequence.
stateful – Boolean (default False). If True the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch.
unroll – Unsupported  Boolean (default False). If True the network will be unrolled, else a symbolic loop will be used. Unrolling can speedup a RNN, although it tends to be more memoryintensive. Unrolling is only suitable for short sequences.
time_major – The shape format of the
inputs
andoutputs
tensors. If True the shape of the inputs and outputs will be(timesteps, batch, ...)
, otherwise the shape will be(batch, timesteps, ...)
. Usingtime_major = True
is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation. However, most TensorFlow data is batchmajor, so by default this function accepts input and emits output in batchmajor form.seed – A Python integer. Used for the
kernel_initializer
andrecurrent_initializer
.partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
reset_after – GRU convention (whether to apply reset gate after or before matrix multiplication). False = “before”, True = “after” (default).
available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of 1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.
available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of 1. or None indicates that the default in Popnn should be used.

build
(input_shape)¶ Create variables of the PopnnGRU layer.
It can be called manually before
__call__()
or automatically through__call__()
. In the former case, any subsequent__call__()
will skip creating variables. Parameters
input_shape – a TensorShape object with 3 dimensions.
 Raises
ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call
(inputs, mask=None, training=None, initial_state=None)¶ Runs the forward step for the GRU layer.
 Parameters
inputs – 3D tensor with shape [batch_size, seq_len, input_size]. If the time_major parameter is True, the the shape should be [seq_len, batch_size, input_size].
training – Set to False to use the layer in inference mode. This is only relevant if
dropout
orrecurrent_dropout
is used.initial_state – Initial state tensor, shaped
[batch_size, num_units]
If not provided, the state is initialized to zeros.
 Returns
If
return_sequences
is True then the GRU layer returns a tensor of shape [batch_size, seq_len, num_units], otherwise it returns a tensor of shape [batch_size, num_units]. Ifreturn_state
is set to True then the output state of the last cell is also returned. Raises
ValueError – if initial_state is not valid.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

state_shape
(batch_size)¶ Shape of Popnn GRU state.
State shape is [batch_size, num_units].
 Parameters
batch_size – an int
 Returns
A Python array.

class
ipu_tensorflow_addons.keras.layers.
PopnnLSTM
(*args, **kwargs)¶ Popnn implementation of Long ShortTerm Memory layer (Hochreiter and Schmidhuber 1997), optimized for the IPU.
Note that the Keras equivalent uses the
hard_sigmoid
as the default recurrent activation, however this version usessigmoid
as the default. Parameters
units – Positive integer, dimensionality of the output space.
activation – Activation function to use. Default: hyperbolic tangent (“tanh”). Accepted activations: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass
None
, no activation is applied (ie. “linear” activation:a(x) = x
).recurrent_activation – Activation function to use for the recurrent step. Default: sigmoid (“sigmoid”). Accepted activations: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass
None
, no activation is applied (ie. “linear” activation:a(x) = x
).use_bias – Boolean. If True then the layer will use a bias vector.
kernel_initializer – Initializer for the
kernel
weights matrix, used for the linear transformation of the inputs.recurrent_initializer – Initializer for the
recurrent_kernel
weights matrix, used for the linear transformation of the recurrent state.bias_initializer – Initializer for the bias vector.
unit_forget_bias – Boolean. If True then add 1 to the bias of the forget gate at initialization. Setting it to true will also force
bias_initializer="zeros"
. This is recommended in Jozefowicz et al.kernel_regularizer – Unsupported  Regularizer function applied to the
kernel
weights matrix.recurrent_regularizer – Unsupported  Regularizer function applied to the
recurrent_kernel
weights matrix.bias_regularizer – Unsupported  Regularizer function applied to the bias vector.
activity_regularizer – Unsupported  Regularizer function applied to the output of the layer (its “activation”).
kernel_constraint – Unsupported  Constraint function applied to the
kernel
weights matrix.recurrent_constraint – Unsupported  Constraint function applied to the
recurrent_kernel
weights matrix.bias_constraint – Unsupported  Constraint function applied to the bias vector.
dropout – Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
dropout_seed – An optional twoelement tensorlike object (
tf.Tensor
, a numpy array or Python list/tuple), representing the random seed that will be used to create the distribution for dropout.recurrent_dropout – Unsupported  Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.
implementation – Unsupported  Implementation mode.
return_sequences – Boolean. If True then the full output sequence will be returned. If False then only the last output in the output sequence will be returned.
return_state – Boolean. If True then the last state will be returned in addition to the last output or output sequence.
go_backwards – Unsupported  Boolean (default False). If True process the input sequence backwards and return the reversed sequence.
stateful – Boolean (default False). If True the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch.
unroll – Unsupported  Boolean (default False). If True the network will be unrolled, else a symbolic loop will be used. Unrolling can speedup a RNN, although it tends to be more memoryintensive. Unrolling is only suitable for short sequences.
seed – A Python integer. Used for the
kernel_initializer
andrecurrent_initializer
.partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
time_major – The shape format of the
inputs
andoutputs
tensors. If True the shape of the inputs and outputs will be(timesteps, batch, ...)
, otherwise the shape will be(batch, timesteps, ...)
. Usingtime_major = True
is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation. However, most TensorFlow data is batchmajor, so by default this function accepts input and emits output in batchmajor form.available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of 1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.
available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of 1. or None indicates that the default in Popnn should be used.

build
(input_shape)¶ Create variables of the PopnnLSTM layer.
It can be called manually before
__call__()
or automatically through__call__()
. In the former case, any subsequent__call__()
will skip creating variables. Parameters
input_shape – a TensorShape object with 3 dimensions.
 Raises
ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call
(inputs, mask=None, training=None, initial_state=None)¶ Runs the forward step for the LSTM layer.
 Parameters
inputs – 3D tensor with shape [batch_size, seq_len, input_size]. If the time_major parameter is set to True then the shape should be [seq_len, batch_size, input_size].
training – Set to False to use the layer in inference mode. This is only relevant if
dropout
orrecurrent_dropout
is set.initial_state – An
LSTMStateTuple
of state tensors, each shaped[batch_size, num_units]
. If not provided, the state is initialized to zeros.
 Returns
If
return_sequences
is True the LSTM layer returns a tensor of shape [batch_size, seq_len, num_units] otherwise it returns a tensor of shape [batch_size, num_units]. Ifreturn_state
is True then the output state of the last cell is also returned.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.

state_shape
(batch_size)¶ Shape of Popnn LSTM states.
Shape is a 2element tuple. Each is [batch_size, num_units]
 Parameters
batch_size – an int
 Returns
A tuple of Python arrays.

class
ipu_tensorflow_addons.keras.layers.
RecomputationCheckpoint
(*args, **kwargs)¶ Layer for checkpointing values in a computational pipeline stage. When recomputation is enabled, these values will not be recomputed and they will be stored in memory instead.
This layer can reduce memory liveness peaks when using recomputation if there are too many activations which need to be recomputed before the backpropagation operations can be executed.
This layer should be used with the
RecomputationMode.RecomputeAndBackpropagateInterleaved
pipelining recomputation mode.Note that this layer has no effect when used with the
RecomputationMode.RecomputeThenBackpropagate
pipelining recomputation mode.
call
(inputs, **kwargs)¶ Checkpoint the input tensors.
 Parameters
inputs – A tensor or a structure of tensors which should be checkpointed.
 Returns
A tensor or a structure of tensors which matches shape and type of
inputs
.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.


class
ipu_tensorflow_addons.keras.layers.
SerialDense
(*args, **kwargs)¶ Denselyconnected NN layer where the dot operation is serialized to reduce the size of this operation.
Dense
implements the operation:output = activation(dot(input, kernel) + bias)
whereactivation
is the elementwise activation function passed as theactivation
argument,kernel
is a weights matrix created by the layer, andbias
is a bias vector created by the layer (only applicable ifuse_bias
isTrue
).Given the
input
tensor with shape[..., m, k]
andkernel
tensor with shape[k, n]
, the matrix multiplication can be serialized as follows:Along the
m
dimension ofinput
, by settingserialization_dimension
toinput_columns
.Along the
k
dimension ofinput
andkernel
by settingserialization_dimension
toinput_rows_kernel_columns
.Along
n
dimension ofkernel
, by settingserialization_dimension
tokernel_rows
.
Example:
# as first layer in a sequential model: model = Sequential() model.add(SerialDense(32, input_shape=(16,))) # now the model will take as input arrays of shape (*, 16) # and output arrays of shape (*, 32) # after the first layer, you don't need to specify # the size of the input anymore: model.add(SerialDense(32))
 Parameters
units – Positive integer, dimensionality of the output space.
serialization_factor – An integer indicating the number of smaller matrix multiplies this operation is broken up into. Must divide the dimension along which the operation is serialized on.
serialization_dimension – A string, must be one of
input_columns
,input_rows_kernel_columns
orkernel_rows
. Indicates the dimension along which the operation is serialzed on.activation – Activation function to use. If you don’t specify anything, no activation is applied (ie. “linear” activation:
a(x) = x
).use_bias – Boolean, whether the layer uses a bias vector.
kernel_initializer – Initializer for the
kernel
weights matrix.bias_initializer – Initializer for the bias vector.
kernel_regularizer – Regularizer function applied to the
kernel
weights matrix.bias_regularizer – Regularizer function applied to the bias vector.
activity_regularizer – Regularizer function applied to the output of the layer (its “activation”).
kernel_constraint – Constraint function applied to the
kernel
weights matrix.bias_constraint – Constraint function applied to the bias vector.
 Input shape:
ND tensor with shape:
(batch_size, ..., input_dim)
. The most common situation would be a 2D input with shape(batch_size, input_dim)
. Output shape:
ND tensor with shape:
(batch_size, ..., units)
. For instance, for a 2D input with shape(batch_size, input_dim)
, the output would have shape(batch_size, units)
.

build
(input_shape)¶ Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of
Layer
orModel
can override if they need a statecreation step inbetween layer instantiation and layer call.This is typically used to create the weights of
Layer
subclasses. Parameters
input_shape – Instance of
TensorShape
, or list of instances ofTensorShape
if the layer expects a list of inputs (one instance per input).

call
(inputs, **kwargs)¶  Parameters
inputs – The tensor to apply the dense weights to.
 Returns
The tensor resulting from applying the dense weights.

compute_output_shape
(input_shape)¶ Computes the output shape of the layer.
If the layer has not been built, this method will call
build
on the layer. This assumes that the layer will later be used with inputs that match the input shape provided here. Parameters
input_shape – Shape tuple (tuple of integers) or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
 Returns
An input shape tuple.

get_config
()¶ Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above). Returns
Python dictionary.
23.2. Keras Optimizers¶
23.2.1. Keras optimizers made for IPU TensorFlow¶

class
ipu_tensorflow_addons.keras.optimizers.
AdamIpuOptimizer
(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e07, amsgrad=False, name='Adam', m_dtype=None, v_dtype=None, vhat_dtype=None, debiasing=True, **kwargs)¶ Optimizer that implements the Adam algorithm.
Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of firstorder and secondorder moments. According to the paper Adam: A Method for Stochastic Optimization. Kingma et al., 2014, the method is “computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters”.
For AMSGrad see On The Convergence Of Adam And Beyond. Reddi et al., 58
This optimizer allows setting the optimizer state precisions independently and differently to the var precisions. It also supports outlining the optimizer update, which can save memory at the expense of passing variables around by making the optimizer update a reusable code block.

__init__
(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e07, amsgrad=False, name='Adam', m_dtype=None, v_dtype=None, vhat_dtype=None, debiasing=True, **kwargs)¶  Parameters
learning_rate – A Tensor or a floating point value. The learning rate.
beta_1 – A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates.
beta_2 – A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates.
epsilon – A small constant for numerical stability. This epsilon is “epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
amsgrad – boolean. Whether to apply AMSGrad variant of this algorithm from the paper “On the Convergence of Adam and beyond”.
name – Optional name for the operations created when applying gradients. Defaults to “Adam”.
m_dtype – Dtype of the optimizer state m. If None, will set to dtypes of the corresponding vars.
v_dtype – Dtype of the optimizer state v. If None, will set to dtypes of the corresponding vars.
vhat_dtype – Dtype of the optimizer state vhat. If None, will set to dtypes of the corresponding vars.
debiasing – Debias m and v to correct for initialisation.
**kwargs – keyword arguments. Allowed to be {
clipnorm
,clipvalue
,lr
,decay
}.clipnorm
is clip gradients by norm;clipvalue
is clip gradients by value,decay
is included for backward compatibility to allow time inverse decay of learning rate.lr
is included for backward compatibility, recommended to uselearning_rate
instead.

get_config
()¶ Returns the config of the optimizer.
An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.
 Returns
Python dictionary.


class
ipu_tensorflow_addons.keras.optimizers.
LAMBIpuOptimizer
(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e06, weight_decay_rate=0.0, exclude_from_weight_decay=None, exclude_from_layer_adaptation=None, name='LAMB', debiasing=True, m_dtype=None, v_dtype=None, weight_norm_clip=None, optimizer_compute_precisions=(tf.float32, tf.float32), **kwargs)¶ Optimizer that implements the Layerwise Adaptive Moments (LAMB). See paper Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.
This optimizer allows setting the optimizer state precisions independently and differently to the var precisions. It also supports outlining the optimizer update, which can save memory at the expense of passing variables around by making the optimizer update a reusable code block.

__init__
(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e06, weight_decay_rate=0.0, exclude_from_weight_decay=None, exclude_from_layer_adaptation=None, name='LAMB', debiasing=True, m_dtype=None, v_dtype=None, weight_norm_clip=None, optimizer_compute_precisions=(tf.float32, tf.float32), **kwargs)¶  Parameters
learning_rate – A
Tensor
or a floating point value. or a schedule that is atf.keras.optimizers.schedules.LearningRateSchedule
The learning rate.beta_1 – A
float
value or a constantfloat
tensor. The exponential decay rate for the 1st moment estimates.beta_2 – A
float
value or a constantfloat
tensor. The exponential decay rate for the 2nd moment estimates.epsilon – A small constant for numerical stability.
weight_decay_rate – weight decay rate.
exclude_from_weight_decay – List of regex patterns of variables excluded from weight decay. Variables whose name contain a substring matching the pattern will be excluded.
exclude_from_layer_adaptation – List of regex patterns of variables excluded from layer adaptation. Variables whose name contain a substring matching the pattern will be excluded.
name – Optional name for the operations created when applying gradients. Defaults to “LAMB”.
debiasing – Debias m and v to correct for initialisation.
m_dtype – Dtype of the optimizer state m. If None, will set to dtypes of the vars.
v_dtype – Dtype of the optimizer state v. If None, will set to dtypes of the vars.
weight_norm_clip – Clip the weight norms by this value.
optimizer_compute_precisions – Tuple of TF dtypes that determine what precision the stages of optimizer compute are done in. This optimizer has two stages of compute precision so the tuple must be of size 2.
**kwargs – keyword arguments. Allowed to be {
clipnorm
,clipvalue
,lr
,decay
}.clipnorm
is clip gradients by norm;clipvalue
is clip gradients by value,decay
is included for backward compatibility to allow time inverse decay of learning rate.lr
is included for backward compatibility, recommended to uselearning_rate
instead.

get_config
()¶ Returns the config of the optimizer.
An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.
 Returns
Python dictionary.


class
ipu_tensorflow_addons.keras.optimizers.
SGDIpuOptimizer
(learning_rate=0.01, momentum=0.0, nesterov=False, name='SGD', momentum_accum_dtype=None, **kwargs)¶ Optimizer that implements the gradient descent algorithm with momentum.
This optimizer allows setting the optimizer state precisions differently to the var precisions. It also supports outlining the optimizer update, which can save memory at the expense of passing variables around by making the optimizer update a reusable code block.
For
nesterov=True
, see [`Sutskever et al., 2013.
__init__
(learning_rate=0.01, momentum=0.0, nesterov=False, name='SGD', momentum_accum_dtype=None, **kwargs)¶  Parameters
learning_rate – A
Tensor
or a floating point value. or a schedule that is atf.keras.optimizers.schedules.LearningRateSchedule
The learning rate.momentum – A
float
value or a constantfloat
tensor that accelerates gradient descent in the relevant direction and dampens oscillationsnesterov – boolean. Whether to apply Nesterov momentum. Defaults to
False
.name – Optional name prefix for the operations created when applying gradients. Defaults to
"SGD"
.momentum_accum_dtype – Dtype of the momentum accumulation optimizer state. If None, will set to dtypes of the corresponding vars.
**kwargs – keyword arguments. Allowed to be {
clipnorm
,clipvalue
,lr
,decay
}.clipnorm
is clip gradients by norm;clipvalue
is clip gradients by value,decay
is included for backward compatibility to allow time inverse decay of learning rate.lr
is included for backward compatibility, recommended to uselearning_rate
instead.

get_config
()¶ Returns the config of the optimizer.
An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.
 Returns
Python dictionary.

23.3. Legacy TensorFlow Layers¶
23.3.1. TensorFlow layers made for IPU TensorFlow¶

class
ipu_tensorflow_addons.v1.layers.
PopnnAUGRU
(*args, **kwargs)¶ XLA compatible, timemajor Popnn implementation of an AUGRU layer.
Below is a typical workflow:
with tf.Graph().as_default(): augru = PopnnAUGRU(num_units, ...) outputs, output_state = augru(inputs, initial_state, training=True)

__init__
(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', name=None, reset_after=False, available_memory_proportion_fwd=None, available_memory_proportion_bwd=None)¶ Creates a PopnnAUGRU model from model spec.
 Parameters
num_units – the number of units within the RNN model.
dtype – tf.float16 or tf.float32
partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.
weights_initializer – starting value to initialize the weight (default is Glorot uniform initializer).
activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.
recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.
bias_initializer – starting value to initialize the bias (default is all zeros).
name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking
__call__()
.available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of 1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.
available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of 1. or None indicates that the default in Popnn should be used.

call
(inputs, seq_len, attention_score, initial_state=None, training=True, time_major=True)¶ Runs the forward step for the AUGRU model.
 Parameters
inputs – 3D tensor with shape [time_len, batch_size, input_size].
seq_len – 1D tensor with the sequence length of samples in each batch.
attention_score – The output of attention layer, the score of samples in each batch, shaped
[batch_size, max_seq_len]
.initial_state – Initial state tensor, shaped
[batch_size, num_units]
. If not provided, the state is initialized to zeros.training – whether this operation will be used in training or inference.
time_major – whether the time dimension is the first dimension.
 Returns
A tuple of output and output state.
output: a tensor of shape [time_len, batch_size, num_units].
output_state: The output state of the last cell.
 Raises
ValueError – if initial_state is not valid.


class
ipu_tensorflow_addons.v1.layers.
PopnnDynamicGRU
(*args, **kwargs)¶ XLA compatible, timemajor Popnn implementation of an GRU layer, with a sequence length input.
Below is a typical workflow:
with tf.Graph().as_default(): gru = PopnnDynamicGRU(num_units, ...) outputs, output_state = gru( inputs, seq_len, initial_state, training=True)

__init__
(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', name=None, reset_after=False, available_memory_proportion_fwd=None, available_memory_proportion_bwd=None)¶ Creates a PopnnDynamicGRU model from model spec.
 Parameters
num_units – the number of units within the RNN model.
dtype – tf.float16 or tf.float32
partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.
weights_initializer – starting value to initialize the weight (default is Glorot uniform initializer).
bias_initializer – starting value to initialize the bias (default is all zeros).
activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.
recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.
name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking
__call__()
.reset_after – GRU convention (whether to apply reset gate after or before matrix multiplication). False = “before” (default), True = “after”. Leave as default (False) to match the behaviour of the standard TensorFlow GRU.
available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of 1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.
available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of 1. or None indicates that the default in Popnn should be used.

call
(inputs, seq_len, initial_state=None, training=True, time_major=True)¶ Runs the forward step for the DynamicGRU model.
 Parameters
inputs – 3D tensor with shape [batch_size, time_len, input_size].
seq_len – 1D tensor with the sequence length of samples in each batch.
initial_state – Initial state tensor, shaped
[batch_size, num_units]
. If not provided, the state is initialized to zeros.training – whether this operation will be used in training or inference.
time_major – whether the time dimension is the first demension.
 Returns
A tuple of output and output state.
output: a tensor of shape [time_len, batch_size, num_units].
output_state: The output state of the last cell.
 Raises
ValueError – if initial_state is not valid.


class
ipu_tensorflow_addons.v1.layers.
PopnnDynamicLSTM
(*args, **kwargs)¶ 
call
(inputs, seq_len, initial_state=None, training=True)¶ Runs the forward step for the LSTM model.
 Parameters
inputs – 3D tensor with shape [time_len, batch_size, input_size].
seq_len – 1D tensor with the sequence length of samples in each batch.
initial_state – An
LSTMStateTuple
of state tensors, each shaped[batch_size, num_units]
. If not provided, the state is initialized to zeros.training – Set to False to use the LSTM model in inference mode.
 Returns
A tuple of output and output state.
output: a tensor of shape [time_len, batch_size, num_units].
output_state: An
LSTMStateTuple
of the same shape and structure as initial_state.
 Raises
ValueError – if initial_state is not valid.


class
ipu_tensorflow_addons.v1.layers.
PopnnGRU
(*args, **kwargs)¶ XLA compatible, timemajor Popnn implementation of a GRU layer.
Below is a typical workflow:
with tf.Graph().as_default(): gru = PopnnGRU(num_units, ...) outputs, output_state = gru(inputs, initial_state, training=True)

__init__
(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', name=None, reset_after=False, available_memory_proportion_fwd=None, available_memory_proportion_bwd=None)¶ Creates a PopnnGRU model from model spec.
 Parameters
num_units – the number of units within the GRU model.
dtype – tf.float16 or tf.float32
partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.
weights_initializer – starting value to initialize the weights (default is Glorot uniform initializer).
bias_initializer – starting value to initialize the bias (default is all zeros).
activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.
recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.
name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking
__call__()
.reset_after – GRU convention (whether to apply reset gate after or before matrix multiplication). False = “before” (default), True = “after”. Leave as default (False) to match the behaviour of the standard TensorFlow GRU.
available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of 1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.
available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of 1. or None indicates that the default in Popnn should be used.

build
(input_shape)¶ Create variables of the PopnnGRU.
It can be called manually before
__call__()
or automatically through__call__()
. In the former case, any subsequent__call__()
will skip creating variables. Parameters
input_shape – a TensorShape object with 3 dimensions.
 Raises
ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call
(inputs, initial_state=None, training=True)¶ Runs the forward step for the GRU model.
 Parameters
inputs – 3D tensor with shape [time_len, batch_size, input_size].
initial_state – Initial state tensor, shaped
[batch_size, num_units]
. If not provided, the state is initialized to zeros.training – Set to False to use the GRU model in inference mode.
 Returns
A tuple of output and output_state.
output: a tensor of shape [time_len, batch_size, num_units].
output_state: The output state of the last cell.
 Raises
ValueError – if initial_state is not valid.

state_shape
(batch_size)¶ Shape of Popnn GRU state.
State shape is [batch_size, num_units].
 Parameters
batch_size – an int
 Returns
A Python array.


class
ipu_tensorflow_addons.v1.layers.
PopnnLSTM
(*args, **kwargs)¶ XLA compatible, timemajor Popnn implementation of an LSTM layer.
Below is a typical workflow:
with tf.Graph().as_default(): lstm = PopnnLSTM(num_units, ...) outputs, output_states = lstm(inputs, initial_states, training=True)

__init__
(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', name=None, available_memory_proportion_fwd=None, available_memory_proportion_bwd=None)¶ Creates a PopnnLSTM model from model spec.
 Parameters
num_units – the number of units within the LSTM model.
dtype – tf.float16 or tf.float32
partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.
weights_initializer – starting value to initialize the weights (default is Glorot uniform initializer).
bias_initializer – starting value to initialize the bias (default is all zeros).
activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.
recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.
name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking
__call__()
.available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of 1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.
available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of 1. or None indicates that the default in Popnn should be used.

build
(input_shape)¶ Create variables of the PopnnLSTM.
It can be called manually before
__call__()
or automatically through__call__()
. In the former case, any subsequent__call__()
will skip creating variables. Parameters
input_shape – a TensorShape object with 3 dimensions.
 Raises
ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call
(inputs, initial_state=None, training=True)¶ Runs the forward step for the LSTM model.
 Parameters
inputs – 3D tensor with shape [time_len, batch_size, input_size].
initial_state – An
LSTMStateTuple
of state tensors, each shaped[batch_size, num_units]
. If not provided, the state is initialized to zeros.training – Set to False to use the LSTM model in inference mode.
 Returns
A tuple of output and output state.
output: a tensor of shape [time_len, batch_size, num_units].
output_state: An
LSTMStateTuple
of the same shape and structure as initial_state.
 Raises
ValueError – if initial_state is not valid.

state_shape
(batch_size)¶ Shape of Popnn LSTM states.
Shape is a 2element tuple. Each is [batch_size, num_units]
 Parameters
batch_size – an int
 Returns
a tuple of Python arrays.
