27. IPU TensorFlow Addons Python API
27.1. Keras Layers
27.1.1. Keras layers made for IPU TensorFlow
- class ipu_tensorflow_addons.keras.layers.AssumeEqualAcrossReplicas(*args, **kwargs)
Layer for marking values as equal across replicas to try and prevent divergent control flow compilation tf.errors.
Divergent control flow describes the situation where program flow differs among replicas. This happens when the value of a conditional is not the same across all replicas. This is a problem if the conditional body requires a cross-replica sync, as only some replicas will reach it. If this happens, the execution will hang as the operation waits for all replicas to sync.
To warn the user about this, Poplar checks for divergent control flow during compilation. However since the values of tensors are unknown at compilation time it can’t be certain whether a tensor will lead to divergent control flow or not.
assume_equal_across_replicas
can be used to mark tensors which are equal across all replicas and in doing so prevents them causing divergency errors, if used in a conditional.- Parameters
inplace – A bool for controlling whether or not the given tensor(s) is copied or operated on inplace. This is needed when using
AssumeEqualAcrossReplicas
with tensor slices.
- call(inputs, **kwargs)
Prevent inputs from causing divergency errors by marking them equal across replicas.
- Parameters
inputs – Tensor to apply the layer to.
- Returns
The layer’s output tensor that will not cause divergency errors.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- class ipu_tensorflow_addons.keras.layers.CTCInferenceLayer(*args, **kwargs)
Computes CTC (Connectionist Temporal Classification) predictions using a beam search. This implementation is designed and optimized for the IPU and cannot be used with other systems.
- Parameters
blank_index – The class index to use for the blank label.
beam_width – The beam width to use in the beam search.
top_paths – The number of paths to return.
from_logits – Whether to expect the input data in the form of logits (
True
) or log probabilities (False
). Default value isFalse
.
- call(data, data_length, **kwargs)
- Parameters
data – The data input [max_time, batch_size, num_classes] tensor.
data_length – A tensor of shape [batch_size] containing the number of timesteps in each
data
batch entry.
- Returns
Label probabilities: Negative log probabilities that each path is correct.
Label lengths: Length of each path of predictions.
Decoded labels: The predictions made by the beam search.
- Return type
A tuple of values
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- class ipu_tensorflow_addons.keras.layers.CTCLoss(*args, **kwargs)
Computes CTC (Connectionist Temporal Classification) loss. This implementation is designed and optimized for the IPU and cannot be used with other systems.
Usage:
labels = tf.keras.layers.Input((max_label_length), batch_size=batch_size, dtype=np.int32, name="labels") data = tf.keras.layers.Input((max_time, num_classes), batch_size=batch_size, dtype=np.float32, name="data") label_length = tf.keras.layers.Input((), batch_size=batch_size, dtype=np.int32, name="label_length") logit_length = tf.keras.layers.Input((), batch_size=batch_size, dtype=np.int32, name="logit_length") dense_layer = tf.keras.layers.Dense(num_classes) transpose_layer = tf.keras.layers.Lambda( lambda x: keras.backend.permute_dimensions(x, (1, 0, 2))) loss = ipu_tensorflow_addons.keras.layers.CTCLoss(from_logits=True) x = dense_layer(data) x = transpose_layer(x) loss = loss(labels, x, label_length, logit_length) model = tf.keras.Model((labels, data, label_length, logit_length), loss) get_loss_output = lambda y_true, y_pred: y_pred model.compile('sgd', loss=get_loss_output)
- Parameters
blank_index – The class index to use for the blank label.
from_logits – Whether to expect the input data in the form of logits (
True
) or log probabilities (False
). Default value isFalse
.
- call(labels, data, label_length, data_length, **kwargs)
- Parameters
labels – The labels input [batch_size, max_label_length] tensor.
data – The data input [max_time, batch_size, num_classes].
label_length – A tensor of shape [batch_size] containing the number of labels in each
labels
batch entry.data_length – A tensor of shape [batch_size] containing the number of timesteps in each
data
batch entry.
- Returns
The calculated loss.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- class ipu_tensorflow_addons.keras.layers.CTCPredictionsLayer(*args, **kwargs)
Computes CTC (Connectionist Temporal Classification) most probable predictions.
Returns the most probable predictions from the ctc decoder. This selects the most probable of all predictions returned. It also fills the values off the end with the blank index
This layer does a lot of post processing steps to create the predictions. If your model is close to its memory limit it may be worth using the CTCInference layer and streaming the results of that off the device and performing the processing on the CPU. However this will create a larger stream copy that may also cost memory.
- Parameters
blank_index – The class index to use for the blank label.
beam_width – The beam width to use in the beam search.
top_paths – The number of paths to return.
from_logits – Whether to expect the input data in the form of logits (
True
) or log probabilities (False
). Default value isFalse
.
- call(data, data_length, **kwargs)
- Parameters
data – The data input [max_time, batch_size, num_classes] tensor The data is expected in the form of log probabilities.
data_length – A tensor of shape [batch_size] containing the number of timesteps in each
data
batch entry. If not provided can only perform inference.
- Returns
The most probable predictions from the CTC decoder. This selects the most probable of all predictions returned. It fills the values off the end with the blank index.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- class ipu_tensorflow_addons.keras.layers.ConvertFromF8(*args, **kwargs)
Layer to convert from fp8 to another floating point datatype.
- __init__(dtype=tf.float16, **kwargs)
Args: dtype: The dtype to convert to. Anything other than tf.half will incur
an extra cast to tf.half first.
- call(inputs, **kwargs)
Args: inputs: Output of a layer that returns an f8 tensor.
More specifically, inputs should have the form [data, metadata].
- class ipu_tensorflow_addons.keras.layers.ConvertToF8(*args, **kwargs)
A wrapper layer around convert_to_f8.
This layer expects 2 inputs: (floating point) data and metadata, and returns the output of convert_to_f8 wrapped in a list instead of a QuarterTensor.
- call(data, metadata, **kwargs)
Args: data: A floating point tensor. metadata: Output of
create_metadata()
.
- class ipu_tensorflow_addons.keras.layers.Dense(*args, **kwargs)
Dense layer with support for fp8 matrix multiplication.
The layer uses fp8 when it is passed a list as its input, in which case it expects this input to come from a ConvertToF8 layer.
Otherwise you should be able to pass most of the options available for the normal keras Dense layer.
Note: you should not pass the output of convert_to_f8 directly to this layer, as it returns a QuarterTensor instead of a list that this layer expects.
The default initializer for the kernel data is uniformly random in all possible uint8 values (other than the error value 0x80 which gets mapped to 0). You can change this by passing an initializer to the constructor through
kernel_data_initializer
. Keep in mind that this will need to return uint8 data, which you will most likely want to get from a call toconvert_to_f8
.By default the kernel metadata will be set to a scale of 0 and
F143
format. If you need a different kernel scale / format, you can achieve that by passingkernel_scale
andkernel_format
parameters to the constructor. The passed scale should be in the inclusive range [-32, 31], which multiplies the numeric value of the kernel by2^kernel_scale
. The format should be of typeFormat
.You can also use the
get_weights
/set_weights
methods to manipulate the weights.An example of using this layer eagerly:
from tensorflow.python.ipu.ops.f8_ops import create_metadata, Format from keras.ipu.layers import Dense, ConvertToF8 from tensorflow.python.ipu.ipu_strategy import IPUStrategyV1 strategy = IPUStrategyV1() with strategy.scope(): input_array = np.array([[1., 2.], [3., -1.]]) f8_tensor = ConvertToF8()(input_array, metadata=create_metadata(Format.F143)) output = Dense(units=3)(f8_tensor)
An example of using this layer in a Functional model:
from tensorflow.python.ipu.ops.f8_ops import create_metadata, Format from keras.ipu.layers import Dense, ConvertToF8 from tensorflow.python.ipu.ipu_strategy import IPUStrategyV1 strategy = IPUStrategyV1() with strategy.scope(): inputs = Input(dtype="float16", shape=[2], batch_size=2) outputs = ConvertToF8()(inputs, metadata=create_metadata(Format.F143)) outputs = Dense(units=3)(outputs) model = keras.Model(inputs, outputs) input_array = np.array([[1., 2.], [3., -1.]]) model.predict(input_array, batch_size=2)
- Input shape:
N-D tensor with shape:
(batch_size, ..., input_dim)
. The most common situation would be a 2D input with shape(batch_size, input_dim)
. In case of passing an fp8 input, the input should be the output of aConvertToF8
layer.- Output shape:
N-D tensor with shape:
(batch_size, ..., units)
. For instance, for a 2D input with shape(batch_size, input_dim)
, the output would have shape(batch_size, units)
.
- Parameters
units – Positive integer, dimensionality of the output space.
kernel_format – Format of the kernel tensor when using fp8; one of
Format.F143
orFormat.F152
.Format
can be imported fromtensorflow.python.ipu.ops.f8_ops
.kernel_scale – Scale for the kernel tensor when using fp8.
kernel_data_initializer – An initializer for the kernel data when using fp8.
- build(input_shape)
Stripped down version of keras.layers.Dense.build.
Defers weight construction to the call method so that we know if we’re dealing with fp8 matmul or not, depending on inputs.
- call(inputs)
Use fp8 MatMul if
inputs
is an instance of QuarterTensor. Otherwise behave like a normal Dense layer.
- class ipu_tensorflow_addons.keras.layers.Dropout(*args, **kwargs)
Dropout layer optimized for running on the IPU.
The Dropout layer randomly sets input units to 0 with a frequency of
rate
at each step during training. Inputs not set to 0 are scaled up by1/(1 - rate)
such that the expected sum is unchanged.Note that the Dropout layer only applies when
training
is set to True, so no values are dropped during inference.- Parameters
rate – Float between 0 and 1. Fraction of the input units to drop.
noise_shape – 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input.
seed – An optional two-element tensor-like object (
tf.Tensor
, a numpy array or Python list/tuple) containing a pair of 32-bit integers that will be used to seed the random number generator that generates the dropout mask.
- build(input_shape)
Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of
Layer
orModel
can override if they need a state-creation step in-between layer instantiation and layer call.This is typically used to create the weights of
Layer
subclasses.- Parameters
input_shape – Instance of
TensorShape
, or list of instances ofTensorShape
if the layer expects a list of inputs (one instance per input).
- call(inputs, training=None)
Perform dropout.
- Parameters
inputs – Input tensor (of any rank).
training – Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (doing nothing).
- Returns
In training mode, a tensor which has some nodes set to zero, as randomly selected based on other parameters. In inference mode, a tensor that is identical to the input tensor.
- compute_output_shape(input_shape)
Computes the output shape of the layer.
If the layer has not been built, this method will call
build
on the layer. This assumes that the layer will later be used with inputs that match the input shape provided here.- Parameters
input_shape – Shape tuple (tuple of integers) or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
- Returns
An input shape tuple.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- class ipu_tensorflow_addons.keras.layers.EffectiveTransformer(*args, **kwargs)
EffectiveTransformer is an implementation of a multihead attention network.
Transformers of this type are described in the following paper: https://arxiv.org/abs/1706.03762
This implementation is optimised for batches of padded sequences, by dynamically compressing the input sequences for computationally expensive parts of the algorithm. This compression is achieved by the removal of padding for those computations that do not rely on a 1:1 relationship between the input
to
andfrom
sequences.For an input sequence tensor
X
of shape[B, N]
, the algorithm will processX
in compressed chunks of shape[B', N]
, whereB'
is less than or equal tomax_batch_size
. The algorithm output, however, keeps the input batch sizeB
. Though the maximum batch size of compressed sequences to be processed in each chunk is of shape[B', N]
, the parametersequences_per_iter
determines the upper limit on the total number of compressed sequences to be processed for eachB'
sized batch.The distinction between
max_batch_size
andsequences_per_iter
is of importance when a corpus of data has much variance in the length of its sequences (the degree of padding in each row).max_batch_size
determines the upper bound on the number of rows of data to be processed in each chunk andsequences_per_iter
determines the upper bound on the number of sequences to be compressed into each chunk. This distinction is important to consider because a chunk of compressed sequences will need to be decompressed at points in the algorithm. This can incur large memory usage if the number of compressed sequences to process is high and the uncompressed shape unbounded.sequences_per_iter
must be less than or equal tomax_batch_size
.- Parameters
output_layer_size – The number of output units.
max_batch_size – The upper limit to which additional sequences will be compressed into a chunk of data. This is the maximum size of the uncompressed sequence tensor.
use_scale – If True, learn a scale parameter.
num_attention_heads – The number of attention heads to use for multihead attention.
attention_head_size – The size of each attention head.
sequences_per_iter – The number of full-sequence equivalents to process in each data chunk. Must be less than or equal to
max_batch_size
.qkv_activation – The activation function to use for the Query, Key and Value embeddings.
attention_dropout_prob – Dropout probability applied to the attention distribution.
output_activation – The activation function to use for the layer output.
output_dropout_prob – Dropout probability applied to the layer output.
layer_norm_output – Whether to apply Layer Normalisation to the output.
embedding_initializer – The initializer to be used for the QKV embeddings. Default is ‘glorot_uniform’.
embedding_bias_initializer – The initializer to be used for QKV embeddings additive bias. Defaults to ‘zeros’.
output_initializer – The initializer for the output layer. Defaults to ‘glorot_uniform’.
output_bias_initializer – The initializer for the output layer additive bias. Defaults to ‘zeros’.
- build(input_shapes)
Builds an
EffectiveTransformer
Layer with respect to the providedinput_shapes
.- Parameters
input_shapes – A list of Tensor shapes of length four or five. In the case of four elements provided in
input_shapes
, the Tensor shapes should correspond to thefrom_sequences
,from_sequence_lengths
,to_sequences
andto_sequence_lengths
Tensor arguments to thecall
method. In the case of five Tensor shapes provided ininput_shapes
, the fifth element should correspond to the optionalq_mask
input to thecall
method.
- call(inputs, training=True)
Performs a single forward pass of an
EffectiveTransformer
layer instance.As input, two sequence sets and their respective sequence lengths are required. The two sets of sequences are referred to as the ‘from’ sequences and ‘to’ sequences, referring to the computed attention relationship. In the case that the ‘from’ and ‘to’ sequence sets are equal, this layer will compute self-attention.
- Parameters
inputs – A list of input Tensors, of at least four elements containing
from_sequences
,from_sequence_lengths
,to_sequences
andto_sequence_lengths
. Additionally, a fifth tensorq_mask
for attention head masking can be provided.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- class ipu_tensorflow_addons.keras.layers.Embedding(*args, **kwargs)
This is designed to be a replacement for the typical use cases of the Keras Embedding layer.
- Parameters
input_dim – int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
output_dim – int >= 0. Dimension of the dense embedding.
embeddings_initializer – Initializer for the
embeddings
matrix.serialization_factor – If greater than 1, the embedding lookup will be broken up into
serialization_factor
smaller lookups, serialized along the 0th dimension. This option should not be used unless the parameters of this layer is used by another layer. If this is the case, then serialization can reduce the maximum memory at the cost of extra computation.
- Input shape:
2D tensor with shape:
(batch_size, input_length)
.- Output shape:
3D tensor with shape:
(batch_size, input_length, output_dim)
.
- build(input_shape)
Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of
Layer
orModel
can override if they need a state-creation step in-between layer instantiation and layer call.This is typically used to create the weights of
Layer
subclasses.- Parameters
input_shape – Instance of
TensorShape
, or list of instances ofTensorShape
if the layer expects a list of inputs (one instance per input).
- call(inputs, inputs_are_sorted=False, training=None)
Perform an embedding lookup.
- Parameters
inputs – An integer tensor of indices into the embedding variable.
inputs_are_sorted – Set to True when indices are sorted, this allows Poplar to optimise for the case when the indices to look up are in order. Defaults to False.
- Returns
The entries of the embedding tensor corresponding to the ids tensor indices.
- compute_output_shape(input_shape)
Computes the output shape of the layer.
If the layer has not been built, this method will call
build
on the layer. This assumes that the layer will later be used with inputs that match the input shape provided here.- Parameters
input_shape – Shape tuple (tuple of integers) or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
- Returns
An input shape tuple.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- ipu_tensorflow_addons.keras.layers.GroupNorm
alias of
GroupNormalization
- class ipu_tensorflow_addons.keras.layers.GroupNormalization(*args, **kwargs)
Group normalization layer optimized for running on the IPU.
This layer is used like the standard Keras BatchNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.
Group normalization is described in this paper: https://arxiv.org/abs/1803.08494.
- Parameters
groups – The number of groups to use in the normalization.
channels_axis – Integer, the axis that should be normalized (typically the features axis).
center – If True, add offset of
beta
to normalized tensor. If False,beta
is ignored.scale – If True, multiply by
gamma
. If False,gamma
is not used.epsilon – Small float added to variance to avoid dividing by zero.
beta_initializer – Initializer for the beta weight.
gamma_initializer – Initializer for the gamma weight.
strided_channel_grouping – Selects whether to group the channels dimension for group normalisation with a stride between channels. This makes the PopLibs implementation more efficient but is unconventional. Among other things this will mean that using pre-trained weights would not be possible if not produced with this unconventional implementation.
trainable – Boolean, if
True
the variables will be marked as trainable.
- build(input_shape)
Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of
Layer
orModel
can override if they need a state-creation step in-between layer instantiation and layer call.This is typically used to create the weights of
Layer
subclasses.- Parameters
input_shape – Instance of
TensorShape
, or list of instances ofTensorShape
if the layer expects a list of inputs (one instance per input).
- call(inputs, training=None)
- Parameters
inputs – The tensor to apply normalization to.
- Returns
The tensor resulting from applying normalization.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- ipu_tensorflow_addons.keras.layers.InstanceNorm
alias of
InstanceNormalization
- class ipu_tensorflow_addons.keras.layers.InstanceNormalization(*args, **kwargs)
Instance normalization layer optimized for use on the IPU.
This layer is used like the standard Keras InstanceNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.
Instance normalization is described in this paper: https://arxiv.org/abs/1607.08022.
- Parameters
channels_axis – Integer, the axis that should be normalized (typically the features axis).
center – If True, add offset of
beta
to normalized tensor. If False,beta
is ignored.scale – If True, multiply by
gamma
. If False,gamma
is not used.epsilon – Small float added to variance to avoid dividing by zero.
beta_initializer – Initializer for the beta weight.
gamma_initializer – Initializer for the gamma weight.
- build(input_shape)
Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of
Layer
orModel
can override if they need a state-creation step in-between layer instantiation and layer call.This is typically used to create the weights of
Layer
subclasses.- Parameters
input_shape – Instance of
TensorShape
, or list of instances ofTensorShape
if the layer expects a list of inputs (one instance per input).
- call(inputs, training=None)
- Parameters
inputs – The tensor to apply normalization to.
- Returns
The tensor resulting from applying normalization.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- ipu_tensorflow_addons.keras.layers.LayerNorm
alias of
LayerNormalization
- class ipu_tensorflow_addons.keras.layers.LayerNormalization(*args, **kwargs)
Layer normalization layer optimized for use on the IPU.
This layer is used like the standard Keras LayerNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.
Layer normalization is described in this paper: https://arxiv.org/abs/1607.06450.
- Parameters
axis – Integer or List/Tuple. The axis that should be normalized (typically the features axis).
epsilon – Small float added to variance to avoid dividing by zero.
center – If True, add offset of
beta
to normalized tensor. If False,beta
is ignored.scale – If True, multiply by
gamma
. If False,gamma
is not used. When the next layer is linear (also e.g.nn.relu
), this can be disabled since the scaling will be done by the next layer.beta_initializer – Initializer for the beta weight.
gamma_initializer – Initializer for the gamma weight.
beta_regularizer – Optional regularizer for the beta weight.
gamma_regularizer – Optional regularizer for the gamma weight.
beta_constraint – Optional constraint for the beta weight.
gamma_constraint – Optional constraint for the gamma weight.
trainable – Boolean, if
True
the variables will be marked as trainable.
- build(input_shape)
Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of
Layer
orModel
can override if they need a state-creation step in-between layer instantiation and layer call.This is typically used to create the weights of
Layer
subclasses.- Parameters
input_shape – Instance of
TensorShape
, or list of instances ofTensorShape
if the layer expects a list of inputs (one instance per input).
- call(inputs, training=None)
- Parameters
inputs – The tensor to apply normalization to.
- Returns
The tensor resulting from applying normalization.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- class ipu_tensorflow_addons.keras.layers.PopnnGRU(*args, **kwargs)
Popnn implementation of the Gated Recurrent Unit (Cho et al. 2014), optimized for the IPU.
There are two variants of the GRU implementation. The default is based on v3 and has reset gate applied to hidden state before matrix multiplication. The other is based on the original version and has the order reversed. The first one is the default behaviour for this implementation, however the Keras equivalent can use the second variant. To use this variant, set
'reset_after'=True
.Note that the Keras equivalent uses the
hard_sigmoid
as the default recurrent activation, however this version usessigmoid
as the default.- Parameters
units – Positive integer, dimensionality of the output space.
activation – Activation function to use. Default: hyperbolic tangent (“tanh”). Accepted activations: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass
None
, no activation is applied (ie. “linear” activation:a(x) = x
).recurrent_activation – Activation function to use for the recurrent step. Default: sigmoid (“sigmoid”). Accepted activations: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass
None
, no activation is applied (ie. “linear” activation:a(x) = x
).use_bias – Boolean. If True then the layer will use a bias vector.
kernel_initializer – Initializer for the
kernel
weights matrix, used for the linear transformation of the inputs.recurrent_initializer – Initializer for the
recurrent_kernel
weights matrix, used for the linear transformation of the recurrent state.bias_initializer – Initializer for the bias vector.
kernel_regularizer – Unsupported - Regularizer function applied to the
kernel
weights matrix.recurrent_regularizer – Unsupported - Regularizer function applied to the
recurrent_kernel
weights matrix.bias_regularizer – Unsupported - Regularizer function applied to the bias vector.
activity_regularizer – Unsupported - Regularizer function applied to the output of the layer (its “activation”).
kernel_constraint – Unsupported - Constraint function applied to the
kernel
weights matrix.recurrent_constraint – Unsupported - Constraint function applied to the
recurrent_kernel
weights matrix.bias_constraint – Unsupported - Constraint function applied to the bias vector.
dropout – Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
dropout_seed – An optional two-element tensor-like object (
tf.Tensor
, a numpy array or Python list/tuple), representing the random seed that will be used to create the distribution for dropout.recurrent_dropout – Unsupported - Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.
implementation – Unsupported - Implementation mode.
return_sequences – Boolean. If True then the full output sequence will be returned. If False then only the last output in the output sequence will be returned.
return_state – Boolean. If True then the last state will be returned in addition to the last output or output sequence.
go_backwards – Unsupported - Boolean (default False). If True process the input sequence backwards and return the reversed sequence.
stateful – Boolean (default False). If True the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch.
unroll – Unsupported - Boolean (default False). If True the network will be unrolled, else a symbolic loop will be used. Unrolling can speed-up a RNN, although it tends to be more memory-intensive. Unrolling is only suitable for short sequences.
time_major – The shape format of the
inputs
andoutputs
tensors. If True the shape of the inputs and outputs will be(timesteps, batch, ...)
, otherwise the shape will be(batch, timesteps, ...)
. Usingtime_major = True
is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation. However, most TensorFlow data is batch-major, so by default this function accepts input and emits output in batch-major form.seed – A Python integer. Used for the
kernel_initializer
andrecurrent_initializer
.partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
reset_after – GRU convention (whether to apply reset gate after or before matrix multiplication). False = “before”, True = “after” (default).
options – A Python dictionary. Implementation or debug options for the forward GRU cell in PopLibs. See the GRU documentation in the PopLibs API reference for the full list of options.
options_bwd – A Python dictionary. Implementation or debug options for the backward GRU cell in PopLibs. See the GRU documentation in the PopLibs API reference for the full list of options.
- build(input_shape)
Create variables of the PopnnGRU layer.
It can be called manually before
__call__()
or automatically through__call__()
. In the former case, any subsequent__call__()
will skip creating variables.- Parameters
input_shape – a TensorShape object with 3 dimensions.
- Raises
ValueError – if input_shape has wrong dimension or unknown 3rd dimension.
- call(inputs, mask=None, training=None, initial_state=None)
Runs the forward step for the GRU layer.
- Parameters
inputs – 3D tensor with shape [batch_size, seq_len, input_size]. If the time_major parameter is True, the the shape should be [seq_len, batch_size, input_size].
training – Set to False to use the layer in inference mode. This is only relevant if
dropout
orrecurrent_dropout
is used.initial_state – Initial state tensor, shaped
[batch_size, num_units]
If not provided, the state is initialized to zeros.
- Returns
If
return_sequences
is True then the GRU layer returns a tensor of shape [batch_size, seq_len, num_units], otherwise it returns a tensor of shape [batch_size, num_units]. Ifreturn_state
is set to True then the output state of the last cell is also returned.- Raises
ValueError – if initial_state is not valid.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- state_shape(batch_size)
Shape of Popnn GRU state.
State shape is [batch_size, num_units].
- Parameters
batch_size – an int
- Returns
A Python array.
- class ipu_tensorflow_addons.keras.layers.PopnnLSTM(*args, **kwargs)
Popnn implementation of Long Short-Term Memory layer (Hochreiter and Schmidhuber 1997), optimized for the IPU.
Note that the Keras equivalent uses the
hard_sigmoid
as the default recurrent activation, however this version usessigmoid
as the default.- Parameters
units – Positive integer, dimensionality of the output space.
activation – Activation function to use. Default: hyperbolic tangent (“tanh”). Accepted activations: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass
None
, no activation is applied (ie. “linear” activation:a(x) = x
).recurrent_activation – Activation function to use for the recurrent step. Default: sigmoid (“sigmoid”). Accepted activations: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass
None
, no activation is applied (ie. “linear” activation:a(x) = x
).use_bias – Boolean. If True then the layer will use a bias vector.
kernel_initializer – Initializer for the
kernel
weights matrix, used for the linear transformation of the inputs.recurrent_initializer – Initializer for the
recurrent_kernel
weights matrix, used for the linear transformation of the recurrent state.bias_initializer – Initializer for the bias vector.
unit_forget_bias – Boolean. If True then add 1 to the bias of the forget gate at initialization. Setting it to true will also force
bias_initializer="zeros"
. This is recommended in Jozefowicz et al.kernel_regularizer – Unsupported - Regularizer function applied to the
kernel
weights matrix.recurrent_regularizer – Unsupported - Regularizer function applied to the
recurrent_kernel
weights matrix.bias_regularizer – Unsupported - Regularizer function applied to the bias vector.
activity_regularizer – Unsupported - Regularizer function applied to the output of the layer (its “activation”).
kernel_constraint – Unsupported - Constraint function applied to the
kernel
weights matrix.recurrent_constraint – Unsupported - Constraint function applied to the
recurrent_kernel
weights matrix.bias_constraint – Unsupported - Constraint function applied to the bias vector.
dropout – Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.
dropout_seed – An optional two-element tensor-like object (
tf.Tensor
, a numpy array or Python list/tuple), representing the random seed that will be used to create the distribution for dropout.recurrent_dropout – Unsupported - Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.
implementation – Unsupported - Implementation mode.
return_sequences – Boolean. If True then the full output sequence will be returned. If False then only the last output in the output sequence will be returned.
return_state – Boolean. If True then the last state will be returned in addition to the last output or output sequence.
go_backwards – Unsupported - Boolean (default False). If True process the input sequence backwards and return the reversed sequence.
stateful – Boolean (default False). If True the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch.
unroll – Unsupported - Boolean (default False). If True the network will be unrolled, else a symbolic loop will be used. Unrolling can speed-up a RNN, although it tends to be more memory-intensive. Unrolling is only suitable for short sequences.
seed – A Python integer. Used for the
kernel_initializer
andrecurrent_initializer
.partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
time_major – The shape format of the
inputs
andoutputs
tensors. If True the shape of the inputs and outputs will be(timesteps, batch, ...)
, otherwise the shape will be(batch, timesteps, ...)
. Usingtime_major = True
is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation. However, most TensorFlow data is batch-major, so by default this function accepts input and emits output in batch-major form.options – A Python dictionary. Implementation or debug options for the forward LSTM cell in PopLibs. See the LSTM documentation in the PopLibs API reference for the full list of options.
options_bwd – A Python dictionary. Implementation or debug options for the backward LSTM cell in PopLibs. See the LSTM documentation in the PopLibs API reference for the full list of options.
- build(input_shape)
Create variables of the PopnnLSTM layer.
It can be called manually before
__call__()
or automatically through__call__()
. In the former case, any subsequent__call__()
will skip creating variables.- Parameters
input_shape – a TensorShape object with 3 dimensions.
- Raises
ValueError – if input_shape has wrong dimension or unknown 3rd dimension.
- call(inputs, mask=None, training=None, initial_state=None)
Runs the forward step for the LSTM layer.
- Parameters
inputs – 3D tensor with shape [batch_size, seq_len, input_size]. If the time_major parameter is set to True then the shape should be [seq_len, batch_size, input_size].
training – Set to False to use the layer in inference mode. This is only relevant if
dropout
orrecurrent_dropout
is set.initial_state – An
LSTMStateTuple
of state tensors, each shaped[batch_size, num_units]
. If not provided, the state is initialized to zeros.
- Returns
If
return_sequences
is True the LSTM layer returns a tensor of shape [batch_size, seq_len, num_units] otherwise it returns a tensor of shape [batch_size, num_units]. Ifreturn_state
is True then the output state of the last cell is also returned.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- state_shape(batch_size)
Shape of Popnn LSTM states.
Shape is a 2-element tuple. Each is [batch_size, num_units]
- Parameters
batch_size – an int
- Returns
A tuple of Python arrays.
- class ipu_tensorflow_addons.keras.layers.RecomputationCheckpoint(*args, **kwargs)
Layer for checkpointing values in a computational pipeline stage. When recomputation is enabled, these values will not be recomputed and they will be stored in memory instead.
This layer can reduce memory liveness peaks when using recomputation if there are too many activations which need to be recomputed before the backpropagation operations can be executed.
This layer should be used with the
RecomputationMode.RecomputeAndBackpropagateInterleaved
pipelining recomputation mode.Note that this layer has no effect when used with the
RecomputationMode.RecomputeThenBackpropagate
pipelining recomputation mode.- call(inputs, **kwargs)
Checkpoint the input tensors.
- Parameters
inputs – A tensor or a structure of tensors which should be checkpointed.
- Returns
A tensor or a structure of tensors which matches shape and type of
inputs
.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
- class ipu_tensorflow_addons.keras.layers.SerialDense(*args, **kwargs)
Densely-connected NN layer where the dot operation is serialized to reduce the size of this operation.
Dense
implements the operation:output = activation(dot(input, kernel) + bias)
whereactivation
is the element-wise activation function passed as theactivation
argument,kernel
is a weights matrix created by the layer, andbias
is a bias vector created by the layer (only applicable ifuse_bias
isTrue
).Given the
input
tensor with shape[..., m, k]
andkernel
tensor with shape[k, n]
, the matrix multiplication can be serialized as follows:Along the
m
dimension ofinput
, by settingserialization_dimension
toinput_columns
.Along the
k
dimension ofinput
andkernel
by settingserialization_dimension
toinput_rows_kernel_columns
.Along
n
dimension ofkernel
, by settingserialization_dimension
tokernel_rows
.
Example:
# as first layer in a sequential model: model = Sequential() model.add(SerialDense(32, input_shape=(16,))) # now the model will take as input arrays of shape (*, 16) # and output arrays of shape (*, 32) # after the first layer, you don't need to specify # the size of the input anymore: model.add(SerialDense(32))
- Parameters
units – Positive integer, dimensionality of the output space.
serialization_factor – An integer indicating the number of smaller matrix multiplies this operation is broken up into. Must divide the dimension along which the operation is serialized on.
serialization_dimension – A string, must be one of
input_columns
,input_rows_kernel_columns
orkernel_rows
. Indicates the dimension along which the operation is serialzed on.activation – Activation function to use. If you don’t specify anything, no activation is applied (ie. “linear” activation:
a(x) = x
).use_bias – Boolean, whether the layer uses a bias vector.
kernel_initializer – Initializer for the
kernel
weights matrix.bias_initializer – Initializer for the bias vector.
kernel_regularizer – Regularizer function applied to the
kernel
weights matrix.bias_regularizer – Regularizer function applied to the bias vector.
activity_regularizer – Regularizer function applied to the output of the layer (its “activation”).
kernel_constraint – Constraint function applied to the
kernel
weights matrix.bias_constraint – Constraint function applied to the bias vector.
- Input shape:
N-D tensor with shape:
(batch_size, ..., input_dim)
. The most common situation would be a 2D input with shape(batch_size, input_dim)
.- Output shape:
N-D tensor with shape:
(batch_size, ..., units)
. For instance, for a 2D input with shape(batch_size, input_dim)
, the output would have shape(batch_size, units)
.
- build(input_shape)
Creates the variables of the layer (optional, for subclass implementers).
This is a method that implementers of subclasses of
Layer
orModel
can override if they need a state-creation step in-between layer instantiation and layer call.This is typically used to create the weights of
Layer
subclasses.- Parameters
input_shape – Instance of
TensorShape
, or list of instances ofTensorShape
if the layer expects a list of inputs (one instance per input).
- call(inputs, **kwargs)
- Parameters
inputs – The tensor to apply the dense weights to.
- Returns
The tensor resulting from applying the dense weights.
- compute_output_shape(input_shape)
Computes the output shape of the layer.
If the layer has not been built, this method will call
build
on the layer. This assumes that the layer will later be used with inputs that match the input shape provided here.- Parameters
input_shape – Shape tuple (tuple of integers) or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.
- Returns
An input shape tuple.
- get_config()
Returns the config of the layer.
A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.
The config of a layer does not include connectivity information, nor the layer class name. These are handled by
Network
(one layer of abstraction above).Note that
get_config()
does not guarantee to return a fresh copy of dict every time it is called. The callers should make a copy of the returned dict if they want to modify it.- Returns
Python dictionary.
27.2. Keras Optimizers
27.2.1. Keras optimizers made for IPU TensorFlow
- class ipu_tensorflow_addons.keras.optimizers.AdamIpuOptimizer(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='Adam', m_dtype=None, v_dtype=None, vhat_dtype=None, debiasing=True, **kwargs)
Optimizer that implements the Adam algorithm.
Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments. According to the paper Adam: A Method for Stochastic Optimization. Kingma et al., 2014, the method is “computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters”.
For AMSGrad see On The Convergence Of Adam And Beyond. Reddi et al., 5-8
This optimizer allows setting the optimizer state precisions independently and differently to the var precisions. It also supports outlining the optimizer update, which can save memory at the expense of passing variables around by making the optimizer update a reusable code block.
- __init__(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='Adam', m_dtype=None, v_dtype=None, vhat_dtype=None, debiasing=True, **kwargs)
- Parameters
learning_rate – A Tensor or a floating point value. The learning rate.
beta_1 – A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates.
beta_2 – A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates.
epsilon – A small constant for numerical stability. This epsilon is “epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
amsgrad – boolean. Whether to apply AMSGrad variant of this algorithm from the paper “On the Convergence of Adam and beyond”.
name – Optional name for the operations created when applying gradients. Defaults to “Adam”.
m_dtype – Dtype of the optimizer state m. If None, will set to dtypes of the corresponding vars.
v_dtype – Dtype of the optimizer state v. If None, will set to dtypes of the corresponding vars.
vhat_dtype – Dtype of the optimizer state vhat. If None, will set to dtypes of the corresponding vars.
debiasing – Debias m and v to correct for initialisation.
**kwargs – keyword arguments. Allowed to be {
clipnorm
,clipvalue
,lr
,decay
}.clipnorm
is clip gradients by norm;clipvalue
is clip gradients by value,decay
is included for backward compatibility to allow time inverse decay of learning rate.lr
is included for backward compatibility, recommended to uselearning_rate
instead.
- get_config()
Returns the config of the optimizer.
An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.
- Returns
Python dictionary.
- get_slot_dtype(var, slot_name)
Returns the slot dtype for
var
andslot_name
.- Parameters
var – a
Variable
object.slot_name – name of the slot variable.
- Returns
The
dtype
of the slot variable.
- class ipu_tensorflow_addons.keras.optimizers.LAMBIpuOptimizer(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-06, weight_decay_rate=0.0, exclude_from_weight_decay=None, exclude_from_layer_adaptation=None, name='LAMB', debiasing=True, m_dtype=None, v_dtype=None, weight_norm_clip=None, optimizer_compute_precisions=(tf.float32, tf.float32), **kwargs)
Optimizer that implements the Layer-wise Adaptive Moments (LAMB). See paper Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.
This optimizer allows setting the optimizer state precisions independently and differently to the var precisions. It also supports outlining the optimizer update, which can save memory at the expense of passing variables around by making the optimizer update a reusable code block.
- __init__(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-06, weight_decay_rate=0.0, exclude_from_weight_decay=None, exclude_from_layer_adaptation=None, name='LAMB', debiasing=True, m_dtype=None, v_dtype=None, weight_norm_clip=None, optimizer_compute_precisions=(tf.float32, tf.float32), **kwargs)
- Parameters
learning_rate – A
Tensor
or a floating point value. or a schedule that is atf.keras.optimizers.schedules.LearningRateSchedule
The learning rate.beta_1 – A
float
value or a constantfloat
tensor. The exponential decay rate for the 1st moment estimates.beta_2 – A
float
value or a constantfloat
tensor. The exponential decay rate for the 2nd moment estimates.epsilon – A small constant for numerical stability.
weight_decay_rate – weight decay rate.
exclude_from_weight_decay – List of regex patterns of variables excluded from weight decay. Variables whose name contain a substring matching the pattern will be excluded.
exclude_from_layer_adaptation – List of regex patterns of variables excluded from layer adaptation. Variables whose name contain a substring matching the pattern will be excluded.
name – Optional name for the operations created when applying gradients. Defaults to “LAMB”.
debiasing – Debias m and v to correct for initialisation.
m_dtype – Dtype of the optimizer state m. If None, will set to dtypes of the vars.
v_dtype – Dtype of the optimizer state v. If None, will set to dtypes of the vars.
weight_norm_clip – Clip the weight norms by this value.
optimizer_compute_precisions – Tuple of TF dtypes that determine what precision the stages of optimizer compute are done in. This optimizer has two stages of compute precision so the tuple must be of size 2.
**kwargs – keyword arguments. Allowed to be {
clipnorm
,clipvalue
,lr
,decay
}.clipnorm
is clip gradients by norm;clipvalue
is clip gradients by value,decay
is included for backward compatibility to allow time inverse decay of learning rate.lr
is included for backward compatibility, recommended to uselearning_rate
instead.
- get_config()
Returns the config of the optimizer.
An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.
- Returns
Python dictionary.
- get_slot_dtype(var, slot_name)
Returns the slot dtype for
var
andslot_name
.- Parameters
var – a
Variable
object.slot_name – name of the slot variable.
- Returns
The
dtype
of the slot variable.
- class ipu_tensorflow_addons.keras.optimizers.SGDIpuOptimizer(learning_rate=0.01, momentum=0.0, nesterov=False, name='SGD', momentum_accum_dtype=None, **kwargs)
Optimizer that implements the gradient descent algorithm with momentum.
This optimizer allows setting the optimizer state precisions differently to the var precisions. It also supports outlining the optimizer update, which can save memory at the expense of passing variables around by making the optimizer update a reusable code block.
For
nesterov=True
, see [`Sutskever et al., 2013.- __init__(learning_rate=0.01, momentum=0.0, nesterov=False, name='SGD', momentum_accum_dtype=None, **kwargs)
- Parameters
learning_rate – A
Tensor
or a floating point value. or a schedule that is atf.keras.optimizers.schedules.LearningRateSchedule
The learning rate.momentum – A
float
value or a constantfloat
tensor that accelerates gradient descent in the relevant direction and dampens oscillationsnesterov – boolean. Whether to apply Nesterov momentum. Defaults to
False
.name – Optional name prefix for the operations created when applying gradients. Defaults to
"SGD"
.momentum_accum_dtype – Dtype of the momentum accumulation optimizer state. If None, will set to dtypes of the corresponding vars.
**kwargs – keyword arguments. Allowed to be {
clipnorm
,clipvalue
,lr
,decay
}.clipnorm
is clip gradients by norm;clipvalue
is clip gradients by value,decay
is included for backward compatibility to allow time inverse decay of learning rate.lr
is included for backward compatibility, recommended to uselearning_rate
instead.
- get_config()
Returns the config of the optimizer.
An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.
- Returns
Python dictionary.
- get_slot_dtype(var, slot_name)
Returns the slot dtype for
var
andslot_name
.- Parameters
var – a
Variable
object.slot_name – name of the slot variable.
- Returns
The
dtype
of the slot variable.
27.3. Legacy TensorFlow Layers
27.3.1. TensorFlow layers made for IPU TensorFlow
- class ipu_tensorflow_addons.v1.layers.PopnnAUGRU(*args, **kwargs)
XLA compatible, time-major Popnn implementation of an AUGRU layer.
Below is a typical workflow:
with tf.Graph().as_default(): augru = PopnnAUGRU(num_units, ...) outputs, output_state = augru(inputs, initial_state, training=True)
- __init__(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', return_state=True, name=None, reset_after=False, options=None, options_bwd=None)
Creates a PopnnAUGRU model from model spec.
- Parameters
num_units – the number of units within the RNN model.
dtype – tf.float16 or tf.float32
partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.
weights_initializer – starting value to initialize the weight (default is Glorot uniform initializer).
activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.
recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.
return_state – Boolean. Whether to return the last state in addition to the output. Default:
True
.bias_initializer – starting value to initialize the bias (default is all zeros).
name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking
__call__()
.options – A Python dictionary. Implementation or debug options for the forward LSTM cell in PopLibs. See the LSTM documentation in the PopLibs API reference for the full list of options.
options_bwd – A Python dictionary. Implementation or debug options for the backward LSTM cell in PopLibs. See the LSTM documentation in the PopLibs API reference for the full list of options.
- call(inputs, seq_len, attention_score, initial_state=None, training=True, time_major=True)
Runs the forward step for the AUGRU model.
- Parameters
inputs – 3-D tensor with shape [time_len, batch_size, input_size].
seq_len – 1-D tensor with the sequence length of samples in each batch.
attention_score – The output of attention layer, the score of samples in each batch, shaped
[batch_size, max_seq_len]
.initial_state – Initial state tensor, shaped
[batch_size, num_units]
. If not provided, the state is initialized to zeros.training – whether this operation will be used in training or inference.
time_major – whether the time dimension is the first dimension.
- Returns
A tuple of output and output state.
output: a tensor of shape [time_len, batch_size, num_units].
output_state: The output state of the last cell.
- Raises
ValueError – if initial_state is not valid.
- class ipu_tensorflow_addons.v1.layers.PopnnDynamicGRU(*args, **kwargs)
XLA compatible, time-major Popnn implementation of an GRU layer, with a sequence length input.
Below is a typical workflow:
with tf.Graph().as_default(): gru = PopnnDynamicGRU(num_units, ...) outputs, output_state = gru( inputs, seq_len, initial_state, training=True)
- __init__(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', return_state=True, name=None, reset_after=False, options=None, options_bwd=None)
Creates a PopnnDynamicGRU model from model spec.
- Parameters
num_units – the number of units within the RNN model.
dtype – tf.float16 or tf.float32
partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.
weights_initializer – starting value to initialize the weight (default is Glorot uniform initializer).
bias_initializer – starting value to initialize the bias (default is all zeros).
activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.
recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.
return_state – Boolean. Whether to return the last state in addition to the output. Default:
True
.name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking
__call__()
.reset_after – GRU convention (whether to apply reset gate after or before matrix multiplication). False = “before” (default), True = “after”. Leave as default (False) to match the behaviour of the standard TensorFlow GRU.
options – A Python dictionary. Implementation or debug options for the forward LSTM cell in PopLibs. See the LSTM documentation in the PopLibs API reference for the full list of options.
options_bwd – A Python dictionary. Implementation or debug options for the backward LSTM cell in PopLibs. See the LSTM documentation in the PopLibs API reference for the full list of options.
- call(inputs, seq_len, initial_state=None, training=True, time_major=True)
Runs the forward step for the DynamicGRU model.
- Parameters
inputs – 3-D tensor with shape [batch_size, time_len, input_size].
seq_len – 1-D tensor with the sequence length of samples in each batch.
initial_state – Initial state tensor, shaped
[batch_size, num_units]
. If not provided, the state is initialized to zeros.training – whether this operation will be used in training or inference.
time_major – whether the time dimension is the first demension.
- Returns
A tuple of output and output state.
output: a tensor of shape [time_len, batch_size, num_units].
output_state: The output state of the last cell.
- Raises
ValueError – if initial_state is not valid.
- class ipu_tensorflow_addons.v1.layers.PopnnDynamicLSTM(*args, **kwargs)
- call(inputs, seq_len, initial_state=None, training=True)
Runs the forward step for the LSTM model.
- Parameters
inputs – 3D tensor with shape [time_len, batch_size, input_size].
seq_len – 1-D tensor with the sequence length of samples in each batch.
initial_state – An
LSTMStateTuple
of state tensors, each shaped[batch_size, num_units]
. If not provided, the state is initialized to zeros.training – Set to False to use the LSTM model in inference mode.
- Returns
A tuple of output and output state.
output: a tensor of shape [time_len, batch_size, num_units].
output_state: An
LSTMStateTuple
of the same shape and structure as initial_state.
- Raises
ValueError – if initial_state is not valid.
- class ipu_tensorflow_addons.v1.layers.PopnnGRU(*args, **kwargs)
XLA compatible, time-major Popnn implementation of a GRU layer.
Below is a typical workflow:
with tf.Graph().as_default(): gru = PopnnGRU(num_units, ...) outputs, output_state = gru(inputs, initial_state, training=True)
- __init__(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', return_state=True, name=None, reset_after=False, options=None, options_bwd=None)
Creates a PopnnGRU model from model spec.
- Parameters
num_units – the number of units within the GRU model.
dtype – tf.float16 or tf.float32
partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.
weights_initializer – starting value to initialize the weights (default is Glorot uniform initializer).
bias_initializer – starting value to initialize the bias (default is all zeros).
activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.
recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.
return_state – Boolean. Whether to return the last state in addition to the output. Default:
True
.name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking
__call__()
.reset_after – GRU convention (whether to apply reset gate after or before matrix multiplication). False = “before” (default), True = “after”. Leave as default (False) to match the behaviour of the standard TensorFlow GRU.
options – A Python dictionary. Implementation or debug options for the forward LSTM cell in PopLibs. See the LSTM documentation in the PopLibs API reference for the full list of options.
options_bwd – A Python dictionary. Implementation or debug options for the backward LSTM cell in PopLibs. See the LSTM documentation in the PopLibs API reference for the full list of options.
- build(input_shape)
Create variables of the PopnnGRU.
It can be called manually before
__call__()
or automatically through__call__()
. In the former case, any subsequent__call__()
will skip creating variables.- Parameters
input_shape – a TensorShape object with 3 dimensions.
- Raises
ValueError – if input_shape has wrong dimension or unknown 3rd dimension.
- call(inputs, initial_state=None, training=True)
Runs the forward step for the GRU model.
- Parameters
inputs – 3D tensor with shape [time_len, batch_size, input_size].
initial_state – Initial state tensor, shaped
[batch_size, num_units]
. If not provided, the state is initialized to zeros.training – Set to False to use the GRU model in inference mode.
- Returns
A tuple of output and output_state.
output: a tensor of shape [time_len, batch_size, num_units].
output_state: The output state of the last cell.
- Raises
ValueError – if initial_state is not valid.
- state_shape(batch_size)
Shape of Popnn GRU state.
State shape is [batch_size, num_units].
- Parameters
batch_size – an int
- Returns
A Python array.
- class ipu_tensorflow_addons.v1.layers.PopnnLSTM(*args, **kwargs)
XLA compatible, time-major Popnn implementation of an LSTM layer.
Below is a typical workflow:
with tf.Graph().as_default(): lstm = PopnnLSTM(num_units, ...) outputs, output_states = lstm(inputs, initial_states, training=True)
- __init__(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', return_state=True, name=None, options=None, options_bwd=None)
Creates a PopnnLSTM model from model spec.
- Parameters
num_units – the number of units within the LSTM model.
dtype – tf.float16 or tf.float32
partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.
seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.
weights_initializer – starting value to initialize the weights (default is Glorot uniform initializer).
bias_initializer – starting value to initialize the bias (default is all zeros).
activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.
recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.
return_state – Boolean. Whether to return the last state in addition to the output. Default:
True
.name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking
__call__()
.options – A Python dictionary. Implementation or debug options for the forward LSTM cell in PopLibs. See the LSTM documentation in the PopLibs API reference for the full list of options.
options_bwd – A Python dictionary. Implementation or debug options for the backward LSTM cell in PopLibs. See the LSTM documentation in the PopLibs API reference for the full list of options.
- build(input_shape)
Create variables of the PopnnLSTM.
It can be called manually before
__call__()
or automatically through__call__()
. In the former case, any subsequent__call__()
will skip creating variables.- Parameters
input_shape – a TensorShape object with 3 dimensions.
- Raises
ValueError – if input_shape has wrong dimension or unknown 3rd dimension.
- call(inputs, initial_state=None, training=True)
Runs the forward step for the LSTM model.
- Parameters
inputs – 3D tensor with shape [time_len, batch_size, input_size].
initial_state – An
LSTMStateTuple
of state tensors, each shaped[batch_size, num_units]
. If not provided, the state is initialized to zeros.training – Set to False to use the LSTM model in inference mode.
- Returns
A tuple of output and output state.
output: a tensor of shape [time_len, batch_size, num_units].
output_state: An
LSTMStateTuple
of the same shape and structure as initial_state.
- Raises
ValueError – if initial_state is not valid.
- state_shape(batch_size)
Shape of Popnn LSTM states.
Shape is a 2-element tuple. Each is [batch_size, num_units]
- Parameters
batch_size – an int
- Returns
a tuple of Python arrays.