23. IPU TensorFlow Addons Python API

23.1. Keras Layers

23.1.1. Keras layers made for IPU TensorFlow

class ipu_tensorflow_addons.keras.layers.AssumeEqualAcrossReplicas(*args, **kwargs)

Layer for marking values as equal across replicas to try and prevent divergent control flow compilation errors.

Divergent control flow describes the situation where program flow differs among replicas. This happens when the value of a conditional is not the same across all replicas. This is a problem if the conditional body requires a cross-replica sync, as only some replicas will reach it. If this happens, the execution will hang as the operation waits for all replicas to sync.

To warn the user about this, Poplar checks for divergent control flow during compilation. However since the values of tensors are unknown at compilation time it can’t be certain whether a tensor will lead to divergent control flow or not. assume_equal_across_replicas can be used to mark tensors which are equal across all replicas and in doing so prevents them causing divergency errors, if used in a conditional.

Parameters

inplace – A bool for controlling whether or not the given tensor(s) is copied or operated on inplace. This is needed when using AssumeEqualAcrossReplicas with tensor slices.

call(inputs, **kwargs)

This is where the layer’s logic lives.

Note here that call() method in tf.keras is little bit different from keras API. In keras API, you can pass support masking for layers as additional arguments. Whereas tf.keras has compute_mask() method to support masking.

Parameters
  • inputs – Input tensor, or list/tuple of input tensors.

  • **kwargs – Additional keyword arguments. Currently unused.

Returns

A tensor or list/tuple of tensors.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

class ipu_tensorflow_addons.keras.layers.CTCInferenceLayer(*args, **kwargs)

Computes CTC (Connectionist Temporal Classification) predictions using a beam search. This implementation is designed and optimized for the IPU and cannot be used with other systems.

Parameters
  • blank_index – The class index to use for the blank label.

  • beam_width – The beam width to use in the beam search.

  • top_paths – The number of paths to return.

  • from_logits – Whether to expect the input data in the form of logits (True) or log probabilities (False). Default value is False.

call(data, data_length, **kwargs)
Parameters
  • data – The data input [max_time, batch_size, num_classes] tensor.

  • data_length – A tensor of shape [batch_size] containing the number of timesteps in each data batch entry.

Returns

  • Label probabilities: Negative log probabilities that each path is correct.

  • Label lengths: Length of each path of predictions.

  • Decoded labels: The predictions made by the beam search.

Return type

A tuple of values

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

class ipu_tensorflow_addons.keras.layers.CTCLoss(*args, **kwargs)

Computes CTC (Connectionist Temporal Classification) loss. This implementation is designed and optimized for the IPU and cannot be used with other systems.

Usage:

labels = tf.keras.layers.Input((max_label_length), batch_size=batch_size,
                               dtype=np.int32, name="labels")
data = tf.keras.layers.Input((max_time, num_classes),
                             batch_size=batch_size, dtype=np.float32,
                             name="data")
label_length = tf.keras.layers.Input((), batch_size=batch_size,
                                     dtype=np.int32, name="label_length")
logit_length = tf.keras.layers.Input((), batch_size=batch_size,
                                     dtype=np.int32, name="logit_length")

dense_layer = tf.keras.layers.Dense(num_classes)
transpose_layer = tf.keras.layers.Lambda(
    lambda x: keras.backend.permute_dimensions(x, (1, 0, 2)))
ctc_loss_layer = ipu.keras.losses.CTCLoss(from_logits=True)

x = dense_layer(data)
x = transpose_layer(x)
loss = ctc_loss_layer(labels, x, label_length, logit_length)

model = ipu.keras.Model((labels, data, label_length, logit_length), loss)
get_loss_output = lambda y_true, y_pred: y_pred
model.compile('sgd', loss=get_loss_output)
Parameters
  • blank_index – The class index to use for the blank label.

  • from_logits – Whether to expect the input data in the form of logits (True) or log probabilities (False). Default value is False.

call(labels, data, label_length, data_length, **kwargs)
Parameters
  • labels – The labels input [batch_size, max_label_length] tensor.

  • data – The data input [max_time, batch_size, num_classes].

  • label_length – A tensor of shape [batch_size] containing the number of labels in each labels batch entry.

  • data_length – A tensor of shape [batch_size] containing the number of timesteps in each data batch entry.

Returns

The calculated loss.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

class ipu_tensorflow_addons.keras.layers.CTCPredictionsLayer(*args, **kwargs)

Computes CTC (Connectionist Temporal Classification) most probable predictions.

Returns the most probable predictions from the ctc decoder. This selects the most probable of all predictions returned. It also fills the values off the end with the blank index

This layer does a lot of post processing steps to create the predictions. If your model is close to its memory limit it may be worth using the CTCInference layer and streaming the results of that off the device and performing the processing on the CPU. However this will create a larger stream copy that may also cost memory.

Parameters
  • blank_index – The class index to use for the blank label.

  • beam_width – The beam width to use in the beam search.

  • top_paths – The number of paths to return.

  • from_logits – Whether to expect the input data in the form of logits (True) or log probabilities (False). Default value is False.

call(data, data_length, **kwargs)
Parameters
  • data – The data input [max_time, batch_size, num_classes] tensor The data is expected in the form of log probabilities.

  • data_length – A tensor of shape [batch_size] containing the number of timesteps in each data batch entry. If not provided can only perform inference.

Returns

The most probable predictions from the CTC decoder. This selects the most probable of all predictions returned. It fills the values off the end with the blank index.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

class ipu_tensorflow_addons.keras.layers.Dropout(*args, **kwargs)

Dropout layer optimized for running on the IPU.

The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the expected sum is unchanged.

Note that the Dropout layer only applies when training is set to True, so no values are dropped during inference.

Parameters
  • rate – Float between 0 and 1. Fraction of the input units to drop.

  • noise_shape – 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input.

  • seed – An optional two-element tensor-like object (tf.Tensor, a numpy array or Python list/tuple) containing a pair of 32-bit integers that will be used to seed the random number generator that generates the dropout mask.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=None)

Perform dropout.

Parameters
  • inputs – Input tensor (of any rank).

  • training – Python boolean indicating whether the layer should behave in training mode (adding dropout) or in inference mode (doing nothing).

Returns

In training mode, a tensor which has some nodes set to zero, as randomly selected based on other parameters. In inference mode, a tensor that is identical to the input tensor.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

class ipu_tensorflow_addons.keras.layers.EffectiveTransformer(*args, **kwargs)

EffectiveTransformer is an implementation of a multihead attention network.

Transformers of this type are described in the following paper: https://arxiv.org/abs/1706.03762

This implementation is optimised for batches of padded sequences, by dynamically compressing the input sequences for computationally expensive parts of the algorithm. This compression is achieved by the removal of padding for those computations that do not rely on a 1:1 relationship between the input to and from sequences.

For an input sequence tensor X of shape [B, N], the algorithm will process X in compressed chunks of shape [B', N], where B' is less than or equal to max_batch_size. The algorithm output, however, keeps the input batch size B. Though the maximum batch size of compressed sequences to be processed in each chunk is of shape [B', N], the parameter sequences_per_iter determines the upper limit on the total number of compressed sequences to be processed for each B' sized batch.

The distinction between max_batch_size and sequences_per_iter is of importance when a corpus of data has much variance in the length of its sequences (the degree of padding in each row). max_batch_size determines the upper bound on the number of rows of data to be processed in each chunk and sequences_per_iter determines the upper bound on the number of sequences to be compressed into each chunk. This distinction is important to consider because a chunk of compressed sequences will need to be decompressed at points in the algorithm. This can incur large memory usage if the number of compressed sequences to process is high and the uncompressed shape unbounded.

sequences_per_iter must be less than or equal to max_batch_size.

Parameters
  • output_layer_size – The number of output units.

  • max_batch_size – The upper limit to which additional sequences will be compressed into a chunk of data. This is the maximum size of the uncompressed sequence tensor.

  • use_scale – If True, learn a scale parameter.

  • num_attention_heads – The number of attention heads to use for multihead attention.

  • attention_head_size – The size of each attention head.

  • sequences_per_iter – The number of full-sequence equivalents to process in each data chunk. Must be less than or equal to max_batch_size.

  • qkv_activation – The activation function to use for the Query, Key and Value embeddings.

  • attention_dropout_prob – Dropout probability applied to the attention distribution.

  • output_activation – The activation function to use for the layer output.

  • output_dropout_prob – Dropout probability applied to the layer output.

  • layer_norm_output – Whether to apply Layer Normalisation to the output.

  • embedding_initializer – The initializer to be used for the QKV embeddings. Default is ‘glorot_uniform’.

  • embedding_bias_initializer – The initializer to be used for QKV embeddings additive bias. Defaults to ‘zeros’.

  • output_initializer – The initializer for the output layer. Defaults to ‘glorot_uniform’.

  • output_bias_initializer – The initializer for the output layer additive bias. Defaults to ‘zeros’.

build(input_shapes)

Builds an EffectiveTransformer Layer with respect to the provided input_shapes.

Parameters
  • input_shapes – A list of Tensor shapes of length four or five. In the

  • of four elements provided in input_shapes (case) –

  • Tensor shapes (the) –

  • correspond to the from_sequences (should) –

  • from_sequence_lengths

:param : :param to_sequences and to_sequence_lengths Tensor arguments to the: :param call method. In the case of five Tensor shapes provided in: :param input_shapes: :param the fifth element should correspond to the optional: :param q_mask input to the call method.:

call(inputs, training=True)

Performs a single forward pass of an EffectiveTransformer layer instance.

As input, two sequence sets and their respective sequence lengths are required. The two sets of sequences are referred to as the ‘from’ sequences and ‘to’ sequences, referring to the computed attention relationship. In the case that the ‘from’ and ‘to’ sequence sets are equal, this layer will compute self-attention.

Parameters
  • inputs – A list of input Tensors, of at least four elements containing

  • from_sequences

  • from_sequence_lengths

  • and (to_sequences) –

  • Additionally (to_sequence_lengths.) –

  • fifth tensor q_mask for (a) –

  • head masking can be provided. (attention) –

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

class ipu_tensorflow_addons.keras.layers.Embedding(*args, **kwargs)

This is designed to be a replacement for the typical use cases of the Keras Embedding layer.

Parameters
  • input_dim – int > 0. Size of the vocabulary, i.e. maximum integer index + 1.

  • output_dim – int >= 0. Dimension of the dense embedding.

  • embeddings_initializer – Initializer for the embeddings matrix.

  • serialization_factor – If greater than 1, the embedding lookup will be broken up into serialization_factor smaller lookups, serialized along the 0th dimension. This option should not be used unless the parameters of this layer is used by another layer. If this is the case, then serialization can reduce the maximum memory at the cost of extra computation.

Input shape:

2D tensor with shape: (batch_size, input_length).

Output shape:

3D tensor with shape: (batch_size, input_length, output_dim).

call(inputs, training=None)

Perform an embedding lookup.

Parameters

inputs – An integer tensor of indices into the embedding variable.

Returns

The entries of the embedding tensor corresponding to the ids tensor indices.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

ipu_tensorflow_addons.keras.layers.GRU

alias of ipu_tensorflow_addons.keras.layers.rnn.PopnnGRU

ipu_tensorflow_addons.keras.layers.GroupNorm

alias of ipu_tensorflow_addons.keras.layers.normalization.GroupNormalization

class ipu_tensorflow_addons.keras.layers.GroupNormalization(*args, **kwargs)

Group normalization layer optimized for running on the IPU.

This layer is used like the standard Keras BatchNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.

Group normalization is described in this paper: https://arxiv.org/abs/1803.08494.

Parameters
  • groups – The number of groups to use in the normalization.

  • channels_axis – Integer, the axis that should be normalized (typically the features axis).

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • beta_initializer – Initializer for the beta weight.

  • gamma_initializer – Initializer for the gamma weight.

  • strided_channel_grouping – Selects whether to group the channels dimension for group normalisation with a stride between channels. This makes the PopLibs implementation more efficient but is unconventional. Among other things this will mean that using pre-trained weights would not be possible if not produced with this unconventional implementation.

  • trainable – Boolean, if True the variables will be marked as trainable.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=None)
Parameters

inputs – The tensor to apply normalization to.

Returns

The tensor resulting from applying normalization.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

ipu_tensorflow_addons.keras.layers.InstanceNorm

alias of ipu_tensorflow_addons.keras.layers.normalization.InstanceNormalization

class ipu_tensorflow_addons.keras.layers.InstanceNormalization(*args, **kwargs)

Instance normalization layer optimized for use on the IPU.

This layer is used like the standard Keras InstanceNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.

Instance normalization is described in this paper: https://arxiv.org/abs/1607.08022.

Parameters
  • channels_axis – Integer, the axis that should be normalized (typically the features axis).

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used.

  • epsilon – Small float added to variance to avoid dividing by zero.

  • beta_initializer – Initializer for the beta weight.

  • gamma_initializer – Initializer for the gamma weight.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=None)
Parameters

inputs – The tensor to apply normalization to.

Returns

The tensor resulting from applying normalization.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

ipu_tensorflow_addons.keras.layers.LSTM

alias of ipu_tensorflow_addons.keras.layers.rnn.PopnnLSTM

ipu_tensorflow_addons.keras.layers.LayerNorm

alias of ipu_tensorflow_addons.keras.layers.normalization.LayerNormalization

class ipu_tensorflow_addons.keras.layers.LayerNormalization(*args, **kwargs)

Layer normalization layer optimized for use on the IPU.

This layer is used like the standard Keras LayerNormalization layer. However, it has beta and gamma trainable parameters, but no statistics gathering.

Layer normalization is described in this paper: https://arxiv.org/abs/1607.06450.

Parameters
  • axis – Integer or List/Tuple. The axis that should be normalized (typically the features axis).

  • epsilon – Small float added to variance to avoid dividing by zero.

  • center – If True, add offset of beta to normalized tensor. If False, beta is ignored.

  • scale – If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.

  • beta_initializer – Initializer for the beta weight.

  • gamma_initializer – Initializer for the gamma weight.

  • beta_regularizer – Optional regularizer for the beta weight.

  • gamma_regularizer – Optional regularizer for the gamma weight.

  • beta_constraint – Optional constraint for the beta weight.

  • gamma_constraint – Optional constraint for the gamma weight.

  • trainable – Boolean, if True the variables will be marked as trainable.

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, training=None)
Parameters

inputs – The tensor to apply normalization to.

Returns

The tensor resulting from applying normalization.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

class ipu_tensorflow_addons.keras.layers.PopnnGRU(*args, **kwargs)

Popnn implementation of the Gated Recurrent Unit (Cho et al. 2014), optimized for the IPU.

There are two variants of the GRU implementation. The default is based on v3 and has reset gate applied to hidden state before matrix multiplication. The other is based on the original version and has the order reversed. The first one is the default behaviour for this implementation, however the Keras equivalent can use the second variant. To use this variant, set 'reset_after'=True.

Note that the Keras equivalent uses the hard_sigmoid as the default recurrent activation, however this version uses sigmoid as the default.

Parameters
  • units – Positive integer, dimensionality of the output space.

  • activation – Activation function to use. Default: hyperbolic tangent (“tanh”). Accepted activations: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass None, no activation is applied (ie. “linear” activation: a(x) = x).

  • recurrent_activation – Activation function to use for the recurrent step. Default: sigmoid (“sigmoid”). Accepted activations: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass None, no activation is applied (ie. “linear” activation: a(x) = x).

  • use_bias – Boolean. If True then the layer will use a bias vector.

  • kernel_initializer – Initializer for the kernel weights matrix, used for the linear transformation of the inputs.

  • recurrent_initializer – Initializer for the recurrent_kernel weights matrix, used for the linear transformation of the recurrent state.

  • bias_initializer – Initializer for the bias vector.

  • kernel_regularizer – Unsupported - Regularizer function applied to the kernel weights matrix.

  • recurrent_regularizer – Unsupported - Regularizer function applied to the recurrent_kernel weights matrix.

  • bias_regularizer – Unsupported - Regularizer function applied to the bias vector.

  • activity_regularizer – Unsupported - Regularizer function applied to the output of the layer (its “activation”).

  • kernel_constraint – Unsupported - Constraint function applied to the kernel weights matrix.

  • recurrent_constraint – Unsupported - Constraint function applied to the recurrent_kernel weights matrix.

  • bias_constraint – Unsupported - Constraint function applied to the bias vector.

  • dropout – Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.

  • dropout_seed – An optional two-element tensor-like object (tf.Tensor, a numpy array or Python list/tuple), representing the random seed that will be used to create the distribution for dropout.

  • recurrent_dropout – Unsupported - Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.

  • implementation – Unsupported - Implementation mode.

  • return_sequences – Boolean. If True then the full output sequence will be returned. If False then only the last output in the output sequence will be returned.

  • return_state – Boolean. If True then the last state will be returned in addition to the last output or output sequence.

  • go_backwards – Unsupported - Boolean (default False). If True process the input sequence backwards and return the reversed sequence.

  • stateful – Boolean (default False). If True the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch.

  • unroll – Unsupported - Boolean (default False). If True the network will be unrolled, else a symbolic loop will be used. Unrolling can speed-up a RNN, although it tends to be more memory-intensive. Unrolling is only suitable for short sequences.

  • time_major – The shape format of the inputs and outputs tensors. If True the shape of the inputs and outputs will be (timesteps, batch, ...), otherwise the shape will be (batch, timesteps, ...). Using time_major = True is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation. However, most TensorFlow data is batch-major, so by default this function accepts input and emits output in batch-major form.

  • seed – A Python integer. Used for the kernel_initializer and recurrent_initializer.

  • partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.

  • reset_after – GRU convention (whether to apply reset gate after or before matrix multiplication). False = “before”, True = “after” (default).

  • available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of -1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.

  • available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of -1. or None indicates that the default in Popnn should be used.

build(input_shape)

Create variables of the PopnnGRU layer.

It can be called manually before __call__() or automatically through __call__(). In the former case, any subsequent __call__() will skip creating variables.

Parameters

input_shape – a TensorShape object with 3 dimensions.

Raises

ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call(inputs, mask=None, training=None, initial_state=None)

Runs the forward step for the GRU layer.

Parameters
  • inputs – 3D tensor with shape [batch_size, seq_len, input_size]. If the time_major parameter is True, the the shape should be [seq_len, batch_size, input_size].

  • training – Set to False to use the layer in inference mode. This is only relevant if dropout or recurrent_dropout is used.

  • initial_state – Initial state tensor, shaped [batch_size, num_units] If not provided, the state is initialized to zeros.

Returns

If return_sequences is True then the GRU layer returns a tensor of shape [batch_size, seq_len, num_units], otherwise it returns a tensor of shape [batch_size, num_units]. If return_state is set to True then the output state of the last cell is also returned.

Raises

ValueError – if initial_state is not valid.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

state_shape(batch_size)

Shape of Popnn GRU state.

State shape is [batch_size, num_units].

Parameters

batch_size – an int

Returns

A Python array.

class ipu_tensorflow_addons.keras.layers.PopnnLSTM(*args, **kwargs)

Popnn implementation of Long Short-Term Memory layer (Hochreiter and Schmidhuber 1997), optimized for the IPU.

Note that the Keras equivalent uses the hard_sigmoid as the default recurrent activation, however this version uses sigmoid as the default.

Parameters
  • units – Positive integer, dimensionality of the output space.

  • activation – Activation function to use. Default: hyperbolic tangent (“tanh”). Accepted activations: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass None, no activation is applied (ie. “linear” activation: a(x) = x).

  • recurrent_activation – Activation function to use for the recurrent step. Default: sigmoid (“sigmoid”). Accepted activations: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”. If you pass None, no activation is applied (ie. “linear” activation: a(x) = x).

  • use_bias – Boolean. If True then the layer will use a bias vector.

  • kernel_initializer – Initializer for the kernel weights matrix, used for the linear transformation of the inputs.

  • recurrent_initializer – Initializer for the recurrent_kernel weights matrix, used for the linear transformation of the recurrent state.

  • bias_initializer – Initializer for the bias vector.

  • unit_forget_bias – Boolean. If True then add 1 to the bias of the forget gate at initialization. Setting it to true will also force bias_initializer="zeros". This is recommended in Jozefowicz et al.

  • kernel_regularizer – Unsupported - Regularizer function applied to the kernel weights matrix.

  • recurrent_regularizer – Unsupported - Regularizer function applied to the recurrent_kernel weights matrix.

  • bias_regularizer – Unsupported - Regularizer function applied to the bias vector.

  • activity_regularizer – Unsupported - Regularizer function applied to the output of the layer (its “activation”).

  • kernel_constraint – Unsupported - Constraint function applied to the kernel weights matrix.

  • recurrent_constraint – Unsupported - Constraint function applied to the recurrent_kernel weights matrix.

  • bias_constraint – Unsupported - Constraint function applied to the bias vector.

  • dropout – Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs.

  • dropout_seed – An optional two-element tensor-like object (tf.Tensor, a numpy array or Python list/tuple), representing the random seed that will be used to create the distribution for dropout.

  • recurrent_dropout – Unsupported - Float between 0 and 1. Fraction of the units to drop for the linear transformation of the recurrent state.

  • implementation – Unsupported - Implementation mode.

  • return_sequences – Boolean. If True then the full output sequence will be returned. If False then only the last output in the output sequence will be returned.

  • return_state – Boolean. If True then the last state will be returned in addition to the last output or output sequence.

  • go_backwards – Unsupported - Boolean (default False). If True process the input sequence backwards and return the reversed sequence.

  • stateful – Boolean (default False). If True the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch.

  • unroll – Unsupported - Boolean (default False). If True the network will be unrolled, else a symbolic loop will be used. Unrolling can speed-up a RNN, although it tends to be more memory-intensive. Unrolling is only suitable for short sequences.

  • seed – A Python integer. Used for the kernel_initializer and recurrent_initializer.

  • partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.

  • time_major – The shape format of the inputs and outputs tensors. If True the shape of the inputs and outputs will be (timesteps, batch, ...), otherwise the shape will be (batch, timesteps, ...). Using time_major = True is a bit more efficient because it avoids transposes at the beginning and end of the RNN calculation. However, most TensorFlow data is batch-major, so by default this function accepts input and emits output in batch-major form.

  • available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of -1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.

  • available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of -1. or None indicates that the default in Popnn should be used.

build(input_shape)

Create variables of the PopnnLSTM layer.

It can be called manually before __call__() or automatically through __call__(). In the former case, any subsequent __call__() will skip creating variables.

Parameters

input_shape – a TensorShape object with 3 dimensions.

Raises

ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call(inputs, mask=None, training=None, initial_state=None)

Runs the forward step for the LSTM layer.

Parameters
  • inputs – 3D tensor with shape [batch_size, seq_len, input_size]. If the time_major parameter is set to True then the shape should be [seq_len, batch_size, input_size].

  • training – Set to False to use the layer in inference mode. This is only relevant if dropout or recurrent_dropout is set.

  • initial_state – An LSTMStateTuple of state tensors, each shaped [batch_size, num_units]. If not provided, the state is initialized to zeros.

Returns

If return_sequences is True the LSTM layer returns a tensor of shape [batch_size, seq_len, num_units] otherwise it returns a tensor of shape [batch_size, num_units]. If return_state is True then the output state of the last cell is also returned.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

state_shape(batch_size)

Shape of Popnn LSTM states.

Shape is a 2-element tuple. Each is [batch_size, num_units]

Parameters

batch_size – an int

Returns

A tuple of Python arrays.

class ipu_tensorflow_addons.keras.layers.RecomputationCheckpoint(*args, **kwargs)

Layer for checkpointing values in a computational pipeline stage. When recomputation is enabled, these values will not be recomputed and they will be stored in memory instead.

This layer can reduce memory liveness peaks when using recomputation if there are too many activations which need to be recomputed before the backpropagation operations can be executed.

This layer should be used with the RecomputationMode.RecomputeAndBackpropagateInterleaved pipelining recomputation mode.

Note that this layer has no effect when used with the RecomputationMode.RecomputeThenBackpropagate pipelining recomputation mode.

call(inputs, **kwargs)

Checkpoint the input tensors.

Parameters

inputs – A tensor or a structure of tensors which should be checkpointed.

Returns

A tensor or a structure of tensors which matches shape and type of inputs.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

class ipu_tensorflow_addons.keras.layers.SerialDense(*args, **kwargs)

Densely-connected NN layer where the dot operation is serialized to reduce the size of this operation.

Dense implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer (only applicable if use_bias is True).

Given the input tensor with shape [..., m, k] and kernel tensor with shape [k, n], the matrix multiplication can be serialized as follows:

  • Along the m dimension of input, by setting serialization_dimension to input_columns.

  • Along the k dimension of input and kernel by setting serialization_dimension to input_rows_kernel_columns.

  • Along n dimension of kernel, by setting serialization_dimension to kernel_rows.

Example:

# as first layer in a sequential model:
model = Sequential()
model.add(SerialDense(32, input_shape=(16,)))
# now the model will take as input arrays of shape (*, 16)
# and output arrays of shape (*, 32)

# after the first layer, you don't need to specify
# the size of the input anymore:
model.add(SerialDense(32))
Parameters
  • units – Positive integer, dimensionality of the output space.

  • serialization_factor – An integer indicating the number of smaller matrix multiplies this operation is broken up into. Must divide the dimension along which the operation is serialized on.

  • serialization_dimension – A string, must be one of input_columns, input_rows_kernel_columns or kernel_rows. Indicates the dimension along which the operation is serialzed on.

  • activation – Activation function to use. If you don’t specify anything, no activation is applied (ie. “linear” activation: a(x) = x).

  • use_bias – Boolean, whether the layer uses a bias vector.

  • kernel_initializer – Initializer for the kernel weights matrix.

  • bias_initializer – Initializer for the bias vector.

  • kernel_regularizer – Regularizer function applied to the kernel weights matrix.

  • bias_regularizer – Regularizer function applied to the bias vector.

  • activity_regularizer – Regularizer function applied to the output of the layer (its “activation”).

  • kernel_constraint – Constraint function applied to the kernel weights matrix.

  • bias_constraint – Constraint function applied to the bias vector.

Input shape:

N-D tensor with shape: (batch_size, ..., input_dim). The most common situation would be a 2D input with shape (batch_size, input_dim).

Output shape:

N-D tensor with shape: (batch_size, ..., units). For instance, for a 2D input with shape (batch_size, input_dim), the output would have shape (batch_size, units).

build(input_shape)

Creates the variables of the layer (optional, for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call.

This is typically used to create the weights of Layer subclasses.

Parameters

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, **kwargs)
Parameters

inputs – The tensor to apply the dense weights to.

Returns

The tensor resulting from applying the dense weights.

compute_output_shape(input_shape)

Computes the output shape of the layer.

If the layer has not been built, this method will call build on the layer. This assumes that the layer will later be used with inputs that match the input shape provided here.

Parameters

input_shape – Shape tuple (tuple of integers) or list of shape tuples (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.

Returns

An input shape tuple.

get_config()

Returns the config of the layer.

A layer config is a Python dictionary (serializable) containing the configuration of a layer. The same layer can be reinstantiated later (without its trained weights) from this configuration.

The config of a layer does not include connectivity information, nor the layer class name. These are handled by Network (one layer of abstraction above).

Returns

Python dictionary.

23.2. Keras Optimizers

23.2.1. Keras optimizers made for IPU TensorFlow

class ipu_tensorflow_addons.keras.optimizers.AdamIpuOptimizer(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='Adam', m_dtype=None, v_dtype=None, vhat_dtype=None, debiasing=True, **kwargs)

Optimizer that implements the Adam algorithm.

Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments. According to the paper Adam: A Method for Stochastic Optimization. Kingma et al., 2014, the method is “computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters”.

For AMSGrad see On The Convergence Of Adam And Beyond. Reddi et al., 5-8

This optimizer allows setting the optimizer state precisions independently and differently to the var precisions. It also supports outlining the optimizer update, which can save memory at the expense of passing variables around by making the optimizer update a reusable code block.

__init__(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False, name='Adam', m_dtype=None, v_dtype=None, vhat_dtype=None, debiasing=True, **kwargs)
Parameters
  • learning_rate – A Tensor or a floating point value. The learning rate.

  • beta_1 – A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates.

  • beta_2 – A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates.

  • epsilon – A small constant for numerical stability. This epsilon is “epsilon hat” in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.

  • amsgrad – boolean. Whether to apply AMSGrad variant of this algorithm from the paper “On the Convergence of Adam and beyond”.

  • name – Optional name for the operations created when applying gradients. Defaults to “Adam”.

  • m_dtype – Dtype of the optimizer state m. If None, will set to dtypes of the corresponding vars.

  • v_dtype – Dtype of the optimizer state v. If None, will set to dtypes of the corresponding vars.

  • vhat_dtype – Dtype of the optimizer state vhat. If None, will set to dtypes of the corresponding vars.

  • debiasing – Debias m and v to correct for initialisation.

  • **kwargs – keyword arguments. Allowed to be {clipnorm, clipvalue, lr, decay}. clipnorm is clip gradients by norm; clipvalue is clip gradients by value, decay is included for backward compatibility to allow time inverse decay of learning rate. lr is included for backward compatibility, recommended to use learning_rate instead.

get_config()

Returns the config of the optimizer.

An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.

Returns

Python dictionary.

class ipu_tensorflow_addons.keras.optimizers.LAMBIpuOptimizer(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-06, weight_decay_rate=0.0, exclude_from_weight_decay=None, exclude_from_layer_adaptation=None, name='LAMB', debiasing=True, m_dtype=None, v_dtype=None, weight_norm_clip=None, optimizer_compute_precisions=(tf.float32, tf.float32), **kwargs)

Optimizer that implements the Layer-wise Adaptive Moments (LAMB). See paper Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.

This optimizer allows setting the optimizer state precisions independently and differently to the var precisions. It also supports outlining the optimizer update, which can save memory at the expense of passing variables around by making the optimizer update a reusable code block.

__init__(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-06, weight_decay_rate=0.0, exclude_from_weight_decay=None, exclude_from_layer_adaptation=None, name='LAMB', debiasing=True, m_dtype=None, v_dtype=None, weight_norm_clip=None, optimizer_compute_precisions=(tf.float32, tf.float32), **kwargs)
Parameters
  • learning_rate – A Tensor or a floating point value. or a schedule that is a tf.keras.optimizers.schedules.LearningRateSchedule The learning rate.

  • beta_1 – A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates.

  • beta_2 – A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates.

  • epsilon – A small constant for numerical stability.

  • weight_decay_rate – weight decay rate.

  • exclude_from_weight_decay – List of regex patterns of variables excluded from weight decay. Variables whose name contain a substring matching the pattern will be excluded.

  • exclude_from_layer_adaptation – List of regex patterns of variables excluded from layer adaptation. Variables whose name contain a substring matching the pattern will be excluded.

  • name – Optional name for the operations created when applying gradients. Defaults to “LAMB”.

  • debiasing – Debias m and v to correct for initialisation.

  • m_dtype – Dtype of the optimizer state m. If None, will set to dtypes of the vars.

  • v_dtype – Dtype of the optimizer state v. If None, will set to dtypes of the vars.

  • weight_norm_clip – Clip the weight norms by this value.

  • optimizer_compute_precisions – Tuple of TF dtypes that determine what precision the stages of optimizer compute are done in. This optimizer has two stages of compute precision so the tuple must be of size 2.

  • **kwargs – keyword arguments. Allowed to be {clipnorm, clipvalue, lr, decay}. clipnorm is clip gradients by norm; clipvalue is clip gradients by value, decay is included for backward compatibility to allow time inverse decay of learning rate. lr is included for backward compatibility, recommended to use learning_rate instead.

get_config()

Returns the config of the optimizer.

An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.

Returns

Python dictionary.

class ipu_tensorflow_addons.keras.optimizers.SGDIpuOptimizer(learning_rate=0.01, momentum=0.0, nesterov=False, name='SGD', momentum_accum_dtype=None, **kwargs)

Optimizer that implements the gradient descent algorithm with momentum.

This optimizer allows setting the optimizer state precisions differently to the var precisions. It also supports outlining the optimizer update, which can save memory at the expense of passing variables around by making the optimizer update a reusable code block.

For nesterov=True, see [`Sutskever et al., 2013.

__init__(learning_rate=0.01, momentum=0.0, nesterov=False, name='SGD', momentum_accum_dtype=None, **kwargs)
Parameters
  • learning_rate – A Tensor or a floating point value. or a schedule that is a tf.keras.optimizers.schedules.LearningRateSchedule The learning rate.

  • momentum – A float value or a constant float tensor that accelerates gradient descent in the relevant direction and dampens oscillations

  • nesterov – boolean. Whether to apply Nesterov momentum. Defaults to False.

  • name – Optional name prefix for the operations created when applying gradients. Defaults to "SGD".

  • momentum_accum_dtype – Dtype of the momentum accumulation optimizer state. If None, will set to dtypes of the corresponding vars.

  • **kwargs – keyword arguments. Allowed to be {clipnorm, clipvalue, lr, decay}. clipnorm is clip gradients by norm; clipvalue is clip gradients by value, decay is included for backward compatibility to allow time inverse decay of learning rate. lr is included for backward compatibility, recommended to use learning_rate instead.

get_config()

Returns the config of the optimizer.

An optimizer config is a Python dictionary (serializable) containing the configuration of an optimizer. The same optimizer can be reinstantiated later (without any saved state) from this configuration.

Returns

Python dictionary.

23.3. Legacy TensorFlow Layers

23.3.1. TensorFlow layers made for IPU TensorFlow

class ipu_tensorflow_addons.v1.layers.PopnnAUGRU(*args, **kwargs)

XLA compatible, time-major Popnn implementation of an AUGRU layer.

Below is a typical workflow:

with tf.Graph().as_default():
  augru = PopnnAUGRU(num_units, ...)

  outputs, output_state = augru(inputs, initial_state, training=True)
__init__(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', name=None, reset_after=False, available_memory_proportion_fwd=None, available_memory_proportion_bwd=None)

Creates a PopnnAUGRU model from model spec.

Parameters
  • num_units – the number of units within the RNN model.

  • dtype – tf.float16 or tf.float32

  • partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.

  • seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.

  • weights_initializer – starting value to initialize the weight (default is Glorot uniform initializer).

  • activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.

  • recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.

  • bias_initializer – starting value to initialize the bias (default is all zeros).

  • name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking __call__().

  • available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of -1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.

  • available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of -1. or None indicates that the default in Popnn should be used.

call(inputs, seq_len, attention_score, initial_state=None, training=True, time_major=True)

Runs the forward step for the AUGRU model.

Parameters
  • inputs – 3-D tensor with shape [time_len, batch_size, input_size].

  • seq_len – 1-D tensor with the sequence length of samples in each batch.

  • attention_score – The output of attention layer, the score of samples in each batch, shaped [batch_size, max_seq_len].

  • initial_state – Initial state tensor, shaped [batch_size, num_units]. If not provided, the state is initialized to zeros.

  • training – whether this operation will be used in training or inference.

  • time_major – whether the time dimension is the first dimension.

Returns

A tuple of output and output state.

  • output: a tensor of shape [time_len, batch_size, num_units].

  • output_state: The output state of the last cell.

Raises

ValueError – if initial_state is not valid.

class ipu_tensorflow_addons.v1.layers.PopnnDynamicGRU(*args, **kwargs)

XLA compatible, time-major Popnn implementation of an GRU layer, with a sequence length input.

Below is a typical workflow:

with tf.Graph().as_default():
  gru = PopnnDynamicGRU(num_units, ...)

  outputs, output_state = gru(
    inputs, seq_len, initial_state, training=True)
__init__(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', name=None, reset_after=False, available_memory_proportion_fwd=None, available_memory_proportion_bwd=None)

Creates a PopnnDynamicGRU model from model spec.

Parameters
  • num_units – the number of units within the RNN model.

  • dtype – tf.float16 or tf.float32

  • partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.

  • seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.

  • weights_initializer – starting value to initialize the weight (default is Glorot uniform initializer).

  • bias_initializer – starting value to initialize the bias (default is all zeros).

  • activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.

  • recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.

  • name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking __call__().

  • reset_after – GRU convention (whether to apply reset gate after or before matrix multiplication). False = “before” (default), True = “after”. Leave as default (False) to match the behaviour of the standard TensorFlow GRU.

  • available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of -1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.

  • available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of -1. or None indicates that the default in Popnn should be used.

call(inputs, seq_len, initial_state=None, training=True, time_major=True)

Runs the forward step for the DynamicGRU model.

Parameters
  • inputs – 3-D tensor with shape [batch_size, time_len, input_size].

  • seq_len – 1-D tensor with the sequence length of samples in each batch.

  • initial_state – Initial state tensor, shaped [batch_size, num_units]. If not provided, the state is initialized to zeros.

  • training – whether this operation will be used in training or inference.

  • time_major – whether the time dimension is the first demension.

Returns

A tuple of output and output state.

  • output: a tensor of shape [time_len, batch_size, num_units].

  • output_state: The output state of the last cell.

Raises

ValueError – if initial_state is not valid.

class ipu_tensorflow_addons.v1.layers.PopnnDynamicLSTM(*args, **kwargs)
call(inputs, seq_len, initial_state=None, training=True)

Runs the forward step for the LSTM model.

Parameters
  • inputs – 3D tensor with shape [time_len, batch_size, input_size].

  • seq_len – 1-D tensor with the sequence length of samples in each batch.

  • initial_state – An LSTMStateTuple of state tensors, each shaped [batch_size, num_units]. If not provided, the state is initialized to zeros.

  • training – Set to False to use the LSTM model in inference mode.

Returns

A tuple of output and output state.

  • output: a tensor of shape [time_len, batch_size, num_units].

  • output_state: An LSTMStateTuple of the same shape and structure as initial_state.

Raises

ValueError – if initial_state is not valid.

class ipu_tensorflow_addons.v1.layers.PopnnGRU(*args, **kwargs)

XLA compatible, time-major Popnn implementation of a GRU layer.

Below is a typical workflow:

with tf.Graph().as_default():
  gru = PopnnGRU(num_units, ...)

  outputs, output_state = gru(inputs, initial_state, training=True)
__init__(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', name=None, reset_after=False, available_memory_proportion_fwd=None, available_memory_proportion_bwd=None)

Creates a PopnnGRU model from model spec.

Parameters
  • num_units – the number of units within the GRU model.

  • dtype – tf.float16 or tf.float32

  • partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.

  • seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.

  • weights_initializer – starting value to initialize the weights (default is Glorot uniform initializer).

  • bias_initializer – starting value to initialize the bias (default is all zeros).

  • activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.

  • recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.

  • name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking __call__().

  • reset_after – GRU convention (whether to apply reset gate after or before matrix multiplication). False = “before” (default), True = “after”. Leave as default (False) to match the behaviour of the standard TensorFlow GRU.

  • available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of -1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.

  • available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of -1. or None indicates that the default in Popnn should be used.

build(input_shape)

Create variables of the PopnnGRU.

It can be called manually before __call__() or automatically through __call__(). In the former case, any subsequent __call__() will skip creating variables.

Parameters

input_shape – a TensorShape object with 3 dimensions.

Raises

ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call(inputs, initial_state=None, training=True)

Runs the forward step for the GRU model.

Parameters
  • inputs – 3D tensor with shape [time_len, batch_size, input_size].

  • initial_state – Initial state tensor, shaped [batch_size, num_units]. If not provided, the state is initialized to zeros.

  • training – Set to False to use the GRU model in inference mode.

Returns

A tuple of output and output_state.

  • output: a tensor of shape [time_len, batch_size, num_units].

  • output_state: The output state of the last cell.

Raises

ValueError – if initial_state is not valid.

state_shape(batch_size)

Shape of Popnn GRU state.

State shape is [batch_size, num_units].

Parameters

batch_size – an int

Returns

A Python array.

class ipu_tensorflow_addons.v1.layers.PopnnLSTM(*args, **kwargs)

XLA compatible, time-major Popnn implementation of an LSTM layer.

Below is a typical workflow:

with tf.Graph().as_default():
  lstm = PopnnLSTM(num_units, ...)

  outputs, output_states = lstm(inputs, initial_states, training=True)
__init__(num_units, dtype=tf.float32, partials_dtype=tf.float32, seed=None, weights_initializer=None, bias_initializer=None, activation='tanh', recurrent_activation='sigmoid', name=None, available_memory_proportion_fwd=None, available_memory_proportion_bwd=None)

Creates a PopnnLSTM model from model spec.

Parameters
  • num_units – the number of units within the LSTM model.

  • dtype – tf.float16 or tf.float32

  • partials_dtype – the type used by Popnn to perform partial calculations. Either tf.float16 or tf.float32.

  • seed – A Python integer. Used to create the default Glorot uniform initializer weights_initializer.

  • weights_initializer – starting value to initialize the weights (default is Glorot uniform initializer).

  • bias_initializer – starting value to initialize the bias (default is all zeros).

  • activation – Activation function. Defaults to “tanh”. Accepted values: “tanh”, “relu”, “softmax”, “sigmoid”, “hard_sigmoid”.

  • recurrent_activation – Recurrent activation function. Defaults to “sigmoid”. Must generate output in the [0,1] range. Accepted values: “tanh”, “softmax”, “sigmoid”, “hard_sigmoid”.

  • name – VariableScope for the created subgraph; defaults to class name. This only serves the default scope if later no scope is specified when invoking __call__().

  • available_memory_proportion_fwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the forward propagation layer. A value of -1. or None indicates that the default in Popnn should be used. If available_memory_proportion_bwd is set to None, then this value applies to both phases.

  • available_memory_proportion_bwd – Maximum fraction of IPU memory which can be used as temporary scratch space during computation, for the backward propagation layer. A value of -1. or None indicates that the default in Popnn should be used.

build(input_shape)

Create variables of the PopnnLSTM.

It can be called manually before __call__() or automatically through __call__(). In the former case, any subsequent __call__() will skip creating variables.

Parameters

input_shape – a TensorShape object with 3 dimensions.

Raises

ValueError – if input_shape has wrong dimension or unknown 3rd dimension.

call(inputs, initial_state=None, training=True)

Runs the forward step for the LSTM model.

Parameters
  • inputs – 3D tensor with shape [time_len, batch_size, input_size].

  • initial_state – An LSTMStateTuple of state tensors, each shaped [batch_size, num_units]. If not provided, the state is initialized to zeros.

  • training – Set to False to use the LSTM model in inference mode.

Returns

A tuple of output and output state.

  • output: a tensor of shape [time_len, batch_size, num_units].

  • output_state: An LSTMStateTuple of the same shape and structure as initial_state.

Raises

ValueError – if initial_state is not valid.

state_shape(batch_size)

Shape of Popnn LSTM states.

Shape is a 2-element tuple. Each is [batch_size, num_units]

Parameters

batch_size – an int

Returns

a tuple of Python arrays.

23.4. Legacy TensorFlow Optimizers

23.4.1. Optimizers made for IPU TensorFlow