25. Keras Python API

For details on the standard Keras API, refer to the Keras documentation.

25.1. IPU specific Keras integration

class keras.ipu.ALSOptimizerGradientAccumulationWrapper(als_optimizer, num_mini_batches, offload_weight_update_variables=None, replicated_optimizer_state_sharding=False, dtype=None, reduction_method=GradientAccumulationReductionMethod.SUM, name='ALSOptimizerGradientAccumulationWrapper')
apply_gradients(grads_and_vars, global_step=None, captured_grads=None, name=None)

Apply gradients to variables.

Parameters
  • grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().

  • global_step – Optional Variable to increment by one after the variables have been updated.

  • captured_grads – An optional dictionary (indexed by tags) of captured

  • keras.ipu.ALSOptimizer (grads to be forwarded onto the the wrapped) –

  • instance.

  • name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.

Returns

An Operation that applies the gradients. If global_step was not None, that operation also increments global_step.

Raises

ValueError – If the grads_and_vars is malformed.

compute_gradients(loss, var_list=None, gate_gradients=1, aggregation_method=None, colocate_gradients_with_ops=False, grad_loss=None)

Compute gradients of “loss” for the variables in “var_list”.

This simply wraps the get_gradients method of the wrapped ALSOptimizer. The gradients will be aggregated in this wrappers apply_gradients method so that the gradients may be modified with options such as clipping with per replica global norm if needed.

Parameters
  • loss – A Tensor containing the value to minimize.

  • var_list – Optional list or tuple of tf.Variable to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKey.TRAINABLE_VARIABLES.

  • **kwargs – Keyword arguments for compute_gradients().

Returns

A list of (gradient, variable) pairs.

class keras.ipu.FunctionalLayerPipelineStageAssignment(layer, node_index, pipeline_stage=None)

A class to indicate at which pipeline stage a layer in a Functional model should be executed.

Keras layers can be called multiple times in order to share weights between layers. Each of these calls produces a tensor output which can be executed in different pipeline stages (as long as these stages are mapped to the same device).

property inbound_layers

Returns the input layers for the layer in this assignment. This can be useful for identifying which specific node_index this is.

property layer

Returns the Keras layer associated with this assignment.

property node_index

Returns the specific call to the layer that produced a tensor.

property pipeline_stage

Returns the pipeline stage this layer has been assigned to. If None, then this layer has not been assigned to a pipeline stage.

class keras.ipu.ModelLayerPipelineStageAssignment(layer, node_index, pipeline_stage=None)

A class to indicate at which pipeline stage a layer in a Model subclass should be executed.

Keras layers can be called multiple times in order to share weights between layers. Each of these calls produces a tensor output which can be executed in different pipeline stages (as long as these stages are mapped to the same device).

property inbound_layers

Returns the input layers for the layer in this assignment. This can be useful for identifying which specific node_index this is.

property layer

Returns the Keras layer associated with this assignment.

property node_index

Returns the specific call to the layer that produced a tensor.

property pipeline_stage

Returns the pipeline stage this layer has been assigned to. If None, then this layer has not been assigned to a pipeline stage.

class keras.ipu.PipelineStage(stage)

A scope within which Keras layers and/or calls to Keras layers can be assigned to pipeline stages.

Pipeline stages can be assigned to all calls of Layer by constructing the Layer within a PipelineStage scope as follows:

strategy = ipu.ipu_strategy.IPUStrategy()
input_layer = Input(2)
with strategy.scope():
  with PipelineStage(0):
    x = Dense(4)(input_layer)

  with PipelineStage(1):
    x = Dense(4)(x)

Pipeline stages can also be assigned to individual Layer calls, as follows:

strategy = ipu.ipu_strategy.IPUStrategy()
input_layer = Input(2)
l = Dense(4)
with strategy.scope():
  with PipelineStage(0):
    x = l(input_layer)

  with PipelineStage(1):
    x = l(x)

Pipeline stages assigned to Layer calls take precedence over those assigned when constructing the Layer.

class keras.ipu.SequentialLayerPipelineStageAssignment(layer, pipeline_stage=None)

A class used to indicate which pipeline stage a layer in a Sequential model should be executed in.

property layer

Returns the Keras layer associated with this assignment.

property pipeline_stage

Returns the pipeline stage this layer has been assigned to. If None, then this layer has not been assigned to a pipeline stage.

25.2. IPU specific Keras extensions

class keras.ipu.extensions.FunctionalExtension(*args, **kwargs)
get_pipeline_stage_assignment()

Returns the pipeline stage assignment of all the layers in the model.

If set_pipeline_stage_assignment() has been called before, then it returns a copy of the current assignment, otherwise returns a list of FunctionalLayerPipelineStageAssignment for each invocation of each layer in the model (excluding input layers) in post order (which means that layers are returned in the order they are executed).

print_pipeline_stage_assignment_summary(line_length=None, print_fn=None)

Prints a summary of the pipeline stage assignment of the model.

Parameters
  • line_length – Total length of printed lines (for example, set this to adapt the display to different terminal window sizes).

  • print_fn – Print function to use. It will be called on each line of the summary. You can set it to a custom function in order to capture the string summary. It defaults to print (prints to stdout).

reset_pipeline_stage_assignment()

Resets the pipeline stage assignment so that the model is no longer pipelined.

set_asynchronous_callbacks(asynchronous=False)

Sets the asynchronous callback options when calling fit(), evaluate() and predict().

When running fit(), evaluate() and predict() the callbacks the model is configured with are executed after steps_per_execution steps have executed. Enabling asynchronous callbacks means that the callbacks are invoked after every step, even when steps_per_execution > 1. This can reduce the latency of receiving per step results and metrics at a cost of an extra thread running in the background of the application. Note that this option is ignored for fit() and evaluate() when running a pipelined model and accumulate_outfeed=True (configured via set_pipelining_options()).

Parameters

asynchronous – If True, enables asynchronous callbacks. Defalts to False.

set_gradient_accumulation_options(gradient_accumulation_steps_per_replica=None, gradient_accumulation_reduction_method='sum', **gradient_accumulation_optimizer_kwargs)

Sets the gradient accumulation options for non-pipelined models which are to be used when training a model.

When set, and gradient_accumulation_steps_per_replica > 1, the optimizer which the current model has been compiled with is wrapped in GradientAccumulationOptimizerV2. This means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

See the Gradient accumulation section for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Note that the minimize API of the provided optimizer will not be called when gradient accumulation is enabled. As such, overriding minimize in a custom optimizer will cause a ValueError to be raised.

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The steps_per_execution value used when compiling the model must be divisible by the gradient_accumulation_steps_per_replica multiplied by the number of replicas. This value is saved/loaded when the model is saved/loaded.

  • reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod).

  • gradient_accumulation_optimizer_kwargs – All remaining keyword arguments are forwarded to GradientAccumulationOptimizerV2. See the optimizer for all the available arguments. Must not contain opt or num_mini_batches as keys. Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_gradient_accumulation_options() again.

set_infeed_queue_options(**kwargs)

Sets the options for all instances of IPUInfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUInfeedQueue is created to efficiently feed data from the dataset to the device. Instances of IPUInfeedQueue can be created with optional arguments, such as prefetch_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUInfeedQueue.

set_outfeed_queue_options(**kwargs)

Sets the options for all instances of IPUOutfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUOutfeedQueue is created to efficiently feed data from the device to the host. Instances of IPUOutfeedQueue can be created with optional arguments, such as buffer_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUOutfeedQueue.

set_pipeline_stage_assignment(pipeline_stage_assignment)

Sets the pipeline stage assignment of all the invocations of all the layers in the model.

Sets the pipeline stage assignment of all the invocations of all the layers (excluding input layers) in the model which is used to create a model-parallel execution of this model when calling fit(), evaluate() and predict(). Note that this pipelining stage assignment is ignored when using the call() function on this model.

Parameters

pipeline_stage_assignment – A list of the same length as the total number of invocations of all the layers in this model (excluding input layers). All elements have to be instances of FunctionalLayerPipelineStageAssignment which are used to indicate which pipeline stage a particular layer invocation should be assigned to.

Raises

ValueErrorpipeline_stage_assignment is not a valid assignment.

set_pipelining_options(gradient_accumulation_steps_per_replica=None, device_mapping=None, accumulate_outfeed=None, gradient_accumulation_reduction_method='sum', **pipelining_kwargs)

Sets the pipelining options, including gradient accumulation options, for pipelined models.

Before training a pipelined model, the gradient_accumulation_steps_per_replica argument needs to be set as pipelined models always perform gradient accumulation when training. Setting gradient_accumulation_steps_per_replica > 1 means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps. These accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

When training a data-parallel model, enabling gradient accumulation also reduces the communication overhead as the all-reduce of gradients is now performed after each replica has performed gradient_accumulation_steps_per_replica steps instead of after each step.

See the Gradient accumulation section for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Note that the minimize API of the provided optimizer will not be called when pipelining is enabled. As such, overriding minimize in a custom optimizer will cause a ValueError to be raised.

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The steps_per_execution value used when compiling the model must be divisible by the gradient_accumulation_steps_per_replica multiplied by the number of replicas. This value is saved/loaded when the model is saved/loaded.

  • device_mapping – If provided, a list of length equal to the number of pipeline stages assigned in this model. An element at index i in the list represents which IPU the i’th pipeline stage should reside on. This can be used to make sure computational stages which share Keras layers/tf.Variable objects are resident on the same IPU. This value is saved/loaded when the model is saved/loaded.

  • accumulate_outfeed – The metrics from the model are normally enqueued as soon as they’re available. If this option is True, the data will instead be accumulated when they’re available and enqueued at the end of pipeline execution, reducing the amount of host <-> device communication. When used with training, the accumulated metrics are normalised by gradient_accumulation_steps_per_replica. When used with evaluation, the accumulated metrics are normalised by steps_per_epoch. This option is ignored when doing prediction. When using accumulate_outfeed, model callbacks will be called with the same data for the batches which the data was accumulated for. This value is saved/loaded when the model is saved/loaded.

  • gradient_accumulation_reduction_method – (Experimental) Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod).

  • pipelining_kwargs – All remaining keyword arguments are forwarded to pipeline(). Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_pipelining_options() again.

class keras.ipu.extensions.ModelExtension(*args, **kwargs)
get_pipeline_stage_assignment()

Returns the pipeline stage assignment of the layers in the model.

If set_pipeline_stage_assignment() has been called before, then it returns a copy of the current assignment, otherwise returns a list of ModelLayerPipelineStageAssignment for each layer in the model in post order (which means that layers are returned in the order they are executed).

print_pipeline_stage_assignment_summary(line_length=None, print_fn=None)

Prints a summary of the pipeline stage assignment of the model.

Parameters
  • line_length – Total length of printed lines (for example, set this to adapt the display to different terminal window sizes).

  • print_fn – Print function to use. It will be called on each line of the summary. You can set it to a custom function in order to capture the string summary. It defaults to print (prints to stdout).

reset_pipeline_stage_assignment()

Resets the pipeline stage assignment so that the model is no longer pipelined.

set_asynchronous_callbacks(asynchronous=False)

Sets the asynchronous callback options when calling fit(), evaluate() and predict().

When running fit(), evaluate() and predict() the callbacks the model is configured with are executed after steps_per_execution`steps have executed. Enabling asynchronous callbacks means that the callbacks are invoked after every step, even when `steps_per_execution > 1. This can reduce the latency of receiving per step results and metrics at a cost of an extra thread running in the background of the application. Note that this option is ignored for fit() and evaluate() when running a pipelined model and accumulate_outfeed=True (configured via set_pipelining_options()).

Parameters

asynchronous – If True, enables asynchronous callbacks. Defalts to False.

set_gradient_accumulation_options(gradient_accumulation_steps_per_replica=None, gradient_accumulation_reduction_method='sum', **gradient_accumulation_optimizer_kwargs)

Sets the gradient accumulation options for non-pipelined models which are to be used when training a model.

When set, and gradient_accumulation_steps_per_replica > 1, the optimizer which the current model has been compiled with is wrapped in GradientAccumulationOptimizerV2. This means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps. These accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

See the Gradient accumulation section for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Note that the minimize API of the provided optimizer will not be called when gradient accumulation is enabled. As such, overriding minimize in a custom optimizer will cause a ValueError to be raised.

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The steps_per_execution value used when compiling the model must be divisible by the gradient_accumulation_steps_per_replica multiplied by the number of replicas. This value is saved/loaded when the model is saved/loaded.

  • reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod).

  • gradient_accumulation_optimizer_kwargs – All remaining keyword arguments are forwarded to GradientAccumulationOptimizerV2. See the optimizer for all the available arguments. Must not contain opt or num_mini_batches as keys. Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_gradient_accumulation_options again.

set_infeed_queue_options(**kwargs)

Sets the options for all instances of IPUInfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUInfeedQueue is created to efficiently feed data from the dataset to the device. Instances of IPUInfeedQueue can be created with optional arguments, such as prefetch_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUInfeedQueue.

set_outfeed_queue_options(**kwargs)

Sets the options for all instances of IPUOutfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUOutfeedQueue is created to efficiently feed data from the device to the host. Instances of IPUOutfeedQueue can be created with optional arguments, such as buffer_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUOutfeedQueue.

set_pipeline_stage_assignment(pipeline_stage_assignment)

Sets the pipeline stage assignment for all the layers in the model.

Sets the pipeline stage assignment of all the layers in the model which is used to create a model-parallel execution of this Model when calling fit(), evaluate() and predict(). Note that this pipelining stage assignment is ignored when using the call() function on this model.

Parameters

pipeline_stage_assignment – A list of the same length as the number of layers in this model. All elements can be either intergers or instances of ModelLayerPipelineStageAssignment. If all the elements are integers, then a layer in this model at index i is assigned to a pipeline stage pipeline_stage_assignment[i]. Otherwise, if all the elements are of type ModelLayerPipelineStageAssignment then a layer in this model at index i is assigned to a pipeline stage indicated by pipeline_stage_assignment[i].pipeline_stage.

Raises

ValueErrorpipeline_stage_assignment is not a valid assignment.

set_pipelining_options(gradient_accumulation_steps_per_replica=None, device_mapping=None, accumulate_outfeed=None, gradient_accumulation_reduction_method='sum', **pipelining_kwargs)

Sets the pipelining options, including gradient accumulation options, for pipelined models.

Before training a pipelined model, the gradient_accumulation_steps_per_replica argument needs to be set as pipelined models always perform gradient accumulation when training. Setting gradient_accumulation_steps_per_replica > 1 means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps. These accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

When training a data-parallel model, enabling gradient accumulation also reduces the communication overhead as the all-reduce of gradients is now performed after each replica has performed gradient_accumulation_steps_per_replica steps instead of after each step.

See the Gradient accumulation section for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Note that the minimize API of the provided optimizer will not be called when pipelining is enabled. As such, overriding minimize in a custom optimizer will cause a ValueError to be raised.

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The steps_per_execution value used when compiling the model must be divisible by the gradient_accumulation_steps_per_replica multiplied by the number of replicas. This value is saved/loaded when the model is saved/loaded.

  • device_mapping – If provided, a list of length equal to the number of pipeline stages assigned in this model. An element at index i in the list represents which IPU the i’th pipeline stage should reside on. This can be used to make sure computational stages which share Keras layers/tf.Variable objects are resident on the same IPU. This value is saved/loaded when the model is saved/loaded.

  • accumulate_outfeed – The metrics from the model are normally enqueued as soon as they’re available. If this option is True, the data will instead be accumulated when they’re available and enqueued at the end of pipeline execution, reducing the amount of host <-> device communication. When used with training, the accumulated metrics are normalised by gradient_accumulation_steps_per_replica. When used with evaluation, the accumulated metrics are normalised by steps_per_epoch. This option is ignored when doing prediction. When using accumulate_outfeed, model callbacks will be called with the same data for the batches which the data was accumulated for. This value is saved/loaded when the model is saved/loaded.

  • gradient_accumulation_reduction_method – (Experimental) Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod).

  • pipelining_kwargs – All remaining keyword arguments are forwarded to pipeline(). Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_pipelining_options again.

class keras.ipu.extensions.SequentialExtension(*args, **kwargs)
get_pipeline_stage_assignment()

Returns the pipeline stage assignment of the layers in the model.

If set_pipeline_stage_assignment() has been called before, then it returns a copy of the current assignment, otherwise returns a list of SequentialLayerPipelineStageAssignment for each layer in the model in post order (which means that layers are returned in the order they are executed).

print_pipeline_stage_assignment_summary(line_length=None, print_fn=None)

Prints a summary of the pipeline stage assignment of the model.

Parameters
  • line_length – Total length of printed lines (for example, set this to adapt the display to different terminal window sizes).

  • print_fn – Print function to use. It will be called on each line of the summary. You can set it to a custom function in order to capture the string summary. It defaults to print (prints to stdout).

reset_pipeline_stage_assignment()

Resets the pipeline stage assignment so that the model is no longer pipelined.

set_asynchronous_callbacks(asynchronous=False)

Sets the asynchronous callback options when calling fit(), evaluate() and predict().

When running fit(), evaluate() and predict(), the callback functions are called after executing the number of steps specified by steps_per_execution, where each step processes one batch.

Enabling asynchronous callbacks means that the callbacks are invoked after every step, even when steps_per_execution > 1. This can reduce the latency of receiving per-step results and metrics, at the cost of an extra thread running in the background of the application.

Note that this option is ignored for fit() and evaluate() when running a pipelined model and accumulate_outfeed=True (configured via set_pipelining_options() ).

Parameters

asynchronous – If True, enables asynchronous callbacks. Defalts to False.

set_gradient_accumulation_options(gradient_accumulation_steps_per_replica=None, gradient_accumulation_reduction_method='sum', **gradient_accumulation_optimizer_kwargs)

Sets the gradient accumulation options for non-pipelined models which are to be used when training a model.

When set, and gradient_accumulation_steps_per_replica > 1, the optimizer which the current model has been compiled with is wrapped in GradientAccumulationOptimizerV2. This means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

See the Gradient accumulation section for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Note that the minimize API of the provided optimizer will not be called when gradient accumulation is enabled. As such, overriding minimize in a custom optimizer will cause a ValueError to be raised.

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The steps_per_execution value used when compiling the model must be divisible by the gradient_accumulation_steps_per_replica multiplied by the number of replicas. This value is saved/loaded when the model is saved/loaded.

  • reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod).

  • gradient_accumulation_optimizer_kwargs – All remaining keyword arguments are forwarded to GradientAccumulationOptimizerV2. See the optimizer for all the available arguments. Must not contain opt or num_mini_batches as keys. Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_gradient_accumulation_options again.

set_infeed_queue_options(**kwargs)

Sets the options for all instances of IPUInfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUInfeedQueue is created to efficiently feed data from the dataset to the device. Instances of IPUInfeedQueue can be created with optional arguments, such as prefetch_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUInfeedQueue.

set_outfeed_queue_options(**kwargs)

Sets the options for all instances of IPUOutfeedQueue generated when executing the model.

When using fit(), evalute() and predict(), an instance of IPUOutfeedQueue is created to efficiently feed data from the device to the host. Instances of IPUOutfeedQueue can be created with optional arguments, such as buffer_depth, which can increase the throughput of the model.

Parameters

**kwargs – All keyword arguments are forwarded to IPUOutfeedQueue.

set_pipeline_stage_assignment(pipeline_stage_assignment)

Sets the pipeline stage assignment of all the layers in the model.

Sets the pipeline stage assignment of all the layers in the model which is used to create a model-parallel execution of this Sequential model when calling fit(), evaluate() and predict(). Note that this pipelining stage assignment is ignored when using the call() function on this model.

Parameters

pipeline_stage_assignment – A list of the same length as the number of layers in this model. All elements can be either intergers or instances of SequentialLayerPipelineStageAssignment. If all the elements are integers, then a layer in this model at index i is assigned to a pipeline stage pipeline_stage_assignment[i]. Otherwise, if all the elements are of type SequentialLayerPipelineStageAssignment then a layer in this model at index i is assigned to a pipeline stage indicated by pipeline_stage_assignment[i].pipeline_stage.

Raises

ValueErrorpipeline_stage_assignment is not a valid assignment.

set_pipelining_options(gradient_accumulation_steps_per_replica=None, device_mapping=None, accumulate_outfeed=None, gradient_accumulation_reduction_method='sum', **pipelining_kwargs)

Sets the pipelining options, including gradient accumulation options, for pipelined models.

Before training a pipelined model, the gradient_accumulation_steps_per_replica argument needs to be set as pipelined models always perform gradient accumulation when training. Setting gradient_accumulation_steps_per_replica > 1 means that each replica will accumulate the gradients for gradient_accumulation_steps_per_replica steps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.

Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we set gradient_accumulation_steps_per_replica=4 and there are 4 replicas in the system, this simulates an input batch of size 256.

When training a data-parallel model, enabling gradient accumulation also reduces the communication overhead as the all-reduce of gradients is now performed after each replica has performed gradient_accumulation_steps_per_replica steps instead of after each step.

See the Gradient accumulation section for more details.

The value of gradient_accumulation_steps_per_replica has no effect when using evaluate() or predict().

Note that the minimize API of the provided optimizer will not be called when pipelining is enabled. As such, overriding minimize in a custom optimizer will cause a ValueError to be raised.

Parameters
  • gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The steps_per_execution value used when compiling the model must be divisible by the gradient_accumulation_steps_per_replica multiplied by the number of replicas. This value is saved/loaded when the model is saved/loaded.

  • device_mapping – If provided, a list of length equal to the number of pipeline stages assigned in this model. An element at index i in the list represents which IPU the i’th pipeline stage should reside on. This can be used to make sure computational stages which share Keras layers/tf.Variable objects are resident on the same IPU. This value is saved/loaded when the model is saved/loaded.

  • accumulate_outfeed – The metrics from the model are normally enqueued as soon as they’re available. If this option is True, the data will instead be accumulated when they’re available and enqueued at the end of pipeline execution, reducing the amount of host <-> device communication. When used with training, the accumulated metrics are normalised by gradient_accumulation_steps_per_replica. When used with evaluation, the accumulated metrics are normalised by steps_per_epoch. This option is ignored when doing prediction. When using accumulate_outfeed, model callbacks will be called with the same data for the batches which the data was accumulated for. This value is saved/loaded when the model is saved/loaded.

  • gradient_accumulation_reduction_method – (Experimental) Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to GradientAccumulationReductionMethod.SUM (see GradientAccumulationReductionMethod).

  • pipelining_kwargs – All remaining keyword arguments are forwarded to pipeline(). Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please call set_pipelining_options again.