25. Keras Python API
For details on the standard Keras API, refer to the Keras documentation.
25.1. IPU specific Keras integration
- class keras.ipu.FunctionalLayerPipelineStageAssignment(layer, node_index, pipeline_stage=None)
A class to indicate at which pipeline stage a layer in a
Functionalmodel should be executed.Keras layers can be called multiple times in order to share weights between layers. A new
FunctionalLayerPipelineStageAssignmentis required for every call. For example,node_index=0will correspond to the first time the layer was called. As weights are shared, all stages a given layer is assigned to must be mapped to the same device.
- class keras.ipu.FunctionalNestedModelPipelineStageAssignment(nested_model, node_index, pipeline_stage_assignments)
A class containing the pipeline stage assignments for a nested model in a
Functionalmodel. These are separate from assignments set directly on the nested model, though any such existing assignments are used as defaults.Nested models can be called multiple times. A new
FunctionalNestedModelPipelineStageAssignmentis required for every call. For example,node_index=0will correspond to the first time the nested model was called. All stages a given layer is assigned to must be mapped to the same device.
- class keras.ipu.ModelLayerPipelineStageAssignment(layer, node_index, pipeline_stage=None)
A class to indicate at which pipeline stage a layer in a
Modelsubclass should be executed.Keras layers can be called multiple times in order to share weights between layers. A new
ModelLayerPipelineStageAssignmentis required for every call. For example,node_index=0will correspond to the first time the layer was called. As weights are shared, all stages a given layer is assigned to must be mapped to the same device.- property inbound_layers
The input layers for the layer in this assignment. This can be useful for identifying which specific
node_indexthis is.
- property is_nested_model
Whether this assignment is for a nested model.
- property layer
The Keras layer associated with this assignment.
- property node_index
The specific call to the layer.
- property pipeline_stage
The pipeline stage this layer has been assigned to. If
None, then this layer has not been assigned to a pipeline stage.
- class keras.ipu.NestedModelPipelineStageAssignment(nested_model, node_index, pipeline_stage_assignments)
A class containing the pipeline stage assignments for a nested model in a
Modelsubclass. These are separate from assignments set directly on the nested model, though any such existing assignments are used as defaults.Nested models can be called multiple times. A new
NestedModelPipelineStageAssignmentis required for every call. For example,node_index=0will correspond to the first time the nested model was called. All stages a given layer is assigned to must be mapped to the same device.- property inbound_layers
The input layers for the nested model in this assignment. This can be useful for identifying which specific
node_indexthis is.
- property is_nested_model
Whether this assignment is for a nested model.
- property nested_model
The nested model associated with this assignment.
- property node_index
The index of the specific call to the nested model.
- property pipeline_stage_assignments
The pipeline stage assignments for this nested model.
- class keras.ipu.PipelineStage(stage)
A scope within which Keras layers and/or calls to Keras layers can be assigned to pipeline stages.
Pipeline stages can be assigned to all calls of
Layerby constructing theLayerwithin aPipelineStagescope as follows:strategy = ipu.ipu_strategy.IPUStrategy() input_layer = Input(2) with strategy.scope(): with PipelineStage(0): x = Dense(4)(input_layer) with PipelineStage(1): x = Dense(4)(x)
Pipeline stages can also be assigned to individual
Layercalls, as follows:strategy = ipu.ipu_strategy.IPUStrategy() input_layer = Input(2) l = Dense(4) with strategy.scope(): with PipelineStage(0): x = l(input_layer) with PipelineStage(1): x = l(x)
Pipeline stages assigned to
Layercalls take precedence over those assigned when constructing theLayer.
- class keras.ipu.ReplicatedMetricReductionMethod(value)
Cross-replica reduction method to use when returning metrics which exist across multiple replicas.
- NONE: Do not perform any reduction. Return the metric values from the last
replica.
- LIST: For each metric return a list containing the values from every
replica. When using this option, the Keras progress bar output will show the mean of the list values.
SUM: Return a sum of the metric values from each replica.
- MEAN: Return a sum of the metric values from each replica,
scaled by (
1/num_replicas).
- class keras.ipu.SequentialLayerPipelineStageAssignment(layer, pipeline_stage=None)
A class to indicate at which pipeline stage a layer in a
Sequentialmodel should be executed.- property is_nested_model
Whether this assignment is for a nested model.
- property layer
The Keras layer associated with this assignment.
- property pipeline_stage
The pipeline stage this layer has been assigned to. If
None, then this layer has not been assigned to a pipeline stage.
- class keras.ipu.SequentialNestedModelPipelineStageAssignment(nested_model, pipeline_stage_assignments)
A class containing the pipeline stage assignments for a nested model in a
Sequentialmodel. These are separate from assignments set directly on the nested model, though any such existing assignments are used as defaults.Nested models can be called multiple times. A new
SequentialNestedModelPipelineStageAssignmentis required for every call. For example,node_index=0will correspond to the first time the nested model was called. All stages a given layer is assigned to must be mapped to the same device.- property is_nested_model
Whether this assignment is for a nested model.
- property nested_model
The nested Keras model associated with this assignment.
- property pipeline_stage_assignments
The pipeline stage assignments for this nested model.
25.2. IPU specific Keras extensions
- class keras.ipu.extensions.FunctionalExtension(*args, **kwargs)
- get_pipeline_stage_assignment()
Returns the pipeline stage assignment of the layers in the model.
If
set_pipeline_stage_assignment()has been called before, then it returns a copy of the current assignment, otherwise returns a list ofFunctionalLayerPipelineStageAssignmentandFunctionalNestedModelPipelineStageAssignmentfor each layer invocation (excluding input layers) and nested model in the model. The list is in post order (execution order).
- print_pipeline_stage_assignment_summary(line_length=None, print_fn=None)
Prints a summary of the pipeline stage assignment of the model.
- Parameters
line_length – Total length of printed lines (for example, set this to adapt the display to different terminal window sizes).
print_fn – Print function to use. It will be called on each line of the summary. You can set it to a custom function in order to capture the string summary. It defaults to
print(prints to stdout).
- reset_pipeline_stage_assignment()
Resets the pipeline stage assignment so that the model is no longer pipelined.
- set_asynchronous_callbacks(asynchronous=False)
Sets the asynchronous callback options when calling
fit(),evaluate()andpredict().When running
fit(),evaluate()andpredict()the callbacks the model is configured with are executed aftersteps_per_executionsteps have executed. Enabling asynchronous callbacks means that the callbacks are invoked after every step, even whensteps_per_execution > 1. This can reduce the latency of receiving per step results and metrics at a cost of an extra thread running in the background of the application. Note that this option is ignored forfit()andevaluate()when running a pipelined model andaccumulate_outfeed=True(configured viaset_pipelining_options()).- Parameters
asynchronous – If
True, enables asynchronous callbacks. Defalts toFalse.
- set_gradient_accumulation_options(gradient_accumulation_steps_per_replica=None, gradient_accumulation_reduction_method='sum', use_v2_gradient_accumulation_optimizer=False, **gradient_accumulation_optimizer_kwargs)
Sets the gradient accumulation options for non-pipelined models which are to be used when training a model.
When set, and
gradient_accumulation_steps_per_replica > 1, the optimizer which the current model has been compiled with is wrapped inGradientAccumulationOptimizerV2. This means that each replica will accumulate the gradients forgradient_accumulation_steps_per_replicasteps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set
gradient_accumulation_steps_per_replica=4and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we setgradient_accumulation_steps_per_replica=4and there are 4 replicas in the system, this simulates an input batch of size 256.See the Gradient accumulation section for more details.
The value of
gradient_accumulation_steps_per_replicahas no effect when usingevaluate()orpredict().Note that the
minimizeAPI of the provided optimizer will not be called when gradient accumulation is enabled. As such, overridingminimizein a custom optimizer will cause aValueErrorto be raised.- Parameters
gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The
steps_per_executionvalue used when compiling the model must be divisible bygradient_accumulation_steps_per_replica. This value is saved/loaded when the model is saved/loaded.reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to
GradientAccumulationReductionMethod.SUM(seeGradientAccumulationReductionMethod).use_v2_gradient_accumulation_optimizer – When enabled, the
OptimizerV2based IPU KerasGradientAccumulationOptimizer(seeGradientAccumulationOptimizer) is used in place of the default IPU TensorFlowGradientAccumulationOptimizerV2(seeGradientAccumulationOptimizerV2). Default is False.gradient_accumulation_optimizer_kwargs – All remaining keyword arguments are forwarded to
GradientAccumulationOptimizerV2. See the optimizer for all the available arguments. Must not containoptornum_mini_batchesas keys. Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please callset_gradient_accumulation_options()again.
- set_infeed_queue_options(**kwargs)
Sets the options for all instances of
IPUInfeedQueuegenerated when executing the model.When using
fit(),evalute()andpredict(), an instance ofIPUInfeedQueueis created to efficiently feed data from the dataset to the device. Instances ofIPUInfeedQueuecan be created with optional arguments, such asprefetch_depth, which can increase the throughput of the model.- Parameters
**kwargs – All keyword arguments are forwarded to
IPUInfeedQueue.
- set_outfeed_queue_options(**kwargs)
Sets the options for all instances of
IPUOutfeedQueuegenerated when executing the model.When using
fit(),evalute()andpredict(), an instance ofIPUOutfeedQueueis created to efficiently feed data from the device to the host. Instances ofIPUOutfeedQueuecan be created with optional arguments, such asbuffer_depth, which can increase the throughput of the model.- Parameters
**kwargs – All keyword arguments are forwarded to
IPUOutfeedQueue.
- set_pipeline_stage_assignment(pipeline_stage_assignment)
Sets the pipeline stage assignment of all the invocations of all the layers in the model.
Sets the pipeline stage assignment of all the invocations of all the layers (excluding input layers) in the model which is used to create a model-parallel execution of this model when calling
fit(),evaluate()andpredict(). Note that this pipelining stage assignment is ignored when using thecall()function on this model.- Parameters
pipeline_stage_assignment – A list of the same length as the total number of invocations of all the layers in this model (excluding input layers). All elements have to be instances of
FunctionalLayerPipelineStageAssignmentwhich are used to indicate which pipeline stage a particular layer invocation should be assigned to.- Raises
ValueError –
pipeline_stage_assignmentis not a valid assignment.
- set_pipelining_options(gradient_accumulation_steps_per_replica=None, device_mapping=None, accumulate_outfeed=None, gradient_accumulation_reduction_method='sum', **pipelining_kwargs)
Sets the pipelining options, including gradient accumulation options, for pipelined models.
Before training a pipelined model, the
gradient_accumulation_steps_per_replicaargument needs to be set as pipelined models always perform gradient accumulation when training. Settinggradient_accumulation_steps_per_replica > 1means that each replica will accumulate the gradients forgradient_accumulation_steps_per_replicasteps. These accumulated gradients are then all-reduced across the replicas and the weight update is performed.Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set
gradient_accumulation_steps_per_replica=4and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we setgradient_accumulation_steps_per_replica=4and there are 4 replicas in the system, this simulates an input batch of size 256.When training a data-parallel model, enabling gradient accumulation also reduces the communication overhead as the all-reduce of gradients is now performed after each replica has performed
gradient_accumulation_steps_per_replicasteps instead of after each step.See the Gradient accumulation section for more details.
The value of
gradient_accumulation_steps_per_replicahas no effect when usingevaluate()orpredict().Note that the
minimizeAPI of the provided optimizer will not be called when pipelining is enabled. As such, overridingminimizein a custom optimizer will cause aValueErrorto be raised.- Parameters
gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The
steps_per_executionvalue used when compiling the model must be divisible bygradient_accumulation_steps_per_replica. This value is saved/loaded when the model is saved/loaded.device_mapping – If provided, a list of length equal to the number of pipeline stages assigned in this model. An element at index
iin the list represents which IPU thei’th pipeline stage should reside on. This can be used to make sure computational stages which share Keras layers/tf.Variableobjects are resident on the same IPU. This value is saved/loaded when the model is saved/loaded.accumulate_outfeed – The metrics from the model are normally enqueued as soon as they’re available. If this option is True, the data will instead be accumulated when they’re available and enqueued at the end of pipeline execution, reducing the amount of host <-> device communication. When used with training, the accumulated metrics are normalised by
gradient_accumulation_steps_per_replica. When used with evaluation, the accumulated metrics are normalised bysteps_per_epoch. This option is ignored when doing prediction. When usingaccumulate_outfeed, model callbacks will be called with the same data for the batches which the data was accumulated for. This value is saved/loaded when the model is saved/loaded.gradient_accumulation_reduction_method – (Experimental) Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to
GradientAccumulationReductionMethod.SUM(seeGradientAccumulationReductionMethod).pipelining_kwargs – All remaining keyword arguments are forwarded to
pipeline(). Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please callset_pipelining_options()again.
- set_replication_options(replicated_metric_reduction_method='NONE')
Configure behaviour when using this model with replication.
- Parameters
replicated_metric_reduction_method – Cross-replica reduction method to use when returning metrics which exist across multiple replicas. Defaults to
ReplicatedMetricReductionMethod.NONE(seeReplicatedMetricReductionMethod).
- class keras.ipu.extensions.ModelExtension(*args, **kwargs)
- get_pipeline_stage_assignment()
Returns the pipeline stage assignment of the layers in the model.
If
set_pipeline_stage_assignment()has been called before, then it returns a copy of the current assignment, otherwise returns a list ofModelLayerPipelineStageAssignmentandNestedModelPipelineStageAssignmentfor each layer and nested model in the model. The list is in post order (execution order).
- print_pipeline_stage_assignment_summary(line_length=None, print_fn=None)
Prints a summary of the pipeline stage assignment of the model.
- Parameters
line_length – Total length of printed lines (for example, set this to adapt the display to different terminal window sizes).
print_fn – Print function to use. It will be called on each line of the summary. You can set it to a custom function in order to capture the string summary. It defaults to
print(prints to stdout).
- reset_pipeline_stage_assignment()
Resets the pipeline stage assignment so that the model is no longer pipelined.
- set_asynchronous_callbacks(asynchronous=False)
Sets the asynchronous callback options when calling
fit(),evaluate()andpredict().When running
fit(),evaluate()andpredict()the callbacks the model is configured with are executed aftersteps_per_executionsteps have executed. Enabling asynchronous callbacks means that the callbacks are invoked after every step, even whensteps_per_execution > 1. This can reduce the latency of receiving per step results and metrics at a cost of an extra thread running in the background of the application. Note that this option is ignored forfit()andevaluate()when running a pipelined model andaccumulate_outfeed=True(configured viaset_pipelining_options()).- Parameters
asynchronous – If
True, enables asynchronous callbacks. Defalts toFalse.
- set_gradient_accumulation_options(gradient_accumulation_steps_per_replica=None, gradient_accumulation_reduction_method='sum', use_v2_gradient_accumulation_optimizer=False, **gradient_accumulation_optimizer_kwargs)
Sets the gradient accumulation options for non-pipelined models which are to be used when training a model.
When set, and
gradient_accumulation_steps_per_replica > 1, the optimizer which the current model has been compiled with is wrapped inGradientAccumulationOptimizerV2. This means that each replica will accumulate the gradients forgradient_accumulation_steps_per_replicasteps. These accumulated gradients are then all-reduced across the replicas and the weight update is performed.Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set
gradient_accumulation_steps_per_replica=4and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we setgradient_accumulation_steps_per_replica=4and there are 4 replicas in the system, this simulates an input batch of size 256.See the Gradient accumulation section for more details.
The value of
gradient_accumulation_steps_per_replicahas no effect when usingevaluate()orpredict().Note that the
minimizeAPI of the provided optimizer will not be called when gradient accumulation is enabled. As such, overridingminimizein a custom optimizer will cause aValueErrorto be raised.- Parameters
gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The
steps_per_executionvalue used when compiling the model must be divisible bygradient_accumulation_steps_per_replica. This value is saved/loaded when the model is saved/loaded.reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to
GradientAccumulationReductionMethod.SUM(seeGradientAccumulationReductionMethod).use_v2_gradient_accumulation_optimizer – When enabled, the
OptimizerV2based IPU KerasGradientAccumulationOptimizer(seeGradientAccumulationOptimizer) is used in place of the default IPU TensorFlowGradientAccumulationOptimizerV2(seeGradientAccumulationOptimizerV2). Default is False.gradient_accumulation_optimizer_kwargs – All remaining keyword arguments are forwarded to
GradientAccumulationOptimizerV2. See the optimizer for all the available arguments. Must not containoptornum_mini_batchesas keys. Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please callset_gradient_accumulation_optionsagain.
- set_infeed_queue_options(**kwargs)
Sets the options for all instances of
IPUInfeedQueuegenerated when executing the model.When using
fit(),evalute()andpredict(), an instance ofIPUInfeedQueueis created to efficiently feed data from the dataset to the device. Instances ofIPUInfeedQueuecan be created with optional arguments, such asprefetch_depth, which can increase the throughput of the model.- Parameters
**kwargs – All keyword arguments are forwarded to
IPUInfeedQueue.
- set_outfeed_queue_options(**kwargs)
Sets the options for all instances of
IPUOutfeedQueuegenerated when executing the model.When using
fit(),evalute()andpredict(), an instance ofIPUOutfeedQueueis created to efficiently feed data from the device to the host. Instances ofIPUOutfeedQueuecan be created with optional arguments, such asbuffer_depth, which can increase the throughput of the model.- Parameters
**kwargs – All keyword arguments are forwarded to
IPUOutfeedQueue.
- set_pipeline_stage_assignment(pipeline_stage_assignment)
Sets the pipeline stage assignment for all the layers in the model.
Sets the pipeline stage assignment of all the layers in the model which is used to create a model-parallel execution of this
Modelwhen callingfit(),evaluate()andpredict(). Note that this pipelining stage assignment is ignored when using thecall()function on this model.- Parameters
pipeline_stage_assignment – A list of the same length as the number of layers in this model. All elements can be either intergers or instances of
ModelLayerPipelineStageAssignment. If all the elements are integers, then a layer in this model at indexiis assigned to a pipeline stagepipeline_stage_assignment[i]. Otherwise, if all the elements are of typeModelLayerPipelineStageAssignmentthen a layer in this model at indexiis assigned to a pipeline stage indicated bypipeline_stage_assignment[i].pipeline_stage.- Raises
ValueError –
pipeline_stage_assignmentis not a valid assignment.
- set_pipelining_options(gradient_accumulation_steps_per_replica=None, device_mapping=None, accumulate_outfeed=None, gradient_accumulation_reduction_method='sum', **pipelining_kwargs)
Sets the pipelining options, including gradient accumulation options, for pipelined models.
Before training a pipelined model, the
gradient_accumulation_steps_per_replicaargument needs to be set as pipelined models always perform gradient accumulation when training. Settinggradient_accumulation_steps_per_replica > 1means that each replica will accumulate the gradients forgradient_accumulation_steps_per_replicasteps. These accumulated gradients are then all-reduced across the replicas and the weight update is performed.Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set
gradient_accumulation_steps_per_replica=4and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we setgradient_accumulation_steps_per_replica=4and there are 4 replicas in the system, this simulates an input batch of size 256.When training a data-parallel model, enabling gradient accumulation also reduces the communication overhead as the all-reduce of gradients is now performed after each replica has performed
gradient_accumulation_steps_per_replicasteps instead of after each step.See the Gradient accumulation section for more details.
The value of
gradient_accumulation_steps_per_replicahas no effect when usingevaluate()orpredict().Note that the
minimizeAPI of the provided optimizer will not be called when pipelining is enabled. As such, overridingminimizein a custom optimizer will cause aValueErrorto be raised.- Parameters
gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The
steps_per_executionvalue used when compiling the model must be divisible bygradient_accumulation_steps_per_replica. This value is saved/loaded when the model is saved/loaded.device_mapping – If provided, a list of length equal to the number of pipeline stages assigned in this model. An element at index
iin the list represents which IPU thei’th pipeline stage should reside on. This can be used to make sure computational stages which share Keras layers/tf.Variableobjects are resident on the same IPU. This value is saved/loaded when the model is saved/loaded.accumulate_outfeed – The metrics from the model are normally enqueued as soon as they’re available. If this option is True, the data will instead be accumulated when they’re available and enqueued at the end of pipeline execution, reducing the amount of host <-> device communication. When used with training, the accumulated metrics are normalised by
gradient_accumulation_steps_per_replica. When used with evaluation, the accumulated metrics are normalised bysteps_per_epoch. This option is ignored when doing prediction. When usingaccumulate_outfeed, model callbacks will be called with the same data for the batches which the data was accumulated for. This value is saved/loaded when the model is saved/loaded.gradient_accumulation_reduction_method – (Experimental) Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to
GradientAccumulationReductionMethod.SUM(seeGradientAccumulationReductionMethod).pipelining_kwargs – All remaining keyword arguments are forwarded to
pipeline(). Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please callset_pipelining_optionsagain.
- set_replication_options(replicated_metric_reduction_method='NONE')
Configure behaviour when using this model with replication.
- Parameters
replicated_metric_reduction_method – Cross-replica reduction method to use when returning metrics which exist across multiple replicas. Defaults to
ReplicatedMetricReductionMethod.NONE(seeReplicatedMetricReductionMethod).
- class keras.ipu.extensions.ReplicatedMetricReductionMethod(value)
Cross-replica reduction method to use when returning metrics which exist across multiple replicas.
- NONE: Do not perform any reduction. Return the metric values from the last
replica.
- LIST: For each metric return a list containing the values from every
replica. When using this option, the Keras progress bar output will show the mean of the list values.
SUM: Return a sum of the metric values from each replica.
- MEAN: Return a sum of the metric values from each replica,
scaled by (
1/num_replicas).
- class keras.ipu.extensions.SequentialExtension(*args, **kwargs)
- get_pipeline_stage_assignment()
Returns the pipeline stage assignment of the layers in the model.
If
set_pipeline_stage_assignment()has been called before, then it returns a copy of the current assignment, otherwise returns a list ofSequentialLayerPipelineStageAssignmentandSequentialNestedModelPipelineStageAssignmentfor each layer and nested model in the model. The list is in post order (execution order).
- print_pipeline_stage_assignment_summary(line_length=None, print_fn=None)
Prints a summary of the pipeline stage assignment of the model.
- Parameters
line_length – Total length of printed lines (for example, set this to adapt the display to different terminal window sizes).
print_fn – Print function to use. It will be called on each line of the summary. You can set it to a custom function in order to capture the string summary. It defaults to
print(prints to stdout).
- reset_pipeline_stage_assignment()
Resets the pipeline stage assignment so that the model is no longer pipelined.
- set_asynchronous_callbacks(asynchronous=False)
Sets the asynchronous callback options when calling
fit(),evaluate()andpredict().When running
fit(),evaluate()andpredict(), the callback functions are called after executing the number of steps specified bysteps_per_execution, where each step processes one batch.Enabling asynchronous callbacks means that the callbacks are invoked after every step, even when
steps_per_execution > 1. This can reduce the latency of receiving per-step results and metrics, at the cost of an extra thread running in the background of the application.Note that this option is ignored for
fit()andevaluate()when running a pipelined model andaccumulate_outfeed=True(configured viaset_pipelining_options()).- Parameters
asynchronous – If
True, enables asynchronous callbacks. Defalts toFalse.
- set_gradient_accumulation_options(gradient_accumulation_steps_per_replica=None, gradient_accumulation_reduction_method='sum', use_v2_gradient_accumulation_optimizer=False, **gradient_accumulation_optimizer_kwargs)
Sets the gradient accumulation options for non-pipelined models which are to be used when training a model.
When set, and
gradient_accumulation_steps_per_replica > 1, the optimizer which the current model has been compiled with is wrapped inGradientAccumulationOptimizerV2. This means that each replica will accumulate the gradients forgradient_accumulation_steps_per_replicasteps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set
gradient_accumulation_steps_per_replica=4and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we setgradient_accumulation_steps_per_replica=4and there are 4 replicas in the system, this simulates an input batch of size 256.See the Gradient accumulation section for more details.
The value of
gradient_accumulation_steps_per_replicahas no effect when usingevaluate()orpredict().Note that the
minimizeAPI of the provided optimizer will not be called when gradient accumulation is enabled. As such, overridingminimizein a custom optimizer will cause aValueErrorto be raised.- Parameters
gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The
steps_per_executionvalue used when compiling the model must be divisible bygradient_accumulation_steps_per_replica. This value is saved/loaded when the model is saved/loaded.reduction_method – Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to
GradientAccumulationReductionMethod.SUM(seeGradientAccumulationReductionMethod).use_v2_gradient_accumulation_optimizer – When enabled, the
OptimizerV2based IPU KerasGradientAccumulationOptimizer(seeGradientAccumulationOptimizer) is used in place of the default IPU TensorFlowGradientAccumulationOptimizerV2(seeGradientAccumulationOptimizerV2). Default is False.gradient_accumulation_optimizer_kwargs – All remaining keyword arguments are forwarded to
GradientAccumulationOptimizerV2. See the optimizer for all the available arguments. Must not containoptornum_mini_batchesas keys. Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please callset_gradient_accumulation_optionsagain.
- set_infeed_queue_options(**kwargs)
Sets the options for all instances of
IPUInfeedQueuegenerated when executing the model.When using
fit(),evalute()andpredict(), an instance ofIPUInfeedQueueis created to efficiently feed data from the dataset to the device. Instances ofIPUInfeedQueuecan be created with optional arguments, such asprefetch_depth, which can increase the throughput of the model.- Parameters
**kwargs – All keyword arguments are forwarded to
IPUInfeedQueue.
- set_outfeed_queue_options(**kwargs)
Sets the options for all instances of
IPUOutfeedQueuegenerated when executing the model.When using
fit(),evalute()andpredict(), an instance ofIPUOutfeedQueueis created to efficiently feed data from the device to the host. Instances ofIPUOutfeedQueuecan be created with optional arguments, such asbuffer_depth, which can increase the throughput of the model.- Parameters
**kwargs – All keyword arguments are forwarded to
IPUOutfeedQueue.
- set_pipeline_stage_assignment(pipeline_stage_assignment)
Sets the pipeline stage assignment of all the layers in the model.
Sets the pipeline stage assignment of all the layers in the model which is used to create a model-parallel execution of this
Sequentialmodel when callingfit(),evaluate()andpredict(). Note that this pipelining stage assignment is ignored when using thecall()function on this model.- Parameters
pipeline_stage_assignment – A list of the same length as the number of layers in this model. All elements can be either intergers or instances of
SequentialLayerPipelineStageAssignment. If all the elements are integers, then a layer in this model at indexiis assigned to a pipeline stagepipeline_stage_assignment[i]. Otherwise, if all the elements are of typeSequentialLayerPipelineStageAssignmentthen a layer in this model at indexiis assigned to a pipeline stage indicated bypipeline_stage_assignment[i].pipeline_stage.- Raises
ValueError –
pipeline_stage_assignmentis not a valid assignment.
- set_pipelining_options(gradient_accumulation_steps_per_replica=None, device_mapping=None, accumulate_outfeed=None, gradient_accumulation_reduction_method='sum', **pipelining_kwargs)
Sets the pipelining options, including gradient accumulation options, for pipelined models.
Before training a pipelined model, the
gradient_accumulation_steps_per_replicaargument needs to be set as pipelined models always perform gradient accumulation when training. Settinggradient_accumulation_steps_per_replica > 1means that each replica will accumulate the gradients forgradient_accumulation_steps_per_replicasteps, these accumulated gradients are then all-reduced across the replicas and the weight update is performed.Gradient accumulation allows us to simulate bigger batch sizes. For example if we have a model where each step is of batch size 16 and we set
gradient_accumulation_steps_per_replica=4and there is single replica in the system, this simulates an input batch of size 64. If we have a model where each step is of batch size 16 and we setgradient_accumulation_steps_per_replica=4and there are 4 replicas in the system, this simulates an input batch of size 256.When training a data-parallel model, enabling gradient accumulation also reduces the communication overhead as the all-reduce of gradients is now performed after each replica has performed
gradient_accumulation_steps_per_replicasteps instead of after each step.See the Gradient accumulation section for more details.
The value of
gradient_accumulation_steps_per_replicahas no effect when usingevaluate()orpredict().Note that the
minimizeAPI of the provided optimizer will not be called when pipelining is enabled. As such, overridingminimizein a custom optimizer will cause aValueErrorto be raised.- Parameters
gradient_accumulation_steps_per_replica – An integer which indicates the number of steps the gradients will be accumulated for in each replica. The
steps_per_executionvalue used when compiling the model must be divisible bygradient_accumulation_steps_per_replica. This value is saved/loaded when the model is saved/loaded.device_mapping – If provided, a list of length equal to the number of pipeline stages assigned in this model. An element at index
iin the list represents which IPU thei’th pipeline stage should reside on. This can be used to make sure computational stages which share Keras layers/tf.Variableobjects are resident on the same IPU. This value is saved/loaded when the model is saved/loaded.accumulate_outfeed – The metrics from the model are normally enqueued as soon as they’re available. If this option is True, the data will instead be accumulated when they’re available and enqueued at the end of pipeline execution, reducing the amount of host <-> device communication. When used with training, the accumulated metrics are normalised by
gradient_accumulation_steps_per_replica. When used with evaluation, the accumulated metrics are normalised bysteps_per_epoch. This option is ignored when doing prediction. When usingaccumulate_outfeed, model callbacks will be called with the same data for the batches which the data was accumulated for. This value is saved/loaded when the model is saved/loaded.gradient_accumulation_reduction_method – (Experimental) Reduction method to use when accumulating gradients. During the iterations in each optimizer step, the computed gradients can either be directly summed up or scaled such that we compute a mean of all gradients for each variable. Computing a mean avoids potential issues with overflow during accumulation especially when using float16, but gives smaller gradients and might require adjusting the learning-rate accordingly. Defaults to
GradientAccumulationReductionMethod.SUM(seeGradientAccumulationReductionMethod).pipelining_kwargs – All remaining keyword arguments are forwarded to
pipeline(). Note that this dictionary is not serializable, which means that when the model is being saved, these values are not saved. When restoring/loading a model, please callset_pipelining_optionsagain.
- set_replication_options(replicated_metric_reduction_method='NONE')
Configure behaviour when using this model with replication.
- Parameters
replicated_metric_reduction_method – Cross-replica reduction method to use when returning metrics which exist across multiple replicas. Defaults to
ReplicatedMetricReductionMethod.NONE(seeReplicatedMetricReductionMethod).
25.3. Keras Optimizer specializations for the Graphcore IPU
- class keras.ipu.optimizers.ALSGradientAccumulationOptimizer(opt, num_mini_batches, *nargs, **kwargs)
An optimizer that provides Gradient Accumulation functionality to
keras.ipu.optimizers.ALSOptimizerand its derivatives (keras.ipu.optimizers.adam.ALSOptimizerAdam,keras.ipu.optimizers.rmsprop.ALSOptimizerRMSPropandkeras.ipu.optimizers.gradient_descent.ALSOptimizerSGD)- apply_gradients(grads_and_vars, captured_grads=None, name=None, experimental_aggregate_gradients=True)
Accumulate and apply gradients to variables and update the loss scale factor.
- Parameters
grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().
global_step – Optional Variable to increment by one after the variables have been updated.
captured_grads – A dictionary of captured gradients to be used for statistics collection when updating the ALS Loss Scale Factor.
name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.
experimental_aggregate_gradients – Whether to sum gradients from different replicas in the presense of
tf.distribute.Strategy. If False, it’s user responsibility to aggregate the gradients. Default to True.
- Returns
An
Operationthat applies the gradients. Ifglobal_stepwas not None, that operation also incrementsglobal_step.- Raises
ValueError – If the grads_and_vars is malformed.
- classmethod from_config(config, custom_objects=None)
Creates an
ALSGradientAccumulationOptimizerfrom its config.This method is the reverse of
get_config(inherited fromGradientAccumulationOptimizer), capable of instantiating the same optimizer from the config dictionary.- Parameters
config – A Python dictionary, typically the output of get_config.
custom_objects – A Python dictionary mapping names to additional Python objects used to create this optimizer, such as a function used for a hyperparameter.
- Returns
An
ALSGradientAccumulationOptimizerinstance.
- get_gradients(loss, params)
Returns gradients of
losswith respect toparams.Should be used only in legacy v1 graph mode.
- Parameters
loss – Loss tensor.
params – List of variables.
- Returns
List of gradient tensors.
- Raises
ValueError – In case any gradient cannot be computed (e.g. if gradient function not implemented).
- class keras.ipu.optimizers.ALSOptimizer(opt, initial_loss_scaling_factor=1, update_frequency=8, increase_factor=2, max_loss_scaling_factor=32768, accumulate_statistics_over_update_period=True, ratio_threshold=1e-05, captured_grads_only=False, lpf_alpha=0.0, histogram_bin_edge=8192, name='ALSOptimizer')
An optimizer that automatically computes and applies a loss scaling factor (LSF) prior to gradient computation.
The LSF is computed such that the magnitude of the loss is increased to reduce numerical underflow. If the magnitude of the loss becomes too great and overflow occurs, then the LSF is automatically decreased.
The automatic increase and decrease of the LSF is governed by sample statistics collected over computed gradients of type
float16.Gradient statistics are collected on each backward pass, irresepective of
update_frequency. Everyupdate_frequencypasses, the LSF is scaled by eitherincrease_factorordecrease_factordepending on the state of the gradient statistics collected up to that point. If there is minimal overflow, then the LSF is scaled byincrease_factor, otherwise it is scaled bydecrease_factor. At LSF update time, the gradient statistics are reset for the following update period.Example using Keras Functional API:
strategy = IPUStrategy() with strategy.scope(): opt = SGD(0.01) opt_wrapper = ALSOptimizer( opt, initial_loss_scaling_factor=10.0, update_frequency=3, increase_factor=2.0) x, t = some_dataset_fn() input_l = Input(x.shape[1]) dense = Dense(t.shape[1], activation='relu', dtype=np.float16)(input_l) m = Model(inputs=input_l, outputs=dense, gradient_accumulation_count=2) m.compile(optimizer=opt_wrapper, loss='mse') m.fit(x, t)
Example using
tf.function:strategy = IPUStrategy() opt = SGD(0.01) opt_wrapper = ALSOptimizer( opt, initial_loss_scaling_factor=10.0, update_frequency=3, increase_factor=2.0) x, t = some_dataset_fn() dense = Dense(t.shape[1], activation='relu', dtype=np.float16) @tf.function(jit_compile=True) def f(x, t): with GradientTape() as tape: y = dense(x) l = mean_squared_error(labels=t, predictions=y) opt_wrapper.minimize(l, dense.variables, tape=tape) return l loss = strategy.run(f, args=[x, t])
- apply_gradients(grads_and_vars, captured_grads=None, global_step=None, name=None)
Apply gradients to variables and update the loss scale factor.
- Parameters
grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().
global_step – Optional Variable to increment by one after the variables have been updated.
captured_grads – A dictionary of captured gradients to be used for statistics collection when updating the ALS Loss Scale Factor.
name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.
- Returns
An
Operationthat applies the gradients. Ifglobal_stepwas not None, that operation also incrementsglobal_step.- Raises
ValueError – If the grads_and_vars is malformed.
- classmethod from_config(config, custom_objects=None)
Creates an
ALSOptimizerfrom its config.This method is the reverse of
get_config, capable of instantiating the same optimizer from the config dictionary.- Parameters
config – A Python dictionary, typically the output of get_config.
custom_objects – A Python dictionary mapping names to additional Python objects used to create this optimizer, such as a function used for a hyperparameter.
- Returns
An
ALSOptimizerinstance.
- get_config()
Returns the config of the
ALSOptimizerinstance.
- get_gradients(loss, params)
Compute gradients of a scaled loss w.r.t. a given list of params.
- Parameters
loss – A loss tensor.
var_list – A list of variables to optimize.
- Returns
A list of LSF scaled gradients.
- get_scaled_loss(loss)
Applies the current loss scaling factor to a given loss.
- Parameters
loss – The loss to be scaled.
- Returns
The scaled loss.
- reset()
Reset loss scaling.
- class keras.ipu.optimizers.ALSOptimizerAdam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-07, initial_loss_scaling_factor=1, update_frequency=8, increase_factor=2, max_loss_scaling_factor=32768, accumulate_statistics_over_update_period=True, ratio_threshold=1e-05, captured_grads_only=False, lpf_alpha=0.0, histogram_bin_edge=8192, name='ALSOptimizerAdam')
An Adam optimizer that performs Automatic Loss Scaling, specifically handling moment updates.
- classmethod from_config(config, custom_objects=None)
Creates an
ALSOptimizerfrom its config.This method is the reverse of
get_config, capable of instantiating the same optimizer from the config dictionary.- Parameters
config – A Python dictionary, typically the output of get_config.
custom_objects – A Python dictionary mapping names to additional Python objects used to create this optimizer, such as a function used for a hyperparameter.
- Returns
An
ALSOptimizerinstance.
- get_config()
Returns the config of the
ALSOptimizerAdaminstance.
- class keras.ipu.optimizers.ALSOptimizerRMSProp(learning_rate=0.001, rho=0.9, momentum=0.0, epsilon=1e-07, centered=False, initial_loss_scaling_factor=1, update_frequency=8, increase_factor=2, max_loss_scaling_factor=32768, accumulate_statistics_over_update_period=True, ratio_threshold=1e-05, captured_grads_only=False, lpf_alpha=0.0, histogram_bin_edge=8192, name='ALSOptimizerRMSProp')
An RMSProp optimizer that performs Automatic Loss Scaling, specifically handling moment updates.
- classmethod from_config(config, custom_objects=None)
Creates an
ALSOptimizerfrom its config.This method is the reverse of
get_config, capable of instantiating the same optimizer from the config dictionary.- Parameters
config – A Python dictionary, typically the output of get_config.
custom_objects – A Python dictionary mapping names to additional Python objects used to create this optimizer, such as a function used for a hyperparameter.
- Returns
An
ALSOptimizerinstance.
- get_config()
Returns the config of the
ALSOptimizerAdaminstance.
- class keras.ipu.optimizers.ALSOptimizerSGD(learning_rate=0.01, momentum=0.0, initial_loss_scaling_factor=1, update_frequency=8, increase_factor=2, max_loss_scaling_factor=32768, accumulate_statistics_over_update_period=True, ratio_threshold=1e-05, captured_grads_only=False, lpf_alpha=0.0, histogram_bin_edge=8192, name='ALSOptimizerSGD')
An SGD optimizer that performs Automatic Loss Scaling, specifically handling moment updates.
- classmethod from_config(config, custom_objects=None)
Creates an
ALSOptimizerfrom its config.This method is the reverse of
get_config, capable of instantiating the same optimizer from the config dictionary.- Parameters
config – A Python dictionary, typically the output of get_config.
custom_objects – A Python dictionary mapping names to additional Python objects used to create this optimizer, such as a function used for a hyperparameter.
- Returns
An
ALSOptimizerinstance.
- get_config()
Returns the config of the
ALSOptimizerAdaminstance.
- class keras.ipu.optimizers.GradientAccumulationOptimizer(opt, num_mini_batches, *nargs, **kwargs)
An optimizer which performs the weight update after multiple batches have been accumulated.
- classmethod from_config(config, custom_objects=None)
Creates a
GradientAccumulationOptimizerfrom its config.This method is the reverse of
get_config, capable of instantiating the same optimizer from the config dictionary.- Parameters
config – A Python dictionary, typically the output of get_config.
custom_objects – A Python dictionary mapping names to additional Python objects used to create this optimizer, such as a function used for a hyperparameter.
- Returns
A
GradientAccumulationOptimizerinstance.
- get_config()
Returns the config of the
GradientAccumulationOptimizerinstance.