18. IPU TensorFlow Addons
18.1. Introduction
IPU TensorFlow Addons is a collection of add-ons created for IPU TensorFlow. These include TensorFlow layers and optimizers, and a command line tool for managing SavedModel execution.
18.2. IPU SavedModel CLI
The IPU TensorFlow Addons includes a preview of the SavedModel command line
interface (CLI) tool for IPUs called ipu_saved_model_cli.
Note
This tool is still in development and subject to change without notice. Not all functions will have been fully tested.
This section documents the IPU-specific functions of the SavedModel CLI for the run and convert subcommands.
For more information about the tool, see the TensorFlow SavedModel CLI documentation.
18.2.1. Run subcommand
ipu_saved_model_cli run [-h]
--dir DIR
--tag_set TAG_SET
--signature_def SIGNATURE_DEF_KEY
[--inputs INPUTS]
[--input_exprs INPUT_EXPRS]
[--input_examples INPUT_EXAMPLES]
[--outdir OUTDIR]
[--overwrite]
[--tf_debug]
[--worker WORKER]
[--init_ipu]
[--num_ipus NUM_IPUS]
[--matmul_amp MATMUL_AMP]
[--conv_amp CONV_AMP]
[--matmul_partial_type MATMUL_PARTIAL_TYPE]
[--conv_partial_type CONV_PARTIAL_TYPE]
The run subcommand supports the following IPU-specific command line options:
--conv_amp floatThe “available memory proportion”: the proportion of memory to use for temporary values, intermediate sums and so on for convolutions. It must be a value between 0.0 and 1.0. If you want to change this value, you will need to specify it to the
runcommand, even if you have already specified it for theconvertcommand unless you are using the embedded application runtime (see Convert subcommand).See
convolutions.poplar_optionsfor more information.The technical note Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU provides more details and some practical examples of using
availableMemoryProportion.Default: 0.6.
--conv_partial_type stringThe type to use for intermediate values when doing a convolution. This can be “float” (the default) or “half”. If you want to change this type, you will need to specify it to the
runcommand, even if you have already specified it for theconvertcommand unless you are using the embedded application runtime (see Convert subcommand).See the Memory and Performance Optimisation on the IPU technical notes for some practical tips on selecting the type for partial results.
Default: “float”.
--init_ipuIf specified, then the SavedModel will call
configure_ipu_system()when it starts execution. This option should be only used if the worker is an IPU job.--matmul_amp floatThe proportion of memory to use for temporary values, intermediate sums and so on for matrix multiplications. Must be a value between 0.0 and 1.0.
See
matmuls.poplar_optionsfor more information.The technical note Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU provides more details and some practical examples of using
availableMemoryProportion.Default: 0.6.
--matmul_partial_type stringThe type to use for intermediate values when doing a matrix multiply. This can be “float” (the default) or “half”.
See the Memory and Performance Optimisation on the IPU technical notes for some practical tips on selecting the type for partial results.
Default: “float”.
--num_ipus integerThe number of IPUs that the SavedModel will use for inference. In most cases this must be a power of 2. The command line tool does not check this, but you may get an error from the application if any necessary constraints are not met.
Default: 1.
18.2.2. Convert subcommand
Convert the SavedModel with IPU integration.
ipu_saved_model_cli convert ipu [-h]
[--excluded_nodes EXCLUDED_NODES [EXCLUDED_NODES ...]]
[--num_ipus NUM_IPUS]
[--matmul_amp MATMUL_AMP]
[--conv_amp CONV_AMP]
[--matmul_partial_type MATMUL_PARTIAL_TYPE]
[--conv_partial_type CONV_PARTIAL_TYPE]
[--batch_size BATCH_SIZE]
[--batch_per_step BATCH_PER_STEP]
[--precision_mode PRECISION_MODE]
[--gelu_replacement GELU_REPLACEMENT]
[--no_ipu_placement]
[--int64_to_int32_conversion]
[--precision_conversion_excluded_nodes PRECISION_CONVERSION_EXCLUDED_NODES [PRECISION_CONVERSION_EXCLUDED_NODES ...]]
[--remove_excluded_nodes]
[--manual_sharding MANUAL_SHARDING]
[--embedded_runtime_save_config EMBEDDED_RUNTIME_SAVE_CONFIG]
[--pipeline_cfg PIPELINE_CFG]
[--config_file CONFIG_FILE]
This has the following command line options:
--batch_per_step integerRepeat count for
repeat()orpipelining_ops. If 0, it will not turn off the loop repeat IPU wrapper. If the IPU embedded application runtime is enabled and batches per step is 0, then it will be changed to 1.Default: 0
--batch_size integerThe micro batch size to be used by the model.
Default: 1
--config_file pathPath to a JSON-format configuration file that defines all the options to the command. Example configuration file.
--conv_amp floatThe “available memory proportion”: the proportion of memory to use for temporary values, intermediate sums and so on for convolutions. Must be a value between 0.0 and 1.0.
See
IPUConfig.convolutions.poplar_optionsfor more information.The technical note Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU provides more details and some practical examples of using
availableMemoryProportion.Default: 0.6.
--conv_partial_type stringThe type to use for intermediate values when doing a convolution. This can be “float” (the default) or “half”.
See the Memory and Performance Optimisation on the IPU technical notes for some practical tips on selecting the type for partial results.
Default: “float”.
--embedded_runtime_save_config jsonA JSON string defining the configuration for embedded application runtime compilation. For example:
{ "embedded_runtime_exec_cachedir": "/path/to/exec", "runtime_api_timeout_us": 5000 }
where:
embedded_runtime_exec_cachedirsets the directory where the compiled embedded-application runtime file is created, andruntime_api_timeout_ussets the limit (in microseconds) on the time the IPU will wait for data. See Timeout for more information.
--excluded_nodes string1, string2, string3, ...A list of nodes that will not be be placed on the IPU.
Default: none.
--gelu_replacement stringThe nodes that define the GELU activation function will be replaced with the IPU-optimised GELU op (
tensorflow.python.ipu.nn_ops.gelu()), which will reduce the amount of memory required and improve the throughput.This is a JSON-format string. For example:
{ "gelu_replacement": { "nodes": [ // Nodes in GELU function (regular expressions) "intermediate/dense/Sqrt$", "intermediate/dense/truediv$", "intermediate/dense/Erf$", "intermediate/dense/add$", "intermediate/dense/mul$", "intermediate/dense/mul_1$", "intermediate/dense/Sqrt/x$", "intermediate/dense/add/x$", "intermediate/dense/mul/x$" ], "node_as_gelu_input": [ // The names of GELU function inputs (regex) "encoder/layer_[0-9]*/intermediate/dense/BiasAdd" ], "node_use_gelu_output": [ // The names of GELU function outputs (regex) "encoder/layer_[0-9]*/output/dense/MatMul" ] } }
--int64_to_int32_conversionConvert ops with int64 type to int32 type.
The IPU does not support int64. Ops placed on the IPU that have int64 inputs/outputs will be modified to use int32 instead. Prior to sending data to the IPU, any int64 values will be cast to int32 values.
--manual_sharding regex-for-ipu0, regex-for-ipu1, regex-for-ipu2, ...A list of regular expression strings, one for each IPU. Nodes whose names match an expressions will be placed on that IPU. Nodes which do not match an expression will be placed on IPU 0.
An error will be raised if the number of regular expressions is not equal to the number of IPUs?
Default: none.
--matmul_amp floatThe proportion of memory to use for temporary values, intermediate sums and so on for matrix multiplications. Must be a value between 0.0 and 1.0. See
IPUConfig.matmuls.poplar_optionsfor more information.The technical note Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU provides more details and some practical examples of using
availableMemoryProportion.Default: 0.6.
--matmul_partial_type stringThe type to use for intermediate values when doing a matrix multiply. This can be “float” (the default) or “half”.
See the Memory and Performance Optimisation on the IPU technical notes for some practical tips on selecting the type for partial results.
Default: “float”.
--merge_subgraphsMerge multiple IPU subgraphs into one with the IPU
compile()function.--no_ipu_placementIf set, no nodes will be placed on IPUs.
--num_ipus integerThe number of IPUs that the SavedModel will use for inference. Default: 1.
--pipeline_cfg stringA JSON-format string that defines the configuration of the pipeline. See Pipeline configuration for more information.
--precision_conversion_excluded_nodes string1, string2, string3, ...A list of nodes that will not have their precision changed by the
--precision_modeoption.Default: none.
--precision_mode stringThe precision of the output model. This can be either “FP16” or “FP32”.
Default: FP32
--remove_excluded_nodesRemove the nodes listed in
--excluded_nodesfrom the graph.
18.2.3. Pipeline configuration
The pipeline configuration specifies how to distribute the nodes in the model across the IPUs. It has three options:
auto: This automatically splits the model into several pipeline stages to optimise performance, searching for the minimum number of IPUs that model should use.manual: Specifies how to split the model into several pipeline stages with user-specified regular expressions.load: Load the pipeline configuration from a file.
In general, you would use auto first, and then use load mode to adjust the configuration if you are not satisfied with the results.
The pipeline configuration is specified using the following options:
converter: specifies the mode of the pipeline converter. This must be one ofauto,manualorload. Each of these has a set of configuration options, described below.autofine_tune_iter(optional): The maximum number of iterations for fine tuning.ipu_model: Run on the IPU Model (default: true). If false, the pipelined model will be run on IPUs.profiling_root_dir(optional): The directory where the SavedModel tool will write the profiling file (default./pipeline-profiling).priority(optional): The priority of the balancing strategy. Must be “cycle” (default) or “memory”.“cycle”: balance the compute cycles for each pipeline stage
“memory”: balance the memory use for each pipeline stage
max_ipu_quantity(optional): The maximum number of IPUs that can be used (default 64).min_ipu_quantity(optional): The minimum number of IPUS that can be used (default 2).solution_dir(optional): The directory where the SavedModel will write the configuration file describing the pipeline it has created.
loadipu_model: Run on the IPU Model (default: true).profiling_root_dir(optional): The directory where the SavedModel tool will write the profiling file (default./pipeline-profiling).solution_path(required): The file containing the pipeline configuration to be read by the SavedModel tool.profiling_enable(optional): Enable profiling.
manualipu_model: Run on the IPU Model (default: true).profiling_root_dir(optional): The directory where the SavedModel tool will write the profiling file (default./pipeline-profiling).manual_pipeline_config(required): A list of regular expressions, one for each IPU, to match nodes to the IPUs in the pipeline. Nodes whose names match an expressions will be placed on the corresponding IPU. Nodes which do not match an expression will be placed on IPU 0.device_info(required): the mapping of pipeline stages to IPU targets (see the description of device mapping for more information).solution_dir(optional): The directory where the SavedModel will write the configuration file describing the pipeline.profiling_enable(optional): Enable profiling.
18.2.4. Pipeline development
For the auto pipeline optimization option, the SavedModel CLI tool will:
Run the model
Generate and analyse the profiling information
Find an optimal solution to split the model and save the result to the solution file
For step 2, the tool needs to get cycle information from the profile. However, it is possible that the model will raise an out of memory error. The tool avoids this by running the model on the IPU Model, which will does not generate out of memory errors.
The auto option has the following limitations:
The design does not consider other data types like
tf.resource.It does not support INT64 data type.
The SavedModel input needs to be frozen.
The minimum number of IPU search spaces is 2, which means searching from IPU number >= 2 since the model can be fit in a single IPU and does not need the pipeline methodology.
The input tensor shape needs to be fixed, excluding batch size.
It cannot handle
control_dependencynodes.The first dimension of input tensors must be
micro_batch_size.
18.2.5. Pipeline solution file
The pipeline solution file is a JSON definition of how pipeline stages are mapped to IPUs. This is generated by the auto and manual options, and read by the load option.
{
"device_maping": [0, 1],
"pipeline_mapping": {
<node name>: <pipeline stage id>,
<node name>: <pipeline stage id>
}
}
Where:
device_mappingis a list of length equal to the number of computational stages. Each element in the list specifies the ID of the IPU that the corresponding pipeline stage will be placed on.
pipeline_mappingspecifies which nodes will be mapped to each pipeline stage.
18.2.6. Example configuration file
1 {
2 "batch_size": 1,
3 "num_ipus": 1,
4 "batch_per_step": 1,
5 "matmul_amp": 0.6,
6 "matmul_partial_type": "half",
7 "conv_amp": 0.6,
8 "conv_partial_type": "half",
9 "no_ipu_placement": false,
10 "excluded_nodes": [
11 "^strided_slice_1$",
12 "NotEqual",
13 "Assert"
14 ],
15 "remove_excluded_nodes": true,
16 "merge_subgraphs": true,
17 "precision_mode": "FP16",
18 "precision_conversion_excluded_nodes": [
19 "^add$"
20 ],
21 "int64_to_int32_conversion": true,
22 "gelu_replacement": {
23 "nodes": [
24 "intermediate/dense/Sqrt$",
25 "intermediate/dense/truediv$",
26 "intermediate/dense/Erf$",
27 "intermediate/dense/add$",
28 "intermediate/dense/mul$",
29 "intermediate/dense/mul_1$",
30 "intermediate/dense/Sqrt/x$",
31 "intermediate/dense/add/x$",
32 "intermediate/dense/mul/x$"
33 ],
34 "node_as_gelu_input": [
35 "encoder/layer_[0-9]*/intermediate/dense/BiasAdd"
36 ],
37 "node_use_gelu_output": [
38 "encoder/layer_[0-9]*/output/dense/MatMul"
39 ]
40 },
41 "manual_sharding": [
42 [
43 "^sharding0"
44 ],
45 [
46 "^sharding1"
47 ]
48 ],
49 "pipeline_cfg": {
50 // auto pipeline configuration.
51 "converter": "auto",
52 "fine_tune_iter": 5,
53 "ipu_model": true,
54 "max_ipu_quantity": 64,
55 "min_ipu_quantity": 2,
56 "priority": "cycle",
57 "profiling_root_dir": "/path/to/profiling_root_dir",
58 "solution_dir": "/path/to/solution_dir",
59 // pipeline configuration loader configuration.
60 "converter": "load",
61 "ipu_model": true,
62 "profiling_root_dir": "profiling",
63 "solution_path": "solution/greedy_search.pipeconfig",
64 "profiling_enable": false,
65 // manual pipeline configuration.
66 "converter": "manual",
67 "ipu_model": true,
68 "profiling_root_dir": "profiling",
69 "solution_dir": "solution",
70 "manual_pipeline_config": [
71 [
72 "input_3",
73 "^middle/unit_0",
74 "^middle/unit_1",
75 "^middle/unit_2/",
76 "^middle/unit_3",
77 "^middle/unit_4"
78 ],
79 [
80 "^middle/unit_5",
81 "input_1",
82 "input_2",
83 "^right/unit_0",
84 "^right/unit_1",
85 "^right/unit_2",
86 "^left/unit_0"
87 ],
88 [
89 "^left/unit_1",
90 "^left/unit_2",
91 "^left/unit_3",
92 "^left/unit_4",
93 "concat",
94 "^res/unit_0/"
95 ],
96 [
97 "^res/unit_1",
98 "^res/unit_2",
99 "^res/down/",
100 "^res/add"
101 ]
102 ],
103 "device_info": [
104 0,
105 1,
106 1,
107 0
108 ],
109 "profiling_enable": true
110 },
111 "embedded_runtime_save_config": {
112 "runtime_api_timeout_us": 5000,
113 "embedded_runtime_exec_cachedir": "bert_poplar_exec"
114 }
115 }