19. IPU TensorFlow Addons
19.1. Introduction
IPU TensorFlow Addons is a collection of add-ons created for IPU TensorFlow. These include TensorFlow layers and optimizers, and a command line tool for managing SavedModel execution.
19.2. IPU SavedModel CLI
The IPU TensorFlow Addons includes a preview of the SavedModel command line
interface (CLI) tool for IPUs called ipu_saved_model_cli
.
Note
This tool is still in development and subject to change without notice. Not all functions will have been fully tested.
This section documents the IPU-specific functions of the SavedModel CLI for the run and convert subcommands.
For more information about the tool, see the TensorFlow SavedModel CLI documentation.
19.2.1. Run subcommand
ipu_saved_model_cli run [-h]
--dir DIR
--tag_set TAG_SET
--signature_def SIGNATURE_DEF_KEY
[--inputs INPUTS]
[--input_exprs INPUT_EXPRS]
[--input_examples INPUT_EXAMPLES]
[--outdir OUTDIR]
[--overwrite]
[--tf_debug]
[--worker WORKER]
[--init_ipu]
[--num_ipus NUM_IPUS]
[--matmul_amp MATMUL_AMP]
[--conv_amp CONV_AMP]
[--matmul_partial_type MATMUL_PARTIAL_TYPE]
[--conv_partial_type CONV_PARTIAL_TYPE]
The run
subcommand supports the following IPU-specific command line options:
--conv_amp float
The “available memory proportion”: the proportion of memory to use for temporary values, intermediate sums and so on for convolutions. It must be a value between 0.0 and 1.0. If you want to change this value, you will need to specify it to the
run
command, even if you have already specified it for theconvert
command unless you are using the embedded application runtime (see Convert subcommand).See
convolutions.poplar_options
for more information.The technical note Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU provides more details and some practical examples of using
availableMemoryProportion
.Default: 0.6.
--conv_partial_type string
The type to use for intermediate values when doing a convolution. This can be “float” (the default) or “half”. If you want to change this type, you will need to specify it to the
run
command, even if you have already specified it for theconvert
command unless you are using the embedded application runtime (see Convert subcommand).See the Memory and Performance Optimisation on the IPU technical notes for some practical tips on selecting the type for partial results.
Default: “float”.
--init_ipu
If specified, then the SavedModel will call
configure_ipu_system()
when it starts execution. This option should be only used if the worker is an IPU job.--matmul_amp float
The proportion of memory to use for temporary values, intermediate sums and so on for matrix multiplications. Must be a value between 0.0 and 1.0.
See
matmuls.poplar_options
for more information.The technical note Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU provides more details and some practical examples of using
availableMemoryProportion
.Default: 0.6.
--matmul_partial_type string
The type to use for intermediate values when doing a matrix multiply. This can be “float” (the default) or “half”.
See the Memory and Performance Optimisation on the IPU technical notes for some practical tips on selecting the type for partial results.
Default: “float”.
--num_ipus integer
The number of IPUs that the SavedModel will use for inference. In most cases this must be a power of 2. The command line tool does not check this, but you may get an error from the application if any necessary constraints are not met.
Default: 1.
19.2.2. Convert subcommand
Convert the SavedModel with IPU integration.
ipu_saved_model_cli convert ipu [-h]
[--excluded_nodes EXCLUDED_NODES [EXCLUDED_NODES ...]]
[--num_ipus NUM_IPUS]
[--matmul_amp MATMUL_AMP]
[--conv_amp CONV_AMP]
[--matmul_partial_type MATMUL_PARTIAL_TYPE]
[--conv_partial_type CONV_PARTIAL_TYPE]
[--batch_size BATCH_SIZE]
[--batch_per_step BATCH_PER_STEP]
[--precision_mode PRECISION_MODE]
[--gelu_replacement GELU_REPLACEMENT]
[--no_ipu_placement]
[--int64_to_int32_conversion]
[--precision_conversion_excluded_nodes PRECISION_CONVERSION_EXCLUDED_NODES [PRECISION_CONVERSION_EXCLUDED_NODES ...]]
[--remove_excluded_nodes]
[--manual_sharding MANUAL_SHARDING]
[--embedded_runtime_save_config EMBEDDED_RUNTIME_SAVE_CONFIG]
[--pipeline_cfg PIPELINE_CFG]
[--config_file CONFIG_FILE]
This has the following command line options:
--batch_per_step integer
Repeat count for
repeat()
orpipelining_ops
. If 0, it will not turn off the loop repeat IPU wrapper. If the IPU embedded application runtime is enabled and batches per step is 0, then it will be changed to 1.Default: 0
--batch_size integer
The micro batch size to be used by the model.
Default: 1
--config_file path
Path to a JSON-format configuration file that defines all the options to the command. Example configuration file.
--conv_amp float
The “available memory proportion”: the proportion of memory to use for temporary values, intermediate sums and so on for convolutions. Must be a value between 0.0 and 1.0.
See
IPUConfig.convolutions.poplar_options
for more information.The technical note Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU provides more details and some practical examples of using
availableMemoryProportion
.Default: 0.6.
--conv_partial_type string
The type to use for intermediate values when doing a convolution. This can be “float” (the default) or “half”.
See the Memory and Performance Optimisation on the IPU technical notes for some practical tips on selecting the type for partial results.
Default: “float”.
--embedded_runtime_save_config json
A JSON string defining the configuration for embedded application runtime compilation. For example:
{ "embedded_runtime_exec_cachedir": "/path/to/exec", "runtime_api_timeout_us": 5000 }
where:
embedded_runtime_exec_cachedir
sets the directory where the compiled embedded-application runtime file is created, andruntime_api_timeout_us
sets the limit (in microseconds) on the time the IPU will wait for data. See Timeout for more information.
--excluded_nodes string1, string2, string3, ...
A list of nodes that will not be be placed on the IPU.
Default: none.
--gelu_replacement string
The nodes that define the GELU activation function will be replaced with the IPU-optimised GELU op (
tensorflow.python.ipu.nn_ops.gelu()
), which will reduce the amount of memory required and improve the throughput.This is a JSON-format string. For example:
{ "gelu_replacement": { "nodes": [ // Nodes in GELU function (regular expressions) "intermediate/dense/Sqrt$", "intermediate/dense/truediv$", "intermediate/dense/Erf$", "intermediate/dense/add$", "intermediate/dense/mul$", "intermediate/dense/mul_1$", "intermediate/dense/Sqrt/x$", "intermediate/dense/add/x$", "intermediate/dense/mul/x$" ], "node_as_gelu_input": [ // The names of GELU function inputs (regex) "encoder/layer_[0-9]*/intermediate/dense/BiasAdd" ], "node_use_gelu_output": [ // The names of GELU function outputs (regex) "encoder/layer_[0-9]*/output/dense/MatMul" ] } }
--int64_to_int32_conversion
Convert ops with int64 type to int32 type.
The IPU does not support int64. Ops placed on the IPU that have int64 inputs/outputs will be modified to use int32 instead. Prior to sending data to the IPU, any int64 values will be cast to int32 values.
--manual_sharding regex-for-ipu0, regex-for-ipu1, regex-for-ipu2, ...
A list of regular expression strings, one for each IPU. Nodes whose names match an expressions will be placed on that IPU. Nodes which do not match an expression will be placed on IPU 0.
An error will be raised if the number of regular expressions is not equal to the number of IPUs?
Default: none.
--matmul_amp float
The proportion of memory to use for temporary values, intermediate sums and so on for matrix multiplications. Must be a value between 0.0 and 1.0. See
IPUConfig.matmuls.poplar_options
for more information.The technical note Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU provides more details and some practical examples of using
availableMemoryProportion
.Default: 0.6.
--matmul_partial_type string
The type to use for intermediate values when doing a matrix multiply. This can be “float” (the default) or “half”.
See the Memory and Performance Optimisation on the IPU technical notes for some practical tips on selecting the type for partial results.
Default: “float”.
--merge_subgraphs
Merge multiple IPU subgraphs into one with the IPU
compile()
function.--no_ipu_placement
If set, no nodes will be placed on IPUs.
--num_ipus integer
The number of IPUs that the SavedModel will use for inference. Default: 1.
--pipeline_cfg string
A JSON-format string that defines the configuration of the pipeline. See Pipeline configuration for more information.
--precision_conversion_excluded_nodes string1, string2, string3, ...
A list of nodes that will not have their precision changed by the
--precision_mode
option.Default: none.
--precision_mode string
The precision of the output model. This can be either “FP16” or “FP32”.
Default: FP32
--remove_excluded_nodes
Remove the nodes listed in
--excluded_nodes
from the graph.
19.2.3. Pipeline configuration
The pipeline configuration specifies how to distribute the nodes in the model across the IPUs. It has three options:
auto
: This automatically splits the model into several pipeline stages to optimise performance, searching for the minimum number of IPUs that model should use.manual
: Specifies how to split the model into several pipeline stages with user-specified regular expressions.load
: Load the pipeline configuration from a file.
In general, you would use auto
first, and then use load
mode to adjust the configuration if you are not satisfied with the results.
The pipeline configuration is specified using the following options:
converter
: specifies the mode of the pipeline converter. This must be one ofauto
,manual
orload
. Each of these has a set of configuration options, described below.auto
fine_tune_iter
(optional): The maximum number of iterations for fine tuning.ipu_model
: Run on the IPU Model (default: true). If false, the pipelined model will be run on IPUs.profiling_root_dir
(optional): The directory where the SavedModel tool will write the profiling file (default./pipeline-profiling
).priority
(optional): The priority of the balancing strategy. Must be “cycle” (default) or “memory”.“cycle”: balance the compute cycles for each pipeline stage
“memory”: balance the memory use for each pipeline stage
max_ipu_quantity
(optional): The maximum number of IPUs that can be used (default 64).min_ipu_quantity
(optional): The minimum number of IPUS that can be used (default 2).solution_dir
(optional): The directory where the SavedModel will write the configuration file describing the pipeline it has created.
load
ipu_model
: Run on the IPU Model (default: true).profiling_root_dir
(optional): The directory where the SavedModel tool will write the profiling file (default./pipeline-profiling
).solution_path
(required): The file containing the pipeline configuration to be read by the SavedModel tool.profiling_enable
(optional): Enable profiling.
manual
ipu_model
: Run on the IPU Model (default: true).profiling_root_dir
(optional): The directory where the SavedModel tool will write the profiling file (default./pipeline-profiling
).manual_pipeline_config
(required): A list of regular expressions, one for each IPU, to match nodes to the IPUs in the pipeline. Nodes whose names match an expressions will be placed on the corresponding IPU. Nodes which do not match an expression will be placed on IPU 0.device_info
(required): the mapping of pipeline stages to IPU targets (see the description of device mapping for more information).solution_dir
(optional): The directory where the SavedModel will write the configuration file describing the pipeline.profiling_enable
(optional): Enable profiling.
19.2.4. Pipeline development
For the auto pipeline optimization option, the SavedModel CLI tool will:
Run the model
Generate and analyse the profiling information
Find an optimal solution to split the model and save the result to the solution file
For step 2, the tool needs to get cycle information from the profile. However, it is possible that the model will raise an out of memory error. The tool avoids this by running the model on the IPU Model, which will does not generate out of memory errors.
The auto option has the following limitations:
The design does not consider other data types like
tf.resource
.It does not support INT64 data type.
The SavedModel input needs to be frozen.
The minimum number of IPU search spaces is 2, which means searching from IPU number >= 2 since the model can be fit in a single IPU and does not need the pipeline methodology.
The input tensor shape needs to be fixed, excluding batch size.
It cannot handle
control_dependency
nodes.The first dimension of input tensors must be
micro_batch_size
.
19.2.5. Pipeline solution file
The pipeline solution file is a JSON definition of how pipeline stages are mapped to IPUs. This is generated by the auto
and manual
options, and read by the load
option.
{
"device_maping": [0, 1],
"pipeline_mapping": {
<node name>: <pipeline stage id>,
<node name>: <pipeline stage id>
}
}
Where:
device_mapping
is a list of length equal to the number of computational stages. Each element in the list specifies the ID of the IPU that the corresponding pipeline stage will be placed on.
pipeline_mapping
specifies which nodes will be mapped to each pipeline stage.
19.2.6. Example configuration file
1 {
2 "batch_size": 1,
3 "num_ipus": 1,
4 "batch_per_step": 1,
5 "matmul_amp": 0.6,
6 "matmul_partial_type": "half",
7 "conv_amp": 0.6,
8 "conv_partial_type": "half",
9 "no_ipu_placement": false,
10 "excluded_nodes": [
11 "^strided_slice_1$",
12 "NotEqual",
13 "Assert"
14 ],
15 "remove_excluded_nodes": true,
16 "merge_subgraphs": true,
17 "precision_mode": "FP16",
18 "precision_conversion_excluded_nodes": [
19 "^add$"
20 ],
21 "int64_to_int32_conversion": true,
22 "gelu_replacement": {
23 "nodes": [
24 "intermediate/dense/Sqrt$",
25 "intermediate/dense/truediv$",
26 "intermediate/dense/Erf$",
27 "intermediate/dense/add$",
28 "intermediate/dense/mul$",
29 "intermediate/dense/mul_1$",
30 "intermediate/dense/Sqrt/x$",
31 "intermediate/dense/add/x$",
32 "intermediate/dense/mul/x$"
33 ],
34 "node_as_gelu_input": [
35 "encoder/layer_[0-9]*/intermediate/dense/BiasAdd"
36 ],
37 "node_use_gelu_output": [
38 "encoder/layer_[0-9]*/output/dense/MatMul"
39 ]
40 },
41 "manual_sharding": [
42 [
43 "^sharding0"
44 ],
45 [
46 "^sharding1"
47 ]
48 ],
49 "pipeline_cfg": {
50 // auto pipeline configuration.
51 "converter": "auto",
52 "fine_tune_iter": 5,
53 "ipu_model": true,
54 "max_ipu_quantity": 64,
55 "min_ipu_quantity": 2,
56 "priority": "cycle",
57 "profiling_root_dir": "/path/to/profiling_root_dir",
58 "solution_dir": "/path/to/solution_dir",
59 // pipeline configuration loader configuration.
60 "converter": "load",
61 "ipu_model": true,
62 "profiling_root_dir": "profiling",
63 "solution_path": "solution/greedy_search.pipeconfig",
64 "profiling_enable": false,
65 // manual pipeline configuration.
66 "converter": "manual",
67 "ipu_model": true,
68 "profiling_root_dir": "profiling",
69 "solution_dir": "solution",
70 "manual_pipeline_config": [
71 [
72 "input_3",
73 "^middle/unit_0",
74 "^middle/unit_1",
75 "^middle/unit_2/",
76 "^middle/unit_3",
77 "^middle/unit_4"
78 ],
79 [
80 "^middle/unit_5",
81 "input_1",
82 "input_2",
83 "^right/unit_0",
84 "^right/unit_1",
85 "^right/unit_2",
86 "^left/unit_0"
87 ],
88 [
89 "^left/unit_1",
90 "^left/unit_2",
91 "^left/unit_3",
92 "^left/unit_4",
93 "concat",
94 "^res/unit_0/"
95 ],
96 [
97 "^res/unit_1",
98 "^res/unit_2",
99 "^res/down/",
100 "^res/add"
101 ]
102 ],
103 "device_info": [
104 0,
105 1,
106 1,
107 0
108 ],
109 "profiling_enable": true
110 },
111 "embedded_runtime_save_config": {
112 "runtime_api_timeout_us": 5000,
113 "embedded_runtime_exec_cachedir": "bert_poplar_exec"
114 }
115 }