11. Custom IPU operations
There are three mechanisms for providing custom operations to the IPU through the TensorFlow interface. The first uses a fully custom codelet and host build file.
The second case is a custom operation which is executed on the CPU.
The third possibility is a custom, fused elementwise arithmetic operation. In this last case, the gradient creation in the optimisers will not produce a gradient operation for the custom operation.
11.1. Fully customised IPU operations
You can provide a custom operation to be compiled into the Poplar executable and run on the IPU hardware. You must provide a host-side shared object library that implements the action of adding vertices to a Poplar graph, given some Poplar tensor inputs. They can optionally provide a Poplar source code or binary file containing one or more “codelets” (code that runs on the IPU).
For more information about writing codelets, please refer to the Poplar and PopLibs User Guide.
These operations are added with ipu.custom_ops.precompiled_user_op
.
See tensorflow.python.ipu.custom_ops.precompiled_user_op()
for details.
An example of this is shown below.
The shared object file must contain an undecorated symbol that should be
declared as below. It should add vertices to the graph that perform the
custom operation. The name of the symbol should match the name of the
operation in the graph. By default these types of operations are called
Build
.
1extern "C"
2poplar::program::Program Build(
3 poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
4 std::vector<poplar::Tensor>& outputs, const std::string &attributes,
5 const std::string &debug_prefix)
The arguments are:
graph
: the Poplar graph into which to add tensors and vertices.inputs
: a vector of Poplar tensors which are inputs to the operation.outputs
: a vector into which to store the outputs of the operation. The vector will contain zero entries when theBuild
function is called.attributes
: a string which was passed as theattributes
argument to the Python operation. See _operation_attributes for more details.debug_prefix
: the debug name that has been given to the operation in the TensorFlow graph.
If the operation can have its gradient taken, then the shared object can
contain a separate function with the same name as the forward pass builder.
The function must be given the same name as the forward operation with _grad
appended. The signature of the builder function is slightly different, as it
takes the forward pass inputs and outputs as arguments, as well as the
gradient outputs. Gradient builders have their own metadata functions.
E.g. metadata function name for the example below will be Build_grad_metadata
.
1extern "C"
2poplar::program::Program Build_grad(
3 poplar::Graph& graph, int input_grad_index,
4 const std::vector<poplar::Tensor>& gradients,
5 const std::vector<poplar::Tensor>& fwd_inputs,
6 const std::vector<poplar::Tensor>& fwd_outputs,
7 std::vector<poplar::Tensor>& outputs,
8 const std::string& attributes, const std::string& debug_prefix)
The arguments are:
graph
: the Poplar graph into which to add tensors and vertices.input_grad_index
: The index of the input for which this operation is producing the partial derivative. If the gradient operation calculates all of the partial derivatives, then this input should be ignored.gradients
: the inputs to the gradient operation, from the previous gradient operation or loss.fwd_inputs
: the tensors which are the inputs to the forward operation.fwd_outputs
: the tensors which are the outputs of the forward operation.outputs
: the outputs of this gradient operation. There must be one per input of the original forward operation. Inputs which are not differentiable can have an null Poplar tensor.attributes
: a string which was passed as thegradient_attributes
argument to the Python operation. See _operation_attributes for more details.debug_prefix
: the name of the operation.
11.1.1. Metadata
The shared object file can optionally contain an undecorated symbol that is
the same as the builder function with _metadata
appended. This function
must have the following signature:
1extern "C"
2void Build_metadata(std::vector<std::int64_t>& allocating_indices,
3 std::uint32_t& num_inplace, bool& is_elementwise,
4 bool& is_stateless, std::uint32_t num_inputs)
The arguments are:
allocating_indices
: indicates which of the inputs should be allocated using the tensor allocation function. See the description in Tensor allocation.num_inplace
: indicates the number of inputs which are ‘in place’. The firstnum_inplace
of the inputs will be considered to be in-place.is_elementwise
: indicates that this operation is element-wise.is_stateless
: indicates that this operation is stateless. Custom ops are stateful by default.num_inputs
: indicates how many inputs are on the operation.
The function should fill in the values of the first four arguments, which are all reference types.
11.1.2. In-place operations
If an operation does an in-place modification of an input tensor, as
opposed to creating a new output tensor, then the num_inplace
can be
used to indicate that this is the case. The system will ensure that when
a tensor is updated in place, that any other uses of that tensor will be
complete before the operation is run.
If a tensor is not marked as in place
then the operation must not modify
it. If it is modified then other operations which consume it may see an
incorrect value on their input.
When trying to update tensors in-place you need to ensure that TensorFlow
sees an assignment of the tensor, otherwise the modified input tensor update
will not “stick”. This means that the inplace inputs need to always be
returned as outputs of the custom operation and if a tf.Variable
was
modified inplace it has to be assigned back to itself with tf.assign
.
This might look something like the following:
1import os
2import numpy as np
3
4from tensorflow.python import ipu
5from tensorflow.python.ipu.scopes import ipu_scope
6import tensorflow.compat.v1 as tf
7tf.disable_v2_behavior()
8
9# Configure argument for targeting the IPU
10cfg = ipu.utils.create_ipu_config()
11# cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
12cfg = ipu.utils.auto_select_ipus(cfg, 1)
13ipu.utils.configure_ipu_system(cfg)
14
15with tf.device("cpu"):
16 x_data = tf.placeholder(np.float32, [4])
17
18
19def add_op(x, y):
20 outputs = {
21 "output_types": [tf.float32],
22 "output_shapes": [tf.TensorShape([4])],
23 }
24
25 base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
26 lib_path = os.path.join(base_path, "libcustom_add_inplace.so")
27
28 o = ipu.custom_ops.precompiled_user_op([x, y], lib_path, outs=outputs)
29 return o
30
31
32def my_net(x):
33 inplace = tf.get_variable("weights",
34 shape=[4],
35 initializer=tf.zeros_initializer())
36
37 # Even though the custom op is in place, TF still needs to see an assignment.
38 inplace_add = tf.assign(inplace, add_op(inplace, x)[0])
39 with tf.control_dependencies([inplace_add]):
40 return inplace
41
42
43with ipu_scope("/device:IPU:0"):
44 xla_result = ipu.ipu_compiler.compile(my_net, [x_data])
45
46with tf.Session() as sess:
47 sess.run(tf.global_variables_initializer())
48
49 result = sess.run(xla_result, feed_dict={x_data: [2., 4., 6., -1.]})
50 print(result)
51
52 result = sess.run(xla_result, feed_dict={x_data: [2., 4., 6., -1.]})
53 print(result)
And the associated custom op:
1/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.
2
3Licensed under the Apache License, Version 2.0 (the "License");
4you may not use this file except in compliance with the License.
5You may obtain a copy of the License at
6
7 http://www.apache.org/licenses/LICENSE-2.0
8
9Unless required by applicable law or agreed to in writing, software
10distributed under the License is distributed on an "AS IS" BASIS,
11WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12See the License for the specific language governing permissions and
13limitations under the License.
14==============================================================================*/
15
16#include <poplar/Graph.hpp>
17#include <popops/Cast.hpp>
18#include <popops/ScaledAdd.hpp>
19#include <poputil/exceptions.hpp>
20
21extern "C" {
22int32_t custom_op_api_level = 2;
23}
24
25extern "C" void Build_metadata(std::vector<std::int64_t>& allocating_indices,
26 std::uint32_t& num_inplace, bool& is_elementwise,
27 std::uint32_t num_inputs) {
28 allocating_indices.clear();
29 num_inplace = 1;
30 is_elementwise = true;
31}
32
33extern "C" poplar::program::Program Build(poplar::Graph& graph,
34 std::vector<poplar::Tensor>& inputs,
35 std::vector<poplar::Tensor>& outputs,
36 const std::string& attributes,
37 const std::string& debug_prefix) {
38 if (inputs.size() != 2) {
39 throw poputil::poplibs_error("add requires 2 inputs.");
40 }
41
42 auto left = inputs[0];
43 auto right = inputs[1];
44
45 if (left.shape() != right.shape()) {
46 throw poputil::poplibs_error("Inputs must have identical shapes.");
47 }
48
49 poplar::program::Sequence prog;
50 popops::scaledAddTo(graph, left, right, 1.0, prog,
51 debug_prefix + "/custom_add_inplace");
52 outputs.push_back(left);
53 return prog;
54}
11.1.3. Elementwise operations
The IPU driver can do a better job of allocating the layout of Poplar tensors if it can associate them with specific operations. If the output of an operation is the same shape and layout as its first input, then it should be marked as elementwise.
Typically, the graph building code for the operation will clone the input in order to generate the output Poplar tensor.
11.1.4. Tensor allocation
When generating the Poplar graph, sometimes the backend has the freedom to allocate an input to an operation. This happens when an input to an operation is also the input to the graph, or when previous operations do not put constraints on the input tensor.
If this condition occurs, then by default the backend will create the Poplar tensor with linear mapping. See the section on tile mapping in the Poplar and PopLibs API Reference.
To override this behaviour and allocate a tensor using a specific layout mapping, the custom operation can provide a function with the following signature:
1extern "C" poplar::Tensor Build_allocator(
2 poplar::Graph& graph, std::uint32_t operand,
3 const std::vector<size_t>& shape, poplar::Type type,
4 const std::string& attributes, const std::string& debug_prefix)
The arguments are:
graph
: the Poplar graph where the tensor should be created.operand
: the operand number of the input to allocate.shape
: the shape of the tensor.type
: the Poplar data type for the tensor.attributes
: a string which was passed as theattributes
orgradient_attributes
argument to the Python operation (depending on whether this function corresponds to the forward or gradient operation). See _operation_attributes for more details.debug_prefix
: the name of the operation.
11.1.5. Gradient operations
As described above, when the gradient of the forward operation is generated, either a single operation, or multiple operations can be inserted into the graph.
You can use the parameter separate_gradients
on the precompiled_user_op
function to select which of the two options are required. The compiled code must
match this setting.
If the separate_gradients
parameter is set to False
, then the compiled
function for generating the gradient operation should fill in one output
for each of the inputs of the forward pass function. Each output should be
the partial derivative with respect to one of the inputs.
If the separate_gradients
parameter is True
, then the gradient operation
building function should produce an operation with a single output, which is
the partial differential with respect to only one of the forward pass inputs.
The specific input will be given by the input_grad_index
input of the call
to the sharded object Build_grad
function.
11.1.6. Stateless operations
If an operation’s outputs depend only on the value of their inputs, and not any internally stored state, then the operation is said to be stateless. Marking an operation as stateless in the metadata function will allow the TensorFlow backend to perform optimisations which would otherwise be disallowed, such as common code removal.
11.1.7. Operation attributes
If an operation requires some data which is not available when compiling the C++
Poplar function, then the string attributes
argument can be used to pass
such information from the Python level operation to the C++ function.
Since the attributes
argument is a string object, any data format which can
be serialized/deserialized to/from a string, such as JSON, can be used.
In the following example we add a custom operation which performs a serialized
matrix-matrix multiplication where we use the attributes
argument to pass
information, encoded using the JSON data format, about serialization to the C++
function.
1/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.
2
3Licensed under the Apache License, Version 2.0 (the "License");
4you may not use this file except in compliance with the License.
5You may obtain a copy of the License at
6
7 http://www.apache.org/licenses/LICENSE-2.0
8
9Unless required by applicable law or agreed to in writing, software
10distributed under the License is distributed on an "AS IS" BASIS,
11WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12See the License for the specific language governing permissions and
13limitations under the License.
14==============================================================================*/
15
16#include <poplar/Graph.hpp>
17#include <poplin/MatMul.hpp>
18#include <popops/ElementWise.hpp>
19#include <poputil/exceptions.hpp>
20
21// Use the https://github.com/open-source-parsers/jsoncpp JsonCpp parser
22#include "include/json/json.h"
23
24extern "C" {
25int32_t custom_op_api_level = 2;
26}
27
28namespace {
29Json::Value ParseAttributes(const std::string& attributes) {
30 // Parse Json.
31 Json::CharReaderBuilder builder;
32 std::string errs;
33 Json::Value parsed_json;
34 std::unique_ptr<Json::CharReader> reader(builder.newCharReader());
35 bool parsed =
36 reader->parse(attributes.c_str(), attributes.c_str() + attributes.size(),
37 &parsed_json, &errs);
38 assert(parsed && errs);
39 return parsed_json;
40}
41
42std::vector<size_t> GetVectorFromJson(Json::Value& val) {
43 std::vector<size_t> result;
44 result.reserve(val.size());
45 for (auto a : val) {
46 result.push_back(a.asUInt64());
47 }
48 return result;
49}
50} // namespace
51
52extern "C" void Build_metadata(std::vector<std::int64_t>& allocating_indices,
53 std::uint32_t& num_inplace, bool& is_elementwise,
54 std::uint32_t num_inputs) {
55 allocating_indices = {0, 1};
56 num_inplace = 0;
57 is_elementwise = false;
58}
59
60extern "C" poplar::Tensor Build_allocator(poplar::Graph& graph,
61 std::uint32_t operand,
62 const std::vector<size_t>& shape,
63 poplar::Type type,
64 const std::string& attributes,
65 const std::string& debug_prefix) {
66 assert(operand < 2);
67 // Parse JSON and get the expected attributes.
68 Json::Value json = ParseAttributes(attributes);
69 const int serialization_factor = json["serialization_factor"].asInt();
70 std::vector<std::size_t> lhs_shape = GetVectorFromJson(json["lhs_shape"]);
71 std::vector<std::size_t> rhs_shape = GetVectorFromJson(json["rhs_shape"]);
72
73 // Verify shapes and adjust them to be slice shapes.
74 assert(lhs_shape.size() == 2);
75 assert(rhs_shape.size() == 2);
76
77 assert(lhs_shape[1] % serialization_factor == 0 &&
78 "serialization_factor must divide the dimension of LHS shape");
79 lhs_shape[1] /= serialization_factor;
80
81 assert(rhs_shape[0] % serialization_factor == 0 &&
82 "serialization_factor must divide the dimension of RHS shape");
83 rhs_shape[0] /= serialization_factor;
84
85 // Allocate the slice.
86 poplar::Tensor slice;
87 if (operand == 0) {
88 // Allocating for lhs - allocate the slice.
89 slice = poplin::createMatMulInputLHS(graph, type, lhs_shape, rhs_shape,
90 debug_prefix + "/LHS");
91 } else {
92 assert(operand == 1);
93 slice = poplin::createMatMulInputRHS(graph, type, lhs_shape, rhs_shape,
94 debug_prefix + "/RHS");
95 }
96
97 // Clone the slice for each serialized matrix multiply.
98 std::vector<poplar::Tensor> slices(serialization_factor);
99 slices[0] = slice;
100 for (int i = 1; i != serialization_factor; ++i) {
101 slices[i] = graph.clone(slice);
102 }
103
104 // Concatenate the slices into a single tensor - the concatentation dimension
105 // depends on the operand which is being allocated.
106 poplar::Tensor t = poplar::concat(slices, operand == 0 ? 1 : 0);
107 return t;
108}
109
110extern "C" poplar::program::Program Build(poplar::Graph& graph,
111 std::vector<poplar::Tensor>& inputs,
112 std::vector<poplar::Tensor>& outputs,
113 const std::string& attributes,
114 const std::string& debug_prefix) {
115 if (inputs.size() != 2) {
116 throw poputil::poplibs_error("add requires 2 inputs.");
117 }
118 Json::Value json = ParseAttributes(attributes);
119 poplar::program::Sequence seq;
120 poplar::Tensor lhs = inputs[0];
121 poplar::Tensor rhs = inputs[1];
122 poplar::Tensor output;
123
124 const int serialization_factor = json["serialization_factor"].asInt();
125 const int slice_size = lhs.dim(1) / serialization_factor;
126 for (int i = 0; i != serialization_factor; ++i) {
127 // Slice out the parts of the matmul.
128 poplar::Tensor lhs_slice =
129 lhs.slice(i * slice_size, (i + 1) * slice_size, 1);
130 poplar::Tensor rhs_slice =
131 rhs.slice(i * slice_size, (i + 1) * slice_size, 0);
132 // Do the partial matmul.
133 poplar::Tensor partial_matmul = poplin::matMul(
134 graph, lhs_slice, rhs_slice, seq, debug_prefix + "/Slice");
135
136 // Accumulate the results from partial matmuls.
137 if (i == 0) {
138 output = partial_matmul;
139 } else {
140 popops::addInPlace(graph, output, partial_matmul, seq,
141 debug_prefix + "/Add");
142 }
143 }
144 outputs = {output};
145 return seq;
146}
Which is then executed with:
1import os
2import json
3import numpy as np
4
5from tensorflow.python import ipu
6from tensorflow.python.ipu.scopes import ipu_scope
7import tensorflow.compat.v1 as tf
8tf.disable_v2_behavior()
9
10# Configure argument for targeting the IPU
11cfg = ipu.utils.create_ipu_config()
12cfg = ipu.utils.auto_select_ipus(cfg, 1)
13ipu.utils.configure_ipu_system(cfg)
14
15base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
16lib_path = os.path.join(base_path, "libtutorial_attributes_example.so")
17
18
19def my_net(x, y):
20 x_shape = x.get_shape().as_list()
21 y_shape = y.get_shape().as_list()
22 outputs = {
23 "output_types": [x.dtype],
24 "output_shapes": [tf.TensorShape([x_shape[0], y_shape[1]])],
25 }
26
27 # We create a matmul operation, which we want to perform as two serialized
28 # matmuls. We also record all the input shapes.
29 attributes = {
30 "serialization_factor": 2,
31 "lhs_shape": x_shape,
32 "rhs_shape": y_shape
33 }
34 attributes_json = json.dumps(attributes)
35
36 o = ipu.custom_ops.precompiled_user_op([x, y],
37 lib_path,
38 attributes=attributes_json,
39 outs=outputs)
40
41 return o
42
43
44with tf.device("cpu"):
45 x_ph = tf.placeholder(np.float32, [128, 1024])
46 y_ph = tf.placeholder(np.float32, [1024, 64])
47
48with ipu_scope("/device:IPU:0"):
49 xla_result = ipu.ipu_compiler.compile(my_net, [x_ph, y_ph])
50
51with tf.Session() as sess:
52 # Base run
53 result = sess.run(xla_result,
54 feed_dict={
55 x_ph: np.full(x_ph.shape, 10.0),
56 y_ph: np.full(y_ph.shape, 12.0),
57 })
58
59 print(result)
11.1.8. Example
This example shows the source file for a rotate operation, which takes three
vectors and rotates the x
and y
ones by the angle
one:
1/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.
2
3Licensed under the Apache License, Version 2.0 (the "License");
4you may not use this file except in compliance with the License.
5You may obtain a copy of the License at
6
7 http://www.apache.org/licenses/LICENSE-2.0
8
9Unless required by applicable law or agreed to in writing, software
10distributed under the License is distributed on an "AS IS" BASIS,
11WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12See the License for the specific language governing permissions and
13limitations under the License.
14==============================================================================*/
15
16#include <vector>
17
18#include <poplar/Graph.hpp>
19#include <poplar/Tensor.hpp>
20#include <poputil/Util.hpp>
21#include <poputil/VertexTemplates.hpp>
22#include <poputil/exceptions.hpp>
23
24// Export the API level symbol
25extern "C" {
26int32_t custom_op_api_level = 2;
27}
28
29extern "C" void Build_metadata(std::vector<std::int64_t>& allocating_indices,
30 std::uint32_t& num_inplace, bool& is_elementwise,
31 bool& is_stateless, std::uint32_t num_inputs) {
32 allocating_indices.clear();
33 num_inplace = 0;
34 is_elementwise = true;
35}
36
37extern "C" poplar::program::Program Build(
38 poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
39 std::vector<poplar::Tensor>& outputs, const std::string& attributes,
40 const std::string& debugPrefix) {
41 if (inputs.size() != 3) {
42 throw poputil::poplibs_error("Rotate requires 3 inputs");
43 }
44
45 if (inputs[0].numElements() == 0) {
46 return poplar::program::Sequence();
47 }
48
49 if (inputs[0].rank() != 1 || inputs[1].rank() != 1 || inputs[2].rank() != 1) {
50 throw poputil::poplibs_error("All inputs must be rank 1");
51 }
52
53 if (inputs[0].dim(0) != inputs[1].dim(0) ||
54 inputs[0].dim(0) != inputs[2].dim(0)) {
55 throw poputil::poplibs_error(
56 "Length of rotate vector and data vectors must match");
57 }
58
59 if (inputs[0].elementType() != inputs[1].elementType() ||
60 inputs[0].elementType() != inputs[2].elementType()) {
61 throw poputil::poplibs_error(
62 "Data types of angle vector and data vectors must match");
63 }
64
65 auto dType = inputs[0].elementType();
66
67 /*
68 * Create a ComputeSet which will be executed, and contains the vertices
69 */
70 auto cs = graph.addComputeSet(debugPrefix + "/rotate");
71
72 /*
73 * Get the tile mapping for the complete tensor. We will map the vertices so
74 * that they match the layout of the 'x' input tensor (input[0]). If the 'x'
75 * tensor was layed out differently to the other ones, then Poplar will
76 * insert code to move the data in the other tensors to the mapped tile. So
77 * ideally we would choose the best mapping for the vertices by analysing
78 * all of the tensor mappings.
79 */
80 auto tileMapping = graph.getTileMapping(inputs[0]);
81
82 /*
83 * Get the target, which descibes properties of the hardware.
84 */
85 auto target = graph.getTarget();
86
87 /*
88 * Get the vector width of the particular data type, so that later we can
89 * divide the tensor up between workers in an appropriate way.
90 */
91 const auto vectorWidth = target.getVectorWidth(dType);
92
93 /*
94 * Create the output tensors
95 */
96 outputs.push_back(graph.clone(inputs[0]));
97 outputs.push_back(graph.clone(inputs[1]));
98
99 auto xFlat = inputs[0].flatten();
100 auto yFlat = inputs[1].flatten();
101 auto aFlat = inputs[2].flatten();
102 auto xOutputFlat = outputs[0].flatten();
103 auto yOutputFlat = outputs[1].flatten();
104
105 for (unsigned tile = 0; tile != tileMapping.size(); ++tile) {
106 /*
107 * If a tile contains no elements of the tensor then do not create any
108 * vertices for it.
109 */
110 if (tileMapping[tile].empty()) {
111 continue;
112 }
113
114 /*
115 * Split up the regions of the inputs tensors so that they are evenly
116 * distributed between the workers on the tile.
117 */
118 auto vertexRegions = poputil::splitRegionsBetweenWorkers(
119 target, tileMapping[tile], vectorWidth, 2 * vectorWidth);
120
121 for (const auto& regions : vertexRegions) {
122 /*
123 * If a region has no elements, then there is no need to add a vertex for
124 * it.
125 */
126 if (regions.empty()) {
127 continue;
128 }
129
130 /*
131 * Add codelets to tiles which work over the regions in the input
132 * tensors.
133 */
134 auto v = graph.addVertex(cs, poputil::templateVertex("Rotate", dType),
135 {{"x_out", xOutputFlat.slices(regions)},
136 {"y_out", yOutputFlat.slices(regions)},
137 {"x_in", xFlat.slices(regions)},
138 {"y_in", yFlat.slices(regions)},
139 {"angle", aFlat.slices(regions)}});
140
141 /* Map the vertex onto the appropriate tile. */
142 graph.setTileMapping(v, tile);
143
144 /* Provide a bogus cycle count estimate for the profiler. */
145 graph.setCycleEstimate(v, 1);
146 }
147 }
148
149 return poplar::program::Execute(cs);
150}
This is the associated codelet file:
1#include <cmath>
2
3#include <poplar/HalfFloat.hpp>
4#include <poplar/Vertex.hpp>
5
6using namespace poplar;
7
8/*
9 * A codelet to rotate a tensors 'x' and 'y', by the angle (radians) in the
10 * tensor 'angle', around the origin.
11 */
12template <typename FPType>
13class Rotate : public Vertex {
14 public:
15 Vector<Output<Vector<FPType>>> x_out;
16 Vector<Output<Vector<FPType>>> y_out;
17 Vector<Input<Vector<FPType>>> x_in;
18 Vector<Input<Vector<FPType>>> y_in;
19 Vector<Input<Vector<FPType>>> angle;
20
21 bool compute() {
22 for (unsigned i = 0; i < angle.size(); ++i) {
23 for (unsigned j = 0; j != angle[i].size(); ++j) {
24 float a = angle[i][j];
25 float x = x_in[i][j];
26 float y = y_in[i][j];
27 x_out[i][j] = x * cos(a) - y * sin(a);
28 y_out[i][j] = x * sin(a) + y * cos(a);
29 }
30 }
31 return true;
32 }
33};
34
35template class Rotate<float>;
36template class Rotate<half>;
This is an example of it in use:
1import os
2import numpy as np
3
4from tensorflow.python import ipu
5from tensorflow.python.ipu.scopes import ipu_scope
6import tensorflow.compat.v1 as tf
7tf.disable_v2_behavior()
8
9# Configure argument for targeting the IPU
10cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
11cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
12cfg = ipu.utils.auto_select_ipus(cfg, 1)
13ipu.utils.configure_ipu_system(cfg)
14
15with tf.device("cpu"):
16 x_data = tf.placeholder(np.float32, [4])
17 y_data = tf.placeholder(np.float32, [4])
18 p_angle = tf.placeholder(np.float32, [4])
19
20
21def rotate_op(x, y, a):
22 outputs = {
23 "output_types": [tf.float32, tf.float32],
24 "output_shapes": [tf.TensorShape([4]),
25 tf.TensorShape([4])],
26 }
27
28 base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
29 lib_path = os.path.join(base_path, "libcustom_rotate_op.so")
30 gp_path = os.path.join(base_path, "custom_codelet.gp")
31
32 o = ipu.custom_ops.precompiled_user_op([x, y, a],
33 lib_path,
34 gp_path,
35 outs=outputs)
36 return o
37
38
39def my_net(x, y, a):
40 return rotate_op(x, y, a)
41
42
43with ipu_scope("/device:IPU:0"):
44 xla_result = ipu.ipu_compiler.compile(my_net, [x_data, y_data, p_angle])
45
46with tf.Session() as sess:
47 # Base run
48 result = sess.run(xla_result,
49 feed_dict={
50 x_data: [2., 4., 6., -1.],
51 y_data: [2., 3., 8., -1.],
52 p_angle: [np.pi, np.pi / 2., 3. * np.pi / 2., 0]
53 })
54
55 print(result)
When compiling the host-size shared object file, it is not necessary to include or link against any TensorFlow header or library files. Only the Poplar headers and link libraries should be necessary.
11.2. Fully customised CPU operations
The framework also allows a custom operation that executes code on the CPU instead of on the IPU. A shared object, much like the builder function of the device-side custom operation, must be written. The signature of this function should be:
1extern "C" void Callback(const std::vector<void*>& data,
2 const std::vector<std::uint32_t>& number_of_elements,
3 std::vector<void*>& outputs,
4 const std::string& name);
The arguments are:
data
: the input data. the function should be written to expect a certain data type so the void pointer can be cast into the expected type.number_of_elements
: indicates the number of elements in the input data.outputs
: should be filled in by the operation.name
: is the name of the operation within the XLA/HLO graph.
11.3. Custom elementwise expressions
The Python function ipu.custom_ops.codelet_expression_op
provides an interface
for giving a custom fused expression to the compiler. This will be encoded
into a single compute set. See
tensorflow.python.ipu.custom_ops.codelet_expression_op()
for details.
The arguments to the Python function are a callable Python function which encodes the arithmetic expression, and the tensor arguments to the operation.
For instance:
1def my_custom_op(x, y, z):
2 return x * x + y * z
3
4ipu.custom_ops.codelet_expression_op(my_custom_op, a, b, c)
In this example, the Python function my_custom_op
provides the expression,
and the arguments a
, b
and c
are the three inputs from other parts of
the TensorFlow graph.
Python operators which are supported in the function are +
, -
, *
, and
abs
.
11.4. API Level Versioning
1// Export the API level symbol
2extern "C" {
3int32_t custom_op_api_level = 2;
4}
You must include the code above in your source code. The custom op loader checks the API level of the custom op and refuses to load if it does not match the current API level. A different API level normally means that it is binary incompatible with the previous version.
11.4.1. Changes in API Level
API level:
is_stateless
has been added to metadata function.attributes
a string argument has been added to the allocation and thebuild functions which allows passing of user defined attributes to the operation (and its gradient operation if present).