14. Writing custom operations
If TensorFlow for the IPU does not implement an operation that you need then there are two ways you can add a custom operation to the TensorFlow graph.
You can implement the operation in C++ using the Poplar graph programming framework. See Custom operation on the IPU.
This provides the highest performance because the operation runs on the IPU.
The second possibility is to execute the custom operation on the host CPU. See Custom host CPU operations.
This may be easier to implement because you only need to write host code, without needing to get to grips with Poplar. However, the performance will be lower because it does not exploit the parallelism available on the IPU, and because the data has to be moved from the IPUs to the host and back.
Note
In the rest of this chapter, “custom op” or “op” will be used to refer specifically to the new custom operation made available in the TensorFlow code. The word “operation” will be used more generally to talk about the implementation of this custom op.
14.1. Custom operation on the IPU
To create a custom op on the IPU, you need to write a Poplar program that performs the required functions on the input tensors. After compiling this code, you can load it into your TensorFlow program to create a custom op, which can then be used in your TensorFlow model in the same way as any other op.
The following sections provide more detail on these steps.
14.1.1. Building the Poplar graph
The custom op is defined in a C++ program that populates a graph with a
poplar::Program
object containing the operations to be performed on the
input tensors.
The Poplar and PopLibs libraries provide a rich set of functions optimised for
the IPU. You can also add your own functionality as “codelets”, which contain
C++ code compiled for, and executed on, the IPU.
For more information about writing Poplar graph programs and codelets, refer to the Poplar and PopLibs User Guide and the Poplar tutorials on the Graphcore GitHub tutorials repository.
Your program must contain a function to build the graph, which will be called from TensorFlow when you instantiate the custom op. This has the following signature:
1extern "C"
2poplar::program::Program Build(
3 poplar::Graph& graph,
4 const std::vector<poplar::Tensor>& inputs,
5 std::vector<poplar::Tensor>& outputs,
6 const std::string &attributes,
7 const std::string &debug_prefix)
The default name for the function is Build()
. If you want to use a different
name (because you have multiple custom ops, for example), you can specify the name of
the function when importing the program into TensorFlow. See the definition of
the tensorflow.python.ipu.custom_ops.precompiled_user_op()
function
for details.
Note
The extern "C"
declaration is required to ensure that the compiler does
not change the function name (C++ compilers will normally modify, or
“decorate”, function names to encode extra information about the function).
The parameters to Build()
are:
graph
: A Poplar graph to add theProgram
object and tensors to, in order to implement the operation.inputs
: A vector of tensors which are inputs to the operation. These are passed as the input arguments to the custom op when it is called in TensorFlow.outputs
: A vector of tensors that are the outputs of the operation. These will be returned as the result of the custom op in TensorFlow. This vector will initially be empty, so you will need to add result tensors to it.attributes
: A string which is passed as theattributes
argument to the custom op in TensorFlow. See Operation attributes for more details.debug_prefix
: The debug name that is passed to the custom op in TensorFlow.
The Build()
function returns the program object that it added to the graph.
14.1.2. Gradient builders
If the op is required for training, then you must also implement a function that
builds a Poplar graph for the gradient operation. This has the same name as the
forward-operation builder with _grad
appended.
The signature of the gradient builder function is:
1extern "C"
2poplar::program::Program Build_grad(
3 poplar::Graph& graph,
4 int input_grad_index,
5 const std::vector<poplar::Tensor>& gradients,
6 const std::vector<poplar::Tensor>& fwd_inputs,
7 const std::vector<poplar::Tensor>& fwd_outputs,
8 std::vector<poplar::Tensor>& outputs,
9 const std::string& attributes,
10 const std::string& debug_prefix)
The parameters to Build_grad()
are:
graph
: A Poplar graph to add theProgram
object and tensors to, in order to implement the operation.input_grad_index
: The index of the input tensor to calculate the the partial derivative for.You can choose to implement a gradient operation that calculates the partial derivatives for all tensors or for one tensor at a time. In the latter case, you need to set
separate_gradients
toTrue
when you callprecompiled_user_op()
.There may be advantages in calculating all the gradients at the same time; for example, if there are common sub-expressions. On the other hand, this removes the ability for TensorFlow to do some optimisations, such as dead-code elimination if all of the gradients are not required.
If the
separate_gradients
parameter is set toFalse
, then your function for generating the gradient operation must populate one output tensor for each of the inputs of the forward pass function. Each output must be the partial derivative with respect to one of the inputs.If the
separate_gradients
parameter isTrue
, then the gradient operation building function must produce an operation with a single output, which is the partial differential with respect to only one of the forward pass inputs. The specific input will be given by theinput_grad_index
argument to theBuild_grad()
function.If your gradient operation calculates all of the partial derivatives, then you can ignore the
input_grad_index
parameter.gradients
: The inputs to the gradient operation, from the previous gradient operation or loss.fwd_inputs
: The input tensors to the forward-pass operation.fwd_outputs
: The output tensors from the forward-pass operation.outputs
: The outputs from this gradient operation. There must be one per input of the forward operation. Inputs which are not differentiable can be assigned a “null” Poplar tensor (that is, one created with the defaultTensor
constructor and containing no data).attributes
: A string which is passed as thegradient_attributes
argument to the custom op when called from TensorFlow. See Operation attributes for more details.debug_prefix
: The name of the operation.
The Build_grad()
function returns the program object that it added to the graph.
14.1.3. Metadata
You can also specify extra information about the custom op by including a
metadata function in the object file. This has the same name as the builder
function with _metadata
appended.
This function has the following signature:
1extern "C"
2void Build_metadata(
3 std::vector<std::int64_t>& allocating_indices,
4 std::vector<std::int64_t>& replica_identical_output_indices,
5 std::map<std::int64_t, std::int64_t>& input_to_output_tensor_aliasing,
6 bool& is_elementwise,
7 bool& is_stateless,
8 bool& is_hashable,
9 std::uint32_t num_inputs)
The parameters are used to return the following information about the operation:
allocating_indices
: Use this to specify which input tensors will be allocated using the tensor-allocation function described in Tensor allocation.replica_identical_output_indices
: Experimental. Use this to specify which output tensors are identical across replicas. The compiler uses this to help provide deterministic behaviour when running with replication and performing stochastic rounding.An empty vector means that no tensors are identical across replicas.
input_to_output_tensor_aliasing
: Use this map to indicate if any of the input and output tensors alias. The values in the map are the vector indexes of the the tensors. For example, a mapping from 1 to 0 indicates that input tensor 1 is aliased with output tensor 0. This means thatpoplar::Tensor::intersectsWith()
would return true when called for these tensors.Providing information about whether an input tensor aliases an output tensor allows the TensorFlow graph compiler to perform more optimisation. It also ensures that if an input tensor is updated in-place and used as an output, then any other uses of that tensor will be completed before this operation is run, to ensure correct behaviour. See In-place operations for an example of using this for an in-place operation.
If an input tensor is not mapped to an output tensor, then the operation must not modify that input tensor. If it is modified, then other operations which use it as an input may be passed incorrect values.
is_elementwise
: Set this to true if the output of an operation is the same shape and layout as its first input. (This parameter was originally used to tell the compiler that an operation was elementwise. However, its meaning has changed to indicate any operation where the compiler can perform optimisations based on matching the input and output tensors.)In this case, your graph-building code for the operation will typically clone the input in order to generate the output tensor.
is_stateless
: Set this to true if this operation is “stateless”.If an operation’s outputs depend only on the value of their inputs, and not any internally stored state, then the operation is said to be stateless. Marking an operation as stateless will allow the TensorFlow backend to perform optimisations which would otherwise not be possible, such as common code removal. It also allows the custom op to be used with recomputation.
Custom ops are stateful by default.
is_hashable
: Set this to true if this operation can be uniquely hashed.In order to detect when code changes and needs to be recompiled, the TensorFlow compiler will generate a hash value for the TensorFlow graph. If all ops in the graph are hashable then the executable will be saved in the cache (if enabled). This allows the graph to be run multiple times without needing to recompile it. See Caching of compiled executables for more information.
However, because the TensorFlow compiler does not have any information about the implementation of the custom operation or its dependencies, the compiler will treat it as non-hashable, therefore the TensorFlow program will be recompiled every time it is run.
If you can guarantee that custom operation and its dependencies will not change then you can set this parameter to true.
This attribute must be set to true if you intend to pre-compile your TensorFlow program (see Pre-compiling executables).
num_inputs
: This is the number of input tensors that the operation is called with.
If you use the metadata function to specify some information about the custom operation, then you must set the values of all the parameters even if you are using the default values.
Gradient builders have their own metadata functions. These are named after the
gradient builder function with _metadata
appended. For example:
Build_grad_metadata()
.
14.1.4. Compiling the IPU code
API level
You need to specify the API level that your operation code is compatible with. The custom op loader checks the API level and will not load it if it does not match the current API level. A change in API level normally means that the file is not compatible with previous versions. See API level changes for information about the changes in the API.
You must include the following code in your builder program to specify the API level.
1// Export the API level symbol
2extern "C" {
3int32_t custom_op_api_level = 5;
4}
API level |
Changes to the API |
---|---|
1 |
|
2 |
The |
3 |
|
4 |
|
5 |
|
PopLibs library code
You need to explicitly add the the IPU code for any PopLibs libraries that you use.
For example, if your code uses the popops
and poprand
libraries, then you need to include the following in your builder code:
1#include <popops/codelets.hpp>
2#include <poprand/codelets.hpp>
3
4extern "C"
5poplar::program::Program Build_grad(poplar::Graph& graph,
6 int input_grad_index,
7 const std::vector<poplar::Tensor>& gradients,
8 const std::vector<poplar::Tensor>& fwd_inputs,
9 const std::vector<poplar::Tensor>& fwd_outputs,
10 std::vector<poplar::Tensor>& outputs,
11 const std::string& attributes,
12 const std::string& debug_prefix) {
13
14 ... // create the program object in the graph
15
16 popops::addCodelets(graph);
17 poprand::addCodelets(graph);
18}
Compiling the library file
The code has to be compiled to create a shared-library object file.
For example, if you have a source file called poplar_code.cpp
that contains
the Build()
function, you can use the following command line to generate
a library file called libcustom_op.so
:
$ g++ poplar_code.cpp -shared -fpic -o libcustom_op.so -lpoplar -lpoputil -lpoprand
Note that you also need to link the Poplar and PopLibs libraries that you use
(in this example poplar
, poputil
and poprand
). See the Poplar and
PopLibs API Reference for more
information.
It is not necessary to include or link against any TensorFlow header or library files. Only the Poplar and PopLibs headers, and the corresponding libraries are required.
14.1.5. Using the custom op in TensorFlow
You can call the custom operation from TensorFlow with
precompiled_user_op()
. This
specifies the library file containing the custom operation code, the input and
output tensors, and other information needed to use the op in TensorFlow. See
precompiled_user_op()
in the API documentation for more
information.
14.1.6. Tensor allocation
If the input tensors to the operation have not already been allocated to tiles because of their use by other operations, then the TensorFlow compiler will, by default, allocate the tensors with linear mapping.
You can override this behaviour by defining a function that allocates tensors in a way that is most efficient for your operation. See the section on variable mapping in the Poplar and PopLibs API Reference for more information.
To do this, define a function with the suffix _allocator
with the following
signature:
1extern "C" poplar::Tensor Build_allocator(
2 poplar::Graph& graph,
3 std::uint32_t operand,
4 const std::vector<size_t>& shape,
5 poplar::Type type,
6 const std::string& attributes,
7 const std::string& debug_prefix)
The parameters to the function are:
graph
: The graph to add the tensor to.operand
: The index of the input tensor to allocate.shape
: The shape of the tensor.type
: The Poplar data type for the tensor.attributes
: A string which is passed as theattributes
orgradient_attributes
argument to the custom op in TensorFlow (depending on whether this function corresponds to the forward or gradient operation). See Operation attributes for more details.debug_prefix
: the name of the operation.
The allocator function returns the tensor that it has allocated.
If the input tensor has already been allocated, then this function will not be called.
14.1.7. Examples
Some examples of using a custom op in TensorFlow are shown in the following sections. There are further examples in the Graphcore GitHub tutorials repository:
Note
From Poplar SDK 3.1, TensorFlow 1 will only be supported in CentOS 7. In addition, Examples and Tutorials for TensorFlow 1 are only available up to version 3.0 of the SDK. There has been limited testing of the 3.0 versions of the TensorFlow 1 tutorials and examples with Poplar SDK 3.1.
In-place operations
An operation can use the same tensor as an input and output, modifying the tensor in-place as opposed to creating a new output tensor.
You can use the input_to_output_tensor_aliasing map in the metadata to indicate this to the TensorFlow compiler by specifying that the input tensor is aliased with an output tensor.
When you update tensors in-place, the TensorFlow compiler must see an assignment
to the tensor, otherwise the changes to the input tensor will be optimised away.
This means that the in-place inputs always need to be returned as outputs of the
custom operation. If a tf.Variable
object is modified in-place then it has to be
assigned back to itself with tf.assign
.
Listing 14.2 shows an example of adding an in-place custom op to a TensorFlow model. The implementation of the operation is shown in Listing 14.2.
1import os
2import numpy as np
3
4from tensorflow.python import ipu
5from tensorflow.python.ipu.scopes import ipu_scope
6import tensorflow.compat.v1 as tf
7tf.disable_v2_behavior()
8
9# Configure argument for targeting the IPU
10cfg = ipu.config.IPUConfig()
11cfg.auto_select_ipus = 1
12cfg.configure_ipu_system()
13
14with tf.device("cpu"):
15 x_data = tf.placeholder(np.float32, [4])
16
17
18def add_op(x, y):
19 outputs = {
20 "output_types": [tf.float32],
21 "output_shapes": [tf.TensorShape([4])],
22 }
23
24 base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
25 lib_path = os.path.join(base_path, "libcustom_add_inplace.so")
26
27 o = ipu.custom_ops.precompiled_user_op([x, y], lib_path, outs=outputs)
28 return o
29
30
31def my_net(x):
32 inplace = tf.get_variable("weights",
33 shape=[4],
34 initializer=tf.zeros_initializer())
35
36 # Even though the custom op is in place, TF still needs to see an assignment.
37 inplace_add = tf.assign(inplace, add_op(inplace, x)[0])
38 with tf.control_dependencies([inplace_add]):
39 return inplace
40
41
42with ipu_scope("/device:IPU:0"):
43 xla_result = ipu.ipu_compiler.compile(my_net, [x_data])
44
45with tf.Session() as sess:
46 sess.run(tf.global_variables_initializer())
47
48 result = sess.run(xla_result, feed_dict={x_data: [2., 4., 6., -1.]})
49 print(result)
50
51 result = sess.run(xla_result, feed_dict={x_data: [2., 4., 6., -1.]})
52 print(result)
Download custom_add_inplace.py
1#include <poplar/Graph.hpp>
2#include <popops/Cast.hpp>
3#include <popops/ScaledAdd.hpp>
4#include <poputil/exceptions.hpp>
5
6extern "C" {
7int32_t custom_op_api_level = 5;
8}
9
10extern "C" void Build_metadata(
11 std::vector<std::int64_t>& allocating_indices,
12 std::vector<std::int64_t>& replica_identical_output_indices,
13 std::map<std::int64_t, std::int64_t>& input_to_output_tensor_aliasing,
14 bool& is_elementwise, bool& is_stateless, bool& is_hashable,
15 std::uint32_t num_inputs) {
16 allocating_indices.clear();
17 input_to_output_tensor_aliasing = {
18 {/*input tensor index*/ 0, /*output tensor index=*/0}};
19 is_elementwise = true;
20}
21
22extern "C" poplar::program::Program Build(poplar::Graph& graph,
23 std::vector<poplar::Tensor>& inputs,
24 std::vector<poplar::Tensor>& outputs,
25 const std::string& attributes,
26 const std::string& debug_prefix) {
27 if (inputs.size() != 2) {
28 throw poputil::poplibs_error("add requires 2 inputs.");
29 }
30
31 auto left = inputs[0];
32 auto right = inputs[1];
33
34 if (left.shape() != right.shape()) {
35 throw poputil::poplibs_error("Inputs must have identical shapes.");
36 }
37
38 poplar::program::Sequence prog;
39 popops::scaledAddTo(graph, left, right, 1.0, prog,
40 debug_prefix + "/custom_add_inplace");
41 outputs.push_back(left);
42 return prog;
43}
Download custom_add_inplace.cc
Operation attributes
If an operation requires some data which is not available when compiling the C++
builder function, then the string attributes
argument can be used to pass
such information from the TensorFlow op to the C++ function.
Since the attributes
argument is a string object, any data format which can
be serialized/deserialized as a string, such as JSON, can be used.
In Listing 14.3, we implement a custom operation which
performs a serialized matrix-matrix multiplication where the attributes
argument passes information about serialization, encoded in JSON data format, to
the C++ function. Listing 14.4 shows how this custom
op is called from TensorFlow.
1#include <poplar/Graph.hpp>
2#include <poplin/MatMul.hpp>
3#include <popops/ElementWise.hpp>
4#include <poputil/exceptions.hpp>
5
6// Use the https://github.com/open-source-parsers/jsoncpp JsonCpp parser
7#include "include/json/json.h"
8
9extern "C" {
10int32_t custom_op_api_level = 5;
11}
12
13namespace {
14Json::Value ParseAttributes(const std::string& attributes) {
15 // Parse Json.
16 Json::CharReaderBuilder builder;
17 std::string errs;
18 Json::Value parsed_json;
19 std::unique_ptr<Json::CharReader> reader(builder.newCharReader());
20 bool parsed =
21 reader->parse(attributes.c_str(), attributes.c_str() + attributes.size(),
22 &parsed_json, &errs);
23 assert(parsed && errs);
24 return parsed_json;
25}
26
27std::vector<size_t> GetVectorFromJson(Json::Value& val) {
28 std::vector<size_t> result;
29 result.reserve(val.size());
30 for (auto a : val) {
31 result.push_back(a.asUInt64());
32 }
33 return result;
34}
35} // namespace
36
37extern "C" void Build_metadata(
38 std::vector<std::int64_t>& allocating_indices,
39 std::vector<std::int64_t>& replica_identical_output_indices,
40 std::map<std::int64_t, std::int64_t>& input_to_output_tensor_aliasing,
41 bool& is_elementwise, bool& is_hashable, std::uint32_t num_inputs) {
42 allocating_indices = {0, 1};
43 is_elementwise = false;
44}
45
46extern "C" poplar::Tensor Build_allocator(poplar::Graph& graph,
47 std::uint32_t operand,
48 const std::vector<size_t>& shape,
49 poplar::Type type,
50 const std::string& attributes,
51 const std::string& debug_prefix) {
52 assert(operand < 2);
53 // Parse JSON and get the expected attributes.
54 Json::Value json = ParseAttributes(attributes);
55 const int serialization_factor = json["serialization_factor"].asInt();
56 std::vector<std::size_t> lhs_shape = GetVectorFromJson(json["lhs_shape"]);
57 std::vector<std::size_t> rhs_shape = GetVectorFromJson(json["rhs_shape"]);
58
59 // Verify shapes and adjust them to be slice shapes.
60 assert(lhs_shape.size() == 2);
61 assert(rhs_shape.size() == 2);
62
63 assert(lhs_shape[1] % serialization_factor == 0 &&
64 "serialization_factor must divide the dimension of LHS shape");
65 lhs_shape[1] /= serialization_factor;
66
67 assert(rhs_shape[0] % serialization_factor == 0 &&
68 "serialization_factor must divide the dimension of RHS shape");
69 rhs_shape[0] /= serialization_factor;
70
71 // Allocate the slice.
72 poplar::Tensor slice;
73 if (operand == 0) {
74 // Allocating for lhs - allocate the slice.
75 slice = poplin::createMatMulInputLHS(graph, type, lhs_shape, rhs_shape,
76 debug_prefix + "/LHS");
77 } else {
78 assert(operand == 1);
79 slice = poplin::createMatMulInputRHS(graph, type, lhs_shape, rhs_shape,
80 debug_prefix + "/RHS");
81 }
82
83 // Clone the slice for each serialized matrix multiply.
84 std::vector<poplar::Tensor> slices(serialization_factor);
85 slices[0] = slice;
86 for (int i = 1; i != serialization_factor; ++i) {
87 slices[i] = graph.clone(slice);
88 }
89
90 // Concatenate the slices into a single tensor - the concatentation dimension
91 // depends on the operand which is being allocated.
92 poplar::Tensor t = poplar::concat(slices, operand == 0 ? 1 : 0);
93 return t;
94}
95
96extern "C" poplar::program::Program Build(poplar::Graph& graph,
97 std::vector<poplar::Tensor>& inputs,
98 std::vector<poplar::Tensor>& outputs,
99 const std::string& attributes,
100 const std::string& debug_prefix) {
101 if (inputs.size() != 2) {
102 throw poputil::poplibs_error("add requires 2 inputs.");
103 }
104 Json::Value json = ParseAttributes(attributes);
105 poplar::program::Sequence seq;
106 poplar::Tensor lhs = inputs[0];
107 poplar::Tensor rhs = inputs[1];
108 poplar::Tensor output;
109
110 const int serialization_factor = json["serialization_factor"].asInt();
111 const int slice_size = lhs.dim(1) / serialization_factor;
112 for (int i = 0; i != serialization_factor; ++i) {
113 // Slice out the parts of the matmul.
114 poplar::Tensor lhs_slice =
115 lhs.slice(i * slice_size, (i + 1) * slice_size, 1);
116 poplar::Tensor rhs_slice =
117 rhs.slice(i * slice_size, (i + 1) * slice_size, 0);
118 // Do the partial matmul.
119 poplar::Tensor partial_matmul = poplin::matMul(
120 graph, lhs_slice, rhs_slice, seq, debug_prefix + "/Slice");
121
122 // Accumulate the results from partial matmuls.
123 if (i == 0) {
124 output = partial_matmul;
125 } else {
126 popops::addInPlace(graph, output, partial_matmul, seq,
127 debug_prefix + "/Add");
128 }
129 }
130 outputs = {output};
131 return seq;
132}
Download tutorial_attributes_example.cc
1import os
2import json
3import numpy as np
4
5from tensorflow.python import ipu
6from tensorflow.python.ipu.scopes import ipu_scope
7import tensorflow.compat.v1 as tf
8tf.disable_v2_behavior()
9
10# Configure argument for targeting the IPU
11cfg = ipu.config.IPUConfig()
12cfg.auto_select_ipus = 1
13cfg.configure_ipu_system()
14
15base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
16lib_path = os.path.join(base_path, "libtutorial_attributes_example.so")
17
18
19def my_net(x, y):
20 x_shape = x.get_shape().as_list()
21 y_shape = y.get_shape().as_list()
22 outputs = {
23 "output_types": [x.dtype],
24 "output_shapes": [tf.TensorShape([x_shape[0], y_shape[1]])],
25 }
26
27 # We create a matmul operation, which we want to perform as two serialized
28 # matmuls. We also record all the input shapes.
29 attributes = {
30 "serialization_factor": 2,
31 "lhs_shape": x_shape,
32 "rhs_shape": y_shape
33 }
34 attributes_json = json.dumps(attributes)
35
36 o = ipu.custom_ops.precompiled_user_op([x, y],
37 lib_path,
38 attributes=attributes_json,
39 outs=outputs)
40
41 return o
42
43
44with tf.device("cpu"):
45 x_ph = tf.placeholder(np.float32, [128, 1024])
46 y_ph = tf.placeholder(np.float32, [1024, 64])
47
48with ipu_scope("/device:IPU:0"):
49 xla_result = ipu.ipu_compiler.compile(my_net, [x_ph, y_ph])
50
51with tf.Session() as sess:
52 # Base run
53 result = sess.run(xla_result,
54 feed_dict={
55 x_ph: np.full(x_ph.shape, 10.0),
56 y_ph: np.full(y_ph.shape, 12.0),
57 })
58
59 print(result)
Download tutorial_attributes_example.py
Custom codelet
Listing 14.5 shows the source file for a custom rotate
operation, which takes three vectors and rotates x
and y
by the values
in angle
. The vertex code for the custom codelet is shown in
Listing 14.6. The TensorFlow program that calls the custom op is
shown in Listing 14.7.
1#include <vector>
2
3#include <poplar/Graph.hpp>
4#include <poplar/Tensor.hpp>
5#include <poputil/Util.hpp>
6#include <poputil/VertexTemplates.hpp>
7#include <poputil/exceptions.hpp>
8
9// Export the API level symbol
10extern "C" {
11int32_t custom_op_api_level = 5;
12}
13
14extern "C" void Build_metadata(
15 std::vector<std::int64_t>& allocating_indices,
16 std::vector<std::int64_t>& replica_identical_output_indices,
17 std::map<std::int64_t, std::int64_t>& input_to_output_tensor_aliasing,
18 bool& is_elementwise, bool& is_stateless, bool& is_hashable,
19 std::uint32_t num_inputs) {
20 allocating_indices.clear();
21 is_elementwise = true;
22}
23
24extern "C" poplar::program::Program Build(
25 poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
26 std::vector<poplar::Tensor>& outputs, const std::string& attributes,
27 const std::string& debugPrefix) {
28 if (inputs.size() != 3) {
29 throw poputil::poplibs_error("Rotate requires 3 inputs");
30 }
31
32 if (inputs[0].numElements() == 0) {
33 return poplar::program::Sequence();
34 }
35
36 if (inputs[0].rank() != 1 || inputs[1].rank() != 1 || inputs[2].rank() != 1) {
37 throw poputil::poplibs_error("All inputs must be rank 1");
38 }
39
40 if (inputs[0].dim(0) != inputs[1].dim(0) ||
41 inputs[0].dim(0) != inputs[2].dim(0)) {
42 throw poputil::poplibs_error(
43 "Length of rotate vector and data vectors must match");
44 }
45
46 if (inputs[0].elementType() != inputs[1].elementType() ||
47 inputs[0].elementType() != inputs[2].elementType()) {
48 throw poputil::poplibs_error(
49 "Data types of angle vector and data vectors must match");
50 }
51
52 auto dType = inputs[0].elementType();
53
54 /*
55 * Create a ComputeSet which will be executed, and contains the vertices
56 */
57 auto cs = graph.addComputeSet(debugPrefix + "/rotate");
58
59 /*
60 * Get the tile mapping for the complete tensor. We will map the vertices so
61 * that they match the layout of the 'x' input tensor (input[0]). If the 'x'
62 * tensor was layed out differently to the other ones, then Poplar will
63 * insert code to move the data in the other tensors to the mapped tile. So
64 * ideally we would choose the best mapping for the vertices by analysing
65 * all of the tensor mappings.
66 */
67 auto tileMapping = graph.getTileMapping(inputs[0]);
68
69 /*
70 * Get the target, which descibes properties of the hardware.
71 */
72 auto target = graph.getTarget();
73
74 /*
75 * Get the vector width of the particular data type, so that later we can
76 * divide the tensor up between workers in an appropriate way.
77 */
78 const auto vectorWidth = target.getVectorWidth(dType);
79
80 /*
81 * Create the output tensors
82 */
83 outputs.push_back(graph.clone(inputs[0]));
84 outputs.push_back(graph.clone(inputs[1]));
85
86 auto xFlat = inputs[0].flatten();
87 auto yFlat = inputs[1].flatten();
88 auto aFlat = inputs[2].flatten();
89 auto xOutputFlat = outputs[0].flatten();
90 auto yOutputFlat = outputs[1].flatten();
91
92 for (unsigned tile = 0; tile != tileMapping.size(); ++tile) {
93 /*
94 * If a tile contains no elements of the tensor then do not create any
95 * vertices for it.
96 */
97 if (tileMapping[tile].empty()) {
98 continue;
99 }
100
101 /*
102 * Split up the regions of the inputs tensors so that they are evenly
103 * distributed between the workers on the tile.
104 */
105 auto vertexRegions = poputil::splitRegionsBetweenWorkers(
106 target, tileMapping[tile], vectorWidth, 2 * vectorWidth);
107
108 for (const auto& regions : vertexRegions) {
109 /*
110 * If a region has no elements, then there is no need to add a vertex for
111 * it.
112 */
113 if (regions.empty()) {
114 continue;
115 }
116
117 /*
118 * Add codelets to tiles which work over the regions in the input
119 * tensors.
120 */
121 auto v = graph.addVertex(cs, poputil::templateVertex("Rotate", dType),
122 {{"x_out", xOutputFlat.slices(regions)},
123 {"y_out", yOutputFlat.slices(regions)},
124 {"x_in", xFlat.slices(regions)},
125 {"y_in", yFlat.slices(regions)},
126 {"angle", aFlat.slices(regions)}});
127
128 /* Map the vertex onto the appropriate tile. */
129 graph.setTileMapping(v, tile);
130
131 /* Provide a bogus cycle count estimate for the profiler. */
132 graph.setPerfEstimate(v, 1);
133 }
134 }
135
136 return poplar::program::Execute(cs);
137}
Download custom_rotate_op.cc
1#include <cmath>
2
3#include <poplar/HalfFloat.hpp>
4#include <poplar/Vertex.hpp>
5
6using namespace poplar;
7
8/*
9 * A codelet to rotate a tensors 'x' and 'y', by the angle (radians) in the
10 * tensor 'angle', around the origin.
11 */
12template <typename FPType>
13class Rotate : public Vertex {
14 public:
15 Vector<Output<Vector<FPType>>> x_out;
16 Vector<Output<Vector<FPType>>> y_out;
17 Vector<Input<Vector<FPType>>> x_in;
18 Vector<Input<Vector<FPType>>> y_in;
19 Vector<Input<Vector<FPType>>> angle;
20
21 bool compute() {
22 for (unsigned i = 0; i < angle.size(); ++i) {
23 for (unsigned j = 0; j != angle[i].size(); ++j) {
24 float a = angle[i][j];
25 float x = x_in[i][j];
26 float y = y_in[i][j];
27 x_out[i][j] = x * cos(a) - y * sin(a);
28 y_out[i][j] = x * sin(a) + y * cos(a);
29 }
30 }
31 return true;
32 }
33};
34
35template class Rotate<float>;
36template class Rotate<half>;
Download custom_codelet.cpp
1import os
2import numpy as np
3
4from tensorflow.python import ipu
5from tensorflow.python.ipu.scopes import ipu_scope
6import tensorflow.compat.v1 as tf
7tf.disable_v2_behavior()
8
9# Configure argument for targeting the IPU
10cfg = ipu.config.IPUConfig()
11cfg.auto_select_ipus = 1
12cfg.configure_ipu_system()
13
14with tf.device("cpu"):
15 x_data = tf.placeholder(np.float32, [4])
16 y_data = tf.placeholder(np.float32, [4])
17 p_angle = tf.placeholder(np.float32, [4])
18
19
20def rotate_op(x, y, a):
21 outputs = {
22 "output_types": [tf.float32, tf.float32],
23 "output_shapes": [tf.TensorShape([4]),
24 tf.TensorShape([4])],
25 }
26
27 base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
28 lib_path = os.path.join(base_path, "libcustom_rotate_op.so")
29 gp_path = os.path.join(base_path, "custom_codelet.gp")
30
31 o = ipu.custom_ops.precompiled_user_op([x, y, a],
32 lib_path,
33 gp_path,
34 outs=outputs)
35 return o
36
37
38def my_net(x, y, a):
39 return rotate_op(x, y, a)
40
41
42with ipu_scope("/device:IPU:0"):
43 xla_result = ipu.ipu_compiler.compile(my_net, [x_data, y_data, p_angle])
44
45with tf.Session() as sess:
46 # Base run
47 result = sess.run(xla_result,
48 feed_dict={
49 x_data: [2., 4., 6., -1.],
50 y_data: [2., 3., 8., -1.],
51 p_angle: [np.pi, np.pi / 2., 3. * np.pi / 2., 0]
52 })
53
54 print(result)
Download tutorial_custom_codelet.py
14.2. Custom host CPU operations
You can write a custom operation as a function that executes code on the host
CPU instead of on the IPU. The default name for this function is Callback()
.
As with the builder functions described previously, this must be compiled into a
shared library file.
The signature of the callback function is:
1extern "C"
2void Callback(
3 const std::vector<const void*>& data,
4 const std::vector<std::uint32_t>& number_of_elements,
5 const std::vector<void*>& outputs,
6 const std::string& attributes,
7 const std::string& name);
The parameters are:
data
: The input data passed to the custom op in TensorFlow. The function must be written to expect a specific data type and the void pointer cast into the expected type.number_of_elements
: This indicates the number of elements in the input data.outputs
: The results returned by the operation.attributes
: A string which is passed as theattributes
argument to the custom op in TensorFlow. See Operation attributes for more details.name
: This is the name of the operation within the XLA graph.
You can call the host code from your TensorFlow program using
tensorflow.python.ipu.custom_ops.cpu_user_operation()
.
This specifies the input object file to load, the input and output tensors, and other parameters to the
operation.
14.2.1. Gradient callback
If the op is required for training, then you must also implement a function for
the gradient operation. This has the same name as the callback with _grad
appended.
The signature of the gradient callback function is:
1extern "C" void Callback_grad(
2 const std::vector<void*>& data,
3 const std::vector<uint32_t>& number_of_elements,
4 std::vector<void*>& outputs,
5 const std::string& attributes,
6 const std::string& name);
The parameters are:
data
: The input data passed to the custom op in TensorFlow. The function must be written to expect a specific data type so the void pointer can be cast into the expected type.number_of_elements
: This indicates the number of elements in the input data.outputs
: The results returned by the operation.attributes
: A string which is passed as thegradient_attributes
argument to the Python op in TensorFlow. See Operation attributes for more details.name
: This is the name of the operation within the XLA graph.