11. Custom IPU operations

There are three mechanisms for providing custom operations to the IPU through the TensorFlow interface. The first uses a fully custom codelet and host build file.

The second case is a custom operation which is executed on the CPU.

The third possibility is a custom, fused elementwise arithmetic operation. In this last case, the gradient creation in the optimisers will not produce a gradient operation for the custom operation.

11.1. Fully customised IPU operations

You can provide a custom operation to be compiled into the Poplar executable and run on the IPU hardware. You must provide a host-side shared object library that implements the action of adding vertices to a Poplar graph, given some Poplar tensor inputs. They can optionally provide a Poplar source code or binary file containing one or more “codelets” (code that runs on the IPU).

For more information about writing codelets, please refer to the Poplar and PopLibs User Guide.

These operations are added with ipu.custom_ops.precompiled_user_op. See tensorflow.python.ipu.custom_ops.precompiled_user_op() for details. An example of this is shown below.

The shared object file must contain an undecorated symbol that should be declared as below. It should add vertices to the graph that perform the custom operation. The name of the symbol should match the name of the operation in the graph. By default these types of operations are called Build.

1extern "C"
2poplar::program::Program Build(
3  poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
4  std::vector<poplar::Tensor>& outputs, const std::string &attributes,
5  const std::string &debug_prefix)

The arguments are:

  • graph: the Poplar graph into which to add tensors and vertices.

  • inputs: a vector of Poplar tensors which are inputs to the operation.

  • outputs: a vector into which to store the outputs of the operation. The vector will contain zero entries when the Build function is called.

  • attributes: a string which was passed as the attributes argument to the Python operation. See _operation_attributes for more details.

  • debug_prefix: the debug name that has been given to the operation in the TensorFlow graph.

If the operation can have its gradient taken, then the shared object can contain a separate function with the same name as the forward pass builder. The function must be given the same name as the forward operation with _grad appended. The signature of the builder function is slightly different, as it takes the forward pass inputs and outputs as arguments, as well as the gradient outputs. Gradient builders have their own metadata functions. E.g. metadata function name for the example below will be Build_grad_metadata.

1extern "C"
2poplar::program::Program Build_grad(
3    poplar::Graph& graph, int input_grad_index,
4    const std::vector<poplar::Tensor>& gradients,
5    const std::vector<poplar::Tensor>& fwd_inputs,
6    const std::vector<poplar::Tensor>& fwd_outputs,
7    std::vector<poplar::Tensor>& outputs,
8    const std::string& attributes, const std::string& debug_prefix)

The arguments are:

  • graph: the Poplar graph into which to add tensors and vertices.

  • input_grad_index: The index of the input for which this operation is producing the partial derivative. If the gradient operation calculates all of the partial derivatives, then this input should be ignored.

  • gradients: the inputs to the gradient operation, from the previous gradient operation or loss.

  • fwd_inputs: the tensors which are the inputs to the forward operation.

  • fwd_outputs: the tensors which are the outputs of the forward operation.

  • outputs: the outputs of this gradient operation. There must be one per input of the original forward operation. Inputs which are not differentiable can have an null Poplar tensor.

  • attributes: a string which was passed as the gradient_attributes argument to the Python operation. See _operation_attributes for more details.

  • debug_prefix: the name of the operation.

11.1.1. Metadata

The shared object file can optionally contain an undecorated symbol that is the same as the builder function with _metadata appended. This function must have the following signature:

1extern "C"
2void Build_metadata(std::vector<std::int64_t>& allocating_indices,
3  std::uint32_t& num_inplace, bool& is_elementwise,
4  bool& is_stateless, std::uint32_t num_inputs)

The arguments are:

  • allocating_indices: indicates which of the inputs should be allocated using the tensor allocation function. See the description in Tensor allocation.

  • num_inplace: indicates the number of inputs which are ‘in place’. The first num_inplace of the inputs will be considered to be in-place.

  • is_elementwise: indicates that this operation is element-wise.

  • is_stateless: indicates that this operation is stateless. Custom ops are stateful by default.

  • num_inputs: indicates how many inputs are on the operation.

The function should fill in the values of the first four arguments, which are all reference types.

11.1.2. In-place operations

If an operation does an in-place modification of an input tensor, as opposed to creating a new output tensor, then the num_inplace can be used to indicate that this is the case. The system will ensure that when a tensor is updated in place, that any other uses of that tensor will be complete before the operation is run.

If a tensor is not marked as in place then the operation must not modify it. If it is modified then other operations which consume it may see an incorrect value on their input.

When trying to update tensors in-place you need to ensure that TensorFlow sees an assignment of the tensor, otherwise the modified input tensor update will not “stick”. This means that the inplace inputs need to always be returned as outputs of the custom operation and if a tf.Variable was modified inplace it has to be assigned back to itself with tf.assign. This might look something like the following:

 1import os
 2import numpy as np
 3
 4from tensorflow.python import ipu
 5from tensorflow.python.ipu.scopes import ipu_scope
 6import tensorflow.compat.v1 as tf
 7tf.disable_v2_behavior()
 8
 9# Configure argument for targeting the IPU
10cfg = ipu.utils.create_ipu_config()
11# cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
12cfg = ipu.utils.auto_select_ipus(cfg, 1)
13ipu.utils.configure_ipu_system(cfg)
14
15with tf.device("cpu"):
16  x_data = tf.placeholder(np.float32, [4])
17
18
19def add_op(x, y):
20  outputs = {
21      "output_types": [tf.float32],
22      "output_shapes": [tf.TensorShape([4])],
23  }
24
25  base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
26  lib_path = os.path.join(base_path, "libcustom_add_inplace.so")
27
28  o = ipu.custom_ops.precompiled_user_op([x, y], lib_path, outs=outputs)
29  return o
30
31
32def my_net(x):
33  inplace = tf.get_variable("weights",
34                            shape=[4],
35                            initializer=tf.zeros_initializer())
36
37  # Even though the custom op is in place, TF still needs to see an assignment.
38  inplace_add = tf.assign(inplace, add_op(inplace, x)[0])
39  with tf.control_dependencies([inplace_add]):
40    return inplace
41
42
43with ipu_scope("/device:IPU:0"):
44  xla_result = ipu.ipu_compiler.compile(my_net, [x_data])
45
46with tf.Session() as sess:
47  sess.run(tf.global_variables_initializer())
48
49  result = sess.run(xla_result, feed_dict={x_data: [2., 4., 6., -1.]})
50  print(result)
51
52  result = sess.run(xla_result, feed_dict={x_data: [2., 4., 6., -1.]})
53  print(result)

And the associated custom op:

 1/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.
 2
 3Licensed under the Apache License, Version 2.0 (the "License");
 4you may not use this file except in compliance with the License.
 5You may obtain a copy of the License at
 6
 7    http://www.apache.org/licenses/LICENSE-2.0
 8
 9Unless required by applicable law or agreed to in writing, software
10distributed under the License is distributed on an "AS IS" BASIS,
11WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12See the License for the specific language governing permissions and
13limitations under the License.
14==============================================================================*/
15
16#include <poplar/Graph.hpp>
17#include <popops/Cast.hpp>
18#include <popops/ScaledAdd.hpp>
19#include <poputil/exceptions.hpp>
20
21extern "C" {
22int32_t custom_op_api_level = 2;
23}
24
25extern "C" void Build_metadata(std::vector<std::int64_t>& allocating_indices,
26                               std::uint32_t& num_inplace, bool& is_elementwise,
27                               std::uint32_t num_inputs) {
28  allocating_indices.clear();
29  num_inplace = 1;
30  is_elementwise = true;
31}
32
33extern "C" poplar::program::Program Build(poplar::Graph& graph,
34                                          std::vector<poplar::Tensor>& inputs,
35                                          std::vector<poplar::Tensor>& outputs,
36                                          const std::string& attributes,
37                                          const std::string& debug_prefix) {
38  if (inputs.size() != 2) {
39    throw poputil::poplibs_error("add requires 2 inputs.");
40  }
41
42  auto left = inputs[0];
43  auto right = inputs[1];
44
45  if (left.shape() != right.shape()) {
46    throw poputil::poplibs_error("Inputs must have identical shapes.");
47  }
48
49  poplar::program::Sequence prog;
50  popops::scaledAddTo(graph, left, right, 1.0, prog,
51                      debug_prefix + "/custom_add_inplace");
52  outputs.push_back(left);
53  return prog;
54}

11.1.3. Elementwise operations

The IPU driver can do a better job of allocating the layout of Poplar tensors if it can associate them with specific operations. If the output of an operation is the same shape and layout as its first input, then it should be marked as elementwise.

Typically, the graph building code for the operation will clone the input in order to generate the output Poplar tensor.

11.1.4. Tensor allocation

When generating the Poplar graph, sometimes the backend has the freedom to allocate an input to an operation. This happens when an input to an operation is also the input to the graph, or when previous operations do not put constraints on the input tensor.

If this condition occurs, then by default the backend will create the Poplar tensor with linear mapping. See the section on tile mapping in the Poplar and PopLibs API Reference.

To override this behaviour and allocate a tensor using a specific layout mapping, the custom operation can provide a function with the following signature:

1extern "C" poplar::Tensor Build_allocator(
2  poplar::Graph& graph, std::uint32_t operand,
3  const std::vector<size_t>& shape, poplar::Type type,
4  const std::string& attributes, const std::string& debug_prefix)

The arguments are:

  • graph: the Poplar graph where the tensor should be created.

  • operand: the operand number of the input to allocate.

  • shape: the shape of the tensor.

  • type: the Poplar data type for the tensor.

  • attributes: a string which was passed as the attributes or gradient_attributes argument to the Python operation (depending on whether this function corresponds to the forward or gradient operation). See _operation_attributes for more details.

  • debug_prefix: the name of the operation.

11.1.5. Gradient operations

As described above, when the gradient of the forward operation is generated, either a single operation, or multiple operations can be inserted into the graph.

You can use the parameter separate_gradients on the precompiled_user_op function to select which of the two options are required. The compiled code must match this setting.

If the separate_gradients parameter is set to False, then the compiled function for generating the gradient operation should fill in one output for each of the inputs of the forward pass function. Each output should be the partial derivative with respect to one of the inputs.

If the separate_gradients parameter is True, then the gradient operation building function should produce an operation with a single output, which is the partial differential with respect to only one of the forward pass inputs.

The specific input will be given by the input_grad_index input of the call to the sharded object Build_grad function.

11.1.6. Stateless operations

If an operation’s outputs depend only on the value of their inputs, and not any internally stored state, then the operation is said to be stateless. Marking an operation as stateless in the metadata function will allow the TensorFlow backend to perform optimisations which would otherwise be disallowed, such as common code removal.

11.1.7. Operation attributes

If an operation requires some data which is not available when compiling the C++ Poplar function, then the string attributes argument can be used to pass such information from the Python level operation to the C++ function. Since the attributes argument is a string object, any data format which can be serialized/deserialized to/from a string, such as JSON, can be used.

In the following example we add a custom operation which performs a serialized matrix-matrix multiplication where we use the attributes argument to pass information, encoded using the JSON data format, about serialization to the C++ function.

  1/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.
  2
  3Licensed under the Apache License, Version 2.0 (the "License");
  4you may not use this file except in compliance with the License.
  5You may obtain a copy of the License at
  6
  7    http://www.apache.org/licenses/LICENSE-2.0
  8
  9Unless required by applicable law or agreed to in writing, software
 10distributed under the License is distributed on an "AS IS" BASIS,
 11WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12See the License for the specific language governing permissions and
 13limitations under the License.
 14==============================================================================*/
 15
 16#include <poplar/Graph.hpp>
 17#include <poplin/MatMul.hpp>
 18#include <popops/ElementWise.hpp>
 19#include <poputil/exceptions.hpp>
 20
 21// Use the https://github.com/open-source-parsers/jsoncpp JsonCpp parser
 22#include "include/json/json.h"
 23
 24extern "C" {
 25int32_t custom_op_api_level = 2;
 26}
 27
 28namespace {
 29Json::Value ParseAttributes(const std::string& attributes) {
 30  // Parse Json.
 31  Json::CharReaderBuilder builder;
 32  std::string errs;
 33  Json::Value parsed_json;
 34  std::unique_ptr<Json::CharReader> reader(builder.newCharReader());
 35  bool parsed =
 36      reader->parse(attributes.c_str(), attributes.c_str() + attributes.size(),
 37                    &parsed_json, &errs);
 38  assert(parsed && errs);
 39  return parsed_json;
 40}
 41
 42std::vector<size_t> GetVectorFromJson(Json::Value& val) {
 43  std::vector<size_t> result;
 44  result.reserve(val.size());
 45  for (auto a : val) {
 46    result.push_back(a.asUInt64());
 47  }
 48  return result;
 49}
 50}  // namespace
 51
 52extern "C" void Build_metadata(std::vector<std::int64_t>& allocating_indices,
 53                               std::uint32_t& num_inplace, bool& is_elementwise,
 54                               std::uint32_t num_inputs) {
 55  allocating_indices = {0, 1};
 56  num_inplace = 0;
 57  is_elementwise = false;
 58}
 59
 60extern "C" poplar::Tensor Build_allocator(poplar::Graph& graph,
 61                                          std::uint32_t operand,
 62                                          const std::vector<size_t>& shape,
 63                                          poplar::Type type,
 64                                          const std::string& attributes,
 65                                          const std::string& debug_prefix) {
 66  assert(operand < 2);
 67  // Parse JSON and get the expected attributes.
 68  Json::Value json = ParseAttributes(attributes);
 69  const int serialization_factor = json["serialization_factor"].asInt();
 70  std::vector<std::size_t> lhs_shape = GetVectorFromJson(json["lhs_shape"]);
 71  std::vector<std::size_t> rhs_shape = GetVectorFromJson(json["rhs_shape"]);
 72
 73  // Verify shapes and adjust them to be slice shapes.
 74  assert(lhs_shape.size() == 2);
 75  assert(rhs_shape.size() == 2);
 76
 77  assert(lhs_shape[1] % serialization_factor == 0 &&
 78         "serialization_factor must divide the dimension of LHS shape");
 79  lhs_shape[1] /= serialization_factor;
 80
 81  assert(rhs_shape[0] % serialization_factor == 0 &&
 82         "serialization_factor must divide the dimension of RHS shape");
 83  rhs_shape[0] /= serialization_factor;
 84
 85  // Allocate the slice.
 86  poplar::Tensor slice;
 87  if (operand == 0) {
 88    // Allocating for lhs - allocate the slice.
 89    slice = poplin::createMatMulInputLHS(graph, type, lhs_shape, rhs_shape,
 90                                         debug_prefix + "/LHS");
 91  } else {
 92    assert(operand == 1);
 93    slice = poplin::createMatMulInputRHS(graph, type, lhs_shape, rhs_shape,
 94                                         debug_prefix + "/RHS");
 95  }
 96
 97  // Clone the slice for each serialized matrix multiply.
 98  std::vector<poplar::Tensor> slices(serialization_factor);
 99  slices[0] = slice;
100  for (int i = 1; i != serialization_factor; ++i) {
101    slices[i] = graph.clone(slice);
102  }
103
104  // Concatenate the slices into a single tensor - the concatentation dimension
105  // depends on the operand which is being allocated.
106  poplar::Tensor t = poplar::concat(slices, operand == 0 ? 1 : 0);
107  return t;
108}
109
110extern "C" poplar::program::Program Build(poplar::Graph& graph,
111                                          std::vector<poplar::Tensor>& inputs,
112                                          std::vector<poplar::Tensor>& outputs,
113                                          const std::string& attributes,
114                                          const std::string& debug_prefix) {
115  if (inputs.size() != 2) {
116    throw poputil::poplibs_error("add requires 2 inputs.");
117  }
118  Json::Value json = ParseAttributes(attributes);
119  poplar::program::Sequence seq;
120  poplar::Tensor lhs = inputs[0];
121  poplar::Tensor rhs = inputs[1];
122  poplar::Tensor output;
123
124  const int serialization_factor = json["serialization_factor"].asInt();
125  const int slice_size = lhs.dim(1) / serialization_factor;
126  for (int i = 0; i != serialization_factor; ++i) {
127    // Slice out the parts of the matmul.
128    poplar::Tensor lhs_slice =
129        lhs.slice(i * slice_size, (i + 1) * slice_size, 1);
130    poplar::Tensor rhs_slice =
131        rhs.slice(i * slice_size, (i + 1) * slice_size, 0);
132    // Do the partial matmul.
133    poplar::Tensor partial_matmul = poplin::matMul(
134        graph, lhs_slice, rhs_slice, seq, debug_prefix + "/Slice");
135
136    // Accumulate the results from partial matmuls.
137    if (i == 0) {
138      output = partial_matmul;
139    } else {
140      popops::addInPlace(graph, output, partial_matmul, seq,
141                         debug_prefix + "/Add");
142    }
143  }
144  outputs = {output};
145  return seq;
146}

Which is then executed with:

 1import os
 2import json
 3import numpy as np
 4
 5from tensorflow.python import ipu
 6from tensorflow.python.ipu.scopes import ipu_scope
 7import tensorflow.compat.v1 as tf
 8tf.disable_v2_behavior()
 9
10# Configure argument for targeting the IPU
11cfg = ipu.utils.create_ipu_config()
12cfg = ipu.utils.auto_select_ipus(cfg, 1)
13ipu.utils.configure_ipu_system(cfg)
14
15base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
16lib_path = os.path.join(base_path, "libtutorial_attributes_example.so")
17
18
19def my_net(x, y):
20  x_shape = x.get_shape().as_list()
21  y_shape = y.get_shape().as_list()
22  outputs = {
23      "output_types": [x.dtype],
24      "output_shapes": [tf.TensorShape([x_shape[0], y_shape[1]])],
25  }
26
27  # We create a matmul operation, which we want to perform as two serialized
28  # matmuls. We also record all the input shapes.
29  attributes = {
30      "serialization_factor": 2,
31      "lhs_shape": x_shape,
32      "rhs_shape": y_shape
33  }
34  attributes_json = json.dumps(attributes)
35
36  o = ipu.custom_ops.precompiled_user_op([x, y],
37                                         lib_path,
38                                         attributes=attributes_json,
39                                         outs=outputs)
40
41  return o
42
43
44with tf.device("cpu"):
45  x_ph = tf.placeholder(np.float32, [128, 1024])
46  y_ph = tf.placeholder(np.float32, [1024, 64])
47
48with ipu_scope("/device:IPU:0"):
49  xla_result = ipu.ipu_compiler.compile(my_net, [x_ph, y_ph])
50
51with tf.Session() as sess:
52  # Base run
53  result = sess.run(xla_result,
54                    feed_dict={
55                        x_ph: np.full(x_ph.shape, 10.0),
56                        y_ph: np.full(y_ph.shape, 12.0),
57                    })
58
59  print(result)

11.1.8. Example

This example shows the source file for a rotate operation, which takes three vectors and rotates the x and y ones by the angle one:

  1/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.
  2
  3Licensed under the Apache License, Version 2.0 (the "License");
  4you may not use this file except in compliance with the License.
  5You may obtain a copy of the License at
  6
  7    http://www.apache.org/licenses/LICENSE-2.0
  8
  9Unless required by applicable law or agreed to in writing, software
 10distributed under the License is distributed on an "AS IS" BASIS,
 11WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 12See the License for the specific language governing permissions and
 13limitations under the License.
 14==============================================================================*/
 15
 16#include <vector>
 17
 18#include <poplar/Graph.hpp>
 19#include <poplar/Tensor.hpp>
 20#include <poputil/Util.hpp>
 21#include <poputil/VertexTemplates.hpp>
 22#include <poputil/exceptions.hpp>
 23
 24// Export the API level symbol
 25extern "C" {
 26int32_t custom_op_api_level = 2;
 27}
 28
 29extern "C" void Build_metadata(std::vector<std::int64_t>& allocating_indices,
 30                               std::uint32_t& num_inplace, bool& is_elementwise,
 31                               bool& is_stateless, std::uint32_t num_inputs) {
 32  allocating_indices.clear();
 33  num_inplace = 0;
 34  is_elementwise = true;
 35}
 36
 37extern "C" poplar::program::Program Build(
 38    poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
 39    std::vector<poplar::Tensor>& outputs, const std::string& attributes,
 40    const std::string& debugPrefix) {
 41  if (inputs.size() != 3) {
 42    throw poputil::poplibs_error("Rotate requires 3 inputs");
 43  }
 44
 45  if (inputs[0].numElements() == 0) {
 46    return poplar::program::Sequence();
 47  }
 48
 49  if (inputs[0].rank() != 1 || inputs[1].rank() != 1 || inputs[2].rank() != 1) {
 50    throw poputil::poplibs_error("All inputs must be rank 1");
 51  }
 52
 53  if (inputs[0].dim(0) != inputs[1].dim(0) ||
 54      inputs[0].dim(0) != inputs[2].dim(0)) {
 55    throw poputil::poplibs_error(
 56        "Length of rotate vector and data vectors must match");
 57  }
 58
 59  if (inputs[0].elementType() != inputs[1].elementType() ||
 60      inputs[0].elementType() != inputs[2].elementType()) {
 61    throw poputil::poplibs_error(
 62        "Data types of angle vector and data vectors must match");
 63  }
 64
 65  auto dType = inputs[0].elementType();
 66
 67  /*
 68   * Create a ComputeSet which will be executed, and contains the vertices
 69   */
 70  auto cs = graph.addComputeSet(debugPrefix + "/rotate");
 71
 72  /*
 73   * Get the tile mapping for the complete tensor.  We will map the vertices so
 74   * that they match the layout of the 'x' input tensor (input[0]).  If the 'x'
 75   * tensor was layed out differently to the other ones, then Poplar will
 76   * insert code to move the data in the other tensors to the mapped tile. So
 77   * ideally we would choose the best mapping for the vertices by analysing
 78   * all of the tensor mappings.
 79   */
 80  auto tileMapping = graph.getTileMapping(inputs[0]);
 81
 82  /*
 83   * Get the target, which descibes properties of the hardware.
 84   */
 85  auto target = graph.getTarget();
 86
 87  /*
 88   * Get the vector width of the particular data type, so that later we can
 89   * divide the tensor up between workers in an appropriate way.
 90   */
 91  const auto vectorWidth = target.getVectorWidth(dType);
 92
 93  /*
 94   * Create the output tensors
 95   */
 96  outputs.push_back(graph.clone(inputs[0]));
 97  outputs.push_back(graph.clone(inputs[1]));
 98
 99  auto xFlat = inputs[0].flatten();
100  auto yFlat = inputs[1].flatten();
101  auto aFlat = inputs[2].flatten();
102  auto xOutputFlat = outputs[0].flatten();
103  auto yOutputFlat = outputs[1].flatten();
104
105  for (unsigned tile = 0; tile != tileMapping.size(); ++tile) {
106    /*
107     * If a tile contains no elements of the tensor then do not create any
108     * vertices for it.
109     */
110    if (tileMapping[tile].empty()) {
111      continue;
112    }
113
114    /*
115     * Split up the regions of the inputs tensors so that they are evenly
116     * distributed between the workers on the tile.
117     */
118    auto vertexRegions = poputil::splitRegionsBetweenWorkers(
119        target, tileMapping[tile], vectorWidth, 2 * vectorWidth);
120
121    for (const auto& regions : vertexRegions) {
122      /*
123       * If a region has no elements, then there is no need to add a vertex for
124       * it.
125       */
126      if (regions.empty()) {
127        continue;
128      }
129
130      /*
131       * Add codelets to tiles which work over the regions in the input
132       * tensors.
133       */
134      auto v = graph.addVertex(cs, poputil::templateVertex("Rotate", dType),
135                               {{"x_out", xOutputFlat.slices(regions)},
136                                {"y_out", yOutputFlat.slices(regions)},
137                                {"x_in", xFlat.slices(regions)},
138                                {"y_in", yFlat.slices(regions)},
139                                {"angle", aFlat.slices(regions)}});
140
141      /* Map the vertex onto the appropriate tile. */
142      graph.setTileMapping(v, tile);
143
144      /* Provide a bogus cycle count estimate for the profiler. */
145      graph.setCycleEstimate(v, 1);
146    }
147  }
148
149  return poplar::program::Execute(cs);
150}

This is the associated codelet file:

 1#include <cmath>
 2
 3#include <poplar/HalfFloat.hpp>
 4#include <poplar/Vertex.hpp>
 5
 6using namespace poplar;
 7
 8/*
 9 * A codelet to rotate a tensors 'x' and 'y', by the angle (radians) in the
10 * tensor 'angle', around the origin.
11 */
12template <typename FPType>
13class Rotate : public Vertex {
14 public:
15  Vector<Output<Vector<FPType>>> x_out;
16  Vector<Output<Vector<FPType>>> y_out;
17  Vector<Input<Vector<FPType>>> x_in;
18  Vector<Input<Vector<FPType>>> y_in;
19  Vector<Input<Vector<FPType>>> angle;
20
21  bool compute() {
22    for (unsigned i = 0; i < angle.size(); ++i) {
23      for (unsigned j = 0; j != angle[i].size(); ++j) {
24        float a = angle[i][j];
25        float x = x_in[i][j];
26        float y = y_in[i][j];
27        x_out[i][j] = x * cos(a) - y * sin(a);
28        y_out[i][j] = x * sin(a) + y * cos(a);
29      }
30    }
31    return true;
32  }
33};
34
35template class Rotate<float>;
36template class Rotate<half>;

This is an example of it in use:

 1import os
 2import numpy as np
 3
 4from tensorflow.python import ipu
 5from tensorflow.python.ipu.scopes import ipu_scope
 6import tensorflow.compat.v1 as tf
 7tf.disable_v2_behavior()
 8
 9# Configure argument for targeting the IPU
10cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
11cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
12cfg = ipu.utils.auto_select_ipus(cfg, 1)
13ipu.utils.configure_ipu_system(cfg)
14
15with tf.device("cpu"):
16  x_data = tf.placeholder(np.float32, [4])
17  y_data = tf.placeholder(np.float32, [4])
18  p_angle = tf.placeholder(np.float32, [4])
19
20
21def rotate_op(x, y, a):
22  outputs = {
23      "output_types": [tf.float32, tf.float32],
24      "output_shapes": [tf.TensorShape([4]),
25                        tf.TensorShape([4])],
26  }
27
28  base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
29  lib_path = os.path.join(base_path, "libcustom_rotate_op.so")
30  gp_path = os.path.join(base_path, "custom_codelet.gp")
31
32  o = ipu.custom_ops.precompiled_user_op([x, y, a],
33                                         lib_path,
34                                         gp_path,
35                                         outs=outputs)
36  return o
37
38
39def my_net(x, y, a):
40  return rotate_op(x, y, a)
41
42
43with ipu_scope("/device:IPU:0"):
44  xla_result = ipu.ipu_compiler.compile(my_net, [x_data, y_data, p_angle])
45
46with tf.Session() as sess:
47  # Base run
48  result = sess.run(xla_result,
49                    feed_dict={
50                        x_data: [2., 4., 6., -1.],
51                        y_data: [2., 3., 8., -1.],
52                        p_angle: [np.pi, np.pi / 2., 3. * np.pi / 2., 0]
53                    })
54
55  print(result)

When compiling the host-size shared object file, it is not necessary to include or link against any TensorFlow header or library files. Only the Poplar headers and link libraries should be necessary.

11.2. Fully customised CPU operations

The framework also allows a custom operation that executes code on the CPU instead of on the IPU. A shared object, much like the builder function of the device-side custom operation, must be written. The signature of this function should be:

1extern "C" void Callback(const std::vector<void*>& data,
2                         const std::vector<std::uint32_t>& number_of_elements,
3                         std::vector<void*>& outputs,
4                         const std::string& name);

The arguments are:

  • data: the input data. the function should be written to expect a certain data type so the void pointer can be cast into the expected type.

  • number_of_elements: indicates the number of elements in the input data.

  • outputs: should be filled in by the operation.

  • name: is the name of the operation within the XLA/HLO graph.

11.3. Custom elementwise expressions

The Python function ipu.custom_ops.codelet_expression_op provides an interface for giving a custom fused expression to the compiler. This will be encoded into a single compute set. See tensorflow.python.ipu.custom_ops.codelet_expression_op() for details.

The arguments to the Python function are a callable Python function which encodes the arithmetic expression, and the tensor arguments to the operation.

For instance:

1def my_custom_op(x, y, z):
2    return x * x + y * z
3
4ipu.custom_ops.codelet_expression_op(my_custom_op, a, b, c)

In this example, the Python function my_custom_op provides the expression, and the arguments a, b and c are the three inputs from other parts of the TensorFlow graph.

Python operators which are supported in the function are +, -, *, and abs.

11.4. API Level Versioning

1// Export the API level symbol
2extern "C" {
3int32_t custom_op_api_level = 2;
4}

You must include the code above in your source code. The custom op loader checks the API level of the custom op and refuses to load if it does not match the current API level. A different API level normally means that it is binary incompatible with the previous version.

11.4.1. Changes in API Level

API level:

  1. is_stateless has been added to metadata function.

  2. attributes a string argument has been added to the allocation and the

    build functions which allows passing of user defined attributes to the operation (and its gradient operation if present).