11. Custom IPU operations

There are three mechanisms for providing custom operations to the IPU through the TensorFlow interface. The first uses a fully custom codelet and host build file.

The second case is a custom operation which is executed on the CPU.

The third possibility is a custom, fused elementwise arithmetic operation. In this last case, the gradient creation in the optimisers will not produce a gradient operation for the custom operation.

11.1. Fully customised IPU operations

You can provide a custom operation to be compiled into the Poplar executable and run on the IPU hardware. You must provide a host-side shared object library that implements the action of adding vertices to a Poplar graph, given some Poplar tensor inputs. They can optionally provide a Poplar source code or binary file containing one or more “codelets” (code that runs on the IPU).

For more information about writing codelets, please refer to the Poplar and PopLibs User Guide.

These operations are added with ipu.custom_ops.precompiled_user_op. See tensorflow.python.ipu.custom_ops.precompiled_user_op() for details. An example of this is shown below.

The shared object file must contain an undecorated symbol that should be declared as below. It should add vertices to the graph that perform the custom operation. The name of the symbol should match the name of the operation in the graph. By default these types of operations are called Build.

extern "C"
poplar::program::Program Build(
  poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
  std::vector<poplar::Tensor>& outputs, const std::string &attributes,
  const std::string &debug_prefix)

The arguments are:

graph: the Poplar graph into which to add tensors and vertices.
inputs: a vector of Poplar tensors which are inputs to the operation.
outputs: a vector into which to store the outputs of the operation. The vector will contain zero entries when the Build function is called.
attributes: a string which was passed as the attributes argument to the Python operation. See _operation_attributes for more details.
debug_prefix: the debug name that has been given to the operation in the TensorFlow graph.

If the operation can have its gradient taken, then the shared object can contain a separate function with the same name as the forward pass builder. The function must be given the same name as the forward operation with _grad appended. The signature of the builder function is slightly different, as it takes the forward pass inputs and outputs as arguments, as well as the gradient outputs. Gradient builders have their own metadata functions. E.g. metadata function name for the example below will be Build_grad_metadata.

extern "C"
poplar::program::Program Build_grad(
    poplar::Graph& graph, int input_grad_index,
    const std::vector<poplar::Tensor>& gradients,
    const std::vector<poplar::Tensor>& fwd_inputs,
    const std::vector<poplar::Tensor>& fwd_outputs,
    std::vector<poplar::Tensor>& outputs,
    const std::string& attributes, const std::string& debug_prefix)

The arguments are:

graph: the Poplar graph into which to add tensors and vertices.
input_grad_index: The index of the input for which this operation is producing the partial derivative. If the gradient operation calculates all of the partial derivatives, then this input should be ignored.
gradients: the inputs to the gradient operation, from the previous gradient operation or loss.
fwd_inputs: the tensors which are the inputs to the forward operation.
fwd_outputs: the tensors which are the outputs of the forward operation.
outputs: the outputs of this gradient operation. There must be one per input of the original forward operation. Inputs which are not differentiable can have an null Poplar tensor.
attributes: a string which was passed as the gradient_attributes argument to the Python operation. See _operation_attributes for more details.
debug_prefix: the name of the operation.

11.1.1. Metadata

The shared object file can optionally contain an undecorated symbol that is the same as the builder function with _metadata appended. This function must have the following signature:

extern "C"
void Build_metadata(std::vector<std::int64_t>& allocating_indices,
  std::uint32_t& num_inplace, bool& is_elementwise,
  bool& is_stateless, std::uint32_t num_inputs)

The arguments are:

allocating_indices: indicates which of the inputs should be allocated using the tensor allocation function. See the description in Tensor allocation.
num_inplace: indicates the number of inputs which are ‘in place’. The first num_inplace of the inputs will be considered to be in-place.
is_elementwise: indicates that this operation is element-wise.
is_stateless: indicates that this operation is stateless. Custom ops are stateful by default.
num_inputs: indicates how many inputs are on the operation.

The function should fill in the values of the first four arguments, which are all reference types.

11.1.2. In-place operations

If an operation does an in-place modification of an input tensor, as opposed to creating a new output tensor, then the num_inplace can be used to indicate that this is the case. The system will ensure that when a tensor is updated in place, that any other uses of that tensor will be complete before the operation is run.

If a tensor is not marked as in place then the operation must not modify it. If it is modified then other operations which consume it may see an incorrect value on their input.

When trying to update tensors in-place you need to ensure that TensorFlow sees an assignment of the tensor, otherwise the modified input tensor update will not “stick”. This means that the inplace inputs need to always be returned as outputs of the custom operation and if a tf.Variable was modified inplace it has to be assigned back to itself with tf.assign. This might look something like the following:

import os
import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = ipu.utils.create_ipu_config()
# cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

with tf.device("cpu"):
  x_data = tf.placeholder(np.float32, [4])


def add_op(x, y):
  outputs = {
      "output_types": [tf.float32],
      "output_shapes": [tf.TensorShape([4])],
  }

  base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
  lib_path = os.path.join(base_path, "libcustom_add_inplace.so")

  o = ipu.custom_ops.precompiled_user_op([x, y], lib_path, outs=outputs)
  return o


def my_net(x):
  inplace = tf.get_variable("weights",
                            shape=[4],
                            initializer=tf.zeros_initializer())

  # Even though the custom op is in place, TF still needs to see an assignment.
  inplace_add = tf.assign(inplace, add_op(inplace, x)[0])
  with tf.control_dependencies([inplace_add]):
    return inplace


with ipu_scope("/device:IPU:0"):
  xla_result = ipu.ipu_compiler.compile(my_net, [x_data])

with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())

  result = sess.run(xla_result, feed_dict={x_data: [2., 4., 6., -1.]})
  print(result)

  result = sess.run(xla_result, feed_dict={x_data: [2., 4., 6., -1.]})
  print(result)

And the associated custom op:

/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/

#include <poplar/Graph.hpp>
#include <popops/Cast.hpp>
#include <popops/ScaledAdd.hpp>
#include <poputil/exceptions.hpp>

extern "C" {
int32_t custom_op_api_level = 2;
}

extern "C" void Build_metadata(std::vector<std::int64_t>& allocating_indices,
                               std::uint32_t& num_inplace, bool& is_elementwise,
                               std::uint32_t num_inputs) {
  allocating_indices.clear();
  num_inplace = 1;
  is_elementwise = true;
}

extern "C" poplar::program::Program Build(poplar::Graph& graph,
                                          std::vector<poplar::Tensor>& inputs,
                                          std::vector<poplar::Tensor>& outputs,
                                          const std::string& attributes,
                                          const std::string& debug_prefix) {
  if (inputs.size() != 2) {
    throw poputil::poplibs_error("add requires 2 inputs.");
  }

  auto left = inputs[0];
  auto right = inputs[1];

  if (left.shape() != right.shape()) {
    throw poputil::poplibs_error("Inputs must have identical shapes.");
  }

  poplar::program::Sequence prog;
  popops::scaledAddTo(graph, left, right, 1.0, prog,
                      debug_prefix + "/custom_add_inplace");
  outputs.push_back(left);
  return prog;
}

11.1.3. Elementwise operations

The IPU driver can do a better job of allocating the layout of Poplar tensors if it can associate them with specific operations. If the output of an operation is the same shape and layout as its first input, then it should be marked as elementwise.

Typically, the graph building code for the operation will clone the input in order to generate the output Poplar tensor.

11.1.4. Tensor allocation

When generating the Poplar graph, sometimes the backend has the freedom to allocate an input to an operation. This happens when an input to an operation is also the input to the graph, or when previous operations do not put constraints on the input tensor.

If this condition occurs, then by default the backend will create the Poplar tensor with linear mapping. See the section on tile mapping in the Poplar and PopLibs API Reference.

To override this behaviour and allocate a tensor using a specific layout mapping, the custom operation can provide a function with the following signature:

extern "C" poplar::Tensor Build_allocator(
  poplar::Graph& graph, std::uint32_t operand,
  const std::vector<size_t>& shape, poplar::Type type,
  const std::string& attributes, const std::string& debug_prefix)

The arguments are:

graph: the Poplar graph where the tensor should be created.
operand: the operand number of the input to allocate.
shape: the shape of the tensor.
type: the Poplar data type for the tensor.
attributes: a string which was passed as the attributes or gradient_attributes argument to the Python operation (depending on whether this function corresponds to the forward or gradient operation). See _operation_attributes for more details.
debug_prefix: the name of the operation.

11.1.5. Gradient operations

As described above, when the gradient of the forward operation is generated, either a single operation, or multiple operations can be inserted into the graph.

You can use the parameter separate_gradients on the precompiled_user_op function to select which of the two options are required. The compiled code must match this setting.

If the separate_gradients parameter is set to False, then the compiled function for generating the gradient operation should fill in one output for each of the inputs of the forward pass function. Each output should be the partial derivative with respect to one of the inputs.

If the separate_gradients parameter is True, then the gradient operation building function should produce an operation with a single output, which is the partial differential with respect to only one of the forward pass inputs.

The specific input will be given by the input_grad_index input of the call to the sharded object Build_grad function.

11.1.6. Stateless operations

If an operation’s outputs depend only on the value of their inputs, and not any internally stored state, then the operation is said to be stateless. Marking an operation as stateless in the metadata function will allow the TensorFlow backend to perform optimisations which would otherwise be disallowed, such as common code removal.

11.1.7. Operation attributes

If an operation requires some data which is not available when compiling the C++ Poplar function, then the string attributes argument can be used to pass such information from the Python level operation to the C++ function. Since the attributes argument is a string object, any data format which can be serialized/deserialized to/from a string, such as JSON, can be used.

In the following example we add a custom operation which performs a serialized matrix-matrix multiplication where we use the attributes argument to pass information, encoded using the JSON data format, about serialization to the C++ function.

/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/

#include <poplar/Graph.hpp>
#include <poplin/MatMul.hpp>
#include <popops/ElementWise.hpp>
#include <poputil/exceptions.hpp>

// Use the https://github.com/open-source-parsers/jsoncpp JsonCpp parser
#include "include/json/json.h"

extern "C" {
int32_t custom_op_api_level = 2;
}

namespace {
Json::Value ParseAttributes(const std::string& attributes) {
  // Parse Json.
  Json::CharReaderBuilder builder;
  std::string errs;
  Json::Value parsed_json;
  std::unique_ptr<Json::CharReader> reader(builder.newCharReader());
  bool parsed =
      reader->parse(attributes.c_str(), attributes.c_str() + attributes.size(),
                    &parsed_json, &errs);
  assert(parsed && errs);
  return parsed_json;
}

std::vector<size_t> GetVectorFromJson(Json::Value& val) {
  std::vector<size_t> result;
  result.reserve(val.size());
  for (auto a : val) {
    result.push_back(a.asUInt64());
  }
  return result;
}
}  // namespace

extern "C" void Build_metadata(std::vector<std::int64_t>& allocating_indices,
                               std::uint32_t& num_inplace, bool& is_elementwise,
                               std::uint32_t num_inputs) {
  allocating_indices = {0, 1};
  num_inplace = 0;
  is_elementwise = false;
}

extern "C" poplar::Tensor Build_allocator(poplar::Graph& graph,
                                          std::uint32_t operand,
                                          const std::vector<size_t>& shape,
                                          poplar::Type type,
                                          const std::string& attributes,
                                          const std::string& debug_prefix) {
  assert(operand < 2);
  // Parse JSON and get the expected attributes.
  Json::Value json = ParseAttributes(attributes);
  const int serialization_factor = json["serialization_factor"].asInt();
  std::vector<std::size_t> lhs_shape = GetVectorFromJson(json["lhs_shape"]);
  std::vector<std::size_t> rhs_shape = GetVectorFromJson(json["rhs_shape"]);

  // Verify shapes and adjust them to be slice shapes.
  assert(lhs_shape.size() == 2);
  assert(rhs_shape.size() == 2);

  assert(lhs_shape[1] % serialization_factor == 0 &&
         "serialization_factor must divide the dimension of LHS shape");
  lhs_shape[1] /= serialization_factor;

  assert(rhs_shape[0] % serialization_factor == 0 &&
         "serialization_factor must divide the dimension of RHS shape");
  rhs_shape[0] /= serialization_factor;

  // Allocate the slice.
  poplar::Tensor slice;
  if (operand == 0) {
    // Allocating for lhs - allocate the slice.
    slice = poplin::createMatMulInputLHS(graph, type, lhs_shape, rhs_shape,
                                         debug_prefix + "/LHS");
  } else {
    assert(operand == 1);
    slice = poplin::createMatMulInputRHS(graph, type, lhs_shape, rhs_shape,
                                         debug_prefix + "/RHS");
  }

  // Clone the slice for each serialized matrix multiply.
  std::vector<poplar::Tensor> slices(serialization_factor);
  slices[0] = slice;
  for (int i = 1; i != serialization_factor; ++i) {
    slices[i] = graph.clone(slice);
  }

  // Concatenate the slices into a single tensor - the concatentation dimension
  // depends on the operand which is being allocated.
  poplar::Tensor t = poplar::concat(slices, operand == 0 ? 1 : 0);
  return t;
}

extern "C" poplar::program::Program Build(poplar::Graph& graph,
                                          std::vector<poplar::Tensor>& inputs,
                                          std::vector<poplar::Tensor>& outputs,
                                          const std::string& attributes,
                                          const std::string& debug_prefix) {
  if (inputs.size() != 2) {
    throw poputil::poplibs_error("add requires 2 inputs.");
  }
  Json::Value json = ParseAttributes(attributes);
  poplar::program::Sequence seq;
  poplar::Tensor lhs = inputs[0];
  poplar::Tensor rhs = inputs[1];
  poplar::Tensor output;

  const int serialization_factor = json["serialization_factor"].asInt();
  const int slice_size = lhs.dim(1) / serialization_factor;
  for (int i = 0; i != serialization_factor; ++i) {
    // Slice out the parts of the matmul.
    poplar::Tensor lhs_slice =
        lhs.slice(i * slice_size, (i + 1) * slice_size, 1);
    poplar::Tensor rhs_slice =
        rhs.slice(i * slice_size, (i + 1) * slice_size, 0);
    // Do the partial matmul.
    poplar::Tensor partial_matmul = poplin::matMul(
        graph, lhs_slice, rhs_slice, seq, debug_prefix + "/Slice");

    // Accumulate the results from partial matmuls.
    if (i == 0) {
      output = partial_matmul;
    } else {
      popops::addInPlace(graph, output, partial_matmul, seq,
                         debug_prefix + "/Add");
    }
  }
  outputs = {output};
  return seq;
}

Which is then executed with:

import os
import json
import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = ipu.utils.create_ipu_config()
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
lib_path = os.path.join(base_path, "libtutorial_attributes_example.so")


def my_net(x, y):
  x_shape = x.get_shape().as_list()
  y_shape = y.get_shape().as_list()
  outputs = {
      "output_types": [x.dtype],
      "output_shapes": [tf.TensorShape([x_shape[0], y_shape[1]])],
  }

  # We create a matmul operation, which we want to perform as two serialized
  # matmuls. We also record all the input shapes.
  attributes = {
      "serialization_factor": 2,
      "lhs_shape": x_shape,
      "rhs_shape": y_shape
  }
  attributes_json = json.dumps(attributes)

  o = ipu.custom_ops.precompiled_user_op([x, y],
                                         lib_path,
                                         attributes=attributes_json,
                                         outs=outputs)

  return o


with tf.device("cpu"):
  x_ph = tf.placeholder(np.float32, [128, 1024])
  y_ph = tf.placeholder(np.float32, [1024, 64])

with ipu_scope("/device:IPU:0"):
  xla_result = ipu.ipu_compiler.compile(my_net, [x_ph, y_ph])

with tf.Session() as sess:
  # Base run
  result = sess.run(xla_result,
                    feed_dict={
                        x_ph: np.full(x_ph.shape, 10.0),
                        y_ph: np.full(y_ph.shape, 12.0),
                    })

  print(result)

11.1.8. Example

This example shows the source file for a rotate operation, which takes three vectors and rotates the x and y ones by the angle one:

/* Copyright 2020 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/

#include <vector>

#include <poplar/Graph.hpp>
#include <poplar/Tensor.hpp>
#include <poputil/Util.hpp>
#include <poputil/VertexTemplates.hpp>
#include <poputil/exceptions.hpp>

// Export the API level symbol
extern "C" {
int32_t custom_op_api_level = 2;
}

extern "C" void Build_metadata(std::vector<std::int64_t>& allocating_indices,
                               std::uint32_t& num_inplace, bool& is_elementwise,
                               bool& is_stateless, std::uint32_t num_inputs) {
  allocating_indices.clear();
  num_inplace = 0;
  is_elementwise = true;
}

extern "C" poplar::program::Program Build(
    poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
    std::vector<poplar::Tensor>& outputs, const std::string& attributes,
    const std::string& debugPrefix) {
  if (inputs.size() != 3) {
    throw poputil::poplibs_error("Rotate requires 3 inputs");
  }

  if (inputs[0].numElements() == 0) {
    return poplar::program::Sequence();
  }

  if (inputs[0].rank() != 1 || inputs[1].rank() != 1 || inputs[2].rank() != 1) {
    throw poputil::poplibs_error("All inputs must be rank 1");
  }

  if (inputs[0].dim(0) != inputs[1].dim(0) ||
      inputs[0].dim(0) != inputs[2].dim(0)) {
    throw poputil::poplibs_error(
        "Length of rotate vector and data vectors must match");
  }

  if (inputs[0].elementType() != inputs[1].elementType() ||
      inputs[0].elementType() != inputs[2].elementType()) {
    throw poputil::poplibs_error(
        "Data types of angle vector and data vectors must match");
  }

  auto dType = inputs[0].elementType();

  /*
   * Create a ComputeSet which will be executed, and contains the vertices
   */
  auto cs = graph.addComputeSet(debugPrefix + "/rotate");

  /*
   * Get the tile mapping for the complete tensor.  We will map the vertices so
   * that they match the layout of the 'x' input tensor (input[0]).  If the 'x'
   * tensor was layed out differently to the other ones, then Poplar will
   * insert code to move the data in the other tensors to the mapped tile. So
   * ideally we would choose the best mapping for the vertices by analysing
   * all of the tensor mappings.
   */
  auto tileMapping = graph.getTileMapping(inputs[0]);

  /*
   * Get the target, which descibes properties of the hardware.
   */
  auto target = graph.getTarget();

  /*
   * Get the vector width of the particular data type, so that later we can
   * divide the tensor up between workers in an appropriate way.
   */
  const auto vectorWidth = target.getVectorWidth(dType);

  /*
   * Create the output tensors
   */
  outputs.push_back(graph.clone(inputs[0]));
  outputs.push_back(graph.clone(inputs[1]));

  auto xFlat = inputs[0].flatten();
  auto yFlat = inputs[1].flatten();
  auto aFlat = inputs[2].flatten();
  auto xOutputFlat = outputs[0].flatten();
  auto yOutputFlat = outputs[1].flatten();

  for (unsigned tile = 0; tile != tileMapping.size(); ++tile) {
    /*
     * If a tile contains no elements of the tensor then do not create any
     * vertices for it.
     */
    if (tileMapping[tile].empty()) {
      continue;
    }

    /*
     * Split up the regions of the inputs tensors so that they are evenly
     * distributed between the workers on the tile.
     */
    auto vertexRegions = poputil::splitRegionsBetweenWorkers(
        target, tileMapping[tile], vectorWidth, 2 * vectorWidth);

    for (const auto& regions : vertexRegions) {
      /*
       * If a region has no elements, then there is no need to add a vertex for
       * it.
       */
      if (regions.empty()) {
        continue;
      }

      /*
       * Add codelets to tiles which work over the regions in the input
       * tensors.
       */
      auto v = graph.addVertex(cs, poputil::templateVertex("Rotate", dType),
                               {{"x_out", xOutputFlat.slices(regions)},
                                {"y_out", yOutputFlat.slices(regions)},
                                {"x_in", xFlat.slices(regions)},
                                {"y_in", yFlat.slices(regions)},
                                {"angle", aFlat.slices(regions)}});

      /* Map the vertex onto the appropriate tile. */
      graph.setTileMapping(v, tile);

      /* Provide a bogus cycle count estimate for the profiler. */
      graph.setCycleEstimate(v, 1);
    }
  }

  return poplar::program::Execute(cs);
}

This is the associated codelet file:

#include <cmath>

#include <poplar/HalfFloat.hpp>
#include <poplar/Vertex.hpp>

using namespace poplar;

/*
 * A codelet to rotate a tensors 'x' and 'y', by the angle (radians) in the
 * tensor 'angle', around the origin.
 */
template <typename FPType>
class Rotate : public Vertex {
 public:
  Vector<Output<Vector<FPType>>> x_out;
  Vector<Output<Vector<FPType>>> y_out;
  Vector<Input<Vector<FPType>>> x_in;
  Vector<Input<Vector<FPType>>> y_in;
  Vector<Input<Vector<FPType>>> angle;

  bool compute() {
    for (unsigned i = 0; i < angle.size(); ++i) {
      for (unsigned j = 0; j != angle[i].size(); ++j) {
        float a = angle[i][j];
        float x = x_in[i][j];
        float y = y_in[i][j];
        x_out[i][j] = x * cos(a) - y * sin(a);
        y_out[i][j] = x * sin(a) + y * cos(a);
      }
    }
    return true;
  }
};

template class Rotate<float>;
template class Rotate<half>;

This is an example of it in use:

import os
import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = ipu.utils.create_ipu_config(profiling=True, use_poplar_text_report=True)
cfg = ipu.utils.set_ipu_model_options(cfg, compile_ipu_code=False)
cfg = ipu.utils.auto_select_ipus(cfg, 1)
ipu.utils.configure_ipu_system(cfg)

with tf.device("cpu"):
  x_data = tf.placeholder(np.float32, [4])
  y_data = tf.placeholder(np.float32, [4])
  p_angle = tf.placeholder(np.float32, [4])


def rotate_op(x, y, a):
  outputs = {
      "output_types": [tf.float32, tf.float32],
      "output_shapes": [tf.TensorShape([4]),
                        tf.TensorShape([4])],
  }

  base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
  lib_path = os.path.join(base_path, "libcustom_rotate_op.so")
  gp_path = os.path.join(base_path, "custom_codelet.gp")

  o = ipu.custom_ops.precompiled_user_op([x, y, a],
                                         lib_path,
                                         gp_path,
                                         outs=outputs)
  return o


def my_net(x, y, a):
  return rotate_op(x, y, a)


with ipu_scope("/device:IPU:0"):
  xla_result = ipu.ipu_compiler.compile(my_net, [x_data, y_data, p_angle])

with tf.Session() as sess:
  # Base run
  result = sess.run(xla_result,
                    feed_dict={
                        x_data: [2., 4., 6., -1.],
                        y_data: [2., 3., 8., -1.],
                        p_angle: [np.pi, np.pi / 2., 3. * np.pi / 2., 0]
                    })

  print(result)

When compiling the host-size shared object file, it is not necessary to include or link against any TensorFlow header or library files. Only the Poplar headers and link libraries should be necessary.

11.2. Fully customised CPU operations

The framework also allows a custom operation that executes code on the CPU instead of on the IPU. A shared object, much like the builder function of the device-side custom operation, must be written. The signature of this function should be:

extern "C" void Callback(const std::vector<void*>& data,
                         const std::vector<std::uint32_t>& number_of_elements,
                         std::vector<void*>& outputs,
                         const std::string& name);

The arguments are:

data: the input data. the function should be written to expect a certain data type so the void pointer can be cast into the expected type.
number_of_elements: indicates the number of elements in the input data.
outputs: should be filled in by the operation.
name: is the name of the operation within the XLA/HLO graph.

11.3. Custom elementwise expressions

The Python function ipu.custom_ops.codelet_expression_op provides an interface for giving a custom fused expression to the compiler. This will be encoded into a single compute set. See tensorflow.python.ipu.custom_ops.codelet_expression_op() for details.

The arguments to the Python function are a callable Python function which encodes the arithmetic expression, and the tensor arguments to the operation.

For instance:

def my_custom_op(x, y, z):
    return x * x + y * z

ipu.custom_ops.codelet_expression_op(my_custom_op, a, b, c)

In this example, the Python function my_custom_op provides the expression, and the arguments a, b and c are the three inputs from other parts of the TensorFlow graph.

Python operators which are supported in the function are +, -, *, and abs.

11.4. API Level Versioning

// Export the API level symbol
extern "C" {
int32_t custom_op_api_level = 2;
}

You must include the code above in your source code. The custom op loader checks the API level of the custom op and refuses to load if it does not match the current API level. A different API level normally means that it is binary incompatible with the previous version.

11.4.1. Changes in API Level

API level:

is_stateless has been added to metadata function.
attributes a string argument has been added to the allocation and the
build functions which allows passing of user defined attributes to the operation (and its gradient operation if present).