14. Writing custom operations

If TensorFlow for the IPU does not implement an operation that you need then there are two ways you can add a custom operation to the TensorFlow graph.

You can implement the operation in C++ using the Poplar graph programming framework. See Section 14.1, Custom operation on the IPU.

This provides the highest performance because the operation runs on the IPU.
The second possibility is to execute the custom operation on the host CPU. See Section 14.2, Custom host CPU operations.

This may be easier to implement because you only need to write host code, without needing to get to grips with Poplar. However, the performance will be lower because it does not exploit the parallelism available on the IPU, and because the data has to be moved from the IPUs to the host and back.

Note

In the rest of this chapter, “custom op” or “op” will be used to refer specifically to the new custom operation made available in the TensorFlow code. The word “operation” will be used more generally to talk about the implementation of this custom op.

14.1. Custom operation on the IPU

To create a custom op on the IPU, you need to write a Poplar program that performs the required functions on the input tensors. After compiling this code, you can load it into your TensorFlow program to create a custom op, which can then be used in your TensorFlow model in the same way as any other op.

The following sections provide more detail on these steps.

14.1.1. Building the Poplar graph

The custom op is defined in a C++ program that populates a graph with a poplar::Program object containing the operations to be performed on the input tensors. The Poplar and PopLibs libraries provide a rich set of functions optimised for the IPU. You can also add your own functionality as “codelets”, which contain C++ code compiled for, and executed on, the IPU.

For more information about writing Poplar graph programs and codelets, refer to the Poplar and PopLibs User Guide and the Poplar tutorials in the Graphcore GitHub examples repository.

Your program must contain a function to build the graph, which will be called from TensorFlow when you instantiate the custom op. This has the following signature:

extern "C"
poplar::program::Program Build(
    poplar::Graph& graph,
    const std::vector<poplar::Tensor>& inputs,
    std::vector<poplar::Tensor>& outputs,
    const std::string &attributes,
    const std::string &debug_prefix)

The default name for the function is Build(). If you want to use a different name (because you have multiple custom ops, for example), you can specify the name of the function when importing the program into TensorFlow. See the definition of the tensorflow.python.ipu.custom_ops.precompiled_user_op() function for details.

Note

The extern "C" declaration is required to ensure that the compiler does not change the function name (C++ compilers will normally modify, or “decorate”, function names to encode extra information about the function).

The parameters to Build() are:

graph: A Poplar graph to add the Program object and tensors to, in order to implement the operation.
inputs: A vector of tensors which are inputs to the operation. These are passed as the input arguments to the custom op when it is called in TensorFlow.
outputs: A vector of tensors that are the outputs of the operation. These will be returned as the result of the custom op in TensorFlow. This vector will initially be empty, so you will need to add result tensors to it.
attributes: A string which is passed as the attributes argument to the custom op in TensorFlow. See Operation attributes for more details.
debug_prefix: The debug name that is passed to the custom op in TensorFlow.

The Build() function returns the program object that it added to the graph.

14.1.2. Gradient builders

If the op is required for training, then you must also implement a function that builds a Poplar graph for the gradient operation. This has the same name as the forward-operation builder with _grad appended.

The signature of the gradient builder function is:

extern "C"
poplar::program::Program Build_grad(
    poplar::Graph& graph,
    int input_grad_index,
    const std::vector<poplar::Tensor>& gradients,
    const std::vector<poplar::Tensor>& fwd_inputs,
    const std::vector<poplar::Tensor>& fwd_outputs,
    std::vector<poplar::Tensor>& outputs,
    const std::string& attributes,
    const std::string& debug_prefix)

The parameters to Build_grad() are:

graph: A Poplar graph to add the Program object and tensors to, in order to implement the operation.
input_grad_index: The index of the input tensor to calculate the the partial derivative for.

You can choose to implement a gradient operation that calculates the partial derivatives for all tensors or for one tensor at a time. In the latter case, you need to set separate_gradients to True when you call precompiled_user_op().

There may be advantages in calculating all the gradients at the same time; for example, if there are common sub-expressions. On the other hand, this removes the ability for TensorFlow to do some optimisations, such as dead-code elimination if all of the gradients are not required.

If the separate_gradients parameter is set to False, then your function for generating the gradient operation must populate one output tensor for each of the inputs of the forward pass function. Each output must be the partial derivative with respect to one of the inputs.

If the separate_gradients parameter is True, then the gradient operation building function must produce an operation with a single output, which is the partial differential with respect to only one of the forward pass inputs. The specific input will be given by the input_grad_index argument to the Build_grad() function.

If your gradient operation calculates all of the partial derivatives, then you can ignore the input_grad_index parameter.
gradients: The inputs to the gradient operation, from the previous gradient operation or loss.
fwd_inputs: The input tensors to the forward-pass operation.
fwd_outputs: The output tensors from the forward-pass operation.
outputs: The outputs from this gradient operation. There must be one per input of the forward operation. Inputs which are not differentiable can be assigned a “null” Poplar tensor (that is, one created with the default Tensor constructor and containing no data).
attributes: A string which is passed as the gradient_attributes argument to the custom op when called from TensorFlow. See Operation attributes for more details.
debug_prefix: The name of the operation.

The Build_grad() function returns the program object that it added to the graph.

14.1.3. Metadata

You can also specify extra information about the custom op by including a metadata function in the object file. This has the same name as the builder function with _metadata appended.

This function has the following signature:

extern "C"
void Build_metadata(
    std::vector<std::int64_t>& allocating_indices,
    std::vector<std::int64_t>& replica_identical_output_indices,
    std::map<std::int64_t, std::int64_t>& input_to_output_tensor_aliasing,
    bool& is_elementwise,
    bool& is_stateless,
    bool& is_hashable,
    std::uint32_t num_inputs)

The parameters are used to return the following information about the operation:

allocating_indices: Use this to specify which input tensors will be allocated using the tensor-allocation function described in Section 14.1.6, Tensor allocation.
replica_identical_output_indices: Experimental. Use this to specify which output tensors are identical across replicas. The compiler uses this to help provide deterministic behaviour when running with replication and performing stochastic rounding.

An empty vector means that no tensors are identical across replicas.

input_to_output_tensor_aliasing: Use this map to indicate if any of the input and output tensors alias. The values in the map are the vector indexes of the the tensors. For example, a mapping from 1 to 0 indicates that input tensor 1 is aliased with output tensor 0. This means that poplar::Tensor::intersectsWith() would return true when called for these tensors.

Providing information about whether an input tensor aliases an output tensor allows the TensorFlow graph compiler to perform more optimisation. It also ensures that if an input tensor is updated in-place and used as an output, then any other uses of that tensor will be completed before this operation is run, to ensure correct behaviour. See In-place operations for an example of using this for an in-place operation.

If an input tensor is not mapped to an output tensor, then the operation must not modify that input tensor. If it is modified, then other operations which use it as an input may be passed incorrect values.
is_elementwise: Set this to true if the output of an operation is the same shape and layout as its first input. (This parameter was originally used to tell the compiler that an operation was elementwise. However, its meaning has changed to indicate any operation where the compiler can perform optimisations based on matching the input and output tensors.)

In this case, your graph-building code for the operation will typically clone the input in order to generate the output tensor.
is_stateless: Set this to true if this operation is “stateless”.

If an operation’s outputs depend only on the value of their inputs, and not any internally stored state, then the operation is said to be stateless. Marking an operation as stateless will allow the TensorFlow backend to perform optimisations which would otherwise not be possible, such as common code removal. It also allows the custom op to be used with recomputation, see Section 6.6, Recomputation.

Custom ops are stateful by default.
is_hashable: Set this to true if this operation can be uniquely hashed.

In order to detect when code changes and needs to be recompiled, the TensorFlow compiler will generate a hash value for the TensorFlow graph. If all ops in the graph are hashable then the executable will be saved in the cache (if enabled). This allows the graph to be run multiple times without needing to recompile it. See Section 5.1, Caching of compiled executables for more information.

However, because the TensorFlow compiler does not have any information about the implementation of the custom operation or its dependencies, the compiler will treat it as non-hashable, therefore the TensorFlow program will be recompiled every time it is run.

If you can guarantee that custom operation and its dependencies will not change then you can set this parameter to true.

This attribute must be set to true if you intend to pre-compile your TensorFlow program (see Section 5.2, Pre-compiling executables).
num_inputs: This is the number of input tensors that the operation is called with.

If you use the metadata function to specify some information about the custom operation, then you must set the values of all the parameters even if you are using the default values.

Gradient builders have their own metadata functions. These are named after the gradient builder function with _metadata appended. For example: Build_grad_metadata().

14.1.4. Compiling the IPU code

API level

You need to specify the API level that your operation code is compatible with. The custom op loader checks the API level and will not load it if it does not match the current API level. A change in API level normally means that the file is not compatible with previous versions. See Table 14.1 for information about the changes in the API.

You must include the following code in your builder program to specify the API level.

// Export the API level symbol
extern "C" {
int32_t custom_op_api_level = 5;
}

Table 14.1 API level changes
API level	Changes to the API
1	`is_stateless` was added to the metadata function.
2	The `attributes` parameter was added to the allocation and the build functions to allow user-defined attributes to be passed to the operation (and its gradient operation, if present).
3	`input_to_output_tensor_aliasing` replaced `num_inplace` to allow finer-grain description of the operation performed in order to allow more optimisations.
4	`is_hashable` was added to the metadata builder function.
5	`replica_identical_output_indices` was added to the metadata builder function.

PopLibs library code

You need to explicitly add the the IPU code for any PopLibs libraries that you use. For example, if your code uses the popops and poprand libraries, then you need to include the following in your builder code:

#include <popops/codelets.hpp>
#include <poprand/codelets.hpp>

extern "C"
poplar::program::Program Build_grad(poplar::Graph& graph,
                                    int input_grad_index,
                                    const std::vector<poplar::Tensor>& gradients,
                                    const std::vector<poplar::Tensor>& fwd_inputs,
                                    const std::vector<poplar::Tensor>& fwd_outputs,
                                    std::vector<poplar::Tensor>& outputs,
                                    const std::string& attributes,
                                    const std::string& debug_prefix) {

    ... // create the program object in the graph

    popops::addCodelets(graph);
    poprand::addCodelets(graph);
}

Compiling the library file

The code has to be compiled to create a shared-library object file. For example, if you have a source file called poplar_code.cpp that contains the Build() function, you can use the following command line to generate a library file called libcustom_op.so:

$ g++ poplar_code.cpp -shared -fpic -o libcustom_op.so -lpoplar -lpoputil -lpoprand

Note that you need to link the Poplar and PopLibs libraries that you use (in this example poplar, poputil and poprand). See the Poplar and PopLibs API Reference for more information.

It is not necessary to include or link against any TensorFlow header or library files. Only the Poplar and PopLibs headers, and the corresponding libraries are required.

You can add -g to the above command to compile the custom operation with debugging symbols. This allows you to debug the C++ code with gdb.

14.1.5. Using the custom op in TensorFlow

You can call the custom operation from TensorFlow with precompiled_user_op(). This specifies the library file containing the custom operation code, the input and output tensors, and other information needed to use the op in TensorFlow. See precompiled_user_op() in the API documentation for more information.

14.1.6. Tensor allocation

If the input tensors to the operation have not already been allocated to tiles because of their use by other operations, then the TensorFlow compiler will, by default, allocate the tensors with linear mapping.

You can override this behaviour by defining a function that allocates tensors in a way that is most efficient for your operation. See the section on variable mapping in the Poplar and PopLibs API Reference for more information.

To do this, define a function with the suffix _allocator with the following signature:

extern "C" poplar::Tensor Build_allocator(
    poplar::Graph& graph,
    std::uint32_t operand,
    const std::vector<size_t>& shape,
    poplar::Type type,
    const std::string& attributes,
    const std::string& debug_prefix)

The parameters to the function are:

graph: The graph to add the tensor to.
operand: The index of the input tensor to allocate.
shape: The shape of the tensor.
type: The Poplar data type for the tensor.
attributes: A string which is passed as the attributes or gradient_attributes argument to the custom op in TensorFlow (depending on whether this function corresponds to the forward or gradient operation). See Operation attributes for more details.
debug_prefix: the name of the operation.

The allocator function returns the tensor that it has allocated.

If the input tensor has already been allocated, then this function will not be called.

14.1.7. Examples

Some examples of using a custom op in TensorFlow are shown in the following sections. There are further examples in the Graphcore GitHub tutorials repository:

Note

These examples are only for TensorFlow 1.

From Poplar SDK 3.1, TensorFlow 1 will only be supported in CentOS 7.

In addition, Examples and Tutorials for TensorFlow 1 are only available up to version 3.0 of the SDK. There has been limited testing of the 3.0 versions of the TensorFlow 1 tutorials and examples with later versions of the Poplar SDK.

In-place operations

An operation can use the same tensor as an input and output, modifying the tensor in-place as opposed to creating a new output tensor.

You can use the input_to_output_tensor_aliasing map in the metadata to indicate this to the TensorFlow compiler by specifying that the input tensor is aliased with an output tensor.

When you update tensors in-place, the TensorFlow compiler must see an assignment to the tensor, otherwise the changes to the input tensor will be optimised away. This means that the in-place inputs always need to be returned as outputs of the custom operation. If a tf.Variable object is modified in-place then it has to be assigned back to itself with tf.assign.

Listing 14.2 shows an example of adding an in-place custom op to a TensorFlow model. The implementation of the operation is shown in Listing 14.2.

Listing 14.1 custom_add_inplace.py

import os
import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = ipu.config.IPUConfig()
cfg.auto_select_ipus = 1
cfg.configure_ipu_system()

with tf.device("cpu"):
  x_data = tf.placeholder(np.float32, [4])


def add_op(x, y):
  outputs = {
      "output_types": [tf.float32],
      "output_shapes": [tf.TensorShape([4])],
  }

  base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
  lib_path = os.path.join(base_path, "libcustom_add_inplace.so")

  o = ipu.custom_ops.precompiled_user_op([x, y], lib_path, outs=outputs)
  return o


def my_net(x):
  inplace = tf.get_variable("weights",
                            shape=[4],
                            initializer=tf.zeros_initializer())

  # Even though the custom op is in place, TF still needs to see an assignment.
  inplace_add = tf.assign(inplace, add_op(inplace, x)[0])
  with tf.control_dependencies([inplace_add]):
    return inplace


with ipu_scope("/device:IPU:0"):
  xla_result = ipu.ipu_compiler.compile(my_net, [x_data])

with tf.Session() as sess:
  sess.run(tf.global_variables_initializer())

  result = sess.run(xla_result, feed_dict={x_data: [2., 4., 6., -1.]})
  print(result)

  result = sess.run(xla_result, feed_dict={x_data: [2., 4., 6., -1.]})
  print(result)

Download custom_add_inplace.py

Listing 14.2 custom_add_inplace.cc

#include <poplar/Graph.hpp>
#include <popops/Cast.hpp>
#include <popops/ScaledAdd.hpp>
#include <poputil/exceptions.hpp>

extern "C" {
int32_t custom_op_api_level = 5;
}

extern "C" void Build_metadata(
    std::vector<std::int64_t>& allocating_indices,
    std::vector<std::int64_t>& replica_identical_output_indices,
    std::map<std::int64_t, std::int64_t>& input_to_output_tensor_aliasing,
    bool& is_elementwise, bool& is_stateless, bool& is_hashable,
    std::uint32_t num_inputs) {
  allocating_indices.clear();
  input_to_output_tensor_aliasing = {
      {/*input tensor index*/ 0, /*output tensor index=*/0}};
  is_elementwise = true;
}

extern "C" poplar::program::Program Build(poplar::Graph& graph,
                                          std::vector<poplar::Tensor>& inputs,
                                          std::vector<poplar::Tensor>& outputs,
                                          const std::string& attributes,
                                          const std::string& debug_prefix) {
  if (inputs.size() != 2) {
    throw poputil::poplibs_error("add requires 2 inputs.");
  }

  auto left = inputs[0];
  auto right = inputs[1];

  if (left.shape() != right.shape()) {
    throw poputil::poplibs_error("Inputs must have identical shapes.");
  }

  poplar::program::Sequence prog;
  popops::scaledAddTo(graph, left, right, 1.0, prog,
                      debug_prefix + "/custom_add_inplace");
  outputs.push_back(left);
  return prog;
}

Download custom_add_inplace.cc

Operation attributes

If an operation requires some data which is not available when compiling the C++ builder function, then the string attributes argument can be used to pass such information from the TensorFlow op to the C++ function. Since the attributes argument is a string object, any data format which can be serialized/deserialized as a string, such as JSON, can be used.

In Listing 14.3, we implement a custom operation which performs a serialized matrix-matrix multiplication where the attributes argument passes information about serialization, encoded in JSON data format, to the C++ function. Listing 14.4 shows how this custom op is called from TensorFlow.

Listing 14.3 tutorial_attributes_example.cc

#include <poplar/Graph.hpp>
#include <poplin/MatMul.hpp>
#include <popops/ElementWise.hpp>
#include <poputil/exceptions.hpp>

// Use the https://github.com/open-source-parsers/jsoncpp JsonCpp parser
#include "include/json/json.h"

extern "C" {
int32_t custom_op_api_level = 5;
}

namespace {
Json::Value ParseAttributes(const std::string& attributes) {
  // Parse Json.
  Json::CharReaderBuilder builder;
  std::string errs;
  Json::Value parsed_json;
  std::unique_ptr<Json::CharReader> reader(builder.newCharReader());
  bool parsed =
      reader->parse(attributes.c_str(), attributes.c_str() + attributes.size(),
                    &parsed_json, &errs);
  assert(parsed && errs);
  return parsed_json;
}

std::vector<size_t> GetVectorFromJson(Json::Value& val) {
  std::vector<size_t> result;
  result.reserve(val.size());
  for (auto a : val) {
    result.push_back(a.asUInt64());
  }
  return result;
}
}  // namespace

extern "C" void Build_metadata(
    std::vector<std::int64_t>& allocating_indices,
    std::vector<std::int64_t>& replica_identical_output_indices,
    std::map<std::int64_t, std::int64_t>& input_to_output_tensor_aliasing,
    bool& is_elementwise, bool& is_hashable, std::uint32_t num_inputs) {
  allocating_indices = {0, 1};
  is_elementwise = false;
}

extern "C" poplar::Tensor Build_allocator(poplar::Graph& graph,
                                          std::uint32_t operand,
                                          const std::vector<size_t>& shape,
                                          poplar::Type type,
                                          const std::string& attributes,
                                          const std::string& debug_prefix) {
  assert(operand < 2);
  // Parse JSON and get the expected attributes.
  Json::Value json = ParseAttributes(attributes);
  const int serialization_factor = json["serialization_factor"].asInt();
  std::vector<std::size_t> lhs_shape = GetVectorFromJson(json["lhs_shape"]);
  std::vector<std::size_t> rhs_shape = GetVectorFromJson(json["rhs_shape"]);

  // Verify shapes and adjust them to be slice shapes.
  assert(lhs_shape.size() == 2);
  assert(rhs_shape.size() == 2);

  assert(lhs_shape[1] % serialization_factor == 0 &&
         "serialization_factor must divide the dimension of LHS shape");
  lhs_shape[1] /= serialization_factor;

  assert(rhs_shape[0] % serialization_factor == 0 &&
         "serialization_factor must divide the dimension of RHS shape");
  rhs_shape[0] /= serialization_factor;

  // Allocate the slice.
  poplar::Tensor slice;
  if (operand == 0) {
    // Allocating for lhs - allocate the slice.
    slice = poplin::createMatMulInputLHS(graph, type, lhs_shape, rhs_shape,
                                         debug_prefix + "/LHS");
  } else {
    assert(operand == 1);
    slice = poplin::createMatMulInputRHS(graph, type, lhs_shape, rhs_shape,
                                         debug_prefix + "/RHS");
  }

  // Clone the slice for each serialized matrix multiply.
  std::vector<poplar::Tensor> slices(serialization_factor);
  slices[0] = slice;
  for (int i = 1; i != serialization_factor; ++i) {
    slices[i] = graph.clone(slice);
  }

  // Concatenate the slices into a single tensor - the concatentation dimension
  // depends on the operand which is being allocated.
  poplar::Tensor t = poplar::concat(slices, operand == 0 ? 1 : 0);
  return t;
}

extern "C" poplar::program::Program Build(poplar::Graph& graph,
                                          std::vector<poplar::Tensor>& inputs,
                                          std::vector<poplar::Tensor>& outputs,
                                          const std::string& attributes,
                                          const std::string& debug_prefix) {
  if (inputs.size() != 2) {
    throw poputil::poplibs_error("add requires 2 inputs.");
  }
  Json::Value json = ParseAttributes(attributes);
  poplar::program::Sequence seq;
  poplar::Tensor lhs = inputs[0];
  poplar::Tensor rhs = inputs[1];
  poplar::Tensor output;

  const int serialization_factor = json["serialization_factor"].asInt();
  const int slice_size = lhs.dim(1) / serialization_factor;
  for (int i = 0; i != serialization_factor; ++i) {
    // Slice out the parts of the matmul.
    poplar::Tensor lhs_slice =
        lhs.slice(i * slice_size, (i + 1) * slice_size, 1);
    poplar::Tensor rhs_slice =
        rhs.slice(i * slice_size, (i + 1) * slice_size, 0);
    // Do the partial matmul.
    poplar::Tensor partial_matmul = poplin::matMul(
        graph, lhs_slice, rhs_slice, seq, debug_prefix + "/Slice");

    // Accumulate the results from partial matmuls.
    if (i == 0) {
      output = partial_matmul;
    } else {
      popops::addInPlace(graph, output, partial_matmul, seq,
                         debug_prefix + "/Add");
    }
  }
  outputs = {output};
  return seq;
}

Download tutorial_attributes_example.cc

Listing 14.4 tutorial_attributes_example.py

import os
import json
import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = ipu.config.IPUConfig()
cfg.auto_select_ipus = 1
cfg.configure_ipu_system()

base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
lib_path = os.path.join(base_path, "libtutorial_attributes_example.so")


def my_net(x, y):
  x_shape = x.get_shape().as_list()
  y_shape = y.get_shape().as_list()
  outputs = {
      "output_types": [x.dtype],
      "output_shapes": [tf.TensorShape([x_shape[0], y_shape[1]])],
  }

  # We create a matmul operation, which we want to perform as two serialized
  # matmuls. We also record all the input shapes.
  attributes = {
      "serialization_factor": 2,
      "lhs_shape": x_shape,
      "rhs_shape": y_shape
  }
  attributes_json = json.dumps(attributes)

  o = ipu.custom_ops.precompiled_user_op([x, y],
                                         lib_path,
                                         attributes=attributes_json,
                                         outs=outputs)

  return o


with tf.device("cpu"):
  x_ph = tf.placeholder(np.float32, [128, 1024])
  y_ph = tf.placeholder(np.float32, [1024, 64])

with ipu_scope("/device:IPU:0"):
  xla_result = ipu.ipu_compiler.compile(my_net, [x_ph, y_ph])

with tf.Session() as sess:
  # Base run
  result = sess.run(xla_result,
                    feed_dict={
                        x_ph: np.full(x_ph.shape, 10.0),
                        y_ph: np.full(y_ph.shape, 12.0),
                    })

  print(result)

Download tutorial_attributes_example.py

Custom codelet

Listing 14.5 shows the source file for a custom rotate operation, which takes three vectors and rotates x and y by the values in angle. The vertex code for the custom codelet is shown in Listing 14.6. The TensorFlow program that calls the custom op is shown in Listing 14.7.

Listing 14.5 custom_rotate_op.cc

#include <vector>

#include <poplar/Graph.hpp>
#include <poplar/Tensor.hpp>
#include <poputil/Util.hpp>
#include <poputil/VertexTemplates.hpp>
#include <poputil/exceptions.hpp>

// Export the API level symbol
extern "C" {
int32_t custom_op_api_level = 5;
}

extern "C" void Build_metadata(
    std::vector<std::int64_t>& allocating_indices,
    std::vector<std::int64_t>& replica_identical_output_indices,
    std::map<std::int64_t, std::int64_t>& input_to_output_tensor_aliasing,
    bool& is_elementwise, bool& is_stateless, bool& is_hashable,
    std::uint32_t num_inputs) {
  allocating_indices.clear();
  is_elementwise = true;
}

extern "C" poplar::program::Program Build(
    poplar::Graph& graph, const std::vector<poplar::Tensor>& inputs,
    std::vector<poplar::Tensor>& outputs, const std::string& attributes,
    const std::string& debugPrefix) {
  if (inputs.size() != 3) {
    throw poputil::poplibs_error("Rotate requires 3 inputs");
  }

  if (inputs[0].numElements() == 0) {
    return poplar::program::Sequence();
  }

  if (inputs[0].rank() != 1 || inputs[1].rank() != 1 || inputs[2].rank() != 1) {
    throw poputil::poplibs_error("All inputs must be rank 1");
  }

  if (inputs[0].dim(0) != inputs[1].dim(0) ||
      inputs[0].dim(0) != inputs[2].dim(0)) {
    throw poputil::poplibs_error(
        "Length of rotate vector and data vectors must match");
  }

  if (inputs[0].elementType() != inputs[1].elementType() ||
      inputs[0].elementType() != inputs[2].elementType()) {
    throw poputil::poplibs_error(
        "Data types of angle vector and data vectors must match");
  }

  auto dType = inputs[0].elementType();

  /*
   * Create a ComputeSet which will be executed, and contains the vertices
   */
  auto cs = graph.addComputeSet(debugPrefix + "/rotate");

  /*
   * Get the tile mapping for the complete tensor.  We will map the vertices so
   * that they match the layout of the 'x' input tensor (input[0]).  If the 'x'
   * tensor was layed out differently to the other ones, then Poplar will
   * insert code to move the data in the other tensors to the mapped tile. So
   * ideally we would choose the best mapping for the vertices by analysing
   * all of the tensor mappings.
   */
  auto tileMapping = graph.getTileMapping(inputs[0]);

  /*
   * Get the target, which descibes properties of the hardware.
   */
  auto target = graph.getTarget();

  /*
   * Get the vector width of the particular data type, so that later we can
   * divide the tensor up between workers in an appropriate way.
   */
  const auto vectorWidth = target.getVectorWidth(dType);

  /*
   * Create the output tensors
   */
  outputs.push_back(graph.clone(inputs[0]));
  outputs.push_back(graph.clone(inputs[1]));

  auto xFlat = inputs[0].flatten();
  auto yFlat = inputs[1].flatten();
  auto aFlat = inputs[2].flatten();
  auto xOutputFlat = outputs[0].flatten();
  auto yOutputFlat = outputs[1].flatten();

  for (unsigned tile = 0; tile != tileMapping.size(); ++tile) {
    /*
     * If a tile contains no elements of the tensor then do not create any
     * vertices for it.
     */
    if (tileMapping[tile].empty()) {
      continue;
    }

    /*
     * Split up the regions of the inputs tensors so that they are evenly
     * distributed between the workers on the tile.
     */
    auto vertexRegions = poputil::splitRegionsBetweenWorkers(
        target, tileMapping[tile], vectorWidth, 2 * vectorWidth);

    for (const auto& regions : vertexRegions) {
      /*
       * If a region has no elements, then there is no need to add a vertex for
       * it.
       */
      if (regions.empty()) {
        continue;
      }

      /*
       * Add codelets to tiles which work over the regions in the input
       * tensors.
       */
      auto v = graph.addVertex(cs, poputil::templateVertex("Rotate", dType),
                               {{"x_out", xOutputFlat.slices(regions)},
                                {"y_out", yOutputFlat.slices(regions)},
                                {"x_in", xFlat.slices(regions)},
                                {"y_in", yFlat.slices(regions)},
                                {"angle", aFlat.slices(regions)}});

      /* Map the vertex onto the appropriate tile. */
      graph.setTileMapping(v, tile);

      /* Provide a bogus cycle count estimate for the profiler. */
      graph.setPerfEstimate(v, 1);
    }
  }

  return poplar::program::Execute(cs);
}

Download custom_rotate_op.cc

Listing 14.6 custom_codelet.cpp

#include <cmath>

#include <poplar/HalfFloat.hpp>
#include <poplar/Vertex.hpp>

using namespace poplar;

/*
 * A codelet to rotate a tensors 'x' and 'y', by the angle (radians) in the
 * tensor 'angle', around the origin.
 */
template <typename FPType>
class Rotate : public Vertex {
 public:
  Vector<Output<Vector<FPType>>> x_out;
  Vector<Output<Vector<FPType>>> y_out;
  Vector<Input<Vector<FPType>>> x_in;
  Vector<Input<Vector<FPType>>> y_in;
  Vector<Input<Vector<FPType>>> angle;

  bool compute() {
    for (unsigned i = 0; i < angle.size(); ++i) {
      for (unsigned j = 0; j != angle[i].size(); ++j) {
        float a = angle[i][j];
        float x = x_in[i][j];
        float y = y_in[i][j];
        x_out[i][j] = x * cos(a) - y * sin(a);
        y_out[i][j] = x * sin(a) + y * cos(a);
      }
    }
    return true;
  }
};

template class Rotate<float>;
template class Rotate<half>;

Download custom_codelet.cpp

Listing 14.7 tutorial_custom_codelet.py

import os
import numpy as np

from tensorflow.python import ipu
from tensorflow.python.ipu.scopes import ipu_scope
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Configure argument for targeting the IPU
cfg = ipu.config.IPUConfig()
cfg.auto_select_ipus = 1
cfg.configure_ipu_system()

with tf.device("cpu"):
  x_data = tf.placeholder(np.float32, [4])
  y_data = tf.placeholder(np.float32, [4])
  p_angle = tf.placeholder(np.float32, [4])


def rotate_op(x, y, a):
  outputs = {
      "output_types": [tf.float32, tf.float32],
      "output_shapes": [tf.TensorShape([4]),
                        tf.TensorShape([4])],
  }

  base_path = os.path.join(os.getcwd(), "tensorflow/compiler/plugin/poplar")
  lib_path = os.path.join(base_path, "libcustom_rotate_op.so")
  gp_path = os.path.join(base_path, "custom_codelet.gp")

  o = ipu.custom_ops.precompiled_user_op([x, y, a],
                                         lib_path,
                                         gp_path,
                                         outs=outputs)
  return o


def my_net(x, y, a):
  return rotate_op(x, y, a)


with ipu_scope("/device:IPU:0"):
  xla_result = ipu.ipu_compiler.compile(my_net, [x_data, y_data, p_angle])

with tf.Session() as sess:
  # Base run
  result = sess.run(xla_result,
                    feed_dict={
                        x_data: [2., 4., 6., -1.],
                        y_data: [2., 3., 8., -1.],
                        p_angle: [np.pi, np.pi / 2., 3. * np.pi / 2., 0]
                    })

  print(result)

Download tutorial_custom_codelet.py

14.2. Custom host CPU operations

You can write a custom operation as a function that executes code on the host CPU instead of on the IPU. The default name for this function is Callback(). As with the builder functions described previously, this must be compiled into a shared library file.

The signature of the callback function is:

extern "C"
void Callback(
    const std::vector<const void*>& data,
    const std::vector<std::uint32_t>& number_of_elements,
    const std::vector<void*>& outputs,
    const std::string& attributes,
    const std::string& name);

The parameters are:

data: The input data passed to the custom op in TensorFlow. The function must be written to expect a specific data type and the void pointer cast into the expected type.
number_of_elements: This indicates the number of elements in the input data.
outputs: The results returned by the operation.
attributes: A string which is passed as the attributes argument to the custom op in TensorFlow. See Operation attributes for more details.
name: This is the name of the operation within the XLA graph.

You can call the host code from your TensorFlow program using tensorflow.python.ipu.custom_ops.cpu_user_operation(). This specifies the input object file to load, the input and output tensors, and other parameters to the operation.

14.2.1. Gradient callback

If the op is required for training, then you must also implement a function for the gradient operation. This has the same name as the callback with _grad appended.

The signature of the gradient callback function is:

extern "C" void Callback_grad(
    const std::vector<void*>& data,
    const std::vector<uint32_t>& number_of_elements,
    std::vector<void*>& outputs,
    const std::string& attributes,
    const std::string& name);

The parameters are:

data: The input data passed to the custom op in TensorFlow. The function must be written to expect a specific data type so the void pointer can be cast into the expected type.
number_of_elements: This indicates the number of elements in the input data.
outputs: The results returned by the operation.
attributes: A string which is passed as the gradient_attributes argument to the Python op in TensorFlow. See Operation attributes for more details.
name: This is the name of the operation within the XLA graph.

Search help

14. Writing custom operations

14.1. Custom operation on the IPU

14.1.1. Building the Poplar graph

14.1.2. Gradient builders

14.1.3. Metadata

14.1.4. Compiling the IPU code

API level

PopLibs library code

Compiling the library file

14.1.5. Using the custom op in TensorFlow

14.1.6. Tensor allocation

14.1.7. Examples

In-place operations

Operation attributes

Custom codelet

14.2. Custom host CPU operations

14.2.1. Gradient callback