3. Understanding vertices

This chapter describes the Vertex class, including how vertices are run, parameters are passed and data types are stored.

You can also follow a practical walkthrough of writing vertex code in the Poplar Vertices tutorial from the Graphcore tutorials.

3.1. The Vertex class

Vertices in Poplar are subclasses of the Vertex or MultiVertex base classes. They each have a compute() method that is run on the tile and return either void or a bool value. The Vertex::compute() method runs in a single worker thread. The MultiVertex::compute(unsigned) method runs in multiple worker threads.

For example:

#include <poplar/Vertex.hpp>

using namespace poplar;

class AdderVertex : public Vertex {
public:
  Input<float> x;
  Input<float> y;
  Output<float> sum;

  void compute() {
    *sum = x + y;
  }
};

The Input and Output fields are edges that connect the vertex to the tensor data that it reads and writes. An Input field should not be written and an Output field should not be read; the results are undefined. If you need a field that is read and written, then it should be defined as InOut.

These fields have begin, end, operator[] and operator* methods so they can be iterated over and accessed like other C++ containers. For Input fields all of these methods are const.

The Output field can be successfully updated even if the corresponding tensor is on another tile. This is because the data is not transferred to the destination tile until the compute is complete. However, reading an Output field is not guaranteed to return the expected value. If you need to both write to and read from a field, then it should be declared as an InOut type.

In order to run the vertex’s compute() method, you need to add the vertex to a compute set. When compiled, the compute set is reduced to a single function that calls the compute() functions of all the vertices it contains.

3.2. Vertex state

Any vertex state required is provided as member fields inside the vertex class.

Note

Field names cannot begin with an underscore (_). Names beginning with an underscore are reserved for use by Poplar.

3.2.1. Vector and VectorList types

There are two vector types used: Vector and VectorList, representing one and two dimensional blocks of data respectively. VectorList is a “jagged” 2D list, in other words the sub-lists need not be the same length.

Each of these vector types can be represented in memory in different ways. See Section 5, Vertex vector types for details.

3.2.2. Allowed field types as vertex state

Type support for fields is limited to the types in Types.hpp.

The following are types not allowed as vertex fields:

  • User-defined types, either as types in a Vector or as types

  • Input, Output and InOut edges with user-defined types are not allowed

  • Standard containers, including those whose size is statically defined

  • A non-static pointer to a supported type

  • A static pointer to a supported type is allowed but dynamic memory allocation with malloc() is not allowed, making it ineffectual

The code example below shows what field types are allowed and not allowed in vertices. The types that are not allowed are highlighted.

// User defined type
struct ComplexFloat {
  float a;
  float b;
};

// A vertex definition
class ProcessComplexFloat : public MultiVertex {
public:
  // User defined types are NOT allowed either as
  // types in Vectors or as types.
  Vector<Input<ComplexFloat>> x;
  ComplexFloat z;

  // Supported types are allowed
  Vector<Input<int>> y;

  // Standard containers are NOT allowed including those
  // whose size is statically defined
  unsigned aArray[10];
  std::vector<bool> bVector;
  std::array<unsigned, 10> cArray;

  // A single element of a supported type is allowed
  unsigned singleValue;

  // A non-static pointer to a supported type is NOT allowed
  short *ptrToShort;

  // A static pointer to supported type is allowed but dynamic
  // memory allocation via malloc is NOT allowed, making it
  // ineffectual.
  static int *staticPtrToInt;

  // An Input/Output/InOut edge with user defined type is NOT allowed
  Input<ComplexFloat> in;

  // An Input/Output/InOut edge with a supported type is allowed
  Output<float> out;


  bool compute() {
    return true;
  }
};

See Section 4, Supported types for more information about the types supported on the IPU.

3.2.3. Specifying memory constraints

The poplar::constraint attribute can be applied to vertices to restrict where vertex state is placed in memory. This takes one or more string parameters.

See the IPU Programmer’s Guide for a description of the memory architecture in the IPU.

The parameters available are described in the table below, where src and dst are names of vectors in your vertex state.

Table 3.1 Poplar constraints

Type

Description

"elem(*src)!=elem(*dst)"

This constraint means that the vertex field src will be placed in a different memory element to the vertex field dst. This means you can do load from the src pointer and store to the dst pointer in the same cycle without causing a memory clash. If you find that this constraint doesn’t give much performance benefit it should be removed as it can be costly in terms of total memory use.

"region(*src)!=region(*dst)"

This constraint means that two fields will be placed in different regions. This implies that one of them will be placed in interleaved memory, although it doesn’t matter which one. As above, this means you can use load-store instructions to do a simultaneous load from src and store to dst.

3.2.4. Stack allocation

When C++ functions are compiled, the compiler is usually able to determine the stack required. This is not possible if you use recursion, function calls via pointers or variable-length arrays (array variables that have a size that is not a compile time constant).

If you must use these techniques, then you must explicitly specify the stack used by your functions. Macros are provided for this purpose. These are defined in the Poplar header file StackSizeDefs.hpp. See the runtime API section of the Poplar API Reference for more information.

  • DEF_STACK_USAGE size function

    This defines the total stack usage (in bytes) for the function specified and any functions that it calls. This means that Poplar will not traverse the call graph of the function to determine the total stack usage of the function.

    If you use recursion, this macro must be used to specify the total stack usage of the recursive function itself, taking into account the maximum depth of the recursion and any other functions that can be called.

  • DEF_FUNC_CALL_PTRS

    This defines a list of other functions that may be called via pointers. Note that this creates a maintainability problem as the macro use must be updated every time the code changes its use of function pointers.

3.3. MultiVertex worker threads

Vertices with the Vertex base class run in a single worker thread and can access all the information they need to run from their vertex state alone. The compute() method of vertices with the MultiVertex base class is run multiple times in different worker threads. The Poplar compiler generates code to run the compute method of a multi-vertex multiple times and passes a single argument to the multi-vertex compute method which is the thread ID of the running worker.

You can obtain the total number of invocations of the multi-vertex compute method for a given vertex in vertex code using the MultiVertex::numWorkers() method and in host code using the poplar::Target::getNumWorkerContexts() method.

The worker thread IDs given to the multi-vertex compute method will always be in the range [0, MultiVertex::numWorkers()) and the same ID will never be given twice in the same compute set for the same multi-vertex.

The worker thread ID allows a multi-vertex to have precise control over the split of work to be performed in parallel for a single vertex. For example, you could split the work of adding two vectors of numbers together using a multi-vertex like so:

class AddTwoVectors : public MultiVertex {
public:
  Input<Vector<unsigned>> a;
  Input<Vector<unsigned>> b;
  Output<Vector<unsigned>> c;

  void compute(unsigned workerId) {
    const auto numElements = a.size();
    for (std::size_t i = workerId;
         i < numElements;
         i += MultiVertex::numWorkers()) {
      c[i] = a[i] + b[i];
    }
  }
};

3.3.1. Thread safety

An ordinary vertex provides compile-time thread safety checking because the regions of memory that are read and written by each worker in the compute set is defined. You control the read and written regions of memory by each worker thread in a multi-vertex and consequently no compile-time thread safety checking is available and you must take care to avoid any such issues yourself.

You may safely read from the same region of memory from multiple threads.

You may safely write to the same region of memory from multiple threads but the order of those writes is undefined.

However, you must also consider the atomic write size of the target in use. The atomic write size in bytes is available in host code as poplar::Target::getAtomicStoreGranularity(). This value gives the smallest alignment and size in bytes that can be written to memory atomically. When writing data where the number of bytes or the address is not a multiple of the atomic write size multiple instructions are required to perform the write. This is because the existing memory contents need to be read, partially modified with the new data, and then re-written to memory. This means that the write is not atomic. Consequently two threads writing to the same atom could overwrite the other’s data.

To demonstrate this, let’s write a simple vertex exhibiting undefined behaviour due to non-atomic writes. Let us assume that poplar::Target::getAtomicStoreGranularity() returns 4 and poplar::Target::getTypeSize(poplar::Type) returns 2 for poplar::UNSIGNED_SHORT:

class WriteOne : public MultiVertex {
public:
  InOut<Vector<unsigned short>> data;

  void compute(unsigned workerId) {
    for (std::size_t i = workerId;
         i < data.size();
         i += MultiVertex::numWorkers()) {
      data[i] = 1;
    }
  }
};

The undefined behaviour in this vertex arises from the combination of 2 facts.

  1. each worker thread writes to a destination that has size 2 bytes (the size of an unsigned short). This is less than than the atomic store granularity meaning the value will be written non-atomically by reading 4 bytes, modifying 2 bytes, and writing 4 bytes back to memory.

  2. another worker writes non-atomically to the other 2 bytes within the same 4-byte region.

Assume our target has just 2 worker threads and the field data is connected to a tensor with 2 elements and both elements have value 0 in tile memory before this vertex runs. An illustration of what each thread _may_ do when executing the WriteOne vertex follows:

Table 3.2 Possible execution with 2 worker threads producing incorrect result.

Thread 0

Thread 1

Read 4 bytes from tile memory at address &data[0] into thread 0 registers, registers now have unsigned short values {0, 0}

Read 4 bytes from tile memory at address &data[0] into thread 1 registers, registers now have unsigned short values {0, 0}

Modify value read for data[0], registers now have unsigned short values {1, 0}

Modify value read for data[1], registers now have unsigned short values {0, 1}

Write 4 bytes to tile memory at address &data[0] from thread 0 registers which have unsigned short values {1, 0}

Write 4 bytes to tile memory at address &data[0] from thread 1 registers which have unsigned short values {0, 1}

Depending on the order that thread 0 and 1 perform their writes, the values {1, 0} or {0, 1} are left in tile memory after this vertex runs and neither result is the intended one.

In this example we can fix the undefined behaviour by aligning the address of the data to at least the atomic store granularity and by having each worker thread write a number of consecutive elements with total size that is a multiple of the atomic store granularity:

class WriteOne : public MultiVertex {
public:
  // The input data is aligned to at least 4 bytes.
  InOut<Vector<unsigned short, 4>> data;

  void compute(unsigned workerId) {
    for (std::size_t i = workerId * 2;
         i < data.size();
         i += MultiVertex::numWorkers() * 2) {
      // Each worker is now guaranteed to always write 4 consecutive bytes at
      // a 4 byte aligned address. Non-atomic writes no longer matter because
      // no other thread will write non-atomically to the same 4 byte region.
      data[i] = 1;
      data[i + 1] = 1;
    }
  }
};

3.4. Calling conventions

There is a vertex calling convention (see Section 14, Application binary interface (ABI)) that is used by vertices. However, the compute() method itself does not use this calling convention. Because of this, when Poplar compiles a vertex it will create a new function that does use the calling convention, which then calls the compute() method and propagates the return value.

The name of this wrapper function is __runCodelet_XXXX where XXXX is the mangled name of the class that contains the compute method (see Section 3.5, Vertex name mangling). The wrapper for a Vertex::compute() method looks like this:

int __runCodelet_MyVertex() {

  void *vertexPtr = __builtin_colossus_get_vertex_base();
  auto v = static_cast<MyVertex*>(vertexPtr);
  return v->compute();
}

The wrapper for a MultiVertex::compute(unsigned) method looks like this:

int __runCodelet_MyMultiVertex() {
  void *vertexPtr = __builtin_colossus_get_vertex_base();
  auto v = static_cast<MyVertex*>(vertexPtr);
  auto w = __builtin_ipu_get(CSR_W_WSR__INDEX) & CSR_W_WSR__CTXTID_M1__MASK;
  return v->compute(w);
}

Vertices with base class Vertex have no parameters. Vertices with base class MultiVertex have a single parameter to the MultiVertex::compute(unsigned) method which is the thread ID of the worker running the method.

3.4.1. External codelets

When you write an assembly, or external, implementation of a vertex you need to inform Poplar that you are providing the __runCodelet_XXXX function so it does not generate the wrapper itself. You do this by adding a static bool isExternalCodelet to the Vertex or MultiVertex class. When this exists and is set to true, Poplar will assume that the __runCodelet_XXXX function is defined, and will call that, ignoring the compute() method.

You can use this to define, at compile time, whether you have provided assembly code definitions for some or all possible template instantiations of a vertex. For example, consider a vertex like this:

template <typename FPType>
class Foo : public Vertex {
  static bool isExternalCodelet = std::is_same<FPType, float>();
  bool compute() { return true; }
};

template class Foo<half>;
template class Foo<float>;

This states that Poplar should only use the compiled compute method for the Foo<half> vertex and that we will provide an assembly implementation of the Foo<float> vertex.

3.4.2. Recursion and function pointers

You should avoid the use of recursion and function calls via pointers. Using these will prevent the Poplar runtime from correctly computing the stack usage.

If you must use recursion, the DEF_STACK_USAGE macro (see Section 10.3.1, Specifying stack size) must be used to specify the total stack usage of the recursive function itself, taking into account the maximum depth of the recursion and any other functions that can be called.

If you need to call via function pointers, you can use the DEF_FUNC_CALL_PTRS macro to specify a a list of other functions that may be called via pointers. Note that this creates a maintainability problem as the macro use must be updated every time the code changes its use of function pointers. See the API documentation for details.

3.5. Vertex name mangling

The Poplar name mangling is designed to be easy to write. All mangled vertices begin with __runCodelet_ followed by the full class name, including namespace, with the following changes made:

  • Replace __ (two underscores) with _Z

  • Replace :: with __ (two underscores)

  • Replace < with ___ (three underscores)

  • Replace , in the template argument list with _ (one underscore)

  • Discard >

Some examples are shown in the table below.

Table 3.3 Name mangling

Before

After

popnn::NonLinearity2D<float, 2>

__runCodelet_popnn__NonLinearity2D___float_2

popops::UnaryOp<popops::expr::ABSOLUTE, half>

__runCodelet_popops__UnaryOp___popops__expr__ABSOLUTE_half

popconv::ConvPartial1x1<float, half, true>

__runCodelet_popconv__ConvPartial1x1___float_half_true