3. Understanding vertices

This chapter describes the Vertex class, including how vertices are run, parameters are passed and data types are stored.

You can also follow a practical walkthrough of writing vertex code in the Poplar Vertices tutorial from the Graphcore tutorials repository.

3.1. The Vertex class

Vertices in Poplar are subclasses of the Vertex or MultiVertex base classes. They each have a compute() method that is run on the tile and returns a bool value. The Vertex::compute() method runs in a single worker thread. The MultiVertex::compute(unsigned) method runs in multiple worker threads.

In order to run the vertex’s compute() method, you need to add the vertex to a compute set. When compiled, the compute set is reduced to a single function that calls the compute() functions of all the vertices it contains.

3.2. Vertex state

Any vertex state required is provided as member fields inside the vertex class.

3.2.1. Vector and VectorList types

There are two vector types used: Vector and VectorList, representing one and two dimensional blocks of data respectively. VectorList is a “jagged” 2D list, in other words the sub-lists need not be the same length.

Each of these vector types can be represented in memory in different ways. See Vector types for details.

3.2.2. Specifying memory constraints

The poplar::constraint attribute can be applied to vertices to restrict where vertex state is placed in memory. This takes one or more string parameters.

The parameters available are described in the table below, where src and dst are names of vectors in your vertex state.

Table 3.1 Poplar constraints

Type

Description

"elem(*src)!=elem(*dst)"

This constraint means that the vertex field src will be placed in a different memory element to the vertex field dst. This means you can do load from the src pointer and store to the dst pointer in the same cycle without causing a memory clash. If you find that this constraint doesn’t give much performance benefit it should be removed as it can be costly in terms of total memory use.

"region(*src)!=region(*dst)"

This constraint means that two fields will be placed in different regions. This implies that one of them will be placed in interleaved memory, although it doesn’t matter which one. As above, this means you can use load-store instructions to do a simultaneous load from src and store to dst.

3.3. MultiVertex worker threads

Vertices with the Vertex base class run in a single worker thread and can access all the information they need to run from their vertex state alone. The compute() method of vertices with the MultiVertex base class is run multiple times in different worker threads. The Poplar compiler generates code to run the compute method of a multi-vertex multiple times and passes a single argument to the multi-vertex compute method which is the thread ID of the running worker.

You can obtain the total number of invocations of the multi-vertex compute method for a given vertex in vertex code using the MultiVertex::numWorkers() method and in host code using the poplar::Target::getNumWorkerContexts() method.

The worker thread IDs given to the multi-vertex compute method will always be in the range [0, MultiVertex::numWorkers()) and the same ID will never be given twice in the same compute set for the same multi-vertex.

The worker thread ID allows a multi-vertex to have precise control over the split of work to be performed in parallel for a single vertex. For example you might split the work of adding 2 vectors of numbers together using a multi-vertex like so:

class AddTwoVectors : public MultiVertex {
public:
  Input<Vector<unsigned>> a;
  Input<Vector<unsigned>> b;
  Output<Vector<unsigned>> c;

  bool compute(unsigned workerId) {
    const auto numElements = a.size();
    for (std::size_t i = workerId;
         i < numElements;
         i += MultiVertex::numWorkers(); ++i) {
      c[i] = a[i] + b[i];
    }
    return true;
  }
};

3.3.1. Thread safety

An ordinary vertex provides compile-time thread safety checking because the regions of memory that are read and written by each worker in the compute set is defined. You control the read and written regions of memory by each worker thread in a multi-vertex and consequently no compile-time thread safety checking is available and you must take care to avoid any such issues yourself.

You may safely read from the same region of memory from multiple threads.

You may safely write to the same region of memory from multiple threads but the order of those writes is undefined.

However, you must also consider the atomic write size of the target in use. The atomic write size in bytes is available in host code as poplar::Target::getAtomicStoreGranularity(). This value gives the smallest alignment and size in bytes that can be written to memory atomically. When writing data where the number of bytes or the address is not a multiple of the atomic write size multiple instructions are required to perform the write. This is because the existing memory contents need to be read, partially modified with the new data, and then re-written to memory. This means that the write is not atomic. Consequently two threads writing to the same atom could overwrite the other’s data.

3.4. Calling conventions

There is a vertex calling convention (see Application binary interface (ABI)) that is used by vertices. However, the compute() method itself does not use this calling convention. Because of this, when Poplar compiles a vertex it will create a new function that does use the calling convention, which then calls the compute() method and propagates the return value.

The name of this wrapper function is __runCodelet_XXXX where XXXX is the mangled name of the class that contains the compute method (see Vertex name mangling). The wrapper for a Vertex::compute() method looks like this:

int __runCodelet_MyVertex() {

  void *vertexPtr = __builtin_colossus_get_vertex_base();
  auto v = static_cast<MyVertex*>(vertexPtr);
  return v->compute();
}

The wrapper for a MultiVertex::compute(unsigned) method looks like this:

int __runCodelet_MyMultiVertex() {
  void *vertexPtr = __builtin_colossus_get_vertex_base();
  auto v = static_cast<MyVertex*>(vertexPtr);
  auto w = __builtin_ipu_get(CSR_W_WSR__INDEX) & CSR_W_WSR__CTXTID_M1__MASK;
  return v->compute(w);
}

Vertices with base class Vertex have no parameters. Vertices with base class MultiVertex have a single parameter to the MultiVertex::compute(unsigned) method which is the thread ID of the worker running the method.

3.4.1. External codelets

When you write an assembly, or external, implementation of a vertex you need to inform Poplar that you are providing the __runCodelet_XXXX function so it does not generate the wrapper itself. You do this by adding a static bool isExternalCodelet to the Vertex or MultiVertex class. When this exists and is set to true, Poplar will assume that the __runCodelet_XXXX function is defined, and will call that, ignoring the compute() method.

You can use this to define, at compile time, whether you have provided assembly code definitions for some or all possible template instantiations of a vertex. For example, consider a vertex like this:

template <typename FPType>
template class Foo : public Vertex {
  static bool isExternalCodelet = std::is_same<FPType, float>();
  bool compute() { return true; }
};

template class Foo<half>;
template class Foo<float>;

This states that Poplar should only use the compiled compute method for the Foo<half> vertex and that we will provide an assembly implementation of the Foo<float> vertex.

3.4.2. Recursion and function pointers

You should avoid the use of recursion and function calls via pointers. Using these will prevent the Poplar runtime from correctly computing the stack usage.

If you must use recursion, the DEF_STACK_USAGE macro (see Section 7.3.1, Specifying stack size) must be used to specify the total stack usage of the recursive function itself, taking into account the maximum depth of the recursion and any other functions that can be called.

If you need to call via function pointers, you can use the DEF_FUNC_CALL_PTRS macro to specify a a list of other functions that may be called via pointers. Note that this creates a maintainability problem as the macro use must be updated every time the code changes its use of function pointers. See the API documentation for details.

3.5. Vertex name mangling

The Poplar name mangling is designed to be easy to write. All mangled vertices begin with __runCodelet_ followed by the full class name, including namespace, with the following changes made:

  • Replace __ (two underscores) with _Z

  • Replace :: with __ (two underscores)

  • Replace < with ___ (three underscores)

  • Replace , in the template argument list with _ (one underscore)

  • Discard >

Some examples are shown in the table below.

Table 3.2 Name mangling

Before

After

popnn::NonLinearity2D<float, 2>

__runCodelet_popnn__NonLinearity2D___float_2

popops::UnaryOp<popops::expr::ABSOLUTE, half>

__runCodelet_popops__UnaryOp___popops__expr__ABSOLUTE_half

popconv::ConvPartial1x1<float, half, true>

__runCodelet_popconv__ConvPartial1x1___float_half_true