3. Understanding vertices

This chapter describes the `Vertex` class, including how vertices are run, parameters are passed and data types are stored.

You can also follow a practical walkthrough of writing vertex code in the Poplar Vertices tutorial from the Graphcore tutorials.

3.1. The Vertex class

Vertices in Poplar are subclasses of the `Vertex` or `MultiVertex` base classes. They each have a `compute()` method that is run on the tile and return either `void` or a `bool` value. The `Vertex::compute()` method runs in a single worker thread. The `MultiVertex::compute(unsigned)` method runs in multiple worker threads.

For example:

```#include <poplar/Vertex.hpp>

using namespace poplar;

class AdderVertex : public Vertex {
public:
Input<float> x;
Input<float> y;
Output<float> sum;

void compute() {
*sum = x + y;
}
};
```

The `Input` and `Output` fields are edges that connect the vertex to the tensor data that it reads and writes. An `Input` field should not be written and an `Output` field should not be read; the results are undefined. If you need a field that is read and written, then it should be defined as `InOut`.

These fields have `begin`, `end`, `operator[]` and `operator*` methods so they can be iterated over and accessed like other C++ containers. For `Input` fields all of these methods are `const`.

The `Output` field can be successfully updated even if the corresponding tensor is on another tile. This is because the data is not transferred to the destination tile until the compute is complete. However, reading an `Output` field is not guaranteed to return the expected value. If you need to both write to and read from a field, then it should be declared as an `InOut` type.

In order to run the vertex’s `compute()` method, you need to add the vertex to a `compute set`. When compiled, the compute set is reduced to a single function that calls the `compute()` functions of all the vertices it contains.

3.2. Vertex state

Any vertex state required is provided as member fields inside the vertex class.

Note

Field names cannot begin with an underscore (`_`). Names beginning with an underscore are reserved for use by Poplar.

3.2.1. Vector and VectorList types

There are two vector types used: `Vector` and `VectorList`, representing one and two dimensional blocks of data respectively. `VectorList` is a “jagged” 2D list, in other words the sub-lists need not be the same length.

Each of these vector types can be represented in memory in different ways. See Section 5, Vertex vector types for details.

3.2.2. Allowed field types as vertex state

Type support for fields is limited to the types in Types.hpp.

The following are types not allowed as vertex fields:

• User-defined types, either as types in a `Vector` or as types

• `Input`, `Output` and `InOut` edges with user-defined types are not allowed

• Standard containers, including those whose size is statically defined

• A non-static pointer to a supported type

• A static pointer to a supported type is allowed but dynamic memory allocation with `malloc()` is not allowed, making it ineffectual

The code example below shows what field types are allowed and not allowed in vertices. The types that are not allowed are highlighted.

```// User defined type
struct ComplexFloat {
float a;
float b;
};

// A vertex definition
class ProcessComplexFloat : public MultiVertex {
public:
// User defined types are NOT allowed either as
// types in Vectors or as types.
Vector<Input<ComplexFloat>> x;
ComplexFloat z;

// Supported types are allowed
Vector<Input<int>> y;

// Standard containers are NOT allowed including those
// whose size is statically defined
unsigned aArray[10];
std::vector<bool> bVector;
std::array<unsigned, 10> cArray;

// A single element of a supported type is allowed
unsigned singleValue;

// A non-static pointer to a supported type is NOT allowed
short *ptrToShort;

// A static pointer to supported type is allowed but dynamic
// memory allocation via malloc is NOT allowed, making it
// ineffectual.
static int *staticPtrToInt;

// An Input/Output/InOut edge with user defined type is NOT allowed
Input<ComplexFloat> in;

// An Input/Output/InOut edge with a supported type is allowed
Output<float> out;

bool compute() {
return true;
}
};
```

3.2.3. Specifying memory constraints

The `poplar::constraint` attribute can be applied to vertices to restrict where vertex state is placed in memory. This takes one or more string parameters.

See the IPU Programmer’s Guide for a description of the memory architecture in the IPU.

The parameters available are described in the table below, where `src` and `dst` are names of vectors in your vertex state.

Table 3.1 Poplar constraints

Type

Description

`"elem(*src)!=elem(*dst)"`

This constraint means that the vertex field `src` will be placed in a different memory element to the vertex field `dst`. This means you can do load from the `src` pointer and store to the `dst` pointer in the same cycle without causing a memory clash. If you find that this constraint doesn’t give much performance benefit it should be removed as it can be costly in terms of total memory use.

`"region(*src)!=region(*dst)"`

This constraint means that two fields will be placed in different regions. This implies that one of them will be placed in interleaved memory, although it doesn’t matter which one. As above, this means you can use load-store instructions to do a simultaneous load from `src` and store to `dst`.

3.2.4. Stack allocation

When C++ functions are compiled, the compiler is usually able to determine the stack required. This is not possible if you use recursion, function calls via pointers or variable-length arrays (array variables that have a size that is not a compile time constant).

If you must use these techniques, then you must explicitly specify the stack used by your functions. Macros are provided for this purpose. These are defined in the Poplar header file `StackSizeDefs.hpp`. See the runtime API section of the Poplar API Reference for more information.

• `DEF_STACK_USAGE size function`

This defines the total stack usage (in bytes) for the function specified and any functions that it calls. This means that Poplar will not traverse the call graph of the function to determine the total stack usage of the function.

If you use recursion, this macro must be used to specify the total stack usage of the recursive function itself, taking into account the maximum depth of the recursion and any other functions that can be called.

• `DEF_FUNC_CALL_PTRS`

This defines a list of other functions that may be called via pointers. Note that this creates a maintainability problem as the macro use must be updated every time the code changes its use of function pointers.

Vertices with the `Vertex` base class run in a single worker thread and can access all the information they need to run from their vertex state alone. The `compute()` method of vertices with the `MultiVertex` base class is run multiple times in different worker threads. The Poplar compiler generates code to run the compute method of a multi-vertex multiple times and passes a single argument to the multi-vertex compute method which is the thread ID of the running worker.

You can obtain the total number of invocations of the multi-vertex compute method for a given vertex in vertex code using the `MultiVertex::numWorkers()` method and in host code using the `poplar::Target::getNumWorkerContexts()` method.

The worker thread IDs given to the multi-vertex compute method will always be in the range `[0, MultiVertex::numWorkers())` and the same ID will never be given twice in the same compute set for the same multi-vertex.

The worker thread ID allows a multi-vertex to have precise control over the split of work to be performed in parallel for a single vertex. For example, you could split the work of adding two vectors of numbers together using a multi-vertex like so:

```class AddTwoVectors : public MultiVertex {
public:
Input<Vector<unsigned>> a;
Input<Vector<unsigned>> b;
Output<Vector<unsigned>> c;

void compute(unsigned workerId) {
const auto numElements = a.size();
for (std::size_t i = workerId;
i < numElements;
i += MultiVertex::numWorkers()) {
c[i] = a[i] + b[i];
}
}
};
```

An ordinary vertex provides compile-time thread safety checking because the regions of memory that are read and written by each worker in the compute set is defined. You control the read and written regions of memory by each worker thread in a multi-vertex and consequently no compile-time thread safety checking is available and you must take care to avoid any such issues yourself.

You may safely read from the same region of memory from multiple threads.

You may safely write to the same region of memory from multiple threads but the order of those writes is undefined.

However, you must also consider the atomic write size of the target in use. The atomic write size in bytes is available in host code as `poplar::Target::getAtomicStoreGranularity()`. This value gives the smallest alignment and size in bytes that can be written to memory atomically. When writing data where the number of bytes or the address is not a multiple of the atomic write size multiple instructions are required to perform the write. This is because the existing memory contents need to be read, partially modified with the new data, and then re-written to memory. This means that the write is not atomic. Consequently two threads writing to the same atom could overwrite the other’s data.

To demonstrate this, let’s write a simple vertex exhibiting undefined behaviour due to non-atomic writes. Let us assume that `poplar::Target::getAtomicStoreGranularity()` returns 4 and `poplar::Target::getTypeSize(poplar::Type)` returns 2 for `poplar::UNSIGNED_SHORT`:

```class WriteOne : public MultiVertex {
public:
InOut<Vector<unsigned short>> data;

void compute(unsigned workerId) {
for (std::size_t i = workerId;
i < data.size();
i += MultiVertex::numWorkers()) {
data[i] = 1;
}
}
};
```

The undefined behaviour in this vertex arises from the combination of 2 facts.

1. each worker thread writes to a destination that has size 2 bytes (the size of an unsigned short). This is less than than the atomic store granularity meaning the value will be written non-atomically by reading 4 bytes, modifying 2 bytes, and writing 4 bytes back to memory.

2. another worker writes non-atomically to the other 2 bytes within the same 4-byte region.

Assume our target has just 2 worker threads and the field `data` is connected to a tensor with 2 elements and both elements have value 0 in tile memory before this vertex runs. An illustration of what each thread _may_ do when executing the WriteOne vertex follows:

Table 3.2 Possible execution with 2 worker threads producing incorrect result.

Read 4 bytes from tile memory at address &data[0] into thread 0 registers, registers now have unsigned short values {0, 0}

Read 4 bytes from tile memory at address &data[0] into thread 1 registers, registers now have unsigned short values {0, 0}

Modify value read for data[0], registers now have unsigned short values {1, 0}

Modify value read for data[1], registers now have unsigned short values {0, 1}

Write 4 bytes to tile memory at address &data[0] from thread 0 registers which have unsigned short values {1, 0}

Write 4 bytes to tile memory at address &data[0] from thread 1 registers which have unsigned short values {0, 1}

Depending on the order that thread 0 and 1 perform their writes, the values {1, 0} or {0, 1} are left in tile memory after this vertex runs and neither result is the intended one.

In this example we can fix the undefined behaviour by aligning the address of the data to at least the atomic store granularity and by having each worker thread write a number of consecutive elements with total size that is a multiple of the atomic store granularity:

```class WriteOne : public MultiVertex {
public:
// The input data is aligned to at least 4 bytes.
InOut<Vector<unsigned short, 4>> data;

void compute(unsigned workerId) {
for (std::size_t i = workerId * 2;
i < data.size();
i += MultiVertex::numWorkers() * 2) {
// Each worker is now guaranteed to always write 4 consecutive bytes at
// a 4 byte aligned address. Non-atomic writes no longer matter because
// no other thread will write non-atomically to the same 4 byte region.
data[i] = 1;
data[i + 1] = 1;
}
}
};
```

3.4. Calling conventions

There is a vertex calling convention (see Section 14, Application binary interface (ABI)) that is used by vertices. However, the `compute()` method itself does not use this calling convention. Because of this, when Poplar compiles a vertex it will create a new function that does use the calling convention, which then calls the `compute()` method and propagates the return value.

The name of this wrapper function is `__runCodelet_XXXX` where `XXXX` is the mangled name of the class that contains the compute method (see Section 3.5, Vertex name mangling). The wrapper for a `Vertex::compute()` method looks like this:

```int __runCodelet_MyVertex() {

void *vertexPtr = __builtin_colossus_get_vertex_base();
auto v = static_cast<MyVertex*>(vertexPtr);
return v->compute();
}
```

The wrapper for a `MultiVertex::compute(unsigned)` method looks like this:

```int __runCodelet_MyMultiVertex() {
void *vertexPtr = __builtin_colossus_get_vertex_base();
auto v = static_cast<MyVertex*>(vertexPtr);
auto w = __builtin_ipu_get(CSR_W_WSR__INDEX) & CSR_W_WSR__CTXTID_M1__MASK;
return v->compute(w);
}
```

Vertices with base class `Vertex` have no parameters. Vertices with base class `MultiVertex` have a single parameter to the `MultiVertex::compute(unsigned)` method which is the thread ID of the worker running the method.

3.4.1. External codelets

When you write an assembly, or external, implementation of a vertex you need to inform Poplar that you are providing the `__runCodelet_XXXX` function so it does not generate the wrapper itself. You do this by adding a `static bool isExternalCodelet` to the `Vertex` or `MultiVertex` class. When this exists and is set to `true`, Poplar will assume that the `__runCodelet_XXXX` function is defined, and will call that, ignoring the `compute()` method.

You can use this to define, at compile time, whether you have provided assembly code definitions for some or all possible template instantiations of a vertex. For example, consider a vertex like this:

```template <typename FPType>
class Foo : public Vertex {
static bool isExternalCodelet = std::is_same<FPType, float>();
bool compute() { return true; }
};

template class Foo<half>;
template class Foo<float>;
```

This states that Poplar should only use the compiled `compute` method for the `Foo<half>` vertex and that we will provide an assembly implementation of the `Foo<float>` vertex.

3.4.2. Recursion and function pointers

You should avoid the use of recursion and function calls via pointers. Using these will prevent the Poplar runtime from correctly computing the stack usage.

If you must use recursion, the `DEF_STACK_USAGE` macro (see Section 10.3.1, Specifying stack size) must be used to specify the total stack usage of the recursive function itself, taking into account the maximum depth of the recursion and any other functions that can be called.

If you need to call via function pointers, you can use the `DEF_FUNC_CALL_PTRS` macro to specify a a list of other functions that may be called via pointers. Note that this creates a maintainability problem as the macro use must be updated every time the code changes its use of function pointers. See the API documentation for details.

3.5. Vertex name mangling

The Poplar name mangling is designed to be easy to write. All mangled vertices begin with `__runCodelet_` followed by the full class name, including namespace, with the following changes made:

• Replace `__` (two underscores) with `_Z`

• Replace `::` with `__` (two underscores)

• Replace `<` with `___` (three underscores)

• Replace `,` in the template argument list with `_` (one underscore)

• Discard `>`

Some examples are shown in the table below.

Table 3.3 Name mangling

Before

After

`popnn::NonLinearity2D<float, 2>`

`__runCodelet_popnn__NonLinearity2D___float_2`

`popops::UnaryOp<popops::expr::ABSOLUTE, half>`

`__runCodelet_popops__UnaryOp___popops__expr__ABSOLUTE_half`

`popconv::ConvPartial1x1<float, half, true>`

`__runCodelet_popconv__ConvPartial1x1___float_half_true`