3. Understanding vertices
This chapter describes the Vertex
class, including how vertices are run,
parameters are passed and data types are stored.
You can also follow a practical walkthrough of writing vertex code in the Poplar Vertices tutorial from the Graphcore tutorials.
3.1. The Vertex class
Vertices in Poplar are subclasses of the Vertex
or MultiVertex
base
classes. They each have a compute()
method that is run on the tile and
return either void
or a bool
value. The Vertex::compute()
method
runs in a single worker thread. The MultiVertex::compute(unsigned)
method
runs in multiple worker threads.
For example:
#include <poplar/Vertex.hpp>
using namespace poplar;
class AdderVertex : public Vertex {
public:
Input<float> x;
Input<float> y;
Output<float> sum;
void compute() {
*sum = x + y;
}
};
The Input
and Output
fields are edges that connect the vertex to the tensor data that
it reads and writes. An Input
field should not be written and an Output
field should not be read; the results are undefined. If you need a field that
is read and written, then it should be defined as InOut
.
These fields have begin
, end
, operator[]
and operator*
methods
so they can be iterated over and accessed like other C++ containers.
For Input
fields all of these methods are const
.
The Output
field can be successfully updated even if the corresponding
tensor is on another tile. This is because the data is not transferred to the
destination tile until the compute is complete. However, reading an
Output
field is not guaranteed to return the expected value. If you need to
both write to and read from a field, then it should be declared as an InOut
type.
In order to run the vertex’s compute()
method, you need to add the vertex
to a compute set
. When compiled, the compute set is reduced to
a single function that calls the compute()
functions of all the
vertices it contains.
3.2. Vertex state
Any vertex state required is provided as member fields inside the vertex class.
Note
Field names cannot begin with an underscore (_
). Names beginning with an underscore are reserved for use by Poplar.
3.2.1. Vector and VectorList types
There are two vector types used: Vector
and VectorList
,
representing one and two dimensional blocks of data respectively.
VectorList
is a “jagged” 2D list, in other words the sub-lists need
not be the same length.
Each of these vector types can be represented in memory in different ways. See Section 5, Vertex vector types for details.
3.2.2. Allowed field types as vertex state
Type support for fields is limited to the types in Types.hpp.
The following are types not allowed as vertex fields:
User-defined types, either as types in a
Vector
or as typesInput
,Output
andInOut
edges with user-defined types are not allowedStandard containers, including those whose size is statically defined
A non-static pointer to a supported type
A static pointer to a supported type is allowed but dynamic memory allocation with
malloc()
is not allowed, making it ineffectual
The code example below shows what field types are allowed and not allowed in vertices. The types that are not allowed are highlighted.
// User defined type
struct ComplexFloat {
float a;
float b;
};
// A vertex definition
class ProcessComplexFloat : public MultiVertex {
public:
// User defined types are NOT allowed either as
// types in Vectors or as types.
Vector<Input<ComplexFloat>> x;
ComplexFloat z;
// Supported types are allowed
Vector<Input<int>> y;
// Standard containers are NOT allowed including those
// whose size is statically defined
unsigned aArray[10];
std::vector<bool> bVector;
std::array<unsigned, 10> cArray;
// A single element of a supported type is allowed
unsigned singleValue;
// A non-static pointer to a supported type is NOT allowed
short *ptrToShort;
// A static pointer to supported type is allowed but dynamic
// memory allocation via malloc is NOT allowed, making it
// ineffectual.
static int *staticPtrToInt;
// An Input/Output/InOut edge with user defined type is NOT allowed
Input<ComplexFloat> in;
// An Input/Output/InOut edge with a supported type is allowed
Output<float> out;
bool compute() {
return true;
}
};
See Section 4, Supported types for more information about the types supported on the IPU.
3.2.3. Specifying memory constraints
The poplar::constraint
attribute can be applied to vertices to
restrict where vertex state is placed in memory. This takes
one or more string parameters.
See the IPU Programmer’s Guide for a description of the memory architecture in the IPU.
The parameters available are described in the table below, where
src
and dst
are names of vectors in your vertex state.
Type |
Description |
---|---|
|
This constraint means that the vertex field |
|
This constraint means that two fields will be placed in different
regions. This implies that one of them will be placed in
interleaved memory, although it doesn’t matter which one.
As above, this means you can use load-store instructions to do
a simultaneous load from |
3.2.4. Stack allocation
When C++ functions are compiled, the compiler is usually able to determine the stack required. This is not possible if you use recursion, function calls via pointers or variable-length arrays (array variables that have a size that is not a compile time constant).
If you must use these techniques, then you must explicitly specify the stack
used by your functions. Macros are provided for this purpose. These
are defined in the Poplar header file StackSizeDefs.hpp
.
See the runtime API section of the
Poplar API Reference
for more information.
DEF_STACK_USAGE size function
This defines the total stack usage (in bytes) for the function specified and any functions that it calls. This means that Poplar will not traverse the call graph of the function to determine the total stack usage of the function.
If you use recursion, this macro must be used to specify the total stack usage of the recursive function itself, taking into account the maximum depth of the recursion and any other functions that can be called.
DEF_FUNC_CALL_PTRS
This defines a list of other functions that may be called via pointers. Note that this creates a maintainability problem as the macro use must be updated every time the code changes its use of function pointers.
3.3. MultiVertex worker threads
Vertices with the Vertex
base class run in a single worker thread and can
access all the information they need to run from their vertex state alone. The
compute()
method of vertices with the MultiVertex
base class is run
multiple times in different worker threads. The Poplar compiler generates code
to run the compute method of a multi-vertex multiple times and passes a
single argument to the multi-vertex compute method which is the
thread ID of the running worker.
You can obtain the total number of invocations of the multi-vertex compute
method for a given vertex in vertex code using the
MultiVertex::numWorkers()
method and in host code using the
poplar::Target::getNumWorkerContexts()
method.
The worker thread IDs given to the multi-vertex compute method
will always be in the range [0, MultiVertex::numWorkers())
and the same ID
will never be given twice in the same compute set for the same multi-vertex.
The worker thread ID allows a multi-vertex to have precise control over the split of work to be performed in parallel for a single vertex. For example, you could split the work of adding two vectors of numbers together using a multi-vertex like so:
class AddTwoVectors : public MultiVertex {
public:
Input<Vector<unsigned>> a;
Input<Vector<unsigned>> b;
Output<Vector<unsigned>> c;
void compute(unsigned workerId) {
const auto numElements = a.size();
for (std::size_t i = workerId;
i < numElements;
i += MultiVertex::numWorkers()) {
c[i] = a[i] + b[i];
}
}
};
3.3.1. Thread safety
An ordinary vertex provides compile-time thread safety checking because the regions of memory that are read and written by each worker in the compute set is defined. You control the read and written regions of memory by each worker thread in a multi-vertex and consequently no compile-time thread safety checking is available and you must take care to avoid any such issues yourself.
You may safely read from the same region of memory from multiple threads.
You may safely write to the same region of memory from multiple threads but the order of those writes is undefined.
However, you must also consider the atomic write size of the target in use. The
atomic write size in bytes is available in host code as
poplar::Target::getAtomicStoreGranularity()
. This value gives the smallest
alignment and size in bytes that can be written to memory atomically.
When writing data where the number of bytes or the address is not a
multiple of the atomic write size multiple instructions are required to perform
the write. This is because the existing memory contents need to be read,
partially modified with the new data, and then re-written to memory. This means
that the write is not atomic. Consequently two threads writing to the same atom
could overwrite the other’s data.
To demonstrate this, let’s write a simple vertex exhibiting undefined behaviour
due to non-atomic writes. Let us assume that
poplar::Target::getAtomicStoreGranularity()
returns 4 and
poplar::Target::getTypeSize(poplar::Type)
returns 2 for
poplar::UNSIGNED_SHORT
:
class WriteOne : public MultiVertex {
public:
InOut<Vector<unsigned short>> data;
void compute(unsigned workerId) {
for (std::size_t i = workerId;
i < data.size();
i += MultiVertex::numWorkers()) {
data[i] = 1;
}
}
};
The undefined behaviour in this vertex arises from the combination of 2 facts.
each worker thread writes to a destination that has size 2 bytes (the size of an unsigned short). This is less than than the atomic store granularity meaning the value will be written non-atomically by reading 4 bytes, modifying 2 bytes, and writing 4 bytes back to memory.
another worker writes non-atomically to the other 2 bytes within the same 4-byte region.
Assume our target has just 2 worker threads and the field data
is connected
to a tensor with 2 elements and both elements have value 0 in tile memory
before this vertex runs. An illustration of what each thread _may_ do when
executing the WriteOne vertex follows:
Thread 0 |
Thread 1 |
---|---|
Read 4 bytes from tile memory at address &data[0] into thread 0 registers, registers now have unsigned short values {0, 0} |
Read 4 bytes from tile memory at address &data[0] into thread 1 registers, registers now have unsigned short values {0, 0} |
Modify value read for data[0], registers now have unsigned short values {1, 0} |
Modify value read for data[1], registers now have unsigned short values {0, 1} |
Write 4 bytes to tile memory at address &data[0] from thread 0 registers which have unsigned short values {1, 0} |
Write 4 bytes to tile memory at address &data[0] from thread 1 registers which have unsigned short values {0, 1} |
Depending on the order that thread 0 and 1 perform their writes, the values {1, 0} or {0, 1} are left in tile memory after this vertex runs and neither result is the intended one.
In this example we can fix the undefined behaviour by aligning the address of the data to at least the atomic store granularity and by having each worker thread write a number of consecutive elements with total size that is a multiple of the atomic store granularity:
class WriteOne : public MultiVertex {
public:
// The input data is aligned to at least 4 bytes.
InOut<Vector<unsigned short, 4>> data;
void compute(unsigned workerId) {
for (std::size_t i = workerId * 2;
i < data.size();
i += MultiVertex::numWorkers() * 2) {
// Each worker is now guaranteed to always write 4 consecutive bytes at
// a 4 byte aligned address. Non-atomic writes no longer matter because
// no other thread will write non-atomically to the same 4 byte region.
data[i] = 1;
data[i + 1] = 1;
}
}
};
3.4. Calling conventions
There is a vertex calling convention (see Section 14, Application binary interface (ABI))
that is used by vertices. However, the compute()
method itself does not use
this calling convention. Because of this, when Poplar compiles a vertex it will
create a new function that does use the calling convention, which then calls
the compute()
method and propagates the return value.
The name of this wrapper function is __runCodelet_XXXX
where XXXX
is
the mangled name of the class that contains the compute method (see
Section 3.5, Vertex name mangling). The wrapper for a Vertex::compute()
method looks
like this:
int __runCodelet_MyVertex() {
void *vertexPtr = __builtin_colossus_get_vertex_base();
auto v = static_cast<MyVertex*>(vertexPtr);
return v->compute();
}
The wrapper for a MultiVertex::compute(unsigned)
method looks like this:
int __runCodelet_MyMultiVertex() {
void *vertexPtr = __builtin_colossus_get_vertex_base();
auto v = static_cast<MyVertex*>(vertexPtr);
auto w = __builtin_ipu_get(CSR_W_WSR__INDEX) & CSR_W_WSR__CTXTID_M1__MASK;
return v->compute(w);
}
Vertices with base class Vertex
have no parameters.
Vertices with base class MultiVertex
have a single parameter to the
MultiVertex::compute(unsigned)
method which is the thread ID of the
worker running the method.
3.4.1. External codelets
When you write an assembly, or external, implementation of a vertex you need to
inform Poplar that you are providing the __runCodelet_XXXX
function so it
does not generate the wrapper itself. You do this by adding
a static bool isExternalCodelet
to the Vertex
or MultiVertex
class.
When this exists and is set to true
, Poplar will assume that the
__runCodelet_XXXX
function is defined, and will call that, ignoring the
compute()
method.
You can use this to define, at compile time, whether you have provided assembly code definitions for some or all possible template instantiations of a vertex. For example, consider a vertex like this:
template <typename FPType>
class Foo : public Vertex {
static bool isExternalCodelet = std::is_same<FPType, float>();
bool compute() { return true; }
};
template class Foo<half>;
template class Foo<float>;
This states that Poplar should only use the compiled compute
method for
the Foo<half>
vertex and that we will provide an assembly implementation
of the Foo<float>
vertex.
3.4.2. Recursion and function pointers
You should avoid the use of recursion and function calls via pointers. Using these will prevent the Poplar runtime from correctly computing the stack usage.
If you must use recursion, the DEF_STACK_USAGE
macro (see Section 10.3.1, Specifying stack size) must be used to
specify the total stack usage of the recursive function itself, taking
into account the maximum depth of the recursion and any other functions that
can be called.
If you need to call via function pointers, you can use the
DEF_FUNC_CALL_PTRS
macro to specify a a list of other
functions that may be called via pointers.
Note that this creates a maintainability problem as the
macro use must be updated every time the code changes its use of function
pointers. See the API documentation for details.
3.5. Vertex name mangling
The Poplar name mangling is designed to be easy to write.
All mangled vertices begin with __runCodelet_
followed by the
full class name, including namespace, with the following changes made:
Replace
__
(two underscores) with_Z
Replace
::
with__
(two underscores)Replace
<
with___
(three underscores)Replace
,
in the template argument list with_
(one underscore)Discard
>
Some examples are shown in the table below.
Before |
After |
---|---|
|
|
|
|
|
|