11. Writing efficient C++
The Poplar graph programming framework uses the the C/C++ compiler from the LLVM open source infrastructure for building the codelets. By using LLVM, Poplar is able to provide a fast compiler that generates high performance code, and has all the latest language features required for easier development.
This chapter describes some of the optimisations, features and techniques you can use to allow the compiler to generate more efficient code. In order to understand the optimisations in this chapter, you should be familiar with Section 10, Writing vertices in assembly.
11.1. Inspecting the generated code
You can use the Poplar codelet compiler popc
to precompile the C++ codelets and use the -S option to emit the generated assembly:
$ popc --S --target ipu1 example.c -o example.s
In the rest of this chapter, the examples will be written as minimal C/C++ functions for clarity. The assembly shown is the generated output with all assembler directives removed.
11.2. Optimisation levels
The compiler provides a number of optimisation levels that can be specified with the option -O<level>
.
The optimisation level -O2
generally provides the best overall code generation in terms of code size and performance. -O3
performs more aggressive optimisations and delivers better runtime performance, but it may also generate larger code.
By default, Poplar uses -O3
, but if code size is an issue, it might be desirable to use the -Os
option. This performs optimisations to minimize code size but potentially at a cost to runtime performance.
More information on optimisation levels is available from the popc
command line help option (-h
). You can also refer to the Clang documentation for Code Generation Options.
11.3. IPU hardware loops
There are two hardware loops defined in the IPU architecture:
rpt
: Therpt
instruction allows a scoped loop to be defined. This allows the branch back to the start of the loop body to be omitted.brnzdec
: Thebrnzdec
instruction is shorthand for ‘branch if non-zero, decrement’. This means a value can be evaluated to be zero or not. If it is not zero, it will be decremented which is roughly the sequence of instructions required for a (decrementing) loop.
These hardware loops are automatically generated where possible by the compiler. However, each comes with constraints that may inhibit generation of the hardware loop. This section describes the constraints and some of the basics of the IPU hardware loop generation in LLVM.
11.3.1. Prioritisation between hardware loops
Generally, within the IPU LLVM backend the rpt
loop is preferred over the brnzdec
loop. If at some point, due to the rpt
constraints, LLVM is unable to generate the rpt
loop, it will attempt to generate a brnzdec
loop. If that also fails, a regular
loop consisting of a subtraction operation (sub
) and a branch if non-zero (brnz
) operation is executed.
In pseudocode, this would be:
if (rpt_constraints_are_met()) {
generate_rpt()
} else if (brnzdec_constraints_are_met()) {
generate_brnzdec()
} else {
generate_regular_loop()
}
Some additional internal checks are in place which may result in a suboptimal hardware loop. However, most often a brnzdec
loop is generated instead of an rpt
loop because of unmet hardware loop constraints. The generation of a regular
loop (for example, a brnz
loop) is often due to unmet internal constraints that are not easily controlled on the C/C++ level.
11.3.2. Hardware loop constraints
This section describes the constraints for the generation of hardware loops.
11.3.3. rpt
The rpt
loops have the following constraints:
A limit on the number of iterations. This number depends on the architecture type.
As a result, the compiler will try to infer the induction variable but will conservatively consider the constraint not met if the induction variable either cannot be analysed or is known to be greater than the limit.
The scoped
rpt
loop body may only contain bundles.As a result, the loop body may only contain instructions that can be issued together.
In the case that the scheduler cannot put two instructions into a bundle, a
nop
or anfnop
will be generated to complete the bundle, which will impact code size.
There must not be any system or control instructions in the loop body (for example,
put
,br
,call
andbrnz
).As a result:
No C/C++ function calls can be made within the loop body.
The
rpt
loop must be the innermost loop of a loop nest, or be a standalone loop with no parent loop.
Cost analysis: the
rpt
loop body contains either 4 or fewer instruction bundles or less than half of the instruction bundles containnop
orfnop
instructions.You can override the instruction bundle count constraint, but without user intervention the hard limit on loop body bundle count is 4 and will depend on the number of bundles containing
(f)nop
s for bundle counts beyond 4.
11.3.4. brnzdec
Since the brnzdec
loops have few constraints, they are more often not generated because of unmet internal constraints on the shape of a loop’s basic blocks.
For more more information, refer to this talk about hardware loops from the 2022 EuroLLVM Developers’ Meeting on “Hardware loops in the IPU backend”.
11.4. Guiding the C/C++ compiler for better loop code generation
This section describes what you can do to guide the C/C++ compiler to improve the code generation.
The examples in this section show two cases: one without the optimisation and one with the optimisation.
11.4.1. Hardware loop generation
There are multiple ways to guide the compiler to emit an IPU hardware loop. Each IPU hardware loop must adhere to a set of constraints. One of these is the limited iteration count for the rpt
hardware loops. Effectively, this hardware loop can only be emitted if, among other constraints, the iteration count is within a particular range. However, the iteration count is not always available during compilation time which results in a conservative approach to generating this IPU hardware loop.
As a developer, you may be aware of which C/C++ loops may and which may not generate this rpt
hardware loop. This section describes a few options to hint the compiler into generating the rpt
hardware loop. Note that although the iteration count constraint is a common reason for why the rpt
loop is not emitted, it is not the only one.
11.4.2. __builtin_assume
The __builtin_assume
builtin function (see Clang Language Extensions documentation for more information) is available in C and C++ and provides the compiler with boolean invariants that the compiler may assume to be true. One of these assumptions could be the number of iterations a loop will run. This number may be unknown or unanalysable for the compiler, but an obvious fact for the developer. In such cases __builtin_assume
can be used to limit the range of the loop induction variable to denote a limited range on the iteration count.
For example:
// Compiled with popc --S --target ipu1 -O2 example_assume.c -o out.s
void foo (int *out, int *in, unsigned size) {
for (int i = 0; i < size; ++i) {
*out += in[i];
}
}
void assume_foo (int *out, int *in, unsigned size) {
__builtin_assume(size < 4095);
for (int i = 0; i < size; ++i) {
*out += in[i];
}
}
will result in:
foo: # @foo
brz $m2, .LBB0_3
ld32 $m3, $m0, $m15, 0
add $m2, $m2, -1
ld32step $m4, $m15, $m1+=, 1
add $m3, $m3, $m4
st32 $m3, $m0, $m15, 0
brnzdec $m2, .LBB0_2
br $m10
assume_foo: # @assume_foo
ld32 $m3, $m0, $m15, 0
{
rpt $m2, 2
fnop
}
{
ld32step $m4, $m15, $m1+=, 1
fnop
}
{
add $m3, $m3, $m4
fnop
}
{
st32 $m3, $m0, $m15, 0
fnop
}
br $m10
11.4.3. rptsize_t
rptsize_t
provides an unsigned integer type with the allowed range of the rpt
hardware loop. This type’s size is guaranteed to be equal to the size of the rpt
hardware loop iteration count limit. This type can be used similar as any regular int is used in C++. Note that rptsize_t
can, and should, only be treated as an unsigned integer type. Refer to the Tile Vertex Instruction Set Architecture document to get the repeat size (TREG_REPEAT_COUNT_WIDTH) for your specific IPU.
For example:
// Compiled with popc --S --target ipu1 -O3 example_rptsize_t.cpp -o out.s
#include <ipudef.h>
void bar (int *out, int *a, unsigned n) {
for (int i = 0; i < n; ++i) {
*out += a[i];
}
}
void rptsizebar1 (int *out, int *a, rptsize_t n) {
for (int i = 0; i < n; ++i) {
*out += a[i];
}
}
void rptsizebar2 (int *out, int *a, rptsize_t n) {
for (rptsize_t i = 0; i < n; ++i) {
*out += a[i];
}
}
will result in:
_Z3barPiS_j: # @_Z3barPiS_j
brz $m2, .LBB0_3
ld32 $m3, $m0, $m15, 0
add $m2, $m2, -1
ld32step $m4, $m15, $m1+=, 1
add $m3, $m3, $m4
st32 $m3, $m0, $m15, 0
brnzdec $m2, .LBB0_2
br $m10
_Z11rptsizebar1PiS_6ap_intILj12ELb0EE: # @_Z11rptsizebar1PiS_6ap_intILj12ELb0EE
ldz16 $m2, $m2, $m15, 0
ld32 $m3, $m0, $m15, 0
{
rpt $m2, 2
fnop
}
{
ld32step $m4, $m15, $m1+=, 1
fnop
}
{
add $m3, $m3, $m4
fnop
}
{
st32 $m3, $m0, $m15, 0
fnop
}
br $m10
_Z11rptsizebar2PiS_6ap_intILj12ELb0EE: # @_Z11rptsizebar2PiS_6ap_intILj12ELb0EE
ldz16 $m2, $m2, $m15, 0
ld32 $m3, $m0, $m15, 0
{
rpt $m2, 2
fnop
}
{
ld32step $m4, $m15, $m1+=, 1
fnop
}
{
add $m3, $m3, $m4
fnop
}
{
st32 $m3, $m0, $m15, 0
fnop
}
br $m10
11.4.4. Pragma vectorize
You may want to disable vectorization in a loop if it results in less optimal code. With the IPU, such cases may arise from incorrect assumptions by LLVM about what may and may not be vectorized. For example, LLVM does not make a distinction between float and int types during vectorization and will in some cases choose to vectorize a loop of integers to a 2-element vector despite 2-element vector operations only being available for float types. As a result both the code size and execution time will be worse.
You can guide vectorization using a number of pragmas (see Clang Language Extensions documentation for more). This section will only cover how to disable the vectorize pragmas.
The Clang pragma for disabling vectorization is:
#pragma clang loop vectorize(disable)
This can be inserted into C/C++ code before the loop targeted to have its vectorization disabled. This targeting of a single loop avoids disabling vectorization for the whole compilation unit.
For example:
// compile with popc --S --target ipu1 example_pragma_vectorize.c -o out_O2.s -O2
void false_pos_vectorize (unsigned long *a, unsigned long *or_arr) {
unsigned long ret;
int i, j;
for (i = 0; i < 32; ++i) {
for (j = 0; j < 32; ++j) {
int b;
ret = 0;
for (b = i; b < j; ++b) {
ret |= or_arr[b];
}
}
}
*a = ret;
}
void no_vec_false_pos_vectorize (unsigned long *a, unsigned long *or_arr) {
unsigned long ret;
int i, j;
for (i = 0; i < 32; ++i) {
for (j = 0; j < 32; ++j) {
int b;
ret = 0;
#pragma clang loop vectorize(disable)
for (b = i; b < j; ++b) {
ret |= or_arr[b];
}
}
}
*a = ret;
}
will result in:
false_pos_vectorize: # @false_pos_vectorize
add $m11, $m11, -48
st32 $m8, $m11, $m15, 4 # 4-byte Folded Spill
st32 $m9, $m11, $m15, 3 # 4-byte Folded Spill
st32 $m10, $m11, $m15, 2 # 4-byte Folded Spill
st32 $m7, $m11, $m15, 1 # 4-byte Folded Spill
st32 $m0, $m11, $m15, 5 # 4-byte Folded Spill
mov $m2, $m15
setzi $m0, 32
st32 $m0, $m11, $m15, 6 # 4-byte Folded Spill
st32 $m1, $m11, $m15, 10 # 4-byte Folded Spill
st32 $m1, $m11, $m15, 9 # 4-byte Folded Spill
add $m0, $m0, -1
.LBB0_1:
st32 $m0, $m11, $m15, 7 # 4-byte Folded Spill
mov $m7, $m15
ld32 $m9, $m11, $m15, 6 # 4-byte Folded Reload
sub $m0, 0, $m2
st32 $m0, $m11, $m15, 8 # 4-byte Folded Spill
add $m9, $m9, -1
.LBB0_2:
cmpult $m0, $m2, $m7
brz $m0, .LBB0_3
sub $m0, $m7, $m2
mov $m8, $m15
cmpult $m1, $m0, 2
brz $m1, .LBB0_6
mov $m10, $m2
bri .LBB0_9
.LBB0_3:
mov $m8, $m15
bri .LBB0_11
.LBB0_6:
mov $m5, $m15
ld32 $m1, $m11, $m15, 8 # 4-byte Folded Reload
add $m1, $m1, $m7
andc $m1, $m1, 1
add $m1, $m1, -2
shr $m1, $m1, 1
add $m4, $m1, 1
add $m4, $m4, -1
andc $m1, $m0, 1
st32 $m1, $m11, $m15, 11 # 4-byte Folded Spill
add $m10, $m2, $m1
ld32 $m1, $m11, $m15, 9 # 4-byte Folded Reload
.LBB0_7:
ld32 $m3, $m1, $m15, 1
ld32step $m6, $m15, $m1+=, 2
or $m5, $m3, $m5
or $m8, $m6, $m8
brnzdec $m4, .LBB0_7
ld32 $m1, $m11, $m15, 11 # 4-byte Folded Reload
cmpeq $m0, $m0, $m1
mov $m4, $m5
or $m1, $m5, $m0
or $m8, $m8, $m4
brnz $m0, .LBB0_11
.LBB0_9:
add $m0, $m10, 1
cmpult $m1, $m0, $m7
movnz $m0, $m1, $m7
sub $m0, $m0, $m10
add $m0, $m0, -1
shl $m1, $m10, 2
ld32 $m3, $m11, $m15, 10 # 4-byte Folded Reload
add $m1, $m3, $m1
.LBB0_10:
ld32step $m3, $m15, $m1+=, 1
or $m8, $m3, $m8
brnzdec $m0, .LBB0_10
.LBB0_11:
add $m7, $m7, 1
brnzdec $m9, .LBB0_2
ld32 $m0, $m11, $m15, 7 # 4-byte Folded Reload
add $m2, $m2, 1
ld32 $m1, $m11, $m15, 9 # 4-byte Folded Reload
add $m1, $m1, 4
st32 $m1, $m11, $m15, 9 # 4-byte Folded Spill
brnzdec $m0, .LBB0_1
ld32 $m0, $m11, $m15, 5 # 4-byte Folded Reload
st32 $m8, $m0, $m15, 0
ld32 $m7, $m11, $m15, 1 # 4-byte Folded Reload
ld32 $m10, $m11, $m15, 2 # 4-byte Folded Reload
ld32 $m9, $m11, $m15, 3 # 4-byte Folded Reload
ld32 $m8, $m11, $m15, 4 # 4-byte Folded Reload
add $m11, $m11, 48
br $m10
no_vec_false_pos_vectorize: # @no_vec_false_pos_vectorize
add $m11, $m11, -24
st32 $m8, $m11, $m15, 4 # 4-byte Folded Spill
st32 $m9, $m11, $m15, 3 # 4-byte Folded Spill
st32 $m10, $m11, $m15, 2 # 4-byte Folded Spill
st32 $m7, $m11, $m15, 1 # 4-byte Folded Spill
st32 $m0, $m11, $m15, 5 # 4-byte Folded Spill
mov $m2, $m15
setzi $m0, 32
mov $m4, $m0
add $m4, $m4, -1
.LBB1_1:
mov $m5, $m15
mov $m6, $m0
sub $m7, 0, $m2
.LBB1_2:
cmpult $m8, $m2, $m5
brz $m8, .LBB1_3
mov $m8, $m15
add $m9, $m7, $m5
add $m9, $m9, -1
mov $m10, $m1
.LBB1_5:
ld32step $m3, $m15, $m10+=, 1
or $m8, $m3, $m8
brnzdec $m9, .LBB1_5
add $m6, $m6, -1
add $m5, $m5, 1
brnz $m6, .LBB1_2
bri .LBB1_7
.LBB1_3:
mov $m8, $m15
add $m6, $m6, -1
add $m5, $m5, 1
brnz $m6, .LBB1_2
.LBB1_7:
add $m2, $m2, 1
add $m1, $m1, 4
brnzdec $m4, .LBB1_1
ld32 $m0, $m11, $m15, 5 # 4-byte Folded Reload
st32 $m8, $m0, $m15, 0
ld32 $m7, $m11, $m15, 1 # 4-byte Folded Reload
ld32 $m10, $m11, $m15, 2 # 4-byte Folded Reload
ld32 $m9, $m11, $m15, 3 # 4-byte Folded Reload
ld32 $m8, $m11, $m15, 4 # 4-byte Folded Reload
add $m11, $m11, 24
br $m10
11.5. Restrict
The restrict
qualifier in C/C++ can be applied to pointers, It aids in the scheduling of function parameters and class members. Currently, the existing alias analysis in LLVM is not perfect and is very conservative. If it cannot assume there will be no aliasing between pointers, it will assume the worst and transform the code assuming that aliasing will occur.
However, you may be aware of such constraints within your C/C++ code and you may want the compiler to generate its optimal code with the assumption that no pointers will alias within a certain (function) scope. You can help the compiler by hinting that a pointer won’t alias and this could help with code generation. This hinting notifies the compiler that it can be less conservative with loads and stores of pointers.
Warning
This does put the burden on you about whether pointers alias or not. Having a restrict
qualifier applied to a pointer which aliases with another pointer within the same scope will result in undefined behaviour. Refer to the C and C++ spec for more detail on the C and C++ level semantics of the restrict
qualifier.
For example:
// Compiled with popc --S --target ipu1 -O2 example_restrict.c -o out.s
int may_alias_foo(int *a, int *b)
{
*a = *b;
*b = 5;
return *b * *a;
}
int restricted_foo(int *restrict a, int *restrict b)
{
*a = *b;
*b = 5;
return *b * *a;
}
will result in:
may_alias_foo: # @may_alias_foo
{
ld32 $m2, $m1, $m15, 0
setzi $a0, 5
}
st32 $m2, $m0, $m15, 0
st32 $a0, $m1, $m15, 0
ld32 $m0, $m0, $m15, 0 // Re-load a[0], just in case b == a
mul $m0, $m0, 5
br $m10
restricted_foo: # @restricted_foo
{
ld32 $m2, $m1, $m15, 0
setzi $a0, 5
}
st32 $m2, $m0, $m15, 0
mul $m0, $m2, 5
st32 $a0, $m1, $m15, 0
br $m10
11.6. Alignment
The IPU load and store instructions are constrained in terms of alignment. This means that the alignment of an n-bit value will require n-bit alignment. For example, a 64-bit float load or store instruction will require 64-bit (8 byte) alignment. Alignments of values are not enforced in LLVM and must come from whatever generated the LLVM IR code (in our case C/C++ code). In practice, this often means the C/C++ code will have to explicitly set the alignment of variables if you intend to use the 64-bit load and store instructions.
You can guide the compiler to use the best load and store instructions available by either explicitly using a vector float type defined in ipudef.h
, or by explicitly setting the alignment of floats in a consecutive chain of float load and store instructions to its maximum supported alignment.
For example:
// Compiled with popc --S --target ipu1 -O2 example_align.c -o out.s
#include <ipudef.h>
void floot2 (float2 *a, float2 *b, unsigned size) {
for (int i = 0; i < size; ++i) {
a[i] += b[i];
}
}
void floot (float *a, float *b, unsigned size) {
for (int i = 0; i < size; ++i) {
a[i] += b[i];
}
}
void floot_align (float *a, float *b, unsigned size) {
a = __builtin_assume_aligned(a, 8);
b = __builtin_assume_aligned(b, 8);
for (int i = 0; i < size; ++i) {
a[i] += b[i];
}
}
will result in:
floot2: # @floot2
brz $m2, .LBB0_3
add $m2, $m2, -1
.LBB0_2:
ld64step $a0:1, $m15, $m1+=, 1
ld64 $a2:3, $m0, $m15, 0
f32v2add $a0:1, $a0:1, $a2:3
st64step $a0:1, $m15, $m0+=, 1
brnzdec $m2, .LBB0_2
.LBB0_3:
br $m10
floot: # @floot
brz $m2, .LBB1_11
cmpeq $m3, $m2, 1
brnz $m3, .LBB1_2
shl $m3, $m2, 2
add $m4, $m1, $m3
cmpult $m4, $m0, $m4
brz $m4, .LBB1_6
add $m3, $m0, $m3
cmpult $m3, $m1, $m3
brz $m3, .LBB1_6
.LBB1_2:
mov $m3, $m15
.LBB1_9:
sub $m2, $m2, $m3
add $m2, $m2, -1
shl $m3, $m3, 2
add $m0, $m0, $m3
add $m1, $m1, $m3
.LBB1_10:
ld32step $a0, $m15, $m1+=, 1
ld32 $a1, $m0, $m15, 0
f32add $a0, $a0, $a1
st32step $a0, $m15, $m0+=, 1
brnzdec $m2, .LBB1_10
.LBB1_11:
br $m10
.LBB1_6:
andc $m3, $m2, 1
add $m4, $m3, -2
shr $m4, $m4, 1
mov $m5, $m0
mov $m6, $m1
.LBB1_7:
ld32 $a1, $m6, $m15, 1
ld32step $a0, $m15, $m6+=, 2
ld32 $a2, $m5, $m15, 0
ld32 $a3, $m5, $m15, 1
f32v2add $a0:1, $a0:1, $a2:3
st32 $a1, $m5, $m15, 1
st32step $a0, $m15, $m5+=, 2
brnzdec $m4, .LBB1_7
cmpeq $m4, $m3, $m2
brnz $m4, .LBB1_11
bri .LBB1_9
floot_align: # @floot_align
brz $m2, .LBB2_11
cmpeq $m3, $m2, 1
brnz $m3, .LBB2_2
shl $m3, $m2, 2
add $m4, $m1, $m3
cmpult $m4, $m0, $m4
brz $m4, .LBB2_6
add $m3, $m0, $m3
cmpult $m3, $m1, $m3
brz $m3, .LBB2_6
.LBB2_2:
mov $m3, $m15
.LBB2_9:
sub $m2, $m2, $m3
add $m2, $m2, -1
shl $m3, $m3, 2
add $m0, $m0, $m3
add $m1, $m1, $m3
.LBB2_10:
ld32step $a0, $m15, $m1+=, 1
ld32 $a1, $m0, $m15, 0
f32add $a0, $a0, $a1
st32step $a0, $m15, $m0+=, 1
brnzdec $m2, .LBB2_10
.LBB2_11:
br $m10
.LBB2_6:
andc $m3, $m2, 1
add $m4, $m3, -2
shr $m4, $m4, 1
mov $m5, $m0
mov $m6, $m1
.LBB2_7:
ld64step $a0:1, $m15, $m6+=, 1
ld64 $a2:3, $m5, $m15, 0
f32v2add $a0:1, $a0:1, $a2:3
st64step $a0:1, $m15, $m5+=, 1
brnzdec $m4, .LBB2_7
cmpeq $m4, $m3, $m2
brnz $m4, .LBB2_11
bri .LBB2_9
11.7. Vector math functions
There are a number of common math functions provided for types that are otherwise not available in the standard library. These include half and vector types. A complete list can be found in #include <ipu_vector_math>
.
For example:
// compile with popc --S --target ipu1 vector_math.cpp -o out.s -O2
#include <ipu_vector_math>
// will emit single f16v2ln instruction for ipu::log(half2)
half2 get_log(half2 x) {
return ipu::log(x);
}
// will emit multiple f16v2ln instructions for ipu::log(half4)
half4 get_log(half4 x) {
return ipu::log(x);
}
will result in:
_Z7get_logDv2_Dh:
{
br $m10
f16v2ln $a0, $a0
}
_Z7get_logDv4_Dh:
f16v2ln $a0, $a0
{
br $m10
f16v2ln $a1, $a1
}
11.8. Memory intrinsics
You can reduce the overhead of performing pointer arithmetic when loading or storing values by using the post incrementing load store memory intrinsics located in #include <ipu_memory_intrinsics>
. Refer to the IPU C++ memory intrinsics chapter in the Poplar and PopLibs User Guide for more information.
For example:
// compile with popc --S --target ipu1 memory_intrinsics.cpp -o out.s -O3
#include <ipu_memory_intrinsics>
#include <tuple>
std::tuple<int const*, int> postinc_load(int const * x, int stride) {
int load1 = ipu::load_postinc(&x, stride);
int load2 = ipu::load_postinc(&x, stride);
return std::make_tuple(x, load1 + load2);
}
std::tuple<int const*, int> no_postinc_load(int const * x, int stride) {
int load1 = *x;
x += stride;
int load2 = *x;
x += stride;
return std::make_tuple(x, load1 + load2);
}
will result in:
_Z12postinc_loadPKii:
ld32step $m3, $m15, $m1+=, $m2
ld32step $m2, $m15, $m1+=, $m2
add $m2, $m2, $m3
st32 $m1, $m0, $m15, 0
st32 $m2, $m0, $m15, 1
br $m10
_Z15no_postinc_loadPKii:
shl $m3, $m2, 2
ld32 $m4, $m1, $m15, 0
add $m5, $m1, $m3
ld32 $m1, $m1, $m15, $m2
add $m2, $m5, $m3
add $m1, $m1, $m4
st32 $m2, $m0, $m15, 0
st32 $m1, $m0, $m15, 1
br $m10
11.9. Builtins
Clang supports a number of builtin library functions with the same syntax as GCC, as well as some additional functions. In addition, Poplar provides some additional builtin functions that target the IPU.
More information on builtin functions can be found in the LLVM compiler documentation.
A full list of builtins that target the IPU can be found in the Poplar and PopLibs API documentation.
For example:
// compile with popc --S --target ipu1 example_builtins.cpp -o out.s -O2
#include <ipu_vector_math>
// Generic builtin provided by clang
void shufflevector(half2 x1, half2 x2, half2 *x3) {
*x3 = __builtin_shufflevector(x1, x2, 0, 0);
}
// IPU target specific builtin
uint2 get_packed_ptr (const float *a, const float *b, const float *c) {
return __builtin_ipu_tapack(a, b, c);
}
will result in:
_Z13shufflevectorDv2_DhS_PS_:
sort4x16lo $a0, $a0, $a0
st32 $a0, $m0, $m15, 0
br $m10
_Z14get_packed_ptrPKfS0_S0_:
tapack $m0:1, $m0, $m1, $m2
br $m10
11.10. Inline assembly
In some instances you may be unable to get the desired code, for example if there are architecture-specific instructions that cannot easily be represented in C/C++, or if the compiler just cannot produce the desired code despite our best efforts. In this case, you can write inline assembly within our C/C++ functions instead of manually creating the entire function in assembly.
The inline assembler can inhibit further optimisations within a function so take care when using inline assembly.
For example:
// Compiled with popc --S --target ipu1 -O2 example_inline_asm.c -o out.s
#include <ipudef.h>
// Write 16 bits to memory assuming a 32 bit aligned destination pointer
void write16Aligned32(half in, half2 *outPtr) {
// Ensure that the operand that is put into a 32 register is 32 bits in size
half2 result = {in, in};
asm volatile(" ldb16 $a0, $mzero, %[h2Out], 1\n"
" sort4x16lo $a0, %[result], $a0\n"
" st32 $a0, $mzero, %[h2Out],0\n"
:
: [result] "r"(result), [h2Out] "r"(outPtr)
: "$a0", "memory");
}
// Combine four 8bit values in the 8 lsbs of each input into a single 32
// bit result. bits 8..31 of the inputs are ignored
unsigned combine8bit(unsigned in0, unsigned in1, unsigned in2, unsigned in3) {
unsigned out;
asm volatile(" shuf8x8lo $m1, %[in0], %[in1]\n"
" shuf8x8lo $m0, %[in2], %[in3]\n"
" sort4x16lo %[out], $m1, $m0\n"
: [out] "+r"(out)
: [in0] "r"(in0), [in1] "r"(in1), [in2] "r"(in2), [in3] "r"(in3)
: "$m0", "$m1");
return out;
}
will result in:
write16Aligned32: # @write16Aligned32
sort4x16lo $a1, $a0, $a0
ldb16 $a0, $m15, $m0, 1
sort4x16lo $a0, $a1, $a0
st32 $a0, $m15, $m0, 0
br $m10
combine8bit: # @combine8bit
mov $m4, $m1
mov $m5, $m0
shuf8x8lo $m1, $m5, $m4
shuf8x8lo $m0, $m2, $m3
sort4x16lo $m2, $m1, $m0
mov $m0, $m2
br $m10
11.11. Intrinsics
Inline assembly can be avoided — to an extent — by using intrinsics that are guaranteed to map to single instructions. These can be found in (and included from) the header file ipu_intrinsics
. The long-term aim of this feature is to give you the ability to target as much of the instruction set as possible, via C/C++ functions, and to therefore relieve you of the need to write blocks of inline assembly.
For a simple use case example, suppose you wanted to target the andc
instruction, specifically the andc $aDst0, $aSrc0, $aSrc1
variant, given that you have some float arguments you would like to apply this operation to. This would not be straightforward to write using the existing &~ operator, with float operands. For instance the attempt below would result in a compiler error:
// Compiled with popc --S --target ipu1 -O2 wont_work.cpp -o out.s
float wont_work(float x, float y) {
return x &~ y; // Can't just do this - invalid argument error.
}
A potential workaround for this would be:
// Compiled with popc --S --target ipu1 -O2 not_great.cpp -o out.s
float not_great(float x, float y) {
int x_temp = (int) x;
int y_temp = (int) y;
return (float) (x_temp &~ y_temp);
}
which would result in:
not_great: # @ not_great
f32int $a0, $a0, 3
f32int $a1, $a1, 3
f32toi32 $a0, $a0
f32toi32 $a1, $a1
andc $a0, $a0, $a1
{
br $m10
f32fromi32 $a0, $a0
}
Since you know that you want a simple andc instruction given some float arguments, this output seems cluttered with unnecessary float conversions. The next logical attempt might be to write this using inline assembly as:
// Compiled with popc --S --target ipu1 -O2 with_inline_asm.cpp -o out.s
float better_but_with_inline_asm(float x, float y) {
float result;
asm volatile(" andc %[result], %[x], %[y]\n"
: [result] "+r"(result)
: [x] "r"(x), [y] "r"(y));
return result;
}
This results in:
better_but_with_inline_asm: # @ better_but_with_inline_asm
#APP
andc $a0, $a0, $a1
#NO_APP
br $m10
Although this does produce the desired andc
instruction, the use of an intrinsic is even simpler:
// Compiled with popc --S --target ipu1 -O2 best.cpp -o out.s
#include <ipu_intrinsics>
float best(float x, float y) {
return ipu::andc(x, y);
}
as it results in:
best: # @best
{
br $m10
andc $a0, $a0, $a1
}