11. Writing efficient C++

The Poplar graph programming framework uses the the C/C++ compiler from the LLVM open source infrastructure for building the codelets. By using LLVM, Poplar is able to provide a fast compiler that generates high performance code, and has all the latest language features required for easier development.

This chapter describes some of the optimisations, features and techniques you can use to allow the compiler to generate more efficient code. In order to understand the optimisations in this chapter, you should be familiar with Section 10, Writing vertices in assembly.

11.1. Inspecting the generated code

You can use the Poplar codelet compiler popc to precompile the C++ codelets and use the -S option to emit the generated assembly:

$ popc --S --target ipu1 example.c -o example.s

In the rest of this chapter, the examples will be written as minimal C/C++ functions for clarity. The assembly shown is the generated output with all assembler directives removed.

11.2. Optimisation levels

The compiler provides a number of optimisation levels that can be specified with the option -O<level>.

The optimisation level -O2 generally provides the best overall code generation in terms of code size and performance. -O3 performs more aggressive optimisations and delivers better runtime performance, but it may also generate larger code.

By default, Poplar uses -O3, but if code size is an issue, it might be desirable to use the -Os option. This performs optimisations to minimize code size but potentially at a cost to runtime performance.

More information on optimisation levels is available from the popc command line help option (-h). You can also refer to the Clang documentation for Code Generation Options.

11.3. IPU hardware loops

There are two hardware loops defined in the IPU architecture:

rpt: The rpt instruction allows a scoped loop to be defined. This allows the branch back to the start of the loop body to be omitted.
brnzdec: The brnzdec instruction is shorthand for ‘branch if non-zero, decrement’. This means a value can be evaluated to be zero or not. If it is not zero, it will be decremented which is roughly the sequence of instructions required for a (decrementing) loop.

These hardware loops are automatically generated where possible by the compiler. However, each comes with constraints that may inhibit generation of the hardware loop. This section describes the constraints and some of the basics of the IPU hardware loop generation in LLVM.

11.3.1. Prioritisation between hardware loops

Generally, within the IPU LLVM backend the rpt loop is preferred over the brnzdec loop. If at some point, due to the rpt constraints, LLVM is unable to generate the rpt loop, it will attempt to generate a brnzdec loop. If that also fails, a regular loop consisting of a subtraction operation (sub) and a branch if non-zero (brnz) operation is executed.

In pseudocode, this would be:

if (rpt_constraints_are_met()) {
    generate_rpt()
} else if (brnzdec_constraints_are_met()) {
    generate_brnzdec()
} else {
    generate_regular_loop()
}

Some additional internal checks are in place which may result in a suboptimal hardware loop. However, most often a brnzdec loop is generated instead of an rpt loop because of unmet hardware loop constraints. The generation of a regular loop (for example, a brnz loop) is often due to unmet internal constraints that are not easily controlled on the C/C++ level.

11.3.2. Hardware loop constraints

This section describes the constraints for the generation of hardware loops.

11.3.3. rpt

The rpt loops have the following constraints:

A limit on the number of iterations. This number depends on the architecture type.
- As a result, the compiler will try to infer the induction variable but will conservatively consider the constraint not met if the induction variable either cannot be analysed or is known to be greater than the limit.
The scoped rpt loop body may only contain bundles.
- As a result, the loop body may only contain instructions that can be issued together.
- In the case that the scheduler cannot put two instructions into a bundle, a nop or an fnop will be generated to complete the bundle, which will impact code size.
There must not be any system or control instructions in the loop body (for example, put, br, call and brnz).
- As a result:
  - No C/C++ function calls can be made within the loop body.
  - The rpt loop must be the innermost loop of a loop nest, or be a standalone loop with no parent loop.
Cost analysis: the rpt loop body contains either 4 or fewer instruction bundles or less than half of the instruction bundles contain nop or fnop instructions.
- You can override the instruction bundle count constraint, but without user intervention the hard limit on loop body bundle count is 4 and will depend on the number of bundles containing (f)nops for bundle counts beyond 4.

11.3.4. brnzdec

Since the brnzdec loops have few constraints, they are more often not generated because of unmet internal constraints on the shape of a loop’s basic blocks.

For more more information, refer to this talk about hardware loops from the 2022 EuroLLVM Developers’ Meeting on “Hardware loops in the IPU backend”.

11.4. Guiding the C/C++ compiler for better loop code generation

This section describes what you can do to guide the C/C++ compiler to improve the code generation.

The examples in this section show two cases: one without the optimisation and one with the optimisation.

11.4.1. Hardware loop generation

There are multiple ways to guide the compiler to emit an IPU hardware loop. Each IPU hardware loop must adhere to a set of constraints. One of these is the limited iteration count for the rpt hardware loops. Effectively, this hardware loop can only be emitted if, among other constraints, the iteration count is within a particular range. However, the iteration count is not always available during compilation time which results in a conservative approach to generating this IPU hardware loop.

As a developer, you may be aware of which C/C++ loops may and which may not generate this rpt hardware loop. This section describes a few options to hint the compiler into generating the rpt hardware loop. Note that although the iteration count constraint is a common reason for why the rpt loop is not emitted, it is not the only one.

11.4.2. __builtin_assume

The __builtin_assume builtin function (see Clang Language Extensions documentation for more information) is available in C and C++ and provides the compiler with boolean invariants that the compiler may assume to be true. One of these assumptions could be the number of iterations a loop will run. This number may be unknown or unanalysable for the compiler, but an obvious fact for the developer. In such cases __builtin_assume can be used to limit the range of the loop induction variable to denote a limited range on the iteration count.

For example:

// Compiled with popc --S --target ipu1 -O2 example_assume.c -o out.s

void foo (int *out, int *in, unsigned size) {
    for (int i = 0; i < size; ++i) {
        *out += in[i];
    }
}

void assume_foo (int *out, int *in, unsigned size) {
    __builtin_assume(size < 4095);
    for (int i = 0; i < size; ++i) {
        *out += in[i];
    }
}

will result in:

foo:                                    # @foo
    brz $m2, .LBB0_3
    ld32 $m3, $m0, $m15, 0
    add $m2, $m2, -1
    ld32step $m4, $m15, $m1+=, 1
    add $m3, $m3, $m4
    st32 $m3, $m0, $m15, 0
    brnzdec $m2, .LBB0_2
    br $m10

assume_foo:                             # @assume_foo
    ld32 $m3, $m0, $m15, 0
    {
        rpt $m2, 2
        fnop
    }
    {
        ld32step $m4, $m15, $m1+=, 1
        fnop
    }
    {
        add $m3, $m3, $m4
        fnop
    }
    {
        st32 $m3, $m0, $m15, 0
        fnop
    }
    br $m10

11.4.3. rptsize_t

rptsize_t provides an unsigned integer type with the allowed range of the rpt hardware loop. This type’s size is guaranteed to be equal to the size of the rpt hardware loop iteration count limit. This type can be used similar as any regular int is used in C++. Note that rptsize_t can, and should, only be treated as an unsigned integer type. Refer to the Tile Vertex Instruction Set Architecture document to get the repeat size (TREG_REPEAT_COUNT_WIDTH) for your specific IPU.

For example:

// Compiled with popc --S --target ipu1 -O3 example_rptsize_t.cpp -o out.s
#include <ipudef.h>

void bar (int *out, int *a, unsigned n) {
    for (int i = 0; i < n; ++i) {
        *out += a[i];
    }
}

void rptsizebar1 (int *out, int *a, rptsize_t n) {
    for (int i = 0; i < n; ++i) {
        *out += a[i];
    }
}

void rptsizebar2 (int *out, int *a, rptsize_t n) {
    for (rptsize_t i = 0; i < n; ++i) {
        *out += a[i];
    }
}

will result in:

_Z3barPiS_j:                            # @_Z3barPiS_j
    brz $m2, .LBB0_3
    ld32 $m3, $m0, $m15, 0
    add $m2, $m2, -1
    ld32step $m4, $m15, $m1+=, 1
    add $m3, $m3, $m4
    st32 $m3, $m0, $m15, 0
    brnzdec $m2, .LBB0_2
    br $m10

_Z11rptsizebar1PiS_6ap_intILj12ELb0EE:  # @_Z11rptsizebar1PiS_6ap_intILj12ELb0EE
    ldz16 $m2, $m2, $m15, 0
    ld32 $m3, $m0, $m15, 0
    {
        rpt $m2, 2
        fnop
    }
    {
        ld32step $m4, $m15, $m1+=, 1
        fnop
    }
    {
        add $m3, $m3, $m4
        fnop
    }
    {
        st32 $m3, $m0, $m15, 0
        fnop
    }
    br $m10

_Z11rptsizebar2PiS_6ap_intILj12ELb0EE:  # @_Z11rptsizebar2PiS_6ap_intILj12ELb0EE
    ldz16 $m2, $m2, $m15, 0
    ld32 $m3, $m0, $m15, 0
    {
        rpt $m2, 2
        fnop
    }
    {
        ld32step $m4, $m15, $m1+=, 1
        fnop
    }
    {
        add $m3, $m3, $m4
        fnop
    }
    {
        st32 $m3, $m0, $m15, 0
        fnop
    }
    br $m10

11.4.4. Pragma vectorize

You may want to disable vectorization in a loop if it results in less optimal code. With the IPU, such cases may arise from incorrect assumptions by LLVM about what may and may not be vectorized. For example, LLVM does not make a distinction between float and int types during vectorization and will in some cases choose to vectorize a loop of integers to a 2-element vector despite 2-element vector operations only being available for float types. As a result both the code size and execution time will be worse.

You can guide vectorization using a number of pragmas (see Clang Language Extensions documentation for more). This section will only cover how to disable the vectorize pragmas.

The Clang pragma for disabling vectorization is:

#pragma clang loop vectorize(disable)

This can be inserted into C/C++ code before the loop targeted to have its vectorization disabled. This targeting of a single loop avoids disabling vectorization for the whole compilation unit.

For example:

// compile with popc --S --target ipu1 example_pragma_vectorize.c -o out_O2.s -O2
void false_pos_vectorize (unsigned long *a, unsigned long *or_arr) {
unsigned long ret;
int i, j;
for (i = 0; i < 32; ++i) {
    for (j = 0; j < 32; ++j) {
        int b;
        ret = 0;
        for (b = i; b < j; ++b) {
            ret |= or_arr[b];
        }
    }
}
*a = ret;
}

void no_vec_false_pos_vectorize (unsigned long *a, unsigned long *or_arr) {
    unsigned long ret;
    int i, j;
    for (i = 0; i < 32; ++i) {
        for (j = 0; j < 32; ++j) {
            int b;
            ret = 0;
            #pragma clang loop vectorize(disable)
            for (b = i; b < j; ++b) {
                ret |= or_arr[b];
            }
        }
    }
    *a = ret;
}

will result in:

false_pos_vectorize:                    # @false_pos_vectorize
    add $m11, $m11, -48
    st32 $m8, $m11, $m15, 4                 # 4-byte Folded Spill
    st32 $m9, $m11, $m15, 3                 # 4-byte Folded Spill
    st32 $m10, $m11, $m15, 2                # 4-byte Folded Spill
    st32 $m7, $m11, $m15, 1                 # 4-byte Folded Spill
    st32 $m0, $m11, $m15, 5                 # 4-byte Folded Spill
    mov $m2, $m15
    setzi $m0, 32
    st32 $m0, $m11, $m15, 6                 # 4-byte Folded Spill
    st32 $m1, $m11, $m15, 10                # 4-byte Folded Spill
    st32 $m1, $m11, $m15, 9                 # 4-byte Folded Spill
    add $m0, $m0, -1
.LBB0_1:
    st32 $m0, $m11, $m15, 7                 # 4-byte Folded Spill
    mov $m7, $m15
    ld32 $m9, $m11, $m15, 6                 # 4-byte Folded Reload
    sub $m0, 0, $m2
    st32 $m0, $m11, $m15, 8                 # 4-byte Folded Spill
    add $m9, $m9, -1
.LBB0_2:
    cmpult $m0, $m2, $m7
    brz $m0, .LBB0_3
    sub $m0, $m7, $m2
    mov $m8, $m15
    cmpult $m1, $m0, 2
    brz $m1, .LBB0_6
    mov $m10, $m2
    bri .LBB0_9
.LBB0_3:
    mov $m8, $m15
    bri .LBB0_11
.LBB0_6:
    mov $m5, $m15
    ld32 $m1, $m11, $m15, 8                 # 4-byte Folded Reload
    add $m1, $m1, $m7
    andc $m1, $m1, 1
    add $m1, $m1, -2
    shr $m1, $m1, 1
    add $m4, $m1, 1
    add $m4, $m4, -1
    andc $m1, $m0, 1
    st32 $m1, $m11, $m15, 11                # 4-byte Folded Spill
    add $m10, $m2, $m1
    ld32 $m1, $m11, $m15, 9                 # 4-byte Folded Reload
.LBB0_7:
    ld32 $m3, $m1, $m15, 1
    ld32step $m6, $m15, $m1+=, 2
    or $m5, $m3, $m5
    or $m8, $m6, $m8
    brnzdec $m4, .LBB0_7
    ld32 $m1, $m11, $m15, 11                # 4-byte Folded Reload
    cmpeq $m0, $m0, $m1
    mov $m4, $m5
    or $m1, $m5, $m0
    or $m8, $m8, $m4
    brnz $m0, .LBB0_11
.LBB0_9:
    add $m0, $m10, 1
    cmpult $m1, $m0, $m7
    movnz $m0, $m1, $m7
    sub $m0, $m0, $m10
    add $m0, $m0, -1
    shl $m1, $m10, 2
    ld32 $m3, $m11, $m15, 10                # 4-byte Folded Reload
    add $m1, $m3, $m1
.LBB0_10:
    ld32step $m3, $m15, $m1+=, 1
    or $m8, $m3, $m8
    brnzdec $m0, .LBB0_10
.LBB0_11:
    add $m7, $m7, 1
    brnzdec $m9, .LBB0_2
    ld32 $m0, $m11, $m15, 7                 # 4-byte Folded Reload
    add $m2, $m2, 1
    ld32 $m1, $m11, $m15, 9                 # 4-byte Folded Reload
    add $m1, $m1, 4
    st32 $m1, $m11, $m15, 9                 # 4-byte Folded Spill
    brnzdec $m0, .LBB0_1
    ld32 $m0, $m11, $m15, 5                 # 4-byte Folded Reload
    st32 $m8, $m0, $m15, 0
    ld32 $m7, $m11, $m15, 1                 # 4-byte Folded Reload
    ld32 $m10, $m11, $m15, 2                # 4-byte Folded Reload
    ld32 $m9, $m11, $m15, 3                 # 4-byte Folded Reload
    ld32 $m8, $m11, $m15, 4                 # 4-byte Folded Reload
    add $m11, $m11, 48
    br $m10

no_vec_false_pos_vectorize:             # @no_vec_false_pos_vectorize
    add $m11, $m11, -24
    st32 $m8, $m11, $m15, 4                 # 4-byte Folded Spill
    st32 $m9, $m11, $m15, 3                 # 4-byte Folded Spill
    st32 $m10, $m11, $m15, 2                # 4-byte Folded Spill
    st32 $m7, $m11, $m15, 1                 # 4-byte Folded Spill
    st32 $m0, $m11, $m15, 5                 # 4-byte Folded Spill
    mov $m2, $m15
    setzi $m0, 32
    mov $m4, $m0
    add $m4, $m4, -1
.LBB1_1:
    mov $m5, $m15
    mov $m6, $m0
    sub $m7, 0, $m2
.LBB1_2:
    cmpult $m8, $m2, $m5
    brz $m8, .LBB1_3
    mov $m8, $m15
    add $m9, $m7, $m5
    add $m9, $m9, -1
    mov $m10, $m1
.LBB1_5:
    ld32step $m3, $m15, $m10+=, 1
    or $m8, $m3, $m8
    brnzdec $m9, .LBB1_5
    add $m6, $m6, -1
    add $m5, $m5, 1
    brnz $m6, .LBB1_2
    bri .LBB1_7
.LBB1_3:
    mov $m8, $m15
    add $m6, $m6, -1
    add $m5, $m5, 1
    brnz $m6, .LBB1_2
.LBB1_7:
    add $m2, $m2, 1
    add $m1, $m1, 4
    brnzdec $m4, .LBB1_1
    ld32 $m0, $m11, $m15, 5                 # 4-byte Folded Reload
    st32 $m8, $m0, $m15, 0
    ld32 $m7, $m11, $m15, 1                 # 4-byte Folded Reload
    ld32 $m10, $m11, $m15, 2                # 4-byte Folded Reload
    ld32 $m9, $m11, $m15, 3                 # 4-byte Folded Reload
    ld32 $m8, $m11, $m15, 4                 # 4-byte Folded Reload
    add $m11, $m11, 24
    br $m10

11.5. Restrict

The restrict qualifier in C/C++ can be applied to pointers, It aids in the scheduling of function parameters and class members. Currently, the existing alias analysis in LLVM is not perfect and is very conservative. If it cannot assume there will be no aliasing between pointers, it will assume the worst and transform the code assuming that aliasing will occur.

However, you may be aware of such constraints within your C/C++ code and you may want the compiler to generate its optimal code with the assumption that no pointers will alias within a certain (function) scope. You can help the compiler by hinting that a pointer won’t alias and this could help with code generation. This hinting notifies the compiler that it can be less conservative with loads and stores of pointers.

Warning

This does put the burden on you about whether pointers alias or not. Having a restrict qualifier applied to a pointer which aliases with another pointer within the same scope will result in undefined behaviour. Refer to the C and C++ spec for more detail on the C and C++ level semantics of the restrict qualifier.

For example:

// Compiled with popc --S --target ipu1 -O2 example_restrict.c -o out.s
int may_alias_foo(int *a, int *b)
{
    *a = *b;
    *b = 5;
    return *b * *a;
}

int restricted_foo(int *restrict a, int *restrict b)
{
    *a = *b;
    *b = 5;
    return *b * *a;
}

will result in:

may_alias_foo:                          # @may_alias_foo
    {
        ld32 $m2, $m1, $m15, 0
        setzi $a0, 5
    }
    st32 $m2, $m0, $m15, 0
    st32 $a0, $m1, $m15, 0
    ld32 $m0, $m0, $m15, 0 // Re-load a[0], just in case b == a
    mul $m0, $m0, 5
    br $m10

restricted_foo:                         # @restricted_foo
    {
        ld32 $m2, $m1, $m15, 0
        setzi $a0, 5
    }
    st32 $m2, $m0, $m15, 0
    mul $m0, $m2, 5
    st32 $a0, $m1, $m15, 0
    br $m10

11.6. Alignment

The IPU load and store instructions are constrained in terms of alignment. This means that the alignment of an n-bit value will require n-bit alignment. For example, a 64-bit float load or store instruction will require 64-bit (8 byte) alignment. Alignments of values are not enforced in LLVM and must come from whatever generated the LLVM IR code (in our case C/C++ code). In practice, this often means the C/C++ code will have to explicitly set the alignment of variables if you intend to use the 64-bit load and store instructions.

You can guide the compiler to use the best load and store instructions available by either explicitly using a vector float type defined in ipudef.h, or by explicitly setting the alignment of floats in a consecutive chain of float load and store instructions to its maximum supported alignment.

For example:

// Compiled with popc --S --target ipu1 -O2 example_align.c -o out.s
#include <ipudef.h>

void floot2 (float2 *a, float2 *b, unsigned size) {
    for (int i = 0; i < size; ++i) {
        a[i] += b[i];
    }
}

void floot (float *a, float *b, unsigned size) {
    for (int i = 0; i < size; ++i) {
        a[i] += b[i];
    }
}

void floot_align (float *a, float *b, unsigned size) {
    a = __builtin_assume_aligned(a, 8);
    b = __builtin_assume_aligned(b, 8);
    for (int i = 0; i < size; ++i) {
        a[i] += b[i];
    }
}

will result in:

floot2:                                 # @floot2
    brz $m2, .LBB0_3
    add $m2, $m2, -1
.LBB0_2:
    ld64step $a0:1, $m15, $m1+=, 1
    ld64 $a2:3, $m0, $m15, 0
    f32v2add $a0:1, $a0:1, $a2:3
    st64step $a0:1, $m15, $m0+=, 1
    brnzdec $m2, .LBB0_2
.LBB0_3:
    br $m10

floot:                                  # @floot
    brz $m2, .LBB1_11
    cmpeq $m3, $m2, 1
    brnz $m3, .LBB1_2
    shl $m3, $m2, 2
    add $m4, $m1, $m3
    cmpult $m4, $m0, $m4
    brz $m4, .LBB1_6
    add $m3, $m0, $m3
    cmpult $m3, $m1, $m3
    brz $m3, .LBB1_6
.LBB1_2:
    mov $m3, $m15
.LBB1_9:
    sub $m2, $m2, $m3
    add $m2, $m2, -1
    shl $m3, $m3, 2
    add $m0, $m0, $m3
    add $m1, $m1, $m3
.LBB1_10:
    ld32step $a0, $m15, $m1+=, 1
    ld32 $a1, $m0, $m15, 0
    f32add $a0, $a0, $a1
    st32step $a0, $m15, $m0+=, 1
    brnzdec $m2, .LBB1_10
.LBB1_11:
    br $m10
.LBB1_6:
    andc $m3, $m2, 1
    add $m4, $m3, -2
    shr $m4, $m4, 1
    mov $m5, $m0
    mov $m6, $m1
.LBB1_7:
    ld32 $a1, $m6, $m15, 1
    ld32step $a0, $m15, $m6+=, 2
    ld32 $a2, $m5, $m15, 0
    ld32 $a3, $m5, $m15, 1
    f32v2add $a0:1, $a0:1, $a2:3
    st32 $a1, $m5, $m15, 1
    st32step $a0, $m15, $m5+=, 2
    brnzdec $m4, .LBB1_7
    cmpeq $m4, $m3, $m2
    brnz $m4, .LBB1_11
    bri .LBB1_9

floot_align:                            # @floot_align
    brz $m2, .LBB2_11
    cmpeq $m3, $m2, 1
    brnz $m3, .LBB2_2
    shl $m3, $m2, 2
    add $m4, $m1, $m3
    cmpult $m4, $m0, $m4
    brz $m4, .LBB2_6
    add $m3, $m0, $m3
    cmpult $m3, $m1, $m3
    brz $m3, .LBB2_6
.LBB2_2:
    mov $m3, $m15
.LBB2_9:
    sub $m2, $m2, $m3
    add $m2, $m2, -1
    shl $m3, $m3, 2
    add $m0, $m0, $m3
    add $m1, $m1, $m3
.LBB2_10:
    ld32step $a0, $m15, $m1+=, 1
    ld32 $a1, $m0, $m15, 0
    f32add $a0, $a0, $a1
    st32step $a0, $m15, $m0+=, 1
    brnzdec $m2, .LBB2_10
.LBB2_11:
    br $m10
.LBB2_6:
    andc $m3, $m2, 1
    add $m4, $m3, -2
    shr $m4, $m4, 1
    mov $m5, $m0
    mov $m6, $m1
.LBB2_7:
    ld64step $a0:1, $m15, $m6+=, 1
    ld64 $a2:3, $m5, $m15, 0
    f32v2add $a0:1, $a0:1, $a2:3
    st64step $a0:1, $m15, $m5+=, 1
    brnzdec $m4, .LBB2_7
    cmpeq $m4, $m3, $m2
    brnz $m4, .LBB2_11
    bri .LBB2_9

11.7. Vector math functions

There are a number of common math functions provided for types that are otherwise not available in the standard library. These include half and vector types. A complete list can be found in #include <ipu_vector_math>.

For example:

// compile with popc --S --target ipu1 vector_math.cpp -o out.s -O2
#include <ipu_vector_math>

// will emit single f16v2ln instruction for ipu::log(half2)
half2 get_log(half2 x) {
    return ipu::log(x);
}

// will emit multiple f16v2ln instructions for ipu::log(half4)
half4 get_log(half4 x) {
    return ipu::log(x);
}

will result in:

_Z7get_logDv2_Dh:
    {
        br $m10
        f16v2ln $a0, $a0
    }
_Z7get_logDv4_Dh:
    f16v2ln $a0, $a0
    {
        br $m10
        f16v2ln $a1, $a1
    }

11.8. Memory intrinsics

You can reduce the overhead of performing pointer arithmetic when loading or storing values by using the post incrementing load store memory intrinsics located in #include <ipu_memory_intrinsics>. Refer to the IPU C++ memory intrinsics chapter in the Poplar and PopLibs User Guide for more information.

For example:

// compile with popc --S --target ipu1 memory_intrinsics.cpp -o out.s -O3
#include <ipu_memory_intrinsics>
#include <tuple>

std::tuple<int const*, int> postinc_load(int const * x, int stride) {
    int load1 = ipu::load_postinc(&x, stride);
    int load2 = ipu::load_postinc(&x, stride);
    return std::make_tuple(x, load1 + load2);
}

std::tuple<int const*, int> no_postinc_load(int const * x, int stride) {
    int load1 = *x;
    x += stride;
    int load2 = *x;
    x += stride;
    return std::make_tuple(x, load1 + load2);
}

will result in:

_Z12postinc_loadPKii:
    ld32step $m3, $m15, $m1+=, $m2
    ld32step $m2, $m15, $m1+=, $m2
    add $m2, $m2, $m3
    st32 $m1, $m0, $m15, 0
    st32 $m2, $m0, $m15, 1
    br $m10

_Z15no_postinc_loadPKii:
    shl $m3, $m2, 2
    ld32 $m4, $m1, $m15, 0
    add $m5, $m1, $m3
    ld32 $m1, $m1, $m15, $m2
    add $m2, $m5, $m3
    add $m1, $m1, $m4
    st32 $m2, $m0, $m15, 0
    st32 $m1, $m0, $m15, 1
    br $m10

11.9. Builtins

Clang supports a number of builtin library functions with the same syntax as GCC, as well as some additional functions. In addition, Poplar provides some additional builtin functions that target the IPU.

More information on builtin functions can be found in the LLVM compiler documentation.

A full list of builtins that target the IPU can be found in the Poplar and PopLibs API documentation.

For example:

// compile with popc --S --target ipu1 example_builtins.cpp -o out.s -O2
#include <ipu_vector_math>

// Generic builtin provided by clang
void shufflevector(half2 x1, half2 x2, half2 *x3) {
    *x3 = __builtin_shufflevector(x1, x2, 0, 0);
}

// IPU target specific builtin
uint2 get_packed_ptr (const float *a, const float *b, const float *c) {
    return __builtin_ipu_tapack(a, b, c);
}

will result in:

_Z13shufflevectorDv2_DhS_PS_:
    sort4x16lo $a0, $a0, $a0
    st32 $a0, $m0, $m15, 0
    br $m10

_Z14get_packed_ptrPKfS0_S0_:
    tapack $m0:1, $m0, $m1, $m2
    br $m10

11.10. Inline assembly

In some instances you may be unable to get the desired code, for example if there are architecture-specific instructions that cannot easily be represented in C/C++, or if the compiler just cannot produce the desired code despite our best efforts. In this case, you can write inline assembly within our C/C++ functions instead of manually creating the entire function in assembly.

The inline assembler can inhibit further optimisations within a function so take care when using inline assembly.

For example:

// Compiled with popc --S --target ipu1 -O2 example_inline_asm.c -o out.s
#include <ipudef.h>

// Write 16 bits to memory assuming a 32 bit aligned destination pointer
void write16Aligned32(half in, half2 *outPtr) {
// Ensure that the operand that is put into a 32 register is 32 bits in size
half2 result = {in, in};
asm volatile("  ldb16 $a0, $mzero, %[h2Out], 1\n"
            "  sort4x16lo $a0, %[result], $a0\n"
            "  st32  $a0, $mzero, %[h2Out],0\n"
            :
            : [result] "r"(result), [h2Out] "r"(outPtr)
            : "$a0", "memory");
}

// Combine four 8bit values in the 8 lsbs of each input into a single 32
// bit result. bits 8..31 of the inputs are ignored
unsigned combine8bit(unsigned in0, unsigned in1, unsigned in2, unsigned in3) {
unsigned out;
asm volatile(" shuf8x8lo $m1, %[in0], %[in1]\n"
            " shuf8x8lo $m0, %[in2], %[in3]\n"
            " sort4x16lo %[out], $m1, $m0\n"
            : [out] "+r"(out)
            : [in0] "r"(in0), [in1] "r"(in1), [in2] "r"(in2), [in3] "r"(in3)
            : "$m0", "$m1");
return out;
}

will result in:

write16Aligned32:                       # @write16Aligned32
    sort4x16lo $a1, $a0, $a0
    ldb16 $a0, $m15, $m0, 1
    sort4x16lo $a0, $a1, $a0
    st32 $a0, $m15, $m0, 0
    br $m10

combine8bit:                            # @combine8bit
    mov $m4, $m1
    mov $m5, $m0
    shuf8x8lo $m1, $m5, $m4
    shuf8x8lo $m0, $m2, $m3
    sort4x16lo $m2, $m1, $m0
    mov $m0, $m2
    br $m10

11.11. Intrinsics

Inline assembly can be avoided — to an extent — by using intrinsics that are guaranteed to map to single instructions. These can be found in (and included from) the header file ipu_intrinsics. The long-term aim of this feature is to give you the ability to target as much of the instruction set as possible, via C/C++ functions, and to therefore relieve you of the need to write blocks of inline assembly.

For a simple use case example, suppose you wanted to target the andc instruction, specifically the andc $aDst0, $aSrc0, $aSrc1 variant, given that you have some float arguments you would like to apply this operation to. This would not be straightforward to write using the existing &~ operator, with float operands. For instance the attempt below would result in a compiler error:

// Compiled with popc --S --target ipu1 -O2 wont_work.cpp -o out.s
float wont_work(float x, float y) {
    return x &~ y; // Can't just do this - invalid argument error.
}

A potential workaround for this would be:

// Compiled with popc --S --target ipu1 -O2 not_great.cpp -o out.s
float not_great(float x, float y) {
int x_temp = (int) x;
int y_temp = (int) y;
return (float) (x_temp &~ y_temp);
}

which would result in:

not_great:                  # @ not_great
f32int $a0, $a0, 3
f32int $a1, $a1, 3
f32toi32 $a0, $a0
f32toi32 $a1, $a1
andc $a0, $a0, $a1
{
    br $m10
    f32fromi32 $a0, $a0
}

Since you know that you want a simple andc instruction given some float arguments, this output seems cluttered with unnecessary float conversions. The next logical attempt might be to write this using inline assembly as:

// Compiled with popc --S --target ipu1 -O2 with_inline_asm.cpp -o out.s
float better_but_with_inline_asm(float x, float y) {
float result;
asm volatile(" andc %[result], %[x], %[y]\n"
            : [result] "+r"(result)
            : [x] "r"(x), [y] "r"(y));
return result;
}

This results in:

better_but_with_inline_asm:       # @ better_but_with_inline_asm
#APP
andc $a0, $a0, $a1
#NO_APP
br $m10

Although this does produce the desired andc instruction, the use of an intrinsic is even simpler:

// Compiled with popc --S --target ipu1 -O2 best.cpp -o out.s
#include <ipu_intrinsics>

float best(float x, float y) {
return ipu::andc(x, y);
}

as it results in:

best:                              # @best
{
    br $m10
    andc $a0, $a0, $a1
}

Search help

11. Writing efficient C++

11.1. Inspecting the generated code

11.2. Optimisation levels

11.3. IPU hardware loops

11.3.1. Prioritisation between hardware loops

11.3.2. Hardware loop constraints

11.3.3. rpt

11.3.4. brnzdec

11.4. Guiding the C/C++ compiler for better loop code generation

11.4.1. Hardware loop generation

11.4.2. __builtin_assume

11.4.3. rptsize_t

11.4.4. Pragma vectorize

11.5. Restrict

11.6. Alignment

11.7. Vector math functions

11.8. Memory intrinsics

11.9. Builtins

11.10. Inline assembly

11.11. Intrinsics