IPU C/C++ builtins

The following IPU-specific builtin functions can be used in C/C++ code. For some of these the Tile Vertex Instruction Set Architecture is referenced. Refer to this document for more detailed information on the instructions and control and status registers (CSRs) that are targeted by these builtins.

Note

For a lot of these builtins, it is possible to omit the __builtin_ipu prefix by using the corresponding C++ intrinsic. See IPU C++ intrinsics for more information.

Note

Use #include <ipudef.h> for the IPU native types mentioned throughout this section, such half, half2, float2 and more.

For information on non-IPU, generic Clang builtins, refer to the Clang documentation on builtin functions and this comprehensive document for GCC builtins, which Clang also aims to support.

IPU functionality and memory

Get lower half of cycle count from CSR

unsigned __builtin_ipu_get_scount_l(): Get the value of the CSR $COUNT_L, which is the lower 32 bits of the tile cycle counter value.

Get upper half of cycle count from CSR

unsigned __builtin_ipu_get_scount_u(): Get the value of the CSR $COUNT_U, which is the upper 32 bits of the tile cycle counter value.

Get vertex base from CSR

void *__builtin_ipu_get_vertex_base(): Get vertex data structure pointer from the $VERTEX_BASE CSR.

Get tile ID from CSR

unsigned __builtin_ipu_get_tile_id()

Get the ID of the current tile from the $TILE_ID CSR.

Check for worker mode

bool __builtin_ipu_is_worker_mode(): Check for worker mode.

Example

#include <stdbool.h> // needed in C

bool example() {
  bool res = __builtin_ipu_is_worker_mode();
  return res;
}

Triple-pack three addresses

uint2 __builtin_ipu_tapack(const void *addr1, const void *addr2, const void *addr3)

Convert three absolute addresses to the triple-packed address format.

Targets the tapack instruction.

See the Tile Vertex Instruction Set Architecture for more details about the f16v2cmpgt instruction.

Write to a CSR

void __builtin_ipu_put(unsigned val, unsigned char csr_index)

Write to a control and status register.

Targets the put instruction.

See the Tile Vertex Instruction Set Architecture for more details about:

the put instruction

Control and Status registers

Example

Write immediate x to the CSR at index 32.

void example(unsigned x) {
  __builtin_ipu_put(x, 32);
}

Write to an upper CSR

void __builtin_ipu_uput(unsigned val, unsigned char csr_index)

void __builtin_ipu_uput(float val, unsigned char csr_index)

Write to a control register in the upper CSR address space.

Targets the uput instruction.

See the Tile Vertex Instruction Set Architecture for more details about:

the uput instruction

Control and Status registers

Example

Write immediate x to the CSR at index 2 in the upper CSR space.

void example(unsigned x) {
  __builtin_ipu_uput(x, 2);
}

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_uput and __builtin_ipu_uputf are available without this header.

Read from a CSR

unsigned __builtin_ipu_get(unsigned char csr_index)

Read the value of a control and status register into a general purpose register.

Targets the get instruction.

See the Tile Vertex Instruction Set Architecture for more details about:

the get instruction

Control and Status registers

Example

Set res to the value of the CSR at index 1.

unsigned example() {
  unsigned res = __builtin_ipu_get(x, 1);
  return res;
}

Read from an upper CSR

unsigned __builtin_ipu_uget(unsigned char csr_index)

Read the value of a control and status register in the upper CSR space into a general purpose register.

Targets the uget instruction.

See the Tile Vertex Instruction Set Architecture for more details about:

the uget instruction

Control and Status registers

Example

Set res to the value of the CSR at index 4 in the upper CSR space.

unsigned example() {
  unsigned res = __builtin_ipu_uget(x, 4);
  return res;
}

Read from an upper CSR

float __builtin_ipu_ugetf(unsigned char csr_index)

Read the value of a control and status register in the upper CSR space into a general purpose register.

Targets the uget instruction.

See the Tile Vertex Instruction Set Architecture for more details about:

the uget instruction

Control and Status registers

Load and write 64-bit value to the common configuration space

void __builtin_ipu_ld64putcs(const unsigned imm)

Load a naturally-aligned 64-bit value and write it to the common compute configuration space. The load address is provided by the CSR $CCCSLOAD, which is automatically post-incremented by 8.

Targets the ld64putcs instruction.

See the Tile Vertex Instruction Set Architecture for more details about the ld64putcs instruction.

Load and write 128-bit value to the common configuration space

void __builtin_ipu_ld128putcs(const unsigned imm)

Load a naturally-aligned 128-bit value and write it to the common compute configuration space. The load address is provided by the CSR $CCCSLOAD, which is automatically post-incremented by 16.

Targets the ld128putcs instruction.

See the Tile Vertex Instruction Set Architecture for more details about the ld128putcs instruction.

64-bit load and 64-bit store, with post-incrementing addresses

float2 __builtin_ipu_ldst64pace(float2 src, uint2 addr, uint stride, const unsigned imm)

Load a naturally aligned 64-bit value and simultaneously store a 64-bit value src, with two independent post-incrementing addresses. The two addresses are packed into the register pair addr.

The post-increment of the two addresses is determined by the stride and the 4-bit immediate imm.

Targets the ldst64pace instruction.

See the Tile Vertex Instruction Set Architecture for more details about the ldst64pace instruction, specifically how stride and imm are configured and how the addresses are packed into addr.

Note

This builtin may be used in conjunction with __builtin_ipu_tapack.

Bit operations

And operation

int __builtin_ipu_and(int x, int y)

float __builtin_ipu_and(float x, float y)

float2 __builtin_ipu_and(float2 x, float2 y)

Get the result of the and bit operation of two values.

Targets the and instruction.

See the Tile Vertex Instruction Set Architecture for more details about the and instruction.

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_and_i32, __builtin_ipu_and_f32 and __builtin_ipu_and_v2f32 are available without this header.

Andc operation

int __builtin_ipu_andc(int x, int y)

float __builtin_ipu_andc(float x, float y)

float2 __builtin_ipu_andc(float2 x, float2 y)

Get the result of the andc bit operation of two values.

Targets the andc instruction.

See the Tile Vertex Instruction Set Architecture for more details about the andc instruction.

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_andc_i32, __builtin_ipu_andc_f32 and __builtin_ipu_andc_v2f32 are available without this header.

Or operation

int __builtin_ipu_or(int x, int y)

float __builtin_ipu_or(float x, float y)

float2 __builtin_ipu_or(float2 x, float2 y)

Get the result of the or bit operation of two values.

Targets the or instruction.

See the Tile Vertex Instruction Set Architecture for more details about the or instruction.

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_or_i32, __builtin_ipu_or_f32 and __builtin_ipu_or_v2f32 are available without this header.

Not operation

float __builtin_ipu_not(float x)

float2 __builtin_ipu_not(float2 x)

Get the result of the not bit operation of a value.

Targets the not instruction.

See the Tile Vertex Instruction Set Architecture for more details about the not instruction.

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_not_f32 and __builtin_ipu_not_v2f32 are available without this header.

Reverse bytes

unsigned __builtin_ipu_bitrev8(unsigned x)

Reverses the bit order of each byte in x.

Targets the bitrev8 instruction.

See the Tile Vertex Instruction Set Architecture for more details about the bitrev8 instruction.

Reverse bytes

unsigned __builtin_ipu_cms(int x)

Calculates number of higher order bits that match the sign bit in x.

Targets the cms instruction.

See the Tile Vertex Instruction Set Architecture for more details about the cms instruction.

SIMD roll permutation on 4x32-bit values

float2 __builtin_ipu_roll32(float2 x, float2 y)

Performs SIMD roll permutation on the 4 32-bit values across x and y.

x            y             ->      Result
| 3 | 2 |    | 1 | 0 |             | 2 | 1 |

Targets the roll32 instruction.

See the Tile Vertex Instruction Set Architecture for more details about the roll32 instruction.

SIMD roll-left permutation on 8x8-bit values

unsigned __builtin_ipu_roll8l(unsigned x, unsigned y)

Performs SIMD roll-left permutation on the 8 8-bit values across x and y.

x                    y                     ->      Result
| 7 | 6 | 5 | 4 |    | 3 | 2 | 1 | 0 |             | 6 | 5 | 4 | 3 |

Targets the roll8l instruction.

See the Tile Vertex Instruction Set Architecture for more details about the roll8l instruction.

SIMD roll-right permutation on 8x8-bit values

unsigned __builtin_ipu_roll8r(unsigned x, unsigned y)

Performs SIMD roll-right permutation on the 8 8-bit values across x and y.

x                    y                     ->      Result
| 7 | 6 | 5 | 4 |    | 3 | 2 | 1 | 0 |             | 4 | 3 | 2 | 1 |

Targets the roll8r instruction.

See the Tile Vertex Instruction Set Architecture for more details about the roll8r instruction.

Upper half of SIMD shuffle permutation on 8x8-bit values

unsigned __builtin_ipu_shuf8x8hi(unsigned x, unsigned y)

Performs SIMD shuffle permutation on the 8 8-bit values across x and y, and returns the upper word of the result.

x                    y                     ->      Result
| 7 | 6 | 5 | 4 |    | 3 | 2 | 1 | 0 |             | 7 | 3 | 6 | 2 |

Targets the shuf8x8hi instruction.

See the Tile Vertex Instruction Set Architecture for more details about the shuf8x8hi instruction.

Lower half of SIMD shuffle permutation on 8x8-bit values

unsigned __builtin_ipu_shuf8x8lo(unsigned x, unsigned y)

Performs SIMD shuffle permutation on the 8 8-bit values across x and y, and returns the lower word of the result.

x                    y                     ->      Result
| 7 | 6 | 5 | 4 |    | 3 | 2 | 1 | 0 |             | 5 | 1 | 4 | 0 |

Targets the shuf8x8lo instruction.

See the Tile Vertex Instruction Set Architecture for more details about the shuf8x8lo instruction.

Upper half of SIMD sort permutation on 4x32-bit values

float2 __builtin_ipu_sort4x32hi(float2 x, float2 y)

Performs SIMD sort permutation on the 4 32-bit values across x and y, and returns the upper two words of the result.

x            y             ->      Result
| 3 | 2 |    | 1 | 0 |             | 3 | 1 |

Targets the sort4x32hi instruction.

See the Tile Vertex Instruction Set Architecture for more details about the sort4x32hi instruction.

Lower half of SIMD sort permutation on 4x32-bit values

float2 __builtin_ipu_sort4x32lo(float2 x, float2 y)

Performs SIMD sort permutation on the 4 32-bit values across x and y, and returns the lower two words of the result.

x            y             ->      Result
| 3 | 2 |    | 1 | 0 |             | 2 | 0 |

Targets the sort4x32lo instruction.

See the Tile Vertex Instruction Set Architecture for more details about the sort4x32lo instruction.

SIMD sort8 permutation on 4x8-bit values

unsigned __builtin_ipu_sort8(unsigned x)

Performs SIMD sort8 permutation on the 4 8-bit values in x.

x                     ->      Result
| 3 | 2 | 1 | 0 |             | 3 | 1 | 2 | 0 |

Targets the sort8 instruction.

See the Tile Vertex Instruction Set Architecture for more details about the sort8 instruction.

SIMD swap8 permutation on 4x8-bit values

unsigned __builtin_ipu_swap8(unsigned x)

Performs SIMD swap8 permutation on the 4 8-bit values in x.

x                     ->      Result
| 3 | 2 | 1 | 0 |             | 2 | 3 | 0 | 1 |

Targets the swap8 instruction.

See the Tile Vertex Instruction Set Architecture for more details about the swap8 instruction.

Conditional ternary operator

half __builtin_ipu_select_half(half condition, half a, half b)

half2 __builtin_ipu_select_half2(half2 condition, half2 a, half2 b)

half4 __builtin_ipu_select_half4(half4 condition, half4 a, half4 b)

float __builtin_ipu_select_float(float condition, float a, float b)

float2 __builtin_ipu_select_float2(float2 condition, float2 a, float2 b): Builtins that calculate condition ? a : b for float types. For the scalar variants, result will be a if condition is all 1s, b if condition is all 0s. For the vector variants, the element at an index i of the output vector will similarly depend on the ith element of condition.

Float operations

Operations are supported on a number of floating-point number formats, for scalar and vector variables. This support is based on 754-2008 - IEEE Standard for Floating-Point Arithmetic.

For details, see the section Floating Point Unit in the Tile Vertex Instruction Set Architecture.

Absolute addition of two values

half2 __builtin_ipu_absadd(half2 x, half2 y)

half4 __builtin_ipu_absadd(half4 x, half4 y)

float __builtin_ipu_absadd(float x, float y)

float2 __builtin_ipu_absadd(float2 x, float2 y)

Sum of two absolute values.

Targets the f16v2absadd, f16v4absadd, f32v2absadd and f32absadd instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2absadd

f16v4absadd

f32v2absadd

f32absadd

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2absadd, __builtin_ipu_f16v4absadd, __builtin_ipu_f32v2absadd and __builtin_ipu_f32absadd are available without this header.

Absolute maximum of two values

half2 __builtin_ipu_absmax(half2 x, half2 y)

half4 __builtin_ipu_absmax(half4 x, half4 y)

float __builtin_ipu_absmax(float x, float y)

float2 __builtin_ipu_absmax(float2 x, float2 y)

The maximum of two absolute values.

Targets the f16v2absmax, f16v4absmax, f32v2absmax and f32absmax instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2absmax

f16v4absmax

f32v2absmax

f32absmax

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2absmax, __builtin_ipu_f16v4absmax, __builtin_ipu_f32v2absmax and __builtin_ipu_f32absmax are available without this header.

Maximum of two values

half2 __builtin_ipu_max(half2 x, half2 y)

half4 __builtin_ipu_max(half4 x, half4 y)

float __builtin_ipu_max(float x, float y)

float2 __builtin_ipu_max(float2 x, float2 y)

The maximum of two values.

Targets the f16v2max, f16v4max, f32v2max and f32max instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2max

f16v4max

f32v2max

f32max

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2max, __builtin_ipu_f16v4max, __builtin_ipu_f32v2max and __builtin_ipu_f32max are available without this header.

Lateral maximum of two values

half2 __builtin_ipu_maxc(half2 x, half2 y)

half4 __builtin_ipu_maxc(half4 x, half4 y)

float __builtin_ipu_maxc(float x, float y)

float2 __builtin_ipu_maxc(float2 x, float2 y)

The lateral maximum of two variables.

Targets the f16v2maxc and f16v4maxc instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2maxc

f16v4maxc

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2maxc and __builtin_ipu_f16v4maxc are available without this header.

Minimum of two values

half2 __builtin_ipu_min(half2 x, half2 y)

half4 __builtin_ipu_min(half4 x, half4 y)

float __builtin_ipu_min(float x, float y)

float2 __builtin_ipu_min(float2 x, float2 y)

The minimum of two variables.

Targets the f16v2min, f16v4min, f32v2min and f32min instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2min

f16v4min

f32v2min

f32min

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2min, __builtin_ipu_f16v4min, __builtin_ipu_f32v2min and __builtin_ipu_f32min are available without this header.

Min-of-maximum of two values

half2 __builtin_ipu_clamp(half2 x, half2 y)

half4 __builtin_ipu_clamp(half4 x, half2 y)

float __builtin_ipu_clamp(float x, float2 y)

float2 __builtin_ipu_clamp(float2 x, float2 y)

The min-of-maximum of each of the elements in x, compared with the two elements in y.

Targets the f16v2clamp, f16v4clamp, f32v2clamp and f32clamp instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2clamp

f16v4clamp

f32v2clamp

f32clamp

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2clamp, __builtin_ipu_f16v4clamp, __builtin_ipu_f32v2clamp and __builtin_ipu_f32clamp are available without this header.

CMAC operation

void __builtin_ipu_cmac(half2 x, half2 y)

void __builtin_ipu_cmac(half4 x, half4 y)

Performs the CMAC operation on two values.

Targets the f16v2cmac and f16v4cmac instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2cmac

f16v4cmac

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2cmac and __builtin_ipu_f16v4cmac are available without this header.

Natural exponential

half2 __builtin_ipu_exp(half2 x)

float __builtin_ipu_exp(float x)

The natural exponential function.

Targets the f16v2exp and f32exp instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2exp

f32exp

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtin __builtin_ipu_f16v2exp is available without this header.

2-to-the-power-of

half2 __builtin_ipu_exp2(half2 x)

float __builtin_ipu_exp2(float x)

Calculates 2^x.

Targets the f16v2exp2 and f32exp2 instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2exp2

f32exp2

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtin __builtin_ipu_f16v2exp2 is available without this header.

Natural logarithm

half2 __builtin_ipu_ln(half2 x)

float __builtin_ipu_ln(float x)

The natural logarithm function.

Targets the f16v2ln and f32ln instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2ln

f32ln

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtin __builtin_ipu_f16v2ln is available without this header.

Base-2 logarithm

half2 __builtin_ipu_log2(half2 x)

float __builtin_ipu_log2(float x)

Base-2 logarithm function.

Targets the f16v2log2 abd f32log2 instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2log2

f32log2

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtin __builtin_ipu_f16v2log2 is available without this header.

Probabilistic mask function

half4 __builtin_ipu_rmask(half4 x, float y)

float2 __builtin_ipu_rmask(float2 x, float y)

Returns a masked version of the first argument. See the Tile Vertex Instruction Set Architecture for more information.

Targets the f16v4rmask and f32v2rmask instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v4rmask

f32v2rmask

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v4rmask and __builtin_ipu_f32v2rmask are available without this header.

Sigmoid function

half2 __builtin_ipu_sigm(half2 x)

float __builtin_ipu_sigm(float x)

Returns the result of the sigmoid function of a value.

Targets the f16v2sigm and f32sigm instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2sigm

f32sigm

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2sigm and __builtin_ipu_f32sigm are available without this header.

Lateral sum

float __builtin_ipu_sum(half2 x)

float2 __builtin_ipu_sum(half4 x)

Returns the lateral summation of the elements in x.

Targets the f16v2sum and f16v4sum instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2sum

f16v4sum

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2sum and __builtin_ipu_f16v4sum are available without this header.

Tanh

half2 __builtin_ipu_tanh(half2 x)

float __builtin_ipu_tanh(float x)

Returns the result of the hyperbolic tangent function of x.

Targets the f16v2tanh and f32tanh instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2tanh

f32tanh

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtin __builtin_ipu_f16v2tanh is available without this header.

Vector product

void __builtin_ipu_f32v2aop(float2 x, float2 y, unsigned char z)

Calculates vector product of the first two arguments.

Targets the f32v2aop instruction.

See the Tile Vertex Instruction Set Architecture for more details about the f32v2aop instruction.

Vector sum with scalar multiplicand

float2 __builtin_ipu_f32v2axpy(float2 x, float2 y)

Calculates vector result of ax + y where a is the value of the CSR $TAS.

Targets the f32v2axpy instruction.

See the Tile Vertex Instruction Set Architecture for more details about the f32v2axpy instruction.

Get and initialise accumulators

half2 __builtin_ipu_gina(half2 x, unsigned int y)

float2 __builtin_ipu_gina(float2 x, unsigned int y)

Get and initialise accumulators.

Targets the f16v2gina and f32v2gina instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2gina

f32v2gina

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2gina and __builtin_ipu_f32v2gina are available without this header.

Float comparisons

A number of comparison instructions are provided.

For details, see the section Comparisons in the Tile Vertex Instruction Set Architecture.

Equality test

half2 __builtin_ipu_cmpeq(half2 x, half2 y)

half4 __builtin_ipu_cmpeq(half4 x, half4 y)

float __builtin_ipu_cmpeq(float x, float y)

float2 __builtin_ipu_cmpeq(float2 x, float2 y)

Element-wise equality comparison of two arguments.

Targets the f16v2cmpeq, f16v4cmpeq, f32cmpeq and f32v2cmpeq instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2cmpeq

f16v4cmpeq

f32cmpeq

f32v2cmpeq

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2cmpeq, __builtin_ipu_f16v4cmpeq, __builtin_ipu_f32cmpeq and __builtin_ipu_f32v2cmpeq are available without this header.

Greater-than-or-equal-to test

half2 __builtin_ipu_cmpge(half2 x, half2 y)

half4 __builtin_ipu_cmpge(half4 x, half4 y)

float __builtin_ipu_cmpge(float x, float y)

float2 __builtin_ipu_cmpge(float2 x, float2 y)

Element-wise greater-than-or-equal-to test of two arguments.

Targets the f16v2cmpge, f16v4cmpge, f32cmpge and f32v2cmpge instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2cmpge

f16v4cmpge

f32cmpge

f32v2cmpge

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2cmpge, __builtin_ipu_f16v4cmpge, __builtin_ipu_f32cmpge and __builtin_ipu_f32v2cmpge are available without this header.

Greater-than test

half2 __builtin_ipu_cmpgt(half2 x, half2 y)

half4 __builtin_ipu_cmpgt(half4 x, half4 y)

float __builtin_ipu_cmpgt(float x, float y)

float2 __builtin_ipu_cmpgt(float2 x, float2 y)

Element-wise greater-than test of two arguments.

Targets the f16v2cmpgt, f16v4cmpgt, f32cmpgt and f32v2cmpgt instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2cmpgt instruction.

f16v4cmpgt

f32cmpgt

f32v2cmpgt

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2cmpgt, __builtin_ipu_f16v4cmpgt, __builtin_ipu_f32cmpgt and __builtin_ipu_f32v2cmpgt are available without this header.

Less-than-or-equal-to test

half2 __builtin_ipu_cmple(half2 x, half2 y)

half4 __builtin_ipu_cmple(half4 x, half4 y)

float __builtin_ipu_cmple(float x, float y)

float2 __builtin_ipu_cmple(float2 x, float2 y)

Element-wise less-than-or-equal-to test of two arguments.

Targets the f16v2cmple, f16v4cmple, f32cmple and f32v2cmple instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2cmple

f16v4cmple

f32cmple

f32v2cmple

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2cmple, __builtin_ipu_f16v4cmple, __builtin_ipu_f32cmple and __builtin_ipu_f32v2cmple are available without this header.

Less-than test

half2 __builtin_ipu_cmplt(half2 x, half2 y)

half4 __builtin_ipu_cmplt(half4 x, half4 y)

float __builtin_ipu_cmplt(float x, float y)

float2 __builtin_ipu_cmplt(float2 x, float2 y)

Element-wise less-than test of two arguments.

Targets the f16v2cmplt, f16v4cmplt, f32cmplt and f32v2cmplt instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2cmplt

f16v4cmplt

f32cmplt

f32v2cmplt

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins``__builtin_ipu_f16v2cmplt``, __builtin_ipu_f16v4cmplt, __builtin_ipu_f32cmplt and __builtin_ipu_f32v2cmplt are available without this header.

Inequality test

half2 __builtin_ipu_cmpne(half2 x, half2 y)

half4 __builtin_ipu_cmpne(half4 x, half4 y)

float __builtin_ipu_cmpne(float x, float y)

float2 __builtin_ipu_cmpne(float2 x, float2 y)

Element-wise inequality test of two arguments.

Targets the f16v2cmpne, f16v4cmpne, f32cmpne and f32v2cmpne instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2cmpne

f16v4cmpne

f32cmpne

f32v2cmpne

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2cmpne, __builtin_ipu_f16v4cmpne, __builtin_ipu_f32cmpne and __builtin_ipu_f32v2cmpne are available without this header.

Float classification

Classify float

short2 __builtin_ipu_class(half2 num)

short4 __builtin_ipu_class(half4 num)

int __builtin_ipu_class(float num)

short2 __builtin_ipu_class(float2 num)

Floating-point number classifier.

The result will be one of the float class identifiers.

Targets the f16v2class, f16v4class, f32class and f32v2class instructions.

See the Tile Vertex Instruction Set Architecture for more details about these instructions:

f16v2class

f16v4class

f32class

f32v2class

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_f16v2class, __builtin_ipu_f16v4class, __builtin_ipu_f32class and __builtin_ipu_f32v2class are available without this header.

Check whether floating-point value is finite

int __builtin_ipu_isfinite(float val)

short2 __builtin_ipu_isfinite(half2 val)

int2 __builtin_ipu_isfinite(float2 val)

short4 __builtin_ipu_isfinite(half4 val): Check whether a floating-point value, whether scalar or vector, is finite and return the boolean result value as an integer type of same shape and size as the input parameter. This builtin expands to a sequence of instructions with vector floating-point values handled by vector code.

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_isfinite_f32, __builtin_ipu_isfinite_v2f16, __builtin_ipu_isfinite_v2f32 and __builtin_ipu_isfinite_v4f16 are available without this header.

Check whether floating-point value is infinite

int __builtin_ipu_isinf(float val)

short2 __builtin_ipu_isinf(half2 val)

int2 __builtin_ipu_isinf(float2 val)

short4 __builtin_ipu_isinf(half4 val): Check whether a floating-point value, whether scalar or vector, is -inf or +inf and return the boolean result value as an integer type of same shape and size as the input parameter. This builtin expands to a sequence of instructions with vector floating-point values handled by vector code.

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_isinf_f32, __builtin_ipu_isinf_v2f16, __builtin_ipu_isinf_v2f32 and __builtin_ipu_isinf_v4f16 are available without this header.

Check whether floating-point value is NaN

int __builtin_ipu_isnan(float val)

short2 __builtin_ipu_isnan(half2 val)

int2 __builtin_ipu_isnan(float2 val)

short4 __builtin_ipu_isnan(half4 val): Check whether a floating-point value, whether scalar or vector, is not a number (NaN) and return the boolean result value in an integer type of same shape and size as the input parameter. This builtin expands to a sequence of instructions with vector floating-point values handled by vector code.

Note

The function prototypes shown above are the overloaded aliases that can be used by including <ipu_builtins.h>. The pure IPU builtins __builtin_ipu_isnan_f32, __builtin_ipu_isnan_v2f16, __builtin_ipu_isnan_v2f32 and __builtin_ipu_isnan_v4f16 are available without this header.

Random number generation

The IPU hardware includes a pseudorandom number generator (PRNG) and allows for the generation of random values sampled from both the discrete uniform distribution and a quantized 12th degree Irwin-Hall distribution (an approximation to the Normal dis- tribution). The PRNG algorithm used is described in A Fast Hardware Pseudorandom Number Generator Based on xoroshiro128 <https://ieeexplore.ieee.org/document/9875973>`__.

The period of the IPU PRNG, which is the length of the unique sequence produced, 2 ¹²⁸-1.

For more detail, see the section Pseudorandom number generator in the Tile Vertex Instruction Set Architecture.

Generate half2 vector using Gaussian distribution

half2 __builtin_ipu_f16v2grand()

Generate a Gaussian distribution, two-element half-precision random vector in the range [-5 $\frac{13}{16}$, 5 $\frac{13}{16}$].

Targets the f16v2grand instruction.

See the Tile Vertex Instruction Set Architecture for more details about the f16v2grand instruction.

Generate float2 vector using Gaussian distribution

float2 __builtin_ipu_f32v2grand()

Generate a Gaussian distribution, two-element singles-precision random vector in the range [-5 $\frac{13}{16}$, 5 $\frac{13}{16}$].

Targets the f32v2grand instruction.

See the Tile Vertex Instruction Set Architecture for more details about the f32v2grand instruction.

Generate random 32-bit integer

unsigned __builtin_ipu_urand32()

Generate a uniform distribution, 32-bit random integer in the range [0, 2 ³²-1] of length .

Targets the urand32 instruction.

See the Tile Vertex Instruction Set Architecture for more details about the urand32 instruction.

Generate random 64-bit integer

unsigned long long __builtin_ipu_urand64()

Generate a uniform distribution, 64-bit random integer in the range [0, 2 ⁶⁴-1].

Targets the urand64 instruction.

See the Tile Vertex Instruction Set Architecture for more details about the urand64 instruction.

Generate random 16-bit float

half __builtin_ipu_urand_f16(): Generate a uniform distribution, 16-bit random float (half) in the range [-0.5, 0.5].

Generate random 32-bit float

float __builtin_ipu_urand_f32(): Generate a uniform distribution, 32-bit random float in the range [-0.5, 0.5].

Search help

IPU C/C++ builtins

IPU functionality and memory

Get lower half of cycle count from CSR

Get upper half of cycle count from CSR

Get vertex base from CSR

Get tile ID from CSR

Check for worker mode

Example

Triple-pack three addresses

Write to a CSR

Example

Write to an upper CSR

Example

Read from a CSR

Example

Read from an upper CSR

Example

Read from an upper CSR

Load and write 64-bit value to the common configuration space

Load and write 128-bit value to the common configuration space

64-bit load and 64-bit store, with post-incrementing addresses

Bit operations

And operation

Andc operation

Or operation

Not operation

Reverse bytes

Reverse bytes

SIMD roll permutation on 4x32-bit values

SIMD roll-left permutation on 8x8-bit values

SIMD roll-right permutation on 8x8-bit values

Upper half of SIMD shuffle permutation on 8x8-bit values

Lower half of SIMD shuffle permutation on 8x8-bit values

Upper half of SIMD sort permutation on 4x32-bit values

Lower half of SIMD sort permutation on 4x32-bit values

SIMD sort8 permutation on 4x8-bit values

SIMD swap8 permutation on 4x8-bit values

Conditional ternary operator

Float operations

Absolute addition of two values

Absolute maximum of two values

Maximum of two values

Lateral maximum of two values

Minimum of two values

Min-of-maximum of two values

CMAC operation

Natural exponential

2-to-the-power-of

Natural logarithm

Base-2 logarithm

Probabilistic mask function

Sigmoid function

Lateral sum

Tanh

Vector product

Vector sum with scalar multiplicand

Get and initialise accumulators

Float comparisons

Equality test

Greater-than-or-equal-to test

Greater-than test

Less-than-or-equal-to test

Less-than test

Inequality test

Float classification

Classify float

Check whether floating-point value is finite

Check whether floating-point value is infinite

Check whether floating-point value is NaN

Random number generation

Generate half2 vector using Gaussian distribution

Generate float2 vector using Gaussian distribution

Generate random 32-bit integer

Generate random 64-bit integer

Generate random 16-bit float

Generate random 32-bit float