4. Writing vertices in assembly

This chapter introduces the general concepts required for writing vertices in assembly code.

4.1. Notation

Table 4.1 Notation used in this document




A value written using hexadecimal notation


A value written using binary notation


A value written using decimal notation


The value of individual bit i of value


The value of bits i through j of value


Refers to tile architectural state, such as a register

4.2. Instruction set overview

The instruction set architecture (ISA) of the tile processor, including the execution pipeline and registers, is described in detail in the ISA reference manual, available on request.

This section introduces some of the concepts that will be referred to in later chapters. For a high-level introduction to the IPU, please refer to the IPU Programmer’s Guide

The tile is a highly-deterministic, asymmetric, dual pipeline, long instruction word (LIW) processor. It supports multiple hardware resident execution contexts. These contexts are time multiplexed onto shared hardware resources to achieve high utilisation by hiding local instruction latencies, including memory access and branch latencies.

Each tile includes tightly-coupled local memory which is used to store all code and data required by the tile.

4.2.1. Supervisors and workers

An IPU tile has two types of hardware execution contexts: a supervisor context and six worker contexts. There are six execution slots that can run these contexts. A round-robin schedule is used to time multiplex the execution slots (and therefore active contexts) onto the shared hardware resources.

Initially, there is only a single supervisor thread that runs in all execution slots. When worker threads run they occupy a single execution slot. When six workers are running, the supervisor is suspended until an execution slot is made available by the termination of a worker.

The supervisor can only perform certain operations. For example, it cannot execute floating-point instructions.

Supervisor code is used for overall control of execution, synchronisation and exchanges, but all floating-point processing must be done in a worker context.

Workers can execute instructions individually or in parallel with another instruction as part of an execution bundle.

4.2.2. Execution pipelines

The tile has a pair of asymmetric execution pipelines, main and aux:

  • main is designed primarily to perform control flow, address manipulation, integer arithmetic and load/store operations

  • aux is designed primarily to perform floating-point based compute

A supervisor thread cannot use the aux pipeline and its associated state.

Each pipeline has an associated register file. The main execution pipeline is associated (and tightly coupled) with the main register file (MRF) and the aux pipeline with the auxiliary register file (ARF).

These register files, as well as control and status registers and some internal state, are replicated for each context.

For full details of all the registers, see the ISA reference manual.

4.3. Memory architecture

The architectural size of the tile memory is limited to 21 address bits (2 MB). The tile memory is the only memory directly accessible by tile instructions. It contains both the code and data used by that tile. There is no shared memory access between tiles.

The tile uses a contiguous unsigned 21-bit address space, beginning at address 0x00000. Every context, both worker and supervisor, has visibility of the entire address space. In practice, only a part of this memory space is populated with memory. The physical memory has a non-zero start address as a simple way to prevent invalid zero-valued addresses from being accessed. Attempting to access an unpopulated memory address will cause an exception.

The memory is organised as two regions, each made up of a number of 64-bit wide banks. Concurrent accesses can be made to addresses in different banks. This allows, for example, a 64-bit instruction fetch and two 64-bit data accesses to occur simultaneously (one may be a write).

Accesses to the banks in region 1 are interleaved, with bit 3 of the address selecting 64-bit words from alternating odd and even banks. A pair of banks containing interleaved addresses form a single 128-bit wide memory element.

Fig. 4.1 shows the layout of non-interleaved and interleaved memory on Mk1. The organisation on the Mk2 IPU is similar, but the number and addressing of the memory banks is different. See Mk2 Colossus (GC200).

Interleaved and non-interleaved memory regions on Mk1 Colossus

Fig. 4.1 Interleaved and non-interleaved memory regions on Mk1 Colossus

Interleaving allows for two 64-bit aligned addresses, a and a+8, to be accessed simultaneously. This enables, for example, a 128-bit load and a simultaneous 64-bit load or store. Such simultaneous accesses would cause a memory clash exception in the non-interleaved memory region (unless the two addresses happened to straddle the boundary between two banks).

Instructions can only be fetched from region 0. Attempting to execute code from interleaved memory will cause an exception.

4.3.1. Getting information about the memory

The Poplar API provides details of the hardware that the software is actually executing on.

  • Target::getBytesPerTile() function returns the size of tile memory in bytes.

  • getMemoryElementOffsets() returns an array containing the offsets, from the start of memory, of each of the elements (not banks). For the Mk1 Colossus, for example, this will return 12 values, corresponding to the eight 16 KB elements and the four 32 KB elements.

  • getInterleavedMemoryElementIndex() returns the index of the first element in interleaved memory (for Mk1 Colossus, for example, this will return 8).

Details of these and other related functions can be found in the Poplar and PopLibs API Reference.

4.3.2. Mk1 Colossus (GC2)

In the Mk1 Colossus, each tile has 256 kilobytes of SRAM, made up of two regions each of 128 KB, as shown in Fig. 4.2. This means that an IPU with 1,216 tiles has about 300 MB of memory in total.

The available memory starts at address 0x40000 and ends at 0x7FFFF.

Table 4.2 Memory organization for Mk1




Banks (16 KB)

Elements (size)


128 KB



8 (16 KB)


128 KB



4 (32 KB)

Region 0 is selected when bit 17 of the address is 0, and is addressed with bits [16:3]. Bits [16:14] select the bank, or memory element, and bits [13:3] select a 64-bit word from that bank.

Memory architecture for Mk1 Colossus

Fig. 4.2 Memory architecture for Mk1 Colossus

4.3.3. Mk2 Colossus (GC200)

In the Mk2 Colossus, each tile has 624 kilobytes of SRAM (see Fig. 4.3). This means that an IPU with 1,472 tiles has just under 900 MB of memory in total.

The available memory starts at address 0x4C000 and ends at 0xE7FFF.

Table 4.3 Memory organization for Mk2




Banks (16 KB)

Elements (size)


208 KB



13 (16 KB)


406 KB



13 (32 KB)

Region 0 is selected when bit 19 of the address is 0, and is addressed with bits [18:3]. Bits [18:14] select the bank, or memory element, and bits [13:3] select a 64-bit word from that bank.

Memory architecture for Mk2 Colossus

Fig. 4.3 Memory architecture for Mk2 Colossus

4.3.4. Load and store instructions

There are load instructions for the following data sizes:

  • 8 bit

  • 16 bit

  • 32 bit

  • 64 bit

  • 128 bit (only from region 1)

And store instructions for the following data sizes:

  • 32 bit

  • 64 bit

There are instructions that can perform multiple simultaneous loads, as well as instructions that do a simultaneous load and store.

If you try to make more than one access to a memory bank in one cycle you will get a memory conflict. Interleaved memory places sequential 64-bit words in alternating banks, allowing you to use more efficient load-store instructions like ld128, ld2xst64pace, etc. See Vertex pipelines for an example of their use.

All loads (including instruction fetches) and stores must be naturally aligned. Misaligned accesses will result in an exception.