A Dictionary of Graphcore Terminology

AI-Float

AI-Float consists of three core technologies:

  • Industry standard 16 and 32-bit IEEE floating-point arithmetic

  • Hardware support for stochastic rounding

  • A configurable, dot-product AI-Float arithmetic block in the tile

Batch serialisation

A batch of samples are normally processed in parallel. With batch serialisation, the batch is divided into sub-batches (based on a batch serialisation factor) and only the sub-batch is processed in parallel. A sequence of these sub-batches are processed serially.

BSP
Bulk-synchronous parallel

A programming methodology for parallel algorithms which is used on the IPU. Execution for the IPU consists of supersteps, each made up of three phases: synchronization, communication and local compute.

Leslie G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (August 1990), 103-111. DOI=10.1145/79173.79181

C2

Graphcore’s dual IPU PCIe card with two GC2 Colossus IPUs. Provides performance of 250 teraFLOPS of mixed precision compute with 192 GB/s IPU-Link bandwidth between IPUs, 128 GB/s card to card IPU-Links. Maximum power consumption is 300 W.

Cluster

A logical grouping of IPUs.

Codelet

A piece of code that defines the inputs, outputs and internal state of a vertex. Contains a compute() function that defines the behaviour of the vertex.

Colossus

The current architecture of the Graphcore IPU. It consists of an array of thousands of IPU Tiles with In-Processor-Memory and IPU-Links for chip-to-chip communication. It is designed for parallel processing using the BSP model.

Colossus is available as the GC2 and GC200. It is named in honour of the Colossus computer used for code breaking at Bletchley Park.

Compute batch

The number of samples for which activations/gradients are computed in parallel.

Compute set

A set of vertices that are executed in parallel during the BSP compute phase.

Direct attach

One or more IPU-Machines can be used in “direct attach” mode where the IPU-Machines are directly controlled from the user’s computer. The IPU-specific part of the V-IPU software runs on an IPU-Machine, rather than on a separate server.

Dynamic graph

A dynamic graph can be a graph where the shape of tensors within the graph are determined at runtime (see Dynamic shape) or a graph where execution is via dynamic dispatch (see Eager execution).

Dynamic shape

The input to a graph can have variable length or shape. For models such as BERT, the maximum sequence size is known but the actual input is dynamic within that range.

Eager execution

Dynamic graph execution (or dispatch) where each operation is individually compiled, dispatched and executed when required.

Edge

The edges of a computation graph define the connections between elements of tensors and the vertices of the graph.

Exchange

Communication phase of a superstep, where data is communicated between tiles and between IPUs. An exchange can be:

Exchange fabric

The communication network used to transfer data between tiles within an IPU.

Exchange Memory

All memory that can be used for transferring data to and from IPUs. This can be In-Processor-Memory, Streaming Memory or Host memory.

External exchange

An exchange phase where data is communicated between tiles and memory outside the IPU. This can be:

See also IPU-Fabric.

GC2

First generation of the Colossus IPU with 1,216 tiles, each with 256 KB of In-Processor-Memory.

GC200

Second generation Colossus IPU with 1,472 tiles, each with 624 KB of In-Processor-Memory and micro-architectural improvements to increase performance and reduce power consumption.

GCD
Graph compile domain

A subset of IPUs within a GSD which are controlled by a single Poplar instance. “GCD size” is the number of IPUs in the GCD. When a program is executed, the Poplar instance binaries may be replicated and loaded on to multiple GCDs (one Poplar instance per GCD).

GCL
Graphcore Communication Library

A software library for managing communication and synchronization between IPUs, supporting ML at scale. GCL-based IPU communication and synchronization can be established across any IPU-Fabric, supporting IPU-POD topologies such as mesh configuration and Torus configuration.

Global batch

The number of samples that contribute to a weight update across all replicas.

Global exchange

See Inter-IPU exchange.

Graph Analyser

See PopVision.

Graph compiler

See Poplar graph compiler.

Graph engine

See Poplar graph engine.

Graph streaming

A set of techniques that allows IPUs to make efficient use of Exchange Memory. This includes the intelligent placement of variables and weights, and the use of Streaming Memory for scatter/gather and reductions.

GSD
Graph scaleout domain

The set of IPUs used to execute a program, consisting of one or more GCDs. All the GCDs together form the GSD, with GCL managing all the communication and synchronization between teh separate IPUs and GCDs. The “GSD size” is the number of IPUs in the GSD.

GSD is equivalent to the whole partition and GSD size is the partition size.

See also vPOD.

A networking interface implemented by an IPU gateway that can provide IPU-IPU connectivity through another IPU gateway, either via a directly connected link or via a switching infrastructure.

A set of IPU-Machines in an IPU-POD that can communicate only via GW-Links.

Half
Half-precision

A 16-bit floating-point value.

Head node

Poplar host server

Host exchange

Communication between an IPU and the server running the host-side part of the Poplar program.

Host memory

Memory on the host server that can be accessed by the IPU (see Exchange Memory).

The communication path between the host computer and the IPUs. This maybe a direct connection, such as PCIe, or a high-speed network, such as 100 Gigabit Ethernet (100 GbE).

ILD

An IPU-Link Domain (ILD) is a set of IPUs that are connected with IPU-Links. The IPUs have to be within a single IPU-POD. The maximum size of a single ILD is 64 IPUs. Multiple ILDs are used to form multi-ILD clusters. Can also use the term “multi-ILD partition” to mean a partition that spans multiple ILDs and uses GW-Links.

There has to be at least 1 GCD per ILD.

In-Processor-Memory

The tile memory. This memory can be directly accessed by worker threads during the compute phase of a program.

Inter-IPU exchange

Communication between tiles on different IPUs.

IPU
Intelligence Processing Unit

An Intelligence Processing Unit (IPU) is a massively parallel accelerator pioneered by Graphcore for machine learning (ML) and artificial intelligence applications.

Graphcore’s current implementation of the IPU is Colossus.

IPU gateway

The IPU gateway manages communication on and off the IPU-Machine board via the IPU-Links that connect IPU-Machines. It also manages transfers between the IPUs and local Streaming Memory on the IPU-Machine.

IPU-Core

The tile’s processing unit.

IPU-Exchange

Communication on the Exchange fabric internal to the IPU.

IPU-Fabric

The communication network used to transfer data between tiles in an IPU, and between IPUs in a system. The IPU-Fabric is made up of IPU-Links GW-Links, Sync-Links and Host-Links.

Communication links between IPUs.

IPU-M
IPU-Machine

A rack mountable compute platform with a number of interconnected IPUs, management logic, Exchange Memory, and external networking and IPU-Link interfaces.

IPU-Machine: M2000
IPU-M2000

A 1U IPU-Machine containing four Colossus GC200 IPUs providing 1 petaFLOPS of compute, up to 450 GB Exchange Memory, 2.8 Tbps low-latency IPU-Fabric interconnect, and an IPU gateway controller supporting host disaggregation. One or more IPU-M2000s can work as a direct attached system, or they can be built into a rack system as an IPU-POD.

IPU-POD

A collection of interconnected IPU-Machines. An IPU-POD DA (Direct Attach) system has no switches and runs the management software on one of the IPU-Machines. A switched IPU-POD has one or more servers and networking switches. An IPU-POD allows all the IPUs in the IPU-Machines to communicate and synchronize using IPU-IPU connections. The IPUs can be partitioned into “virtual IPU-PODs” using the V-IPU software.

IPU-POD management services

The software and firmware components on IPU-Machines, Poplar servers and management servers that together implement all the required management, provisioning and monitoring functions for an IPU-POD.

IPU-Tile

A tile containing the IPU-Core and In-Processor-Memory.

IPUoF
IPU over Fabric

Software that allows a Poplar server to control and feed data to a program executing on one or more IPUs using remote DMA (RDMA). The IPUoF software has components on both the server and the IPU-Machine.

Management server

A physical server, virtual machine or container implementing the higher, component-independent layers of the IPU-POD management services.

Mesh configuration

IPUs can be connected in a 2D array with their IPU-Links. This is normally a 2 x N array, rather like a ladder with a pair of IPUs on either side of each “rung”.

See also Torus configuration.

Micro batch

The number of samples calculated in one full forward/backward pass of the algorithm.

Partials

Partials are the intermediate values in a computation. For example, four numbers (or tensors) a, b, c, d, might be added like this:

tmp1 = a + b
tmp2 = c + d
total = tmp1 + tmp2

In this case, tmp1 and tmp2 are the partials.

Partition

An isolated group of one or more IPUs within a single vPOD that can be controlled by one or more Poplar hosts within the vPOD, and that can be used for single or multiple ML workloads.

The creation and management of partitions is done by using the V-IPU software. Partitions can be reconfigurable or non-reconfigurable.

Ping pong

(Deprecated)

An execution model where two groups, each consisting of one or more IPUs, alternate between processing and host exchange. This allows larger models to be handled, that would not fit in IPU memory otherwise.

Pipeline depth

The normal meaning is the number of stages in a processing pipeline. It is also used, in some of our APIs, to refer to the number of samples passed through the pipeline before a weight update.

Pipelining

Pipelining is a way of parallelising execution by splitting a model across multiple IPUs. Each stage of processing is mapped to a different IPU, each of which handles a different batch of samples.

PopART

The Poplar advanced run-time (PopART) provides support for importing, creating and running ONNX graphs on the IPU.

PopDist
Poplar Distributed Configuration Library

PopDist provides a set of APIs which are used to make applications ready for distributed execution. Command line parameters passed to PopRun are exposed and can be used to distribute the input/output data or other parts of the applications. PopDist is also bundled with the Poplar SDK.

Poplar

The Graphcore software tools and libraries for graph programming on IPUs. Enables the programmer to write a single program that defines the graph to be executed on the IPU devices and the controlling code that runs on the host. The device code is compiled and loaded onto the IPUs ready for execution.

Poplar graph compiler

The component of Poplar that compiles graph programs for the IPU. This is run implicitly when a Poplar program is executed.

Poplar graph engine

The run-time component of Poplar that provides support for running graph programs on the IPU.

Poplar instance

Poplar translates a framework program into code that runs on IPUs. Each process using a single version of Poplar/the SDK is a Poplar instance. For a program that would need to run on 4 GCDs, there would be 4 Poplar instances, one per GCD.

Poplar SDK

The package of software development tools for the Graphcore IPU. It includes:

Poplar server

A server that runs the Poplar graph engine and communicates with IPUs using Host exchange.

PopLibs

A set of libraries in Poplar for the IPU that provide common operations required in machine learning frameworks and applications.

PopRun

A command line utility to launch distributed applications on IPU-PODs. PopRun creates multiple instances, each of which can run on a single host server or multiple host servers. This includes remote host servers with larger IPU-POD configurations such as an IPU‑POD128, where the remote host servers are physically located in an interconnected IPU-POD. PopRun is required for any IPU-POD system larger than an IPU‑POD64 and is bundled with the Poplar SDK.

PopTorch

Provides a simple wrapper around PyTorch programs to enable them to be run on the IPU.

PopVision

A suite of graphical analysis and debugging tools. The first is the PopVision Graph Analyser for profiling and performance analysis.

Recomputation

In a multi-layer network, activations are computed from layer to layer and are typically saved as intermediate results. Recomputation optimises the use of IPU memory by recomputing some values required on the backward pass. This can massively reduce the amount of In-Processor-Memory used, at the cost of some extra computation.

Remote buffer

A remote buffer is the software representation of Streaming Memory in Poplar. Data is transferred to and from remote buffers using data streams.

Replica batch

The number of samples that contribute to a weight update from a single replica.

Replicated graph

A replicated graph creates a number of identical copies, or replicas, of the same graph. Each replica targets a different subset of the available tiles (all subsets are the same size). Any change made to the replicated graph, such as adding variables or vertices, will affect all the replicas.

Note: This is not the same as the TensorFlow use of the term.

See also Virtual graph.

Replication

See Replicated graph and Replicated tensor sharding.

RTS
Replicated tensor sharding

Storing tensors across replicas by slicing them into equal per-replica “shards” to reduce the memory required.

If a tensor in a replicated graph with replication factor R has the same value on each replica (which is not necessarily the case), you can save memory by storing just a fraction (1/R) of the tensor on each replica. When the full tensor is required all R shards can be broadcast to all the replicas.

Session

Within frameworks such as TensorFlow and PopART, the software interface to code running on an engine. It defines the runtime state consisting of the compiled graph and the values of any variables used by the graph program.

Sharding

The process of dividing a model up by placing parts of the model on separate IPUs. This is a method of distributing a model that is too large to fit on an IPU. Since there are usually data dependencies between the parts of the model, execution will not be very efficient unless pipelining is used.

Standalone

See Direct attach.

Streaming Memory

External memory used by an IPU for data storage. This could be host memory reserved for use by the IPU or dedicated IPU memory.

See also Remote buffer.

Superstep

A sequence of execution phases of a graph program consisting of: system-wide synchronisation, global communication (exchange) and local compute.

Sometimes just referred to as a “step”.

Supervisor code

The code responsible for initiating worker threads and performing the exchange phase and synchronisation phases of a step. Supervisor code cannot perform floating point operations.

Sync
Synchronisation

A system-wide synchronisation; the first phase in a superstep, following which it is safe to perform an exchange phase. Synchronisation can be internal (between all of the tiles on a single IPU) or external (between all tiles on every IPU). External sync is done via dedicated Sync-Link connections.

A connection provided between IPU-M2000s to allow synchronisation between all the IPUs.

Tensor

A tensor is a variable that contains a multi-dimensional array of values. In the IPU, the storage of a tensor can be distributed across the tiles. The data is then operated on, in parallel, by the vertex code running on the tiles.

Tile

An individual processor core in the IPU consisting of a processing unit and memory. All tiles are connected to the exchange fabric.

Torus configuration

A mesh connection where the IPU-Links at each end are connected back to form a closed loop.

See also Mesh configuration.

V-IPU
Virtual-IPU

The IPU-Machine or IPU-POD parts of the IPU-POD management services that implement the allocation, provisioning and monitoring of IPUs and related infrastructure for machine-learning workloads in the IPU-POD.

The IPU-specific part of the V-IPU software can run on an IPU-Machine, when used in direct attach mode in an IPU-POD DA system, or it can run on a server in a switched IPU-POD.

Vertex

A unit of computation in the graph; consists of code that runs on a tile. Vertices have inputs and outputs that are connected to tensors, and are associated with a codelet that defines the processing performed on the tensor data. Each vertex is stored and executed on a single tile.

Virtual graph

A graph is normally created for a physical target with a specific number of tiles. It is possible to create a new graph from that, which is a virtual graph for a subset of the tiles. This is effectively a new view onto the parent graph for a virtual target, which has a subset of the real target’s tiles and can be treated like a new graph.

See also Replicated graph.

vPOD

A “Virtual IPU-POD” is a subset of IPUs and servers in an IPU-POD that is securely isolated from access by other IPUs or servers. A vPOD can contain multiple GSDs.

By default, the complete IPU-POD is also a single vPOD, hence any operations that relate to a vPOD can, in general, also apply to a complete IPU-POD.

The V-IPU software is used to create and manage vPODs.

Worker

Code that can perform floating point operations and is typically responsible for performing the compute phase of a step. A tile has hardware support for multiple worker contexts.

Note: this should not be confused with the TensorFow definition of worker (processes that can make use of multiple hardware resources).