A Dictionary of Graphcore Terminology

Glossary

Batch serialisation

A batch or micro batch of samples is normally processed in parallel. With batch serialisation, the (micro) batch is divided into sub-batches (based on a batch serialisation factor) and only a sub-batch of samples is processed in parallel. A sequence of these sub-batches is processed serially.

Batch size

See Compute batch size, Global batch size, Replica batch size and Micro batch size.

Bow

Next generation IPU using a 3D wafer-on-wafer design to improve performance with increased power delivery and clock speed. The Bow IPU has 1,472 tiles, each with 624 KB of In-Processor-Memory (900 MB total In-Processor-Memory) and FP16.16 AI compute of 350 teraFLOPS.

Bow Pod

A collection of interconnected Bow-2000 IPU-Machines. A Bow Pod16 is in direct attach mode and has no switches and runs the management software on one of the IPU-Machines. Larger Bow Pod systems (Bow Pod64 onwards) are switched systems with one or more servers and networking switches. A Bow Pod allows all the IPUs in the Bow-2000 IPU-Machines to communicate and synchronize using IPU-to-IPU connections. The IPUs can be partitioned into “virtual Pods” using the V-IPU software.

Bow-2000
IPU-Machine: Bow-2000

Bow-2000 A 1U IPU-Machine containing four Bow IPUs providing 1.39 petaFLOPS of compute, up to 260 GB memory, 2.8 Tbps low-latency IPU-Fabric interconnect, and an IPU-Gateway that supports host disaggregation. Up to 4 Bow-2000s can work as a direct attached system, or larger numbers of Bow-2000s can be built into a switched rack system as a Bow Pod.

BSP
Bulk-synchronous parallel

A programming methodology for parallel algorithms which is used on the IPU. Execution for the IPU consists of supersteps, each made up of three phases: synchronization, communication and local compute.

Leslie G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (August 1990), 103-111. DOI=10.1145/79173.79181.

For a more general introduction see Bulk synchronous parallel on Wikipedia.

C2

Graphcore’s dual IPU PCIe card with two GC2 Colossus IPUs. Provides performance of 250 teraFLOPS of mixed precision compute with 192 GB/s IPU-Link bandwidth between IPUs, 128 GB/s card to card IPU-Links. Maximum power consumption is 300 W.

Cluster

A logical grouping of IPUs.

Codelet

A piece of code that defines the inputs, outputs and internal state of a vertex. Contains a compute() function that defines the behaviour of the vertex.

Colossus

The current architecture of the Graphcore IPU. It consists of an array of thousands of IPU Tiles with In-Processor-Memory and IPU-Links for IPU-to-IPU communication. It is designed for parallel processing using the BSP model.

Colossus is available as the GC2 and GC200. It is named in honour of the Colossus computer used for code breaking at Bletchley Park.

Compute batch size

The number of samples for which activations/gradients are computed in parallel. This will be the same as the micro batch size unless batch serialisation is used.

Compute set

A set of vertices that are executed in parallel during the BSP compute phase.

Direct attach

One or more IPU-Machines can be used in “direct attach” mode where the IPU-Machines are directly controlled from the user’s computer. The IPU-specific part of the V-IPU software runs on an IPU-Machine, rather than on a separate server.

Dynamic graph

A dynamic graph can be a graph where the shape of tensors within the graph are determined at runtime (see Dynamic shape) or a graph where execution is via dynamic dispatch (see Eager execution).

Dynamic shape

The input to a graph can have variable length or shape. For models such as BERT, the maximum sequence size is known but the actual input is dynamic within that range.

Eager execution

Dynamic graph execution (or dispatch) where each operation is individually compiled, dispatched and executed when required.

Edge

The edges of a computation graph define the connections between elements of tensors and the vertices of the graph.

Exchange

Communication phase of a superstep, where data is communicated between tiles and between IPUs. An exchange can be:

Exchange fabric

The communication network used to transfer data between tiles within an IPU.

External exchange

An exchange phase where data is communicated between tiles and memory outside the IPU. This can be:

See also IPU-Fabric.

GC2

First generation of the Colossus IPU with 1,216 tiles, each with 256 KB of In-Processor-Memory.

GC200

Second generation Colossus IPU with 1,472 tiles, each with 624 KB of In-Processor-Memory (900 MB total In-Processor-Memory) and micro-architectural improvements to increase performance and reduce power consumption. FP16.16 AI compute of 250 teraFLOPS.

GCD
Graph compile domain

A subset of IPUs within a GSD which are controlled by a single Poplar instance. “GCD size” is the number of IPUs in the GCD. When a program is executed, the Poplar instance binaries may be replicated and loaded on to multiple GCDs (with one Poplar instance per GCD). All the GCDs together form the GSD, with GCL managing all the communication and synchronization between the separate IPUs and GCDs.

One or more GCDs form a GSD, which is equivalent to the whole partition.

GCL
Graphcore Communication Library

A software library for managing communication and synchronization between IPUs, supporting ML at scale. GCL-based IPU communication and synchronization can be established across any IPU-Fabric, supporting Pod topologies such as mesh configuration and torus configuration.

Global batch size

The number of samples that contribute to a weight update across all replicas. This is equal to the replica batch size multiplied by the number of replicas.

Global exchange

See Inter-IPU exchange.

Gradient accumulation

Gradient accumulation is a technique for increasing the batch size used for a weight update step. The gradients from processing multiple micro batches are accumulated and used in a single weight update step. Any normalisation will be done within each micro batch. This means that batch normalisation will not give a mathematically equivalent result when using gradient accumulation compared to just using a larger batch size, but for a large enough micro batch size the statistics will give a sufficiently good approximation.

Graph Analyser

See PopVision.

Graph compiler

See Poplar graph compiler.

Graph engine

See Poplar graph engine.

Graph streaming

A set of techniques that allows IPUs to make efficient use of In-Processor-Memory and Streaming Memory. This includes the intelligent placement of variables and weights, and the use of Streaming Memory for scatter/gather and reductions.

GSD
Graph scaleout domain

The set of IPUs used to execute a program, consisting of one or more GCDs. All the GCDs together form the GSD, with GCL managing all the communication and synchronization between the separate IPUs and GCDs.

The “GSD size” is the number of IPUs in the GSD.

GSD is equivalent to the whole partition and GSD size is the partition size.

See also vPOD.

A networking interface implemented by an IPU-Gateway that can provide IPU-to-IPU connectivity through another IPU-Gateway, either via a directly connected link or via a switching infrastructure.

A set of IPU-Machines in a Pod that can communicate only via GW-Links.

Half
Half-precision

A 16-bit floating-point value.

Head node

Poplar host server

Host exchange

Communication between an IPU and the server running the host-side part of the Poplar program.

Host memory

Memory on the host server that can be accessed by the IPU via the IPU-Fabric.

The communication path between the host computer and the IPUs. This maybe a direct connection, such as PCIe, or a high-speed network, such as 100 Gigabit Ethernet (100 GbE).

ILD

An IPU-Link Domain (ILD) is a set of IPUs that are connected with IPU-Links. The IPUs have to be within a single Pod. The maximum size of a single ILD is 64 IPUs. Multiple ILDs are used to form multi-ILD clusters. Can also use the term “multi-ILD partition” to mean a partition that spans multiple ILDs and uses GW-Links.

There has to be at least 1 GCD per ILD.

In-Processor-Memory

The tile memory. This memory can be directly accessed by worker threads during the compute phase of a program.

Inter-IPU exchange

Communication between tiles on different IPUs.

IPU
Intelligence Processing Unit

An Intelligence Processing Unit (IPU) is a massively parallel processor pioneered by Graphcore for machine learning (ML) and artificial intelligence applications.

Graphcore’s current implementation of the IPU is Colossus.

IPU-Core

The tile’s processing unit.

IPU-Exchange

Communication on the Exchange fabric internal to the IPU.

IPU-Fabric

The communication network used to transfer data between tiles in an IPU, and between IPUs in a system. The IPU-Fabric is made up of IPU-Links GW-Links, Sync-Links and Host-Links.

IPU-Gateway

The IPU-Gateway manages communication on and off the IPU-Machine board via the IPU-Links that connect IPU-Machines. It also manages transfers between the IPUs and local Streaming Memory on the IPU-Machine.

Communication links between IPUs.

IPU-M
IPU-Machine

A rack mountable compute platform with a number of interconnected IPUs, management logic, In-Processor-Memory, Streaming Memory, and external networking and IPU-Link interfaces. General term for IPU-M2000 and Bow-2000 blades.

IPU-Machine: M2000
IPU-M2000

A 1U IPU-Machine containing four Colossus GC200 IPUs providing 1 petaFLOPS of compute, up to 260 GB memory, 2.8 Tbps low-latency IPU-Fabric interconnect, and an IPU-Gateway that supports host disaggregation. One or more IPU-Machines can be built into a Pod system. This can be a direct attached or switched system.

IPU-POD

A collection of interconnected IPU-M2000 IPU-Machines. An IPU-POD DA (Direct Attach) system has no switches and runs the management software on one of the IPU-Machines. A switched IPU-POD has one or more servers and networking switches. An IPU-POD allows all the IPUs in the IPU-M2000 IPU-Machines to communicate and synchronize using IPU-to-IPU connections. The IPUs can be partitioned into “virtual Pods” using the V-IPU software.

IPU-Tile

A tile containing the IPU-Core and In-Processor-Memory.

IPUoF
IPU over Fabric

Software that allows a Poplar server to control and feed data to a program executing on one or more IPUs using remote DMA (RDMA). The IPUoF software has components on both the server and the IPU-Machine.

Management server

A physical server, virtual machine or container implementing the higher, component-independent layers of the Pod management services.

Mesh configuration

IPUs can be connected in a 2D array with their IPU-Links. This is normally a 2 x N array, rather like a ladder with a pair of IPUs on either side of each “rung”.

See also Torus configuration.

Micro batch size

The number of samples for which activations are calculated in one full forward pass of the algorithm in a single replica, and for which gradients are calculated in one full backward pass of the algorithm (when training) in a single replica. If gradient accumulation is used then there will not be a weight update after every backward pass.

See also Replica batch size and Global batch size.

Partials

Partials are the intermediate values in a computation. For example, four numbers (or tensors) a, b, c, d, might be added like this:

tmp1 = a + b
tmp2 = c + d
total = tmp1 + tmp2

In this case, tmp1 and tmp2 are the partials.

Partition

An isolated group of one or more IPUs within a single vPOD that can be controlled by one or more Poplar hosts within the vPOD, and that can be used for single or multiple ML workloads.

The creation and management of partitions is done by using the V-IPU software. Partitions can be reconfigurable or non-reconfigurable.

Ping pong

(Deprecated)

An execution model where two groups, each consisting of one or more IPUs, alternate between processing and host exchange. This allows larger models to be handled, that would not fit in IPU memory otherwise.

Pipeline depth

The normal meaning is the number of stages in a processing pipeline. It is also used, in some of our APIs, to refer to the number of micro batches passed through the pipeline before a weight update.

Pipelining

Pipelining is a way of parallelising execution by splitting a model across multiple IPUs. Each stage of processing is mapped to a different IPU, each of which handles a different micro batch of samples.

See also micro batch size.

Pod

A Graphcore Pod is a general term for IPU-POD and Bow Pod systems.

Pod management services

The software and firmware components on IPU-Machines, Poplar servers and management servers that together implement all the required management, provisioning and monitoring functions for a Pod.

PopART

The Poplar advanced run-time (PopART) provides support for importing, creating and running ONNX graphs on the IPU.

PopDist
Poplar Distributed Configuration Library

PopDist provides an API to make applications ready for distributed execution. Command line parameters passed to PopRun are exposed and can be used to distribute the input/output data or other parts of the applications. PopDist is bundled with the Poplar SDK.

Poplar

The Graphcore software tools and libraries for graph programming on IPUs. Enables the programmer to write a single program that defines the graph to be executed on the IPU devices and the controlling code that runs on the host. The device code is compiled and loaded onto the IPUs ready for execution.

Poplar graph compiler

The component of Poplar that compiles graph programs for the IPU. This is run implicitly when a Poplar program is executed.

Poplar graph engine

The run-time component of Poplar that provides support for running graph programs on the IPU.

Poplar instance

Poplar translates a framework program into code that runs on IPUs. Each process using a single version of Poplar/the SDK is a Poplar instance. For a program that would need to run on 4 GCDs, there would be 4 Poplar instances, one per GCD.

Poplar SDK

The package of software development tools for the Graphcore IPU. It includes:

Poplar server

A server that runs the Poplar graph engine and communicates with IPUs using Host exchange.

PopLibs

A set of libraries in Poplar for the IPU that provide common operations required in machine learning frameworks and applications.

PopRun

A command line utility to launch distributed applications on Pods. PopRun creates multiple instances, each of which can run on a single host server or multiple host servers. This includes remote host servers with larger Pod configurations such as an IPU‑POD256, where the remote host servers are physically located in an interconnected Pod. PopRun is required for any Pod system larger than an IPU‑POD64 or Bow Pod64 and is bundled with the Poplar SDK.

See also Poplar instance.

PopTorch

Provides a simple wrapper around PyTorch programs to enable them to be run on the IPU.

PopVision

A suite of graphical analysis and debugging tools. For more information see the PopVision Tools web page.

Recomputation

In a multi-layer network, activations are computed from layer to layer and are typically saved as intermediate results. Recomputation optimises the use of IPU memory by recomputing some values required on the backward pass. This can massively reduce the amount of In-Processor-Memory used, at the cost of some extra computation.

Remote buffer

A remote buffer is the software representation of Streaming Memory in Poplar. Data is transferred to and from remote buffers using data streams.

Replica batch size

The number of samples that contribute to a weight update from a single replica. The replica batch size equals the micro batch size multiplied by the number of gradient accumulation iterations.

See also Replicated graph.

Replicated graph

A replicated graph creates a number of identical copies, or replicas, of the same graph. Each replica targets a different subset of the available tiles (all subsets are the same size). Any change made to the replicated graph, such as adding variables or vertices, will affect all the replicas.

See also Virtual graph.

Replication

See Replicated graph and Replicated tensor sharding.

RTS
Replicated tensor sharding

Storing tensors across replicas by slicing them into equal per-replica “shards” to reduce the memory required.

If a tensor in a replicated graph with replication factor R has the same value on each replica (which is not necessarily the case), you can save memory by storing just a fraction (1/R) of the tensor on each replica. When the full tensor is required all R shards can be broadcast to all the replicas.

Session

Within frameworks such as TensorFlow and PopART, the software interface to code running on an engine. It defines the runtime state consisting of the compiled graph and the values of any variables used by the graph program.

Sharding

The process of dividing a model up by placing parts of the model on separate IPUs. This is a method of distributing a model that is too large to fit on an IPU. Since there are usually data dependencies between the parts of the model, execution will not be very efficient unless pipelining is used.

Standalone

See Direct attach.

Streaming Memory

Memory external to the IPUs used by ML applications for data storage. This could be host memory reserved for use by the IPUs or dedicated IPU memory.

See also Remote buffer.

Superstep

A sequence of execution phases of a graph program consisting of: system-wide synchronisation, global communication (exchange) and local compute.

Sometimes just referred to as a “step”.

Supervisor code

The code responsible for initiating worker threads and performing the exchange phase and synchronisation phases of a step. Supervisor code cannot perform floating point operations.

Sync
Synchronisation

A system-wide synchronisation; the first phase in a superstep, following which it is safe to perform an exchange phase. Synchronisation can be internal (between all of the tiles on a single IPU) or external (between all tiles on every IPU). External sync is done via dedicated Sync-Link connections.

A connection provided between IPU-Machines to allow synchronisation between all the IPUs.

System Analyser

See PopVision.

Tensor

A tensor is a variable that contains a multi-dimensional array of values. In the IPU, the storage of a tensor can be distributed across the tiles. The data is then operated on, in parallel, by the vertex code running on the tiles.

Tile

An individual processor core in the IPU consisting of a processing unit and memory. All tiles are connected to the exchange fabric.

Torus configuration

A mesh connection where the IPU-Links at each end are connected back to form a closed loop.

See also Mesh configuration.

V-IPU
Virtual-IPU

The IPU-Machine or Pod parts of the Pod management services that implement the allocation, provisioning and monitoring of IPUs and related infrastructure for machine-learning workloads in the Pod.

The IPU-specific part of the V-IPU software can run on an IPU-Machine, when used in direct attach mode in a Pod DA system, or it can run on a server in a switched Pod.

Vertex

A unit of computation in the graph; consists of code that runs on a tile. Vertices have inputs and outputs that are connected to tensors, and are associated with a codelet that defines the processing performed on the tensor data. Each vertex is stored and executed on a single tile.

Virtual graph

A graph is normally created for a physical target with a specific number of tiles. It is possible to create a new graph from that, which is a virtual graph for a subset of the tiles. This is effectively a new view onto the parent graph for a virtual target, which has a subset of the real target’s tiles and can be treated like a new graph.

See also Replicated graph.

vPOD

A “Virtual Pod” is a subset of IPUs and servers in a Pod that is securely isolated from access by other IPUs or servers. A vPOD can contain multiple GSDs.

By default, the complete Pod is also a single vPOD, hence any operations that relate to a vPOD can, in general, also apply to a complete Pod.

The V-IPU software is used to create and manage vPODs.

Worker

Code that can perform floating point operations and is typically responsible for performing the compute phase of a step. A tile has hardware support for multiple worker contexts.

Note: this should not be confused with the TensorFow definition of worker (processes that can make use of multiple hardware resources).