A Dictionary of Graphcore Terminology
- Batch serialisation
A batch or micro batch of samples is normally processed in parallel. With batch serialisation, the (micro) batch is divided into sub-batches (based on a batch serialisation factor) and only a sub-batch of samples is processed in parallel. A sequence of these sub-batches is processed serially.
- Batch size
Next generation (Mk2) Colossus IPU using a 3D wafer-on-wafer design to improve performance with increased power delivery and clock speed. The Bow IPU has 1,472 tiles, each with 624 KB of In-Processor-Memory (900 MB total In-Processor-Memory) and FP16.16 AI compute of 350 teraFLOPS.
- Bow Pod
A collection of interconnected Bow-2000 IPU-Machines. A Bow Pod16 is in direct attach mode and has no switches and runs the management software on one of the IPU-Machines. Larger Bow Pod systems (Bow Pod64 onwards) are switched systems with one or more servers and networking switches. A Bow Pod allows all the IPUs in the Bow-2000 IPU-Machines to communicate and synchronize using IPU-to-IPU connections. The IPUs can be partitioned into “virtual Pods” using the V-IPU software.
For more information:
- IPU-Machine: Bow-2000
Bow-2000 A 1U IPU-Machine containing four Bow IPUs providing 1.39 petaFLOPS of compute, up to 260 GB memory, 2.8 Tbps low-latency IPU-Fabric interconnect, and an IPU-Gateway that supports host disaggregation. Up to 4 Bow-2000s can work as a direct attached system, or larger numbers of Bow-2000s can be built into a switched rack system as a Bow Pod.
For more information:
- Bulk-synchronous parallel
A programming methodology for parallel algorithms which is used on the IPU. Execution for the IPU consists of supersteps, each made up of three phases: synchronization, communication and local compute.
Leslie G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (August 1990), 103-111. DOI=10.1145/79173.79181.
For a more general introduction see Bulk synchronous parallel on Wikipedia.
Graphcore’s dual IPU PCIe card with two GC2 Colossus IPUs. Provides performance of 250 teraFLOPS of mixed precision compute with 192 GB/s IPU-Link bandwidth between IPUs, 128 GB/s card to card IPU-Links. Maximum power consumption is 300 W.
An IPU-Processor PCIe card targeted at machine-learning inference applications. Has a single Mk2 IPU with FP8 support.
A logical grouping of IPUs.
A piece of code that defines the inputs, outputs and internal state of a vertex. Contains a
compute()function that defines the behaviour of the vertex.
The current architecture of the Graphcore IPU. It consists of an array of thousands of IPU Tiles with In-Processor-Memory and IPU-Links for IPU-to-IPU communication. It is designed for parallel processing using the BSP model.
- Compute batch size
- Compute set
- Direct attach
One or more IPU-Machines can be used in “direct attach” mode where the IPU-Machines are directly controlled from the user’s computer. The IPU-specific part of the V-IPU software runs on an IPU-Machine, rather than on a separate server.
- Dynamic graph
- Dynamic shape
The input to a graph can have variable length or shape. For models such as BERT, the maximum sequence size is known but the actual input is dynamic within that range.
- Eager execution
Dynamic graph execution (or dispatch) where each operation is individually compiled, dispatched and executed when required.
The edges of a computation graph define the connections between elements of tensors and the vertices of the graph.
Communication phase of a superstep, where data is communicated between tiles and between IPUs. An exchange can be:
- Exchange fabric
- External exchange
An exchange phase where data is communicated between tiles and memory outside the IPU. This can be:
Inter-IPU exchange (between IPUs; also known as global exchange)
Host exchange (between IPUs and the host)
Streaming Memory exchange (between IPUs and Streaming Memory)
See also IPU-Fabric.
Second generation (Mk2) Colossus IPU with 1,472 tiles, each with 624 KB of In-Processor-Memory (900 MB total In-Processor-Memory) and micro-architectural improvements to increase performance and reduce power consumption. FP16.16 AI compute of 250 teraFLOPS.
- Global batch size
The number of samples that contribute to a weight update across all replicas. This is equal to the replica batch size multiplied by the number of replicas.
- Global exchange
See Inter-IPU exchange.
- Gradient accumulation
Gradient accumulation is a technique for increasing the batch size used for a weight update step. The gradients from processing multiple micro batches are accumulated and used in a single weight update step. Any normalisation will be done within each micro batch. This means that batch normalisation will not give a mathematically equivalent result when using gradient accumulation compared to just using a larger batch size, but for a large enough micro batch size the statistics will give a sufficiently good approximation.
- Graph Analyser
- Graph compile domain
A subset of IPUs within a GSD which are controlled by a single Poplar instance. “GCD size” is the number of IPUs in the GCD. When a program is executed, the Poplar instance binaries may be replicated and loaded on to multiple GCDs (with one Poplar instance per GCD). All the GCDs together form the GSD, with GCL managing all the communication and synchronization between the separate IPUs and GCDs.
One or more GCDs form a GSD, which is equivalent to the whole partition.
- Graph compiler
- Graph engine
See Poplar graph engine.
- Graph scaleout domain
The set of IPUs used to execute a program, consisting of one or more GCDs. All the GCDs together form the GSD, with GCL managing all the communication and synchronization between the separate IPUs and GCDs.
The “GSD size” is the number of IPUs in the GSD.
GSD is equivalent to the whole partition and GSD size is the partition size.
See also vPOD.
- Graph streaming
A set of techniques that allows IPUs to make efficient use of In-Processor-Memory and Streaming Memory. This includes the intelligent placement of variables and weights, and the use of Streaming Memory for scatter/gather and reductions.
- Graphcore Communication Library
A software library for managing communication and synchronization between IPUs, supporting ML at scale. GCL-based IPU communication and synchronization can be established across any IPU-Fabric, supporting Pod topologies such as mesh configuration and torus configuration.
A networking interface implemented by an IPU-Gateway that can provide IPU-to-IPU connectivity through another IPU-Gateway, either via a directly connected link or via a switching infrastructure.
- GW-Link cluster
A 16-bit floating-point value.
- Head node
Poplar host server
- Host exchange
Communication between an IPU and the server running the host-side part of the Poplar program.
- Host memory
Memory on the host server that can be accessed by the IPU via the IPU-Fabric.
The communication path between the host computer and the IPUs. This maybe a direct connection, such as PCIe, or a high-speed network, such as 100 Gigabit Ethernet (100 GbE).
- IPU control unit
A microcontroller that performs system management functions for the IPU.
The tile memory. This memory can be directly accessed by worker threads during the compute phase of a program.
- Intelligence Processing Unit
An Intelligence Processing Unit (IPU) is a massively parallel processor pioneered by Graphcore for machine learning (ML) and artificial intelligence applications.
Graphcore’s current implementation of the IPU is Colossus.
- Inter-IPU exchange
Communication between tiles on different IPUs.
The tile’s processing unit.
Communication on the Exchange fabric internal to the IPU.
The IPU-Gateway manages communication on and off the IPU-Machine board via the IPU-Links that connect IPU-Machines. It also manages transfers between the IPUs and local Streaming Memory on the IPU-Machine.
Communication links between IPUs.
- IPU-Link Domain
An IPU-Link Domain (ILD) is a set of IPUs that are connected with IPU-Links. The IPUs have to be within a single Pod. The maximum size of a single ILD is 64 IPUs. Multiple ILDs are used to form multi-ILD clusters. Can also use the term “multi-ILD partition” to mean a partition that spans multiple ILDs and uses GW-Links.
There has to be at least 1 GCD per ILD.
A rack mountable compute platform with a number of interconnected IPUs, management logic, In-Processor-Memory, Streaming Memory, and external networking and IPU-Link interfaces. General term for IPU-M2000 and Bow-2000 blades.
- IPU-Machine: M2000
A 1U IPU-Machine containing four Colossus GC200 IPUs providing 1 petaFLOPS of compute, up to 260 GB memory, 2.8 Tbps low-latency IPU-Fabric interconnect, and an IPU-Gateway that supports host disaggregation. One or more IPU-Machines can be built into a Pod system. This can be a direct attached or switched system.
For more information:
A collection of interconnected IPU-M2000 IPU-Machines. An IPU-POD DA (Direct Attach) system has no switches and runs the management software on one of the IPU-Machines. A switched IPU-POD has one or more servers and networking switches. An IPU-POD allows all the IPUs in the IPU-M2000 IPU-Machines to communicate and synchronize using IPU-to-IPU connections. The IPUs can be partitioned into “virtual Pods” using the V-IPU software.
For more information:
- IPU over Fabric
Software that allows a Poplar server to control and feed data to a program executing on one or more IPUs using remote DMA (RDMA). The IPUoF software has components on both the server and the IPU-Machine.
- Logical rack
A logical rack is a space in one or more physical racks occupied by a single Pod. Since the standard racking of a single Pod may not be possible within one physical rack, we use the term “logical rack” to refer to the set of components making up the single Pod, regardless of where they may be physically installed. A logical rack is also referred to as an IPU-Link Domain (ILD).
- Management server
A physical server, virtual machine or container implementing the higher, component-independent layers of the Pod management services.
- Mesh configuration
IPUs can be connected in a 2D array with their IPU-Links. This is normally a 2 x N array, rather like a ladder with a pair of IPUs on either side of each “rung”.
See also Torus configuration.
- Micro batch size
The number of samples for which activations are calculated in one full forward pass of the algorithm in a single replica, and for which gradients are calculated in one full backward pass of the algorithm (when training) in a single replica. If gradient accumulation is used then there will not be a weight update after every backward pass.
Partials are the intermediate values in a computation. For example, four numbers (or tensors)
d, might be added like this:
tmp1 = a + b tmp2 = c + d total = tmp1 + tmp2
In this case,
tmp2are the partials.
An isolated group of one or more IPUs within a single vPOD that can be controlled by one or more Poplar hosts within the vPOD, and that can be used for single or multiple ML workloads.
The creation and management of partitions is done by using the V-IPU software. Partitions can be reconfigurable or non-reconfigurable.
- Ping pong
An execution model where two groups, each consisting of one or more IPUs, alternate between processing and host exchange. This allows larger models to be handled, that would not fit in IPU memory otherwise.
- Pipeline depth
The normal meaning is the number of stages in a processing pipeline. It is also used, in some of our APIs, to refer to the number of micro batches passed through the pipeline before a weight update.
Pipelining is a way of parallelising execution by splitting a model across multiple IPUs. Each stage of processing is mapped to a different IPU, each of which handles a different micro batch of samples.
See also micro batch size.
- Pod management services
The Poplar advanced run-time (PopART) provides support for importing, creating and running ONNX graphs on the IPU.
The Graphcore software tools and libraries for graph programming on IPUs. Enables the programmer to write a single program that defines the graph to be executed on the IPU devices and the controlling code that runs on the host. The device code is compiled and loaded onto the IPUs ready for execution.
- Poplar Distributed Configuration Library
PopDist provides an API to make applications ready for distributed execution. Command line parameters passed to PopRun are exposed and can be used to distribute the input/output data or other parts of the applications. PopDist is bundled with the Poplar SDK.
- Poplar graph compiler
The component of Poplar that compiles graph programs for the IPU. This is run implicitly when a Poplar program is executed.
- Poplar graph engine
The run-time component of Poplar that provides support for running graph programs on the IPU.
- Poplar instance
Poplar translates a framework program into code that runs on IPUs. Each process using a single version of Poplar/the SDK is a Poplar instance. For a program that would need to run on 4 GCDs, there would be 4 Poplar instances, one per GCD.
- Poplar SDK
The package of software development tools for the Graphcore IPU. It includes:
- Poplar server
A set of libraries in Poplar for the IPU that provide common operations required in machine learning frameworks and applications.
A command line utility to launch distributed applications on Pods. PopRun creates multiple instances, each of which can run on a single host server or multiple host servers. This includes remote host servers with larger Pod configurations such as an IPU‑POD256, where the remote host servers are physically located in an interconnected Pod. PopRun is required for any Pod system larger than an IPU‑POD64 or Bow Pod64 and is bundled with the Poplar SDK.
See also Poplar instance.
Provides a simple wrapper around PyTorch programs to enable them to be run on the IPU.
A suite of graphical analysis and debugging tools. For more information see the PopVision Tools web page.
For more information:
In a multi-layer network, activations are computed from layer to layer and are typically saved as intermediate results. Recomputation optimises the use of IPU memory by recomputing some values required on the backward pass. This can massively reduce the amount of In-Processor-Memory used, at the cost of some extra computation.
- Remote buffer
- Replica batch size
The number of samples that contribute to a weight update from a single replica. The replica batch size equals the micro batch size multiplied by the number of gradient accumulation iterations.
See also Replicated graph.
- Replicated graph
A replicated graph creates a number of identical copies, or replicas, of the same graph. Each replica targets a different subset of the available tiles (all subsets are the same size). Any change made to the replicated graph, such as adding variables or vertices, will affect all the replicas.
See also Virtual graph.
- Replicated tensor sharding
Storing tensors across replicas by slicing them into equal per-replica “shards” to reduce the memory required.
If a tensor in a replicated graph with replication factor R has the same value on each replica (which is not necessarily the case), you can save memory by storing just a fraction (1/R) of the tensor on each replica. When the full tensor is required all R shards can be broadcast to all the replicas.
Within frameworks such as TensorFlow and PopART, the software interface to code running on an engine. It defines the runtime state consisting of the compiled graph and the values of any variables used by the graph program.
The process of dividing a model up by placing parts of the model on separate IPUs. This is a method of distributing a model that is too large to fit on an IPU. Since there are usually data dependencies between the parts of the model, execution will not be very efficient unless pipelining is used.
See Direct attach.
- Streaming Memory
Memory external to the IPUs used by ML applications for data storage. This could be host memory reserved for use by the IPUs or dedicated IPU memory.
See also Remote buffer.
Sometimes just referred to as a “step”.
- Supervisor code
The code responsible for initiating worker threads and performing the exchange phase and synchronisation phases of a step. Supervisor code cannot perform floating point operations.
A system-wide synchronisation; the first phase in a superstep, following which it is safe to perform an exchange phase. Synchronisation can be internal (between all of the tiles on a single IPU) or external (between all tiles on every IPU). External sync is done via dedicated Sync-Link connections.
A connection provided between IPU-Machines to allow synchronisation between all the IPUs.
- System Analyser
A tensor is a variable that contains a multi-dimensional array of values. In the IPU, the storage of a tensor can be distributed across the tiles. The data is then operated on, in parallel, by the vertex code running on the tiles.
An individual processor core in the IPU consisting of a processing unit and memory. All tiles are connected to the exchange fabric.
- Torus configuration
A mesh connection where the IPU-Links at each end are connected back to form a closed loop.
See also Mesh configuration.
A unit of computation in the graph; consists of code that runs on a tile. Vertices have inputs and outputs that are connected to tensors, and are associated with a codelet that defines the processing performed on the tensor data. Each vertex is stored and executed on a single tile.
- Virtual graph
A graph is normally created for a physical target with a specific number of tiles. It is possible to create a new graph from that, which is a virtual graph for a subset of the tiles. This is effectively a new view onto the parent graph for a virtual target, which has a subset of the real target’s tiles and can be treated like a new graph.
See also Replicated graph.
The IPU-specific part of the V-IPU software can run on an IPU-Machine, when used in direct attach mode in a Pod DA system, or it can run on a server in a switched Pod.
By default, the complete Pod is also a single vPOD, hence any operations that relate to a vPOD can, in general, also apply to a complete Pod.
The V-IPU software is used to create and manage vPODs.
Code that can perform floating point operations and is typically responsible for performing the compute phase of a step. A tile has hardware support for multiple worker contexts.
Note: this should not be confused with the TensorFow definition of worker (processes that can make use of multiple hardware resources).