AI-Float consists of three core technologies:
Industry standard 16 and 32-bit IEEE floating-point arithmetic
Hardware support for stochastic rounding
A configurable, dot-product AI-Float arithmetic block in the tile
- Batch serialisation
A batch of samples are normally processed in parallel. With batch serialisation, the batch is divided into sub-batches (based on a batch serialisation factor) and only the sub-batch is processed in parallel. A sequence of these sub-batches are processed serially.
- Bulk-synchronous parallel
A programming methodology for parallel algorithms which is used on the IPU. Execution for the IPU consists of supersteps, each made up of three phases: synchronization, communication and local compute.
Leslie G. Valiant. 1990. A bridging model for parallel computation. Commun. ACM 33, 8 (August 1990), 103-111. DOI=10.1145/79173.79181
Graphcore’s dual IPU PCIe card with two GC2 Colossus IPUs. Provides performance of 250 teraFLOPS of mixed precision compute with 192 GB/s IPU-Link bandwidth between IPUs, 128 GB/s card to card IPU-Links. Maximum power consumption is 300 W.
A logical grouping of IPUs.
A piece of code that defines the inputs, outputs and internal state of a vertex. Contains a
compute()function that defines the behaviour of the vertex.
The current architecture of the Graphcore IPU. It consists of an array of thousands of IPU Tiles with In-Processor-Memory and IPU-Links for chip-to-chip communication. It is designed for parallel processing using the BSP model.
- Compute batch
The number of samples for which activations/gradients are computed in parallel.
- Compute set
- Direct attach
One or more IPU-Machines can be used in “direct attach” mode where the IPU-Machines are directly controlled from the user’s computer. The IPU-specific part of the V-IPU software runs on an IPU-Machine, rather than on a separate server.
- Dynamic graph
- Dynamic shape
The input to a graph can have variable length or shape. For models such as BERT, the maximum sequence size is known but the actual input is dynamic within that range.
- Eager execution
Dynamic graph execution (or dispatch) where each operation is individually compiled, dispatched and executed when required.
The edges of a computation graph define the connections between elements of tensors and the vertices of the graph.
Communication phase of a superstep, where data is communicated between tiles and between IPUs. An exchange can be:
- Exchange fabric
- Exchange Memory
- External exchange
An exchange phase where data is communicated between tiles and memory outside the IPU. This can be:
Inter-IPU exchange (between IPUs; also known as global exchange)
Host exchange (between IPUs and the host)
Streaming Memory exchange (between IPUs and Streaming Memory)
See also IPU-Fabric.
- Graph compile domain
A subset of IPUs within a GSD which are controlled by a single Poplar instance. “GCD size” is the number of IPUs in the GCD. When a program is executed, the Poplar instance binaries may be replicated and loaded on to multiple GCDs (one Poplar instance per GCD).
- Graphcore Communication Library
A software library for managing communication and synchronization between IPUs, supporting ML at scale. GCL-based IPU communication and synchronization can be established across any IPU-Fabric, supporting IPU-POD topologies such as mesh configuration and Torus configuration.
- Global batch
The number of samples that contribute to a weight update across all replicas.
- Global exchange
See Inter-IPU exchange.
- Graph Analyser
- Graph compiler
- Graph engine
See Poplar graph engine.
- Graph streaming
A set of techniques that allows IPUs to make efficient use of Exchange Memory. This includes the intelligent placement of variables and weights, and the use of Streaming Memory for scatter/gather and reductions.
- Graph scaleout domain
The set of IPUs used to execute a program, consisting of one or more GCDs. All the GCDs together form the GSD, with GCL managing all the communication and synchronization between teh separate IPUs and GCDs. The “GSD size” is the number of IPUs in the GSD.
GSD is equivalent to the whole partition and GSD size is the partition size.
See also vPOD.
A networking interface implemented by an IPU gateway that can provide IPU-IPU connectivity through another IPU gateway, either via a directly connected link or via a switching infrastructure.
- GW-Link cluster
A 16-bit floating-point value.
- Head node
Poplar host server
- Host exchange
Communication between an IPU and the server running the host-side part of the Poplar program.
- Host memory
Memory on the host server that can be accessed by the IPU (see Exchange Memory).
The communication path between the host computer and the IPUs. This maybe a direct connection, such as PCIe, or a high-speed network, such as 100 Gigabit Ethernet (100 GbE).
- IPU-Link Domain
An IPU-Link Domain (ILD) is a set of IPUs that are connected with IPU-Links. The IPUs have to be within a single IPU-POD. The maximum size of a single ILD is 64 IPUs. Multiple ILDs are used to form multi-ILD clusters. Can also use the term “multi-ILD partition” to mean a partition that spans multiple ILDs and uses GW-Links.
There has to be at least 1 GCD per ILD.
The tile memory. This memory can be directly accessed by worker threads during the compute phase of a program.
- Inter-IPU exchange
Communication between tiles on different IPUs.
- Intelligence Processing Unit
An Intelligence Processing Unit (IPU) is a massively parallel accelerator pioneered by Graphcore for machine learning (ML) and artificial intelligence applications.
Graphcore’s current implementation of the IPU is Colossus.
- IPU gateway
The IPU gateway manages communication on and off the IPU-Machine board via the IPU-Links that connect IPU-Machines. It also manages transfers between the IPUs and local Streaming Memory on the IPU-Machine.
The tile’s processing unit.
Communication on the Exchange fabric internal to the IPU.
Communication links between IPUs.
- IPU-Machine: M2000
A 1U IPU-Machine containing four Colossus GC200 IPUs providing 1 petaFLOPS of compute, up to 260GB Exchange Memory, 2.8 Tbps low-latency IPU-Fabric interconnect, and an IPU gateway controller supporting host disaggregation. One or more IPU-M2000s can work as a direct attached system, or they can be built into a rack system as an IPU-POD.
A collection of interconnected IPU-Machines. An IPU-POD DA (Direct Attach) system has no switches and runs the management software on one of the IPU-Machines. A switched IPU-POD has one or more servers and networking switches. An IPU-POD allows all the IPUs in the IPU-Machines to communicate and synchronize using IPU-IPU connections. The IPUs can be partitioned into “virtual IPU-PODs” using the V-IPU software.
- IPU-POD management services
- IPU over Fabric
Software that allows a Poplar server to control and feed data to a program executing on one or more IPUs using remote DMA (RDMA). The IPUoF software has components on both the server and the IPU-Machine.
- Management server
A physical server, virtual machine or container implementing the higher, component-independent layers of the IPU-POD management services.
- Mesh configuration
IPUs can be connected in a 2D array with their IPU-Links. This is normally a 2 x N array, rather like a ladder with a pair of IPUs on either side of each “rung”.
See also Torus configuration.
- Micro batch
The number of samples calculated in one full forward/backward pass of the algorithm.
Partials are the intermediate values in a computation. For example, four numbers (or tensors)
d, might be added like this:
tmp1 = a + b tmp2 = c + d total = tmp1 + tmp2
In this case,
tmp2are the partials.
An isolated group of one or more IPUs within a single vPOD that can be controlled by one or more Poplar hosts within the vPOD, and that can be used for single or multiple ML workloads.
The creation and management of partitions is done by using the V-IPU software. Partitions can be reconfigurable or non-reconfigurable.
- Ping pong
An execution model where two groups, each consisting of one or more IPUs, alternate between processing and host exchange. This allows larger models to be handled, that would not fit in IPU memory otherwise.
- Pipeline depth
The normal meaning is the number of stages in a processing pipeline. It is also used, in some of our APIs, to refer to the number of samples passed through the pipeline before a weight update.
Pipelining is a way of parallelising execution by splitting a model across multiple IPUs. Each stage of processing is mapped to a different IPU, each of which handles a different batch of samples.
The Poplar advanced run-time (PopART) provides support for importing, creating and running ONNX graphs on the IPU.
- Poplar Distributed Configuration Library
PopDist provides a set of APIs which are used to make applications ready for distributed execution. Command line parameters passed to PopRun are exposed and can be used to distribute the input/output data or other parts of the applications. PopDist is also bundled with the Poplar SDK.
The Graphcore software tools and libraries for graph programming on IPUs. Enables the programmer to write a single program that defines the graph to be executed on the IPU devices and the controlling code that runs on the host. The device code is compiled and loaded onto the IPUs ready for execution.
- Poplar graph compiler
The component of Poplar that compiles graph programs for the IPU. This is run implicitly when a Poplar program is executed.
- Poplar graph engine
The run-time component of Poplar that provides support for running graph programs on the IPU.
- Poplar instance
Poplar translates a framework program into code that runs on IPUs. Each process using a single version of Poplar/the SDK is a Poplar instance. For a program that would need to run on 4 GCDs, there would be 4 Poplar instances, one per GCD.
- Poplar SDK
The package of software development tools for the Graphcore IPU. It includes:
- Poplar server
A set of libraries in Poplar for the IPU that provide common operations required in machine learning frameworks and applications.
A command line utility to launch distributed applications on IPU-PODs. PopRun creates multiple instances, each of which can run on a single host server or multiple host servers. This includes remote host servers with larger IPU-POD configurations such as an IPU‑POD128, where the remote host servers are physically located in an interconnected IPU-POD. PopRun is required for any IPU-POD system larger than an IPU‑POD64 and is bundled with the Poplar SDK.
Provides a simple wrapper around PyTorch programs to enable them to be run on the IPU.
A suite of graphical analysis and debugging tools. The first is the PopVision Graph Analyser for profiling and performance analysis.
In a multi-layer network, activations are computed from layer to layer and are typically saved as intermediate results. Recomputation optimises the use of IPU memory by recomputing some values required on the backward pass. This can massively reduce the amount of In-Processor-Memory used, at the cost of some extra computation.
- Remote buffer
- Replica batch
The number of samples that contribute to a weight update from a single replica.
- Replicated graph
A replicated graph creates a number of identical copies, or replicas, of the same graph. Each replica targets a different subset of the available tiles (all subsets are the same size). Any change made to the replicated graph, such as adding variables or vertices, will affect all the replicas.
Note: This is not the same as the TensorFlow use of the term.
See also Virtual graph.
- Replicated tensor sharding
Storing tensors across replicas by slicing them into equal per-replica “shards” to reduce the memory required.
If a tensor in a replicated graph with replication factor R has the same value on each replica (which is not necessarily the case), you can save memory by storing just a fraction (1/R) of the tensor on each replica. When the full tensor is required all R shards can be broadcast to all the replicas.
Within frameworks such as TensorFlow and PopART, the software interface to code running on an engine. It defines the runtime state consisting of the compiled graph and the values of any variables used by the graph program.
The process of dividing a model up by placing parts of the model on separate IPUs. This is a method of distributing a model that is too large to fit on an IPU. Since there are usually data dependencies between the parts of the model, execution will not be very efficient unless pipelining is used.
See Direct attach.
- Streaming Memory
External memory used by an IPU for data storage. This could be host memory reserved for use by the IPU or dedicated IPU memory.
See also Remote buffer.
Sometimes just referred to as a “step”.
- Supervisor code
The code responsible for initiating worker threads and performing the exchange phase and synchronisation phases of a step. Supervisor code cannot perform floating point operations.
A system-wide synchronisation; the first phase in a superstep, following which it is safe to perform an exchange phase. Synchronisation can be internal (between all of the tiles on a single IPU) or external (between all tiles on every IPU). External sync is done via dedicated Sync-Link connections.
A connection provided between IPU-M2000s to allow synchronisation between all the IPUs.
A tensor is a variable that contains a multi-dimensional array of values. In the IPU, the storage of a tensor can be distributed across the tiles. The data is then operated on, in parallel, by the vertex code running on the tiles.
An individual processor core in the IPU consisting of a processing unit and memory. All tiles are connected to the exchange fabric.
- Torus configuration
A mesh connection where the IPU-Links at each end are connected back to form a closed loop.
See also Mesh configuration.
The IPU-Machine or IPU-POD parts of the IPU-POD management services that implement the allocation, provisioning and monitoring of IPUs and related infrastructure for machine-learning workloads in the IPU-POD.
The IPU-specific part of the V-IPU software can run on an IPU-Machine, when used in direct attach mode in an IPU-POD DA system, or it can run on a server in a switched IPU-POD.
A unit of computation in the graph; consists of code that runs on a tile. Vertices have inputs and outputs that are connected to tensors, and are associated with a codelet that defines the processing performed on the tensor data. Each vertex is stored and executed on a single tile.
- Virtual graph
A graph is normally created for a physical target with a specific number of tiles. It is possible to create a new graph from that, which is a virtual graph for a subset of the tiles. This is effectively a new view onto the parent graph for a virtual target, which has a subset of the real target’s tiles and can be treated like a new graph.
See also Replicated graph.
By default, the complete IPU-POD is also a single vPOD, hence any operations that relate to a vPOD can, in general, also apply to a complete IPU-POD.
The V-IPU software is used to create and manage vPODs.
Code that can perform floating point operations and is typically responsible for performing the compute phase of a step. A tile has hardware support for multiple worker contexts.
Note: this should not be confused with the TensorFow definition of worker (processes that can make use of multiple hardware resources).