2. Understanding the IPU programming model
Before optimising performance on the IPU, it is necessary to understand the IPU programming model. This section summarises the core concepts of the IPU programming model, further details can be found in the IPU Programmer’s Guide.
2.1. Core concepts for IPU programming
Programming the IPU is determined by the features of the IPU hardware and the software used to develop the machine learning models.
The IPU is organised in multiple processing cores called tiles. Each tile can be viewed as an independent processor unit, which executes a tile-specific program and has access to local SRAM in the tile (called In-Processor-Memory). All tiles are connected to the exchange fabric, which is the communication network used to transfer data between tiles within an IPU. This exchange can be further distributed over multiple IPUs. The communication bandwidth during exchange is particularly fast, about 47 TB/s within an IPU and 8 TB/s between IPUs.
The IPU is programmed through software abstractions provided by the Poplar Graph Programming framework.
Two core concepts for programming IPUs are:
The bulk-synchronous parallel model of execution. This decomposes the runtime execution into three phases: local compute, global synchronisation, and data exchange.
The graph representation of computations. The Poplar graph programming framework operates on a computational graph where vertices represent operations and edges represent the input and output data.
2.2. Main differences from GPU programming
For engineers coming from a GPU programming background, the differences in IPU programming are:
The graph is compiled statically. This means that dynamic tensor access is not easily performed and comes with a memory cost.
Model parallelism (such as sharding a model) is often required for large-scale models.
The IPU programmer is more involved with controlling how the IPU memory is used, for example, partitioned and allocated.
Because the graph and tensor layouts are static, the code required to perform communication during the exchange phase grows as the total number of tensors increases.
2.3. Factors affecting model performance
Model performance is driven by:
Efficient use of compute cycles: the aim is to maximise the use of the compute capabilities of the IPU.
Efficient use of memory: the aim is to minimise the total memory required and to maximise the bandwidth at which the used memory is accessed.
Efficient use of communication with the host: the aim is to maximise the bandwidth and minimise the latency of the data transfers between the host and the IPUs.
2.4. PopVision Tools
Profiling the compiled program and its execution can provide insights into the possible memory, compute cycles and communications optimisations. The PopVision Graph Analyser helps with understanding how the graph is distributed and run on the IPU. The PopVision System Analyser breaks down all the processes around the processing on the IPU such as graph compilation or host-IPU communications.
For more information about the profiling tools, see the PopVision tools web page.