2. Model Runtime overview

This chapter introduces the general concepts of running a model on the IPU, and describes the main Model Runtime components and their roles in the process. The high-level ModelRunner API wraps the functionality described in this chapter.

2.1. Finding hardware

You can run any computation model with its full definition stored in a PopEF file. The first step is to find out what hardware resources are needed (namely the number of IPUs, their architecture version and supported features) and if these resources are available in your system. The Model Runtime DeviceManager class lets you specify the exact requirements of the hardware to run the model. It also picks that hardware from available resources.

2.2. Data transfers

Your model, compiled into a Poplar executable (stored as PopEF), needs to be transferred to the IPU. This is the first data transfer step (Section 2.3, Executable upload).

During model execution, the model inputs have to be sent from the host to the IPU, and the model outputs have to be returned from the IPU to the host.

To manage all these data streaming activities, Model Runtime provides a set of tools in the Session class. Session also provides support for verification of the model parameters.

2.3. Executable upload

A Session object deserializes the model from PopEF and loads it onto the IPU.

In most cases a model consists of at least three types of Poplar program:

  • A Load program uploads weights or other constants to the IPU memory. This is executed once before the first Main program call.

  • A Main program contains the model computational graph implementation.

  • A Save program is used to transfer the selected tensor data back to the host. This type of program is not used in most inference use cases.

For more information about how a model is run, see Section 4.4, Running programs.

2.4. Tensor data

Every model operates on data which can be constants, variables, inputs or outputs. These are stored as tensors, and can be made available to the running model in several ways:

  • In the case of model weights, or other model constants, values may be compiled into the executable. This means that all information is transferred to the IPU hardware in one step.

  • The data can be stored as PopEF TensorData. In this case, Model Runtime needs to set up the transfer of data to the IPU hardware before the first model execution (to be precise, a Poplar data stream gets connected to the PopEF TensorData which creates a direct data-transfer channel operated by Poplar).

  • You can provide the data explicitly. This can be used to override TensorData and is the most general way of providing model inputs.

2.5. Managing data sources and targets

To create data transfer channels between your host program and the IPU, Model Runtime uses the Poplar mechanism of asynchronous data transfer using a host callback function.

To transfer data during model execution, the IPU communicates with the Poplar runtime to initiate the transfer. It then calls the callback function, passing a pointer to the memory where it will read or write the tensor data. This call is blocking, so Poplar waits to receive the data before continuing with execution.

There are generally two sources of data for a tensor:

  • data stored in a PopEF TensorData

  • data you provide directly

The Session tensor manager prepares and connects all the callbacks responsible for transferring the PopEF TensorData (if there are any, and you did not exclude some or all of them from this auto-binding mechanism using the predicates mechanism).

When it comes to other data (both inputs and outputs) that you transfer explicitly, and that don’t come directly from PopEF, Session provides an interface to set up the connections. How your data is queued to provide the best possible performance (in the general case) is the responsibility of QueueManager. To simplify the creation and basic configuration of QueueManager, Session provides a factory method (createQueueManager()) that constructs a QueueManager object and registers itself in that object.

2.6. Queues of data

A naive strategy of serial model execution (prepare the input, send the input to the IPU, wait for the output, fetch the output from the IPU, repeat) is not sufficient to achieve good performance in real-life applications.

Because the processes of preparing the data and model execution on the IPU can be pipelined, QueueManager provides a mechanism for buffering inputs and outputs. The tensors in the queue may be input and fetched asynchronously, so that your program can prepare inputs in parallel (if possible) and add them to the queue. At the same time, but independently, the IPU keeps fetching the consecutive inputs, and fills the output queues with the computed results.

The QueueManager class provides access to the lists of input and output queues. This allows you to enqueue data to or from the queues, providing pointers to the user-managed memory where the data resides or is to be downloaded to. This means that Model Runtime doesn’t have to create extra copies of this data. Data gets transferred to the IPU from the user memory (or the other way round) exactly when it’s needed by the IPU (or when it is ready to be written back to your model). In this process, the only blocking factor is the lack of data in the queue, which means that the model is not able to provide the data at a fast enough rate to keep the IPU busy all the time.

2.7. Buffers

QueueManager does not store the tensor data buffers, but instead stores pointers to them (together with some other metadata). Even though this is a much smaller amount of data, operations on the queues are very frequent so the extra cost of adding elements to them, and removing elements from them, should be as low as possible and preferably state-invariant.

Because of these requirements, QueueManager organizes tensor queues as ring buffers. Each Tensor object has a dedicated constant-size buffer that gets filled with data up to its capacity and drained in the same order (FIFO). When the last slot in the queue gets filled, the algorithm starts to use up the data from the beginning of the queue. This is an asynchronous process so, while the model enqueues more tensors, the IPU keeps consuming them.

The RingBuffer class is a single-producer–single-consumer thread-safe structure. As there is only one user of the buffer (QueueManager) that can access it asynchronously, this profile provides a suitable level of safety, without too much impact on performance.