2. Model Runtime under the hood
This chapter briefly introduces the general concepts of running the model on the IPU, introducing the main Model Runtime components and their responsibilities in the whole process.
2.1. Finding hardware
The user can provide any possible computation model (with its full
definition is stored in the PopEF) and the first step is to figure out what hardware
resources need to be provided (namely the number of the IPU accelerators, their generation and supported features) and if such hardware is available in the
user’s system. The Model Runtime model_runtime::DeviceManager class lets the user pick a model_runtime::Device suitable for the model and to specify the exact characteristics the device must satisfy.
2.2. Data transfers
The user model compiled into the Poplar executable (stored as PopEF) needs to be transferred to the IPU. This is the first data transfer step. For regular model execution, the model inputs have to be sent from the host to the IPU and the model outputs have to be fetched back from the IPU to the host.
To gather up and control all the data streaming duties, Model Runtime provides a
set of tools dedicated for these tasks contained in the Session class. The Session
API includes several complex operations taking place under the hood; it
also provides verification of the model parameters by providing wrappers for
components functionalities.
2.3. Executable upload
To upload the model executable onto the device, Session uses the Model Runtime Executable object to deserialize the model from the PopEF and finally load it onto the IPU.
In most cases the user model consists of at least three types of programs:
Load: for uploading weights or other constants to the IPU memory. This is supposed to be executed once before the first
Mainprogram callMain: the model computational graph implementation
Save: used to transfer the selected tensors data back to the host. This program is unused in most inference use cases.
Executable provides an API for controlling the execution (run, stop) of these programs.
2.4. Tensor data
Every model operates on data, which can be provided in different ways:
In the case of model weights, or other model constants, values may be compiled into the executable. This means that all information is transferred to the IPU in one step.
Another option is to store the data in PopEF’s TensorData. In this case, Model Runtime needs to set up the transfer of data to the device before the first model execution (to be precise: a Poplar Engine stream gets connected to proper PopEF
TensorDatawhich creates a direct data transfer channel operated by Poplar).The last option is data provided explicitly by the user. This may be used for overriding
TensorDataand is the general way to provide regular model inputs.
Each logical model constant/variable/input/output data is organized in the form of a tensor and its transfer can be managed by the internal tensor manager object.
2.5. Managing the data sources/targets
To create data transfer channels between the user program and the IPU device,
Model Runtime utilizes Poplar’s mechanism of asynchronous data feed functions
- callback.
During model execution, when the data transfer step is to be triggered, the IPU
communicates with the Poplar runtime to initiate the transfer. It then calls the
proper callback and passes to it the pointer to the memory where the target Tensor
bytes are to be put (input Tensor) or where it can be fetched from (output
Tensor). This call is blocking so Poplar needs to receive the data to continue with execution.
There are generally two sources of data for a model Tensor: PopEF’s TensorData that stores their values and data provided directly by the user.
The Session internal Tensor manager prepares and connects all the callbacks
responsible for transferring the PopEF TensorData (if there
is any and if the user did not exclude any or all of them from this auto-binding
using the predicates mechanism).
When it comes to other data (both inputs and outputs) to be transferred
explicitly by the user and not coming directly from PopEF itself, Session
provides an interface to setup the connections. How the user data is queued to
provide the best possible performance (in the general case) is the responsibility of QueueManager. To simplify the creation and basic configuration of QueueManager, Session provides a factory method
(createQueueManager()) that constructs an object of the class QueueManager and
registers itself in that object.
2.6. Queues of data
A naive strategy of simple serialization of the model execution stages are: prepare input, send the input to the IPU, wait for the output, fetch it from the device, repeat if not sufficient enough to achieve good model execution performance in real life applications.
As the processes of preparing user data and model execution on the IPU may be
pipelined, QueueManager provides a mechanism of buffering user inputs and outputs. The tensors in the queue may be inputted and fetched
asynchronously, so that the user program can prepare inputs in parallel (if
possible) and add them to the queue. At the same time, but independently, the IPU keeps fetching the consecutive inputs, and fills the output queues with
computations results.
The QueueManager API provides access to the lists of input and output queues. In this
way, the user can enqueue data to/from the queues providing pointers to the user-managed memory where the data resides or is to be downloaded to. In this this way Model Runtime avoids extra copying. Data gets transferred to the device from the user memory (or the other way round) exactly when it’s needed by the IPU (or when ready for writing back to the user). In this process, the only blocking factor is the lack of data in the queue, which means that the user is not able to provide the data at a fast enough rate to keep the IPU busy all the time.
2.7. Buffers
QueueManager does not store the tensor data buffers, but instead stores pointers to them (plus some other metadata). Even though this is a sort of “light data”, operations on the queues are very frequent, so the extra cost of adding elements to them and removing elements from them should be as low as possible and preferably state-invariant.
Due to the identified requirements, QueueManager organizes tensor queues in the form of ring buffers. Each Tensor has a dedicated constant size buffer that gets filled with data up to its capacity and drained in the same order (FIFO). When
the last cell in the queue gets filled, the algorithm starts to use up the data from the very beginning of the queue. Since this is an asynchronous process, while user enqueues more tensors, the IPU keeps consuming them.
The RingBuffer class is a Single Producer - Single Consumer thread-safe structure. As there is only one user of the buffer (QueueManager) that can access it in asynchronous manner, this profile provides
the suitable level of safety, without too much impact on performance.