2. Model Runtime overview
This chapter introduces the general concepts of running a model on the IPU, and describes the main Model Runtime components and their roles in the process. The high-level ModelRunner API wraps the functionality described in this chapter.
2.1. Finding hardware
You can run any computation model with its full definition stored in a PopEF
file. The first step is to find out what hardware resources are needed (namely
the number of IPUs, their architecture version and supported features) and if
these resources are available in your system. The Model Runtime
DeviceManager
class lets you specify the exact requirements of the hardware to run the model. It also picks that hardware from available resources.
2.2. Data transfers
Your model, compiled into a Poplar executable (stored as PopEF), needs to be transferred to the IPU. This is the first data transfer step (Section 2.3, Executable upload).
During model execution, the model inputs have to be sent from the host to the IPU, and the model outputs have to be returned from the IPU to the host.
To manage all these data streaming activities, Model Runtime
provides a set of tools in the Session
class. Session
also provides support for verification of the model parameters.
2.3. Executable upload
A Session
object deserializes the model from PopEF
and loads it onto the IPU.
In most cases a model consists of at least three types of Poplar program:
A Load program uploads weights or other constants to the IPU memory. This is executed once before the first Main program call.
A Main program contains the model computational graph implementation.
A Save program is used to transfer the selected tensor data back to the host. This type of program is not used in most inference use cases.
For more information about how a model is run, see Section 4.4, Running programs.
2.4. Tensor data
Every model operates on data which can be constants, variables, inputs or outputs. These are stored as tensors, and can be made available to the running model in several ways:
In the case of model weights, or other model constants, values may be compiled into the executable. This means that all information is transferred to the IPU hardware in one step.
The data can be stored as PopEF TensorData. In this case, Model Runtime needs to set up the transfer of data to the IPU hardware before the first model execution (to be precise, a Poplar data stream gets connected to the PopEF
TensorData
which creates a direct data-transfer channel operated by Poplar).You can provide the data explicitly. This can be used to override
TensorData
and is the most general way of providing model inputs.
2.5. Managing data sources and targets
To create data transfer channels between your host program and the IPU, Model Runtime uses the Poplar mechanism of asynchronous data transfer using a host callback function.
To transfer data during model execution, the IPU communicates with the Poplar runtime to initiate the transfer. It then calls the callback function, passing a pointer to the memory where it will read or write the tensor data. This call is blocking, so Poplar waits to receive the data before continuing with execution.
There are generally two sources of data for a tensor:
data stored in a PopEF
TensorData
data you provide directly
The Session
tensor manager prepares and connects
all the callbacks responsible for transferring the PopEF TensorData
(if
there are any, and you did not exclude some or all of them from this
auto-binding mechanism using the predicates mechanism).
When it comes to other data (both inputs and outputs) that you transfer
explicitly, and that don’t come directly from PopEF,
Session
provides an interface to set up the
connections. How your data is queued to provide the best possible performance
(in the general case) is the responsibility of
QueueManager
. To simplify the creation and basic
configuration of QueueManager
,
Session
provides a factory method
(createQueueManager()
) that constructs a
QueueManager
object and registers
itself in that object.
2.6. Queues of data
A naive strategy of serial model execution (prepare the input, send the input to the IPU, wait for the output, fetch the output from the IPU, repeat) is not sufficient to achieve good performance in real-life applications.
Because the processes of preparing the data and model execution on the IPU can
be pipelined, QueueManager
provides a mechanism for
buffering inputs and outputs. The tensors in the queue may be input and fetched
asynchronously, so that your program can prepare inputs in parallel (if
possible) and add them to the queue. At the same time, but independently, the
IPU keeps fetching the consecutive inputs, and fills the output queues with the
computed results.
The QueueManager
class provides access to the lists
of input and output queues. This allows you to enqueue data to or from the
queues, providing pointers to the user-managed memory where the data resides or
is to be downloaded to. This means that Model Runtime doesn’t have to create
extra copies of this data. Data gets transferred to the IPU from the user
memory (or the other way round) exactly when it’s needed by the IPU (or when it
is ready to be written back to your model). In this process, the only blocking
factor is the lack of data in the queue, which means that the model is not able
to provide the data at a fast enough rate to keep the IPU busy all the time.
2.7. Buffers
QueueManager
does not store the tensor data
buffers, but instead stores pointers to them (together with some other
metadata). Even though this is a much smaller amount of data, operations on the
queues are very frequent so the extra cost of adding elements to them, and
removing elements from them, should be as low as possible and preferably
state-invariant.
Because of these requirements, QueueManager
organizes tensor queues as ring buffers. Each Tensor
object has a dedicated
constant-size buffer that gets filled with data up to its capacity and drained
in the same order (FIFO). When the last slot in the queue gets filled, the
algorithm starts to use up the data from the beginning of the queue. This is an
asynchronous process so, while the model enqueues more tensors, the IPU keeps
consuming them.
The RingBuffer
class is a
single-producer–single-consumer thread-safe structure. As there is only one
user of the buffer (QueueManager
) that can access
it asynchronously, this profile provides a suitable level of safety, without
too much impact on performance.