2. API reference

Knowing the details of the Triton backend API is crucial to understanding the comments and implementation below.

2.1. BackendState

class BackendState

Backend state: this state is shared amongst all the models and model states.

Public Functions

BackendState() = default
BackendState(const char *options_buf, size_t options_len)

Public Members

std::mutex poplar_init_mutex

The mutex that the model states need to acquire before using the non-thread safe parts of Poplar (for example, poplar::Engine creation).

bool check_package_hash = true

Flag controlling if the loaded PopEF is checked against hash compatibility with the active Poplar SDK.

2.2. ModelState

class ModelState : public BackendModel

State associated with a model that is using this backend.

An object of this class is created and associated with each TRITONBACKEND_Model.

A model state is shared among all the instances of a model.

Public Functions

~ModelState()
void ValidateModelConfig()

Validate that model configuration is supported by this backend.

Note

This method must be called before calling any of the other methods from this class.

inline uint64_t ExecDelay() const

For Triton testing: delay in milliseconds to wait before processing the requests.

Value comes from the Triton ‘execute_delay_ms’ parameter.

Returns

Execution delay in ms [miliseconds]

Pre

Only available after ValidateModelConfig() has been called.

inline uint64_t DelayMultiplier() const

For Triton testing: delay multiplier per instance.

That is, total_delay = max(DelayMultiplier() * InstanceId(), 1) * ExecDelay.

Value comes from the Triton instance_wise_delay_multiplier parameter.

Returns

Delay multiplier value

Pre

Only available after ValidateModelConfig() has been called.

const std::map<std::string, popef::TensorInfo> &ModelInputs() const

Map of input tensors details.

Returns

Map:

  • key (std::string): input tensor name

  • value (popef::TensorInfo): input tensor details

Pre

Only available after ValidateModelConfig() has been called.

const std::map<std::string, popef::TensorInfo> &ModelOutputs() const

Map of output tensors details:

Returns

Map:

  • key (std::string): output tensor name

  • value (popef::TensorInfo): output tensor details

Pre

Only available after ValidateModelConfig() has been called.

const std::set<std::string> &OutputNames() const

Model outputs’ names.

Returns

Set of outputs’ names strings

Pre

Only available after ValidateModelConfig() has been called.

std::unique_ptr<popef::Model> LoadModel() const

Create a new instance of the popef::Model.

Note

The PopEF models are not thread safe, each model instance should own its own model.

Returns

Pointer (unique_ptr) to model

Pre

Only available after ValidateModelConfig() has been called.

void CreationDelay()

For Triton testing: block the thread based on the value of the Triton ‘creation_delay_sec’ parameter.

BackendState *Backend()

Return the backend associated with this model.

pvti::Graph &GetGraph(const std::string &name, const std::string &unit = "")

Get or create a pvti graph.

Parameters
  • name[in] The graph name. This name will be prefixed by the model name.

  • unit[in] The graph unit.

Returns

Reference to Graph

std::chrono::nanoseconds TimeoutInNanoseconds() const

Time to wait before flushing the pipeline when there are no inputs left in the queues but some requests are still in flight.

Returns

Timeout in ns [nanoseconds]

void IncrementInstanceCount()

Increment the number of instances.

int InstanceCount() const

Return the current number of instances.

Returns

Number of instances

bool UseSynchronousExecution() const

If true, then the backend should wait until all the requests responses have been sent before returning from TRITONBACKEND_ModelInstanceExecute.

Returns

Synchronous execution usage flag

Public Static Functions

static void Create(TRITONBACKEND_Model *triton_model, BackendState *backend, ModelState **state)

Static method to use to create a ModelState.

Parameters
  • triton_model[in] Model the state is associated with.

  • backend[in] The backend the state is associated with.

  • state[out] The state to initialise.

2.3. InputBuffers

using triton::backend::poplar::InputBuffers = std::map<std::string, const void*>

Map input names to data buffers.

2.4. OutputBuffers

using triton::backend::poplar::OutputBuffers = std::map<std::string, void*>

Map output names to data buffers.

2.5. ModelInstanceState

class ModelInstanceState : public BackendModelInstance

State associated with an instance of a model.

An object of this class is created and associated with each TRITONBACKEND_ModelInstance.

Public Functions

~ModelInstanceState()
inline ModelState *StateForModel() const

Get the state of the model that corresponds to this instance.

Returns

Pointer to model’s state

inline int InstanceId() const

Get the ID of the instance.

Returns

ID

void EnqueueRequest(const InputBuffers &inputs, size_t batch_size, const std::function<void(void)> &callback = nullptr)

Enqueue a request.

Note

The callback might be called multiple times (if the input is prefetched and the prefetch is later discarded).

Parameters
  • inputs[in] Buffers to use for the inputs. (Must contain a buffer for each input.)

  • batch_size[in] Batch size of the request.

  • callback[in] Optional callback to call when the first input is being read.

Pre

Acquire the EnqueueRequest lock.

Post

For each call to EnqueueRequest() a corresponding call to SetRequestOutputs() must be made.

std::mutex &QueuesMutex()

Mutex protecting the queues: must be acquired before calling EnqueueRequest().

Returns

Mutex

void SetRequestOutputs(const OutputBuffers &outputs, size_t batch_size, const std::function<void(void)> &callback = nullptr)

Set the corresponding output buffers’ pointers for the last enqueued request.

Parameters
  • outputs[in] Buffers to use for the outputs. (Must contain a pointer for each output, if a pointer is nullptr then that output will be discarded.)

  • batch_size[in] Batch size of the request.

  • callback[in] Optional callback to call once after the last output has been written.

void AddValuesToGraph(int64_t batch_size, double exec_ns, double compute_ns)

Add a set of measurements for a collection of requests to the PVTI graph.

Parameters
  • batch_size[in] Total batch size of the collection.

  • exec_ns[in] Average execution time for a single batch.

  • compute_ns[in] Average compute time for a single batch.

void UnlockSessionThread()

Unlock session thread, which gets blocked after timeout callback.

void UpdateLastCollection(std::weak_ptr<RequestCollection> requests)

Keep track of the last collection: as long as this weak_ptr is valid it means some requests are still in flight.

Public Static Functions

static void CreateAndRun(ModelState *model_state, TRITONBACKEND_ModelInstance *triton_model_instance, ModelInstanceState **state)

Allocate a model instance state and start SessionRunThread.

Parameters
  • model_state[in] The model that the instance is associated with.

  • triton_model_instance[in] Triton instance of the model.

  • state[out] Backend instance of the model.

static void CreateEmptyState(ModelState *model_state, TRITONBACKEND_ModelInstance *triton_model_instance, ModelInstanceState **state)

Allocate empty model instance state. Empty means that no SessionRunThread is created.

Parameters
  • model_state[in] The model that the instance is associated with.

  • triton_model_instance[in] Triton instance of the model.

  • state[out] Backend instance of the model.

2.6. RequestCollection

class RequestCollection

Represents a collection of requests received by one call to TRITONBACKEND_ModelInstanceExecute

Each lambda object pushed to the Session queue that is executing a request holds a reference to the collection and releases it after the request has sent a response.

Once all the requests are completed, the collection releases all the memory associated with them and reports statistics for the entire batch to the server.

Public Functions

explicit RequestCollection(uint64_t request_count, uint64_t exec_start_ns, ModelInstanceState *instance_state)

Initialise a collection of requests.

Parameters
  • request_count[in] Reserve memory for this number of requests.

  • exec_start_ns[in] Timestamp for when the backend received this collection of requests. If 0: don’t report stats for this collection.

  • instance_state[in] Instance state this collection of requests is associated with.

Post

addRequest() and addBatch() must both be called request_count times.

template<typename ...Args>
inline Request &addRequest(Args&&... args)
void addBatch(uint64_t batch_size) noexcept

Must be called once per request in the collection to indicate the batch size of each individual request.

~RequestCollection()

Implicitly called when the last request in the collection releases its reference to the collection.

Release the memory used by the requests and responses in the collection and report the statistics for this collection of requests to the server.

2.7. Request

class Request

Store a Triton request and its associated response.

Note

The ownership of both the Triton request and the Triton response are transferred to this object which will take care of releasing them once the response has been sent.

Public Functions

Request(TRITONBACKEND_Response *response, TRITONBACKEND_Request *request, uint32_t request_index, Latch *latch = nullptr)

Constructor

Parameters
  • response[in] Triton response associated to this request. Ownership is transferred to this object.

  • request[in] Incoming Triton request. Ownership is transferred to this object.

  • request_index[in] Index of this request in the batch received by the backend.

  • latch[in] Optional latch to notify once all the requests’ responses have been sent.

~Request()
Request(const Request&) = delete

Prevent requests from being copied as they contain raw pointers which are deleted on destruction.

Request &operator=(const Request &other) = delete
Request &operator=(Request &&other)
Request(Request&&)
bool errorResponseSent() const noexcept

Return true if an error response has been sent.

void getInput(const std::string &name, const popef::TensorInfo &info, int64_t *batch_size, const void **input_buffer)

Get the buffer and batch size of the requested input

Parameters
  • name[in] Name of the input to retrieve.

  • info[in] Tensor shape of the requested input.

  • batch_size[inout] Batch size, if batch_size[0] == 0, then it will be populated with the input buffer’s batch size. Otherwise an error will be sent if the batch size doesn’t match the input buffer’s batch size.

  • input_buffer[out] Will be set to the buffer containing the input data.

Post

errorResponseSent() will return true if an error occurred.

void parseRequestInfo(uint32_t num_inputs_expected, const std::set<std::string> &allowed_output_names)

Parse the request and ensure it has the correct number of inputs / outputs.

Parameters
  • num_inputs_expected[in] Number of inputs the model expects.

  • allowed_output_names[in] The list of outputs the user is allowed to request.

Post

errorResponseSent() will return true if the number of inputs or the requested outputs are invalid.

int64_t batchSize() const

Return the request’s batch size.

Returns

Batch size

void *getOutputBuffer(const std::string &name, const popef::TensorInfo &info, int64_t batch_size)

Allocate and return a pointer to a buffer for the given request.

Note

The response owns the returned buffer and therefore the returned pointer becomes invalid once the response is sent.

Parameters
  • name[in] Name of the output we need a buffer for.

  • info[in] Tensor shape of the output to allocate.

  • batch_size[in] Batch size of the output to allocate.

Returns

A pointer to a buffer or nullptr if an error occurred.

void sendResponseSuccess()

Send a response indicating the request was successfully processed.

void startComputeTimer()

Start the compute timer.

Must be called just before the IPU starts processing the request.

void stopComputeTimer()

Stop the compute timer.

Must be called immediately after the IPU is done processing the request.

Pre

startComputeTimer() must have been called.

void reportStatistics(TRITONBACKEND_ModelInstance *instance)

Report the statistics related to this request.

Pre

startComputeTimer(), stopComputeTimer() and sendResponseSuccess() must have been called.

uint64_t getStartComputeTime() const noexcept

Return the start compute timestamp.

Returns

Start compute timestamp

Pre

startComputeTimer() must have been called.

uint64_t getEndComputeTime() const noexcept

Return the end compute timestamp.

Returns

End compute timestamp

Pre

stopComputeTimer() must have been called.

2.8. Tracepoint

class Tracepoint : public Tracepoint

RAII class to create pvti tracepoints.

Public Functions

explicit Tracepoint(const std::string &label)
~Tracepoint() = default

2.9. Latch

class Latch

Public Functions

explicit Latch(int count)
Latch(const Latch&) = delete
Latch(Latch&&) = delete
Latch &operator=(const Latch &other) = delete
Latch &operator=(Latch &&other) = delete
~Latch() = default
void notify()
void wait()