2. API reference
Knowing the details of the Triton backend API is crucial to understanding the comments and implementation below.
2.1. BackendState
-
class BackendState
Backend state: this state is shared amongst all the models and model states.
Public Functions
-
BackendState() = default
-
BackendState(const char *options_buf, size_t options_len)
Public Members
-
std::mutex poplar_init_mutex
The mutex that the model states need to acquire before using the non-thread safe parts of Poplar (for example, poplar::Engine creation).
-
bool check_package_hash = true
Flag controlling if the loaded PopEF is checked against hash compatibility with the active Poplar SDK.
-
BackendState() = default
2.2. ModelState
-
class ModelState : public BackendModel
State associated with a model that is using this backend.
An object of this class is created and associated with each TRITONBACKEND_Model.
A model state is shared among all the instances of a model.
Public Functions
-
~ModelState()
-
void ValidateModelConfig()
Validate that model configuration is supported by this backend.
Note
This method must be called before calling any of the other methods from this class.
-
inline uint64_t ExecDelay() const
For Triton testing: delay in milliseconds to wait before processing the requests.
Value comes from the Triton ‘execute_delay_ms’ parameter.
- Returns
Execution delay in ms [miliseconds]
- Pre
Only available after ValidateModelConfig() has been called.
-
inline uint64_t DelayMultiplier() const
For Triton testing: delay multiplier per instance.
That is, total_delay = max(DelayMultiplier() * InstanceId(), 1) * ExecDelay.
Value comes from the Triton
instance_wise_delay_multiplier
parameter.- Returns
Delay multiplier value
- Pre
Only available after ValidateModelConfig() has been called.
-
const std::map<std::string, popef::TensorInfo> &ModelInputs() const
Map of input tensors details.
- Returns
Map:
key (std::string): input tensor name
value (popef::TensorInfo): input tensor details
- Pre
Only available after ValidateModelConfig() has been called.
-
const std::map<std::string, popef::TensorInfo> &ModelOutputs() const
Map of output tensors details:
- Returns
Map:
key (std::string): output tensor name
value (popef::TensorInfo): output tensor details
- Pre
Only available after ValidateModelConfig() has been called.
-
const std::set<std::string> &OutputNames() const
Model outputs’ names.
- Returns
Set of outputs’ names strings
- Pre
Only available after ValidateModelConfig() has been called.
-
std::unique_ptr<popef::Model> LoadModel() const
Create a new instance of the popef::Model.
Note
The PopEF models are not thread safe, each model instance should own its own model.
- Returns
Pointer (unique_ptr) to model
- Pre
Only available after ValidateModelConfig() has been called.
-
void CreationDelay()
For Triton testing: block the thread based on the value of the Triton ‘creation_delay_sec’ parameter.
-
BackendState *Backend()
Return the backend associated with this model.
-
pvti::Graph &GetGraph(const std::string &name, const std::string &unit = "")
Get or create a pvti graph.
- Parameters
name – [in] The graph name. This name will be prefixed by the model name.
unit – [in] The graph unit.
- Returns
Reference to Graph
-
std::chrono::nanoseconds TimeoutInNanoseconds() const
Time to wait before flushing the pipeline when there are no inputs left in the queues but some requests are still in flight.
- Returns
Timeout in ns [nanoseconds]
-
void IncrementInstanceCount()
Increment the number of instances.
-
int InstanceCount() const
Return the current number of instances.
- Returns
Number of instances
-
bool UseSynchronousExecution() const
If true, then the backend should wait until all the requests responses have been sent before returning from TRITONBACKEND_ModelInstanceExecute.
- Returns
Synchronous execution usage flag
Public Static Functions
-
static void Create(TRITONBACKEND_Model *triton_model, BackendState *backend, ModelState **state)
Static method to use to create a ModelState.
- Parameters
triton_model – [in] Model the state is associated with.
backend – [in] The backend the state is associated with.
state – [out] The state to initialise.
-
~ModelState()
2.3. InputBuffers
-
using triton::backend::poplar::InputBuffers = std::map<std::string, const void*>
Map input names to data buffers.
2.4. OutputBuffers
-
using triton::backend::poplar::OutputBuffers = std::map<std::string, void*>
Map output names to data buffers.
2.5. ModelInstanceState
-
class ModelInstanceState : public BackendModelInstance
State associated with an instance of a model.
An object of this class is created and associated with each TRITONBACKEND_ModelInstance.
Public Functions
-
~ModelInstanceState()
-
inline ModelState *StateForModel() const
Get the state of the model that corresponds to this instance.
- Returns
Pointer to model’s state
-
inline int InstanceId() const
Get the ID of the instance.
- Returns
ID
-
void EnqueueRequest(const InputBuffers &inputs, size_t batch_size, const std::function<void(void)> &callback = nullptr)
Enqueue a request.
Note
The callback might be called multiple times (if the input is prefetched and the prefetch is later discarded).
- Parameters
inputs – [in] Buffers to use for the inputs. (Must contain a buffer for each input.)
batch_size – [in] Batch size of the request.
callback – [in] Optional callback to call when the first input is being read.
- Pre
Acquire the EnqueueRequest lock.
- Post
For each call to EnqueueRequest() a corresponding call to SetRequestOutputs() must be made.
-
std::mutex &QueuesMutex()
Mutex protecting the queues: must be acquired before calling EnqueueRequest().
- Returns
Mutex
-
void SetRequestOutputs(const OutputBuffers &outputs, size_t batch_size, const std::function<void(void)> &callback = nullptr)
Set the corresponding output buffers’ pointers for the last enqueued request.
- Parameters
outputs – [in] Buffers to use for the outputs. (Must contain a pointer for each output, if a pointer is nullptr then that output will be discarded.)
batch_size – [in] Batch size of the request.
callback – [in] Optional callback to call once after the last output has been written.
-
void AddValuesToGraph(int64_t batch_size, double exec_ns, double compute_ns)
Add a set of measurements for a collection of requests to the PVTI graph.
- Parameters
batch_size – [in] Total batch size of the collection.
exec_ns – [in] Average execution time for a single batch.
compute_ns – [in] Average compute time for a single batch.
-
void UnlockSessionThread()
Unlock session thread, which gets blocked after timeout callback.
-
void UpdateLastCollection(std::weak_ptr<RequestCollection> requests)
Keep track of the last collection: as long as this weak_ptr is valid it means some requests are still in flight.
Public Static Functions
-
static void CreateAndRun(ModelState *model_state, TRITONBACKEND_ModelInstance *triton_model_instance, ModelInstanceState **state)
Allocate a model instance state and start SessionRunThread.
- Parameters
model_state – [in] The model that the instance is associated with.
triton_model_instance – [in] Triton instance of the model.
state – [out] Backend instance of the model.
-
static void CreateEmptyState(ModelState *model_state, TRITONBACKEND_ModelInstance *triton_model_instance, ModelInstanceState **state)
Allocate empty model instance state. Empty means that no SessionRunThread is created.
- Parameters
model_state – [in] The model that the instance is associated with.
triton_model_instance – [in] Triton instance of the model.
state – [out] Backend instance of the model.
-
~ModelInstanceState()
2.6. RequestCollection
-
class RequestCollection
Represents a collection of requests received by one call to TRITONBACKEND_ModelInstanceExecute
Each lambda object pushed to the Session queue that is executing a request holds a reference to the collection and releases it after the request has sent a response.
Once all the requests are completed, the collection releases all the memory associated with them and reports statistics for the entire batch to the server.
Public Functions
-
explicit RequestCollection(uint64_t request_count, uint64_t exec_start_ns, ModelInstanceState *instance_state)
Initialise a collection of requests.
- Parameters
request_count – [in] Reserve memory for this number of requests.
exec_start_ns – [in] Timestamp for when the backend received this collection of requests. If 0: don’t report stats for this collection.
instance_state – [in] Instance state this collection of requests is associated with.
- Post
addRequest() and addBatch() must both be called request_count times.
-
void addBatch(uint64_t batch_size) noexcept
Must be called once per request in the collection to indicate the batch size of each individual request.
-
~RequestCollection()
Implicitly called when the last request in the collection releases its reference to the collection.
Release the memory used by the requests and responses in the collection and report the statistics for this collection of requests to the server.
-
explicit RequestCollection(uint64_t request_count, uint64_t exec_start_ns, ModelInstanceState *instance_state)
2.7. Request
-
class Request
Store a Triton request and its associated response.
Note
The ownership of both the Triton request and the Triton response are transferred to this object which will take care of releasing them once the response has been sent.
Public Functions
-
Request(TRITONBACKEND_Response *response, TRITONBACKEND_Request *request, uint32_t request_index, Latch *latch = nullptr)
Constructor
- Parameters
response – [in] Triton response associated to this request. Ownership is transferred to this object.
request – [in] Incoming Triton request. Ownership is transferred to this object.
request_index – [in] Index of this request in the batch received by the backend.
latch – [in] Optional latch to notify once all the requests’ responses have been sent.
-
~Request()
-
Request(const Request&) = delete
Prevent requests from being copied as they contain raw pointers which are deleted on destruction.
-
bool errorResponseSent() const noexcept
Return true if an error response has been sent.
-
void getInput(const std::string &name, const popef::TensorInfo &info, int64_t *batch_size, const void **input_buffer)
Get the buffer and batch size of the requested input
- Parameters
name – [in] Name of the input to retrieve.
info – [in] Tensor shape of the requested input.
batch_size – [inout] Batch size, if batch_size[0] == 0, then it will be populated with the input buffer’s batch size. Otherwise an error will be sent if the batch size doesn’t match the input buffer’s batch size.
input_buffer – [out] Will be set to the buffer containing the input data.
- Post
errorResponseSent() will return true if an error occurred.
-
void parseRequestInfo(uint32_t num_inputs_expected, const std::set<std::string> &allowed_output_names)
Parse the request and ensure it has the correct number of inputs / outputs.
- Parameters
num_inputs_expected – [in] Number of inputs the model expects.
allowed_output_names – [in] The list of outputs the user is allowed to request.
- Post
errorResponseSent() will return true if the number of inputs or the requested outputs are invalid.
-
int64_t batchSize() const
Return the request’s batch size.
- Returns
Batch size
-
void *getOutputBuffer(const std::string &name, const popef::TensorInfo &info, int64_t batch_size)
Allocate and return a pointer to a buffer for the given request.
Note
The response owns the returned buffer and therefore the returned pointer becomes invalid once the response is sent.
- Parameters
name – [in] Name of the output we need a buffer for.
info – [in] Tensor shape of the output to allocate.
batch_size – [in] Batch size of the output to allocate.
- Returns
A pointer to a buffer or nullptr if an error occurred.
-
void sendResponseSuccess()
Send a response indicating the request was successfully processed.
-
void startComputeTimer()
Start the compute timer.
Must be called just before the IPU starts processing the request.
-
void stopComputeTimer()
Stop the compute timer.
Must be called immediately after the IPU is done processing the request.
- Pre
startComputeTimer() must have been called.
-
void reportStatistics(TRITONBACKEND_ModelInstance *instance)
Report the statistics related to this request.
- Pre
startComputeTimer(), stopComputeTimer() and sendResponseSuccess() must have been called.
-
uint64_t getStartComputeTime() const noexcept
Return the start compute timestamp.
- Returns
Start compute timestamp
- Pre
startComputeTimer() must have been called.
-
uint64_t getEndComputeTime() const noexcept
Return the end compute timestamp.
- Returns
End compute timestamp
- Pre
stopComputeTimer() must have been called.
-
Request(TRITONBACKEND_Response *response, TRITONBACKEND_Request *request, uint32_t request_index, Latch *latch = nullptr)
2.8. Tracepoint
-
class Tracepoint : public Tracepoint
RAII class to create pvti tracepoints.
2.9. Latch
-
class Latch