1. Introduction
This section describes what the Poplar Triton backend is and what it is used for. In addition, it provides the basic information about the installation as well as the information necessary for its use.
The following instructions assume that you are familiar with machine learning and the possibilities offered by TensorFlow or PopART to create and run models on the IPU. You should also understand the possibilities that the NVIDIA Triton Inference Server offers and how to use it.
1.1. Overview
The Poplar Triton backend is part of the Poplar SDK.
It is packaged as a single libtriton_poplar.so
plugin for Triton
Inference Server. For more information, please visit:
NVIDIA Triton Inference Server.
For more information about installing the Poplar SDK, see the relevant “Getting Started” guide for your IPU system on the Graphcore documentation portal.
1.2. Setting the environment variables
You need the Poplar runtime libraries to use the Poplar Triton backend, so, as described on the SDK installation instructions, you also need to set the library search paths, using the scripts provided in the SDK:
$ # Add the Poplar runtime libraries to the search path
$ source <path to poplar installation>/enable.sh
1.3. Building the Triton server
The instructions below describe how to configure an Ubuntu 20.04 container with the standard Triton Inference Server - version 21.05.
If you already have your own Triton server up and running you can skip this step.
Note
On Pod systems some extra parameters might need to be passed for the partitions to be visible from inside the docker container. For more information see the using IPUs from Docker documentation.
Download Dockerfile
which will
allow you to create a container with configured Triton Inference Server.
$ # Build a container from the downloaded Dockerfile
$ docker build -t graphcore_triton - < ./Dockerfile
$ # Start a configured container
$ gc-docker -- \
$ --shm-size=1g \
$ --ulimit memlock=-1 \
$ -p 8000:8000 \
$ -p 8001:8001 \
$ -p 8002:8002 \
$ --ulimit stack=67108864 \
$ --privileged \
$ -v `pwd`:/mnt/host \
$ -ti \
$ graphcore_triton:latest
$ # From that point you can use configured server
$ tritonserver --help
1.4. Configuring the model repository
See the Triton model repository documentation for more information about how to set up a model repository.
Here is an example of a model repository:
models
├── modelA
│ ├── 1
│ │ ├── executable.popef
│ │ └── name_map.json
│ └── config.pbtxt
├── ModelB
│ ├── 1
│ │ ├── executable.popef
│ │ ├── weights.popef (optional)
│ │ └── name_map.json
│ └── config.pbtxt
└── ModelC
├── 1
│ ├── executable.popef
│ └── name_map.json
└── config.pbtxt
1.4.1. PopEF files
PopEF is an exchange format used by Graphcore’s frameworks to store compiled Poplar executables and their associated metadata. For more information, see the PopEF User Guide.
There are several ways to generate these files, for more information see the Model handling chapter in the PopEF User Guide.
By setting Poplar’s executable caching path in TensorFlow you can determine the location of the PopEF file, as presented in the following example:
$ # Enable executable caching
$ export TF_POPLAR_FLAGS="--executable_cache_path=modelA"
$ # Run your TensorFlow model
$ python3 run_modelA.py
$ ls modelA/
a40887cbc9973426.poplar_exec
One of the effects of the above operations is to create the random_name.poplar_exec
file.
It contains the compiled Poplar executable for the TensorFlow model. This is used like a cache to
avoid re-compiling the model. The instructions below assume that you have renamed this file
to executable.popef
.
If you froze your graph then the weights will be directly embedded inside the
executable.popef
file and you can skip the weights export step.
Freezing the model means producing a single file containing information all
about the graph. All hyperparameters (for example weights) are captured as
constants within the graph structure.
If you didn’t freeze the weights then you will need to export them from TensorFlow:
import tensorflow.compat.v1 as tf
from tensorflow.python import ipu
...
with tf.Session() as sess:
... # Initialise the session here
# Exports the weights from the session to file.
ipu.utils.export_variables_from_live_session(sess, "weights.popef")
1.4.2. Input / output name mapping
TensorFlow doesn’t preserve the names of the model’s inputs and outputs,
instead all the inputs are called infeed_number
and all the outputs are
called outfeed_number
.
These names can be remapped to more user-friendly names by creating a JSON file
named name_map.json
in the same folder as executable.popef
.
The Poplar SDK provides a tool called popef_dump
to print the PopEF metadata:
$ popef_dump executable.popef
[12:00:12.403] [popef:cpp] [debug] Read header: {"version": "", "type": "opaque", "size": "432"}
[12:00:12.403] [popef:cpp] [debug] start end: 0 432
[12:00:12.403] [popef:cpp] [debug] Read OpaqueBlob: {"name": "tensorflow", "executable": "UNHASHED_EXECUTABLE"} Structure size: 72
[12:00:12.403] [popef:cpp] [debug] Read header: {"version": "", "type": "poplarExecutable", "size": "38172251"}
[12:00:12.403] [popef:cpp] [debug] start end: 432 38172683
[12:00:12.403] [popef:cpp] [debug] Read Executable: {"name": "UNHASHED_EXECUTABLE", "uncompressedSize": "361890868"} header size: 56
[12:00:12.403] [popef:cpp] [debug] Read header: {"version": "", "type": "metadata", "size": "568"}
[12:00:12.403] [popef:cpp] [debug] start end: 38172683 38173251
[12:00:12.404] [popef:cpp] [debug] Read Metadata: { "replicationFactor": "1",
"numIpus": "1",
"seedHandle": "__seed_stream",
"anchors": [
{ "name": "infeed_1.0",
"handle": "infeed_1.0",
"tensorInfo": {"shape": ["256", "8", "8", "1"], "dataType": "f16"},
"type": "input",
"isPerReplica": true },
{ "name": "outfeed_2.0",
"handle": "outfeed_2.0",
"tensorInfo": {"shape": ["256", "3"], "dataType": "f16"},
"type": "output",
"isPerReplica": true },
{ "name": "outfeed_2.1",
"handle": "outfeed_2.1",
"tensorInfo": {"shape": ["256", "1"], "dataType": "f16"},
"type": "output",
"isPerReplica": true } ],
"executable": "UNHASHED_EXECUTABLE",
"numProcesses": "1",
"ipuVersion": "2",
"isPOD": false,
"isInference": true,
"engineOptions": [],
"deviceOptions": [],
"programFlow": {"load": ["0"], "main": ["1"], "save": ["2"]} }
name_map.json
should contain a single string to string dictionary where the
keys are the PopEF names and the values the user-friendly names.
{
"infeed_1.0": "input",
"outfeed_2.0": "my_output",
"outfeed_2.1": "my_other_output"
}
1.4.3. Triton model configuration
See the model configuration documentation for more information about how to configure models.
The Poplar Triton backend extends this configuration with the following optional parameters:
executable_path
: path to the model executable PopEF file. If this parameter is not defined, the model repository is searched forexecutable.popef
.weights_path
: path to the model weights PopEF file. If this parameter is not defined, the model repository is searched forweights.popef
.name_map_file_path
: path to the names remapping file. If this parameter is not defined, the model repository is searched forname_map.json
.check_package_hash
: boolean flag to control the check of Poplar versions compatibility. By default, this parameter is set totrue
, which means that during model loading, the Poplar Triton backend verifies if the Poplar version the model was compiled against is the same as the version of Poplar enabled in the user environment.
Note
Setting the parameter check_package_hash
to false
turns the Poplar
versions compatibility check off. This is not recommended for production
environments.
Note
If you created a name_map.json
file to rename the model’s inputs and outputs
then you should use the user-friendly names in config.pbtxt
.
Note
Shapes do not include the batch dimension.
1name: "my_fp16_model"
2backend: "poplar"
3max_batch_size: 256000
4dynamic_batching {
5 preferred_batch_size: [2560]
6 max_queue_delay_microseconds: 500000
7}
8input [
9 {
10 name: "input"
11 data_type: TYPE_FP16
12 dims: [ 8, 8, 1 ]
13 }
14]
15output [
16 {
17 name: "my_output"
18 data_type: TYPE_FP16
19 dims: [ 3 ]
20 },
21 {
22 name: "my_other_output"
23 data_type: TYPE_FP16
24 dims: [ 1 ]
25 }
26]
27
28parameters [
29 {
30 key: "synchronous_execution"
31 value:{string_value: "1"}
32 },
33 {
34 key: "timeout_ns"
35 value:{string_value: "500000"}
36 },
37 {
38 key: "executable_path"
39 value:{string_value: "/path/to/executable.popef"}
40 },
41 {
42 key: "weights_path"
43 value:{string_value: "/path/to/weights.popef"}
44 },
45 {
46 key: "name_map_file_path"
47 value:{string_value: "/path/to/name_map.json"}
48 },
49 {
50 key: "check_package_hash"
51 value:{string_value: "true"}
52 }
53]
54
55instance_group [{ kind: KIND_CPU },{ kind: KIND_CPU }]
Backend
Models targeting the Poplar backend must have:
backend: "poplar"
Instance groups
You can run several instances of the same model by having several CPU instances:
instance_group [{ kind: KIND_CPU },{ kind: KIND_CPU }]
Batching
The Poplar backend supports dynamic batching. There is no limit to the
batch size the backend can handle, so the max_batch_size
can be
set to a large multiple of the model’s batch size.
We recommend using a multiple of the model’s batch size for the preferred_batch_size
.
(You might need to experiment with different values to find out which one works
best for your model.)
For example, for a model compiled with a batch size of 256:
max_batch_size: 256000
dynamic_batching {
preferred_batch_size: [2560]
max_queue_delay_microseconds: 500000
}
For more information about max_queue_delay_microseconds
see
delayed batching.
Timeouts
Optional amount of time in nanoseconds the backend will wait for before flushing an incomplete batch through.
parameters [
{
key: "timeout_ns"
value:{string_value: "500000"}
}
]
Synchronous execution
By default, the backend will run asynchronously: each call to
TRITONBACKEND_ModelInstanceExecute
will enqueue the jobs to process the
received requests and return before the requests have actually been processed.
In order to make the backend block until all the requests have been processed
you can set synchronous_execution
to “1”:
parameters [
{
key: "synchronous_execution"
value:{string_value: "1"}
}
]
1.5. Configuring the Poplar backend
Create a poplar
folder in your backends folder.
Copy libtriton_poplar.so
and the lib
folder from the Poplar SDK into that folder.
backends
└── poplar
├── lib -> copy of <poplar_sdk>/lib
└── libtriton_poplar.so
1.6. Starting the server
In order to use the functionalities provided by Triton Inference Server, execute operation presented below. It will will allow to start a Triton Server that will be able to answer the queries directed by the client programs.
$ tritonserver \
$ --model-repository /path/to/models_repository \
$ --backend-directory /path/to/backends
1.7. Profiling
The Poplar backend is instrumented with PVTI tracepoints.
The PopVision trace instrumentation library (libpvti) provides functions to control the capture of profiling information for the host-code of your IPU application. This data can then be explored with the PopVision System Analyser. For more options, refer to the PopVision Trace Instrumentation Library.
In order to capture the pvti
reports needed for the PopVision System
Analyser you need to set PVTI_OPTIONS='{"enable":"true"}'
before starting the Triton server:
$ PVTI_OPTIONS='{"enable":"true"}' tritonserver \
$ --model-repository /path/to/models_repository \
$ --backend-directory /path/to/backends
For more options, refer to the PopVision System Analyser User Guide.
1.7.1. Triton performance analyzer and metrics
For each batch of requests the Poplar backend will provide compute and execution times to the server.
To see how to access these metrics see the Triton server documentation.
The Poplar backend is also compatible with the Triton performance analyzer, for more information see the performance analyzer documentation.
1.8. Limitations
The types supported by Triton Inference Server are listed in available data types.
The Poplar Triton Backend supports a subset of the types defined in the model configuration website. These are listed in Table 1.1.
Model Config |
TensorFlow |
ONNX Runtime |
PyTorch |
API |
NumPy |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|