1. Introduction

This section describes what the Poplar Triton backend is and what it is used for. In addition, it provides the basic information about the installation as well as the information necessary for its use.

The following instructions assume that you are familiar with machine learning and the possibilities offered by TensorFlow or PopART to create and run models on the IPU. You should also understand the possibilities that the NVIDIA Triton Inference Server offers and how to use it.

1.1. Overview

The Poplar Triton backend is part of the Poplar SDK. It is packaged as a single libtriton_poplar.so plugin for Triton Inference Server. For more information, please visit: NVIDIA Triton Inference Server.

Note

The Poplar Triton Backend is currently a preview version.

For more information about installing the Poplar SDK, see the relevant “Getting Started” guide for your IPU system on the Graphcore documentation portal.

1.2. Setting the environment variables

You need the Poplar runtime libraries to use the Poplar Triton backend, so, as described on the SDK installation instructions, you also need to set the library search paths, using the scripts provided in the SDK:

$ # Add the Poplar runtime libraries to the search path
$ source <path to poplar installation>/enable.sh

1.3. Building the Triton server

The instructions below describe how to configure an Ubuntu 20.04 container with the standard Triton Inference Server - version 21.05.

If you already have your own Triton server up and running you can skip this step.

Note

On Pod systems some extra parameters might need to be passed for the partitions to be visible from inside the docker container. For more information see the using IPUs from Docker documentation.

Download Dockerfile which will allow you to create a container with configured Triton Inference Server.

$ # Build a container from the downloaded Dockerfile
$ docker build -t graphcore_triton - < ./Dockerfile

$ # Start a configured container
$ gc-docker -- \
$       --shm-size=1g \
$       --ulimit memlock=-1 \
$       -p 8000:8000 \
$       -p 8001:8001 \
$       -p 8002:8002 \
$       --ulimit stack=67108864 \
$       --privileged \
$       -v `pwd`:/mnt/host \
$       -ti \
$       graphcore_triton:latest

$ # From that point you can use configured server
$ tritonserver --help

1.4. Configuring the model repository

See the Triton model repository documentation for more information about how to set up a model repository.

Here is an example of a model repository:

models
├── modelA
│   ├── 1
│   │   ├── executable.popef
│   │   └── name_map.json
│   └── config.pbtxt
├── ModelB
│   ├── 1
│   │   ├── executable.popef
│   │   ├── weights.popef (optional)
│   │   └── name_map.json
│   └── config.pbtxt
└── ModelC
    ├── 1
    │   ├── executable.popef
    │   └── name_map.json
    └── config.pbtxt

1.4.1. PopEF files

PopEF is an exchange format used by Graphcore’s frameworks to store compiled Poplar executables and their associated metadata. For more information, see the PopEF User Guide.

There are several ways to generate these files, for more information see the Model handling chapter in the PopEF User Guide.

By setting Poplar’s executable caching path in TensorFlow you can determine the location of the PopEF file, as presented in the following example:

$ # Enable executable caching
$ export TF_POPLAR_FLAGS="--executable_cache_path=modelA"
$ # Run your TensorFlow model
$ python3 run_modelA.py
$ ls modelA/
  a40887cbc9973426.poplar_exec

One of the effects of the above operations is to create the random_name.poplar_exec file. It contains the compiled Poplar executable for the TensorFlow model. This is used like a cache to avoid re-compiling the model. The instructions below assume that you have renamed this file to executable.popef.

If you froze your graph then the weights will be directly embedded inside the executable.popef file and you can skip the weights export step. Freezing the model means producing a single file containing information all about the graph. All hyperparameters (for example weights) are captured as constants within the graph structure.

If you didn’t freeze the weights then you will need to export them from TensorFlow:

import tensorflow.compat.v1 as tf
from tensorflow.python import ipu
...
with tf.Session() as sess:
    ... # Initialise the session here

    # Exports the weights from the session to file.
    ipu.utils.export_variables_from_live_session(sess, "weights.popef")

1.4.2. Input / output name mapping

TensorFlow doesn’t preserve the names of the model’s inputs and outputs, instead all the inputs are called infeed_number and all the outputs are called outfeed_number.

These names can be remapped to more user-friendly names by creating a JSON file named name_map.json in the same folder as executable.popef.

The Poplar SDK provides a tool called popef_dump to print the PopEF metadata:

Listing 1.1 Inspecting the PopEF metadata
$ popef_dump executable.popef
[12:00:12.403] [popef:cpp] [debug] Read header: {"version": "", "type": "opaque", "size": "432"}
[12:00:12.403] [popef:cpp] [debug] start end: 0 432
[12:00:12.403] [popef:cpp] [debug] Read OpaqueBlob: {"name": "tensorflow", "executable": "UNHASHED_EXECUTABLE"} Structure size: 72
[12:00:12.403] [popef:cpp] [debug] Read header: {"version": "", "type": "poplarExecutable", "size": "38172251"}
[12:00:12.403] [popef:cpp] [debug] start end: 432 38172683
[12:00:12.403] [popef:cpp] [debug] Read Executable: {"name": "UNHASHED_EXECUTABLE", "uncompressedSize": "361890868"} header size: 56
[12:00:12.403] [popef:cpp] [debug] Read header: {"version": "", "type": "metadata", "size": "568"}
[12:00:12.403] [popef:cpp] [debug] start end: 38172683 38173251
[12:00:12.404] [popef:cpp] [debug] Read Metadata: { "replicationFactor": "1",
    "numIpus": "1",
    "seedHandle": "__seed_stream",
    "anchors": [
        { "name": "infeed_1.0",
            "handle": "infeed_1.0",
            "tensorInfo": {"shape": ["256", "8", "8", "1"], "dataType": "f16"},
            "type": "input",
            "isPerReplica": true },
        { "name": "outfeed_2.0",
            "handle": "outfeed_2.0",
            "tensorInfo": {"shape": ["256", "3"], "dataType": "f16"},
            "type": "output",
            "isPerReplica": true },
        { "name": "outfeed_2.1",
            "handle": "outfeed_2.1",
            "tensorInfo": {"shape": ["256", "1"], "dataType": "f16"},
            "type": "output",
            "isPerReplica": true } ],
    "executable": "UNHASHED_EXECUTABLE",
    "numProcesses": "1",
    "ipuVersion": "2",
    "isPOD": false,
    "isInference": true,
    "engineOptions": [],
    "deviceOptions": [],
    "programFlow": {"load": ["0"], "main": ["1"], "save": ["2"]} }

name_map.json should contain a single string to string dictionary where the keys are the PopEF names and the values the user-friendly names.

Listing 1.2 name_map.json example
{
    "infeed_1.0": "input",
    "outfeed_2.0": "my_output",
    "outfeed_2.1": "my_other_output"
}

1.4.3. Triton model configuration

See the model configuration documentation for more information about how to configure models.

Note

If you created a name_map.json to rename the model’s inputs and outputs then you should use the user-friendly names in config.pbtxt.

Note

Shapes do not include the batch dimension.

Listing 1.3 Example of a full config file
 1name: "my_fp16_model"
 2backend: "poplar"
 3max_batch_size: 256000
 4dynamic_batching {
 5    preferred_batch_size: [2560]
 6    max_queue_delay_microseconds: 500000
 7}
 8input [
 9    {
10        name: "input"
11        data_type: TYPE_FP16
12        dims: [ 8, 8, 1 ]
13    }
14]
15output [
16    {
17        name: "my_output"
18        data_type: TYPE_FP16
19        dims: [ 3 ]
20    },
21    {
22        name: "my_other_output"
23        data_type: TYPE_FP16
24        dims: [ 1 ]
25    }
26]
27
28parameters [
29    {
30        key: "synchronous_execution"
31        value:{string_value: "1"}
32    },
33    {
34        key: "timeout_ns"
35        value:{string_value: "500000"}
36    }
37]
38
39instance_group [{ kind: KIND_CPU },{ kind: KIND_CPU }]

Backend

Models targeting the Poplar backend must have:

backend: "poplar"

Instance groups

You can run several instances of the same model by having several CPU instances:

instance_group [{ kind: KIND_CPU },{ kind: KIND_CPU }]

Batching

The Poplar backend supports dynamic batching. There is no limit to the batch size the backend can handle, so the max_batch_size can be set to a large multiple of the model’s batch size.

We recommend using a multiple of the model’s batch size for the preferred_batch_size. (You might need to experiment with different values to find out which one works best for your model.)

For example, for a model compiled with a batch size of 256:

max_batch_size: 256000
dynamic_batching {
    preferred_batch_size: [2560]
    max_queue_delay_microseconds: 500000
}

For more information about max_queue_delay_microseconds see delayed batching.

Timeouts

Optional amount of time in nanoseconds the backend will wait for before flushing an incomplete batch through.

parameters [
    {
        key: "timeout_ns"
        value:{string_value: "500000"}
    }
]

Synchronous execution

By default, the backend will run asynchronously: each call to TRITONBACKEND_ModelInstanceExecute will enqueue the jobs to process the received requests and return before the requests have actually been processed.

In order to make the backend block until all the requests have been processed you can set synchronous_execution to “1”:

parameters [
    {
        key: "synchronous_execution"
        value:{string_value: "1"}
    }
]

1.5. Configuring the Poplar backend

Create a poplar folder in your backends folder. Copy libtriton_poplar.so and the lib folder from the Poplar SDK into that folder.

Listing 1.4 Content of the backend folder
backends
└── poplar
    ├── lib -> copy of <poplar_sdk>/lib
    └── libtriton_poplar.so

1.6. Starting the server

In order to use the functionalities provided by Triton Inference Server, execute operation presented below. It will will allow to start a Triton Server that will be able to answer the queries directed by the client programs.

Listing 1.5 Starting the Triton server
$ tritonserver \
$   --model-repository /path/to/models_repository \
$   --backend-directory /path/to/backends

1.7. Profiling

The Poplar backend is instrumented with PVTI tracepoints.

The PopVision trace instrumentation library (libpvti) provides functions to control the capture of profiling information for the host-code of your IPU application. This data can then be explored with the PopVision System Analyser. For more options, refer to the PopVision Trace Instrumentation Library.

In order to capture the pvti reports needed for the PopVision System Analyser you need to set PVTI_OPTIONS='{"enable":"true"}' before starting the Triton server:

Listing 1.6 Starting the Triton server with PVTI capture enabled.
$ PVTI_OPTIONS='{"enable":"true"}' tritonserver \
$   --model-repository /path/to/models_repository \
$   --backend-directory /path/to/backends

For more options, refer to the PopVision System Analyser User Guide.

1.7.1. Triton performance analyzer and metrics

For each batch of requests the Poplar backend will provide compute and execution times to the server.

To see how to access these metrics see the Triton server documentation.

The Poplar backend is also compatible with the Triton performance analyzer, for more information see the performance analyzer documentation.

1.8. Limitations

The types supported by Triton Inference Server are listed in available data types.

The Poplar Triton Backend supports a subset of the types defined in the model configuration website. These are listed in Table 1.1.

Table 1.1 Types supported by Poplar Triton Backend

Model Config

TensorFlow

ONNX Runtime

PyTorch

API

NumPy

TYPE_UINT32

DT_UINT32

UINT32

UINT32

uint32

TYPE_INT32

DT_INT32

INT32

kInt

INT32

int32

TYPE_FP16

DT_HALF

FLOAT16

FP16

float16

TYPE_FP32

DT_FLOAT

FLOAT

kFloat

FP32

float32