3. Model runtime

3.1. Run with PopRT Runtime

PopRT Runtime is a runtime environment included in PopRT for loading and running PopEF models. PopRT Runtime provides Python and C++ APIs, which can be used for rapid verification of PopEF files or integration with machine learning frameworks and model service frameworks. This section will take the executable.popef file generated in Section 2.2, Model conversion and compilation as an example to describe how to use the PopRT Runtime API to load and run PopEF models.

3.1.1. Environment preparation

Switch to the directory where the executable.popef file created in Section 2.2, Model conversion and compilation is located, and check whether the directory is correct with ls.

$ ls `pwd -P` | grep bertsquad-12.onnx.optimized.onnx

The following output shows that the current directory is correct.

bertsquad-12.onnx.optimized.onnx

Start the Docker container with the following command:

$ gc-docker -- --rm -it \
    -v `pwd -P`:/model_runtime_test \
    -w /model_runtime_test \
    --entrypoint /bin/bash \
    graphcorecn/poprt-staging:latest

3.1.2. Run with PopRT Runtime Python API

The sample code is shown in Listing 3.1. Save it as model_runner_quick_start.py.

Listing 3.1 Sample PopRT Runtime code (Python API)

# Copyright (c) 2022 Graphcore Ltd. All rights reserved.

import numpy as np
from poprt import runtime

# Load the popef
runner = runtime.ModelRunner('executable.popef')

# Get the input and output information of the model
inputs = runner.get_model_inputs()
outputs = runner.get_model_outputs()

# Create the random inputs and zero outputs
inputs_dict =  {x.name:np.random.randint(2, size=x.shape).astype(x.numpy_data_type()) for x in inputs}
outputs_dict = {x.name:np.zeros(x.shape).astype(x.numpy_data_type()) for x in outputs}

# Execute the inference
runner.execute(inputs_dict, outputs_dict)

# Check the output values
for name, value in outputs_dict.items():
    print(f'{name} : {value}')

model_runner_quick_start.py

Run the saved sample code:

$ python3 model_runner_quick_start.py

A successful run will produce an output similar to the following:

unstack:1 : [[-0.9604 -1.379  -2.01   ... -1.814  -1.78   -1.626 ]
[-1.051  -1.977  -1.913  ... -1.435  -1.681  -1.251 ]
[-3.67   -2.71   -2.78   ... -3.951  -4.027  -3.959 ]
...
[-0.0919 -0.6445 -0.3125 ... -0.384  -0.54   -0.3152]
[-0.69   -1.071  -1.421  ... -1.533  -1.456  -1.389 ]
[-3.56   -2.99   -3.23   ... -4.05   -3.977  -3.955 ]]
unstack:0 : [[-1.437  -1.645  -2.17   ... -2.139  -2.379  -2.281 ]
[-1.259  -1.8545 -1.915  ... -1.804  -1.8955 -1.671 ]
[-2.832  -2.057  -2.104  ... -3.29   -3.34   -3.36  ]
...
[-0.4673 -0.8716 -0.8545 ... -1.253  -1.287  -1.289 ]
[-1.288  -1.481  -1.928  ... -2.158  -2.146  -2.129 ]
[-2.762  -2.43   -2.6    ... -3.418  -3.23   -3.324 ]]
unique_ids:0 : [1 0 1 1 1 0 0 1 1 0 1 1 1 1 0 1]

3.1.3. Run with PopRT Runtime C++ API

PopRT Runtime also provides sample C++ API code as shown in Listing 3.2. Save it as model_runner_quick_start.cpp.

Listing 3.2 Sample PopRT Runtime code (C++ API)

// Copyright (c) 2022 Graphcore Ltd. All rights reserved.

#include <iostream>
#include "poprt/runtime/model_runner.hpp"

int main(int argc, char* argv[]) {
  // Load the PopEF file
  auto runner = poprt::runtime::ModelRunner("executable.popef");

  // Get the inputs and outputs information of model
  auto inputs = runner.getModelInputs();
  auto outputs = runner.getModelOutputs();

  // Create the inputs and ouputs
  poprt::runtime::InputMemoryView in;
  poprt::runtime::OutputMemoryView out;
  std::vector<std::shared_ptr<unsigned char[]>> memories;
  int i = 0;
  for (const auto& input : inputs) {
    memories.push_back(
        std::shared_ptr<unsigned char[]>(new unsigned char[input.sizeInBytes]));
    in.emplace(input.name, poprt::runtime::ConstTensorMemoryView(
                               memories[i++].get(), input.sizeInBytes));
  }
  for (const auto& output : outputs) {
    memories.push_back(std::shared_ptr<unsigned char[]>(
        new unsigned char[output.sizeInBytes]));
    out.emplace(output.name, poprt::runtime::TensorMemoryView(
                                 memories[i++].get(), output.sizeInBytes));
  }

  // Execute the inference
  runner.execute(in, out);

  // Print the result information
  std::cout << "Sucessfully executed. The outputs are: " << std::endl;
  for (const auto& output : outputs)
    std::cout << "name: " << output.name << ", dataType: " << output.dataType
              << ", sizeInBytes: " << output.sizeInBytes << std::endl;
}

model_runner_quick_start.cpp

Compile the sample code.

$ apt-get update && \
  apt-get install g++ -y && \
  g++ model_runner_quick_start.cpp -o model_runner_quick_start \
    --std=c++14 -I/usr/local/lib/python3.8/dist-packages/poprt/include \
    -L/usr/local/lib/python3.8/dist-packages/poprt/lib \
    -lpoprt_runtime -lpopef

Run the example program obtained from compilation.

$ LD_LIBRARY_PATH=/usr/local/lib/python3.8/dist-packages/poprt/lib:$LD_LIBRARY_PATH ./model_runner_quick_start

A successful run will result in the following output:

Sucessfully execute, the outputs:
name: unstack:1, dataType: F16, sizeInBytes: 8192
name: unstack:0, dataType: F16, sizeInBytes: 8192
name: unique_ids:0, dataType: S32, sizeInBytes: 64

Note

After completing the above example, exit the current container and return to the host environment.

3.2. Deploy with Triton Inference Server

In order to deploy and run the compiled bertsquad-12.onnx model, create a directory named model_repository inside the directory containing the executable.popef file:

mkdir -p model_repository/bertsquad-12/1/
cp executable.popef model_repository/bertsquad-12/1/
touch model_repository/bertsquad-12/config.pbtxt
cd model_repository

This will give the following directory structure:

$ tree .
.
└── bertsquad-12
  ├── 1
  │   └── executable.popef
  └── config.pbtxt

bertsquad-12 is the name of the model
1 represents the version of the model
executable.popef is the PopEF file generated by model compilation
config.pbtxt is the Triton configuration file described in Section 3.2.1, Configuration of the generated model.

3.2.1. Configuration of the generated model

To deploy the model to the Triton Inference Server you need to create a configuration file config.pbtxt for the model. This file contains the name of the model, the backend used, the batching information, the input and output information, and so on. For more information about model configuration, there is an example config file in the Graphcore Poplar Triton Backend: User Guide and more details in the Triton model configuration documentation.

The configuration of config.pbtxt used in this example is as follows, and this content needs to be copied into the empty config.pbtxt file generated as described above.

name: "bertsquad-12"
backend: "poplar"
max_batch_size: 16
dynamic_batching {
    preferred_batch_size: [16]
    max_queue_delay_microseconds: 5000
}
input [
    {
        name: "input_ids:0"
        data_type: TYPE_INT32
        dims: [ 256 ]
    },
    {
        name: "input_mask:0"
        data_type: TYPE_INT32
        dims: [ 256 ]
    },
    {
        name: "segment_ids:0"
        data_type: TYPE_INT32
        dims: [ 256 ]
    },
    {
        name: "unique_ids_raw_output___9:0"
        data_type: TYPE_INT32
        dims: [ 1 ]
        reshape: { shape: [ ] }
    }
]
output [
    {
        name: "unique_ids:0"
        data_type: TYPE_INT32
        dims: [ 1 ]
        reshape: { shape: [ ] }
    },
    {
        name: "unstack:0"
        data_type: TYPE_FP16
        dims: [ 256 ]
    },
    {
        name: "unstack:1"
        data_type: TYPE_FP16
        dims: [ 256 ]
    }
]
parameters [
    {
        key: "synchronous_execution"
        value:{string_value: "1"}
    },
    {
        key: "timeout_ns"
        value:{string_value: "500000"}
    }
]

3.2.2. Activation of model service

gc-docker is used to activate the Trition Inference Server container, and executable.popef is loaded and run through the Poplar Triton Backend.

gc-docker -- --rm \
    --network=host \
    -v `pwd -P`:/models \
    graphcorecn/toolkit-triton-staging:latest

Note

Remove the --network=host parameter when running gc-docker if testing on an IPU-M2000 or Bow-2000 system

You should then get the following output, indicating that the model is successfully deployed and ready to accept gRPC and HTTP requests.

Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000

3.2.3. Verification services via gRPC

Below is an example of a deployed model tested with the gRPC API with the Triton Client. Refer to the Triton documentation and code examples for more detailed API information.

import numpy as np
import tritonclient.grpc as gc

# create the triton client
triton_client = gc.InferenceServerClient(url="localhost:8001")
model_name = 'bertsquad-12'
inputs = []
outputs = []
inputs.append(gc.InferInput('input_ids:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('input_mask:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('segment_ids:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('unique_ids_raw_output___9:0', [16, 1], "INT32"))
# create data
input0_data = np.random.randint(0, 1000, size=(16, 256)).astype(np.int32)
input1_data = np.random.randint(0, 1, size=(16, 1)).astype(np.int32)
for i in range(3):
    inputs[i].set_data_from_numpy(input0_data)
inputs[3].set_data_from_numpy(input1_data)
outputs_names = ['unique_ids:0', 'unstack:0', 'unstack:1']
for name in outputs_names:
    outputs.append(gc.InferRequestedOutput(name))
results = triton_client.infer(
    model_name=model_name, inputs=inputs, outputs=outputs
)
statistics = triton_client.get_inference_statistics(model_name=model_name)
if len(statistics.model_stats) != 1:
    print("FAILED: Inference Statistics")
    sys.exit(1)
print(statistics)
for name in outputs_names:
    print(f'{name} = {results.as_numpy(name)}')

grpc_test.py

Open a new terminal to connect to the host, save the above code to grpc_test.py, and create a python virtual environment and test it.

virtualenv -p python3 venv
source venv/bin/activate
pip install tritonclient[all]
python grpc_test.py
deactivate

If executed correctly, the model statistics and inference results are returned, for example:

model_stats {
name: "bertsquad-12"
version: "1"
last_inference: 1667439772895
inference_count: 64
execution_count: 4
inference_stats {
    success {
    count: 4
    ns: 170377440
    }
...
unique_ids:0 = [[0]
...
unstack:0 = [[-0.991 -1.472 -1.571 ... -1.738 -1.77  -1.803]
...
unstack:1 = [[-0.9023 -1.285  -1.325  ... -1.419  -1.441  -1.452 ]
...

Search help