3. Model runtime
3.1. Run with PopRT Runtime
PopRT Runtime is a runtime environment included in PopRT for loading and running PopEF models. PopRT Runtime provides Python and C++ APIs, which can be used for rapid verification of PopEF files or integration with machine learning frameworks and model service frameworks. This section will take the executable.popef
file generated in Section 2.2, Model conversion and compilation as an example to describe how to use the PopRT Runtime API to load and run PopEF models.
3.1.1. Environment preparation
Switch to the directory where the executable.popef
file created in Section 2.2, Model conversion and compilation is located, and check whether the directory is correct with ls
.
$ ls `pwd -P` | grep bertsquad-12.onnx.optimized.onnx
The following output shows that the current directory is correct.
bertsquad-12.onnx.optimized.onnx
Start the Docker container with the following command:
$ gc-docker -- --rm -it \
-v `pwd -P`:/model_runtime_test \
-w /model_runtime_test \
--entrypoint /bin/bash \
graphcorecn/poprt-staging:latest
3.1.2. Run with PopRT Runtime Python API
The sample code is shown in Listing 3.1. Save it as model_runner_quick_start.py
.
# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
import numpy as np
from poprt import runtime
# Load the popef
runner = runtime.ModelRunner('executable.popef')
# Get the input and output information of the model
inputs = runner.get_model_inputs()
outputs = runner.get_model_outputs()
# Create the random inputs and zero outputs
inputs_dict = {x.name:np.random.randint(2, size=x.shape).astype(x.numpy_data_type()) for x in inputs}
outputs_dict = {x.name:np.zeros(x.shape).astype(x.numpy_data_type()) for x in outputs}
# Execute the inference
runner.execute(inputs_dict, outputs_dict)
# Check the output values
for name, value in outputs_dict.items():
print(f'{name} : {value}')
Run the saved sample code:
$ python3 model_runner_quick_start.py
A successful run will produce an output similar to the following:
unstack:1 : [[-0.9604 -1.379 -2.01 ... -1.814 -1.78 -1.626 ]
[-1.051 -1.977 -1.913 ... -1.435 -1.681 -1.251 ]
[-3.67 -2.71 -2.78 ... -3.951 -4.027 -3.959 ]
...
[-0.0919 -0.6445 -0.3125 ... -0.384 -0.54 -0.3152]
[-0.69 -1.071 -1.421 ... -1.533 -1.456 -1.389 ]
[-3.56 -2.99 -3.23 ... -4.05 -3.977 -3.955 ]]
unstack:0 : [[-1.437 -1.645 -2.17 ... -2.139 -2.379 -2.281 ]
[-1.259 -1.8545 -1.915 ... -1.804 -1.8955 -1.671 ]
[-2.832 -2.057 -2.104 ... -3.29 -3.34 -3.36 ]
...
[-0.4673 -0.8716 -0.8545 ... -1.253 -1.287 -1.289 ]
[-1.288 -1.481 -1.928 ... -2.158 -2.146 -2.129 ]
[-2.762 -2.43 -2.6 ... -3.418 -3.23 -3.324 ]]
unique_ids:0 : [1 0 1 1 1 0 0 1 1 0 1 1 1 1 0 1]
3.1.3. Run with PopRT Runtime C++ API
PopRT Runtime also provides sample C++ API code as shown in Listing 3.2. Save it as model_runner_quick_start.cpp
.
// Copyright (c) 2022 Graphcore Ltd. All rights reserved.
#include <iostream>
#include "poprt/runtime/model_runner.hpp"
int main(int argc, char* argv[]) {
// Load the PopEF file
auto runner = poprt::runtime::ModelRunner("executable.popef");
// Get the inputs and outputs information of model
auto inputs = runner.getModelInputs();
auto outputs = runner.getModelOutputs();
// Create the inputs and ouputs
poprt::runtime::InputMemoryView in;
poprt::runtime::OutputMemoryView out;
std::vector<std::shared_ptr<unsigned char[]>> memories;
int i = 0;
for (const auto& input : inputs) {
memories.push_back(
std::shared_ptr<unsigned char[]>(new unsigned char[input.sizeInBytes]));
in.emplace(input.name, poprt::runtime::ConstTensorMemoryView(
memories[i++].get(), input.sizeInBytes));
}
for (const auto& output : outputs) {
memories.push_back(std::shared_ptr<unsigned char[]>(
new unsigned char[output.sizeInBytes]));
out.emplace(output.name, poprt::runtime::TensorMemoryView(
memories[i++].get(), output.sizeInBytes));
}
// Execute the inference
runner.execute(in, out);
// Print the result information
std::cout << "Sucessfully executed. The outputs are: " << std::endl;
for (const auto& output : outputs)
std::cout << "name: " << output.name << ", dataType: " << output.dataType
<< ", sizeInBytes: " << output.sizeInBytes << std::endl;
}
Compile the sample code.
$ apt-get update && \
apt-get install g++ -y && \
g++ model_runner_quick_start.cpp -o model_runner_quick_start \
--std=c++14 -I/usr/local/lib/python3.8/dist-packages/poprt/include \
-L/usr/local/lib/python3.8/dist-packages/poprt/lib \
-lpoprt_runtime -lpopef
Run the example program obtained from compilation.
$ LD_LIBRARY_PATH=/usr/local/lib/python3.8/dist-packages/poprt/lib:$LD_LIBRARY_PATH ./model_runner_quick_start
A successful run will result in the following output:
Sucessfully execute, the outputs:
name: unstack:1, dataType: F16, sizeInBytes: 8192
name: unstack:0, dataType: F16, sizeInBytes: 8192
name: unique_ids:0, dataType: S32, sizeInBytes: 64
Note
After completing the above example, exit the current container and return to the host environment.
3.2. Deploy with Triton Inference Server
In order to deploy and run the compiled bertsquad-12.onnx
model, create a directory named model_repository
inside the directory containing the executable.popef
file:
mkdir -p model_repository/bertsquad-12/1/
cp executable.popef model_repository/bertsquad-12/1/
touch model_repository/bertsquad-12/config.pbtxt
cd model_repository
This will give the following directory structure:
$ tree .
.
└── bertsquad-12
├── 1
│ └── executable.popef
└── config.pbtxt
bertsquad-12
is the name of the model1
represents the version of the modelexecutable.popef
is the PopEF file generated by model compilationconfig.pbtxt
is the Triton configuration file described in Section 3.2.1, Configuration of the generated model.
3.2.1. Configuration of the generated model
To deploy the model to the Triton Inference Server you need to create a configuration file config.pbtxt
for the model. This file contains the name of the model, the backend used, the batching information, the input and output information, and so on. For more information about model configuration, there is an example config file in the Graphcore Poplar Triton Backend: User Guide and more details in the Triton model configuration documentation.
The configuration of config.pbtxt
used in this example is as follows, and this content needs to be copied into the empty config.pbtxt
file generated as described above.
name: "bertsquad-12"
backend: "poplar"
max_batch_size: 16
dynamic_batching {
preferred_batch_size: [16]
max_queue_delay_microseconds: 5000
}
input [
{
name: "input_ids:0"
data_type: TYPE_INT32
dims: [ 256 ]
},
{
name: "input_mask:0"
data_type: TYPE_INT32
dims: [ 256 ]
},
{
name: "segment_ids:0"
data_type: TYPE_INT32
dims: [ 256 ]
},
{
name: "unique_ids_raw_output___9:0"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
}
]
output [
{
name: "unique_ids:0"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "unstack:0"
data_type: TYPE_FP16
dims: [ 256 ]
},
{
name: "unstack:1"
data_type: TYPE_FP16
dims: [ 256 ]
}
]
parameters [
{
key: "synchronous_execution"
value:{string_value: "1"}
},
{
key: "timeout_ns"
value:{string_value: "500000"}
}
]
3.2.2. Activation of model service
gc-docker
is used to activate the Trition Inference Server container, and executable.popef
is loaded and run through the Poplar Triton Backend.
gc-docker -- --rm \
--network=host \
-v `pwd -P`:/models \
graphcorecn/toolkit-triton-staging:latest
Note
Remove the --network=host
parameter when running gc-docker
if testing on an IPU-M2000 or Bow-2000 system
You should then get the following output, indicating that the model is successfully deployed and ready to accept gRPC and HTTP requests.
Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000
3.2.3. Verification services via gRPC
Below is an example of a deployed model tested with the gRPC API with the Triton Client. Refer to the Triton documentation and code examples for more detailed API information.
import numpy as np
import tritonclient.grpc as gc
# create the triton client
triton_client = gc.InferenceServerClient(url="localhost:8001")
model_name = 'bertsquad-12'
inputs = []
outputs = []
inputs.append(gc.InferInput('input_ids:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('input_mask:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('segment_ids:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('unique_ids_raw_output___9:0', [16, 1], "INT32"))
# create data
input0_data = np.random.randint(0, 1000, size=(16, 256)).astype(np.int32)
input1_data = np.random.randint(0, 1, size=(16, 1)).astype(np.int32)
for i in range(3):
inputs[i].set_data_from_numpy(input0_data)
inputs[3].set_data_from_numpy(input1_data)
outputs_names = ['unique_ids:0', 'unstack:0', 'unstack:1']
for name in outputs_names:
outputs.append(gc.InferRequestedOutput(name))
results = triton_client.infer(
model_name=model_name, inputs=inputs, outputs=outputs
)
statistics = triton_client.get_inference_statistics(model_name=model_name)
if len(statistics.model_stats) != 1:
print("FAILED: Inference Statistics")
sys.exit(1)
print(statistics)
for name in outputs_names:
print(f'{name} = {results.as_numpy(name)}')
Open a new terminal to connect to the host, save the above code to grpc_test.py
, and create a python virtual environment and test it.
virtualenv -p python3 venv
source venv/bin/activate
pip install tritonclient[all]
python grpc_test.py
deactivate
If executed correctly, the model statistics and inference results are returned, for example:
model_stats {
name: "bertsquad-12"
version: "1"
last_inference: 1667439772895
inference_count: 64
execution_count: 4
inference_stats {
success {
count: 4
ns: 170377440
}
...
unique_ids:0 = [[0]
...
unstack:0 = [[-0.991 -1.472 -1.571 ... -1.738 -1.77 -1.803]
...
unstack:1 = [[-0.9023 -1.285 -1.325 ... -1.419 -1.441 -1.452 ]
...