3. 模型运行
3.1. 通过PopRT Runtime运行
PopRT Runtime 是PopRT中加载和运行PopEF模型的组件,它提供了Python和C++ API,可以快速验证PopEF文件,或者与机器学习框架或模型服务框架相集成。
本节将以 Section 2.2, 模型转换和编译 中产生的 executable.popef
文件为例,讲述如何通过PopRT Runtime API
加载和运行模型。
3.1.1. 环境准备
将目录切换到 Section 2.2, 模型转换和编译 中产生的 executable.popef
所在的目录,并通过 ls
命令验证。
$ ls `pwd -P` | grep bertsquad-12.onnx.optimized.onnx
如果目录正确将看到以下的输出:
bertsquad-12.onnx.optimized.onnx
使用以下的命令启动容器:
$ gc-docker -- --rm -it \
-v `pwd -P`:/model_runtime_test \
-w /model_runtime_test \
--entrypoint /bin/bash \
graphcorecn/poprt-staging:latest
3.1.2. 使用 Python API 运行模型
将示例代码 Listing 3.1 保存为 model_runner_quick_start.py
.
# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
import numpy as np
from poprt import runtime
# Load the popef
runner = runtime.ModelRunner('executable.popef')
# Get the input and output information of the model
inputs = runner.get_model_inputs()
outputs = runner.get_model_outputs()
# Create the random inputs and zero outputs
inputs_dict = {x.name:np.random.randint(2, size=x.shape).astype(x.numpy_data_type()) for x in inputs}
outputs_dict = {x.name:np.zeros(x.shape).astype(x.numpy_data_type()) for x in outputs}
# Execute the inference
runner.execute(inputs_dict, outputs_dict)
# Check the output values
for name, value in outputs_dict.items():
print(f'{name} : {value}')
运行保存的示例代码:
$ python3 model_runner_quick_start.py
成功运行将得到如下的输出:
unstack:1 : [[-0.9604 -1.379 -2.01 ... -1.814 -1.78 -1.626 ]
[-1.051 -1.977 -1.913 ... -1.435 -1.681 -1.251 ]
[-3.67 -2.71 -2.78 ... -3.951 -4.027 -3.959 ]
...
[-0.0919 -0.6445 -0.3125 ... -0.384 -0.54 -0.3152]
[-0.69 -1.071 -1.421 ... -1.533 -1.456 -1.389 ]
[-3.56 -2.99 -3.23 ... -4.05 -3.977 -3.955 ]]
unstack:0 : [[-1.437 -1.645 -2.17 ... -2.139 -2.379 -2.281 ]
[-1.259 -1.8545 -1.915 ... -1.804 -1.8955 -1.671 ]
[-2.832 -2.057 -2.104 ... -3.29 -3.34 -3.36 ]
...
[-0.4673 -0.8716 -0.8545 ... -1.253 -1.287 -1.289 ]
[-1.288 -1.481 -1.928 ... -2.158 -2.146 -2.129 ]
[-2.762 -2.43 -2.6 ... -3.418 -3.23 -3.324 ]]
unique_ids:0 : [1 0 1 1 1 0 0 1 1 0 1 1 1 1 0 1]
3.1.3. 通过 C++ API 运行模型
将示例代码 Listing 3.2 保存为 model_runner_quick_start.cpp
。
// Copyright (c) 2022 Graphcore Ltd. All rights reserved.
#include <iostream>
#include "poprt/runtime/model_runner.hpp"
int main(int argc, char* argv[]) {
// Load the PopEF file
auto runner = poprt::runtime::ModelRunner("executable.popef");
// Get the inputs and outputs information of model
auto inputs = runner.getModelInputs();
auto outputs = runner.getModelOutputs();
// Create the inputs and ouputs
poprt::runtime::InputMemoryView in;
poprt::runtime::OutputMemoryView out;
std::vector<std::shared_ptr<unsigned char[]>> memories;
int i = 0;
for (const auto& input : inputs) {
memories.push_back(
std::shared_ptr<unsigned char[]>(new unsigned char[input.sizeInBytes]));
in.emplace(input.name, poprt::runtime::ConstTensorMemoryView(
memories[i++].get(), input.sizeInBytes));
}
for (const auto& output : outputs) {
memories.push_back(std::shared_ptr<unsigned char[]>(
new unsigned char[output.sizeInBytes]));
out.emplace(output.name, poprt::runtime::TensorMemoryView(
memories[i++].get(), output.sizeInBytes));
}
// Execute the inference
runner.execute(in, out);
// Print the result information
std::cout << "Sucessfully executed. The outputs are: " << std::endl;
for (const auto& output : outputs)
std::cout << "name: " << output.name << ", dataType: " << output.dataType
<< ", sizeInBytes: " << output.sizeInBytes << std::endl;
}
编译代码:
$ apt-get update && \
apt-get install g++ -y && \
g++ model_runner_quick_start.cpp -o model_runner_quick_start \
--std=c++14 -I/usr/local/lib/python3.8/dist-packages/poprt/include \
-L/usr/local/lib/python3.8/dist-packages/poprt/lib \
-lpoprt_runtime -lpopef
运行编译得到的程序:
$ LD_LIBRARY_PATH=/usr/local/lib/python3.8/dist-packages/poprt/lib:$LD_LIBRARY_PATH ./model_runner_quick_start
成功运行会得到如下的输出:
Sucessfully execute, the outputs:
name: unstack:1, dataType: F16, sizeInBytes: 8192
name: unstack:0, dataType: F16, sizeInBytes: 8192
name: unique_ids:0, dataType: S32, sizeInBytes: 64
备注
完成上述示例后,请退出当前容器,回到主机环境。
3.2. 部署模型到Triton Inference Server
在包含 executable.popef
的目录中,创建一个 model_repository
的目录。
mkdir -p model_repository/bertsquad-12/1/
cp executable.popef model_repository/bertsquad-12/1/
touch model_repository/bertsquad-12/config.pbtxt
cd model_repository
目录结构如下:
$ tree .
.
└── bertsquad-12
├── 1
│ └── executable.popef
└── config.pbtxt
bertsquad-12
是模型的名字1
表示模型的版本executable.popef
是模型编译产生的PopEF文件config.pbtxt
是在 Section 3.2.1, 生成模型的配置 中说明的Triton配置文件
3.2.1. 生成模型的配置
部署模型到Triton Inference Server需要为这个模型创建一个配置文件 config.pbtxt
,主要包含模型的名称、
使用的backend、batching的信息和输入输出等信息,更多关于模型配置的内容请参考
Triton model configuration
文档。
本例中使用的 config.pbtxt
配置如下,需要将内容拷贝到上述产生的空的 config.pbtxt
中。
name: "bertsquad-12"
backend: "poplar"
max_batch_size: 16
dynamic_batching {
preferred_batch_size: [16]
max_queue_delay_microseconds: 5000
}
input [
{
name: "input_ids:0"
data_type: TYPE_INT32
dims: [ 256 ]
},
{
name: "input_mask:0"
data_type: TYPE_INT32
dims: [ 256 ]
},
{
name: "segment_ids:0"
data_type: TYPE_INT32
dims: [ 256 ]
},
{
name: "unique_ids_raw_output___9:0"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
}
]
output [
{
name: "unique_ids:0"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "unstack:0"
data_type: TYPE_FP16
dims: [ 256 ]
},
{
name: "unstack:1"
data_type: TYPE_FP16
dims: [ 256 ]
}
]
parameters [
{
key: "synchronous_execution"
value:{string_value: "1"}
},
{
key: "timeout_ns"
value:{string_value: "500000"}
}
]
3.2.2. 启动模型服务
使用 gc-docker
启动Triton Inference Server容器,通过Poplar Triton Backend后端加载和运行 executable.popef
。
gc-docker -- --rm \
--network=host \
-v `pwd -P`:/models \
graphcorecn/toolkit-triton-staging:latest
备注
如果在IPU-M2000或Bow-2000中测试,请删除命令中的 --network=host
参数
打印出如下信息,说明模型部署成功,可以接受gRPC和HTTP的请求。
Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000
3.2.3. 通过gRPC验证服务
以下是通过Triton Client的gRPC API测试部署模型的例子,更详细的API信息请参考其文档和代码示例。
import numpy as np
import tritonclient.grpc as gc
# create the triton client
triton_client = gc.InferenceServerClient(url="localhost:8001")
model_name = 'bertsquad-12'
inputs = []
outputs = []
inputs.append(gc.InferInput('input_ids:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('input_mask:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('segment_ids:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('unique_ids_raw_output___9:0', [16, 1], "INT32"))
# create data
input0_data = np.random.randint(0, 1000, size=(16, 256)).astype(np.int32)
input1_data = np.random.randint(0, 1, size=(16, 1)).astype(np.int32)
for i in range(3):
inputs[i].set_data_from_numpy(input0_data)
inputs[3].set_data_from_numpy(input1_data)
outputs_names = ['unique_ids:0', 'unstack:0', 'unstack:1']
for name in outputs_names:
outputs.append(gc.InferRequestedOutput(name))
results = triton_client.infer(
model_name=model_name, inputs=inputs, outputs=outputs
)
statistics = triton_client.get_inference_statistics(model_name=model_name)
if len(statistics.model_stats) != 1:
print("FAILED: Inference Statistics")
sys.exit(1)
print(statistics)
for name in outputs_names:
print(f'{name} = {results.as_numpy(name)}')
打开新的终端连接到主机,将以上代码保存到 grpc_test.py
,创建Python虚拟环境并测试。
virtualenv -p python3 venv
source venv/bin/activate
pip install tritonclient[all]
python grpc_test.py
deactivate
正确执行,则返回模型统计信息和推理结果:
model_stats {
name: "bertsquad-12"
version: "1"
last_inference: 1667439772895
inference_count: 64
execution_count: 4
inference_stats {
success {
count: 4
ns: 170377440
}
...
unique_ids:0 = [[0]
...
unstack:0 = [[-0.991 -1.472 -1.571 ... -1.738 -1.77 -1.803]
...
unstack:1 = [[-0.9023 -1.285 -1.325 ... -1.419 -1.441 -1.452 ]
...