3. 模型运行

3.1. 通过PopRT Runtime运行

PopRT Runtime 是PopRT中加载和运行PopEF模型的组件，它提供了Python和C++ API，可以快速验证PopEF文件，或者与机器学习框架或模型服务框架相集成。本节将以 Section 2.2, 模型转换和编译中产生的 executable.popef 文件为例，讲述如何通过PopRT Runtime API 加载和运行模型。

3.1.1. 环境准备

将目录切换到 Section 2.2, 模型转换和编译中产生的 executable.popef 所在的目录，并通过 ls 命令验证。

$ ls `pwd -P` | grep bertsquad-12.onnx.optimized.onnx

如果目录正确将看到以下的输出：

bertsquad-12.onnx.optimized.onnx

使用以下的命令启动容器：

$ gc-docker -- --rm -it \
    -v `pwd -P`:/model_runtime_test \
    -w /model_runtime_test \
    --entrypoint /bin/bash \
    graphcorecn/poprt-staging:latest

3.1.2. 使用 Python API 运行模型

将示例代码 Listing 3.1 保存为 model_runner_quick_start.py.

Listing 3.1 PopRT Runtime Python API 代码示例

# Copyright (c) 2022 Graphcore Ltd. All rights reserved.

import numpy as np
from poprt import runtime

# Load the popef
runner = runtime.ModelRunner('executable.popef')

# Get the input and output information of the model
inputs = runner.get_model_inputs()
outputs = runner.get_model_outputs()

# Create the random inputs and zero outputs
inputs_dict =  {x.name:np.random.randint(2, size=x.shape).astype(x.numpy_data_type()) for x in inputs}
outputs_dict = {x.name:np.zeros(x.shape).astype(x.numpy_data_type()) for x in outputs}

# Execute the inference
runner.execute(inputs_dict, outputs_dict)

# Check the output values
for name, value in outputs_dict.items():
    print(f'{name} : {value}')

model_runner_quick_start.py

运行保存的示例代码：

$ python3 model_runner_quick_start.py

成功运行将得到如下的输出：

unstack:1 : [[-0.9604 -1.379  -2.01   ... -1.814  -1.78   -1.626 ]
[-1.051  -1.977  -1.913  ... -1.435  -1.681  -1.251 ]
[-3.67   -2.71   -2.78   ... -3.951  -4.027  -3.959 ]
...
[-0.0919 -0.6445 -0.3125 ... -0.384  -0.54   -0.3152]
[-0.69   -1.071  -1.421  ... -1.533  -1.456  -1.389 ]
[-3.56   -2.99   -3.23   ... -4.05   -3.977  -3.955 ]]
unstack:0 : [[-1.437  -1.645  -2.17   ... -2.139  -2.379  -2.281 ]
[-1.259  -1.8545 -1.915  ... -1.804  -1.8955 -1.671 ]
[-2.832  -2.057  -2.104  ... -3.29   -3.34   -3.36  ]
...
[-0.4673 -0.8716 -0.8545 ... -1.253  -1.287  -1.289 ]
[-1.288  -1.481  -1.928  ... -2.158  -2.146  -2.129 ]
[-2.762  -2.43   -2.6    ... -3.418  -3.23   -3.324 ]]
unique_ids:0 : [1 0 1 1 1 0 0 1 1 0 1 1 1 1 0 1]

3.1.3. 通过 C++ API 运行模型

将示例代码 Listing 3.2 保存为 model_runner_quick_start.cpp 。

Listing 3.2 PopRT Runtime C++ API 代码示例

// Copyright (c) 2022 Graphcore Ltd. All rights reserved.

#include <iostream>
#include "poprt/runtime/model_runner.hpp"

int main(int argc, char* argv[]) {
  // Load the PopEF file
  auto runner = poprt::runtime::ModelRunner("executable.popef");

  // Get the inputs and outputs information of model
  auto inputs = runner.getModelInputs();
  auto outputs = runner.getModelOutputs();

  // Create the inputs and ouputs
  poprt::runtime::InputMemoryView in;
  poprt::runtime::OutputMemoryView out;
  std::vector<std::shared_ptr<unsigned char[]>> memories;
  int i = 0;
  for (const auto& input : inputs) {
    memories.push_back(
        std::shared_ptr<unsigned char[]>(new unsigned char[input.sizeInBytes]));
    in.emplace(input.name, poprt::runtime::ConstTensorMemoryView(
                               memories[i++].get(), input.sizeInBytes));
  }
  for (const auto& output : outputs) {
    memories.push_back(std::shared_ptr<unsigned char[]>(
        new unsigned char[output.sizeInBytes]));
    out.emplace(output.name, poprt::runtime::TensorMemoryView(
                                 memories[i++].get(), output.sizeInBytes));
  }

  // Execute the inference
  runner.execute(in, out);

  // Print the result information
  std::cout << "Sucessfully executed. The outputs are: " << std::endl;
  for (const auto& output : outputs)
    std::cout << "name: " << output.name << ", dataType: " << output.dataType
              << ", sizeInBytes: " << output.sizeInBytes << std::endl;
}

model_runner_quick_start.cpp

编译代码：

$ apt-get update && \
  apt-get install g++ -y && \
  g++ model_runner_quick_start.cpp -o model_runner_quick_start \
    --std=c++14 -I/usr/local/lib/python3.8/dist-packages/poprt/include \
    -L/usr/local/lib/python3.8/dist-packages/poprt/lib \
    -lpoprt_runtime -lpopef

运行编译得到的程序：

$ LD_LIBRARY_PATH=/usr/local/lib/python3.8/dist-packages/poprt/lib:$LD_LIBRARY_PATH ./model_runner_quick_start

成功运行会得到如下的输出：

Sucessfully execute, the outputs:
name: unstack:1, dataType: F16, sizeInBytes: 8192
name: unstack:0, dataType: F16, sizeInBytes: 8192
name: unique_ids:0, dataType: S32, sizeInBytes: 64

备注

完成上述示例后，请退出当前容器，回到主机环境。

3.2. 部署模型到Triton Inference Server

在包含 executable.popef 的目录中，创建一个 model_repository 的目录。

mkdir -p model_repository/bertsquad-12/1/
cp executable.popef model_repository/bertsquad-12/1/
touch model_repository/bertsquad-12/config.pbtxt
cd model_repository

目录结构如下：

$ tree .
.
└── bertsquad-12
    ├── 1
    │   └── executable.popef
    └── config.pbtxt

bertsquad-12 是模型的名字
1 表示模型的版本
executable.popef 是模型编译产生的PopEF文件
config.pbtxt 是在 Section 3.2.1, 生成模型的配置中说明的Triton配置文件

3.2.1. 生成模型的配置

部署模型到Triton Inference Server需要为这个模型创建一个配置文件 config.pbtxt，主要包含模型的名称、使用的backend、batching的信息和输入输出等信息，更多关于模型配置的内容请参考 Triton model configuration 文档。

本例中使用的 config.pbtxt 配置如下，需要将内容拷贝到上述产生的空的 config.pbtxt 中。

name: "bertsquad-12"
backend: "poplar"
max_batch_size: 16
dynamic_batching {
    preferred_batch_size: [16]
    max_queue_delay_microseconds: 5000
}
input [
    {
        name: "input_ids:0"
        data_type: TYPE_INT32
        dims: [ 256 ]
    },
    {
        name: "input_mask:0"
        data_type: TYPE_INT32
        dims: [ 256 ]
    },
    {
        name: "segment_ids:0"
        data_type: TYPE_INT32
        dims: [ 256 ]
    },
    {
        name: "unique_ids_raw_output___9:0"
        data_type: TYPE_INT32
        dims: [ 1 ]
        reshape: { shape: [ ] }
    }
]
output [
    {
        name: "unique_ids:0"
        data_type: TYPE_INT32
        dims: [ 1 ]
        reshape: { shape: [ ] }
    },
    {
        name: "unstack:0"
        data_type: TYPE_FP16
        dims: [ 256 ]
    },
    {
        name: "unstack:1"
        data_type: TYPE_FP16
        dims: [ 256 ]
    }
]
parameters [
    {
        key: "synchronous_execution"
        value:{string_value: "1"}
    },
    {
        key: "timeout_ns"
        value:{string_value: "500000"}
    }
]

3.2.2. 启动模型服务

使用 gc-docker 启动Triton Inference Server容器，通过Poplar Triton Backend后端加载和运行 executable.popef 。

gc-docker -- --rm \
    --network=host \
    -v `pwd -P`:/models \
    graphcorecn/toolkit-triton-staging:latest

备注

如果在IPU-M2000或Bow-2000中测试，请删除命令中的 --network=host 参数

打印出如下信息，说明模型部署成功，可以接受gRPC和HTTP的请求。

Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000

3.2.3. 通过gRPC验证服务

以下是通过Triton Client的gRPC API测试部署模型的例子，更详细的API信息请参考其文档和代码示例。

import numpy as np
import tritonclient.grpc as gc

# create the triton client
triton_client = gc.InferenceServerClient(url="localhost:8001")
model_name = 'bertsquad-12'
inputs = []
outputs = []
inputs.append(gc.InferInput('input_ids:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('input_mask:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('segment_ids:0', [16, 256], "INT32"))
inputs.append(gc.InferInput('unique_ids_raw_output___9:0', [16, 1], "INT32"))
# create data
input0_data = np.random.randint(0, 1000, size=(16, 256)).astype(np.int32)
input1_data = np.random.randint(0, 1, size=(16, 1)).astype(np.int32)
for i in range(3):
    inputs[i].set_data_from_numpy(input0_data)
inputs[3].set_data_from_numpy(input1_data)
outputs_names = ['unique_ids:0', 'unstack:0', 'unstack:1']
for name in outputs_names:
    outputs.append(gc.InferRequestedOutput(name))
results = triton_client.infer(
    model_name=model_name, inputs=inputs, outputs=outputs
)
statistics = triton_client.get_inference_statistics(model_name=model_name)
if len(statistics.model_stats) != 1:
    print("FAILED: Inference Statistics")
    sys.exit(1)
print(statistics)
for name in outputs_names:
    print(f'{name} = {results.as_numpy(name)}')

grpc_test.py

打开新的终端连接到主机，将以上代码保存到 grpc_test.py，创建Python虚拟环境并测试。

virtualenv -p python3 venv
source venv/bin/activate
pip install tritonclient[all]
python grpc_test.py
deactivate

正确执行，则返回模型统计信息和推理结果：

model_stats {
name: "bertsquad-12"
version: "1"
last_inference: 1667439772895
inference_count: 64
execution_count: 4
inference_stats {
    success {
    count: 4
    ns: 170377440
    }
...
unique_ids:0 = [[0]
...
unstack:0 = [[-0.991 -1.472 -1.571 ... -1.738 -1.77  -1.803]
...
unstack:1 = [[-0.9023 -1.285  -1.325  ... -1.419  -1.441  -1.452 ]
...

Search help