5. Model runtime

As mentioned in Section 2.3, Using the IPU Inference Toolkit, using the IPU Inference Toolkit is divided into two phases: model compilation and model runtime. This chapter describes how to deploy and run models with PopRT, Triton Inference Server or TensorFlow Serving after the model has been converted and compiled to a PopEF model as described in Section 4, Model compilation.

Note

  • Since the commands and code in the examples in this chapter are relatively long, the HTML version of the document is recommended for copying.

  • When copying from a PDF document, ensure that commands are not truncated by line breaks or page breaks.

5.1. Run with PopRT Runtime

PopRT Runtime is a runtime environment included in PopRT for loading and running PopEF models. PopRT Runtime provides Python and C++ APIs, which can be used for rapid verification of PopEF files or integration with machine learning frameworks and model service frameworks. This section will take the executable.popef generated in Section 4.1.4, Model conversion and compilation as an example to describe how to use the PopRT Runtime API to load and run PopEF models.

5.1.1. Environment preparation

Switch to the directory where the executable.popef in Section 4.1.4, Model conversion and compilation is located, and check whether the directory is correct with ls.

$ ls `pwd -P` | grep bertsquad-12_fp16_bs_16.onnx

The following output shows that the current directory is correct.

bertsquad-12_fp16_bs_16.onnx

Start the Docker container with the following command:

$ gc-docker -- --rm -it \
    -v `pwd -P`:/model_runtime_test \
    -w /model_runtime_test \
    --entrypoint /bin/bash \
    graphcorecn/poprt-staging:latest

5.1.2. Run with PopRT Runtime Python API

The sample code is shown in Listing 5.1. Save it as model_runner_quick_start.py.

Listing 5.1 Sample PopRT Runtime code (Python API)
# Copyright (c) 2022 Graphcore Ltd. All rights reserved.

import numpy as np
from poprt import runtime

# Load the popef
runner = runtime.ModelRunner('executable.popef')

# Get the input and output information of the model
inputs = runner.get_model_inputs()
outputs = runner.get_model_outputs()

# Create the random inputs and zero outputs
inputs_dict = {
    x.name: np.random.randint(2, size=x.shape).astype(x.numpy_data_type())
    for x in inputs
}
outputs_dict = {
    x.name: np.zeros(x.shape).astype(x.numpy_data_type()) for x in outputs
}

# Execute the inference
runner.execute(inputs_dict, outputs_dict)

# Check the output values
for name, value in outputs_dict.items():
    print(f'{name} : {value}')

model_runner_quick_start.py

Run the saved sample code:

$ python3 model_runner_quick_start.py

A successful run will produce an output similar to the following:

unstack:1 : [[-0.9604 -1.379  -2.01   ... -1.814  -1.78   -1.626 ]
[-1.051  -1.977  -1.913  ... -1.435  -1.681  -1.251 ]
[-3.67   -2.71   -2.78   ... -3.951  -4.027  -3.959 ]
...
[-0.0919 -0.6445 -0.3125 ... -0.384  -0.54   -0.3152]
[-0.69   -1.071  -1.421  ... -1.533  -1.456  -1.389 ]
[-3.56   -2.99   -3.23   ... -4.05   -3.977  -3.955 ]]
unstack:0 : [[-1.437  -1.645  -2.17   ... -2.139  -2.379  -2.281 ]
[-1.259  -1.8545 -1.915  ... -1.804  -1.8955 -1.671 ]
[-2.832  -2.057  -2.104  ... -3.29   -3.34   -3.36  ]
...
[-0.4673 -0.8716 -0.8545 ... -1.253  -1.287  -1.289 ]
[-1.288  -1.481  -1.928  ... -2.158  -2.146  -2.129 ]
[-2.762  -2.43   -2.6    ... -3.418  -3.23   -3.324 ]]
unique_ids:0 : [1 0 1 1 1 0 0 1 1 0 1 1 1 1 0 1]

5.1.3. Run with PopRT Runtime C++ API

PopRT Runtime also provides sample C++ API code as shown in Listing 5.2. Save it as model_runner_quick_start.cpp.

Listing 5.2 Sample PopRT Runtime code (C++ API)
// Copyright (c) 2022 Graphcore Ltd. All rights reserved.

#include <iostream>
#include "poprt/runtime/model_runner.hpp"

int main(int argc, char* argv[]) {
    // Load the PopEF file
    auto runner = poprt::runtime::ModelRunner("executable.popef");

    // Get the inputs and outputs information of model
    auto inputs = runner.getModelInputs();
    auto outputs = runner.getModelOutputs();

    // Create the inputs and ouputs
    poprt::runtime::InputMemoryView in;
    poprt::runtime::OutputMemoryView out;
    std::vector<std::shared_ptr<unsigned char[]>> memories;
    int i=0;
    for(const auto& input : inputs){
        memories.push_back(std::shared_ptr<unsigned char[]>(new unsigned char[input.sizeInBytes]));
        in.emplace(input.name, poprt::runtime::ConstTensorMemoryView(memories[i++].get(), input.sizeInBytes));
    }
    for(const auto& output : outputs){
        memories.push_back(std::shared_ptr<unsigned char[]>(new unsigned char[output.sizeInBytes]));
        out.emplace(output.name, poprt::runtime::TensorMemoryView(memories[i++].get(), output.sizeInBytes));
    }

    // Execute the inference
    runner.execute(in, out);

    // Print the result information
    std::cout << "Sucessfully executed. The outputs are: " << std::endl;
    for(const auto& output: outputs)
        std::cout << "name: " << output.name
                << ", dataType: " << output.dataType
                << ", sizeInBytes: " << output.sizeInBytes
                << std::endl;
}

model_runner_quick_start.cpp

Compile the sample code.

$ apt-get update && \
  apt-get install g++ -y && \
  g++ model_runner_quick_start.cpp -o model_runner_quick_start \
    --std=c++14 -I/usr/local/lib/python3.8/dist-packages/poprt/include \
    -L/usr/local/lib/python3.8/dist-packages/poprt/lib \
    -lpoprt_runtime -lpoprt_compiler -lpopef

Run the example program obtained from compilation.

$ LD_LIBRARY_PATH=/usr/local/lib/python3.8/dist-packages/poprt/lib:$LD_LIBRARY_PATH ./model_runner_quick_start

A successful run will result in the following output:

Sucessfully execute, the outputs:
name: unstack:1, dataType: F16, sizeInBytes: 8192
name: unstack:0, dataType: F16, sizeInBytes: 8192
name: unique_ids:0, dataType: S32, sizeInBytes: 64

Note

After completing the above example, exit the current container and return to the host environment.

5.2. Deploy to Triton Inference Server

The Poplar SDK includes a plugin (libtriton_poplar.so) for the Triton Inference Server, implemented with Poplar Model Runtime. This is responsible for loading and running PopEF files. For more information about the Poplar Triton Backend, refer to the Poplar Triton Backend: User Guide.

This section will use the executable.popef file generated in Section 4.1.4, Model conversion and compilation as an example to explain how to deploy the PopEF file from compilation to a Triton Inference Server.

5.2.1. Environment preparation

First, switch to the directory where the executable.popef generated in Section 4.1.4, Model conversion and compilation is located, and check whether the directory is correct with the following command.

$ ls `pwd -P` | grep bertsquad-12_fp16_bs_16.onnx

The following output shows that the current directory is correct:

bertsquad-12_fp16_bs_16.onnx

Create a directory for model_repository.

$ mkdir -p model_repository/bertsquad-12/1/ && \
  cp executable.popef model_repository/bertsquad-12/1/ && \
  touch model_repository/bertsquad-12/config.pbtxt && \
  cd model_repository

The directory structure is as follows.

$ tree .
.
└── bertsquad-12
    ├── 1
    │  └── executable.popef
    └── config.pbtxt
  • bertsquad-12 is the name of the model.

  • 1 indicates the version of the model.

  • executable.popef is the PopEF file generated by model compilation.

  • config.pbtxt is the Triton configuration file described in Configuration of generated model.

Note

Sometimes the names of a model’s inputs and outputs contain special characters which are not accepted by the Triton Inference Server. In this case, the Poplar Triton Backend can remap these names for you. Refer to the section on Input/Output Name Mapping in the Poplar Triton Backend: User Guide.

5.2.2. Configuration of generated model

To deploy the model to a Triton Inference Server, you need to create a configuration file config.pbtxt for the model, which contains for example the name of the model, the backend to be used, batching information and input and output information. For more information about model configuration, refer to the Triton Model Configuration documentation.

The configuration of config.pbtxt used in this example is shown in Listing 5.3. Copy this configuration to the empty config.pbtxt generated above.

Listing 5.3 Configuration of config.pbtxt used in this example
name: "bertsquad-12"
backend: "poplar"
max_batch_size: 16
dynamic_batching {
   preferred_batch_size: [16]
   max_queue_delay_microseconds: 5000
}
input [
   {
      name: "input_ids:0"
      data_type: TYPE_INT32
      dims: [ 256 ]
   },
   {
      name: "input_mask:0"
      data_type: TYPE_INT32
      dims: [ 256 ]
   },
   {
      name: "segment_ids:0"
      data_type: TYPE_INT32
      dims: [ 256 ]
   },
   {
      name: "unique_ids_raw_output___9:0"
      data_type: TYPE_INT32
      dims: [ 1 ]
      reshape: { shape: [ ] }
   }
]
output [
   {
      name: "unique_ids:0"
      data_type: TYPE_INT32
      dims: [ 1 ]
      reshape: { shape: [ ] }
   },
   {
      name: "unstack:0"
      data_type: TYPE_FP16
      dims: [ 256 ]
   },
   {
      name: "unstack:1"
      data_type: TYPE_FP16
      dims: [ 256 ]
   }
]
parameters [
   {
      key: "synchronous_execution"
      value:{string_value: "1"}
   },
   {
      key: "timeout_ns"
      value:{string_value: "500000"}
   }
]

Model name

The model name, bertsquad-12, is usually the same as the directory name where the model is located.

Backend

poplar indicates that we are using the Poplar Triton Backend.

Batching

Since the Poplar Triton Backend supports dynamic batches, we recommended setting the values of max_batch_size and preferred_batch_size to integer multiples of the batch size of the model. The batch size of the model in this example is 16. For simplicity, you can set the two parameter values to the batch size.

Input and output

The input and output names, types and dimension information can be viewed with the popef_dump tool. popef_dump is included in the Poplar SDK and it allows you to analyze a PopEF file without using a C++ or Python API. The output shows the file structure and gives basic information. popef_dump does not allow you to view any binary content. For more information about popef_dump, refer to PopEF file analysis in the PopEF: User Guide.

$ gc-docker -- --rm \
    -v `pwd -P`:/models \
    --entrypoint popef_dump \
    graphcorecn/toolkit-triton-staging:latest \
    /models/bertsquad-12/1/executable.popef

The following is an excerpt of the output of the popef_dump command:

PopEF file: executable.popef
Metadata:
...
Anchors:
Inputs (User-provided):
Name: "input_ids:0":
   TensorInfo: { dtype: S32, sizeInBytes: 16384, shape [16, 256] }
   Programs: [5]
   Handle: h2d_input_ids:0
   IsPerReplica: **False**
...
Outputs (User-provided):
Name: "unique_ids:0":
   TensorInfo: { dtype: S32, sizeInBytes: 64, shape [16] }
   Programs: [5]
   Handle: anchor_d2h_unique_ids:0
   IsPerReplica: **False**
...

From the above excerpt of popef_dump, you can see the relationship between the model input and output in PopEF, and the input and output in the model configuration file. For the data type of dtype, refer to PopEF Tensor and Feed Data and Types supported by Poplar Triton Backend.

In the case that max_batch_size is not set to 0, the dimension information of dims in the model configuration file does not contain the batch-size dimension; for example, the dimension of input_ids: 0 is [16, 256], which is configured as dims: [ 256 ] in the model configuration file. For the input and output that only contain the dimension of the batch size, such as unique_ids:0, you need to set dims: [ 1 ] and use reshape: { shape: [ ] } to convert the dimension into one-dimension. For more information about dimension settings, refer to the Triton Model Configuration documentation.

For more description about the fields in config.pbtxt, refer to Triton model configuration in the Poplar Triton Backend documentation.

5.2.3. Start model service

Start the container with gc-docker. If the Poplar SDK is not installed on the host, refer to Section 3.7.2, Run a Docker container.

$ gc-docker -- --rm \
    --network=host \
    -v `pwd -P`:/models \
    graphcorecn/toolkit-triton-staging:latest

Note

In the case of testing on an IPU-M2000 or Bow-2000, omit the --network=host parameter when running gc-docker.

The following information is printed to indicate that the service has been started and can accept gRPC and HTTP requests.

Started GRPCInferenceService at 0.0.0.0:8001
Started HTTPService at 0.0.0.0:8000

Verify the service with gRPC

The following is an example of testing the deployment of a model using the Triton Client gRPC API. For more detailed API information, refer to its documentation and code examples.

import numpy as np
import tritonclient.grpc as gc

# create the triton client
triton_client = gc.InferenceServerClient(
    url = "localhost:8001")

model_name = 'bertsquad-12'

inputs = []
outputs = []

inputs.append(gc.InferInput('input_ids:0', [16,256], "INT32"))
inputs.append(gc.InferInput('input_mask:0', [16,256], "INT32"))
inputs.append(gc.InferInput('segment_ids:0', [16,256], "INT32"))
inputs.append(gc.InferInput('unique_ids_raw_output___9:0', [ 16,1 ], "INT32"))

# create data
input0_data = np.random.randint(0, 1000, size=(16,256)).astype(np.int32)
input1_data = np.random.randint(0, 1, size=(16,1)).astype(np.int32)

for i in range(3):
    inputs[i].set_data_from_numpy(input0_data)
inputs[3].set_data_from_numpy(input1_data)

outputs_names = ['unique_ids:0', 'unstack:0', 'unstack:1']

for name in outputs_names:
    outputs.append(gc.InferRequestedOutput(name))

results = triton_client.infer(
model_name = model_name,
inputs = inputs,
outputs = outputs)

statistics = triton_client.get_inference_statistics(model_name=model_name)
if len(statistics.model_stats) != 1:
    print("FAILED: Inference Statistics")
    sys.exit(1)
print(statistics)

for name in outputs_names:
    print(f'{name} = {results.as_numpy(name)}')

grpc_test.py

Open a new terminal and connect to the host, save the above code to grpc_test.py, create a Python virtual environment and test it.

$ virtualenv -p python3 venv && \
  source venv/bin/activate && \
  pip install tritonclient[all] && \
  python grpc_test.py && \
  deactivate

If executed correctly, model statistics and inference results will be returned.

model_stats {
name: "bertsquad-12"
version: "1"
last_inference: 1667439772895
inference_count: 64
execution_count: 4
inference_stats {
   success {
   count: 4
   ns: 170377440
   }
...
unique_ids:0 = [[0]
...
unstack:0 = [[-0.991 -1.472 -1.571 ... -1.738 -1.77 -1.803]
...
unstack:1 = [[-0.9023 -1.285 -1.325 ... -1.419 -1.441 -1.452 ]
...

Verify the service with HTTP

The following is an example of testing the deployment of a model using the Triton Client HTTP API . For more detailed API information, refer to its documentation and code examples.

import numpy as np
import tritonclient.http as hc

# create the triton client
triton_client = hc.InferenceServerClient(
    url = "localhost:8000")

model_name = 'bertsquad-12'

inputs = []
outputs = []

inputs.append(hc.InferInput('input_ids:0', [16,256], "INT32"))
inputs.append(hc.InferInput('input_mask:0', [16,256], "INT32"))
inputs.append(hc.InferInput('segment_ids:0', [16,256], "INT32"))
inputs.append(hc.InferInput('unique_ids_raw_output___9:0', [ 16,1 ], "INT32"))

# create data
input0_data = np.random.randint(0, 1000, size=(16,256)).astype(np.int32)
input1_data = np.random.randint(0, 1, size=(16,1)).astype(np.int32)

for i in range(3):
    inputs[i].set_data_from_numpy(input0_data)
inputs[3].set_data_from_numpy(input1_data)

outputs_names = ['unique_ids:0', 'unstack:0', 'unstack:1']

for name in outputs_names:
    outputs.append(hc.InferRequestedOutput(name, binary_data=True))

results = triton_client.infer(
model_name = model_name,
inputs = inputs,
outputs = outputs)

statistics = triton_client.get_inference_statistics(model_name=model_name, headers=None)
if len(statistics['model_stats']) != 1:
    print("FAILED: Inference Statistics")
    sys.exit(1)
print(statistics)

for name in outputs_names:
    print(f'{name} = {results.as_numpy(name)}')

triton_client_http_test.py (rename to http_test.py)

Open a new terminal and connect to the host, save the above code to http_test.py, create a Python virtual environment and test it.

# Execute the following virtualenv command if the python virtual environment has not been created
# virtualenv -p python3 venv

$ source venv/bin/activate && \
  pip install tritonclient[all] && \
  python http_test.py && \
  deactivate

If executed correctly, model statistics and inference results will be returned.

{'model_stats': [{'name': 'bertsquad-12', 'version': '1', 'last_inference': 1667440001420, 'inference_count': 80, ... {'count': 5, 'ns': 462978}}]}]}
unique_ids:0 = [[0]
...
unstack:0 = [[-0.753  -1.183  -1.296  ... -1.595  -1.599  -1.65  ]
...
unstack:1 = [[-0.6206 -0.9683 -1.031  ... -1.222  -1.221  -1.241 ]
...

Note

This example is complete. Press ctrl+C to exit the Triton container and return to the host environment.

5.3. Deploy to TensorFlow Serving

This section will use the executable.popef file generated in Section 4.2.2, Model conversion and compilation as an example to explain how to deploy the compiled PopEF file to TensorFlow Serving.

5.3.1. Environment preparation

In the following example, the container needs to be started with gc-docker. If the Poplar SDK is not installed on the host, refer to Section 3.7.2, Run a Docker container. Before starting Docker, define the MODEL_PATH environment variable to be the path to the directory containing resnet_v2_50_optimized.onnx:

$ export MODEL_PATH=/path/to/your/models
$ ls $MODEL_PATH

If the MODEL_PATH environment variable is specified correctly, the following information will be displayed:

resnet_v2_50_optimized.onnx executable.popef ...

5.3.2. Generate SavedModel model

The input model format of TensorFlow Serving is SavedModel, so we encapsulate the PopEF file in a TensorFlow custom op named model_runtime. This custom op has been written using the Poplar Model Runtime API.

$ gc-docker -- --rm \
   -v $MODEL_PATH:/model_path \
   graphcorecn/toolkit-tfserving-staging:latest \
   /bin/bash -c "python3 -m popef2tf.convert \
   --model /model_path/executable.popef \
   --name resnet_v2_50_serving \
   --version 001 \
   --output /model_path"

Note

The TensorFlow included in this Docker image is compiled from the official source code for TensorFlow version 2.6.5. This is not the same as the Graphcore port of TensorFlow that is provided in the Poplar SDK.

The model input parameter to popef2tf specifies the path of the input PopEF file, and output specifies the path of the output SavedModel file.

Use tree to list the contents of the generated SavedModel file directory:

$ tree $MODEL_PATH/resnet_v2_50_serving
resnet_v2_50_serving
└── 001
    ├── assets
    │   └── executable.popef
    ├── saved_model.pb
    └── variables
        ├── variables.data-00000-of-00001
        └── variables.index

5.3.3. Start model service

TensorFlow Serving can batch requests to improve performance, while the IPU uses static graphs. Therefore, batch_size and input_shape need to be determined. For this model, we use a fixed shape of [4,3,224,224] for the input tensor size. The corresponding batch_size is 4.

We also need to set allowed_batch_sizes and max_batch_size and we set these parameters to 4 in the config file during deployment. This means that the client can send input data to the host server with batch_size having any integer value between 1 and 4. More information about the effect of allowed_batch_sizes is given in batching_effect.

Generate the config file in the directory specified by MODEL_PATH directory and open the config file for editing:

$ touch $MODEL_PATH/resnet_v2_50_serving/001/resnet_bs4_3_224_224.conf
$ vim $MODEL_PATH/resnet_v2_50_serving/001/resnet_bs4_3_224_224.conf

Add the following to the config file:

allowed_batch_sizes: 4
max_batch_size {value: 4}

Start the TensorFlow Serving service:

$ gc-docker -- --rm \
   -v $MODEL_PATH:/model_path \
   --network=host \
   graphcorecn/toolkit-tfserving-staging:latest \
   /bin/bash -c "tensorflow_model_server \
   --rest_api_port=8501 \
   --model_name=resnet_v2_50 \
   --enable_batching \
   --batching_parameters_file=/model_path/resnet_v2_50_serving/001/resnet_bs4_3_224_224.conf \
   --model_base_path=/model_path/resnet_v2_50_serving"

where --enable_batching is used to enable batching, and --batching_parameters_file specifies the path of the config file.

Note

In the case of testing on an IPU-M2000 or Bow-2000, remove the --network=host parameter when running gc-docker.

Note

The TensorFlow Serving included in this Docker image supports the TensorFlow custom operator for running PopEF files. This TensorFlow Serving is compiled from the official source code of TensorFlow Serving version 2.6.5 (TensorFlow-2.6.5).

If you do not need to enable batching, remove --enable_batching and --batching_parameters_file when starting the container. In this case, the client can only process input data with batch_size=4. Any other values for batch_size will raise an error.

After the container is started, the output log is as follows:

Building single TensorFlow model file config: model_name: resnet_v2_50 model_base_path: /model_path/resnet_v2_50_serving
Adding/updating models.
**(Re-)**\ adding model: resnet_v2_50
Successfully reserved resources to load servable {name: resnet_v2_50 version: 1}
Approving load for servable version {name: resnet_v2_50 version: 1}
Loading servable version {name: resnet_v2_50 version: 1}
Reading SavedModel from: /model_path/resnet_v2_50_serving/001
Reading meta graph with tags { serve }
Reading SavedModel debug info (if present) from: /model_path/resnet_v2_50_serving/001
...
Successfully loaded servable version {name: resnet_v2_50 version: 1}
...
Exporting HTTP/REST API at:localhost:8501 ...

The resnet_v2_50 model service has been released to port 8501, and it can be called by the client with the RESTful API.

Running with and without batching

It is possible to run the model with or without batching enabled on TensorFlow Serving. This section describes more details about the consequences of the value for allowed_batch_sizes when used with and without batching.

  • Batching enabled and allowed_batch_sizes is set to 4

    Clients can send input data with batch_size of any integer in the range [1-4] to the host server. If the batch size of the input data is less than 4, then the host server will pad the input data up to allowed_batch_sizes or max_batch_size, which is 4 in our case. For example, if a client sends input data with a shape of [1, 3, 224, 224], the host server will pad the input data to [4, 3, 224, 224] with dummy data.

  • Batching enabled and allowed_batch_sizes is not set

    The host server would accept any batch_size in the range [1-4], but PopEF will not. For example, if a client sends input data with a shape of [2, 3, 224, 224] to the server, the server won’t pad this input data, but will only check if it exceeds max_batch_size. Since batch_size for this data is less than max_batch_size, the server will send this [2, 3, 224, 224] tensor to the TensorFlow backend. This will raise an error about the allocated memory size.

    Similarly, sending input data with a shape of [5, 3, 224, 224] where batch_size exceeds max_batch_size, will raise an error that the submitted batch size is larger than the maximum input batch size 4.

  • Batching disabled

    Clients will be only allowed to send input data with batch_size==4 to the server. For example, if a client sends input data with a shape of [1, 3, 224, 224] to the server, the server won’t pad the data, This will raise an error about the allocated memory size.

    In this case, the check that batch_size exceeds max_batch_size is not done. For example, if a client sends input data with a shape of [5, 3, 224, 224], this will also raise an error about the allocated memory.

5.3.4. Verify the service with HTTP

In this example, a group of pictures will be downloaded as a running example to demonstrate how to call the model service by using the RESTful API:

import numpy as np
import json, urllib3
import tempfile, wget
from PIL import Image

def read_labels_file(fname):
    outs = []
    with open(fname, 'r') as fin:
        for line in fin.readlines():
            outs.append(line.strip())
    return outs

def read_images(files, img_h, img_w):
    images = []
    for file in files:
        image = Image.open(file)
        image = image.resize((img_h, img_w))
        image = (np.array(image) / 255).astype(np.float32)
        image = image.transpose((2, 0, 1))
        images.append(image[np.newaxis, :])
    return images

def main():
    # image urls
    urls = [
            'http://images.cocodataset.org/test-stuff2017/000000024309.jpg',
            'http://images.cocodataset.org/test-stuff2017/000000028117.jpg',
            'http://images.cocodataset.org/test-stuff2017/000000006149.jpg',
            'http://images.cocodataset.org/test-stuff2017/000000004954.jpg',
        ]

    # download images
    image_files = []
    with tempfile.TemporaryDirectory() as tmpdir:
        for url in urls:
            image_files.append(wget.download(url, tmpdir))
        images = read_images(image_files, 224, 224)

        label_file = wget.download(
            "https://storage.googleapis.com/download.tensorflow.org/data/ImageNetLabels.txt", tmpdir)
        labels = read_labels_file(label_file)

    # get input json
    input_data = np.concatenate(images, axis=0)
    input_data_list = input_data.tolist()
    postData = {'inputs':input_data_list}
    jPostData = json.dumps(postData)

    http = urllib3.PoolManager()
    r = http.request('POST','http://localhost:8501/v1/models/resnet_v2_50:predict',body=jPostData)

    return_data = json.loads(r.data)
    output = np.array(list(return_data.values()))

    # get real images top5
    k = 5
    idx = output.argsort()[:,:,-1:-k-1:-1]
    for b,idx_bs in enumerate(idx[0]):
        print(f'\nimage{b}:')
        for i in idx_bs:
            print(labels[i], output[0, b, i])

if __name__ == "__main__":
    main()

restful_http_test.py

Open a new terminal and connect to the host, download the above code, create a Python virtual environment and test it.

virtualenv -p python3 venv
source venv/bin/activate
pip install pillow numpy urllib3 wget
python restful_http_test.py
deactivate

If executed correctly, the returned results are shown as follows:

image0:
laptop 18.2867374
notebook 15.9842319
desk 13.6191549
web site 11.1951714
mouse 11.1430635

image1:
mashed potato 14.4165163
guacamole 11.8748741
meat loaf 10.7978802
cheeseburger 9.09055805
plate 8.93529892

image2:
knee pad 8.52720547
volleyball 8.52627659
racket 8.31885242
ski 7.84297752
horizontal bar 7.11924124

image3:
hare 17.0646706
fountain 15.7413292
tennis ball 13.2553062
wallaby 12.6554518
wood rabbit 11.5797682