3. Model Runner deep dive through examples

The ModelRunner class is a lightweight wrapper around all classes and mechanisms provided by the Model Runtime library that allows loading and execution of the model stored in a PopEF file with minimal client effort.

Running a PopEF model with a ModelRunner class consists of two steps:

  1. Create a ModelRunner object by either providing a list of PopEF files or an instance of Model. In this part, the IPU device partition is acquired, the given model is loaded onto a device, and all necessary threads and classes are created and stored inside the ModelRunner internal state. The user is able to pass ModelRunnerConfig during ModelRunner object construction and set up several configuration options, for example replication factor (replication_factor).

  2. Use one of the available execution modes to send the request to the IPU.

Note

The ModelRunner object instance lifetime must be preserved until the last send request returns a result. Destruction of the object causes stopping and unloading the model from the IPU device. State of requests that were being processed during destruction is undefined.

Note

Files that are included by the examples presented in this chapter are placed in the Section 9, Appendix. They contain helper functions for example: processing command line arguments.

3.1. Execution modes

The ModelRunner class provides two execution modes: synchronous (execute()) and asynchronous (executeAsync()). In the synchronous mode, the working thread is blocked until the result is available. In the asynchronous mode the request is queued and a std::future object is returned. The result can be accessed as soon as the IPU finishes computations. Unlike the synchronous mode, the working thread is not blocked here. For each mode, the user is responsible for the memory allocation of the input tensors.

All execute() and executeAsync() overloads take as a parameter InputMemoryView that contains pointers to all input data. The user has to ensure that the input data exists and passed pointers are valid until a result is returned. Each execute() and executeAsync() comes in two flavors: execute(const InputMemoryView &, unsigned), execute(const InputMemoryView &, const OutputMemoryView &output_data, unsigned), executeAsync(const InputMemoryView &, unsigned), executeAsync(const InputMemoryView &, const OutputMemoryView &output_data, unsigned). The difference between overloads comes down to who is responsible for the memory allocation of the output tensors. There are two options available:

  1. ModelRunner allocates memory for the output and returns a TensorMemory instance for each output tensor.

  2. The user allocates the output tensor memory and passes OutputMemoryView to the particular execution mode. ModelRunner will place the result in the memory provided by the user.

The client can find out what tensors the model accepts and returns by calling one of following ModelRunner class methods:

These methods return a collection of DataDesc objects which contain basic information about the tensor: name, shape, data type, size in bytes.

This C++ example sends inference requests to the IPU using all available execution modes.

Listing 3.1 model_runner_execution_modes.cpp
  1// Copyright (c) 2022 Graphcore Ltd. All rights reserved.
  2#include <string>
  3#include <vector>
  4
  5#include <boost/program_options.hpp>
  6
  7#include "model_runtime/ModelRunner.hpp"
  8#include "model_runtime/Tensor.hpp"
  9#include "utils.hpp"
 10
 11namespace examples {
 12
 13void synchronousExecutionModeLibraryAllocatedOutput(
 14    model_runtime::ModelRunner &model_runner);
 15void synchronousExecutionModeUserAllocatedOutput(
 16    model_runtime::ModelRunner &model_runner);
 17void asynchronousExecutionModeLibraryAllocatedOutput(
 18    model_runtime::ModelRunner &model_runner);
 19void asynchronousExecutionModeUserAllocatedOutput(
 20    model_runtime::ModelRunner &model_runner);
 21
 22} // namespace examples
 23
 24/* The example shows loading a model from PopEF files and sending
 25 * inference requests using all available ModelRunner execution modes.
 26 */
 27int main(int argc, char *argv[]) {
 28  using namespace std::chrono_literals;
 29  static const char *example_desc = "Model runner execution modes example.";
 30  const boost::program_options::variables_map vm =
 31      examples::parsePopefProgramOptions(example_desc, argc, argv);
 32  const auto popef_paths = vm["popef"].as<std::vector<std::string>>();
 33
 34  model_runtime::ModelRunnerConfig config;
 35  config.device_wait_config =
 36      model_runtime::DeviceWaitConfig{600s /*timeout*/, 1s /*sleep_time*/};
 37  model_runtime::ModelRunner model_runner(popef_paths, config);
 38
 39  examples::print("Running synchronous execution mode. The memory of the "
 40                  "output tensors is allocated by the ModelRunner object.");
 41  examples::synchronousExecutionModeLibraryAllocatedOutput(model_runner);
 42
 43  examples::print("Running synchronous execution mode. The memory of the "
 44                  "output tensors is allocated by the user.");
 45  examples::synchronousExecutionModeUserAllocatedOutput(model_runner);
 46
 47  examples::print("Running asynchronous execution mode. The memory of the "
 48                  "output tensors is allocated by the ModelRunner object.");
 49  examples::asynchronousExecutionModeLibraryAllocatedOutput(model_runner);
 50
 51  examples::print("Running asynchronous execution mode. The memory of the "
 52                  "output tensors is allocated by the user.");
 53  examples::asynchronousExecutionModeUserAllocatedOutput(model_runner);
 54
 55  examples::print("Success: exiting");
 56  return EXIT_SUCCESS;
 57}
 58
 59namespace examples {
 60
 61void synchronousExecutionModeLibraryAllocatedOutput(
 62    model_runtime::ModelRunner &model_runner) {
 63  examples::print("Allocating input tensors");
 64  const model_runtime::InputMemory input_memory =
 65      examples::allocateHostInputData(model_runner.getExecuteInputs());
 66
 67  examples::printInputMemory(input_memory);
 68
 69  examples::print("Sending single synchronous request with empty data. Output "
 70                  "allocated by ModelRunner.");
 71
 72  const model_runtime::OutputMemory output_memory =
 73      model_runner.execute(examples::toInputMemoryView(input_memory));
 74
 75  examples::print("Received output allocated by ModelRunner:");
 76  using ValueType = std::pair<const std::string, model_runtime::TensorMemory>;
 77
 78  for (const ValueType &name_with_memory : output_memory) {
 79    auto &&[name, memory] = name_with_memory;
 80    examples::print(fmt::format("Output tensor {}, {} bytes", name,
 81                                memory.data_size_bytes));
 82  }
 83}
 84
 85void synchronousExecutionModeUserAllocatedOutput(
 86    model_runtime::ModelRunner &model_runner) {
 87  examples::print("Allocating input tensors");
 88  const model_runtime::InputMemory input_memory =
 89      examples::allocateHostInputData(model_runner.getExecuteInputs());
 90
 91  examples::printInputMemory(input_memory);
 92
 93  examples::print("Allocating output tensors");
 94  model_runtime::OutputMemory output_memory =
 95      examples::allocateHostOutputData(model_runner.getExecuteOutputs());
 96
 97  examples::print("Sending single synchronous request with empty data.");
 98
 99  model_runner.execute(examples::toInputMemoryView(input_memory),
100                       examples::toOutputMemoryView(output_memory));
101
102  examples::print("Received output allocated by ModelRunner:");
103
104  using ValueType = std::pair<const std::string, model_runtime::TensorMemory>;
105
106  for (const ValueType &name_with_memory : output_memory) {
107    auto &&[name, memory] = name_with_memory;
108    examples::print(fmt::format("Output tensor {}, {} bytes", name,
109                                memory.data_size_bytes));
110  }
111}
112
113void asynchronousExecutionModeLibraryAllocatedOutput(
114    model_runtime::ModelRunner &model_runner) {
115  examples::print("Allocating input tensors");
116  const model_runtime::InputMemory input_memory =
117      examples::allocateHostInputData(model_runner.getExecuteInputs());
118
119  examples::printInputMemory(input_memory);
120
121  examples::print("Sending single synchronous request with empty data. Output "
122                  "allocated by ModelRunner.");
123
124  const model_runtime::OutputFutureMemory output_future_memory =
125      model_runner.executeAsync(examples::toInputMemoryView(input_memory));
126
127  examples::print("Waiting for output allocated by ModelRunner:");
128
129  using ValueType = std::pair<const std::string,
130                              std::shared_future<model_runtime::TensorMemory>>;
131
132  for (const ValueType &name_with_future_memory : output_future_memory) {
133    auto &&[name, future_memory] = name_with_future_memory;
134    examples::print(fmt::format("Waiting for the result: tensor {}", name));
135    future_memory.wait();
136    const model_runtime::TensorMemory &memory = future_memory.get();
137    examples::print(fmt::format("Output tensor {} available, received {} bytes",
138                                name, memory.data_size_bytes));
139  }
140}
141
142void asynchronousExecutionModeUserAllocatedOutput(
143    model_runtime::ModelRunner &model_runner) {
144  examples::print("Allocating input tensors");
145  const model_runtime::InputMemory input_memory =
146      examples::allocateHostInputData(model_runner.getExecuteInputs());
147
148  examples::printInputMemory(input_memory);
149
150  examples::print("Allocating output tensors");
151  model_runtime::OutputMemory output_memory =
152      examples::allocateHostOutputData(model_runner.getExecuteOutputs());
153
154  examples::print("Sending single synchronous request with empty data.");
155
156  const model_runtime::OutputFutureMemoryView output_future_memory_view =
157      model_runner.executeAsync(examples::toInputMemoryView(input_memory),
158                                examples::toOutputMemoryView(output_memory));
159
160  examples::print("Waiting for the output");
161
162  using ValueType =
163      std::pair<const std::string,
164                std::shared_future<model_runtime::TensorMemoryView>>;
165
166  for (const ValueType &name_with_future_memory_view :
167       output_future_memory_view) {
168    auto &&[name, future_memory_view] = name_with_future_memory_view;
169    examples::print(fmt::format("Waiting for the result: tensor {}", name));
170    future_memory_view.wait();
171    const model_runtime::TensorMemoryView &memory_view =
172        future_memory_view.get();
173    examples::print(fmt::format("Output tensor {} available, received {} bytes",
174                                name, memory_view.data_size_bytes));
175  }
176}
177
178} // namespace examples

Download model_runner_execution_modes.cpp

This Python example sends inference requests to the IPU using all available execution modes.

Listing 3.2 model_runner_execution_modes.py
  1#!/usr/bin/env python3
  2# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
  3
  4import argparse
  5from datetime import timedelta
  6from re import L
  7import numpy as np
  8import model_runtime
  9import popef
 10"""
 11The example shows loading a model from PopEF files and sending
 12inference requests using all available ModelRunner execution modes.
 13"""
 14
 15
 16def main():
 17    parser = argparse.ArgumentParser("Model runner simple example.")
 18    parser.add_argument(
 19        "-p",
 20        "--popef",
 21        type=str,
 22        metavar='popef_file_path',
 23        help="A collection of PopEF files containing the model.",
 24        nargs='+',
 25        required=True)
 26    args = parser.parse_args()
 27
 28    # Create model runner
 29    config = model_runtime.ModelRunnerConfig()
 30    config.device_wait_config = model_runtime.DeviceWaitConfig(
 31        model_runtime.DeviceWaitStrategy.WAIT_WITH_TIMEOUT,
 32        timeout=timedelta(seconds=600),
 33        sleepTime=timedelta(seconds=1))
 34
 35    print("Creating ModelRunner with", config)
 36    model_runner = model_runtime.ModelRunner(model_runtime.PopefPaths(
 37        args.popef),
 38                                             config=config)
 39
 40    print("Preparing input tensors:")
 41    input_descriptions = model_runner.getExecuteInputs()
 42    input_tensors = [
 43        np.random.randn(*input_desc.shape).astype(input_desc.numpy_data_type())
 44        for input_desc in input_descriptions
 45    ]
 46    input_view = model_runtime.InputMemoryView()
 47
 48    for input_desc, input_tensor in zip(input_descriptions, input_tensors):
 49        print("\tname:", input_desc.name, "shape:", input_tensor.shape,
 50              "dtype:", input_tensor.dtype)
 51        input_view[input_desc.name] = input_tensor
 52
 53    print("Running synchronous execution mode. The memory of the output "
 54          "tensors is allocated by the ModelRunner object.")
 55    synchronousExecutionModeLibraryAllocatedOutput(model_runner, input_view)
 56
 57    print("Running synchronous execution mode. The memory of the output "
 58          "tensors is allocated by the user.")
 59    synchronousExecutionModeUserAllocatedOutput(model_runner, input_view)
 60
 61    print("Running asynchronous execution mode. The memory of the output "
 62          "tensors is allocated by the ModelRunner object.")
 63    asynchronousExecutionModeLibraryAllocatedOutput(model_runner, input_view)
 64
 65    print("Running asynchronous execution mode. The memory of the output "
 66          "tensors is allocated by the user.")
 67    asynchronousExecutionModeUserAllocatedOutput(model_runner, input_view)
 68
 69    input_numpy = dict()
 70    for input_desc, input_tensor in zip(input_descriptions, input_tensors):
 71        input_numpy[input_desc.name] = input_tensor
 72
 73    print("Running synchronous execution mode. The input is a numpy array. "
 74          "The memory of the output tensors is allocated by the ModelRunner "
 75          "object.")
 76    synchronousExecutionModeLibraryAllocatedNumpyInputOutput(
 77        model_runner, input_numpy)
 78
 79    print("Running synchronous execution mode. The input and the output are "
 80          "numpy arrays. The memory of the output tensors is allocated by the "
 81          "user. ")
 82    synchronousExecutionModeUserAllocatedNumpyInputOutput(
 83        model_runner, input_numpy)
 84
 85    print(
 86        "Running asynchronous execution mode. The input and the output are "
 87        "numpy arrays . The memory of the output tensors is allocated by the "
 88        "ModelRunner object.")
 89    asynchronousExecutionModeLibraryAllocatedNumpyOutput(
 90        model_runner, input_numpy)
 91
 92    print(
 93        "Running asynchronous execution mode. The input and the output are "
 94        "numpy arrays . The memory of the output tensors is allocated by the "
 95        "user.")
 96    asynchronousExecutionModeUserAllocatedNumpyOutput(model_runner,
 97                                                      input_numpy)
 98
 99    print("Success: exiting")
100    return 0
101
102
103def synchronousExecutionModeLibraryAllocatedOutput(model_runner, input_view):
104    print("Sending single synchronous request with random data. Output "
105          "allocated by ModelRunner.")
106    result = model_runner.execute(input_view)
107
108    output_descriptions = model_runner.getExecuteOutputs()
109    print("Processing output tensors:")
110    for output_desc in output_descriptions:
111        output_tensor = np.frombuffer(
112            result[output_desc.name],
113            dtype=output_desc.numpy_data_type()).reshape(output_desc.shape)
114        print("\tname:", output_desc.name, "shape:", output_tensor.shape,
115              "dtype:", output_tensor.dtype, "\n", output_tensor)
116
117
118def synchronousExecutionModeUserAllocatedOutput(model_runner, input_view):
119
120    output_descriptions = model_runner.getExecuteOutputs()
121    print("Preparing memory for output tensors")
122    output_tensors = [
123        np.zeros(output_desc.shape, dtype=output_desc.numpy_data_type())
124        for output_desc in output_descriptions
125    ]
126
127    print("Creating model_runtime.OutputMemoryView()")
128    output_view = model_runtime.OutputMemoryView()
129    for desc, tensor in zip(output_descriptions, output_tensors):
130        print("\tname:", desc.name, "shape:", tensor.shape, "dtype:",
131              tensor.dtype)
132        output_view[desc.name] = tensor
133
134    print("Sending single synchronous request with random data")
135    model_runner.execute(input_view, output_view)
136    print("Processing output tensors:")
137    for desc, tensor in zip(output_descriptions, output_tensors):
138        print("\tname:", desc.name, "shape", tensor.shape, "dtype",
139              tensor.dtype, "\n", tensor)
140
141
142def synchronousExecutionModeLibraryAllocatedNumpyInputOutput(
143        model_runner, numpy_input):
144
145    output_descriptions = model_runner.getExecuteOutputs()
146
147    print("Sending single synchronous request random data (numpy array)")
148    output_tensors = model_runner.execute(numpy_input)
149    print("Processing output tensors (numpy dict):")
150    for desc in output_descriptions:
151        tensor = output_tensors[desc.name]
152        print("\tname:", desc.name, "shape", tensor.shape, "dtype",
153              tensor.dtype, "\n", tensor)
154
155
156def synchronousExecutionModeUserAllocatedNumpyInputOutput(
157        model_runner, numpy_input):
158
159    output_descriptions = model_runner.getExecuteOutputs()
160    print("Preparing memory for output tensors")
161    numpy_output = {}
162    for output_desc in output_descriptions:
163        numpy_output[output_desc.name] = np.zeros(
164            output_desc.shape, dtype=output_desc.numpy_data_type())
165
166    print("Sending single synchronous request with random data")
167    model_runner.execute(numpy_input, numpy_output)
168    print("Processing output tensors (numpy dict):")
169    for desc in output_descriptions:
170        tensor = numpy_output[desc.name]
171        print("\tname:", desc.name, "shape", tensor.shape, "dtype",
172              tensor.dtype, "\n", tensor)
173
174
175def asynchronousExecutionModeLibraryAllocatedOutput(model_runner, input_view):
176
177    print("Sending single asynchronous request with random data. Output "
178          "allocated by ModelRunner.")
179    result = model_runner.executeAsync(input_view)
180
181    print("Waiting for output allocated by ModelRunner:")
182    result.wait()
183    print("Results available")
184
185    output_descriptions = model_runner.getExecuteOutputs()
186    print("Processing output tensors:")
187    for output_desc in output_descriptions:
188        output_tensor = np.frombuffer(
189            result[output_desc.name],
190            dtype=output_desc.numpy_data_type()).reshape(output_desc.shape)
191        print("\tname:", output_desc.name, "shape:", output_tensor.shape,
192              "dtype:", output_tensor.dtype, "\n", output_tensor)
193
194
195def asynchronousExecutionModeUserAllocatedOutput(model_runner, input_view):
196    output_descriptions = model_runner.getExecuteOutputs()
197    print("Preparing memory for output tensors")
198    output_tensors = [
199        np.zeros(output_desc.shape, dtype=output_desc.numpy_data_type())
200        for output_desc in output_descriptions
201    ]
202
203    print("Creating model_runtime.OutputMemoryView()")
204    output_view = model_runtime.OutputMemoryView()
205    for desc, tensor in zip(output_descriptions, output_tensors):
206        print("\tname:", desc.name, "shape:", tensor.shape, "dtype:",
207              tensor.dtype)
208        output_view[desc.name] = tensor
209
210    print("Sending single asynchronous request with random data")
211    future = model_runner.executeAsync(input_view, output_view)
212
213    print("Waiting for the output.")
214    future.wait()
215    print("Results available.")
216    print("Processing output tensors:")
217    for desc, tensor in zip(output_descriptions, output_tensors):
218        print("\tname:", desc.name, "shape", tensor.shape, "dtype",
219              tensor.dtype, "\n", tensor)
220
221
222def asynchronousExecutionModeLibraryAllocatedNumpyOutput(
223        model_runner, numpy_input):
224    print("Sending single asynchronous request with random data")
225    future = model_runner.executeAsync(numpy_input)
226
227    print("Waiting for the output.")
228    future.wait()
229    for desc in model_runner.getExecuteOutputs():
230        future_py_array = future[desc.name]
231
232        # Create a np.array copy from the future_py_array buffer
233        # using numpy() method.
234        tensor = future_py_array.numpy()
235        print("\tname:", desc.name, "shape", tensor.shape, "dtype",
236              tensor.dtype, "tensor id", id(tensor), "\n", tensor)
237
238        # Create a np.array copy from the future_py_array buffer
239        # (allocated by ModelRunner instance).
240        tensor_copy = np.array(future_py_array, copy=True)
241        print("Tensor copy", tensor_copy, "tensor id", id(tensor_copy))
242
243        # Avoid copying. Create a np.array view from the future_py_array buffer
244        # (allocated by ModelRunner instance).
245        tensor_view = np.array(future_py_array, copy=False)
246        print("Tensor view", tensor_view, "tensor id", id(tensor_view))
247
248        assert not np.shares_memory(tensor_view, tensor_copy)
249        assert not np.shares_memory(tensor, tensor_copy)
250        assert not np.shares_memory(tensor, tensor_view)
251
252
253def asynchronousExecutionModeUserAllocatedNumpyOutput(model_runner,
254                                                      numpy_input):
255
256    output_descriptions = model_runner.getExecuteOutputs()
257    print("Preparing memory for output tensors")
258    numpy_output = {}
259    for output_desc in output_descriptions:
260        numpy_output[output_desc.name] = np.zeros(
261            output_desc.shape, dtype=output_desc.numpy_data_type())
262
263    print("Sending single asynchronous request with random data")
264    future = model_runner.executeAsync(numpy_input, numpy_output)
265
266    print("Waiting for the output.")
267    future.wait()
268    print("Results available.")
269    print("Processing output tensors:")
270    for desc in output_descriptions:
271        output_tensor = numpy_output[desc.name]
272        future_py_array_view = future[desc.name]
273
274        # Create a np.array view from the future_py_array_view using numpy()
275        # method, view points to np.array present in numpy_output dict
276        tensor_from_future_object = future_py_array_view.numpy()
277        print("\tname:", desc.name, "shape", tensor_from_future_object.shape,
278              "dtype", tensor_from_future_object.dtype, "\n",
279              tensor_from_future_object)
280        assert np.shares_memory(output_tensor, tensor_from_future_object)
281
282        # Create a np.array view from the future_py_array_view buffer, view
283        # points to np.array present in numpy_output dict
284        tensor_view = np.array(future_py_array_view, copy=False)
285        assert np.shares_memory(output_tensor, tensor_view)
286        assert np.shares_memory(tensor_from_future_object, tensor_view)
287
288        # Create a np.array copy from the future_py_array_view buffer
289        tensor_copy = np.array(future_py_array_view, copy=True)
290        assert not np.shares_memory(tensor_from_future_object, tensor_copy)
291        assert not np.shares_memory(output_tensor, tensor_copy)
292
293
294if __name__ == "__main__":
295    main()

Download model_runner_execution_modes.py

3.2. Replication

The ModelRunner class allows to specify the replication factor inside ModelRunnerConfig passed to its constructor. As a result of setting this option, a ModelRunner object will create as many IPU model replica instances as requested, as far as the required number of devices is available. Each execution mode accepts as the last parameter unsigned replica_id which decides to which replica the request will be sent.

This example creates two replicas and sends inference requests to each of them using the C++ API.

Listing 3.3 model_runner_replication.cpp
 1// Copyright (c) 2022 Graphcore Ltd. All rights reserved.
 2#include <string>
 3#include <unordered_map>
 4#include <vector>
 5
 6#include <boost/program_options.hpp>
 7
 8#include "model_runtime/ModelRunner.hpp"
 9#include "model_runtime/Tensor.hpp"
10#include "utils.hpp"
11
12/* The example shows loading a model from PopEF files, creating 2 model replicas
13 * and sending inference requests to each of them.
14 */
15int main(int argc, char *argv[]) {
16  static constexpr unsigned num_replicas = 2;
17
18  using namespace std::chrono_literals;
19  static const char *example_desc = "Model runner simple example.";
20  const boost::program_options::variables_map vm =
21      examples::parsePopefProgramOptions(example_desc, argc, argv);
22  const auto popef_paths = vm["popef"].as<std::vector<std::string>>();
23
24  model_runtime::ModelRunnerConfig config;
25  config.device_wait_config =
26      model_runtime::DeviceWaitConfig{600s /*timeout*/, 1s /*sleep_time*/};
27  examples::print(fmt::format(
28      "Setting model_runtime::ModelRunnerConfig replication_factor=",
29      num_replicas));
30
31  config.replication_factor = num_replicas;
32  model_runtime::ModelRunner model_runner(popef_paths, config);
33
34  for (unsigned replica_id = 0; replica_id < num_replicas; ++replica_id) {
35    examples::print("Allocating input tensors");
36    const model_runtime::InputMemory input_memory =
37        examples::allocateHostInputData(model_runner.getExecuteInputs());
38    examples::printInputMemory(input_memory);
39
40    examples::print(fmt::format(
41        "Sending single synchronous request with empty data - replica {}",
42        replica_id));
43
44    const model_runtime::OutputMemory output_memory = model_runner.execute(
45        examples::toInputMemoryView(input_memory), replica_id);
46
47    examples::print(fmt::format("Received output - replica {}", replica_id));
48
49    using OutputValueType =
50        std::pair<const std::string, model_runtime::TensorMemory>;
51
52    for (const OutputValueType &name_with_memory : output_memory) {
53      auto &&[name, memory] = name_with_memory;
54      examples::print(fmt::format("Output tensor {}, {} bytes", name,
55                                  memory.data_size_bytes));
56    }
57  }
58  examples::print("Success: exiting");
59  return EXIT_SUCCESS;
60}

Download model_runner_replication.cpp

This example creates two replicas and sends inference requests to each of them using the Python API.

Listing 3.4 model_runner_replication.py
 1#!/usr/bin/env python3
 2# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
 3
 4import argparse
 5from datetime import timedelta
 6import numpy as np
 7import model_runtime
 8import popef
 9"""
10The example shows loading a model from PopEF files, creating 2 model replicas
11and sending inference requests to each of them.
12"""
13
14
15def main():
16    parser = argparse.ArgumentParser("Model runner simple example.")
17    parser.add_argument(
18        "-p",
19        "--popef",
20        type=str,
21        metavar='popef_file_path',
22        help="A collection of PopEF files containing the model.",
23        nargs='+',
24        required=True)
25    args = parser.parse_args()
26
27    num_replicas = 2
28    # Create model runner
29    config = model_runtime.ModelRunnerConfig()
30    config.replication_factor = num_replicas
31    config.device_wait_config = model_runtime.DeviceWaitConfig(
32        model_runtime.DeviceWaitStrategy.WAIT_WITH_TIMEOUT,
33        timeout=timedelta(seconds=600),
34        sleepTime=timedelta(seconds=1))
35
36    print("Creating ModelRunner with", config)
37    runner = model_runtime.ModelRunner(model_runtime.PopefPaths(args.popef),
38                                       config=config)
39
40    input_descriptions = runner.getExecuteInputs()
41
42    input = model_runtime.InputMemoryView()
43
44    print("Preparing input tensors:")
45    input_descriptions = runner.getExecuteInputs()
46    input_tensors = [
47        np.random.randn(*input_desc.shape).astype(input_desc.numpy_data_type())
48        for input_desc in input_descriptions
49    ]
50    input_view = model_runtime.InputMemoryView()
51
52    for input_desc, input_tensor in zip(input_descriptions, input_tensors):
53        print("\tname:", input_desc.name, "shape:", input_tensor.shape,
54              "dtype:", input_tensor.dtype)
55        input_view[input_desc.name] = input_tensor
56
57    for replica_id in range(num_replicas):
58        print("Sending single synchronous request with empty data - replica",
59              replica_id, ".")
60        result = runner.execute(input_view, replica_id=replica_id)
61        output_descriptions = runner.getExecuteOutputs()
62
63        print("Processing output tensors - replica", replica_id, ":")
64        for output_desc in output_descriptions:
65            output_tensor = np.frombuffer(
66                result[output_desc.name],
67                dtype=output_desc.numpy_data_type()).reshape(output_desc.shape)
68            print("\tname:", output_desc.name, "shape:", output_tensor.shape,
69                  "dtype:", output_tensor.dtype, "\n", output_tensor)
70
71    print("Success: exiting")
72    return 0
73
74
75if __name__ == "__main__":
76    main()

Download model_runner_replication.py

3.3. Multithreading

By default, ModelRunner is not thread-safe. When many threads call execute() or executeAsync() it leads to race conditions and undefined behavior. To avoid undesirable situations in a multithreaded environment when using ModelRunner, the user must ensure that appropriate synchronization mechanisms between threads are applied. The alternative is to set thread_safe in ModelRunnerConfig to True. Consequently, every call of execute() or executeAsync() will cause the internal std::mutex instance to lock.

This example creates several threads and each one sends inference requests to the IPU using the C++ API.

Listing 3.5 model_runner_multithreading.cpp
 1// Copyright (c) 2022 Graphcore Ltd. All rights reserved.
 2#include <array>
 3#include <string>
 4#include <vector>
 5
 6#include <boost/program_options.hpp>
 7
 8#include "model_runtime/ModelRunner.hpp"
 9#include "model_runtime/Tensor.hpp"
10#include "utils.hpp"
11
12namespace examples {
13
14void workerMain(model_runtime::ModelRunner &model_runner);
15
16} // namespace examples
17
18/* The example shows loading a model from PopEF files and sending inference
19 * requests to the same model by multiple threads.
20 */
21int main(int argc, char *argv[]) {
22  using namespace std::chrono_literals;
23  static const char *example_desc =
24      "Model runner multithreading client example.";
25  const boost::program_options::variables_map vm =
26      examples::parsePopefProgramOptions(example_desc, argc, argv);
27  const auto popef_paths = vm["popef"].as<std::vector<std::string>>();
28
29  model_runtime::ModelRunnerConfig config;
30  config.device_wait_config =
31      model_runtime::DeviceWaitConfig{600s /*timeout*/, 1s /*sleep_time*/};
32  examples::print(
33      "Setting model_runtime::ModelRunnerConfig: thread safe = true");
34  config.thread_safe = true;
35  model_runtime::ModelRunner model_runner(popef_paths, config);
36
37  static constexpr unsigned num_workers = 4;
38  std::vector<std::thread> threads;
39  threads.reserve(num_workers);
40
41  examples::print(fmt::format("Starting {} worker threads", num_workers));
42  for (unsigned i = 0; i < num_workers; i++) {
43    threads.emplace_back(examples::workerMain, std::ref(model_runner));
44  }
45
46  for (auto &worker : threads) {
47    worker.join();
48  };
49
50  examples::print("Success: exiting");
51  return EXIT_SUCCESS;
52}
53
54namespace examples {
55
56void workerMain(model_runtime::ModelRunner &model_runner) {
57  examples::print("Starting workerMain()");
58
59  static constexpr unsigned num_requests = 5;
60  std::array<model_runtime::InputMemory, num_requests> requests_input_data;
61
62  for (unsigned req_id = 0; req_id < num_requests; req_id++) {
63    examples::print(
64        fmt::format("Allocating input tensors - request id {}", req_id));
65    requests_input_data[req_id] =
66        examples::allocateHostInputData(model_runner.getExecuteInputs());
67  }
68
69  std::vector<model_runtime::OutputFutureMemory> results;
70
71  for (unsigned req_id = 0; req_id < num_requests; req_id++) {
72    examples::print(
73        fmt::format("Sending asynchronous request. Request id {}", req_id));
74    results.emplace_back(model_runner.executeAsync(
75        examples::toInputMemoryView(requests_input_data[req_id])));
76  }
77
78  examples::print("Waiting for output:");
79  for (unsigned req_id = 0; req_id < num_requests; req_id++) {
80    auto &output_future_memory = results[req_id];
81
82    using OutputValueType =
83        std::pair<const std::string,
84                  std::shared_future<model_runtime::TensorMemory>>;
85    for (const OutputValueType &name_with_future_memory :
86         output_future_memory) {
87      auto &&[name, future_memory] = name_with_future_memory;
88      examples::print(fmt::format(
89          "Waiting for the result: tensor {}, request_id {}", name, req_id));
90      future_memory.wait();
91      const model_runtime::TensorMemory &memory = future_memory.get();
92      examples::print(fmt::format(
93          "Output tensor {} available, request_id {} received {} bytes", name,
94          req_id, memory.data_size_bytes));
95    }
96  }
97}
98
99} // namespace examples

Download model_runner_multithreading.cpp

This example creates several threads and each one sends inference requests to the IPU using the Python API.

Listing 3.6 model_runner_multithreading.py
 1#!/usr/bin/env python3
 2# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
 3
 4import argparse
 5import threading
 6from datetime import timedelta
 7import numpy as np
 8import model_runtime
 9import popef
10"""
11The example shows loading a model from PopEF files and sending inference
12requests to the same model by multiple threads.
13"""
14
15
16def main():
17    parser = argparse.ArgumentParser("Model runner simple example.")
18    parser.add_argument(
19        "-p",
20        "--popef",
21        type=str,
22        metavar='popef_file_path',
23        help="A collection of PopEF files containing the model.",
24        nargs='+',
25        required=True)
26    args = parser.parse_args()
27
28    config = model_runtime.ModelRunnerConfig()
29    config.thread_safe = True
30    config.device_wait_config = model_runtime.DeviceWaitConfig(
31        model_runtime.DeviceWaitStrategy.WAIT_WITH_TIMEOUT,
32        timeout=timedelta(seconds=600),
33        sleepTime=timedelta(seconds=1))
34
35    print("Creating ModelRunner with", config)
36    model_runner = model_runtime.ModelRunner(model_runtime.PopefPaths(
37        args.popef),
38                                             config=config)
39    num_workers = 4
40    print("Starting", num_workers, "worker threads.")
41    threads = [
42        threading.Thread(target=workerMain, args=(model_runner, worker_id))
43        for worker_id in range(num_workers)
44    ]
45
46    for thread in threads:
47        thread.start()
48
49    for thread in threads:
50        thread.join()
51
52    print("Success: exiting")
53    return 0
54
55
56def workerMain(model_runner, worker_id):
57    print("Worker", worker_id, "Starting workerMain()")
58    num_requests = 5
59
60    input_descriptions = model_runner.getExecuteInputs()
61    input_requests = []
62
63    print("Worker", worker_id, "Allocating input tensors for", num_requests,
64          "requests", input_descriptions)
65    for _ in range(num_requests):
66        input_requests.append([
67            np.random.randn(*input_desc.shape).astype(
68                input_desc.numpy_data_type())
69            for input_desc in input_descriptions
70        ])
71
72    futures = []
73
74    for req_id in range(num_requests):
75        print("Worker", worker_id, "Sending asynchronous request. Request id",
76              req_id)
77        input_view = model_runtime.InputMemoryView()
78        for input_desc, input_tensor in zip(input_descriptions,
79                                            input_requests[req_id]):
80            input_view[input_desc.name] = input_tensor
81        futures.append(model_runner.executeAsync(input_view))
82
83    print("Worker", worker_id, "Processing outputs.")
84    for req_id, future in enumerate(futures):
85        print("Worker", worker_id, "Waiting for the result - request", req_id)
86        future.wait()
87        print("Worker", worker_id, "Result available - request", req_id)
88
89
90if __name__ == "__main__":
91    main()

Download model_runner_multithreading.py

3.4. Frozen inputs

The ModelRunner class allows binding constant tensors by setting frozen_inputs in ModelRunnerConfig. frozen_inputs is an instance of InputMemoryView. The user allocates and passes the pointer to the data for the selected input tensors. If the tensor is the input required during the execution call, it will no longer be required and the tensor from frozen_inputs will always be added to the request. If the tensor is the input saved as PopEF tensor data or feed data, it will be overridden by tensor from frozen_inputs.

Note

Examples Listing 3.7 and Listing 3.8 rely on a PopEF file generated by the code Listing 9.2.

This example binds a constant value to one of the inputs and sends inference requests to the IPU modes using the C++ API.

Listing 3.7 model_runner_frozen_inputs.cpp
  1// Copyright (c) 2022 Graphcore Ltd. All rights reserved.
  2#include <algorithm>
  3#include <string>
  4#include <vector>
  5
  6#include <boost/program_options.hpp>
  7
  8#include <popef/Model.hpp>
  9#include <popef/Reader.hpp>
 10#include <popef/Types.hpp>
 11
 12#include "model_runtime/ModelRunner.hpp"
 13#include "model_runtime/Tensor.hpp"
 14#include "utils.hpp"
 15
 16namespace examples {
 17
 18std::shared_ptr<popef::Model>
 19createPopefModel(const std::vector<std::string> &popef_paths);
 20const popef::Anchor *findAnchor(const std::string &name, popef::Model *model);
 21std::vector<float> createFrozenTensorData(const popef::Anchor *anchor);
 22
 23} // namespace examples
 24
 25/* The example shows loading a model from PopEF file and binding constant tensor
 26 * value to one of the inputs. The example is based on the PopEF file generated
 27 * by `model_runtime_example_generate_simple_popef` example. Generated PopEF
 28 * file consists simple model:
 29 *
 30 * output = (A * weights) + B
 31 *
 32 * where A and B are stream inputs, weights is tensor saved as popef::TensorData
 33 * and and output is result stream output tensor.
 34 */
 35int main(int argc, char *argv[]) {
 36  using namespace std::chrono_literals;
 37  static const char *example_desc = "Model runner frozen inputs example.";
 38  const boost::program_options::variables_map vm =
 39      examples::parsePopefProgramOptions(example_desc, argc, argv);
 40  const auto popef_paths = vm["popef"].as<std::vector<std::string>>();
 41
 42  std::shared_ptr<popef::Model> model = examples::createPopefModel(popef_paths);
 43
 44  static const std::string frozen_input_name = "tensor_B";
 45
 46  examples::print(fmt::format("Looking for tensor {} inside PopEF model.",
 47                              frozen_input_name));
 48  const popef::Anchor *tensor_b_anchor =
 49      examples::findAnchor(frozen_input_name, model.get());
 50  examples::print(fmt::format("Found {}.", *tensor_b_anchor));
 51
 52  examples::print("Creating frozen input tensor data.");
 53  const std::vector<float> tensor_b_data =
 54      examples::createFrozenTensorData(tensor_b_anchor);
 55
 56  examples::print("Creating ModelRunnerConfig.");
 57  model_runtime::ModelRunnerConfig config;
 58
 59  examples::print(fmt::format("Tensor {} is frozen - will be treated as "
 60                              "constant in each execution request.",
 61                              frozen_input_name));
 62  const uint64_t tensor_b_size_in_bytes =
 63      tensor_b_anchor->tensorInfo().sizeInBytes();
 64
 65  config.frozen_inputs = {
 66      {frozen_input_name, model_runtime::ConstTensorMemoryView{
 67                              tensor_b_data.data(), tensor_b_size_in_bytes}}};
 68
 69  config.device_wait_config =
 70      model_runtime::DeviceWaitConfig{600s /*timeout*/, 1s /*sleep_time*/};
 71
 72  model_runtime::ModelRunner model_runner(model, config);
 73
 74  examples::print("Allocating input tensors");
 75
 76  const model_runtime::InputMemory input_memory =
 77      examples::allocateHostInputData(model_runner.getExecuteInputs());
 78
 79  examples::printInputMemory(input_memory);
 80
 81  examples::print("Sending single synchronous request with empty data.");
 82  const model_runtime::OutputMemory output_memory =
 83      model_runner.execute(examples::toInputMemoryView(input_memory));
 84
 85  examples::print("Received output:");
 86
 87  using ValueType = std::pair<const std::string, model_runtime::TensorMemory>;
 88
 89  for (const ValueType &name_with_memory : output_memory) {
 90    auto &&[name, memory] = name_with_memory;
 91    examples::print(fmt::format("Output tensor {}, {} bytes", name,
 92                                memory.data_size_bytes));
 93  }
 94
 95  examples::print("Success: exiting");
 96  return EXIT_SUCCESS;
 97}
 98
 99namespace examples {
100
101std::shared_ptr<popef::Model>
102createPopefModel(const std::vector<std::string> &popef_paths) {
103  auto reader = std::make_shared<popef::Reader>();
104  for (const auto &path : popef_paths)
105    reader->parseFile(path);
106
107  return popef::ModelBuilder(reader).createModel();
108}
109
110const popef::Anchor *findAnchor(const std::string &name, popef::Model *model) {
111  const auto &anchors = model->metadata.anchors();
112
113  const auto anchor_it = std::find_if(
114      anchors.cbegin(), anchors.cend(),
115      [&](const popef::Anchor &anchor) { return anchor.name() == name; });
116
117  if (anchor_it == anchors.cend()) {
118    throw std::runtime_error(fmt::format(
119        "Anchor {} not found in given model. Please make sure that PopEF was "
120        "generated by `model_runtime_example_generate_simple_popef`.",
121        name));
122  }
123
124  if (auto anchorDataType = anchor_it->tensorInfo().dataType();
125      anchorDataType != popef::DataType::F32) {
126    throw std::runtime_error(fmt::format(
127        "Example expects anchor {} with popef::DataType::F32. Received {}",
128        name, anchorDataType));
129  }
130
131  return &(*anchor_it);
132}
133
134std::vector<float> createFrozenTensorData(const popef::Anchor *anchor) {
135  const auto size_in_bytes = anchor->tensorInfo().sizeInBytes();
136  const auto num_elements = size_in_bytes / sizeof(float);
137
138  return std::vector<float>(num_elements, 11.0f);
139}
140
141} // namespace examples

Download model_runner_frozen_inputs.cpp

This example binds a constant value to one of the inputs and sends inference requests to the IPU modes using the Python API.

Listing 3.8 model_runner_frozen_inputs.py
  1#!/usr/bin/env python3
  2# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
  3
  4import os
  5import argparse
  6from datetime import timedelta
  7import numpy as np
  8import model_runtime
  9import popef
 10"""
 11The example shows loading a model from PopEF file and binding constant tensor
 12value to one of the inputs. The example is based on the PopEF file generated
 13by `model_runtime_example_generate_simple_popef` example. Generated PopEF
 14file consists simple model:
 15
 16output = (A * weights) + B
 17
 18where A and B are stream inputs, weights is a tensor saved as popef::TensorData
 19and  output is result stream output tensor.
 20"""
 21
 22
 23def main():
 24    parser = argparse.ArgumentParser("Model runner simple example.")
 25    parser.add_argument(
 26        "-p",
 27        "--popef",
 28        type=str,
 29        metavar='popef_file_path',
 30        help="A collection of PopEF files containing the model.",
 31        nargs='+',
 32        required=True)
 33    args = parser.parse_args()
 34    model = load_model(args.popef)
 35
 36    frozen_input_name = "tensor_B"
 37    print("Looking for tensor", frozen_input_name, "inside PopEF model.")
 38    tensor_b_anchor = popef.Anchor()
 39
 40    for anchor in model.metadata.anchors():
 41        if anchor.name() == frozen_input_name:
 42            tensor_b_anchor = anchor
 43            break
 44    else:
 45        raise Exception(f'Anchor {frozen_input_name} not found inside givem '
 46                        'model. Please make sure that PopEF was generated by '
 47                        '`model_runtime_example_generate_simple_popef`')
 48
 49    print("Generating", frozen_input_name, "random values")
 50    tensor_b_info = tensor_b_anchor.tensorInfo()
 51    tensor_b = np.random.randn(*tensor_b_info.shape()).astype(
 52        tensor_b_info.numpyDType())
 53
 54    config = model_runtime.ModelRunnerConfig()
 55
 56    frozen_inputs = model_runtime.InputMemoryView()
 57    frozen_inputs[frozen_input_name] = tensor_b
 58    config.frozen_inputs = frozen_inputs
 59
 60    print(
 61        "Tensor", frozen_input_name, "is frozen - will be treated as "
 62        "constant in each execution request.")
 63    config.device_wait_config = model_runtime.DeviceWaitConfig(
 64        model_runtime.DeviceWaitStrategy.WAIT_WITH_TIMEOUT,
 65        timeout=timedelta(seconds=600),
 66        sleepTime=timedelta(seconds=1))
 67
 68    model_runner = model_runtime.ModelRunner(model, config=config)
 69
 70    print("Preparing input tensors:")
 71    input_descriptions = model_runner.getExecuteInputs()
 72    input_tensors = [
 73        np.random.randn(*input_desc.shape).astype(input_desc.numpy_data_type())
 74        for input_desc in input_descriptions
 75    ]
 76    input_view = model_runtime.InputMemoryView()
 77
 78    for input_desc, input_tensor in zip(input_descriptions, input_tensors):
 79        print("\tname:", input_desc.name, "shape:", input_tensor.shape,
 80              "dtype:", input_tensor.dtype)
 81        input_view[input_desc.name] = input_tensor
 82
 83    print("Sending single synchronous request with empty data.")
 84    result = model_runner.execute(input_view)
 85    output_descriptions = model_runner.getExecuteOutputs()
 86
 87    print("Processing output tensors:")
 88    for output_desc in output_descriptions:
 89        output_tensor = np.frombuffer(
 90            result[output_desc.name],
 91            dtype=output_desc.numpy_data_type()).reshape(output_desc.shape)
 92        print("\tname:", output_desc.name, "shape:", output_tensor.shape,
 93              "dtype:", output_tensor.dtype, "\n", output_tensor)
 94
 95    print("Success: exiting")
 96
 97    return 0
 98
 99
100def load_model(popef_paths):
101    for model_file in popef_paths:
102        assert os.path.isfile(model_file) is True
103        reader = popef.Reader()
104        reader.parseFile(model_file)
105
106        meta = reader.metadata()
107        exec = reader.executables()
108        return popef.ModelBuilder(reader).createModel()
109
110
111if __name__ == "__main__":
112    main()

Download model_runner_frozen_inputs.py