3. Model Runner deep dive through examples

The ModelRunner class is a lightweight wrapper around all classes and mechanisms provided by the Model Runtime library that allows loading and execution of the model stored in a PopEF file with minimal client effort.

Running a PopEF model with a ModelRunner class consists of two steps:

Create a ModelRunner object by either providing a list of PopEF files or an instance of Model. In this part, the IPU device partition is acquired, the given model is loaded onto a device, and all necessary threads and classes are created and stored inside the ModelRunner internal state. The user is able to pass ModelRunnerConfig during ModelRunner object construction and set up several configuration options, for example replication factor (replication_factor).
Use one of the available execution modes to send the request to the IPU.

Note

The ModelRunner object instance lifetime must be preserved until the last send request returns a result. Destruction of the object causes stopping and unloading the model from the IPU device. State of requests that were being processed during destruction is undefined.

Note

Files that are included by the examples presented in this chapter are placed in the Section 9, Appendix. They contain helper functions for example: processing command line arguments.

3.1. Execution modes

The ModelRunner class provides two execution modes: synchronous (execute()) and asynchronous (executeAsync()). In the synchronous mode, the working thread is blocked until the result is available. In the asynchronous mode the request is queued and a std::future object is returned. The result can be accessed as soon as the IPU finishes computations. Unlike the synchronous mode, the working thread is not blocked here. For each mode, the user is responsible for the memory allocation of the input tensors.

All execute() and executeAsync() overloads take as a parameter InputMemoryView that contains pointers to all input data. The user has to ensure that the input data exists and passed pointers are valid until a result is returned. Each execute() and executeAsync() comes in two flavors: execute(const InputMemoryView &, unsigned), execute(const InputMemoryView &, const OutputMemoryView &output_data, unsigned), executeAsync(const InputMemoryView &, unsigned), executeAsync(const InputMemoryView &, const OutputMemoryView &output_data, unsigned). The difference between overloads comes down to who is responsible for the memory allocation of the output tensors. There are two options available:

ModelRunner allocates memory for the output and returns a TensorMemory instance for each output tensor.
The user allocates the output tensor memory and passes OutputMemoryView to the particular execution mode. ModelRunner will place the result in the memory provided by the user.

The client can find out what tensors the model accepts and returns by calling one of following ModelRunner class methods:

These methods return a collection of DataDesc objects which contain basic information about the tensor: name, shape, data type, size in bytes.

This C++ example sends inference requests to the IPU using all available execution modes.

Listing 3.1 model_runner_execution_modes.cpp

// Copyright (c) 2022 Graphcore Ltd. All rights reserved.
#include <string>
#include <vector>

#include <boost/program_options.hpp>

#include "model_runtime/ModelRunner.hpp"
#include "model_runtime/Tensor.hpp"
#include "utils.hpp"

namespace examples {

void synchronousExecutionModeLibraryAllocatedOutput(
    model_runtime::ModelRunner &model_runner);
void synchronousExecutionModeUserAllocatedOutput(
    model_runtime::ModelRunner &model_runner);
void asynchronousExecutionModeLibraryAllocatedOutput(
    model_runtime::ModelRunner &model_runner);
void asynchronousExecutionModeUserAllocatedOutput(
    model_runtime::ModelRunner &model_runner);

} // namespace examples

/* The example shows loading a model from PopEF files and sending
 * inference requests using all available ModelRunner execution modes.
 */
int main(int argc, char *argv[]) {
  using namespace std::chrono_literals;
  static const char *example_desc = "Model runner execution modes example.";
  const boost::program_options::variables_map vm =
      examples::parsePopefProgramOptions(example_desc, argc, argv);
  const auto popef_paths = vm["popef"].as<std::vector<std::string>>();

  model_runtime::ModelRunnerConfig config;
  config.device_wait_config =
      model_runtime::DeviceWaitConfig{600s /*timeout*/, 1s /*sleep_time*/};
  model_runtime::ModelRunner model_runner(popef_paths, config);

  examples::print("Running synchronous execution mode. The memory of the "
                  "output tensors is allocated by the ModelRunner object.");
  examples::synchronousExecutionModeLibraryAllocatedOutput(model_runner);

  examples::print("Running synchronous execution mode. The memory of the "
                  "output tensors is allocated by the user.");
  examples::synchronousExecutionModeUserAllocatedOutput(model_runner);

  examples::print("Running asynchronous execution mode. The memory of the "
                  "output tensors is allocated by the ModelRunner object.");
  examples::asynchronousExecutionModeLibraryAllocatedOutput(model_runner);

  examples::print("Running asynchronous execution mode. The memory of the "
                  "output tensors is allocated by the user.");
  examples::asynchronousExecutionModeUserAllocatedOutput(model_runner);

  examples::print("Success: exiting");
  return EXIT_SUCCESS;
}

namespace examples {

void synchronousExecutionModeLibraryAllocatedOutput(
    model_runtime::ModelRunner &model_runner) {
  examples::print("Allocating input tensors");
  const model_runtime::InputMemory input_memory =
      examples::allocateHostInputData(model_runner.getExecuteInputs());

  examples::printInputMemory(input_memory);

  examples::print("Sending single synchronous request with empty data. Output "
                  "allocated by ModelRunner.");

  const model_runtime::OutputMemory output_memory =
      model_runner.execute(examples::toInputMemoryView(input_memory));

  examples::print("Received output allocated by ModelRunner:");
  using ValueType = std::pair<const std::string, model_runtime::TensorMemory>;

  for (const ValueType &name_with_memory : output_memory) {
    auto &&[name, memory] = name_with_memory;
    examples::print(fmt::format("Output tensor {}, {} bytes", name,
                                memory.data_size_bytes));
  }
}

void synchronousExecutionModeUserAllocatedOutput(
    model_runtime::ModelRunner &model_runner) {
  examples::print("Allocating input tensors");
  const model_runtime::InputMemory input_memory =
      examples::allocateHostInputData(model_runner.getExecuteInputs());

  examples::printInputMemory(input_memory);

  examples::print("Allocating output tensors");
  model_runtime::OutputMemory output_memory =
      examples::allocateHostOutputData(model_runner.getExecuteOutputs());

  examples::print("Sending single synchronous request with empty data.");

  model_runner.execute(examples::toInputMemoryView(input_memory),
                       examples::toOutputMemoryView(output_memory));

  examples::print("Received output allocated by ModelRunner:");

  using ValueType = std::pair<const std::string, model_runtime::TensorMemory>;

  for (const ValueType &name_with_memory : output_memory) {
    auto &&[name, memory] = name_with_memory;
    examples::print(fmt::format("Output tensor {}, {} bytes", name,
                                memory.data_size_bytes));
  }
}

void asynchronousExecutionModeLibraryAllocatedOutput(
    model_runtime::ModelRunner &model_runner) {
  examples::print("Allocating input tensors");
  const model_runtime::InputMemory input_memory =
      examples::allocateHostInputData(model_runner.getExecuteInputs());

  examples::printInputMemory(input_memory);

  examples::print("Sending single synchronous request with empty data. Output "
                  "allocated by ModelRunner.");

  const model_runtime::OutputFutureMemory output_future_memory =
      model_runner.executeAsync(examples::toInputMemoryView(input_memory));

  examples::print("Waiting for output allocated by ModelRunner:");

  using ValueType = std::pair<const std::string,
                              std::shared_future<model_runtime::TensorMemory>>;

  for (const ValueType &name_with_future_memory : output_future_memory) {
    auto &&[name, future_memory] = name_with_future_memory;
    examples::print(fmt::format("Waiting for the result: tensor {}", name));
    future_memory.wait();
    const model_runtime::TensorMemory &memory = future_memory.get();
    examples::print(fmt::format("Output tensor {} available, received {} bytes",
                                name, memory.data_size_bytes));
  }
}

void asynchronousExecutionModeUserAllocatedOutput(
    model_runtime::ModelRunner &model_runner) {
  examples::print("Allocating input tensors");
  const model_runtime::InputMemory input_memory =
      examples::allocateHostInputData(model_runner.getExecuteInputs());

  examples::printInputMemory(input_memory);

  examples::print("Allocating output tensors");
  model_runtime::OutputMemory output_memory =
      examples::allocateHostOutputData(model_runner.getExecuteOutputs());

  examples::print("Sending single synchronous request with empty data.");

  const model_runtime::OutputFutureMemoryView output_future_memory_view =
      model_runner.executeAsync(examples::toInputMemoryView(input_memory),
                                examples::toOutputMemoryView(output_memory));

  examples::print("Waiting for the output");

  using ValueType =
      std::pair<const std::string,
                std::shared_future<model_runtime::TensorMemoryView>>;

  for (const ValueType &name_with_future_memory_view :
       output_future_memory_view) {
    auto &&[name, future_memory_view] = name_with_future_memory_view;
    examples::print(fmt::format("Waiting for the result: tensor {}", name));
    future_memory_view.wait();
    const model_runtime::TensorMemoryView &memory_view =
        future_memory_view.get();
    examples::print(fmt::format("Output tensor {} available, received {} bytes",
                                name, memory_view.data_size_bytes));
  }
}

} // namespace examples

Download model_runner_execution_modes.cpp

This Python example sends inference requests to the IPU using all available execution modes.

Listing 3.2 model_runner_execution_modes.py

#!/usr/bin/env python3
# Copyright (c) 2022 Graphcore Ltd. All rights reserved.

import argparse
from datetime import timedelta
from re import L
import numpy as np
import model_runtime
import popef
"""
The example shows loading a model from PopEF files and sending
inference requests using all available ModelRunner execution modes.
"""


def main():
    parser = argparse.ArgumentParser("Model runner simple example.")
    parser.add_argument(
        "-p",
        "--popef",
        type=str,
        metavar='popef_file_path',
        help="A collection of PopEF files containing the model.",
        nargs='+',
        required=True)
    args = parser.parse_args()

    # Create model runner
    config = model_runtime.ModelRunnerConfig()
    config.device_wait_config = model_runtime.DeviceWaitConfig(
        model_runtime.DeviceWaitStrategy.WAIT_WITH_TIMEOUT,
        timeout=timedelta(seconds=600),
        sleepTime=timedelta(seconds=1))

    print("Creating ModelRunner with", config)
    model_runner = model_runtime.ModelRunner(model_runtime.PopefPaths(
        args.popef),
                                             config=config)

    print("Preparing input tensors:")
    input_descriptions = model_runner.getExecuteInputs()
    input_tensors = [
        np.random.randn(*input_desc.shape).astype(input_desc.numpy_data_type())
        for input_desc in input_descriptions
    ]
    input_view = model_runtime.InputMemoryView()

    for input_desc, input_tensor in zip(input_descriptions, input_tensors):
        print("\tname:", input_desc.name, "shape:", input_tensor.shape,
              "dtype:", input_tensor.dtype)
        input_view[input_desc.name] = input_tensor

    print("Running synchronous execution mode. The memory of the output "
          "tensors is allocated by the ModelRunner object.")
    synchronousExecutionModeLibraryAllocatedOutput(model_runner, input_view)

    print("Running synchronous execution mode. The memory of the output "
          "tensors is allocated by the user.")
    synchronousExecutionModeUserAllocatedOutput(model_runner, input_view)

    print("Running asynchronous execution mode. The memory of the output "
          "tensors is allocated by the ModelRunner object.")
    asynchronousExecutionModeLibraryAllocatedOutput(model_runner, input_view)

    print("Running asynchronous execution mode. The memory of the output "
          "tensors is allocated by the user.")
    asynchronousExecutionModeUserAllocatedOutput(model_runner, input_view)

    input_numpy = dict()
    for input_desc, input_tensor in zip(input_descriptions, input_tensors):
        input_numpy[input_desc.name] = input_tensor

    print("Running synchronous execution mode. The input is a numpy array. "
          "The memory of the output tensors is allocated by the ModelRunner "
          "object.")
    synchronousExecutionModeLibraryAllocatedNumpyInputOutput(
        model_runner, input_numpy)

    print("Running synchronous execution mode. The input and the output are "
          "numpy arrays. The memory of the output tensors is allocated by the "
          "user. ")
    synchronousExecutionModeUserAllocatedNumpyInputOutput(
        model_runner, input_numpy)

    print(
        "Running asynchronous execution mode. The input and the output are "
        "numpy arrays . The memory of the output tensors is allocated by the "
        "ModelRunner object.")
    asynchronousExecutionModeLibraryAllocatedNumpyOutput(
        model_runner, input_numpy)

    print(
        "Running asynchronous execution mode. The input and the output are "
        "numpy arrays . The memory of the output tensors is allocated by the "
        "user.")
    asynchronousExecutionModeUserAllocatedNumpyOutput(model_runner,
                                                      input_numpy)

    print("Success: exiting")
    return 0


def synchronousExecutionModeLibraryAllocatedOutput(model_runner, input_view):
    print("Sending single synchronous request with random data. Output "
          "allocated by ModelRunner.")
    result = model_runner.execute(input_view)

    output_descriptions = model_runner.getExecuteOutputs()
    print("Processing output tensors:")
    for output_desc in output_descriptions:
        output_tensor = np.frombuffer(
            result[output_desc.name],
            dtype=output_desc.numpy_data_type()).reshape(output_desc.shape)
        print("\tname:", output_desc.name, "shape:", output_tensor.shape,
              "dtype:", output_tensor.dtype, "\n", output_tensor)


def synchronousExecutionModeUserAllocatedOutput(model_runner, input_view):

    output_descriptions = model_runner.getExecuteOutputs()
    print("Preparing memory for output tensors")
    output_tensors = [
        np.zeros(output_desc.shape, dtype=output_desc.numpy_data_type())
        for output_desc in output_descriptions
    ]

    print("Creating model_runtime.OutputMemoryView()")
    output_view = model_runtime.OutputMemoryView()
    for desc, tensor in zip(output_descriptions, output_tensors):
        print("\tname:", desc.name, "shape:", tensor.shape, "dtype:",
              tensor.dtype)
        output_view[desc.name] = tensor

    print("Sending single synchronous request with random data")
    model_runner.execute(input_view, output_view)
    print("Processing output tensors:")
    for desc, tensor in zip(output_descriptions, output_tensors):
        print("\tname:", desc.name, "shape", tensor.shape, "dtype",
              tensor.dtype, "\n", tensor)


def synchronousExecutionModeLibraryAllocatedNumpyInputOutput(
        model_runner, numpy_input):

    output_descriptions = model_runner.getExecuteOutputs()

    print("Sending single synchronous request random data (numpy array)")
    output_tensors = model_runner.execute(numpy_input)
    print("Processing output tensors (numpy dict):")
    for desc in output_descriptions:
        tensor = output_tensors[desc.name]
        print("\tname:", desc.name, "shape", tensor.shape, "dtype",
              tensor.dtype, "\n", tensor)


def synchronousExecutionModeUserAllocatedNumpyInputOutput(
        model_runner, numpy_input):

    output_descriptions = model_runner.getExecuteOutputs()
    print("Preparing memory for output tensors")
    numpy_output = {}
    for output_desc in output_descriptions:
        numpy_output[output_desc.name] = np.zeros(
            output_desc.shape, dtype=output_desc.numpy_data_type())

    print("Sending single synchronous request with random data")
    model_runner.execute(numpy_input, numpy_output)
    print("Processing output tensors (numpy dict):")
    for desc in output_descriptions:
        tensor = numpy_output[desc.name]
        print("\tname:", desc.name, "shape", tensor.shape, "dtype",
              tensor.dtype, "\n", tensor)


def asynchronousExecutionModeLibraryAllocatedOutput(model_runner, input_view):

    print("Sending single asynchronous request with random data. Output "
          "allocated by ModelRunner.")
    result = model_runner.executeAsync(input_view)

    print("Waiting for output allocated by ModelRunner:")
    result.wait()
    print("Results available")

    output_descriptions = model_runner.getExecuteOutputs()
    print("Processing output tensors:")
    for output_desc in output_descriptions:
        output_tensor = np.frombuffer(
            result[output_desc.name],
            dtype=output_desc.numpy_data_type()).reshape(output_desc.shape)
        print("\tname:", output_desc.name, "shape:", output_tensor.shape,
              "dtype:", output_tensor.dtype, "\n", output_tensor)


def asynchronousExecutionModeUserAllocatedOutput(model_runner, input_view):
    output_descriptions = model_runner.getExecuteOutputs()
    print("Preparing memory for output tensors")
    output_tensors = [
        np.zeros(output_desc.shape, dtype=output_desc.numpy_data_type())
        for output_desc in output_descriptions
    ]

    print("Creating model_runtime.OutputMemoryView()")
    output_view = model_runtime.OutputMemoryView()
    for desc, tensor in zip(output_descriptions, output_tensors):
        print("\tname:", desc.name, "shape:", tensor.shape, "dtype:",
              tensor.dtype)
        output_view[desc.name] = tensor

    print("Sending single asynchronous request with random data")
    future = model_runner.executeAsync(input_view, output_view)

    print("Waiting for the output.")
    future.wait()
    print("Results available.")
    print("Processing output tensors:")
    for desc, tensor in zip(output_descriptions, output_tensors):
        print("\tname:", desc.name, "shape", tensor.shape, "dtype",
              tensor.dtype, "\n", tensor)


def asynchronousExecutionModeLibraryAllocatedNumpyOutput(
        model_runner, numpy_input):
    print("Sending single asynchronous request with random data")
    future = model_runner.executeAsync(numpy_input)

    print("Waiting for the output.")
    future.wait()
    for desc in model_runner.getExecuteOutputs():
        future_py_array = future[desc.name]

        # Create a np.array copy from the future_py_array buffer
        # using numpy() method.
        tensor = future_py_array.numpy()
        print("\tname:", desc.name, "shape", tensor.shape, "dtype",
              tensor.dtype, "tensor id", id(tensor), "\n", tensor)

        # Create a np.array copy from the future_py_array buffer
        # (allocated by ModelRunner instance).
        tensor_copy = np.array(future_py_array, copy=True)
        print("Tensor copy", tensor_copy, "tensor id", id(tensor_copy))

        # Avoid copying. Create a np.array view from the future_py_array buffer
        # (allocated by ModelRunner instance).
        tensor_view = np.array(future_py_array, copy=False)
        print("Tensor view", tensor_view, "tensor id", id(tensor_view))

        assert not np.shares_memory(tensor_view, tensor_copy)
        assert not np.shares_memory(tensor, tensor_copy)
        assert not np.shares_memory(tensor, tensor_view)


def asynchronousExecutionModeUserAllocatedNumpyOutput(model_runner,
                                                      numpy_input):

    output_descriptions = model_runner.getExecuteOutputs()
    print("Preparing memory for output tensors")
    numpy_output = {}
    for output_desc in output_descriptions:
        numpy_output[output_desc.name] = np.zeros(
            output_desc.shape, dtype=output_desc.numpy_data_type())

    print("Sending single asynchronous request with random data")
    future = model_runner.executeAsync(numpy_input, numpy_output)

    print("Waiting for the output.")
    future.wait()
    print("Results available.")
    print("Processing output tensors:")
    for desc in output_descriptions:
        output_tensor = numpy_output[desc.name]
        future_py_array_view = future[desc.name]

        # Create a np.array view from the future_py_array_view using numpy()
        # method, view points to np.array present in numpy_output dict
        tensor_from_future_object = future_py_array_view.numpy()
        print("\tname:", desc.name, "shape", tensor_from_future_object.shape,
              "dtype", tensor_from_future_object.dtype, "\n",
              tensor_from_future_object)
        assert np.shares_memory(output_tensor, tensor_from_future_object)

        # Create a np.array view from the future_py_array_view buffer, view
        # points to np.array present in numpy_output dict
        tensor_view = np.array(future_py_array_view, copy=False)
        assert np.shares_memory(output_tensor, tensor_view)
        assert np.shares_memory(tensor_from_future_object, tensor_view)

        # Create a np.array copy from the future_py_array_view buffer
        tensor_copy = np.array(future_py_array_view, copy=True)
        assert not np.shares_memory(tensor_from_future_object, tensor_copy)
        assert not np.shares_memory(output_tensor, tensor_copy)


if __name__ == "__main__":
    main()

Download model_runner_execution_modes.py

3.2. Replication

The ModelRunner class allows to specify the replication factor inside ModelRunnerConfig passed to its constructor. As a result of setting this option, a ModelRunner object will create as many IPU model replica instances as requested, as far as the required number of devices is available. Each execution mode accepts as the last parameter unsigned replica_id which decides to which replica the request will be sent.

This example creates two replicas and sends inference requests to each of them using the C++ API.

Listing 3.3 model_runner_replication.cpp

// Copyright (c) 2022 Graphcore Ltd. All rights reserved.
#include <string>
#include <unordered_map>
#include <vector>

#include <boost/program_options.hpp>

#include "model_runtime/ModelRunner.hpp"
#include "model_runtime/Tensor.hpp"
#include "utils.hpp"

/* The example shows loading a model from PopEF files, creating 2 model replicas
 * and sending inference requests to each of them.
 */
int main(int argc, char *argv[]) {
  static constexpr unsigned num_replicas = 2;

  using namespace std::chrono_literals;
  static const char *example_desc = "Model runner simple example.";
  const boost::program_options::variables_map vm =
      examples::parsePopefProgramOptions(example_desc, argc, argv);
  const auto popef_paths = vm["popef"].as<std::vector<std::string>>();

  model_runtime::ModelRunnerConfig config;
  config.device_wait_config =
      model_runtime::DeviceWaitConfig{600s /*timeout*/, 1s /*sleep_time*/};
  examples::print(fmt::format(
      "Setting model_runtime::ModelRunnerConfig replication_factor=",
      num_replicas));

  config.replication_factor = num_replicas;
  model_runtime::ModelRunner model_runner(popef_paths, config);

  for (unsigned replica_id = 0; replica_id < num_replicas; ++replica_id) {
    examples::print("Allocating input tensors");
    const model_runtime::InputMemory input_memory =
        examples::allocateHostInputData(model_runner.getExecuteInputs());
    examples::printInputMemory(input_memory);

    examples::print(fmt::format(
        "Sending single synchronous request with empty data - replica {}",
        replica_id));

    const model_runtime::OutputMemory output_memory = model_runner.execute(
        examples::toInputMemoryView(input_memory), replica_id);

    examples::print(fmt::format("Received output - replica {}", replica_id));

    using OutputValueType =
        std::pair<const std::string, model_runtime::TensorMemory>;

    for (const OutputValueType &name_with_memory : output_memory) {
      auto &&[name, memory] = name_with_memory;
      examples::print(fmt::format("Output tensor {}, {} bytes", name,
                                  memory.data_size_bytes));
    }
  }
  examples::print("Success: exiting");
  return EXIT_SUCCESS;
}

Download model_runner_replication.cpp

This example creates two replicas and sends inference requests to each of them using the Python API.

Listing 3.4 model_runner_replication.py

#!/usr/bin/env python3
# Copyright (c) 2022 Graphcore Ltd. All rights reserved.

import argparse
from datetime import timedelta
import numpy as np
import model_runtime
import popef
"""
The example shows loading a model from PopEF files, creating 2 model replicas
and sending inference requests to each of them.
"""


def main():
    parser = argparse.ArgumentParser("Model runner simple example.")
    parser.add_argument(
        "-p",
        "--popef",
        type=str,
        metavar='popef_file_path',
        help="A collection of PopEF files containing the model.",
        nargs='+',
        required=True)
    args = parser.parse_args()

    num_replicas = 2
    # Create model runner
    config = model_runtime.ModelRunnerConfig()
    config.replication_factor = num_replicas
    config.device_wait_config = model_runtime.DeviceWaitConfig(
        model_runtime.DeviceWaitStrategy.WAIT_WITH_TIMEOUT,
        timeout=timedelta(seconds=600),
        sleepTime=timedelta(seconds=1))

    print("Creating ModelRunner with", config)
    runner = model_runtime.ModelRunner(model_runtime.PopefPaths(args.popef),
                                       config=config)

    input_descriptions = runner.getExecuteInputs()

    input = model_runtime.InputMemoryView()

    print("Preparing input tensors:")
    input_descriptions = runner.getExecuteInputs()
    input_tensors = [
        np.random.randn(*input_desc.shape).astype(input_desc.numpy_data_type())
        for input_desc in input_descriptions
    ]
    input_view = model_runtime.InputMemoryView()

    for input_desc, input_tensor in zip(input_descriptions, input_tensors):
        print("\tname:", input_desc.name, "shape:", input_tensor.shape,
              "dtype:", input_tensor.dtype)
        input_view[input_desc.name] = input_tensor

    for replica_id in range(num_replicas):
        print("Sending single synchronous request with empty data - replica",
              replica_id, ".")
        result = runner.execute(input_view, replica_id=replica_id)
        output_descriptions = runner.getExecuteOutputs()

        print("Processing output tensors - replica", replica_id, ":")
        for output_desc in output_descriptions:
            output_tensor = np.frombuffer(
                result[output_desc.name],
                dtype=output_desc.numpy_data_type()).reshape(output_desc.shape)
            print("\tname:", output_desc.name, "shape:", output_tensor.shape,
                  "dtype:", output_tensor.dtype, "\n", output_tensor)

    print("Success: exiting")
    return 0


if __name__ == "__main__":
    main()

Download model_runner_replication.py

3.3. Multithreading

By default, ModelRunner is not thread-safe. When many threads call execute() or executeAsync() it leads to race conditions and undefined behavior. To avoid undesirable situations in a multithreaded environment when using ModelRunner, the user must ensure that appropriate synchronization mechanisms between threads are applied. The alternative is to set thread_safe in ModelRunnerConfig to True. Consequently, every call of execute() or executeAsync() will cause the internal std::mutex instance to lock.

This example creates several threads and each one sends inference requests to the IPU using the C++ API.

Listing 3.5 model_runner_multithreading.cpp

// Copyright (c) 2022 Graphcore Ltd. All rights reserved.
#include <array>
#include <string>
#include <vector>

#include <boost/program_options.hpp>

#include "model_runtime/ModelRunner.hpp"
#include "model_runtime/Tensor.hpp"
#include "utils.hpp"

namespace examples {

void workerMain(model_runtime::ModelRunner &model_runner);

} // namespace examples

/* The example shows loading a model from PopEF files and sending inference
 * requests to the same model by multiple threads.
 */
int main(int argc, char *argv[]) {
  using namespace std::chrono_literals;
  static const char *example_desc =
      "Model runner multithreading client example.";
  const boost::program_options::variables_map vm =
      examples::parsePopefProgramOptions(example_desc, argc, argv);
  const auto popef_paths = vm["popef"].as<std::vector<std::string>>();

  model_runtime::ModelRunnerConfig config;
  config.device_wait_config =
      model_runtime::DeviceWaitConfig{600s /*timeout*/, 1s /*sleep_time*/};
  examples::print(
      "Setting model_runtime::ModelRunnerConfig: thread safe = true");
  config.thread_safe = true;
  model_runtime::ModelRunner model_runner(popef_paths, config);

  static constexpr unsigned num_workers = 4;
  std::vector<std::thread> threads;
  threads.reserve(num_workers);

  examples::print(fmt::format("Starting {} worker threads", num_workers));
  for (unsigned i = 0; i < num_workers; i++) {
    threads.emplace_back(examples::workerMain, std::ref(model_runner));
  }

  for (auto &worker : threads) {
    worker.join();
  };

  examples::print("Success: exiting");
  return EXIT_SUCCESS;
}

namespace examples {

void workerMain(model_runtime::ModelRunner &model_runner) {
  examples::print("Starting workerMain()");

  static constexpr unsigned num_requests = 5;
  std::array<model_runtime::InputMemory, num_requests> requests_input_data;

  for (unsigned req_id = 0; req_id < num_requests; req_id++) {
    examples::print(
        fmt::format("Allocating input tensors - request id {}", req_id));
    requests_input_data[req_id] =
        examples::allocateHostInputData(model_runner.getExecuteInputs());
  }

  std::vector<model_runtime::OutputFutureMemory> results;

  for (unsigned req_id = 0; req_id < num_requests; req_id++) {
    examples::print(
        fmt::format("Sending asynchronous request. Request id {}", req_id));
    results.emplace_back(model_runner.executeAsync(
        examples::toInputMemoryView(requests_input_data[req_id])));
  }

  examples::print("Waiting for output:");
  for (unsigned req_id = 0; req_id < num_requests; req_id++) {
    auto &output_future_memory = results[req_id];

    using OutputValueType =
        std::pair<const std::string,
                  std::shared_future<model_runtime::TensorMemory>>;
    for (const OutputValueType &name_with_future_memory :
         output_future_memory) {
      auto &&[name, future_memory] = name_with_future_memory;
      examples::print(fmt::format(
          "Waiting for the result: tensor {}, request_id {}", name, req_id));
      future_memory.wait();
      const model_runtime::TensorMemory &memory = future_memory.get();
      examples::print(fmt::format(
          "Output tensor {} available, request_id {} received {} bytes", name,
          req_id, memory.data_size_bytes));
    }
  }
}

} // namespace examples

Download model_runner_multithreading.cpp

This example creates several threads and each one sends inference requests to the IPU using the Python API.

Listing 3.6 model_runner_multithreading.py

#!/usr/bin/env python3
# Copyright (c) 2022 Graphcore Ltd. All rights reserved.

import argparse
import threading
from datetime import timedelta
import numpy as np
import model_runtime
import popef
"""
The example shows loading a model from PopEF files and sending inference
requests to the same model by multiple threads.
"""


def main():
    parser = argparse.ArgumentParser("Model runner simple example.")
    parser.add_argument(
        "-p",
        "--popef",
        type=str,
        metavar='popef_file_path',
        help="A collection of PopEF files containing the model.",
        nargs='+',
        required=True)
    args = parser.parse_args()

    config = model_runtime.ModelRunnerConfig()
    config.thread_safe = True
    config.device_wait_config = model_runtime.DeviceWaitConfig(
        model_runtime.DeviceWaitStrategy.WAIT_WITH_TIMEOUT,
        timeout=timedelta(seconds=600),
        sleepTime=timedelta(seconds=1))

    print("Creating ModelRunner with", config)
    model_runner = model_runtime.ModelRunner(model_runtime.PopefPaths(
        args.popef),
                                             config=config)
    num_workers = 4
    print("Starting", num_workers, "worker threads.")
    threads = [
        threading.Thread(target=workerMain, args=(model_runner, worker_id))
        for worker_id in range(num_workers)
    ]

    for thread in threads:
        thread.start()

    for thread in threads:
        thread.join()

    print("Success: exiting")
    return 0


def workerMain(model_runner, worker_id):
    print("Worker", worker_id, "Starting workerMain()")
    num_requests = 5

    input_descriptions = model_runner.getExecuteInputs()
    input_requests = []

    print("Worker", worker_id, "Allocating input tensors for", num_requests,
          "requests", input_descriptions)
    for _ in range(num_requests):
        input_requests.append([
            np.random.randn(*input_desc.shape).astype(
                input_desc.numpy_data_type())
            for input_desc in input_descriptions
        ])

    futures = []

    for req_id in range(num_requests):
        print("Worker", worker_id, "Sending asynchronous request. Request id",
              req_id)
        input_view = model_runtime.InputMemoryView()
        for input_desc, input_tensor in zip(input_descriptions,
                                            input_requests[req_id]):
            input_view[input_desc.name] = input_tensor
        futures.append(model_runner.executeAsync(input_view))

    print("Worker", worker_id, "Processing outputs.")
    for req_id, future in enumerate(futures):
        print("Worker", worker_id, "Waiting for the result - request", req_id)
        future.wait()
        print("Worker", worker_id, "Result available - request", req_id)


if __name__ == "__main__":
    main()

Download model_runner_multithreading.py

3.4. Frozen inputs

The ModelRunner class allows binding constant tensors by setting frozen_inputs in ModelRunnerConfig. frozen_inputs is an instance of InputMemoryView. The user allocates and passes the pointer to the data for the selected input tensors. If the tensor is the input required during the execution call, it will no longer be required and the tensor from frozen_inputs will always be added to the request. If the tensor is the input saved as PopEF tensor data or feed data, it will be overridden by tensor from frozen_inputs.

Note

Examples Listing 3.7 and Listing 3.8 rely on a PopEF file generated by the code Listing 9.2.

This example binds a constant value to one of the inputs and sends inference requests to the IPU modes using the C++ API.

Listing 3.7 model_runner_frozen_inputs.cpp

// Copyright (c) 2022 Graphcore Ltd. All rights reserved.
#include <algorithm>
#include <string>
#include <vector>

#include <boost/program_options.hpp>

#include <popef/Model.hpp>
#include <popef/Reader.hpp>
#include <popef/Types.hpp>

#include "model_runtime/ModelRunner.hpp"
#include "model_runtime/Tensor.hpp"
#include "utils.hpp"

namespace examples {

std::shared_ptr<popef::Model>
createPopefModel(const std::vector<std::string> &popef_paths);
const popef::Anchor *findAnchor(const std::string &name, popef::Model *model);
std::vector<float> createFrozenTensorData(const popef::Anchor *anchor);

} // namespace examples

/* The example shows loading a model from PopEF file and binding constant tensor
 * value to one of the inputs. The example is based on the PopEF file generated
 * by `model_runtime_example_generate_simple_popef` example. Generated PopEF
 * file consists simple model:
 *
 * output = (A * weights) + B
 *
 * where A and B are stream inputs, weights is tensor saved as popef::TensorData
 * and and output is result stream output tensor.
 */
int main(int argc, char *argv[]) {
  using namespace std::chrono_literals;
  static const char *example_desc = "Model runner frozen inputs example.";
  const boost::program_options::variables_map vm =
      examples::parsePopefProgramOptions(example_desc, argc, argv);
  const auto popef_paths = vm["popef"].as<std::vector<std::string>>();

  std::shared_ptr<popef::Model> model = examples::createPopefModel(popef_paths);

  static const std::string frozen_input_name = "tensor_B";

  examples::print(fmt::format("Looking for tensor {} inside PopEF model.",
                              frozen_input_name));
  const popef::Anchor *tensor_b_anchor =
      examples::findAnchor(frozen_input_name, model.get());
  examples::print(fmt::format("Found {}.", *tensor_b_anchor));

  examples::print("Creating frozen input tensor data.");
  const std::vector<float> tensor_b_data =
      examples::createFrozenTensorData(tensor_b_anchor);

  examples::print("Creating ModelRunnerConfig.");
  model_runtime::ModelRunnerConfig config;

  examples::print(fmt::format("Tensor {} is frozen - will be treated as "
                              "constant in each execution request.",
                              frozen_input_name));
  const uint64_t tensor_b_size_in_bytes =
      tensor_b_anchor->tensorInfo().sizeInBytes();

  config.frozen_inputs = {
      {frozen_input_name, model_runtime::ConstTensorMemoryView{
                              tensor_b_data.data(), tensor_b_size_in_bytes}}};

  config.device_wait_config =
      model_runtime::DeviceWaitConfig{600s /*timeout*/, 1s /*sleep_time*/};

  model_runtime::ModelRunner model_runner(model, config);

  examples::print("Allocating input tensors");

  const model_runtime::InputMemory input_memory =
      examples::allocateHostInputData(model_runner.getExecuteInputs());

  examples::printInputMemory(input_memory);

  examples::print("Sending single synchronous request with empty data.");
  const model_runtime::OutputMemory output_memory =
      model_runner.execute(examples::toInputMemoryView(input_memory));

  examples::print("Received output:");

  using ValueType = std::pair<const std::string, model_runtime::TensorMemory>;

  for (const ValueType &name_with_memory : output_memory) {
    auto &&[name, memory] = name_with_memory;
    examples::print(fmt::format("Output tensor {}, {} bytes", name,
                                memory.data_size_bytes));
  }

  examples::print("Success: exiting");
  return EXIT_SUCCESS;
}

namespace examples {

std::shared_ptr<popef::Model>
createPopefModel(const std::vector<std::string> &popef_paths) {
  auto reader = std::make_shared<popef::Reader>();
  for (const auto &path : popef_paths)
    reader->parseFile(path);

  return popef::ModelBuilder(reader).createModel();
}

const popef::Anchor *findAnchor(const std::string &name, popef::Model *model) {
  const auto &anchors = model->metadata.anchors();

  const auto anchor_it = std::find_if(
      anchors.cbegin(), anchors.cend(),
      [&](const popef::Anchor &anchor) { return anchor.name() == name; });

  if (anchor_it == anchors.cend()) {
    throw std::runtime_error(fmt::format(
        "Anchor {} not found in given model. Please make sure that PopEF was "
        "generated by `model_runtime_example_generate_simple_popef`.",
        name));
  }

  if (auto anchorDataType = anchor_it->tensorInfo().dataType();
      anchorDataType != popef::DataType::F32) {
    throw std::runtime_error(fmt::format(
        "Example expects anchor {} with popef::DataType::F32. Received {}",
        name, anchorDataType));
  }

  return &(*anchor_it);
}

std::vector<float> createFrozenTensorData(const popef::Anchor *anchor) {
  const auto size_in_bytes = anchor->tensorInfo().sizeInBytes();
  const auto num_elements = size_in_bytes / sizeof(float);

  return std::vector<float>(num_elements, 11.0f);
}

} // namespace examples

Download model_runner_frozen_inputs.cpp

This example binds a constant value to one of the inputs and sends inference requests to the IPU modes using the Python API.

Listing 3.8 model_runner_frozen_inputs.py

#!/usr/bin/env python3
# Copyright (c) 2022 Graphcore Ltd. All rights reserved.

import os
import argparse
from datetime import timedelta
import numpy as np
import model_runtime
import popef
"""
The example shows loading a model from PopEF file and binding constant tensor
value to one of the inputs. The example is based on the PopEF file generated
by `model_runtime_example_generate_simple_popef` example. Generated PopEF
file consists simple model:

output = (A * weights) + B

where A and B are stream inputs, weights is a tensor saved as popef::TensorData
and  output is result stream output tensor.
"""


def main():
    parser = argparse.ArgumentParser("Model runner simple example.")
    parser.add_argument(
        "-p",
        "--popef",
        type=str,
        metavar='popef_file_path',
        help="A collection of PopEF files containing the model.",
        nargs='+',
        required=True)
    args = parser.parse_args()
    model = load_model(args.popef)

    frozen_input_name = "tensor_B"
    print("Looking for tensor", frozen_input_name, "inside PopEF model.")
    tensor_b_anchor = popef.Anchor()

    for anchor in model.metadata.anchors():
        if anchor.name() == frozen_input_name:
            tensor_b_anchor = anchor
            break
    else:
        raise Exception(f'Anchor {frozen_input_name} not found inside givem '
                        'model. Please make sure that PopEF was generated by '
                        '`model_runtime_example_generate_simple_popef`')

    print("Generating", frozen_input_name, "random values")
    tensor_b_info = tensor_b_anchor.tensorInfo()
    tensor_b = np.random.randn(*tensor_b_info.shape()).astype(
        tensor_b_info.numpyDType())

    config = model_runtime.ModelRunnerConfig()

    frozen_inputs = model_runtime.InputMemoryView()
    frozen_inputs[frozen_input_name] = tensor_b
    config.frozen_inputs = frozen_inputs

    print(
        "Tensor", frozen_input_name, "is frozen - will be treated as "
        "constant in each execution request.")
    config.device_wait_config = model_runtime.DeviceWaitConfig(
        model_runtime.DeviceWaitStrategy.WAIT_WITH_TIMEOUT,
        timeout=timedelta(seconds=600),
        sleepTime=timedelta(seconds=1))

    model_runner = model_runtime.ModelRunner(model, config=config)

    print("Preparing input tensors:")
    input_descriptions = model_runner.getExecuteInputs()
    input_tensors = [
        np.random.randn(*input_desc.shape).astype(input_desc.numpy_data_type())
        for input_desc in input_descriptions
    ]
    input_view = model_runtime.InputMemoryView()

    for input_desc, input_tensor in zip(input_descriptions, input_tensors):
        print("\tname:", input_desc.name, "shape:", input_tensor.shape,
              "dtype:", input_tensor.dtype)
        input_view[input_desc.name] = input_tensor

    print("Sending single synchronous request with empty data.")
    result = model_runner.execute(input_view)
    output_descriptions = model_runner.getExecuteOutputs()

    print("Processing output tensors:")
    for output_desc in output_descriptions:
        output_tensor = np.frombuffer(
            result[output_desc.name],
            dtype=output_desc.numpy_data_type()).reshape(output_desc.shape)
        print("\tname:", output_desc.name, "shape:", output_tensor.shape,
              "dtype:", output_tensor.dtype, "\n", output_tensor)

    print("Success: exiting")

    return 0


def load_model(popef_paths):
    for model_file in popef_paths:
        assert os.path.isfile(model_file) is True
        reader = popef.Reader()
        reader.parseFile(model_file)

        meta = reader.metadata()
        exec = reader.executables()
        return popef.ModelBuilder(reader).createModel()


if __name__ == "__main__":
    main()

Download model_runner_frozen_inputs.py

Search help

3. Model Runner deep dive through examples

3.1. Execution modes

3.2. Replication

3.3. Multithreading

3.4. Frozen inputs