5.3. Overlap I/O

The IPU execution of an inference model is generally divided into three stages:

Load: Copy input data from the host to the IPU
Compute: Model computing
Store: Copy result data from the IPU to the host

These three stages are executed serially, which means that the computing resources of the IPU are idle while data is transferred in the Load and Store stages. In some models, with large input and output data, the performance of the whole model is limited by I/O. For this kind of model, enabling overlap I/O can overlap the computing stage and the I/O stage. This improves the utilisation rate of the computing resources of the IPU.

5.3.1. Principle

The principle of overlap I/O is to divide all the tiles on the IPU into two groups, namely compute tiles and I/O tiles. Compute tiles perform all computation, while I/O tiles only handle transferring data with the host. In this way, the stages of Load, Compute and Store form a three-level pipeline in a computational flow, which overlaps compute and I/O to improve the utilisation rate of the computing resources of the IPU.

../_images/overlap_io_pipeline.png — Fig. 5.2 Pipeline formed by Load/Compute/Store

5.3.2. Configuring I/O tiles

To enable overlap I/O, you only need to set one parameter, the number of I/O tiles. The number of I/O tiles can be adjusted to optimise the throughput of the transmission. To calculate the number of I/O tiles, you can divide the sum of the tensor sizes of all input and output by the SRAM size available for each tile and then round to the next power of 2.

Configuring I/O tiles with the PopRT CLI --num_io_tiles parameter:

poprt \
    --input_model model.onnx \
    --export_popef \
    --output_dir model \
    --num_io_tiles 128

Configuring I/O tiles with the poprt.compiler.CompilerOptions API:

opts = poprt.compiler.CompilerOptions()
opts.num_io_tiles = 128

5.3.3. Debugging

You can use the PopVision Graph Analyser to display the overlap between I/O and Compute stages to determine whether overlap I/O would improve your throughput. Fig. 5.3 shows an example of the output of the PopVision Graph Analyser.

../_images/overlap_io.png — Fig. 5.3 PopVision Graph Analyser shows overlap between I/O and Compute

5.3.4. Concurrent requests

Since the three stages of inference form a three-stage pipeline through overlap I/O, sufficient concurrent data must be fed to the IPU in order to keep the pipeline full. At least three threads are required to feed the concurrent data to the IPU through the multi-threading mode.

5.3.5. Example

The following is a simple example using overlap I/O:

Listing 5.1 simple_overlapio.py

# Copyright (c) 2022 Graphcore Ltd. All rights reserved.
import argparse
import threading

import numpy as np
import onnx

from onnx import helper

from poprt import runtime
from poprt.compiler import Compiler, CompilerOptions
from poprt.runtime import RuntimeConfig

'''
PopRT use OverlapInnerLoop strategy as default exchange strategy.
There are two loops in the main program: outer loop and inner loop.
Each batch data needs to be processed in three pipeline stages: load/compute/store.
Therefore, in order to enable the pipeline to run normally, at least three threads
are required to feed data to the pipeline at the same time.
==============================================================
OverlapInnerLoop:
- Boxes denote subgraphs / subgraph Ops / loops
- Inputs/outputs are loop carried in order

.- outer loop ----------------------------------------.
|                  .- inner loop -.                   |
| load - compute - | - store      |                   |
|           load - | - compute -- | - store           |
|                  |   load ----- | - compute - store |
|                  '--------------'                   |
'-----------------------------------------------------'
         ^^^^^^^       ^^^^^^^        ^^^^^^^
         overlap       overlap        overlap

==============================================================
'''


def compile(model: onnx.ModelProto, args):
    """Compile ONNX to PopEF."""
    model_bytes = model.SerializeToString()
    outputs = [o.name for o in model.graph.output]

    options = CompilerOptions()
    options.batches_per_step = args.batches_per_step
    options.num_io_tiles = args.num_io_tiles

    executable = Compiler.compile(model_bytes, outputs, options)
    return executable


def run(executable, args):
    """Run PopEF."""
    # Create model runner
    config = RuntimeConfig()
    config.timeout_ns = 0
    # Create model runner
    model_runner = runtime.Runner(executable, config)

    inputs_info = model_runner.get_execute_inputs()
    outputs_info = model_runner.get_execute_outputs()

    # Run in multiple threads
    def execute(bps, inputs_info, outputs_info):
        inputs = {}
        outputs = {}

        for input in inputs_info:
            inputs[input.name] = np.random.uniform(0, 1, input.shape).astype(
                input.numpy_data_type()
            )
        for output in outputs_info:
            outputs[output.name] = np.zeros(
                output.shape, dtype=output.numpy_data_type()
            )

        # To correctly generate the popvision report, iteration must be a
        # multiple of batches_per_step and greater than 2 * batches_per_step
        # There are 3 threads, so the total number feed into IPU is 3 * iteration
        iteration = bps
        for _ in range(iteration):
            model_runner.execute(inputs, outputs)

    threads = []
    num_threads = 3
    print(f"Run PopEF with {num_threads} threads.")
    for _ in range(num_threads):
        threads.append(
            threading.Thread(
                target=execute, args=(args.batches_per_step, inputs_info, outputs_info)
            )
        )

    for t in threads:
        t.start()

    for t in threads:
        t.join()
    print(f"Complete.")


def default_model():
    TensorProto = onnx.TensorProto

    nodes = []
    num_matmuls = 4
    nodes.append(helper.make_node("Expand", ["input", "shape"], ["Act0"]))
    for i in range(num_matmuls):
        nodes.append(helper.make_node("MatMul", [f"Act{i}", "Weight"], [f"Act{i+1}"]))
    nodes.append(
        helper.make_node("ReduceMean", [f"Act{num_matmuls}"], ["output"], axes=[0, 1])
    )

    graph = helper.make_graph(
        nodes,
        "matmul_test",
        [
            helper.make_tensor_value_info("input", TensorProto.FLOAT, (256, 256)),
        ],
        [helper.make_tensor_value_info("output", TensorProto.FLOAT, (256, 256))],
        [
            helper.make_tensor(
                "shape",
                TensorProto.INT64,
                [4],
                np.array([4, 4, 256, 256], dtype=np.int64),
            ),
            helper.make_tensor(
                "Weight",
                TensorProto.FLOAT,
                (4, 4, 256, 256),
                np.random.randn(4, 4, 256, 256),
            ),
        ],
    )
    opset_imports = [helper.make_opsetid("", 11)]
    original_model = helper.make_model(graph, opset_imports=opset_imports)
    return original_model


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description='Convert onnx model and run it on IPU.'
    )
    parser.add_argument(
        '--batches_per_step',
        type=int,
        default=16,
        help="The number of on-chip loop count.",
    )
    parser.add_argument(
        '--num_io_tiles',
        type=int,
        default=192,
        help="The number of IO tiles.",
    )
    args = parser.parse_args()
    model = default_model()
    exec = compile(model, args)
    run(exec, args)

Download simple_overlapio.py

Search help