Source (GitHub) | Download notebook


6.4. Tutorial: Instrumenting applications

In this tutorial you will learn to use:

  • the PopVision System Analyser, a desktop tool for profiling the execution of IPU-targeted software on your host system processors;

  • the libpvti module in python which can be used to profile, time, and log information from your IPU applications and plot it directly in the PopVision System Analyser.

How to run this tutorial

To run the Jupyter notebook version of this tutorial:

  1. Enable a Poplar SDK environment (see the Getting Started guide for your IPU system)

  2. In the same environment, install the Jupyter notebook server: python -m pip install jupyter

  3. Launch a Jupyter Server on a specific port: jupyter-notebook --no-browser --port <port number>

  4. Connect via SSH to your remote machine, forwarding your chosen port: ssh -NL <port number>:localhost:<port number> <your username>@<remote machine>

For more details about this process, or if you need troubleshooting, see our guide on using IPUs from Jupyter notebooks.

Introduction

The Graphcore PopVision™ System Analyser is a desktop tool for analysing the execution of IPU-targeted software on your host system processors. It shows an interactive timeline visualisation of the execution steps involved, helping you to identify any bottlenecks between the CPUs and IPUs. This is particularly useful when you are scaling models to run on multiple CPUs and IPUs.

For this tutorial we are going to use a PopART MNIST example and add instrumentation that can be viewed using the PopVision System Analyser. Make sure the PopVision System Analyser is installed on your local machine, it can be downloaded from the Downloads Portal. The PopART MNIST example is from the simple_applications/popart/mnist directory).

Run the MNIST example with profiling enabled

import subprocess
import os

mnist_path = "./popart_mnist.py"
os.environ["PVTI_OPTIONS"] = '{"enable":"true", "directory": "mydirectory"}'
subprocess.run(["sh", "./get_data.sh"])
output = subprocess.run(
    ["python3", mnist_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
print(output.stdout.decode("utf-8"))
Creating ONNX model.
Compiling the training graph.
Compiling the validation graph.
Running training loop.
Epoch #1
   Loss=16.2605
   Accuracy=88.88%
Epoch #2
   Loss=13.9930
   Accuracy=89.63%
Epoch #3
   Loss=13.1049
   Accuracy=89.83%
Epoch #4
   Loss=12.5232
   Accuracy=90.01%
Epoch #5
   Loss=12.1029
   Accuracy=90.12%
Epoch #6
   Loss=11.7830
   Accuracy=90.22%
Epoch #7
   Loss=11.5327
   Accuracy=90.40%
Epoch #8
   Loss=11.3332
   Accuracy=90.59%
Epoch #9
   Loss=11.1712
   Accuracy=90.65%
Epoch #10
   Loss=11.0370
   Accuracy=90.70%

When this has completed you will find a pvti file in the working directory, e.g. “Tue_Nov_24_11:59:17_2022_GMT_4532.pvti”.

Note: You can specify an output directory for the pvti files to be written to:

PVTI_OPTIONS='{"enable":"true", "directory": "tommyFlowers"}' python3 popart_mnist.py

Open the PopVision System Analyser and then select “Open a report” and select the pvti file generated. You may need to copy the pvti file to your local machine.

You should then see the following profile information.

PopVision System Analyser screenshot of mnist

Profiling execution of epochs

We are now going to modify the MNIST example to add instrumentation to clearly show the epochs. (You can find the completed tutorial in the complete directory)

Firstly, we need to import the libpvti library.

Add the import statement at the top of popart_mnist.py:

import libpvti as pvti

Next we will need to create a trace channel. Add the mnistPvtiChannel as a global object.

mnistPvtiChannel = pvti.createTraceChannel("MNIST Application")

We are going to use the Python with keyword with a Python context manager to instrument the epoch loop.

Note: You will need to indent the contents of the loop.

print("Running training loop.")
for i in range(opts.epochs):
    with pvti.Tracepoint(mnistPvtiChannel, f"Epoch:{i}"):
        ...

Run the MNIST example again with profiling enabled

output = subprocess.run(
    ["python3", mnist_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
print(output.stdout.decode("utf-8"))
Creating ONNX model.
Compiling the training graph.
Compiling the validation graph.
Running training loop.
Epoch #1
   Loss=16.2605
   Accuracy=88.88%
Epoch #2
   Loss=13.9930
   Accuracy=89.63%
Epoch #3
   Loss=13.1049
   Accuracy=89.83%
Epoch #4
   Loss=12.5232
   Accuracy=90.01%
Epoch #5
   Loss=12.1029
   Accuracy=90.12%
Epoch #6
   Loss=11.7830
   Accuracy=90.22%
Epoch #7
   Loss=11.5327
   Accuracy=90.40%
Epoch #8
   Loss=11.3332
   Accuracy=90.59%
Epoch #9
   Loss=11.1712
   Accuracy=90.65%
Epoch #10
   Loss=11.0370
   Accuracy=90.70%

We leave it as an exercise for the reader to add instrumentation of the training & evaluation phases. When added you will see the following profile in the PopVision System Analyser.

Note: You can nest the Tracepoint statements.

PopVision System Analyser screenshot of instrumented mnist

Logging the training and validation losses

In addition to displaying function profiling, the System Analyser can plot numerical data captured by the libpvti library.

In this section, we are going to add instrumentation to our python script to allow the System Analyser to plot the loss reported by PopART (this is a Poplar SDK 2.1 feature).

We have added the libpvti import in the previous section, so we need first to create a pvti Graph object and then create series in the graph.

To create the graph we call the pvti.Graph constructor passing the name of the graph:

loss_graph = pvti.Graph("Loss", "")

Then create the series to which we will add the data:

training_loss_series = loss_graph.addSeries("Training Loss")
validation_loss_series = loss_graph.addSeries("Validation Loss")

Finally after each call to the PopART session.run method we will record the training and validation loss. We take the loss from the anchors (which is an array) and compute the mean value:

training.session.run(stepio, "Epoch " + str(i) + " training step" + str(step))

# Record the training loss
training_loss_series.add(np.mean(training.anchors[loss]).item())

...

validation.session.run(stepio, "Epoch " + str(i) + " evaluation step " + str(step))

# Record the validation loss
validation_loss_series.add(np.mean(validation.anchors[loss]).item())

Run the MNIST example again with profiling enabled

output = subprocess.run(
    ["python3", mnist_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
print(output.stdout.decode("utf-8"))
Creating ONNX model.
Compiling the training graph.
Compiling the validation graph.
Running training loop.
Epoch #1
   Loss=16.2605
   Accuracy=88.88%
Epoch #2
   Loss=13.9930
   Accuracy=89.63%
Epoch #3
   Loss=13.1049
   Accuracy=89.83%
Epoch #4
   Loss=12.5232
   Accuracy=90.01%
Epoch #5
   Loss=12.1029
   Accuracy=90.12%
Epoch #6
   Loss=11.7830
   Accuracy=90.22%
Epoch #7
   Loss=11.5327
   Accuracy=90.40%
Epoch #8
   Loss=11.3332
   Accuracy=90.59%
Epoch #9
   Loss=11.1712
   Accuracy=90.65%
Epoch #10
   Loss=11.0370
   Accuracy=90.70%

When we view the resulting pvti report in the System Analyser (you may need to scroll to the bottom of the page) it will show the loss graph looking something like this:

PopVision System Analyser screenshot of instrumented mnist loss

Note: The option to merge all charts has been enabled to combine all threads into a single row, to make it easier to align the flame graph with the line graph.

Generating and profiling instant events

You can get insight into when particular sequences in the host code are executed by adding ‘instant events’. This feature can be used to log events that occur during the execution of the application, such as receiving a message, errors/warnings or a change in a parameter such as epoch or learning rate.

For these purposes you may use ‘instant events’, which are like checkpoints. This feature adds trace points corresponding to a single point in time rather than a block.

For example, we are going to log the epoch number each time a new epoch begins, by using instant events:

print("Running training loop.")
mnistInstantEventsChannel = pvti.createTraceChannel("Instant Events")
for i in range(opts.epochs):
    pvti.Tracepoint.event(mnistInstantEventsChannel, f"Epoch {i} begin")
    ...

Run the MNIST example again with profiling enabled

output = subprocess.run(
    ["python3", mnist_path], stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
print(output.stdout.decode("utf-8"))
Creating ONNX model.
Compiling the training graph.
Compiling the validation graph.
Running training loop.
Epoch #1
   Loss=16.2605
   Accuracy=88.88%
Epoch #2
   Loss=13.9930
   Accuracy=89.63%
Epoch #3
   Loss=13.1049
   Accuracy=89.83%
Epoch #4
   Loss=12.5232
   Accuracy=90.01%
Epoch #5
   Loss=12.1029
   Accuracy=90.12%
Epoch #6
   Loss=11.7830
   Accuracy=90.22%
Epoch #7
   Loss=11.5327
   Accuracy=90.40%
Epoch #8
   Loss=11.3332
   Accuracy=90.59%
Epoch #9
   Loss=11.1712
   Accuracy=90.65%
Epoch #10
   Loss=11.0370
   Accuracy=90.70%

You can use an existing trace channel to capture instant events, but we are using a separate one for the purposes of this tutorial.

When added you will see the following profile in the PopVision System Analyser. Instant events are represented by flags at the top of the profile:

PopVision System Analyser screenshot of instrumented mnist loss

Going further

We leave it as an exercise for the reader to add additional instrumentation. The completed example also calculates accuracy of the model, and CPU load using the psutil library, and plots both of them.

PopVision System Analyser screenshot of instrumented mnist loss, accuracy & cpuload

This is a very simple use case for adding instrumentation. The PopVision trace instrumentation library (libpvti) provides other functions, classes & methods to instrument your Python and C++ code. For more information please see the PVTI library documentation.

Generated:2022-12-02T16:18 Source:walkthrough.py SDK:3.1.0-EA.1+1166 SST:0.0.9