3. Capturing IPU reports

This section describes how to generate the files that the PopVision® Graph Analyser can analyse. The PopVision® Graph Analyser uses report files generated during compilation and execution by the Poplar SDK.

When you first open the application, there is a link on the opening page to a Getting Started with PopVision video.

The sections below describe the files supported by the PopVision® Graph Analyser. These files can be created using the POPLAR_ENGINE_OPTIONS environment variable or the Poplar API. At a minimum you need either the archive.a or the profile.pop for the PopVision® Graph Analyser to present reports.

With the release of Poplar SDK 1.2 a new entry in POPLAR_ENGINE_OPTIONS was added to make capturing reports easier. In order to capture the reports needed for the PopVision® Graph Analyser you only need to set POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}' before you run a program. By default this will enable instrumentation and capture all the required reports to the current working directory. For more information, please read the description of the Poplar Engine options in the Poplar and PopLibs API Reference.

By default, report files are output to the current working directory. You can specify a different output directory by using, for example:

POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.directory":"./tommyFlowers"}'

If you have an application that has multiple Poplar programs (for example, if you build and run a training and validation model), then a subdirectory of the Engine name will be created in autoReport.directory in which the profile information will be written. This allows users of the Poplar API to make sure reports are written into different locations. (If no name is provided then the profile information will continue to be written in autoReport.directory). Further information can be found in the Using TensorFlow, Using PopART and Using PyTorch sections.

If you are profiling cached executables using PopART, you must use popart::SessionOptions to provide a directory name for your reports. Using autoReport.directory in POPLAR_ENGINE_OPTIONS will not work.

3.1. Unsupported file versions

As of Graph Analyser version 3.7, support for the old JSON and CBOR graph profile formats has been removed. This means that the following files that were generated before Poplar SDK 2.0 can no longer be read:

  • graph.json

  • graph.cbor

  • execution.json

  • execution.cbor

  • profile_info.json

At a minimum you need either the profile.pop or archive.a files present for the PopVision® Graph Analyser to generate its reports. If neither of these are found, you will see a ‘No graph profile found’ warning when trying to open a report.

3.2. Profiling Overhead

Profiling a model may add a memory and computation overhead to its compilation and execution phases. Typically, the highest performance impact is due to the execution overhead.

If you are not interested in the execution profile (execution trace) it is best to deactivate it by setting "autoReport.outputExecutionProfile": "false" or "debug.instrument": "false". This will implicitly disable "debug.instrumentControlFlow", "debug.instrumentExternalExchange", and "debug.instrumentCompute" (unless explicitly enabled by the user). Note that a profile can omit its execution part but not its compilation part. In other words, setting "autoReport.outputExecutionProfile" to true will automatically set "autoReport.outputGraphProfile" to true too.

The following sections show which options can be used to reduce the overhead in the compilation and execution profiler parts.

3.2.1. Compilation

During compilation, the profiler generates the graph profile (also known as the memory profile). This profile contains information that Poplar knows or estimates at compilation time, such as the programs that form the model and its variables. The contents of the graph profile are sufficient to analyse memory issues.

The environment variable POPLAR_PROFILER_LOG_LEVEL can be set to generate a log of the steps performed by the profiler during compilation and detect any possible time overhead.

Poplar engine options can be used to include or exclude the profiling of certain information. This will reduce the time taken to create the profile and the size of the generated files. Please refer to Report Files for a description of the files.

The option POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}' outputs a full report. If you wish to exclude a part of it, set this option and explicitly disable the undesired information. For example: POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.outputArchive":"false"}'

Similarly, if you wish to include only a certain part of the report, just set the specific autoReport option. For example: POPLAR_ENGINE_OPTIONS='{"autoReport.outputArchive":"true"}'

Options to tune the graph profile (in order of expected impact):

  • "autoReport.outputLoweredVars": "false": This option can be useful if profile.pop is too large. However, this will deactivate some functionality in the Graph Analyser such as the variables memory graph. Note that excluding lowered variables from the report will not speed up its visualisation by the Graph Analyser.

  • "autoReport.outputDebugInfo": "false": This option can be useful if debug.cbor is too large. However, this will deactivate some functionalities in the Graph Analyser such as the operations graph. Note that no meaningful speed-up is expected in the rest of Graph Analyser functionalities.

  • "autoReport.outputArchive": "false": This option can be set to avoid generating archive.a. If it is set to false, you must generate profile.pop and some minor functionalities of the Memory Report will be disabled in the Graph Analyser.

3.2.2. Execution

Profiling the execution means to measure and record the cycles spent on each of the programs of the model. The result can be visualised in the Graph Analyser execution trace. However, this instrumentation can lead to the following main overheads:

IPU memory overhead: some memory in the IPU will be devoted to store program cycles and branch records. Some more will be used by the instrumentation code itself, although it is usually negligible.

You can set "debug.computeInstrumentationLevel": "ipu" to reduce the memory needed for program cycles. In this mode, only one tile (debug.profilingTile) will record cycles. The drawback is that per-tile cycles - the BSP trace - will not be available in the Graph Analyser. Another inconvenience is that this instrumentation level may slightly disrupt the normal execution of the model. This is because some artificial synchronisations may be introduced in order to measure the cycles of the longest-running tile.

Regarding branch records, you cannot reduce the memory needed to store them but you can pick which tile will keep them. Thus, by using "debug.branchRecordTile" you can pick a tile with low memory pressure. Note that the last tile in the IPU is selected by default and that is usually a good decision. Also, branch recording may introduce artificial synchronisation points to flush the records to the host. This can disrupt the normal execution, especially for pipelined models with high number of conditional branches, such as If programs.

Because of all these extra memory requirements, a model with high memory consumption may go out of memory when profiling is enabled. Depending on the model, you can adjust its parameters to leave space for the instrumentation. For example, you can try decreasing the batch size. In TensorFlow BERT you can adjust the --micro-batch-size.

Host computing overhead: Poplar processes the cycle measurements after each run to create a trace that can be visualised in the Graph Analyser. This can take a considerable amount of time if the run executed many programs. This overhead may reveal itself in the Graph Analyser if the execution took multiple runs. At the beginning of each run the IPU waits for the host in a StreamCopyBegin program. After the first run, the host may be busy processing the cycles measured in the previous run. This causes a large StreamCopyBegin as the IPU waits for the host to finish this processing. Because of this overhead, measuring throughput of a profiled model is highly discouraged.

To reduce this overhead you can reduce the amount of programs profiled. By default, only the first two runs of the execution are captured. This can be increased or decreased by setting executionProfileProgramRunCount as follows:

POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "autoReport.executionProfileProgramRunCount":"10"}'

It is essential that you try to reduce the iterations on each run too. For instance, by reducing the number of steps or the number of batches per step you can get a lighter execution profile. This will not only reduce host computation overhead but will also speed up the visualisation in the Graph Analyser. The public examples contain some hints on how to reduce an execution to be profiled. For instance, TensorFlow BERT.

Finally, the report size of multi-replica executions can be reduced by focusing on a single replica. The user can select a replica with "replicaToProfile" option as follows:

POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "profiler.replicaToProfile":"0"}'

3.3. Reloading reports

The folder (or folders, if you’re comparing reports) that contain the individual report files are monitored by the application in case any of the files changes, for example, if you’ve re-run your Poplar program and generated a new version.

If the application detects that any of the files have changed, a dialog box appears telling you what files have changed, and prompting you to reload the report files.

Make sure that your Poplar program has finished executing (in particular that the profile.pop file has been completely written to disk before clicking on the Reload button), otherwise you may see inconsistent information displayed in the application.

3.4. Profile troubleshooting

Very occasionally, the Graph Analyser may not be able to open a profile. This can happen for a number of reasons detailed below, with explanations on how to remedy the issue.

3.4.1. Reducing the size of profile reports

Large models and programs with many iterations within them can generate large reports which can take a while to process and display in the PopVision tools. You can reduce the size of the profiles generated when instrumenting your IPU programs by:

  • Adjusting the number of steps being profiled

  • Reducing the number of batches per step

  • Changing the instrumentation level

  • Changing the branch record tile

  • Select a single replica

  • Reducing gradient accumulation factor (if you’re using it). This reduces the size of a single Engine run.

There are some additional suggestions for reducing profile size in the Profiling overhead section above.

3.4.2. Missing or corrupted report files

Sometimes the Python script running the training or inference program exits too early, through some other fault, and the profile isn’t written correctly.

  • Set up POPLAR_PROFILER_LOG_LEVEL to get more information about your script’s execution.

  • The profiles are written in SQLite format. Check that they open in an SQLite client, and also using libpva.

3.4.3. Compilation fails with OOM

Sometimes execution may be prevented if there is not sufficient memory to execute the program. Here are some actions you can take to reduce memory usage in your model.

  • Reduce your model size. This will reduce the number of paramater variables that need to be stored in the IPU memory.

  • Only use the Memory report, not the Execution Trace report.

  • Change your instrumentation level so that you are storing less information.

  • Change the branch record tile.

Additional ways to optimise your memory and throughput are detailed in the section on the Insights report.

3.5. Poplar report files

The PopVision® Graph Analyser only supports fixed names for each of the files. If you save them with different names they will not be opened. When you are browsing directories to open, the PopVision® Graph Analyser will highlight which of the following files are present in that directory.

3.5.1. Binary archive (archive.a)

This is an archive of ELF executable files, one for each tile. With this you can see the total memory usage for each tile on the Memory Report.

  • Poplar Engine Options

    • POPLAR_ENGINE_OPTIONS='{"autoReport.outputArchive":"true"}'

  • Using Poplar API

    • Set the Poplar Engine option “autoReport.outputArchive” to true

3.5.2. Poplar Profile (profile.pop)

This file contains compile-time and execution information about the Poplar graph. This file is used to show memory, liveness and program tree views and also the execution trace view.

  • Poplar Engine Options

    • POPLAR_ENGINE_OPTIONS='{"autoReport.outputGraphProfile":"true"}' and/or POPLAR_ENGINE_OPTIONS='{"autoReport.outputExecutionProfile":"true"}'

  • Using Poplar API

    • Set the Poplar Engine options “autoReport.outputGraphProfile” to true and/or “autoReport.outputExecutionProfile” to true

3.5.3. Lowered Vars Information

Poplar can generate lowered vars information, which contains details about the allocation of variables on each tile, and is used to generate the variable layout in the Memory Report. IPU memory is statically allocated and this file contains the size, location, name and other details about every variable on every tile.

This information is not generated by default, as it can be quite large, and not useful to some users. However, there are engine options to collect the data and save it either into the profile.pop file, or as a stand-alone file.

  • Poplar Engine Options

    • When you use POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true"}' the lowered vars information will be captured in the profile.pop file. You can switch that functionality on separately with: POPLAR_ENGINE_OPTIONS='{"autoReport.outputLoweredVars":"true"}'

3.5.4. Frameworks Information (framework.json & app.json)

You can use Poplar to create two more ‘custom’ files into which you can put your own data from frameworks or your application. See the Framework and Application JSON files section for more details.

3.5.5. Debug Information (debug.cbor)

This file contains additional debug information collected from the Poplar software. From this information you can understand the source of variables, Poplar programs and compute sets. The debug information is viewable in the Liveness report and the Program Tree.

  • Poplar Engine Options

    • Automatically, when using POPLAR_ENGINE_OPTIONS='{"autoReport.enable":"true"}' and manually using {"autoReport.outputDebugInfo":"true"}

  • Using Poplar API

    • Automatically created

Collecting the enhanced debug information will not increase the memory footprint of your IPU application. The enhanced debug information is generated and streamed as the model is compiled.

See the two ‘Debug information’ sections in the Liveness Report and the Program Tree for details of what’s included in the debug information, and where to find it in the Graph Analyser reports.

3.6. Using TensorFlow

If you use TensorFlow, the separate reports for each Poplar program compiled and executed will be placed in a subdirectory of autoReport.directory that contains the ISO date/time and process ID in its name.

The debug.cbor will be placed in the autoReport.directory and symbolic links are created in the subdirectories

The cluster name can now be found in details loaded from the framework.json.

For more details please see the guide Targeting the IPU from TensorFlow 2 and Targeting the IPU from TensorFlow 1.

3.7. Using PopART

For PopART, the name of the Engine is by default set to inference or training depending on if you are using the InferenceSession or TrainingSession. You also have the option of providing your own Engine name when creating the session.

training_session = popart.TrainingSession(fnModel=builder.getModelProto(),
  ...
  deviceInfo=device,
  name="tommyFlowers")

The profile.pop will be written out to:

autoReport.directory/tommyFlowers

If your application has two inference sessions, by default the second will overwrite the first.

For more details please see the PopART User Guide.

If you are profiling cached executables using PopART, you must use popart::SessionOptions to provide a directory name for your reports. Using autoReport.directory in POPLAR_ENGINE_OPTIONS will not work.

3.8. Using PyTorch

For PyTorch (which builds on top of PopART), it will also, by default, name the Poplar Engine “inference” or “training” depending on whether you are using the InferenceModel or TrainingModel classes respectively. You also have the option to name the Poplar Engine yourself by specifying it in the Options object:

opts = poptorch.Options()
opts.modelName("tommyflowers")
opts.enableProfiling(dirname)

poptorch_model = poptorch.inferenceModel(model, opts)

In this example, the profile.pop file will be written to the directory:

autoReport.directory/tommyFlowers

For more details please see the PyTorch User Guide.