14. Viewing an Execution Trace

14.1. Detailed Execution Trace vs Lightweight Profiling

The Execution Trace Report shows the output of instrumenting a Poplar model in order to capture information about how long various parts of the model take to run. There are two different ways of capturing this information: the detailed Execution Trace, and Lightweight Profiling. Each has advantages and disadvantages. Which one is captured can be chosen by modifying the Poplar Engine options before running the model.

The detailed Execution Trace works primarily by simulating execution of the application on a simplified IPU model. The simulation data is augmented by capturing information at runtime about how many cycles each compute set takes to execute. The advantage of the detailed Execution Trace is that the data captured is very detailed: every single program step is included and the trace covers every tile on every IPU. This also means we can show the BSP trace and provide information about how a single program step may execute for different periods of time on different tiles. In addition to the cycle count for each step in the model, the detailed Execution Trace also displays statistics like the tile balance and cycle proportions, and compute-set details.

The detailed Execution Trace has several disadvantages. Simulating the execution of every program step on every tile can add a significant amount of time to the compilation of the Poplar application. Recording BSP data means that profiling code must be added to every tile. Finally, only a single cycle-duration is recorded for each compute set. If the compute set executes multiple times in the model but the executions take different amounts of time, the detailed Execution Trace will not show these differences.

Lightweight Profiling (LWP) does not involve any simulation and uses data captured at runtime from the IPU. Instead of capturing information about individual program steps, LWP records the start and end time of Blocks, which are programs added in order to instrument part of an application. Blocks may contain only a single program step, or may contain high-level operations consisting of a large number of program steps.

Instead of capturing information on every tile, LWP only records Blocks on a small number of selected tiles. This means that BSP and tile-balance information are not available from a LWP execution trace. However, as well as Blocks, LWP also records StreamCopy programs that take place on the selected tiles. Because these programs span many tiles on an IPU, an LWP trace also includes blocks for StreamCopy programs on tiles that were not selected, but which participated in a StreamCopy on a tile that was selected. These blocks can be aggregated at the top of an IPU’s trace by enabling the Aggregate StreamCopy Blocks option.

LWP captures far less data than the detailed Execution Trace and so is suitable for instrumenting large, long-running models with minimal overhead. The disadvantage of LWP is that the information available is less detailed than the information in the detailed Execution Trace.

When you open an Execution Trace, a cache file named profile.pop_cache is created, which makes it much quicker to load the report when it’s opened a second time. If this file becomes corrupted for any reason, you can delete it by using the Delete Cache option in the Help menu.

14.2. Execution Trace options

14.2.1. Selecting IPUs

The menu bar above the graph contains a drop-down list named Select IPU which allows you to select all IPUs or an individual IPU.

14.2.2. View options

There are several options for viewing Execution Trace graphs, which you can select from the Options drop-down menu in the top right-hand corner of the report screen. Which options are available depends on whether the report uses detailed Execution Trace or Lightweight Profiling.

Detailed Execution Trace Options

  • Show Tile Balance: Shows what percentage of tiles are in use during that program step, visible as shading in each each step.

  • Show Terminals: Include a line at the right- hand end of each process in the graph.

  • Group Executions: Group together executions that are part of the same higher process further up the call stack. Grouping is determined by the slash-delimited function calls that are logged to the execution trace profile output.

  • Group Syncs: Group multiple successive Sync steps together in the graph.

  • Show Text: Show or hide the name of each step in the graph.

  • Separate Overlapping Steps: Split up overlapping program steps into separate, non-overlapping lanes in the graph so that they can all be seen at once.

  • Show External Syncs: Display or hide the External Sync steps in the graph. There are often many of these, and hiding them may make the execution graph plot easier to understand in some cases.

  • Show SyncAns: Display SyncAns steps in the graph. As for External Syncs, hiding these steps may simplify the graph plot.

  • Use new BSP stats: Enable the new BSP-based statistics which are more accurate than the standard step-based statistics but may be slower to calculate for particularly large reports.

Lightweight Profiling Options

  • Show Text: When enabled, the name of each block is displayed on the graph.

  • Show Buffer Flushes: LWP needs to periodically flush profiling data from each IPU. By default, time spent carrying out these flushes is hidden. When this option is enabled, all buffer flushes are shown.

  • Aggregate StreamCopy Blocks: LWP includes blocks that represent StreamCopy programs on the selected tiles, as well as any tiles that also participated. When this option is enabled, StreamCopy blocks on tiles that were not selected are aggregated and displayed in a separate I/O Events lane at the top of each IPU’s trace.

  • Show Relative Cycles: The first Block that is recorded in a LWP trace may start at a non-zero number of cycles, which can make it more difficult to see block durations at a glance. When this option is enabled, the minimum cycle count is subtracted from the axis ticks, so, the ticks become relative to the start of the first Block.

14.2.3. Colour Key

You can view the colour key that the execution trace uses by clicking the key icon in the top right-hand corner of the graph.

If your report is very large (typically over 40 million cycles), the smaller steps will be too small to render, and you’ll see a small pop-up message displayed just under the mini-map that indicates that you should zoom further in to see relevant detail in the graph trace.

14.2.4. Change Layout

When comparing the Execution Trace of two reports a button in the top right hand corner allows you to arrange the Execution Traces side by side or one on top of the other.

14.3. Detailed Execution Trace

Detailed Execution Trace profiling has an overhead that can be prohibitive in some cases. Please refer to Profiling Overhead for details on how to prepare your execution to minimse that overhead.

There are two halves to the Execution Trace Report:

  • the Execution Trace graph, in the top half of the screen, showing either a flat compute set execution trace, or a flame graph of the stack,

  • a details report, in the bottom half of the screen, showing statistics about the cycle proportions, tile balance and compute sets present in the portion of the graph currently displayed. The details report has two tabs, the Summary tab, which shows statistics, cycle proportions and tile balance (experimental), and the Details tab, which shows details of a selected process from the execution trace graph.

14.3.1. The Execution Trace graph

The top half of the Execution Trace Report shows, by default, a flat graph, showing a set of consecutive blocks for each IPU, identifying what program steps were executed (as shown in the Program Tree report), and how many cycles they took. You can also view it as a flame graph, where the Poplar debug information is used to group compute sets together as part of the same operation, or application layer.

  • Select features that you want in the graph from the Graph View drop-down, including Flame Graph, BSP Trace, and displaying separate runs of the program.

  • Hover your mouse pointer over the graph to see a callout containing the compute set ID, together with the amount of memory used by that compute set.

  • Click on a run at the top of the trace, and you can see the run-specific execution parameters displayed in the Details tab.

  • Click on the graph to see that compute set’s stack details below the graph, which displays its name and memory size.

  • Double-clicking on a layer when the flame graph option is selected expands that layer to the full width of the graph. You can select two layers at once by clicking on the first one, then holding down Shift key and clicking on the second one. The graph then expands to contain just two layers. This makes it possible to inspect the cycle proportions of only the visible section of the execution trace, making it easier to understand the proportion of cycles spent in each type of process.

  • Clicking on a region in the BSP trace also selects the corresponding step in the Execution Trace graph.

  • You can magnify the BSP trace vertically to help identify individual tiles. Click on the + and – buttons to the left of the trace zooms in and out vertically. Click the recycle button to return to the default magnification.

  • Click the icon at the left-hand edge of a lane in the graph to collapse it such that only the top row of blocks is shown.

14.3.2. Execution View

Use the Execution View drop-down list in the top left-hand corner of the Execution report to control what features you’d like to see on the graph. These include:

  • Runs: Toggle to display or hide the inclusion of program run information on the graph. When enabled this displays, just below the mini-map, a set of dark grey markers that indicate when each program run starts and ends. On the graph itself, a bar at the top of each IPU lane shows the names of each program that runs. See Defining program names.

  • Flat & Flame: These settings toggle between a flat view, where all the program steps are compressed into a single lane on the graph, or a flame view, where the call structure of all steps is displayed. Note that you can also control how overlapping steps are displayed with the Separate overlapping steps control, in the View Options control.

  • BSP: Include a graphical depiction of the BSP activity, showing where the patterns of the IPU internal Sync, Compute and Exchange steps occur.

The current combination of settings is displayed on the drop-down button itself.

Defining program names

You can specify program names in your programs by using the Poplar or PopART APIs. The Execution Trace graph can display this information by enabling the Runs option in the Execution View drop-down control above the graph.

  • The Poplar Engine::run API now takes a debug string for the name of the run.

  • The PopART Session::run API allows you to specify a string for the name of the run, as well as additional strings for internal program runs, for example: WeightsToHost.

If you enable the display of runs in the graph, but no run name was provided for a run, a sequentially numbered default name is generated, for example: Engine::run #5.

As well as the execution parameters displayed in the Summary report, you can also display run-specific execution parameters, as described next.

Run-specific execution parameters

Poplar has the ability to tune runtime-only parameters for a specific run, which may modify its behaviour (including, but not limited to, how fast it executes). You can view these parameters on the Summary report, but also view them for a specific run:

  • With the execution trace displayed, click on one of the named runs displayed at the top of the trace, and then click on the Details tab to see the parameters used to compile the programs for that run. If no name was supplied for a run, it will be called, for example, Engine::run #1.

14.3.3. Filtering steps

You can concentrate on the steps you’re interested in by filtering on a particular search term.

  • Type a term into the Search box at the top and press Enter. Steps that don’t match in the execution trace graph are greyed-out.

  • Cycle through the matching steps by pressing Enter repeatedly, or clicking the arrows below the Search box. The single arrows move backwards and forwards through the matching steps, one at a time, and the double arrows move ten at a time.

  • To cancel the filtering, empty the search box by clicking on the small x at its right-hand end.

14.3.4. Summary tab

This tab provides an overview of the portion of the execution trace currently displayed in the graph in the top half of the page.

Statistics

The statistics displayed are:

  • Cycles: The number of cycles in the visible section of the execution trace. Note that a warning is displayed if this number overflows the 32-bit memory limit.

  • Rx / Tx: For StreamCopy and GlobalExchange steps, the amount of data transmitted and received during the operation.

Cyclye proportions

A bar is displayed for each IPU that shows a graphical representation of the proportion of cycles that are taken executing each type of compute set. If you hover your mouse over these bars, you’ll see a key that shows what process each colour represents, as follows:

  • Internal Sync: The sync process that occurs between each tile on an IPU as part of the BSP process.

  • External Sync: The sync process that occurs between each IPU as part of the BSP process. External syncs are also used in some host-device communication situations, where the IPUs all need to synchronise with an event outside their boundaries, for example a flow control step in the host program.

  • OnTileExecute: The vertex compute code executed on each tile.

  • DoExchange: The tile-to-tile data exchange within an IPU.

  • GlobalExchange: An IPU-to-IPU data exchange.

  • StreamCopy: A data exchange between an IPU and the host machine over PCI.

Tile balance

A bar is displayed for each IPU that shows a graphical representation of the percentage of tiles utilised by steps in the current viewport.

It is calculated by averaging the tile balance for all Steps (excluding Syncs) that were executed on the IPU. In previous versions of the Graph Analyser this value was derived from the percentage of cycles executed by a step, weighted by the number of tiles the step used.

New BSP statistics

The standard execution trace statistics are calculated by taking into account which steps are currently in view and averaging values across those steps, weighting by the length of the step. This does not take into account that the tiles on each IPU do not in general all execute the same step at the same time.

The new BSP statistics are calculated based on BSP data and take into account what every tile is executing at every cycle. This can be slower to calculate but gives more accurate information about the cycle window currently in view.

When the New BSP statistics option is enabled, the Tile Balanceis replaced with Tile Utilisation. This measures the proportion of tile-cycles currently in view which are spent on processing or transfer steps, for example OnTileExecute, DoExchange, GlobalExchange, StreamCopyMid, as opposed to idle states like Sync, SyncAns, and StreamCopyBegin.

14.3.5. Details tab

When an program step is selected in the flat graph, or a layer is selected in the flame graph, a list of program steps, with further details, is shown here. Many of the details are the same across the different types:

  • Cycles: The number of cycles on active tiles that the program step used to execute,

  • Active Tiles: The number of tiles involved in executing that program step,

  • All Cycles: The number of cycles on all tiles, with additional statistics.

  • Tile Balance: A measure of how efficiently the program step is spread across the tiles. See View Options for more details.

  • Active Tile Balance: A recalculation of the tile balance measurement above, but excluding those tiles that do nothing.

A warning is displayed if any of the cycle counts overflow the 32-bit memory limit.

Internal Sync

This is a sync process between tiles on an IPU.

External Sync

This is a sync process between IPUs.

SyncAns

This is an internal Automatic, Non-participatory Sync process. A tile can pre-acknowledge a number of internal/external syncs using the sans instruction. The Sync ANS instruction will wait until all those pre-acknowledged syncs actually happen.

OnTileExecute

This is a piece of vertex code being executed in a tile. In addition to the common information listed above, the following is displayed:

  • By Vertex Type: Shows what vertices are involved in the process execution.

Below these details, an interactive graph plot is displayed that shows how the selected program step makes use of cycles on each tile as it executes. For DoExchange programs, there is also a graph of the data received and transmitted by the program during its execution.

DoExchange

This is an exchange process, where data is exchanged between IPU tiles. In addition to the common information listed above, the following is displayed:

  • Total Data: The total amount of data transferred during the exchange,

  • Data Transmitted: The amount of data transmitted during the exchange,

  • Data Received: The amount of data received during the exchange,

  • Data Balance: The mean amount of data exchanged divided by the maximum amount of data exchanged,

  • Exchange Code: How large the variable is that holds the code for performing the exchange,

  • Source Variables: A truncated list of the variables from which data was sent in the exchange,

  • Destination Variables: A truncated list of the variables to which data was sent in the exchange.

GlobalExchange operations

GlobalExchange is the process by which data is exchanged between IPUs. In addition to the common information listed above, the following is displayed:

  • Total Data: The total amount of data transferred during the exchange,

  • Data Balance: The mean amount of data exchanged divided by the maximum amount of data exchanged,

  • Source Variables: A truncated list of the variables from which data was sent in the exchange (with temporary variables given basic integer names),

  • Destination Variables: A truncated list of the variables to which data was sent in the exchange (with temporary variables given basic integer names).

A tile’s physical location on an IPU, and how far it is away from the main exchange block, determines how quickly data can be moved between it and other tiles. Also, the highest-numbered tiles on an IPU are linked back directly to the lowest-number tiles in a ring-type topology. The combination of these two factors is what generates the typically triangular and curved shapes seen in these exchange graphs.

StreamCopy

This process copies data between tensors and streams, allowing data to be transferred between the IPUs and the host machine over PCI. The execution trace shows these program steps as three separate phases, StreamCopyBegin, the Copy itself (StreamCopyMid), and StreamCopyEnd. StreamCopyMid is further divided into Host, RemoteBuffer and Mixed categories to show the direction of data flow.

In addition to the common information listed above, the following is displayed:

  • Total Data: The total amount of data transferred during the exchange,

  • Data Balance: The mean amount of data exchanged divided by the maximum amount of data exchanged,

  • Copies from host: How many copy instructions transferred data from the host machine,

  • Copies to host: How many copy instructions transferred data to the host machine.

14.4. Lightweight Profiling

Lightweight profiling allows you to choose which steps in your program you want to profile, instead of profiling everything (which is the default when using Poplar’s fine-grained instrumentation). Lightweight profiling adds less overhead to your program when running, and also makes for smaller report profiles for PopVision to open.

See the Lightweight Profiling tutorial in the GitHub tutorials repository.

14.4.1. The Lightweight Profiling graph

The LWP report shows a graph indicating the when LWP Blocks executed and how long they took, in cycles. Blocks are shown for each tile that was profiled, and the tiles are grouped by IPU. Blocks can overlap, in which case they will stack up as a flame graph. There may also be periods of time which are not accounted for by any Block - these periods will appear as empty gaps on the graph.

  • Move around the main report graph using your mouse, as described is Adjusting report size.

  • Hover your mouse pointer over a block to see a tooltip containing the block name and how many cycles that block executed for.

14.4.2. Lightweight Profiling block types

Several different block types are seen in the lightweight profiling graph.

Common

These blocks correspond to the portions of a Poplar application which are instrumented. For each, we know the precise cycle count when it began and finished. You can manually add Block programs when using Poplar, or use the Poplar Engine options to automatically add Blocks.

Stream Copy

Stream Copies copy data between tensors and streams, allowing data to be transferred between the IPUs and the host machine over PCI. The execution trace shows these program steps as three separate phases, StreamCopyBegin, the Copy itself (StreamCopyMid), and StreamCopyEnd. StreamCopyMid is further divided into Host, RemoteBuffer and Mixed categories to show the direction of data flow.

Buffer Flush (or Block Flush)

On profiled tiles, a small buffer is used to collect the start and finish times of blocks. This buffer needs to be periodically flushed otherwise it will overflow. Poplar automatically determines appropriate times to flush the buffer and inserts buffer flush operations. When the Show Buffer Flushes option is enabled, these operations can be seen as blocks on the LWP graph.

Overflow

If the LWP profiling buffer is not flushed regularly enough it can overflow. When this occurs, no more data is recorded until the next flush occurs. The time between the buffer overflowing and the following buffer flush is referred to as an overflow region. Any block which both begins and ends in an overflow region will not be recorded and is not shown on the LWP graph. For blocks which start before an overflow but finish in the overflow region, the end time of the block will be unknown. For blocks which start in an overflow region but finish after the buffer flush, both the start time and the type of the block are unknown - these are shown as Unknown blocks.

Unknown

As described above, if a block starts in an overflow region then the type of the block is not recorded. We still know the block exists because we record its finish, so the block is displayed as an Unknown block.