5. Profiling
You can use Poplar to instrument your code to produce detailed compile-time information about the graph program and run-time information about execution, including how memory is used, and where memory and processor cycles are consumed.
This information can be output to files and viewed with the PopVision Graph Analyser. This provides a graphical view of the profile data and includes full documentation as context-sensitive help. See the document Poplar Profile Data for a detailed description of the contents of these files.
You can also generate a summary of the profiling information for quick review. See Profile summary.
The profiling information available depends on the target:
Target |
Memory Profiling |
Execution Profiling |
---|---|---|
IPU |
Exact |
Hardware measurement with limitations |
IPU Model |
Exact (optional) |
Detailed but based on estimates |
CPU |
None |
None |
The IPUModel::compileIPUCode
option, described below, can be used to
generate exact memory profiling information for an IPU Model.
Because profiling adds code and extra variables to extract the profiling information, it can change the performance and memory usage of your program.
5.1. Generating profiling information
After you have loaded your Graph
into an Engine
, you can get static
profile information about the graph and the resources required. This includes
cycle estimates (for an IPU Model) and memory information.
ProfileValue graphProfile = engine.getGraphProfile();
After you have run the program one or more times you can get dynamic profiling information (what code was run, cycle counts, and so on).
ProfileValue executionProfile = engine.getExecutionProfile();
You can save the profiling information to files for use by the Graph Analyser. For example:
poplar::serializeToJSON(graphFile, graphProfile, true);
poplar::serializeToJSON(executionFile, executionProfile, true);
The last parameter of serializeToJSON()
is optional and controls whether or
not to pretty print the data.
5.1.1. Profiling options
There are some options for controlling profiling on IPU targets (hardware or simulator).
By default, profiling is disabled. The instrumentation of compute cycles and external exchange cycles can be enabled with the following options:
debug.instrument
Set to “true” enable instrumentation.debug.instrumentCompute
Set to “true” or “false” to enable or disable instrumentation of compute cycles.debug.instrumentExternalExchange
Set to “true” or “false” to enable or disable instrumentation of cycles used for exchanges between IPUs, or between the IPU and the host.
Note that there is no option to instrument internal exchanges because all internal exchange is statically scheduled.
If the instrumentation of compute is enabled, then the compute cycles counted
can be specified with the debug.computeInstrumentationLevel
option.
This can have the following values:
“tile” Store the cycle count for the last execution of each compute set on every tile (default).
“vertex” Store the cycle count for the last execution of each vertex on every tile.
“ipu” Store the cycle count for the last execution of each compute set on a single tile per IPU. This measures the execution time of the longest-running tile in the compute set. This saves memory compared to “tile” but loses all the per-tile cycle information.
These can be specified when the Engine
constructor is called. For example:
Engine engine(graph, prog,
OptionFlags({{"debug.instrumentCompute", "true"},
{"debug.instrumentExternalExchange", "false"},
{"debug.computeInstrumentationLevel", "tile"}}));
These options can also be defined in the environment variable
POPLAR_ENGINE_OPTIONS
. For example:
export POPLAR_ENGINE_OPTIONS='{"debug.instrument": "true",
"debug.computeInstrumentationLevel": "vertex"}'
For complete information about the available options, see the Poplar and PopLibs API Reference.
For IPU Model targets you can optionally tell Poplar to compile code for the
IPU (in addition to the CPU code that is actually executed by the model). If
this is not done, then the reported memory usage will not include memory used
for code. If this is enabled then the memory profile should give the same
results as an IPU target. This option is specified by setting the
compileIPUCode
members of the model, for example:
// Create the IPU Model device
IPUModel ipuModel;
ipuModel.compileIPUCode = true;
5.2. Profile summary
You can generate a readable summary of the information, which may be useful for a quick overview of the resource usage and performance of your program. This summary report can be printed by the program or generated from the profile data.
5.2.1. Printing from a Poplar program
To print a summary of both the graph and execution profiling information you can call the function:
poplar::printProfileSummary(std::cout, graphProfile,
executionProfile, options);
You can also print the graph and execution summaries separately by calling
printGraphSummary
or printExecutionSummary
(see the
Poplar and PopLibs API Reference for more information).
The following options are available:
colours
: Control the use of colours in the summary outputshowVarStorage
: Show liveness information for each programshowOptimizations
: Show compile optimisation detailsshowExecutionSteps
: Show the execution stepsshowPerIpuMemoryUsage
: Show memory usage for each IPU
Colours can be used to highlight different sections of the output, to make
it easier to understand. By default, colour is enabled when output is to a
supported terminal. By setting the colours
option to "true"
, colour
output will be generated even if outputting to a file or piping the output to
another program.
5.2.2. Command line conversion
A program is included with the Poplar SDK to convert the generated profile data
into a readable summary. This has options --graph-profile
and
--execution-profile
to specify the graph and execution profile files to use.
For example:
$ poplar_profile_summary --graph-profile graph_profile.json
The following command line options can be used to control the summary output:
--show-execution-steps
: Show execution steps--show-optimizations
: Show optimisation information--show-per-ipu-memory-usage
: Report the memory usage for each IPU--show-var-storage
: Show liveness information for each program
5.2.3. Summary report format
There are two environment variables that can also be used to control colour output. These are:
CLICOLOR
: If set to 0, then colour output will not be generated. This overrides thecolours
option value.CLICOLOR_FORCE
: If set to 1, then colour output will always be generated, even if output is not to the terminal. This overrides thecolours
option value and theCLICOLOR
environment variable.
If there are replicated graphs then the memory usage will only be shown for one replica, as it will be the same for all of them.
If any tiles are out of memory, the memory used a few tiles with the largest usage will be shown.
The output with showVarStorage
and showExecutionSteps
will be similar
to that shown below.
If any tiles are out of memory, the memory used by a few tiles with the largest usage will be shown.
Target:
Number of IPUs: 1
Tiles per IPU: 1,216
Total Tiles: 1,216
Memory Per-Tile: 256.0 kB
Total Memory: 304.0 MB
Clock Speed (approx): 1,600.0 MHz
Number of Replicas: 1
IPUs per Replica: 1
Tiles per Replica: 1,216
Memory per Replica: 304.0 MB
Graph:
Number of vertices: 5,564
Number of edges: 18,564
Number of variables: 47,973
Number of compute sets: 43
Memory on all IPUs:
Memory Usage:
Total:
Including Gaps: 67,725,024 B
Excluding Gaps:
By Memory Region:
Non-interleaved: 6,522,641 B
Interleaved: 197,696 B
Overflowed: 0 B
Total: 6,720,337 B
By Data Type:
Not Overlapped
Variables: 254,280 B
Internal Exchange Message Buffers: 109,788 B
Data Rearrangement Buffers: 96 B
Host Exchange Packet Headers: 10,720 B
Stack: 3,852,288 B
Vertex Instances: 65,768 B
Copy Descriptors: 16,161 B
VectorList Descriptors: 528 B
Vertex Field Data: 24,052 B
Control Code: 467,696 B
Vertex Code: 1,097,868 B
Internal Exchange Code: 215,364 B
Host Exchange Code: 211,132 B
Total: 6,325,741 B
Overlapped
Variables: 340,056 B
Internal Exchange Message Buffers: 697,128 B
Data Rearrangement Buffers: 15,424 B
Total: 1,052,608 B
Total After Overlapping: 394,596 B
Vertex Data (106,509B):
By Category:
Internal vertex state: 34,114 B
Edge pointers: 38,876 B
Copy pointers: 16,668 B
Padding: 346 B
Descriptors: 16,505 B
By Type:
poplin::ConvPartial1x1Out<float,float,true,false> 185,792 B
poplin::ReduceAdd<float,float> 117,728 B
poprand::SetSeedSupervisor 34,048 B
poplar_rt::Memcpy64BitSupervisor 31,468 B
poplar_rt::DstStridedCopyDA32 30,242 B
poplin::Transpose2d<float> 23,784 B
popops::ScaledAddSupervisor<float,float,true> 18,336 B
popnn::NonLinearityGradSupervisor<float,popnn::NonLinearityType::SIGMOID> 13,024 B
popnn::NonLinearitySupervisor<float,popnn::NonLinearityType::SIGMOID> 12,512 B
poplar_rt::StridedCopyDA32 9,043 B
popops::EncodeOneHot<unsigned int,float> 3,120 B
poplar_rt::DstStridedCopy64BitMultiAccess 2,677 B
popops::Reduce<popops::ReduceAdd,float,float,false,2> 1,632 B
poplar_rt::ShortMemcpy 1,324 B
popops::ScaledAdd2D<float,true> 1,164 B
popops::ScaledReduce<popops::ReduceAdd,float,float,true,1> 888 B
popnn::LossSumSquaredTransform<float> 816 B
poplar_rt::StridedCopy64BitMultiAccess 719 B
poplar_rt::DstStridedMemsetZero64Bit 416 B
popnn::ReduceMaxClassGather<float,unsigned int> 384 B
popops::Reduce<popops::ReduceAdd,float,float,false,1> 368 B
poplar_rt::MemsetZero 172 B
popops::Reduce<popops::ReduceAdd,float,float,false,3> 152 B
popnn::CalcAccuracy<unsigned int> 72 B
poplar_rt::MemsetZero64Bit 68 B
popops::ScaledReduce<popops::ReduceAdd,float,float,true,0> 56 B
By Tile (Excluding Gaps):
Range (KB) Histogram (Excluding Gaps) Count (tiles)
3 - 4 **************************************** 824
4 - 5 0
5 - 6 0
6 - 7 0
7 - 8 ******** 168
8 - 9 *** 44
9 - 10 ****** 126
10 - 11 ** 30
11 - 12 * 16
12 - 13 0
13 - 14 0
14 - 15 * 2
15 - 16 * 2
16 - 17 * 3
17 - 18 0
18 - 19 0
19 - 20 * 1
Maximum (Including Gaps): 131,608 (128.5 K) on tile 0
Maximum (Excluding Gaps): 20,294 (19.8 K) on tile 0
0 tile(s) out of memory
Variable Storage Liveness: All tiles
Always-live bytes: 5,976,425
Always-live variables:
<anon> 48
<const> 120
Layer1/Fwd/Conv_1/worklists 14,112
Layer3/Bwd/Conv_1/worklists 72
Layer3/Fwd/Conv_1/worklists 144
Layer3/Wu/ReduceFinalStage/IntermediateToOutput/numPartials 60
Layer5/Bwd/LossSumSquared/offset 112
Layer5/Bwd/LossSumSquared/reduce_loss/ReduceOnTile/InToIntermediateNoExchange/numPartials 32
Layer5/Bwd/LossSumSquared/sliceLen 128
ValuePadder/padding 16
controlCode 467,696
copyDescriptor 16,161
hostExchangeCode 211,132
hostExchangePacketHeader 10,720
internalExchangeCode 215,364
numCorrect 4
stack 3,852,288
vectorListDescriptor 528
vertexCode 1,097,868
vertexFieldData 24,052
vertexInstanceState 65,768
Maximum live bytes (including always-live): 6,556,737
Sequence
StreamCopy
Live Bytes (excluding always-live): 95,452
Live Vars (excluding always-live):
<anon> 12
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
StreamCopy
Live Bytes (excluding always-live): 95,456
Live Vars (excluding always-live):
<anon> 12
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
programId 4
DoExchange: switchControlBroadcast13/ExchangePre
Live Bytes (excluding always-live): 100,312
Live Vars (excluding always-live):
<anon> 8
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
broadcastProgramId 4,860
programId 4
Switch
Live Bytes (excluding always-live): 100,312
Live Vars (excluding always-live):
<anon> 8
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
broadcastProgramId 4,860
programId 4
Sequence
DoExchange: init/setMasterSeed/ExchangePre
Live Bytes (excluding always-live): 105,168
Live Vars (excluding always-live):
<anon> 8
<message:anon> 9,720
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
OnTileExecute: init/setMasterSeed
Live Bytes (excluding always-live): 105,168
Live Vars (excluding always-live):
<anon> 8
<message:anon> 9,720
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
...
Execution:
Total cycles: 54,202 (approx 33.9 microseconds)
Total compute cycles (including idle threads): 1,933,324
Total compute cycles (excluding idle threads): 1,769,242
Total IPU exchange cycles: 482,221
Total global exchange cycles: 0
Total host exchange cycles: 4,194,423
Total shared structure copy cycles: 0
Total sync cycles: 58,852,900
Total tile balance: 2.9%
Total thread balance: 91.5%
Cycles by vertex type:
poplin::ConvPartial1x1Out<float,float,true,false> (808 instances): 764,028
poplar_rt::DstStridedCopyDA32 (1393 instances): 509,173
poprand::SetSeedSupervisor (1216 instances): 177,536
poplin::ReduceAdd<float,float> (376 instances): 154,116
popops::ScaledAddSupervisor<float,float,true> (460 instances): 139,812
poplar_rt::StridedCopyDA32 (455 instances): 83,995
poplar_rt::Memcpy64BitSupervisor (203 instances): 59,048
popnn::NonLinearitySupervisor<float,popnn::NonLinearityType::SIGMOID> (68 instances): 15,780
popnn::NonLinearityGradSupervisor<float,popnn::NonLinearityType::SIGMOID> (68 instances): 14,628
poplar_rt::ShortMemcpy (73 instances): 5,256
poplin::Transpose2d<float> (99 instances): 4,770
popops::EncodeOneHot<unsigned int,float> (20 instances): 4,408
popops::Reduce<popops::ReduceAdd,float,float,false,2> (112 instances): 3,584
poplar_rt::DstStridedCopy64BitMultiAccess (41 instances): 2,437
popops::ScaledReduce<popops::ReduceAdd,float,float,true,1> (32 instances): 1,600
poplar_rt::StridedCopy64BitMultiAccess (33 instances): 1,388
popops::ScaledAdd2D<float,true> (39 instances): 1,371
popnn::ReduceMaxClassGather<float,unsigned int> (16 instances): 1,056
popops::Reduce<popops::ReduceAdd,float,float,false,1> (16 instances): 912
popnn::LossSumSquaredTransform<float> (8 instances): 792
poplar_rt::DstStridedMemsetZero64Bit (18 instances): 740
popops::Reduce<popops::ReduceAdd,float,float,false,3> (4 instances): 144
popops::ScaledReduce<popops::ReduceAdd,float,float,true,0> (2 instances): 100
popnn::CalcAccuracy<unsigned int> (1 instances): 78
poplar_rt::MemsetZero64Bit (2 instances): 26
poplar_rt::MemsetZero (1 instances): 15
Steps
--- External Sync ---
StreamCopy (cycles 118 - 118)
Cycles: 7
Active Tiles: 1 / 1,216
Tile Balance: 0.1%
Active Tile Balance: 100.0%
Total Data: 2
Data Balance (mean / max data per tile): 0.1%
--- Internal Sync ---
--- Internal Sync ---
--- External Sync ---
StreamCopy (cycles 472 - 472)
Cycles: 9
Active Tiles: 1 / 1,216
Tile Balance: 0.1%
Active Tile Balance: 100.0%
Total Data: 4
Data Balance (mean / max data per tile): 0.1%
--- Internal Sync ---
--- Internal Sync ---
--- Internal Sync ---
DoExchange (cycles 826 - 946): switchControlBroadcast13/ExchangePre
Cycles: 120 (0 overlapped with previous)
Active Tiles: 1,216 / 1,216
Tile Balance: 86.6%
Active Tile Balance: 86.6%
Total Data: 2,432
Data Balance (mean / max data per tile): 100.0%
--- External Sync ---
StreamCopy (cycles 1,064 - 1,064)
Cycles: 7
Active Tiles: 1 / 1,216
Tile Balance: 0.1%
Active Tile Balance: 100.0%
Total Data: 2
Data Balance (mean / max data per tile): 0.1%
--- Internal Sync ---
--- Internal Sync ---
--- External Sync ---
StreamCopy (cycles 1,418 - 12,098)
Cycles: 10,700
Active Tiles: 392 / 1,216
Tile Balance: 32.2%
Active Tile Balance: 100.0%
Total Data: 25,122
Data Balance (mean / max data per tile): 21.1%
--- Internal Sync ---
--- Internal Sync ---
--- Internal Sync ---
DoExchange (cycles 12,432 - 12,581): Layer1/Fwd/Conv_1/Convolve/ExchangePre
Cycles: 169 (20 overlapped with previous)
Active Tiles: 392 / 1,216
Tile Balance: 27.9%
Active Tile Balance: 86.6%
Total Data: 125,440
Data Balance (mean / max data per tile): 32.2%
OnTileExecute (cycles 12,581 - 13,580): Layer1/Fwd/Conv_1/Convolve
Cycles: 999 (0 overlapped with previous)
Active Tiles: 392 / 1,216
Thread Balance: 100.0%
Tile Balance: 31.7%
Active Tile Balance: 98.3%
By vertex type:
poplin::ConvPartial1x1Out<float,float,true,false> (392 instances): 384,944
....