11. Profiling
You can use Poplar to instrument your code to produce detailed compile-time information about the graph program and run-time information about execution, including how memory is used, and where memory and processor cycles are consumed.
A tutorial on how to profile code is available in the Graphcore tutorials repository. This information can be output to files and read programatically using PopVision Analysis Library (libpva), or viewed graphically with the PopVision Graph Analyser which can be found on the PopVision Tools page. You can also generate a summary of the profiling information for quick review. See Section 11.2, Profile summary.
The profiling information available depends on the target:
Target |
Memory Profiling |
Execution Profiling |
---|---|---|
IPU |
Exact |
Hardware measurement with limitations |
IPU Model |
Exact (optional) |
Detailed but based on estimates |
CPU |
None |
None |
The IPUModel::compileIPUCode
option, described below, can be used to
generate exact memory profiling information for an IPU Model.
Because profiling adds code and extra variables to extract the profiling information, it can change the performance and memory usage of your program.
11.1. Profiling options
There are some options for controlling profiling on IPU targets (hardware or simulator).
By default, profiling is disabled. The instrumentation of compute cycles and external exchange cycles can be enabled with the following options:
debug.instrument
Set to “true” enable instrumentation.debug.instrumentCompute
Set to “true” or “false” to enable or disable instrumentation of compute cycles.debug.instrumentExternalExchange
Set to “true” or “false” to enable or disable instrumentation of cycles used for exchanges between IPUs, or between the IPU and the host.
Note that there is no option to instrument internal exchanges because all internal exchange is statically scheduled.
If the instrumentation of compute is enabled, then the compute cycles counted
can be specified with the debug.computeInstrumentationLevel
option.
This can have the following values:
“tile” Store the cycle count for the last execution of each compute set on every tile (default).
“vertex” Store the cycle count for the last execution of each vertex on every tile.
“ipu” Store the cycle count for the last execution of each compute set on a single tile per IPU. This measures the execution time of the longest-running tile in the compute set. This saves memory compared to “tile” but loses all the per-tile cycle information.
These can be specified when the Engine
constructor is called. For example:
Engine engine(graph, prog,
OptionFlags({{"debug.instrumentCompute", "true"},
{"debug.instrumentExternalExchange", "false"},
{"debug.computeInstrumentationLevel", "tile"}}));
These options can also be defined in the environment variable
POPLAR_ENGINE_OPTIONS
. For example:
export POPLAR_ENGINE_OPTIONS='{"debug.instrument": "true",
"debug.computeInstrumentationLevel": "vertex"}'
For complete information about the available options, see the Poplar and PopLibs API Reference.
For IPU Model targets you can optionally tell Poplar to compile code for the
IPU (in addition to the CPU code that is actually executed by the model). If
this is not done, then the reported memory usage will not include memory used
for code. If this is enabled then the memory profile should give the same
results as an IPU target. This option is specified by setting the
compileIPUCode
members of the model, for example:
// Create the IPU Model device
IPUModel ipuModel;
ipuModel.compileIPUCode = true;
11.2. Profile summary
You can generate a readable summary of the information, which may be useful for a quick overview of the resource usage and performance of your program. This summary report can be printed by the program or generated from the profile data.
See Section 11.5, Summary report format for an example of the output generated.
11.2.1. Printing from a Poplar program
Note
To use the printProfileSummary()
function, you must set the Engine
option “debug.retainDebugInformation”.
For example:
Engine engine(graph, trainProg, OptionFlags{{"debug.retainDebugInformation", "true"}});
To print a summary of both the graph and execution profiling information you can call printProfileSummary(). This has options to control the information included and the use of colour.
Colours can be used to highlight different sections of the output, to make
it easier to understand. By default, colour is enabled when output is to a
supported terminal. By setting the colours
option to "true"
, colour
output will be generated even if outputting to a file or piping the output to
another program.
11.2.2. Command line conversion
A program is included with the Poplar SDK to convert the generated profile data
into a readable summary. This has the option --profile
to specify the profile file to use.
For example:
$ poplar_profile_summary --profile profile.pop
The following command line options can be used to control the summary output:
--show-execution-steps
: Show execution steps--show-optimizations
: Show optimisation information--show-per-ipu-memory-usage
: Report the memory usage for each IPU--show-var-storage
: Show liveness information for each program
11.3. Storage categories
Profiling reports show the “category” of data storage. These categories are defined in the table below.
Category |
Description |
---|---|
constant |
Constants added by the user. Variables added by the compiler that happen to be constant will be categorised as “variable”. |
controlCode |
Code for |
controlId |
Variables that are used in switch programs or variables that store a sync ID for tracking host/device synchronisation points. |
controlTable |
A table that lists the vertices to run in each compute set. Only used if the table scheduler is enabled. |
copyDescriptor |
Copy descriptors are special variable-sized fields used by copy vertices. |
globalExchangeCode |
Code for performing exchange operations between IPUs. |
globalExchangePacketHeader |
Packet headers for exchange operations between IPUs. |
globalMessage |
Message variables holding data being sent between IPUs. |
hostExchangeCode |
Code for performing exchange operations to and from the host. |
hostExchangePacketHeader |
Data used as packet headers for host exchange. |
hostMessage |
Message variables holding data being sent or received from the host. |
instrumentationResults |
Storage for profiling information. |
internalExchangeCode |
Code for performing internal exchanges. |
message |
Message data for internal exchanges. |
multiple |
Space shared by variables from multiple different categories. |
outputEdge |
Storage for output edge data before an exchange takes place. |
rearrangement |
Variables holding rearranged versions of tensor data. A rearranged variable will never be always-live as it is only required in the context of a specific compute set. |
sharedCodeStorage |
Code shared bey vertices. |
sharedDataStorage |
Data shared by vertices. |
stack |
The work and supervisor stacks allocated on the specified tile. For more information about worker stack allocation see the Vertex Assembly Programming Guide. |
variable |
Variables created in the program (for example, created by the Poplar
|
vectorListDescriptor |
The data for |
vertexCode |
Code for vertex functions (codelets). |
vertexFieldData |
Variable-sized fields, e.g. the data for |
vertexInstanceState |
An instance of a |
11.4. Variable liveness
The concept of “liveness” is important in order to understand the use of memory by variables through the lifetime of a program. If a variable is “live”, it means it holds a value that may need to be read in future. Therefore, that memory cannot be reused until the variable is no longer live.
Poplar does liveness analysis using a standard algorithm but with the exception that Poplar’s variables are not scalar values, they are arrays.
In the standard analysis a variable is “killed” when a new value is written to it. This means that it is “dead” immediately before that point because its value there can never be read.
int a = 1;
// a is dead here because its current value (1) can never be read.
a = 2; // a is killed here, which makes it dead on the line above.
In Poplar, on the other hand, a variable is killed when all of its elements are written in the same compute set. Consider the pseudo-code:
var = graph.addVariable(FLOAT, {2}, ...);
seq.add(Execute( var[0] = 1, var[1] = 2 ));
// var is dead here (it is killed on the line below) because none of its
// element values (1, 2) can ever be read.
seq.add(Execute( var[0] = 3, var[1] = 4 ));
If only some of the elements are written, then the entire variable is considered to be live before the write because we may still need the value of an element that was not written to.
seq.add(Execute( var[0] = 1, var[1] = 2 ));
// var is alive here because the value 2 might be read later.
seq.add(Execute( var[0] = 3 ));
Here, var
is still alive because no compute set writes to every element. If the
entire variable is overwritten but in separate compute sets, then it will
still be considered to be live because Poplar does not track the liveness
of each variable element, only the entire variable.
seq.add(Execute( var[0] = 1, var[1] = 2 ));
// var is alive here even though 1 and 2 can never be read.
seq.add(Execute( var[0] = 3 ));
seq.add(Execute( var[1] = 4 ));
This may mean that var
is considered to be alive longer than necessary,
which could lead to increased memory use. The WriteUndef
program can be used
to explicitly mark a variable as no longer alive.
For more information about liveness analysis see:
11.5. Summary report format
There are two environment variables that can also be used to control colour output. These are:
CLICOLOR
: If set to 0, then colour output will not be generated. This overrides thecolours
option value.CLICOLOR_FORCE
: If set to 1, then colour output will always be generated, even if output is not to the terminal. This overrides thecolours
option value and theCLICOLOR
environment variable.
If there are replicated graphs then the memory usage will only be shown for one replica, as it will be the same for all of them.
If any tiles are out of memory, the memory used a few tiles with the largest usage will be shown.
The output with showVarStorage
and showExecutionSteps
will be similar
to that shown below.
If any tiles are out of memory, the memory used by a few tiles with the largest usage will be shown.
Target:
Number of IPUs: 1
Tiles per IPU: 1,216
Total Tiles: 1,216
Memory Per-Tile: 256.0 kB
Total Memory: 304.0 MB
Clock Speed (approx): 1,600.0 MHz
Number of Replicas: 1
IPUs per Replica: 1
Tiles per Replica: 1,216
Memory per Replica: 304.0 MB
Graph:
Number of vertices: 5,564
Number of edges: 18,564
Number of variables: 47,973
Number of compute sets: 43
Memory on all IPUs:
Memory Usage:
Total:
Including Gaps: 67,725,024 B
Excluding Gaps:
By Memory Region:
Non-interleaved: 6,522,641 B
Interleaved: 197,696 B
Overflowed: 0 B
Total: 6,720,337 B
By Data Type:
Not Overlapped
Variables: 254,280 B
Internal Exchange Message Buffers: 109,788 B
Data Rearrangement Buffers: 96 B
Host Exchange Packet Headers: 10,720 B
Stack: 3,852,288 B
Vertex Instances: 65,768 B
Copy Descriptors: 16,161 B
VectorList Descriptors: 528 B
Vertex Field Data: 24,052 B
Control Code: 467,696 B
Vertex Code: 1,097,868 B
Internal Exchange Code: 215,364 B
Host Exchange Code: 211,132 B
Total: 6,325,741 B
Overlapped
Variables: 340,056 B
Internal Exchange Message Buffers: 697,128 B
Data Rearrangement Buffers: 15,424 B
Total: 1,052,608 B
Total After Overlapping: 394,596 B
Vertex Data (106,509B):
By Category:
Internal vertex state: 34,114 B
Edge pointers: 38,876 B
Copy pointers: 16,668 B
Padding: 346 B
Descriptors: 16,505 B
By Type:
poplin::ConvPartial1x1Out<float,float,true,false> 185,792 B
poplin::ReduceAdd<float,float> 117,728 B
poprand::SetSeedSupervisor 34,048 B
poplar_rt::Memcpy64BitSupervisor 31,468 B
poplar_rt::DstStridedCopyDA32 30,242 B
poplin::Transpose2d<float> 23,784 B
popops::ScaledAddSupervisor<float,float,true> 18,336 B
popnn::NonLinearityGradSupervisor<float,popnn::NonLinearityType::SIGMOID> 13,024 B
popnn::NonLinearitySupervisor<float,popnn::NonLinearityType::SIGMOID> 12,512 B
poplar_rt::StridedCopyDA32 9,043 B
popops::EncodeOneHot<unsigned int,float> 3,120 B
poplar_rt::DstStridedCopy64BitMultiAccess 2,677 B
popops::Reduce<popops::ReduceAdd,float,float,false,2> 1,632 B
poplar_rt::ShortMemcpy 1,324 B
popops::ScaledAdd2D<float,true> 1,164 B
popops::ScaledReduce<popops::ReduceAdd,float,float,true,1> 888 B
popnn::LossSumSquaredTransform<float> 816 B
poplar_rt::StridedCopy64BitMultiAccess 719 B
poplar_rt::DstStridedMemsetZero64Bit 416 B
popnn::ReduceMaxClassGather<float,unsigned int> 384 B
popops::Reduce<popops::ReduceAdd,float,float,false,1> 368 B
poplar_rt::MemsetZero 172 B
popops::Reduce<popops::ReduceAdd,float,float,false,3> 152 B
popnn::CalcAccuracy<unsigned int> 72 B
poplar_rt::MemsetZero64Bit 68 B
popops::ScaledReduce<popops::ReduceAdd,float,float,true,0> 56 B
By Tile (Excluding Gaps):
Range (KB) Histogram (Excluding Gaps) Count (tiles)
3 - 4 **************************************** 824
4 - 5 0
5 - 6 0
6 - 7 0
7 - 8 ******** 168
8 - 9 *** 44
9 - 10 ****** 126
10 - 11 ** 30
11 - 12 * 16
12 - 13 0
13 - 14 0
14 - 15 * 2
15 - 16 * 2
16 - 17 * 3
17 - 18 0
18 - 19 0
19 - 20 * 1
Maximum (Including Gaps): 131,608 (128.5 K) on tile 0
Maximum (Excluding Gaps): 20,294 (19.8 K) on tile 0
0 tile(s) out of memory
Variable Storage Liveness: All tiles
Always-live bytes: 5,976,425
Always-live variables:
<anon> 48
<const> 120
Layer1/Fwd/Conv_1/worklists 14,112
Layer3/Bwd/Conv_1/worklists 72
Layer3/Fwd/Conv_1/worklists 144
Layer3/Wu/ReduceFinalStage/IntermediateToOutput/numPartials 60
Layer5/Bwd/LossSumSquared/offset 112
Layer5/Bwd/LossSumSquared/reduce_loss/ReduceOnTile/InToIntermediateNoExchange/numPartials 32
Layer5/Bwd/LossSumSquared/sliceLen 128
ValuePadder/padding 16
controlCode 467,696
copyDescriptor 16,161
hostExchangeCode 211,132
hostExchangePacketHeader 10,720
internalExchangeCode 215,364
numCorrect 4
stack 3,852,288
vectorListDescriptor 528
vertexCode 1,097,868
vertexFieldData 24,052
vertexInstanceState 65,768
Maximum live bytes (including always-live): 6,556,737
Sequence
StreamCopy
Live Bytes (excluding always-live): 95,452
Live Vars (excluding always-live):
<anon> 12
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
StreamCopy
Live Bytes (excluding always-live): 95,456
Live Vars (excluding always-live):
<anon> 12
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
programId 4
DoExchange: switchControlBroadcast13/ExchangePre
Live Bytes (excluding always-live): 100,312
Live Vars (excluding always-live):
<anon> 8
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
broadcastProgramId 4,860
programId 4
Switch
Live Bytes (excluding always-live): 100,312
Live Vars (excluding always-live):
<anon> 8
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
broadcastProgramId 4,860
programId 4
Sequence
DoExchange: init/setMasterSeed/ExchangePre
Live Bytes (excluding always-live): 105,168
Live Vars (excluding always-live):
<anon> 8
<message:anon> 9,720
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
OnTileExecute: init/setMasterSeed
Live Bytes (excluding always-live): 105,168
Live Vars (excluding always-live):
<anon> 8
<message:anon> 9,720
Layer1/Fwd/biases 120
Layer1/Fwd/weights 94,080
Layer3/Fwd/biases 40
Layer3/Fwd/weights 1,200
...
Execution:
Total cycles: 54,202 (approx 33.9 microseconds)
Total compute cycles (including idle threads): 1,933,324
Total compute cycles (excluding idle threads): 1,769,242
Total IPU exchange cycles: 482,221
Total global exchange cycles: 0
Total host exchange cycles: 4,194,423
Total shared structure copy cycles: 0
Total sync cycles: 58,852,900
Total tile balance: 2.9%
Total thread balance: 91.5%
Cycles by vertex type:
poplin::ConvPartial1x1Out<float,float,true,false> (808 instances): 764,028
poplar_rt::DstStridedCopyDA32 (1393 instances): 509,173
poprand::SetSeedSupervisor (1216 instances): 177,536
poplin::ReduceAdd<float,float> (376 instances): 154,116
popops::ScaledAddSupervisor<float,float,true> (460 instances): 139,812
poplar_rt::StridedCopyDA32 (455 instances): 83,995
poplar_rt::Memcpy64BitSupervisor (203 instances): 59,048
popnn::NonLinearitySupervisor<float,popnn::NonLinearityType::SIGMOID> (68 instances): 15,780
popnn::NonLinearityGradSupervisor<float,popnn::NonLinearityType::SIGMOID> (68 instances): 14,628
poplar_rt::ShortMemcpy (73 instances): 5,256
poplin::Transpose2d<float> (99 instances): 4,770
popops::EncodeOneHot<unsigned int,float> (20 instances): 4,408
popops::Reduce<popops::ReduceAdd,float,float,false,2> (112 instances): 3,584
poplar_rt::DstStridedCopy64BitMultiAccess (41 instances): 2,437
popops::ScaledReduce<popops::ReduceAdd,float,float,true,1> (32 instances): 1,600
poplar_rt::StridedCopy64BitMultiAccess (33 instances): 1,388
popops::ScaledAdd2D<float,true> (39 instances): 1,371
popnn::ReduceMaxClassGather<float,unsigned int> (16 instances): 1,056
popops::Reduce<popops::ReduceAdd,float,float,false,1> (16 instances): 912
popnn::LossSumSquaredTransform<float> (8 instances): 792
poplar_rt::DstStridedMemsetZero64Bit (18 instances): 740
popops::Reduce<popops::ReduceAdd,float,float,false,3> (4 instances): 144
popops::ScaledReduce<popops::ReduceAdd,float,float,true,0> (2 instances): 100
popnn::CalcAccuracy<unsigned int> (1 instances): 78
poplar_rt::MemsetZero64Bit (2 instances): 26
poplar_rt::MemsetZero (1 instances): 15
Steps
--- External Sync ---
StreamCopy (cycles 118 - 118)
Cycles: 7
Active Tiles: 1 / 1,216
Tile Balance: 0.1%
Active Tile Balance: 100.0%
Total Data: 2
Data Balance (mean / max data per tile): 0.1%
--- Internal Sync ---
--- Internal Sync ---
--- External Sync ---
StreamCopy (cycles 472 - 472)
Cycles: 9
Active Tiles: 1 / 1,216
Tile Balance: 0.1%
Active Tile Balance: 100.0%
Total Data: 4
Data Balance (mean / max data per tile): 0.1%
--- Internal Sync ---
--- Internal Sync ---
--- Internal Sync ---
DoExchange (cycles 826 - 946): switchControlBroadcast13/ExchangePre
Cycles: 120 (0 overlapped with previous)
Active Tiles: 1,216 / 1,216
Tile Balance: 86.6%
Active Tile Balance: 86.6%
Total Data: 2,432
Data Balance (mean / max data per tile): 100.0%
--- External Sync ---
StreamCopy (cycles 1,064 - 1,064)
Cycles: 7
Active Tiles: 1 / 1,216
Tile Balance: 0.1%
Active Tile Balance: 100.0%
Total Data: 2
Data Balance (mean / max data per tile): 0.1%
--- Internal Sync ---
--- Internal Sync ---
--- External Sync ---
StreamCopy (cycles 1,418 - 12,098)
Cycles: 10,700
Active Tiles: 392 / 1,216
Tile Balance: 32.2%
Active Tile Balance: 100.0%
Total Data: 25,122
Data Balance (mean / max data per tile): 21.1%
--- Internal Sync ---
--- Internal Sync ---
--- Internal Sync ---
DoExchange (cycles 12,432 - 12,581): Layer1/Fwd/Conv_1/Convolve/ExchangePre
Cycles: 169 (20 overlapped with previous)
Active Tiles: 392 / 1,216
Tile Balance: 27.9%
Active Tile Balance: 86.6%
Total Data: 125,440
Data Balance (mean / max data per tile): 32.2%
OnTileExecute (cycles 12,581 - 13,580): Layer1/Fwd/Conv_1/Convolve
Cycles: 999 (0 overlapped with previous)
Active Tiles: 392 / 1,216
Thread Balance: 100.0%
Tile Balance: 31.7%
Active Tile Balance: 98.3%
By vertex type:
poplin::ConvPartial1x1Out<float,float,true,false> (392 instances): 384,944
....