11. Profiling

You can use Poplar to instrument your code to produce detailed compile-time information about the graph program and run-time information about execution, including how memory is used, and where memory and processor cycles are consumed.

A tutorial on how to profile code is available in the Graphcore tutorials repository. This information can be output to files and read programatically using PopVision Analysis Library (libpva), or viewed graphically with the PopVision Graph Analyser which can be found on the PopVision Tools page. You can also generate a summary of the profiling information for quick review. See Section 11.2, Profile summary.

The profiling information available depends on the target:

Target	Memory Profiling	Execution Profiling
IPU	Exact	Hardware measurement with limitations
IPU Model	Exact (optional)	Detailed but based on estimates
CPU	None	None

The IPUModel::compileIPUCode option, described below, can be used to generate exact memory profiling information for an IPU Model.

Because profiling adds code and extra variables to extract the profiling information, it can change the performance and memory usage of your program.

11.1. Profiling options

There are some options for controlling profiling on IPU targets (hardware or simulator).

By default, profiling is disabled. The instrumentation of compute cycles and external exchange cycles can be enabled with the following options:

debug.instrument Set to “true” enable instrumentation.
debug.instrumentCompute Set to “true” or “false” to enable or disable instrumentation of compute cycles.
debug.instrumentExternalExchange Set to “true” or “false” to enable or disable instrumentation of cycles used for exchanges between IPUs, or between the IPU and the host.

Note that there is no option to instrument internal exchanges because all internal exchange is statically scheduled.

If the instrumentation of compute is enabled, then the compute cycles counted can be specified with the debug.computeInstrumentationLevel option. This can have the following values:

“tile” Store the cycle count for the last execution of each compute set on every tile (default).
“vertex” Store the cycle count for the last execution of each vertex on every tile.
“ipu” Store the cycle count for the last execution of each compute set on a single tile per IPU. This measures the execution time of the longest-running tile in the compute set. This saves memory compared to “tile” but loses all the per-tile cycle information.

These can be specified when the Engine constructor is called. For example:

Engine engine(graph, prog,
              OptionFlags({{"debug.instrumentCompute", "true"},
                           {"debug.instrumentExternalExchange", "false"},
                           {"debug.computeInstrumentationLevel", "tile"}}));

These options can also be defined in the environment variable POPLAR_ENGINE_OPTIONS. For example:

export POPLAR_ENGINE_OPTIONS='{"debug.instrument": "true",
                               "debug.computeInstrumentationLevel": "vertex"}'

For complete information about the available options, see the Poplar and PopLibs API Reference.

For IPU Model targets you can optionally tell Poplar to compile code for the IPU (in addition to the CPU code that is actually executed by the model). If this is not done, then the reported memory usage will not include memory used for code. If this is enabled then the memory profile should give the same results as an IPU target. This option is specified by setting the compileIPUCode members of the model, for example:

// Create the IPU Model device
IPUModel ipuModel;
ipuModel.compileIPUCode = true;

11.2. Profile summary

You can generate a readable summary of the information, which may be useful for a quick overview of the resource usage and performance of your program. This summary report can be printed by the program or generated from the profile data.

See Section 11.5, Summary report format for an example of the output generated.

11.2.1. Printing from a Poplar program

Note

To use the printProfileSummary() function, you must set the Engine option “debug.retainDebugInformation”.

For example:

Engine engine(graph, trainProg, OptionFlags{{"debug.retainDebugInformation", "true"}});

To print a summary of both the graph and execution profiling information you can call printProfileSummary(). This has options to control the information included and the use of colour.

Colours can be used to highlight different sections of the output, to make it easier to understand. By default, colour is enabled when output is to a supported terminal. By setting the colours option to "true", colour output will be generated even if outputting to a file or piping the output to another program.

11.2.2. Command line conversion

A program is included with the Poplar SDK to convert the generated profile data into a readable summary. This has the option --profile to specify the profile file to use. For example:

$ poplar_profile_summary --profile profile.pop

The following command line options can be used to control the summary output:

--show-execution-steps: Show execution steps
--show-optimizations: Show optimisation information
--show-per-ipu-memory-usage: Report the memory usage for each IPU
--show-var-storage: Show liveness information for each program

11.3. Storage categories

Profiling reports show the “category” of data storage. These categories are defined in the table below.

Table 11.1 Storage categories
Category	Description
constant	Constants added by the user. Variables added by the compiler that happen to be constant will be categorised as “variable”.
controlCode	Code for `Program` objects and running compute sets.
controlId	Variables that are used in switch programs or variables that store a sync ID for tracking host/device synchronisation points.
controlTable	A table that lists the vertices to run in each compute set. Only used if the table scheduler is enabled.
copyDescriptor	Copy descriptors are special variable-sized fields used by copy vertices.
globalExchangeCode	Code for performing exchange operations between IPUs.
globalExchangePacketHeader	Packet headers for exchange operations between IPUs.
globalMessage	Message variables holding data being sent between IPUs.
hostExchangeCode	Code for performing exchange operations to and from the host.
hostExchangePacketHeader	Data used as packet headers for host exchange.
hostMessage	Message variables holding data being sent or received from the host.
instrumentationResults	Storage for profiling information.
internalExchangeCode	Code for performing internal exchanges.
message	Message data for internal exchanges.
multiple	Space shared by variables from multiple different categories.
outputEdge	Storage for output edge data before an exchange takes place.
rearrangement	Variables holding rearranged versions of tensor data. A rearranged variable will never be always-live as it is only required in the context of a specific compute set.
sharedCodeStorage	Code shared bey vertices.
sharedDataStorage	Data shared by vertices.
stack	The work and supervisor stacks allocated on the specified tile. For more information about worker stack allocation see the Vertex Assembly Programming Guide.
variable	Variables created in the program (for example, created by the Poplar `graph.addVariable()` function).
vectorListDescriptor	The data for `VectorList<Input<...>, DeltaN>` fields.
vertexCode	Code for vertex functions (codelets).
vertexFieldData	Variable-sized fields, e.g. the data for `Vector<float>`, `Vector<Input<...>>` and `InputSet<...>` fields.
vertexInstanceState	An instance of a `Vertex` class object. This will be `sizeof(VertexName)` for each vertex.

11.4. Variable liveness

The concept of “liveness” is important in order to understand the use of memory by variables through the lifetime of a program. If a variable is “live”, it means it holds a value that may need to be read in future. Therefore, that memory cannot be reused until the variable is no longer live.

Poplar does liveness analysis using a standard algorithm but with the exception that Poplar’s variables are not scalar values, they are arrays.

In the standard analysis a variable is “killed” when a new value is written to it. This means that it is “dead” immediately before that point because its value there can never be read.

int a = 1;
// a is dead here because its current value (1) can never be read.
a = 2; // a is killed here, which makes it dead on the line above.

In Poplar, on the other hand, a variable is killed when all of its elements are written in the same compute set. Consider the pseudo-code:

var = graph.addVariable(FLOAT, {2}, ...);

seq.add(Execute( var[0] = 1, var[1] = 2 ));
// var is dead here (it is killed on the line below) because none of its
// element values (1, 2) can ever be read.
seq.add(Execute( var[0] = 3, var[1] = 4 ));

If only some of the elements are written, then the entire variable is considered to be live before the write because we may still need the value of an element that was not written to.

seq.add(Execute( var[0] = 1, var[1] = 2 ));
// var is alive here because the value 2 might be read later.
seq.add(Execute( var[0] = 3 ));

Here, var is still alive because no compute set writes to every element. If the entire variable is overwritten but in separate compute sets, then it will still be considered to be live because Poplar does not track the liveness of each variable element, only the entire variable.

seq.add(Execute( var[0] = 1, var[1] = 2 ));
// var is alive here even though 1 and 2 can never be read.
seq.add(Execute( var[0] = 3 ));
seq.add(Execute( var[1] = 4 ));

This may mean that var is considered to be alive longer than necessary, which could lead to increased memory use. The WriteUndef program can be used to explicitly mark a variable as no longer alive.

For more information about liveness analysis see:

11.5. Summary report format

There are two environment variables that can also be used to control colour output. These are:

CLICOLOR: If set to 0, then colour output will not be generated. This overrides the colours option value.
CLICOLOR_FORCE: If set to 1, then colour output will always be generated, even if output is not to the terminal. This overrides the colours option value and the CLICOLOR environment variable.

If there are replicated graphs then the memory usage will only be shown for one replica, as it will be the same for all of them.

If any tiles are out of memory, the memory used a few tiles with the largest usage will be shown.

The output with showVarStorage and showExecutionSteps will be similar to that shown below.

If any tiles are out of memory, the memory used by a few tiles with the largest usage will be shown.

Target:
  Number of IPUs:         1
  Tiles per IPU:          1,216
  Total Tiles:            1,216
  Memory Per-Tile:        256.0 kB
  Total Memory:           304.0 MB
  Clock Speed (approx):   1,600.0 MHz
  Number of Replicas:     1
  IPUs per Replica:       1
  Tiles per Replica:      1,216
  Memory per Replica:     304.0 MB

Graph:
  Number of vertices:            5,564
  Number of edges:              18,564
  Number of variables:          47,973
  Number of compute sets:           43

Memory on all IPUs:
Memory Usage:
  Total:
    Including Gaps:         67,725,024 B
    Excluding Gaps:
      By Memory Region:
        Non-interleaved:     6,522,641 B
        Interleaved:           197,696 B
        Overflowed:                  0 B
      Total:                 6,720,337 B
      By Data Type:
        Not Overlapped
            Variables:                              254,280 B
            Internal Exchange Message Buffers:      109,788 B
            Data Rearrangement Buffers:                  96 B
            Host Exchange Packet Headers:            10,720 B
            Stack:                                3,852,288 B
            Vertex Instances:                        65,768 B
            Copy Descriptors:                        16,161 B
            VectorList Descriptors:                     528 B
            Vertex Field Data:                       24,052 B
            Control Code:                           467,696 B
            Vertex Code:                          1,097,868 B
            Internal Exchange Code:                 215,364 B
            Host Exchange Code:                     211,132 B
          Total:                                  6,325,741 B
        Overlapped
            Variables:                              340,056 B
            Internal Exchange Message Buffers:      697,128 B
            Data Rearrangement Buffers:              15,424 B
          Total:                                  1,052,608 B
          Total After Overlapping:                  394,596 B
      Vertex Data (106,509B):
        By Category:
          Internal vertex state:         34,114 B
          Edge pointers:                 38,876 B
          Copy pointers:                 16,668 B
          Padding:                          346 B
          Descriptors:                   16,505 B
        By Type:
          poplin::ConvPartial1x1Out<float,float,true,false>                                185,792 B
          poplin::ReduceAdd<float,float>                                                   117,728 B
          poprand::SetSeedSupervisor                                                        34,048 B
          poplar_rt::Memcpy64BitSupervisor                                                  31,468 B
          poplar_rt::DstStridedCopyDA32                                                     30,242 B
          poplin::Transpose2d<float>                                                        23,784 B
          popops::ScaledAddSupervisor<float,float,true>                                     18,336 B
          popnn::NonLinearityGradSupervisor<float,popnn::NonLinearityType::SIGMOID>         13,024 B
          popnn::NonLinearitySupervisor<float,popnn::NonLinearityType::SIGMOID>             12,512 B
          poplar_rt::StridedCopyDA32                                                         9,043 B
          popops::EncodeOneHot<unsigned int,float>                                           3,120 B
          poplar_rt::DstStridedCopy64BitMultiAccess                                          2,677 B
          popops::Reduce<popops::ReduceAdd,float,float,false,2>                              1,632 B
          poplar_rt::ShortMemcpy                                                             1,324 B
          popops::ScaledAdd2D<float,true>                                                    1,164 B
          popops::ScaledReduce<popops::ReduceAdd,float,float,true,1>                           888 B
          popnn::LossSumSquaredTransform<float>                                                816 B
          poplar_rt::StridedCopy64BitMultiAccess                                               719 B
          poplar_rt::DstStridedMemsetZero64Bit                                                 416 B
          popnn::ReduceMaxClassGather<float,unsigned int>                                      384 B
          popops::Reduce<popops::ReduceAdd,float,float,false,1>                                368 B
          poplar_rt::MemsetZero                                                                172 B
          popops::Reduce<popops::ReduceAdd,float,float,false,3>                                152 B
          popnn::CalcAccuracy<unsigned int>                                                     72 B
          poplar_rt::MemsetZero64Bit                                                            68 B
          popops::ScaledReduce<popops::ReduceAdd,float,float,true,0>                            56 B


  By Tile (Excluding Gaps):
    Range (KB) Histogram (Excluding Gaps)               Count (tiles)
        3 -  4 ****************************************    824
        4 -  5                                               0
        5 -  6                                               0
        6 -  7                                               0
        7 -  8 ********                                    168
        8 -  9 ***                                          44
        9 - 10 ******                                      126
      10 - 11 **                                           30
      11 - 12 *                                            16
      12 - 13                                               0
      13 - 14                                               0
      14 - 15 *                                             2
      15 - 16 *                                             2
      16 - 17 *                                             3
      17 - 18                                               0
      18 - 19                                               0
      19 - 20 *                                             1

    Maximum (Including Gaps): 131,608 (128.5 K) on tile 0
    Maximum (Excluding Gaps): 20,294 (19.8 K) on tile 0
    0 tile(s) out of memory



  Variable Storage Liveness: All tiles

    Always-live bytes: 5,976,425
    Always-live variables:
      <anon>                                                                                               48
      <const>                                                                                             120
      Layer1/Fwd/Conv_1/worklists                                                                      14,112
      Layer3/Bwd/Conv_1/worklists                                                                          72
      Layer3/Fwd/Conv_1/worklists                                                                         144
      Layer3/Wu/ReduceFinalStage/IntermediateToOutput/numPartials                                          60
      Layer5/Bwd/LossSumSquared/offset                                                                    112
      Layer5/Bwd/LossSumSquared/reduce_loss/ReduceOnTile/InToIntermediateNoExchange/numPartials            32
      Layer5/Bwd/LossSumSquared/sliceLen                                                                  128
      ValuePadder/padding                                                                                  16
      controlCode                                                                                     467,696
      copyDescriptor                                                                                   16,161
      hostExchangeCode                                                                                211,132
      hostExchangePacketHeader                                                                         10,720
      internalExchangeCode                                                                            215,364
      numCorrect                                                                                            4
      stack                                                                                         3,852,288
      vectorListDescriptor                                                                                528
      vertexCode                                                                                    1,097,868
      vertexFieldData                                                                                  24,052
      vertexInstanceState                                                                              65,768

    Maximum live bytes (including always-live): 6,556,737

    Sequence
      StreamCopy
        Live Bytes (excluding always-live): 95,452
        Live Vars (excluding always-live):
          <anon>                        12
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
      StreamCopy
        Live Bytes (excluding always-live): 95,456
        Live Vars (excluding always-live):
          <anon>                        12
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
          programId                      4
      DoExchange: switchControlBroadcast13/ExchangePre
        Live Bytes (excluding always-live): 100,312
        Live Vars (excluding always-live):
          <anon>                         8
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
          broadcastProgramId         4,860
          programId                      4
      Switch
        Live Bytes (excluding always-live): 100,312
        Live Vars (excluding always-live):
          <anon>                         8
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
          broadcastProgramId         4,860
          programId                      4
        Sequence
          DoExchange: init/setMasterSeed/ExchangePre
            Live Bytes (excluding always-live): 105,168
            Live Vars (excluding always-live):
              <anon>                         8
              <message:anon>             9,720
              Layer1/Fwd/biases            120
              Layer1/Fwd/weights        94,080
              Layer3/Fwd/biases             40
              Layer3/Fwd/weights         1,200
          OnTileExecute: init/setMasterSeed
            Live Bytes (excluding always-live): 105,168
            Live Vars (excluding always-live):
              <anon>                         8
              <message:anon>             9,720
              Layer1/Fwd/biases            120
              Layer1/Fwd/weights        94,080
              Layer3/Fwd/biases             40
              Layer3/Fwd/weights         1,200
...

Execution:

  Total cycles:                                  54,202 (approx 33.9 microseconds)
  Total compute cycles (including idle threads): 1,933,324
  Total compute cycles (excluding idle threads): 1,769,242
  Total IPU exchange cycles:                     482,221
  Total global exchange cycles:                  0
  Total host exchange cycles:                    4,194,423
  Total shared structure copy cycles:            0
  Total sync cycles:                             58,852,900
  Total tile balance:                            2.9%
  Total thread balance:                          91.5%

  Cycles by vertex type:
    poplin::ConvPartial1x1Out<float,float,true,false>                                (808 instances):      764,028
    poplar_rt::DstStridedCopyDA32                                                   (1393 instances):      509,173
    poprand::SetSeedSupervisor                                                      (1216 instances):      177,536
    poplin::ReduceAdd<float,float>                                                   (376 instances):      154,116
    popops::ScaledAddSupervisor<float,float,true>                                    (460 instances):      139,812
    poplar_rt::StridedCopyDA32                                                       (455 instances):       83,995
    poplar_rt::Memcpy64BitSupervisor                                                 (203 instances):       59,048
    popnn::NonLinearitySupervisor<float,popnn::NonLinearityType::SIGMOID>             (68 instances):       15,780
    popnn::NonLinearityGradSupervisor<float,popnn::NonLinearityType::SIGMOID>         (68 instances):       14,628
    poplar_rt::ShortMemcpy                                                            (73 instances):        5,256
    poplin::Transpose2d<float>                                                        (99 instances):        4,770
    popops::EncodeOneHot<unsigned int,float>                                          (20 instances):        4,408
    popops::Reduce<popops::ReduceAdd,float,float,false,2>                            (112 instances):        3,584
    poplar_rt::DstStridedCopy64BitMultiAccess                                         (41 instances):        2,437
    popops::ScaledReduce<popops::ReduceAdd,float,float,true,1>                        (32 instances):        1,600
    poplar_rt::StridedCopy64BitMultiAccess                                            (33 instances):        1,388
    popops::ScaledAdd2D<float,true>                                                   (39 instances):        1,371
    popnn::ReduceMaxClassGather<float,unsigned int>                                   (16 instances):        1,056
    popops::Reduce<popops::ReduceAdd,float,float,false,1>                             (16 instances):          912
    popnn::LossSumSquaredTransform<float>                                              (8 instances):          792
    poplar_rt::DstStridedMemsetZero64Bit                                              (18 instances):          740
    popops::Reduce<popops::ReduceAdd,float,float,false,3>                              (4 instances):          144
    popops::ScaledReduce<popops::ReduceAdd,float,float,true,0>                         (2 instances):          100
    popnn::CalcAccuracy<unsigned int>                                                  (1 instances):           78
    poplar_rt::MemsetZero64Bit                                                         (2 instances):           26
    poplar_rt::MemsetZero                                                              (1 instances):           15
Steps
  --- External Sync ---
  StreamCopy (cycles 118 - 118)
    Cycles:                             7
    Active Tiles:                       1 / 1,216
    Tile Balance:                       0.1%
    Active Tile Balance:                100.0%
    Total Data:                         2
    Data Balance (mean / max data per tile): 0.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- External Sync ---
  StreamCopy (cycles 472 - 472)
    Cycles:                             9
    Active Tiles:                       1 / 1,216
    Tile Balance:                       0.1%
    Active Tile Balance:                100.0%
    Total Data:                         4
    Data Balance (mean / max data per tile): 0.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- Internal Sync ---
  DoExchange (cycles 826 - 946): switchControlBroadcast13/ExchangePre
    Cycles:                             120 (0 overlapped with previous)
    Active Tiles:                       1,216 / 1,216
    Tile Balance:                       86.6%
    Active Tile Balance:                86.6%
    Total Data:                         2,432
    Data Balance (mean / max data per tile): 100.0%
  --- External Sync ---
  StreamCopy (cycles 1,064 - 1,064)
    Cycles:                             7
    Active Tiles:                       1 / 1,216
    Tile Balance:                       0.1%
    Active Tile Balance:                100.0%
    Total Data:                         2
    Data Balance (mean / max data per tile): 0.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- External Sync ---
  StreamCopy (cycles 1,418 - 12,098)
    Cycles:                             10,700
    Active Tiles:                       392 / 1,216
    Tile Balance:                       32.2%
    Active Tile Balance:                100.0%
    Total Data:                         25,122
    Data Balance (mean / max data per tile): 21.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- Internal Sync ---
  DoExchange (cycles 12,432 - 12,581): Layer1/Fwd/Conv_1/Convolve/ExchangePre
    Cycles:                             169 (20 overlapped with previous)
    Active Tiles:                       392 / 1,216
    Tile Balance:                       27.9%
    Active Tile Balance:                86.6%
    Total Data:                         125,440
    Data Balance (mean / max data per tile): 32.2%
  OnTileExecute (cycles 12,581 - 13,580): Layer1/Fwd/Conv_1/Convolve
    Cycles:                             999 (0 overlapped with previous)
    Active Tiles:                       392 / 1,216
    Thread Balance:                     100.0%
    Tile Balance:                       31.7%
    Active Tile Balance:                98.3%
    By vertex type:
      poplin::ConvPartial1x1Out<float,float,true,false>        (392 instances):      384,944
....