5. Profiling

You can use Poplar to instrument your code to produce detailed compile-time information about the graph program and run-time information about execution, including how memory is used, and where memory and processor cycles are consumed.

This information can be output to files and viewed with the PopVision Graph Analyser. This provides a graphical view of the profile data and includes full documentation as context-sensitive help. See the document Poplar Profile Data for a detailed description of the contents of these files.

You can also generate a summary of the profiling information for quick review. See Profile summary.

The profiling information available depends on the target:

Target

Memory Profiling

Execution Profiling

IPU

Exact

Hardware measurement with limitations

IPU Model

Exact (optional)

Detailed but based on estimates

CPU

None

None

The IPUModel::compileIPUCode option, described below, can be used to generate exact memory profiling information for an IPU Model.

Because profiling adds code and extra variables to extract the profiling information, it can change the performance and memory usage of your program.

5.1. Generating profiling information

After you have loaded your Graph into an Engine, you can get static profile information about the graph and the resources required. This includes cycle estimates (for an IPU Model) and memory information.

ProfileValue graphProfile = engine.getGraphProfile();

After you have run the program one or more times you can get dynamic profiling information (what code was run, cycle counts, and so on).

ProfileValue executionProfile = engine.getExecutionProfile();

You can save the profiling information to files for use by the Graph Analyser. For example:

poplar::serializeToJSON(graphFile, graphProfile, true);
poplar::serializeToJSON(executionFile, executionProfile, true);

The last parameter of serializeToJSON() is optional and controls whether or not to pretty print the data.

5.1.1. Profiling options

There are some options for controlling profiling on IPU targets (hardware or simulator).

By default, profiling is disabled. The instrumentation of compute cycles and external exchange cycles can be enabled with the following options:

  • debug.instrument Set to “true” enable instrumentation.

  • debug.instrumentCompute Set to “true” or “false” to enable or disable instrumentation of compute cycles.

  • debug.instrumentExternalExchange Set to “true” or “false” to enable or disable instrumentation of cycles used for exchanges between IPUs, or between the IPU and the host.

Note that there is no option to instrument internal exchanges because all internal exchange is statically scheduled.

If the instrumentation of compute is enabled, then the compute cycles counted can be specified with the debug.computeInstrumentationLevel option. This can have the following values:

  • “tile” Store the cycle count for the last execution of each compute set on every tile (default).

  • “vertex” Store the cycle count for the last execution of each vertex on every tile.

  • “ipu” Store the cycle count for the last execution of each compute set on a single tile per IPU. This measures the execution time of the longest-running tile in the compute set. This saves memory compared to “tile” but loses all the per-tile cycle information.

These can be specified when the Engine constructor is called. For example:

Engine engine(graph, prog,
              OptionFlags({{"debug.instrumentCompute", "true"},
                           {"debug.instrumentExternalExchange", "false"},
                           {"debug.computeInstrumentationLevel", "tile"}}));

These options can also be defined in the environment variable POPLAR_ENGINE_OPTIONS. For example:

export POPLAR_ENGINE_OPTIONS='{"debug.instrument": "true",
                               "debug.computeInstrumentationLevel": "vertex"}'

For complete information about the available options, see the Poplar and PopLibs API Reference.

For IPU Model targets you can optionally tell Poplar to compile code for the IPU (in addition to the CPU code that is actually executed by the model). If this is not done, then the reported memory usage will not include memory used for code. If this is enabled then the memory profile should give the same results as an IPU target. This option is specified by setting the compileIPUCode members of the model, for example:

// Create the IPU Model device
IPUModel ipuModel;
ipuModel.compileIPUCode = true;

5.2. Profile summary

You can generate a readable summary of the information, which may be useful for a quick overview of the resource usage and performance of your program. This summary report can be printed by the program or generated from the profile data.

5.2.1. Printing from a Poplar program

To print a summary of both the graph and execution profiling information you can call the function:

poplar::printProfileSummary(std::cout, graphProfile,
                            executionProfile, options);

You can also print the graph and execution summaries separately by calling printGraphSummary or printExecutionSummary (see the Poplar and PopLibs API Reference for more information).

The following options are available:

  • colours: Control the use of colours in the summary output

  • showVarStorage: Show liveness information for each program

  • showOptimizations: Show compile optimisation details

  • showExecutionSteps: Show the execution steps

  • showPerIpuMemoryUsage: Show memory usage for each IPU

Colours can be used to highlight different sections of the output, to make it easier to understand. By default, colour is enabled when output is to a supported terminal. By setting the colours option to "true", colour output will be generated even if outputting to a file or piping the output to another program.

5.2.2. Command line conversion

A program is included with the Poplar SDK to convert the generated profile data into a readable summary. This has options --graph-profile and --execution-profile to specify the graph and execution profile files to use. For example:

$ poplar_profile_summary --graph-profile graph_profile.json

The following command line options can be used to control the summary output:

  • --show-execution-steps: Show execution steps

  • --show-optimizations: Show optimisation information

  • --show-per-ipu-memory-usage: Report the memory usage for each IPU

  • --show-var-storage: Show liveness information for each program

5.2.3. Summary report format

There are two environment variables that can also be used to control colour output. These are:

  • CLICOLOR: If set to 0, then colour output will not be generated. This overrides the colours option value.

  • CLICOLOR_FORCE: If set to 1, then colour output will always be generated, even if output is not to the terminal. This overrides the colours option value and the CLICOLOR environment variable.

If there are replicated graphs then the memory usage will only be shown for one replica, as it will be the same for all of them.

If any tiles are out of memory, the memory used a few tiles with the largest usage will be shown.

The output with showVarStorage and showExecutionSteps will be similar to that shown below.

If any tiles are out of memory, the memory used by a few tiles with the largest usage will be shown.

Target:
  Number of IPUs:         1
  Tiles per IPU:          1,216
  Total Tiles:            1,216
  Memory Per-Tile:        256.0 kB
  Total Memory:           304.0 MB
  Clock Speed (approx):   1,600.0 MHz
  Number of Replicas:     1
  IPUs per Replica:       1
  Tiles per Replica:      1,216
  Memory per Replica:     304.0 MB

Graph:
  Number of vertices:            5,564
  Number of edges:              18,564
  Number of variables:          47,973
  Number of compute sets:           43

Memory on all IPUs:
Memory Usage:
  Total:
    Including Gaps:         67,725,024 B
    Excluding Gaps:
      By Memory Region:
        Non-interleaved:     6,522,641 B
        Interleaved:           197,696 B
        Overflowed:                  0 B
      Total:                 6,720,337 B
      By Data Type:
        Not Overlapped
            Variables:                              254,280 B
            Internal Exchange Message Buffers:      109,788 B
            Data Rearrangement Buffers:                  96 B
            Host Exchange Packet Headers:            10,720 B
            Stack:                                3,852,288 B
            Vertex Instances:                        65,768 B
            Copy Descriptors:                        16,161 B
            VectorList Descriptors:                     528 B
            Vertex Field Data:                       24,052 B
            Control Code:                           467,696 B
            Vertex Code:                          1,097,868 B
            Internal Exchange Code:                 215,364 B
            Host Exchange Code:                     211,132 B
          Total:                                  6,325,741 B
        Overlapped
            Variables:                              340,056 B
            Internal Exchange Message Buffers:      697,128 B
            Data Rearrangement Buffers:              15,424 B
          Total:                                  1,052,608 B
          Total After Overlapping:                  394,596 B
      Vertex Data (106,509B):
        By Category:
          Internal vertex state:         34,114 B
          Edge pointers:                 38,876 B
          Copy pointers:                 16,668 B
          Padding:                          346 B
          Descriptors:                   16,505 B
        By Type:
          poplin::ConvPartial1x1Out<float,float,true,false>                                185,792 B
          poplin::ReduceAdd<float,float>                                                   117,728 B
          poprand::SetSeedSupervisor                                                        34,048 B
          poplar_rt::Memcpy64BitSupervisor                                                  31,468 B
          poplar_rt::DstStridedCopyDA32                                                     30,242 B
          poplin::Transpose2d<float>                                                        23,784 B
          popops::ScaledAddSupervisor<float,float,true>                                     18,336 B
          popnn::NonLinearityGradSupervisor<float,popnn::NonLinearityType::SIGMOID>         13,024 B
          popnn::NonLinearitySupervisor<float,popnn::NonLinearityType::SIGMOID>             12,512 B
          poplar_rt::StridedCopyDA32                                                         9,043 B
          popops::EncodeOneHot<unsigned int,float>                                           3,120 B
          poplar_rt::DstStridedCopy64BitMultiAccess                                          2,677 B
          popops::Reduce<popops::ReduceAdd,float,float,false,2>                              1,632 B
          poplar_rt::ShortMemcpy                                                             1,324 B
          popops::ScaledAdd2D<float,true>                                                    1,164 B
          popops::ScaledReduce<popops::ReduceAdd,float,float,true,1>                           888 B
          popnn::LossSumSquaredTransform<float>                                                816 B
          poplar_rt::StridedCopy64BitMultiAccess                                               719 B
          poplar_rt::DstStridedMemsetZero64Bit                                                 416 B
          popnn::ReduceMaxClassGather<float,unsigned int>                                      384 B
          popops::Reduce<popops::ReduceAdd,float,float,false,1>                                368 B
          poplar_rt::MemsetZero                                                                172 B
          popops::Reduce<popops::ReduceAdd,float,float,false,3>                                152 B
          popnn::CalcAccuracy<unsigned int>                                                     72 B
          poplar_rt::MemsetZero64Bit                                                            68 B
          popops::ScaledReduce<popops::ReduceAdd,float,float,true,0>                            56 B


  By Tile (Excluding Gaps):
    Range (KB) Histogram (Excluding Gaps)               Count (tiles)
        3 -  4 ****************************************    824
        4 -  5                                               0
        5 -  6                                               0
        6 -  7                                               0
        7 -  8 ********                                    168
        8 -  9 ***                                          44
        9 - 10 ******                                      126
      10 - 11 **                                           30
      11 - 12 *                                            16
      12 - 13                                               0
      13 - 14                                               0
      14 - 15 *                                             2
      15 - 16 *                                             2
      16 - 17 *                                             3
      17 - 18                                               0
      18 - 19                                               0
      19 - 20 *                                             1

    Maximum (Including Gaps): 131,608 (128.5 K) on tile 0
    Maximum (Excluding Gaps): 20,294 (19.8 K) on tile 0
    0 tile(s) out of memory



  Variable Storage Liveness: All tiles

    Always-live bytes: 5,976,425
    Always-live variables:
      <anon>                                                                                               48
      <const>                                                                                             120
      Layer1/Fwd/Conv_1/worklists                                                                      14,112
      Layer3/Bwd/Conv_1/worklists                                                                          72
      Layer3/Fwd/Conv_1/worklists                                                                         144
      Layer3/Wu/ReduceFinalStage/IntermediateToOutput/numPartials                                          60
      Layer5/Bwd/LossSumSquared/offset                                                                    112
      Layer5/Bwd/LossSumSquared/reduce_loss/ReduceOnTile/InToIntermediateNoExchange/numPartials            32
      Layer5/Bwd/LossSumSquared/sliceLen                                                                  128
      ValuePadder/padding                                                                                  16
      controlCode                                                                                     467,696
      copyDescriptor                                                                                   16,161
      hostExchangeCode                                                                                211,132
      hostExchangePacketHeader                                                                         10,720
      internalExchangeCode                                                                            215,364
      numCorrect                                                                                            4
      stack                                                                                         3,852,288
      vectorListDescriptor                                                                                528
      vertexCode                                                                                    1,097,868
      vertexFieldData                                                                                  24,052
      vertexInstanceState                                                                              65,768

    Maximum live bytes (including always-live): 6,556,737

    Sequence
      StreamCopy
        Live Bytes (excluding always-live): 95,452
        Live Vars (excluding always-live):
          <anon>                        12
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
      StreamCopy
        Live Bytes (excluding always-live): 95,456
        Live Vars (excluding always-live):
          <anon>                        12
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
          programId                      4
      DoExchange: switchControlBroadcast13/ExchangePre
        Live Bytes (excluding always-live): 100,312
        Live Vars (excluding always-live):
          <anon>                         8
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
          broadcastProgramId         4,860
          programId                      4
      Switch
        Live Bytes (excluding always-live): 100,312
        Live Vars (excluding always-live):
          <anon>                         8
          Layer1/Fwd/biases            120
          Layer1/Fwd/weights        94,080
          Layer3/Fwd/biases             40
          Layer3/Fwd/weights         1,200
          broadcastProgramId         4,860
          programId                      4
        Sequence
          DoExchange: init/setMasterSeed/ExchangePre
            Live Bytes (excluding always-live): 105,168
            Live Vars (excluding always-live):
              <anon>                         8
              <message:anon>             9,720
              Layer1/Fwd/biases            120
              Layer1/Fwd/weights        94,080
              Layer3/Fwd/biases             40
              Layer3/Fwd/weights         1,200
          OnTileExecute: init/setMasterSeed
            Live Bytes (excluding always-live): 105,168
            Live Vars (excluding always-live):
              <anon>                         8
              <message:anon>             9,720
              Layer1/Fwd/biases            120
              Layer1/Fwd/weights        94,080
              Layer3/Fwd/biases             40
              Layer3/Fwd/weights         1,200
...

Execution:

  Total cycles:                                  54,202 (approx 33.9 microseconds)
  Total compute cycles (including idle threads): 1,933,324
  Total compute cycles (excluding idle threads): 1,769,242
  Total IPU exchange cycles:                     482,221
  Total global exchange cycles:                  0
  Total host exchange cycles:                    4,194,423
  Total shared structure copy cycles:            0
  Total sync cycles:                             58,852,900
  Total tile balance:                            2.9%
  Total thread balance:                          91.5%

  Cycles by vertex type:
    poplin::ConvPartial1x1Out<float,float,true,false>                                (808 instances):      764,028
    poplar_rt::DstStridedCopyDA32                                                   (1393 instances):      509,173
    poprand::SetSeedSupervisor                                                      (1216 instances):      177,536
    poplin::ReduceAdd<float,float>                                                   (376 instances):      154,116
    popops::ScaledAddSupervisor<float,float,true>                                    (460 instances):      139,812
    poplar_rt::StridedCopyDA32                                                       (455 instances):       83,995
    poplar_rt::Memcpy64BitSupervisor                                                 (203 instances):       59,048
    popnn::NonLinearitySupervisor<float,popnn::NonLinearityType::SIGMOID>             (68 instances):       15,780
    popnn::NonLinearityGradSupervisor<float,popnn::NonLinearityType::SIGMOID>         (68 instances):       14,628
    poplar_rt::ShortMemcpy                                                            (73 instances):        5,256
    poplin::Transpose2d<float>                                                        (99 instances):        4,770
    popops::EncodeOneHot<unsigned int,float>                                          (20 instances):        4,408
    popops::Reduce<popops::ReduceAdd,float,float,false,2>                            (112 instances):        3,584
    poplar_rt::DstStridedCopy64BitMultiAccess                                         (41 instances):        2,437
    popops::ScaledReduce<popops::ReduceAdd,float,float,true,1>                        (32 instances):        1,600
    poplar_rt::StridedCopy64BitMultiAccess                                            (33 instances):        1,388
    popops::ScaledAdd2D<float,true>                                                   (39 instances):        1,371
    popnn::ReduceMaxClassGather<float,unsigned int>                                   (16 instances):        1,056
    popops::Reduce<popops::ReduceAdd,float,float,false,1>                             (16 instances):          912
    popnn::LossSumSquaredTransform<float>                                              (8 instances):          792
    poplar_rt::DstStridedMemsetZero64Bit                                              (18 instances):          740
    popops::Reduce<popops::ReduceAdd,float,float,false,3>                              (4 instances):          144
    popops::ScaledReduce<popops::ReduceAdd,float,float,true,0>                         (2 instances):          100
    popnn::CalcAccuracy<unsigned int>                                                  (1 instances):           78
    poplar_rt::MemsetZero64Bit                                                         (2 instances):           26
    poplar_rt::MemsetZero                                                              (1 instances):           15
Steps
  --- External Sync ---
  StreamCopy (cycles 118 - 118)
    Cycles:                             7
    Active Tiles:                       1 / 1,216
    Tile Balance:                       0.1%
    Active Tile Balance:                100.0%
    Total Data:                         2
    Data Balance (mean / max data per tile): 0.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- External Sync ---
  StreamCopy (cycles 472 - 472)
    Cycles:                             9
    Active Tiles:                       1 / 1,216
    Tile Balance:                       0.1%
    Active Tile Balance:                100.0%
    Total Data:                         4
    Data Balance (mean / max data per tile): 0.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- Internal Sync ---
  DoExchange (cycles 826 - 946): switchControlBroadcast13/ExchangePre
    Cycles:                             120 (0 overlapped with previous)
    Active Tiles:                       1,216 / 1,216
    Tile Balance:                       86.6%
    Active Tile Balance:                86.6%
    Total Data:                         2,432
    Data Balance (mean / max data per tile): 100.0%
  --- External Sync ---
  StreamCopy (cycles 1,064 - 1,064)
    Cycles:                             7
    Active Tiles:                       1 / 1,216
    Tile Balance:                       0.1%
    Active Tile Balance:                100.0%
    Total Data:                         2
    Data Balance (mean / max data per tile): 0.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- External Sync ---
  StreamCopy (cycles 1,418 - 12,098)
    Cycles:                             10,700
    Active Tiles:                       392 / 1,216
    Tile Balance:                       32.2%
    Active Tile Balance:                100.0%
    Total Data:                         25,122
    Data Balance (mean / max data per tile): 21.1%
  --- Internal Sync ---
  --- Internal Sync ---
  --- Internal Sync ---
  DoExchange (cycles 12,432 - 12,581): Layer1/Fwd/Conv_1/Convolve/ExchangePre
    Cycles:                             169 (20 overlapped with previous)
    Active Tiles:                       392 / 1,216
    Tile Balance:                       27.9%
    Active Tile Balance:                86.6%
    Total Data:                         125,440
    Data Balance (mean / max data per tile): 32.2%
  OnTileExecute (cycles 12,581 - 13,580): Layer1/Fwd/Conv_1/Convolve
    Cycles:                             999 (0 overlapped with previous)
    Active Tiles:                       392 / 1,216
    Thread Balance:                     100.0%
    Tile Balance:                       31.7%
    Active Tile Balance:                98.3%
    By vertex type:
      poplin::ConvPartial1x1Out<float,float,true,false>        (392 instances):      384,944
....