7. Profiling data format¶

Warning

Profiling is a rapidly changing part of Poplar, so this information is subject to change without notice.

This section describes the format of the profiling data produced by Poplar. See Profiling for information on how to generate the profiling information.

The data is produced in JSON (JavaScript Object Notation) format with separate files with information about the graph and the execution of the program. The graph profile can be created as soon as the graph has been compiled. The execution profile can be produced after the graph program has been run.

7.1. Graph profile¶

The structure of the graph profile is organised in the following areas:

Target information
Optimisation information
Graph information
Vertex types
Compute sets
Exchanges
Program structure
Memory use

These are described in detail in the following sections.

7.1.1. Target information¶

The target contains some useful information about the target hardware.

type: The target type, which is one of CPU, IPU or IPU_MODEL.
bytesPerIPU: The number of bytes of memory on an IPU.
bytesPerTile: The number of bytes of memory on a tile.
clockFrequency: The tile clock frequency in Hertz.
numIPUs: The number of IPU chips in the system.
tilesPerIPU: The number of tiles on each IPU chip.
numTiles: The total number of tiles. This is the product of numIPUs and tilesPerIPU. It is stored redundantly for convenience.
totalMemory: The total memory. This is the product of bytesPerTile and numTiles (or bytesPerIPU and numIPUs). It is stored redundantly for convenience.
relativeSyncDelayByTile: The sync delay for each tile (relative to the minimum value).
minSyncDelay: The minimum sync delay for any tile.

The sync delay for a tile is the number of cycles that it takes for the tile to send a sync request to the sync controller and receive a sync release signal back from the sync controller. It is smaller for tiles closer to the sync controller. This can be used for calculating how long a sync takes. The values are given for each tile on one IPU, In other words, there are tilesPerIPU values, not numTiles, because the sync delay values are the same on every IPU. The sync delay for each tile is given by minSyncDelay + relativeSyncDelayByTile[tile].

7.1.2. Optimisation information¶

optimizationInfo contains a map<string, double> of internal metrics related to compilation. The keys may change but this will always be a map from strings to doubles.

7.1.3. Graph information¶

graph includes some basic information about the graph, such as the number of compute sets.

"graph":{
  "numComputeSets":9,
  "numEdges":24,
  "numVars":111,
  "numVertices":16
}

7.1.4. Vertex types¶

vertexTypes lists the vertex types that are actually used in the graph. There may be many more vertex types but unused ones are ignored. In the rest of the profile data, references to vertex types are specified as an index into these arrays.

names lists the names of the vertex types. This includes built-in vertices like poplar_rt::LongMemcpy.
sizes contains the size of the vertex state (the class members) of each vertex type. For example Doubler might have 4 bytes of state.

7.1.5. Compute sets¶

computeSets contains cycle estimates, names and the number of vertices in each compute set. For the IPU_MODEL target it also includes a cycleEstimates field.

names: The name of each compute set. These are mainly for debugging purposes and are not necessarily unique. This includes compute sets generated during compilation.
vertexCounts and vertexTypes: The number of each type of vertex in the compute set. For each compute set there are vertexCounts[compute_set][i] vertices of type vertexTypes[compute_set][i]. The type is an index into the top-level "vertexTypes" array.
cycleEstimates: A cycle estimate is calculated for each vertex and then the vertices are scheduled in the same way that they would be run on real hardware. This results in three cycleEstimates:
- activeCyclesByTile: This is the number of cycles during which a vertex was being run. Tiles have six hardware threads that are serviced in a round-robin fashion. If only one vertex is running then out of every six cycles only one cycle is “active”, and the other five cycles are idle. activeCyclesByTile counts the total number of active cycles in each compute set for each tile. It is indexed as [compute_set][tile].
- activeCyclesByVertexType: The is the total number of active cycles in each compute set, by vertex type. It is indexed as [compute_set][vertex_type] where vertex_type is an index into "vertexTypes".
- cyclesByTile: This is similar to activeCyclesByTile but it also counts idle cycles where a thread is not executing. This therefore gives the actual number of cycles that each tile takes running this compute set.

7.1.6. Exchanges¶

exchanges lists some basic information about internal exchanges.

bytesReceivedByTile is the number of bytes received by each tile in the exchange. It is indexed as [internal_exchange_id][tile].
bytesSentByTile is the number of bytes sent by each tile in the exchange. It is indexed as [internal_exchange_id][tile].
cyclesByTile is the number of cycles that each tile used for internal exchanges. It is indexed as [internal_exchange_id][tile]. This is known exactly for internal exchanges, which are statically scheduled.

externalExchanges lists the same information for IPU-to-IPU exchanges.

bytesReceivedByTile is the number of bytes received by each tile in the exchange. It is indexed as [external_exchange_id][tile].
bytesSentByTile is the number of bytes sent by each tile in the exchange. It is indexed as [external_exchange_id][tile].
estimatedCyclesByTile is the estimated number of cycles that each tile used for exchanges with other IPUs. It is indexed as [external_exchange_id][tile].

hostExchanges lists the same information for exchanges between the host and IPU.

bytesReceivedByTile is the number of bytes received by each tile in the exchange. It is indexed as [host_exchange_id][tile].
bytesSentByTile is the number of bytes sent by each tile in the exchange. It is indexed as [host_exchange_id][tile].
estimatedCyclesByTile is the estimated number of cycles that each tile used for exchanges to or from the host. It is indexed as [host_exchange_id][tile].

7.1.7. Program structure¶

The graph profile includes a serialisation of the program structure. This can include some programs generated during compilation, such as exchange and sync operations, in addition to the programs explicitly specified in the source code.

programs is a flattened array of all the programs given to the engine. This includes control programs (programs the user has provided) and functions (internally generated programs to reduce code duplication).

The arrays controlPrograms and functionPrograms contain the indexes of control and function programs in the programs array. Normally user programs are wrapped in a single control program so controlPrograms will nearly always contain only [0].

"controlPrograms":[0],
"functionPrograms":[31, 45],

Each entry in the programs array is a tagged union. The tag is type and has to be one of the following values, to indicate the type of the program. The following table summarises the tags generated by each program class.

Program class	Program type tags
`Execute`	`OnTileExecute` This may be preceded or followed by `DoExchange` or `GlobalExchange` if exchanges are needed before/after execution.
`Repeat`	`Repeat`
`RepeatWhileTrue` `RepeatWhileFalse`	`RepeatWhile`
`If`	`SetLocalConsensusFromVar` `Sync` `If` or `IfElse`
`Switch`	`Switch`
`Sequence`	`Sequence`
`Copy`	`DoExchange`, `GlobalExchange` or `StreamCopy` Corresponding to internal exchange, inter-IPU exchange and host exchange respectively. This may be preceded or followed by `OnTileExecute` or `DoExchange` if data rearrangement is needed before/after the copy.
`WriteUndef`	`WriteUndef`
`Sync`	`Sync`
`Call`	`Call`
`PrintTensor`	`StreamCopy`

The type determines which other fields are present. The most useful are described below.

Programs that have sub-programs encode this with the children field (even those with a fixed number of children like If). The sub-programs are specified as indexes into the programs array.

{
  "children":[4,5,6,7,8,9],
  "type":"Sequence"
}

The exchange programs (DoExchange, GlobalExchange and StreamCopy) reference the exchange ID, which is an index into exchanges, externalExchanges or hostExchanges respectively.

{
  "exchange":1,
  "type":"StreamCopy"
}

DoExchange also includes a breakdown of the number of type of exchange instruction.

{
  "exchange":3,
  "memoryByInstruction":{
    "delay":24,
    "exchange":0,
    "other":0,
    "receiveAddress":16,
    "receiveFormat":0,
    "receiveMux":32,
    "send":8,
    "sendWithPutAddress":0,
    "sendWithPutFormat":0,
    "sendWithPutMux":0,
    "total":80
  },
  "name":"/ExchangePre",
  "type":"DoExchange"
}

OnTileExecute contains the compute set ID, which is an index into the arrays in computeSets.

{
  "computeSet":3,
  "type":"OnTileExecute"
}

Programs can have a name field:

{
  "exchange":2,
  "name":"progIdCopy/GlobalPre/GlobalExchange",
  "type":"GlobalExchange"
}

Call programs call a sub-graph as a function. They contain an index into the functionPrograms array that identifies the function called.

{
  "target":1,
  "type":"Call"
}

7.1.8. Memory use¶

The memory object contains a lot of information about memory use. All memory is statically allocated so you don’t need to run the program to gather this data.

The memory usage is reported for each tile, and also by category (what the memory is used for), by compute set and by vertex type. There is also a report of variable liveness*, including a tree of the liveness for all possible call stacks (this is a finite list because recursion is not allowed).

There are two memory regions on each tile, interleaved and non-interleaved, the use of each of these is reported separately. If the memory requirement is greater than the available memory, then this is reported as overflowed. The memory usage in each region is provided, both with and without gaps. Gaps arise because of memory allocation constraints, such as alignment requirements. For more information on the tile memory architecture, refer to the IPU Programmer’s Guide.

The memory used by some variables can be overlapped with others, because they are not live at the same time. Hence, the usage is split into overlapped and nonOverlapped components.

For top-level replicated graphs (those created by Graph(target, replication_factor)) the memory use will be reported for a single replica (the memory used by all replicas will be identical).

Memory per tile¶

"byTile": {
  "interleaved": [ 536, 408 ],
  "interleavedIncludingGaps": [ 536, 408 ],
  "nonInterleaved": [ 19758, 3896 ],
  "nonInterleavedIncludingGaps": [ 65568, 19596 ],
  "overflowed": [ 0, 0 ],
  "overflowedIncludingGaps": [ 0, 0 ],
  "total": [ 20294, 3896 ],
  "totalIncludingGaps": [ 131608, 19596 ]
}

total is the sum of interleaved, nonInterleaved and overflowed. This is the total amount of memory used for data (not including padding) on each tile. However, due to memory constraints leading to padding, more memory may actually be required. Therefore this is usually not the number you want.
totalIncludingGaps is the actual amount of memory that is required on each tile. This is not simply the sum of the previous “including gaps” figures because adding those up does not take account of the gaps between the regions.

If any of these numbers is larger than the number of bytes per tile then the program will not fit on the hardware.

Memory by category¶

byCategory is a breakdown of memory usage across the whole system by the type of data, and the region it is in.

"byCategory":{
  "controlCode": {
    "interleaved": {
      "nonOverlapped": [ 0, 0 ],
      "overlapped": [ 0, 0 ]
    },
    "nonInterleaved": {
      "nonOverlapped": [ 1216, 356 ],
      "overlapped": [ 0, 0 ]
    },
    "overflowed": {
      "nonOverlapped": [ 0, 0 ],
      "overlapped": [ 0, 0 ]
    },
    "total": [ 1216, 356 ]
  }

The list of categories are:

constant: Constants added by the user. Variables added by the compiler that happen to be constant will be in Variable
controlCode: Code for Programs and running compute sets.
controlId: Program and sync IDs.
controlTable: A table that lists the vertices to run in each compute set. Only used if the table scheduler is enabled.
copyDescriptor: Copy descriptors are special variable-sized fields used by copy vertices.
globalExchangeCode: Code for performing exchange operations between IPUs.
globalExchangePacketHeader: Packet headers for inter-IPU exchanges.
globalMessage: Message data for inter-IPU exchanges.
hostExchangeCode: Code for performing exchange operations to and from the host
hostExchangePacketHeader: Packet headers for host exchanges.
hostMessage: Message data for host exchanges.
instrumentationResults: Variables to store profiling information.
internalExchangeCode: Code for performing internal exchange operations.
message: Message data for internal exchanges.
multiple: Space shared by variables from multiple different categories.
outputEdge: Storage for output edge data before an exchange takes place.
rearrangement: Variables holding rearranged versions of tensor data.
sharedCodeStorage: Code shared bey vertices.
sharedDataStorage: Data shared by vertices.
stack: Worker and supervisor stacks.
variable: Space allocated for variables in worker and supervisor code.
vectorListDescriptor: The data for VectorList<Input<...>, DeltaN> fields.
vertexCode: Code for vertex functions (codelets).
vertexFieldData: Variable-sized fields. For example, the data for Vector<float> or Vector<Input<...>> fields.
vertexInstanceState: Vertex class instances. This will be sizeof (VertexName) for each vertex.

Memory by compute set¶

byComputeSet is a breakdown of memory usage across the whole system. It includes several 2D arrays indexed by compute set, then by tile.

"byComputeSet": {
  "codeBytes": [[0, ...],[0, ...], ...],
  "copyPtrBytes": [[0, ...],[0, ...], ...],
  "descriptorBytes": [[0, ...],[0, ...], ...],
  "edgePtrBytes": [[0, ...],[0, ...], ...],
  "paddingBytes": [[0, ...],[0, ...], ...],
  "vertexDataBytes": [[0, ...],[0, ...], ...],
  "totalBytes": [[0, ...],[0, ...], ...]
}

codeBytes is the amount of memory used for code by a compute set. Because that code may be shared by several compute sets, these numbers cannot be added in a meaningful way.
totalBytes is the sum of the above for convenience. Because it includes codeBytes it cannot be added in a meaningful way.

Memory by vertex type¶

byVertexType is a breakdown of memory usage across the whole system, like byComputeSet but for vertex types instead. The index into these arrays is also an index into the top level vertexTypes object.

"byVertexType":{
  "codeBytes": [[0, ...],[0, ...], ...],
  "copyPtrBytes": [[0, ...],[0, ...], ...],
  "descriptorBytes": [[0, ...],[0, ...], ...],
  "edgePtrBytes": [[0, ...],[0, ...], ...],
  "paddingBytes": [[0, ...],[0, ...], ...],
  "totalBytes": [[0, ...],[0, ...], ...],
  "vertexDataBytes": [[0, ...],[0, ...], ...]
}

7.2. Execution profile¶

The execution profile contains information about the programs that have been run since the execution profile was last reset. Because the profiling data varies for different target types and profiling methods, the entire object is a tagged union.

7.2.1. Profiler mode¶

The profilerMode is the tag for this object. It can be one of the following:

NONE
CPU
IPU_MODEL
COMPUTE_SETS
SINGLE_TILE_COMPUTE_SETS
VERTICES
EXTERNAL_EXCHANGES
HOST_EXCHANGES

It has the following fields, some of which are only present for certain modes.

COMPUTE_SETS

computeSetCyclesByTile: A 2D array indexed by compute set id, then tile, that gives the total number of cycles taken to execute that compute set on that tile.

SINGLE_TILE_COMPUTE_SETS

computeSetCycles: A 1D array indexed by compute set id that gives the total number of cycles taken to execute that compute set on all tiles. For this mode an internal sync is inserted before & after the compute set.

VERTICES

vertexCycles: A 1D array indexed by vertex ID that contains the number of cycles each vertex took the last time it was run.
vertexComputeSet: A 1D array indexed by vertex ID giving the compute set the vertex is in.
vertexType: A 1D array indexed by vertex ID giving an index into the list of vertex types.

EXTERNAL_EXCHANGES

externalExchangeCycles: A 2D array indexed by external exchange ID, and then tile, that gives the number of cycles used for each external (that is, from one IPU to another) exchange on each tile.

HOST_EXCHANGES

hostExchangeCycles: This is the same as externalExchangeCycles but for host<->IPU exchanges.

Additionally for all modes except NONE and CPU there profile contains program trace and simulation information.

7.2.2. Program trace information¶

programTrace is a 1D array of the programs IDs that were run. These are indexes into programs in the graph profile.

7.2.3. Simulation information¶

simulation has a list of execution steps based on the simulation of the programs that are listed in programTrace. This information is redundant. It is calculated entirely from the graph profile and the programTrace but it is included for convenience.

The fields of simulation are as follows.

cycles is the total number of cycles it took to execute all of the programs in programTrace.
tileCycles is the number of cycles spent doing each kind of activity. Unlike cycles this counts cycles from different tiles as distinct. That is, if two tiles both do a computation that takes 10 cycles in parallel, then cycles will be 10, but tileCycles.compute will be 20. activeCompute is a compute cycle where the active thread is computing, and cycles is a compute cycle where the active thread or any of the other threads is computing.

"tileCycles":{
  "activeCompute":1349,
  "compute":8094,
  "copySharedStructure":0,
  "doExchange":2070,
  "globalExchange":0,
  "streamCopy":16,
  "sync":26238
}

steps lists the compute, sync and exchange steps that are run. Each entry is a tagged union based on the type field which may be one of
- OnTileExecute
- StreamCopy
- CopySharedStructure
- Sync
- DoExchange
- GlobalExchange.

When running on actual hardware, the simulation uses computeSetCycles or computeSetCyclesByTile for the compute set cycles. If hardware cycles are not available (for example, under IPU_MODEL) then cycle estimates are used.

The other fields in each step depend on its type. Sync only contains the sync type: External or Internal

{
  "syncType":"External",
  "type":"sync"
}

All other types contain the following fields:

type: The step type as described above.
program: The program ID for this step (an index into programs).
name: This field may be present if the program has a name. If the program has no name this field is omitted.
tileBalance: A fraction from 0-1 which indicates how balanced computation was between the tiles. It is calculated as the total number of compute cycles used / cycles * numTiles. If all tiles take the same number of cycles to finish this then this will be 1.0. If for example you have one tile that takes 10 cycles and one that takes 5 then this will be 0.75.
activeTiles: The number of tiles that are computing (or exchanging for exchanges).
activeTileBalance: The same as tileBalance but it ignores completely idle tiles.
cycles: The number of cycles taken by the longest running tile. Because OnTileExecute calls can overlap with each other and with exchanges this may be non-zero even if the execution doesn’t actually take any extra time.
cyclesFrom: The first cycle number where this program was executing on any tile.
cyclesTo: The last cycle number where this program was executing on any tile.

The exchange types (DoExchange, StreamCopy, GlobalExchange and SharedStructureCopy) also contain these fields:

totalData: The total amount of data transferred during the exchange.
dataBalance: Exactly like tileBalance but for the amount of data sent and received by each tile, instead of cycles.

OnTileExecute also contains these fields:

threadBalance: Similar in concept to tileBalance except it measures how well-utilised the hardware threads are. If you always run 6 threads or 0 threads this will be 1.0 even if the total computation on each tile takes a different amount of time.
computeSet: The ID of the compute set executed by this step.

DoExchange, GlobalExchange and StreamCopy contain a field that is an index into the corresponding exchange lists, called exchange, externalExchange or hostExchange respectively.

Finally, OnTileExecute, DoExchange and CopySharedStructure contain this field:

cyclesOverlapped: How many cycles were overlapped with previous steps.