7. Profiling data format¶
Warning
Profiling is a rapidly changing part of Poplar, so this information is subject to change without notice.
This section describes the format of the profiling data produced by Poplar. See Profiling for information on how to generate the profiling information.
The data is produced in JSON (JavaScript Object Notation) format with separate files with information about the graph and the execution of the program. The graph profile can be created as soon as the graph has been compiled. The execution profile can be produced after the graph program has been run.
7.1. Graph profile¶
The structure of the graph profile is organised in the following areas:
Target information
Optimisation information
Graph information
Vertex types
Compute sets
Exchanges
Program structure
Memory use
These are described in detail in the following sections.
7.1.1. Target information¶
The target contains some useful information about the target hardware.
type: The target type, which is one ofCPU,IPUorIPU_MODEL.bytesPerIPU: The number of bytes of memory on an IPU.bytesPerTile: The number of bytes of memory on a tile.clockFrequency: The tile clock frequency in Hertz.numIPUs: The number of IPU chips in the system.tilesPerIPU: The number of tiles on each IPU chip.numTiles: The total number of tiles. This is the product ofnumIPUsandtilesPerIPU. It is stored redundantly for convenience.totalMemory: The total memory. This is the product ofbytesPerTileandnumTiles(orbytesPerIPUandnumIPUs). It is stored redundantly for convenience.relativeSyncDelayByTile: The sync delay for each tile (relative to the minimum value).minSyncDelay: The minimum sync delay for any tile.
The sync delay for a tile is the number of cycles that it takes for the tile to
send a sync request to the sync controller and receive a sync release signal
back from the sync controller. It is smaller for tiles closer to the sync
controller. This can be used for calculating how long a sync takes. The values
are given for each tile on one IPU, In other words, there are tilesPerIPU
values, not numTiles, because the sync delay values are the same on every
IPU. The sync delay for each tile is given by minSyncDelay +
relativeSyncDelayByTile[tile].
7.1.2. Optimisation information¶
optimizationInfo contains a map<string, double> of internal metrics
related to compilation. The keys may change but this will always be a map from
strings to doubles.
7.1.3. Graph information¶
graph includes some basic information about the graph, such as the number of
compute sets.
"graph":{
"numComputeSets":9,
"numEdges":24,
"numVars":111,
"numVertices":16
}
7.1.4. Vertex types¶
vertexTypes lists the vertex types that are actually used in the graph.
There may be many more vertex types but unused ones are ignored. In the rest of
the profile data, references to vertex types are specified as an index into
these arrays.
nameslists the names of the vertex types. This includes built-in vertices likepoplar_rt::LongMemcpy.sizescontains the size of the vertex state (the class members) of each vertex type. For exampleDoublermight have 4 bytes of state.
7.1.5. Compute sets¶
computeSets contains cycle estimates, names and the number of vertices in
each compute set. For the IPU_MODEL target it also includes a
cycleEstimates field.
names: The name of each compute set. These are mainly for debugging purposes and are not necessarily unique. This includes compute sets generated during compilation.vertexCountsandvertexTypes: The number of each type of vertex in the compute set. For each compute set there arevertexCounts[compute_set][i]vertices of typevertexTypes[compute_set][i]. The type is an index into the top-level"vertexTypes"array.cycleEstimates: A cycle estimate is calculated for each vertex and then the vertices are scheduled in the same way that they would be run on real hardware. This results in threecycleEstimates:activeCyclesByTile: This is the number of cycles during which a vertex was being run. Tiles have six hardware threads that are serviced in a round-robin fashion. If only one vertex is running then out of every six cycles only one cycle is “active”, and the other five cycles are idle.activeCyclesByTilecounts the total number of active cycles in each compute set for each tile. It is indexed as[compute_set][tile].activeCyclesByVertexType: The is the total number of active cycles in each compute set, by vertex type. It is indexed as[compute_set][vertex_type]wherevertex_typeis an index into"vertexTypes".cyclesByTile: This is similar toactiveCyclesByTilebut it also counts idle cycles where a thread is not executing. This therefore gives the actual number of cycles that each tile takes running this compute set.
7.1.6. Exchanges¶
exchanges lists some basic information about internal exchanges.
bytesReceivedByTileis the number of bytes received by each tile in the exchange. It is indexed as[internal_exchange_id][tile].bytesSentByTileis the number of bytes sent by each tile in the exchange. It is indexed as[internal_exchange_id][tile].cyclesByTileis the number of cycles that each tile used for internal exchanges. It is indexed as[internal_exchange_id][tile]. This is known exactly for internal exchanges, which are statically scheduled.
externalExchanges lists the same information for IPU-to-IPU exchanges.
bytesReceivedByTileis the number of bytes received by each tile in the exchange. It is indexed as[external_exchange_id][tile].bytesSentByTileis the number of bytes sent by each tile in the exchange. It is indexed as[external_exchange_id][tile].estimatedCyclesByTileis the estimated number of cycles that each tile used for exchanges with other IPUs. It is indexed as [external_exchange_id][tile].
hostExchanges lists the same information for exchanges between the host and
IPU.
bytesReceivedByTileis the number of bytes received by each tile in the exchange. It is indexed as[host_exchange_id][tile].bytesSentByTileis the number of bytes sent by each tile in the exchange. It is indexed as[host_exchange_id][tile].estimatedCyclesByTileis the estimated number of cycles that each tile used for exchanges to or from the host. It is indexed as[host_exchange_id][tile].
7.1.7. Program structure¶
The graph profile includes a serialisation of the program structure. This can include some programs generated during compilation, such as exchange and sync operations, in addition to the programs explicitly specified in the source code.
programs is a flattened array of all the programs given to the engine. This
includes control programs (programs the user has provided) and functions
(internally generated programs to reduce code duplication).
The arrays controlPrograms and functionPrograms contain the indexes of
control and function programs in the programs array. Normally user programs
are wrapped in a single control program so controlPrograms will nearly
always contain only [0].
"controlPrograms":[0],
"functionPrograms":[31, 45],
Each entry in the programs array is a tagged union. The tag is type and
has to be one of the following values, to indicate the type of the program. The
following table summarises the tags generated by each program class.
Program class |
Program type tags |
|---|---|
|
This may be preceded or followed by
DoExchange or GlobalExchangeif exchanges are needed before/after execution.
|
|
|
|
|
|
|
|
|
|
|
|
Corresponding to internal exchange, inter-IPU exchange
and host exchange respectively.
This may be preceded or followed by
OnTileExecute or DoExchangeif data rearrangement is needed before/after the copy.
|
|
|
|
|
|
|
|
|
The type determines which other fields are present. The most useful are
described below.
Programs that have sub-programs encode this with the children field (even
those with a fixed number of children like If). The sub-programs are
specified as indexes into the programs array.
{
"children":[4,5,6,7,8,9],
"type":"Sequence"
}
The exchange programs (DoExchange, GlobalExchange and StreamCopy)
reference the exchange ID, which is an index into exchanges,
externalExchanges or hostExchanges respectively.
{
"exchange":1,
"type":"StreamCopy"
}
DoExchange also includes a breakdown of the number of type of exchange
instruction.
{
"exchange":3,
"memoryByInstruction":{
"delay":24,
"exchange":0,
"other":0,
"receiveAddress":16,
"receiveFormat":0,
"receiveMux":32,
"send":8,
"sendWithPutAddress":0,
"sendWithPutFormat":0,
"sendWithPutMux":0,
"total":80
},
"name":"/ExchangePre",
"type":"DoExchange"
}
OnTileExecute contains the compute set ID, which is an index into the arrays
in computeSets.
{
"computeSet":3,
"type":"OnTileExecute"
}
Programs can have a name field:
{
"exchange":2,
"name":"progIdCopy/GlobalPre/GlobalExchange",
"type":"GlobalExchange"
}
Call programs call a sub-graph as a function. They contain an index into the
functionPrograms array that identifies the function called.
{
"target":1,
"type":"Call"
}
7.1.8. Memory use¶
The memory object contains a lot of information about memory use. All memory
is statically allocated so you don’t need to run the program to gather this
data.
The memory usage is reported for each tile, and also by category (what the memory is used for), by compute set and by vertex type. There is also a report of variable liveness*, including a tree of the liveness for all possible call stacks (this is a finite list because recursion is not allowed).
There are two memory regions on each tile, interleaved and non-interleaved, the use of each of these is reported separately. If the memory requirement is greater than the available memory, then this is reported as overflowed. The memory usage in each region is provided, both with and without gaps. Gaps arise because of memory allocation constraints, such as alignment requirements. For more information on the tile memory architecture, refer to the IPU Programmer’s Guide.
The memory used by some variables can be overlapped with others, because they
are not live at the same time. Hence, the usage is split into overlapped and
nonOverlapped components.
For top-level replicated graphs (those created by Graph(target,
replication_factor)) the memory use will be reported for a single replica (the
memory used by all replicas will be identical).
Memory per tile¶
"byTile": {
"interleaved": [ 536, 408 ],
"interleavedIncludingGaps": [ 536, 408 ],
"nonInterleaved": [ 19758, 3896 ],
"nonInterleavedIncludingGaps": [ 65568, 19596 ],
"overflowed": [ 0, 0 ],
"overflowedIncludingGaps": [ 0, 0 ],
"total": [ 20294, 3896 ],
"totalIncludingGaps": [ 131608, 19596 ]
}
totalis the sum ofinterleaved,nonInterleavedandoverflowed. This is the total amount of memory used for data (not including padding) on each tile. However, due to memory constraints leading to padding, more memory may actually be required. Therefore this is usually not the number you want.totalIncludingGapsis the actual amount of memory that is required on each tile. This is not simply the sum of the previous “including gaps” figures because adding those up does not take account of the gaps between the regions.
If any of these numbers is larger than the number of bytes per tile then the program will not fit on the hardware.
Memory by category¶
byCategory is a breakdown of memory usage across the whole system by the
type of data, and the region it is in.
"byCategory":{
"controlCode": {
"interleaved": {
"nonOverlapped": [ 0, 0 ],
"overlapped": [ 0, 0 ]
},
"nonInterleaved": {
"nonOverlapped": [ 1216, 356 ],
"overlapped": [ 0, 0 ]
},
"overflowed": {
"nonOverlapped": [ 0, 0 ],
"overlapped": [ 0, 0 ]
},
"total": [ 1216, 356 ]
}
The list of categories are:
constant: Constants added by the user. Variables added by the compiler that happen to be constant will be inVariablecontrolCode: Code forProgramsand running compute sets.controlId: Program and sync IDs.controlTable: A table that lists the vertices to run in each compute set. Only used if the table scheduler is enabled.copyDescriptor: Copy descriptors are special variable-sized fields used by copy vertices.globalExchangeCode: Code for performing exchange operations between IPUs.globalExchangePacketHeader: Packet headers for inter-IPU exchanges.globalMessage: Message data for inter-IPU exchanges.hostExchangeCode: Code for performing exchange operations to and from the hosthostExchangePacketHeader: Packet headers for host exchanges.hostMessage: Message data for host exchanges.instrumentationResults: Variables to store profiling information.internalExchangeCode: Code for performing internal exchange operations.message: Message data for internal exchanges.multiple: Space shared by variables from multiple different categories.outputEdge: Storage for output edge data before an exchange takes place.rearrangement: Variables holding rearranged versions of tensor data.sharedCodeStorage: Code shared bey vertices.sharedDataStorage: Data shared by vertices.stack: Worker and supervisor stacks.variable: Space allocated for variables in worker and supervisor code.vectorListDescriptor: The data forVectorList<Input<...>, DeltaN>fields.vertexCode: Code for vertex functions (codelets).vertexFieldData: Variable-sized fields. For example, the data forVector<float>orVector<Input<...>>fields.vertexInstanceState: Vertex class instances. This will besizeof (VertexName)for each vertex.
Memory by compute set¶
byComputeSet is a breakdown of memory usage across the whole system. It
includes several 2D arrays indexed by compute set, then by tile.
"byComputeSet": {
"codeBytes": [[0, ...],[0, ...], ...],
"copyPtrBytes": [[0, ...],[0, ...], ...],
"descriptorBytes": [[0, ...],[0, ...], ...],
"edgePtrBytes": [[0, ...],[0, ...], ...],
"paddingBytes": [[0, ...],[0, ...], ...],
"vertexDataBytes": [[0, ...],[0, ...], ...],
"totalBytes": [[0, ...],[0, ...], ...]
}
codeBytesis the amount of memory used for code by a compute set. Because that code may be shared by several compute sets, these numbers cannot be added in a meaningful way.totalBytesis the sum of the above for convenience. Because it includescodeBytesit cannot be added in a meaningful way.
Memory by vertex type¶
byVertexType is a breakdown of memory usage across the whole system, like
byComputeSet but for vertex types instead. The index into these arrays is
also an index into the top level vertexTypes object.
"byVertexType":{
"codeBytes": [[0, ...],[0, ...], ...],
"copyPtrBytes": [[0, ...],[0, ...], ...],
"descriptorBytes": [[0, ...],[0, ...], ...],
"edgePtrBytes": [[0, ...],[0, ...], ...],
"paddingBytes": [[0, ...],[0, ...], ...],
"totalBytes": [[0, ...],[0, ...], ...],
"vertexDataBytes": [[0, ...],[0, ...], ...]
}
7.2. Execution profile¶
The execution profile contains information about the programs that have been run since the execution profile was last reset. Because the profiling data varies for different target types and profiling methods, the entire object is a tagged union.
7.2.1. Profiler mode¶
The profilerMode is the tag for this object. It can be one of the following:
NONECPUIPU_MODELCOMPUTE_SETSSINGLE_TILE_COMPUTE_SETSVERTICESEXTERNAL_EXCHANGESHOST_EXCHANGES
It has the following fields, some of which are only present for certain modes.
COMPUTE_SETS
computeSetCyclesByTile: A 2D array indexed by compute set id, then tile, that gives the total number of cycles taken to execute that compute set on that tile.
SINGLE_TILE_COMPUTE_SETS
computeSetCycles: A 1D array indexed by compute set id that gives the total number of cycles taken to execute that compute set on all tiles. For this mode an internal sync is inserted before & after the compute set.
VERTICES
vertexCycles: A 1D array indexed by vertex ID that contains the number of cycles each vertex took the last time it was run.vertexComputeSet: A 1D array indexed by vertex ID giving the compute set the vertex is in.vertexType: A 1D array indexed by vertex ID giving an index into the list of vertex types.
EXTERNAL_EXCHANGES
externalExchangeCycles: A 2D array indexed by external exchange ID, and then tile, that gives the number of cycles used for each external (that is, from one IPU to another) exchange on each tile.
HOST_EXCHANGES
hostExchangeCycles: This is the same asexternalExchangeCyclesbut for host<->IPU exchanges.
Additionally for all modes except NONE and CPU there profile contains
program trace and simulation information.
7.2.2. Program trace information¶
programTraceis a 1D array of the programs IDs that were run. These are indexes intoprogramsin the graph profile.
7.2.3. Simulation information¶
simulationhas a list of execution steps based on the simulation of the programs that are listed inprogramTrace. This information is redundant. It is calculated entirely from the graph profile and theprogramTracebut it is included for convenience.
The fields of simulation are as follows.
cyclesis the total number of cycles it took to execute all of the programs inprogramTrace.tileCyclesis the number of cycles spent doing each kind of activity. Unlikecyclesthis counts cycles from different tiles as distinct. That is, if two tiles both do a computation that takes 10 cycles in parallel, thencycleswill be 10, buttileCycles.computewill be 20.activeComputeis a compute cycle where the active thread is computing, andcyclesis a compute cycle where the active thread or any of the other threads is computing.
"tileCycles":{
"activeCompute":1349,
"compute":8094,
"copySharedStructure":0,
"doExchange":2070,
"globalExchange":0,
"streamCopy":16,
"sync":26238
}
stepslists the compute, sync and exchange steps that are run. Each entry is a tagged union based on thetypefield which may be one ofOnTileExecuteStreamCopyCopySharedStructureSyncDoExchangeGlobalExchange.
When running on actual hardware, the simulation uses computeSetCycles or
computeSetCyclesByTile for the compute set cycles. If hardware cycles are
not available (for example, under IPU_MODEL) then cycle estimates are used.
The other fields in each step depend on its type. Sync only contains the
sync type: External or Internal
{
"syncType":"External",
"type":"sync"
}
All other types contain the following fields:
type: The step type as described above.program: The program ID for this step (an index intoprograms).name: This field may be present if the program has a name. If the program has no name this field is omitted.tileBalance: A fraction from 0-1 which indicates how balanced computation was between the tiles. It is calculated as the total number of compute cycles used /cycles*numTiles. If all tiles take the same number of cycles to finish this then this will be 1.0. If for example you have one tile that takes 10 cycles and one that takes 5 then this will be 0.75.activeTiles: The number of tiles that are computing (or exchanging for exchanges).activeTileBalance: The same astileBalancebut it ignores completely idle tiles.cycles: The number of cycles taken by the longest running tile. BecauseOnTileExecutecalls can overlap with each other and with exchanges this may be non-zero even if the execution doesn’t actually take any extra time.cyclesFrom: The first cycle number where this program was executing on any tile.cyclesTo: The last cycle number where this program was executing on any tile.
The exchange types (DoExchange, StreamCopy, GlobalExchange and
SharedStructureCopy) also contain these fields:
totalData: The total amount of data transferred during the exchange.dataBalance: Exactly liketileBalancebut for the amount of data sent and received by each tile, instead of cycles.
OnTileExecute also contains these fields:
threadBalance: Similar in concept totileBalanceexcept it measures how well-utilised the hardware threads are. If you always run 6 threads or 0 threads this will be 1.0 even if the total computation on each tile takes a different amount of time.computeSet: The ID of the compute set executed by this step.
DoExchange, GlobalExchange and StreamCopy contain a field that is an
index into the corresponding exchange lists, called exchange,
externalExchange or hostExchange respectively.
Finally, OnTileExecute, DoExchange and CopySharedStructure contain
this field:
cyclesOverlapped: How many cycles were overlapped with previous steps.