7. Profiling data format¶
Warning
Profiling is a rapidly changing part of Poplar, so this information is subject to change without notice.
This section describes the format of the profiling data produced by Poplar. See Profiling for information on how to generate the profiling information.
The data is produced in JSON (JavaScript Object Notation) format with separate files with information about the graph and the execution of the program. The graph profile can be created as soon as the graph has been compiled. The execution profile can be produced after the graph program has been run.
7.1. Graph profile¶
The structure of the graph profile is organised in the following areas:
Target information
Optimisation information
Graph information
Vertex types
Compute sets
Exchanges
Program structure
Memory use
These are described in detail in the following sections.
7.1.1. Target information¶
The target
contains some useful information about the target hardware.
type
: The target type, which is one ofCPU
,IPU
orIPU_MODEL
.bytesPerIPU
: The number of bytes of memory on an IPU.bytesPerTile
: The number of bytes of memory on a tile.clockFrequency
: The tile clock frequency in Hertz.numIPUs
: The number of IPU chips in the system.tilesPerIPU
: The number of tiles on each IPU chip.numTiles
: The total number of tiles. This is the product ofnumIPUs
andtilesPerIPU
. It is stored redundantly for convenience.totalMemory
: The total memory. This is the product ofbytesPerTile
andnumTiles
(orbytesPerIPU
andnumIPUs
). It is stored redundantly for convenience.relativeSyncDelayByTile
: The sync delay for each tile (relative to the minimum value).minSyncDelay
: The minimum sync delay for any tile.
The sync delay for a tile is the number of cycles that it takes for the tile to
send a sync request to the sync controller and receive a sync release signal
back from the sync controller. It is smaller for tiles closer to the sync
controller. This can be used for calculating how long a sync takes. The values
are given for each tile on one IPU, In other words, there are tilesPerIPU
values, not numTiles
, because the sync delay values are the same on every
IPU. The sync delay for each tile is given by minSyncDelay +
relativeSyncDelayByTile[tile].
7.1.2. Optimisation information¶
optimizationInfo
contains a map<string, double>
of internal metrics
related to compilation. The keys may change but this will always be a map from
strings to doubles.
7.1.3. Graph information¶
graph
includes some basic information about the graph, such as the number of
compute sets.
"graph":{
"numComputeSets":9,
"numEdges":24,
"numVars":111,
"numVertices":16
}
7.1.4. Vertex types¶
vertexTypes
lists the vertex types that are actually used in the graph.
There may be many more vertex types but unused ones are ignored. In the rest of
the profile data, references to vertex types are specified as an index into
these arrays.
names
lists the names of the vertex types. This includes built-in vertices likepoplar_rt::LongMemcpy
.sizes
contains the size of the vertex state (the class members) of each vertex type. For exampleDoubler
might have 4 bytes of state.
7.1.5. Compute sets¶
computeSets
contains cycle estimates, names and the number of vertices in
each compute set. For the IPU_MODEL
target it also includes a
cycleEstimates
field.
names
: The name of each compute set. These are mainly for debugging purposes and are not necessarily unique. This includes compute sets generated during compilation.vertexCounts
andvertexTypes
: The number of each type of vertex in the compute set. For each compute set there arevertexCounts[compute_set][i]
vertices of typevertexTypes[compute_set][i]
. The type is an index into the top-level"vertexTypes"
array.cycleEstimates
: A cycle estimate is calculated for each vertex and then the vertices are scheduled in the same way that they would be run on real hardware. This results in threecycleEstimates
:activeCyclesByTile
: This is the number of cycles during which a vertex was being run. Tiles have six hardware threads that are serviced in a round-robin fashion. If only one vertex is running then out of every six cycles only one cycle is “active”, and the other five cycles are idle.activeCyclesByTile
counts the total number of active cycles in each compute set for each tile. It is indexed as[compute_set][tile]
.activeCyclesByVertexType
: The is the total number of active cycles in each compute set, by vertex type. It is indexed as[compute_set][vertex_type]
wherevertex_type
is an index into"vertexTypes"
.cyclesByTile
: This is similar toactiveCyclesByTile
but it also counts idle cycles where a thread is not executing. This therefore gives the actual number of cycles that each tile takes running this compute set.
7.1.6. Exchanges¶
exchanges
lists some basic information about internal exchanges.
bytesReceivedByTile
is the number of bytes received by each tile in the exchange. It is indexed as[internal_exchange_id][tile]
.bytesSentByTile
is the number of bytes sent by each tile in the exchange. It is indexed as[internal_exchange_id][tile]
.cyclesByTile
is the number of cycles that each tile used for internal exchanges. It is indexed as[internal_exchange_id][tile]
. This is known exactly for internal exchanges, which are statically scheduled.
externalExchanges
lists the same information for IPU-to-IPU exchanges.
bytesReceivedByTile
is the number of bytes received by each tile in the exchange. It is indexed as[external_exchange_id][tile]
.bytesSentByTile
is the number of bytes sent by each tile in the exchange. It is indexed as[external_exchange_id][tile]
.estimatedCyclesByTile
is the estimated number of cycles that each tile used for exchanges with other IPUs. It is indexed as [external_exchange_id][tile]
.
hostExchanges
lists the same information for exchanges between the host and
IPU.
bytesReceivedByTile
is the number of bytes received by each tile in the exchange. It is indexed as[host_exchange_id][tile]
.bytesSentByTile
is the number of bytes sent by each tile in the exchange. It is indexed as[host_exchange_id][tile]
.estimatedCyclesByTile
is the estimated number of cycles that each tile used for exchanges to or from the host. It is indexed as[host_exchange_id][tile]
.
7.1.7. Program structure¶
The graph profile includes a serialisation of the program structure. This can include some programs generated during compilation, such as exchange and sync operations, in addition to the programs explicitly specified in the source code.
programs
is a flattened array of all the programs given to the engine. This
includes control programs (programs the user has provided) and functions
(internally generated programs to reduce code duplication).
The arrays controlPrograms
and functionPrograms
contain the indexes of
control and function programs in the programs
array. Normally user programs
are wrapped in a single control program so controlPrograms
will nearly
always contain only [0]
.
"controlPrograms":[0],
"functionPrograms":[31, 45],
Each entry in the programs
array is a tagged union. The tag is type
and
has to be one of the following values, to indicate the type of the program. The
following table summarises the tags generated by each program class.
Program class |
Program type tags |
---|---|
|
This may be preceded or followed by
DoExchange or GlobalExchange if exchanges are needed before/after execution.
|
|
|
|
|
|
|
|
|
|
|
|
Corresponding to internal exchange, inter-IPU exchange
and host exchange respectively.
This may be preceded or followed by
OnTileExecute or DoExchange if data rearrangement is needed before/after the copy.
|
|
|
|
|
|
|
|
|
The type
determines which other fields are present. The most useful are
described below.
Programs that have sub-programs encode this with the children
field (even
those with a fixed number of children like If
). The sub-programs are
specified as indexes into the programs
array.
{
"children":[4,5,6,7,8,9],
"type":"Sequence"
}
The exchange programs (DoExchange
, GlobalExchange
and StreamCopy
)
reference the exchange ID, which is an index into exchanges
,
externalExchanges
or hostExchanges
respectively.
{
"exchange":1,
"type":"StreamCopy"
}
DoExchange
also includes a breakdown of the number of type of exchange
instruction.
{
"exchange":3,
"memoryByInstruction":{
"delay":24,
"exchange":0,
"other":0,
"receiveAddress":16,
"receiveFormat":0,
"receiveMux":32,
"send":8,
"sendWithPutAddress":0,
"sendWithPutFormat":0,
"sendWithPutMux":0,
"total":80
},
"name":"/ExchangePre",
"type":"DoExchange"
}
OnTileExecute
contains the compute set ID, which is an index into the arrays
in computeSets
.
{
"computeSet":3,
"type":"OnTileExecute"
}
Programs can have a name
field:
{
"exchange":2,
"name":"progIdCopy/GlobalPre/GlobalExchange",
"type":"GlobalExchange"
}
Call
programs call a sub-graph as a function. They contain an index into the
functionPrograms
array that identifies the function called.
{
"target":1,
"type":"Call"
}
7.1.8. Memory use¶
The memory
object contains a lot of information about memory use. All memory
is statically allocated so you don’t need to run the program to gather this
data.
The memory usage is reported for each tile, and also by category (what the memory is used for), by compute set and by vertex type. There is also a report of variable liveness*, including a tree of the liveness for all possible call stacks (this is a finite list because recursion is not allowed).
There are two memory regions on each tile, interleaved and non-interleaved, the use of each of these is reported separately. If the memory requirement is greater than the available memory, then this is reported as overflowed. The memory usage in each region is provided, both with and without gaps. Gaps arise because of memory allocation constraints, such as alignment requirements. For more information on the tile memory architecture, refer to the IPU Programmer’s Guide.
The memory used by some variables can be overlapped with others, because they
are not live at the same time. Hence, the usage is split into overlapped
and
nonOverlapped
components.
For top-level replicated graphs (those created by Graph(target,
replication_factor)
) the memory use will be reported for a single replica (the
memory used by all replicas will be identical).
Memory per tile¶
"byTile": {
"interleaved": [ 536, 408 ],
"interleavedIncludingGaps": [ 536, 408 ],
"nonInterleaved": [ 19758, 3896 ],
"nonInterleavedIncludingGaps": [ 65568, 19596 ],
"overflowed": [ 0, 0 ],
"overflowedIncludingGaps": [ 0, 0 ],
"total": [ 20294, 3896 ],
"totalIncludingGaps": [ 131608, 19596 ]
}
total
is the sum ofinterleaved
,nonInterleaved
andoverflowed
. This is the total amount of memory used for data (not including padding) on each tile. However, due to memory constraints leading to padding, more memory may actually be required. Therefore this is usually not the number you want.totalIncludingGaps
is the actual amount of memory that is required on each tile. This is not simply the sum of the previous “including gaps” figures because adding those up does not take account of the gaps between the regions.
If any of these numbers is larger than the number of bytes per tile then the program will not fit on the hardware.
Memory by category¶
byCategory
is a breakdown of memory usage across the whole system by the
type of data, and the region it is in.
"byCategory":{
"controlCode": {
"interleaved": {
"nonOverlapped": [ 0, 0 ],
"overlapped": [ 0, 0 ]
},
"nonInterleaved": {
"nonOverlapped": [ 1216, 356 ],
"overlapped": [ 0, 0 ]
},
"overflowed": {
"nonOverlapped": [ 0, 0 ],
"overlapped": [ 0, 0 ]
},
"total": [ 1216, 356 ]
}
The list of categories are:
constant
: Constants added by the user. Variables added by the compiler that happen to be constant will be inVariable
controlCode
: Code forPrograms
and running compute sets.controlId
: Program and sync IDs.controlTable
: A table that lists the vertices to run in each compute set. Only used if the table scheduler is enabled.copyDescriptor
: Copy descriptors are special variable-sized fields used by copy vertices.globalExchangeCode
: Code for performing exchange operations between IPUs.globalExchangePacketHeader
: Packet headers for inter-IPU exchanges.globalMessage
: Message data for inter-IPU exchanges.hostExchangeCode
: Code for performing exchange operations to and from the hosthostExchangePacketHeader
: Packet headers for host exchanges.hostMessage
: Message data for host exchanges.instrumentationResults
: Variables to store profiling information.internalExchangeCode
: Code for performing internal exchange operations.message
: Message data for internal exchanges.multiple
: Space shared by variables from multiple different categories.outputEdge
: Storage for output edge data before an exchange takes place.rearrangement
: Variables holding rearranged versions of tensor data.sharedCodeStorage
: Code shared bey vertices.sharedDataStorage
: Data shared by vertices.stack
: Worker and supervisor stacks.variable
: Space allocated for variables in worker and supervisor code.vectorListDescriptor
: The data forVectorList<Input<...>, DeltaN>
fields.vertexCode
: Code for vertex functions (codelets).vertexFieldData
: Variable-sized fields. For example, the data forVector<float>
orVector<Input<...>>
fields.vertexInstanceState
: Vertex class instances. This will besizeof (VertexName)
for each vertex.
Memory by compute set¶
byComputeSet
is a breakdown of memory usage across the whole system. It
includes several 2D arrays indexed by compute set, then by tile.
"byComputeSet": {
"codeBytes": [[0, ...],[0, ...], ...],
"copyPtrBytes": [[0, ...],[0, ...], ...],
"descriptorBytes": [[0, ...],[0, ...], ...],
"edgePtrBytes": [[0, ...],[0, ...], ...],
"paddingBytes": [[0, ...],[0, ...], ...],
"vertexDataBytes": [[0, ...],[0, ...], ...],
"totalBytes": [[0, ...],[0, ...], ...]
}
codeBytes
is the amount of memory used for code by a compute set. Because that code may be shared by several compute sets, these numbers cannot be added in a meaningful way.totalBytes
is the sum of the above for convenience. Because it includescodeBytes
it cannot be added in a meaningful way.
Memory by vertex type¶
byVertexType
is a breakdown of memory usage across the whole system, like
byComputeSet
but for vertex types instead. The index into these arrays is
also an index into the top level vertexTypes
object.
"byVertexType":{
"codeBytes": [[0, ...],[0, ...], ...],
"copyPtrBytes": [[0, ...],[0, ...], ...],
"descriptorBytes": [[0, ...],[0, ...], ...],
"edgePtrBytes": [[0, ...],[0, ...], ...],
"paddingBytes": [[0, ...],[0, ...], ...],
"totalBytes": [[0, ...],[0, ...], ...],
"vertexDataBytes": [[0, ...],[0, ...], ...]
}
7.2. Execution profile¶
The execution profile contains information about the programs that have been run since the execution profile was last reset. Because the profiling data varies for different target types and profiling methods, the entire object is a tagged union.
7.2.1. Profiler mode¶
The profilerMode
is the tag for this object. It can be one of the following:
NONE
CPU
IPU_MODEL
COMPUTE_SETS
SINGLE_TILE_COMPUTE_SETS
VERTICES
EXTERNAL_EXCHANGES
HOST_EXCHANGES
It has the following fields, some of which are only present for certain modes.
COMPUTE_SETS
computeSetCyclesByTile
: A 2D array indexed by compute set id, then tile, that gives the total number of cycles taken to execute that compute set on that tile.
SINGLE_TILE_COMPUTE_SETS
computeSetCycles
: A 1D array indexed by compute set id that gives the total number of cycles taken to execute that compute set on all tiles. For this mode an internal sync is inserted before & after the compute set.
VERTICES
vertexCycles
: A 1D array indexed by vertex ID that contains the number of cycles each vertex took the last time it was run.vertexComputeSet
: A 1D array indexed by vertex ID giving the compute set the vertex is in.vertexType
: A 1D array indexed by vertex ID giving an index into the list of vertex types.
EXTERNAL_EXCHANGES
externalExchangeCycles
: A 2D array indexed by external exchange ID, and then tile, that gives the number of cycles used for each external (that is, from one IPU to another) exchange on each tile.
HOST_EXCHANGES
hostExchangeCycles
: This is the same asexternalExchangeCycles
but for host<->IPU exchanges.
Additionally for all modes except NONE
and CPU
there profile contains
program trace and simulation information.
7.2.2. Program trace information¶
programTrace
is a 1D array of the programs IDs that were run. These are indexes intoprograms
in the graph profile.
7.2.3. Simulation information¶
simulation
has a list of execution steps based on the simulation of the programs that are listed inprogramTrace
. This information is redundant. It is calculated entirely from the graph profile and theprogramTrace
but it is included for convenience.
The fields of simulation
are as follows.
cycles
is the total number of cycles it took to execute all of the programs inprogramTrace
.tileCycles
is the number of cycles spent doing each kind of activity. Unlikecycles
this counts cycles from different tiles as distinct. That is, if two tiles both do a computation that takes 10 cycles in parallel, thencycles
will be 10, buttileCycles.compute
will be 20.activeCompute
is a compute cycle where the active thread is computing, andcycles
is a compute cycle where the active thread or any of the other threads is computing.
"tileCycles":{
"activeCompute":1349,
"compute":8094,
"copySharedStructure":0,
"doExchange":2070,
"globalExchange":0,
"streamCopy":16,
"sync":26238
}
steps
lists the compute, sync and exchange steps that are run. Each entry is a tagged union based on thetype
field which may be one ofOnTileExecute
StreamCopy
CopySharedStructure
Sync
DoExchange
GlobalExchange
.
When running on actual hardware, the simulation uses computeSetCycles
or
computeSetCyclesByTile
for the compute set cycles. If hardware cycles are
not available (for example, under IPU_MODEL
) then cycle estimates are used.
The other fields in each step depend on its type. Sync
only contains the
sync type: External
or Internal
{
"syncType":"External",
"type":"sync"
}
All other types contain the following fields:
type
: The step type as described above.program
: The program ID for this step (an index intoprograms
).name
: This field may be present if the program has a name. If the program has no name this field is omitted.tileBalance
: A fraction from 0-1 which indicates how balanced computation was between the tiles. It is calculated as the total number of compute cycles used /cycles
*numTiles
. If all tiles take the same number of cycles to finish this then this will be 1.0. If for example you have one tile that takes 10 cycles and one that takes 5 then this will be 0.75.activeTiles
: The number of tiles that are computing (or exchanging for exchanges).activeTileBalance
: The same astileBalance
but it ignores completely idle tiles.cycles
: The number of cycles taken by the longest running tile. BecauseOnTileExecute
calls can overlap with each other and with exchanges this may be non-zero even if the execution doesn’t actually take any extra time.cyclesFrom
: The first cycle number where this program was executing on any tile.cyclesTo
: The last cycle number where this program was executing on any tile.
The exchange types (DoExchange
, StreamCopy
, GlobalExchange
and
SharedStructureCopy
) also contain these fields:
totalData
: The total amount of data transferred during the exchange.dataBalance
: Exactly liketileBalance
but for the amount of data sent and received by each tile, instead of cycles.
OnTileExecute
also contains these fields:
threadBalance
: Similar in concept totileBalance
except it measures how well-utilised the hardware threads are. If you always run 6 threads or 0 threads this will be 1.0 even if the total computation on each tile takes a different amount of time.computeSet
: The ID of the compute set executed by this step.
DoExchange
, GlobalExchange
and StreamCopy
contain a field that is an
index into the corresponding exchange lists, called exchange
,
externalExchange
or hostExchange
respectively.
Finally, OnTileExecute
, DoExchange
and CopySharedStructure
contain
this field:
cyclesOverlapped
: How many cycles were overlapped with previous steps.