3. Graph profile¶

The structure of the graph profile is organised in the following areas:

Target information
Optimisation information
Graph information
Vertex types
Compute sets
Exchanges
Program structure
Memory use

These are described in detail below.

3.1. Generating the report¶

After you have loaded your Graph into an Engine, you can get static profile information about the graph and the resources required. This includes cycle estimates (for an IPU Model) and memory information.

You can save the profiling information to a file for use by the Graph Analyser. For example:

ProfileValue graphProfile = engine.getGraphProfile();
std::ofstream graphFile;
graphFile.open("graph.json");
poplar::serializeToJSON(graphFile, graphProfile);
graphFile.close();

3.2. Contents of the report¶

3.2.1. Target information¶

target contains information about the target hardware.

Table 3.1 Target information¶
Field	Description
type	The target type, which is one of CPU, IPU or IPU_MODEL.
bytesPerIPU	The number of bytes of memory on an IPU.
bytesPerTile	The number of bytes of memory on a tile.
clockFrequency	The tile clock frequency in Hz.
numIPUs	The number of IPU chips in the system.
tilesPerIPU	The number of tiles on each IPU chip.
numTiles	The total number of tiles. This is the product of numIPUs and tilesPerIPU. It is stored redundantly for convenience.
totalMemory	The total memory. This is the product of bytesPerTile and numTiles (or bytesPerIPU and numIPUs). It is stored redundantly for convenience.
relativeSyncDelayByTile	The sync delay for each tile (relative to the minimum value).
minSyncDelay	The minimum sync delay for any tile.

The sync delay for a tile is the number of cycles that it takes for the tile to send a sync request to the sync controller and receive a sync release signal back from the sync controller. It is smaller for tiles closer to the sync controller. This can be used for calculating how long a sync takes. The values are given for each tile on one IPU, In other words, there are tilesPerIPU values, not numTiles, because the sync delay values are the same on every IPU. The sync delay for each tile is given by minSyncDelay + relativeSyncDelayByTile[tile].

3.2.2. Optimisation information¶

optimizationInfo contains a map<string, double> of internal metrics related to compilation. The keys may change but this will always be a map from strings to doubles.

3.2.3. Graph information¶

graph includes some basic information about the graph, such as the number of compute sets.

"graph":{
  "numComputeSets":9,
  "numEdges":24,
  "numVars":111,
  "numVertices":16
}

3.2.4. Vertex types¶

vertexTypes lists the vertex types that are actually used in the graph. There may be many more vertex types but unused ones are ignored. In the rest of the profile data, references to vertex types are specified as an index into these arrays.

Table 3.2 Vertex types¶
Field	Description
names	Lists the names of the vertex types. This includes built-in vertices like `poplar_rt::LongMemcpy`.
sizes	Contains the size of the vertex state (the class members) of each vertex type. For example Doubler might have 4 bytes of state.

3.2.5. Compute sets¶

computeSets contains cycle estimates, names and the number of vertices in each compute set. For the IPU_MODEL target it also includes a cycleEstimates field.

Table 3.3 Compute sets¶
Field	Description
names	The name of each compute set. These are mainly for debugging purposes and are not necessarily unique. This includes compute sets generated during compilation.
vertexCounts vertexTypes	The number of each type of vertex in the compute set. For each compute set there are vertexCounts[compute_set][i] vertices of type vertexTypes[compute_set][i]. The type is an index into the top-level “vertexTypes” array.
cycleEstimates	A cycle estimate is calculated for each vertex and then the vertices are scheduled in the same way that they would be run on real hardware. This results in three cycleEstimates: activeCyclesByTile: This is the number of cycles during which a vertex was being run. Tiles have six hardware threads that are serviced in a round-robin fashion. If only one vertex is running then out of every six cycles only one cycle is “active”, and the other five cycles are idle. activeCyclesByTile counts the total number of active cycles in each compute set for each tile. It is indexed as [compute_set][tile]. activeCyclesByVertexType: The is the total number of active cycles in each compute set, by vertex type. It is indexed as [compute_set][vertex_type] where vertex_type is an index into “vertexTypes”. cyclesByTile: This is similar to activeCyclesByTile but it also counts idle cycles where a thread is not executing. This therefore gives the actual number of cycles that each tile takes running this compute set.

3.2.6. Exchanges¶

exchanges lists some basic information about internal exchanges.

Table 3.4 Exchange information¶
Field	Description
bytesReceivedByTile	The number of bytes received by each tile in the exchange. It is indexed as [internal_exchange_id][tile].
bytesSentByTile	The number of bytes sent by each tile in the exchange. It is indexed as [internal_exchange_id][tile].
cyclesByTile	The number of cycles that each tile used for internal exchanges. It is indexed as [internal_exchange_id][tile]. This is known exactly for internal exchanges, which are statically scheduled.

externalExchanges lists the same information for IPU-to-IPU exchanges.

Table 3.5 External exchange information¶
Field	Description
bytesReceivedByTile	The number of bytes received by each tile in the exchange. It is indexed as [external_exchange_id][tile].
bytesSentByTile	The number of bytes sent by each tile in the exchange. It is indexed as [external_exchange_id][tile].
estimatedCyclesByTile	The estimated number of cycles that each tile used for exchanges with other IPUs. It is indexed as [external_exchange_id][tile].

hostExchanges lists the same information for exchanges between the host and IPU.

Table 3.6 Host exchange information¶
Field	Description
bytesReceivedByTile	The number of bytes received by each tile in the exchange. It is indexed as [host_exchange_id][tile].
bytesSentByTile	The number of bytes sent by each tile in the exchange. It is indexed as [host_exchange_id][tile].
estimatedCyclesByTile	The estimated number of cycles that each tile used for exchanges to or from the host. It is indexed as [host_exchange_id][tile].

3.2.7. Program structure¶

The graph profile includes a serialisation of the program structure. This can include some programs generated during compilation, such as exchange and sync operations, in addition to the programs explicitly specified in the source code.

programs is a flattened array of all the programs given to the engine. This includes control programs (programs the user has provided) and functions (internally generated programs to reduce code duplication).

The arrays controlPrograms and functionPrograms contain the indexes of control and function programs in the programs array. Normally user programs are wrapped in a single control program so controlPrograms will nearly always contain only [0].

"controlPrograms":[0],
"functionPrograms":[31, 45],

Each entry in the programs array is a tagged union. The tag is type and has to be one of the following values, to indicate the type of the program. The following table summarises the tags generated by each program class.

Table 3.7 Programs¶
Program class	Program type tags
Execute	OnTileExecute This may be preceded or followed by DoExchange or GlobalExchange if exchanges are needed before/after execution.
Repeat	Repeat
RepeatWhileTrue RepeatWhileFalse	RepeatWhile
If	One of the following program types: SetLocalConsensusFromVar Sync If or IfElse
Switch	Switch
Sequence	Sequence
Copy	DoExchange, GlobalExchange or StreamCopy Corresponding to internal exchange, inter-IPU exchange and host exchange respectively. This may be preceded or followed by OnTileExecute or DoExchange if data rearrangement is needed before/after the copy.
WriteUndef	WriteUndef
Sync	Sync
Call	Call
PrintTensor	StreamCopy

The type determines which other fields are present. The most useful are described below.

Programs that have sub-programs encode this with the children field (even those with a fixed number of children like If). The sub-programs are specified as indexes into the programs array.

{
  "children":[4,5,6,7,8,9],
  "type":"Sequence"
}

The exchange programs (DoExchange, GlobalExchange and StreamCopy) reference the exchange ID, which is an index into exchanges, externalExchanges or hostExchanges respectively.

{
  "exchange":1,
  "type":"StreamCopy"
}

DoExchange also includes a breakdown of the number of type of exchange instruction.

{
  "exchange":3,
  "memoryByInstruction":{
    "delay":24,
    "exchange":0,
    "other":0,
    "receiveAddress":16,
    "receiveFormat":0,
    "receiveMux":32,
    "send":8,
    "sendWithPutAddress":0,
    "sendWithPutFormat":0,
    "sendWithPutMux":0,
    "total":80
  },
  "name":"/ExchangePre",
  "type":"DoExchange"
}

OnTileExecute contains the compute set ID, which is an index into the arrays in computeSets.

{
  "computeSet":3,
  "type":"OnTileExecute"
}

Programs can have a name field:

{
  "exchange":2,
  "name":"progIdCopy/GlobalPre/GlobalExchange",
  "type":"GlobalExchange"
}

Call programs call a sub-graph as a function. They contain an index into the functionPrograms array that identifies the function called.

{
  "target":1,
  "type":"Call"
}

3.2.8. Memory use¶

The memory object contains a lot of information about memory use. All memory is statically allocated so you don’t need to run the program to gather this data.

The memory usage is reported for each tile, and also by category (what the memory is used for), by compute set and by vertex type. There is also a report of variable liveness, including a tree of the liveness for all possible call stacks (this is a finite list because recursion is not allowed).

See variable_liveness for more information about liveness of variables in Poplar.

There are two memory regions on each tile, interleaved and non-interleaved, the use of each of these is reported separately. If the memory requirement is greater than the available memory, then this is reported as overflowed. The memory usage in each region is provided, both with and without gaps. Gaps arise because of memory allocation constraints, such as alignment requirements. For more information on the tile memory architecture, refer to the IPU Programmer’s Guide.

The memory used by some variables can be overlapped with others, because they are not live at the same time. Hence, the usage is split into overlapped and nonOverlapped components.

For top-level replicated graphs (those created by Graph(target, replication_factor)) the memory use will be reported for a single replica (the memory used by all replicas will be identical).

Memory per tile¶

"byTile": {
  "interleaved": [ 536, 408 ],
  "interleavedIncludingGaps": [ 536, 408 ],
  "nonInterleaved": [ 19758, 3896 ],
  "nonInterleavedIncludingGaps": [ 65568, 19596 ],
  "overflowed": [ 0, 0 ],
  "overflowedIncludingGaps": [ 0, 0 ],
  "total": [ 20294, 3896 ],
  "totalIncludingGaps": [ 131608, 19596 ]
}

Table 3.8 Memory use per tile¶
Field	Description
total	The sum of interleaved, nonInterleaved and overflowed. This is the total amount of memory used for data (not including padding) on each tile. However, due to memory constraints leading to padding, more memory may actually be required. Therefore this is usually not the number you want.
totalIncludingGaps	The actual amount of memory that is required on each tile. This is not simply the sum of the previous “including gaps” figures because adding those up does not take account of the gaps between the regions.

If any of these numbers is larger than the number of bytes per tile then the program will not fit on the hardware.

Memory by category¶

byCategory is a breakdown of memory usage across the whole system by the type of data, and the region it is in.

"byCategory":{
  "controlCode": {
    "interleaved": {
      "nonOverlapped": [ 0, 0 ],
      "overlapped": [ 0, 0 ]
    },
    "nonInterleaved": {
      "nonOverlapped": [ 1216, 356 ],
      "overlapped": [ 0, 0 ]
    },
    "overflowed": {
      "nonOverlapped": [ 0, 0 ],
      "overlapped": [ 0, 0 ]
    },
    "total": [ 1216, 356 ]
  }

See Storage categories for the full list of categories.

Memory by compute set¶

byComputeSet is a breakdown of memory usage across the whole system. It includes several 2D arrays indexed by compute set, then by tile.

"byComputeSet": {
  "codeBytes": [[0, ...],[0, ...], ...],
  "copyPtrBytes": [[0, ...],[0, ...], ...],
  "descriptorBytes": [[0, ...],[0, ...], ...],
  "edgePtrBytes": [[0, ...],[0, ...], ...],
  "paddingBytes": [[0, ...],[0, ...], ...],
  "vertexDataBytes": [[0, ...],[0, ...], ...],
  "totalBytes": [[0, ...],[0, ...], ...]
}

Table 3.9 Memory use by compute set¶
Field	Description
codeBytes	The amount of memory used for code by a compute set. Because that code may be shared by several compute sets, these numbers cannot be added in a meaningful way.
totalBytes	The sum of the above for convenience. Because it includes codeBytes it cannot be added in a meaningful way.

Memory by vertex type¶

byVertexType is a breakdown of memory usage across the whole system, like byComputeSet but for vertex types instead. The index into these arrays is also an index into the top level vertexTypes object.

"byVertexType":{
  "codeBytes": [[0, ...],[0, ...], ...],
  "copyPtrBytes": [[0, ...],[0, ...], ...],
  "descriptorBytes": [[0, ...],[0, ...], ...],
  "edgePtrBytes": [[0, ...],[0, ...], ...],
  "paddingBytes": [[0, ...],[0, ...], ...],
  "totalBytes": [[0, ...],[0, ...], ...],
  "vertexDataBytes": [[0, ...],[0, ...], ...]
}