3. Graph profile

The structure of the graph profile is organised in the following areas:

  • Target information

  • Optimisation information

  • Graph information

  • Vertex types

  • Compute sets

  • Exchanges

  • Program structure

  • Memory use

These are described in detail below.

3.1. Generating the report

After you have loaded your Graph into an Engine, you can get static profile information about the graph and the resources required. This includes cycle estimates (for an IPU Model) and memory information.

You can save the profiling information to a file for use by the Graph Analyser. For example:

ProfileValue graphProfile = engine.getGraphProfile();
std::ofstream graphFile;
graphFile.open("graph.json");
poplar::serializeToJSON(graphFile, graphProfile);
graphFile.close();

3.2. Contents of the report

3.2.1. Target information

target contains information about the target hardware.

Table 3.1 Target information

Field

Description

type

The target type, which is one of CPU, IPU or IPU_MODEL.

bytesPerIPU

The number of bytes of memory on an IPU.

bytesPerTile

The number of bytes of memory on a tile.

clockFrequency

The tile clock frequency in Hz.

numIPUs

The number of IPU chips in the system.

tilesPerIPU

The number of tiles on each IPU chip.

numTiles

The total number of tiles. This is the product of numIPUs and tilesPerIPU. It is stored redundantly for convenience.

totalMemory

The total memory. This is the product of bytesPerTile and numTiles (or bytesPerIPU and numIPUs). It is stored redundantly for convenience.

relativeSyncDelayByTile

The sync delay for each tile (relative to the minimum value).

minSyncDelay

The minimum sync delay for any tile.

The sync delay for a tile is the number of cycles that it takes for the tile to send a sync request to the sync controller and receive a sync release signal back from the sync controller. It is smaller for tiles closer to the sync controller. This can be used for calculating how long a sync takes. The values are given for each tile on one IPU, In other words, there are tilesPerIPU values, not numTiles, because the sync delay values are the same on every IPU. The sync delay for each tile is given by minSyncDelay + relativeSyncDelayByTile[tile].

3.2.2. Optimisation information

optimizationInfo contains a map<string, double> of internal metrics related to compilation. The keys may change but this will always be a map from strings to doubles.

3.2.3. Graph information

graph includes some basic information about the graph, such as the number of compute sets.

"graph":{
  "numComputeSets":9,
  "numEdges":24,
  "numVars":111,
  "numVertices":16
}

3.2.4. Vertex types

vertexTypes lists the vertex types that are actually used in the graph. There may be many more vertex types but unused ones are ignored. In the rest of the profile data, references to vertex types are specified as an index into these arrays.

Table 3.2 Vertex types

Field

Description

names

Lists the names of the vertex types. This includes built-in vertices like poplar_rt::LongMemcpy.

sizes

Contains the size of the vertex state (the class members) of each vertex type. For example Doubler might have 4 bytes of state.

3.2.5. Compute sets

computeSets contains cycle estimates, names and the number of vertices in each compute set. For the IPU_MODEL target it also includes a cycleEstimates field.

Table 3.3 Compute sets

Field

Description

names

The name of each compute set. These are mainly for debugging purposes and are not necessarily unique. This includes compute sets generated during compilation.

vertexCounts

vertexTypes

The number of each type of vertex in the compute set. For each compute set there are vertexCounts[compute_set][i] vertices of type vertexTypes[compute_set][i]. The type is an index into the top-level “vertexTypes” array.

cycleEstimates

A cycle estimate is calculated for each vertex and then the vertices are scheduled in the same way that they would be run on real hardware. This results in three cycleEstimates:

  • activeCyclesByTile: This is the number of cycles during which a vertex was being run. Tiles have six hardware threads that are serviced in a round-robin fashion. If only one vertex is running then out of every six cycles only one cycle is “active”, and the other five cycles are idle. activeCyclesByTile counts the total number of active cycles in each compute set for each tile. It is indexed as [compute_set][tile].

  • activeCyclesByVertexType: The is the total number of active cycles in each compute set, by vertex type. It is indexed as [compute_set][vertex_type] where vertex_type is an index into “vertexTypes”.

  • cyclesByTile: This is similar to activeCyclesByTile but it also counts idle cycles where a thread is not executing. This therefore gives the actual number of cycles that each tile takes running this compute set.

3.2.6. Exchanges

exchanges lists some basic information about internal exchanges.

Table 3.4 Exchange information

Field

Description

bytesReceivedByTile

The number of bytes received by each tile in the exchange. It is indexed as [internal_exchange_id][tile].

bytesSentByTile

The number of bytes sent by each tile in the exchange. It is indexed as [internal_exchange_id][tile].

cyclesByTile

The number of cycles that each tile used for internal exchanges. It is indexed as [internal_exchange_id][tile]. This is known exactly for internal exchanges, which are statically scheduled.

externalExchanges lists the same information for IPU-to-IPU exchanges.

Table 3.5 External exchange information

Field

Description

bytesReceivedByTile

The number of bytes received by each tile in the exchange. It is indexed as [external_exchange_id][tile].

bytesSentByTile

The number of bytes sent by each tile in the exchange. It is indexed as [external_exchange_id][tile].

estimatedCyclesByTile

The estimated number of cycles that each tile used for exchanges with other IPUs. It is indexed as [external_exchange_id][tile].

hostExchanges lists the same information for exchanges between the host and IPU.

Table 3.6 Host exchange information

Field

Description

bytesReceivedByTile

The number of bytes received by each tile in the exchange. It is indexed as [host_exchange_id][tile].

bytesSentByTile

The number of bytes sent by each tile in the exchange. It is indexed as [host_exchange_id][tile].

estimatedCyclesByTile

The estimated number of cycles that each tile used for exchanges to or from the host. It is indexed as [host_exchange_id][tile].

3.2.7. Program structure

The graph profile includes a serialisation of the program structure. This can include some programs generated during compilation, such as exchange and sync operations, in addition to the programs explicitly specified in the source code.

programs is a flattened array of all the programs given to the engine. This includes control programs (programs the user has provided) and functions (internally generated programs to reduce code duplication).

The arrays controlPrograms and functionPrograms contain the indexes of control and function programs in the programs array. Normally user programs are wrapped in a single control program so controlPrograms will nearly always contain only [0].

"controlPrograms":[0],
"functionPrograms":[31, 45],

Each entry in the programs array is a tagged union. The tag is type and has to be one of the following values, to indicate the type of the program. The following table summarises the tags generated by each program class.

Table 3.7 Programs

Program class

Program type tags

Execute

OnTileExecute

This may be preceded or followed by DoExchange or GlobalExchange if exchanges are needed before/after execution.

Repeat

Repeat

RepeatWhileTrue

RepeatWhileFalse

RepeatWhile

If

One of the following program types:

  1. SetLocalConsensusFromVar

  2. Sync

  3. If or IfElse

Switch

Switch

Sequence

Sequence

Copy

DoExchange, GlobalExchange or StreamCopy

Corresponding to internal exchange, inter-IPU exchange and host exchange respectively.

This may be preceded or followed by OnTileExecute or DoExchange if data rearrangement is needed before/after the copy.

WriteUndef

WriteUndef

Sync

Sync

Call

Call

PrintTensor

StreamCopy

The type determines which other fields are present. The most useful are described below.

Programs that have sub-programs encode this with the children field (even those with a fixed number of children like If). The sub-programs are specified as indexes into the programs array.

{
  "children":[4,5,6,7,8,9],
  "type":"Sequence"
}

The exchange programs (DoExchange, GlobalExchange and StreamCopy) reference the exchange ID, which is an index into exchanges, externalExchanges or hostExchanges respectively.

{
  "exchange":1,
  "type":"StreamCopy"
}

DoExchange also includes a breakdown of the number of type of exchange instruction.

{
  "exchange":3,
  "memoryByInstruction":{
    "delay":24,
    "exchange":0,
    "other":0,
    "receiveAddress":16,
    "receiveFormat":0,
    "receiveMux":32,
    "send":8,
    "sendWithPutAddress":0,
    "sendWithPutFormat":0,
    "sendWithPutMux":0,
    "total":80
  },
  "name":"/ExchangePre",
  "type":"DoExchange"
}

OnTileExecute contains the compute set ID, which is an index into the arrays in computeSets.

{
  "computeSet":3,
  "type":"OnTileExecute"
}

Programs can have a name field:

{
  "exchange":2,
  "name":"progIdCopy/GlobalPre/GlobalExchange",
  "type":"GlobalExchange"
}

Call programs call a sub-graph as a function. They contain an index into the functionPrograms array that identifies the function called.

{
  "target":1,
  "type":"Call"
}

3.2.8. Memory use

The memory object contains a lot of information about memory use. All memory is statically allocated so you don’t need to run the program to gather this data.

The memory usage is reported for each tile, and also by category (what the memory is used for), by compute set and by vertex type. There is also a report of variable liveness, including a tree of the liveness for all possible call stacks (this is a finite list because recursion is not allowed).

See variable_liveness for more information about liveness of variables in Poplar.

There are two memory regions on each tile, interleaved and non-interleaved, the use of each of these is reported separately. If the memory requirement is greater than the available memory, then this is reported as overflowed. The memory usage in each region is provided, both with and without gaps. Gaps arise because of memory allocation constraints, such as alignment requirements. For more information on the tile memory architecture, refer to the IPU Programmer’s Guide.

The memory used by some variables can be overlapped with others, because they are not live at the same time. Hence, the usage is split into overlapped and nonOverlapped components.

For top-level replicated graphs (those created by Graph(target, replication_factor)) the memory use will be reported for a single replica (the memory used by all replicas will be identical).

Memory per tile

"byTile": {
  "interleaved": [ 536, 408 ],
  "interleavedIncludingGaps": [ 536, 408 ],
  "nonInterleaved": [ 19758, 3896 ],
  "nonInterleavedIncludingGaps": [ 65568, 19596 ],
  "overflowed": [ 0, 0 ],
  "overflowedIncludingGaps": [ 0, 0 ],
  "total": [ 20294, 3896 ],
  "totalIncludingGaps": [ 131608, 19596 ]
}
Table 3.8 Memory use per tile

Field

Description

total

The sum of interleaved, nonInterleaved and overflowed. This is the total amount of memory used for data (not including padding) on each tile. However, due to memory constraints leading to padding, more memory may actually be required. Therefore this is usually not the number you want.

totalIncludingGaps

The actual amount of memory that is required on each tile. This is not simply the sum of the previous “including gaps” figures because adding those up does not take account of the gaps between the regions.

If any of these numbers is larger than the number of bytes per tile then the program will not fit on the hardware.

Memory by category

byCategory is a breakdown of memory usage across the whole system by the type of data, and the region it is in.

"byCategory":{
  "controlCode": {
    "interleaved": {
      "nonOverlapped": [ 0, 0 ],
      "overlapped": [ 0, 0 ]
    },
    "nonInterleaved": {
      "nonOverlapped": [ 1216, 356 ],
      "overlapped": [ 0, 0 ]
    },
    "overflowed": {
      "nonOverlapped": [ 0, 0 ],
      "overlapped": [ 0, 0 ]
    },
    "total": [ 1216, 356 ]
  }

See Storage categories for the full list of categories.

Memory by compute set

byComputeSet is a breakdown of memory usage across the whole system. It includes several 2D arrays indexed by compute set, then by tile.

"byComputeSet": {
  "codeBytes": [[0, ...],[0, ...], ...],
  "copyPtrBytes": [[0, ...],[0, ...], ...],
  "descriptorBytes": [[0, ...],[0, ...], ...],
  "edgePtrBytes": [[0, ...],[0, ...], ...],
  "paddingBytes": [[0, ...],[0, ...], ...],
  "vertexDataBytes": [[0, ...],[0, ...], ...],
  "totalBytes": [[0, ...],[0, ...], ...]
}
Table 3.9 Memory use by compute set

Field

Description

codeBytes

The amount of memory used for code by a compute set. Because that code may be shared by several compute sets, these numbers cannot be added in a meaningful way.

totalBytes

The sum of the above for convenience. Because it includes codeBytes it cannot be added in a meaningful way.

Memory by vertex type

byVertexType is a breakdown of memory usage across the whole system, like byComputeSet but for vertex types instead. The index into these arrays is also an index into the top level vertexTypes object.

"byVertexType":{
  "codeBytes": [[0, ...],[0, ...], ...],
  "copyPtrBytes": [[0, ...],[0, ...], ...],
  "descriptorBytes": [[0, ...],[0, ...], ...],
  "edgePtrBytes": [[0, ...],[0, ...], ...],
  "paddingBytes": [[0, ...],[0, ...], ...],
  "totalBytes": [[0, ...],[0, ...], ...],
  "vertexDataBytes": [[0, ...],[0, ...], ...]
}