15. gc-monitor

You can use this command to monitor IPU activity without affecting users of the IPUs. This can be used to:

  • Check and monitor what’s currently running on which IPU in shared systems.

  • Make sure code is correctly running on an IPU.

  • Monitor performance: the power and temperature will increase, and the clock rate will drop, when an IPU is heavily loaded.

15.1. Output

gc-monitor displays a device information table followed by information about any processes that are using these IPUs. The structure of the device information table is different depending on whether your system uses IPU-Machines or PCIe cards.

By default, gc-monitor only displays information about the current active partition. To see the IPUs in other partitions, and the processes using those IPUs, use the --all-partitions command line option.

15.1.1. IPU-Machine device information

The example below is from an IPU-Machine installation that has been configured with a single 8-IPU partition.

+---------------+--------------------------------------------------------------------------------+
|  gc-monitor   |            Partition: p1 (gcd:0) [active] has 8 reconfigurable IPUs            |
+-------------+--------------------+--------+--------------+----------+------+----+------+-------+
|    IPU-M    |       Serial       |IPU-M SW|Server version|  ICU FW  | Type | ID | IPU# |Routing|
+-------------+--------------------+--------+--------------+----------+------+----+------+-------+
|  10.1.5.10  | 0024.0002.8203321  | 2.5.0  |    1.9.0     |  2.4.4   |M2000 | 0  |  3   |  DNC  |
|  10.1.5.10  | 0024.0002.8203321  | 2.5.0  |    1.9.0     |  2.4.4   |M2000 | 1  |  2   |  DNC  |
|  10.1.5.10  | 0024.0001.8203321  | 2.5.0  |    1.9.0     |  2.4.4   |M2000 | 2  |  1   |  DNC  |
|  10.1.5.10  | 0024.0001.8203321  | 2.5.0  |    1.9.0     |  2.4.4   |M2000 | 3  |  0   |  DNC  |
+-------------+--------------------+--------+--------------+----------+------+----+------+-------+
|  10.1.5.11  | 0013.0002.8204921  | 2.5.0  |    1.9.0     |  2.4.4   |M2000 | 4  |  3   |  DNC  |
|  10.1.5.11  | 0013.0002.8204921  | 2.5.0  |    1.9.0     |  2.4.4   |M2000 | 5  |  2   |  DNC  |
|  10.1.5.11  | 0013.0001.8204921  | 2.5.0  |    1.9.0     |  2.4.4   |M2000 | 6  |  1   |  DNC  |
|  10.1.5.11  | 0013.0001.8204921  | 2.5.0  |    1.9.0     |  2.4.4   |M2000 | 7  |  0   |  DNC  |
+-------------+--------------------+--------+--------------+----------+------+----+------+-------+
Table 15.1 IPU-Machine device table fields

Label

Description

IPU-M

IPU-Machine address.

Serial

Each IPU-Machine has two serial numbers, one for each pair of IPUs.

IPU-M SW

IPU-Machine overall software version.

Server version

Fabric server software version.

ICU FW

ICU firmware revision.

Type

IPU-Machine type — M2000 in this case.

ID

Device ID of the IPU, used by applications and other tools to address an IPU in a partition.

IPU#

Index from 0-3 that identifies the IPU chip on an IPU-Machine.

Routing

Link configuration for partition. In this case, “DNC”.

Partition information is displayed in the table header.

Note: if the Type field displays M2000, this is the abbreviation for the IPU-M2000 IPU-Machine. If the Type field shows M2000W, this refers to the Bow-2000 IPU-Machine.

15.1.2. PCIe card device information

The example below is from a system built from 8 C600 PCIe cards:

+---------------+---------------------------------------------------------------------------+
|  gc-monitor   |                          Installed driver: 1.1.7                          |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
| Slot |       Serial       |  ICU FW  | Type  |  Speed   | Ln | ID |    PCIe ID     | IPU# |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
|  10  |  0057.0063.822351  |  2.6.7   | C600  | 8.0 GT/s | 8  | 0  |  0000:1b:00.0  |  0   |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
|  11  |  0041.0063.822351  |  2.6.7   | C600  | 8.0 GT/s | 8  | 1  |  0000:1c:00.0  |  0   |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
|  12  |  0014.0063.822351  |  2.6.7   | C600  | 8.0 GT/s | 8  | 2  |  0000:48:00.0  |  0   |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
|  13  |  0045.0063.822351  |  2.6.7   | C600  | 8.0 GT/s | 8  | 3  |  0000:49:00.0  |  0   |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
|  7   |  0053.0063.822351  |  2.6.7   | C600  | 8.0 GT/s | 8  | 4  |  0000:8a:00.0  |  0   |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
|  6   |  0029.0063.822351  |  2.6.7   | C600  | 8.0 GT/s | 8  | 5  |  0000:8c:00.0  |  0   |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
|  8   |  0062.0063.822351  |  2.6.7   | C600  | 8.0 GT/s | 8  | 6  |  0000:c4:00.0  |  0   |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
|  9   |  0016.0063.822351  |  2.6.7   | C600  | 8.0 GT/s | 8  | 7  |  0000:c5:00.0  |  0   |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
+--------------------------------------------------------------------------------------------------+
|                                      No attached processes                                       |
+--------------------------------------------------------------------------------------------------+
Table 15.2 PCIe device table fields

Label

Description

Slot

Physical PCIe slot location, if available.

Serial

Serial number of the board.

ICU FW

ICU firmware revision.

Type

IPU system type — C600 PCIe cards in this example.

Speed

PCIe speed.

Ln

Number of PCIe lanes in use.

ID

Device ID of the IPU. Used by applications and other tools to address an IPU.

PCIe ID

PCI domain:bus:device.function

IPU#

Index from 0-1 that identifies the IPU chip on a two-IPU PCIe card.

The kernel driver version is displayed in the table header.

15.1.3. Process information

The example below shows the default process information table displaying a gc-hosttraffictest process running across 4 IPUs.

+------------------------------------------------------------------------+------------------------+-----------------+
|                     Attached processes in partition p1                 |          IPU           |      Board      |
+--------+-----------------------------------------+--------+------------+----+----------+--------+--------+--------+
|  PID   |                 Command                 |  Time  |    User    | ID |  Clock   |  Temp  |  Temp  | Power  |
+--------+-----------------------------------------+--------+------------+----+----------+--------+--------+--------+
| 81816  |           gc-hosttraffictest            |  10s   |  ipuuser   | 14 | 1300MHz  | 33.7 C | 35.0 C |113.7 W |
| 81816  |           gc-hosttraffictest            |  10s   |  ipuuser   | 15 | 1300MHz  | 37.4 C |        |        |
+--------+-----------------------------------------+--------+------------+----+----------+--------+--------+--------+
| 81816  |           gc-hosttraffictest            |  10s   |  ipuuser   | 12 | 1300MHz  | 31.4 C | 32.8 C |107.1 W |
| 81816  |           gc-hosttraffictest            |  10s   |  ipuuser   | 13 | 1300MHz  | 34.7 C |        |        |
+--------+-----------------------------------------+--------+------------+----+----------+--------+--------+--------+
Table 15.3 Default process information fields

Label

Description

PID

The process ID of the program using the IPU.

Command

The program using the IPU.

Time

How long the program has been running. Note, not necessarily how long it has been using IPUs.

User

The user that started the process.

ID

Device ID of the IPU.

Clock

The current measured IPU clock rate.

IPU Temp

IPU temperature in degrees Celsius measured from on-chip sensors. May be less accurate than board temperature.

Board Temp

Board temperature in degrees Celsius.

Board Power

Board power consumption in watts. This is the total power of the IPUs on the board. It does not include the power used by fans and other components on the board. Note that the power is sampled and may not accurately reflect either peak or average power.

There are a number of additional process information fields which can be enabled with the -f option. If -f is specified without a parameter all process information fields will be displayed, as below:

+------+---------------+---------------+------------------+----------+----------+----------+----------+---------------+----------------+----------------+--------+--------+------------+
|  ID  |  Board Power  |  Board Temp   |     Command      | Exch Mem | IPU Code | IPU Data |IPU Stack |   IPU Temp    |    IPU-util    | IPU-util-sess  |  PID   |  Time  |    User    |
+------+---------------+---------------+------------------+----------+----------+----------+----------+---------------+----------------+----------------+--------+--------+------------+
|  0   |    188.1 W    |    38.2 C     |     python3      | 0.00 GB  |  40.90%  | 265.12%  |  0.79%   |    39.1 C     |     98.75%     |     95.68%     | 22844  | 1m22s  |  ipuuser   |
+------+---------------+---------------+------------------+----------+----------+----------+----------+---------------+----------------+----------------+--------+--------+------------+

The complete set of fields may be slightly unwieldy, so -f can take a set of comma-separated arguments to specify which fields should be displayed. Use -f list to display all available fields.

Table 15.4 All process fields

Label

Description

ID

Device ID of the IPU.

Board Power

Board power consumption in watts. This is the total power of the IPUs on the board. It does not include the power used by fans and other components on the board. Note that the power is sampled and may not accurately reflect either peak or average power.

Board Temp

Board temperature in degrees Celsius.

Command

The program using the IPU.

Exch Mem

Total external memory usage in GB.

IPU Code

Total application code size expressed as a percentage of tile memory on a single IPU.

IPU Data

Total application data size expressed as a percentage of tile memory on a single IPU.

IPU Stack

Total application stack size expressed as a percentage of tile memory on a single IPU.

IPU Temp

IPU temperature in degrees Celsius measured from on-chip sensors. May be less accurate than board temperature.

IPU-util

A rough indication of “IPU utilisation” based on the amount of time spent waiting for one or more IPU syncs in the last second. Not intended to be used for performance measurement.

IPU-util-sess

A rough indication of “IPU utilisation” based on the amount of time spent waiting for one or more IPU syncs since the HSPs were set up. Not intended to be used for performance measurement.

PID

The process ID of the program using the IPU.

Time

How long the program has been running. Note, not necessarily how long it has been using IPUs.

User

The user that started the process.

Usage

15.2. Allowed options

--no-card-info

Don’t display card information

--all-partitions

Show information about all partitions (default is active partition only)

-s, --sensors

Show sensor data even if no process is attached

-j, --json-output

Emit JSON output

-x, --xml-output

Emit XML output

-c, --csv-output

Emit CSV output

-f {arg}, --fields {arg}

Comma separated field names to be displayed in attached process info. Use -f list to see the choices. If empty all supported fields are displayed.

--no-headers

Don’t display headers (csv only)

--version

Version number

-h, --help

Produce help message

15.3. Notes

  • By default, IPU applications do not read power and temperature sensors, so this information will not be available in gc-monitor. To enable sensor reading the application must be launched with the GCDA_MONITOR environment variable set.

15.4. Examples

### Run main.py, enabling gc-monitor power/temp output
$ GCDA_MONITOR=1 python main.py
### Refresh gc-monitor output every second
$ watch -n1 gc-monitor
### Append one JSON gc-monitor reading to a file
$ gc-monitor -j >> data.json