15. gc-monitor
You can use this command to monitor IPU activity without affecting users of the IPUs. This can be used to:
Check and monitor what’s currently running on which IPU in shared systems.
Make sure code is correctly running on an IPU.
Monitor performance: the power and temperature will increase, and the clock rate will drop, when an IPU is heavily loaded.
15.1. Output
gc-monitor
displays a device information table followed by information about any processes that are using these IPUs.
The structure of the device information table is different depending on whether your system uses IPU-Machines or PCIe cards.
By default, gc-monitor
only displays information about the current active partition.
To see the IPUs in other partitions, and the processes using those IPUs, use the --all-partitions
command line option.
15.1.1. IPU-Machine device information
The example below is from an IPU-Machine installation that has been configured with a single 8-IPU partition.
+---------------+--------------------------------------------------------------------------------+
| gc-monitor | Partition: p1 (gcd:0) [active] has 8 reconfigurable IPUs |
+-------------+--------------------+--------+--------------+----------+------+----+------+-------+
| IPU-M | Serial |IPU-M SW|Server version| ICU FW | Type | ID | IPU# |Routing|
+-------------+--------------------+--------+--------------+----------+------+----+------+-------+
| 10.1.5.10 | 0024.0002.8203321 | 2.5.0 | 1.9.0 | 2.4.4 |M2000 | 0 | 3 | DNC |
| 10.1.5.10 | 0024.0002.8203321 | 2.5.0 | 1.9.0 | 2.4.4 |M2000 | 1 | 2 | DNC |
| 10.1.5.10 | 0024.0001.8203321 | 2.5.0 | 1.9.0 | 2.4.4 |M2000 | 2 | 1 | DNC |
| 10.1.5.10 | 0024.0001.8203321 | 2.5.0 | 1.9.0 | 2.4.4 |M2000 | 3 | 0 | DNC |
+-------------+--------------------+--------+--------------+----------+------+----+------+-------+
| 10.1.5.11 | 0013.0002.8204921 | 2.5.0 | 1.9.0 | 2.4.4 |M2000 | 4 | 3 | DNC |
| 10.1.5.11 | 0013.0002.8204921 | 2.5.0 | 1.9.0 | 2.4.4 |M2000 | 5 | 2 | DNC |
| 10.1.5.11 | 0013.0001.8204921 | 2.5.0 | 1.9.0 | 2.4.4 |M2000 | 6 | 1 | DNC |
| 10.1.5.11 | 0013.0001.8204921 | 2.5.0 | 1.9.0 | 2.4.4 |M2000 | 7 | 0 | DNC |
+-------------+--------------------+--------+--------------+----------+------+----+------+-------+
Label |
Description |
---|---|
|
IPU-Machine address. |
|
Each IPU-Machine has two serial numbers, one for each pair of IPUs. |
|
IPU-Machine overall software version. |
|
Fabric server software version. |
|
ICU firmware revision. |
|
IPU-Machine type — M2000 in this case. |
|
Device ID of the IPU, used by applications and other tools to address an IPU in a partition. |
|
Index from 0-3 that identifies the IPU chip on an IPU-Machine. |
|
Link configuration for partition. In this case, “DNC”. |
Partition information is displayed in the table header.
Note: if the Type
field displays M2000, this is the abbreviation for both the IPU-M2000 and Bow-2000 IPU-Machines.
15.1.2. PCIe card device information
The example below is from a system built from 8 C600 PCIe cards:
+---------------+---------------------------------------------------------------------------+
| gc-monitor | Installed driver: 1.1.7 |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
| Slot | Serial | ICU FW | Type | Speed | Ln | ID | PCIe ID | IPU# |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
| 10 | 0057.0063.822351 | 2.6.7 | C600 | 8.0 GT/s | 8 | 0 | 0000:1b:00.0 | 0 |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
| 11 | 0041.0063.822351 | 2.6.7 | C600 | 8.0 GT/s | 8 | 1 | 0000:1c:00.0 | 0 |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
| 12 | 0014.0063.822351 | 2.6.7 | C600 | 8.0 GT/s | 8 | 2 | 0000:48:00.0 | 0 |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
| 13 | 0045.0063.822351 | 2.6.7 | C600 | 8.0 GT/s | 8 | 3 | 0000:49:00.0 | 0 |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
| 7 | 0053.0063.822351 | 2.6.7 | C600 | 8.0 GT/s | 8 | 4 | 0000:8a:00.0 | 0 |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
| 6 | 0029.0063.822351 | 2.6.7 | C600 | 8.0 GT/s | 8 | 5 | 0000:8c:00.0 | 0 |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
| 8 | 0062.0063.822351 | 2.6.7 | C600 | 8.0 GT/s | 8 | 6 | 0000:c4:00.0 | 0 |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
| 9 | 0016.0063.822351 | 2.6.7 | C600 | 8.0 GT/s | 8 | 7 | 0000:c5:00.0 | 0 |
+------+--------------------+----------+-------+----------+----+----+----------------+------+
+--------------------------------------------------------------------------------------------------+
| No attached processes |
+--------------------------------------------------------------------------------------------------+
Label |
Description |
---|---|
|
Physical PCIe slot location, if available. |
|
Serial number of the board. |
|
ICU firmware revision. |
|
IPU system type — C600 PCIe cards in this example. |
|
PCIe speed. |
|
Number of PCIe lanes in use. |
|
Device ID of the IPU. Used by applications and other tools to address an IPU. |
|
PCI |
|
Index from 0-1 that identifies the IPU chip on a two-IPU PCIe card. |
The kernel driver version is displayed in the table header.
15.1.3. Process information
The example below shows the default process information table displaying a gc-hosttraffictest
process running across 4 IPUs.
+------------------------------------------------------------------------+------------------------+-----------------+
| Attached processes in partition p1 | IPU | Board |
+--------+-----------------------------------------+--------+------------+----+----------+--------+--------+--------+
| PID | Command | Time | User | ID | Clock | Temp | Temp | Power |
+--------+-----------------------------------------+--------+------------+----+----------+--------+--------+--------+
| 81816 | gc-hosttraffictest | 10s | ipuuser | 14 | 1300MHz | 33.7 C | 35.0 C |113.7 W |
| 81816 | gc-hosttraffictest | 10s | ipuuser | 15 | 1300MHz | 37.4 C | | |
+--------+-----------------------------------------+--------+------------+----+----------+--------+--------+--------+
| 81816 | gc-hosttraffictest | 10s | ipuuser | 12 | 1300MHz | 31.4 C | 32.8 C |107.1 W |
| 81816 | gc-hosttraffictest | 10s | ipuuser | 13 | 1300MHz | 34.7 C | | |
+--------+-----------------------------------------+--------+------------+----+----------+--------+--------+--------+
Label |
Description |
---|---|
|
The process ID of the program using the IPU. |
|
The program using the IPU. |
|
How long the program has been running. Note, not necessarily how long it has been using IPUs. |
|
The user that started the process. |
|
Device ID of the IPU. |
|
The current measured IPU clock rate. |
|
IPU temperature in degrees Celsius measured from on-chip sensors. May be less accurate than board temperature. |
|
Board temperature in degrees Celsius. |
|
Board power consumption in watts. This is the total power of the IPUs on the board. It does not include the power used by fans and other components on the board. Note that the power is sampled and may not accurately reflect either peak or average power. |
There are a number of additional process information fields which can be enabled with the -f
option.
If -f
is specified without a parameter all process information fields will be displayed, as below:
+------+---------------+---------------+------------------+----------+----------+----------+----------+---------------+----------------+----------------+--------+--------+------------+
| ID | Board Power | Board Temp | Command | Exch Mem | IPU Code | IPU Data |IPU Stack | IPU Temp | IPU-util | IPU-util-sess | PID | Time | User |
+------+---------------+---------------+------------------+----------+----------+----------+----------+---------------+----------------+----------------+--------+--------+------------+
| 0 | 188.1 W | 38.2 C | python3 | 0.00 GB | 40.90% | 265.12% | 0.79% | 39.1 C | 98.75% | 95.68% | 22844 | 1m22s | ipuuser |
+------+---------------+---------------+------------------+----------+----------+----------+----------+---------------+----------------+----------------+--------+--------+------------+
The complete set of fields may be slightly unwieldy, so -f
can take a set of comma-separated arguments to specify which fields should be displayed.
Use -f list
to display all available fields.
Label |
Description |
---|---|
|
Device ID of the IPU. |
|
Board power consumption in watts. This is the total power of the IPUs on the board. It does not include the power used by fans and other components on the board. Note that the power is sampled and may not accurately reflect either peak or average power. |
|
Board temperature in degrees Celsius. |
|
The program using the IPU. |
|
Total external memory usage in GB. |
|
Total application code size expressed as a percentage of tile memory on a single IPU. |
|
Total application data size expressed as a percentage of tile memory on a single IPU. |
|
Total application stack size expressed as a percentage of tile memory on a single IPU. |
|
IPU temperature in degrees Celsius measured from on-chip sensors. May be less accurate than board temperature. |
|
A rough indication of “IPU utilisation” based on the amount of time spent waiting for one or more IPU syncs in the last second. Not intended to be used for performance measurement. |
|
A rough indication of “IPU utilisation” based on the amount of time spent waiting for one or more IPU syncs since the HSPs were set up. Not intended to be used for performance measurement. |
|
The process ID of the program using the IPU. |
|
How long the program has been running. Note, not necessarily how long it has been using IPUs. |
|
The user that started the process. |
15.2. Usage
15.2.1. Allowed options
|
Don’t display card information |
|
Show information about all partitions (default is active partition only) |
|
Show sensor data even if no process is attached |
|
Emit JSON output |
|
Emit XML output |
|
Emit CSV output |
|
Comma separated field names to be displayed in attached process info. Use -f list to see the choices. If empty all supported fields are displayed. |
|
Don’t display headers (csv only) |
|
Version number |
|
Produce help message |
15.2.2. Notes
By default, IPU applications do not read power and temperature sensors, so this information will not be available in gc-monitor. To enable sensor reading the application must be launched with the GCDA_MONITOR environment variable set.
15.2.3. Examples
### Run main.py, enabling gc-monitor power/temp output
$ GCDA_MONITOR=1 python main.py
### Refresh gc-monitor output every second
$ watch -n1 gc-monitor
### Append one JSON gc-monitor reading to a file
$ gc-monitor -j >> data.json