13. gc-monitor

You can use this command to monitor IPU activity without affecting users of the IPUs. This can be used to:

  • Check and monitor what’s currently running on which IPU in shared systems.

  • Make sure code is correctly running on an IPU.

  • Monitor performance: the power and temp will increase, and the clock rate will drop when an IPU is heavily loaded.

To get a continuously updated display, use it with the watch command:

$ watch gc-monitor

13.1. Output

The output shows information about the IPUs in the system and information on the processes running on the machine. The fields available vary slightly depending on the system.

The card information table for a 4-IPU partition shows:

  1. Information about the partition.

  2. IPU-Machine address.

  3. Serial number of the board.

  4. ICU firmware revision.

  5. IPU card type.

  6. Fabric server version.

  7. ID of the IPU as used by other tools to address the IPUs.

  8. PCIe ID.

  9. Routing type.

  10. Gateway software version.

Typical output is shown below.

+---------------+-------------------------------------------------------------------------------------+
|  gc-monitor   |                  Partition: 'pa12' has 4 reconfigurable IPUs                        |
+-------------+--------------------+----------+------+--------------+----+---------+-------+----------+
|    IPU-M    |       Serial       |  ICU FW  | Type |Server version| ID | PCIe ID |Routing|   GWSW   |
+-------------+--------------------+----------+------+--------------+----+---------+-------+----------+
|  10.1.5.12  | 0024.0002.8203321  |  2.1.3   |M2000 |    1.5.4     | 0  |    3    |  DNC  |  2.0.10  |
|  10.1.5.12  | 0024.0002.8203321  |  2.1.3   |M2000 |    1.5.4     | 1  |    2    |  DNC  |  2.0.10  |
+-------------+--------------------+----------+------+--------------+----+---------+-------+----------+
|  10.1.5.12  | 0024.0001.8203321  |  2.1.3   |M2000 |    1.5.4     | 2  |    1    |  DNC  |  2.0.10  |
|  10.1.5.12  | 0024.0001.8203321  |  2.1.3   |M2000 |    1.5.4     | 3  |    0    |  DNC  |  2.0.10  |
+-------------+--------------------+----------+------+--------------+----+---------+-------+----------+
+----------------------------------------------------------+------------------------+-----------------+
|                    Attached processes                    |          IPU           |      Board      |
+--------+---------------------------+--------+------------+----+----------+--------+--------+--------+
|  PID   |          Command          |  Time  |    User    | ID |  Clock   |  Temp  |  Temp  | Power  |
+--------+---------------------------+--------+------------+----+----------+--------+--------+--------+
| 10629  |       gc-powertest        |   5s   |   emmaf    | 0  | 1330MHz  |  N/A   |  N/A   |  N/A   |
+--------+---------------------------+--------+------------+----+----------+--------+--------+--------+

The card information for a PCIe card shows:

  1. Physical PCIe slot location, if available.

  2. Serial number of the board.

  3. ICU firmware revision.

  4. Installed kernel module driver version number.

  5. IPU card type.

  6. PCIe width and speed.

  7. ID of the IPU as used by other tools to address the IPUs.

  8. Which IPUs are on the same card.

  9. PCIe ID.

  10. IPU number (which IPU on a card it is).


The process information displayed includes:

  1. The process ID (PID) using the IPU.

  2. The process name using the IPU.

  3. The time it has been running.

  4. The username of the user using the IPU.

  5. The ID of the IPU in use.

  6. The current measured IPU clock rate.

  7. The current IPU temperature.

  8. The current board temperature.

  9. The current board power consumption.

By default, the temperature and power data are not available. To enable these, the process running on the IPU must be launched with the GCDA_MONITOR environment variable set. For example, to add the temperature and power data to the output when monitoring a program called test, the following commands could be used:

$ GCDA_MONITOR=1 test
$ watch gc-monitor

13.1.1. Usage

13.2. Allowed options

--no-card-info

Don’t display card information

-j, --json-output

Emit JSON output

-x, --xml-output

Emit XML output

-c, --csv-output

Emit CSV output

-f {arg}, --fields {arg}

Comma separated field names to be displayed in attached process info. Use -f list to see the choices. If empty all supported fields are displayed.

--no-headers

Don’t display headers (csv only)

--version

Version number

-h, --help

Produce help message

13.3. Notes

  • The number of link errors (LnErr) is printed using scientific notation.

  • By default, gc-monitor shows the processes running on attached IPUs, the users running them, the processes’ PIDs and the IPUs’ IDs & clock speeds.

  • Also by default, the temperature and power data is not available. To enable it for a specific process running on the IPU, that process must be launched with the GCDA_MONITOR environment variable set to 1. For example, to add the temperature and power data to the output when monitoring a gc-powertest process, the following commands could be used:

    $ GCDA_MONITOR=1 gc-powertest -d 0 $ watch gc-monitor

13.4. Examples

$ GCDA_MONITOR=1 python main.py # Run main.py, enabling gc-monitor power
and temp output for the IPUs it uses
$ watch -n1 gc-monitor          # Continually monitor IPUs every
second, visually
$ gc-monitor -j >> data.json    # Append one JSON gc-monitor reading
to a data file