11. gc-monitor

You can use this command to monitor IPU activity without affecting users of the IPUs. This can be used to:

  • Check and monitor what’s currently running on which IPU in shared systems.

  • Make sure code is correctly running on an IPU.

  • Monitor performance: the power and temp will increase, and the clock rate will drop when an IPU is heavily loaded.

To get a continuously updated display, use it with the watch command:

$ watch gc-monitor

The output shows information about the IPU cards in the system and information on the processes running on the machine.

The card information table shows:

  1. Physical PCIe slot location, if available.

  2. Serial number of the board.

  3. ICU firmware revision.

  4. Installed kernel module driver version number.

  5. IPU card type.

  6. PCIe width and speed.

  7. ID of the IPU as used by other tools to address the IPUs.

  8. Which IPUs are on the same card.

  9. PCIe ID.

  10. IPU number (which IPU on a card it is).

The process information displayed includes:

  1. The PID using the IPU.

  2. The process name using the IPU.

  3. The username of the user using the IPU.

  4. The ID of the IPU in use.

  5. The current measured IPU clock rate.

  6. The current average IPU temperature.

  7. The current average board temperature.

  8. The current average board power consumption.

Typical output from the is shown below.

+---------------+-----------------------------------------------------------+
| gc-monitor    | Installed driver: 1.0.27                                  |
+------+--------+--------+------+-------+-------+----+--------------+-------+
| Slot | Serial | ICU FW | Type | Speed | Width | ID | PCIe ID      | IPU # |
+------+--------+--------+------+-------+-------+----+--------------+-------+
|6     |0174.   |1.0.26  |C2    |8 GT/s |8      |0   |0000:8e:00.0  |0      |
|      |0004.   |        |      |       |       +---------------------------+
|      |919052  |        |      |       |       |1   |0000:8b:00.0  |1      |
+------+--------+--------+------+-------+-------+----+--------------+-------+
+----------------------------------+----------------------+-----------------+
| Attached processes               | IPU                  | Board           |
|------+----------------+----------|----------------------|-----------------|
| PID  | Command        | User     | ID | Clock  | Temp   | Temp   | Power  |
+------+----------------+----------+----+--------+--------+--------+--------+
|55988 |gc-powertest    |dave      |0   |1600MHz |27.8 C  |30.0 C  |110.6 W |
+------+----------------+----------+----+--------+--------+--------+--------+

By default, the temperature and power data are not available. To enable these, the process running on the IPU must be launched with the GCDA_MONITOR environment variable set. For example, to add the temperature and power data to the output when monitoring a program called test, the following commands could be used:

$ GCDA_MONITOR=1 test
$ watch gc-monitor

11.1. Usage

11.1.1. Allowed options

--no-card-info

Don’t display card information

-j, --json-output

Emit JSON output

--version

Version number

-h, --help

Produce help message

11.1.2. Notes

  • By default, gc-monitor shows the processes running on attached IPUs, the users running them, the processes’ PIDs and the IPUs’ IDs & clock speeds.

  • Also by default, the temperature and power data is not available. To enable it for a specific process running on the IPU, that process must be launched with the GCDA_MONITOR environment variable set to 1. For example, to add the temperature and power data to the output when monitoring a gc-powertest process, the following commands could be used:

    $ GCDA_MONITOR=1 gc-powertest -d 0 $ watch gc-monitor

11.1.3. Examples

$ GCDA_MONITOR=1 python main.py # Run main.py, enabling gc-monitor power
and temp output for the IPUs it uses
$ watch -n1 gc-monitor          # Continually monitor IPUs every
second, visually
$ gc-monitor -j >> data.json    # Append one JSON gc-monitor reading
to a data file