5. Device health check

gcipuinfo contains a “device health check” function, which allows a monitoring system to rapidly detect when any of the configured IPUs or IPU-Machines are non-functional and should be taken out of use. This mechanism is not intended to provide detailed diagnostic information, it is concerned only with identifying which devices have failed.

The types of failure detected include:

  • Network misconfiguration, preventing a Poplar server from communicating with an IPU-Machine.

  • Hardware failure of an IPU, an RNIC, or another component of an IPU-Machine.

The API for the device health check is a single function gcipuinfo::checkHealthOfDevices() that returns a JSON string. If no problems were detected, this will be an empty object "{}", otherwise it will list the failing IPUs. If the health check failed to run, a description of the error will be returned. A monitoring system would continually call this function to poll for failures.

By default, gcipuinfo will only check devices in the currently active partition. You can configure gcipuinfo to run health checks on devices in other partitions, by supplying DiscoverAllPartitionIPUs as the discoveryMode argument of the gcipuinfo::gcipuinfo() object constructor. This is demonstrated in the gc_health_check example program.

Note

It is not possible to use both DiscoverAllPartitionIPUs and DiscoverActivePartitionIPUs modes in the same process.

Note

Information about available partitions is only retrieved once, at program start up. If partitions are added or removed any monitoring programs will need to be restarted to obtain updated partition information.

5.1. Health check results

Example JSON health check report with two failing IPUs in partition p1 on IPU-Machine 10.1.5.10:

{
  "hosts": {
    "10.1.5.10": [
      {
        "error": "device",
        "id": "2",
        "partition": "p1",
        "board ipu index": "1"
      },
      {
        "error": "connection",
        "id": "3",
        "partition": "p1",
        "board ipu index": "0"
      }
    ]
  }
}

Device ID 2 has failed due to a “device” error, which may suggest a hardware or device driver problem. Device ID 3 has failed due to a “connection” error, which may indicate a network problem or a disabled IPUoF server.

Example JSON report where the health check failed to run:

{
  "description": "configuration error\n",
  "error": "no devices found"
}

In this case the health check could not be run because no IPU devices were found. The description shows that this is due to a problem with the configuration, for example a partition may not be set up.