4. Application event record

When the application event recording system is activated, and a Poplar application encounters certain types of events that affect or prevent correct operation, it will store an entry into the “application event record”. These events can be retrieved through the gcipuinfo API.

By polling Poplar servers for any event record entries, a monitoring system could detect when applications have stopped running correctly — either due to a hardware/configuration problem or an issue in the application itself.

4.1. Event record storage

Event records are written to the Poplar server filesystem as a JSON file. The storage location of that file (a directory) is specified by the value of the IPU_APP_EVENT_RECORD_PATH environment variable. To enable the event recording system, you must set this variable to a path that is accessible to the Poplar application. Event recording is disabled if IPU_APP_EVENT_RECORD_PATH is not set.

You can set this on a per-application, per-user or per-container basis. Note that if multiple applications are running on the same Poplar server with the environment variable set to same value, they can potentially overwrite each other’s event records.

Events have a severity level. If there is already an entry in the event record an application will only be allowed to write a new entry if it is of an equal or higher severity to the existing one.

4.2. Accessing event records

gcipuinfo provides the gcipuinfo::getLastAppEventRecord() and gcipuinfo::getLastAppEventRecordAsJSON() functions to retrieve the last recorded event record entry. These functions take an eventRecordPath parameter to specify the location of the event record.

You must provide the program that uses gcipuinfo with the locations of the event records that it will monitor. If the program is monitoring multiple event records then it will need to check each one individually.

Note

For backwards compatibility with previous versions of the library, eventRecordPath is actually an optional parameter in C++ and Python. If it is not specified, the library will attempt to use the value in IPU_APP_EVENT_RECORD_PATH. This usage is deprecated - eventRecordPath should always be specified.

gcipuinfo can return an event record either as a JSON-formatted string or as a string-keyed dictionary/map. If there is no event record set, gcipuinfo::getLastAppEventRecordAsJSON() will return "{}" and gcipuinfo::getLastAppEventRecord() will return an empty dictionary.

Note that the event record paths must be accessible to both the Poplar application and the program using gcipuinfo to read the event record. This may require some special system configuration if, for example, they are running from different containers.

4.3. Error record contents

Example JSON error record:

{
  "event record path": "/tmp/ipu/ipu_app_event_record",
  "pid": "74647",
  "command line": "/home/justina/gbnwp-pipudc019/poplar/Build/build/poplar/tests/ExecutableTest -- --device-type Hw",
  "timestamp": "2021-09-29T12:14:34.901051Z",
  "severity": "nonrecoverable",
  "partition": "",
  "attached ipus": [
    3,
    2,
    1,
    0
  ],
  "specific ipus": [],
  "attached ipu hosts": [
    "10.1.5.10"
  ],
  "specific ipu hosts": [],
  "description": "Link training failed - At least one link failed to train at 2021-09-29T12:14:34.900795Z"
}

The contents of an event record entry are described in the table below.

Table 4.1 Event record attributes

C/Python key symbol

Key string

Description

keyRecordPath

“event record path”

The path the record was stored at

keyTimestamp

“timestamp”

When the event was recorded. Uses extended ISO8601 format.

keySeverity

“severity”

Severity of the event.

keyCommandLine

“command line”

Command line of the process that recorded the event.

keyPid

“pid”

Process ID of the process that recorded the event

keyAttachedIPUs

“attached ipus”

List of all attached IPUs (device IDs).

keyAttachedIPUHosts

“attached ipu hosts”

List of the IPU-Machine hosts that contain the devices listed in “attached ipus”. Not used on PCIe-based systems.

keySpecificIPUs

“specific ipus”

List of specific IPUs named in this event (device IDs). Only used by some events.

keySpecificIPUHosts

“specific ipu hosts”

List of the IPU-Machine hosts that contain the devices in “specific ipus”. Not used on PCIe-based systems.

keyPartition

“partition”

The partition in use by the application, if applicable.

keyDescription

“description”

A description of the event.

4.4. Event severity

Event severity levels are described in gcipuinfo::EventSeverity.

An event with a higher severity than EventSevWarning means that the application encountered a fatal error.