4. Application event record
When the application event recording system is activated, and a Poplar application encounters certain types of events that affect or prevent correct operation, it will store an entry into the “application event record”. These events can be retrieved through the gcipuinfo API.
By polling Poplar servers for any event record entries, a monitoring system could detect when applications have stopped running correctly — either due to a hardware/configuration problem or an issue in the application itself.
4.1. Event record storage
Event records are written to the Poplar server filesystem as a JSON file.
The storage location of that file (a directory) is specified by the value of the IPU_APP_EVENT_RECORD_PATH
environment variable.
To enable the event recording system, you must set this variable to a path that is accessible to the Poplar application.
Event recording is disabled if IPU_APP_EVENT_RECORD_PATH
is not set.
You can set this on a per-application, per-user or per-container basis. Note that if multiple applications are running on the same Poplar server with the environment variable set to same value, they can potentially overwrite each other’s event records.
Events have a severity level. If there is already an entry in the event record an application will only be allowed to write a new entry if it is of an equal or higher severity to the existing one.
4.2. Accessing event records
gcipuinfo provides the gcipuinfo::getLastAppEventRecord()
and gcipuinfo::getLastAppEventRecordAsJSON()
functions to retrieve the last recorded event record entry.
These functions take an eventRecordPath
parameter to specify the location of the event record.
You must provide the program that uses gcipuinfo with the locations of the event records that it will monitor. If the program is monitoring multiple event records then it will need to check each one individually.
Note
For backwards compatibility with previous versions of the library, eventRecordPath
is actually an optional parameter in C++ and Python.
If it is not specified, the library will attempt to use the value in IPU_APP_EVENT_RECORD_PATH
.
This usage is deprecated - eventRecordPath
should always be specified.
gcipuinfo can return an event record either as a JSON-formatted string or as a string-keyed dictionary/map.
If there is no event record set, gcipuinfo::getLastAppEventRecordAsJSON()
will return "{}"
and
gcipuinfo::getLastAppEventRecord()
will return an empty dictionary.
Note that the event record paths must be accessible to both the Poplar application and the program using gcipuinfo to read the event record. This may require some special system configuration if, for example, they are running from different containers.
4.3. Error record contents
Example JSON error record:
{
"event record path": "/tmp/ipu/ipu_app_event_record",
"pid": "74647",
"command line": "/home/justina/gbnwp-pipudc019/poplar/Build/build/poplar/tests/ExecutableTest -- --device-type Hw",
"timestamp": "2021-09-29T12:14:34.901051Z",
"severity": "nonrecoverable",
"partition": "",
"attached ipus": [
3,
2,
1,
0
],
"specific ipus": [],
"attached ipu hosts": [
"10.1.5.10"
],
"specific ipu hosts": [],
"description": "Link training failed - At least one link failed to train at 2021-09-29T12:14:34.900795Z"
}
The contents of an event record entry are described in the table below.
C/Python key symbol |
Key string |
Description |
---|---|---|
|
“event record path” |
The path the record was stored at |
|
“timestamp” |
When the event was recorded. Uses extended ISO8601 format. |
|
“severity” |
Severity of the event. |
|
“command line” |
Command line of the process that recorded the event. |
|
“pid” |
Process ID of the process that recorded the event |
|
“attached ipus” |
List of all attached IPUs (device IDs). |
|
“attached ipu hosts” |
List of the IPU-Machine hosts that contain the devices listed in “attached ipus”. Not used on PCIe-based systems. |
|
“specific ipus” |
List of specific IPUs named in this event (device IDs). Only used by some events. |
|
“specific ipu hosts” |
List of the IPU-Machine hosts that contain the devices in “specific ipus”. Not used on PCIe-based systems. |
|
“partition” |
The partition in use by the application, if applicable. |
|
“description” |
A description of the event. |
4.4. Event severity
Event severity levels are described in gcipuinfo::EventSeverity
.
An event with a higher severity than EventSevWarning
means that the application
encountered a fatal error.