6. API reference
-
enum DeviceDiscoveryMode
Values:
-
enumerator DiscoverActivePartitionIPUs
Discover all devices in current partition.
This is the default mode
-
enumerator DiscoverAllPartitionIPUs
Discover all devices across all partitions.
-
enumerator DiscoverActivePartitionIPUs
-
class gcipuinfo
Application Event Record key names
-
static constexpr const char *keyRecordPath = "event record path"
The path the record was stored at.
-
static constexpr const char *keyTimestamp = "timestamp"
When the event was recorded.
-
static constexpr const char *keySeverity = "severity"
Severity level - EventSeverity.
-
static constexpr const char *keyCommandLine = "command line"
Command line of the process that recorded the event.
-
static constexpr const char *keyPid = "pid"
Process id of the process that recorded the event.
-
static constexpr const char *keyAttachedIPUs = "attached ipus"
List of any attached IPUs (device ids)
-
static constexpr const char *keySpecificIPUs = "specific ipus"
List of any specific IPUs named in this event (device ids)
-
static constexpr const char *keyAttachedIPUHosts = "attached ipu hosts"
List of all currently attached IPU-Machine hostnames.
-
static constexpr const char *keySpecificIPUHosts = "specific ipu hosts"
List of any IPU-Machine hostnames associated with the devices named in “specific ipus”.
-
static constexpr const char *keyPartition = "partition"
The partition in use by the application, if applicable.
-
static constexpr const char *keyDescription = "description"
A description of the event.
Device attribute query methods.
-
bool updateData()
Updates the attribute info.
Calling this function updates the data. Alternatively, you can call setUpdateMode() to ensure the data is always updated.
- Returns
true
if the device attributes have been successfully updated,false
if an error occurred.
-
void setUpdateMode(bool autoUpdate)
Sets the attribute update mode.
By default, the attribute data is queried on construction of this class. This function selects between the default behaviour, and always updating on API call.
- Parameters
autoUpdate –
true
to enable auto-update mode,false
to use the original data.
-
std::vector<std::map<std::string, std::string>> getDevices()
Get all attributes, for all devices.
- Returns
A
std::vector
containing astd::map
for each device.
-
std::string getDevicesAsJSON()
Return a JSON representation of all device attributes.
- Returns
A JSON-formatted tree of devices and device attributes as a
std::string
-
std::map<std::string, std::string> getAttributesForDevice(unsigned deviceId)
Get all attributes, for a specific devices.
- Parameters
deviceId – Device ID to query.
- Returns
A
std::map
for each device.
-
std::string getNamedAttributeForDevice(unsigned deviceId, const std::string key)
Get a specific attribute for a specific device.
- Parameters
deviceId – Device ID to query.
key – Device attribute name.
- Returns
The attribute.
-
std::vector<std::string> getNamedAttributeForAll(const std::string key)
Get a specific attribute for all devices.
- Parameters
key – Device attribute name.
- Returns
A
std::vector
containing the attribute for all devices.
Application Event Record query methods.
-
std::string getLastAppEventRecordAsJSON(EventSeverity minimumSeverity = EventSevNone, const std::string &eventRecordPath = "")
Return a JSON-formatted string describing the last recorded application event.
If there is no recorded event, or the event has a EventSeverity below minimumSeverity, an empty JSON dictionary “{}” is returned.
Looks in the path specified by
- Parameters
eventRecordPath – for an event record. If this is empty, falls back to the path in
IPU_APP_EVENT_RECORD_PATH
. If neither are set, throws an exception.
-
std::map<std::string, std::string> getLastAppEventRecord(EventSeverity minimumSeverity = EventSevNone, const std::string &eventRecordPath = "")
Return a map describing the last recorded application event.
If there is no recorded event, or the event has a EventSeverity below minimumSeverity, an empty map is returned.
Looks in the path specified by
- Parameters
eventRecordPath – for an event record. If this is empty, falls back to the path in
IPU_APP_EVENT_RECORD_PATH
. If neither are set, throws an exception.
-
EventSeverity getLastAppEventRecordSeverity(const std::string &eventRecordPath = "")
Return the EventSeverity of the last event in the event record.
If there is no recorded event, EventSevNone is returned.
Looks in the path specified by
- Parameters
eventRecordPath – for an event record. If this is empty, falls back to the path in
IPU_APP_EVENT_RECORD_PATH
. If neither are set, throws an exception.
Device health check methods.
-
std::string checkHealthOfDevices(unsigned timeoutMS, bool checkActiveIPUs = false)
Run basic ‘health checks’ on all currently configured IPUs.
If all devices appear to be operating normally, returns an empty JSON object:
If any malfunctioning devices were discovered, returns a JSON tree idenfifying the affected IPUs and the IPU-Machine host and partition they belong to. e.g.{}
There are two error types defined:{ "hosts": { "10.1.5.10": [ { "error": "device", "id": "2", "partition": "p1" }, { "error": "connection", "id": "3", "partition": "p1" } ] } }
”connection” - the client was unable to communicate with the IPUoF server (either because of network issues or server error) within the specified timeout.
”device” - the IPUoF server discovered a problem with the IPU or RNIC device.
Each IPU health check must complete within timeoutMS, or else a “connection” error will be recorded.
By default, devices which are currently in use by applications are not checked, unless checkActiveIPUs is true.
Public Types
-
enum EventSeverity
Application Event Record severity level.
The severity level of an application event indicates how serious it is and potential ways of resolving the issue.
Values:
-
enumerator EventSevNone = 0
-
enumerator EventSevWarning = 1
An event which may indicate a system problem.
text: “warning”
-
enumerator EventSevApplicationError = 2
An error likely in the application code or configuration.
poplar::poplar_error, poplar::application_runtime_error
text: “application”
-
enumerator EventSevUndeterminedError = 3
It is not known if this is a system error, or an error in the application code or configuration.
poplar::unknown_runtime_error
text: “undetermined”
-
enumerator EventSevRequiresUserReset = 4
The error may be resolvable by an IPU reset or a partition reset (for POD systems) or a link reset (for non-Pod systems).
poplar::recoverable_runtime_error + IPU_RESET or PARTITION_RESET or LINK_RESET
text: “requires_user_reset”
-
enumerator EventSevRequiresSystemReset = 5
The error may be resolvable by a full reboot of the IPU-M system or Poplar server.
poplar::recoverable_runtime_error + FULL_RESET
text: “requires_system_reset”
-
enumerator EventSevNonRecoverable = 6
The error may require admin-level system reconfiguration or hardware replacement.
poplar::unrecoverable_runtime_error
text: “nonrecoverable”
-
enumerator EventSevNone = 0
Public Functions
-
gcipuinfo(DeviceDiscoveryMode = DiscoverActivePartitionIPUs)
-
static constexpr const char *keyRecordPath = "event record path"
6.1. Attribute labels
-
const std::string IPUAttributeLabels::DeviceId
Unique identifier of a single-IPU or multi-IPU device.
text: “id”
-
const std::string IPUAttributeLabels::AverageBoardTemp
Average temperature in degrees Celsius as read by the sensors on the board.
text: “average board temp”
-
const std::string IPUAttributeLabels::AverageDieTemp
Average temperature in degrees Celsius as read by IPU sensors.
text: “average die temp”
-
const std::string IPUAttributeLabels::BoardIpuIndex
IPU number on board (0-1 for PCIe cards, 0-3 for IPU-Machines).
text: “board ipu index”
-
const std::string IPUAttributeLabels::BoardType
The IPU board type ‘family’, for example C600 or M2000.
Note: M2000 includes IPU-M2000 and Bow-2000. text: “board type”
-
const std::string IPUAttributeLabels::ClockFrequency
Current clock frequency.
text: “clock”
-
const std::string IPUAttributeLabels::DriverVersion
PCIe driver version, specified as a <major.minor.patch> triple.
text: “driver version”
-
const std::string IPUAttributeLabels::GatewaySoftwareVersion
(IPUoF) IPU-Gateway software version, specified as a <major.minor.patch> triple.
text: “gateway software version”
-
const std::string IPUAttributeLabels::GcdId
(IPUoF) Graphcore Compile Domain ID.
text: “gcd id”
-
const std::string IPUAttributeLabels::HexoattTotalSize
Total remote-buffer memory available.
text: “hexoatt total size (bytes)”
-
const std::string IPUAttributeLabels::HexoattActiveSize
Total remote buffer-memory in use by the IPU.
text: “hexoatt active size (bytes)”
-
const std::string IPUAttributeLabels::HexoptTotalSize
Total host exchange memory available.
text: “hexopt total size (bytes)”
-
const std::string IPUAttributeLabels::HexoptActiveSize
Total host exchange memory in use by the IPU.
text: “hexopt active size (bytes)”
-
const std::string IPUAttributeLabels::IpuArchitecture
IPU hardware architecture version.
text: “ipu architecture”
-
const std::string IPUAttributeLabels::IpuofHost
(IPUoF) IP address of IPU-Gateway.
text: “ipuof host”
-
const std::string IPUAttributeLabels::IpuofServerVersion
(IPUoF) Fabric server version.
text: “ipuof server version”
-
const std::string IPUAttributeLabels::IpuUtilisation
Percentage of time spent waiting for one or more IPU syncs, measured in the last second.
text: “ipu utilisation”
-
const std::string IPUAttributeLabels::IpuUtilisationSession
Percentage of time spent waiting for one or more IPU syncs since the HSPs were set up.
text: “ipu utilisation (session)”
-
const std::string IPUAttributeLabels::LinkCorrectableErrorCount
IPU Link correctable error count.
text: “link correctable error count”
-
const std::string IPUAttributeLabels::LinkSpeed
(PCIe) PCIe link speed available.
text: “link speed”
-
const std::string IPUAttributeLabels::LinkWidth
(PCIe) Number of PCIe lanes available.
text: “link width”
-
const std::string IPUAttributeLabels::MaxActiveCodeSize
Maximum active code size (bytes).
text: “max active code size (bytes)”
-
const std::string IPUAttributeLabels::MaxActiveDataSize
Maximum active data size (bytes).
text: “max active data size (bytes)”
-
const std::string IPUAttributeLabels::MaxActiveStackSize
Maximum active stack size (bytes).
text: “max active stack size (bytes)”
-
const std::string IPUAttributeLabels::MultiIpuDeviceId
Multi-IPU device the IPU belongs to.
text: “multi-ipu device id”
-
const std::string IPUAttributeLabels::MultiIpuDiscoveryMethod
Method used to discover multi-IPU groups.
text: “multi-ipu discovery method”
-
const std::string IPUAttributeLabels::NumaNode
NUMA node the IPU is on.
text: “numa node”
-
const std::string IPUAttributeLabels::NumIpuLinkSegments
(IPUoF) Number of IPU-Link segments.
text: “number of ipu-link segments”
-
const std::string IPUAttributeLabels::NumReplicas
(IPUoF) Number of replicas in the partition.
text: “number of replicas”
-
const std::string IPUAttributeLabels::PartitionId
(IPUoF) partition ID.
text: “ipuof partition id”
-
const std::string IPUAttributeLabels::PartitionSyncType
(IPUoF) sync configuration type, for example ‘c2-compatible’.
text: “partition sync type”
-
const std::string IPUAttributeLabels::PciId
PCIe device identifier.
text: “pci id”
-
const std::string IPUAttributeLabels::PhysicalSlot
PCIe physical slot.
text: “pcie physical slot”
-
const std::string IPUAttributeLabels::ProcessStartTime
The start time of the process currently using the IPU.
text: “process start time”
-
const std::string IPUAttributeLabels::ReconfigurablePartition
(IPUoF) Set to 1 if the IPU is part of a reconfigurable partition.
text: “reconfigurable partition”
-
const std::string IPUAttributeLabels::RemoteBuffersSupported
Set to 1 if remote buffers are supported.
text: “remote buffers supported”
-
const std::string IPUAttributeLabels::SerialNumber
Serial number of the board.
text: “board serial number”
-
const std::string IPUAttributeLabels::TotalBoardPower
Total current power consumption as read by board level sensors.
Not used on IPU-Machines text: “total board power”
-
const std::string IPUAttributeLabels::UserExecutable
The name of the process using the device.
text: “user executable”
-
const std::string IPUAttributeLabels::UserName
The username of the user using the device.
text: “user name”
-
const std::string IPUAttributeLabels::UserProcessId
The process IDs of the process using the device.
text: “user process id”
-
const std::string IPUAttributeLabels::GatewayRoutingType
(IPUoF) GW-Link routing type.
text: “gateway routing type”
-
const std::string IPUAttributeLabels::IpuLinkSegmentId
(IPUoF) Identifier of IPU-Link segment.
text: “ipu link segment id”
-
const std::string IPUAttributeLabels::NumGcds
Number of Graph Compile Domains.
text: “number of gcds”
-
const std::string IPUAttributeLabels::FirmwareVersion
ICU Firmware version, specified as a <major.minor.patch> triple.
In development builds, this will be suffixed with branch and build information. text: “firmware version”
-
const std::string IPUAttributeLabels::IpuofServerError
(IPUoF) Set if error occurred while attempting to communicate with the IPUoF server (a ‘connection’ error), or if the IPUoF server was unable to use the device (a ‘device’ error) text: “ipuof server error”
-
const std::string IPUAttributeLabels::HostLinkCorrectableErrorCount
(PCIe) Host Link correctable error count.
text: “host link correctable error count”
-
const std::string IPUAttributeLabels::ApplicationHost
(IPUoF) IP address of the headnode where the application using this IPU is running.
text: “application host”
-
const std::string IPUAttributeLabels::IpuErrorState
Error state of the IPU.
Set to ‘ipu memory failure’ if the tile parity error thresholds have been exceeded. text: “ipu error state”
-
const std::string IPUAttributeLabels::ParityErrorCountThreshold
Threshold for number of parity errors to promote to a unrecoverable error.
text: “parity error count threshold”
-
const std::string IPUAttributeLabels::ParityErrorThresholdInterval
Threshold in seconds at which ‘num parity errors’ are promoted to an uncorrectable error.
text: “parity error threshold interval”
-
const std::string IPUAttributeLabels::IpumSoftwareVersion
(IPUoF) IPU-M software version.
text: “ipum software version”
-
const std::string IPUAttributeLabels::IpuPower
Power consumption of a single IPU.
Only available on IPU-Machines text: “ipu power”
-
const std::string IPUAttributeLabels::LinkCorrectableErrorCountSession
IPU Link correctable error count since device was last reset.
text: “link correctable error count (session)”
-
const std::string IPUAttributeLabels::HostLinkCorrectableErrorCountSession
(PCIe) Host-Link correctable error count since device was last reset.
text: “host link correctable error count (session)”
-
const std::string IPUAttributeLabels::BoardVariant
IPU board model name.
This will be identical to BoardType if this product only has a single variant. text: “board variant”
-
const std::string IPUAttributeLabels::GatewayWriteCombining
Gateway write combining status.
text: “gateway write combining”
-
const std::string IPUAttributeLabels::SecondaryPcieInterfaceSupported
Set to 1 if the secondary interface is supported.
text: “secondary pcie interface supported”
-
const std::string IPUAttributeLabels::ICUBootloaderVersion
ICU bootloader version, specified as a <major.minor.patch> triple.
In development builds, this will be suffixed with branch and build information. text: “icu bootloader version”