6. API reference

enum DeviceDiscoveryMode

Values:

enumerator DiscoverActivePartitionIPUs

Discover all devices in current partition.

This is the default mode

enumerator DiscoverAllPartitionIPUs

Discover all devices across all partitions.

class gcipuinfo

Application Event Record key names

static constexpr const char *keyRecordPath = "event record path"

The path the record was stored at.

static constexpr const char *keyTimestamp = "timestamp"

When the event was recorded.

static constexpr const char *keySeverity = "severity"

Severity level - EventSeverity.

static constexpr const char *keyCommandLine = "command line"

Command line of the process that recorded the event.

static constexpr const char *keyPid = "pid"

Process id of the process that recorded the event.

static constexpr const char *keyAttachedIPUs = "attached ipus"

List of any attached IPUs (device ids)

static constexpr const char *keySpecificIPUs = "specific ipus"

List of any specific IPUs named in this event (device ids)

static constexpr const char *keyAttachedIPUHosts = "attached ipu hosts"

List of all currently attached IPU-Machine hostnames.

static constexpr const char *keySpecificIPUHosts = "specific ipu hosts"

List of any IPU-Machine hostnames associated with the devices named in “specific ipus”.

static constexpr const char *keyPartition = "partition"

The partition in use by the application, if applicable.

static constexpr const char *keyDescription = "description"

A description of the event.

Device attribute query methods.

bool updateData()

Updates the attribute info.

Calling this function updates the data. Alternatively, you can call setUpdateMode() to ensure the data is always updated.

Returns

true if the device attributes have been successfully updated, false if an error occurred.

void setUpdateMode(bool autoUpdate)

Sets the attribute update mode.

By default, the attribute data is queried on construction of this class. This function selects between the default behaviour, and always updating on API call.

Parameters

autoUpdatetrue to enable auto-update mode, false to use the original data.

std::vector<std::map<std::string, std::string>> getDevices()

Get all attributes, for all devices.

Returns

A std::vector containing a std::map for each device.

std::string getDevicesAsJSON()

Return a JSON representation of all device attributes.

Returns

A JSON-formatted tree of devices and device attributes as a std::string

std::map<std::string, std::string> getAttributesForDevice(unsigned deviceId)

Get all attributes, for a specific devices.

Parameters

deviceId – Device ID to query.

Returns

A std::map for each device.

std::string getNamedAttributeForDevice(unsigned deviceId, const std::string key)

Get a specific attribute for a specific device.

Parameters
  • deviceId – Device ID to query.

  • key – Device attribute name.

Returns

The attribute.

std::vector<std::string> getNamedAttributeForAll(const std::string key)

Get a specific attribute for all devices.

Parameters

key – Device attribute name.

Returns

A std::vector containing the attribute for all devices.

Application Event Record query methods.

std::string getLastAppEventRecordAsJSON(EventSeverity minimumSeverity = EventSevNone, const std::string &eventRecordPath = "")

Return a JSON-formatted string describing the last recorded application event.

If there is no recorded event, or the event has a EventSeverity below minimumSeverity, an empty JSON dictionary “{}” is returned.

Looks in the path specified by

Parameters

eventRecordPath – for an event record. If this is empty, falls back to the path in IPU_APP_EVENT_RECORD_PATH. If neither are set, throws an exception.

std::map<std::string, std::string> getLastAppEventRecord(EventSeverity minimumSeverity = EventSevNone, const std::string &eventRecordPath = "")

Return a map describing the last recorded application event.

If there is no recorded event, or the event has a EventSeverity below minimumSeverity, an empty map is returned.

Looks in the path specified by

Parameters

eventRecordPath – for an event record. If this is empty, falls back to the path in IPU_APP_EVENT_RECORD_PATH. If neither are set, throws an exception.

EventSeverity getLastAppEventRecordSeverity(const std::string &eventRecordPath = "")

Return the EventSeverity of the last event in the event record.

If there is no recorded event, EventSevNone is returned.

Looks in the path specified by

Parameters

eventRecordPath – for an event record. If this is empty, falls back to the path in IPU_APP_EVENT_RECORD_PATH. If neither are set, throws an exception.

Device health check methods.

std::string checkHealthOfDevices(unsigned timeoutMS, bool checkActiveIPUs = false)

Run basic ‘health checks’ on all currently configured IPUs.

If all devices appear to be operating normally, returns an empty JSON object:

{}
If any malfunctioning devices were discovered, returns a JSON tree idenfifying the affected IPUs and the IPU-Machine host and partition they belong to. e.g.
{
 "hosts": {
   "10.1.5.10": [
     {
       "error": "device",
       "id": "2",
       "partition": "p1"
     },
     {
       "error": "connection",
       "id": "3",
       "partition": "p1"
     }
   ]
 }
}
There are two error types defined:
  • ”connection” - the client was unable to communicate with the IPUoF server (either because of network issues or server error) within the specified timeout.

  • ”device” - the IPUoF server discovered a problem with the IPU or RNIC device.

Each IPU health check must complete within timeoutMS, or else a “connection” error will be recorded.

By default, devices which are currently in use by applications are not checked, unless checkActiveIPUs is true.

Public Types

enum EventSeverity

Application Event Record severity level.

The severity level of an application event indicates how serious it is and potential ways of resolving the issue.

Values:

enumerator EventSevNone = 0
enumerator EventSevWarning = 1

An event which may indicate a system problem.

text: “warning”

enumerator EventSevApplicationError = 2

An error likely in the application code or configuration.

poplar::poplar_error, poplar::application_runtime_error

text: “application”

enumerator EventSevUndeterminedError = 3

It is not known if this is a system error, or an error in the application code or configuration.

poplar::unknown_runtime_error

text: “undetermined”

enumerator EventSevRequiresUserReset = 4

The error may be resolvable by an IPU reset or a partition reset (for POD systems) or a link reset (for non-Pod systems).

poplar::recoverable_runtime_error + IPU_RESET or PARTITION_RESET or LINK_RESET

text: “requires_user_reset”

enumerator EventSevRequiresSystemReset = 5

The error may be resolvable by a full reboot of the IPU-M system or Poplar server.

poplar::recoverable_runtime_error + FULL_RESET

text: “requires_system_reset”

enumerator EventSevNonRecoverable = 6

The error may require admin-level system reconfiguration or hardware replacement.

poplar::unrecoverable_runtime_error

text: “nonrecoverable”

Public Functions

gcipuinfo(DeviceDiscoveryMode = DiscoverActivePartitionIPUs)

6.1. Attribute labels

const std::string IPUAttributeLabels::DeviceId

Unique identifier of a single-IPU or multi-IPU device.

text: “id”

const std::string IPUAttributeLabels::AverageBoardTemp

Average temperature in degrees Celsius as read by the sensors on the board.

text: “average board temp”

const std::string IPUAttributeLabels::AverageDieTemp

Average temperature in degrees Celsius as read by IPU sensors.

text: “average die temp”

const std::string IPUAttributeLabels::BoardIpuIndex

IPU number on board (0-1 for PCIe cards, 0-3 for IPU-Machines).

text: “board ipu index”

const std::string IPUAttributeLabels::BoardType

The IPU board type ‘family’, for example C600 or M2000.

Note: M2000 includes IPU-M2000 and Bow-2000. text: “board type”

const std::string IPUAttributeLabels::ClockFrequency

Current clock frequency.

text: “clock”

const std::string IPUAttributeLabels::DriverVersion

PCIe driver version, specified as a <major.minor.patch> triple.

text: “driver version”

const std::string IPUAttributeLabels::GatewaySoftwareVersion

(IPUoF) IPU-Gateway software version, specified as a <major.minor.patch> triple.

text: “gateway software version”

const std::string IPUAttributeLabels::GcdId

(IPUoF) Graphcore Compile Domain ID.

text: “gcd id”

const std::string IPUAttributeLabels::HexoattTotalSize

Total remote-buffer memory available.

text: “hexoatt total size (bytes)”

const std::string IPUAttributeLabels::HexoattActiveSize

Total remote buffer-memory in use by the IPU.

text: “hexoatt active size (bytes)”

const std::string IPUAttributeLabels::HexoptTotalSize

Total host exchange memory available.

text: “hexopt total size (bytes)”

const std::string IPUAttributeLabels::HexoptActiveSize

Total host exchange memory in use by the IPU.

text: “hexopt active size (bytes)”

const std::string IPUAttributeLabels::IpuArchitecture

IPU hardware architecture version.

text: “ipu architecture”

const std::string IPUAttributeLabels::IpuofHost

(IPUoF) IP address of IPU-Gateway.

text: “ipuof host”

const std::string IPUAttributeLabels::IpuofServerVersion

(IPUoF) Fabric server version.

text: “ipuof server version”

const std::string IPUAttributeLabels::IpuUtilisation

Percentage of time spent waiting for one or more IPU syncs, measured in the last second.

text: “ipu utilisation”

const std::string IPUAttributeLabels::IpuUtilisationSession

Percentage of time spent waiting for one or more IPU syncs since the HSPs were set up.

text: “ipu utilisation (session)”

const std::string IPUAttributeLabels::LinkCorrectableErrorCount

IPU Link correctable error count.

text: “link correctable error count”

const std::string IPUAttributeLabels::LinkSpeed

(PCIe) PCIe link speed available.

text: “link speed”

const std::string IPUAttributeLabels::LinkWidth

(PCIe) Number of PCIe lanes available.

text: “link width”

const std::string IPUAttributeLabels::MaxActiveCodeSize

Maximum active code size (bytes).

text: “max active code size (bytes)”

const std::string IPUAttributeLabels::MaxActiveDataSize

Maximum active data size (bytes).

text: “max active data size (bytes)”

const std::string IPUAttributeLabels::MaxActiveStackSize

Maximum active stack size (bytes).

text: “max active stack size (bytes)”

const std::string IPUAttributeLabels::MultiIpuDeviceId

Multi-IPU device the IPU belongs to.

text: “multi-ipu device id”

const std::string IPUAttributeLabels::MultiIpuDiscoveryMethod

Method used to discover multi-IPU groups.

text: “multi-ipu discovery method”

const std::string IPUAttributeLabels::NumaNode

NUMA node the IPU is on.

text: “numa node”

const std::string IPUAttributeLabels::NumIpuLinkSegments

(IPUoF) Number of IPU-Link segments.

text: “number of ipu-link segments”

const std::string IPUAttributeLabels::NumReplicas

(IPUoF) Number of replicas in the partition.

text: “number of replicas”

const std::string IPUAttributeLabels::PartitionId

(IPUoF) partition ID.

text: “ipuof partition id”

const std::string IPUAttributeLabels::PartitionSyncType

(IPUoF) sync configuration type, for example ‘c2-compatible’.

text: “partition sync type”

const std::string IPUAttributeLabels::PciId

PCIe device identifier.

text: “pci id”

const std::string IPUAttributeLabels::PhysicalSlot

PCIe physical slot.

text: “pcie physical slot”

const std::string IPUAttributeLabels::ProcessStartTime

The start time of the process currently using the IPU.

text: “process start time”

const std::string IPUAttributeLabels::ReconfigurablePartition

(IPUoF) Set to 1 if the IPU is part of a reconfigurable partition.

text: “reconfigurable partition”

const std::string IPUAttributeLabels::RemoteBuffersSupported

Set to 1 if remote buffers are supported.

text: “remote buffers supported”

const std::string IPUAttributeLabels::SerialNumber

Serial number of the board.

text: “board serial number”

const std::string IPUAttributeLabels::TotalBoardPower

Total current power consumption as read by board level sensors.

Not used on IPU-Machines text: “total board power”

const std::string IPUAttributeLabels::UserExecutable

The name of the process using the device.

text: “user executable”

const std::string IPUAttributeLabels::UserName

The username of the user using the device.

text: “user name”

const std::string IPUAttributeLabels::UserProcessId

The process IDs of the process using the device.

text: “user process id”

const std::string IPUAttributeLabels::GatewayRoutingType

(IPUoF) GW-Link routing type.

text: “gateway routing type”

const std::string IPUAttributeLabels::IpuLinkSegmentId

(IPUoF) Identifier of IPU-Link segment.

text: “ipu link segment id”

const std::string IPUAttributeLabels::NumGcds

Number of Graph Compile Domains.

text: “number of gcds”

const std::string IPUAttributeLabels::FirmwareVersion

ICU Firmware version, specified as a <major.minor.patch> triple.

In development builds, this will be suffixed with branch and build information. text: “firmware version”

const std::string IPUAttributeLabels::IpuofServerError

(IPUoF) Set if error occurred while attempting to communicate with the IPUoF server (a ‘connection’ error), or if the IPUoF server was unable to use the device (a ‘device’ error) text: “ipuof server error”

const std::string IPUAttributeLabels::HostLinkCorrectableErrorCount

(PCIe) Host Link correctable error count.

text: “host link correctable error count”

const std::string IPUAttributeLabels::ApplicationHost

(IPUoF) IP address of the headnode where the application using this IPU is running.

text: “application host”

const std::string IPUAttributeLabels::IpuErrorState

Error state of the IPU.

Set to ‘ipu memory failure’ if the tile parity error thresholds have been exceeded. text: “ipu error state”

const std::string IPUAttributeLabels::ParityErrorCountThreshold

Threshold for number of parity errors to promote to a unrecoverable error.

text: “parity error count threshold”

const std::string IPUAttributeLabels::ParityErrorThresholdInterval

Threshold in seconds at which ‘num parity errors’ are promoted to an uncorrectable error.

text: “parity error threshold interval”

const std::string IPUAttributeLabels::IpumSoftwareVersion

(IPUoF) IPU-M software version.

text: “ipum software version”

const std::string IPUAttributeLabels::IpuPower

Power consumption of a single IPU.

Only available on IPU-Machines text: “ipu power”

const std::string IPUAttributeLabels::LinkCorrectableErrorCountSession

IPU Link correctable error count since device was last reset.

text: “link correctable error count (session)”

const std::string IPUAttributeLabels::HostLinkCorrectableErrorCountSession

(PCIe) Host-Link correctable error count since device was last reset.

text: “host link correctable error count (session)”

const std::string IPUAttributeLabels::BoardVariant

IPU board model name.

This will be identical to BoardType if this product only has a single variant. text: “board variant”

const std::string IPUAttributeLabels::GatewayWriteCombining

Gateway write combining status.

text: “gateway write combining”

const std::string IPUAttributeLabels::SecondaryPcieInterfaceSupported

Set to 1 if the secondary interface is supported.

text: “secondary pcie interface supported”

const std::string IPUAttributeLabels::ICUBootloaderVersion

ICU bootloader version, specified as a <major.minor.patch> triple.

In development builds, this will be suffixed with branch and build information. text: “icu bootloader version”