5.13. Error handling
This section describes how PopRT handles errors that are raised by Poplar while running an inference application. These Poplar exceptions can be raised by either the application or by the IPU hardware. A full description of all Poplar errors can be found in the Exceptions section of the Poplar and PopLibs API Reference.
Note
To ensure the stability and reliability of your system, it is recommended that you configure PopRT runtime according to your requirements and error handling strategies.
PopRT handles Poplar errors as follows:
application_runtime_error
If
config.auto_reset
is true, then the IPU is automatically reset before the next inference.An IPU reset will be performed before the next execution. This reset doesn’t affect other running IPUs.
Any new requests will be processed after the IPU reset is complete.
If
config.auto_reset
is false, then an error is raised.The error message contains the reason why the error occurred.
All requests which have already been enqueued before the exception occurred will return the error.
recoverable_runtime_error
If
poplar::RecoveryAction
isIPU_RESET
and ifconfig.auto_reset
is true, then the IPU is automatically reset before the next inference.An IPU reset will be performed before the next execution. This reset doesn’t affect other running IPUs.
Any new requests will be processed after the IPU reset is complete.
If
poplar::RecoveryAction
is notIPU_RESET
or ifconfig.auto_reset
is false, then an error is raised.The error message contains the reason why the error occurred.
All requests which have already been enqueued before the exception occurred will return the error.
Unknown runtime errors
An error is raised.
The error message might contain the reason why the error occurred.
When these errors occur manual intervention is required before the system is operational again.
The IPU will not be reset and all requests will return the error.
All other runtime errors
An error is raised.
The error message might contain the reason why the error occurred.
When these errors occur manual intervention might be required before the system is operational again.
The error message might contain a required recovery action.
The IPU will not be reset and all requests will return the error.