5.13. Error handling

This section describes how PopRT handles errors that are raised by Poplar while running an inference application. These Poplar exceptions can be raised by either the application or by the IPU hardware. A full description of all Poplar errors can be found in the Exceptions section of the Poplar and PopLibs API Reference.

Note

To ensure the stability and reliability of your system, it is recommended that you configure PopRT runtime according to your requirements and error handling strategies.

PopRT handles Poplar errors as follows:

  • application_runtime_error

    • If config.auto_reset is true, then the IPU is automatically reset before the next inference.

      • An IPU reset will be performed before the next execution. This reset doesn’t affect other running IPUs.

      • Any new requests will be processed after the IPU reset is complete.

    • If config.auto_reset is false, then an error is raised.

      • The error message contains the reason why the error occurred.

      • All requests which have already been enqueued before the exception occurred will return the error.

  • recoverable_runtime_error

    • If poplar::RecoveryAction is IPU_RESET and if config.auto_reset is true, then the IPU is automatically reset before the next inference.

      • An IPU reset will be performed before the next execution. This reset doesn’t affect other running IPUs.

      • Any new requests will be processed after the IPU reset is complete.

    • If poplar::RecoveryAction is not IPU_RESET or if config.auto_reset is false, then an error is raised.

      • The error message contains the reason why the error occurred.

      • All requests which have already been enqueued before the exception occurred will return the error.

  • Unknown runtime errors

    • An error is raised.

    • The error message might contain the reason why the error occurred.

    • When these errors occur manual intervention is required before the system is operational again.

    • The IPU will not be reset and all requests will return the error.

  • All other runtime errors

    • An error is raised.

    • The error message might contain the reason why the error occurred.

    • When these errors occur manual intervention might be required before the system is operational again.

    • The error message might contain a required recovery action.

    • The IPU will not be reset and all requests will return the error.