7. Error handling

This section describes how Model Runtime handles Poplar recoverable errors which are raised during the execution of a model. A recoverable error is raised when a running program fails due to a system error that is likely to be transient.

A full description of all Poplar errors can be found in the Exceptions section of the Poplar and PopLibs API Reference.

Model Runtime handles errors as follows:

  • application_runtime_error

    • If auto_reset is true, then the IPU is automatically reset before the next inference.

      • An IPU reset will be performed before the next execution.

      • Any new requests will be processed after the IPU reset is complete.

    • If auto_reset is false, then an exception is raised.

      • The error message contains the reason why the error occurred.

      • All requests which have already been enqueued before the exception occurred will raise the same error.

  • recoverable_runtime_error

    • If poplar::RecoveryAction is IPU_RESET and if auto_reset is true, then the IPU is automatically reset before the next inference.

      • An IPU reset will be performed before the next execution.

      • Any new requests will be processed after the IPU reset is complete.

    • If poplar::RecoveryAction is not IPU_RESET or if auto_reset is false, then an exception is raised.

      • The error message contains the reason why the error occurred.

      • All requests which have already been enqueued before the exception occurred will raise the same error.

  • Unknown runtime errors

    • An exception is raised.

    • The error message might contain the reason why the error occurred.

    • When these errors occur manual intervention is required before the system is operational again.

    • The IPU will not be reset and all requests will raise the same error.

  • All other runtime errors

    • An exception is raised.

    • The error message might contain the reason why the error occurred.

    • When these errors occur manual intervention might be required before the system is operational again.

    • The error message might contain a required recovery action.

    • The IPU will not be reset and all requests will raise the same error.