1. Overview

As a fail-safe mechanism for reducing the number of “bricked” devices in the field, the ICU bootloader employs a recovery mode called the “911 state”. This ensures that the ICU can be fully recovered in the field should a substantial ICU failure occur.

1.1. What can cause the ICU to enter the recovery mode

  1. Flash corruption. The bootloader validates both firmware images stored in internal flash memory. If they are both rendered invalid, then the ICU will enter the 911 state. This will then allow repair of the firmware images and allow them to boot successfully.

  2. Failed firmware update. System software update is a complex process with many moving parts and requiring multiple system restarts. Occasionally, an update can go wrong with the ICU losing power mid-way through the update process or during its internal bank-swapping process. This then appears as flash corruption as the validation step fails. Again, the ICU will enter the 911 state

  3. Other unknown repeated ICU errors can also cause it to enter the 911 state.

2. Recovery modes

The 911 recovery can be activated in two modes:

  • Soft recovery. For example, caused by repeated firmware crashes

    Soft recovery is a non-permanent bootloader state. The ICU will return to normal operation after a power-cycle.

  • Hard recovery. For example, caused by flash corruption of both firmware images

    Hard recovery occurs when the bootloader detects both firmware images are in a corrupt state. In this mode the bootloader erases all flash areas used by the firmware. The creates a clean state for firmware recovery.

    Note

    Hard recovery is really a last resort. Should any single firmware image become corrupted then the bootloader can always recover the ICU from this state.

2.1. How do I check if the ICU is in recovery mode

There are two methods to check if the ICU is in recovery mode, using either the dfu-util or lsusb commands. You must be logged in to the BMC via SSH or the serial console to use these commands. See the BMC User Guide for more information.

2.1.1. Use the dfu-util command

The dfu-util command implements the host side of the USB device firmware upgrade (DFU) protocol It is used by the IPU-M software to upgrade the ICU firmware. You can also use it, with the -l option, to check the state of the ICU.

In normal operation the output of dfu-util should look like the following example:

# dfu-util -l
Found Runtime: [bbbc:0100] ver=0205, devnum=91, cfg=1, intf=3, path="3-1", alt=0, name="UNKNOWN", serial="0033.0001.8203921.E"
Found Runtime: [bbbc:0100] ver=0205, devnum=92, cfg=1, intf=3, path="3-2", alt=0, name="UNKNOWN", serial="0033.0002.8203921.E"

If one ICU is in recovery mode the output will look like this, with the device ID and serial address set to 911:

$ dfu-util -l
Found Runtime: [bbbc:0911] ver=0205, devnum=91, cfg=1, intf=0, path="3-1", alt=0, name="UNKNOWN", serial="911"
Found Runtime: [bbbc:0100] ver=0205, devnum=92, cfg=1, intf=3, path="3-2", alt=0, name="UNKNOWN", serial="0033.0002.8203921.E"

2.1.2. Use the lsusb command

The lsusb command displays information about USB buses in the system and the devices connected to them.

In normal operation the output of lsusb should look like the following example:

# lsusb
Bus 003 Device 034: ID bbbc:0100
Bus 003 Device 033: ID bbbc:0100
Bus 003 Device 001: ID 1d6b:0001
Bus 002 Device 001: ID 1d6b:0002
Bus 001 Device 001: ID 1d6b:0002

The ID for the ICUs in normal operating state is 0100; see the top two devices in the example above.

If one ICU is in recovery mode it will look like this:

# lsusb
Bus 003 Device 034: ID ffff:0911
Bus 003 Device 033: ID bbbc:0100
Bus 003 Device 001: ID 1d6b:0001
Bus 002 Device 001: ID 1d6b:0002
Bus 001 Device 001: ID 1d6b:0002

The ID for the ICU in recovery mode is 0911.

3. How do I exit the 911 state

Note

Both the condition that caused the 911 state and the recovery process are likely to cause kernel events and error messages to be logged on the BMC.

Before trying any of these recovery methods, confirm the IPU-Gateway has finished booting, you can do this by logging on to the IPU-Gateway:

$ ssh itadmin@<gateway_address>

3.1. Power cycling

For the majority of 911 cases that we have observed, simply power cycling the ICU will recover the 911 state.

You must be logged in to the BMC via SSH or the serial console to use the commands described below. See the BMC User Guide for more information.

You can use bfpga.py (part of the IPU-M software package) to power-cycle the ICUs using the following commands:

$ bfpga.py -w 0 0x0f   # Power off the ICUs
$ bfpga.py -w 0 0x7f   # Power on the ICUs

Then you can check if an ICU is still in recovery mode. Only move to next step if the ICU has successfully recovered.

$ dfu-util -l

If two ICU devices aren’t present, then run the command again.

Note

The ICUs should generally appear in a few seconds, but when exiting from recovery mode, it can take up to 20 seconds.

Finally, power-cycle the IPU-Gateway to restore the links:

$ ipum-utils power_softoff && ipum-utils power_on

3.2. Force upgrade

The ICU upgrade script can also be run with the forced flag set, this will not check the version of the ICU before upgrading. Again, this should normally fix the 911 state.

To find the firmware, go to the installation directory of the IPU-M software package, then execute the following commands:

$ cd bmc/mcu/
$ scp -r icu_firmwares_package_* root@BMC_address:~/
$ ssh root@BMC_address
$ cd  icu_firmwares_package_*

You will need to use slightly different commands to update the firmware, depending on the type of the IPU-Machines in the system. See Section 4, Determine system type for instructions on how to do this.

  • For an IPU-M2000:

    $ ./install_on_target_m2000.sh --force firmware
    
  • For a Bow-2000

    $ ./install_on_target_m2000.sh --force wow_firmware