6. Bow-2000 software and firmware upgrade

This section describes how to upgrade the software and firmware on each Bow-2000.

Note

You can check the version of the software components currently running on the Bow-2000 IPU-Machines with the rack_tool option ––show-running-version (Section 5.4, Rack tool).

6.1. What happens during the upgrade

When you upgrade the software and firmware on a Bow-2000, the following is upgraded:

  • BMC OS

  • The firmware of the IPU Control Unit and the system FPGA

  • IPU-Gateway OS

  • V-IPU agent

    • The version of the V-IPU agent must match the version of the V-IPU server that runs on the management server

Note

Graphcore has only qualified the Bow-2000 software release for the versions of the software sub-components that are listed in the release notes. Use of other combinations of software component versions are not guaranteed to work.

The IPU-Gateway and IPU Control Unit (ICU) on the Bow-2000 support booting from one of two persistent software images, the active image or the standby image. The active image is the running image. Upgrading only affects the standby image, which is the image that is not running.

When the upgrade of the standby image completes successfully, the Bow Pod64 system is immediately instructed to switch to the upgraded standby image, making it the active image. The previously running image now becomes the standby image.

If you want to revert to the previous software version, then simply perform the “upgrade” using the previous version of the IPU-M software.

Note

It is wise to keep, at least, the previous version of the IPU-M software on the management server, just in case a downgrade is required.

Upgrading the IPU-Gateway will overwrite all files in the standby image with default OS files. This means that any site-specific config files on the IPU-Gateway are also replaced with default config files during the upgrade.

So, we use the concept of overlay config files which are site-specific config files that can be copied (“overlaid”) to the IPU-Gateway root file system after an upgrade. In this way, a site-specific configuration is easily maintained on the IPU-Gateway even after the upgrade. Overlay config files are stored on the management server.

In an installation with multiple Bow Pod64 racks, the overlay config files are then copied to all Bow-2000 IPU-Machines on all racks (as defined in the JSON rack_tool configuration file) which ensures that all Bow-2000 IPU-Machines are identically configured, as they are required to be.

6.2. Upgrade instructions

The Rack tool utility that was installed on the management server (Section 5.3, IPU-M software) is used to upgrade the software and firmware on the Bow-2000 IPU-Machines.

Refer to the rack_tool man page for details on the specific upgrade commands.

Note

You cannot upgrade Bow-2000 IPU-Machines while running ML jobs since one of the steps of the upgrade process is to reboot the Bow-2000 IPU-Machines.

  1. Log in to the management server as ipuuser.

  2. Ensure you have downloaded and installed the latest version of the IPU-M software onto the management server (Section 5.3, IPU-M software).

  3. Change to the directory that contains the specific IPU-M release version you are upgrading to:

    $ cd $HOME/IPU-M_releases/IPU-M_<release version>
    
  4. Run the commands to do the upgrade:

$ virtualenv -p python3 venv
$ source venv/bin/activate
$ pip3 install -r requirements.txt
$ ./rack_tool.py upgrade

rack_tool uses a default config file containing information on how to access the Bow-2000 IPU-Machines. The default location and name of this config file is:

$HOME/ipuuser/.rack_tool/rack_config.json.

Note

It is possible to specify your own JSON configuration file with the -–config-file option to rack_tool.

rack_config.json can be edited by a site administrator who integrates the Bow Pod64 into the site-specific network in cases where the default Bow Pod64 IP address plan collides with the site-specific network. The DHCP server config then must match the IP addresses used in the JSON configuration file.

The upgrade process will take several minutes and all the Bow-2000 IPU-Machines will be upgraded in parallel to make this time as short as possible. Part of the upgrade process is to reboot each Bow-2000 a few times to activate the new software.

After the reboots, rack_tool verifies that the upgrade has completed successfully by checking that all sub-components have been upgraded to the same version and that this version corresponds to what is defined in the release notes.

Note

If this verification of versions fails, because there are mismatches in the reported versions compared to what is expected, then you are advised to run the upgrade again.