5. IPU‑POD64 software installation and configuration

This section describes the IPU‑POD64 software installation and configuration required for each IPU‑POD64.

5.1. Management server

One server in the IPU‑POD64 is nominated as the management server. By default this is server 1. The following Graphcore software packages need to be installed on the management server:

  • V-IPU software: contains management and control software for IPU resource control, Built In Self Test (BIST) and monitoring of the IPU-M2000s and IPUs. There is a V-IPU Admin Guide and a V-IPU User Guide.

Note

In an IPU‑POD128 only one management server will be running the V-IPU software (vipu-server) so you will only have to install it once, on the management server you have designated as the main IPU‑POD128 management server. The role of the management servers in an IPU‑POD128 is described in more detail in Section 8, IPU‑POD128 network configuration.

  • IPU-M2000 software: contains the latest IPU-M2000 resident software. It also includes the server resident rack_tool which is required for doing operations from the management server related to all IPU-M2000s.

Note

For large deployments, the management functions can be provided by a separate high-availability server cluster outside the IPU‑POD64 . Please contact Graphcore for more details.

5.2. V-IPU software installation and configuration

Read carefully the release notes for the V-IPU software release before any software upgrade takes place. The release notes are available from the Graphcore download portal as a separate downloadable entity from the same page where the V-IPU software release itself is found.

The release notes give the following details of the release:

  • Software version numbers.

  • Compatibility changes that may need to be understood before upgrading the V-IPU software.

  • Details to any special upgrade handling for this specific release.

  • An overview of fixed problems.

  • An overview of remaining known issues with proposed workarounds, if any.

The software release bundle contains a set of server resident software components.

V-IPU should be installed from a user with root privileges (i.e. with sudo ./install.sh). Note that this install script installs the V-IPU controller to run as a service in the context of the root user. You may need to change to the itadmin user to do this since ipuuser does not have permission for root access.

Note

Remember to only install the V-IPU software on one management server in the IPU‑POD128.

5.3. IPU-M2000 software installation and configuration

Read carefully the release notes for the IPU-M2000 software release before any software upgrade takes place. The release notes are available from the Graphcore download portal as a separate downloadable entity from the same page where the software release itself is found.

The release notes give the following details of the release:

  • Software sub-component version numbers

  • Compatibility statements to Poplar SDK versions

  • Compatibility changes from earlier releases that may need to be understood before upgrading the IPU-M2000s

  • Details to any special upgrade handling for this specific release

  • An overview of fixed problems

  • An overview of remaining known issues with proposed workarounds, if any.

The IPU-M2000 software release bundle contains a set of upgradable software and FPGA sub-components that is executed on the IPU-M2000. The release also contains the tool rack_tool which is used for the software upgrade and other rack related tasks targeting the IPU-M2000s.

The rack_tool upgrade command performs the software upgrade. See details later for how to run this. The IPU-M2000 IPU-Gateway and ICU supports booting from one of two persistent software images, the active image or the standby image. When updating the software, the system will always update the standby image that is not running.

If an upgrade operation fails for one of the components, you should not try to force booting from the now inconsistently upgraded standby image(s) for the various CPU systems inside the IPU-M2000.

The software install process can currently NOT be run at the same time as running ML jobs since the install process reboots the IPU-M2000 once complete.

When the update of the standby image is completed successfully, the IPU‑POD64 is immediately instructed to switch to the updated standby image, making it the active one. The previously running image now becomes the standby image. This is done via a service affecting reboot of each of the upgraded CPU/FPGA systems.

If you want to revert to the previous software version, the standby image can be upgraded to the previous version in the same way as described above.

Note

Graphcore has only qualified the IPU-M2000 software release with exactly the documented set of software sub-component versions and any other version combinations of software components are not guaranteed.

Note that there is an OpenBMC User Guide available.

5.3.1. Download IPU-M2000 software update bundle

The management server needs to be loaded with the correct IPU-M2000 software update bundle before the software update of the IPU-M2000s can be performed. To perform the download, follow these steps:

  1. Log in as ipuuser

  2. Go to the Graphcore download portal and download the latest release into the $HOME /IPU-M_releases directory

  3. Follow the install instructions in the release notes, and perform: tar xvfz <tar-ball.tgz>

The unpacking of the IPU-M2000 software onto the management server’s file system will automatically create a directory tree with a leading directory containing the release version number. This allows several releases to be kept on the management server, in case there is a need to revert to a previous release on the IPU-M2000s. If you do not need to keep them, the older releases (both the unpacked files and the downloaded tar file) can be removed from the management server.

5.3.2. Software update of all IPU-M2000s

Having unpacked the software release onto the server file system, follow these steps:

  1. Change directory: cd $HOME/IPU-M_releases/IPU-M_SW_<release version>

  2. Run the commands:

$ virtualenv -p python3 venv
$ source venv/bin/activate
$ pip3 install -r requirements.txt
$ ./rack_tool.py upgrade

rack_tool will read a default config file to learn how to access the IPU-M2000s. The default location of this file is: $HOME/ipuuser/.rack_tool/rack_config.json.

This file can be edited by a site administrator who integrates the IPU‑POD64 into the site-specific network in cases where the default IPU‑POD64 IP address plan collides with the site-specific network.

The upgrade process will take several minutes and all the IPU-M2000s will be upgraded in parallel to make this time as short as possible.

The upgrade process at the end will perform a few reboots in order to activate the new software.

rack_tool finally verifies that the upgrade completes with all sub-components being upgraded to the same version. Verify that this version corresponds to what is defined in the release notes to check that the upgrade procedures have been followed correctly.

5.3.3. IPU-M2000 IPU-Gateway root file system config files

The IPU‑POD64 and IPU-M2000 support a simple concept where a set of IPU-M2000 IPU-Gateway OS config files residing on the management server will be copied to the IPU-M2000 IPU-Gateway after an update. This feature is optional.

With multiple IPU‑POD64 racks installed, this overwriting (overlay) concept will support a site-specific config control on top of the standard IPU-Gateway distribution that always brings default config files. The current implementation requires all IPU-M2000s to be identically configured.

Any config files on the Debian OS in the IPU-Gateway can be overwritten by these server-side maintained overlay files.

These files reside under ipuuser at $HOME/.rack_tool/root-overlay/.

Note

Make sure the IP addresses in these files are correct and point to the management server’s IP address

The IPU-M2000 IPU-Gateway upgrade itself is destructive in the sense that it will overwrite all executables and corresponding file systems for the standby image. The content of the overlay config files is maintained and stored persistently on the management server. A single central management server can function as a central repository for a site-wide setup of all IPU-M2000s.

5.4. Rack tool

rack_tool is a utility that is supplied with the IPU-M2000 release bundle when installed onto a management server. It is always found under the account that is performing IPU resource management (the default is the ipuuser account), as ~/IPU-M_releases/IPU-M_<release-version>/rack_tool.py.

Information about rack_tool and how to use it can be found on the rack_tool man page.

A suggested naming scheme for IPU-M2000s is: lr1_ipum<n> (logical rack 1, IPU-M2000 #n).

rack_tool is used for the following:

  1. Querying the version of all IPU-M2000s

  2. Connectivity test performed on all IPU-M2000 RDMA data-plane ports,IPU-Gateway and BMC management-plane ports across all IPU-M2000s that are listed in rack_tool’s default config file

  3. Restarting IPU-M2000 IPU-Gateway and BMC in different ways (power cycling or OS reboot)

  4. Control power on/off for the IPU-Gateway part of the IPU-M2000

  5. Running commands on several IPU-M2000s for troubleshooting

  6. Updating the “root overlay” files onto all IPU-M2000s if NTP or syslog server has changed

  7. Running hardware and connectivity tests (see Section 6, IPU‑POD64 manual installation tests for the built-in test capabilities)

Note that supported options will evolve over time so please check the latest rack_tool documentation before using it.