System hardware testing
rack_tool
rack_tool
is a utility that is supplied with the IPU-M2000 system software pack. It is always installed in the account that is performing IPU resource management.
rack_tool
is used for the following:
BIST: Perform hardware tests on one or more IPU-M2000s (see Running system self-tests for the built-in self-test capabilities)
Querying the running software version of all IPU-M2000s.
Testing connectivity to RNIC, GW and BMC ports on all IPU-M2000 listed in rack_tool’s config file. This config file is setup by the IPU-POD DA install script
Restarting IPU-M2000’s GW and BMC CPU systems in different ways (power cycling or OS reboot)
Control power on/off for the GW part of the IPU-M2000(s)
Running commands on several IPU-M2000s for troubleshooting
The supported options will evolve over time so you should refer to the official help menu by running rack_tool --help
. This will help identify the installed version and a more detailed help for this version can be obtained from the rack_tool
online documentation page. Select the version that your tool reports.
The baseline supported options are listed below.
Synopsis
rack_tool.py [–version] [–help] <command> [<args>]
[–ipum <name bmc-ip gw-ip bmc-username bmc-password gw-username gw-password>]
[–global-credentials <bmc-username bmc-password gw-username gw-password>]
rack_tool.py bist [–help]
rack_tool.py status [–help] [–no-color]
rack_tool.py run-command [–help] -c <command> -d <device> <device> is: gw|bmc
rack_tool.py power-off [–help] [–hard]
rack_tool.py power-on [–help]
rack_tool.py power-cycle [–help]
Options
- –ipum, –global-credentials
These options have to be provided after the command parameter.
- -v, –version
Print the version of the tool
- -h, –help
Prints the synopsis and a list of all available commands.
--help
can also be given after a command to show individual help for each command.- –ipum <*name bmc-ip gw-ip bmc-username bmc-password gw-username gw-password*>
Option to manually define what machines to do operations on instead of using a config file. Several machines can be selected by passing the
--ipum
option several times. Example:rack_tool.py upgrade --ipum machine1 10.1.1.1 10.1.1.2 root password itadmin password
- –global-credentials <*bmc-username bmc-password gw-username gw-password*>
Option to set a common set of login details for the machines selected with
--ipum
option. If this option is used, the password and username parameters for the--ipum
option can be omitted. Example:rack_tool.py upgrade --ipum machine1 10.1.1.1 10.1.1.2 --ipum machine2 10.1.2.1 10.1.2.2 --global-credentials root password itadmin password
Commands
upgrade |
Start upgrade of all machines. NOTE: This should not be used directly from rack_tool. Instead, always use the IPU-POD DA install script |
bist |
Run built-In self-test per machine that checks that most components on the board are available and functional |
status |
Show network connectivity status and software versions for IPU-M2000s in a rack |
install-key |
Install current users public SSH key to all machines |
run-command |
Run a command on a device on all machines |
power-off |
Power off GW and IPUs on all machines |
power-on |
Power on GW and IPUs on all machines |
power-cycle |
Power cycle GW and IPUs on all machines |
Exit status
0 |
Successful program execution |
1 |
Unsuccessful program execution |
Files and directories
rack_config.json file format
rack_config.json
is a json file that rack_tool
is using to know how to connect to all the machines in a rack. It consists of the file objects global_credentials
and machines
.
global_credentials
This is an object that holds the login details of the BMC and GW. The object has the following key/value pairs:
"global_credentials": { "bmc_username": "<username>", "bmc_passwd": "<password>", "gw_username": "<username>", "gw_passwd": "<password>" }
machines
This is an array of machine objects that holds information about each machine in the rack. Each machine object consists of the following key/value pairs (listed here in reverse physical order):
"machines": [ { "name": "8204721-0065", "gw_ip": "10.44.44.75", "bmc_ip": "10.44.44.74", "mx_ip": "10.44.44.253" }, { "name": "8204721-0092", "gw_ip": "10.44.44.62", "bmc_ip": "10.44.44.61", "mx_ip": "10.44.44.237" }, { "name": "8204721-0084", "gw_ip": "10.44.44.99", "bmc_ip": "10.44.44.98", "mx_ip": "10.44.44.245" }, { "name": "8204721-0071", "gw_ip": "10.44.44.93", "bmc_ip": "10.44.44.92", "mx_ip": "10.44.44.226" } ]
Note
The machines
section of this json file for an IPU‑POD16 DA system should look as shown in the text above. An IPU‑POD4 DA system of this product should only include the first IPU-M2000 entry. The IP addresses will be filled in by the IPU-POD DA install script as they are controlled by the admin user.
~/.rack_tool/
This directory is the default location for configuration files.
~/.rack_tool/rack_config.json
This is the rack configuration file for the rack.
Running system self-tests
There are two parts to testing the IPU-POD DA system: running built-in hardware self-tests (BIST) and checking the system using the V-IPU management tool. The commands for doing so are:
$ rack_tool bist
performs BMC BIST hardware testing
$ sudo ~/IPU_M_SW-<version>/direct-attach install
performs V-IPU self-tests again as part of the install
Warning
Running any of these commands will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing these tests.
Troubleshooting
This section contains useful information about what to do if you encounter problems while installing and testing the rack. If you can’t find the answer to your query here and are still experiencing problems, then contact your Graphcore representative or use the resources on the Graphcore support portal: https://www.graphcore.ai/support.
$ ./rack_tool.py BIST
This test will generate a very low-level hardware verification report/log that will need to be analyzed by Graphcore support if any errors are reported. The logs are located at ./maintenance_tools/logs
relative to the current directory from which the command is executed.
The command “Done BIST on …” if the test is successful.
The command “Failed BIST on …” if the test fails.
The command will point to the log name generated in both cases.
Warning
Running this command will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing this test.
V-IPU cluster tests
The cluster tests check that the cables are properly inserted, that the cabling is correct with respect to the expected topology, and perform traffic tests on the links in the topology.
V-IPU-based cluster tests are only supported as an integral test step when running the IPU-POD DA install script. If this install script is run again the previous configuration input will be retained and the script will allow a new cluster test to be executed.
Warning
Running these tests will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing these tests.