7. System hardware testing
7.1. rack_tool
rack_tool
is a utility that is supplied with the Bow-2000 system software pack. It is always installed in the account that is performing IPU resource management (the default is the ipuuser
account) as ~/IPU-M_releases/IPU-M_<release-version>/rack_tool.py
.
rack_tool
is used for the following:
BIST: Perform hardware tests on one or more Bow-2000s (see Running system self-tests for the built-in self-test capabilities)
Querying the running software version of all Bow-2000s.
Testing connectivity to RNIC, IPU-Gateway and BMC ports on all Bow-2000s listed in rack_tool’s config file. This config file is setup by the Bow Pod DA install script
Restarting Bow-2000’s IPU-Gateway and BMC CPU systems in different ways (power cycling or OS reboot)
Control power on/off for the IPU-Gateway part of the Bow-2000s
Running commands on several Bow-2000s for troubleshooting
Updating the “root overlay” files onto all Bow-2000s if NTP or syslog server has changed
The supported options will evolve over time so you should refer to the official help menu by running rack_tool --help
. This will help identify the installed version and a more detailed help for this version can be obtained from the rack_tool
online documentation page. Select the version that your tool reports.
7.2. Running system self-tests
There are two parts to testing the Bow Pod DA system: running built-in hardware self-tests (BIST) and checking the system using the V-IPU management tool. The commands for doing so are:
$ rack_tool bist
performs BMC BIST hardware testing
$ sudo ~/IPU_M_SW-<version>/direct-attach-setup install
performs V-IPU self-tests again as part of the install
Warning
Running any of these commands will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing these tests.
7.3. Troubleshooting
This section contains useful information about what to do if you encounter problems while installing and testing the rack. If you can’t find the answer to your query here and are still experiencing problems, then contact your Graphcore representative or use the resources on the Graphcore support portal: https://www.graphcore.ai/support.
7.3.1. BMC built in hardware test
$ ./rack_tool.py BIST
This test will generate a very low-level hardware verification report/log that will need to be analyzed by Graphcore support if any errors are reported. The logs are located at ./maintenance_tools/logs
relative to the current directory from which the command is executed.
The command “Done BIST on …” if the test is successful.
The command “Failed BIST on …” if the test fails.
The command will point to the log name generated in both cases.
Warning
Running this command will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing this test.
7.3.2. V-IPU cluster tests
The cluster tests check that the cables are properly inserted and that the cabling is correct with respect to the expected topology, and perform traffic tests on the links in the topology.
V-IPU-based cluster tests are only supported as an integral test step when running the Bow Pod DA install script. If this install script is run again the previous configuration input will be retained and the script will allow a new cluster test to be executed.
Warning
Running these tests will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing these tests.