7. System hardware testing

7.1. rack_tool

rack_tool is a utility that is supplied with the Bow-2000 system software pack. It is always installed in the account that is performing IPU resource management (the default is the ipuuser account) as ~/IPU-M_releases/IPU-M_<release-version>/rack_tool.py.

rack_tool is used for the following:

  1. BIST: Perform hardware tests on one or more Bow-2000s (see Running system self-tests for the built-in self-test capabilities)

  2. Querying the running software version of all Bow-2000s.

  3. Testing connectivity to RNIC, IPU-Gateway and BMC ports on all Bow-2000s listed in rack_tool’s config file. This config file is setup by the Bow Pod DA install script

  4. Restarting Bow-2000’s IPU-Gateway and BMC CPU systems in different ways (power cycling or OS reboot)

  5. Control power on/off for the IPU-Gateway part of the Bow-2000s

  6. Running commands on several Bow-2000s for troubleshooting

  7. Updating the “root overlay” files onto all Bow-2000s if NTP or syslog server has changed

The supported options will evolve over time so you should refer to the official help menu by running rack_tool --help. This will help identify the installed version and a more detailed help for this version can be obtained from the rack_tool online documentation page. Select the version that your tool reports.

7.2. Running system self-tests

There are two parts to testing the Bow Pod DA system: running built-in hardware self-tests (BIST) and checking the system using the V-IPU management tool. The commands for doing so are:

$ rack_tool bist

  • performs BMC BIST hardware testing

$ sudo ~/IPU_M_SW-<version>/direct-attach-setup install

  • performs V-IPU self-tests again as part of the install

Warning

Running any of these commands will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing these tests.

7.3. Troubleshooting

This section contains useful information about what to do if you encounter problems while installing and testing the rack. If you can’t find the answer to your query here and are still experiencing problems, then contact your Graphcore representative or use the resources on the Graphcore support portal: https://www.graphcore.ai/support.

7.3.1. BMC built in hardware test

$ ./rack_tool.py BIST

This test will generate a very low-level hardware verification report/log that will need to be analyzed by Graphcore support if any errors are reported. The logs are located at ./maintenance_tools/logs relative to the current directory from which the command is executed.

  • The command “Done BIST on …” if the test is successful.

  • The command “Failed BIST on …” if the test fails.

  • The command will point to the log name generated in both cases.

Warning

Running this command will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing this test.

7.3.2. V-IPU cluster tests

The cluster tests check that the cables are properly inserted and that the cabling is correct with respect to the expected topology, and perform traffic tests on the links in the topology.

V-IPU-based cluster tests are only supported as an integral test step when running the Bow Pod DA install script. If this install script is run again the previous configuration input will be retained and the script will allow a new cluster test to be executed.

Warning

Running these tests will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing these tests.