System hardware testing

rack_tool

rack_tool is a utility that is supplied with the IPU-M2000 system software pack. It is always installed in the account that is performing IPU resource management.

rack_tool is used for the following:

  1. BIST: Perform hardware tests on one or more IPU-M2000s (see Running system self-tests for the built-in self-test capabilities)

  2. Querying the running software version of all IPU-M2000s.

  3. Testing connectivity to RNIC, GW and BMC ports on all IPU-M2000 listed in rack_tool’s config file. This config file is setup by the IPU-POD DA install script

  4. Restarting IPU-M2000’s GW and BMC CPU systems in different ways (power cycling or OS reboot)

  5. Control power on/off for the GW part of the IPU-M2000(s)

  6. Running commands on several IPU-M2000s for troubleshooting

The supported options will evolve over time so you should refer to the official help menu by running rack_tool --help. This will help identify the installed version and a more detailed help for this version can be obtained from the rack_tool online documentation page. Select the version that your tool reports. The baseline supported options are listed below.

Synopsis

rack_tool.py [–version] [–help] <command> [<args>]

[–ipum <name bmc-ip gw-ip bmc-username bmc-password gw-username gw-password>]

[–global-credentials <bmc-username bmc-password gw-username gw-password>]

rack_tool.py bist [–help]

rack_tool.py status [–help] [–no-color]

rack_tool.py run-command [–help] -c <command> -d <device> <device> is: gw|bmc

rack_tool.py power-off [–help] [–hard]

rack_tool.py power-on [–help]

rack_tool.py power-cycle [–help]

Options

–ipum, –global-credentials

These options have to be provided after the command parameter.

-v, –version

Print the version of the tool

-h, –help

Prints the synopsis and a list of all available commands. --help can also be given after a command to show individual help for each command.

–ipum <*name bmc-ip gw-ip bmc-username bmc-password gw-username gw-password*>

Option to manually define what machines to do operations on instead of using a config file. Several machines can be selected by passing the --ipum option several times. Example: rack_tool.py upgrade --ipum machine1 10.1.1.1 10.1.1.2 root password itadmin password

–global-credentials <*bmc-username bmc-password gw-username gw-password*>

Option to set a common set of login details for the machines selected with --ipum option. If this option is used, the password and username parameters for the --ipum option can be omitted. Example: rack_tool.py upgrade --ipum machine1 10.1.1.1 10.1.1.2 --ipum machine2 10.1.2.1 10.1.2.2 --global-credentials root password itadmin password

Commands

upgrade

Start upgrade of all machines. NOTE: This should not be used directly from rack_tool. Instead, always use the IPU-POD DA install script

bist

Run built-In self-test per machine that checks that most components on the board are available and functional

status

Show network connectivity status and software versions for IPU-M2000s in a rack

install-key

Install current users public SSH key to all machines

run-command

Run a command on a device on all machines

power-off

Power off GW and IPUs on all machines

power-on

Power on GW and IPUs on all machines

power-cycle

Power cycle GW and IPUs on all machines

Exit status

0

Successful program execution

1

Unsuccessful program execution

Files and directories

rack_config.json file format

rack_config.json is a json file that rack_tool is using to know how to connect to all the machines in a rack. It consists of the file objects global_credentials and machines.

global_credentials

This is an object that holds the login details of the BMC and GW. The object has the following key/value pairs:

"global_credentials": {
    "bmc_username": "<username>",
    "bmc_passwd": "<password>",
    "gw_username": "<username>",
    "gw_passwd": "<password>"
}

machines

This is an array of machine objects that holds information about each machine in the rack. Each machine object consists of the following key/value pairs (listed here in reverse physical order):

"machines": [
    {
        "name": "8204721-0065",
        "gw_ip": "10.44.44.75",
        "bmc_ip": "10.44.44.74",
        "mx_ip": "10.44.44.253"
    },
    {
        "name": "8204721-0092",
        "gw_ip": "10.44.44.62",
        "bmc_ip": "10.44.44.61",
        "mx_ip": "10.44.44.237"
    },
    {
        "name": "8204721-0084",
        "gw_ip": "10.44.44.99",
        "bmc_ip": "10.44.44.98",
        "mx_ip": "10.44.44.245"
    },
    {
        "name": "8204721-0071",
        "gw_ip": "10.44.44.93",
        "bmc_ip": "10.44.44.92",
        "mx_ip": "10.44.44.226"
    }
]

Note

The machines section of this json file for an IPU‑POD16 DA system should look as shown in the text above. An IPU‑POD4 DA system of this product should only include the first IPU-M2000 entry. The IP addresses will be filled in by the IPU-POD DA install script as they are controlled by the admin user.

~/.rack_tool/

  • This directory is the default location for configuration files.

~/.rack_tool/rack_config.json

  • This is the rack configuration file for the rack.

Running system self-tests

There are two parts to testing the IPU-POD DA system: running built-in hardware self-tests (BIST) and checking the system using the V-IPU management tool. The commands for doing so are:

$ rack_tool bist

  • performs BMC BIST hardware testing

$ sudo ~/IPU_M_SW-<version>/direct-attach install

  • performs V-IPU self-tests again as part of the install

Warning

Running any of these commands will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing these tests.

Troubleshooting

This section contains useful information about what to do if you encounter problems while installing and testing the rack. If you can’t find the answer to your query here and are still experiencing problems, then contact your Graphcore representative or use the resources on the Graphcore support portal: https://www.graphcore.ai/support.

$ ./rack_tool.py BIST

This test will generate a very low-level hardware verification report/log that will need to be analyzed by Graphcore support if any errors are reported. The logs are located at ./maintenance_tools/logs relative to the current directory from which the command is executed.

  • The command “Done BIST on …” if the test is successful.

  • The command “Failed BIST on …” if the test fails.

  • The command will point to the log name generated in both cases.

Warning

Running this command will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing this test.

V-IPU cluster tests

The cluster tests check that the cables are properly inserted, that the cabling is correct with respect to the expected topology, and perform traffic tests on the links in the topology.

V-IPU-based cluster tests are only supported as an integral test step when running the IPU-POD DA install script. If this install script is run again the previous configuration input will be retained and the script will allow a new cluster test to be executed.

Warning

Running these tests will affect any ML jobs that are running and will cause undefined behaviour. Therefore Poplar workloads should NOT be running when performing these tests.