6. IPU‑POD64 manual installation tests

You will need to run these tests on each IPU‑POD64.

There is a division of responsibility between BMC management and V-IPU management when it comes to which parts of the system that they test.

BMC has support for chassis management, which means that it can verify correct hardware behaviour on most functional blocks within the chassis.

V-IPU management has support for connectivity tests and focuses on verifying correct cables, cabling for IPU-Links and GW-Links, as well as for the cabled IPU-Link network.

In combination, these two areas of built-in self-tests (BISTs) will cover most of the needs for system installation verification.

6.1. Running system tests

The rack_tool utility is included as part of the IPU-M2000 software release bundle. See the rack_tool man page for details on how to use rack_tool before running these tests. This is especially important if the tests are to be run on a system that has active users.

  • rack_tool bist: performs BMC chassis hardware testing

  • rack_tool vipu-test: performs V-IPU connectivity related tests

6.2. Troubleshooting

This section contains useful information about what to do if you encounter problems while installing and testing the rack. If you can’t find the answer to your query here and are still experiencing problems, then please contact your Graphcore representative or use the resources on the Graphcore support portal .

6.2.1. BMC BISTs

To run the BMC BIST tests, run the following command:

$ ./rack_tool.py bist

This test will generate a very low level hardware verification report that will need to be analysed by Graphcore support in case any errors are reported. The logs are located at “./logs” relative to the current directory where the command is executed.

  • The command reports “Done BIST on …” if the test is successful.

  • The command reports “Failed BIST on …” if the test fails.

  • The command will point to the log file name generated in both cases.

6.2.2. V-IPU built in self tests

$ ./rack_tool.py vipu-test

The following section is based on excerpts from the V-IPU Admin Guide which should be consulted for a detailed and updated overview of BISTs. The V-IPU User Guide is also useful. The collection of V-IPU connectivity tests can be invoked by the ./rack_tool.py vipu-test command or by directly using V-IPU CLI commands as described below.

The V-IPU Controller implements a cluster testing suite that runs a series of tests to verify installation correctness. A V-IPU cluster test can be executed against a cluster entity before any partitions are created. It is strongly recommended to run all the test types provided by the cluster testing suite before deploying any applications in a cluster.

Assume we have created a cluster named “cluster1” formed by four IPU-M2000s using the command:

vipu-admin create cluster cl0 --agents ipum1, ipum2, ipum3, ipum4 --mesh

The simplest way to run a complete cluster test for this cluster is to run ./rack_tool.py vipu-test. The test performs the V-IPU self-tests shown below.

$ vipu-admin test cluster cluster1
Showing test results for cluster ``cluster1``:
+---------------+----------+--------+---------------------------------------+
|   Test Type   | Duration | Passed |                Summary                |
+===============+==========+========+=======================================+
| Version       | 0.00s    | 4/4    | All component versions are consistent |
+---------------+----------+--------+---------------------------------------+
| Cabling       | 8.76s    | 4/4    | All cables connected as expected      |
+---------------+----------+--------+---------------------------------------+
| Sync-Link     | 0.35s    | 8/8    | Sync Link test passed                 |
+---------------+----------+--------+---------------------------------------+
| Link-Training | 20.16s   | 76/76  | All Links Passed                      |
+---------------+----------+--------+---------------------------------------+
| Traffic       | 42.00s   | 1/1    | Traffic test passed                   |
+---------------+----------+--------+---------------------------------------+
| GW-Link       | 0.00s    | 0/0    | GW-Link test skipped                  |
+---------------+----------+--------+---------------------------------------+

The output above shows a successful test with no errors reported.

As the test results show, five test types were executed on “cluster1”. The results for each test type are printed one per line in the output. Each test type tested zero or more elements of the cluster as can be seen from the “Passed” column. Each test type is explained in detail in the rest of this section.

Note that the vipu-test command blocks the CLI until the cluster test is completed, and may take several minutes to complete. To avoid blocking the CLI for prolonged periods of time, cluster tests can be executed asynchronously with the --start, --status and --stop options.

Depending on the how the cluster is created, some of the link tests will be omitted.

In the above example the V-IPU GW-Link test is skipped since the GW-Link is not used in single IPU‑POD64 installation testing. Only when interconnecting several IPU-PODs does it makes sense to test the GW-Links.

Errors discovered during testing can be like the ones shown below. The error text, if possible, indicates which ports are relevant to the problem detected. The port numbers used are aligned with the connectors numbering scheme described earlier in this document.

When a cluster test is running, some restrictions are imposed on the actions an administrator can perform to the system:

  • Partition creation in a cluster where a test is in progress is forbidden.

  • Removal of a cluster where a test is in progress is forbidden.

  • Only one cluster test can be running at any given time on a V-IPU server, even if the V-IPU server controls more than one cluster.

  • There is no persistence to the cluster test results. Only the results of the last test can be retrieved with the –status command, as long as the V-IPU server has not been restarted.