7. Manual installation tests
There is a division of responsibility between BMC management and V-IPU management when it comes to which parts of the system that they test.
BMC has support for chassis management, which means that it can verify correct hardware behaviour on most functional blocks within the chassis.
V-IPU management has support for connectivity tests and focuses on verifying correct cables, cabling for IPU-Links and GW-Links, as well as for the topology verification.
In combination, these two areas of tests will cover most of the needs for system installation verification.
7.1. Running system tests
The rack_tool
utility is included as part of the Bow-2000 software release bundle. See the rack_tool man page for details on how to use rack_tool
before running these tests. This is especially important if the tests are to be run on a system that has active users.
rack_tool bist
: performs BMC chassis hardware testingrack_tool vipu-test
: performs V-IPU connectivity related tests
7.2. Troubleshooting
This section contains useful information about what to do if you encounter problems while installing and testing the rack. If you can’t find the answer to your query here and are still experiencing problems, then please contact your Graphcore representative or use the resources on the Graphcore support portal .
7.2.1. BMC BISTs
To run the BMC BIST tests, run the following command:
$ ./rack_tool.py bist
This test will generate a very low level hardware verification report that will need to be analysed by Graphcore support in case any errors are reported. The logs are located at “./logs” relative to the current directory where the command is executed.
The command reports “Done BIST on …” if the test is successful.
The command reports “Failed BIST on …” if the test fails.
The command will point to the log file name generated in both cases.
7.2.2. V-IPU built in self tests
$ ./rack_tool.py vipu-test
The following section is based on excerpts from the V-IPU Admin Guide which should be consulted for a detailed and updated overview of BISTs. The V-IPU User Guide is also useful. The collection of V-IPU connectivity tests can be invoked by the ./rack_tool.py vipu-test
command or by directly using V-IPU CLI commands as described below.
The V-IPU Controller implements a cluster testing suite that runs a series of tests to verify installation correctness. A V-IPU cluster test can be executed against a cluster entity before any partitions are created. It is strongly recommended to run all the test types provided by the cluster testing suite before deploying any applications in a cluster.
Assume we have created a cluster named “cluster1” formed by four Bow-2000 IPU-Machines using the command:
vipu-admin create cluster cl0 --agents ipum1, ipum2, ipum3, ipum4 --mesh
The simplest way to run a complete cluster test for this cluster is to run ./rack_tool.py vipu-test
. The test performs the V-IPU self-tests shown below.
$ vipu-admin test cluster cluster1
Showing test results for cluster ``cluster1``:
+---------------+----------+--------+---------------------------------------+
| Test Type | Duration | Passed | Summary |
+===============+==========+========+=======================================+
| Version | 0.00s | 4/4 | All component versions are consistent |
+---------------+----------+--------+---------------------------------------+
| Cabling | 8.76s | 4/4 | All cables connected as expected |
+---------------+----------+--------+---------------------------------------+
| Sync-Link | 0.35s | 8/8 | Sync Link test passed |
+---------------+----------+--------+---------------------------------------+
| Link-Training | 20.16s | 76/76 | All Links Passed |
+---------------+----------+--------+---------------------------------------+
| Traffic | 42.00s | 1/1 | Traffic test passed |
+---------------+----------+--------+---------------------------------------+
| GW-Link | 0.00s | 0/0 | GW-Link test skipped |
+---------------+----------+--------+---------------------------------------+
The output above shows a successful test with no errors reported.
As the test results show, five test types were executed on “cluster1”. The results for each test type are printed one per line in the output. Each test type tested zero or more elements of the cluster as can be seen from the “Passed” column. Each test type is explained in detail in the rest of this section.
Note that the vipu-test
command blocks the CLI until the cluster test is completed, and may take several minutes to complete. To avoid blocking the CLI for prolonged periods of time, cluster tests can be executed asynchronously with the --start
, --status
and --stop options
.
Depending on the how the cluster is created, some of the link tests will be omitted.
In the above example the V-IPU GW-Link test is skipped since GW-Links are not used in single Bow Pod64 systems. Only when interconnecting several Bow Pod64 racks does it makes sense to test the GW-Links.
Errors discovered during testing can be like the ones shown below. The error text, if possible, indicates which ports are relevant to the problem detected. The port numbers used are aligned with the connectors numbering scheme described earlier in this document.
When a cluster test is running, some restrictions are imposed on the actions an administrator can perform to the system:
Partition creation in a cluster where a test is in progress is forbidden.
Removal of a cluster where a test is in progress is forbidden.
Only one cluster test can be running at any given time on a V-IPU server, even if the V-IPU server controls more than one cluster.
There is no persistence to the cluster test results. Only the results of the last test can be retrieved with the –status command, as long as the V-IPU server has not been restarted.
IPU-Link cabling test
In order to verify that external IPU-Link cables are connected as expected and properly inserted, the cabling test can be utilized. The cabling test will read the serial ID of the OSFP cables from each end of the links and verify that the cable connects the expected ports together.
Cabling tests are invoked by passing the --cabling
flag to the test cluster
command.
If the test fails, details about which connections that failed are displayed. This will give you a hint to which cables to physically inspect and correct. Very often, a loose cable is the root cause of problems. Below is an example of a test run when the 4 OSFP cables between ipum1 and ipum2 in the cluster are not connected.
$ vipu-admin test cluster cluster1 --cabling
Showing test results for cluster cluster1:
+-----------+----------+--------+-----------------------------------------------------------------------------------+
| Test Type | Duration | Passed | Summary |
+===========+==========+========+===================================================================================+
| Cabling | 21.77s | 8/12 | ipum1 (IPU-Cluster Port 5) x--> ipum2 (IPU-Cluster port 11) (cable not connected) |
+-----------+----------+--------+-----------------------------------------------------------------------------------+
| | | | ipum1 (IPU-Cluster Port 6) x--> ipum2 (IPU-Cluster port 12) (cable not connected) |
+-----------+----------+--------+-----------------------------------------------------------------------------------+
| | | | ipum1 (IPU-Cluster Port 7) x--> ipum2 (IPU-Cluster port 13) (cable not connected) |
+-----------+----------+--------+-----------------------------------------------------------------------------------+
| | | | ipum1 (IPU-Cluster Port 8) x--> ipum2 (IPU-Cluster port 14) (cable not connected) |
+-----------+----------+--------+-----------------------------------------------------------------------------------+
This is an indication of either faulty cabling or an incorrect cluster definition that doesn’t reflect the intended cabling.
Sync-Link test
The Sync-Link test verifies the external Sync-Link cabling between Bow-2000 IPU-Machines. You can run a Sync-Link test by passing the --sync
option to the test cluster command.
A failing Sync-Link test reports the cables which failed to satisfy the cluster topology that is being tested by pointing to the Bow-2000 and Sync-Link port numbers of the failing Sync-Link. In the example command below, two Sync-Link cables between “ipum1” and “ipum2” fail:
The link between “ipum1” Sync-Link port 6 and “ipum2” Sync-Link port 2
The link between “ipum1” Sync-Link port 7 and “ipum2” Sync-Link port 3
This is an indication of either faulty cabling or an incorrect cluster definition that doesn’t reflect the intended cabling.
$ vipu-admin test cluster cluster1 –sync
Showing test results for cluster cluster1:
+-----------+----------+--------+----------------------+
| Test Type | Duration | Passed | Summary |
+===========+==========+========+======================+
| Sync-Link | 0.90s | x/y | Failed Sync Links: |
+-----------+----------+--------+----------------------+
| | | | ipum1:6 <--> 2:ipum2 |
+-----------+----------+--------+----------------------+
| | | | ipum1:7 <--> 3:ipum2 |
+-----------+----------+--------+----------------------+
test (cluster): failed: Some tests failed.
IPU-Link training test
IPU-Link training test verifies IPU-Link readiness for IPU-Links between and within Bow-2000 IPU-Machines (OSFP cables). An IPU-Link test can be invoked with the --ipulink
option in the test cluster command. A failing test will indicate which IPU-Links are failing by pointing to the agent and cluster port enumeration of the failing IPU-Link. In the following example, we test a cluster where the IPU-Links have been disconnected between the first and second Bow-2000 IPU-Machines.
$vipu-admin test cluster cluster1 –ipulink
Showing test results for cluster cluster1:
+-----------+----------+--------+----------------------------------------------------+
| Test Type | Duration | Passed | Summary |
+===========+==========+========+====================================================+
| IPU-Link | 34.57s | x/y | Failed Links |
+-----------+----------+--------+----------------------------------------------------+
| | | | ipum1:4 [pending g1x1] <--> ipum2:8 [pending g1x1] |
+-----------+----------+--------+----------------------------------------------------+
| | | | ipum1:3 [pending g1x1] <--> ipum2:7 [pending g1x1] |
+-----------+----------+--------+----------------------------------------------------+
| | | | ipum1:2 [pending g1x1] <--> ipum2:6 [pending g1x1] |
+-----------+----------+--------+----------------------------------------------------+
| | | | ipum1:1 [pending g1x1] <--> ipum2:5 [pending g1x1] |
+-----------+----------+--------+----------------------------------------------------+
test (cluster): failed: Some tests failed.
IPU-Link traffic test
The traffic test acts as a smoke test for all IPU-Links of a cluster before deploying applications.
The traffic test can be invoked with the --traffic
option. Note that for a traffic test to pass, a prerequisite is that the IPU-Link and IPU-Link training tests have passed.
$ vipu-admin test cluster cluster1 –-traffic
+---------+----------+--------+---------------------------------------------------+
| Test | Duration | Passed | Summary |
+=========+==========+========+===================================================+
| Traffic | 92.23s | 3/4 | Traffic test failed |
+---------+----------+--------+---------------------------------------------------+
| | | | Errors encountered in traffic test 1 |
+---------+----------+--------+---------------------------------------------------+
| | | | corrected link errors: 460 |
+---------+----------+--------+---------------------------------------------------+
| | | | error counter IPU-Link 1 in ipum1, IPU '1' is 250 |
+---------+----------+--------+---------------------------------------------------+
| | | | error counter IPU-Link 1 in ipum4, IPU '1' is 210 |
+---------+----------+--------+---------------------------------------------------+
test cluster (cluster1): failed: Some tests failed.
This example shows a situation where the IPU-link traffic test has failed due to too many correctable errors being detected. Should this occur please try reseating the IPU-Link cables associated with the referenced Bow-2000 IPU-Machines. If that does not resolve the issue, please contact Graphcore support.