2. Launching applications with PopRun¶
In order to understand how PopRun works, it is to important consider the following:
The application to be distributed using the PopDist library
Input data required by the application
Output data generated by the application
Virtual-IPU (V-IPU) utility for allocating IPUs on IPU-PODs
This guide assumes that you are familiar with the concepts mentioned above. More information on V-IPU can be found in the V-IPU User Guide.
The typical workflow of launching a distributed application on an IPU-POD involves the following steps:
Login: All IPU-PODs are equipped with at least one host server. All application launches are made from a host server. Thus, the starting point for any application launch is logging on to a host server.
Distribution: The actual application to be launched must be implemented in such a way so that it is capable of taking advantage of multiple IPUs, either within a single IPU-M2000 or across multiple IPU-M2000s. The PopDist API is used to make applications distributed. Typically, this involves dividing the computation in such a way so that it can be executed in parallel. The IPUs involved in the computation can either be located within an IPU-POD or in another interconnected IPU-POD.
Launch: The actual application launch is made by calling PopRun from the command line on a host server.
Allocation: Depending on the resources available to you, including both IPUs and host servers, PopRun will automatically interface with V-IPU to allocate and reserve the resources you need to execute your application.
Execution: the final step involves the actual execution, where the application runs on the IPUs.
Some IPU applications may involve the host as part of the computation or for host-side collective operations.
2.1. Launch modes¶
Applications distributed with PopDist can be launched in several ways. Below are the most common PopRun launch modes:
Single instance: The simplest launch mode. Here, a single instance is launched on the host server to run an application on a single or multiple IPU-M2000s located in one IPU-POD. Each instance runs on a single graph compile domain (GCD), which is a subset of the available IPUs connected by IPU-Links (an IPU-Link domain)
Multi-instance/Single host: In this launch mode, multiple instances are launched on the same host server. Typically, the application targets multiple IPU-M2000s. This mode is recommended for applications where the host CPU is used for pre-processing or other I/O tasks. Each instance can run on a separate GCD.
Multi-instance/Multiple hosts: This mode applies to IPU-PODs that are equipped with multiple host servers. In this mode, multiple instances are launched on the multiple host servers located within a single IPU-POD. This mode is recommended, for example, for applications with special pre- or post-processing needs that must take place on the host CPU.
Multi-instance/Multiple IPU-PODs: The most extensive mode where multiple instances are launched on one or more host servers across multiple IPU-PODs connected via GW-Links (known as a graph scaleout domain). This mode is recommended for highly scalable IPU applications. Each IPU-POD contains one or more IPU-Link domains (sets of IPUs connected via IPU-Links) The number of IPU-Link domains is specified with the
2.2. Multi-host setup¶
Launching applications on multiple hosts requires that you have an SSH key pair
to authenticate the connections between the hosts. We recommend using the
ssh-keygen tool to generate a new key pair for this purpose or copy
a key pair from another machine.
The steps shown in this section are only required if your user home directory is not located on the same host.
To authorize this key pair on all the hosts, use the
In this example, we assume that we have the four hosts in the following
IP address range:
10.1.3.10[1-4]. To copy the SSH key to all of the
mentioned host, issue the following command:
$ ssh-copy-id 10.1.3.101 $ ssh-copy-id 10.1.3.102 $ ssh-copy-id 10.1.3.103 $ ssh-copy-id 10.1.3.104
In order to verify that you have successfully copied your SSH key to all hosts,
you can try to
ssh into each of them. If you get access without being
prompted for a password, you are ready to start using PopRun with that host.
Should you encounter problems accessing a host when using PopRun, an error will be reported. There are two typical errors that you could encounter:
Host key verification failed: This indicates that the key of the remote host was not accepted by the local host. The key of the host typically needs to be placed in
~/.ssh/known_hoststo be automatically verified. Note that when using
sshin interactive mode, you will be asked if you want to add the remote key the first time you connect. However, when using PopRun, this is not the case as PopRun uses
sshin non-interactive mode. So the easiest way to get the remote host key added is typically to use
sshto log into it the first time.
Permission denied: This indicates that the remote host did not grant access to the local host. The key of the local host typically needs to be placed in
~/.ssh/authorized_keyson the remote host in order to be automatically accepted. This can be done by using the
ssh-copy-idcommand as explained above. Note that interactive password authentication is not supported by PopRun.
2.3. Application launches¶
In this section we will explain how sample IPU applications can be
launched using PopRun using the various launch modes shown in
section on Section 2.1, Launch modes. For simplicity, we will assume that our
application is a Python application called
train.py. The program
arguments are also not shown, as they are irrelevant. Moreover, the
IP address to the
--vipu-server-host is fictional and used
for the sake of the example. Replace it with the IP address of your
own V-IPU server.
2.3.1. Single instance¶
$ poprun --numa-aware=yes --vipu-server-host=10.3.7.150 --vipu-partition=P8 \ --vipu-cluster=A8 --mpi-global-args="--tag-output" \ --num-replicas=4 --num-instances=1 python3 train.py
2.3.2. Multi instance / Single host¶
$ poprun --numa-aware=yes --vipu-server-host=10.3.7.150 --vipu-partition=A8 \ --vipu-cluster=A8 --mpi-global-args="--tag-output" \ --num-replicas=8 --num-instances=2 python3 train.py
2.3.3. Multi instance / Multi host¶
$ poprun --host 10.3.7.153,10.3.7.154 --numa-aware=yes \ --vipu-server-host=10.3.7.150 --vipu-partition=P8 \ --vipu-cluster=A8 --num-ilds=1 --mpi-global-args="--tag-output" \ --mpi-local-args="-x PYTHONPATH" \ --num-replicas=8 --num-instances=2 python3 train.py
The key takeaway from the command line shown above is that IP addresses
of the hosts involved are passed to PopRun, and also any environment
variables that you want to export from the local host to the remote hosts,
in this example the
2.3.4. Multi instance / Multi host / Multi IPU-PODs¶
$ poprun --host 10.3.7.153,10.3.8.153 --numa-aware=yes \ --vipu-server-host=10.3.7.150 --vipu-partition=P8 --vipu-cluster=P8 \ --num-ilds=2 --mpi-global-args="--tag-output" \ --mpi-local-args="-x PYTHONPATH" \ --num-replicas=8 --num-instances=2 python3 train.py
The command line shown above is similar to the one shown in Section 2.3.3, Multi instance / Multi host.
One difference is that the number of IPU-Link domains (
is set to two, as opposed to one.