1. IPU-POD overview
The IPU-POD™ is designed to make both training and inference of very large and demanding machine-learning models faster, more efficient, and more scalable. This enables very large and emergent models to be run most effectively.
The IPU-POD is constructed from a number of IPU-M2000s, each containing four IPUs. For example, the IPU-POD16 has four IPU-M2000s (16 IPUs), and the IPU-POD64 is built from 16 IPU-M2000s (64 IPUs). The number of IPUs in an IPU-POD must be a power of 2, and greater than or equal to 16.
IPU-Links provide communication between the IPUs in an IPU-M2000 and also between the IPU-M2000s in an IPU-POD. A gateway controller in the IPU-M2000 provides links (GW-Links) for high-speed, low-latency communication between IPU-POD racks.
1.1. V-IPU software
The Virtual-IPU™ (V-IPU™) IPU management software is used for allocating and configuring IPUs in the IPU-POD. It has command line support for both the Poplar user role and IPU admin role.
It consists of the following components:
V-IPU agents: An agent resides on each IPU-M2000 in an IPU system and manages the IPU-M2000 hardware.
V-IPU controller: The V-IPU controller runs on a management node. It is responsible for managing V-IPU agents.
V-IPU command-line interface: Command line tools provide access to the administration and user functions of the V-IPU controller.
This document assumes you have access to an IPU-POD that has the V-IPU agents and controller installed and configured.
The V-IPU software should already be installed on the IPU-POD. You can check this with your system administrator. You will also need to ask them for the information you need to connect to the IPU-POD (see Section 3, Getting started with V-IPU).
If you are the system administrator, refer to the V-IPU Admin Guide for information on installing and using the V-IPU software.
1.2. Poplar SDK
The IPU-POD is fully supported by Graphcore’s Poplar SDK to provide a complete, scalable platform for accelerated machine intelligence development.
The Poplar SDK contains tools for creating and running programs on IPU hardware using standard machine-learning frameworks such as PyTorch and TensorFlow. It also includes command line tools for managing IPU hardware.