1. Pod overview

This getting started guide is aimed at a user who is ready to start using a Pod system. This document describes what you need to do in order to run a simple program on an IPU. This includes:

  • installing the necessary software

  • establishing a connection with the management server on your Pod

  • allocating IPUs to run your software on

It also includes an overview of the tools that you can use to monitor the hardware and software while running machine-learning jobs on your Pod system.

Note

This document assumes you have access to a Pod system that has been built and tested according to the relevant Pod build and test guide. This includes that the V-IPU IPU management software (Section 1.3, V-IPU software) has been installed on the management server and on the IPU-Machines.

1.1. About Pod systems

Pod systems are designed to make training and inference of very large machine learning models scalable, fast, and efficient. This means Pod systems are ideal for running large scale models including large language models like GPT-J.

Pod systems are built with IPU-Machines. Each IPU-Machine contains four Graphcore IPUs and supporting components. The name of an Pod indicates the number of IPUs it contains. For example, a Bow Pod16 has four IPU-Machines and 16 IPUs. The different types of Pod system are described below.

Graphcore Bow™ Pod systems combine Bow-2000 IPU-Machines with switches and a host server in a pre-qualified rack configuration.

Each Bow-2000 contains four Bow IPUs. For example, the Bow Pod16 has four Bow-2000s (16 IPUs), and the Bow Pod64 is built from 16 Bow-2000s (64 IPUs).

The smallest Bow Pod system, a Bow Pod16), delivers 5.6 petaFLOPS of AI compute.

IPU-Links provide communication between the IPUs in an IPU-Machine and also between the IPU-Machines in a Pod. The IPU-Gateway in the IPU-Machine uses GW-Links for high-speed, low-latency communication between Pod racks; this is required for multi-rack systems such as the Bow Pod256.

Multi-rack Pod systems are built from multiple Pod racks. For example a Bow Pod256 can be built from four Bow Pod64 racks and contains 256 IPUs. The number of IPUs in a Pod must be a power of 2, and greater than or equal to 16.

Virtualization and provisioning software allow the AI compute resources to be elastically allocated to users and be grouped for both model-parallel and data-parallel AI compute.

1.2. Poplar SDK

The Pod is fully supported by Graphcore’s Poplar SDK to provide a complete, scalable platform for accelerated machine intelligence development.

The Poplar SDK contains tools for creating and running programs on IPU hardware using standard machine-learning frameworks such as PyTorch and TensorFlow. The SDK contains PopTorch and PopTorch Geometric, a set of extensions for PyTorch and PyTorch Geometric to enable models to run directly on Graphcore IPU hardware. It also contains a Graphcore distribution of TensorFlow 2.

The SDK also includes command line tools for managing IPU hardware.

1.3. V-IPU software

Note

A Pod DA system does not need the V-IPU software. If you are using a Pod DA, you can skip this section.

The Virtual-IPU™ (V-IPU™) management software is used for allocating and configuring IPUs in the Pod. The full V-IPU software consists of the following components:

  • V-IPU agents: An agent resides on each IPU-Machine in a Pod system and manages the IPU-Machine hardware.

  • V-IPU controller: The V-IPU controller runs on a management node. It is responsible for managing V-IPU agents.

  • V-IPU command-line interface: Command line tools provide access to the administration and user functions of the V-IPU controller.

This document describes the installation of the command line interface (Section 2.2, Installing the V-IPU command-line tools). For more information about using V-IPU in a Poplar user role (data centre users), refer to the V-IPU User Guide). And for using V-IPU in an admin role (data centre administrators), refer to the V-IPU Administrator Guide).