1. Introduction

The Graphcore® Virtual-IPU™ (V-IPU) is a software layer for allocating and configuring Graphcore Intelligence Processing Units (IPUs) in Graphcore Pods. The IPU devices are accessed using IPU over fabric (IPUoF) network connectivity, based on 100G RDMA over converged Ethernet (RoCE).

V-IPU provides a command line interface so you can request IPU resources to be allocated to run Poplar® based machine learning applications.

This guide will walk you through the steps you need to perform in order to start running applications on an IPU system.

1.1. Terminology and concepts

Table 1.1 defines the terminology and concepts used in the rest of this document. For other Graphcore-specific terminology, refer to the A Dictionary of Graphcore Terminology.

Table 1.1 Terminology

Term

Description

IPU

The Graphcore Intelligence Processing Unit (IPU) is a new type of processor aimed at artificial intelligence and machine learning applications.

IPU-Machine

A rack mountable compute platform with a number of interconnected IPUs, management logic, In-Processor-Memory, Streaming Memory, and external networking and IPU-Link interfaces. General term for IPU-M2000 and Bow-2000 blades.

Pod

A Graphcore Pod is a set of IPU-Machines interconnected with the IPU-Fabric, for example an IPU-POD system or a Bow Pod system.

IPU-Fabric™

The IPU-Fabric is the set of connections that allows data to be communicated between IPUs in the system. The IPU-Fabric is made up of IPU-Links GW-Links, Sync-Links and Host-Links.

IPU-Link™

IPU-Links provide direct connection between IPUs.

GW-Links

GW-Links provide a second tier of interconnect for the IPUs interconnected via IPU-Links. Each gateway controller on an IPU-M2000 exposes GW-Links externally to the chassis via OSFP ports.

Synchronisation (Sync) Links

Sync-Links provide the global synchronisation network that allows a group of IPU devices to coordinate program execution and communication.

Host-Links

Host-Links communicate between the IPUs and the host system, using the IPU over Fabric (IPUoF) protocol.

Cluster

A cluster is a set of IPUs connected via both IPU-Links and GW-Links. An example of a cluster is a number of IPU‑POD64 or Bow Pod64 systems.

A cluster is represented by a software entity, also called a cluster, in the V-IPU software.

IPU-Link Domain (ILD)

An ILD is a set of IPUs in a cluster that are connected solely using IPU-Links. An ILD can be be all of the IPUs in a Pod, or some subset of them. An ILD cannot span multiple racks, which are connected by GW-Links.

vPOD

A vPOD is an isolated set of IPUs within a cluster, that can be used to run applications. This enables secure multi-tenancy by isolating IPUs allocated to different users from one another. A vPOD can span either a single ILD, or multiple ILDs connected via GW-Links. Supported vPOD sizes are 1, 2, 4, 8, 16, 64 and multiples of 64 IPUs.

A vPOD is represented by the partition entity in the V-IPU software.

Allocation

An allocation is a fixed subset of IPU-Machines in a cluster which have been granted to some users for creating partitions. An allocation can span either a single ILD, or multiple ILDs connected via GW-Links. Supported allocation sizes are 4, 8, 16, 32, 64 and multiples of 64 IPUs.

Partition

A partition is a software entity that defines a set of IPUs (a vPOD) within a cluster that are available for running user applications.

A “reconfigurable partition” is a special case that supports multiple Poplar users or applications.

Replica

A replica is a one instance of an application used in data-parallel training using multiple instances (replicas) of the application. Replica sizes are limited to 1, 2 or 4 IPUs.

Graph compile domain (GCD)

A GCD is the set of IPUs within a partition or vPOD that are targeted by the Poplar graph compiler. The IPUs in a GCD are in a single IPU-Link domain (for example, within a single IPU-POD). Valid GCD sizes are: 1, 2, 4, 8, 16 or 64 IPUs. A GCD is associated with a single server running the Poplar SDK and graph engine.

Graph scaleout domain (GSD)

A GSD is a set of IPUs within a partition or vPOD that are used to run user applications. A GSD consists of one or more GCDs and can extend over multiple Pods. Valid GSD sizes are 1, 2, 4, 8, 16, 64 and multiples of 64 IPUs.

1.2. Scope of the document

This document is intended for users of V-IPU-based data centre clusters.

1.3. Structure of the document

The rest of this document is structured as follows. In Section 2, Concepts and architecture we give a brief overview of the components and architecture of the V-IPU management software. Installation instructions are provided Section 3, Getting started. Section 5, Partitions describes how to allocate IPUs to a “partition” that can then be used to run application code. The following chapters describe how to integrate standard cluster management tools (Section 6, Integration with Slurm and Section 7, Integration with Kubernetes). Finally, a complete command-line reference for the vipu utility is provided in Section 8, Command line reference.