1. Introduction

The Graphcore® Virtual-IPU™ (V-IPU) is a software layer for allocating and configuring Graphcore Intelligence Processing Units (IPUs) in IPU-Machine™ and IPU-POD™ systems. The IPU devices are accessed using IPU over fabric (IPUoF) network connectivity, based on 100G RDMA over converged Ethernet (RoCE).

V-IPU provides a command line interface so you can request IPU resources to be allocated to run Poplar® based machine-learning applications.

This guide will walk you through the steps you need to perform in order to start running applications on an IPU system.

1.1. Terminology and concepts

table_terms defines the terminology and concepts used in the rest of this document.

Table 1.1 Terminology

Term

Description

IPU

The Graphcore Intelligence Processing Unit (IPU) is a new type of processor aimed at artificial intelligence and machine learning applications.

IPU-M2000™

An IPU-Machine: M2000 contains four Colossus™ Mk2 GC200 IPUs and a gateway chip in a 1U chassis. One or more IPU-M2000s can be directly attached to the host system, or they can be built into a rack system as an IPU-POD.

IPU-POD™

An IPU-POD is a set of IPU-M2000s interconnected with the IPU-Fabric.

IPU-Fabric™

The IPU-Fabric is the set of connections that allows data to be communicated between IPUs in the system. The IPU-Fabric is made up of IPU-Links GW-Links, Sync-Links and Host-Links.

IPU-Link™

IPU-Links provide direct connection between IPUs.

GW-Links

GW-Links provide a second tier of interconnect for the IPUs interconnected via IPU-Links. Each gateway controller on an IPU-M2000 exposes GW-Links externally to the chassis via OSFP ports.

Synchronisation (Sync) Links

Sync-Links provide the global synchronisation network that allows a group of IPU devices to coordinate program execution and communication.

Host-Links

Host-Links communicate between the IPUs and the host system, using the IPU over Fabric (IPUoF) protocol.

Cluster

A cluster is a set of IPUs connected via both IPU-Links and IPU-GW Links. An example of a cluster is N IPU-POD64s, where N can range from 1 to 1024.

A cluster is represented by a software entity, also called a cluster, in the V-IPU software.

IPU-Link Domain (ILD)

An ILD is a set of IPUs in a cluster that are connected solely using IPU-Links. An ILD can be be all of the IPUs in an IPU-POD, or some subset of them.

vPOD

A vPOD is an isolated set of IPUs within a cluster, that can be used to run applications. This enables secure multi-tenancy by isolating IPUs allocated to different users from one another. A vPOD can span either a single ILD, or multiple ILDs connected via GW-Links. Supported vPOD sizes are 1, 2, 4, 8, 16, 64 and multiples of 64 IPUs.

A vPOD is represented by the partition entity in the V-IPU software.

Partition

A partition is a software entity that defines a set of IPUs (a vPOD) within a cluster that are available for running user applications.

A “reconfigurable partition” is a special case that supports multiple Poplar users or applications.

Replica

A replica is a one instance of an application used in data-parallel training using multiple instances (replicas) of the application. Replica sizes are limited to 1, 2 or 4 IPUs.

Graph compile domain (GCD)

A GCD is the set of IPUs within a partition or vPOD that are targeted by the Poplar graph compiler. The IPUs in a GCD are in a single IPU-Link domain (for example, within a single IPU-POD). Valid GCD sizes are: 1, 2, 4, 8, 16 or 64 IPUs. A GCD is associated with a single server running the Poplar SDK and graph engine.

Graph scaleout domain (GSD)

A GSD is a set of IPUs within a partition or vPOD that are used to run user applications. A GSD consists of one or more GCDs and can extend over multiple IPU-PODs. Valid GSD sizes are 1, 2, 4, 8, 16, 64 and multiples of 64 IPUs.

1.2. Scope of the document

This document is intended for users of V-IPU-based data centre clusters.

1.3. Structure of the document

The rest of this document is structured as follows. In Section 2, Concepts and architecture we give a brief overview of the components and architecture of the V-IPU management software. Installation instructions are provided Section 6.3.2, Installation. Section 4, Partitions describes how to allocate IPUs to a “partition” that can then be used to run application code. The following chapters describe how to integrate standard cluster management tools (Section 5, Integration with Slurm and Section 6, Integration with Kubernetes). Finally, a complete command-line reference for the vipu utility is provided in Section 7, Command line reference.