1. Introduction

The Graphcore® Virtual-IPU™ (V-IPU) management software provides a control plane for large-scale multi-tenanted deployments of Graphcore Intelligence Processing Units (IPUs). It enables management of IPU-based clusters, and integration into legacy data centre environments and emerging software-defined data centres (SDDCs).

The high-level functionality provided by the V-IPU management software is as follows.

  • Hardware introspection and configuration: Introspects and configures hardware components in an IPU cluster. Hardware configuration includes IPU management and configuration, link configuration and training, cluster and routing setup, IPU gateway setup, and sync configurations for IPU applications.

  • Performance isolation: Configures sub-clusters, in the form of partitions, isolated from one another for running workloads in a shared IPU cluster environment.

  • Management controller: Provides a single point of administration for the remote management of hardware and software components in an IPU cluster.

  • Monitoring: Supports comprehensive monitoring, troubleshooting, and performance profiling.

1.1. Terminology and concepts

Table 1.1 defines the terminology and concepts used in the rest of this document. For other Graphcore-specific terminology, refer to the A Dictionary of Graphcore Terminology.

Table 1.1 Terminology

Term

Description

IPU

The Graphcore Intelligence Processing Unit (IPU) is a new type of processor aimed at artificial intelligence and machine learning applications.

IPU-Machine

A rack mountable compute platform with a number of interconnected IPUs, management logic, In-Processor-Memory, Streaming Memory, and external networking and IPU-Link interfaces. General term for IPU-M2000 and Bow-2000 blades.

Pod

A Graphcore Pod is a set of IPU-Machines interconnected with the IPU-Fabric, for example an IPU-POD system or a Bow Pod system.

IPU-Fabric™

The IPU-Fabric is the set of connections that allows data to be communicated between IPUs in the system. The IPU-Fabric is made up of IPU-Links GW-Links, Sync-Links and Host-Links.

IPU-Link™

IPU-Links provide direct connection between IPUs.

GW-Links

GW-Links provide a second tier of interconnect for the IPUs interconnected via IPU-Links. Each gateway controller on an IPU-M2000 exposes GW-Links externally to the chassis via OSFP ports.

Synchronisation (Sync) Links

Sync-Links provide the global synchronisation network that allows a group of IPU devices to coordinate program execution and communication.

Host-Links

Host-Links communicate between the IPUs and the host system, using the IPU over Fabric (IPUoF) protocol.

Cluster

A cluster is a set of IPUs connected via both IPU-Links and GW-Links. An example of a cluster is a number of IPU‑POD64 or Bow Pod64 systems.

A cluster is represented by a software entity, also called a cluster, in the V-IPU software.

IPU-Link Domain (ILD)

An ILD is a set of IPUs in a cluster that are connected solely using IPU-Links. An ILD can be be all of the IPUs in a Pod, or some subset of them. An ILD cannot span multiple racks, which are connected by GW-Links.

vPOD

A vPOD is an isolated set of IPUs within a cluster, that can be used to run applications. This enables secure multi-tenancy by isolating IPUs allocated to different users from one another. A vPOD can span either a single ILD, or multiple ILDs connected via GW-Links. Supported vPOD sizes are 1, 2, 4, 8, 16, 64 and multiples of 64 IPUs.

A vPOD is represented by the partition entity in the V-IPU software.

Allocation

An allocation is a fixed subset of IPU-Machines in a cluster which have been granted to some users for creating partitions. An allocation can span either a single ILD, or multiple ILDs connected via GW-Links. Supported allocation sizes are 4, 8, 16, 32, 64 and multiples of 64 IPUs.

Partition

A partition is a software entity that defines a set of IPUs (a vPOD) within a cluster that are available for running user applications.

A “reconfigurable partition” is a special case that supports multiple Poplar users or applications.

Replica

A replica is a one instance of an application used in data-parallel training using multiple instances (replicas) of the application. Replica sizes are limited to 1, 2 or 4 IPUs.

Graph compile domain (GCD)

A GCD is the set of IPUs within a partition or vPOD that are targeted by the Poplar graph compiler. The IPUs in a GCD are in a single IPU-Link domain (for example, within a single IPU-POD). Valid GCD sizes are: 1, 2, 4, 8, 16 or 64 IPUs. A GCD is associated with a single server running the Poplar SDK and graph engine.

Graph scaleout domain (GSD)

A GSD is a set of IPUs within a partition or vPOD that are used to run user applications. A GSD consists of one or more GCDs and can extend over multiple Pods. Valid GSD sizes are 1, 2, 4, 8, 16, 64 and multiples of 64 IPUs.

1.2. Scope of the document

This document is intended for the administrators of V-IPU-based data centre clusters.

1.3. Structure of the document

The rest of this document is structured as follows. In Section 2, Concepts and architecture we give a brief overview of the components and architecture of the V-IPU management software. Installation instructions are provided Section 3, Installation. Section 5, Users and allocations discusses users and allocation, while different cluster topologies and tests are detailed in Section 6.2.2, Clusters. Next, security features of the V-IPU software are discussed in Section 4, Securing the installation. Finally, complete command-line references for the vipu-admin utility and the vipu-server are provided in Section 9, Admin command line reference and Section 10, Server command line reference.