1. Overview

OpenStack can be deployed in a variety of different configurations, and Graphcore does not endorse or support any particular configuration. OpenStack is also not a pre-requisite for using Graphcore technology, however, OpenStack is often used as an underlying infrastructure in datacentres. The reference design described in this document can be used as a guideline to derive your own configuration of a Graphcore IPU solution on your existing implementation of OpenStack (or alternative cloud platform).

1.1. Goal

The aim of this document is to show a high-level example of configuring an IPU Pod (IPU-POD₆₄) with OpenStack. This includes capturing all the details of the design and implementation of the installation: networking, deployment, administration and monitoring features.

The broader, more general details of the OpenStack deployment are not described; however, the fundamentals which it implements are covered such that it could be replicated either manually or with a different automation scheme.

1.2. Use cases

IPU Pods support multiple tenants with multiple users.

1.2.1. Use case 1:

Provision of multiple user tenancies which each provide CPU and IPU compute in a discrete, secure and flexible virtual IPU Pod (vPOD). Each vPOD requires:

One or more Virtual Machines (Poplar hosts) to execute Poplar workloads

One or more IPU-Machines (IPU-M2000 or Bow-2000)

Access to IPU-Machines from Poplar hosts to execute Poplar workloads on IPUs

A secure network providing ssh access to Poplar hosts

Access to VIPU and IPU-M management services for administrators

Access to VIPU user services for Poplar hosts

Secure network storage with secure network connections (optional)

1.3. Generic cloud overview

The overall architecture of components is shown below. Virtual IPU Pods are constructed with resources from several physical hardware units:

CPU from a cluster of COTS (commercial-off-the-shelf) compute servers

CPU from high-specification Poplar servers installed in the IPU Pod (IPU-POD₆₄) clusters

IPUs from IPU-Machines in the IPU Pod (IPU-POD₆₄) clusters

Network storage volumes from dedicated high-performance NFS appliances

Virtual disk volumes mounted from Ceph storage cluster

Control and data planes overlaid on dedicated 1 Gb (or 10 Gb) and 100 Gb network infrastructures, respectively.

_images/generic-cloud-overview.png — Fig. 1.1 Generic cloud overview - control and data planes

1.4. Cloud physical overview

The reference IPU cloud consists of a number of high-power racks containing IPU Pods, switches, storage and other servers. Additional rack(s) contain the OpenStack infrastructure hosts and storage hardware.

A typical layout is shown in Fig. 1.2. Note that each IPU-POD₆₄ is referred to as a logical rack in other sections of this document.

All 1G switches are connected to a dedicated data centre network (not shown) for switch management.

The data centre management network also provides internet connectivity.

_images/cloud-physical-overview.png — Fig. 1.2 Cloud physical overview

1.5. Acronyms and abbreviations

Term	Description
IPU-Machine	An IPU-Machine is a blade with 4 IPUs such as the IPU-M2000 or Bow-2000
IPU Pod	A rack solution containing IPU-Machines, one or more host servers (also called Poplar servers), network switches and IPU Pod software
Poplar server	Host server that is used by end-users to run machine learning workloads on IPU-Machines
RDMA	Remote DMA
RNIC	RDMA Network Interface Controller: 100GbE interface used to communicate between Poplar servers and IPU-Machines for fast data transfers
RoCE	RDMA over Converged Ethernet: protocol used to transfer data between Poplar servers and IPU-Machines
ToR	Top of Rack: term used in combination with the ToR RDMA switch that is placed on top of the IPU-Machines
VIRM	V-IPU Resource Manager: V-IPU agent running on an IPU-Machine
V-IPU Controller	A service that is managing IPU-Machines using VIRMs
VLAN	Virtual LAN - subnetwork grouping devices on separate physical local area networks
VxLAN	Virtual Extensible LAN - network virtualization that improves scalability associated with large cloud computing deployments
vPOD	Virtual POD. A group of IPU-Machines with a dedicated V-IPU Controller and one or more Poplar servers. One vPOD is used by one tenant. A vPOD can have 1, 2, 4, 8, 16 or more IPU-Machines.

Search help