1. Overview

This document illustrates a reference configuration of a Graphcore IPU-POD64 rack deployed with OpenStack, the open-source cloud computing infrastructure management software. The IPU-POD64 rack contains 64 IPUs (Intelligence Processing Units) and OpenStack is used to manage these IPU resources via its API and UI.

OpenStack can be deployed in a variety of different configurations, and Graphcore does not endorse or support any particular configuration. OpenStack is also not a pre-requisite for using Graphcore technology, however, OpenStack is often used as an underlying infrastructure in datacentres, and you should treat this sample description as a set of guidelines from which you can derive your own configuration of a Graphcore IPU solution on your own existing particular implementation of OpenStack.

1.1. Goal

The aim of this document is to show a high-level example of configuring an IPU-POD64 with OpenStack.

1.2. Use cases

The basic scenario goes as presented below. Note that the host server, the persistent storage, and all the IPU-Machines in the IPU-POD64 are connected to the ToR switch. IPU-PODs support multiple tenants with multiple users.

  • User requests for a new VM on the host server

  • User requests for additional volumes to be mounted to the VM for persistent storage

  • User connects to a shell in the VM

  • User creates and deletes IPU partitions

  • User runs machine learning training or inference workloads

  • User deletes a VM after a workload is finished

1.3. Acronyms and abbreviations

Term

Description

IPU-M, IPU-Machine

An IPU-Machine is a blade with 4 IPUs such as the IPU-M2000

IPU-POD

A rack solution containing IPU-Machines, one or more host servers (also called Poplar servers), network switches and IPU-POD software

Poplar server

Host server that is used by end-users to run machine learning workloads on IPU-Machines

RDMA

Remote DMA

RNIC

RDMA Network Interface Controller: 100GbE interface used to communicate between Poplar servers and IPU-Machines for fast data transfers

RoCE

RDMA over Converged Ethernet: protocol used to transfer data between Poplar servers and IPU-Machines

ToR

Top of Rack: term used in combination with the ToR RDMA switch that is placed on top of the IPU-Machines

VIRM

V-IPU Resource Manager: V-IPU agent running on an IPU-Machine

V-IPU Controller

A service that is managing IPU-Machines using VIRMs

VLAN

Virtual LAN - subnetwork grouping devices on separate physical local area networks

VxLAN

Virtual Extensible LAN - network virtualization that improves scalability associated with large cloud computing deployments

vPOD

Virtual POD. A group of IPU-Machines with dedicated V-IPU Controller and one or more Poplar servers. One vPOD is used by one tenant. A vPOD can have 1, 2, 4, 8, 16 or more IPU-Machines.