3. Physical installation

3.1. Graphcore IPU Pod installation

All the IPU Pods in this reference design are physically built as standard IPU-Machine (IPU-M2000 or Bow-2000) IPU Pod₆₄ configurations as described in the IPU-POD64 Reference Design: Build and Test Guide and Bow Pod64 Reference Design: Build and Test Guide, respectively.

Note that the following may be subject to change in subsequent revisions of the build and test guides:

4x R6525 AMD servers are used with 512 GB RAM each. The RAM is installed with 8x DIMM modules to provide 8 symmetrical NUMA nodes.

Both 100 Gb RNIC ports on each Poplar server are connected to the ToR switch.

Both management network ports on each IPU-Machine are connected to the management switch although only one management network port is currently used. This port is shared by the BMC and IPU-Gateway virtual interfaces via a micro-switch built into the IPU-Machine. The IPU-Machine firmware includes an option to split this to use the separate interfaces, but this has not been tested for this design.

Each R6525 server is configured to PXE boot from a 1 Gb (or 10 Gb) interface. The physical connection is described in the relevant build and test guide. There are several 1 Gb interfaces on these hosts and the network cards vary so the logical interface can appear differently to the OS.

Dell iDRAC settings:

Firmware upgrades:

BIOS: 3.9.3

iDRAC: 6.10.00.00

Mellanox ConnectX-5: 16.26.1040

Hyperthreading is enabled

NUMA per socket (NPS) = 4

Optimised PCIe settings:

PCIe preferred I/O bus: enabled

PCIe preferred I/O bus value: 161 (PCIe slot that contains the RNIC card)

RAID for boot and OS drives

IDRAC service account for Kayobe connection, and IPMI enabled:

Role: adminstrator

IPMI LAN privilege: adminstrator

Tuned for maximum performance:

CPU power management: maximum performance

Turbo boost: enabled

C states: disabled

The MAC addresses of all the IPU-Machine interfaces need to be gathered as part of the installation to allow for DHCP to be configured. The 1 GbE interface MACs are listed on labels but the RNIC interface MACs can only be found after network access has been setup. This means that there are several options for the first install:

Connect the IPU-Machines to the 100 GbE ToR switch and power them on. The MAC addresses can then be seen by the switch and gathered from the switch management interface.

Install and configure as a standalone bare-metal IPU Pod with factory IP addressing (as per the build and test guides) then query the RNIC MAC addresses. You can then redeploy the IPU Pod under the Openstack Ironic automation. You can also perform hardware tests on the IPU-Machines at this stage.

Deploy the IPU Pod with Ironic and Neutron OpenStack projects using the 1 GbE interface MACs gathered from labels, but without the RNIC information. After deployment gather the RNIC interfaces and update the DHCP configuration and re-deploy.

3.2. Infrastructure hosts

3.2.1. Ceph nodes (x5)

For the reference design these are Dell R640 Intel 96 core servers with 768 GB RAM, 2x 100 GbE, 7x 1 TB NVME, 2x 500 GB SSD (as hardware RAID1).

Each server should provide >= 7 TB of NVME storage using multiple devices since this is better for Ceph OSDs.

These nodes also operate as Hypervisors for general compute purposes.

Section 6.2, Ceph clusters provides more details.

3.2.2. OpenStack control nodes (x3)

For the reference design these are Dell R640 Intel 96 core servers with 768 GB RAM, 2x 100 GbE, 7x 1 TB NVME, 2x 500 GB SSD (as hardware RAID1). These may be over-specified in terms of memory and CPU for your system.

These nodes run the OpenStack Overcloud core control services (as bare-metal containers).

Three servers are needed to provide HA.

3.2.3. Seed and Ansible control nodes (x2)

You will require servers with a minimum of:

8 core CPU

12 GB memory

100 GB free disk space (partitioned if possible to separate the Docker volume storage from the host OS)

1x 1 Gb Ethernet

In the reference design the servers are Dell PowerEdge R440 with Xeon Silver 4210 2.2 GB CPU, 128 GB RAM, Dual 25 GbE Ethernet, 2x 480 GB SATA SSD.

Note

Only one server is required for a single cloud installation. In our example two were used to allow for an independent development setup.

This host runs a pre-built seed VM in which OpenStack docker containers are run to manage the OpenStack cloud. This is also referred to as the Undercloud. This host must have access to power management and provisioning/control networks for the remote servers to be provisioned into the OpenStack cloud.

3.3. Core networking

3.3.1. Multi-rack interconnect

The Pod₆₄ racks (described in Section 3.1, Graphcore IPU Pod installation) and the OpenStack infrastructure racks have 8x 100 GbE uplinks to a shared 400 GbE spine. Ideally this would be connected as 4x 100 GbE to each of a pair of MLAG 400 GbE switches to provide redundant connectivity with a minimum 400 GbE throughput. Fig. 3.1 shows a general network architecture for multi-rack connectivity. This includes options for storage, disaggregated compute and other services including the OpenStack infrastructure.

_images/multi-rack-connect.png — Fig. 3.1 Multi-rack connectivity

3.4. Power and cooling

Each Pod₆₄ rack requires power at:

1500 W per IPU-M2000 (1700 W for Bow-2000)

1400 W per Poplar host

333 W (max) for Arista 7060 switch

52 W (max) for Arista 7010T switch

3.5. Network components

This section describes networks, endpoints, physical devices and the relationships between them.

3.5.1. 1 GbE management network

This network has the following devices and services connected:

Poplar servers: 2x 1 GbE interfaces and IPMI interface

Infrastructure servers: 2x 1 GbE interfaces and IPMI interface

IPU-Machines: will have either

A single connection to the 1 GbE presenting two MAC addresses, one for IPU BMC and one for the IPU-Gateway; or

Two 1 GbE connections, one for BMC and one for the IPU-Gateway (untested in this reference architecture)

PDUs in each rack: management interface

100 GbE switch: management interface

The purpose of this network is to provide connections for managing devices.

3.5.2. 100 GbE data network