5. Server requirements

This chapter describes the reference baseline configuration for the server in an IPU-POD DA system before running the IPU-M2000 system software installation script. The high-level steps for configuring the server are as follows:

  1. Select appropriate server specifications

  2. Configure the server storage and RAID arrays

  3. Install the server operating system and packages

  4. Install required python packages

5.1. Server hardware and storage

5.1.1. Hardware requirements

This document describes building a system using a PowerEdge R6525 server. Contact Graphcore support for details of other supported server types or check the Approved Servers list. Other servers may have different installation requirements.

The recommended configuration of the Dell R6525 is as follows:

  • Dell R6525 containing dual AMD EPYC 7742 processors

  • 16x32GB RDIMM PC4-25600 ECC registered dual-rank X4 1.2v

  • 2x 480GB SSD-SATA 6Gbps 2.5 inch hot-swap

  • 7x 1TB NVME SSD PCIe 4x 3.1

  • Dual port Gigabit BASE-T PCIe

  • 1x single port Mellanox ConnectX-5 EN 100Gb/s Ethernet (IPU-POD4 DA system)

  • 2x dual port Mellanox ConnectX-5 EN 100Gb/s Ethernet (IPU-POD16 DA system)

5.1.2. Storage configuration

The recommendation is to configure two types of server storage: SSD-SATA for the operating system and NVME SSD for data storage.

Operating system:

  • 2x 480GB SSD-SATA drives as a RAID 1 via hardware controller

  • Partitioned to use ext4 file system

Data storage:

  • 7x 1TB NVME SSDs as a logical RAID 6 managed with MDADM

  • Partitioned to use xfs file system

5.1.3. Memory configuration

The DIMMs should be installed in a fully symmetric configuration, as recommended by Dell for maximum performance. The recommended configuration has 8 DIMMs per processor, as shown below.

_images/DIMM_config.png

Fig. 5.1 Recommended DIMM configuration

5.1.4. BIOS configuration

Various BIOS settings can impact the performance of the system. The recommended settings are shown in the table below.

Table 5.1 BIOS settings

Setting

BIOS 1.2.11

BIOS 1.4.8 or later

LogicalProc

Enabled

Enabled

ProcVirtualization

Enabled

Enabled

IommuSupport

Disabled

See below

L1StreamHwPrefetcher

Enabled

Enabled

L2StreamHwPrefetcher

Enabled

Enabled

MadtCoreEnumeration

Linear

Linear

NumaNodesPerSocket

4

4

CcxAsNumaDomain

Disabled

Disabled

CpuMinSevAsid

1

1

ProcCcds

All

All

CcdCores

All

All

EmbSata

AhciMode

AhciMode

BIOS 1.4.8 or later does not configure IOMMU directly. Instead, you need to add iommu=off to the kernel boot params in /etc/default/grub and then run update-grub.

5.1.5. User application memory usage

To ensure that user application memory usage does not cause system services to be killed or otherwise affected on the host system, we strongly recommend that some form of out-of-memory protection is put in place. The purpose of this out-of-memory protection is to kill user processes before system processes (such as the V-IPU server, or DHCPD/SSHD) are affected, which will impact wider use or accessibility of the system.

Modern Linux kernels implement oom-killer, which is a line of last resort to kill processes when all memory is exhausted. However, for scientific computing/MPI workloads, they can often knock a system offline before the oom killer has an opportunity to react.

Although the mitigation will depend on the environment, some suggested options are:

  • Install and configure early-oom – intended to overcome the shortcomings of the kernel oom-killer, this service monitors memory usage, and will kill processes based upon a configurable memory and swap limit.

  • Make use of cgroup memory limiting – this can be used to restrict certain users/groups to a proportion of the system’s RAM. The kernel oom-killer operates on each cgroup independently, meaning that the oom-killer will react to memory pressure in a user’s cgroup, while a system or default cgroup (containing essential system services) will be left unaffected.

5.2. Operating system and packages

Please contact your Graphcore representative or use the support portal for information about operating system support. This document describes the following operating systems:

  • Ubuntu 18.04

  • CentOS 7.6 / 8.2

5.2.1. Ubuntu OS installation and packages

In order to have a stable system where IPU related software can run, the packages listed in Table 5.2 need to be installed on the system. The IPU-POD DA install script installs all of these required packages (the script checks which packages are installed and will download any that are missing). If you are running an offline install without the package repository available then you must install these packages before running the IPU-POD DA install script.

Table 5.2 Required Ubuntu 18.04 packages

librdmacm1

lldpad

mlx5_ib kernel module

ntp (or other NTP server running on host)

pigz

python3-pip

python3-venv

rdma-core

5.2.2. CentOS OS installation and packages

In order to have a stable system where IPU related software can run, the packages listed in Table 5.3 need to be installed on the system. The IPU-POD DA install script installs all of these required packages (the script checks which packages are installed and will download any that are missing). If you are running an offline install without the package repository available then you must install these packages before running the IPU-POD DA install script.

Table 5.3 Required CentOS packages

policycoreutils-python (RHEL 7 versions)

python3-policycoreutils (RHEL 8 versions)

lldpad

rdma-core

librdmacm

python3 (CentOS 7 versions)

python36 (CentOS 8 versions)

chrony (or other NTP server running on host)

5.2.3. Python packages

All Python dependencies are built-in to the IPU-POD DA install script.