5. IPU-POD deployment best practices

5.1. Physical

5.1.1. Power, loading, PSUs and wiring

Two diverse power trains are recommended, with each power train being fed from a pure sine wave UPS which is supplied from the utility mains connection. Within the data hall, Graphcore recommends that you establish a colour-coding scheme such that each power cable is the colour of the power train from which it is sourced.

5.1.2. Cooling and air flow

We recommend that the provisioned cooling meets or exceeds the ASHRAE TC 9.9 specification.

Aisle containment is recommended.

N+1 redundancy is recommended for all critical equipment.

5.2. Central services

5.2.1. DNS service

The DNS service is an authoritative name service to resolve IP addresses to fully qualified domain names.

For best practice and to simplify access to hosts and appliances, use of SSL certificates and an authoritative name service should be deployed for internal IPU-POD networks.

Fully qualified domain names for all devices should be resolvable via DNS with both forward and reverse records.

When deploying a name service, we recommend the following:

Highly available configuration

Authoritative for all private address ranges

Forwarding for public address ranges

BIND is regularly patched

DNS zone transfers are disabled

Only port 53 TCP/UDP exposed to the IPU-POD network

When assigning hostnames, we recommended they contain the following information:

Location

Purpose

Instance number

5.2.2. DHCP service

The DHCP service dynamically assigns an IP address and other network configuration parameters.

In order to simplify IP address allocation and network configuration updates, all non-critical server infrastructure should be configured to allocate network configuration by DHCP on the internal IPU-POD networks.

When deploying a DHCP service, we recommend the following:

Load balanced

Highly available

Configurable via API

IPU-POD networks access the DHCP server via the DHCP helper address

Only ports 67 and 68 (UDP/TCP) exposed to the IPU-POD network

5.2.3. Storage

For best results, we recommend the best storage you can afford, with high throughput/IOPS and low read/write latency. For example:

A storage vendor appliance with multiple aggregated network ports and read/write caching. Consider flash/SAS hybrid solutions if an all-flash system is not possible.
Distributed/clustered storage achieving high throughput through a high number of nodes.

Considerations

Improve performance by keeping storage networks within the same broadcast domain as the compute to avoid routing storage traffic. A dedicated storage network can help with this.
Splitting batch work away from code and project work to allow different data protection policies for each.
- For example, snapshots are expensive so should be limited to filesystems with changes to high-value work, such as code development areas. Higher delta areas (like batch/regression/simulation areas) could have reduced or no snapshots.
Using the automounter will allow easy splitting and migration of data into multiple file servers or shares under a shared path. This also allows for central management of your mounts, which is less fragile than managing lots of mounts on many machines.
Storage tiering: often not all storage workloads require the same performance. Consider slower “archive” storage for long term archiving of results, while still being online. Areas requiring particularly high load can exist on smaller but much faster storage. For example, /home/usera might live on one class of file server, and /home/usera-scratch-projA might exist on much faster storage.
Pay particular attention to mount options to optimize data and attribute caching to reduce filesystem load. The defaults are often inappropriate for your use.
If using a clustered filesystem, you can attain a marked improvement in performance by using client-side caching.
Separate user home areas from work areas and dissuade users from working in their home areas.
Use GIDs to group storage areas by project/team, to allow collaboration and security.
Consider implementing Kerberized storage, if you have complex permissions requirements.

5.2.4. Disposal

All items of equipment containing storage media should be checked to ensure that any sensitive data and licensed software have been removed or securely overwritten prior to disposal.

5.2.5. Network Time Protocol (NTP)

NTP is used to synchronize the time between hosts.

NTP synchronization should be configured correctly and enabled on each host to ensure accurate time for system event logs.

The time sources used should be in sync with an agreed-upon time standard such as Coordinated Universal Time (UTC). There should be at minimum three NTP sources, of at least stratum 3, offset should be less than 1 second, and last connection time should be less than 60 seconds.

5.2.6. Directory services

Directory services are used to provide a single source of truth for accounting, authentication and authorisation.

In order to provide a single source of truth regarding account information, access and authorisation for IPU-POD systems, a central directory service should be configured in place of local accounts on the IPU-POD servers.

When deploying a directory service, we recommend the following:

Load balanced

Highly available

LDAP protocol

Enforced TLS encryption

Authenticated binds only

Filtering of data by IPU-POD network

Only port 389 TCP exposed to the IPU-POD network

The directory server should be configured to provide different views of the following data dependant on the IPU-POD network:

Unique user IDs

Primary group

User ssh keys

Mount information for shared storage

5.3. Deployment

5.3.1. Network

Redundant Top of Rack or End of Rack switch topology is recommended, providing 10GB optical connectivity to each rack.

We recommend that networking should be configured as follows:

Network segmentation ensures that services and data can be protected in accordance with their classification and limits the scope of an attack.

We recommend having separate networks dedicated to the IPU-POD service networks and the IPU-POD management network. Traffic to and from these networks can be controlled according to the need to transmit/receive information. Gateways, like firewalls and routers, must enforce and monitor this separation.

In a multi-tenant environment, the IPU-POD service networks should be separated either physically or by the use of VLANs.

A network intrusion detection system should be deployed to provide alerts for any unusual activity and known network-based attacks. This will normally report to a SIEM solution or automated analysis and reporting.

Separate VLANs are used for the following:

Core infrastructure should have the following dedicated VLANs:

Management data network

Management BMC network

Management DMZ network for servers exposed to IPU-POD networks

Each IPU-POD should have the following dedicated VLANs:

IPU-POD host data network

IPU-POD host BMC network

IPU-POD management network

We recommend internal traffic to be routed as follows:

Management data network → Public Internet → Management DMZ network

Management BMC network → No access to the public internet → Dedicated management jump box

Management DMZ network → Public internet

IPU-POD data network → Public Internet → Management DMZ network

Management BMC network → No access to the public internet → IPU-POD management network → Management DMZ network

IPU-POD management network → Public Internet → Management DMZ network

For external traffic we recommend:

Incoming traffic only allowed from whitelisted IP addresses

Outgoing traffic restricted to whitelisted IP addresses and monitored

We also recommend:

All data and inter-switch links are trunked

5.3.2. Installation of a base operating system

When installing host or appliance updates, we recommend:

Installation of qualified software

Installation should be fully automated via API and require zero manual interaction from administrators

Installation should deploy a consistent versioned image

Configuration management should correct drift over time

We also recommend the following:

Local versioned repositories for appliance and operating system packages

Initial switch configuration via DHCP options

Server BIOS configuration via DHCP options

Base server installs via PXE

Disk layout:

RAID 1 operating system disks ( / ), SSDs

RAID 1 home directory ( /home ), SSDs

RAID 6 data directory ( /localdata ), NVME

Configuration management should be used to correct drift, and ensure changes are logged in a central location

5.3.3. Configuration of system management

When configuring any server or appliance we recommend:

Configuration should be fully automated

Configuration should not require an administrator to log in to the host

Configuration changes are tracked in version control and approval process for production changes

Configuration should be continuously monitored, and any drift corrected

User accounts should only be created on the directory server

Access control should be in place to restrict login access to authorised users and networks

The software should be installed from internal versioned repositories

Shared accounts should not be used

Root access should be restricted to local console only

Root commands should be audited and logged

A copy of system logs should be stored in a central location

5.3.4. User provisioning lifecycle

By default, each host has a single root admin account that is used for local administration and to connect the host console. The use of this account should be limited and named (non-root) user accounts with sudo privileges should be used instead.

Some measures to protect user accounts include:

Automatic account logout

Monitor and automatically block IPs with too many failed login attempts

Disable SSH for the root account or set it to key only

Implement 2FA

Role-based access control

Users should be created in the central directory service, not the local server. Home directories should be located on central shared storage and accounts provisioned automatically on creation. Home directories should be on shared storage, and access restricted to a single user account.

User and IPU-POD allocation processes should be tested to ensure that:

User request for the IPU-POD is approved and provides an SSH key

Deployment reports IPU-POD ready for user use, without any commands needing to be manually run by the administrator

Users can log in and start to work

Project directories should be on shared storage and access restricted to named groups of users.

Unique username

Unique UID

Primary GID

Additional group membership

SSH public key

Full username

5.4. Security

5.4.1. Host controls

Remote logging to a central log host can be implemented to provide a secure, centralised store for logs.

System auditing tools can be deployed that allow system administrators to detect unauthorised access or modification of data.

Auditing of user activity can be enabled.

File integrity and host intrusion detection tools can be deployed to detect unauthorised changes and breaches. These would typically report to a security information and event management tool for automated analysis and reporting.

5.4.2. Network applications

When access to network applications through the firewall is required, rules can be set up to allow only the minimum access required for the application.

The classification of the data being allowed through the network must be considered and protection mechanisms put in place accordingly. These should include encryption and source/destination IP address restrictions.

5.5. Monitoring

The host server(s) and the IPU-M2000 devices can be monitored using industry-standard monitoring tools.

5.5.1. IPU-M2000

In-band monitoring

The V-IPU exporter is an agent that collects metrics from the IPU-M2000 (temperature, power consumption, fan speed, IPU error counters, and so on) and exports it in openmetrics format (text representation): https://openmetrics.io/. One V-IPU exporter instance runs in each of the IPU-M2000s alongside the V-IPU agent. These metrics can be collected via a Prometheus instance.

Out-of-band monitoring

OpenBMC firmware running within each IPU-M2000 supports out-of-band management of the IPU-M2000 machines. It provides an OpenBMC Restful API and also supports RedFish Restful API. For further details on OpenBMC firmware, see the BMC User Guide.

5.5.2. Host servers

In-band monitoring

There is currently no monitoring agent installed by default on the host server(s). The Prometheus node exporter agent can be installed to follow the Prometheus ecosystem and join the host server metrics with the metrics exposed via the V-IPU exporter.

Out-of-band monitoring

The default server is a Dell PowerEdge R6525 which runs iDRAC firmware. This supports the iDRAC admin tool, RESTful API, and Redfish API as per Dell specifications. For further details see the documentation at the Dell Support website.

5.6. Alerting

The following basic alerting is recommended:

5.6.1. IPU-M2000

In-band alerting

We recommend alerting on basic OS disk usage and time drifting.

Out-of-band alerting

We recommend alerting on the failure of critical hardware such as system fans, power supply units, and temperature sensors crossing thresholds obtained from OpenBMC.

5.6.2. Host servers

In-band alerting

We recommend alerting on basic OS disk usage, time drifting.

Out-of-band alerting

We recommend alerting on the failure of critical hardware such as system fans, power supply units, and temperature sensors crossing thresholds obtained from iDRAC.

Search help

5. IPU-POD deployment best practices

5.1. Physical

5.1.1. Power, loading, PSUs and wiring

5.1.2. Cooling and air flow

5.2. Central services

5.2.1. DNS service

5.2.2. DHCP service

5.2.3. Storage

Considerations

5.2.4. Disposal

5.2.5. Network Time Protocol (NTP)

5.2.6. Directory services

5.3. Deployment

5.3.1. Network

5.3.2. Installation of a base operating system

5.3.3. Configuration of system management

5.3.4. User provisioning lifecycle

5.4. Security

5.4.1. Host controls

5.4.2. Network applications

5.5. Monitoring

5.5.1. IPU-M2000

In-band monitoring

Out-of-band monitoring

5.5.2. Host servers

In-band monitoring

Out-of-band monitoring

5.6. Alerting

5.6.1. IPU-M2000

In-band alerting

Out-of-band alerting

5.6.2. Host servers

In-band alerting

Out-of-band alerting