5. IPU-POD deployment best practices

5.1. Physical

5.1.1. Power, loading, PSUs and wiring

Two diverse power trains are recommended, with each power train being fed from a pure sine wave UPS which is supplied from the utility mains connection. Within the data hall, Graphcore recommends that you establish a colour-coding scheme such that each power cable is the colour of the power train from which it is sourced.

5.1.2. Cooling and air flow

We recommend that the provisioned cooling meets or exceeds the ASHRAE TC 9.9 specification.

Aisle containment is recommended.

N+1 redundancy is recommended for all critical equipment.

5.2. Central services

5.2.1. DNS service

The DNS service is an authoritative name service to resolve IP addresses to fully qualified domain names.

For best practice and to simplify access to hosts and appliances, use of SSL certificates and an authoritative name service should be deployed for internal IPU-POD networks.

Fully qualified domain names for all devices should be resolvable via DNS with both forward and reverse records.

When deploying a name service, we recommend the following:

  • Highly available configuration

  • Authoritative for all private address ranges

  • Forwarding for public address ranges

  • BIND is regularly patched

  • DNS zone transfers are disabled

  • Only port 53 TCP/UDP exposed to the IPU-POD network

When assigning hostnames, we recommended they contain the following information:

  • Location

  • Purpose

  • Instance number

5.2.2. DHCP service

The DHCP service dynamically assigns an IP address and other network configuration parameters.

In order to simplify IP address allocation and network configuration updates, all non-critical server infrastructure should be configured to allocate network configuration by DHCP on the internal IPU-POD networks.

When deploying a DHCP service, we recommend the following:

  • Load balanced

  • Highly available

  • Configurable via API

  • IPU-POD networks access the DHCP server via the DHCP helper address

  • Only ports 67 and 68 (UDP/TCP) exposed to the IPU-POD network

5.2.3. Storage

For best results, we recommend the best storage you can afford, with high throughput/IOPS and low read/write latency. For example:

  • A storage vendor appliance with multiple aggregated network ports and read/write caching. Consider flash/SAS hybrid solutions if an all-flash system is not possible.

  • Distributed/clustered storage achieving high throughput through a high number of nodes.

Considerations

  • Improve performance by keeping storage networks within the same broadcast domain as the compute to avoid routing storage traffic. A dedicated storage network can help with this.

  • Splitting batch work away from code and project work to allow different data protection policies for each.

    • For example, snapshots are expensive so should be limited to filesystems with changes to high-value work, such as code development areas. Higher delta areas (like batch/regression/simulation areas) could have reduced or no snapshots.

  • Using the automounter will allow easy splitting and migration of data into multiple file servers or shares under a shared path. This also allows for central management of your mounts, which is less fragile than managing lots of mounts on many machines.

  • Storage tiering: often not all storage workloads require the same performance. Consider slower “archive” storage for long term archiving of results, while still being online. Areas requiring particularly high load can exist on smaller but much faster storage. For example, /home/usera might live on one class of file server, and /home/usera-scratch-projA might exist on much faster storage.

  • Pay particular attention to mount options to optimize data and attribute caching to reduce filesystem load. The defaults are often inappropriate for your use.

  • If using a clustered filesystem, you can attain a marked improvement in performance by using client-side caching.

  • Separate user home areas from work areas and dissuade users from working in their home areas.

  • Use GIDs to group storage areas by project/team, to allow collaboration and security.

  • Consider implementing Kerberized storage, if you have complex permissions requirements.

5.2.4. Disposal

All items of equipment containing storage media should be checked to ensure that any sensitive data and licensed software have been removed or securely overwritten prior to disposal.

5.2.5. Network Time Protocol (NTP)

NTP is used to synchronize the time between hosts.

NTP synchronization should be configured correctly and enabled on each host to ensure accurate time for system event logs.

The time sources used should be in sync with an agreed-upon time standard such as Coordinated Universal Time (UTC). There should be at minimum three NTP sources, of at least stratum 3, offset should be less than 1 second, and last connection time should be less than 60 seconds.

5.2.6. Directory services

Directory services are used to provide a single source of truth for accounting, authentication and authorisation.

In order to provide a single source of truth regarding account information, access and authorisation for IPU-POD systems, a central directory service should be configured in place of local accounts on the IPU-POD servers.

When deploying a directory service, we recommend the following:

  • Load balanced

  • Highly available

  • LDAP protocol

  • Enforced TLS encryption

  • Authenticated binds only

  • Filtering of data by IPU-POD network

  • Only port 389 TCP exposed to the IPU-POD network

The directory server should be configured to provide different views of the following data dependant on the IPU-POD network:

  • Unique user IDs

  • Primary group

  • User ssh keys

  • Mount information for shared storage

5.3. Deployment

5.3.1. Network

Redundant Top of Rack or End of Rack switch topology is recommended, providing 10GB optical connectivity to each rack.

We recommend that networking should be configured as follows:

  • Network segmentation ensures that services and data can be protected in accordance with their classification and limits the scope of an attack.

  • We recommend having separate networks dedicated to the IPU-POD service networks and the IPU-POD management network. Traffic to and from these networks can be controlled according to the need to transmit/receive information. Gateways, like firewalls and routers, must enforce and monitor this separation.

  • In a multi-tenant environment, the IPU-POD service networks should be separated either physically or by the use of VLANs.

  • A network intrusion detection system should be deployed to provide alerts for any unusual activity and known network-based attacks. This will normally report to a SIEM solution or automated analysis and reporting.

Separate VLANs are used for the following:

  • Core infrastructure should have the following dedicated VLANs:

    • Management data network

    • Management BMC network

    • Management DMZ network for servers exposed to IPU-POD networks

  • Each IPU-POD should have the following dedicated VLANs:

    • IPU-POD host data network

    • IPU-POD host BMC network

    • IPU-POD management network

We recommend internal traffic to be routed as follows:

  • Management data network → Public Internet → Management DMZ network

  • Management BMC network → No access to the public internet → Dedicated management jump box

  • Management DMZ network → Public internet

  • IPU-POD data network → Public Internet → Management DMZ network

  • Management BMC network → No access to the public internet → IPU-POD management network → Management DMZ network

  • IPU-POD management network → Public Internet → Management DMZ network

For external traffic we recommend:

  • Incoming traffic only allowed from whitelisted IP addresses

  • Outgoing traffic restricted to whitelisted IP addresses and monitored

We also recommend:

  • All data and inter-switch links are trunked

5.3.2. Installation of a base operating system

When installing host or appliance updates, we recommend:

  • Installation of qualified software

  • Installation should be fully automated via API and require zero manual interaction from administrators

  • Installation should deploy a consistent versioned image

  • Configuration management should correct drift over time

We also recommend the following:

  • Local versioned repositories for appliance and operating system packages

  • Initial switch configuration via DHCP options

  • Server BIOS configuration via DHCP options

  • Base server installs via PXE

  • Disk layout:

    • RAID 1 operating system disks ( / ), SSDs

    • RAID 1 home directory ( /home ), SSDs

    • RAID 6 data directory ( /localdata ), NVME

  • Configuration management should be used to correct drift, and ensure changes are logged in a central location

5.3.3. Configuration of system management

When configuring any server or appliance we recommend:

  • Configuration should be fully automated

  • Configuration should not require an administrator to log in to the host

  • Configuration changes are tracked in version control and approval process for production changes

  • Configuration should be continuously monitored, and any drift corrected

  • User accounts should only be created on the directory server

  • Access control should be in place to restrict login access to authorised users and networks

  • The software should be installed from internal versioned repositories

  • Shared accounts should not be used

  • Root access should be restricted to local console only

  • Root commands should be audited and logged

  • A copy of system logs should be stored in a central location

5.3.4. User provisioning lifecycle

By default, each host has a single root admin account that is used for local administration and to connect the host console. The use of this account should be limited and named (non-root) user accounts with sudo privileges should be used instead.

Some measures to protect user accounts include:

  • Automatic account logout

  • Monitor and automatically block IPs with too many failed login attempts

  • Disable SSH for the root account or set it to key only

  • Implement 2FA

  • Role-based access control

Users should be created in the central directory service, not the local server. Home directories should be located on central shared storage and accounts provisioned automatically on creation. Home directories should be on shared storage, and access restricted to a single user account.

User and IPU-POD allocation processes should be tested to ensure that:

  • User request for the IPU-POD is approved and provides an SSH key

  • Deployment reports IPU-POD ready for user use, without any commands needing to be manually run by the administrator

  • Users can log in and start to work

Project directories should be on shared storage and access restricted to named groups of users.

  • Unique username

  • Unique UID

  • Primary GID

  • Additional group membership

  • SSH public key

  • Full username

5.4. Security

5.4.1. Host controls

Remote logging to a central log host can be implemented to provide a secure, centralised store for logs.

System auditing tools can be deployed that allow system administrators to detect unauthorised access or modification of data.

Auditing of user activity can be enabled.

File integrity and host intrusion detection tools can be deployed to detect unauthorised changes and breaches. These would typically report to a security information and event management tool for automated analysis and reporting.

5.4.2. Network applications

When access to network applications through the firewall is required, rules can be set up to allow only the minimum access required for the application.

The classification of the data being allowed through the network must be considered and protection mechanisms put in place accordingly. These should include encryption and source/destination IP address restrictions.

5.5. Monitoring

The host server(s) and the IPU-M2000 devices can be monitored using industry-standard monitoring tools.

5.5.1. IPU-M2000

In-band monitoring

The V-IPU exporter is an agent that collects metrics from the IPU-M2000 (temperature, power consumption, fan speed, IPU error counters, and so on) and exports it in openmetrics format (text representation): https://openmetrics.io/. One V-IPU exporter instance runs in each of the IPU-M2000s alongside the V-IPU agent. These metrics can be collected via a Prometheus instance.

Out-of-band monitoring

OpenBMC firmware running within each IPU-M2000 supports out-of-band management of the IPU-M2000 machines. It provides an OpenBMC Restful API and also supports RedFish Restful API. For further details on OpenBMC firmware, see the BMC User Guide.

5.5.2. Host servers

In-band monitoring

There is currently no monitoring agent installed by default on the host server(s). The Prometheus node exporter agent can be installed to follow the Prometheus ecosystem and join the host server metrics with the metrics exposed via the V-IPU exporter.

Out-of-band monitoring

The default server is a Dell PowerEdge R6525 which runs iDRAC firmware. This supports the iDRAC admin tool, RESTful API, and Redfish API as per Dell specifications. For further details see the documentation at the Dell Support website.

5.6. Alerting

The following basic alerting is recommended:

5.6.1. IPU-M2000

In-band alerting

We recommend alerting on basic OS disk usage and time drifting.

Out-of-band alerting

We recommend alerting on the failure of critical hardware such as system fans, power supply units, and temperature sensors crossing thresholds obtained from OpenBMC.

5.6.2. Host servers

In-band alerting

We recommend alerting on basic OS disk usage, time drifting.

Out-of-band alerting

We recommend alerting on the failure of critical hardware such as system fans, power supply units, and temperature sensors crossing thresholds obtained from iDRAC.