5. IPU-POD deployment best practices
5.1. Physical
5.1.1. Power, loading, PSUs and wiring
Two diverse power trains are recommended, with each power train being fed from a pure sine wave UPS which is supplied from the utility mains connection. Within the data hall, Graphcore recommends that you establish a colour-coding scheme such that each power cable is the colour of the power train from which it is sourced.
5.1.2. Cooling and air flow
We recommend that the provisioned cooling meets or exceeds the ASHRAE TC 9.9 specification.
Aisle containment is recommended.
N+1 redundancy is recommended for all critical equipment.
5.2. Central services
5.2.1. DNS service
The DNS service is an authoritative name service to resolve IP addresses to fully qualified domain names.
For best practice and to simplify access to hosts and appliances, use of SSL certificates and an authoritative name service should be deployed for internal IPU-POD networks.
Fully qualified domain names for all devices should be resolvable via DNS with both forward and reverse records.
When deploying a name service, we recommend the following:
Highly available configuration
Authoritative for all private address ranges
Forwarding for public address ranges
BIND is regularly patched
DNS zone transfers are disabled
Only port 53 TCP/UDP exposed to the IPU-POD network
When assigning hostnames, we recommended they contain the following information:
Location
Purpose
Instance number
5.2.2. DHCP service
The DHCP service dynamically assigns an IP address and other network configuration parameters.
In order to simplify IP address allocation and network configuration updates, all non-critical server infrastructure should be configured to allocate network configuration by DHCP on the internal IPU-POD networks.
When deploying a DHCP service, we recommend the following:
Load balanced
Highly available
Configurable via API
IPU-POD networks access the DHCP server via the DHCP helper address
Only ports 67 and 68 (UDP/TCP) exposed to the IPU-POD network
5.2.3. Storage
For best results, we recommend the best storage you can afford, with high throughput/IOPS and low read/write latency. For example:
A storage vendor appliance with multiple aggregated network ports and read/write caching. Consider flash/SAS hybrid solutions if an all-flash system is not possible.
Distributed/clustered storage achieving high throughput through a high number of nodes.
Considerations
Improve performance by keeping storage networks within the same broadcast domain as the compute to avoid routing storage traffic. A dedicated storage network can help with this.
Splitting batch work away from code and project work to allow different data protection policies for each.
For example, snapshots are expensive so should be limited to filesystems with changes to high-value work, such as code development areas. Higher delta areas (like batch/regression/simulation areas) could have reduced or no snapshots.
Using the automounter will allow easy splitting and migration of data into multiple file servers or shares under a shared path. This also allows for central management of your mounts, which is less fragile than managing lots of mounts on many machines.
Storage tiering: often not all storage workloads require the same performance. Consider slower “archive” storage for long term archiving of results, while still being online. Areas requiring particularly high load can exist on smaller but much faster storage. For example,
/home/usera
might live on one class of file server, and/home/usera-scratch-projA
might exist on much faster storage.Pay particular attention to mount options to optimize data and attribute caching to reduce filesystem load. The defaults are often inappropriate for your use.
If using a clustered filesystem, you can attain a marked improvement in performance by using client-side caching.
Separate user home areas from work areas and dissuade users from working in their home areas.
Use GIDs to group storage areas by project/team, to allow collaboration and security.
Consider implementing Kerberized storage, if you have complex permissions requirements.
5.2.4. Disposal
All items of equipment containing storage media should be checked to ensure that any sensitive data and licensed software have been removed or securely overwritten prior to disposal.
5.2.5. Network Time Protocol (NTP)
NTP is used to synchronize the time between hosts.
NTP synchronization should be configured correctly and enabled on each host to ensure accurate time for system event logs.
The time sources used should be in sync with an agreed-upon time standard such as Coordinated Universal Time (UTC). There should be at minimum three NTP sources, of at least stratum 3, offset should be less than 1 second, and last connection time should be less than 60 seconds.
5.2.6. Directory services
Directory services are used to provide a single source of truth for accounting, authentication and authorisation.
In order to provide a single source of truth regarding account information, access and authorisation for IPU-POD systems, a central directory service should be configured in place of local accounts on the IPU-POD servers.
When deploying a directory service, we recommend the following:
Load balanced
Highly available
LDAP protocol
Enforced TLS encryption
Authenticated binds only
Filtering of data by IPU-POD network
Only port 389 TCP exposed to the IPU-POD network
The directory server should be configured to provide different views of the following data dependant on the IPU-POD network:
Unique user IDs
Primary group
User ssh keys
Mount information for shared storage
5.3. Deployment
5.3.1. Network
Redundant Top of Rack or End of Rack switch topology is recommended, providing 10GB optical connectivity to each rack.
We recommend that networking should be configured as follows:
Network segmentation ensures that services and data can be protected in accordance with their classification and limits the scope of an attack.
We recommend having separate networks dedicated to the IPU-POD service networks and the IPU-POD management network. Traffic to and from these networks can be controlled according to the need to transmit/receive information. Gateways, like firewalls and routers, must enforce and monitor this separation.
In a multi-tenant environment, the IPU-POD service networks should be separated either physically or by the use of VLANs.
A network intrusion detection system should be deployed to provide alerts for any unusual activity and known network-based attacks. This will normally report to a SIEM solution or automated analysis and reporting.
Separate VLANs are used for the following:
Core infrastructure should have the following dedicated VLANs:
Management data network
Management BMC network
Management DMZ network for servers exposed to IPU-POD networks
Each IPU-POD should have the following dedicated VLANs:
IPU-POD host data network
IPU-POD host BMC network
IPU-POD management network
We recommend internal traffic to be routed as follows:
Management data network → Public Internet → Management DMZ network
Management BMC network → No access to the public internet → Dedicated management jump box
Management DMZ network → Public internet
IPU-POD data network → Public Internet → Management DMZ network
Management BMC network → No access to the public internet → IPU-POD management network → Management DMZ network
IPU-POD management network → Public Internet → Management DMZ network
For external traffic we recommend:
Incoming traffic only allowed from whitelisted IP addresses
Outgoing traffic restricted to whitelisted IP addresses and monitored
We also recommend:
All data and inter-switch links are trunked
5.3.2. Installation of a base operating system
When installing host or appliance updates, we recommend:
Installation of qualified software
Installation should be fully automated via API and require zero manual interaction from administrators
Installation should deploy a consistent versioned image
Configuration management should correct drift over time
We also recommend the following:
Local versioned repositories for appliance and operating system packages
Initial switch configuration via DHCP options
Server BIOS configuration via DHCP options
Base server installs via PXE
Disk layout:
RAID 1 operating system disks (
/
), SSDsRAID 1 home directory (
/home
), SSDsRAID 6 data directory (
/localdata
), NVMEConfiguration management should be used to correct drift, and ensure changes are logged in a central location
5.3.3. Configuration of system management
When configuring any server or appliance we recommend:
Configuration should be fully automated
Configuration should not require an administrator to log in to the host
Configuration changes are tracked in version control and approval process for production changes
Configuration should be continuously monitored, and any drift corrected
User accounts should only be created on the directory server
Access control should be in place to restrict login access to authorised users and networks
The software should be installed from internal versioned repositories
Shared accounts should not be used
Root access should be restricted to local console only
Root commands should be audited and logged
A copy of system logs should be stored in a central location
5.3.4. User provisioning lifecycle
By default, each host has a single root
admin account that is used for local administration and to connect the host console. The use of this account should be limited and named (non-root) user accounts with sudo
privileges should be used instead.
Some measures to protect user accounts include:
Automatic account logout
Monitor and automatically block IPs with too many failed login attempts
Disable SSH for the root account or set it to key only
Implement 2FA
Role-based access control
Users should be created in the central directory service, not the local server. Home directories should be located on central shared storage and accounts provisioned automatically on creation. Home directories should be on shared storage, and access restricted to a single user account.
User and IPU-POD allocation processes should be tested to ensure that:
User request for the IPU-POD is approved and provides an SSH key
Deployment reports IPU-POD ready for user use, without any commands needing to be manually run by the administrator
Users can log in and start to work
Project directories should be on shared storage and access restricted to named groups of users.
Unique username
Unique UID
Primary GID
Additional group membership
SSH public key
Full username
5.4. Security
5.4.1. Host controls
Remote logging to a central log host can be implemented to provide a secure, centralised store for logs.
System auditing tools can be deployed that allow system administrators to detect unauthorised access or modification of data.
Auditing of user activity can be enabled.
File integrity and host intrusion detection tools can be deployed to detect unauthorised changes and breaches. These would typically report to a security information and event management tool for automated analysis and reporting.
5.4.2. Network applications
When access to network applications through the firewall is required, rules can be set up to allow only the minimum access required for the application.
The classification of the data being allowed through the network must be considered and protection mechanisms put in place accordingly. These should include encryption and source/destination IP address restrictions.
5.5. Monitoring
The host server(s) and the IPU-M2000 devices can be monitored using industry-standard monitoring tools.
5.5.1. IPU-M2000
In-band monitoring
The V-IPU exporter is an agent that collects metrics from the IPU-M2000 (temperature, power consumption, fan speed, IPU error counters, and so on) and exports it in openmetrics format (text representation): https://openmetrics.io/. One V-IPU exporter instance runs in each of the IPU-M2000s alongside the V-IPU agent. These metrics can be collected via a Prometheus instance.
Out-of-band monitoring
OpenBMC firmware running within each IPU-M2000 supports out-of-band management of the IPU-M2000 machines. It provides an OpenBMC Restful API and also supports RedFish Restful API. For further details on OpenBMC firmware, see the BMC User Guide.
5.5.2. Host servers
In-band monitoring
There is currently no monitoring agent installed by default on the host server(s). The Prometheus node exporter agent can be installed to follow the Prometheus ecosystem and join the host server metrics with the metrics exposed via the V-IPU exporter.
Out-of-band monitoring
The default server is a Dell PowerEdge R6525 which runs iDRAC firmware. This supports the iDRAC admin tool, RESTful API, and Redfish API as per Dell specifications. For further details see the documentation at the Dell Support website.
5.6. Alerting
The following basic alerting is recommended:
5.6.1. IPU-M2000
In-band alerting
We recommend alerting on basic OS disk usage and time drifting.
Out-of-band alerting
We recommend alerting on the failure of critical hardware such as system fans, power supply units, and temperature sensors crossing thresholds obtained from OpenBMC.
5.6.2. Host servers
In-band alerting
We recommend alerting on basic OS disk usage, time drifting.
Out-of-band alerting
We recommend alerting on the failure of critical hardware such as system fans, power supply units, and temperature sensors crossing thresholds obtained from iDRAC.