2. Product description

2.1. IPU‑POD16 Direct Attach

Graphcore’s IPU‑POD16 Direct Attach system combines four IPU-M2000s delivering nearly 4 petaFLOPS of AI compute directly attached to a pre-approved host server from a choice of technology providers including Dell and Supermicro. The IPU‑POD16 Direct Attach is a compact and powerful platform ideal for exploration, experimentation, concept and pilot development, and is designed to scale with additional IPU-M2000s, host servers and a top of rack switch to a full IPU‑POD64. The IPU‑POD64 in turn is a building block for larger systems with up to 64K GC200 IPU processors delivering nearly 16 exaFLOPS of AI compute.

A high-level view of the IPU‑POD16 Direct Attach cabling is shown in the figure below.

_images/POD16-cabling.png

Fig. 2.1 IPU‑POD16 Direct Attach cabling

2.2. Software

IPU-POD systems are fully supported by Graphcore’s Poplar® software development environment, providing a complete and mature platform for ML development and deployment. Standard ML frameworks including TensorFlow, Keras, ONNX, Halo, PaddlePaddle, HuggingFace, PyTorch and PyTorch Lightning are fully supported along with access to PopLibs through our Poplar C++ API. Note that PopLibs, PopART and TensorFlow are available as open source in the Graphcore GitHub repo https://github.com/graphcore. PopTorch provides a simple wrapper around PyTorch programs to enable the programs to run seamlessly on IPUs. The Poplar SDK also includes the PopVision™ visualisation and analysis tools which provide performance monitoring for IPUs - the graphical analysis enables detailed inspection of all processing activities.

In addition to these Poplar development tools, IPU-POD systems are enabled with software support for industry standard converged infrastructure management tools including OpenBMC, Redfish, Docker containers, and orchestration with Slurm and Kubernetes.

_images/software.png

Fig. 2.2 IPU-POD software

Table 2.1 Poplar SDK

Complete end-to-end software stack for developing, deploying and monitoring AI model training jobs as well as inference applications on the Graphcore IPU

ML frameworks

TensorFlow, Keras, PyTorch, Pytorch Lightning, HuggingFace, PaddlePaddle, Halo, and ONNX

Deployment options

Bare metal (Linux), VM (HyperV), containers (Docker)

Host-Links

RDMA based disaggregation between a host and IPU over 100Gbps RoCEv2 NIC, using the IPU over Fabric (IPUoF) protocol

Host-to-IPU ratios supported: 1:16 up to 1:64

Graphcore Communication Library (GCL)

IPU-optimized communication and collective library integrated with the Poplar SDK stack

Support all-reduce (sum,max), all-gather, reduce, broadcast

Scale at near linear performance to 64k IPUs

PopVision

Visualization and analysis tools

To see a full list of supported OS, VM and container options go to the Graphcore support portal https://www.graphcore.ai/support

Table 2.2 Graphcore Virtual IPU SW

IPU-Fabric topology discovery and validation

Provisioning

gRPC and SSH/CLI for IPU allocation/de-allocation into isolated domains (vPods)

Plug-ins for SLURM and Kubernetes (K8)

Resource monitoring

gRPC and SSH/CLI for accessing the IPU-M2000 monitoring service

Prometheus node exporter and Grafana (visualization) support

Table 2.3 Lights-out management

Baseboard Management Controller (OpenBMC)

Dual-image firmware with local rollback support

Console support, CLI/SSH based

Serial-over-Lan and Redfish REST API

2.3. Technical specifications

Table 2.4 Graphcore IPU‑POD16 hardware

IPU-Machines

4x IPU-M2000 blades

IPUs

16 GC200 IPU processors (4 in each IPU-M2000)

IPU-Cores™

23,552

Worker threads

141,312

AI compute

3.994 petaFLOPS AI (FP16.16) compute

0.998 petaFLOPS FP32 compute

Memory

Up to 526.4 GB (includes 14.4 GB In-Processor-Memory (4x 3.6 GB per IPU-M2000) and 512 GB Streaming Memory (4x 64GB DIMM x2 per IPU-M2000)

Table 2.5 Other IPU‑POD16 hardware

IPU‑POD16 host server(s)

Default: 1 x Dell PowerEdge R6525 server

Options: 1 – 4 Graphcore approved server/OS options. Contact Graphcore sales for details

IPU‑POD16 switches

Optional: IPU‑POD16 systems can also be implemented in a switched configuration. Please contact Graphcore sales

Table 2.6 IPU-M2000 thermal characteristics

Air cooled with built-in N+1 hot-plug fan cooling system

Airflow

Mounted for airflow direction front of rack (single door, cold aisle side) to back of rack (split door, hot aisle side)

Airflow rate

103 CFM (measured) per IPU-M2000

Table 2.7 IPU‑POD16 power

PDU

PDU implementation can be customized for target workload and rack power density goals. Contact Graphcore sales for information

Input power (Vac)

2100 - 240 Vac (115 - 230 Vac nominal)

Power cap

1500 W with programmable power cap

Redundancy

1+1 redundancy

2.4. Environmental characteristics

Table 2.8 IPU‑POD16 environmental characteristics

Operating temperature and humidity (inlet air)

10-32C (50 to 90F) at 20%-80% RH (*)

Operating altitude

0 to 3,048m (0-10,000ft) (**)

  • (*) Altitude less than 900m/3000ft and non-condensing environment

  • (**) Max. ambient temperature is de-rated by 1°C per 300m above 900m

For power caps higher than 1700W per IPU-M2000 please contact Graphcore sales for environmental guidance.

2.5. Standards compliance for IPU-M2000s

Table 2.9 IPU-M2000 standards compliance

EMC standards

Emissions: FCC CFR 47, ICES-003, EN55032, EN61000-3-2, EN61000-3-3, VCCI 32-1

Immunity: EN55035, EN61000-4-2, EN61000-4-3, EN61000-4-4, EN61000-4-5, EN61000-4-6, EN61000-4-8, EN61000-4-11

Safety standards

IEC62368-1 2nd Edition, IEC60950-1, UL62368-1 2nd Edition

Certifications

North America (FCC, UL), Europe (CE), UK (UKCA), Australia (RCM), Taiwan (BSMI), Japan (VCCI)

South Korea (KC), China (CQC)

CB-62368, CB-60950

Environmental standards

EU 2011/65/EU RoHS Directive, XVII REACH 1907/2006, 2012/19/EU WEEE Directive

The European Directive 2012/19/EU on Waste Electrical and Electronic Equipment (WEEE) states that these appliances should not be disposed of as part of the routine solid urban waste cycle, but collected separately in order to optimise the recovery and recycling flow of the materials they contain, while also preventing potential damage to human health and the environment arising from the presence of potentially hazardous substances.

The crossed-out bin symbol is printed on all products as a reminder, and must not be disposed of with your other household waste.

Owners of electrical and electronic equipment (EEE) should contact their local government agencies to identify local WEEE collection and treatment systems for the environmental recycling and /or disposal of their end of life computer products. For more information on proper disposal of these devices, refer to the public utility service.

_images/WEEE-bin-plus-triman-logo.png

2.6. Ordering information

IPU-POD systems are available to order from Graphcore channel partners – see https://www.graphcore.ai/partners for details of your nearest Graphcore partner.