1. Overview
The IPU‑POD64 reference design is a rack solution containing 16 IPU-M2000s, 1 host server, network switches and the IPU-POD system software pack. There is the option to have up to 4 servers in the IPU‑POD64, depending on workload requirements. There are 64 Mk2 GC200 IPUs in total with four IPUs in each IPU-M2000.
For more information on IPU-POD systems available from Graphcore see https://www.graphcore.ai/products.
This document provides the requirements and a high-level best practice guide for deploying the IPU‑POD64 into a data centre environment.
Note
All these pictures show a version with 4 servers. The standard reference design contains a single server.
1.1. Acronyms and abbreviations
This is a short list that describes some of the most commonly used terms in this document.
BMC |
Baseboard Management Controller: standby power domain service processor doing system hardware management |
BOM |
Bill of Materials |
GW |
Short for IPU-Gateway, a co-processor to the four IPUs in the IPU-M2000. It enables scaling with multiple IPU-M2000 units |
GW-Link |
High speed communication link(s) that interconnect IPUs and IPU-GWs horizontally between IPU-M2000 units. Special cables are required for GW-Links between IPU-M2000 units |
IPU-Link |
High speed communication links that interconnect IPUs within and between IPU-M2000 units. Special cables are required for IPU-Links between IPU-M2000 units |
PDU |
Power Distribution Unit |
RDMA |
Remote DMA |
RNIC |
RDMA Network Interface Controller |
RoCE |
RDMA over converged Ethernet |
ToR |
Top of Rack. Often used in combination with the ToR RDMA switch that is placed on top of the IPU-M2000 stacked units |