4. Low-level architecture

4.1. Hardware

The list of hardware is as follows. The specific components used in the reference design are given in brackets.

  • 16x IPU-Machines (IPU-M2000)

  • 1-4x host server(s), (PowerEdge R6525)

  • 1x 100GbE RoCE/RDMA ToR switch (Arista DCS-7060CX-32S-F)

  • 1x 1GbE management switch (Arista DCS-7010T-48-F)

  • 1x PDU (APC AP8886)

_images/pod64.png

Fig. 4.1 Reference hardware connections (IPU‑POD64)

4.2. Network design

This chapter presents details about networks and endpoints, switches, VLANs, and so on, and the relationships between them.

4.2.1. 1GbE management network

This network has the following devices connected to a 1GbE switch:

  • Poplar servers: 2x 1GbE interfaces and IPMI interface

  • IPU-Machines: BMC+GW and BMC interfaces

  • PDU: management interface

  • 100GbE switch: management interface

The purpose of this network is to provide connections for managing devices.

Poplar servers have a dedicated IPMI (iDRAC) NIC, and two management NICs on the 1GbE. IPU-Machines have a single connection to the 1GbE, although each IPU-Machine presents two MAC addresses, one for IPU BMC and one for the IPU-Gateway.

4.2.2. 100GbE data network

This network has the following devices connected to a 100GbE switch:

  • Poplar servers: 2x RNIC 100GbE interfaces

  • IPU-Machines: 100GbE interfaces

  • NFS high performance storage

  • VM images and volumes storage

Poplar servers have a bonded pair of 100GbE interfaces. They are used to communicate with IPU-Machines and storage (mentioned in 4.3 section).

IPU-Machines have a single 100GbE connection that is used to communicate with Poplar servers.

4.2.3. Networks and VLANs

The physical switches (1GbE management switch and 100GbE data switch) provide network access to the IPU-Machines and Poplar server. To separate data and management traffic, and to separate various tenants, the VLANs will be added as a provider network within OpenStack Neutron. It is expected that Neutron will provide IPAM and DHCP for these provider networks, along with any required routing.

For tenant IPU partition, there will be a dedicated set of networks, typically:

  • IPU-M BMC

  • IPU-M GW

  • vPOD Management

  • IPU Data

  • Storage

Importantly, the OpenStack project given to end users to create VMs that access IPU-Machines will have access to the IPU Data, Storage and vPOD Management networks only. Such projects do not get access to the privileged IPU-M BMC and IPU-M GW networks. The IPU Data network will be created by the provider and shared with the tenant that is given access to those IPUs. It is expected that the virtual router for the IPU Data network will allow access to the IPU Management network and provide access to the internet via sNAT.

OpenStack Ironic’s integration with Neutron does allow for dynamic API driven reconfiguration of the VLANs assigned to the physical switch access ports.

Name

Switch

VLAN

Notes

IPU-M BMC

1GbE

Global

BMC management traffic. Used by datacentre admins. Shared by all vPODs.

IPU-M GW

1GbE

Global

IPU-M management traffic (GW). It is used by V-IPU Server to manage IPU-Ms using VIRM that is running on GW. Shared by all vPODs.

vPOD M anagement

1GbE

Tenant

Management traffic. It is used to control ToR switch, Poplar server using IPMI (for example iDRAC) and an 100GbE switch using its management port (MGT), and a PDU.

IPU Data

100GbE

Tenant

Data plane for traffic between Poplar server(s) and IPU-Machines (IPU-over-Fabric). It requires VLAN isolation for tenants.

Storage

100GbE

Tenant

Traffic to storage (NAS). It requires VLAN isolation for tenants.

Fig. 4.2 below shows how all the devices are connected and which VLANs are used.

_images/vlans.png

Fig. 4.2 Logical network diagram with VLANs

4.2.4. Security groups

The following traffic restrictions are enforced for VMs on the networks. This is configured in OpenStack Neutron.

Protocol

Port

Source/Destination

Reason

vPOD Controller VM interface in IPU-M BMC network

Egress

ICMP

IPU-M BMC addresses

Check connectivity

TCP

22 (ssh)

IPU-M BMC addresses

Remote access to shell

TCP

443 (https)

TBC

TBC

TCP, UDP

53 (dns)

TBC

To DNS server on 10.2 subnet

Ingress

UDP

514

IPU-M BMC addresses

Collect incoming logs from all IPU-M BMC

vPOD Controller VM interface in IPU-M GW network

Egress

ICMP

Any

Check connectivity

TCP

22 (ssh)

IPU-M GW addresses

Remote access to shell

TCP

2112

IPU-M GW addresses

Collect metrics from Prometheus exporter

TCP

8080

IPU-M GW addresses

Manage HW using V-IPU agent (VIRM) from V-IPU server

TCP, UDP

53

TBC

To DNS server on 10.2 subnet

Ingress

TCP

22

Global management server

For remote administration

TCP

2113

Global management server

Expose metrics by Prometheus exporter

TCP

3000

Global management server

Expose Grafana

TCP

9090

Global management server

Expose Prometheus for federation

UDP

514

IPU-M GW addresses

Collect incoming logs from all IPU-M GW

vPOD Controller VM interface in vPOD management network

Egress

Any

Any

Any

Egress to any allowed

Ingress

TCP

22

Any

Remote access to shell

TCP

8090

Poplar VMs

Access for Poplar VMs to V-IPU server

Poplar VM interface in larger vPOD management network

Egress

Any

Any

Any

Egress to any allowed

Ingress

TCP

22

Any

Remote access to shell

TCP

9100

V-IPU Controller

Expose node exporter for Prometheus

Poplar VM in IPU Data network

No filtering

Poplar VM in Storage network

Egress

TCP, UDP

NFS

NFS server

Allow access to NFS storage server

No ingress

4.2.5. RDMA in VMs using SR-IOV

The VMs that need to connect to the IPUs using IPU-over-Fabric need access via an RDMA enabled network. To achieve this from within the OpenStack VM, SR-IOV will be used. The virtual function (VF) that is passed through to a VM will only have access to a single IPU VLAN. Moreover, because Mellanox Connect-Ex 5 is used, the VF can be connected via a bond (called VF LAG) to the RDMA enabled network. In this case, both members of the bond are connected to a single NIC and single switch.

4.2.6. Configuration in OpenStack

Before VMs can get to communicate with the IPU-Machines, the IPU-Machines must be configured into the OpenStack network.

First, the physical switches need to be configured:

  • New VLAN IDs must be allocated for defined networks.

  • Set switch ports to proper VLANs according to devices attached to them.

With the access VLANs updated, we can use OpenStack ports to provide DHCP and DNS:

  • Create VLAN networks in Neutron. Subnets are created with appropriate CIDRs and allocation pools.

  • Create Neutron ports for all MAC addresses exposed by each IPU-Machine: one for the RNIC on the 100GbE network; one for the BMC MAC; and another for the GW MAC on the 1GbE network. Currently the GW and BMC share a single VLAN but are on different subnets.

  • Appropriate DNS names are applied to each of the IPU-Machine ports. This means any OpenStack VM in the appropriate network can resolve the IPU-Machine DNS names and communicate with them over that network.

Note all the OpenStack steps can be automated using tools like Terraform. The main inputs to the scripts are specifying the VLANs and the mapping between MAC addresses, DNS name and (optionally) IP address. Typically, the latter is specified using CSV, or similar.

4.3. Storage design

4.3.1. NFS high performance storage for IPU workloads

High performance external shared storage should be provided for IPU-POD processing. Generally, the appliance needs to be very fast and durable.

The folders exported by NFS can be mounted to a VM using OpenStack Manila or using cloud-init scripting.

4.3.2. VM images and volumes storage

The external storage appliance should be provided as a backend for OpenStack Glance VM images and Cinder volumes. This allows users to provision VMs using selected images and then to attach additional, durable storage to their VMs. This storage can be detached from one VM and moved to another VM and snapshotted as required. It can optionally be used as a boot volume for VMs, although local storage is expected to be the most scalable option for the root disk.

Access to the storage should be provided over the 100GbE Ethernet physical network to enable high-performance access (up to 4GB/s for a virtual volume mounted on a VM).

4.3.3. Local storage

The Poplar servers have a substantial amount of NVMe storage available. It is expected that the NVMe disks will be striped using software RAID (for example MD RAID 6). This will be exposed to the VMs as the root filesystem and an extra volume using OpenStack Nova ephemeral values. Generally, it should serve as a very fast but non-durable storage. The external storage appliance is used to provide a backend for OpenStack Cinder volumes. This allows users to attach additional, durable storage to their VMs. This storage can be detached from one VM and moved to another VM and snapshotted as required. It can optionally be used as a boot volume for VMs, although local storage is expected to be the most scalable option for the root disk.

4.4. Poplar server design

VMs on the Polar server will use SR-IOV to connect to a 100GbE network. Poplar’s IPU-over-Fabric driver requires RDMA and RoCE on the 100GbE interface. The external storage appliance does not require RDMA but uses the 100GbE interface as well. VMs will also have access to local storage provided by locally attached NVMe disks.

The 100GbE interfaces in the Poplar server are bonded to improve reliability and throughput, shown in Fig. 4.3.

_images/poplar-server.png

Fig. 4.3 Poplar Server with VM and 100GbE Switch

VMs are connected to the interfaces using SR-IOV. The virtual function (VF) that is passed through to a VM will only have access to VLANs 11 and 12 (11 is IPU-over-Fabric for data transfers to IPUs and 12 is NAS storage). Moreover, because Mellanox Connect-Ex 5 is used, the VF can be connected via a bond (called VF LAG) to the RDMA enabled network.

Management of the IPU-POD64 is handled by the V-IPU Controller which has both an admin API and a user API. Connection information is fed to the Poplar SDK from the V-IPU Controller via the user APIs. Each tenant will have its own V-IPU Controller that is reachable from the VMs running the Poplar SDK. It is expected that the V-IPU Controller will run as a managed service on the Poplar server.

The V-IPU Controller needs access to the VIRM agent service on each IPU-Machine. This access is provided by the IPU-GW network using NIC P1 and VLAN 13.

NIC P2 in the Poplar server is used for server management. It can also be accessed via its IPMI interface. Both ports are operating in VLAN 99.

4.5. OpenStack Controller

How the OpenStack Controller is setup will depend on the user implementation. Controller nodes and services should not be placed on Poplar servers as their full capabilities should be dedicated to end-user workloads.

OpenStack services largely adopt a scale-out design pattern, and controller nodes run many Python OpenStack controller processes. Controller nodes are configured to be optimised for high-concurrency and high-throughput workloads.

In addition, controller nodes also typically operate the message bus, which is usually RabbitMQ. RabbitMQ implements a highly-available clustered storage for AMQP message queues. Despite being co-resident, communications from OpenStack control plane services to RabbitMQ are addressed to an HAProxy-managed virtual IP (despite its widespread adoption, RabbitMQ’s cluster membership is not completely resilient).

4.5.1. Authentication and secret management

Keystone services will run on the OpenStack Controllers.

Keystone

API and services for OpenStack authentication. Keystone has an HTTP REST API and runs as a service within Apache. Keystone persistent state is stored by Galera/MySQL. Transient token state is stored by Memcached.

OpenStack system accounts will be stored locally within the SQL database. Customer domain users will be stored in external LDAP servers, which may be managed by the customer.

4.5.2. Software images

Glance only has one service - Glance API - which runs on the OpenStack Controllers. Glance will store software images in Ceph RBD. During the initial phase of the deployment, we can use the local disk of the first controller, to stop having a hard dependency on setting up Ceph.

Glance API

An HTTP REST API for software image services.

4.5.3. Open vSwitch (OVS)

OpenStack SDN is conventionally implemented as a set of Neutron services. Our recommendation is to use the OVS Neutron ML2 driver (https://docs.openstack.org/networking-ovn/latest/).

In a virtualised OVS-based OpenStack deployment, instance network configuration is terminated in the associated hypervisor using Open vSwitch. OpenStack networking for VM instances and other software-defined infrastructure is expressed as flow rules for the Open vSwitch bridges.

Neutron also operates a number of centralised services, which perform DHCP on virtual tenant networks and routing between networks. When routing between external and internal networks, the Neutron SDN configuration also performs NAT to manage the association of floating IPs.

Neutron server

Controller process for OpenStack networking.

4.5.4. OpenStack Compute

Nova API, Conductor, Scheduler and Placement services will run on the OpenStack Controllers. Nova Compute, LibVirt and NoVNC Proxy services will run on OpenStack Compute hypervisors.

Nova API

An HTTP REST API for OpenStack compute services.

Nova Conductor

The Nova conductor process implements the logic at the heart of OpenStack compute.

Nova Scheduler

The process for selecting the hypervisor resources to which a new instance will be assigned is performed by the Nova scheduler. A configurable set of filters are used to eliminate ineligible resources.

Nova Placement API

A new service (introduced in Newton but more tightly integrated in Ocata), providing a more generic and flexible method of managing heterogeneous compute resources.

Nova NoVNC Proxy

Proxy to libvirt consoles via websockets.

4.5.5. Storage services

Openstack has components which provide filesystem and block storage as a service through Manila and Cinder respectively. These components allow users to attach additional storage to their instances in a self-managed fashion. The services that manage these resources will run on the controllers. The controllers therefore need access to the backend storage.

Cinder will be configured to use the Ceph RBD driver. Glance will also be configured to store images using its Ceph RBD driver.

Manila will only be used if the external storage appliance has support of Manila. CephFS can be used with Manila, but it is not expected to deliver the required performance.

Manila API

An HTTP REST API for managing file systems as a service.

Manila Share

Intermediary between the storage backend and the client. It is also possible to mount shares out-of-band if the backend pool supports this mode of operation. This is the case when using CephFS. Also tracks the status of the share.

Manila Scheduler

Routes requests to the appropriate manila share service. The selection can be controlled via filters, which will look at attributes such as the amount of remaining capacity in the pool that a given manila share service manages.

Manila Data

Certain operations such as backing up, copying, or migrating shares require that the data to be moved or copied from one place to another. This is the service that does the work for those operations.

Cinder API

An HTTP rest API for managing block storage as a service.

Cinder Scheduler

Selects an appropriate Cinder volume service for a given request.

Cinder Volume

Worker service that coordinates with the Ceph clusters to create and destroy RBD devices on demand.

4.5.6. Web interface

The Horizon web interface runs on the OpenStack controllers.

Horizon

The OpenStack web interface.

4.5.7. Supporting services

RabbitMQ

A stateful service, providing reliable, clustered AMQP messaging used for internal communication between OpenStack services.

Galera Cluster

A stateful service, providing SQL databases to all OpenStack components.

Memcached

A low-latency cache for ephemeral data, such as Keystone authentication tokens.

Keepalived

A service for Virtual Router Redundancy Protocol (VRRP) for services to redundantly share a common virtual IP (VIP).

HAProxy

The TCP proxy server used for supporting high availability by interposing between client connections and one or more servers for a service. HAProxy uses VRRP (via Keepalived) to implement high availability in an active-passive pairing.

Cron

Periodic management activities, such as the flushing of expired Keystone tokens.

Fluentd

A framework for collecting logs from the control plane servers and OpenStack services and shipping to Monasca for storage and analysis.

Open vSwitch daemon

Open vSwitch is used by Neutron for the dynamic management of OpenStack’s software-defined networking.

Open vSwitch OVSDB

The Open vSwitch database (OVSDB) protocol is the interface between Neutron/OVN and Open vSwitch.

4.6. Security

Security best practices should be followed, in particular:

  • Public API traffic secured using Let’s Encrypt SSL certificate

  • Firewall applied to public API network

  • Internal OpenStack APIs and services on a separate isolated network

  • Access using SSH key pairs