1. Introduction
IPU workloads require access to the RDMA network interface but this is not available by default in Kubernetes. One quick workaround is to run a Pod in privileged mode with access to a host network. This is not an acceptable solution for a production environment, for various reasons.
This document describes how to configure a Kubernetes cluster to use MACVLAN to avoid this workaround. MACVLAN is a virtual LAN that you can use to assign several IP addresses to the same network interface.
Access to the RDMA interface is made available as a secondary network thanks to the Multus plugin and the NVidia Network Operator.
2. Overview
This document describes how to install and configure a Kubernetes cluster where:
A Kubernetes node is running on the Poplar server
Pods running on this node can access the RDMA NIC that provides access to IPU-Machines
It doesn’t use the host network and privileged mode in Pods
It uses MACVLAN to access RDMA NIC
The following components need to be installed and configured, as described in the following sections:
-
If you are using the Gcore Cloud service and want to configure a private network, follow the steps in Section 3, Gcore network settings, otherwise you can skip that section.
-
We have tested with Kubernetes versions 1.24 and 1.25
-
This requires Mellanox driver version 5
Install the NVidia Network Operator
This includes Multus, Whereabouts and some NVidia and Mellanox Kubernetes extensions
The Graphcore Kubernetes IPU Operator will also need to be installed, if it isn’t already, to test that workloads can run and get access to RDMA. The IPU Operator needs to be configured to run worker Pods without a host network. See Section 7, Usage examples.
3. Gcore network settings
If you are using the Gcore Cloud service and want to configure a private network, then you need to configure the network as described below.
In the Gcore Cloud portal, when creating a new AI cluster open Network settings and select Private network. Then click Add a new network.
Fig. 3.1 Adding a new private network
In the Create network dialog, enter a name for your private network and select Baremetal Network.
Fig. 3.2 Creating a network
The next step is to create a subnetwork. Click on Add a new subnetwork. In the Create Subnetwork dialog, enter a name and CIDR for your subnetwork and click Create subnetwork.
Fig. 3.3 Creating a subnetwork
Finally, complete the process by selecting Use floating IP in the Network settings dialog.
Fig. 3.4 Selecting floating IP
4. Kubernetes installation
Execute the following commands as the root user:
# this is based on https://computingforgeeks.com/deploy-kubernetes-cluster-on-ubuntu-with-kubeadm/
# https://blog.knoldus.com/how-to-install-kubernetes-on-ubuntu-20-04-kubeadm-and-minikube/
# prepare repos for Kubernetes
apt -y install curl apt-transport-https
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | tee /etc/apt/sources.list.d/kubernetes.list
# install Kubernetes
apt update
apt -y install vim git curl wget kubelet kubeadm kubectl
apt-mark hold kubelet kubeadm kubectl
# check versions
kubectl version --client
kubeadm version
# disable swap, probably already off
swapoff -a
# enable kernel modules
modprobe overlay
modprobe br_netfilter
# add some settings to sysctl
tee /etc/sysctl.d/kubernetes.conf<<EOF
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
EOF
# reload sysctl
sysctl --system
# add Docker repo and install its packages
apt update
apt install -y curl gnupg2 software-properties-common apt-transport-https ca-certificates
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
apt update
apt install -y containerd.io docker-ce docker-ce-cli
# create required directories
mkdir -p /etc/systemd/system/docker.service.d
# create daemon json config file
tee /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"insecure-registries":["10.0.2.15:5000"]
}
EOF
# start and enable docker ervices
systemctl daemon-reload
systemctl restart docker
systemctl enable docker
# make sure that the br_netfilter module is loaded
lsmod | grep br_netfilter
# enable some plugin that is needed for j8s 1.25:
# https://serverfault.com/questions/1074008/containerd-1-4-9-unimplemented-desc-unknown-service-runtime-v1alpha2-runtimese
cat > /etc/containerd/config.toml <<EOF
[plugins."io.containerd.grpc.v1.cri"]
systemd_cgroup = true
EOF
systemctl restart containerd
# enable kubelet service.
systemctl enable kubelet
# pull container images
kubeadm config images pull
# bootstrap a cluster without using DNS endpoint
kubeadm init --pod-network-cidr=192.168.0.0/16
# point path to kube config
export KUBECONFIG=/etc/kubernetes/admin.conf
# check status
kubectl cluster-info
# install network plugin Calico
kubectl create -f https://docs.projectcalico.org/manifests/tigera-operator.yaml
kubectl create -f https://docs.projectcalico.org/manifests/custom-resources.yaml # this file also is using 192.168.0.0/16
kubectl get pods -n calico-system
# confirm master node is ready
kubectl get nodes -o wide
# setup kube config for ubuntu user
mkdir -p /home/ubuntu/.kube
cp -i /etc/kubernetes/admin.conf /home/ubuntu/.kube/config
chown -R ubuntu:ubuntu /home/ubuntu/.kube
5. Mellanox RDMA NIC setup
Follow the instructions on the NVidia Network Operator page to install the Network Operator.
Note
For Kubernetes version 1.25 or later, use Network Operator version 1.4.0
For Kubernetes version 1.24, use Network Operator version 1.3.0
5.1. Install the NVidia Network Operator
Execute the following commands as a standard user, for example “ubuntu”:
# install Helm
sudo snap install helm --classic
# add nvidia Helm charts repo
helm repo add nvidia https://mellanox.github.io/network-operator
helm repo update
# install NVidia Network Operator
cat > network-operator-values.yaml <<EOF
nfd:
enabled: true
# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
deploy: false
rdmaSharedDevicePlugin:
deploy: true
resources:
- name: rdma_shared_device_a
drivers: ["mlx5_core"]
nvPeerDriver:
deploy: false
secondaryNetwork:
deploy: true
cniPlugins:
deploy: true
multus:
deploy: true
ipamPlugin:
deploy: true
# to remove node affinity in multus ds
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExist
EOF
helm install -n network-operator --create-namespace \
-f network-operator-values.yaml --wait --version 1.4.0 \
network-operator nvidia/network-operator
# check network operator pods
kubectl -n network-operator get pods
5.2. Configure MACVLAN
If you run everything on one node, then you should remove any NoSchedule
taints from the node:
$ kubectl edit node <node-name>
Then look for any NoSchedule
taints and remove them.
Configure MACVLAN as shown in Listing 5.2. You will need to set two fields:
Replace
<RDMA_INTERFACE>
with the name of RDMA interface on the hostReplace
<RANGE_START-RANGE_END>
with a safe range of IP addresses that are available in the network that the RDMA interface is connected to, for example “10.254.151.230-10.254.151.240/24”.
Then apply this configuration:
$ kubectl apply -f macvlan.yaml
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: rdma-net
spec:
networkNamespace: "default"
master: "<RDMA_INTERFACE>"
mode: "bridge"
mtu: 1500
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"log_file" : "/var/log/whereabouts.log",
"log_level" : "info",
"range": "<RANGE_START-RANGE_END>"
}
To check this has worked, look in /etc/cni/net.d/
and check that files and folders are present for Multus and Whereabouts.
You should see the file 00-multus.conf
and directories multus.d
and whereabouts.d
:
$ sudo ls -al /etc/cni/net.d/
drwx------ 4 root root 4096 Dec 15 14:13 .
drwx------ 3 root root 4096 Dec 15 12:18 ..
-rw------- 1 root root 861 Dec 15 14:13 00-multus.conf
-rw-r--r-- 1 root root 808 Dec 15 13:39 10-calico.conflist
-rw------- 1 root root 2721 Jan 10 23:33 calico-kubeconfig
drwxr-xr-x 2 root root 4096 Dec 15 14:13 multus.d
drwxr-xr-x 2 root root 4096 Dec 15 14:13 whereabouts.d
6. Test the Pod
Now we can verify that the applied networking configuration can be used. The following Pod definition uses the “rdma-net” network and RDMA resources configured previously.
apiVersion: v1
kind: Pod
metadata:
name: rdma-test-pod
annotations:
k8s.v1.cni.cncf.io/networks: rdma-net
spec:
containers:
- image: mellanox/rping-test
name: rdma-test-ctr
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
rdma/rdma_shared_device_a: 1
requests:
rdma/rdma_shared_device_a: 1
command:
- sh
- -c
- sleep infinity
Access the Pod created by this and check that the RDMA interface is present, and how the routing table is looks. For example:
$ kubectl exec -it rdma-test-pod -- bash
$ ip -4 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
4: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default link-netnsid 0
inet 192.168.34.155/32 scope global eth0
valid_lft forever preferred_lft forever
5: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link-netnsid 0
inet 10.193.242.225/24 brd 10.193.242.255 scope global net1
valid_lft forever preferred_lft forever
In the Pod, there should be 3 network interfaces:
Loopback
A regular Kubernetes interface, here this is “eth0”
“net1” which is the extra RDMA interface configured by Multus
$ route -n
20.0.0.0 169.254.1.1 0.0.0.0 UG 0 0 0 eth0
310.193.242.0 0.0.0.0 255.255.255.0 U 0 0 0 net1
4169.254.1.1 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
Main traffic goes upstream to the Internet via “eth0”. “net1” is an interface for the RDMA network where IPU-Machines reside.
7. Usage examples
7.1. Kubernetes native examples with IPU
Note
In these examples, IPUOF_VIPU_API_HOST
is set to “localhost”.
However, in general, the value of IPUOF_VIPU_API_HOST
is specific to the system being used. It will usually be the IP address or hostname of a remote machine (not “localhost”).
On Gcore systems, the correct value can be determined by looking at the value of the IPUOF_VIPU_API_HOST
environment variable on the Gcore host that you log into.
1apiVersion: batch/v1
2kind: Job
3metadata:
4 name: simple-training
5spec:
6 backoffLimit: 1
7 template:
8 spec:
9 metadata:
10 annotations:
11 k8s.v1.cni.cncf.io/networks: rdma-net
12 nodeSelector:
13 vipu-ctrl: vipu-controller
14 securityContext:
15 runAsUser: 0
16 containers:
17 - name: mnist-training
18 image: graphcore/pytorch-tutorials:latest
19 workingDir: "/opt/tutorials/simple_applications/pytorch/mnist"
20 command:
21 - "bash"
22 args:
23 - "-c"
24 - "pip3 install -r requirements.txt && python3 mnist_poptorch.py --epochs=1"
25 env:
26 - name: IPUOF_VIPU_API_HOST
27 value: "localhost"
28 - name: IPUOF_VIPU_API_PORT
29 value: "8090"
30 - name: IPUOF_VIPU_API_PARTITION_ID
31 value: "training_partition"
32 securityContext:
33 capabilities:
34 add: [ "IPC_LOCK" ]
35 resources:
36 limits:
37 rdma/rdma_shared_device_a: 1
38 requests:
39 rdma/rdma_shared_device_a: 1
40 restartPolicy: OnFailure
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: simple-inference
5spec:
6 replicas: 1
7 selector:
8 matchLabels:
9 app: simple-inference
10 template:
11 metadata:
12 annotations:
13 k8s.v1.cni.cncf.io/networks: rdma-net
14 labels:
15 app: simple-inference
16 spec:
17 nodeSelector:
18 vipu-ctrl: vipu-controller
19 securityContext:
20 runAsUser: 0
21 containers:
22 - name: mnist-inference
23 image: graphcore/pytorch-tutorials:latest
24 command:
25 - "sleep"
26 - "infinity"
27 env:
28 - name: IPUOF_VIPU_API_HOST
29 value: "localhost"
30 - name: IPUOF_VIPU_API_PORT
31 value: "8090"
32 - name: IPUOF_VIPU_API_PARTITION_ID
33 value: "inference_partition"
34 securityContext:
35 capabilities:
36 add: [ "IPC_LOCK" ]
37 resources:
38 limits:
39 rdma/rdma_shared_device_a: 1
40 requests:
41 rdma/rdma_shared_device_a: 1
7.2. Using the IPU operator
We can use the IPU Operator to schedule jobs to IPUs. For more information about the IPU Operator see Kubernetes IPU Operator User Guide.
Prepare a values file for the IPU Operator Helm chart, for example ipu-operator-values.yaml.
global:
vipuControllers: "<VIPU_SERVER_ADDRESS>:ipunode=true"
worker:
hostNetwork: false
You will need to replace
<VIPU_SERVER_ADDRESS>
with the V-IPU server address. You can find the server address with the commandvipu --server-version
. An example, with the server address “10.5.212.116:8090”, is shown below.$ vipu --server-version version: 1.18.0 host: 10.5.212.116:8090
The
hostNetwork
must be disabled (set tofalse
) so that the host network will not be used and to avoid privileged mode.
Now install the IPU Operator. Run this as a regular user:
# install IPUJob CRD
$ curl -s https://api.github.com/repos/graphcore/helm-charts/releases/latest | \
grep -wo "https.*ipujobs.*yaml" | wget -qi -
$ kubectl apply -f graphcore_ai_ipujobs_*.yaml
# install IPU operator
$ helm repo add gc https://helm-charts.graphcore.ai/
$ helm repo update
$ helm install -n ipu-operator --create-namespace --wait --version 1.1.0
-f ipu-operator-values.yaml ipu-operator gc/ipu-operator
$ for node in $(kubectl get nodes -o name | grep -v control-plane); do \
kubectl label $node ipunode=true; \
done
Now let’s try running an IPUJob. Save the code shown in Listing 7.5 and run it:
$ kubectl apply -f ipujob.yaml
ipujob/mnist-training created
Note
The Pod has to be run as user 0. This is required by the RDMA libraries otherwise access to RDMA device will be rejected.
1apiVersion: graphcore.ai/v1alpha1
2kind: IPUJob
3metadata:
4 name: mnist-training
5spec:
6 # jobInstances defines the number of job instances.
7 # More than 1 job instance is usually useful for inference jobs only.
8 jobInstances: 1
9 # ipusPerJobInstance refers to the number of IPUs required per job instance.
10 # A separate IPU partition of this size will be created by the IPU Operator
11 # for each job instance.
12 ipusPerJobInstance: "1"
13 workers:
14 template:
15 metadata:
16 annotations:
17 k8s.v1.cni.cncf.io/networks: rdma-net
18 spec:
19 securityContext:
20 runAsUser: 0
21 containers:
22 - name: mnist-training
23 image: graphcore/pytorch-tutorials:latest
24 workingDir: "/opt/tutorials/simple_applications/pytorch/mnist"
25 command: ["bash"]
26 args: ["-c", "pip3 install -r requirements.txt && python3 mnist_poptorch.py --epochs=1"]
27 securityContext:
28 capabilities:
29 add: [ "IPC_LOCK" ]
30 resources:
31 limits:
32 rdma/rdma_shared_device_a: 1
33 requests:
34 rdma/rdma_shared_device_a: 1
35 restartPolicy: Never