1. Introduction

IPU workloads require access to the RDMA network interface but this is not available by default in Kubernetes. One quick workaround is to run a Pod in privileged mode with access to a host network. This is not an acceptable solution for a production environment, for various reasons.

This document describes how to configure a Kubernetes cluster to use MACVLAN to avoid this workaround. MACVLAN is a virtual LAN that you can use to assign several IP addresses to the same network interface.

Access to the RDMA interface is made available as a secondary network thanks to the Multus plugin and the NVidia Network Operator.

2. Overview

This document describes how to install and configure a Kubernetes cluster where:

  1. A Kubernetes node is running on the Poplar server

  2. Pods running on this node can access the RDMA NIC that provides access to IPU-Machines

  3. It doesn’t use the host network and privileged mode in Pods

  4. It uses MACVLAN to access RDMA NIC

The following components need to be installed and configured, as described in the following sections:

The Graphcore Kubernetes IPU Operator will also need to be installed, if it isn’t already, to test that workloads can run and get access to RDMA. The IPU Operator needs to be configured to run worker Pods without a host network. See Section 7, Usage examples.

3. Gcore network settings

If you are using the Gcore Cloud service and want to configure a private network, then you need to configure the network as described below.

  • In the Gcore Cloud portal, when creating a new AI cluster open Network settings and select Private network. Then click Add a new network.

    _images/add-new-network.png

    Fig. 3.1 Adding a new private network

  • In the Create network dialog, enter a name for your private network and select Baremetal Network.

    _images/create-network.png

    Fig. 3.2 Creating a network

  • The next step is to create a subnetwork. Click on Add a new subnetwork. In the Create Subnetwork dialog, enter a name and CIDR for your subnetwork and click Create subnetwork.

    _images/subnetwork.png

    Fig. 3.3 Creating a subnetwork

  • Finally, complete the process by selecting Use floating IP in the Network settings dialog.

    _images/use-floating-ip.png

    Fig. 3.4 Selecting floating IP

4. Kubernetes installation

Execute the following commands as the root user:

Listing 4.1 Installing Kubernetes
    # this is based on https://computingforgeeks.com/deploy-kubernetes-cluster-on-ubuntu-with-kubeadm/
    # https://blog.knoldus.com/how-to-install-kubernetes-on-ubuntu-20-04-kubeadm-and-minikube/

    # prepare repos for Kubernetes
    apt -y install curl apt-transport-https
    curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
    echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | tee /etc/apt/sources.list.d/kubernetes.list

    # install Kubernetes
    apt update
    apt -y install vim git curl wget kubelet kubeadm kubectl
    apt-mark hold kubelet kubeadm kubectl

    # check versions
    kubectl version --client
    kubeadm version

    # disable swap, probably already off
    swapoff -a

    # enable kernel modules
    modprobe overlay
    modprobe br_netfilter

    # add some settings to sysctl
    tee /etc/sysctl.d/kubernetes.conf<<EOF
    net.bridge.bridge-nf-call-ip6tables = 1
    net.bridge.bridge-nf-call-iptables = 1
    net.ipv4.ip_forward = 1
    EOF

    # reload sysctl
    sysctl --system

    # add Docker repo and install its packages
    apt update
    apt install -y curl gnupg2 software-properties-common apt-transport-https ca-certificates
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add -
    add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
    apt update
    apt install -y containerd.io docker-ce docker-ce-cli

    # create required directories
    mkdir -p /etc/systemd/system/docker.service.d

    # create daemon json config file
    tee /etc/docker/daemon.json <<EOF
    {
        "exec-opts": ["native.cgroupdriver=systemd"],
        "log-driver": "json-file",
        "log-opts": {
            "max-size": "100m"
        },
        "storage-driver": "overlay2",
        "insecure-registries":["10.0.2.15:5000"]
    }
    EOF

    # start and enable docker ervices
    systemctl daemon-reload
    systemctl restart docker
    systemctl enable docker

    # make sure that the br_netfilter module is loaded
    lsmod | grep br_netfilter

    # enable some plugin that is needed for j8s 1.25:
    # https://serverfault.com/questions/1074008/containerd-1-4-9-unimplemented-desc-unknown-service-runtime-v1alpha2-runtimese
    cat > /etc/containerd/config.toml <<EOF
    [plugins."io.containerd.grpc.v1.cri"]
        systemd_cgroup = true
    EOF
    systemctl restart containerd

    # enable kubelet service.
    systemctl enable kubelet

    # pull container images
    kubeadm config images pull

    # bootstrap a cluster without using DNS endpoint
    kubeadm init --pod-network-cidr=192.168.0.0/16

    # point path to kube config
    export KUBECONFIG=/etc/kubernetes/admin.conf

    # check status
    kubectl cluster-info

    # install network plugin Calico
    kubectl create -f https://docs.projectcalico.org/manifests/tigera-operator.yaml
    kubectl create -f https://docs.projectcalico.org/manifests/custom-resources.yaml  # this file also is using 192.168.0.0/16
    kubectl get pods -n calico-system

    # confirm master node is ready
    kubectl get nodes -o wide

    # setup kube config for ubuntu user
    mkdir -p /home/ubuntu/.kube
    cp -i /etc/kubernetes/admin.conf /home/ubuntu/.kube/config
    chown -R ubuntu:ubuntu /home/ubuntu/.kube

5. Mellanox RDMA NIC setup

Follow the instructions on the NVidia Network Operator page to install the Network Operator.

Note

  • For Kubernetes version 1.25 or later, use Network Operator version 1.4.0

  • For Kubernetes version 1.24, use Network Operator version 1.3.0

5.1. Install the NVidia Network Operator

Execute the following commands as a standard user, for example “ubuntu”:

Listing 5.1 Installing the NVidia Network Operator
# install Helm
sudo snap install helm --classic

# add nvidia Helm charts repo
helm repo add nvidia https://mellanox.github.io/network-operator
helm repo update

# install NVidia Network Operator
cat > network-operator-values.yaml <<EOF
nfd:
  enabled: true

# NicClusterPolicy CR values:
deployCR: true
ofedDriver:
  deploy: false

rdmaSharedDevicePlugin:
  deploy: true
  resources:
    - name: rdma_shared_device_a
      drivers: ["mlx5_core"]

nvPeerDriver:
  deploy: false

secondaryNetwork:
  deploy: true
  cniPlugins:
    deploy: true
  multus:
    deploy: true
  ipamPlugin:
    deploy: true

# to remove node affinity in multus ds
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: node-role.kubernetes.io/master
            operator: DoesNotExist
          - key: node-role.kubernetes.io/control-plane
            operator: DoesNotExist
EOF

helm install -n network-operator --create-namespace \
  -f network-operator-values.yaml --wait --version 1.4.0 \
  network-operator nvidia/network-operator

# check network operator pods
kubectl -n network-operator get pods

5.2. Configure MACVLAN

If you run everything on one node, then you should remove any NoSchedule taints from the node:

$ kubectl edit node <node-name>

Then look for any NoSchedule taints and remove them.

Configure MACVLAN as shown in Listing 5.2. You will need to set two fields:

  • Replace <RDMA_INTERFACE> with the name of RDMA interface on the host

  • Replace <RANGE_START-RANGE_END> with a safe range of IP addresses that are available in the network that the RDMA interface is connected to, for example “10.254.151.230-10.254.151.240/24”.

Then apply this configuration:

$ kubectl apply -f macvlan.yaml
Listing 5.2 Configuring MACVLAN
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
  name: rdma-net
spec:
  networkNamespace: "default"
  master: "<RDMA_INTERFACE>"
  mode: "bridge"
  mtu: 1500
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info",
      "range": "<RANGE_START-RANGE_END>"
    }

To check this has worked, look in /etc/cni/net.d/ and check that files and folders are present for Multus and Whereabouts. You should see the file 00-multus.conf and directories multus.d and whereabouts.d:

$ sudo ls -al /etc/cni/net.d/
drwx------ 4 root root 4096 Dec 15 14:13 .
drwx------ 3 root root 4096 Dec 15 12:18 ..
-rw------- 1 root root  861 Dec 15 14:13 00-multus.conf
-rw-r--r-- 1 root root  808 Dec 15 13:39 10-calico.conflist
-rw------- 1 root root 2721 Jan 10 23:33 calico-kubeconfig
drwxr-xr-x 2 root root 4096 Dec 15 14:13 multus.d
drwxr-xr-x 2 root root 4096 Dec 15 14:13 whereabouts.d

6. Test the Pod

Now we can verify that the applied networking configuration can be used. The following Pod definition uses the “rdma-net” network and RDMA resources configured previously.

Listing 6.1 Testing RDMA access
    apiVersion: v1
    kind: Pod
    metadata:
    name: rdma-test-pod
    annotations:
        k8s.v1.cni.cncf.io/networks: rdma-net
    spec:
    containers:
    - image: mellanox/rping-test
        name: rdma-test-ctr
        securityContext:
        capabilities:
            add: [ "IPC_LOCK" ]
        resources:
        limits:
            rdma/rdma_shared_device_a: 1
        requests:
            rdma/rdma_shared_device_a: 1
        command:
        - sh
        - -c
        - sleep infinity

Access the Pod created by this and check that the RDMA interface is present, and how the routing table is looks. For example:

$ kubectl exec -it rdma-test-pod -- bash
$ ip -4 a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
4: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1480 qdisc noqueue state UP group default  link-netnsid 0
    inet 192.168.34.155/32 scope global eth0
       valid_lft forever preferred_lft forever
5: [email protected]: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default  link-netnsid 0
    inet 10.193.242.225/24 brd 10.193.242.255 scope global net1
       valid_lft forever preferred_lft forever

In the Pod, there should be 3 network interfaces:

  1. Loopback

  2. A regular Kubernetes interface, here this is “eth0”

  3. “net1” which is the extra RDMA interface configured by Multus

$ route -n
20.0.0.0         169.254.1.1     0.0.0.0         UG    0      0        0 eth0
310.193.242.0    0.0.0.0         255.255.255.0   U     0      0        0 net1
4169.254.1.1     0.0.0.0         255.255.255.255 UH    0      0        0 eth0

Main traffic goes upstream to the Internet via “eth0”. “net1” is an interface for the RDMA network where IPU-Machines reside.

7. Usage examples

7.1. Kubernetes native examples with IPU

Note

In these examples, IPUOF_VIPU_API_HOST is set to “localhost”.

However, in general, the value of IPUOF_VIPU_API_HOST is specific to the system being used. It will usually be the IP address or hostname of a remote machine (not “localhost”).

On Gcore systems, the correct value can be determined by looking at the value of the IPUOF_VIPU_API_HOST environment variable on the Gcore host that you log into.

Listing 7.1 Simple native training example with IPU
 1apiVersion: batch/v1
 2kind: Job
 3metadata:
 4 name: simple-training
 5spec:
 6 backoffLimit: 1
 7 template:
 8  spec:
 9   metadata:
10    annotations:
11     k8s.v1.cni.cncf.io/networks: rdma-net
12   nodeSelector:
13    vipu-ctrl: vipu-controller
14   securityContext:
15    runAsUser: 0
16   containers:
17   - name: mnist-training
18    image: graphcore/pytorch-tutorials:latest
19    workingDir: "/opt/tutorials/simple_applications/pytorch/mnist"
20    command:
21     - "bash"
22    args:
23     - "-c"
24     - "pip3 install -r requirements.txt && python3 mnist_poptorch.py --epochs=1"
25    env:
26     - name: IPUOF_VIPU_API_HOST
27      value: "localhost"
28     - name: IPUOF_VIPU_API_PORT
29      value: "8090"
30     - name: IPUOF_VIPU_API_PARTITION_ID
31      value: "training_partition"
32    securityContext:
33     capabilities:
34      add: [ "IPC_LOCK" ]
35    resources:
36     limits:
37      rdma/rdma_shared_device_a: 1
38     requests:
39      rdma/rdma_shared_device_a: 1
40   restartPolicy: OnFailure
Listing 7.2 Simple native inference example with IPU
 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  name: simple-inference
 5spec:
 6  replicas: 1
 7  selector:
 8    matchLabels:
 9      app: simple-inference
10  template:
11    metadata:
12      annotations:
13        k8s.v1.cni.cncf.io/networks: rdma-net
14      labels:
15        app: simple-inference
16    spec:
17      nodeSelector:
18        vipu-ctrl: vipu-controller
19      securityContext:
20        runAsUser: 0
21      containers:
22      - name: mnist-inference
23        image: graphcore/pytorch-tutorials:latest
24        command:
25          - "sleep"
26          - "infinity"
27        env:
28          - name: IPUOF_VIPU_API_HOST
29            value: "localhost"
30          - name: IPUOF_VIPU_API_PORT
31            value: "8090"
32          - name: IPUOF_VIPU_API_PARTITION_ID
33            value: "inference_partition"
34        securityContext:
35          capabilities:
36            add: [ "IPC_LOCK" ]
37        resources:
38          limits:
39            rdma/rdma_shared_device_a: 1
40          requests:
41            rdma/rdma_shared_device_a: 1

7.2. Using the IPU operator

We can use the IPU Operator to schedule jobs to IPUs. For more information about the IPU Operator see Kubernetes IPU Operator User Guide.

Prepare a values file for the IPU Operator Helm chart, for example ipu-operator-values.yaml.

Listing 7.3 ipu-operator-values.yaml
global:
vipuControllers: "<VIPU_SERVER_ADDRESS>:ipunode=true"
worker:
hostNetwork: false
  • You will need to replace <VIPU_SERVER_ADDRESS> with the V-IPU server address. You can find the server address with the command vipu --server-version. An example, with the server address “10.5.212.116:8090”, is shown below.

    $ vipu --server-version
    version: 1.18.0
    host: 10.5.212.116:8090
    
  • The hostNetwork must be disabled (set to false) so that the host network will not be used and to avoid privileged mode.

Now install the IPU Operator. Run this as a regular user:

Listing 7.4 Installing the IPU Operator
# install IPUJob CRD
$ curl -s https://api.github.com/repos/graphcore/helm-charts/releases/latest | \
  grep -wo "https.*ipujobs.*yaml" | wget -qi -
$ kubectl apply -f graphcore_ai_ipujobs_*.yaml

# install IPU operator
$ helm repo add gc https://helm-charts.graphcore.ai/
$ helm repo update
$ helm install -n ipu-operator --create-namespace --wait --version 1.1.0
  -f ipu-operator-values.yaml ipu-operator gc/ipu-operator
$ for node in $(kubectl get nodes -o name | grep -v control-plane); do \
  kubectl label $node ipunode=true; \
done

Now let’s try running an IPUJob. Save the code shown in Listing 7.5 and run it:

$ kubectl apply -f ipujob.yaml
ipujob/mnist-training created

Note

The Pod has to be run as user 0. This is required by the RDMA libraries otherwise access to RDMA device will be rejected.

Listing 7.5 Running an IPUJob
 1apiVersion: graphcore.ai/v1alpha1
 2kind: IPUJob
 3metadata:
 4  name: mnist-training
 5spec:
 6  # jobInstances defines the number of job instances.
 7  # More than 1 job instance is usually useful for inference jobs only.
 8  jobInstances: 1
 9  # ipusPerJobInstance refers to the number of IPUs required per job instance.
10  # A separate IPU partition of this size will be created by the IPU Operator
11  # for each job instance.
12  ipusPerJobInstance: "1"
13  workers:
14    template:
15      metadata:
16        annotations:
17          k8s.v1.cni.cncf.io/networks: rdma-net
18      spec:
19        securityContext:
20          runAsUser: 0
21        containers:
22        - name: mnist-training
23          image: graphcore/pytorch-tutorials:latest
24          workingDir: "/opt/tutorials/simple_applications/pytorch/mnist"
25          command: ["bash"]
26          args: ["-c", "pip3 install -r requirements.txt && python3 mnist_poptorch.py --epochs=1"]
27          securityContext:
28            capabilities:
29              add: [ "IPC_LOCK" ]
30          resources:
31            limits:
32              rdma/rdma_shared_device_a: 1
33            requests:
34              rdma/rdma_shared_device_a: 1
35        restartPolicy: Never