6. IPU usage statistics

The V-IPU Proxy in the IPU Operator keeps track of the used IPU partitions by IPUJobs running inside the cluster. The data is stored in a ConfigMap and it links the IPUJob with the partition it is using. On top of that tracker ConfigMap, the V-IPU Proxy exposes a couple of read-only REST endpoints that you can utilize.

By default, these endpoints are only exposed within the cluster, so we can use any container with curl to query them, using the following commands:

  • /stats

    Start a Pod with curl:

    $ kubectl run mycurlpod -it --rm --image=dwdraju/alpine-curl-jq -- sh
    

    And then from inside the Pod run:

    $ curl ipu-operator-vipu-proxy/stats | jq .
    {
      "default": { # the default namespace
        "used": 4,
        "available": 28
      },
      "total": {
        "used": 4,
        "available": 28
      }
    }
    
  • /query

    Again, from inside the Pod, run:

    $ curl --request POST -H "Content-Type: application/json" --data '{"size":2}' \
      ipu-operator-vipu-proxy/query | jq .
    {
      "available": true,
      "numOfPartitions": 14, # 14 possible partitions are available
      "message": ""
    }
    
    $ curl --request POST -H "Content-Type: application/json" --data '{"size":64}' \
      ipu-operator-vipu-proxy/query | jq .
    {
      "available": true,
      "numOfPartitions": 0, # no partition of the requested size are available
      "message": ""
    }
    

Note

The IPU Operator doesn’t track manually created partitions, therefore no resources allocated outside of the Operator are included in the reported metrics.

6.1. Operator metrics

The IPU Operator exposes a set of Prometheus metrics that you can use. However, these metrics are exposed behind a protected endpoint. The IPU Operator creates a ClusterRole that grants the permissions to scrape the metrics. To allow your Prometheus server to scrape those metrics, you need to bind that ClusterRole to the service account that the Prometheus server uses.

# Find the ClusterRole you must use. Note that the name can be different in your installation

$ kubectl get clusterrole -l component=metrics-reader
NAME                          CREATED AT
ipu-operator-metrics-reader   2022-03-24T11:13:07Z

# Create a ClusterRoleBinding. The service account must be the one
# that the Prometheus server uses.

$ kubectl create clusterrolebinding metrics --clusterrole=ipu-operator-metrics-reader \
  --serviceaccount=<namespace>:<service-account-name>