6. IPU usage statistics
The V-IPU Proxy
in the IPU Operator keeps track of the used IPU partitions by
IPUJobs
running inside the cluster. The data is stored in a ConfigMap and it
links the IPUJob
with the partition it is using. On top of that tracker
ConfigMap, the V-IPU Proxy
exposes a couple of read-only REST endpoints that you
can utilize.
By default, these endpoints are only exposed within the cluster, so we can use any container with curl to query them, using the following commands:
/stats
Start a Pod with curl:
$ kubectl run mycurlpod -it --rm --image=dwdraju/alpine-curl-jq -- sh
And then from inside the Pod run:
$ curl ipu-operator-vipu-proxy/stats | jq . { "default": { # the default namespace "used": 4, "available": 28 }, "total": { "used": 4, "available": 28 } }
/query
Again, from inside the Pod, run:
$ curl --request POST -H "Content-Type: application/json" --data '{"size":2}' \ ipu-operator-vipu-proxy/query | jq . { "available": true, "numOfPartitions": 14, # 14 possible partitions are available "message": "" } $ curl --request POST -H "Content-Type: application/json" --data '{"size":64}' \ ipu-operator-vipu-proxy/query | jq . { "available": true, "numOfPartitions": 0, # no partition of the requested size are available "message": "" }
Note
The IPU Operator doesn’t track manually created partitions, therefore no resources allocated outside of the Operator are included in the reported metrics.
6.1. Operator metrics
The IPU Operator exposes a set of Prometheus metrics that you can use. However, these metrics are exposed behind a protected endpoint. The IPU Operator creates a ClusterRole that grants the permissions to scrape the metrics. To allow your Prometheus server to scrape those metrics, you need to bind that ClusterRole to the service account that the Prometheus server uses.
# Find the ClusterRole you must use. Note that the name can be different in your installation
$ kubectl get clusterrole -l component=metrics-reader
NAME CREATED AT
ipu-operator-metrics-reader 2022-03-24T11:13:07Z
# Create a ClusterRoleBinding. The service account must be the one
# that the Prometheus server uses.
$ kubectl create clusterrolebinding metrics --clusterrole=ipu-operator-metrics-reader \
--serviceaccount=<namespace>:<service-account-name>