5. Debugging problems
When something does not work as expected, you need to debug the problem to understand how to fix it.
Before we talk about how to debug a failed
IPUJob, it is probably
good to understand how the IPU Operator executes an
5.1. How does the IPU Operator work?
The IPU Operator consists of a few components:
Controller – this runs the reconcile loop
It makes sure the current state of
IPUJobmatches the desired state you define.
It also creates the required IPU partition by interacting with the
V-IPU Proxy. If the
V-IPU Proxysees that a partition already exists and is already owned by given
IPUJob, it will reset the partition. This makes it possible to restart the job with a clean IPU partition.
Admission webhooks – there are two webhooks:
Defaulting (Mutation) webhook – adds some default values to the submitted
Validating webhook – validates the
V-IPU Proxy– proxies IPU partition operations to the
V-IPU Controllerand keeps track of the partitions created for jobs running inside the cluster.
IPUJob is created in the cluster, the IPU Operator gets notified and
creates the following Kubernetes resources:
A ConfigMap to hold a couple of things:
kubeexecscript which is used by MPI to trigger remote execution with the
WorkerPods from the
hostfilewhich lists the
WorkerPods that MPI will use for remote execution.
A Kubernetes RBAC role, role-binding and service account for the
LauncherPod which allows the
Launcherto list and watch
WorkerPods and exec into them.
A set of
WorkerPods which participate in the job execution. These Pods are placed into a 365-day sleep as the main background process until the
Launchertriggers job processes on them.
LauncherPod which contains two components:
An init-container which runs a small application we provide with the IPU Operator. This program watches the
WorkerPods until they are all available and in the Ready state.
The main container which uses the image you provide and runs the user-defined command to trigger the job execution in
The IPU Operator also sets environment variables on the
Worker Pods that allow
Poplar to see and use the IPUs when running the AI/ML program.
There are a few places to look for debug info:
The status updates for the
IPUJob, which you can find by using
kubectlto describe the job
LauncherPod logs which can be found by running:
$ kubectl logs <ipujob-name>-launcher -n <the-namespace-where-the-job-was-deployed>
The Controller logs which can be found by running:
$ kubectl logs <controller-manager-pod-name> -n <the-namespace-where-the-operator-was-deployed>
V-IPU Proxylogs which can be found by running:
$ kubectl logs <vipu-proxy-Pod-name> -n <the-namespace-where-the-operator-was-deployed>