5. Debugging problems
When something does not work as expected, you need to debug the problem to understand how to fix it.
Before we talk about how to debug a failed IPUJob, it is probably
good to understand how the IPU Operator executes an IPUJob.
5.1. How does the IPU Operator work?
The IPU Operator consists of a few components:
Controller – this runs the reconcile loop
It makes sure the current state of
IPUJobmatches the desired state you define.It also creates the required IPU partition by interacting with the
V-IPU Proxy. If theV-IPU Proxysees that a partition already exists and is already owned by givenIPUJob, it will reset the partition. This makes it possible to restart the job with a clean IPU partition.
Admission webhooks – there are two webhooks:
Defaulting (Mutation) webhook – adds some default values to the submitted
IPUJobspecification.Validating webhook – validates the
IPUJobspecification.
V-IPU Proxy– proxies IPU partition operations to theV-IPU Controllerand keeps track of the partitions created for jobs running inside the cluster.
When an IPUJob is created in the cluster, the IPU Operator gets notified and
creates the following Kubernetes resources:
A ConfigMap to hold a couple of things:
A
kubeexecscript which is used by MPI to trigger remote execution with theWorkerPods from theLauncherPod.A
hostfilewhich lists theWorkerPods that MPI will use for remote execution.
A Kubernetes RBAC role, role-binding and service account for the
LauncherPod which allows theLauncherto list and watchWorkerPods and exec into them.A set of
WorkerPods which participate in the job execution. These Pods are placed into a 365-day sleep as the main background process until theLaunchertriggers job processes on them.A
LauncherPod which contains two components:An init-container which runs a small application we provide with the IPU Operator. This program watches the
WorkerPods until they are all available and in the Ready state.The main container which uses the image you provide and runs the user-defined command to trigger the job execution in
WorkerPods.
The IPU Operator also sets environment variables on the Worker Pods that allow
Poplar to see and use the IPUs when running the AI/ML program.
5.2. Debugging
There are a few places to look for debug info:
The status updates for the
IPUJob, which you can find by usingkubectlto describe the jobThe
LauncherPod logs which can be found by running:$ kubectl logs <ipujob-name>-launcher -n <the-namespace-where-the-job-was-deployed>
The Controller logs which can be found by running:
$ kubectl logs <controller-manager-pod-name> -n <the-namespace-where-the-operator-was-deployed>
The
V-IPU Proxylogs which can be found by running:$ kubectl logs <vipu-proxy-Pod-name> -n <the-namespace-where-the-operator-was-deployed>