5. Debugging problems
When something does not work as expected, you need to debug the problem to understand how to fix it.
Before we talk about how to debug a failed IPUJob
, it is probably
good to understand how the IPU Operator executes an IPUJob
.
5.1. How does the IPU Operator work?
The IPU Operator consists of a few components:
Controller – this runs the reconcile loop
It makes sure the current state of
IPUJob
matches the desired state you define.It also creates the required IPU partition by interacting with the
V-IPU Proxy
. If theV-IPU Proxy
sees that a partition already exists and is already owned by givenIPUJob
, it will reset the partition. This makes it possible to restart the job with a clean IPU partition.
Admission webhooks – there are two webhooks:
Defaulting (Mutation) webhook – adds some default values to the submitted
IPUJob
specification.Validating webhook – validates the
IPUJob
specification.
V-IPU Proxy
– proxies IPU partition operations to theV-IPU Controller
and keeps track of the partitions created for jobs running inside the cluster.
When an IPUJob
is created in the cluster, the IPU Operator gets notified and
creates the following Kubernetes resources:
A ConfigMap to hold a couple of things:
A
kubeexec
script which is used by MPI to trigger remote execution with theWorker
Pods from theLauncher
Pod.A
hostfile
which lists theWorker
Pods that MPI will use for remote execution.
A Kubernetes RBAC role, role-binding and service account for the
Launcher
Pod which allows theLauncher
to list and watchWorker
Pods and exec into them.A set of
Worker
Pods which participate in the job execution. These Pods are placed into a 365-day sleep as the main background process until theLauncher
triggers job processes on them.A
Launcher
Pod which contains two components:An init-container which runs a small application we provide with the IPU Operator. This program watches the
Worker
Pods until they are all available and in the Ready state.The main container which uses the image you provide and runs the user-defined command to trigger the job execution in
Worker
Pods.
The IPU Operator also sets environment variables on the Worker
Pods that allow
Poplar to see and use the IPUs when running the AI/ML program.
5.2. Debugging
There are a few places to look for debug info:
The status updates for the
IPUJob
, which you can find by usingkubectl
to describe the jobThe
Launcher
Pod logs which can be found by running:$ kubectl logs <ipujob-name>-launcher -n <the-namespace-where-the-job-was-deployed>
The Controller logs which can be found by running:
$ kubectl logs <controller-manager-pod-name> -n <the-namespace-where-the-operator-was-deployed>
The
V-IPU Proxy
logs which can be found by running:$ kubectl logs <vipu-proxy-Pod-name> -n <the-namespace-where-the-operator-was-deployed>