5. Debugging problems

When something does not work as expected, you need to debug the problem to understand how to fix it.

Before we talk about how to debug a failed IPUJob, it is probably good to understand how the IPU Operator executes an IPUJob.

5.1. How does the IPU Operator work?

The IPU Operator consists of a few components:

  • Controller – this runs the reconcile loop

    • It makes sure the current state of IPUJob matches the desired state you define.

    • It also creates the required IPU partition by interacting with the V-IPU Proxy. If the V-IPU Proxy sees that a partition already exists and is already owned by given IPUJob, it will reset the partition. This makes it possible to restart the job with a clean IPU partition.

  • Admission webhooks – there are two webhooks:

    • Defaulting (Mutation) webhook – adds some default values to the submitted IPUJob specification.

    • Validating webhook – validates the IPUJob specification.

  • V-IPU Proxy – proxies IPU partition operations to the V-IPU Controller and keeps track of the partitions created for jobs running inside the cluster.

When an IPUJob is created in the cluster, the IPU Operator gets notified and creates the following Kubernetes resources:

  • A ConfigMap to hold a couple of things:

    • A kubeexec script which is used by MPI to trigger remote execution with the Worker Pods from the Launcher Pod.

    • A hostfile which lists the Worker Pods that MPI will use for remote execution.

  • A Kubernetes RBAC role, role-binding and service account for the Launcher Pod which allows the Launcher to list and watch Worker Pods and exec into them.

  • A set of Worker Pods which participate in the job execution. These Pods are placed into a 365-day sleep as the main background process until the Launcher triggers job processes on them.

  • A Launcher Pod which contains two components:

    • An init-container which runs a small application we provide with the IPU Operator. This program watches the Worker Pods until they are all available and in the Ready state.

    • The main container which uses the image you provide and runs the user-defined command to trigger the job execution in Worker Pods.

The IPU Operator also sets environment variables on the Worker Pods that allow Poplar to see and use the IPUs when running the AI/ML program.

5.2. Debugging

There are a few places to look for debug info:

  • The status updates for the IPUJob, which you can find by using kubectl to describe the job

  • The Launcher Pod logs which can be found by running:

    $ kubectl logs <ipujob-name>-launcher -n <the-namespace-where-the-job-was-deployed>
  • The Controller logs which can be found by running:

    $ kubectl logs <controller-manager-pod-name> -n <the-namespace-where-the-operator-was-deployed>
  • The V-IPU Proxy logs which can be found by running:

    $ kubectl logs <vipu-proxy-Pod-name> -n <the-namespace-where-the-operator-was-deployed>