8. Release notes

8.1. Version 1.2.0

8.1.1. New features

  • IPU Operator handling partitions of 32 IPUs: The Kubernetes integration for IPUs now supports IPU partitions of size 32. This expanded capacity enables better management and allocation of larger IPU clusters, providing more flexibility and scalability for IPU-accelerated workloads;

  • Bundled IPUJob CRD in Helm Chart: In this release, the IPUJob Custom Resource Definition (CRD) is bundled within the Helm chart itself. This eliminates the need for a separate installation of the CRD before deploying the Helm chart. The bundled CRD streamlines the installation process and ensures that all necessary resources are automatically set up.

8.1.2. Bug fixes

  • IPU Resource Cleanup after IPUJob Completion: With this update, the Kubernetes integration for IPU introduces automatic cleanup of IPU resources after job completion. When an IPU-accelerated workload finishes, the integration ensures that the allocated IPU resources are released and made available for other workloads. This feature prevents resource wastage and optimizes the utilization of IPU resources within the cluster.

8.1.3. Other improvements

  • CVE Vulnerability Fixes: This release addresses several Common Vulnerabilities and Exposures (CVE) identified in the previous version, which were mostly related to Go language version 1.18.6 and have been addressed by upgrading Go packages to 1.19.9. The integration has been updated to mitigate any potential security risks associated with the following CVEs:

    • CVE-2022-1996: Authorization Bypass Through User-Controlled Key in GitHub repository emicklei/go-restful prior to v3.8.0;

    • CVE-2021-38561: Out-of-bounds Read in golang.org/x/text;

    • CVE-2022-21698: Uncontrolled Resource Consumption in client_golang;

    • CVE-2021-43565: The x/crypto/ssh package before 0.0.0-20211202192323-5770296d904e of golang.org/x/crypto allows an attacker to panic an SSH server;

    • CVE-2022-27191: The golang.org/x/crypto/ssh package before 0.0.0-20220314234659-1baeb1ce4c0b for Go allows an attacker to crash a server in certain circumstances involving AddHostKey;

    • CVE-2021-44716: net/http in Go before 1.16.12 and 1.17.x before 1.17.5 allows uncontrolled memory consumption in the header canonicalization cache via HTTP/2 requests;

    • CVE-2022-27664: In net/http in Go before 1.18.6 and 1.19.x before 1.19.1, attackers can cause a denial of service because an HTTP/2 connection can hang during closing if shutdown were preempted by a fatal error;

    • CVE-2022-41723: A maliciously crafted HTTP/2 stream could cause excessive CPU consumption in the HPACK decoder, sufficient to cause a denial of service from a small number of small requests;

    • CVE-2022-32149: An attacker may cause a denial of service by crafting an Accept-Language header which ParseAcceptLanguage will take significant time to parse;

    • CVE-2022-28948: An issue in the Unmarshal function in Go-Yaml v3 causes the program to crash when attempting to deserialize invalid input;

    • CVE-2023-24540: Not all valid JavaScript whitespace characters are considered to be whitespace;

    • CVE-2023-24538: Templates do not properly consider backticks (`) as Javascript string delimiters, and do not escape them as expected;

    • CVE-2022-2879: Allocation of Resources Without Limits or Throttling;

    • CVE-2022-30630: Uncontrolled Recursion;

    • CVE-2022-41724: Uncontrolled Resource Consumption;

    • CVE-2022-30632: Uncontrolled Recursion;

    • CVE-2022-41716: Improper Neutralization of Special Elements in Output Used by a Downstream Component (‘Injection’);

    • CVE-2022-2880: Inconsistent Interpretation of HTTP Requests (‘HTTP Request Smuggling’);

    • CVE-2022-30631: Uncontrolled Recursion;

    • CVE-2022-41725: Uncontrolled Resource Consumption;

    • CVE-2022-41723: Uncontrolled Resource Consumption;

    • CVE-2022-30633: Uncontrolled Recursion;

    • CVE-2022-41715: Programs which compile regular expressions from untrusted sources may be vulnerable to memory exhaustion or denial of service;

    • CVE-2022-41722: Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’);

    • CVE-2022-28131: Uncontrolled Recursion;

    • CVE-2022-41720: Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’);

    • CVE-2022-32189: A too-short encoded message can cause a panic in Float.GobDecode and Rat GobDecode in math/big in Go before 1.17.13 and 1.18.5, potentially allowing a denial of service.

    • CVE-2022-30635: Uncontrolled Recursion;

    • CVE-2023-24537: Integer Overflow or Wraparound;

    • CVE-2023-24534: Uncontrolled Resource Consumption;

    • CVE-2023-24536: Allocation of Resources Without Limits or Throttling;

    • CVE-2023-29400: Improper Neutralization of Special Elements in Output Used by a Downstream Component (‘Injection’);

    • CVE-2023-24539: Improper Neutralization of Special Elements in Output Used by a Downstream Component (‘Injection’).

8.1.4. Known issues

  • K8s IPUJob Pods must be run on a Kubernetes node with IPU access, meaning that at least one worker node must be an IPU-POD head node;

  • For parallel IPUJobs with more than one worker Pod, you must specify the network interface which will be used for MPI communication using the mpirun --mca btl_tcp_if_include option;

  • IPU partitions larger than 64 IPUs are currently not supported.

8.1.5. Upgrade guidelines

  • Before triggering the upgrade delete the ipu-operator-admission secret;

  • In order to upgrade the IPU Operator run the command: $ helm upgrade [RELEASE_NAME] [CHART].

8.2. Version 1.1.0

8.2.1. New features

Initial public release of the IPU Operator, consisting of the following containerised components:

  • controller: executes the main control loop of the operator;

  • vipu-proxy: performs IPU partitions operations (create, reset, delete). The proxy also tracks IPU usage and exposes endpoints to provide usage stats and check IPU availability;

  • launcher: sends IPU partition creation requests to vipu-proxy and waits for workers to be ready before allowing job execution.

8.2.2. Bug fixes

N/A

8.2.3. Other improvements

N/A

8.2.4. Known issues

  • K8s IPUJob Pods must be run on a Kubernetes node with IPU access, meaning that at least one worker node must be an IPU-POD head node;

  • In order to access the RDMA network interface on the head node, the IPUJob Pods must use host networking and run in privileged mode;

  • For parallel IPUJobs with more than one worker Pod, you must specify the network interface which will be used for MPI communication using the mpirun --mca btl_tcp_if_include option;

  • IPU partitions larger than 64 IPUs are currently not supported.

8.2.5. Compatibility changes

N/A