Kubernetes IPU Operator User Guide
Version: 1.1.0
1. The IPU Operator
1.1. Components and design
2. Installation
2.1. Prerequisites
2.2. Installation methods
2.3. Installation using Helm chart hosted on GitHub
2.3.1. Installing the IPUJob CRD
2.3.2. Installing the IPU Operator
2.4. Installation using Helm chart from Graphcore’s SDP
2.4.1. Download package
2.4.2. Installing the IPUJob CRD
2.4.3. Installing the IPU Operator from a local container repository
2.5. Basic installation
2.6. Multiple V-IPU Controllers
2.7. Verify the installation is successful
2.8. Uninstall the IPU Operator
2.9. Upgrade the IPU Operator
3. Configurations
4. Creating an IPUJob
4.1. Training job
4.1.1. Simple training
4.1.2. Distributed training
4.2. Inference job
4.2.1. Scale up or down operations
4.3. Automatic restarts
4.4. Clean up resources and IPU partitions
5. Debugging problems
5.1. How does the IPU Operator work?
5.2. Debugging
6. IPU usage statistics
6.1. Operator metrics
7. Known limitations
8. Release notes
8.1. Version 1.1.0
8.1.1. New features
8.1.2. Bug fixes
8.1.3. Other improvements
8.1.4. Known issues
8.1.5. Compatibility changes
9. Legal notices
Kubernetes IPU Operator User Guide
Index