Memory and Performance Optimisation on the IPU
Version: 3.1.0
1. Overview
2. Understanding the IPU programming model
2.1. Core concepts for IPU programming
2.2. Main differences from GPU programming
2.3. Factors affecting model performance
2.4. PopVision Tools
3. Mapping a model to an IPU system
3.1. Computational graph of ML model training
3.2. Understanding the memory mapping of a computational graph
3.2.1. Machine-learning models use of memory
3.2.2. IPU memory used by a Poplar computational graph
3.3. Always-live and not-always-live memory
3.4. Tensor variables memory use
3.4.1. Number of model parameters
3.4.2. Number of activations
3.4.3. Number of optimiser states (training only)
3.4.4. Number of backpropagation variables (training only)
3.4.5. Determine total memory used by variables
3.5. Vertex code and exchange memory use
4. Optimising for performance
4.1. Memory
4.2. Pipeline execution scheme
4.3. Data parallelism
4.3.1. Graph replication
4.3.2. Multiple SDK instances and replication: PopDist
4.4. Host-IPU IO optimisation
4.4.1. Prefetch and prefetch depth
4.4.2. Overlapping I/O with compute
4.4.3. Data size reduction
4.4.4. Disabling variable offloading
4.5. Host-side processing optimisations
4.5.1. Host and IPU preprocessing
4.5.2. The one-computational-graph concept
4.5.3. Looping
4.6. Optimising numerical precision
4.7. Replicated tensor sharding (RTS)
4.8. Tile mapping
4.9. Other execution schemes
5. Common memory optimisations
5.1. Available memory proportion tuning
5.2. Partials type
5.3. Activation recomputations
5.3.1. Activations recomputation and memory use
5.3.2. Recomputation checkpoints
5.4. Variable offloading
5.5. Graph outlining
5.6. Reducing the batch size
5.7. Writing a custom operation
6. Debugging an out-of-memory exception
6.1. Identifying when you’ve run out of IPU memory
6.2. Memory limits on the IPU
6.3. Profiling the model
6.3.1. Enabling profiling
6.3.2. Using offline compilation to reduce IPU usage when profiling
6.3.3. Using the PopVision Graph Analyser
6.4. Deciding what to do
6.4.1. Tile and IPU memory balance
6.4.2. Techniques by liveness of memory
6.4.2.1. Reducing not-always-live memory
6.4.2.2. Reducing always-live memory
7. Scaling an application over multiple replicas
7.1. Quick guide to scaling
7.2. Analyse your scaling behaviour
7.2.1. Estimating the theoretical throughput
7.3. Constant or slowed-down processes (Amdahl’s law)
7.4. Graph compilation and executable loading
7.5. Host-I/O optimisation
7.6. Batch size and gradient accumulation count
7.7. Memory optimization for more replicas
7.8. Pipeline optimization and replicated tensor sharding
7.9. Technical background
7.9.1. Scale-out IPU hardware architecture
7.9.2. GCL allreduce
7.9.2.1. GCL allreduce on a single POD
7.9.2.2. GCL allreduce with many Pods
8. Reducing graph compilation time
8.1. Finding malloc implementation in use
8.2. Using LD_PRELOAD to change the malloc implementation
8.3. Different malloc implementations
8.3.1. tbbmalloc
8.3.2. jemalloc
8.3.3. tcmalloc
9. Trademarks & copyright
Memory and Performance Optimisation on the IPU
Index