Logo
Memory and Performance Optimisation on the IPU
Version: 3.1.0
  • 1. Overview
  • 2. Understanding the IPU programming model
    • 2.1. Core concepts for IPU programming
    • 2.2. Main differences from GPU programming
    • 2.3. Factors affecting model performance
    • 2.4. PopVision Tools
  • 3. Mapping a model to an IPU system
    • 3.1. Computational graph of ML model training
    • 3.2. Understanding the memory mapping of a computational graph
      • 3.2.1. Machine-learning models use of memory
      • 3.2.2. IPU memory used by a Poplar computational graph
    • 3.3. Always-live and not-always-live memory
    • 3.4. Tensor variables memory use
      • 3.4.1. Number of model parameters
      • 3.4.2. Number of activations
      • 3.4.3. Number of optimiser states (training only)
      • 3.4.4. Number of backpropagation variables (training only)
      • 3.4.5. Determine total memory used by variables
    • 3.5. Vertex code and exchange memory use
  • 4. Optimising for performance
    • 4.1. Memory
    • 4.2. Pipeline execution scheme
    • 4.3. Data parallelism
      • 4.3.1. Graph replication
      • 4.3.2. Multiple SDK instances and replication: PopDist
    • 4.4. Host-IPU IO optimisation
      • 4.4.1. Prefetch and prefetch depth
      • 4.4.2. Overlapping I/O with compute
      • 4.4.3. Data size reduction
      • 4.4.4. Disabling variable offloading
    • 4.5. Host-side processing optimisations
      • 4.5.1. Host and IPU preprocessing
      • 4.5.2. The one-computational-graph concept
      • 4.5.3. Looping
    • 4.6. Optimising numerical precision
    • 4.7. Replicated tensor sharding (RTS)
    • 4.8. Tile mapping
    • 4.9. Other execution schemes
  • 5. Common memory optimisations
    • 5.1. Available memory proportion tuning
    • 5.2. Partials type
    • 5.3. Activation recomputations
      • 5.3.1. Activations recomputation and memory use
      • 5.3.2. Recomputation checkpoints
    • 5.4. Variable offloading
    • 5.5. Graph outlining
    • 5.6. Reducing the batch size
    • 5.7. Writing a custom operation
  • 6. Debugging an out-of-memory exception
    • 6.1. Identifying when you’ve run out of IPU memory
    • 6.2. Memory limits on the IPU
    • 6.3. Profiling the model
      • 6.3.1. Enabling profiling
      • 6.3.2. Using offline compilation to reduce IPU usage when profiling
      • 6.3.3. Using the PopVision Graph Analyser
    • 6.4. Deciding what to do
      • 6.4.1. Tile and IPU memory balance
      • 6.4.2. Techniques by liveness of memory
        • 6.4.2.1. Reducing not-always-live memory
        • 6.4.2.2. Reducing always-live memory
  • 7. Scaling an application over multiple replicas
    • 7.1. Quick guide to scaling
    • 7.2. Analyse your scaling behaviour
      • 7.2.1. Estimating the theoretical throughput
    • 7.3. Constant or slowed-down processes (Amdahl’s law)
    • 7.4. Graph compilation and executable loading
    • 7.5. Host-I/O optimisation
    • 7.6. Batch size and gradient accumulation count
    • 7.7. Memory optimization for more replicas
    • 7.8. Pipeline optimization and replicated tensor sharding
    • 7.9. Technical background
      • 7.9.1. Scale-out IPU hardware architecture
      • 7.9.2. GCL allreduce
        • 7.9.2.1. GCL allreduce on a single POD
        • 7.9.2.2. GCL allreduce with many Pods
  • 8. Reducing graph compilation time
    • 8.1. Finding malloc implementation in use
    • 8.2. Using LD_PRELOAD to change the malloc implementation
    • 8.3. Different malloc implementations
      • 8.3.1. tbbmalloc
      • 8.3.2. jemalloc
      • 8.3.3. tcmalloc
  • 9. Trademarks & copyright
Memory and Performance Optimisation on the IPU


Revision 978fb849.