Logo
Memory and Performance Optimisation on the IPU
Version: latest
  • 1. Overview
  • 2. Understanding the IPU programming model
    • 2.1. Core concepts for IPU programming
    • 2.2. Main differences from GPU programming
    • 2.3. Factors affecting model performance
    • 2.4. PopVision Tools
  • 3. Mapping a model to an IPU system
    • 3.1. Computational graph of ML model training
    • 3.2. Understanding the memory mapping of a computational graph
      • 3.2.1. Machine-learning models use of memory
      • 3.2.2. IPU memory used by a Poplar computational graph
    • 3.3. Always-live and not-always-live memory
    • 3.4. Tensor variables memory use
      • 3.4.1. Number of model parameters
      • 3.4.2. Number of activations
      • 3.4.3. Number of optimiser states (training only)
      • 3.4.4. Number of backpropagation variables (training only)
      • 3.4.5. Determine total memory used by variables
    • 3.5. Vertex code and exchange memory use
  • 4. Optimising for performance
    • 4.1. Memory
    • 4.2. Pipeline execution scheme
    • 4.3. Data parallelism
      • 4.3.1. Graph replication
      • 4.3.2. Multiple SDK instances and replication: PopDist
    • 4.4. Host-IPU I/O optimisation
      • 4.4.1. Prefetch and prefetch depth
      • 4.4.2. Overlapping I/O with compute
      • 4.4.3. Data size reduction
      • 4.4.4. Disabling variable offloading
    • 4.5. Host-side processing optimisations
      • 4.5.1. Host and IPU preprocessing
      • 4.5.2. The one-computational-graph concept
      • 4.5.3. Looping
    • 4.6. Optimising numerical precision
    • 4.7. Replicated tensor sharding (RTS)
    • 4.8. Tile mapping
    • 4.9. Other execution schemes
  • 5. Common memory optimisations
    • 5.1. Available memory proportion tuning
    • 5.2. Partials type
    • 5.3. Activation recomputations
      • 5.3.1. Activations recomputation and memory use
      • 5.3.2. Recomputation checkpoints
    • 5.4. Variable offloading
    • 5.5. Graph outlining
    • 5.6. Reducing the batch size
    • 5.7. Writing a custom operation
  • 6. Debugging an out-of-memory exception
    • 6.1. Identifying when you’ve run out of IPU memory
    • 6.2. Memory limits on the IPU
    • 6.3. Profiling the model
      • 6.3.1. Enabling profiling
      • 6.3.2. Using offline compilation to reduce IPU usage when profiling
      • 6.3.3. Using the PopVision Graph Analyser
    • 6.4. Deciding what to do
      • 6.4.1. Tile and IPU memory balance
      • 6.4.2. Techniques by liveness of memory
        • 6.4.2.1. Reducing not-always-live memory
        • 6.4.2.2. Reducing always-live memory
  • 7. Scaling an application over multiple replicas
    • 7.1. Quick guide to scaling
    • 7.2. Analyse your scaling behaviour
      • 7.2.1. Estimating the theoretical throughput
    • 7.3. Constant or slowed-down processes (Amdahl’s law)
    • 7.4. Graph compilation and executable loading
    • 7.5. Host-I/O optimisation
    • 7.6. Batch size and gradient accumulation count
    • 7.7. Memory optimization for more replicas
    • 7.8. Pipeline optimization and replicated tensor sharding
    • 7.9. Technical background
      • 7.9.1. Scale-out IPU hardware architecture
      • 7.9.2. GCL allreduce
        • 7.9.2.1. GCL allreduce on a single POD
        • 7.9.2.2. GCL allreduce with many Pods
  • 8. Reducing graph compilation time
    • 8.1. Finding malloc implementation in use
    • 8.2. Using LD_PRELOAD to change the malloc implementation
    • 8.3. Different malloc implementations
      • 8.3.1. tbbmalloc
      • 8.3.2. jemalloc
      • 8.3.3. tcmalloc
  • 9. Trademarks & copyright
Memory and Performance Optimisation on the IPU

Search help

Note: Searching from the top-level index page will search all documents. Searching from a specific document will search only that document.

  • Find an exact phrase: Wrap your search phrase in "" (double quotes) to only get results where the phrase is exactly matched. For example "PyTorch for the IPU" or "replicated tensor sharding"
  • Prefix query: Add an * (asterisk) at the end of any word to indicate a prefix query. This will return results containing all words with the specific prefix. For example tensor*
  • Fuzzy search: Use ~N (tilde followed by a number) at the end of any word for a fuzzy search. This will return results that are similar to the search word. N specifies the “edit distance” (fuzziness) of the match. For example Polibs~1
  • Words close to each other: ~N (tilde followed by a number) after a phrase (in quotes) returns results where the words are close to each other. N is the maximum number of positions allowed between matching words. For example "ipu version"~2
  • Logical operators. You can use the following logical operators in a search:
    • + signifies AND operation
    • | signifies OR operation
    • - negates a single word or phrase (returns results without that word or phrase)
    • () controls operator precedence


Revision 5adac908.