Targeting the IPU from TensorFlow 2

1. Introduction
- 1.1. Document overview
2. Targeting the Poplar XLA device
- 2.1. Supported types
- 2.2. Device selection
- 2.3. Configuring system options
  - 2.3.1. TF_POPLAR_FLAGS environment variable
- 2.4. Supported operations
- 2.5. Unsupported operations
3. Compiling and pre-compiling executables
- 3.1. Caching of compiled executables
- 3.2. Pre-compiling executables
  - 3.2.1. Unsupported Operations
4. Support for TensorFlow 2
- 4.1. Function annotation with @tf.function
- 4.2. IPUStrategy
- 4.3. Keras
5. TensorFlow 2 examples
- 5.1. Training on the IPU
- 5.2. Custom training function
- 5.3. Pipelined model
6. Training a model
- 6.1. Training loops, data sets and feed queues
- 6.2. Accessing outfeed queue results during execution
- 6.3. Replicated graphs
  - 6.3.1. Selecting the number of replicas
  - 6.3.2. Performing parameter updates
- 6.4. Pipelined training
- 6.5. Gradient accumulation
- 6.6. Optimizer state offloading
- 6.7. Dataset benchmarking
  - 6.7.1. Accessing the JSON data
7. Efficient IPU I/O
- 7.1. Prefetch elements
- 7.2. I/O Tiles
8. Example using IPUEstimator
9. Example using IPUPipelineEstimator
10. Distributed training
- 10.1. Example using IPUMultiWorkerStrategy
- 10.2. Distributed training with Horovod
- 10.3. Launching Horovod training
- 10.4. Complete Horovod example
11. Half-precision floating point and stochastic rounding
- 11.1. Controlling the half-precision floating-point unit
- 11.2. Resetting the global random number seed
- 11.3. Debugging numerical issues
12. IPU-optimised operations
- 12.1. LSTM and GRU
- 12.2. Dropout
- 12.3. Embedding lookup
- 12.4. Group normalisation
- 12.5. Instance normalisation
- 12.6. Layer normalisation
- 12.7. GeLU activation
- 12.8. Sequence slice
- 12.9. Histogram
13. IPU Outlined Functions
- 13.1. Usage
- 13.2. Examples
  - 13.2.1. Models with common structures
  - 13.2.2. Serializing large operations
14. Writing custom operations
- 14.1. Custom operation on the IPU
- 14.2. Custom host CPU operations
  - 14.2.1. Gradient callback
15. IPU host embeddings
- 15.1. Usage
- 15.2. Example
- 15.3. Experimental functionality: IPU embeddings in remote buffers
  - 15.3.1. Partitioning strategies
16. Retrieving information about compilation and execution
- 16.1. TensorFlow options for reporting
- 16.2. Dumping auxiliary Poplar information
  - 16.2.1. Poplar vertex graph
  - 16.2.2. Poplar interval report
- 16.3. XLA graph file naming
17. API changes
- 17.1. Release 2.1
  - 17.1.1. Breaking changes
  - 17.1.2. Non-breaking changes
- 17.2. Release 2.0
  - 17.2.1. Breaking changes
  - 17.2.2. Non-breaking changes
18. Deprecated profiling functionality
- 18.1. Adding an operation to get compilation and execution events
  - 18.1.1. ipu_event_trace()
  - 18.1.2. ipu_compile_summary(name, [op list])
- 18.2. Enabling tracing in the hardware configuration options
- 18.3. Extract the reports from the returned events
- 18.4. Producing reports for use with the PopVision Graph Analyser
- 18.5. Using the IPU Model device for debugging
- 18.6. Reading the Poplar textual summary report
- 18.7. Producing an ELF image of the compilation
19. Python API
- 19.1. Operations and utilities related to the Graphcore IPU
- 19.2. Distribution strategy for a single system
- 19.3. Compiler interface
- 19.4. Scoping contexts
- 19.5. Infeed queue
- 19.6. Outfeed queue
- 19.7. General utilities
- 19.8. Configuration utilities
- 19.9. Looping utilities
- 19.10. Distributed training
- 19.11. Horovod
- 19.12. Datasets
  - 19.12.1. Dataset benchmarking
  - 19.12.2. Dataset wrappers
- 19.13. Estimators
  - 19.13.1. IPUEstimator
  - 19.13.2. IPUPipelineEstimator
  - 19.13.3. Run configs
  - 19.13.4. Session run hooks
- 19.14. Keras
  - 19.14.1. Keras API
  - 19.14.2. Keras Model interfaces for IPU
- 19.15. Keras layers
  - 19.15.1. Keras layer specializations for the Graphcore IPU
- 19.16. Keras losses
  - 19.16.1. Keras loss functions for the Graphcore IPU
- 19.17. Keras optimizers
  - 19.17.1. Keras Optimizer wrappers for the Graphcore IPU
- 19.18. Operators
  - 19.18.1. Custom operations
  - 19.18.2. Functional operators
  - 19.18.3. Image operations
  - 19.18.4. Graphcore utility operations
  - 19.18.5. IPU specific maths operations
  - 19.18.6. Pipelining operators
  - 19.18.7. Popnn primitive neural network operators
  - 19.18.8. Popnn normalization operators
  - 19.18.9. Popnn recurrent neural network operators
  - 19.18.10. Popops all to all and all gather operators
  - 19.18.11. Popops cross replica operators
  - 19.18.12. Popops embedding operators
  - 19.18.13. Popops reduce scatter operator
  - 19.18.14. Poprand operators
  - 19.18.15. Utility operations to be used in replicated mode
  - 19.18.16. Slicing operators
  - 19.18.17. Statistics operators
  - 19.18.18. Summary operations for IPUs
- 19.19. Optimisers
  - 19.19.1. Optimizer classes for the Graphcore IPU
- 19.20. Sharding
  - 19.20.1. Utility functions for sharding graphs
20. TensorFlow operators supported by the IPU
21. Resources
- 21.1. Graphcore
- 21.2. TensorFlow
- 21.3. Other
22. Index
23. Trademarks & copyright