Targeting the IPU from TensorFlow 2

1. Introduction
- 1.1. Document overview
2. Targeting the Poplar XLA device
- 2.1. Supported types
- 2.2. Device selection
- 2.3. Configuring system options
  - 2.3.1. TF_POPLAR_FLAGS environment variable
- 2.4. Supported operations
- 2.5. Unsupported operations
- 2.6. Error Handling
  - 2.6.1. Construction and compilation errors
  - 2.6.2. Runtime errors
3. Support for TensorFlow 2
- 3.1. IPUStrategy
- 3.2. Execution modes
  - 3.2.1. Graph mode with @tf.function
  - 3.2.2. Eager mode
- 3.3. On-device loops
4. Keras with IPUs
- 4.1. Single IPU models
- 4.2. Using steps_per_execution
- 4.3. Gradient accumulation
- 4.4. Model parallelism
  - 4.4.1. Sequential model
  - 4.4.2. Functional model
    - Pipelining a model you are writing yourself
    - Pipelining an existing functional model
- 4.5. Automatic data parallelism
- 4.6. Asynchronous callbacks
- 4.7. Porting models from TensorFlow 2.1
  - 4.7.1. TF2.1
  - 4.7.2. TF2.4
- 4.8. Implementation details
5. Compiling and pre-compiling executables
- 5.1. Caching of compiled executables
- 5.2. Pre-compiling executables
  - 5.2.1. Unsupported Operations
6. Training a model
- 6.1. Training loops, data sets and feed queues
- 6.2. Accessing outfeed queue results during execution
- 6.3. Replicated graphs
  - 6.3.1. Selecting the number of replicas
  - 6.3.2. Performing parameter updates
- 6.4. Pipelined training
- 6.5. Gradient accumulation
- 6.6. Optimizer state offloading
- 6.7. Dataset benchmarking
  - 6.7.1. Accessing the JSON data
7. Efficient IPU I/O
- 7.1. Prefetch elements
- 7.2. I/O Tiles
8. Example using IPUEstimator
9. Example using IPUPipelineEstimator
10. Distributed training
- 10.1. Example using IPUMultiWorkerStrategy
- 10.2. Distributed training with Horovod
- 10.3. Launching Horovod training
- 10.4. Complete Horovod example
11. Half-precision floating point and stochastic rounding
- 11.1. Controlling the half-precision floating-point unit
- 11.2. Resetting the global random number seed
- 11.3. Debugging numerical issues
12. IPU-optimised operations
- 12.1. LSTM and GRU
- 12.2. Dropout
- 12.3. Embedding lookup
- 12.4. Group normalisation
- 12.5. Instance normalisation
- 12.6. Layer normalisation
- 12.7. GeLU activation
- 12.8. Sequence slice
- 12.9. Histogram
13. IPU Outlined Functions
- 13.1. Usage
- 13.2. Examples
  - 13.2.1. Models with common structures
  - 13.2.2. Serializing large operations
14. Writing custom operations
- 14.1. Custom operation on the IPU
- 14.2. Custom host CPU operations
  - 14.2.1. Gradient callback
15. IPU host embeddings
- 15.1. Usage
- 15.2. Example
- 15.3. Experimental functionality: IPU embeddings in remote buffers
  - 15.3.1. Partitioning strategies
16. Retrieving information about compilation and execution
- 16.1. TensorFlow options for reporting
- 16.2. Dumping auxiliary Poplar information
  - 16.2.1. Poplar vertex graph
  - 16.2.2. Poplar interval report
- 16.3. XLA graph file naming
17. API changes
- 17.1. Release 2.2
  - 17.1.1. Breaking changes
  - 17.1.2. Non-breaking changes
- 17.2. Release 2.1
  - 17.2.1. Breaking changes
  - 17.2.2. Non-breaking changes
- 17.3. Release 2.0
  - 17.3.1. Breaking changes
  - 17.3.2. Non-breaking changes
18. Python API
- 18.1. Operations and utilities related to the Graphcore IPU
- 18.2. Distribution strategy for a single system
- 18.3. Compiler interface
- 18.4. Scoping contexts
- 18.5. Infeed queue
- 18.6. Outfeed queue
- 18.7. General utilities
- 18.8. Configuration utilities
- 18.9. Looping utilities
- 18.10. Distributed training
- 18.11. Horovod
- 18.12. Datasets
  - 18.12.1. Dataset benchmarking
  - 18.12.2. Dataset wrappers
- 18.13. Estimators
  - 18.13.1. IPUEstimator
  - 18.13.2. IPUPipelineEstimator
  - 18.13.3. Run configs
  - 18.13.4. Session run hooks
- 18.14. Keras
  - 18.14.1. IPU specific Keras extensions
- 18.15. Keras layers
  - 18.15.1. Keras layer specializations for the Graphcore IPU
- 18.16. Keras losses
  - 18.16.1. Keras loss functions for the Graphcore IPU
- 18.17. Keras optimizers
  - 18.17.1. Keras Optimizer wrappers for the Graphcore IPU
- 18.18. Operators
  - 18.18.1. Custom operations
  - 18.18.2. Functional operators
  - 18.18.3. Image operations
  - 18.18.4. Graphcore utility operations
  - 18.18.5. IPU specific maths operations
  - 18.18.6. Pipelining operators
  - 18.18.7. Popnn primitive neural network operators
  - 18.18.8. Popnn normalization operators
  - 18.18.9. Popnn recurrent neural network operators
  - 18.18.10. Popops all to all and all gather operators
  - 18.18.11. Popops cross replica operators
  - 18.18.12. Popops embedding operators
  - 18.18.13. Popops reduce scatter operator
  - 18.18.14. Poprand operators
  - 18.18.15. Utility operations to be used in replicated mode
  - 18.18.16. Slicing operators
  - 18.18.17. Statistics operators
  - 18.18.18. Summary operations for IPUs
- 18.19. Optimisers
  - 18.19.1. Optimizer classes for the Graphcore IPU
- 18.20. Sharding
  - 18.20.1. Utility functions for sharding graphs
19. TensorFlow operators supported by the IPU
20. Resources
- 20.1. Graphcore
- 20.2. TensorFlow
- 20.3. Other
21. Index
22. Trademarks & copyright