Targeting the IPU from TensorFlow 2

1. Introduction
- 1.1. Document overview
2. Targeting the Poplar XLA device
- 2.1. Supported types
- 2.2. Device selection
- 2.3. Configuring compilation options
  - 2.3.1. TF_POPLAR_FLAGS environment variable
- 2.4. Supported operations
- 2.5. Unsupported operations
3. Compiling and pre-compiling executables
- 3.1. Caching of compiled executables
- 3.2. Pre-compiling executables
  - 3.2.1. Unsupported Operations
4. Support for TensorFlow 2
- 4.1. Function annotation with @tf.function
- 4.2. IPUStrategy
- 4.3. Keras
5. TensorFlow 2 examples
- 5.1. Training on the IPU
- 5.2. Custom training function
- 5.3. Pipelined model
6. Training a model
- 6.1. Training loops, data sets and feed queues
- 6.2. Accessing outfeed queue results during execution
- 6.3. Replicated graphs
- 6.4. Pipelined training
- 6.5. Gradient accumulation
- 6.6. Optimizer state offloading
- 6.7. Dataset benchmarking
  - 6.7.1. Accessing the JSON data
7. Efficient IPU I/O
- 7.1. Prefetch elements
- 7.2. I/O Tiles
8. Example using IPUEstimator
9. Example using IPUPipelineEstimator
10. Distributed training
- 10.1. The input function
- 10.2. The model function
- 10.3. Cluster definition
- 10.4. Complete example
- 10.5. Distributed training with Horovod
- 10.6. Horovod Open MPI dependency
- 10.7. Launching Horovod training
- 10.8. Complete Horovod example
11. Half-precision floating point and stochastic rounding
- 11.1. Controlling the half-precision floating-point unit
- 11.2. Resetting the global random number seed
- 11.3. Debugging numerical issues
12. IPU-optimised operations
- 12.1. LSTM and GRU
- 12.2. Dropout
- 12.3. Embedding lookup
- 12.4. Group normalisation
- 12.5. Instance normalisation
- 12.6. Layer normalisation
- 12.7. GeLU activation
13. IPU Outlined Functions
- 13.1. Usage
- 13.2. Examples
  - 13.2.1. Models with common structures
  - 13.2.2. Serializing large operations
14. Writing custom operations
- 14.1. Custom operation on the IPU
- 14.2. Custom host CPU operations
  - 14.2.1. Gradient callback
15. IPU host embeddings
- 15.1. Usage
- 15.2. Example
- 15.3. Experimental functionality: IPU embeddings in remote buffers
  - 15.3.1. Partitioning strategies
16. Retrieving information about compilation and execution
- 16.1. Adding an operation to get compilation and execution events
  - 16.1.1. ipu_event_trace()
  - 16.1.2. ipu_compile_summary(name, [op list])
- 16.2. Enabling tracing in the hardware configuration options
- 16.3. Extract the reports from the returned events
- 16.4. Producing reports for use with the PopVision Graph Analyser
- 16.5. Using the IPU Model device for debugging
- 16.6. TensorFlow options for reporting
- 16.7. Reading the Poplar textual summary report
- 16.8. Producing an ELF image of the compilation
- 16.9. Dumping auxiliary Poplar information
  - 16.9.1. Poplar vertex graph
  - 16.9.2. Poplar interval report
- 16.10. XLA graph file naming
17. API changes
- 17.1. Breaking changes
- 17.2. Non-breaking changes
18. Python API
- 18.1. Operations and utilities related to the Graphcore IPU
- 18.2. Distribution strategy for a single system
- 18.3. Compiler interface
- 18.4. Scoping contexts
- 18.5. Infeed queue
- 18.6. Outfeed queue
- 18.7. General utilities
- 18.8. Looping utilities
- 18.9. Distributed training
- 18.10. Horovod
- 18.11. Datasets
  - 18.11.1. Dataset benchmarking
  - 18.11.2. Dataset wrappers
- 18.12. Estimators
- 18.13. Keras
  - 18.13.1. Keras API
  - 18.13.2. Keras Model interfaces for IPU
- 18.14. Keras layers
  - 18.14.1. Keras layer specializations for the Graphcore IPU
- 18.15. Keras losses
  - 18.15.1. Keras loss functions for the Graphcore IPU
- 18.16. Operators
- 18.17. Optimisers
  - 18.17.1. Optimizer classes for the Graphcore IPU
- 18.18. Sharding
  - 18.18.1. Automatic graph sharding
  - 18.18.2. Utility functions for sharding graphs
19. TensorFlow operators supported by the IPU
20. Resources
- 20.1. Graphcore
- 20.2. TensorFlow
- 20.3. Other
21. Index
22. Trademarks & copyright