Targeting the IPU from TensorFlow 1

1. Introduction
- 1.1. Document overview
2. Tutorial
- 2.1. Preliminary graphs
- 2.2. A basic graph
  - 2.2.1. Selecting hardware to run on
  - 2.2.2. Running on the IPU Model simulator
- 2.3. Compiling the graph for the IPU
- 2.4. Sharding a graph
- 2.5. Adding variables
  - 2.5.1. Troubleshooting
  - 2.5.2. Note on the global_step counter
3. Targeting the Poplar XLA device
- 3.1. Supported types
- 3.2. Device selection
- 3.3. Configuring compilation options
  - 3.3.1. TF_POPLAR_FLAGS environment variable
- 3.4. Supported operations
- 3.5. Unsupported operations
4. Compiling and pre-compiling executables
- 4.1. Caching of compiled executables
- 4.2. Pre-compiling executables
  - 4.2.1. Unsupported Operations
5. Training a model
- 5.1. Training loops, data sets and feed queues
- 5.2. Accessing outfeed queue results during execution
- 5.3. Replicated graphs
- 5.4. Pipelined training
- 5.5. Gradient accumulation
- 5.6. Optimizer state offloading
- 5.7. Dataset benchmarking
  - 5.7.1. Accessing the JSON data
6. Efficient IPU I/O
- 6.1. Prefetch elements
- 6.2. I/O Tiles
7. Example using IPUEstimator
8. Example using IPUPipelineEstimator
9. Distributed training
- 9.1. The input function
- 9.2. The model function
- 9.3. Cluster definition
- 9.4. Complete example
- 9.5. Distributed training with Horovod
- 9.6. Horovod Open MPI dependency
- 9.7. Launching Horovod training
- 9.8. Complete Horovod example
10. Half-precision floating point and stochastic rounding
- 10.1. Controlling the half-precision floating-point unit
- 10.2. Resetting the global random number seed
- 10.3. Debugging numerical issues
11. IPU-optimised operations
- 11.1. LSTM and GRU
- 11.2. Dropout
- 11.3. Embedding lookup
- 11.4. Group normalisation
- 11.5. Instance normalisation
- 11.6. Layer normalisation
- 11.7. GeLU activation
12. IPU Outlined Functions
- 12.1. Usage
- 12.2. Examples
  - 12.2.1. Models with common structures
  - 12.2.2. Serializing large operations
13. Writing custom operations
- 13.1. Custom operation on the IPU
- 13.2. Custom host CPU operations
  - 13.2.1. Gradient callback
14. IPU host embeddings
- 14.1. Usage
- 14.2. Example
- 14.3. Experimental functionality: IPU embeddings in remote buffers
  - 14.3.1. Partitioning strategies
15. Retrieving information about compilation and execution
- 15.1. Adding an operation to get compilation and execution events
  - 15.1.1. ipu_event_trace()
  - 15.1.2. ipu_compile_summary(name, [op list])
- 15.2. Enabling tracing in the hardware configuration options
- 15.3. Extract the reports from the returned events
- 15.4. Producing reports for use with the PopVision Graph Analyser
- 15.5. Using the IPU Model device for debugging
- 15.6. TensorFlow options for reporting
- 15.7. Reading the Poplar textual summary report
- 15.8. Producing an ELF image of the compilation
- 15.9. Dumping auxiliary Poplar information
  - 15.9.1. Poplar vertex graph
  - 15.9.2. Poplar interval report
- 15.10. XLA graph file naming
16. API changes
- 16.1. Breaking changes
- 16.2. Non-breaking changes
17. Python API
- 17.1. Operations and utilities related to the Graphcore IPU
- 17.2. Compiler interface
- 17.3. Scoping contexts
- 17.4. Infeed queue
- 17.5. Outfeed queue
- 17.6. General utilities
- 17.7. Looping utilities
- 17.8. Distributed training
- 17.9. Horovod
- 17.10. Datasets
  - 17.10.1. Dataset benchmarking
  - 17.10.2. Dataset wrappers
- 17.11. Estimators
- 17.12. Keras layers
  - 17.12.1. Keras layer specializations for the Graphcore IPU
- 17.13. Operators
- 17.14. Optimisers
  - 17.14.1. Optimizer classes for the Graphcore IPU
- 17.15. Sharding
  - 17.15.1. Automatic graph sharding
  - 17.15.2. Utility functions for sharding graphs
18. TensorFlow operators supported by the IPU
19. Resources
- 19.1. Graphcore
- 19.2. TensorFlow
- 19.3. Other
20. Index
21. Trademarks & copyright