Technical Notes and White Papers

Technical Notes

Implementation

Creating Custom Operations for the IPU

This technical note provides an overview of the steps for implementing a custom op in each of the frameworks available in the Poplar SDK, with links to sources of more information.

PopSparse Matrix Multiplication (Dynamic Pattern)

The Graphcore PopLibs library for the IPU includes PopSparse, a library of functions for sparse matrix operations.

This document is a high-level description of the algorithmic design of the dynamic sparse matrix multiplication in the Graphcore PopSparse library. It provides a guide to the code and some pointers to the implementation.

Porting TensorFlow models to the IPU

This document is a practical guide to porting TensorFlow models to the IPU using the Poplar SDK. It is assumed that you are already aware of the document Targeting the IPU from TensorFlow 1, which serves as the primary introduction to developing TensorFlow models for the IPU. It provides a conceptual introduction to developing models at the framework level, and describes a number of specific facets of the TensorFlow-to-Poplar API that are pivotal to running models on the IPU.

This document will focus on some of the practical considerations for developing a model for the IPU and provide some guidance on best practices. In doing so, it will attempt to identify those key elements that assist the developer in transitioning to using TensorFlow on the IPU.

Model parallelism with TensorFlow: sharding and pipelining

This technical note describes how to parallelise TensorFlow models on IPU hardware.

If a deep learning network has too many layers and parameters to fit on one IPU, we need to divide it into pieces and distribute those pieces across multiple IPUs. This is called the model parallelism approach, and it enables us to train large models that exceed the memory capacity on a single IPU. Currently, we support two types of model parallelism, sharding and pipelining.

Optimisation

Memory and Performance Optimisation

The goal of this document is to help Graphcore AI engineers and customers to optimise high-performance machine learning models running on the IPU.

There are many factors that affect model performance; this document will cover: memory optimisation, execution schemes and optimisations specific to the IPU and Poplar.

Although this document focuses on performance it is worth bearing in mind that numerical stability and convergence properties may limit the design options when trying to optimise a model for performance.

Optimising Temporary Memory Usage for Convolutions and Matmuls on the IPU

In many of our example applications for the IPU, you will see an option called availableMemoryProportion. This technical note describes what this option does and when you may need to tune it in order to make a model fit onto the IPU or to optimise its performance.

All of the frameworks for the IPU, such as TensorFlow and PyTorch, make use of the facilities provided by the Poplar and PopLibs library functions. So, for example when a TensorFlow program needs to perform a matrix multiply (matmul), it will call the matmul functions in PopLibs.

availableMemoryProportion is used by PopLibs when deciding how to implement operations on the IPU; in other words, how to convert the framework-level operations into the low-level functions that execute on the IPU.

This document discusses availableMemoryProportion in relation to convolutions and matmuls, which are the most common use cases at the time of writing, but it may also apply to other PopLibs functions not covered here.

Optimising for the IPU: Computational Graph Recompilation and Executable Switching in TensorFlow

When code is executed on an IPU, a multi-operation computational graph is compiled to run efficiently on the device. This technical note describes how to minimise recompilation.

This compilation ensures that the code running on the IPU is optimal: as many tiles as possible are used, as little device memory as possible is used and the number of execution cycles is short. Note that, in contrast to some other platforms, the graph to be compiled isn’t just one matmul operation but many consecutive operations and so almost every graph execution is different and will need to be compiled and optimised.

The compilation process performs many optimizations, and so it can take some time. It is therefore important to know when the compilation of the graph will happen and avoid it occurring at inconvenient times, or too often. This is especially relevant when running benchmarks since it can add significant overhead.

As a result, it is important to avoid recompilations as far as possible. This technical note provides some strategies that can help you with this.

Pre-Training and Fine-Tuning BERT for the IPU

This technical note provides an insight into BERT-Large implementation on Graphcore IPU-POD systems, using both TensorFlow and PyTorch. This should help users better understand some of the key optimization techniques for model development on the IPU.

Hardware

Graphcore OpenStack reference design for IPU-POD systems

This document illustrates a reference configuration of a Graphcore IPU‑POD64 deployed with OpenStack, the open-source cloud computing infrastructure management software. The IPU‑POD64 contains 64 IPUs (Intelligence Processing Units) and OpenStack is used to manage these IPU resources via its API and UI.

OpenStack can be deployed in a variety of different configurations, and Graphcore does not endorse or support any particular configuration. OpenStack is also not a pre-requisite for using Graphcore technology, however, OpenStack is often used as an underlying infrastructure in data centres, and you should treat this sample description as a set of guidelines from which you can derive your own configuration of a Graphcore IPU solution on your own existing particular implementation of OpenStack.

Scaling AI with Graphcore and Pure Storage

This technical note describes an example reference architecture, developed with Pure Storage, for using FlashBlade storage with the IPU-POD.

Data and AI teams today need simple yet powerful infrastructure to take ideas from experimentation to production rapidly. They need end-to-end infrastructure that provides a performant platform that is easy to set up, but that does not impede the work of data scientists and machine-learning engineers. Graphcore and Pure Storage® have brought intelligent compute and storage together to create a converged infrastructure solution to serve machine-learning workloads of all sizes while maintaining simplicity and performance at any scale.

Switched GW-Links in large scale IPU-POD systems

As IPU-POD systems get larger (from IPU‑POD128 onwards) connecting all the IPU-Machines in each of the individual racks horizontally becomes too complicated and prone to errors. There is also the risk of increased downtime if an individual rack within the system needs to be taken out of service. Therefore a different approach is required for large scale IPU-PODs. The solution is to use switched GW-Links, where the GW-Links do not directly connect between IPU-Machines but pass through switches instead.

This technical note describes using switched GW-Links to connect IPU-Machines in large-scale switched IPU-PODs. This is in contrast to our current approach for switched IPU-PODs which uses directly attached GW-Links horizontally connecting IPU-Machines.

White Papers

AI-Float™ - Mixed Precision Arithmetic for AI: A Hardware Perspective

This white paper describes how the IPU’s hardware and software architectures support fast and efficient training and inference of deep learning models using mixed precision arithmetic.

Current trends of the increasing complexity of AI workloads have driven the interest in dedicated processors for AI applications. Since current technology is no longer able to double the processing speed with every new silicon generation at a constant power, due to the slow-down of Moore’s law and Dennard scaling, today’s AI processors and accelerators need to make more efficient use of the available power.

As the model/dataset size of deep learning applications continues to increase, the scalability of machine learning systems becomes indispensable, in order to cope with the considerable compute requirements of these applications. Training large, distributed models, however, creates several challenges, relying on the effective use of the available compute, memory, and networking resources shared among the different nodes, limited by the available power budget.

The choice of computer arithmetic is an important part in any accelerator or processor design, as it has a direct impact on hardware complexity and compute efficiency, as well as memory and communication bandwidth utilization. For these reasons, the use of efficient numerical representations is of critical importance, since it allows increased power efficiency due to the improved compute throughput and improved utilization of the communication bandwidth.

Graphcore’s AI Software Stack is Now Customer-Driven

A new white paper from Cambrian AI Research examines the growing momentum of the Poplar software stack and ecosystem, detailing how our customer-centric focus is driving software enhancements, supporting developers and benefiting from open-source contributions from the wider AI ecosystem.

“Developers require fast, scalable accelerators to handle the massive computational loads of larger models. But AI developers also need a robust development environment that meets them in their AI development journey,” writes Analyst Karl Freund.

Research papers

You can find research papers by Graphcore and others on our Developer Portal.