2. How to transfer an existing code base

2.1. Preliminary analysis

Before porting TensorFlow models to the IPU, it can be helpful for you to assess the problem at hand.

What is the model size? A single 2nd generation IPU has an on-chip memory/SRAM of approximately 900MB (or ~300 MB for the first generation). Will I use the IPU for training or inference? Training will require require more memory. Here, it is also good to get an idea about the minimal batch size for the model to converge correctly, for example for batch normalization. Depending on these estimates, it is possible to get an idea of whether the whole model will fit onto one IPU or if a setup with multiple IPUs is necessary.

An analysis of existing code can indicate the kind of parallelism desired. Some programs distribute the same task on multiple devices, while feeding each with a different batch of data (data parallel) and others spread one task and one batch of data across multiple devices (model parallel). There are also cases, typically seen in reinforcement learning applications, where inference and training are distributed in parallel across different devices. IPUs support all of the above schemes.

Additionally, it is important for you to assess the suitability of the existing code for porting. On one hand, if the code is mainly based on the estimator’s train, evaluate, and predict functions, replacing it by the IPUEstimator might be all that is required. On the other hand, if instead of the estimator approach, when the tfcompile tool is used for code optimization together with feeds and fetches, replacing the compiler and data streams with the respective IPU counterparts could ease the transition.

If your code is very specialized for a device (for example TPU estimator or other TPU specific functions) and not general at all, it is maybe better for you to start with the existing model code and refactor the surrounding code. Graphcore provides a GitHub repository with plenty of tutorials and code example. It is worth exploring these to check if a similar problem has been addressed already.

The IPU documentation includes a list of supported operations (depending on the related data types) — see the supported operators chapter of the TensorFlow User Guide. If your code contains operations that are not yet supported, either a new IPU implementation will be required, or the respective part has to be processed on the CPU which is achieved by scoping. See the custom operators chapter of the TensorFlow User Guide for more information.

2.2. Planning the transfer

The following guidelines give you some ideas about to transfer your model. There are multiple non-exclusive paths which can be taken.

If your model is sufficiently small and a code transfer looks straightforward, it makes sense to start with the full model. If the model is very big or contains components that are not supported by the IPU, it makes sense to first start with a small version of the model.

A similar approach is useful when looking at the code base. If the implementation is rather simple, a direct change will probably work. Otherwise, it is good to break it down into smaller development stages. Since more complex code should come with meaningful unit tests, tests are probably a good starting point.

A different approach is to use our repository of examples. Instead of changing an existing code base, it might be easier to just transfer the model of interest to an existing example.

2.3. Next steps

The first step after planning is to port the model and work through common initial issues like operations that have to be mapped onto the CPU (see Section 5, Scoping and determining unsupported operations), or wrong input datatypes for the estimator because the original code provides iterators instead of datasets or feeds. If no estimator is used, it is recommended to use feeds to enable optimal communication between host and IPU (see Section 3, ResNeXt inference example).

The second step is to profile the code to understand potential bottlenecks that might impact processing speed or memory consumption. One approach to explore bottlenecks outside of the IPU is the intrinsic Python profiler (cprofile). In Section 3.4, Generating a report, we describe how the IPU can be efficiently profiled.

Thirdly, if a smaller model was used for getting started, it is now time to scale the model with potentially further profiling. There are two approaches for scaling, if the model easily fits on one IPU, the batch size can be increased and multiple IPUs can process the data in parallel. If your model does not fit on a single IPU, it is recommended to spread the model across multiple IPUs. See the document on Model Parallelism and the Tensorflow 1 Pipelining Tutorial for more details about how to split your model over multiple IPUs.

Note

From Poplar SDK 3.1, TensorFlow 1 will only be supported in CentOS 7. In addition, Examples and Tutorials for TensorFlow 1 are only available up to version 3.0 of the SDK. There has been limited testing of the 3.0 versions of the TensorFlow 1 tutorials and examples with Poplar SDK 3.1.