Source (GitHub)

6.1. Using Infeed and Outfeed Queues in TensorFlow 2

In this tutorial, you will write code for training a simple fully connected network on the MNIST dataset to illustrate how to use infeed and outfeed queues with a custom training function.

If you are not familiar with how to run TensorFlow 2 models on the IPU, you may wish to read our MNIST starter tutorial .


  • Installed and enabled Poplar

  • Installed the Graphcore port of TensorFlow 2

Refer to the Getting Started guide for your IPU system for instructions.

Directory Structure

  • completed_code: Completed versions of the scripts described in this tutorial

  • execution_trace: Execution traces of the scripts described in this tutorial

  • This file

  • tests: A directory that contains test scripts

Table of Contents


The combination of TensorFlow, XLA and Poplar provides the ability to combine an entire training graph into a single operation in a TensorFlow graph. This accelerates training by removing the need to pass control to the IPU for each individual operation in the graph.

However, if the Python code with the training pass is called for each batch in the training dataset, then there is still the overhead of passing control to the IPU for each batch. You can deal with this by placing the training operations inside a loop, so that they can be executed multiple times on the IPU without returning control to the host.

Normally when TensorFlow runs, operations which are not inside a loop will take a fixed number of tensors as input. However, when a training operation is placed into a loop, the inputs need to take the form of streams of values. Standard TensorFlow Python feed dictionaries cannot provide data in this form, so when training in a loop, data must be fed from a TensorFlow dataset. Similarly, the outputs must be represented as a stream. TensorFlow for IPU provides two mechanisms, infeed queues and outfeed queues, for handling this when using the IPU.

You need to do the following in order to construct a system that trains in a loop:

  • Create a TensorFlow dataset to provide data to the input queue

  • Create a model

  • Create a custom training loop

  • Create an IPUInfeedQueue to feed the data to the loop

  • Create an IPUOutfeedQueue to take results out of the loop


This example illustrates how to build pipelines manually with custom functions. However, the IPU-specific versions of the Keras Model and Sequential classes automatically add an infeed and outfeed to the model’s fit, evaluate and predict methods. These methods call the run function internally.

In this tutorial, you will write code for training a simple fully connected network on the MNIST dataset to illustrate how to use infeed and outfeed queues with a custom training function. Copy the code snippets below into a new Python file. The completed code is available for comparison in the folder completed_code.

Import the necessary APIs

First of all, you need to import APIs that will be used in the example.

import time
import tensorflow as tf
from tensorflow import keras

from tensorflow.python.ipu import ipu_outfeed_queue
from tensorflow.python.ipu import config
from tensorflow.python.ipu import ipu_strategy

Define hyperparameters

You need to define the number of steps to train the model.

step_count = 10_000
steps_per_execution = 10

Prepare the dataset

Next, you need to prepare the dataset:

def create_dataset():
    mnist = keras.datasets.mnist

    (x_train, y_train), (_, _) = mnist.load_data()
    x_train = x_train / 255.0

    train_ds =, y_train))
    train_ds = train_ds.cache()
    train_ds = train_ds.shuffle(len(x_train)).batch(32, drop_remainder=True)
    train_ds = d, l: (tf.cast(d, tf.float32), tf.cast(l, tf.int32)))
    train_ds = train_ds.prefetch(

    return train_ds.repeat()

ds = create_dataset()

The TensorFlow documentation on optimizing input pipelines with the API describes how caching and prefetching work.

Define the model

Now you need to define your model. The code example provided uses this simple fully connected network:

# A simple fully-connected network model using the standard Keras Sequential API.
def create_model():
    m = keras.Sequential(
            keras.layers.Dense(128, activation="relu"),
            keras.layers.Dense(10, activation="softmax"),
    return m

model = create_model()

Define the custom training loop

Now that you have the dataset and the model, you can construct the device-side part of the model. In the code example provided, the body of the loop is a function which takes MNIST images and their labels as input, performs a single step of a training loop, and returns the loss. The results are then inserted into the outfeed queue using the enqueue method.

def training_loop(iterator, outfeed_queue, model, optimizer, num_iterations):
    for _ in tf.range(num_iterations):
        # Get the data for the step.
        features, labels = next(iterator)

        # Perform the training step.
        with tf.GradientTape() as tape:
            predictions = model(features, training=True)
            prediction_loss = keras.losses.sparse_categorical_crossentropy(
                labels, predictions
            loss = tf.reduce_mean(prediction_loss)

        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        # Store the loss in the outfeed queue.

The tf.function decorator compiles the Python function into a callable TensorFlow graph. We set experimental_compile=True because this means the function will be compiled with XLA, which is essential for running TensorFlow programs on the IPU.

Configure the hardware

You must create an IPU configuration to use the IPU. You can use cfg.auto_select_ipus = 1 to automatically select one IPU.

cfg = config.IPUConfig()
cfg.auto_select_ipus = 1

Create data pipeline and execute the training loop

tf.distribute.Strategy is a TensorFlow API for distributing training across multiple devices. IPUStrategy is a subclass which targets a system with one or more IPUs attached.

Creating variables within the scope of the IPUStrategy will ensure that they are placed on the IPU, but the initialization for the variables will be performed on the CPU device.

Before you execute the training loop, you need to create a data pipeline from the dataset into a training or inference loop on the IPU. This is done by creating an infeed queue and an outfeed queue.

You may notice that IPUInfeedQueue is not explicitly declared like IPUOutfeedQueue. This is because calling iter on a dataset inside an IPUStrategy creates an infeed queue internally.

An IPUOutfeedQueue is used to generate and add outfeed enqueue/dequeue operations to the graph. The infeed and outfeed queues together manage the transfer of data between the IPU graph and the host.

IPUOutfeedQueue objects have extra options to control how they collect and output the data sent to them. These options are not used in this example. Refer to the IPUOutfeedQueue documentation for details.

To execute the training loop, you should do so within the scope of an ipu_strategy.IPUStrategy using run like so:

strategy = ipu_strategy.IPUStrategy()

with strategy.scope():
    # An optimizer for updating the trainable variables.
    opt = keras.optimizers.SGD(0.01)

    # Create an instance of the model.
    model = create_model()

    # Create an iterator for the dataset.
    iterator = iter(create_dataset())

    # Create an IPUOutfeedQueue to collect results from each on device step.
    outfeed_queue = ipu_outfeed_queue.IPUOutfeedQueue()

    start_time = time.time()
    # Train the model
    for _ in range(0, step_count, steps_per_execution):

        # Run `steps_per_execution` at a time.
            args=[iterator, outfeed_queue, model, opt, steps_per_execution],
    result = outfeed_queue.dequeue()
        "Time taken using infeed/outfeed queues:", time.time() - start_time, "seconds"

steps_per_execution refers to the total number of batches of data processed by each replica (if replication is enabled) each time an engine is executed. Usually, it is passed as an argument when compiling a model in TF2 (see Keras with IPUs documentation). In this example, steps_per_execution is used here to simulate the same behaviour with a custom training loop – it essentially plays the same role as the known steps_per_execution parameter that is passed to model.compile.

Additional notes

You can also use more than one outfeed queue. This can be useful for diagnosing numerical issues. See the inspecting tensors example for details.

The same model without infeed or outfeed queues is used to train the model on the MNIST dataset in, so you can run both examples and compare the difference in efficiency.

Both models are trained on 10,000 batches so that the difference in time efficiency is apparent.

Run the model that uses the infeed and outfeed queues:

python completed_code/

Example output:

Time taken using infeed/outfeed: 14.621165752410889 seconds

Run the version without infeed and outfeed queues and compare the output:

python completed_code/

Example output:

Time taken without infeed/outfeed: 23.766914129257202 seconds

You can see that training the model with infeed and outfeed queues is much more efficient. The difference in time is even larger when more epochs are needed to train a model.

Moreover, you can visualise the execution traces of both models with the PopVision Graph Analyser. You can run the following command to enable graph profiling while executing the program:

POPLAR_ENGINE_OPTIONS='{"autoReport.all":"true", "":"with_feeds", "autoReport.executionProfileProgramRunCount":"10", "autoReport.outputDebugInfo":"true", "debug.allowOutOfMemory":"true", "autoReport.outputSerializedGraph":"false"}' python completed_code/

The engine option specifies where you would like to save the profile. You can download the PopVision Graph Analyser from the Graphcore developer page and view the profile you have saved with it. Similarly, you can save the profile of in a different folder for comparison.

The engine option autoReport.executionProfileProgramRunCount specifies how many runs of the Poplar program you would like to capture. If you increase the number of runs to be captured, the execution report gets larger and so takes longer to generate. For the purpose of illustration, this option is set to 10 in this case. The default value is 2. For more information on the available options, please visit PopVision Graph Analyser User Guide.

The execution traces for both and are shown below and can be found in the execution_trace folder as well. PopVision Graph Analyser 2.4.2 is used here to generate the images.

Execution trace for program with feeds

Execution trace for program without feeds

Note that each Engine::run call for contains multiple steps while each step takes one Engine::run call for This verifies that placing the training operations inside a loop, as you did using feeds, allows them to be executed multiple times on the IPU without returning control to the host.


This example is licensed under the Apache License 2.0 - see the LICENSE file in this directory.

Copyright (c) 2021 Graphcore Ltd. All rights reserved.

It contains derived work licensed under the Apache License 2.0. See the headers in the source code for details.