2. Targeting the Poplar XLA device

The Poplar XLA devices are named /device:IPU:X, where X is an integer which identifies that logical device. This can consist of one or more physical IPU devices, as described below.

An IPU-specific TensorFlow distribution strategy, the IPUStrategy, is available for setting up all appropriate scoping when creating a model. The IPUStrategy should always be used to target the Poplar XLA device.

If you are using Keras, you must instantiate your Keras model inside of the strategy scope:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  # Create a Keras model inside the strategy.
  model = create_model()

  # Compile the model for training.
  model.compile(
      loss=tf.keras.losses.SparseCategoricalCrossentropy(),
      optimizer=tf.keras.optimizers.RMSprop(),
      metrics=["accuracy"],
  )

  model.fit(dataset, epochs=2, steps_per_epoch=100)

If you are not using Keras, you must use the @tf.function annotation along with strategy.run. This will cause all operations created by the Python function passed into strategy.run to be placed on the IPU system and compiled together into a single Poplar executable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])


@tf.function(experimental_compile=True)
def matmul_fn(x, y):
  z = tf.matmul(x, y)
  return z


strategy = ipu.ipu_strategy.IPUStrategy()
with strategy.scope():
  c = strategy.run(matmul_fn, args=(a, b))

2.1. Supported types

Poplar and the PopLibs libraries support the following data types:

  • tf.float32

  • tf.float16

  • tf.int32

  • tf.bool

2.2. Device selection

Hardware configuration options enable you to select the number of IPU devices. By default, TensorFlow will create one virtual device (/device:IPU:0) with a single IPU. The first available single IPU will be used.

Two API options on the IPUConfig are available for controlling which or how many IPUs this virtual device will use:

  • auto_select_ipus allows you to specify a quantity of IPUs to use. The virtual device will be given that many IPUs.

  • select_ipus allows you to choose a specific IPU hardware device using its ID. The device IDs can be seen with the gc-info command line tool. An ID can represent a single IPU device or a larger “multi-IPU” device that contains a group of closely connected single IPU devices.

    For example, the largest single IPU device in a 16-IPU system has the ID 15, while the largest multi-IPU device has the ID 30. (See the IPU Command Line Tools document for details of how device IDs map to available IPUs.)

You can also pass a list (or tuple) to either of these options. This will configure a separate TensorFlow virtual device for each value in the list (/device:IPU:0, /device:IPU:1, and so on).

Once the hardware structure has been specified, the configure_ipu_system() method of the config object must be used to attach to and initialise the hardware:

1
2
3
4
5
6
from tensorflow.python import ipu

cfg = ipu.config.IPUConfig()
# Select multi-IPU with device ID 30, which contains all 16 IPUs of a 16-IPU system.
cfg.select_ipus = 30
cfg.configure_ipu_system()

This example will allocate a single TensorFlow virtual device (/device:IPU:0) which will use all 16 IPUs in a 16-IPU system.

For more examples, see the documentation for the options in the Python API: auto_select_ipus, select_ipus.

2.3. Configuring system options

In addition to auto_select_ipus and select_ipus, the IPUConfig class has many other options for system configuration. The class is a nested structure of attributes which organises these options into several categories and sub-categories. To use it, instantiate it and treat its attributes like ordinary Python variables:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Initialize an IPUConfig instance
cfg = ipu.config.IPUConfig()

# Ask for 2 IPUs on /device:IPU:0
cfg.auto_select_ipus = 2

# Change our mind and decide we need 4 IPUs instead. This is fine since
# setting any config attribute has no effect until the config is used to
# configure the IPU system
cfg.auto_select_ipus = 4

# Configure the system with the config, creating /device:IPU:0 with 4 IPUs
cfg.configure_ipu_system()

Some attributes are not configuration options themselves, but rather names for general categories of grouped options. Categories cannot be set. You can access an arbitrarily nested attribute with chained dot notation, and an attribute’s full name indicates exactly where it is in the IPUConfig nested structure. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
cfg = ipu.config.IPUConfig()

# Set the IPU Model version, which is in the "ipu_model" category
# Its full name is "ipu_model.version"
cfg.ipu_model.version = "ipu2"
print(cfg.ipu_model.version)  # ipu2

# Set the multi replica distribution process count, which is in the
# "multi_replica_distribution" sub-category of the "experimental" category
# of the config
cfg.experimental.multi_replica_distribution.process_count = 2
print(cfg.experimental.multi_replica_distribution.process_count)  # 2

# You cannot set to a category, since it's a grouping of options and is not an
# option itself
cfg.experimental = 2  # Will error

cfg.configure_ipu_system()

Options are type checked when they’re set and if an option cannot be found, then a similarly spelled one may be suggested. Additionally, setting to a deprecated option will give you a warning:

1
2
3
4
5
6
cfg = ipu.config.IPUConfig()

# Try to set an option to an incorrect type
cfg.ipu_model.version = True  # Will error immediately asking for a string
# Make a spelling mistake when writing the option name
cfg.ipu_model.vrsion = "ipu2"  # Will ask if you meant "version"

The metadata for any attribute, including its default, allowed types, docstring, full name and whether or not it is deprecated can all be accessed through the get_attribute_metadata() function, which takes a string representing the attribute’s full name, relative to the category you are calling the function on. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
cfg = ipu.config.IPUConfig()

# Access by full name from the base config:
metadata = cfg.get_attribute_metadata("ipu_model.version")
# Access by name relative to the "ipu_model" sub-category:
metadata = cfg.ipu_model.get_attribute_metadata("version")

# Use the metadata
print(metadata.types)  # [str]
print(metadata.default)  # "ipu2"
print(metadata.deprecated)  # False indicates it is not deprecated
print(metadata.__doc__)  # "Specify the ipu version to be used by the..."

# Check a value against the option's type
metadata.check_type(5)  # Will fail, since this option needs a string.

# Print a deprecation message if the option is deprecated
metadata.warn_if_deprecated()

This is useful for forwarding IPU configuration options to command line interfaces in applications. Note that you can also access the metadata of categories themselves, but the types and default fields will be empty. You can see a full description of the available metadata in the AttributeMetadata class.

When you are finished adjusting the IPUConfig instance, use it to configure the IPU system by calling its configure_ipu_system() method. The options set on an instance will not have any effect until this is done. Note that configuring the system does not alter the instance. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
cfg = ipu.config.IPUConfig()
cfg.auto_select_ipus = 4
cfg.ipu_model.compile_ipu_code = False
cfg.ipu_model.version = "ipu2"
cfg.scheduling.algorithm = ipu.config.SchedulingAlgorithm.Clustering
...

cfg.configure_ipu_system()
# The IPU system can now be used.

# The config can still be accessed after configuration.
print(cfg.ipu_model.version)  # ipu2

In addition to auto_select_ipus and select_ipus, some other options on the IPUConfig which can be used to configure the hardware and compiler are highlighted below:

  • allow_recompute turns on recomputation, to reduce the memory requirement of the model at the expense of speed.

  • selection_order to control the mapping between the “virtual” IPUs and physical IPUs of a multi-IPU device.

  • compilation_poplar_options sets general options to be passed to the Poplar compiler.

  • convolutions.poplar_options, matmuls.poplar_options and pooling.poplar_options pass specific options directly to the PopLibs convolution, matmul and pooling operations.

  • ipu_model is a category containing options that control the Poplar IPU Model device type.

  • floating_point_behaviour is a category containing options that allow you to configure the IPU device’s floating point control register.

  • optimizations is a category containing options that can toggle various optimizations, which generally have a performance or memory use trade-off.

  • scheduling contains options that specify and control the scheduling algorithm the Poplar XLA backend uses to schedule the operations in the graph before it is lowered to Poplar.

To view the full list, see IPUConfig.

2.3.1. TF_POPLAR_FLAGS environment variable

The options passed through the IPUConfig are tied to the application that uses that config to configure the IPU system. Some configuration options are instead provided by an environment variable called TF_POPLAR_FLAGS.

If you set TF_POPLAR_FLAGS=--help and execute a TF session, it will output some help for each option. The available options are described below:

Table 2.1 TensorFlow configuration options

Option

Description

--dump_schedule_as_dot

Dump the schedule of the XLA graph to the user console.

--executable_cache_path=path

Enables the Poplar executable cache. See Caching of compiled executables.

--fallback_scheduler

Uses the standard TensorFlow scheduler, instead of the Graphcore specific one.

--help

Print information for all the options.

--log_cycle_count=int

Log the number of cycles used in evaluating the main graph. The numeric argument indicates the tile on which the cycle count operation will be created. This may be used as an alternative to profiling for graphs with dynamic control flow.

--max_compilation_threads=int

Sets the maximum number of threads which Poplar is allowed to use for compiling the executable.

--max_infeed_threads=int

Sets the maximum number of threads which each infeed queue is allowed to use when accessing data from datasets.

--null_data_feed

Cause any infeed queues to copy garbage data to the IPU rather than real data. This option can be used to determine whether the dataset provided to the infeed queue is the bottleneck during execution.

--save_interval_report=path

Dumps the Poplar interval report to the given directory.

--save_vertex_graph=path

Dumps the Poplar vertex graph (as a DOT file) to the given directory.

--tensor_map_file_path=path

Cause a JSON file containing the tile mapping of all tensors to be written to this directory.

--use_ipu_model

Use the Poplar IPUModel for graph compilation and execution.

--use_synthetic_data

Prevent the system from downloading or uploading data to the IPU when executing code. This can be useful for testing performance without the overhead of data transfer.

Using this option, all data transfer is prevented. You can instead use --synthetic_data_categories to prevent the transfer of specific categories of tensor data.

When using this option, the graph’s transferred input tensors will never be initialized and can therefore have undefined content. You can avoid this with the --synthetic-data-initializer option.

The outputs from any outfeeds will also be uninitialized tensors on the host which may also contain undefined content.

This option cannot be used when dequeuing an IPUOutfeedQueue which is in IPUOutfeedMode.LAST mode.

--synthetic_data_categories

Prevent the system from downloading or uploading data of the given types to the IPU when executing code. This can be useful for testing performance without the overhead of data transfer.

The values can be any of: infeed, outfeed, seed, hostembedding or parameters.

For example, --synthetic_data_categories='infeed,outfeed' will use synthetic data just for infeeds and outfeeds.

When using this option, the graph’s transferred input tensors will never be initialized and can therefore have undefined content. You can avoid this with the --synthetic-data-initializer option.

This option is a more selective alternative to --use_synthetic_data; you shouldn’t specify both.

--synthetic_data_initializer

When using synthetic data, the graph’s transferred input tensors will never be initialized and can therefore have undefined content. You can use this option to prevent this by initializing these tensors on the device. The tensors can be initialized with a constant value X: --synthetic_data_initializer=X or random values: --synthetic_data_initializer=random.

For this option to have an effect, you must also specify --use_synthetic_data or --synthetic_data_categories.

--while_loop_brute_force_max_trip_count=int

Sets the upper bound for how many iterations a while loop will be simulated for in order to brute force the number of times it will be executed.

--show_progress_bar=true|false|auto

Whether to show the compilation progress bar. Either true, false or auto. When set to auto, the progress bar will only be enabled when attached to a console, VLOG logging is disabled and compiling a graph which can take more than few seconds to compile. Defaults to auto.

--on_demand_device_poll_time=int

When using ‘ON_DEMAND’ connection type, configure how often to poll for the device (in milliseconds) when a device is not available - defaults to 1000ms. Minimum is 100ms.

--on_demand_device_timeout=int

When using ‘ON_DEMAND’ connection type, configure how long to wait (in milliseconds) for a device before timing out - defaults to 3600000ms (1 hour).

--sync_replica_start

Add a global synchronisation point at the start of each replica’s main Poplar program. This can be used to force each replica to not execute until all replicas have started.

Multiple options can be specified at the same time by concatenating them like command line switches, for example: TF_POPLAR_FLAGS=--executable_cache_path=/tmp/cache --log_cycle_count=123.

2.4. Supported operations

A list of supported TensorFlow operations is provided in TensorFlow operators supported by the IPU.

2.5. Unsupported operations

TensorFlow core operations which use variable buffers or strings are not supported. For instance, JpegDecode.

Unsupported operations will cause the compilation to fail.

By including tf.debugging.set_log_device_placement(true) in your script, you can check if the operations in your graph are targeting the Poplar XLA device.

2.6. Error Handling

The error and exception handling by TensorFlow is divided into two categories:

  • Poplar graph construction and compilation errors which occur during construction and compilation of TensorFlow programs.

  • Poplar runtime errors which occur during the execution of the compiled program.

The following sections describe the actions you need to take when these errors occur.

2.6.1. Construction and compilation errors

These errors are reported to the user using the TensorFlow Status error classes. The error messages contain information about why the error occurred and what action the user is required to take in order to stop the error from occurring.

2.6.2. Runtime errors

These errors and exceptions occur when running a Poplar program. The full list of all the exceptions and their meanings can be found in the Poplar documentation in the Exceptions section of the Poplar API reference manual.

These runtime errors are handled in the following manner:

  • application_runtime_error - a tensorflow.errors.InternalError error is raised. The error message contains the reason why the error occurred. An IPU reset will be performed before the next execution of a Poplar program.

  • recoverable_runtime_error with a recovery action poplar::RecoveryAction::IPU_RESET - a tensorflow.errors.InternalError error is raised. The error message contains the reason why the error occurred. An IPU reset will be performed before the next execution of a Poplar program.

  • All other runtime errors - the process executing the Poplar program is terminated and the full error message is logged to the console. When these errors occur manual intervention might be required before the system is operational again. The error message might contain a required recovery action.