Scope of this document

This document contains the Release notes for the Poplar SDK 2.5.1 for Graphcore’s IPU product family. The software deliverables covered by this document are the following:

Driver & Utilities

Driver and associated utilities needed by the Graphcore IPU.

PopART

The Poplar Advanced Run Time is a flexible ONNX-compatible runtime supporting both training & inference.

PopTorch

The PopTorch library provides a set of extensions for PyTorch to enable it to run on the Graphcore IPU hardware.

Poplar

A graph programming framework for the IPU.

Poplar Libraries

The PopLibs library provides a range of higher level functions commonly used in machine learning applications.

GCL

The Graphcore Communication Library enables high-performance scale-out for IPU systems.

PopDist

Poplar Distributed Configuration Library (PopDist) is a library for configuring and coordinating distributed execution of (large-scale) machine learning applications.

PopRun

PopRun is a command line utility to launch distributed applications on Graphcore Pod systems.

Libpva

The PopVision analysis library (libpva) allows programmatic analysis of the IPU profiling information used by the PopVision Graph Analyser.

TensorFlow

An implementation of the TensorFlow framework for the Graphcore IPU.

IPU TensorFlow Addons

A collection of Graphcore IPU-specific features for the TensorFlow framework.

Poplar Triton Backend

A backend for the Triton Inference Server that supports models saved as Poplar executables.

Release overview

Drivers & Utilities

  • gc-monitor and gc-info can now display information for IPUs that are not included in the active partition. The functionality is also available in the gcipuinfo library.

  • gc-monitor shows IPUs that are in use by other hosts.

PopART

  • Added support for RNN operator (preview).

  • Improved Automatic Loss Scaling support (experimental).

PopTorch

Compatible with PyTorch 1.10

  • Increased operator support, including support for torch.nn.RNN.

  • Improved Automatic Loss Scaling support (experimental).

Poplar

  • Significantly reduced the amount of host memory needed when compiling very large models. Most of these optimisations are enabled by default. There is a new experimental Poplar Engine option that allows compilation to be done for a subset of IPUs at a time.

GCL

  • Collective optimisations for improved scale-out performance.

  • Extended collective support to include broadcast/oneToAll.

TensorFlow

TensorFlow 1.15.5 and TensorFlow 2.5.2

  • Keras support in TensorFlow 2 now includes Keras Model subclasses.

  • Optimisations have been made for loop based models (such as RNNs) to improve compile time, memory usage and runtime performance.

Poplar Triton Backend

  • Preview version of a backend for the Triton Inference Server.

Package contents

The downloaded unified Poplar SDK will contain the following packages:

Ubuntu 18.04

Package

Version

Driver & Utilities

1.1.1

PopART

2.5.1

PopTorch

2.5.0 (for PyTorch 1.10)

Poplar

2.5.0

PopDist/PopRun

2.5.0

TensorFlow 1.15.5

Graphcore TensorFlow 2.5.1

TensorFlow 2.5.2

Graphcore TensorFlow 2.5.1

IPU TensorFlow Addons

2.5.1

Poplar Triton Backend

2.5.1

Ubuntu 20.04

Package

Version

Driver & Utilities

1.1.1

PopART

2.5.1

PopTorch

2.5.0 (for PyTorch 1.10)

Poplar

2.5.0

PopDist/PopRun

2.5.0

TensorFlow 2.5.2

Graphcore TensorFlow 2.5.1

IPU TensorFlow Addons

2.5.1

Poplar Triton Backend

2.5.1

CentOS 7.6

Package

Version

Driver & Utilities

1.1.1

PopART

2.5.1

PopTorch

2.5.0 (for PyTorch 1.10)

Poplar

2.5.0

PopDist/PopRun

2.5.0

TensorFlow 1.15.5

Graphcore TensorFlow 2.5.1

TensorFlow 2.5.2

Graphcore TensorFlow 2.5.1

IPU TensorFlow Addons

2.5.1

Poplar Triton Backend

2.5.1

Debian 10

Package

Version

Driver & Utilities

1.1.1

PopART

2.5.1

PopTorch

2.5.0 (for PyTorch 1.10)

Poplar

2.5.0

PopDist/PopRun

2.5.0

TensorFlow 2.5.2

Graphcore TensorFlow 2.5.1

IPU TensorFlow Addons

2.5.1

Poplar Triton Backend

2.5.1

Note

See Appendix A for TensorFlow additional requirements.

Product support and compatibility matrix

SUPPORTED
These products are actively worked on: they will receive new features, general updates and security updates.
Notice of deprecation will be sent in advance for supported products.
DEPRECATED
These products will only receive security updates.
These products are expected to work with the indicated products however correctness is not guaranteed.
It is advised not to upgrade to this software version, unless strictly necessary.
In the future, these products can move to a Not Supported state, without further notice.
Support level will reflect the deprecated status.
NOT SUPPORTED
These products are not expected to work with this release.
No support will be provided.

Important

Deprecated products can be moved to a Not supported status without further notice.

IPU-Machine System Software compatibility matrix

IPU-Machine Model

IPU-M Software Version

Support level

Notes

IPU-M2000

2.5.0

Supported

N/A

Bow-2000

2.5.0

Supported

N/A

IPU PCIe Hardware Support level

Model

Revision

ICU Firmware version

Driver version

Support level

Notes

C2 300-0004

All revisions

1.4.14

1.0.57

Deprecated

N/A

Note

Use Firmware revision in accordance with IPU revision.

Important

For Firmware revision, compatibility is only enforced for patch versions.

Driver Support level

OS

Support level

Supported Kernel Version

Notes

CentOS 7.4/7.5

Supported

3.10

CentOS LTS kernel.

CentOS 7.6

Supported

3.10

CentOS LTS kernel.

Microsoft Windows

Supported

Windows Server 2019

Ubuntu 18.04

Supported

5.4

Ubuntu LTS kernel.

Ubuntu 20.04

Supported

5.4

Ubuntu LTS kernel.

Debian 10

Supported

4.19

Debian LTS kernel.

Warning

It is strongly recommended to update the kernel module of the driver to the version included with this 2.5.1 release.
This is to avoid incompatibilities with the non-kernel components of this SDK

SDK 2.5.1 Support level

OS

Support level

Notes

Microsoft Windows

Not Supported

CentOS 7.6

Supported

Ubuntu 18.04

Supported

Ubuntu 20.04

Supported

Debian 10

Supported

Supported tools

Ubuntu 18.04

Tool

Support level

Version

Notes

GCC/G++

Supported

7.2.0

libstdc++

Supported

6.0.24

libc

Supported

2.27

binutils

Supported

2.30

Python

Supported

3.6

Boost library

Deprecated

1.70

Ubuntu 20.04

Tool

Support level

Version

Notes

GCC/G++

Supported

9.3.0

libstdc++

Supported

10.3.0

libc

Supported

2.31

binutils

Supported

2.34

Python

Supported

3.8

Boost library

Deprecated

1.71

CentOS 7.6

Tool

Support level

Version

Notes

GCC/G++

Supported

7.3.1

libstdc++

Supported

6.0.24

libc

Supported

2.17

binutils

Supported

2.28

Python

Supported

3.6

Boost library

Deprecated

1.70

Debian 10

Tool

Support level

Version

Notes

GCC/G++

Supported

8.3

libstdc++

Supported

6.0.24

libc

Supported

2.28

binutils

Supported

2.28

Python

Supported

3.7.3

Boost library

Deprecated

1.70

List of changes

The following sections will list changes in version 2.5.1, as well as older releases, for all products contained in the Poplar SDK.
There are three main sections, divided by argument:
Changelogs

Changelogs section lists important bug fixes and relevant functionality that has been added. Minor fixes or features will not be listed.

Known issues

Known Issues section will list all important issues known to date. This section will list issues that will impact Poplar functionality.

Compatibility changes

Compatibilities changes section will capture any change that needs to apply existing code, to remain compatible with this version of the SDK.

Changelogs

Product

Changelog

Driver & Utilities

Changelog Driver & Utilities

PopART

Changelog PopART

PopTorch

Changelog PopTorch

Poplar

Changelog Poplar

Poplar Libraries

Changelog Poplar Libraries

GCL

Changelog GCL

PopDist/PopRun

Changelog PopRun/PopDist

Libpva Library

Changelog Libpva Library

TensorFlow

Changelog TensorFlow

IPU TensorFlow Addons

Changelog IPU TensorFlow Addons

Poplar Triton Backend

Changelog Poplar Triton Backend

Driver & Utilities Changelog

Kernel Module

1.1.1

  • T50828: Remove deprecated sync utilisation code.

  • T52353: Removed the need for GCDA_MONITOR to get power/temperature values.

  • T52721: Preserve mark counts in non POSTED modes on exit.

  • T52775: Implemented detection of multiple tile parity errors.

  • T52776: If an IPU memory failure occurs, record unrecoverable error and mark device as unusable.

  • T55372: Added new correctable error counters that clear on IPU reset.

  • T55565: Improved power and temperature reporting.

  • T56430: Detect when processes are attached to a device from another namespace.

1.0.57

  • T45456: PCIe driver uses pin_user_pages API with Linux kernels 5.8.0+.

  • T47498: Added Host Link Correctable Errors.

  • T48270: Update the IPU PCIe driver to correctly use the DMA API.

  • T48616: Driver scripts improvements.

  • T49874: Clear allocated PL-DDR memory prior to use on native PCIe.

Low level libraries and tools

2.5.0

  • T38729: Added gc-hostlatencytest.

  • T39431: Added GCDA API to allow querying the available PL-DDR on an IPU-M.

  • T40698: Updated gc-hosttraffictest to provide performance statistics for host memory transfers.

  • T41646: Added IPU-M version info to gc-monitor.

  • T43803: Generate libpvti documentation with sphinx_resources.

  • T44446: Force gRPC to not use a proxy server.

  • T48984: Refactor conversion of fabric exceptions to graphcore_target_access exceptions to improve maintainability.

  • T49018: Extended the PVTI API to allow setting of user thread names.

  • T49170: Add IPU power profile query option to gc-info.

  • T49902: Removed PCIe ID field from gc-monitor for Fabric devices.

  • T49958: Updated gc-info -l to return an error code if no devices are found.

  • T50043: gcipuinfo: add path parameter to application event record retrieval API.

  • T50828: Remove deprecated sync utilisation code.

  • T51093: gcipuinfo: add attributes to application event record listing IPUoF hosts.

  • T51249: GC tools report device discovery errors when no IPUs found.

  • T51264: Fix issues when attach is aborted at an early stage.

  • T51460: Add support for static partitions with varying sync types for the hardware testing command line tools.

  • T51503: Avoid a segmentation fault when using legacy environment variable IPUOF_CONFIG_PATH with an empty value.

  • T51526: gc-monitor: track IPUs that are in use by other headnodes.

  • T51527: gc-monitor: when IPUs are in use by other headnodes, display hostname.

  • T51694: Add error checking in GCDA when requesting invalid buffers.

  • T51744: Add option to set the duration for --host tests in`gc- hosttrafficttest`.

  • T51774: Extend internal interface used by V-IPU to support the enabling and disabling of NLC links.

  • T51832: Fix a rare issue in hgwio_server that can temporarily cause failure to attach.

  • T51974: IPUoF client calls ibv_fork_init() during RDMA client initialisation.

  • T52102: Improve error handling on attach in IPUoF.

  • T52132: Ensure all buffers are detached during IPUoF device detach.

  • T52248: Add sync group configuration debug information to host sync timeout exceptions.

  • T52249: The bootloader now throws GraphcoreDeviceAccessExceptions::ipu_bootloader_missing_sync for any bootloader sync errors so that they can be caught for sync debug reporting.

  • T52458: Added option to gc-monitor and gc-info to view IPUs in other partitions.

  • T52459: gcipuinfo can return device attributes and run health checks on devices in other partitions.

  • T52606: Improve IPUoF client HSP debug logging messages.

  • T52609: Fix fallback strategies used in IPUoF client HSP polling.

  • T52721: Preserve mark counts in non POSTED modes on exit.

  • T52775: Implemented detection of multiple tile parity errors.

  • T52776: If an IPU memory failure occurs, record unrecoverable error and mark device as unusable.

  • T53084: Updated IPUoF to allow tools to see IPU devices outside of the current partition.

  • T53170: Prevent the increment of marks on devices that have a GSP pin configuration that does not support HSP. This improves IPUoF performance for the bootloader and avoids confusing debug messages.

  • T53188: Fix docker images not working with Broadcom RNIC.

  • T53326: Make gcipuinfo report no IPU devices found as an error.

  • T53422: Fixed HSP update race between IPUoF client and server.

  • T53451: Make several attempts at checking if PL DDR is cleared at startup.

  • T53537: Order IPU-M devices numerically and by IPU Id in PCIe in gc-monitor display.

  • T53741: Fixed popc --version deadlock when PVTI is enabled.

  • T53755: Avoid RPC timeouts after first attach.

  • T53822: Added support for detection and handling of multiple parity errors.

  • T53884: Updated IPUoF RDMA QP retry count to improve link reliability.

  • T53895: Optimise IPUoF behavior on first attach.

  • T53977: Fix IPUoF race condition when receiving an attach request during detach.

  • T54030: Remove connection disconnect when get_device_info call fails.

  • T54110: Improve IPUoF mirror fence logging.

  • T54119: IPUoF server enables memmory error checking.

  • T54468: Improved the IPUoF error message when there’s been an issue creating the connection.

  • T54615: Improve recovery and debug in gc-hosttraffictest when a test times out.

  • T54685: gc-monitor --all-partitions now ignores partitions in an error state rather than terminating with an error.

  • T55364: Improve availability on IPUoF server start.

  • T55389: Added link to tutorials in the PVTI user guide.

  • T55407: Reset IPU upon any gc-hosttraffictest failure to recover host interface.

  • T55411: Reduce excessive output for gc-memorytest in verbose mode.

  • T55426: Fixed PVTI PopRun exception when the trace file cannot be created or if the tables already exist in the database.

  • T55565: Improved power and temperature reporting.

  • T55629: Extended gc-binary API to support the creation of tile IPU archives in incremental steps rather than at once.

  • T55942: Fixed a rare double free of allocated memory in the IPUoF server when the IPUoF connection fails.

  • T56150: Remove unnecessary files from the release packages.

2.4.0

  • T29027: Add GCDA_OPTIONS environment variable to allow setting runtime options as json.

  • T30646: Extended gc-iputraffictest to support testing of more than 16 IPUs.

  • T37217: gc-monitor extended to support multi GCD partitions.

  • T38068: Add single IPU mode for iputraffictest.

  • T43718: Added Python documentation to tracing library.

  • T45122: Allow Poplar to reconfigure links in static partitions.

  • T45371: Added APIs to attach/RDMA-write to IPU tile memory and simple peer-to- peer RDMA write to tile tests to measure the P2P bandwidth and latency.

  • T45594: Query the IPU for the architecture during device discovery rather than using the architecture defined by the VIRM configuration.

  • T45785: Added API to query the last error status.

  • T46259: Updated PVTI to support binary meta data.

  • T46401: Gc-monitor support for multi-GCD partitions.

  • T46855: Error if both an IPUoF configuration file and the IPUOF_VIPU_API_* environment variable is used.

  • T47225: Improve SERDES link training to allow auto link negotiation.

  • T47348: Avoid printing a driver version warning in gc-monitor when no IPUs are found.

  • T47414: When invoked without a device id, gc-reset will now correctly choose a the largest device for partitions greater than 16 IPUs.

  • T47498: Added Host Link Correctable Errors.

  • T47619: Fix segfault in gcipuinfo when no devices are found.

  • T47640: Initialise the IPU code/data/stack size attributes and the IPU utilisation attributes prior to attach.

  • T47727: Fix failure to start if port of RDMA device is UP but no IP address configured.

  • T47913: Improve handling of IPUoF configuration errors.

  • T48317: SoC configuration code tidy up.

  • T48377: Added documentation for GCDA attributes.

  • T48434: Added gRPC health check in IPUoF client and server.

  • T48435: Add device health check API to gcipuinfo.

  • T48437: Return getDevices result by value.

  • T48553: A new GCDA_OPTIONS feature to simulate SoC errors.

  • T48907: Set gRPC deadline in all IPUoF client requests.

  • T48911: Fast fabric error reporting during PORT_DOWN or connection unreachable.

  • T48939: Increase server robustness to link down.

  • T48947: Catch fabric exceptions when storing the sensor value in sensor loop.

  • T48956: Fixing missing error propagation in some cases.

  • T49126: Fix bug affecting gc-monitor on non-reconfigurable partitions.

  • T49134: Log rather than throw when automatically detaching during object destruction.

  • T49205: Prevent potential long delay when read_config_register calls times out.

  • T49448: Reduce timeout on CM QP failure.

  • T49477: Added gc-podman and container support package.

  • T49802: Improve shutdown time when using GCDA_MONITOR.

  • T49853: Fixed device ID initialisation in IPUoF server constructor.

  • T50043: gcipuinfo: add path parameter to application event record retrieval API.

  • T50044: Extend the timeout for attach during clearing of memory at IPUoF server startup.

  • T50404: Fixed some error messages when server is killed early.

  • T50424: Fixed some error message when PL DDR clearing is not complete when shutting down server.

  • T50857: Fix data race in multithreaded link training when using partial link training config.

  • T51093: gcipuinfo: add attributes to application event record listing IPUoF hosts.

  • T51526: gc-monitor: track IPUs that are in use by other headnodes.

  • T5764: Add documentation for runtime options.

PopART Changelog

2.5.1

New features

  • Add PopXL API (experimental)

  • Add support for RNN operator (preview)

  • Improvements to automatic loss scaling (experimental)

  • Add improved ability to manage PRNG behaviour across replicas (experimental)

  • Add ability to retrieve random seed

  • Add an overload of Builder.setAvailableMemoryProportion which can target multi-output nodes

  • Ensure initial inputs of gradient graphs match any user-specified provided grads

  • Ensure outputs of gradient graphs match any user-specified required grads

  • Add ability to run exported models using the Poplar Triton Backend via PopEF integration

  • Add visualisations for inplace modified and aliased tensors and graph inputs and outputs to Dot visualizer

  • DynamicSliceOp and DynamicUpdateOp can drop the first dimension of the slice if it is 1

  • Support AnchorReturnType::Final in MainLoops transform

  • Improved replicated tensor sharding (RTS) compatibility for operations

  • Make gradient clipping compatible with replicated tensor sharding (RTS)

  • Improved linter support

  • Add ability to show ONNX model proto in human readable text

  • Various improvements to executable caching

  • Add ability to perform per-replica reads and writes of variable values

  • Improved quality of debug information

  • Use slice plan in SparseAccumulateOpx

  • Add ability to merge collective operations

  • Add ability to dynamically switch off the backwards pass when using implicit pipelining

  • Add ability to refresh engine cache on-the-fly

Bug Fixes

  • Fix the logic that replaces DropoutOp with IdentityOp

  • Improve device handling in tests

  • Fix for potential deadlock condition in test runner

  • Fix in lowering logic for trailing subgraph parts that contain only calls to child subgraph parts

  • IdentityLossOpx will no longer attempt to unwind (resulting in an error) when there is a reduction

  • Fix subgraph autodiff logic

  • Allow CallOp to not have outputs connected for all of its callee outputs

  • Fix Python binding for DeviceManager::tryAttachUntilTimeout

  • Correctly promote inplace aliased and modified tensors through the Loop operation

  • Fix unwinding through multiple consecutive slice operations

  • Fix unwinding issue in MaxOpx

  • Enable bufferingDepth to be used when SessionOptions::enablePrefetchDatastreams isn’t set

  • Fix dtype clone in SparseAccumulateOpx::createInputTensor

  • Fix bug in ReplicatedTensorShardingTracer

  • Fix compile error if accl2 type is not FLOAT

  • Fix PowArg0GradOpPattern for fp16

Optimisations

  • Allow non-broadcasted indices as an input to the scatterreduce operation

  • Add ExpandCast pattern to reverse the order of an expand followed by cast to reduce memory footprint

  • Add inplace versions of WhereOp

  • Allow IdentityInplaceOp to unwind, reducing memory use when it cannot be made inplace

  • Split operators_test in two

  • Add TensorRemapOp for point-fixes of bad tensor layouts

  • Explicit recomputation support for pipelining

  • Alias zero copy tracks variables and multi-context tensors less conservatively

  • Improve graph traversal through loop-carried tensors

Logging and documentation

  • Add compile-time option to log device access events to a file

  • Improved CommGroupType::None comments

  • Fix code listings

  • Update to internal documentation build system

  • Various small user guide and API improvements

  • Add how to execute imported model to documentation

2.4.0

New features

  • Remove optional downcasting of ‘gs’ in the OptimizerDecompose Pattern, so the atomic scalar tensor is always in FP32

  • Add a new SessionOption ‘ensureFp32LossScaleTensor’. If your optimizer uses loss scaling and your model produces an FP16 loss tensor, enabling this SessionOption means that the loss scale tensor will be an FP32 tensor, and will be combined with FP16 activations as late as possible to produce the first FP16 gradients

  • Implement IncrementModOp which does y = (x + i) % m efficiently

  • Add DynamicSliceInplaceOp to update an existing slice from a larger tensor

  • Add Ir::removeIsolatedGraphs method to prune unused graphs

  • Add outplace version of RemoteLoadOp (the original version is now called RemoteLoadInplaceOp)

  • Add a way to connect Poplar HostFunction callbacks to a session. These HostFunction programs can be added via custom ops

  • Add new API methods DeviceManager::tryAcquireAvailableDevice and DeviceManager::tryAcquireDeviceById that return a nullptr if no device is acquired

  • Make the MatMulPattern, MatMulLhsGradPattern and MatMulRhsGradPattern patterns mandatory (they cannot be disabled)

  • Remove use of Poplar’s ‘planMinimisationTarget’ option

  • Set Poplar engine option ‘target.deterministicWorkers’ based on session options

  • Improvements to RNG state handling

  • Update PyTorch version in requirement files

  • Add additional test graphs

  • Add support for updating the available_memory_proportion of an operator

  • Use the PopLibs slice planner across PopART operators: Gather, Scatter, and ScatterReduce and their gradients

  • The environment variable POPART_CACHE_DIR can be used to enable model caching and set the cache directory

  • Implement constant folding for ReduceProd operator

  • Use buffering depth settings for device-to-host streams

  • Implement executeOpNTimesEveryMTimes

  • Add accessor for optimiser state tensors

  • Adding outlining information to debug context of Call operations

  • Make topk return an int32

Bug Fixes

  • Fixed an issue where gradient clipping introduced cycles in the graph

  • Fix loading from a serialized executable when the Ir object passed to popx::serialization::deserializeExecutable has already called its addAdditionalModelProtoTensors method

  • Allow a ReduceGradOp to change its output tensor type after construction

  • Enable and fix dependency-free fallback for tensor layout creators

  • Add missing updaterScaleOp→settings.optimizerOp for TensorFlow-like RMSProp in PopART

  • Fix ElementWiseBinaryBaseOp::getReplicatedTensorShardingIndices() for broadcast case where one tensor is already sharded

  • Fix Regions::flatIndex and dimIndex for non-full shapes

  • Change debug names of tensors when lowering to Poplar so that PopVision displays them correctly

  • Add missing ResizeGradOp::clone() implementation

  • Change final to override where required by custom Ops

  • Add missing clone function to AddArg*Grad Ops

  • Fix bug in AliasZeroCopy::disableDeadCodeNodes where disabled nodes were still considered as live

  • Remove cast in SparseAccumulate allowing PopLibs to select a specialisation based on dtype

  • During build, force FindPython to always pick virtualenv Python, if there is one

  • Assign output of cloneNcopy to a variable

  • Add owned_attributes to Attributes

  • Fix ReduceOp::setup to not accept indices outside the specified range

  • Fix get loss scale in loss scale update op

  • ConvTranspose Op now has a valid gradient: models using transpose convolution now train correctly

  • Convolution now supports a truncated kernel which can occur when calculating a gradient of a convolution in some cases

  • CopyVarUpdate Op now succeeds in obscure cases in which the tensor inputs are not parallel writable

  • Regenerate generated files on new build

  • Fix for LeakyReLU not working in FP16

  • Robustness improvements to remote tensor sharding

  • Add missing accumulatorPrefs to reservedPrefixes()

Optimisations

  • Prevent recomputation of ops in the final forward PipelineStage along one ‘path to the loss’ when an op along another path is set to RecomputeType::Checkpoint

  • Clean up LoopOp and loop body graph input/output indexing

  • Improve inheritPlacementAttributes to extend searching Op attributes across graphs

  • Add connectInTensorLike function to simplify connecting of IpuCopyOps

  • Speed up topocons with large graphs, improving overlapped IO graph compilation time

  • Custom op example compiles faster after removing unnecessary compiler option from Makefile

  • Use LossScaleUpdateOp with sum operation

  • Use updated Poprithms scheduling API

Logging and documentation

  • Document getCollectiveLinkedGroup

  • Fix doc identifier for IncrementModOp

  • Document Shape type

  • Document Region type

  • Document RemoteLoad operation

  • Document RemoteStore operation

  • Updated documentation of dataflow, loop, mainloops and subgraphoutline

  • Improve formatting of Python documentation

  • Improve documentation for ReductionType and MeanReductionStrategy enum types

  • Add sections for documenting limitations and added current Clip-11 limitation

  • Improved error message when not providing constant min/max thresholds for Clip11 Op

  • Minor corrections to PopART C++ API documentation

  • PopART C++ API Doc: Fixing availableMemoryProportion reference documentation

PopTorch Changelog

2.5.0

New features

  • Support for torch.var

  • Support for torch.std

  • Support for torch.var_mean

  • Support for torch.std_mean

  • Support for col2im (used by torch.nn.Fold)

  • Support for torch.argsort

  • Support for torch.nn.RNN

  • Support for torch.nn.utils.weight_norm

  • Support for torch.randperm

  • Support for torch.nn.functional.cosine_similarity and torch.nn.CosineSimilarity

  • Support for torch.all, torch.any, torch.Tensor.all and torch.Tensor.any

  • Support for torch.Tensor.exponential_ and torch.distributions.Exponential

Bug fixes

  • Fix thread safety issue in LogContext

  • Fix torch.clamp with integer tensors

  • Fix in-place modification of slices

  • Fix torch.index_put_ when operating on slices

  • Fix torch.chunk when dim size is indivisible by the specified number of chunks

  • Fix cases where tensor.half() was in-place

  • Fix tracing with half buffers

  • Fix for loops with in-place ops

  • Fix torch.flip with negative indices

  • Fix masked fill when using tensor indexing syntax

  • Fix some cases where use of serializedMatMul was ignored or resulted in errors

Other improvements

  • Ignore missing values when reloading an Optimizer state

  • Support saving Optimizer states when compiling offline

  • Also save the random number generator’s state and the seed when saving a model

  • Improve error message of aten::index, aten::index_put_ when indexing with boolean tensor masks

  • Add support for repr in PoplarExecutor

  • For models annotated with BeginBlock, show the IPU blocks in repr(model)

  • Improve implementation of torch.scatter_add

2.4.0

  • Support for deepcopy functionality in poptorch.Options class

  • Added functionality to add a name scope for each operator present in the module. This function is enabled by default. It can be disabled using poptorch.Options.disableModuleNamescope.

  • Support for a greater number of convolution and transpose convolution parameters including those which result in input/kernel/output truncation, either for inference (transpose) or gradient calculation.

  • Migrated to PyTorch version 1.10.0

  • Support for gradient clipping by norm in poptorch.optim optimizers

  • Support saving and restoring internal optimiser state with PopTorch optimisers via optimizer.state_dict() and optimizer.load_state_dict()

  • Add removeBlocks function to remove block annotations from a Model / Layer.

  • Support for CPU ops using poptorch.CPU.

  • Support for im2col.

  • Make optimizers work with LR schedulers.

  • Switched to gold linker by default.

Poplar Changelog

2.5.0

New features

  • Added support for storing code off-chip during model execution (initial implementation only supports internal exchange code)

  • Compile time improvements

  • Drastically reduced the amount of host memory needed when compiling very large models. Most of these optimisations are enabled by default. There is a new experimental Poplar Engine option that allows compilation to be serialised - you can specify the number of tiles for which lowering is done concurrently.

  • Added support for gp files to contain different configurations for the same architecture (for example, debug and release codelets)

Bug fixes

  • Fixed some private symbols leaking from libpoplar.so

  • Fixed a deadlock that can happen when stream callbacks don’t progress

  • Fixed an issue where pipeline stages would sync and run serially when profiling

  • Fixed a crash that could happen when creating the profile file

  • Fixed an issue where DELTANELEMENTS would cause a codelet to be mistakenly identified as a recursive function

  • Fixed a liveness issue from stream copy splitting that caused a variable to be always live

  • Fixed an issue where PrintTensor programs did not work for multi-ILD targets

  • Fixed an issue where unused constants could still be allocated on the device

  • Provided error handling for missing stream callbacks rather than crashing

  • Provided error handling for invalid codelet types (eg. 3D vectors) rather than crashing

  • Stopped the worker register dump from being logged twice on an exception

  • Fixed broken links in the user guide and API documentation

  • Changed the permissions of the archive to allow it to be read by the tools

Other improvements

  • Removed the old and deprecated profie formats

  • Better error handling when passing a null pointer into Graph::addConstant

  • Added an option to log the Poplar log to the system log

  • Attached user source location to Poplar exceptions

  • Added methods to hash the envrionment and engine options for a compilation

  • Always output symbols in the ELF when user is saving the archive

  • Compressed the final executable to drastically reduce the size

  • Add an option to write NaN’s into dead tensors to help debug WriteUndef issues

  • Improved the codelet codegen from the compiler

  • Added documentation for engine options that control which exceptions are enabled

  • Better error message when POPLAR_ENGINE_OPTIONS is an invalid JSON string

  • Improved documentation for which types are supported

  • Improved documentation on MultiVertex and, in particular, a race condition that is possible if it is used incorrectly

  • Improved explanation of different syncConfiguration options in the user guide

2.4.0

New features

  • Extended memory (greater than 16GB) for remote buffers in Poplar

  • Allow users to create a target for predefined Graphcore machines (eg. IPU-M2000)

  • Compile time improvements for key models

  • Compressed the Poplar executable

  • Added “Host Function” program: a new type of host exchange for embeddings

Bug fixes

  • Host memory at the end of compilation was not the same as it was at the start

  • Fixed segmentation fault when using host-to-device ring buffer with rearrangement on host

  • Fixed bug where findUnbroadcastTensor gives incorrect result for a concatenation of a broadcast tensor and a non-broadcast tensor

  • No exception was thrown when reconfigurable partition and Poplar config mismatch with many instances

  • Fixed a bug which limited the size of the GP files

  • Fixed a bug when creating a GP file from vertices in separate source files with the same field name

  • Made load time relocations deterministic to avoid a race condition

  • Fixed a bug where contiguous PrintTensor statements were being printed in reverse order

  • Use GCDA when handling multiple HSPs so that the PVTI events are generated correctly

  • Removed a case of undefined behaviour in merge variables when there are no merge candidates

  • Fixed a bug where you would get a non-const pointer for an Input field in a codelet

  • Fixed error in code example in Poplar User Guide

  • Fixed an error in the Poplar User Guide where wrong values were used for size/alignment of float vectors

Other improvements

  • Added an optimisation to inline nested calls

  • Added support for source destination tensor with different layout in CrossReplicaCopy’s

  • Generate Graph report after compilation

  • Add support for a new LOOP program in Poplar for an endless loop on the device

  • Extend NextSyncId analysis to build a nextSyncId table for each programId

  • Added support for safely stopping an Engine that has not finished running a program

  • Provided a way to set the host sync time out at a smaller granularity than 1 second.

  • Improvements to the new Poplar backtraces

  • Outline MultiVertex supervisor stubs

  • Added an optimisation pass to eliminate no-op WriteUndefs during lowering

  • Added mirrorFence(N) support to Poplar

  • Optimised the overhead for code copies when groups of exchanges in a sequence are all outlined

  • Included Poplar hash in the executable

  • Changed the default for deterministicWorkers to always work across replicas

  • Allow Poplar to trivially look ahead and process future sync points before the IPU reaches them

  • Documented which options can be changed at runtime via POPLAR_RUNTIME_OPTIONS

  • Lots of improvements reducing the host memory needed and the number of allocations during a compilation

  • Log all exceptions leaving Poplar

  • Added documentation for what kind of vertex members that are valid

  • Documented the restrictions on creating remote buffers to Poplar users on IPU-M2000 platforms

  • Added float16 and float32 as type aliases in Poplar

Poplar Libraries Changelog

2.5.0

New features

  • Added support for the ROIAlign layer

  • Added support for a stable sort using the new bitonic sort algorithm

  • Extended embedding layer to support groups

Bug fixes

  • Fixed a segfault that could happen for reductions

  • Fixed incorrect documentation of the return type of the random functions

  • Fixed incorrect documentation for building the third-party dependencies in the README

  • Fixed an issue in the CTC planner where it used the wrong memory estimate for the reduction

  • Added DebugContext in the fill operation

Other improvements

  • Optimised the scaled add codelets to utilise interleaved memory

  • Improved support for parallelising a transpose across workers

  • Prevent the partials type from being smaller than the output type in all layers

  • Attached user source location to PopLibs exceptions

  • Optimisations to the ERF layer

  • Added int32 support to the power elementwise operation

  • Improvements for MultiSlice when given a single offset

  • Added a default memory proportion to the embedding planner

2.4.0

New features

  • A new slice planner for faster embeddings

  • Extended popops to support embeddings where the indices are known at compile time

  • Added support for the Error Function (ERF) to PopLibs

Bug fixes

  • Fixed all compiler warnings that were in the public headers

  • Fixed a bug where only a single MultiVertex instance was generated for some elementwise operations

  • Avoided possible overread in CTC Inference codelet

Other improvements

  • Added a method to validate convolution and matmul options

  • Removed zeroing of output for input channel serial splits

  • Added structured rearrangements for fwd/gradA layers

  • Improved the documentation of the normalisation functions

  • Added an option to allow runtime bounds checking of embedding indices

  • Documented the partial type for convolutions

  • Added an optimisation to try to fuse the constituent parts of a mean function into a scaled reduce

  • Added new SLIC and VMAC vertices that generate more efficient exchange code

  • Specialised map expressions with a scalar multiply of type float and a tensor of type half to scaledAdd

  • Incorporated identity operations into element-wise expression optimisations

  • Added a partials type to ADD operation in multiUpdate

  • Optimised the memory overhead of the Reduce vertex state and improved the speed by creating fused vertices for scalar operations

  • Use the new rptsize_t type in the elementwise codelets

  • Dither reductions across tiles that are created with the reduceMany API

  • Improved the performance of the log1p vertex

GCL Changelog

2.5.0

New features

  • Extended GCL group API to include interleaved groups

  • Added a broadcast/oneToAll collective

  • Added handling for GCL_OPTIONS environment variable

  • Added support for many tensor multi phase reductions

  • Several latency improvements for GW-Links traffic

Bug fixes

  • Fixed grain size used in Collective Balanced Reorder API for multi phase AllReduce

  • Fixed SQUARE_ADD operation for multi phase AllReduce

  • Fixed uneven use of GW-Links on IPU-POD128 system

Other improvements

  • Added syncful.useOptimisedLayout GCL option

  • Multiple improvements to GCL’s memory footprint

  • Added support for n-phased cycle counts

  • Parallelised host side result validation

  • Relaxed mapping requirements for non-replicated collectives

  • Exposed concatChunks in the Collectives API

  • Added guards preventing modifications of input tensor

  • Added a GCL code example to the Poplar and PopLibs User Guide

2.4.0

New features

  • Added two-phase AllGather support (AllGather over GW-Links)

  • Exposed ReduceScatter and AllGather with many input tensors

  • Added support for non-commutative SQUARE_ADD reduction operator

  • Added handling for wide-only AllReduces

Bug fixes

  • Added a check for IPU number when running on IPU-POD16

  • Fixed multiple narrowing bugs

  • Fixed warning about serial reductions

  • Invalid CommGroup::replicaGroupSize now throws an exception

Other improvements

  • Fixed zero padding for Collective Balanced Reorder tensors mapped to only one tile

  • Added CommGroup to log messages

  • Introduced logging modules

  • Zero-padding the Collective Balanced Reorder tensor before using it for reductions

  • Added grain size to each replica in tensor created for Collective Balanced Reorder class

  • Input tensor is now checked for optimised layout

PopDist Changelog

2.5.0

New features

  • Ability to specify autoReport.directory for each instance

2.4.0

New features

None.

PopRun Changelog

2.5.0

New features

  • Ability to specify V-IPU allocation from the command line

  • Fixed incorrect resource allocation when launching applications from SLURM

  • Auto-completion functions for bash and zsh shells

  • Passing --autoreport-dir to PopRun will set the autoReport.directory for each instance

  • Skip exporting command line options that are not useful

2.4.0

New features

  • Added support for automatically generating executable cache path when multiple hosts are specified. Generated cache path will be removed when the process exits or fails

  • Enabled --tag-output by default. This option can now be omitted from --mpi-global-args. To turn the feature off, pass --tag-output=no.

  • Enabled --allow-run-as-root by default. This option can now be omitted from --mpi-global-args. To turn the feature off, specify --allow-run-as-root=no.

  • Passed POPLAR_ENGINE_OPTIONS to all instances by default. This feature cannot be turned off.

  • PopRun now unsets IPUOF_CONFIG_PATH before launching instances

Libpva Library Changelog

2.5.0

New features

  • Handle empty ipusToProfile when using profiler.replicaToProfile in a distributed execution

  • Allow variables to be optional for CodeCopy programs

  • Added option to inline calls when retrieving programs from debug context

  • Allow access to absolute markers that will be included in the execution profile, and expose them in the C++ API

  • Allow gaps in a sequence of program IDs

2.4.0

New features

  • Added the Python str to all the libpva objects.

  • Added C++ operator<< methods to all libpva objects.

  • CodeCopy program has a new property to get the list of variables copied.

  • Added new API to get the Poplar Engine options for compilation and execution.

  • Added the id, name and parent properties to the DebugContext.

Bug fixes

  • None

TensorFlow Changelog

2.5.1

New features

  • Migrated codebase from TensorFlow 2.4 to TensorFlow 2.5.

  • Added efficient support for Keras Model subclasses, see the documentation for full details.

  • Added ipu.ops.within_replica_ops module which provides within replica variants of all gather, all reduce and reduce scatter operations.

  • Added optimise_latency option to IPUInfeedQueue and IPUOutfeedQueue, which when enabled can speedup small data transfers.

  • Expanded interface for ipu.ops.reduce_scatter and ipu.ops.all_gather to support multiple inputs in a single operation.

  • Improved integration with TensorBoard for TensorFlow 2 Keras models.

  • Added support for passing tf.function to ipu.application_compile_op.experimental_application_compile_op in TensorFlow 2.

  • Added ipu.control_flow_ops.barrier for forcing the scheduling of operations, see the documentation for full details.

Bug fixes

  • Optimisations for loop based models (such as RNNs) to improve compile time, memory usage and runtime performance.

  • Memory usage optimisations for dynamic slices/update operations. This optimisation is on by default, but can be disabled with IPUConfig.optimizations.enable_dynamic_slice_replacement.

2.4.0

New features

  • Added an implementation of ipu.cross_replica_ops.cross_replica_mean() to provide better numerical stability.

  • Exposed set_infeed_queue_options and set_outfeed_queue_options functions for Sequential and Functional Keras models to allow configuration of IPUInfeedQueue and IPUOutfeedQueue.

  • Performance improvements for scatter and gather operations with static indices.

  • Added an IPU optimised implementation ipu.math_ops.segment_sum to perform a sorted segment sum with a fixed number of segments.

  • Exposed available_memory_proportion for Keras RNN Layers.

  • Allowed the gradient_accumulation_count parameter of ipu.pipelining_ops.pipeline to be a runtime value instead of a constant to allow dynamic batch sizes.

  • Added support for TensorFlow 2 Keras API using popdist and poprun.

  • Optimisations for the tf.random.shuffle operation.

Bug fixes

  • Reduced the runtime overhead when iteratively calling fit(), evaluate() or predict() on a Keras model.

  • Compile time improvements.

IPU TensorFlow Addons Changelog

2.5.1

New features

  • Add options and options_bwd arguments to RNN Keras layers and RNN TensorFlow 1 layers which get passed to their corresponding PopLibs implementations.

Bug fixes

None.

2.4.0

New features

  • Initial release.

  • Implementation of the SGD, Adam and LAMB optimizers with IPU specific features to improve model performance.

Bug fixes

None.

Poplar Triton Backend Changelog

2.5.1

New features

  • Preview version of a backend for the Triton Inference Server.

Bug fixes

None.

Known issues

The following section will detail known issues in v2.5.1.
Each product will be detailed separately.

Product

Paragraph

Driver & Utilities

Driver & Utilities known issues

PopART

PopART known issues

PopTorch

PopTorch known issues

Poplar

Poplar known issues

Poplar Libraries

Poplar Libraries known issues

GCL

GCL known issues

PopDist/PopRun

PopRun/PopDist known issues

Libpva Library

Libpva Library known issues

TensorFlow

TensorFlow known issues

IPU TensorFlow Addons

IPU TensorFlow Addons known issues

Poplar Triton Backend

Poplar Triton Backend known issues

Driver & Utilities known issues

1.1.1

None.

1.0.55

None.

PopART known issues

2.5.1

None.

2.4.0

None.

PopTorch known issues

2.5.0

None.

2.4.0

None.

Poplar known issues

2.5.0

None.

2.4.0

None.

Poplar Libraries known issues

2.5.0

None.

2.4.0

None.

GCL known issues

2.5.0

None.

2.4.0

None.

PopDist known issues

2.5.0

None.

2.4.0

None.

PopRun known issues

2.5.0

None.

2.4.0

None.

Libpva Library known issues

2.5.0

None.

2.4.0

None.

TensorFlow known issues

2.5.1

Warning

The versions of TensorFlow included in Poplar SDK 2.5.1 and earlier are not compatible with protobuf version 4 (see TensorFlow issue #56077). When you install a TensorFlow wheel from the Poplar SDK, you must ensure you have a compatible version of protobuf, downgrading if necessary.

  • For TensorFlow 2:

    $ python -m pip install "protobuf>=3.9.2,<3.20" --force-reinstall
    
  • For TensorFlow 1:

    $ python -m pip install "protobuf>=3.8.0,<3.20" --force-reinstall
    

You can do this before or after installing the Graphcore TensorFlow wheel.

  • Wrapping Keras layers in ipu.outlined_function can cause compilation errors.

  • The gradient_accumulation_reduction_method feature of Keras models can cause an increase in memory usage when the non-default option is used.

  • Using mixed_precision.Policy('mixed_float16') with pipelined Keras models results in compilation errors.

  • The experimental_normalize_gradients feature of TensorFlow 2 can produce unstable results when the number of replicas or the gradient_accumulation_steps_per_replica is large.

2.4.0

  • Using mixed_precision.Policy('mixed_float16') with pipelined Keras models results in compilation errors.

  • The experimental_normalize_gradients feature of TensorFlow 2 can produce unstable results when the number of replicas or the gradient_accumulation_steps_per_replica is large.

IPU TensorFlow Addons known issues

2.5.1

None.

2.4.0

None.

Poplar Triton Backend known issues

2.5.1

None.

Compatibility changes

The following section will detail compatibility changes in v2.5.1

Product

Paragraph

Driver & Utilities

Driver & Utilities compatibility changes

PopART

PopART compatibility changes

PopTorch

PopTorch compatibility changes

Poplar

Poplar compatibility changes

Poplar Libraries

Poplar Libraries compatibility changes

GCL

GCL compatibility changes

PopDist/PopRun

PopRun/PopDist compatibility changes

Libpva Library

Libpva Library compatibility changes

TensorFlow

TensorFlow compatibility changes

IPU TensorFlow Addons

IPU TensorFlow Addons compatibility changes

Poplar Triton Backend

Poplar Triton Backend compatibility changes

Driver & Utilities Compatibility changes

1.1.1

None.

1.0.55

None.

PopART Compatibility changes

2.5.1

  • [API] Following deprecation in the previous release, DeviceManager methods acquireDeviceById and acquireAvailableDevice now error if unable to attach to a device

  • [API] Change argument type for loadExecutableFromStream

2.4.0

  • [API] Deprecate behaviour whereby methods DeviceManager::acquireAvailableDevice and DeviceManager::acquireDeviceById return a nullptr if no device is acquired

  • [API] Remove debugPrefix methods

  • [API] Remove use of GCL_NUM_IO_TILES

  • [API] Remove use of deprecated method snap::program::Sequence::add(poplar::program::Program)

  • [API] Remove deprecated MeanReductionStrategy::PostAndLoss option

  • [API] Remove setting perExecutionStreamCopyCycles

PopTorch Compatibility changes

2.5.0

  • Removed poptorch.AnchorMode, poptorch.Options.anchorMode which were deprecated in favour of poptorch.OutputMode and poptorch.Options.outputMode respectively.

2.4.0

  • Deprecated poptorch.Options.anchorMode in favour of poptorch.Options.outputMode

  • Deprecated poptorch.Options.defaultAnchorMode in favour of poptorch.Options.defaultOutputMode

  • Deprecated poptorch.AnchorMode in favour of poptorch.OutputMode

Poplar Compatibility changes

2.5.0

None.

2.4.0

None.

Poplar Libraries Compatibility changes

2.5.0

  • Deprecated the non-GCL collectives

  • Removed support for multi-IPU convolutions

2.4.0

None.

GCL Compatibility changes

2.5.0

  • All methods that consume popops::CollectiveOperator are deprecated (popops::CollectiveOperator is replaced by gcl::CollectiveOperator)

2.4.0

  • The following methods have been removed from the public API:

    • popops::allReduce(), popops::allGather and popops:reduceScatter (replaced by gcl::allReduceWithinReplica(), gcl::allGatherWithinReplica() and gcl::reduceScatterWithinReplica())

PopDist Compatibility changes

2.5.0

None.

2.4.0

None.

PopRun Compatibility changes

2.5.0

None.

2.4.0

None.

Libpva Library Compatibility changes

2.5.0

  • None

2.4.0

  • None

TensorFlow Compatibility changes

2.5.1

  • See the API changes section in the TensorFlow documentation for full details.

2.4.0

  • IPUMultiReplicaStrategy has been renamed to PopDistStrategy.

  • See the API changes section in the TensorFlow documentation for full details.

IPU TensorFlow Addons Compatibility changes

2.5.1

  • See the IPU TensorFlow Addons API changes section in the TensorFlow documentation for full details.

2.4.0

None.

Poplar Triton Backend Compatibility changes

2.5.1

None.

Appendix

Appendix A : Additional requirements

PopVision Graph Analyser

  • To be able to view profiling reports generated by SDK v2.5.1, PopVision Graph Analyser v3.7.0 or later and PopVision System Analyser v2.7.0 or later are required.

TensorFlow

To correctly execute TensorFlow code please ensure:

Intel platforms

  • Use Python 3.6 as minimum version

  • A CPU compatible with the AVX-512 instruction set is needed.

AMD plaforms

  • Use Python 3.6 as minimum version

  • A CPU compatible with the Znver1 instruction set is needed.