Legal notice

The information included in the Release notes is only for use with or for Graphcore products.
Use of the information included in the Release notes is at your own risk, and subject to the terms of the Graphcore end-user license agreement.
All Graphcore products including hardware and software, described in the Release notes, are subject to continuous changes at any time, without notice.

Graphcloud®, Graphcore® and Poplar® are registered trademarks of Graphcore Ltd.

Bow™, Bow-2000™, Colossus™, In-Processor-Memory™, IPU-Core™, IPU-Exchange™, IPU-Fabric™, IPU-Link™, IPU-M2000™, IPU-Machine™, IPU-POD™, IPU-Tile™, PopART™, PopDist™, PopLibs™, PopRun™, PopTorch™, PopVision™, Streaming Memory™ and Virtual-IPU™ are trademarks of Graphcore Ltd.

All other trademarks are the property of their respective owners.

Scope of this document

This document contains the Release notes for the Poplar SDK 2.5.1 for Graphcore’s IPU product family. The software deliverables covered by this document are the following:

Driver & Utilities: Driver and associated utilities needed by the Graphcore IPU.
PopART: The Poplar Advanced Run Time is a flexible ONNX-compatible runtime supporting both training & inference.
PopTorch: The PopTorch library provides a set of extensions for PyTorch to enable it to run on the Graphcore IPU hardware.
Poplar: A graph programming framework for the IPU.
Poplar Libraries: The PopLibs library provides a range of higher level functions commonly used in machine learning applications.
GCL: The Graphcore Communication Library enables high-performance scale-out for IPU systems.
PopDist: Poplar Distributed Configuration Library (PopDist) is a library for configuring and coordinating distributed execution of (large-scale) machine learning applications.
PopRun: PopRun is a command line utility to launch distributed applications on Graphcore Pod systems.
Libpva: The PopVision analysis library (libpva) allows programmatic analysis of the IPU profiling information used by the PopVision Graph Analyser.
TensorFlow: An implementation of the TensorFlow framework for the Graphcore IPU.
IPU TensorFlow Addons: A collection of Graphcore IPU-specific features for the TensorFlow framework.
Poplar Triton Backend: A backend for the Triton Inference Server that supports models saved as Poplar executables.

Release overview

Drivers & Utilities

gc-monitor and gc-info can now display information for IPUs that are not included in the active partition. The functionality is also available in the gcipuinfo library.
gc-monitor shows IPUs that are in use by other hosts.

PopART

Added support for RNN operator (preview).
Improved Automatic Loss Scaling support (experimental).

PopTorch

Compatible with PyTorch 1.10

Increased operator support, including support for torch.nn.RNN.
Improved Automatic Loss Scaling support (experimental).

Poplar

Significantly reduced the amount of host memory needed when compiling very large models. Most of these optimisations are enabled by default. There is a new experimental Poplar Engine option that allows compilation to be done for a subset of IPUs at a time.

GCL

Collective optimisations for improved scale-out performance.
Extended collective support to include broadcast/oneToAll.

TensorFlow

TensorFlow 1.15.5 and TensorFlow 2.5.2

Keras support in TensorFlow 2 now includes Keras Model subclasses.
Optimisations have been made for loop based models (such as RNNs) to improve compile time, memory usage and runtime performance.

Poplar Triton Backend

Preview version of a backend for the Triton Inference Server.

Package contents

The downloaded unified Poplar SDK will contain the following packages:

Ubuntu 18.04

Package	Version
Driver & Utilities	1.1.1
PopART	2.5.1
PopTorch	2.5.0 (for PyTorch 1.10)
Poplar	2.5.0
PopDist/PopRun	2.5.0
TensorFlow 1.15.5	Graphcore TensorFlow 2.5.1
TensorFlow 2.5.2	Graphcore TensorFlow 2.5.1
IPU TensorFlow Addons	2.5.1
Poplar Triton Backend	2.5.1

Ubuntu 20.04

Package	Version
Driver & Utilities	1.1.1
PopART	2.5.1
PopTorch	2.5.0 (for PyTorch 1.10)
Poplar	2.5.0
PopDist/PopRun	2.5.0
TensorFlow 2.5.2	Graphcore TensorFlow 2.5.1
IPU TensorFlow Addons	2.5.1
Poplar Triton Backend	2.5.1

CentOS 7.6

Package	Version
Driver & Utilities	1.1.1
PopART	2.5.1
PopTorch	2.5.0 (for PyTorch 1.10)
Poplar	2.5.0
PopDist/PopRun	2.5.0
TensorFlow 1.15.5	Graphcore TensorFlow 2.5.1
TensorFlow 2.5.2	Graphcore TensorFlow 2.5.1
IPU TensorFlow Addons	2.5.1
Poplar Triton Backend	2.5.1

Debian 10

Package	Version
Driver & Utilities	1.1.1
PopART	2.5.1
PopTorch	2.5.0 (for PyTorch 1.10)
Poplar	2.5.0
PopDist/PopRun	2.5.0
TensorFlow 2.5.2	Graphcore TensorFlow 2.5.1
IPU TensorFlow Addons	2.5.1
Poplar Triton Backend	2.5.1

Note

See Appendix A for TensorFlow additional requirements.

Product support and compatibility matrix

SUPPORTED: These products are actively worked on: they will receive new features, general updates and security updates.

Notice of deprecation will be sent in advance for supported products.
DEPRECATED: These products will only receive security updates.

These products are expected to work with the indicated products however correctness is not guaranteed.

It is advised not to upgrade to this software version, unless strictly necessary.

In the future, these products can move to a Not Supported state, without further notice.

Support level will reflect the deprecated status.
NOT SUPPORTED: These products are not expected to work with this release.

No support will be provided.

Important

Deprecated products can be moved to a Not supported status without further notice.

IPU-Machine System Software compatibility matrix

IPU-Machine Model	IPU-M Software Version	Support level	Notes
IPU-M2000	2.5.0	Supported	N/A
Bow-2000	2.5.0	Supported	N/A

IPU PCIe Hardware Support level

Model	Revision	ICU Firmware version	Driver version	Support level	Notes
C2 300-0004	All revisions	1.4.14	1.0.57	Deprecated	N/A

Note

Use Firmware revision in accordance with IPU revision.

Important

For Firmware revision, compatibility is only enforced for patch versions.

Driver Support level

OS	Support level	Supported Kernel Version	Notes
CentOS 7.4/7.5	Supported	3.10	CentOS LTS kernel.
CentOS 7.6	Supported	3.10	CentOS LTS kernel.
Microsoft Windows	Supported	Windows Server 2019
Ubuntu 18.04	Supported	5.4	Ubuntu LTS kernel.
Ubuntu 20.04	Supported	5.4	Ubuntu LTS kernel.
Debian 10	Supported	4.19	Debian LTS kernel.

Warning

It is strongly recommended to update the kernel module of the driver to the version included with this 2.5.1 release.

This is to avoid incompatibilities with the non-kernel components of this SDK

SDK 2.5.1 Support level

OS	Support level	Notes
Microsoft Windows	Not Supported
CentOS 7.6	Supported
Ubuntu 18.04	Supported
Ubuntu 20.04	Supported
Debian 10	Supported

Supported tools

Ubuntu 18.04

Tool	Support level	Version
GCC/G++	Supported	7.2.0
libstdc++	Supported	6.0.24
libc	Supported	2.27
binutils	Supported	2.30
Python	Supported	3.6
Boost library	Deprecated	1.70

Ubuntu 20.04

Tool	Support level	Version
GCC/G++	Supported	9.3.0
libstdc++	Supported	10.3.0
libc	Supported	2.31
binutils	Supported	2.34
Python	Supported	3.8
Boost library	Deprecated	1.71

CentOS 7.6

Tool	Support level	Version
GCC/G++	Supported	7.3.1
libstdc++	Supported	6.0.24
libc	Supported	2.17
binutils	Supported	2.28
Python	Supported	3.6
Boost library	Deprecated	1.70

Debian 10

Tool	Support level	Version
GCC/G++	Supported	8.3
libstdc++	Supported	6.0.24
libc	Supported	2.28
binutils	Supported	2.28
Python	Supported	3.7.3
Boost library	Deprecated	1.70

List of changes

The following sections will list changes in version 2.5.1, as well as older releases, for all products contained in the Poplar SDK.

There are three main sections, divided by argument:

Changelogs: Changelogs section lists important bug fixes and relevant functionality that has been added. Minor fixes or features will not be listed.
Known issues: Known Issues section will list all important issues known to date. This section will list issues that will impact Poplar functionality.
Compatibility changes: Compatibilities changes section will capture any change that needs to apply existing code, to remain compatible with this version of the SDK.

Changelogs

Product	Changelog
Driver & Utilities	Changelog Driver & Utilities
PopART	Changelog PopART
PopTorch	Changelog PopTorch
Poplar	Changelog Poplar
Poplar Libraries	Changelog Poplar Libraries
GCL	Changelog GCL
PopDist/PopRun	Changelog PopRun/PopDist
Libpva Library	Changelog Libpva Library
TensorFlow	Changelog TensorFlow
IPU TensorFlow Addons	Changelog IPU TensorFlow Addons
Poplar Triton Backend	Changelog Poplar Triton Backend

Driver & Utilities Changelog

Kernel Module

1.1.1

T50828: Remove deprecated sync utilisation code.
T52353: Removed the need for GCDA_MONITOR to get power/temperature values.
T52721: Preserve mark counts in non POSTED modes on exit.
T52775: Implemented detection of multiple tile parity errors.
T52776: If an IPU memory failure occurs, record unrecoverable error and mark device as unusable.
T55372: Added new correctable error counters that clear on IPU reset.
T55565: Improved power and temperature reporting.
T56430: Detect when processes are attached to a device from another namespace.

1.0.57

T45456: PCIe driver uses pin_user_pages API with Linux kernels 5.8.0+.
T47498: Added Host Link Correctable Errors.
T48270: Update the IPU PCIe driver to correctly use the DMA API.
T48616: Driver scripts improvements.
T49874: Clear allocated PL-DDR memory prior to use on native PCIe.

Low level libraries and tools

2.5.0

T38729: Added gc-hostlatencytest.
T39431: Added GCDA API to allow querying the available PL-DDR on an IPU-M.
T40698: Updated gc-hosttraffictest to provide performance statistics for host memory transfers.
T41646: Added IPU-M version info to gc-monitor.
T43803: Generate libpvti documentation with sphinx_resources.
T44446: Force gRPC to not use a proxy server.
T48984: Refactor conversion of fabric exceptions to graphcore_target_access exceptions to improve maintainability.
T49018: Extended the PVTI API to allow setting of user thread names.
T49170: Add IPU power profile query option to gc-info.
T49902: Removed PCIe ID field from gc-monitor for Fabric devices.
T49958: Updated gc-info -l to return an error code if no devices are found.
T50043: gcipuinfo: add path parameter to application event record retrieval API.
T50828: Remove deprecated sync utilisation code.
T51093: gcipuinfo: add attributes to application event record listing IPUoF hosts.
T51249: GC tools report device discovery errors when no IPUs found.
T51264: Fix issues when attach is aborted at an early stage.
T51460: Add support for static partitions with varying sync types for the hardware testing command line tools.
T51503: Avoid a segmentation fault when using legacy environment variable IPUOF_CONFIG_PATH with an empty value.
T51526: gc-monitor: track IPUs that are in use by other headnodes.
T51527: gc-monitor: when IPUs are in use by other headnodes, display hostname.
T51694: Add error checking in GCDA when requesting invalid buffers.
T51744: Add option to set the duration for --host tests in`gc- hosttrafficttest`.
T51774: Extend internal interface used by V-IPU to support the enabling and disabling of NLC links.
T51832: Fix a rare issue in hgwio_server that can temporarily cause failure to attach.
T51974: IPUoF client calls ibv_fork_init() during RDMA client initialisation.
T52102: Improve error handling on attach in IPUoF.
T52132: Ensure all buffers are detached during IPUoF device detach.
T52248: Add sync group configuration debug information to host sync timeout exceptions.
T52249: The bootloader now throws GraphcoreDeviceAccessExceptions::ipu_bootloader_missing_sync for any bootloader sync errors so that they can be caught for sync debug reporting.
T52458: Added option to gc-monitor and gc-info to view IPUs in other partitions.
T52459: gcipuinfo can return device attributes and run health checks on devices in other partitions.
T52606: Improve IPUoF client HSP debug logging messages.
T52609: Fix fallback strategies used in IPUoF client HSP polling.
T52721: Preserve mark counts in non POSTED modes on exit.
T52775: Implemented detection of multiple tile parity errors.
T52776: If an IPU memory failure occurs, record unrecoverable error and mark device as unusable.
T53084: Updated IPUoF to allow tools to see IPU devices outside of the current partition.
T53170: Prevent the increment of marks on devices that have a GSP pin configuration that does not support HSP. This improves IPUoF performance for the bootloader and avoids confusing debug messages.
T53188: Fix docker images not working with Broadcom RNIC.
T53326: Make gcipuinfo report no IPU devices found as an error.
T53422: Fixed HSP update race between IPUoF client and server.
T53451: Make several attempts at checking if PL DDR is cleared at startup.
T53537: Order IPU-M devices numerically and by IPU Id in PCIe in gc-monitor display.
T53741: Fixed popc --version deadlock when PVTI is enabled.
T53755: Avoid RPC timeouts after first attach.
T53822: Added support for detection and handling of multiple parity errors.
T53884: Updated IPUoF RDMA QP retry count to improve link reliability.
T53895: Optimise IPUoF behavior on first attach.
T53977: Fix IPUoF race condition when receiving an attach request during detach.
T54030: Remove connection disconnect when get_device_info call fails.
T54110: Improve IPUoF mirror fence logging.
T54119: IPUoF server enables memmory error checking.
T54468: Improved the IPUoF error message when there’s been an issue creating the connection.
T54615: Improve recovery and debug in gc-hosttraffictest when a test times out.
T54685: gc-monitor --all-partitions now ignores partitions in an error state rather than terminating with an error.
T55364: Improve availability on IPUoF server start.
T55389: Added link to tutorials in the PVTI user guide.
T55407: Reset IPU upon any gc-hosttraffictest failure to recover host interface.
T55411: Reduce excessive output for gc-memorytest in verbose mode.
T55426: Fixed PVTI PopRun exception when the trace file cannot be created or if the tables already exist in the database.
T55565: Improved power and temperature reporting.
T55629: Extended gc-binary API to support the creation of tile IPU archives in incremental steps rather than at once.
T55942: Fixed a rare double free of allocated memory in the IPUoF server when the IPUoF connection fails.
T56150: Remove unnecessary files from the release packages.

2.4.0

T29027: Add GCDA_OPTIONS environment variable to allow setting runtime options as json.
T30646: Extended gc-iputraffictest to support testing of more than 16 IPUs.
T37217: gc-monitor extended to support multi GCD partitions.
T38068: Add single IPU mode for iputraffictest.
T43718: Added Python documentation to tracing library.
T45122: Allow Poplar to reconfigure links in static partitions.
T45371: Added APIs to attach/RDMA-write to IPU tile memory and simple peer-to- peer RDMA write to tile tests to measure the P2P bandwidth and latency.
T45594: Query the IPU for the architecture during device discovery rather than using the architecture defined by the VIRM configuration.
T45785: Added API to query the last error status.
T46259: Updated PVTI to support binary meta data.
T46401: Gc-monitor support for multi-GCD partitions.
T46855: Error if both an IPUoF configuration file and the IPUOF_VIPU_API_* environment variable is used.
T47225: Improve SERDES link training to allow auto link negotiation.
T47348: Avoid printing a driver version warning in gc-monitor when no IPUs are found.
T47414: When invoked without a device id, gc-reset will now correctly choose a the largest device for partitions greater than 16 IPUs.
T47498: Added Host Link Correctable Errors.
T47619: Fix segfault in gcipuinfo when no devices are found.
T47640: Initialise the IPU code/data/stack size attributes and the IPU utilisation attributes prior to attach.
T47727: Fix failure to start if port of RDMA device is UP but no IP address configured.
T47913: Improve handling of IPUoF configuration errors.
T48317: SoC configuration code tidy up.
T48377: Added documentation for GCDA attributes.
T48434: Added gRPC health check in IPUoF client and server.
T48435: Add device health check API to gcipuinfo.
T48437: Return getDevices result by value.
T48553: A new GCDA_OPTIONS feature to simulate SoC errors.
T48907: Set gRPC deadline in all IPUoF client requests.
T48911: Fast fabric error reporting during PORT_DOWN or connection unreachable.
T48939: Increase server robustness to link down.
T48947: Catch fabric exceptions when storing the sensor value in sensor loop.
T48956: Fixing missing error propagation in some cases.
T49126: Fix bug affecting gc-monitor on non-reconfigurable partitions.
T49134: Log rather than throw when automatically detaching during object destruction.
T49205: Prevent potential long delay when read_config_register calls times out.
T49448: Reduce timeout on CM QP failure.
T49477: Added gc-podman and container support package.
T49802: Improve shutdown time when using GCDA_MONITOR.
T49853: Fixed device ID initialisation in IPUoF server constructor.
T50043: gcipuinfo: add path parameter to application event record retrieval API.
T50044: Extend the timeout for attach during clearing of memory at IPUoF server startup.
T50404: Fixed some error messages when server is killed early.
T50424: Fixed some error message when PL DDR clearing is not complete when shutting down server.
T50857: Fix data race in multithreaded link training when using partial link training config.
T51093: gcipuinfo: add attributes to application event record listing IPUoF hosts.
T51526: gc-monitor: track IPUs that are in use by other headnodes.
T5764: Add documentation for runtime options.

PopART Changelog

2.5.1

New features

Add PopXL API (experimental)
Add support for RNN operator (preview)
Improvements to automatic loss scaling (experimental)
Add improved ability to manage PRNG behaviour across replicas (experimental)
Add ability to retrieve random seed
Add an overload of Builder.setAvailableMemoryProportion which can target multi-output nodes
Ensure initial inputs of gradient graphs match any user-specified provided grads
Ensure outputs of gradient graphs match any user-specified required grads
Add ability to run exported models using the Poplar Triton Backend via PopEF integration
Add visualisations for inplace modified and aliased tensors and graph inputs and outputs to Dot visualizer
DynamicSliceOp and DynamicUpdateOp can drop the first dimension of the slice if it is 1
Support AnchorReturnType::Final in MainLoops transform
Improved replicated tensor sharding (RTS) compatibility for operations
Make gradient clipping compatible with replicated tensor sharding (RTS)
Improved linter support
Add ability to show ONNX model proto in human readable text
Various improvements to executable caching
Add ability to perform per-replica reads and writes of variable values
Improved quality of debug information
Use slice plan in SparseAccumulateOpx
Add ability to merge collective operations
Add ability to dynamically switch off the backwards pass when using implicit pipelining
Add ability to refresh engine cache on-the-fly

Bug Fixes

Fix the logic that replaces DropoutOp with IdentityOp
Improve device handling in tests
Fix for potential deadlock condition in test runner
Fix in lowering logic for trailing subgraph parts that contain only calls to child subgraph parts
IdentityLossOpx will no longer attempt to unwind (resulting in an error) when there is a reduction
Fix subgraph autodiff logic
Allow CallOp to not have outputs connected for all of its callee outputs
Fix Python binding for DeviceManager::tryAttachUntilTimeout
Correctly promote inplace aliased and modified tensors through the Loop operation
Fix unwinding through multiple consecutive slice operations
Fix unwinding issue in MaxOpx
Enable bufferingDepth to be used when SessionOptions::enablePrefetchDatastreams isn’t set
Fix dtype clone in SparseAccumulateOpx::createInputTensor
Fix bug in ReplicatedTensorShardingTracer
Fix compile error if accl2 type is not FLOAT
Fix PowArg0GradOpPattern for fp16

Optimisations

Allow non-broadcasted indices as an input to the scatterreduce operation
Add ExpandCast pattern to reverse the order of an expand followed by cast to reduce memory footprint
Add inplace versions of WhereOp
Allow IdentityInplaceOp to unwind, reducing memory use when it cannot be made inplace
Split operators_test in two
Add TensorRemapOp for point-fixes of bad tensor layouts
Explicit recomputation support for pipelining
Alias zero copy tracks variables and multi-context tensors less conservatively
Improve graph traversal through loop-carried tensors

Logging and documentation

Add compile-time option to log device access events to a file
Improved CommGroupType::None comments
Fix code listings
Update to internal documentation build system
Various small user guide and API improvements
Add how to execute imported model to documentation

2.4.0

New features

Remove optional downcasting of ‘gs’ in the OptimizerDecompose Pattern, so the atomic scalar tensor is always in FP32
Add a new SessionOption ‘ensureFp32LossScaleTensor’. If your optimizer uses loss scaling and your model produces an FP16 loss tensor, enabling this SessionOption means that the loss scale tensor will be an FP32 tensor, and will be combined with FP16 activations as late as possible to produce the first FP16 gradients
Implement IncrementModOp which does y = (x + i) % m efficiently
Add DynamicSliceInplaceOp to update an existing slice from a larger tensor
Add Ir::removeIsolatedGraphs method to prune unused graphs
Add outplace version of RemoteLoadOp (the original version is now called RemoteLoadInplaceOp)
Add a way to connect Poplar HostFunction callbacks to a session. These HostFunction programs can be added via custom ops
Add new API methods DeviceManager::tryAcquireAvailableDevice and DeviceManager::tryAcquireDeviceById that return a nullptr if no device is acquired
Make the MatMulPattern, MatMulLhsGradPattern and MatMulRhsGradPattern patterns mandatory (they cannot be disabled)
Remove use of Poplar’s ‘planMinimisationTarget’ option
Set Poplar engine option ‘target.deterministicWorkers’ based on session options
Improvements to RNG state handling
Update PyTorch version in requirement files
Add additional test graphs
Add support for updating the available_memory_proportion of an operator
Use the PopLibs slice planner across PopART operators: Gather, Scatter, and ScatterReduce and their gradients
The environment variable POPART_CACHE_DIR can be used to enable model caching and set the cache directory
Implement constant folding for ReduceProd operator
Use buffering depth settings for device-to-host streams
Implement executeOpNTimesEveryMTimes
Add accessor for optimiser state tensors
Adding outlining information to debug context of Call operations
Make topk return an int32

Bug Fixes

Fixed an issue where gradient clipping introduced cycles in the graph
Fix loading from a serialized executable when the Ir object passed to popx::serialization::deserializeExecutable has already called its addAdditionalModelProtoTensors method
Allow a ReduceGradOp to change its output tensor type after construction
Enable and fix dependency-free fallback for tensor layout creators
Add missing updaterScaleOp→settings.optimizerOp for TensorFlow-like RMSProp in PopART
Fix ElementWiseBinaryBaseOp::getReplicatedTensorShardingIndices() for broadcast case where one tensor is already sharded
Fix Regions::flatIndex and dimIndex for non-full shapes
Change debug names of tensors when lowering to Poplar so that PopVision displays them correctly
Add missing ResizeGradOp::clone() implementation
Change final to override where required by custom Ops
Add missing clone function to AddArg*Grad Ops
Fix bug in AliasZeroCopy::disableDeadCodeNodes where disabled nodes were still considered as live
Remove cast in SparseAccumulate allowing PopLibs to select a specialisation based on dtype
During build, force FindPython to always pick virtualenv Python, if there is one
Assign output of cloneNcopy to a variable
Add owned_attributes to Attributes
Fix ReduceOp::setup to not accept indices outside the specified range
Fix get loss scale in loss scale update op
ConvTranspose Op now has a valid gradient: models using transpose convolution now train correctly
Convolution now supports a truncated kernel which can occur when calculating a gradient of a convolution in some cases
CopyVarUpdate Op now succeeds in obscure cases in which the tensor inputs are not parallel writable
Regenerate generated files on new build
Fix for LeakyReLU not working in FP16
Robustness improvements to remote tensor sharding
Add missing accumulatorPrefs to reservedPrefixes()

Optimisations

Prevent recomputation of ops in the final forward PipelineStage along one ‘path to the loss’ when an op along another path is set to RecomputeType::Checkpoint
Clean up LoopOp and loop body graph input/output indexing
Improve inheritPlacementAttributes to extend searching Op attributes across graphs
Add connectInTensorLike function to simplify connecting of IpuCopyOps
Speed up topocons with large graphs, improving overlapped IO graph compilation time
Custom op example compiles faster after removing unnecessary compiler option from Makefile
Use LossScaleUpdateOp with sum operation
Use updated Poprithms scheduling API

Logging and documentation

Document getCollectiveLinkedGroup
Fix doc identifier for IncrementModOp
Document Shape type
Document Region type
Document RemoteLoad operation
Document RemoteStore operation
Updated documentation of dataflow, loop, mainloops and subgraphoutline
Improve formatting of Python documentation
Improve documentation for ReductionType and MeanReductionStrategy enum types
Add sections for documenting limitations and added current Clip-11 limitation
Improved error message when not providing constant min/max thresholds for Clip11 Op
Minor corrections to PopART C++ API documentation
PopART C++ API Doc: Fixing availableMemoryProportion reference documentation

PopTorch Changelog

2.5.0

New features

Support for torch.var
Support for torch.std
Support for torch.var_mean
Support for torch.std_mean
Support for col2im (used by torch.nn.Fold)
Support for torch.argsort
Support for torch.nn.RNN
Support for torch.nn.utils.weight_norm
Support for torch.randperm
Support for torch.nn.functional.cosine_similarity and torch.nn.CosineSimilarity
Support for torch.all, torch.any, torch.Tensor.all and torch.Tensor.any
Support for torch.Tensor.exponential_ and torch.distributions.Exponential

Bug fixes

Fix thread safety issue in LogContext
Fix torch.clamp with integer tensors
Fix in-place modification of slices
Fix torch.index_put_ when operating on slices
Fix torch.chunk when dim size is indivisible by the specified number of chunks
Fix cases where tensor.half() was in-place
Fix tracing with half buffers
Fix for loops with in-place ops
Fix torch.flip with negative indices
Fix masked fill when using tensor indexing syntax
Fix some cases where use of serializedMatMul was ignored or resulted in errors

Other improvements

Ignore missing values when reloading an Optimizer state
Support saving Optimizer states when compiling offline
Also save the random number generator’s state and the seed when saving a model
Improve error message of aten::index, aten::index_put_ when indexing with boolean tensor masks
Add support for repr in PoplarExecutor
For models annotated with BeginBlock, show the IPU blocks in repr(model)
Improve implementation of torch.scatter_add

2.4.0

Support for deepcopy functionality in poptorch.Options class
Added functionality to add a name scope for each operator present in the module. This function is enabled by default. It can be disabled using poptorch.Options.disableModuleNamescope.
Support for a greater number of convolution and transpose convolution parameters including those which result in input/kernel/output truncation, either for inference (transpose) or gradient calculation.
Migrated to PyTorch version 1.10.0
Support for gradient clipping by norm in poptorch.optim optimizers
Support saving and restoring internal optimiser state with PopTorch optimisers via optimizer.state_dict() and optimizer.load_state_dict()
Add removeBlocks function to remove block annotations from a Model / Layer.
Support for CPU ops using poptorch.CPU.
Support for im2col.
Make optimizers work with LR schedulers.
Switched to gold linker by default.

Poplar Changelog

2.5.0

New features

Added support for storing code off-chip during model execution (initial implementation only supports internal exchange code)
Compile time improvements
Drastically reduced the amount of host memory needed when compiling very large models. Most of these optimisations are enabled by default. There is a new experimental Poplar Engine option that allows compilation to be serialised - you can specify the number of tiles for which lowering is done concurrently.
Added support for gp files to contain different configurations for the same architecture (for example, debug and release codelets)

Bug fixes

Fixed some private symbols leaking from libpoplar.so
Fixed a deadlock that can happen when stream callbacks don’t progress
Fixed an issue where pipeline stages would sync and run serially when profiling
Fixed a crash that could happen when creating the profile file
Fixed an issue where DELTANELEMENTS would cause a codelet to be mistakenly identified as a recursive function
Fixed a liveness issue from stream copy splitting that caused a variable to be always live
Fixed an issue where PrintTensor programs did not work for multi-ILD targets
Fixed an issue where unused constants could still be allocated on the device
Provided error handling for missing stream callbacks rather than crashing
Provided error handling for invalid codelet types (eg. 3D vectors) rather than crashing
Stopped the worker register dump from being logged twice on an exception
Fixed broken links in the user guide and API documentation
Changed the permissions of the archive to allow it to be read by the tools

Other improvements

Removed the old and deprecated profie formats
Better error handling when passing a null pointer into Graph::addConstant
Added an option to log the Poplar log to the system log
Attached user source location to Poplar exceptions
Added methods to hash the envrionment and engine options for a compilation
Always output symbols in the ELF when user is saving the archive
Compressed the final executable to drastically reduce the size
Add an option to write NaN’s into dead tensors to help debug WriteUndef issues
Improved the codelet codegen from the compiler
Added documentation for engine options that control which exceptions are enabled
Better error message when POPLAR_ENGINE_OPTIONS is an invalid JSON string
Improved documentation for which types are supported
Improved documentation on MultiVertex and, in particular, a race condition that is possible if it is used incorrectly
Improved explanation of different syncConfiguration options in the user guide

2.4.0

New features

Extended memory (greater than 16GB) for remote buffers in Poplar
Allow users to create a target for predefined Graphcore machines (eg. IPU-M2000)
Compile time improvements for key models
Compressed the Poplar executable
Added “Host Function” program: a new type of host exchange for embeddings

Bug fixes

Host memory at the end of compilation was not the same as it was at the start
Fixed segmentation fault when using host-to-device ring buffer with rearrangement on host
Fixed bug where findUnbroadcastTensor gives incorrect result for a concatenation of a broadcast tensor and a non-broadcast tensor
No exception was thrown when reconfigurable partition and Poplar config mismatch with many instances
Fixed a bug which limited the size of the GP files
Fixed a bug when creating a GP file from vertices in separate source files with the same field name
Made load time relocations deterministic to avoid a race condition
Fixed a bug where contiguous PrintTensor statements were being printed in reverse order
Use GCDA when handling multiple HSPs so that the PVTI events are generated correctly
Removed a case of undefined behaviour in merge variables when there are no merge candidates
Fixed a bug where you would get a non-const pointer for an Input field in a codelet
Fixed error in code example in Poplar User Guide
Fixed an error in the Poplar User Guide where wrong values were used for size/alignment of float vectors

Other improvements

Added an optimisation to inline nested calls
Added support for source destination tensor with different layout in CrossReplicaCopy’s
Generate Graph report after compilation
Add support for a new LOOP program in Poplar for an endless loop on the device
Extend NextSyncId analysis to build a nextSyncId table for each programId
Added support for safely stopping an Engine that has not finished running a program
Provided a way to set the host sync time out at a smaller granularity than 1 second.
Improvements to the new Poplar backtraces
Outline MultiVertex supervisor stubs
Added an optimisation pass to eliminate no-op WriteUndefs during lowering
Added mirrorFence(N) support to Poplar
Optimised the overhead for code copies when groups of exchanges in a sequence are all outlined
Included Poplar hash in the executable
Changed the default for deterministicWorkers to always work across replicas
Allow Poplar to trivially look ahead and process future sync points before the IPU reaches them
Documented which options can be changed at runtime via POPLAR_RUNTIME_OPTIONS
Lots of improvements reducing the host memory needed and the number of allocations during a compilation
Log all exceptions leaving Poplar
Added documentation for what kind of vertex members that are valid
Documented the restrictions on creating remote buffers to Poplar users on IPU-M2000 platforms
Added float16 and float32 as type aliases in Poplar

Poplar Libraries Changelog

2.5.0

New features

Added support for the ROIAlign layer
Added support for a stable sort using the new bitonic sort algorithm
Extended embedding layer to support groups

Bug fixes

Fixed a segfault that could happen for reductions
Fixed incorrect documentation of the return type of the random functions
Fixed incorrect documentation for building the third-party dependencies in the README
Fixed an issue in the CTC planner where it used the wrong memory estimate for the reduction
Added DebugContext in the fill operation

Other improvements

Optimised the scaled add codelets to utilise interleaved memory
Improved support for parallelising a transpose across workers
Prevent the partials type from being smaller than the output type in all layers
Attached user source location to PopLibs exceptions
Optimisations to the ERF layer
Added int32 support to the power elementwise operation
Improvements for MultiSlice when given a single offset
Added a default memory proportion to the embedding planner

2.4.0

New features

A new slice planner for faster embeddings
Extended popops to support embeddings where the indices are known at compile time
Added support for the Error Function (ERF) to PopLibs

Bug fixes

Fixed all compiler warnings that were in the public headers
Fixed a bug where only a single MultiVertex instance was generated for some elementwise operations
Avoided possible overread in CTC Inference codelet

Other improvements

Added a method to validate convolution and matmul options
Removed zeroing of output for input channel serial splits
Added structured rearrangements for fwd/gradA layers
Improved the documentation of the normalisation functions
Added an option to allow runtime bounds checking of embedding indices
Documented the partial type for convolutions
Added an optimisation to try to fuse the constituent parts of a mean function into a scaled reduce
Added new SLIC and VMAC vertices that generate more efficient exchange code
Specialised map expressions with a scalar multiply of type float and a tensor of type half to scaledAdd
Incorporated identity operations into element-wise expression optimisations
Added a partials type to ADD operation in multiUpdate
Optimised the memory overhead of the Reduce vertex state and improved the speed by creating fused vertices for scalar operations
Use the new rptsize_t type in the elementwise codelets
Dither reductions across tiles that are created with the reduceMany API
Improved the performance of the log1p vertex

GCL Changelog

2.5.0

New features

Extended GCL group API to include interleaved groups
Added a broadcast/oneToAll collective
Added handling for GCL_OPTIONS environment variable
Added support for many tensor multi phase reductions
Several latency improvements for GW-Links traffic

Bug fixes

Fixed grain size used in Collective Balanced Reorder API for multi phase AllReduce
Fixed SQUARE_ADD operation for multi phase AllReduce
Fixed uneven use of GW-Links on IPU-POD128 system

Other improvements

Added syncful.useOptimisedLayout GCL option
Multiple improvements to GCL’s memory footprint
Added support for n-phased cycle counts
Parallelised host side result validation
Relaxed mapping requirements for non-replicated collectives
Exposed concatChunks in the Collectives API
Added guards preventing modifications of input tensor
Added a GCL code example to the Poplar and PopLibs User Guide

2.4.0

New features

Added two-phase AllGather support (AllGather over GW-Links)
Exposed ReduceScatter and AllGather with many input tensors
Added support for non-commutative SQUARE_ADD reduction operator
Added handling for wide-only AllReduces

Bug fixes

Added a check for IPU number when running on IPU-POD16
Fixed multiple narrowing bugs
Fixed warning about serial reductions
Invalid CommGroup::replicaGroupSize now throws an exception

Other improvements

Fixed zero padding for Collective Balanced Reorder tensors mapped to only one tile
Added CommGroup to log messages
Introduced logging modules
Zero-padding the Collective Balanced Reorder tensor before using it for reductions
Added grain size to each replica in tensor created for Collective Balanced Reorder class
Input tensor is now checked for optimised layout

PopDist Changelog

2.5.0

New features

Ability to specify autoReport.directory for each instance

2.4.0

New features

None.

PopRun Changelog

2.5.0

New features

Ability to specify V-IPU allocation from the command line
Fixed incorrect resource allocation when launching applications from SLURM
Auto-completion functions for bash and zsh shells
Passing --autoreport-dir to PopRun will set the autoReport.directory for each instance
Skip exporting command line options that are not useful

2.4.0

New features

Added support for automatically generating executable cache path when multiple hosts are specified. Generated cache path will be removed when the process exits or fails
Enabled --tag-output by default. This option can now be omitted from --mpi-global-args. To turn the feature off, pass --tag-output=no.
Enabled --allow-run-as-root by default. This option can now be omitted from --mpi-global-args. To turn the feature off, specify --allow-run-as-root=no.
Passed POPLAR_ENGINE_OPTIONS to all instances by default. This feature cannot be turned off.
PopRun now unsets IPUOF_CONFIG_PATH before launching instances

Libpva Library Changelog

2.5.0

New features

Handle empty ipusToProfile when using profiler.replicaToProfile in a distributed execution
Allow variables to be optional for CodeCopy programs
Added option to inline calls when retrieving programs from debug context
Allow access to absolute markers that will be included in the execution profile, and expose them in the C++ API
Allow gaps in a sequence of program IDs

2.4.0

New features

Added the Python str to all the libpva objects.
Added C++ operator<< methods to all libpva objects.
CodeCopy program has a new property to get the list of variables copied.
Added new API to get the Poplar Engine options for compilation and execution.
Added the id, name and parent properties to the DebugContext.

Bug fixes

None

TensorFlow Changelog

2.5.1

New features

Migrated codebase from TensorFlow 2.4 to TensorFlow 2.5.
Added efficient support for Keras Model subclasses, see the documentation for full details.
Added ipu.ops.within_replica_ops module which provides within replica variants of all gather, all reduce and reduce scatter operations.
Added optimise_latency option to IPUInfeedQueue and IPUOutfeedQueue, which when enabled can speedup small data transfers.
Expanded interface for ipu.ops.reduce_scatter and ipu.ops.all_gather to support multiple inputs in a single operation.
Improved integration with TensorBoard for TensorFlow 2 Keras models.
Added support for passing tf.function to ipu.application_compile_op.experimental_application_compile_op in TensorFlow 2.
Added ipu.control_flow_ops.barrier for forcing the scheduling of operations, see the documentation for full details.

Bug fixes

Optimisations for loop based models (such as RNNs) to improve compile time, memory usage and runtime performance.
Memory usage optimisations for dynamic slices/update operations. This optimisation is on by default, but can be disabled with IPUConfig.optimizations.enable_dynamic_slice_replacement.

2.4.0

New features

Added an implementation of ipu.cross_replica_ops.cross_replica_mean() to provide better numerical stability.
Exposed set_infeed_queue_options and set_outfeed_queue_options functions for Sequential and Functional Keras models to allow configuration of IPUInfeedQueue and IPUOutfeedQueue.
Performance improvements for scatter and gather operations with static indices.
Added an IPU optimised implementation ipu.math_ops.segment_sum to perform a sorted segment sum with a fixed number of segments.
Exposed available_memory_proportion for Keras RNN Layers.
Allowed the gradient_accumulation_count parameter of ipu.pipelining_ops.pipeline to be a runtime value instead of a constant to allow dynamic batch sizes.
Added support for TensorFlow 2 Keras API using popdist and poprun.
Optimisations for the tf.random.shuffle operation.

Bug fixes

Reduced the runtime overhead when iteratively calling fit(), evaluate() or predict() on a Keras model.
Compile time improvements.

IPU TensorFlow Addons Changelog

2.5.1

New features

Add options and options_bwd arguments to RNN Keras layers and RNN TensorFlow 1 layers which get passed to their corresponding PopLibs implementations.

Bug fixes

None.

2.4.0

New features

Initial release.
Implementation of the SGD, Adam and LAMB optimizers with IPU specific features to improve model performance.

Bug fixes

None.

Poplar Triton Backend Changelog

2.5.1

New features

Preview version of a backend for the Triton Inference Server.

Bug fixes

None.

Known issues

The following section will detail known issues in v2.5.1.

Each product will be detailed separately.

Product	Paragraph
Driver & Utilities	Driver & Utilities known issues
PopART	PopART known issues
PopTorch	PopTorch known issues
Poplar	Poplar known issues
Poplar Libraries	Poplar Libraries known issues
GCL	GCL known issues
PopDist/PopRun	PopRun/PopDist known issues
Libpva Library	Libpva Library known issues
TensorFlow	TensorFlow known issues
IPU TensorFlow Addons	IPU TensorFlow Addons known issues
Poplar Triton Backend	Poplar Triton Backend known issues

$ python -m pip install "protobuf>=3.9.2,<3.20" --force-reinstall

For TensorFlow 1:

$ python -m pip install "protobuf>=3.8.0,<3.20" --force-reinstall

You can do this before or after installing the Graphcore TensorFlow wheel.

Wrapping Keras layers in ipu.outlined_function can cause compilation errors.
The gradient_accumulation_reduction_method feature of Keras models can cause an increase in memory usage when the non-default option is used.
Using mixed_precision.Policy('mixed_float16') with pipelined Keras models results in compilation errors.
The experimental_normalize_gradients feature of TensorFlow 2 can produce unstable results when the number of replicas or the gradient_accumulation_steps_per_replica is large.

2.4.0

Using mixed_precision.Policy('mixed_float16') with pipelined Keras models results in compilation errors.
The experimental_normalize_gradients feature of TensorFlow 2 can produce unstable results when the number of replicas or the gradient_accumulation_steps_per_replica is large.

IPU TensorFlow Addons known issues

2.5.1

None.

2.4.0

None.

Poplar Triton Backend known issues

2.5.1

None.

Compatibility changes

The following section will detail compatibility changes in v2.5.1

Product	Paragraph
Driver & Utilities	Driver & Utilities compatibility changes
PopART	PopART compatibility changes
PopTorch	PopTorch compatibility changes
Poplar	Poplar compatibility changes
Poplar Libraries	Poplar Libraries compatibility changes
GCL	GCL compatibility changes
PopDist/PopRun	PopRun/PopDist compatibility changes
Libpva Library	Libpva Library compatibility changes
TensorFlow	TensorFlow compatibility changes
IPU TensorFlow Addons	IPU TensorFlow Addons compatibility changes
Poplar Triton Backend	Poplar Triton Backend compatibility changes

Driver & Utilities Compatibility changes

1.1.1

None.

1.0.55

None.

PopART Compatibility changes

2.5.1

[API] Following deprecation in the previous release, DeviceManager methods acquireDeviceById and acquireAvailableDevice now error if unable to attach to a device
[API] Change argument type for loadExecutableFromStream

2.4.0

[API] Deprecate behaviour whereby methods DeviceManager::acquireAvailableDevice and DeviceManager::acquireDeviceById return a nullptr if no device is acquired
[API] Remove debugPrefix methods
[API] Remove use of GCL_NUM_IO_TILES
[API] Remove use of deprecated method snap::program::Sequence::add(poplar::program::Program)
[API] Remove deprecated MeanReductionStrategy::PostAndLoss option
[API] Remove setting perExecutionStreamCopyCycles

PopTorch Compatibility changes

2.5.0

Removed poptorch.AnchorMode, poptorch.Options.anchorMode which were deprecated in favour of poptorch.OutputMode and poptorch.Options.outputMode respectively.

2.4.0

Deprecated poptorch.Options.anchorMode in favour of poptorch.Options.outputMode
Deprecated poptorch.Options.defaultAnchorMode in favour of poptorch.Options.defaultOutputMode
Deprecated poptorch.AnchorMode in favour of poptorch.OutputMode

Poplar Compatibility changes

2.5.0

None.

2.4.0

None.

Poplar Libraries Compatibility changes

2.5.0

Deprecated the non-GCL collectives
Removed support for multi-IPU convolutions

2.4.0

None.

GCL Compatibility changes

2.5.0

All methods that consume popops::CollectiveOperator are deprecated (popops::CollectiveOperator is replaced by gcl::CollectiveOperator)

2.4.0

The following methods have been removed from the public API:
- popops::allReduce(), popops::allGather and popops:reduceScatter (replaced by gcl::allReduceWithinReplica(), gcl::allGatherWithinReplica() and gcl::reduceScatterWithinReplica())

PopDist Compatibility changes

2.5.0

None.

2.4.0

None.

PopRun Compatibility changes

2.5.0

None.

2.4.0

None.

Libpva Library Compatibility changes

2.5.0

None

2.4.0

None

TensorFlow Compatibility changes

2.5.1

See the API changes section in the TensorFlow documentation for full details.

2.4.0

IPUMultiReplicaStrategy has been renamed to PopDistStrategy.
See the API changes section in the TensorFlow documentation for full details.

IPU TensorFlow Addons Compatibility changes

2.5.1

See the IPU TensorFlow Addons API changes section in the TensorFlow documentation for full details.

2.4.0

None.

Poplar Triton Backend Compatibility changes

2.5.1

None.

Appendix

Appendix A : Additional requirements

PopVision Graph Analyser

To be able to view profiling reports generated by SDK v2.5.1, PopVision Graph Analyser v3.7.0 or later and PopVision System Analyser v2.7.0 or later are required.

TensorFlow

To correctly execute TensorFlow code please ensure:

Intel platforms

Use Python 3.6 as minimum version
A CPU compatible with the AVX-512 instruction set is needed.

AMD plaforms

Use Python 3.6 as minimum version
A CPU compatible with the Znver1 instruction set is needed.