Legal notice
Graphcloud®, Graphcore® and Poplar® are registered trademarks of Graphcore Ltd.
Bow™, Bow-2000™, Colossus™, In-Processor-Memory™, IPU-Core™, IPU-Exchange™, IPU-Fabric™, IPU-Link™, IPU-M2000™, IPU-Machine™, IPU-POD™, IPU-Tile™, PopART™, PopDist™, PopLibs™, PopRun™, PopTorch™, PopVision™, Streaming Memory™ and Virtual-IPU™ are trademarks of Graphcore Ltd.
All other trademarks are the property of their respective owners.
Copyright © 2022 Graphcore Ltd. All rights reserved.
Scope of this document
This document contains the Release notes for the Poplar SDK 2.5.1 for Graphcore’s IPU product family. The software deliverables covered by this document are the following:
- Driver & Utilities
Driver and associated utilities needed by the Graphcore IPU.
- PopART
The Poplar Advanced Run Time is a flexible ONNX-compatible runtime supporting both training & inference.
- PopTorch
The PopTorch library provides a set of extensions for PyTorch to enable it to run on the Graphcore IPU hardware.
- Poplar
A graph programming framework for the IPU.
- Poplar Libraries
The PopLibs library provides a range of higher level functions commonly used in machine learning applications.
- GCL
The Graphcore Communication Library enables high-performance scale-out for IPU systems.
- PopDist
Poplar Distributed Configuration Library (PopDist) is a library for configuring and coordinating distributed execution of (large-scale) machine learning applications.
- PopRun
PopRun is a command line utility to launch distributed applications on Graphcore Pod systems.
- Libpva
The PopVision analysis library (libpva) allows programmatic analysis of the IPU profiling information used by the PopVision Graph Analyser.
- TensorFlow
An implementation of the TensorFlow framework for the Graphcore IPU.
- IPU TensorFlow Addons
A collection of Graphcore IPU-specific features for the TensorFlow framework.
- Poplar Triton Backend
A backend for the Triton Inference Server that supports models saved as Poplar executables.
Release overview
Drivers & Utilities
gc-monitor
andgc-info
can now display information for IPUs that are not included in the active partition. The functionality is also available in thegcipuinfo
library.gc-monitor
shows IPUs that are in use by other hosts.
PopART
Added support for RNN operator (preview).
Improved Automatic Loss Scaling support (experimental).
PopTorch
Compatible with PyTorch 1.10
Increased operator support, including support for
torch.nn.RNN
.Improved Automatic Loss Scaling support (experimental).
Poplar
Significantly reduced the amount of host memory needed when compiling very large models. Most of these optimisations are enabled by default. There is a new experimental Poplar Engine option that allows compilation to be done for a subset of IPUs at a time.
GCL
Collective optimisations for improved scale-out performance.
Extended collective support to include broadcast/oneToAll.
TensorFlow
TensorFlow 1.15.5 and TensorFlow 2.5.2
Keras support in TensorFlow 2 now includes Keras Model subclasses.
Optimisations have been made for loop based models (such as RNNs) to improve compile time, memory usage and runtime performance.
Poplar Triton Backend
Preview version of a backend for the Triton Inference Server.
Package contents
The downloaded unified Poplar SDK will contain the following packages:
Ubuntu 18.04
Package |
Version |
---|---|
Driver & Utilities |
1.1.1 |
PopART |
2.5.1 |
PopTorch |
2.5.0 (for PyTorch 1.10) |
Poplar |
2.5.0 |
PopDist/PopRun |
2.5.0 |
TensorFlow 1.15.5 |
Graphcore TensorFlow 2.5.1 |
TensorFlow 2.5.2 |
Graphcore TensorFlow 2.5.1 |
IPU TensorFlow Addons |
2.5.1 |
Poplar Triton Backend |
2.5.1 |
Ubuntu 20.04
Package |
Version |
---|---|
Driver & Utilities |
1.1.1 |
PopART |
2.5.1 |
PopTorch |
2.5.0 (for PyTorch 1.10) |
Poplar |
2.5.0 |
PopDist/PopRun |
2.5.0 |
TensorFlow 2.5.2 |
Graphcore TensorFlow 2.5.1 |
IPU TensorFlow Addons |
2.5.1 |
Poplar Triton Backend |
2.5.1 |
CentOS 7.6
Package |
Version |
---|---|
Driver & Utilities |
1.1.1 |
PopART |
2.5.1 |
PopTorch |
2.5.0 (for PyTorch 1.10) |
Poplar |
2.5.0 |
PopDist/PopRun |
2.5.0 |
TensorFlow 1.15.5 |
Graphcore TensorFlow 2.5.1 |
TensorFlow 2.5.2 |
Graphcore TensorFlow 2.5.1 |
IPU TensorFlow Addons |
2.5.1 |
Poplar Triton Backend |
2.5.1 |
Debian 10
Package |
Version |
---|---|
Driver & Utilities |
1.1.1 |
PopART |
2.5.1 |
PopTorch |
2.5.0 (for PyTorch 1.10) |
Poplar |
2.5.0 |
PopDist/PopRun |
2.5.0 |
TensorFlow 2.5.2 |
Graphcore TensorFlow 2.5.1 |
IPU TensorFlow Addons |
2.5.1 |
Poplar Triton Backend |
2.5.1 |
Note
See Appendix A for TensorFlow additional requirements.
Product support and compatibility matrix
- SUPPORTED
- These products are actively worked on: they will receive new features, general updates and security updates.Notice of deprecation will be sent in advance for supported products.
- DEPRECATED
- These products will only receive security updates.These products are expected to work with the indicated products however correctness is not guaranteed.It is advised not to upgrade to this software version, unless strictly necessary.In the future, these products can move to a Not Supported state, without further notice.Support level will reflect the deprecated status.
- NOT SUPPORTED
- These products are not expected to work with this release.No support will be provided.
Important
Deprecated products can be moved to a Not supported status without further notice.
IPU-Machine System Software compatibility matrix
IPU-Machine Model |
IPU-M Software Version |
Support level |
Notes |
---|---|---|---|
IPU-M2000 |
2.5.0 |
Supported |
N/A |
Bow-2000 |
2.5.0 |
Supported |
N/A |
IPU PCIe Hardware Support level
Model |
Revision |
ICU Firmware version |
Driver version |
Support level |
Notes |
---|---|---|---|---|---|
C2 300-0004 |
All revisions |
1.4.14 |
1.0.57 |
Deprecated |
N/A |
Note
Use Firmware revision in accordance with IPU revision.
Important
For Firmware revision, compatibility is only enforced for patch versions.
Driver Support level
OS |
Support level |
Supported Kernel Version |
Notes |
---|---|---|---|
CentOS 7.4/7.5 |
Supported |
3.10 |
CentOS LTS kernel. |
CentOS 7.6 |
Supported |
3.10 |
CentOS LTS kernel. |
Microsoft Windows |
Supported |
Windows Server 2019 |
|
Ubuntu 18.04 |
Supported |
5.4 |
Ubuntu LTS kernel. |
Ubuntu 20.04 |
Supported |
5.4 |
Ubuntu LTS kernel. |
Debian 10 |
Supported |
4.19 |
Debian LTS kernel. |
Warning
SDK 2.5.1 Support level
OS |
Support level |
Notes |
---|---|---|
Microsoft Windows |
Not Supported |
|
CentOS 7.6 |
Supported |
|
Ubuntu 18.04 |
Supported |
|
Ubuntu 20.04 |
Supported |
|
Debian 10 |
Supported |
Supported tools
Ubuntu 18.04
Tool |
Support level |
Version |
Notes |
---|---|---|---|
GCC/G++ |
Supported |
7.2.0 |
|
libstdc++ |
Supported |
6.0.24 |
|
libc |
Supported |
2.27 |
|
binutils |
Supported |
2.30 |
|
Python |
Supported |
3.6 |
|
Boost library |
Deprecated |
1.70 |
Ubuntu 20.04
Tool |
Support level |
Version |
Notes |
---|---|---|---|
GCC/G++ |
Supported |
9.3.0 |
|
libstdc++ |
Supported |
10.3.0 |
|
libc |
Supported |
2.31 |
|
binutils |
Supported |
2.34 |
|
Python |
Supported |
3.8 |
|
Boost library |
Deprecated |
1.71 |
CentOS 7.6
Tool |
Support level |
Version |
Notes |
---|---|---|---|
GCC/G++ |
Supported |
7.3.1 |
|
libstdc++ |
Supported |
6.0.24 |
|
libc |
Supported |
2.17 |
|
binutils |
Supported |
2.28 |
|
Python |
Supported |
3.6 |
|
Boost library |
Deprecated |
1.70 |
Debian 10
Tool |
Support level |
Version |
Notes |
---|---|---|---|
GCC/G++ |
Supported |
8.3 |
|
libstdc++ |
Supported |
6.0.24 |
|
libc |
Supported |
2.28 |
|
binutils |
Supported |
2.28 |
|
Python |
Supported |
3.7.3 |
|
Boost library |
Deprecated |
1.70 |
List of changes
- Changelogs
Changelogs section lists important bug fixes and relevant functionality that has been added. Minor fixes or features will not be listed.
- Known issues
Known Issues section will list all important issues known to date. This section will list issues that will impact Poplar functionality.
- Compatibility changes
Compatibilities changes section will capture any change that needs to apply existing code, to remain compatible with this version of the SDK.
Changelogs
Product |
Changelog |
---|---|
Driver & Utilities |
|
PopART |
|
PopTorch |
|
Poplar |
|
Poplar Libraries |
|
GCL |
|
PopDist/PopRun |
|
Libpva Library |
|
TensorFlow |
|
IPU TensorFlow Addons |
|
Poplar Triton Backend |
Driver & Utilities Changelog
Kernel Module
1.1.1
T50828: Remove deprecated sync utilisation code.
T52353: Removed the need for
GCDA_MONITOR
to get power/temperature values.T52721: Preserve mark counts in non POSTED modes on exit.
T52775: Implemented detection of multiple tile parity errors.
T52776: If an IPU memory failure occurs, record unrecoverable error and mark device as unusable.
T55372: Added new correctable error counters that clear on IPU reset.
T55565: Improved power and temperature reporting.
T56430: Detect when processes are attached to a device from another namespace.
1.0.57
T45456: PCIe driver uses
pin_user_pages
API with Linux kernels 5.8.0+.T47498: Added Host Link Correctable Errors.
T48270: Update the IPU PCIe driver to correctly use the DMA API.
T48616: Driver scripts improvements.
T49874: Clear allocated PL-DDR memory prior to use on native PCIe.
Low level libraries and tools
2.5.0
T38729: Added
gc-hostlatencytest
.T39431: Added GCDA API to allow querying the available PL-DDR on an IPU-M.
T40698: Updated
gc-hosttraffictest
to provide performance statistics for host memory transfers.T41646: Added IPU-M version info to
gc-monitor
.T43803: Generate libpvti documentation with sphinx_resources.
T44446: Force gRPC to not use a proxy server.
T48984: Refactor conversion of fabric exceptions to
graphcore_target_access
exceptions to improve maintainability.T49018: Extended the PVTI API to allow setting of user thread names.
T49170: Add IPU power profile query option to
gc-info
.T49902: Removed
PCIe ID
field fromgc-monitor
for Fabric devices.T49958: Updated
gc-info -l
to return an error code if no devices are found.T50043:
gcipuinfo
: add path parameter to application event record retrieval API.T50828: Remove deprecated sync utilisation code.
T51093:
gcipuinfo
: add attributes to application event record listing IPUoF hosts.T51249: GC tools report device discovery errors when no IPUs found.
T51264: Fix issues when attach is aborted at an early stage.
T51460: Add support for static partitions with varying sync types for the hardware testing command line tools.
T51503: Avoid a segmentation fault when using legacy environment variable
IPUOF_CONFIG_PATH
with an empty value.T51526:
gc-monitor
: track IPUs that are in use by other headnodes.T51527:
gc-monitor
: when IPUs are in use by other headnodes, display hostname.T51694: Add error checking in GCDA when requesting invalid buffers.
T51744: Add option to set the duration for
--host
tests in`gc- hosttrafficttest`.T51774: Extend internal interface used by V-IPU to support the enabling and disabling of NLC links.
T51832: Fix a rare issue in
hgwio_server
that can temporarily cause failure to attach.T51974: IPUoF client calls
ibv_fork_init()
during RDMA client initialisation.T52102: Improve error handling on attach in IPUoF.
T52132: Ensure all buffers are detached during IPUoF device detach.
T52248: Add sync group configuration debug information to host sync timeout exceptions.
T52249: The bootloader now throws
GraphcoreDeviceAccessExceptions::ipu_bootloader_missing_sync
for any bootloader sync errors so that they can be caught for sync debug reporting.T52458: Added option to
gc-monitor
andgc-info
to view IPUs in other partitions.T52459:
gcipuinfo
can return device attributes and run health checks on devices in other partitions.T52606: Improve IPUoF client HSP debug logging messages.
T52609: Fix fallback strategies used in IPUoF client HSP polling.
T52721: Preserve mark counts in non POSTED modes on exit.
T52775: Implemented detection of multiple tile parity errors.
T52776: If an IPU memory failure occurs, record unrecoverable error and mark device as unusable.
T53084: Updated IPUoF to allow tools to see IPU devices outside of the current partition.
T53170: Prevent the increment of marks on devices that have a GSP pin configuration that does not support HSP. This improves IPUoF performance for the bootloader and avoids confusing debug messages.
T53188: Fix docker images not working with Broadcom RNIC.
T53326: Make
gcipuinfo
report no IPU devices found as an error.T53422: Fixed HSP update race between IPUoF client and server.
T53451: Make several attempts at checking if PL DDR is cleared at startup.
T53537: Order IPU-M devices numerically and by IPU Id in PCIe in
gc-monitor
display.T53741: Fixed
popc --version
deadlock when PVTI is enabled.T53755: Avoid RPC timeouts after first attach.
T53822: Added support for detection and handling of multiple parity errors.
T53884: Updated IPUoF RDMA QP retry count to improve link reliability.
T53895: Optimise IPUoF behavior on first attach.
T53977: Fix IPUoF race condition when receiving an attach request during detach.
T54030: Remove connection disconnect when
get_device_info
call fails.T54110: Improve IPUoF mirror fence logging.
T54119: IPUoF server enables memmory error checking.
T54468: Improved the IPUoF error message when there’s been an issue creating the connection.
T54615: Improve recovery and debug in
gc-hosttraffictest
when a test times out.T54685:
gc-monitor --all-partitions
now ignores partitions in an error state rather than terminating with an error.T55364: Improve availability on IPUoF server start.
T55389: Added link to tutorials in the PVTI user guide.
T55407: Reset IPU upon any
gc-hosttraffictest
failure to recover host interface.T55411: Reduce excessive output for
gc-memorytest
in verbose mode.T55426: Fixed PVTI PopRun exception when the trace file cannot be created or if the tables already exist in the database.
T55565: Improved power and temperature reporting.
T55629: Extended
gc-binary
API to support the creation of tile IPU archives in incremental steps rather than at once.T55942: Fixed a rare double free of allocated memory in the IPUoF server when the IPUoF connection fails.
T56150: Remove unnecessary files from the release packages.
2.4.0
T29027: Add GCDA_OPTIONS environment variable to allow setting runtime options as json.
T30646: Extended
gc-iputraffictest
to support testing of more than 16 IPUs.T37217: gc-monitor extended to support multi GCD partitions.
T38068: Add single IPU mode for iputraffictest.
T43718: Added Python documentation to tracing library.
T45122: Allow Poplar to reconfigure links in static partitions.
T45371: Added APIs to attach/RDMA-write to IPU tile memory and simple peer-to- peer RDMA write to tile tests to measure the P2P bandwidth and latency.
T45594: Query the IPU for the architecture during device discovery rather than using the architecture defined by the VIRM configuration.
T45785: Added API to query the last error status.
T46259: Updated PVTI to support binary meta data.
T46401: Gc-monitor support for multi-GCD partitions.
T46855: Error if both an IPUoF configuration file and the
IPUOF_VIPU_API_*
environment variable is used.T47225: Improve SERDES link training to allow auto link negotiation.
T47348: Avoid printing a driver version warning in gc-monitor when no IPUs are found.
T47414: When invoked without a device id, gc-reset will now correctly choose a the largest device for partitions greater than 16 IPUs.
T47498: Added Host Link Correctable Errors.
T47619: Fix segfault in gcipuinfo when no devices are found.
T47640: Initialise the IPU code/data/stack size attributes and the IPU utilisation attributes prior to attach.
T47727: Fix failure to start if port of RDMA device is UP but no IP address configured.
T47913: Improve handling of IPUoF configuration errors.
T48317: SoC configuration code tidy up.
T48377: Added documentation for GCDA attributes.
T48434: Added gRPC health check in IPUoF client and server.
T48435: Add device health check API to gcipuinfo.
T48437: Return getDevices result by value.
T48553: A new GCDA_OPTIONS feature to simulate SoC errors.
T48907: Set gRPC deadline in all IPUoF client requests.
T48911: Fast fabric error reporting during PORT_DOWN or connection unreachable.
T48939: Increase server robustness to link down.
T48947: Catch fabric exceptions when storing the sensor value in sensor loop.
T48956: Fixing missing error propagation in some cases.
T49126: Fix bug affecting gc-monitor on non-reconfigurable partitions.
T49134: Log rather than throw when automatically detaching during object destruction.
T49205: Prevent potential long delay when read_config_register calls times out.
T49448: Reduce timeout on CM QP failure.
T49477: Added
gc-podman
and container support package.T49802: Improve shutdown time when using
GCDA_MONITOR
.T49853: Fixed device ID initialisation in IPUoF server constructor.
T50043: gcipuinfo: add path parameter to application event record retrieval API.
T50044: Extend the timeout for attach during clearing of memory at IPUoF server startup.
T50404: Fixed some error messages when server is killed early.
T50424: Fixed some error message when PL DDR clearing is not complete when shutting down server.
T50857: Fix data race in multithreaded link training when using partial link training config.
T51093: gcipuinfo: add attributes to application event record listing IPUoF hosts.
T51526: gc-monitor: track IPUs that are in use by other headnodes.
T5764: Add documentation for runtime options.
PopART Changelog
2.5.1
New features
Add PopXL API (experimental)
Add support for RNN operator (preview)
Improvements to automatic loss scaling (experimental)
Add improved ability to manage PRNG behaviour across replicas (experimental)
Add ability to retrieve random seed
Add an overload of Builder.setAvailableMemoryProportion which can target multi-output nodes
Ensure initial inputs of gradient graphs match any user-specified provided grads
Ensure outputs of gradient graphs match any user-specified required grads
Add ability to run exported models using the Poplar Triton Backend via PopEF integration
Add visualisations for inplace modified and aliased tensors and graph inputs and outputs to Dot visualizer
DynamicSliceOp and DynamicUpdateOp can drop the first dimension of the slice if it is 1
Support AnchorReturnType::Final in MainLoops transform
Improved replicated tensor sharding (RTS) compatibility for operations
Make gradient clipping compatible with replicated tensor sharding (RTS)
Improved linter support
Add ability to show ONNX model proto in human readable text
Various improvements to executable caching
Add ability to perform per-replica reads and writes of variable values
Improved quality of debug information
Use slice plan in SparseAccumulateOpx
Add ability to merge collective operations
Add ability to dynamically switch off the backwards pass when using implicit pipelining
Add ability to refresh engine cache on-the-fly
Bug Fixes
Fix the logic that replaces DropoutOp with IdentityOp
Improve device handling in tests
Fix for potential deadlock condition in test runner
Fix in lowering logic for trailing subgraph parts that contain only calls to child subgraph parts
IdentityLossOpx will no longer attempt to unwind (resulting in an error) when there is a reduction
Fix subgraph autodiff logic
Allow CallOp to not have outputs connected for all of its callee outputs
Fix Python binding for DeviceManager::tryAttachUntilTimeout
Correctly promote inplace aliased and modified tensors through the Loop operation
Fix unwinding through multiple consecutive slice operations
Fix unwinding issue in MaxOpx
Enable bufferingDepth to be used when SessionOptions::enablePrefetchDatastreams isn’t set
Fix dtype clone in SparseAccumulateOpx::createInputTensor
Fix bug in ReplicatedTensorShardingTracer
Fix compile error if accl2 type is not FLOAT
Fix PowArg0GradOpPattern for fp16
Optimisations
Allow non-broadcasted indices as an input to the scatterreduce operation
Add ExpandCast pattern to reverse the order of an expand followed by cast to reduce memory footprint
Add inplace versions of WhereOp
Allow IdentityInplaceOp to unwind, reducing memory use when it cannot be made inplace
Split operators_test in two
Add TensorRemapOp for point-fixes of bad tensor layouts
Explicit recomputation support for pipelining
Alias zero copy tracks variables and multi-context tensors less conservatively
Improve graph traversal through loop-carried tensors
Logging and documentation
Add compile-time option to log device access events to a file
Improved CommGroupType::None comments
Fix code listings
Update to internal documentation build system
Various small user guide and API improvements
Add how to execute imported model to documentation
2.4.0
New features
Remove optional downcasting of ‘gs’ in the OptimizerDecompose Pattern, so the atomic scalar tensor is always in FP32
Add a new SessionOption ‘ensureFp32LossScaleTensor’. If your optimizer uses loss scaling and your model produces an FP16 loss tensor, enabling this SessionOption means that the loss scale tensor will be an FP32 tensor, and will be combined with FP16 activations as late as possible to produce the first FP16 gradients
Implement IncrementModOp which does y = (x + i) % m efficiently
Add DynamicSliceInplaceOp to update an existing slice from a larger tensor
Add Ir::removeIsolatedGraphs method to prune unused graphs
Add outplace version of RemoteLoadOp (the original version is now called RemoteLoadInplaceOp)
Add a way to connect Poplar HostFunction callbacks to a session. These HostFunction programs can be added via custom ops
Add new API methods DeviceManager::tryAcquireAvailableDevice and DeviceManager::tryAcquireDeviceById that return a nullptr if no device is acquired
Make the MatMulPattern, MatMulLhsGradPattern and MatMulRhsGradPattern patterns mandatory (they cannot be disabled)
Remove use of Poplar’s ‘planMinimisationTarget’ option
Set Poplar engine option ‘target.deterministicWorkers’ based on session options
Improvements to RNG state handling
Update PyTorch version in requirement files
Add additional test graphs
Add support for updating the available_memory_proportion of an operator
Use the PopLibs slice planner across PopART operators: Gather, Scatter, and ScatterReduce and their gradients
The environment variable POPART_CACHE_DIR can be used to enable model caching and set the cache directory
Implement constant folding for ReduceProd operator
Use buffering depth settings for device-to-host streams
Implement executeOpNTimesEveryMTimes
Add accessor for optimiser state tensors
Adding outlining information to debug context of Call operations
Make topk return an int32
Bug Fixes
Fixed an issue where gradient clipping introduced cycles in the graph
Fix loading from a serialized executable when the Ir object passed to popx::serialization::deserializeExecutable has already called its addAdditionalModelProtoTensors method
Allow a ReduceGradOp to change its output tensor type after construction
Enable and fix dependency-free fallback for tensor layout creators
Add missing updaterScaleOp→settings.optimizerOp for TensorFlow-like RMSProp in PopART
Fix ElementWiseBinaryBaseOp::getReplicatedTensorShardingIndices() for broadcast case where one tensor is already sharded
Fix Regions::flatIndex and dimIndex for non-full shapes
Change debug names of tensors when lowering to Poplar so that PopVision displays them correctly
Add missing ResizeGradOp::clone() implementation
Change final to override where required by custom Ops
Add missing clone function to AddArg*Grad Ops
Fix bug in AliasZeroCopy::disableDeadCodeNodes where disabled nodes were still considered as live
Remove cast in SparseAccumulate allowing PopLibs to select a specialisation based on dtype
During build, force FindPython to always pick virtualenv Python, if there is one
Assign output of cloneNcopy to a variable
Add owned_attributes to Attributes
Fix ReduceOp::setup to not accept indices outside the specified range
Fix get loss scale in loss scale update op
ConvTranspose Op now has a valid gradient: models using transpose convolution now train correctly
Convolution now supports a truncated kernel which can occur when calculating a gradient of a convolution in some cases
CopyVarUpdate Op now succeeds in obscure cases in which the tensor inputs are not parallel writable
Regenerate generated files on new build
Fix for LeakyReLU not working in FP16
Robustness improvements to remote tensor sharding
Add missing accumulatorPrefs to reservedPrefixes()
Optimisations
Prevent recomputation of ops in the final forward PipelineStage along one ‘path to the loss’ when an op along another path is set to RecomputeType::Checkpoint
Clean up LoopOp and loop body graph input/output indexing
Improve inheritPlacementAttributes to extend searching Op attributes across graphs
Add connectInTensorLike function to simplify connecting of IpuCopyOps
Speed up topocons with large graphs, improving overlapped IO graph compilation time
Custom op example compiles faster after removing unnecessary compiler option from Makefile
Use LossScaleUpdateOp with sum operation
Use updated Poprithms scheduling API
Logging and documentation
Document getCollectiveLinkedGroup
Fix doc identifier for IncrementModOp
Document Shape type
Document Region type
Document RemoteLoad operation
Document RemoteStore operation
Updated documentation of dataflow, loop, mainloops and subgraphoutline
Improve formatting of Python documentation
Improve documentation for ReductionType and MeanReductionStrategy enum types
Add sections for documenting limitations and added current Clip-11 limitation
Improved error message when not providing constant min/max thresholds for Clip11 Op
Minor corrections to PopART C++ API documentation
PopART C++ API Doc: Fixing availableMemoryProportion reference documentation
PopTorch Changelog
2.5.0
New features
Support for
torch.var
Support for
torch.std
Support for
torch.var_mean
Support for
torch.std_mean
Support for
col2im
(used bytorch.nn.Fold
)Support for
torch.argsort
Support for
torch.nn.RNN
Support for
torch.nn.utils.weight_norm
Support for
torch.randperm
Support for
torch.nn.functional.cosine_similarity
andtorch.nn.CosineSimilarity
Support for
torch.all
,torch.any
,torch.Tensor.all
andtorch.Tensor.any
Support for
torch.Tensor.exponential_
andtorch.distributions.Exponential
Bug fixes
Fix thread safety issue in LogContext
Fix
torch.clamp
with integer tensorsFix in-place modification of slices
Fix
torch.index_put_
when operating on slicesFix
torch.chunk
when dim size is indivisible by the specified number of chunksFix cases where
tensor.half()
was in-placeFix tracing with half buffers
Fix for loops with in-place ops
Fix
torch.flip
with negative indicesFix masked fill when using tensor indexing syntax
Fix some cases where use of
serializedMatMul
was ignored or resulted in errors
Other improvements
Ignore missing values when reloading an Optimizer state
Support saving Optimizer states when compiling offline
Also save the random number generator’s state and the seed when saving a model
Improve error message of
aten::index
,aten::index_put_
when indexing with boolean tensor masksAdd support for
repr
in PoplarExecutorFor models annotated with
BeginBlock
, show the IPU blocks inrepr(model)
Improve implementation of
torch.scatter_add
2.4.0
Support for deepcopy functionality in
poptorch.Options
classAdded functionality to add a name scope for each operator present in the module. This function is enabled by default. It can be disabled using
poptorch.Options.disableModuleNamescope
.Support for a greater number of convolution and transpose convolution parameters including those which result in input/kernel/output truncation, either for inference (transpose) or gradient calculation.
Migrated to PyTorch version 1.10.0
Support for gradient clipping by norm in
poptorch.optim
optimizersSupport saving and restoring internal optimiser state with PopTorch optimisers via
optimizer.state_dict()
andoptimizer.load_state_dict()
Add
removeBlocks
function to remove block annotations from a Model / Layer.Support for CPU ops using
poptorch.CPU
.Support for
im2col
.Make optimizers work with LR schedulers.
Switched to gold linker by default.
Poplar Changelog
2.5.0
New features
Added support for storing code off-chip during model execution (initial implementation only supports internal exchange code)
Compile time improvements
Drastically reduced the amount of host memory needed when compiling very large models. Most of these optimisations are enabled by default. There is a new experimental Poplar Engine option that allows compilation to be serialised - you can specify the number of tiles for which lowering is done concurrently.
Added support for gp files to contain different configurations for the same architecture (for example, debug and release codelets)
Bug fixes
Fixed some private symbols leaking from libpoplar.so
Fixed a deadlock that can happen when stream callbacks don’t progress
Fixed an issue where pipeline stages would sync and run serially when profiling
Fixed a crash that could happen when creating the profile file
Fixed an issue where DELTANELEMENTS would cause a codelet to be mistakenly identified as a recursive function
Fixed a liveness issue from stream copy splitting that caused a variable to be always live
Fixed an issue where PrintTensor programs did not work for multi-ILD targets
Fixed an issue where unused constants could still be allocated on the device
Provided error handling for missing stream callbacks rather than crashing
Provided error handling for invalid codelet types (eg. 3D vectors) rather than crashing
Stopped the worker register dump from being logged twice on an exception
Fixed broken links in the user guide and API documentation
Changed the permissions of the archive to allow it to be read by the tools
Other improvements
Removed the old and deprecated profie formats
Better error handling when passing a null pointer into
Graph::addConstant
Added an option to log the Poplar log to the system log
Attached user source location to Poplar exceptions
Added methods to hash the envrionment and engine options for a compilation
Always output symbols in the ELF when user is saving the archive
Compressed the final executable to drastically reduce the size
Add an option to write NaN’s into dead tensors to help debug WriteUndef issues
Improved the codelet codegen from the compiler
Added documentation for engine options that control which exceptions are enabled
Better error message when POPLAR_ENGINE_OPTIONS is an invalid JSON string
Improved documentation for which types are supported
Improved documentation on MultiVertex and, in particular, a race condition that is possible if it is used incorrectly
Improved explanation of different syncConfiguration options in the user guide
2.4.0
New features
Extended memory (greater than 16GB) for remote buffers in Poplar
Allow users to create a target for predefined Graphcore machines (eg. IPU-M2000)
Compile time improvements for key models
Compressed the Poplar executable
Added “Host Function” program: a new type of host exchange for embeddings
Bug fixes
Host memory at the end of compilation was not the same as it was at the start
Fixed segmentation fault when using host-to-device ring buffer with rearrangement on host
Fixed bug where
findUnbroadcastTensor
gives incorrect result for a concatenation of a broadcast tensor and a non-broadcast tensorNo exception was thrown when reconfigurable partition and Poplar config mismatch with many instances
Fixed a bug which limited the size of the GP files
Fixed a bug when creating a GP file from vertices in separate source files with the same field name
Made load time relocations deterministic to avoid a race condition
Fixed a bug where contiguous PrintTensor statements were being printed in reverse order
Use GCDA when handling multiple HSPs so that the PVTI events are generated correctly
Removed a case of undefined behaviour in merge variables when there are no merge candidates
Fixed a bug where you would get a non-const pointer for an Input field in a codelet
Fixed error in code example in Poplar User Guide
Fixed an error in the Poplar User Guide where wrong values were used for size/alignment of float vectors
Other improvements
Added an optimisation to inline nested calls
Added support for source destination tensor with different layout in CrossReplicaCopy’s
Generate Graph report after compilation
Add support for a new LOOP program in Poplar for an endless loop on the device
Extend NextSyncId analysis to build a nextSyncId table for each programId
Added support for safely stopping an Engine that has not finished running a program
Provided a way to set the host sync time out at a smaller granularity than 1 second.
Improvements to the new Poplar backtraces
Outline MultiVertex supervisor stubs
Added an optimisation pass to eliminate no-op WriteUndefs during lowering
Added mirrorFence(N) support to Poplar
Optimised the overhead for code copies when groups of exchanges in a sequence are all outlined
Included Poplar hash in the executable
Changed the default for
deterministicWorkers
to always work across replicasAllow Poplar to trivially look ahead and process future sync points before the IPU reaches them
Documented which options can be changed at runtime via POPLAR_RUNTIME_OPTIONS
Lots of improvements reducing the host memory needed and the number of allocations during a compilation
Log all exceptions leaving Poplar
Added documentation for what kind of vertex members that are valid
Documented the restrictions on creating remote buffers to Poplar users on IPU-M2000 platforms
Added float16 and float32 as type aliases in Poplar
Poplar Libraries Changelog
2.5.0
New features
Added support for the ROIAlign layer
Added support for a stable sort using the new bitonic sort algorithm
Extended embedding layer to support groups
Bug fixes
Fixed a segfault that could happen for reductions
Fixed incorrect documentation of the return type of the random functions
Fixed incorrect documentation for building the third-party dependencies in the README
Fixed an issue in the CTC planner where it used the wrong memory estimate for the reduction
Added DebugContext in the fill operation
Other improvements
Optimised the scaled add codelets to utilise interleaved memory
Improved support for parallelising a transpose across workers
Prevent the partials type from being smaller than the output type in all layers
Attached user source location to PopLibs exceptions
Optimisations to the ERF layer
Added int32 support to the power elementwise operation
Improvements for MultiSlice when given a single offset
Added a default memory proportion to the embedding planner
2.4.0
New features
A new slice planner for faster embeddings
Extended popops to support embeddings where the indices are known at compile time
Added support for the Error Function (ERF) to PopLibs
Bug fixes
Fixed all compiler warnings that were in the public headers
Fixed a bug where only a single MultiVertex instance was generated for some elementwise operations
Avoided possible overread in CTC Inference codelet
Other improvements
Added a method to validate convolution and matmul options
Removed zeroing of output for input channel serial splits
Added structured rearrangements for fwd/gradA layers
Improved the documentation of the normalisation functions
Added an option to allow runtime bounds checking of embedding indices
Documented the partial type for convolutions
Added an optimisation to try to fuse the constituent parts of a mean function into a scaled reduce
Added new SLIC and VMAC vertices that generate more efficient exchange code
Specialised map expressions with a scalar multiply of type float and a tensor of type half to scaledAdd
Incorporated identity operations into element-wise expression optimisations
Added a partials type to ADD operation in multiUpdate
Optimised the memory overhead of the Reduce vertex state and improved the speed by creating fused vertices for scalar operations
Use the new
rptsize_t
type in the elementwise codeletsDither reductions across tiles that are created with the
reduceMany
APIImproved the performance of the
log1p
vertex
GCL Changelog
2.5.0
New features
Extended GCL group API to include interleaved groups
Added a broadcast/oneToAll collective
Added handling for GCL_OPTIONS environment variable
Added support for many tensor multi phase reductions
Several latency improvements for GW-Links traffic
Bug fixes
Fixed grain size used in Collective Balanced Reorder API for multi phase AllReduce
Fixed SQUARE_ADD operation for multi phase AllReduce
Fixed uneven use of GW-Links on IPU-POD128 system
Other improvements
Added syncful.useOptimisedLayout GCL option
Multiple improvements to GCL’s memory footprint
Added support for n-phased cycle counts
Parallelised host side result validation
Relaxed mapping requirements for non-replicated collectives
Exposed concatChunks in the Collectives API
Added guards preventing modifications of input tensor
Added a GCL code example to the Poplar and PopLibs User Guide
2.4.0
New features
Added two-phase AllGather support (AllGather over GW-Links)
Exposed ReduceScatter and AllGather with many input tensors
Added support for non-commutative SQUARE_ADD reduction operator
Added handling for wide-only AllReduces
Bug fixes
Added a check for IPU number when running on IPU-POD16
Fixed multiple narrowing bugs
Fixed warning about serial reductions
Invalid CommGroup::replicaGroupSize now throws an exception
Other improvements
Fixed zero padding for Collective Balanced Reorder tensors mapped to only one tile
Added CommGroup to log messages
Introduced logging modules
Zero-padding the Collective Balanced Reorder tensor before using it for reductions
Added grain size to each replica in tensor created for Collective Balanced Reorder class
Input tensor is now checked for optimised layout
PopDist Changelog
2.5.0
New features
Ability to specify autoReport.directory for each instance
2.4.0
New features
None.
PopRun Changelog
2.5.0
New features
Ability to specify V-IPU allocation from the command line
Fixed incorrect resource allocation when launching applications from SLURM
Auto-completion functions for bash and zsh shells
Passing
--autoreport-dir
to PopRun will set the autoReport.directory for each instanceSkip exporting command line options that are not useful
2.4.0
New features
Added support for automatically generating executable cache path when multiple hosts are specified. Generated cache path will be removed when the process exits or fails
Enabled
--tag-output
by default. This option can now be omitted from--mpi-global-args
. To turn the feature off, pass--tag-output=no
.Enabled
--allow-run-as-root
by default. This option can now be omitted from--mpi-global-args
. To turn the feature off, specify--allow-run-as-root=no
.Passed
POPLAR_ENGINE_OPTIONS
to all instances by default. This feature cannot be turned off.PopRun now unsets
IPUOF_CONFIG_PATH
before launching instances
Libpva Library Changelog
2.5.0
New features
Handle empty ipusToProfile when using profiler.replicaToProfile in a distributed execution
Allow variables to be optional for CodeCopy programs
Added option to inline calls when retrieving programs from debug context
Allow access to absolute markers that will be included in the execution profile, and expose them in the C++ API
Allow gaps in a sequence of program IDs
2.4.0
New features
Added the Python
str
to all the libpva objects.Added C++
operator<<
methods to all libpva objects.CodeCopy program has a new property to get the list of variables copied.
Added new API to get the Poplar Engine options for compilation and execution.
Added the id, name and parent properties to the DebugContext.
Bug fixes
None
TensorFlow Changelog
2.5.1
New features
Migrated codebase from TensorFlow 2.4 to TensorFlow 2.5.
Added efficient support for Keras Model subclasses, see the documentation for full details.
Added
ipu.ops.within_replica_ops
module which provides within replica variants of all gather, all reduce and reduce scatter operations.Added
optimise_latency
option toIPUInfeedQueue
andIPUOutfeedQueue
, which when enabled can speedup small data transfers.Expanded interface for
ipu.ops.reduce_scatter
andipu.ops.all_gather
to support multiple inputs in a single operation.Improved integration with TensorBoard for TensorFlow 2 Keras models.
Added support for passing
tf.function
toipu.application_compile_op.experimental_application_compile_op
in TensorFlow 2.Added
ipu.control_flow_ops.barrier
for forcing the scheduling of operations, see the documentation for full details.
Bug fixes
Optimisations for loop based models (such as RNNs) to improve compile time, memory usage and runtime performance.
Memory usage optimisations for dynamic slices/update operations. This optimisation is on by default, but can be disabled with
IPUConfig.optimizations.enable_dynamic_slice_replacement
.
2.4.0
New features
Added an implementation of
ipu.cross_replica_ops.cross_replica_mean()
to provide better numerical stability.Exposed
set_infeed_queue_options
andset_outfeed_queue_options
functions forSequential
andFunctional
Keras models to allow configuration ofIPUInfeedQueue
andIPUOutfeedQueue
.Performance improvements for scatter and gather operations with static indices.
Added an IPU optimised implementation
ipu.math_ops.segment_sum
to perform a sorted segment sum with a fixed number of segments.Exposed
available_memory_proportion
for Keras RNN Layers.Allowed the
gradient_accumulation_count
parameter ofipu.pipelining_ops.pipeline
to be a runtime value instead of a constant to allow dynamic batch sizes.Added support for TensorFlow 2 Keras API using
popdist
andpoprun
.Optimisations for the
tf.random.shuffle
operation.
Bug fixes
Reduced the runtime overhead when iteratively calling
fit()
,evaluate()
orpredict()
on a Keras model.Compile time improvements.
IPU TensorFlow Addons Changelog
2.5.1
New features
Add
options
andoptions_bwd
arguments to RNN Keras layers and RNN TensorFlow 1 layers which get passed to their corresponding PopLibs implementations.
Bug fixes
None.
2.4.0
New features
Initial release.
Implementation of the SGD, Adam and LAMB optimizers with IPU specific features to improve model performance.
Bug fixes
None.
Poplar Triton Backend Changelog
2.5.1
New features
Preview version of a backend for the Triton Inference Server.
Bug fixes
None.
Known issues
Product |
Paragraph |
---|---|
Driver & Utilities |
|
PopART |
|
PopTorch |
|
Poplar |
|
Poplar Libraries |
|
GCL |
|
PopDist/PopRun |
|
Libpva Library |
|
TensorFlow |
|
IPU TensorFlow Addons |
|
Poplar Triton Backend |
Driver & Utilities known issues
1.1.1
None.
1.0.55
None.
PopART known issues
2.5.1
None.
2.4.0
None.
PopTorch known issues
2.5.0
None.
2.4.0
None.
Poplar known issues
2.5.0
None.
2.4.0
None.
Poplar Libraries known issues
2.5.0
None.
2.4.0
None.
GCL known issues
2.5.0
None.
2.4.0
None.
PopDist known issues
2.5.0
None.
2.4.0
None.
PopRun known issues
2.5.0
None.
2.4.0
None.
Libpva Library known issues
2.5.0
None.
2.4.0
None.
TensorFlow known issues
2.5.1
Warning
The versions of TensorFlow included in Poplar SDK 2.5.1 and earlier are not compatible with protobuf
version 4 (see TensorFlow issue #56077). When you install a TensorFlow wheel from the Poplar SDK, you must ensure you have a compatible version of protobuf
, downgrading if necessary.
For TensorFlow 2:
$ python -m pip install "protobuf>=3.9.2,<3.20" --force-reinstall
For TensorFlow 1:
$ python -m pip install "protobuf>=3.8.0,<3.20" --force-reinstall
You can do this before or after installing the Graphcore TensorFlow wheel.
Wrapping Keras layers in
ipu.outlined_function
can cause compilation errors.The
gradient_accumulation_reduction_method
feature of Keras models can cause an increase in memory usage when the non-default option is used.Using
mixed_precision.Policy('mixed_float16')
with pipelined Keras models results in compilation errors.The
experimental_normalize_gradients
feature of TensorFlow 2 can produce unstable results when the number of replicas or the gradient_accumulation_steps_per_replica is large.
2.4.0
Using
mixed_precision.Policy('mixed_float16')
with pipelined Keras models results in compilation errors.The
experimental_normalize_gradients
feature of TensorFlow 2 can produce unstable results when the number of replicas or the gradient_accumulation_steps_per_replica is large.
IPU TensorFlow Addons known issues
2.5.1
None.
2.4.0
None.
Poplar Triton Backend known issues
2.5.1
None.
Compatibility changes
The following section will detail compatibility changes in v2.5.1
Product |
Paragraph |
---|---|
Driver & Utilities |
|
PopART |
|
PopTorch |
|
Poplar |
|
Poplar Libraries |
|
GCL |
|
PopDist/PopRun |
|
Libpva Library |
|
TensorFlow |
|
IPU TensorFlow Addons |
|
Poplar Triton Backend |
Driver & Utilities Compatibility changes
1.1.1
None.
1.0.55
None.
PopART Compatibility changes
2.5.1
[API] Following deprecation in the previous release, DeviceManager methods acquireDeviceById and acquireAvailableDevice now error if unable to attach to a device
[API] Change argument type for loadExecutableFromStream
2.4.0
[API] Deprecate behaviour whereby methods DeviceManager::acquireAvailableDevice and DeviceManager::acquireDeviceById return a nullptr if no device is acquired
[API] Remove debugPrefix methods
[API] Remove use of GCL_NUM_IO_TILES
[API] Remove use of deprecated method snap::program::Sequence::add(poplar::program::Program)
[API] Remove deprecated MeanReductionStrategy::PostAndLoss option
[API] Remove setting perExecutionStreamCopyCycles
PopTorch Compatibility changes
2.5.0
Removed
poptorch.AnchorMode
,poptorch.Options.anchorMode
which were deprecated in favour ofpoptorch.OutputMode
andpoptorch.Options.outputMode
respectively.
2.4.0
Deprecated
poptorch.Options.anchorMode
in favour ofpoptorch.Options.outputMode
Deprecated
poptorch.Options.defaultAnchorMode
in favour ofpoptorch.Options.defaultOutputMode
Deprecated
poptorch.AnchorMode
in favour ofpoptorch.OutputMode
Poplar Compatibility changes
2.5.0
None.
2.4.0
None.
Poplar Libraries Compatibility changes
2.5.0
Deprecated the non-GCL collectives
Removed support for multi-IPU convolutions
2.4.0
None.
GCL Compatibility changes
2.5.0
All methods that consume
popops::CollectiveOperator
are deprecated (popops::CollectiveOperator
is replaced bygcl::CollectiveOperator
)
2.4.0
The following methods have been removed from the public API:
popops::allReduce()
,popops::allGather
andpopops:reduceScatter
(replaced bygcl::allReduceWithinReplica()
,gcl::allGatherWithinReplica()
andgcl::reduceScatterWithinReplica()
)
PopDist Compatibility changes
2.5.0
None.
2.4.0
None.
PopRun Compatibility changes
2.5.0
None.
2.4.0
None.
Libpva Library Compatibility changes
2.5.0
None
2.4.0
None
TensorFlow Compatibility changes
2.5.1
See the
API changes
section in the TensorFlow documentation for full details.
2.4.0
IPUMultiReplicaStrategy
has been renamed toPopDistStrategy
.See the
API changes
section in the TensorFlow documentation for full details.
IPU TensorFlow Addons Compatibility changes
2.5.1
See the
IPU TensorFlow Addons API changes
section in the TensorFlow documentation for full details.
2.4.0
None.
Poplar Triton Backend Compatibility changes
2.5.1
None.
Appendix
Appendix A : Additional requirements
PopVision Graph Analyser
To be able to view profiling reports generated by SDK v2.5.1, PopVision Graph Analyser v3.7.0 or later and PopVision System Analyser v2.7.0 or later are required.
TensorFlow
To correctly execute TensorFlow code please ensure:
Intel platforms
Use Python 3.6 as minimum version
A CPU compatible with the AVX-512 instruction set is needed.
AMD plaforms
Use Python 3.6 as minimum version
A CPU compatible with the Znver1 instruction set is needed.