Legal notice
Graphcore ® and Poplar ® are registered trademarks of Graphcore Ltd.
Copyright © 2021 Graphcore Ltd. All rights reserved.
Scope of this document
This document contains the Release notes for the Poplar SDK 2.3.0 for Graphcore’s IPU product family. The software deliverables covered by this document are the following:
- Driver & Utilities
Driver and associated utilities needed by the Graphcore IPU.
- PopART
The Poplar Advanced Run Time is a flexible ONNX-compatible runtime supporting both training & inference.
- PopTorch
The PopTorch library provides a set of extensions for PyTorch to enable it to run on the Graphcore IPU hardware.
- Poplar
A graph programming framework for the IPU.
- PopDist/PopRun
Poplar Distributed Configuration Library (PopDist) is a library for configuring and coordinating distributed execution of (large-scale) machine learning applications.
- TensorFlow
An implementation of the TensorFlow framework for the Graphcore IPU.
Package contents
The downloaded unified Poplar SDK will contain the following packages:
Ubuntu 18.04
Package |
Version |
---|---|
Driver & Utilities |
1.0.55 |
PopART |
2.3.0+1367 |
PopTorch |
2.3.0+30608 |
Poplar |
2.3.0+1367 |
PopDist/PopRun |
2.3.0 |
TensorFlow 1 |
Graphcore TensorFlow 2.3.0 |
TensorFlow 2 |
Graphcore TensorFlow 2.3.0 |
Ubuntu 20.04
Important
The Ubuntu 20.04 SDK is a preview release and is not yet fully qualified on all hardware platforms.
Package |
Version |
---|---|
Driver & Utilities |
1.0.55 |
PopART |
2.3.0+1367 |
PopTorch |
2.3.0+30608 |
Poplar |
2.3.0+1367 |
PopDist/PopRun |
2.3.0 |
TensorFlow 2 |
Graphcore TensorFlow 2.3.0 |
CentOS 7.6
Package |
Version |
---|---|
Driver & Utilities |
1.0.55 |
PopART |
2.3.0+1367 |
PopTorch |
2.3.0+30608 |
Poplar |
2.3.0+1367 |
PopDist/PopRun |
2.3.0 |
TensorFlow 1 |
Graphcore TensorFlow 2.3.0 |
TensorFlow 2 |
Graphcore TensorFlow 2.3.0 |
Note
See Appendix A for TensorFlow additional requirements.
Product support and compatibility matrix
- SUPPORTED
- These products are actively worked on: they will receive new features, general updates and security updates.Notice of deprecation will be sent in advance for supported products.
- DEPRECATED
- These products will only receive security updates.These products are expected to work with the indicated products however correctness is not guaranteed.It is advised not to upgrade to this software version, unless strictly necessary.In the future, these products can move to a Not Supported state, without further notice.Support level will reflect the deprecated status.
- NOT SUPPORTED
- These products are not expected to work with this release.No support will be provided.
Important
Deprecated products can be moved to a Not supported status without further notice.
IPU-M2000 System Software compatibility matrix
IPUM Model |
Version |
Support level |
Notes |
---|---|---|---|
IPU-M2000 300-0024 |
2.3.0 |
Supported |
N/A |
IPU PCIe Hardware Support level
Model |
Revision |
ICU Firmware version |
Driver version |
Support level |
Notes |
---|---|---|---|---|---|
C2 300-0004 |
All revisions |
1.4.14 |
1.0.55 |
Supported |
Note
Use Firmware revision in accordance with IPU revision.
Important
For Firmware revision, compatibility is only enforced for patch versions.
Driver Support level
OS |
Support level |
Supported Kernel Version |
Notes |
---|---|---|---|
CentOS 7.4/7.5 |
Supported |
3.10 |
CentOS LTS kernel. |
CentOS 7.6 |
Supported |
3.10 |
CentOS LTS kernel. |
Microsoft Windows |
Supported |
Windows Server 2019 |
|
Ubuntu 18.04 |
Supported |
5.4 |
Ubuntu LTS kernel. |
Ubuntu 20.04 |
Supported |
5.4 |
Ubuntu LTS kernel. |
Warning
SDK 2.3.0 Support level
OS |
Support level |
Notes |
---|---|---|
Microsoft Windows |
Not Supported |
|
CentOS 7.6 |
Supported |
In some specific instances we encountered a less than optimal model compilation time. Investigations are ongoing to address the problem. |
Ubuntu 18.04 |
Supported |
|
Ubuntu 20.04 |
Supported |
Supported tools
Ubuntu 18.04
Tool |
Support level |
Version |
|
---|---|---|---|
GCC/G++ |
Supported |
7.2.0 |
|
libstdc++ |
Supported |
6.0.24 |
|
libc |
Supported |
2.27 |
|
binutils |
Supported |
2.30 |
|
Python |
Supported |
3.6 |
|
Boost library |
Deprecated |
1.70 |
Ubuntu 20.04
Tool |
Support level |
Version |
|
---|---|---|---|
GCC/G++ |
Supported |
9.3.0 |
|
libstdc++ |
Supported |
10.3.0 |
|
libc |
Supported |
2.31 |
|
binutils |
Supported |
2.34 |
|
Python |
Supported |
3.8 |
|
Boost library |
Deprecated |
1.71 |
CentOS 7.6
Tool |
Support level |
Version |
|
---|---|---|---|
GCC/G++ |
Supported |
7.3.1 |
|
libstdc++ |
Supported |
6.0.24 |
|
libc |
Supported |
2.17 |
|
binutils |
Supported |
2.28 |
|
Python |
Supported |
3.6 |
|
Boost library |
Deprecated |
1.70 |
List of changes
- Changelogs
Changelogs section lists important bug fixes and relevant functionality that has been added. Minor fixes or features will not be listed.
- Known issues
Known Issues section will list all important issues known to date. This section will list issues that will impact Poplar functionality.
- Compatibility changes
Compatibilities changes section will capture any change that needs to apply existing code, to remain compatible with this version of the SDK.
Changelogs
Product |
Changelog |
---|---|
Driver & Utilities |
|
PopART |
|
PopTorch |
|
Poplar |
|
Poplar Libraries |
|
GCL |
|
PopDist/PopRun |
|
Libpva Library |
|
TensorFlow |
Driver & Utilities Changelog
Kernel Module
1.0.55
T38724: Implement PCIe P2P support in IPU device driver.
T42607: Sensor data (power and temperature) is displayed between containers / namespaces.
T42657: Changes to support Linux kernel 5.10.
T42745: Fix error handling when IPU is removed.
T43360: Improve the error message when the PCIe driver fails to load due to a missing device file.
T45666: Provide an API to enable the Multi Read Service Table in the Gateway.
1.0.52
T38126: Improved error handling when IPU cards are disconnected.
T36583: Avoid clearing PL DDR on docker restart.
T36158: Add a mechanism for measuring IPU utilisation and mark count monitoring.
T34827: Detect attachBuffer() failures.
T30346: Display AER error counts in gc-hosttraffictest on IPU-M2000.
Low level libraries and tools
2.3.0+1367
T30646: Extended
gc-iputraffictest
to support testing of more than 16 IPUs.T38726: Implemented IPUoF/RDMA to IPU P2P APIs.
T39915: Fixed GCDA link configuration memory leak.
T40012: Provide more information in device attach error messages.
T40561: gc-hosttraffictest: disable data checking in read-only mode.
T41038: Add metadata support to PVTI’s tracepoints.
T41365: PVTI documentation improvements.
T42308: gc-flops prints help by default. Documentation added.
T42478: Update physical slot attribute to contain sensible values for cases where slot information cannot be read from system.
T42557: Improved classification of different logging levels in GCDA and IPUoF.
T42582: gc-monitor documentation updated.
T42607: Sensor data (power and temperature) is displayed between containers / namespaces.
T42745: Fix error handling when IPU is removed.
T43014: Fix lockup when running with GCDA_MONITOR over IPUoF.
T43383: Improved
gc-links
error message when attempting to train links on IPU-POD systems.T43404: gc-iputraffictest: reduce time required to initialise tiles.
T43575: Add ForceParityReset option to reset API.
T43694: Throw exception if attempting to start an application when the IPU bootloader has not loaded anything.
T43910: Remove the assumption that all images in graphcore_binary are tile images.
T43943: Add
IGNORE_ICU_ATTACH_ERROR
to suppress exception upon ICU attach failure.T44256: ipuof server: disable HSP interrupt handlers.
T44349: gwlinkstraffictest: fix GiB and GiB/s output.
T44539: Fix IPU-POD Kubernetes IPU connection failure due to invalid GID.
T44573: Log IPU sync groups to aid debug.
T44725: Preserve the exception type when catching and rethrowing exceptions.
T44959: Add ‘Hardware’ target type used as argument to GCDA’s
getDevices
API which will return either PCIe or IPUoF devices.T44970: Added
GCDA_LOG_MASK
environment variable to filter GCDA debug log messages.T45026: Add missing gcipuinfo golang support files.
T45090: Add gcipuinfo C wrapper library.
T45150: Use null metadata id for trace events without metadata.
T45327: Added IPUoF mirror fence API.
T45493: Added support for RDMA NICs without RoCEv1 support.
T45592: IPU cycle counter is not enabled during reset if ICU firmware version is 2.0.0 or later.
T45666: Provide an API to enable the Multi Read Service Table in the Gateway.
T45740: PVTI dabatabase schema updated.
T45785: Added API to query the last error status.
T45879: Update to report an error if
$IPUOF_VIPU_API_PARTITION_ID
is defined, but the partition ID is invalid.T46139: Bind added metadata with bindBlob instead of bindText in PVTI.
T46212:
GCDA_LOGGING
andGCDA_LOG_LEVEL
environment variables no longer enable IPUoF logging.T46213: gc-info –tile-overview will now check for exceptions in workers.
T46259: Updated PVTI to support binary meta data.
2.2.0
T25931: Support reading from the middle of a stream by limiting the bytes read when reading binaries.
T30865: Logging now uses ISO8601 UTC timestamps and
%p
in GCDA_LOG_DEST will be replaced with the process ID.T32069: Improved consistency of
PCIe Id
/PCI id
terminology withingc- info
.T36207:
gc-hostsynclatencytest
fixes for native IPU-M2000.T38422: Fixed boost exception on application shutdown when generating PVTI and using
GCDA_MONITOR
.T39043: Fix missing power and temperature fields when requesting all fields in
gc-monitor
.T39458: Bind IPUoF-servers cq_handler to dedicated CPU and use CQ polling mode for all IPUoF-servers.
T39680: Improved IPUoF latency and avoid spikes in IPUoF HSP update.
T39698: Improved performance when checking for SoC errors via GCDA.
T39887: Enhanced
gc-hostsynclatencytest
to output additional statistics.T39891: Fixed GCDA device discovery from multiple threads.
T39956: IPUoF client detaches from device after RDMA fabric error if IPUoF server is reachable.
T40048: Enhanced
gc-binary
error information on failure.T40067: GCDA environment variables are now ignored, if present but set to zero or empty.
T40430: Fixed ICU comms lockup during multi-threaded use of
GCDA_LOGGING
.T40567: Reduced the thresholds for IPU clock throttling log messages.
T40717: Improve handling of IPUoF exceptions within GCDA.
T41042: Removed the call to
gc-inventory
from the GCIPUINFO library.T41365: PVTI documentation improvements.
T41418: Fix a race condition between RDMA disconnect and HSP update.
T41427: Fix incorrect statistics in gc-iputraffictest when using a large number of iterations.
T41556: The LIBPVTI and LIBPVA documentation has been split out from the Poplar documents to separate documents for each library.
T41641: gc-powertest now supports finer power level control with
-p
option.T41779: Added gc-flops, a tool to measure floating point chip performance (Mk2).
T41868: Fixed invalid register field check when checking for errors via
$CMGMTEVVR
.T41882: Reduce IPUoF initial latency in HSP update.
T41936: Support JSON output for gc-info device status commands.
T42053: Show state key in gc-info –tile-overview.
T42097: Replace gc-inventory-based gcipuinfo library with interface to GCDA.
T42099: Added Python and Go support to the
pure API
variant ofgcipuinfo
.T42190: Avoid printing duplicate board temperature and power in process table for gc-monitor.
T42369: Extended
gc-flops
timeout.T42445: Added an interface to GCDA to expose the IPU chip ID.
T42502: Ensure PVTI generation is completely disabled when
PVTI_OPTIONS={"enable":"false"}
.T42557: Improved classification of different logging levels in GCDA and IPUoF.
T42560: Added JSON support to
gc-flops
.T42632: Improved IPUoF logging format.
T42708: When an IPU-Link configuration is supplied during attach, the IPU will be reset.
T42822: Link training failures report “Link Training Error” rather than “setupChassis failed”.
T43014: Fix lockup when running with GCDA_MONITOR over IPUoF.
PopART Changelog
2.3.0+1367
New features
Add requirement files
Support AnchorReturnType::Sum in MainLoops transform
Modify addLoop(Input|Output) to automatically adjust the “modifies” & “aliases” maps when a new input/output shifts indices
Add a runtime_error class to PopART & change errors that lead to Poplar engine calls to this new error type
Add constructors for errors with IDs
Add swish activation function op
Add ability to use copyvarupdate ops from the builder
Allow truncation support for conv ops to permit certain convtranspose ops
Support convtranpose when the calculated padding is negative
Add enableConvDithering convolution option that might help tile balance
Add support for setting available_memory_proportion for Gather, Scatter, and ScatterReduce
Changed Builder::setAvailableMemoryProportion to allow setting the available_memory_proportion on any operator.
Changed Scatter to use popops::multiSlice which improves tile utilisation
Add Python bindings for Builder::virtualGraph(const std::set<TensorId> &, int64_t) and Builder::getVirtualGraph(const std::set<TensorId> &)
Return Mk2 device by default when creating offline device.
MeanReductionType: Specify how gradient should be mean reduced across gradient_accumulation and replication
Support RTS with CommGroups for RTS-128
Add support for overlapped host to device IO
Add shape dimension check for when no InputShapeTensor is provided
Add lossScaleUpdateFactor to onnx model definition
Add RecomputeAll mode
Add support for RunningMean in TiedGatherPattern
Add compatibility with zsh shell
Bug Fixes
Replace unnecessary custom functions with library calls
Make the error in Tensor::setTensorData() an internal_error & improve the error message
Remove unused forward declarations in popart.cpp
Fix saving/restoration of conv parameters which could affect ops in some instances
Fix dynamicslice gradient calculation in situations where there is just one gradient input (no sum)
Fix missing gradient in SoftSign operator
Fix for DynamicSlice when axis_size % slice_size != 0
Fix for DynamicUpdate when axis_size % slice_size != 0
Fix log module under which Ir preparation compile time breakdown is logged (is module Ir now)
Fix Softplus gradient calculation
Fix various typos
Allow outplace round op.
LSTMOp, only create new pass through tensor if not already present.
Fix segfault issue in ReverseOpx.
Fix potential bug in LSTMGradOp::gradInputInfo.
Guarantee that an onnx node’s output tensor id exists in the model’s value info field by the end of shape inference
Correct some shape methods to use the correct type alias
Make Ir in Session a shared_ptr for easier use in Python
Do not allow matrix multiplication serialization factor to be ⇐ 0 when doing serialization.
Improve speed of Pipeline::setFinalFwdStageRecomputation
Fix maxWeightNorm=0 behaviour for Adam based optimizers
Fix replication_factor >1 issue when explicit host copy ops are enabled
Fix random seed compatibility with useHostCopyOps=True
Fix use of internal aliases in pipelined IpuCopyOpx
Fix replicaGroupSize for the ReplicatedAllGatherTest_CommGroup_All test
Fix to ClipWeightGradientsByNorm with sharding
Add Clip_11 to the verification list for ElementWiseUnaryOutplaceOpx
Optimisations
Avoid the need for an extra ‘Scale’ operation in the graph with non-constant loss scaling in the optimizer
Logging and documentation
Update tests/popart/README.md info on setting test devices to have correct paths
Correct tensor location log for optimizerStateTensorLocationSettings.location
Describe shared_ptr usage for session.ir
Document overlapped IO and RTS
Add documentation for PyStepIO, PyStepIOCallback and StepIOCallback
Add documentation for popart.ir to Python API docs
Reduce the amount of logging at level ‘info’
Align Popart’s log format with Poplar, GCDA, vipu, etc.
Add mode options to
setSerializeMatMul()
docstring
2.2.1
New features
None.
Bug Fixes
Revert compilation optimisations for
SumOp
.
Optimisations
None.
2.2.0
New features
Added explicit pipelining IR support (experimental).
Added overlapping device side IO support (experimental).
Script added to summarise op constructor info for use in upcoming API additions.
Add low-level Python bindings for upcoming API additions.
Improve type-hints for the PopART Python module.
Add demos for creating training and inference model directly in the PopART IR.
Add a transform which will enable the tracking of user-specified tensor gradients when training with automatic loss scaling.
Add static factory function to
ClipNormSettings
to allow clipping all weights in a model.Add a way to globally set the matmul options. Use
SessionOptions::matmulOptions = std::map<std::string, std::string>
and refer to thematmul()
section in the Poplar documentation for available options.Add
PackedDataBlockOp
for working with packed sequences of data.ResizeOp
, support for sizes input.ResizeOp
, support for modeslinear
andcubic
.ResizeOp
, support for coordinate_transformation_mode attribute.ResizeOp
, extend support for nearest_mode attribute.Support gradient clipping for Lamb optimizer.
Remove ONNX and Protobuf from public headers and their targets from CMake export.
Support using Lamb optimiser on weights that have been serialised, using LambSerialisedWeight pattern.
Add Python bindings for poplar_recoverable_runtime_error, poplar_unrecoverable_runtime_error, and poplar_application_runtime_error.
Implement (without an ONNX builder binding)
TiedGatherOp
andTiedGatherGradOp
, seeTiedGatherPattern
for details.
Bug Fixes
Check input intersection in
ConcatOp::bwdRegMap
.Add checks to ensure RTS propagation through
RemoteLoad
/RemoteStore
is safe.Support
MultiExchange
partial lowering to avoid circular task dependencies.Avoid inplacing of ops which might result in race conditions.
Use the correct data type in pooling ops when using fp16. When doing so the partials were left in float32 which is now handled by Poplar.
Fix auto-diff transform bug.
Avoid annotating priorities in graph scheduler when non-optimal schedule is requested.
Clear
PathFromLoss
on cloned op when op sharding.Modify current pybind11 bindings to allow for upcoming API additions.
Change
CastOp::getGradOps()
to only add gradients to the backward graph if input type is floating point.Allow StepIO runtime assertions when an IR has been built without an ONNX model.
Fix NaN-loss when training with automatic loss scaling.
Organise source and test directories for upcoming API additions.
Fix type mismatch with float16 optimizer state.
Fix false positive matches in executable cache with different optimizer hyper-parameters.
Fix issues found when using gradient clipping with serialized matmuls.
In CMake, fix bug where there was a missing target dependency.
Add
schedulePriority
to Op attributes when serialising Ir with JSON.
Optimisations
Use embedding planner in
ScatterReduceGrad
.Improve performance of aliasing checks.
Improve recompute pruning for final forward pipeline stage.
Remove gradCast from
SGD2Decompose
.SessionOptions::delayVarUpdates
only takes effect if optionsexplicitRecomputation
andexplicitMainLoops
are both off, as otherwise the optimisation is not needed.Add
SessionOptions::scheduleNonWeightUpdateGradientConsumersEarly
which, ifVarUpdates
are being delayed, ensures that Ops which consume gradients but are not for updating weights (like gradient accumulationAccumulateOp
, automatic loss scalingHistogramOp
) are still scheduled as early as possible.TiedGather
andTiedGatherAccumulate
patterns, which apply various optimisations to scenarios where a weight is consumed by both aGather
and aMatMul
, but transposed on one side; the Gather and MatMul are on different pipeline stages; and an optimiser with extra state tensors is being used (like SGD with momentum).Disables PopLibs fully_connected_pass on MatMul as the resulting tile layout leads to less exchange in this scenario
Elides the grad sum accumulator tensor for the weight by accumulating directly into the optimiser state tensor
The optimiser state tensor is now consumed by two ops on different pipeline stages, so a stash and restore is introduced, but since both ops are on the same virtual graph (as they consume the same weight), we can elide this too.
Replace the
GatherGrad
→Accumulate
with a singleSparseAccumulate
, eliding the extra dense tensor inbetweenEnsures the tile layouts of the weight and the gradient accumulator tensors are such that exchange is minimised during the weight update
Logging and documentation
Create a reserved prefix for
AutomaticLossScaleProxy
.Add a script for measuring test coverage locally.
Updated readme to use correct version of pybind11.
Amend license information for suffixtree.
Fix compilation progress log messages.
Enhanced, comprehensive documentation of
SGD
optimizer and its implementation.
PopTorch Changelog
2.3.0+30608
Support for
torch.bitwise_and
,torch.bitwise_or
,torch.bitwise_xor
Support for
torch.logical_and
,torch.logical_or
,Support K-dimensional NLLLoss, K-dimensional CrossEntropyLoss
Support for non-default affine parameter flags in normalisation ops
Support for
torch.Tensor.T
Support for
torch.bool
intorch.zeros
,torch.zeros_like
,torch.ones
,torch.ones_like
Support for
torch.scatter
and its in-place variantSupport for in-place modification to buffers on IPU
Support for taking slices of scalars
Support version of bilinear upsampling specifying intended output size instead of scale factors
Add support for overlapping host IO on inputs via
poptorch.set_overlap_for_input
.Add option for setting number of IO tiles via
numIOTiles
inpoptorch.Options
(required forpoptorch.TensorLocationSettings.useIOTilesToLoad
andpoptorch.set_overlap_for_input
.)Improve PopTorch’s parity with PyTorch’s Softplus
Improve implementation of torch.SiLU by using Poplar’s Swish operator
Additional support for operation overloads
Fix issue where PopTorch recalculated upsampling scales in fp16
Fix issue where the last use of
poptorch.set_available_memory
would be prunedAdd documentation on available memory proportion to incorporate embeddings and indexing operations
Add documentation on how users can generate debug information
Support replicated tensor sharding when running on multiple processes
Allow selection for a non-constant x input.
Support for
enableConvDithering
convolution option
2.2.0
Migrated to PyTorch version 1.9.0
Support for
torch.roll
Support for
torch.clone
Add modelName session option that can be passed to PopART
Support List inputs to a model
Tuples/Lists of constants can now be returned by a model
Add
enableProfiling
convenience method inpoptorch.Options
to enable profile report generationFix bug with
torch.Tensor.repeat
when applied to an input during trainingFix bug with
aten::to
when applied to a constant used as an input to another nodeImproved error message when encountering untraceable types during compilation
Support for
torch.gather
. Please note: this operator is known to cause long compilation times. Consider using a onehot-based solution instead ortorch.index_select
if appropriate.Using a convolution layer op with the value of
padding
greater than or equal tokernel_size
is now supported.Support for
torch.Tensor.new_ones
andtorch.Tensor.new_zeros
Support for
torch.flip
Support for PopART ops attributes
Support for exceptions categories
Poplar Changelog
2.3.0+1367
New features
Output a backtrace on tile errors that identifies the program the error occoured in
Add initial support for having more than 16 GiBs of remote buffers per IPU, controlled by the target.extendedMemory option
Add support for UNSIGNED_LONGLONG and LONGLONG data types
Add optional needAlignWorkers vertex field to specify whether a vertex requires worker alignment when target.deterministicWorkers is enabled
Bug fixes
Fix TEXCPT_INVALID_ADDR exception when using remote buffers
Fix host sync timeout error with large stream copy
Fix divergent control flow error when profiling a program containing a switch
Fix incorrect copy code generation when an Input field and an InOut field are connected to the same tensor
Add additional system analyser trace operations to ensure all host I/O operations are included in the trace
Fix
poplar::Graph::findUnbroadcastTensor
returning an incorrect result for a partially broadcast multi-dimensional tensor
Other improvements
Add optimisation to improve performance of stream copies in loops
Reduce host exchange code size
Improve variable allocator so more applications fit in memory
Reduce code size and runtime overhead of enabling target.deterministicWorkers
Improve graph compilation speed
Optimise compiler data structures to reduce host memory usage
Improve efficency of stream copies on IPU-M2000 systems in applications with many IPUs per replica
Update the format of Poplar logging messages to be consistent with tools such as VIPU
Improve speed of writing profile information to disk, particularly on network file systems
Extend optimisation that duplicates compute to avoid exchange so it works in more cases
Include the tile a memory parity errror occoured on in the error message attached to the exception
Extend the Abort program to take an optional message
Make
poplar::Graph::getTileMapping
optionally return incomplete tile mappings for tensors that may lay outside the virtual graphOptimise remote buffer and stream read and write bandwidth on IPU-M2000 systems
Support setting some engine options at runtime so they can be changed without recompiling the graph
2.2.0
New features
Split Poplar’s runtime errors into categories to allow for automatic recovery
Bug fixes
Fix deadlock when the number of worker threads is high in comparison with the amount of stream callbacks to handle
Fixed occasions when in-place binary elementwise operations would return the wrong results when both inputs alias each other
Fixed typo in the profiler for GetGlobalConsensus programs
Changed host exchange to use the correct sync type
Fixed a foreign key issue when generating the profile
Avoid initialisation of already allocated remote buffers
Fix error during compilation when putting both MultiVertex and Vertex codelets into the same compute set
Updated some broken documentation links
Reduced overly verbose trace logging
Removed imprecise supervisor stack check that caused false positives
Other improvements
Improved the performance of the
deterministicWorkers: portable
engine optionSupport unbuffered completion mode in the PCI complex
Removed unnecessary copy of the binary from a serialised executable
Optimised the codegen when patching remote buffer copy headers
Removed the V1 and experimental profiler formats
Improved the latency (including random spikes) during model execution
Store lowered vars in the main profile file
Optimize away starting syncId copies by feeding the program’s dataflow result back to itself
Extend program analysis to eliminate syncId copies inside repeat loops when possible
General documentation improvements to Graph.hpp and Engine.hpp
Update Loop/Repeat* descriptions with cross references and a bit more info
Improved host memory usage during compilation
Error messages now provide enough information to identify the IPU at fault when there are multiple Poplar processes
Better error checking when extracting the archive from the executable
Standardise printing of StreamCopy programs
Allow engines to have names
Add support for serialising the executable from the engine
Set profiler.perExecutionStreamCopyCycles as default
Add a symlink to the debug.cbor in the parent directory
Print compute set name when add vertex fails
Extend logging to include which IPUs are in each sync group
Better document what types are supported by Hardware, poplar and frameworks
Updated IPU Programmer’s Guide
Updated the tutorials links in the user guide
Updated LLVM to version 13.0.0
Improvements made to the codegen produced when compiling C++ device code
Poplar Libraries Changelog
2.3.0+1367
New features
Improve performance of LSTM and GRU operations with a variable sequence length
Introduce a new variant of
popnn::lstmBwd
andpopnn::lstmBwdWithWU
that can output both the gradient of the output and the gradient of the cell state.Many performance improvements for multiSlice, multiUpdate, multUpdateMax operations
Add popops::regroupIfPossible function which only regroups a tensor if it can be done efficiently
Bug fixes
Fix bug that caused element-wise operations to sometimes unnecessarily use a less efficient implementation that took more memory
Fix bug in pooling that gave incorrect results when the stride was larger than the kernel size
Emit an error if the value of the availableMemoryProportion option is less than 0.0
Fix bug in mixed precision popops::mulInPlace that caused it to error when passed a scalar tensor
Other improvements
Extend unary and binary element-wise operations to support long long and unsigned long long types
Extend popops fill operation to support long long and unsigned long long types
Extend map expression with support long long and unsigned long long types
Extend dynamic slice and update with support long long and unsigned long long types
Support cast operations on char types
Dither the tile mapping of different RNN operations to improve memory balance across tiles
Update the format of PopLibs logging messages to be consistent with logs produced by other tools such as VIPU
Reduce memory usage of code generated for fused popops map expressions
Update popops::multiUpdateAdd to support a scale tensor of type float when the tensor to update has type half
Reduce code size by sharing some common code between different popops codelets
Dither the tile mapping of a temporaries used in reduce and convolution operations to improve memory balance across tiles
Dither the tile mapping of convolution weights and biases to improve memory balance across tiles
Add variant of dynamic slice that writes the result to a tensor passed in as an argument
Improve performance of element-wise negate operation
Improve memory usage of popops scaled add operation by reducing vertex state
Reduce vertex state required for large convolutions
Use faster bitonic sort implementation for sortKeyValue and sortKeyValueInPlace APIs in more cases
2.2.0
New features
Added support to the embedding planner to optimise for cycles
Bug fixes
Fixed some compile time issues when planning large 3D convolutions
Fixed for elementwise ops that could fail with bug inputs
Fix copyright notice in ConvPartialsStridesPacking.hpp
Added a missing alignment for a non-linearity vertex that could cause a tile exception
Updated popfloat to match the 1-byte data type used by TensorFlow
Fixed a NaN exception when using Gfloat types in popfloat
Added missing debug context for call to Fill in the embedding layer
Other improvements
Added popops/NormaliseImage.hpp to PopLibs API document
Added support for user design priorities in popsolver
Added an optimisation to hoist the broadcasting of non-sliced operands out of the loop during serialised convolutions
Use the new builtins for isfinist and isnan
Updated some of the existing vertices to use MultiVertex
Extended the range of the nx1 convolution vertex
Optimise the 1x1 convolution vertex inner loop
Allow expressions in popops to be compared for equality
Add error function (erf) in poplibs
Optimisations for LSTM variable time steps
Estimate time step interval for WU matmuls inside LSTM’s
Sped up the unit tests
Use the 64-bit load store instructions in the dynamic slice 1D vertex
Improved error messages related to elementwise map expressions
Optimised the performance of the GELU vertices
Add support for doing a MAX operation during a multi-update
Improved the documentation for pooling
Add faster versions of innermost loop of exponent unary op
Allow for 32-bit partials during pooling for a 16-bit input
GCL Changelog
2.3.0
New features
Extended exceptions and error handling
Added parallel multi-tensor collectives
Non-replicated collectives (TensorCollectives) moved from PopLibs to GCL
Three-phase orthogonal allReduce support
Bug fixes
Fix for unaligned tensor splitting
Fix for building with gcc -Og/-O1 flags
Fix for replicatedReduceScatter Local reduction with CommGroup
Other improvements
Change bdcast to broadcast in debug strings
Added internal_tests component for exporting collective tests
Unified all CMakeLists to use CMake 3.18.4
Syncless topologies are now verified before use
Single tensor replicatedAllReduceWithOutput() dispatches the multi-tensor one
Replicated collectives functions get CrossReplica suffix
Added error code linter
Added multi-tensor allReduceInPlace()
Step counter code is made optional
Described the logical GCL topologies in the documentation
Using optimised broadcast reduction on SWNC when possible
Various improvements to the test framework
2.2.0
Bug fixes
IO tile allocator now returns tile pairs if requested by the caller
Fix for multi-ILD Collective::MEAN operator
Other improvements
Improve host rearrangement performance for transposed tensors
PopDist Changelog
2.3.0
New features
None.
2.2.0
New features
Improved error reporting in case of a missing IPU device.
PopRun Changelog
2.3.0
New features
Export some commonly used environment variables by default. The environment variables
PATH
,LD_LIBRARY_PATH
andPYTHONPATH
are exported by default to all instances. Passing them to--mpi-local-args="-x ENV_VAR"
is no longer needed.Add support for Slurm hostlists. The
--host
argument now supports the Slurm hostlist syntax. For example,host[1-3,5]
will expand tohost1,host2,host3,host5
.Pick up configuration options from Slurm. The number of instances, replicas, IPUs per replica and the available hosts are picked up from Slurm environment variables if they exist. If an option is provided both by a command-line argument and by Slurm, the command-line argument take precedence.
Allow disabling executable caching. The executable cache can be disabled by passing an empty string using
--executable-cache-path "".
If there is only a single V-IPU partition available, it will now be used automatically without the need for specifying its name using
--vipu-partition
.Increase default V-IPU server timeout. The default value of
--vipu-server-timeout
is now 120 seconds.The new argument
--only-stdout-from-instance
allows suppressing the standard output from all instances except the given one. This is different from the existing--only-output-from-instance
in that it allows standard error from all instances.
2.2.0
New features
Added checks for IPU/GW-Link routing and sync type of existing partititons. The existing partition is checked against the values passed to
--ipu-link-routing-type
,--gw-link-routing-type
and--sync-type
. In case of a mismatch, the partition will be updated if--update-partition=yes
is provided.Improved error message when the application was terminated by SIGKILL.
Libpva Library Changelog
2.3.0
New features
Added equality operators for Programs so they can be used as keys in maps or sets
Added new APIs to query the vertex instances by tile. Previous we reported number of vertices, but now you will be able to determine which tiles they are on.
Added support to show the
dwarf
memory category.Added an API to query which variables as associated with a debug context.
Added CodeCopy as a new Program type.
The ProgramVisitor now has a default handler
visitProgram
.Added an API function on the
Program
to report how much control code on each tile is used.
Bug fixes
None
2.2.0
New features
Add APIs to get liveness information from the compilation report.
Add APIs to the lowered variable information from the compilation report. See
LoweredVariable
The
openReport
API now optionally takes thedebug.cbor
input file.Add APIs read the DebugContext information from the
debug.cbor
and associated with programs and variables.The documentation for libpva has been moved from the Poplar user guide to a standalone user guide.
Bug fixes
Fix issue with lists with more than 2 to the power 16 elements being truncated.
Fixed issue in Python binding that prevent access to
VertexInstances
&ComputeSets
TensorFlow Changelog
2.3.0
New features
Improved performance of concurrent pipeline stages.
Migrated codebase from TensorFlow 2.4.1 to TensorFlow 2.4.3.
Performance optimisations when using
replicated_optimizer_state_sharding
option with pipelining orGradientAccumulationOptimizerV2
.Added
IPUConfig.optimizations.math
for controlling arithmetic optimisations of the model compilation.Improved integration with the Graphcore PopVision System Analyser.
Add support for hooks with
IPUPipelineEstimator
.Compile-time and run-time optimisations.
PopLibs options can be specified for slice operations via the
IPUConfig.slices.poplar_options
config option and viaPipelineStageOptions
.
Bug fixes
Pipelined Keras models now correctly set the
training
argument passed to the Keras layers in the model.Improved performance (latency and throughput) of callbacks when using
asynchronous_callbacks
.EffectiveTransformer
weight initialisers have been exposed to the user.IPU specific Keras layers can now be serialised to allow the model to be saved and restored.
2.2.2
New features
Added support for an embedded runtime for integrating into inference systems - see
IPU embedded application runtime
section in documentation for full details.
Bug fixes
None.
2.2.0
New features
Migrated codebase from TensorFlow 2.1 to TensorFlow 2.4.
Improved Keras integration, see the documentation for full details.
Added support for concurrent pipeline stages - see
Concurrent pipeline stages
section in documentation for full details.Improved operation scheduling to reduce memory usage.
Performance optimisations when using the experimental
replicated_optimizer_state_sharding
option with pipelining.Compile-time and run-time optimisations.
Added
EffectiveTransformer
Keras layer to efficiently handle transformers without padding the input sequences.Added
AssumeEqualAcrossReplicas
Keras layer andassume_equal_across_replicas
operator for marking operations in the graph as equal across replicas to aid with divergent control flow.
Bug fixes
Fixed a memory leak which caused host memory usage increase when using Keras.
Known issues
Product |
Paragraph |
---|---|
Driver & Utilities |
|
PopART |
|
PopTorch |
|
Poplar |
|
Poplar Libraries |
|
GCL |
|
PopDist/PopRun |
|
Libpva Library |
|
TensorFlow |
Driver & Utilities known issues
1.0.55
None.
1.0.52
None.
PopART known issues
2.3.0+1367
None.
2.2.1
None.
PopTorch known issues
2.3.0+30608
None.
2.2.0
None.
Poplar known issues
2.3.0+1367
None.
2.2.0
None.
Poplar Libraries known issues
2.3.0+1367
None.
2.2.0
None.
GCL known issues
2.3.0
None.
2.2.0
None.
PopDist known issues
2.3.0
None.
2.2.0
None.
PopRun known issues
2.3.0
None.
2.2.0
None.
Libpva Library known issues
2.3.0
None.
2.2.0
None.
TensorFlow known issues
2.3.0
Using
mixed_precision.Policy('mixed_float16')
with pipelined Keras models results in compilation errors.The
experimental_normalize_gradients
feature of TensorFlow 2 can produce unstable results when the number of replicas or the gradient_accumulation_steps_per_replica is large.Using TensorFlow operations instead of Keras layers inside of a pipelined Keras model definition can result in a compilation error when one of the inputs is a constant.
2.2.2
None.
2.2.0
The
experimental_normalise_gradients
feature of TensorFlow 2 can produce unstable results when the number of replicas or thegradient_accumulation_steps_per_replica
is large.
Compatibility changes
The following section will detail compatibility changes in v2.3.0
Product |
Paragraph |
---|---|
Driver & Utilities |
|
PopART |
|
PopTorch |
|
Poplar |
|
Poplar Libraries |
|
GCL |
|
PopDist/PopRun |
|
Libpva Library |
|
TensorFlow |
Driver & Utilities Compatibility changes
1.0.55
None.
1.0.52
None.
PopART Compatibility changes
2.3.0+1367
[API] Remove HostReduce
[API] Remove PreAliasPatternType
[API] Remove deprecated grouped matrix multiplication option
[API] Deprecate some pattern constructors
[API] Deprecate unused environment variables
[API] Remove explicit pipelining flag
2.2.1
None.
2.2.0
[API] Remove deprecated
Patterns::Patterns(std::vector<PreAliasPatternType> types)
constructor. UsePatterns::Patterns(std::vector<std::string> types)
instead.[API] Remove deprecated
bool Patterns::isPatternEnabled(PreAliasPatternType t)
method. Usebool Patterns::isPatternEnabled(std::string t)
instead.[API] Remove warnings for
GCL_REAL_COLLECTIVES
andGCL_MAX_BYTES_PER_TILE
use session optionsuseSynclessCollectives
andgclOptions["maxBytesPerTile"]
, respectively, instead.[API] Remove deprecated if op constructor. See
willow/include/popart/op/if.hpp
for the replacement constructor.
PopTorch Compatibility changes
2.3.0+30608
Default mean reduction strategies have changed from the deprecated PostAndLoss strategy to Post or Running based on optimiser accumulation type
Mean reduction strategy can now be set via
poptorch.Options.Training.setMeanAccumulationAndReplicationReductionStrategy
.Add warning that IPU-specific optimiser states cannot be read from the host, when calling
get_state()
on poptorch.optim optimisers
2.2.0
Removed
accumulationReductionType
which was deprecated in 2.1 in favour ofaccumulationAndReplicationReductionType
inpoptorch.Options.Training
Removed
runningVarianceAlwaysFloat
which was deprecated in 2.1 and replaced byrunningStatisticsAlwaysFloat
inpoptorch.Options.Precision
,
Poplar Compatibility changes
2.3.0+1367
Support for non-top-level replicated graphs has been removed
The
opt.enableSwSyncs
option has been removed
2.2.0
The “device” value for the engine option debug.computeInstrumentationLevel has been deprecated
The methods
poplar::Graph::createReplicatedGraph
andpoplar::Graph::getNonReplicatedTensor
have been deprecated. Use the top-level replication API instead.
Poplar Libraries Compatibility changes
2.3.0+1367
The following methods from
TensorCollectives.hpp
have been deprecated:popops::allReduce()
instead usegcl::allReduceWithinReplica()
popops::allGather
instead usegcl::allGatherWithinReplica()
popops:reduceScatter
instead usegcl::reduceScatterWithinReplica()
The versions of
popnn::lstmBwd
andpopnn::lstmBwdWithWU
with an optional parameter to output only the gradient of the cell state have been deprecated. Use the new versions that output both the gradient of the output and the gradient of cell state instead.
2.2.0
None.
GCL Compatibility changes
2.3.0
New methods have been added to replace the deprecated ones in Poplibs:
gcl::allReduceWithinReplica()
,gcl::allGatherWithinReplica()
andgcl::reduceScatterWithinReplica()
should be used instead ofpopops::allReduce()
,popops::allGather
andpopops:reduceScatter
fromTensorCollectives.h
The following APIs are deprecated:
gcl::allReduce()
instead usegcl::allReduceCrossReplica
,gcl::allGather
instead usegcl::allGatherCrossReplica
gcl::reduceScatter
instead usegcl::reduceScatterCrossReplica
Internal function
getNumXBsUsed()
was removed from the public API
2.2.0
The following APIs have been removed:
gcl::allReduce
methods usingpopops::Operation
- usegcl::allReduce
withpopops::CollectiveOperator
insteadgcl::allReduceToDestination methods using `popops::Operation
- usegcl::allReduceToDestination
withpopops::CollectiveOperator
insteadgcl::allReduceInPlace
methods usingpopops::Operation
- usegcl::allReduceInPlace
withpopops::CollectiveOperator
insteadgcl::reduceScatter
methods usingpopops::Operation
- usegcl::reduceScatter
withpopops::CollectiveOperator
instead
gcl::perIPUTiles
argument list was extended and it can now return IO tiles that are tile pairs if requested by the caller
PopDist Compatibility changes
2.3.0
None.
2.2.0
None.
PopRun Compatibility changes
2.3.0
None.
2.2.0
None.
Libpva Library Compatibility changes
2.3.0
None
2.2.0
There has been a change to the classes for liveness. — Instead of
programStep.notAlwaysLiveBytes
you now have to useprogramStep.notAlwaysLiveMemory.bytes
— Instead ofprogramStep.notAlwaysLiveVariables[x].name
you now have to useprogramStep.notAlwaysLiveMemory.variables[x].name
TensorFlow Compatibility changes
2.3.0
Custom user op metadata interface update - the
metadata
interface for custom user ops has been updated with an additional parameter.See the
API changes
section in the TensorFlow documentation for full details.
2.2.2
None.
2.2.0
IPU specific Keras API for building models has been removed. See the the TensorFlow documentation for full details.
C++ Poplar TensorFlow libraries are private by default.
feed_name
does not need to be specified forIPUInfeedQueue
andIPUOutfeedQueue
.See the
API changes
section in the TensorFlow documentation for full details.
Appendix
Appendix A : Additional requirements
PopVision Graph Analyser
To be able to view profiling reports generated by SDK v2.3.0, PopVision Graph Analyser v3.0 or later and PopVision System Analyser v2.0 or later are required.
TensorFlow
To correctly execute TensorFlow code please ensure:
Intel platforms
Use Python 3.6 as minimum version
A CPU compatible with the AVX-512 instruction set is needed.
AMD plaforms
Use Python 3.6 as minimum version
A CPU compatible with the Znver1 instruction set is needed.