5.1.6. PopART changelog

2.6.0+5997

New features

Improvements for explicit pipelining (support for overlapped IO)
Support half-precision tensors in scaledVarUpdate
Allow AddArg0GradOp to change its output tensor type after construction
Improved support for gradient clipping in accumulate outer fragment paralleliser transform
Add execution context constraints in AliasModelGrower
Add RoiAlign operation
Add init_type to ops.init to allow for uninitialised tensors
Allow AiGraphcoreOpset1::Reshape to use -1 dim
Add AiGraphcoreOpset1::slice to mimic Slice-1
Improved implementation of resize gradient reduceDimension
Separate load and store landing pad tensors for remote exchanges when required
Improved support for int16/uint16
Use LeakyRelu output tensor instead of input to compute the gradient
Add transform to backup inplace updated tensors when they are required for recomputation
Add step to verify that users aren’t using modifying (inplace) operations in autodiff
Add support for custom programs, introduce special custom program for implicit pipeline forward only (experimental)
Add pass argument to in_sequence in PopXL, to allow transforms to add topological constraints after an operation is created
Add shape inference to tensor remap operation
Enable profiling of cached executables. See PopVision documentation
Add code loading to PopXL. See Graphs
Add custom operation support to PopXL. See Custom operations
Add tanh, conv, averagepool, argmin, argmax, exp, histogram, sqrt, maximum, log, onehot and roialign to PopXL. See Supported operations
Add negative log likelihood loss in PopXL
Add per-replica variable initialisation and retrieval to PopXL. See Replication
Add support for torch input tensors in PopXL
Improved device management in PopXL
Add .vscode workspace file
Add argument type check to popxl.Session.get_tensors_data in PopXL
Add support for enabling engine caching via POPXL_CACHE_DIR environment variable
Avoid use of deprecated variables in GCL
Add “zeroInfinity” option and plan option flag “enableReducedClassesInLabel” support to CTC operation
Add ability to run CTC operation in validation mode
Switch to new collective balanced reorder API
Add “disableOptimizerStateTensorStreams” option to selectively disable streaming and storing of optimizer tensors
Improved handling of setting weights from host and loading weights to the host in PopXL via context managers

Bug Fixes

Tidy up linting issues
Fix subgraph pruner used by autodiff
Fix allreduce logic
Fix autodiff bugs
Fix bug related to change in collective balanced reorder padding behaviour
Fix missing RTSGroup error
Fix bug in equal in PopXL
Fix partialTypeMatMuls support in PopXL
Fix for torch linear mode test
Fix for misssing pipelineStage attribute
Reload engine and connect streams on every re-attach through popxl.Session context manager
Fix bug where VariableSettings CommGroup is not respected by AllReduce in gradient and accumulator reduction
Fix for random number state management when replicated graph option is set
Fix alias zero copy setting verification
Fix non-determinism bugs in accumulate outer fragment paralleliser and multi collective transforms
Fix explicit recompute (annotation issue and recompute to recompute connections)
Fix for resize gradient operation
Fix mechanism to write variable data when using a cached binary in PopXL
Various fixes for executable caching (added missing tensors, random seed, anchors, and various bug fixes)
Correct incorrect state-tensor initial vector dimensions
Revert inplace WhereOp to outplace when it is not parallel writable
Don’t overwrite the number of tiles to 4 when a custom IPU model config is used
Print before erase to avoid memory error

Optimisations

Reduce number of dummy graph objects constructed for lowering MatMul operations
Remove cases of a clone followed by a mapTensorLinearly
Remove unnecessary uses of mapTensorLinearly
Optimisations in parsing ONNX protobuf files
Speed up ignore applyInplacePattern by ignoring graph.isSchedulable
Faster & more memory efficient implementation of cubic resize
Take compilation-affecting engine options into account when calculating hashes for the purpose of executable caching

Logging and documentation

General improvement of PopXL user guide (including sessions, remote variables, replica grouping, code loading, custom operations, links to related PyTorch Numpy and ONNX operations, MNIST example)
General improvements to PopART API documentation
Error message improvements

2.5.1

New features

Add PopXL API (experimental)
Add support for RNN operator (preview)
Improvements to automatic loss scaling (experimental)
Add improved ability to manage PRNG behaviour across replicas (experimental)
Add ability to retrieve random seed
Add an overload of Builder.setAvailableMemoryProportion which can target multi-output nodes
Ensure initial inputs of gradient graphs match any user-specified provided grads
Ensure outputs of gradient graphs match any user-specified required grads
Add ability to run exported models using the Poplar Triton Backend via PopEF integration
Add visualisations for inplace modified and aliased tensors and graph inputs and outputs to Dot visualizer
DynamicSliceOp and DynamicUpdateOp can drop the first dimension of the slice if it is 1
Support AnchorReturnType::Final in MainLoops transform
Improved replicated tensor sharding (RTS) compatibility for operations
Make gradient clipping compatible with replicated tensor sharding (RTS)
Improved linter support
Add ability to show ONNX model proto in human readable text
Various improvements to executable caching
Add ability to perform per-replica reads and writes of variable values
Improved quality of debug information
Use slice plan in SparseAccumulateOpx
Add ability to merge collective operations
Add ability to dynamically switch off the backwards pass when using implicit pipelining
Add ability to refresh engine cache on-the-fly

Bug Fixes

Fix the logic that replaces DropoutOp with IdentityOp
Improve device handling in tests
Fix for potential deadlock condition in test runner
Fix in lowering logic for trailing subgraph parts that contain only calls to child subgraph parts
IdentityLossOpx will no longer attempt to unwind (resulting in an error) when there is a reduction
Fix subgraph autodiff logic
Allow CallOp to not have outputs connected for all of its callee outputs
Fix Python binding for DeviceManager::tryAttachUntilTimeout
Correctly promote inplace aliased and modified tensors through the Loop operation
Fix unwinding through multiple consecutive slice operations
Fix unwinding issue in MaxOpx
Enable bufferingDepth to be used when SessionOptions::enablePrefetchDatastreams isn’t set
Fix dtype clone in SparseAccumulateOpx::createInputTensor
Fix bug in ReplicatedTensorShardingTracer
Fix compile error if accl2 type is not FLOAT
Fix PowArg0GradOpPattern for fp16

Optimisations

Allow non-broadcasted indices as an input to the scatterreduce operation
Add ExpandCast pattern to reverse the order of an expand followed by cast to reduce memory footprint
Add inplace versions of WhereOp
Allow IdentityInplaceOp to unwind, reducing memory use when it cannot be made inplace
Split operators_test in two
Add TensorRemapOp for point-fixes of bad tensor layouts
Explicit recomputation support for pipelining
Alias zero copy tracks variables and multi-context tensors less conservatively
Improve graph traversal through loop-carried tensors

Logging and documentation

Add compile-time option to log device access events to a file
Improved CommGroupType::None comments
Fix code listings
Update to internal documentation build system
Various small user guide and API improvements
Add how to execute imported model to documentation