2. PopART C++ API
2.1. Sessions
#include <popart/session.hpp>
-
class popart::Session
Session is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware.
Subclassed by popart::InferenceSession, popart::TrainingSession
Public Functions
-
std::vector<uint32_t> getRNGState()
Get state of the random number generator.
-
void setRNGState(const std::vector<uint32_t>)
Set state of the random number generator.
-
void setRandomSeed(uint64_t seedValue)
Set the value of the random number generator seed.
This method explicitly seeds all random operations. Additionally, this method derives a new state for the random number generator (RNG) from the seed and sets it on the device. This RNG state is used to resolve stochastic rounding. Note that to deterministically store and restore the combined random state for a session, do the following:
C++:
// Store random state (session s0). auto seed = s0.getRandomSeed(); auto rngState = s0.getRNGState(); // Restore random state (session s1). s1.setRandomSeed(seed); // <-- affects RNG state, order important s1.setRNGState(rngState);
Python:
# Store random state (session s0). seed = s0.getRandomSeed() rngState = s0.getRNGState() # Restore random state (session s1). s1.setRandomSeed(seed) // <-- affects RNG state, order important s1.setRNGState(rngState)
- Parameters
seedValue – The value of the seed.
-
uint64_t getRandomSeed()
Get the value of the random number generator seed.
Calling setRandomSeed() with this value (at a later stage) reinstates the random state logic that seeds random operations.
- Returns
The value used to seed current random operations.
-
void compileAndExport(const std::string &filename)
Compile the graph and export it to a file.
This method will first create a
snap::Graph
and compile thepoplar::Executable
. Next, it will export the executable and PopART metadata to the file. The exported file will be in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.- Parameters
filename – The name of the file where the compiled executable and metadata will be saved.
-
void compileAndExport(std::ostream &out)
Compile the graph and export it to a stream.
This method will first create a
snap::Graph
and compile thepoplar::Executable
. Next, it will export the executable and PopART metadata to the stream. The data will be streamed in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.This method automatically creates folders as needed if
filename
is located in a folder which does not exist.- Parameters
out – The stream that the compiled executable and metadata will be written to.
-
void saveExecutableToFile(const std::string &filename)
Save a compiled graph to a file.
The file will be in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.
This method automatically creates folders as needed if
filename
is located in a folder which does not exist.- Parameters
filename – The name of the file where the compiled executable and metadata will be saved.
- Pre
prepareDevice() must have been called.
-
void saveExecutableToStream(std::ostream &out)
Save a compiled graph to a stream.
The data will be streamed in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.
- Parameters
out – The stream where the compiled executable and metadata will be written to.
- Pre
prepareDevice() must have been called.
-
void checkInplacingAmbiguity() const
Check for potential inplacing ambiguities.
This method creates an
AliasModel
object for each graph and runs the Poprithms ambiguity checker on it.Throws an error if the graph has an inplacing ambiguity and will prompt the user to check the inplacing.
See
poprithms::memory::inplace::Graph::AmbiguityStatus
on the Poprithms GitHub repo for more on what constitutes an ambiguity.
-
void loadExecutableFromFile(const std::string &filename)
Load the compiled executable and metadata from a file.
The file must have been created with compileAndExport(const std::string).
- Parameters
filename – The name of the file to load the executable and metadata from.
Load the compiled executable and from a stream.
The stream must have been created with compileAndExport(std::ostream).
- Parameters
in – The shared pointer to the stream to load the executable from.
-
void prepareDevice(bool loadEngine = true)
Prepare the network for execution.
This will create the
snap::Graph
andpoplar::Engine
.- Parameters
loadEngine – If
true
, load the engine and connect the streams once the device is ready.
-
void loadEngineAndConnectStreams()
Load the engine on the device and connect the streams.
This will set up the
poplar::Streams
.Note: This call is optional. The engine will implicitly be loaded on the device when required.
-
void weightsFromHost()
Copy weights from the host to the device.
-
void weightsToHost()
Copy the weights from the device to the host steam memory.
-
uint64_t getCycleCount(std::string id = "")
Copy the cycle count tensor from the device to the host.
- Parameters
id – The identifier of the cycle count tensor.
-
void connectStreamToCallback(const std::string &streamHandle, std::function<void(void*)> callback, unsigned index = 0)
Connect a Poplar stream with a callback.
This method will be called whenever the stream will be read or was written to by the device. The memory location will only be valid for reading or writing for the duration of the callback.
- Parameters
streamHandle – The name of the stream to connect to.
callback – The callback to be called whenever the stream is to be read or was written to by the device.
index – The replica index to connect to, when using replicated graphs. Default=0.
-
void connectStream(const std::string &streamHandle, void *buffer)
Connect a Poplar stream with a fixed location in memory.
Each time data is copied to the stream, this location will be read and each time data is copied from the stream, this location will be written.
- Parameters
streamHandle – The handle of the stream to connect to.
buffer – The pointer to the memory location.
-
void connectHostFunction(const std::string &functionHandle, std::function<void(const void*const*, size_t, void*const*, size_t)> callback, unsigned index = 0)
Connect a host function to a callback.
The callback takes two arguments, which point to the locations in memory for each of the function’s input and output arguments, respectively. During a host function call, first the device transfers the input data to the host, then the callback is invoked, and finally the output data is copied back to the device. The memory pointed to by the callback arguments must only be accessed during the duration of the callback.
- Parameters
functionHandle – The name of the host function.
callback – The function to be called whenever new input data is available.
index – The replica index to connect to, when using replicated graphs. Default=0.
-
void run(IStepIO &stepIO, std::string debugName = "")
Run one step.
Read input data from address in
stepIO.in
.Write the output data to addresses in
stepIO.out
.- Parameters
stepIO – The input and output data.
debugName – A debug string to identify this run in logs.
-
void run(std::string programHandle, IStepIO &stepIO, std::string debugName = "")
Run one step of a custom program.
Read input data from address in
stepIO.in
.Write the output data to addresses in
stepIO.out
.- Parameters
programHandle – The handle of the custom program to run.
stepIO – The input and output data.
debugName – A debug string to identify this run in logs.
-
void updateExternallySavedTensorLocations(const std::string &fromLocation, const std::string &toLocation)
Update the tensor locations of tensors in the session’s ONNX model.
A new file will be created at this point, and written to when the ONNX model is saved with a subsequent call to modelToHost().
- Parameters
fromLocation – All externally saved tensors with location
fromLocation
will have their location updated totoLocation
.toLocation – The updated tensor locations. This must not already exist.
-
void modelToHost(const std::string &fn)
Write the current model to an ONNX file.
- Parameters
fn – The path to file. The path can be absolute or relative. If you plan to run your program in multiple processes simultaneously, you should avoid possible race conditions by writing to different files, for example by using temporary files.
-
TensorInfo getInfo(TensorId) const
Get the tensor information for a tensor.
- Parameters
TensorId – The identifier of the tensor to get the tensor information for.
- Returns
The tensor information for the tensor.
-
bool hasInfo(TensorId) const
Check whether a tensor has information.
- Parameters
TensorId – The identifier of the tensor to get the tensor information for.
- Returns
true
if the tensor with identifier TensorId has tensor information andfalse
if not.
-
std::string getSummaryReport(bool resetProfile = true) const
Retrieve the summary report from the
poplar::Engine
.The options which were passed to the Session constructor will influence the information in the report.
This method may only be called after prepareDevice() has been called.
- Parameters
resetProfile – If
true
, resets the execution profile. Default =true
.- Returns
A string containing the report.
-
std::string getSerializedGraph() const
Retrieve the serialized graph from the
poplar::Engine
.A JSON format report is produced.
This method may only be called after prepareDevice() has been called.
- Returns
A string containing the serialized graph.
-
pva::Report getReport() const
Retrieve the graph report from the
poplar::Engine
.The options which were passed to the Session constructor will influence the information in the report.
This method may only be called after prepareDevice() has been called.
- Returns
The PopVision Analysis report object.
-
void resetHostWeights(const std::string &model, const bool ignoreWeightsInModelWithoutCorrespondingHostWeight = false)
Reset weights with weights in an ONNX model.
Note that the only differences between the ONNX model and the current model must be the weights. No other differences are allowed.
This method only updates the weights on the host. weightsFromHost() must be called after this method to update the weights on the device.
- Parameters
model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.
ignoreWeightsInModelWithoutCorrespondingHostWeight – If
true
, do not throw an error if there are initializers in the ONNX model without corresponding initializer tensor(s) in the session’s IR.
-
void readWeights(const IWeightsIO &weightsIo)
Read the weights from the host stream memory and write to the host.
This method may only be called after weightsToHost() has been called.
- Parameters
weightsIo – The weight data that is read from the host stream memory is written to the addresses in
weightsIo.out
.
-
void writeWeights(const IWeightsIO &weightsIo)
Write the weights from the host to the IR tensor memory.
This method may only be called after weightsFromHost() has been called.
- Parameters
weightsIo – The weight data is written to the addresses in
weightsIo.out
.
-
std::string serializeIr(IrSerializationFormat format)
Serizalise the IR graph to a string.
- Parameters
format – The format to use for serializing.
-
inline const popx::IrLowering &getIrLowering() const
Get the IR lowering associated with the Session.
-
inline popx::Executablex &getExecutable()
Get the executable associated with the Session.
-
inline const popx::Executablex &getExecutable() const
Get the executable associated with the Session.
-
void updateEngineCache()
Update cacheEntries from engine cache directory and update ir::hashMatched_ with the updated cacheEntries.
Set the DeviceInfo of the Session.
-
std::vector<uint32_t> getRNGState()
2.1.1. Training session
#include <popart/session.hpp>
-
class popart::TrainingSession : public popart::Session
TrainingSession is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware with training provided by optimizing a loss tensor using an optimizer and automatic differentiation (backpropagation).
Public Functions
-
~TrainingSession() override
Destructor for the TrainingSession class.
-
void updateOptimizerFromHost(const Optimizer *optimizer)
Update the optimizer from the host.
This method updates the optimizer and the associated hyperparameters but not the optimizer state tensors.
NOTE: The optimizer parameter has to be compatible with the optimizer passed to the TrainingSession constructor. For example, you cannot call this function with an
SDG1
optimizer if you created the session with anSDG0
optimizer. This is because it is not possible to change the IR after a session has been constructed.- Parameters
optimizer – A pointer to a popart::Optimizer.
-
void copyFromRemoteBuffer(const std::string &buffer, void *w, int repeat_index, unsigned replication_index = 0)
Copy from a remote butter into a user buffer.
This can be useful when we run larger models with host side reductions since HEXOPT is currently limited to 128 MB.
- Parameters
buffer – The name of the remote buffer to copy from.
w – Pointer to a user buffer to copy to.
repeat_index – The index in the remote buffer to copy from.
replication_index – The replicated graph index when using replicated graphs. Default=0.
-
void copyToRemoteBuffer(void *w, const std::string &buffer, int repeat_index, unsigned replication_index = 0)
Copy from a user buffer to a remote buffer.
This can be useful when we run larger models with host side reductions since HEXOPT is currently limited to 128 MB.
- Parameters
w – Pointer to a user buffer to copy from.
buffer – The remote buffer to copy to.
repeat_index – The index in the remote buffer to copy to.
replication_index – The replicated graph index when using replicated graphs. Default=0.
Public Static Functions
Create a session for training from an IR.
- Parameters
ir – The IR to create the session from.
deviceInfo – The type of device that this session uses.
name – The name of this training session. Default: “training”.
Create a session for inference from an ONNX model.
- Parameters
model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.
dataFlow – Configuration for the data feeds and fetches.
loss – The identifier of the final scalar loss tensor for training.
optimizer – The name of an optimizer to use when training.
deviceInfo – The type of device that this session uses.
inputShapeInfo – (Optional) The sizes and dtypes of the input tensors. This is used to specify the sizes of the input tensors in the case that the ONNX model does not include this information. The Poplar graph programmming framework uses statically allocated memory buffers and so it needs to know the size of tensors before the compilation. Default: InputShapeInfo().
userOptions – (Optional) The user configuration options for the Session class. Default: SessionOptions().
patterns – (Optional) A user-selected set of graph transformation patterns which will be applied to the graph. If this is not specified, a default set of optimisation transformations will be applied. Default: Patterns().
name – (Optional) The name of this inference session. Default: “training”.
-
~TrainingSession() override
2.1.2. Inference session
#include <popart/session.hpp>
-
class popart::InferenceSession : public popart::Session
InferenceSession is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware, without any automatic differentiation (backpropagation) or optimization.
Public Functions
-
~InferenceSession() override
Destructor for the InferenceSession class.
-
void popxlSetEngineIsLoaded(bool isLoaded)
Public Static Functions
Create a session for inference from an IR.
- Parameters
ir – The IR to create the session from.
deviceInfo – The type of device that this session uses.
name – The name of this inference session. Default: “inference”.
Create a session for inference from an ONNX model.
- Parameters
model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.
dataFlow – Configuration for the data feeds and fetches.
deviceInfo – The type of device that this session uses.
inputShapeInfo – (Optional) The sizes and dtypes of the input tensors. This is used to specify the sizes of the input tensors in the case that the ONNX model does not include this information. The Poplar graph programmming framework uses statically allocated memory buffers and so it needs to know the size of tensors before the compilation. Default: InputShapeInfo().
userOptions – (Optional) The user configuration options for the Session class. Default: SessionOptions().
patterns – (Optional) A user-selected set of graph transformation patterns which will be applied to the graph. If this is not specified, a default set of optimisation transformations will be applied. Default: Patterns().
name – (Optional) The name of this inference session. Default: “inference”.
-
~InferenceSession() override
2.1.3. Session options
#include <popart/sessionoptions.hpp>
-
enum popart::AccumulateOuterFragmentSchedule
Enum type that determines how the operations in the accumulate outer fragment will be scheduled accross virtual graphs (only relevant to pipelined modes).
Values:
-
enumerator Scheduler = 0
Don’t add additional constraints and let the scheduler work it out.
-
enumerator Serial
Add constraints that ensure ops are executed in virtual graph ID order.
-
enumerator OverlapCycleOptimized
Try and parallelise ops with different virtual graph IDs as much as possible.
-
enumerator OverlapMemoryOptimized
Try and parallelise ops with different virtual graph IDs but avoid certain steps that are costly in terms of memory usage.
-
enumerator Scheduler = 0
-
enum popart::AutodiffStitchStrategy
Enum type representing a strategy to ensure a backward graph’s inputs are either inputs of the forward graph, outputs of the forward graph or gradients of outputs of the forward graph.
Strategies may expose tensors that would otherwise have been internal to the forward graph as outputs of this forward graph.
Values:
-
enumerator RecomputeMinimal = 0
Recompute any backward graph inputs associated with non-gradient forward graph tensors that are neither inputs nor outputs in the forward graph.
-
enumerator RecomputeAllNonInputs
Recompute any backward graph inputs associated with non-gradient forward graph tensors that are not inputs in the forward graph.
-
enumerator AddFwdOutputs
For backward graph inputs associated with non-gradient forward graph tensors that are neither inputs or outputs in the forward graph, add them as outputs to the forward graph.
-
enumerator SafeAddFwdOutputs
Like AutodiffStitchStrategy::AddFwdOutputs except that those backward graph inputs that can’t be stitched with AutodiffStitchStrategy::AddFwdOutputs (that is, by adding outputs to the forward graph) are stitched using the AutodiffStitchStrategy::RecomputeMinimal strategy instead.
This means that this is a safe strategy to use as an Autodiff default.
-
enumerator N
Number of
AutodiffStitchStrategy
values.
-
enumerator RecomputeMinimal = 0
-
enum popart::BatchSerializationBatchSchedule
Enum type that describes how to change the batch serialisation subgraph schedule before outlining.
Note
This setting is experimental and may change.
Values:
-
enumerator Scheduler = 0
Don’t encourage any particular scheduling for ops within batch subgraphs (leave it to the scheduler) but tell the scheduler to schedule subgraphs in sequence.
-
enumerator Isomorphic
Encourage all ops within batch subgraphs to be scheduled identically and for each subgraph to be scheduled in sequence (good for outlineability).
-
enumerator OverlapOnIo
Attempt to put the remote load op for batch N+1 right after the compute phase of batch N.
-
enumerator OverlapOnCompute
Attempt to put the remote load op for batch N+1 right before the compute phase of batch N.
-
enumerator N
The number of
BatchSerializationBatchSchedule
values.
-
enumerator Scheduler = 0
-
enum popart::BatchSerializationMethod
Enum type that describes how to apply the batch serialization.
Note
This setting is experimental and may change.
Values:
-
enumerator UnrollDynamic = 0
Unroll the batch with dynamic slicing.
-
enumerator UnrollStatic
Unroll the batch with static slicing.
-
enumerator Loop
Loop over the batch dimension.
-
enumerator N
The number of
BatchSerializationMethod
values.
-
enumerator UnrollDynamic = 0
-
enum popart::BatchSerializationTransformContext
Enum type that describes when to apply batch serialization.
Note
This setting is experimental and may change.
Values:
-
enumerator Fwd = 0
Apply batch serialiation before growing the backward pass.
-
enumerator Bwd
Apply batch serialiation after growing the backward pass.
-
enumerator N
The number of
BatchSerializationTransformContext
values.
-
enumerator Fwd = 0
-
enum popart::ExecutionPhaseIOSchedule
Enum type to specify when to load tensors.
Values:
-
enumerator Preload = 0
Preload tensors in previous phase for use in current phase.
-
enumerator OnDemand
Load tensors just before they are required.
-
enumerator N
The number of
ExecutionPhaseIOSchedule
values.
-
enumerator Preload = 0
-
enum popart::ExecutionPhaseSchedule
Enum type to specify the order of processing optimizer operations for different weights of the same execution phase.
The steps for phased execution are:
Copy to IO tiles if necessary.
Run collective operations if necessary.
Load optimizer state.
Update optimizer state.
Apply optimizer.
Store updated tensor if necessary.
Values:
-
enumerator Interleaving = 0
Process above steps for one weight at a time (for example: 123456, 123456, 123456).
The scheduler may interleave these steps.
-
enumerator Batch
Process above steps for all weights together, in a way that maximises overlap potential between compute and exchange (for example: 333, 111, 222, 444, 555, 666).
-
enumerator BatchClusteredIO
Process above steps for all weights together, in a way that maximises overlap potential between compute and exchange, and maximise stream copy merges by keeping RemoteLoad/RemoteStore operations clustered (for example: 333, 111, 222, 444, 555, 666).
-
enumerator N
The number of
ExecutionPhaseSchedule
values.
-
enum popart::GradientTensorTrackingMethod
Enum type to specify the method for selecting gradient tensors whose statistics are to be tracked for the AutomaticLossScale transform.
Values:
-
enumerator AllNonViewChangingGradientTensors = 0
Track all gradients of non-view-changing gradient tensors.
-
enumerator ConvAndMatmulGradients
Track all gradients of inputs to MatMul and Convolution ops.
-
enumerator GradientsOfUserSpecifiedTensors
Track gradients of user-specified tensors.
-
enumerator N
The number of
GradientTensorTrackingMethod
values.
-
enumerator AllNonViewChangingGradientTensors = 0
-
enum popart::Instrumentation
Enum type used to specify an instrumentation type.
Values:
-
enumerator Outer = 0
Outer loop instrumentation, graph over all IPUs.
-
enumerator Inner
Inner loop instrumentation, graph per IPU.
-
enumerator N
The number of
Instrumentation
values.
-
enumerator Outer = 0
-
enum popart::IrSerializationFormat
Enum type used to specify a serialization format.
Values:
-
enumerator JSON
JavaScript Object Notation (JSON).
-
enumerator JSON
-
enum popart::MeanReductionStrategy
Enum type that specifies when to divide by a mean reduction factor, when doing mean reduction over a sequence of tensors \(t_1, t_2, ..., t_k\).
Values:
-
enumerator Running = 0
Keep the reduction buffer as the mean of the tensors accumulated so far.
If \(t_1, ..., t_f\) has just been processed, the current accumulator \(s\) is the mean of these values, and the next accumulator update is \(s = \frac{f}{f+1} * s + \frac{1}{f+1} * t_{f+1}\) to keep \(s\) a running mean.
This strategy guarantees \(s \le \max(a_1, ..., a_k)\) throughout the accumulation, therefore it will not overflow, but it is generally slower than MeanReductionStrategy::Post.
-
enumerator Post
Keep the accumulation factor as the running sum, and divide once by \(k\) at the end of the accumulation.
This strategy will generally be faster than MeanReductionStrategy::Running, but is prone to overflow (especially when using
fp16
).
-
enumerator N
The number of
MeanReductionStrategy
values.
-
enumerator Running = 0
-
enum popart::MergeVarUpdateType
Enum type used to specify which VarUpdateOp ops to merge.
Values:
-
enumerator None = 0
Do not merge VarUpdateOp ops.
-
enumerator All
Merge all VarUpdateOp ops into as few groups as possible.
This is a good choice when memory is not a constraint.
-
enumerator AutoLoose
Merge into groups while attempting not to increase maximum variable liveness, and also not slice tensor variables so they will need to be processed by different VarUpdateOp ops.
-
enumerator AutoTight
Merge into groups, so that VarUpdateOp ops process tensors of exactly
SessionOptions::mergeVarUpdateMemThreshold
in size.
-
enumerator N
The number of
MergeVarUpdateType
values.
-
enumerator None = 0
-
enum popart::RecomputationType
Enum type to specify which ops to recompute in the backward pass when doing auto-recomputation.
Values:
-
enumerator None = 0
No ops are recomputed (Default).
-
enumerator Standard
Recompute using algorithm that picks checkpoints to try and minimise max liveness.
-
enumerator NormOnly
Only Norm ops (+ non-linearities, if following) are recomputed.
-
enumerator Pipeline
Recompute all forward pipeline stages.
-
enumerator RecomputeAll
Recompute all ops.
-
enumerator N
The number of
RecomputationTypes
values.
-
enumerator None = 0
-
enum popart::SubgraphCopyingStrategy
Enum type that describes how copies for inputs and outputs for subgraphs are lowered.
Currently this only affects subgraphs associated with CallOp ops.
Values:
-
enumerator OnEnterAndExit = 0
Copy all inputs before the start of the subgraph, copy all outputs after all ops in the subgraph.
With this strategy, subgraphs will always map to a single Poplar function.
-
enumerator JustInTime
Copy inputs just before they are consumed and copy outputs as soon as they are produced.
With this strategy, subgraphs may be lowered into multiple Poplar functions.
-
enumerator N
The number of
SubgraphCopyingStrategy
values.
-
enumerator OnEnterAndExit = 0
-
enum popart::SyntheticDataMode
Enum type used to specify the data source for input tensors.
Values:
-
enumerator Off = 0
Use real data.
-
enumerator Zeros
Input tensors are initialised to all zeros.
-
enumerator RandomNormal
Input tensors are initialised with a random normal distribution ~N(0,1).
-
enumerator N
The number of
SyntheticDataMode
values.
-
enumerator Off = 0
-
enum popart::VirtualGraphMode
Enum type used to specify a virtual graph mode.
Values:
-
enumerator Off = 0
Virtual graphs are not enabled.
-
enumerator Manual
User must set the popart::Op::virtualGraph attribute on all ops.
-
enumerator Auto
Use the AutoVirtualGraph transform.
-
enumerator ExecutionPhases
Virtual graphs are tied to execution phases.
-
enumerator N
The number of
VirtualGraphMode
values.
-
enumerator Off = 0
-
struct popart::AccumulateOuterFragmentSettings
A structure containing accumulate outer fragment settings.
Public Functions
-
AccumulateOuterFragmentSettings() = default
-
inline AccumulateOuterFragmentSettings(AccumulateOuterFragmentSchedule schedule_, const std::vector<int> &excludedVirtualGraphs_)
Constructor for AccumulateOuterFragmentSettings.
- Parameters
schedule_ – Indicate how to schedule the accumulate outer fragment. This setting is experimental and may change. Default: AccumulateOuterFragmentSchedule::Serial
excludedVirtualGraphs_ – Indicate to explicitly avoid parallelising the virtual graph IDs. This setting is experimental and may change.
Public Members
-
AccumulateOuterFragmentSchedule schedule = AccumulateOuterFragmentSchedule::Serial
Indicate how to schedule the accumulate outer fragment.
Note
This setting is experimental and may change.
-
std::vector<int> excludedVirtualGraphs = {}
Indicate to explicitly avoid parallelising the virtual graph IDs.
Note
This setting is experimental and may change.
-
AccumulateOuterFragmentSettings() = default
-
struct popart::AutodiffSettings
The settings for the Autodiff transform.
Public Functions
-
AutodiffSettings() = default
Default constructor for the AutodiffSettings struct.
-
inline AutodiffSettings(AutodiffStitchStrategy stitchStrategy_)
Constructor for the AutodiffSettings struct.
- Parameters
stitchStrategy_ – The strategy to ensure a backward graph’s inputs are either inputs of the forward graph, outputs of the forward graph or gradients of outputs of the forward graph. Default: AutodiffStitchStrategy::RecomputeAllNonInputs.
Public Members
-
AutodiffStitchStrategy stitchStrategy = AutodiffStitchStrategy::RecomputeAllNonInputs
The strategy PopART should use to ensure that all graph inputs of a backward graph are available as either inputs or outputs of the forward graph or gradients of outputs of the forward graph.
Note
This is an experimental option and may change.
-
AutodiffSettings() = default
-
struct popart::AutomaticLossScalingSettings
A structure containing user configuration for automatic loss scaling settings.
Note
Automatic loss scaling is currently experimental and under active development. Recommendation: Set the loss scale manually.
Public Functions
-
AutomaticLossScalingSettings() = default
Default constructor for AutomaticLossScalingSettings.
-
AutomaticLossScalingSettings(bool enabled_, const nonstd::optional<std::vector<TensorId>> &toTrackTensors_, float binEdgeLocation_, float thresholdUpperCountProportion_, int updatePeriod_, GradientTensorTrackingMethod gradientTensorTrackingMethod_)
Constructor for AutomaticLossScalingSettings.
- Parameters
enabled_ – Indicate whether to keep track (
true
) or not (false
) of the distribution of gradient tensor elements over the floating point range. Default:false
.toTrackTensors_ – An optional list of model tensor names, for which gradient statistics will be collected. If not set, the gradients of all tensors produced by default operations (matmul, conv) will be used.
binEdgeLocation_ – The location of the bin edge as a proportion of the absolute numerical range of the tracked gradient tensor elements, in the range [0, 1]. 0 represents the smallest representable value, and 1 the maximum. This is the single bin edge of the histogram that is an input to the loss scale updater algorithm. Default: 0.125.
thresholdUpperCountProportion_ – The proportion of the elements in the upper bin above which the loss scale is increased, and below which the loss scale is decreased. Should be in the range [0, 1]. Default: 1e-7.
updatePeriod_ – Indicate how often the loss scale update factor should be updated with respect to optimizer steps. Default: 1
gradientTensorTrackingMethod_ – The method for selecting gradient tensors whose statistics are to be tracked. Default: GradientTensorTrackingMethod::AllNonViewChangingGradientTensors.
-
std::size_t hash() const
Public Members
-
bool enabled = false
-
float binEdgeLocation = 0.125f
-
float thresholdUpperCountProportion = 1e-7
-
int updatePeriod = 1
-
GradientTensorTrackingMethod gradientTensorTrackingMethod = GradientTensorTrackingMethod::AllNonViewChangingGradientTensors
-
AutomaticLossScalingSettings() = default
-
struct popart::BatchSerializationSettings
A structure containing batch serialization settings.
Public Functions
-
BatchSerializationSettings() = default
Default constructor for BatchSerializationSettings.
-
BatchSerializationSettings(int factor_, bool concatOnVirtualGraphChange_, bool concatOnExecutionPhaseChange_, bool concatOnPipelineStageChange_, BatchSerializationTransformContext transformContext_ = BatchSerializationTransformContext::Fwd, BatchSerializationMethod method_ = BatchSerializationMethod::UnrollDynamic, BatchSerializationBatchSchedule batchSchedule_ = BatchSerializationBatchSchedule::Isomorphic)
Constructor for BatchSerializationSettings.
- Parameters
factor_ – The number of compute batches to split operations into. Default: 0.
concatOnVirtualGraphChange_ – Indicate to break batch serialization chains (
true
) when the virtual graph changes (by concatenating the compute batches to the local batch). Default:true
.concatOnExecutionPhaseChange_ – Indicate to break batch serialization chains (
true
) when the execution phase changes (by concatenating the compute batches to the local batch). Default:true
.concatOnPipelineStageChange_ – Indicate to break batch serialization chains (
true
) when the pipeline stage changes (by concatenating the compute batches to the local batch). Default:true
.transformContext_ – An experimental value to control when batch serialization is applied. Default: Fwd.
method_ – An experimental value to control how batch serialization is applied. Default: BatchSerializationMethod::UnrollDynamic.
batchSchedule_ – An experimental value that changes how operations are scheduled. Default: BatchSerializationBatchSchedule::Isomorphic.
Public Members
-
int factor = 0
The number of compute batches to split operations into.
-
bool concatOnVirtualGraphChange = true
Break batch serialization chains when the virtual graph changes (by concatenating the compute batches to the local batch).
-
bool concatOnExecutionPhaseChange = true
Break batch serialization chains when the execution phase changes (by concatenating the compute batches to the local batch).
-
bool concatOnPipelineStageChange = true
Break batch serialization chains when the pipeline stage changes (by concatenating the compute batches to the local batch).
-
BatchSerializationTransformContext transformContext = BatchSerializationTransformContext::Fwd
Experimental value to control when batch serialization is applied.
-
BatchSerializationMethod method = BatchSerializationMethod::UnrollDynamic
Experimental value to control how batch serialization is applied.
-
BatchSerializationBatchSchedule batchSchedule = BatchSerializationBatchSchedule::Isomorphic
Experimental value that changes how operations are scheduled.
-
BatchSerializationSettings() = default
-
struct popart::ExecutionPhaseSettings
A structure containing ExecutionPhase settings.
Public Functions
-
ExecutionPhaseSettings() = default
Default constructor for ExecutionPhaseSettings.
-
inline ExecutionPhaseSettings(int phases_, bool stages_, ExecutionPhaseIOSchedule weightIOSchedule_, ExecutionPhaseIOSchedule activationIOSchedule_, ExecutionPhaseIOSchedule optimizerStateIOSchedule_, ExecutionPhaseIOSchedule accumulatorIOSchedule_, ExecutionPhaseSchedule schedule_)
Constructor for ExecutionPhaseSettings.
- Parameters
phases_ – The number of execution phases for the whole model. Default=0.
stages_ – The number of overlapping stages:
1: Parallel streaming memory, default for 1 IPU per replica.
2: PingPong between 2 IPUs, default for 2 or more IPUs per replica (Default).
weightIOSchedule_ – The execution phase IO schedule for weight tensors. Default: ExecutionPhaseIOSchedule::Preload.
activationIOSchedule_ – The execution phase IO schedule for activation and gradient tensors. Default: ExecutionPhaseIOSchedule::Preload.
optimizerStateIOSchedule_ – An experimental value to control when batch serialization is applied. Default: ExecutionPhaseIOSchedule::OnDemand.
accumulatorIOSchedule_ – An experimental value to control how batch serialization is applied. Default: ExecutionPhaseIOSchedule::Preload.
schedule_ – An experimental value that changes how operations are scheduled. Default: ExecutionPhaseSchedule::Interleaving.
Public Members
-
int phases = 0
Number of ExecutionPhases for the whole model.
-
int stages = 2
Number of overlapping stages.
1: Parallel streaming memory, default for 1 IPU per replica.
2: PingPong between 2 IPUs, default for 2 or more IPUs per replica.
-
ExecutionPhaseIOSchedule weightIOSchedule = ExecutionPhaseIOSchedule::Preload
The execution phase IO schedule for weight tensors.
-
ExecutionPhaseIOSchedule activationIOSchedule = ExecutionPhaseIOSchedule::Preload
The execution phase IO schedule for activation and gradient tensors.
-
ExecutionPhaseIOSchedule optimizerStateIOSchedule = ExecutionPhaseIOSchedule::OnDemand
-
ExecutionPhaseIOSchedule accumulatorIOSchedule = ExecutionPhaseIOSchedule::Preload
-
ExecutionPhaseSettings() = default
-
struct popart::ReplicatedCollectivesSettings
A structure containing settings for replicated collective operations.
Public Functions
-
ReplicatedCollectivesSettings(bool prepareScheduleForMergingCollectives = false, bool mergeAllReduceCollectives = false)
Constructor for the ReplicatedCollectivesSettings struct.
- Parameters
prepareScheduleForMergingCollectives – Insert constraints into the schedule such that collectives which can be merged occur one right after the other.
true
to insert constraints,false
otherwise. Default:false
.mergeAllReduceCollectives – Identify allreduce operations which can be scheduled at the same time, and perform them as one larger operation to better utilize the bandwidth between replicas.
true
to identify operations,false
otherwise. Default:false
.
-
std::size_t hash() const
-
ReplicatedCollectivesSettings(bool prepareScheduleForMergingCollectives = false, bool mergeAllReduceCollectives = false)
-
struct popart::SessionOptions
A structure containing user configuration options for the Session class.
Public Members
-
std::string logDir
A directory for log traces to be written into.
-
std::set<std::string> dotChecks = {}
When to write
.dot
files during IR construction.
-
int firstDotOp = 0
The ops written to the
.dot
file will be a part of the schedule, controlled by firstDotOp and finalDotOp.In particular, it will be [max(0, firstDotOp), min(N ops in IR, finalDotOp)).
-
int finalDotOp = 10000
See firstDotOp.
-
bool dotOpNames = false
Enable inclusion of the op name in the
.dot
file (the op type is always exported).Enabled when
true
. Default:false
.
-
bool exportPoplarComputationGraph = false
Enable export of Poplar computational graph.
Enabled when
true
. Default:false
.
-
bool exportPoplarVertexGraph = false
Enable export of Poplar vertex graph.
Enabled when
true
. Default:false
.
-
bool separateCallOpPdfs = true
Enable creation of separate PDFs for each subgraph when generating PDFs of IR graphs.
Enabled when
true
. Default:true
.
-
bool enableOutlining = true
Enable outlining.
This identifies and extracts repeated parts of computational graph into subgraphs. Enabled when
true
. Default:true
.
-
bool enableOutliningCopyCostPruning = true
Enable inclusion of the cost of copying of cached sections should be in the outlining cost model.
Enabled when
true
. Default:true
.
-
float outlineThreshold = 1.0f
Specify the incremental value that a sub-graph requires, relative to its nested sub-graphs (if any), to be eligible for outlining.
A high threshold results in fewer sub-graphs being outlined, a negative value results in all being outlined. The gross value of a sub-graph is the sum of its constituent ops’ Op::getSubgraphValue() values. To disable outlining, it is better to set enableOutlining to false than to set this value to infinity. The default value of 1.0f results in all high value operations such as convolution being cached, but standalone low value operations such as ReLU will not be.
Default: 1.0f.
-
float outlineSequenceBreakCost = 10000.0f
Specify the penalty applied to outlining potential sub-graphs if the sub-graph to be created breaks up a sequence of operations that are more efficient (for example for overlapping compute and exchange) when outlined together.
Default: 10000.0f.
-
SubgraphCopyingStrategy subgraphCopyingStrategy = SubgraphCopyingStrategy::OnEnterAndExit
Specify how copies for inputs and outputs for subgraphs are lowered.
Setting this value to SubgraphCopyingStrategy::JustInTime may save memory at the cost of fragmenting subgraphs into multiple Poplar functions. This may be particularly useful when a number of weight updates are outlined in one subgraph, as it may prevent multiple weight tensors from being live at the same time inside the subgraph.
Default: SubgraphCopyingStrategy::OnEnterAndExit.
-
RecomputationType autoRecomputation = RecomputationType::None
Enable recomputation of operations in the graph in the backward pass.
This will reduce model size at the cost of computation cycles.
Default: RecomputationType::None (no recomputation).
-
MergeVarUpdateType mergeVarUpdate = MergeVarUpdateType::None
Enable merging of VarUpdates into groups of VarUpdates, by flattening and concatenating variable tensors and updating tensors.
Default: MergeVarUpdateType::None (no merging).
-
int64_t mergeVarUpdateMemThreshold = 1000000
Specify the memory threshold for VarUpdateOp merging algorithms.
The MergeVarUpdateType::AutoLoose and MergeVarUpdateType::AutoTight VarUpdateOp merging algorithms have a threshold on the total memory of variable tensors to merge for updating. Defined as total memory in bytes.
Default: 1000000.
-
std::string logDir
-
struct popart::TensorLocationSettings
A structure containing user configuration for cache/offloading settings.
Public Functions
-
TensorLocationSettings() = default
Constructor.
-
TensorLocationSettings(TensorLocation location_, int minElementsForOffChip_ = 2, int minElementsForReplicatedTensorSharding_ = 8192)
Constructor.
- Parameters
location_ – The tensor location information.
minElementsForOffChip_ – The minimum number of elements below which offloading won’t be considered.
minElementsForReplicatedTensorSharding_ – The minimum number of elements necessary for replicated tensor sharding.
-
TensorLocationSettings(TensorStorage storage_, int minElementsForOffChip_ = 2, int minElementsForReplicatedTensorSharding_ = 8192)
Constructor.
- Parameters
storage_ – The tensor storage information.
minElementsForOffChip_ – The minimum number of elements below which offloading won’t be considered.
minElementsForReplicatedTensorSharding_ – The minimum number of elements necessary for replicated tensor sharding.
Public Members
-
TensorLocation location = TensorLocation()
The default tensor location for this tensor type.
-
int minElementsForOffChip = 2
The minimum number of elements below which offloading won’t be considered.
-
int minElementsForReplicatedTensorSharding = 8192
A minimum number of elements below which replicated tensor sharding won’t be considered.
-
TensorLocationSettings() = default
#include <popart/variablesettings.hpp>
-
class popart::VariableSettings
A class to dictate behaviour of variables and reductions of such across multiple graphs.
Public Functions
-
void verify()
Runs test to see if the VariableSettings are invalid, and throws an error if so.
- Returns
the CommGroup sharedVariableDomain of this VariableSettings.
-
inline VariableRetrievalMode getRetrievalMode() const
- Returns
the VariableRetrievalMode retrievalMode of this VariableSettings.
-
VariableSettings()
“Default” constructor, defaults CommGroup to [All, 0] and retrievalMode to OnePerGroup.
-
VariableSettings(CommGroup sharedVariableDomain_, VariableRetrievalMode retrievalMode_)
Entirely custom VariableSettings.
-
unsigned numReplicasReturningVariable(unsigned replicaCount) const
Calculate the number of replicas that will return this variable.
- Parameters
replicaCount – Number of global replicas.
- Returns
Number of variables returned.
-
unsigned groupCount(unsigned replicaCount) const
- Parameters
replicaCount – The replicationFactor of the graph.
- Returns
The number of groups given the replicaFactor and the VariableSettings.
-
unsigned getRealGroupSize(unsigned replicaCount) const
Because CommGroup’s don’t have a defined group-size if the type is All or None, this function will return a group-size that is always accurate, based on replicas.
- Parameters
replicaCount – The replication factor
- Returns
The actual number of replicas in a group
-
unsigned getGroupRepresentative(unsigned group) const
Get the default first member of a group.
- Parameters
group – The group to return the representative for.
- Returns
The representative replica of this group.
-
Shape shapeOnReplica(Shape full_shape, unsigned replicaCount, const TensorId name) const
The shape Onnx reads holds an extra outer dimension in certain cases, where the outer dimension represents the number of returning replica variables.
This function takes an Onnx full-shape and removes the outer dimension safely (ie. checks if the outer dimension matches an expected outer dimension). A quick-function to avoid duplicate code.
- Parameters
full_shape – The shape as presented by Onnx.
replicaCount – The local replication factor, used to calculate the return factor.
name – The TensorId of the function, used to give good error feedback.
- Returns
The shape of the data on the replica.
-
Shape shapeOnHost(Shape replica_shape, unsigned replicaCount) const
Takes the shape of a tensor on a replica and returns it’s full ONNX shape.
This is the inverse operation to shapeOnReplica
- Parameters
replica_shape – The shape of the data on a replica.
replicaCount – The local replication factor, used to calculate the return factor.
- Returns
The shape as presented by Onnx.
-
std::vector<std::vector<std::int64_t>> groups(unsigned replicaCount) const
This function returns a set of vectors where each vector contains all the replicaId’s of the replicas with a sharedVariableDomain given the variableSettings and the replicaCount.
- Parameters
replicaCount – The local replication factor
- Returns
A set of sets, such that set.at(a).set(b) is member nr. b of group a, and set.size() is the number og groups and set.at(A).size() is the size of the group.
-
bool operator==(VariableSettings other)
Compare two variable-settings.
- Parameters
other – VariableSettings to compare these settings to.
- Returns
True if all internal elements are the same
-
bool operator!=(VariableSettings other)
Compare two variable-settings.
- Parameters
other – VariableSettings to compare these settings to.
- Returns
False if all internal elements are the same
-
void verify()
#include <popart/commgroup.hpp>
-
class popart::CommGroup
Class to specify sub-groups of replicas.
Examples of derived sub-groups:
IPU-link domain sub-rack:
.. code-block:: python type == Consecutive && replicaGroupSize == 64/replica-size/N
where N is power of two and replicaGroupSize > 1.
Complete IPU-link domain / full rack:
.. code-block:: python type == Consecutive && replicaGroupSize == 64/replica-size
Using GW-links only:
.. code-block:: python type == Orthogonal && replicaGroupSize == 64/replica-size
Public Functions
-
CommGroup()
Default CommGroup constructor.
Sets type to CommGroupType::All and replicaGroupSize to 0.
-
inline CommGroup(CommGroupType type, unsigned groupSize)
Construct CommGroup.
- Parameters
groupType – replica group type
groupSize – replica group size
Public Members
-
CommGroupType type = CommGroupType::All
Replica group type.
-
unsigned replicaGroupSize = 0
Replica group size.
2.2. Data input and output (IStepIO)
#include <popart/istepio.hpp>
-
class popart::IStepIO
An abstract base class through which input and output data is passed to a Session (see Session::run).
Data is passed via buffers. In the case of buffers returned by IStepIO::in, PopART reads from these buffers. In the case of IStepIO::out, PopART writes to these buffers. The IStepIO::inComplete() and IStepIO::outComplete() functions are called by PopART to signal it is done with an input or output buffer.
An IStepIO implementation should conceptually implement a rolling queue of active buffers for each input and output tensor. Every successful call to IStepIO::in should yield a new data buffer for PopART to read from and add it to the head of the conceptual queue. Conversely, every call to IStepIO::inComplete() should be taken to mean that the buffer at the tail-end of the queue is no longer being used by PopART. This buffer is removed from the conceptual queue.
Note that a IStepIO::in call with the
prefetch
flag set is only considered successful when it returns data.Output works analogously to input.
The expected total number of input (or output) buffers that are ‘completed’ for a tensor in one Session::run call is
bps
\(\times\) SessionOptions::accumulationFactor \(\times\) SessionOptions::replicatedGraphCount, wherebps
is the number of batches per call to Session::run (this is a value captured by the DataFlow instance passed to the Session instance).Note, however, that there may be additional ‘incomplete’ calls to IStepIO::in and IStepIO::out.
Furthermore, the number of input (or output) buffers that may be ‘incomplete’ at a given time for a given tensor should not normally be more than SessionOptions::bufferingDepth \(\times\) SessionOptions::replicatedGraphCount, but this bound is not guaranteed.
EXAMPLE: Suppose a session is configured such that the total expected number of input buffers is 6 and these are input buffers for a tensor with ID
t
with 100 elements. The associated input calls in IStepIO may look like this if SessionOptions::bufferingDepth is 3:in("t", 100, false) -> Give buffer[0] to PopART. in("t", 100, true) -> Give buffer[1] to PopART. in("t", 100, true) -> Give buffer[2] to PopART. inComplete("t", 100) -> buffer[0] is no longer required and can be reused. in("t", 100, true) -> Give buffer[3] to PopART. inComplete("t", 100) -> buffer[1] is no longer required and can be reused. in("t", 100, true) -> Give buffer[4] to PopART. inComplete("t", 100) -> buffer[2] is no longer required and can be reused. in("t", 100, true) -> Give buffer[5] to PopART. inComplete("t", 100) -> buffer[3] is no longer required and can be reused. in("t", 100, true) -> No data available, return nullptr. inComplete("t", 100) -> buffer[4] is no longer required and can be reused. inComplete("t", 100) -> buffer[5] is no longer required and can be reused.
Subclassed by popart::StepIOCallback, popart::StepIOGeneric< ARRAY_TYPE, ACCESSOR_TYPE, ArrayInfoT >, popart::StepIOGeneric< IArray, StepIONS::IArrayAccessor, IArray &>
Public Functions
-
virtual ConstVoidData in(TensorId id, int64_t numElements, bool prefetch) = 0
Request a new input data buffer.
The memory in this buffer is available for use in PopART until the corresponding inComplete() call.
Note
: Failing to provide a valid data buffer will result in a runtime failure if
prefetch
is set tofalse
.- Parameters
id – The ID of the tensor to return data for.
numElements – The number of elements in the tensor.
prefetch – If set to
true
the inability to provide data is not considered an error. Iffalse
, it is considered an error if no data can be provided.
- Returns
The input buffer for this tensor (or nullptr on failure) returned as a ConstVoidData object.
-
virtual void inComplete(TensorId id, int64_t numElements) = 0
Notify the user (running a PopART program) that a previously retrieved input data buffer is no longer used by PopART.
- Parameters
id – The ID of the tensor to return data for.
numElements – The number of elements in the tensor.
-
virtual MutableVoidData out(TensorId id, int64_t numElements) = 0
Request a new output data buffer.
The memory in this buffer is available for use in PopART until the corresponding inComplete() call and will be modified in-place.
Note
Failing to provide a valid data buffer will result in a runtime failure.
- Parameters
id – The ID of the tensor to return data for.
numElements – The number of elements in the tensor.
- Returns
The output buffer for this tensor returned as a MutableVoidData object.
-
inline virtual void outComplete(TensorId)
Notify the user (running a PopART program) that a previously retrieved input data buffer is no longer used by PopART.
- Parameters
id – The ID of the tensor to return data for.
numElements – The number of elements in the tensor.
-
inline void enableRuntimeAsserts(bool b)
Enable or disable runtime asserts.
If runtime asserts are enabled, then a check that the input and output buffers have the correct number of elements is performed. As Session.run() is called multiple times during a user’s session, the check is only performed in the first call to Session.run(), under the assumption that the user is unlikely to change the size of buffers between runs.
- Parameters
b – The setting to enable runtime asserts (
true
) or disable runtime asserts (false
).
-
inline bool runtimeAssertsEnabled() const
Check if runtime asserts are enabled.
- Returns
true
if runtime asserts are enabled, otherwisefalse
.
-
virtual void assertNumElements(const popx::Executablex&) const = 0
Check number of elements.
This check is performed when runtimeAssertsEnabled() is
true
.- Parameters
Executablex – The input executable to be checked that the input and output buffers have the correct number of elements.
-
virtual ConstVoidData in(TensorId id, int64_t numElements, bool prefetch) = 0
#include <popart/stepio.hpp>
-
class popart::StepIO : public popart::StepIOGeneric<IArray, StepIONS::IArrayAccessor, IArray&>
Class to provide a Session object with input and output data.
-
class popart::StepIOCallback : public popart::IStepIO
Class that implements the IStepIO interface using user-provided callback functions.
The IStepIO interface contains a number of pure virtual member functions through which PopART receives buffers to read data from and buffers to write data to. StepIOCallback inherits from IStepIO and implements those member functions by delegating the logic to the callback functions passed in the constructor. This gives the user full control as to how data buffers are provisioned.
See IStepIO for more details on the expected behaviour of the callbacks.
Public Types
-
using InputCallback = std::function<ConstVoidData(TensorId, bool)>
Callable object that implements IStepIO::in().
-
using InputCompleteCallback = std::function<void(TensorId)>
Callable object that implements IStepIO::inComplete().
-
using OutputCallback = std::function<MutableVoidData(TensorId)>
Callable object that implements IStepIO::out().
-
using OutputCompleteCallback = std::function<void(TensorId)>
Callable object that implements IStepIO::outComplete().
Public Functions
-
inline StepIOCallback(InputCallback inputCallback, InputCompleteCallback inputCompleteCallback, OutputCallback outputCallback, OutputCompleteCallback outputCompleteCallback)
Construct a StepIOCallback object.
- Parameters
inputCallback – The callback function the constructed StepIOCallback instance will use when IStepIO::in() is called. See IStepIO for details on how to implement this method.
inputCompleteCallback – The callback function the constructed StepIOCallback instance will use when IStepIO::inComplete() is called. See IStepIO for details on how to implement this method.
outputCallback – The callback function the constructed StepIOCallback instance will use when IStepIO::out() is called. See IStepIO for details on how to implement this method.
outputCompleteCallback – The callback function the constructed StepIOCallback instance will use when IStepIO::outComplete() is called. See IStepIO for details on how to implement this method.
-
inline virtual void assertNumElements(const popx::Executablex&) const
Check number of elements.
This check is performed when IStepIO::runtimeAssertsEnabled() is
true
.- Parameters
Executablex – The input executable to be checked that the input and output buffers have the correct number of elements.
-
virtual ConstVoidData in(TensorId id, int64_t numElements, bool prefetch) final
This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the
inputCallback
parameter passed to the constructor.This function should not be called directly.
-
virtual void inComplete(TensorId id, int64_t numElements) final
This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the
inputCompleteCallback
parameter passed to the constructor.This function should not be called directly.
-
virtual MutableVoidData out(TensorId id, int64_t numElements) final
This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the
outputCallback
parameter passed to the constructor.This function should not be called directly.
-
virtual void outComplete(TensorId id) final
This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the
outputCompleteCallback
parameter passed to the constructor.This function should not be called directly.
-
using InputCallback = std::function<ConstVoidData(TensorId, bool)>
-
class popart::IWeightsIO
A virtual class for accessing pointers to the data required to perform a training step.
Subclassed by popart::WeightsIO
Public Functions
-
virtual ~IWeightsIO() = default
Destructor for IWeightsIO.
-
virtual bool contains(TensorId) const = 0
Check if the WeightsIO instance contains the weights for a specific tensor.
- Parameters
TensorId – The ID of the tensor to look for weights for.
- Returns
true
if the WeightsIO instance contains weights for the tensor,false
otherwise.
-
virtual MutableVoidData weight(TensorId) const = 0
Retrieve weights for a specific tensor.
- Parameters
TensorId – The ID of the tensor to retrieve weights for.
- Returns
The weights.
-
virtual ~IWeightsIO() = default
-
class popart::WeightsIO : public popart::IWeightsIO
Class representing weights.
Public Functions
-
virtual bool contains(TensorId) const final
Check if the WeightsIO instance contains the weights for a specific tensor.
- Parameters
TensorId – The ID of the tensor to look for weights for.
- Returns
true
if the WeightsIO instance contains weights for the tensor,false
otherwise.
-
virtual MutableVoidData weight(TensorId) const final
Retrieve weights for a specific tensor from the WeightsIO object.
- Parameters
TensorId – The ID of the tensor to retrieve weights for.
- Returns
The weights.
-
void insert(TensorId, MutableVoidData)
Insert weights for a specific tensor into the WeightsIO object.
- Parameters
TensorId – The ID of the tensor to insert weights for.
MutableVoidData – The weights to insert.
-
virtual bool contains(TensorId) const final
Warning
doxygenstruct: Cannot find class “popart::IArrayAccessor” in doxygen xml output for project “project” from directory: doxygen/xml
#include <popart/stepio_generic.hpp>
-
template<typename ARRAY_TYPE, typename ACCESSOR_TYPE, typename ArrayInfoT>
class popart::StepIOGeneric : public popart::IStepIO Subclassed by popart::StepIO
Public Functions
-
inline void assertNumElements(const popx::Executablex &exe) const final
-
inline TensorInfo getTensorInfo(ARRAY_TYPE &array) const
-
template<typename T>
inline T get(TensorId id, std::map<TensorId, ArrayInfo> &M, int64_t numElements, bool advance_, std::string mapName)
-
template<typename T>
inline void advance(TensorId id, std::map<TensorId, ArrayInfo> &M, int64_t numElements, std::string mapName)
-
inline ConstVoidData in(TensorId id, int64_t numElements, bool) final
-
inline MutableVoidData out(TensorId id, int64_t numElements) final
-
inline void assertNumElements(const popx::Executablex &exe) const final
Warning
doxygenstruct: Cannot find class “popart::ArrayInfo” in doxygen xml output for project “project” from directory: doxygen/xml
#include <popart/iarray.hpp>
2.3. Tensors
#include <popart/tensor.hpp>
-
class popart::Tensor : public popart::Vertex
Public Functions
-
Tensor(TensorId, TensorType, Graph&, const DebugContext& = {})
-
Tensor(TensorId, VariableSettings, Graph&, const DebugContext& = {})
-
Tensor(TensorId, TensorType, VariableSettings, Graph&, const DebugContext& = {})
-
inline std::string str() const final
-
TensorType tensorType() const
-
std::string tensor_type() const
-
void setTensorType(TensorType)
-
inline ReplicatedStreamMode getReplicatedStreamMode() const
-
inline void setReplicatedStreamMode(const ReplicatedStreamMode &mode)
-
void setTensorLocationInfo(TensorLocation&, std::pair<RemoteBufferId, RemoteBufferIndex> &remoteBufferInfo)
-
std::set<PipelineStage> getPipelineStages() const
-
bool hasProducer() const
-
bool isGraphInput() const
-
bool isGraphOutput() const
-
bool isLoopInput() const
-
bool isImplicitLoopInput() const
-
bool isExplicitLoopInput() const
-
bool isLoopTripCounter() const
-
bool isUnmodifiable() const
-
bool isCheckpointTensor() const
-
bool isImplicitRecomputeTensor() const
-
bool isRestoreInplaceTensor() const
-
bool idIncludesPrefix(const std::vector<std::string>&) const
-
bool isOptimizerTensor() const
-
bool isRemoteArgTensor() const
-
bool isRandomSeedTensor() const
-
bool isOptimizerStateTensor() const
-
bool isAccumulatorTensor() const
-
bool isHostLoadTensor() const
Is this tensor produced by a HostLoad Op or MultiExchangeOp with HostLoad descriptor?
- Returns
true if producer is a HostLoad Op or MultiExchangeOp with HostLoad descriptor false otherwise.
-
bool isWeightTensor() const
-
bool isAnchored() const
-
bool isRootAnchor() const
-
bool hasTensorData() const
-
TensorData *tensorData()
-
const TensorData *tensorData() const
-
bool hasVirtualGraphId() const
-
VGraphIdAndTileSet getVirtualGraphIdAndTileSet(std::set<OpId> &visited) const
-
VGraphIdAndTileSet getVirtualGraphIdAndTileSetUnsafe() const
-
VGraphIdAndTileSet getVirtualGraphIdAndTileSetUnsafe(std::set<OpId> &visited) const
-
int getBatchAxis() const
-
bool consumersAllPreLoss() const
-
bool isModified(bool considerLoopInput = true) const
Check if any of the consumers modify this tensor.
- Parameters
considerLoopInput – If explicit loop inputs should be considered as being modified. If false, only operations modifying the tensor inplace will be considered.
- Returns
True if the tensor is modified, otherwise false.
-
bool isAliased() const
Check if any of the consumers alias this tensor.
- Returns
True if the tensor is aliased to any output, otherwise false.
-
std::set<Op*, POpCmp> getInplaceModifiers() const
Find operations that modify a tensor.
- Returns
All operations that (direct and indirectly) modify this tensor
-
std::vector<char> getDataViaGraphTraversal() const
-
inline void setVariableUpdateType(VariableUpdateType type)
Members of old subclass VariableTensor class VariableTensor : public Tensor {.
-
inline VariableUpdateType getVariableUpdateType() const
-
inline VariableSettings getVariableSettings() const
- Returns
The VariableSettings of this Variable
-
std::vector<int64_t> returnedShape(unsigned replicationFactor)
Returns the shape necessitated by IO.
- Parameters
replicationFactor – The replication factor
- Returns
the shape of the tensor, considering replica groups
-
void verifyMutableVoidInfo(const TensorInfo mutableVoidInfo, unsigned replicationFactor)
Check that the info of a mutableVoidData object matches the expectations set by the TensorInfo and VariableSettings.
Throws an error if there is a mismatch.
- Parameters
mutableVoidInfo – The data of the MutableVoidInfo with the same id as this tensor
replicationFactor – The replicationFactor of this instance
Public Members
-
Consumers consumers
-
TensorInfo info
-
TensorLocationInfo tensorLocationInfo
-
InputSettings inputSettings
-
Tensor(TensorId, TensorType, Graph&, const DebugContext& = {})
-
enum popart::TensorType
Values:
-
enumerator ActGrad = 0
-
enumerator Const
-
enumerator Stream
-
enumerator Unknown
-
enumerator Variable
-
enumerator N
-
enumerator ActGrad = 0
#include <popart/tensorinfo.hpp>
-
enum popart::DataType
There is a one-to-one correspondence between
popart::DataTypes
andONNX_NAMESPACE::TensorProto_DataTypes
, which is equivalent todecltype
(ONNX_NAMESPACE::TensorProto().data_type()).Values:
-
enumerator UINT8 = 0
-
enumerator INT8
-
enumerator UINT16
-
enumerator INT16
-
enumerator INT32
-
enumerator INT64
-
enumerator UINT32
-
enumerator UINT64
-
enumerator BOOL
-
enumerator FLOAT
-
enumerator FLOAT16
-
enumerator BFLOAT16
-
enumerator DOUBLE
-
enumerator COMPLEX64
-
enumerator COMPLEX128
-
enumerator STRING
-
enumerator UNDEFINED
-
enumerator UINT8 = 0
-
class popart::TensorInfo
Public Functions
-
TensorInfo(DataType, const Shape&)
Create TensorInformation based on data type and shape.
- Parameters
data_type – - The data type.
shape – - The actual shape of the tensor.
-
TensorInfo(DataType data_type, const Shape &shape, const Shape &meta_shape)
Create TensorInformation based on data type, shape and meta shape.
- Parameters
data_type – - The data type.
shape – - The actual shape of the tensor.
meta_shape – - The meta shape of the tensor, which can for example be used to store the original tensor shape before replicated tensor sharding was applied.
-
TensorInfo(std::string data_type, std::string shape)
-
explicit TensorInfo(const ONNX_NAMESPACE::TensorProto&)
-
explicit TensorInfo(const ONNX_NAMESPACE::TypeProto&)
-
void set(const ONNX_NAMESPACE::TensorProto&)
-
void set(const ONNX_NAMESPACE::TypeProto&)
-
TensorInfo() = default
-
std::vector<size_t> shape_szt() const
-
inline int64_t nelms() const
-
int64_t nbytes() const
-
inline int64_t dim(int i) const
-
inline std::vector<int> strides(const std::vector<long> &shape)
Get the strides of the tensor, that is the number of bytes to step in each dimension when traversing an array in memory.
See https://numpy.org/doc/stable/reference/generated/numpy.ndarray.strides.html
- Parameters
shape – The on-host ONNX shape of a tensor. This is different from this->shape(), which gives the on-replica shape of a tensor
- Returns
std::vector<int> The strides vector.
-
const std::string &data_type() const
-
const std::string &data_type_lcase() const
-
void append(std::ostream&) const
-
bool isSet() const
-
bool operator==(const TensorInfo&) const
-
bool operator!=(const TensorInfo&) const
-
ONNX_NAMESPACE::TypeProto getOnnxTypeProto() const
-
const DataTypeInfo *getDataTypeInfo() const
Public Static Functions
-
static std::string npOutDataTypeExceptionMessage(const TensorInfo &i0, const TensorInfo &i1, const std::string &debugName)
-
TensorInfo(DataType, const Shape&)
#include <popart/tensorindex.hpp>
-
class popart::TensorIndexMap
Public Functions
-
TensorIndexMap() = default
-
~TensorIndexMap()
-
void erase(int)
-
void clear()
-
bool hasIndex(int) const
-
const std::map<Tensor*, std::vector<int>, PTensorCmp> &indicesMap() const
-
int n() const
-
void append(std::stringstream&, std::string prefix, int max_id_length) const
-
void setInfoIfIndex(const TensorInfo&, int index)
-
int maxIdLength() const
-
int minIndex() const
-
int maxIndex() const
-
TensorIndexMap() = default
#include <popart/tensorlocation.hpp>
-
enum popart::ReplicatedTensorSharding
Enum type to specify whether to shard tensors over replicas.
Values:
-
enumerator Off = 0
Don’t shard tensors over replicas.
-
enumerator On = 1
Do shard tensors over replicas.
-
enumerator N = 2
Number of values.
-
enumerator Off = 0
-
class popart::TensorLocation
Class that describes the memory characteristics of one or multiple tensors.
See also: SessionOptions.
Public Functions
-
TensorLocation()
Equivalent to calling TensorLocation(TensorStorage::Undefined, TileSet::Compute, TileSet::Compute, ReplicatedTensorSharding::Off)
-
TensorLocation(TensorStorage storage)
Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, ReplicatedTensorSharding::Off)
-
TensorLocation(TensorStorage storage, ReplicatedTensorSharding replicatedTensorSharding)
Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, replicatedTensorSharding)
-
TensorLocation(TensorStorage storage, ReplicatedTensorSharding replicatedTensorSharding, CommGroup shardingDomain)
Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, replicatedTensorSharding, shardingDomain)
-
TensorLocation(TensorStorage storage, TileSet loadTileSet, TileSet storageTileSet, ReplicatedTensorSharding replicatedTensorSharding)
Construct a TensorLocation from parameters.
- Parameters
storage – The memory location of the tensor(s).
loadTileSet – The tiles through which the tensor(s) are loaded onto the chip.
storageTileSet – The tiles on which the tensor(s) are stored.
replicatedTensorSharding – Whether to apply replicated tensor. sharding.
-
TensorLocation(TensorStorage storage, TileSet loadTileSet, TileSet storageTileSet, ReplicatedTensorSharding replicatedTensorSharding, CommGroup shardingDomain)
Construct a TensorLocation from parameters.
- Parameters
storage – The memory location of the tensor(s).
loadTileSet – The tiles through which the tensor(s) are loaded onto the chip.
storageTileSet – The tiles on which the tensor(s) are stored.
replicatedTensorSharding – Whether to apply replicated tensor. sharding.
shardingDomain – GCL communication group across which to shard the tensor. Perpendicular replicas will not shard, and reduce gradients normally (via AllReduce). Defaults to sharding across all replicas.
-
TensorLocation(std::vector<int64_t> serialized)
-
bool operator==(const TensorLocation &rhs) const
-
bool operator!=(const TensorLocation &rhs) const
-
std::vector<int64_t> serialize() const
-
bool isRemote() const
Public Members
-
TensorStorage storage
The memory location of the tensor(s).
-
ReplicatedTensorSharding replicatedTensorSharding
Whether to apply replicated tensor sharding (RTS) or not.
-
TensorLocation()
2.4. Optimizers
#include <popart/optimizer.hpp>
-
class popart::Optimizer
Interface for describing an Optimizer and, internally, how to grow the optimiser step for each weight.
The end-user facing interface constructed by the user to describe what kind of optimiser to use.
Then also used internally by the Ir to grow the optimiser step for each weight.
Stores OptimizerValues for optimizer parameters like learning rate, loss scaling, etc.
See also
OptimiserValue.
Optimizer stores the values for each weight - they can have different values. There is a “default” for all weights, then you can specify specific values for specific weights. This is encapsulated by an OptimizerValueMap, which is a sparse map from weight to value, with unspecified values implying the default.
See also
OptimizerValueMap.
At runtime, the user can dynamically update the Optimizer, e.g. by setting new OptimizerValues. validReplacement determines whether the new Optimizer is interchangable with the one the Ir was built for. For example, trying to replace an SGD Optimizer with an Adam Optimizer would throw.
Subclassed by popart::Adam, popart::Adaptive, popart::SGD
Public Functions
-
virtual ~Optimizer() = default
Optimizer class has a two-part initialisation. The ctor, used by the end-user, and setFactorsFromOptions called by the Ir to finish initialisation once we have all the relevant information during Ir preparation.
Some key methods used by the Ir to grow optimiser step for each weight are createOp, getInputIds, optimizerInputs.
If the OptimizerValue is const, no Ir tensor for that value is created and the VarUpdateOp created for that weight will not have the optional input for that tensor. The Opx of the VarUpdateOp will emit poplar code that uses the provided value directly.
If the OptimizerValue is not const, an Ir tensor for that value is created and the VarUpdateOp created for that weight will have the optional input for that tensor. The tensor will be a stream tensor, so that it can be updated later from host. The tensor will be streamed an initial value of the OptimizerValue’s value.
It is common for Optimizer
implementations to make use of “compound
scalars”. Take for example the SGD0 weight update equation: w <- w * (1 - lr * (1 - dm) * wd) - g * (lr * (1 - dm) / ls) w is the weights and g is the grads. lr, dm, wd, ls are all the “atomic scalars”. These are the scalars/hyperparameters of the
Optimizer that the user can set using OptimizerValues, as described above.Multiple atomic scalars appear in expressions together, and will be operated on together before being used by an Op that also consumes a tensor (in this case the weights or grads). For SGD0, they can be grouped as follows:
w <- w * {1 - lr * (1 - dm) * wd} - g * { lr * (1 - dm) / ls } ^^^^^^^^^^^^^^^^^^^^^^^^^ ~~~~~~~~~~~~~~~~~~~~~~ | | weight decay scale factor 0 | scaled learning rate 0
We call wdsf0 and slr0 the “compound scalars”.
We can statically precompute the OptimizerValues for these compound scalars using the OptimizerValues of the atomic scalars. This makes the Ir simpler, as we now have only:
w <- w * wdsf0 - g * slr0
The CompoundScalarHelpers are used to precompute the compound scalar values.
If any of the composite atomic scalars are non-const, the compound scalar is non-const.
See also
compoundscalarhelper.hpp
-
Optimizer(OptimizerValue lossScaling, const std::vector<ClipNormSettings> &clipNormSettings, const DebugContext &debugContext)
-
virtual OptimizerType type() const = 0
-
virtual std::string type_s() const = 0
-
virtual std::vector<TensorId> getInputIds(const Tensor &weight) const = 0
Returns the TensorIds of the input tensors to the VarUpdateOp this optimiser will create for the given
weight
.Specifically, The TensorId at index i will be the id of the input tensor at InIndex i of the VarUpdateOp. If the input is an OptimizerValue, if it is const, then “” will be returned, else the relevant reservered prefix for that OptimizerValue will be used, followed by the weight id. The prefixes are defined in tensornames.hpp, for example
reservedDefaultWeightDecayScaleFactor0Prefix
orreservedSpecificScaledLearningRate1Prefix
(note there are different prefixes depending on if the weight has a specific or default value for that OptimizerValue).
-
virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const = 0
-
inline const OptimizerValue &lossScaling() const
-
inline float getLossScalingVal() const
-
float getFinalLossScalingVal() const
-
virtual void setFactorsFromOptions(const SessionOptions&)
-
bool gradientAccumulationEnabled() const
-
bool meanReductionEnabled() const
-
bool postMeanAccumulationEnabled() const
-
bool postMeanReplicationEnabled() const
-
int64_t getReplicatedGraphCount() const
-
int64_t getAccumulationFactor() const
-
bool meanGradientAccumulationEnabled() const
-
inline const std::vector<ClipNormSettings> &getClipNormSettings() const
-
virtual bool hasSpecific() const = 0
-
virtual size_t hash() const
-
inline DebugContext getDebugContext() const
-
enum popart::OptimizerType
Types of optimizers.
Values:
-
enumerator SGD = 0
-
enumerator Adam
-
enumerator Adaptive
-
enumerator NTYPES
-
enumerator SGD = 0
-
enum popart::OptimizerReductionType
Reduction mode when doing data-parallel training over replicated graphs.
Depending on the optimizer used and its configuration, this option describes how the reduction of gradients over replicas will occur. For example, directly on the gradient, on the gradient accumulator, or on the momentum. See the documentation of individual optimizers for more information.
Values:
-
enumerator None = 0
No replicated graph reduction.
-
enumerator GradReduce
Gradient reduction (every iteration, after a weight’s gradient is produced)
-
enumerator AcclReduce
Momentum reduction (SGD1, after the gradient accumulation loop, if applicable)
-
enumerator AccumReduce
Accumulator reduction (Adam/SGD2 + gradient accumulation, after the gradient accumulation loop)
-
enumerator None = 0
#include <popart/optimizervalue.hpp>
-
class popart::OptimizerValue
A class used to represent values of hyper parameters.
Public Functions
-
OptimizerValue() = default
Equivalent to OptimizerValue(0, false).
-
inline OptimizerValue(float v)
Equivalent to OptimizerValue(v, true).
-
inline OptimizerValue(float v, bool c)
Constructor.
- Parameters
v – The current value of the hyper parameter.
c – A boolean flag to indicate whether the parameter will remain at this value forever (
true
) or may change over time (false
).
-
inline OptimizerValue(std::pair<float, bool> x)
-
inline float val() const
-
inline bool isConst() const
-
void validReplacement(const OptimizerValue &rhs) const
-
bool operator==(const OptimizerValue &rhs) const
-
OptimizerValue() = default
#include <popart/optimizervaluemap.hpp>
-
class popart::OptimizerValueMap
Public Functions
-
inline OptimizerValueMap(OptimizerValue g)
-
OptimizerValue get(const TensorId &id) const
-
void insertSpecific(const TensorId&, OptimizerValue)
-
inline bool hasSpecific() const
-
inline OptimizerValue getDefault() const
-
void validReplacement(const OptimizerValueMap &rhs) const
-
inline const std::map<TensorId, OptimizerValue> &getSpecifics() const
-
inline OptimizerValueMap(OptimizerValue g)
2.4.1. Stochastic Gradient Descent (SGD)
#include <popart/clipnormsettings.hpp>
-
class popart::ClipNormSettings
A data structure used to represent a maximum value constraint on one or more weights.
This is passed to the optimizer on construction.
Public Functions
-
ClipNormSettings(const std::vector<TensorId> &weightIds_, float maxNorm_)
DEPRECATED This will be removed from a future release.
Constructor.
- Parameters
weightIds_ – The weight tensor IDs that this constraint applies to.
maxNorm_ – The maximum permissible value.
-
float getMaxNorm() const
-
bool operator==(const ClipNormSettings&) const
-
bool operator!=(const ClipNormSettings &other) const
Public Static Functions
-
static ClipNormSettings clipWeights(const std::vector<TensorId> &weightIds_, float maxNorm_)
-
static ClipNormSettings clipAllWeights(float maxNorm_)
-
ClipNormSettings(const std::vector<TensorId> &weightIds_, float maxNorm_)
#include <popart/sgd.hpp>
-
class popart::SGD : public popart::Optimizer
Stochastic Gradient Descent (SGD) optimizer.
Akin to any optimizer implementation, this class is responsible for updating each weight tensor ( \(w\)) in the model using the gradient ( \(g\)) of the loss function with respect to the weight as calculated during the backwards pass.
The SGD optimizer has the following state for each weight:
velocity ( \(v\))
The SGD optimizer has the following hyper parameters:
learning rate ( \(\text{lr}\))
momentum ( \(\text{mm}\))
weight decay ( \(\text{wd}\))
dampening ( \(\text{dm}\))
velocity scaling ( \(\text{vs}\))
loss scaling ( \(\text{ls}\))
clip norm settings
The values of these parameters can be shared between all weights but some can be overridden with weight-specific values (see SGD::insertSpecific). Hyper parameters are captured using OptimizerValue objects and therefore can be either a constant value or a non-constant value that can be adjusted by the user.
In the following we will describe how this optimizer updates a weight using a gradient. In the context of this description the gradient is is the value of the gradient after any gradient accumulation has been performed and after the application of a loss scaling factor to the gradient has been corrected for.
When the optimizer needs to update a weight, \(w\), using a gradient, \(g\), it first updates the optimizer state as follows:
\[ v' := v * \text{mm} + (1 - \text{dm}) * (g + \text{wd} * w) \text{ \ . } \]Following the update of the optimizer state the optimizer uses said state to update the weight:
\[ w' := w - \text{lr} * v' \text{ \ . } \]In addition to the above, the velocity scaling hyper parameter is a scaling factor that can provide improved numerical stability by ensuring the values stored in the optimizer state, \(v\), are scaled by this value. When using this parameter PopART will automatically deal with the artificially scaled velocity value during the weight update and other hyper parameters do not need to be adjusted).
In addition, the loss scaling hyper parameter is similar in nature to the velocity scaling parameter. It is a scaling value that is applied to the loss gradient at the start of the the backwards pass and, at the end of the backwards pass, this scaling is reversed by multiplying the gradients for each weight with the inverse of the loss scaling value prior to updating the optimizer state. Using loss scaling can also improve numerical stability in some cases.
Finally, it is possible to add clip norm settings for this optimizer. These clip norms compute the L2 norm for a group of weights and adds a scalar term to the weight update that effectively divides it by the norm (or a constant value that is provided as part of the clip norm, which ever is greater).
See the SGD notes in optimizer.hpp for a more detailed and comprehensive derivation of the SGD optimizer step in PopART.
Subclassed by popart::ConstSGD
Public Functions
-
SGD(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultMomentum, OptimizerValue defaultDampening, OptimizerValue defaultVelocityScaling, OptimizerValue lossScaling, const std::vector<ClipNormSettings> &clipNormSettings = {}, SGDAccumulatorAndMomentum sgdAccMm = SGDAccumulatorAndMomentum::Combined, DataType accumType = DataType::UNDEFINED, DataType accl1Type = DataType::UNDEFINED, const DebugContext &debugContext = {})
Constructor.
See also
SGDAccumulatorAndMomentum. Defaults to SGDAccumulatorAndMomentum::Combined.
- Parameters
defaultLearningRate – The learning rate value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultWeightDecay – The weight decay value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultMomentum – The momentum value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultDampening – The dampening value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultVelocityScaling – The velocity scaling value to use for weights for which no weight-specific hyper parameter have been inserted.
lossScaling – The loss scaling value to use.
clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).
sgdAccMm – The implementation strategy to use when gradient accumulation and/or momentum are used, otherwise ignored.
accumType – The DataType of the accum tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.
accl1Type – The DataType of the accl1 tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.
debugContext – Optional debug context.
-
SGD(const std::map<std::string, std::pair<float, bool>> ¶ms, const std::vector<ClipNormSettings> &clipNormSettings = {}, SGDAccumulatorAndMomentum sgdAccMm = SGDAccumulatorAndMomentum::Combined, DataType accumType = DataType::UNDEFINED, DataType accl1Type = DataType::UNDEFINED, const DebugContext &debugContext = {})
Constructor.
EXAMPLE:
This will create an SGD Optimizer which has a constant momentum of 0.6 and a changeable learning rate initially of 0.02. All OptimizerValues not present in the map will take values from theSGD({{"defaultLearningRate", {0.02, false}}, {"defaultMomentum", {0.6, true}}});
getUnset
* functions.See also
SGDAccumulatorAndMomentum. Defaults to SGDAccumulatorAndMomentum::Combined.
- Parameters
params – A parameter map where the keys are one or more of
"defaultLearningRate"
,"defaultWeightDecay"
,"defaultMomentum"
,"defaultDampening"
,"defaultVelocityScaling"
or"lossScaling"
. The map’s values are pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter because default values will be used where parameters are missing.clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).
sgdAccMm – The implementation strategy to use when gradient accumulation and/or momentum are used, otherwise ignored.
accumType – The DataType of the accum tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.
accl1Type – The DataType of the accl1 tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.
debugContext – Optional debug context.
-
inline SGD()
Default constructor Creates SGD with default scalars (equivalent to getUnset<scalar>() methods), and other default parameters of main constructor.
-
~SGD() = default
-
inline virtual OptimizerType type() const final
-
inline virtual std::string type_s() const final
-
inline SGDAccumulatorAndMomentum getSGDAccumulatorAndMomentum() const
-
virtual std::unique_ptr<Op> createOp(const Tensor &weight, Graph&) const final
Returns the VarUpdateOp for the given
weight
.If no gradient accumulation of momentum, this will be a SGD0VarUpdateOp. Else, if
getSGDAccumulatorAndMomentum() == Combined
, this will be an SGD1ComboOp, else ifgetSGDAccumulatorAndMomentum() == Combined
SGD2ComboOp, an SGD2ComboOp.
The required compound scalar OptimizerValues for the
VarUpdateOp wil be computed and passed to the Op. See the SGD notes above this class for how they are derived. Recall that if non-const, the VarUpdateOp will take an input Tensor for the compound scalar.See also
Optimizer::createOp
The OptimizerReductionType of the Op is derived as follows: No replication => None Replication, no grad acc => GradReduce Replication, grad acc, SGD1 => AcclReduce Replication, grad acc, SGD2 => AccumReduce See the SGD notes above this class for why this is.
If SGD2, the DataType of the accum and accl1 tensors passed to the SGD2ComboOp will be as set in the SGD constructor. Recall DataType::UNDEFINED means use the same as the weight.
An SGD1ComboOp will later be decomposed by SGD1Decompose
pattern into a series of Ops and Tensors that implement the SGD1 optimiser step.
An SGD12ComboOp will later be decomposed by
SGD2Decompose pattern into a series of Ops and Tensors that implement the SGD2 optimiser step.See also
See also
-
virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const final
smm1 and wdsf0 have the same data type as the
weight
. Everything else
-
float getStoredValue(const TensorId &optId) const
Tensor “opt” has an id, which it uses to match a compound scalar which this object can compute from the atomic scalars.
-
void insertSpecific(const TensorId &weight, OptimizerValue learningRate, OptimizerValue weightDecay, OptimizerValue momentum, OptimizerValue dampening, OptimizerValue velocityScaling)
Insert a weight-specific set of hyper parameters.
- Parameters
weight – The TensorId of the weight.
learningRate – The learning rate value to use for this specific weight.
weightDecay – The weight decay value to use for this specific weight.
momentum – The momentum value to use for this specific weight.
dampening – The dampening value to use for this specific weight.
velocityScaling – The velocity scaling value to use for this specific weight.
-
void insertSpecific(const TensorId &weight, const std::map<std::string, std::pair<float, bool>> ¶ms)
Insert a weight-specific set of hyper parameters.
- Parameters
weight – The TensorId of the weight.
params – A parameter map where keys are one of
"learningRate"
,"weightDecay"
,"momentum"
,"dampening"
, or"velocityScaling"
and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.
-
virtual bool hasSpecific() const final
-
inline const OptimizerValueMap &learningRates() const
-
inline const OptimizerValueMap &weightDecays() const
-
inline const OptimizerValueMap &momentums() const
-
inline const OptimizerValueMap &dampenings() const
-
inline const OptimizerValueMap &velocityScalings() const
-
virtual size_t hash() const
Public Static Functions
-
static inline OptimizerValue getUnsetLearningRate()
Default learning rate value.
-
static inline OptimizerValue getUnsetWeightDecay()
Default weight decay value.
-
static inline OptimizerValue getUnsetMomentum()
Default momentum value.
-
static inline OptimizerValue getUnsetDampening()
Default dampening value.
-
static inline OptimizerValue getUnsetVelocityScaling()
Default velocity scaling value.
-
static inline OptimizerValue getUnsetLossScaling()
Default loss scaling value.
-
static SGD fromDefaultMap(const std::map<std::string, OptimizerValue>&, const DebugContext &debugContext = {})
-
class popart::ConstSGD : public popart::SGD
Stochastic Gradient Descent (SGD) optimizer with constant learning rate, weight decay, loss scaling and clip norm settings (and default values for momentum, dampening or velocity scaling).
NOTE: See SGD for detailed meaning for these parameters.
NOTE: This class exists for backwards compatibility with the Python API and may be removed at some point in the future.
Public Functions
-
inline ConstSGD(float learningRate, float weightDecay = 0, float lossScaling = 1, const std::vector<ClipNormSettings> &clipNormSettings = {})
Constructor.
- Parameters
learningRate – A constant learning rate.
weightDecay – A constant weight decay value.
lossScaling – A constant loss scaling value.
clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).
-
inline ConstSGD(float learningRate, float weightDecay = 0, float lossScaling = 1, const std::vector<ClipNormSettings> &clipNormSettings = {})
2.4.2. Adam, AdaMax & Lamb
#include <popart/adam.hpp>
-
enum popart::AdamMode
Enum type describing the mode of an Adam optimizer instance.
Values:
-
enumerator Adam = 0
Adam or AdamW mode, depending on weight decay setting (see Kingma & Ba, 2015 and Loshchilov & Hutter, 2018).
-
enumerator AdaMax
Adamax mode.
-
enumerator Lamb
Lamb mode (see You et al., 2020).
-
enumerator LambNoBias
Like Lamb but without bias correction.
-
enumerator Adam = 0
-
class popart::Adam : public popart::Optimizer
AdamW, Lamb and AdaMax optimizer implementation.
Akin to any optimizer implementation, this class is responsible for updating each weight tensor ( \(w\)) in the model using the gradient ( \(g\)) of the loss function with respect to the weight as calculated during the backwards pass.
The optimizer has the following state for each weight:
first-order momentum ( \(m\))
second-order momentum ( \(v\))
time step ( \(t\))
The optimizer has the following hyper parameters:
learning rate ( \(\text{lr}\))
weight decay ( \(\text{wd}\))
beta1 ( \(\beta_1\))
beta2 ( \(\beta_2\))
epsilon ( \(\epsilon\))
loss scaling ( \(\text{ls}\))
maximum weight norm ( \(\text{mwn}\))
The values of these parameters can be shared between all weights but some can be overridden with weight-specific values (see Adam::insertSpecific). Hyper parameters are captured using OptimizerValue objects and therefore can be either a constant value or a non-constant value that can be adjusted by the user.
The values of #AdamMode and #WeightDecayMode passed to the constructor determines how weights are updated (see below).
In the following we will describe how this optimizer updates a weight using a gradient. In the context of this description the gradient is is the value of the gradient after any gradient accumulation has been performed and after the application of a loss scaling factor to the gradient has been corrected for.
When the optimizer needs to update a weight, \(w\), using a gradient, \(g\), it first computes a term \(g_\text{tmp}\), which is effectively is \(g\) with L2 regularization applied if the #WeightDecayMode is set to WeightDecayMode::L2Regularization this, as follows:
\[\begin{split} g_\text{tmp} := \left\{\begin{aligned} g & \text{ \; (Decay) } \\ (g + \text{wd} * w) & \text{ \; (L2Regularization) \; . } \\ \end{aligned}\right.\\ \end{split}\]Secondly, the optimizer updates the optimizer state as follows:
\[\begin{split} m' &:= \beta_1 * m + (1 - \beta_1) * g_\text{tmp} \\ v' &:= \left\{\begin{aligned} \beta_2 * v + (1 - \beta_2) * g_\text{tmp}^2 & \text{ \; (Adam/AdamNoBias) } \\ \beta_2 * v + (1 - \beta_2) * g_\text{tmp}^2 & \text{ \; (Lamb/LambNoBias) } \\ \text{max}(\beta_2 * v, |g_\text{tmp}|) & \text{ \; (AdaMax) } \\ \end{aligned}\right.\\ t' &:= t + 1 \\ \end{split}\]Next, it computes the following terms:
\[\begin{split} m_\text{tmp} &:= \left\{\begin{aligned} m' & \text{ \; (AdamNoBias/LambNoBias) } \\ \frac{m'}{(1 - \beta_1^{t'})} & \text{ \; (Adam/Lamb/AdaMax) } \\ \end{aligned}\right.\\ v_\text{tmp} &:= \left\{\begin{aligned} v' & \text{ \; (AdamNoBias/LambNoBias) } \\ \frac{v'}{(1 - \beta_2^{t'})} & \text{ \; (Adam/Lamb/AdaMax) } \\ \end{aligned}\right.\\ u_\text{tmp} &:= \left\{\begin{aligned} \frac{m_\text{tmp}}{(\sqrt{v_\text{tmp}} + \epsilon)} + \text{wd} * w &\text{ \; (Decay) } \\ \frac{m_\text{tmp}}{(\sqrt{v_\text{tmp}} + \epsilon)} &\text{ \; (L2Regularization) } \\ \end{aligned}\right. \end{split}\]Finally, the optimizer updates the weight as follows:
\[\begin{split} w' := \left\{\begin{aligned} w - \text{lr} * u_\text{tmp} &\text{ \; (Adam/AdamNoBias/AdaMax) } \\ w - \biggl(\frac{\text{min}(\lVert{w}\rVert, \text{mwn})}{\lVert{u_\text{tmp}}\rVert}\biggr) * \text{lr} * u_\text{tmp} &\text{ \; (Lamb/LambNoBias) } \\ \end{aligned}\right. \end{split}\]In addition to the above, the loss scaling hyper parameter is similar in nature to the velocity scaling parameter. It is a scaling value that is applied to the loss gradient at the start of the the backwards pass and, at the end of the backwards pass, this scaling is reversed by multiplying the gradients for each weight with the inverse of the loss scaling value prior to updating the optimizer state. Using loss scaling can also improve numerical stability of the gradient calculations. If scaledOptimizerState is enabled then the the lossScaling will not be removed before updating the optimizer state. This can improve the numerical stability when accl1_type is set to FLOAT16.
NOTE: The maximum weight norm is referred to as \(\phi\) in You et al., 2020.
Public Functions
-
virtual bool hasSpecific() const final
-
Adam(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultBeta1, OptimizerValue defaultBeta2, OptimizerValue defaultEps, OptimizerValue lossScaling, OptimizerValue maxWeightNorm, AdamMode adamMode, WeightDecayMode weightDecayMode, DataType accumType, DataType accl1Type, DataType accl2Type, const std::vector<ClipNormSettings> &clipNormSettings = {}, bool scaledOptimizerState = false, const DebugContext &debugContext = {})
Constructor.
- Parameters
defaultLearningRate – The learning rate value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultWeightDecay – The weight decay value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultBeta1 – The beta1 value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultBeta2 – The beta2 value value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultEps – The epsilon value to use for weights for which no weight-specific hyper parameter have been inserted.
lossScaling – The loss scaling value to use.
maxWeightNorm – The maxWeightNorm value to use.
adamMode – The AdamMode value to use.
weightDecayMode – The WeightDecayMode value to use.
maxWeightNorm – The maxWeightNorm value to use.
accumType – Data type to use for gradient accumulation.
accl1Type – Data type to use for tensor that stores first-order momentum optimizer state.
accl2Type – Data type to use for tensor that stores second-order momentum optimizer state.
clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).
scaledOptimizerState – Experimental Option. Does not remove lossScaling before updating the optimizer state. This should have no effect on the update equation. However, it does ensure a more numerically stable implementation when accl1_type is set to DataType::FLOAT16. Note: When loading a model that includes initialised optimizer state, ensure that accl1 and accl2 are scaled by lossScaling and lossScaling^2 respectively.
debugContext – Optional debug context.
-
Adam(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultBeta1, OptimizerValue defaultBeta2, OptimizerValue defaultEps, OptimizerValue lossScaling, AdamMode adamMode, WeightDecayMode weightDecayMode, DataType accumType, DataType accl1Type, DataType accl2Type, const std::vector<ClipNormSettings> &clipNormSettings = {}, bool scaledOptimizerState = false, const DebugContext &debugContext = {})
-
Adam(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultBeta1, OptimizerValue defaultBeta2, OptimizerValue defaultEps, OptimizerValue lossScaling, OptimizerValue maxWeightNorm, AdamMode adamMode, DataType accumType, DataType accl1Type, DataType accl2Type, const std::vector<ClipNormSettings> &clipNormSettings = {}, bool scaledOptimizerState = false, const DebugContext &debugContext = {})
-
Adam(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultBeta1, OptimizerValue defaultBeta2, OptimizerValue defaultEps, OptimizerValue lossScaling, AdamMode adamMode, DataType accumType, DataType accl1Type, DataType accl2Type, const std::vector<ClipNormSettings> &clipNormSettings = {}, bool scaledOptimizerState = false, const DebugContext &debugContext = {})
-
Adam(const std::map<std::string, std::pair<float, bool>> ¶ms, AdamMode adamMode, WeightDecayMode weightDecayMode, DataType accumType, DataType accl1Type, DataType accl2Type, const std::vector<ClipNormSettings> &clipNormSettings = {}, bool scaledOptimizerState = false, const DebugContext &debugContext = {})
Constructor.
EXAMPLE:
Adam({{"defaultLearningRate", {0.02, False}}, {"defaultBeta1", {0.9, True}}, {"defaultBeta2":{0.999, True}}}, AdamMode::Adam, WeightDecayMode::Decay, DataType::FLOAT, DataType::FLOAT, DataType::FLOAT);
- Parameters
params – A parameter map where keys are one of
"defaultLearningRate"
,"defaultWeightDecay"
,"defaultBeta1"
,"defaultBeta2"
,"defaultEps"
,"lossScaling"
or"maxWeightNorm"
, and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.adamMode – The AdamMode value to use.
weightDecayMode – The WeightDecayMode value to use.
maxWeightNorm – The maxWeightNorm value to use.
accumType – Data type to use for gradient accumulation.
accl1Type – Data type to use for tensor that stores first-order momentum optimizer state.
accl2Type – Data type to use for tensor that stores second-order momentum optimizer state.
clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).
scaledOptimizerState – Experimental Option. Does not remove lossScaling before updating the optimizer state. This should have no effect on the update equation. However, it does ensure a more numerically stable implementation when accl1_type is set to DataType::FLOAT16. Note: When loading a model that includes initialised optimizer state, ensure that accl1 and accl2 are scaled by lossScaling and lossScaling^2 respectively.
debugContext – Optional debug context.
-
~Adam() = default
-
inline virtual OptimizerType type() const final
-
inline virtual std::string type_s() const final
-
virtual std::vector<TensorId> getInputIds(const Tensor &weight) const final
The names of the inputs for the VarUpdateOp for the Variable Tensor “weight”.
In the returned vector, an empty string (“”) is used as a placeholder for constant inputs.
-
virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const final
The names and infos of the optimizer tensors.
-
float getStoredValue(const TensorId &optId) const
Tensor “opt” has an id, based on which it matches a compound scalar which this object can compute from the atomic scalars.
-
void insertSpecific(const TensorId &weight, OptimizerValue learningRate, OptimizerValue weightDecay, OptimizerValue beta1, OptimizerValue beta2, OptimizerValue eps, OptimizerValue mwn)
Insert a weight-specific set of hyper parameters.
- Parameters
weight – The TensorId of the weight.
learningRate – The learning rate value to use for this specific weight.
weightDecay – The weight decay value to use for this specific weight.
beta1 – The beta1 value to use for this specific weight.
beta2 – The beta2 value to use for this specific weight.
eps – The epsilon value to use for this specific weight.
mwn – The max weight norm value to use for this specific weight.
-
void setStep(int64_t step)
-
void insertSpecific(const TensorId &weight, const std::map<std::string, std::pair<float, bool>> ¶ms)
Insert a weight-specific set of hyper parameters.
- Parameters
weight – The TensorId of the weight.
params – A parameter map where keys are one of
"defaultLearningRate"
,"defaultWeightDecay"
,"defaultBeta1"
,"defaultBeta2"
,"defaultEps"
,"lossScaling"
or"maxWeightNorm"
and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.
-
inline const OptimizerValueMap &learningRates() const
-
inline const OptimizerValueMap &weightDecays() const
-
inline const OptimizerValueMap &beta1s() const
-
inline const OptimizerValueMap &beta2s() const
-
inline const OptimizerValueMap &epss() const
-
inline const OptimizerValueMap &maxWeightNorms() const
-
inline const WeightDecayMode &getWeightDecayMode() const
-
inline bool useScaledOptimizerState() const
-
virtual size_t hash() const final
-
virtual void setFactorsFromOptions(const SessionOptions&) final
Public Static Functions
-
static inline OptimizerValue getUnsetLearningRate()
Default learning rate value.
-
static inline OptimizerValue getUnsetWeightDecay()
Default weight decay value.
-
static inline OptimizerValue getUnsetBeta1()
Default beta1 value.
-
static inline OptimizerValue getUnsetBeta2()
Default beta2 value.
-
static inline OptimizerValue getUnsetEps()
Default epsilon value.
-
static inline OptimizerValue getUnsetLossScaling()
Default loss scaling value.
-
static inline OptimizerValue getUnsetMaxWeightNorm()
Default maximum weight norm value.
-
static Adam fromDefaultMap(const std::map<std::string, OptimizerValue>&, AdamMode adamMode_, WeightDecayMode decayMode_, DataType accumType_, DataType accl1Type_, DataType accl2Type_, const DebugContext &debugContext = {})
2.4.3. AdaDelta, RMSProp & AdaGrad
#include <popart/adaptive.hpp>
-
enum popart::AdaptiveMode
Enum class representing a type of adaptive optimizer.
Values:
-
enumerator AdaGrad = 0
AdaGrad optimizer.
-
enumerator RMSProp
RMSProp optimizer.
-
enumerator CenteredRMSProp
CenteredRMSProp optimizer.
-
enumerator AdaDelta
AdaDelta optimizer.
-
enumerator AdaGrad = 0
-
class popart::Adaptive : public popart::Optimizer
AdaDelta, RMSProp and AdaGrad optimizer implementation.
Akin to any optimizer implementation, this class is responsible for updating each weight tensor ( \(w\)) in the model using the gradient ( \(g\)) of the loss function with respect to the weight as calculated during the backwards pass.
The optimizer has the following state for each weight:
first-order momentum ( \(v_1\))
second-order momentum ( \(v_2\)) (only for AdaGrad/RMSProp)
third-order momentum ( \(v_3\))
The optimizer has the following hyper parameters:
learning rate ( \(\text{lr}\))
weight decay ( \(\text{wd}\))
alpha ( \(\alpha\))
momentum ( \(\text{m}\)))
epsilon ( \(\epsilon\))
loss scaling ( \(\text{ls}\))
The values of these parameters can be shared between all weights but some can be overridden with weight-specific values (see Adaptive::insertSpecific). Hyper parameters are captured using OptimizerValue objects and therefore can be either a constant value or a non-constant value that can be adjusted by the user.
The values of #AdaptiveMode and #WeightDecayMode passed to the constructor determines how weights are updated (see below).
In the following we will describe how this optimizer updates a weight using a gradient. In the context of this description the gradient is is the value of the gradient after any gradient accumulation has been performed and after the application of a loss scaling factor to the gradient has been corrected for.
When the optimizer needs to update a weight, \(w\), using a gradient, \(g\), it first computes a term \(g_\text{tmp}\), which is effectively is \(g\) with L2 regularization applied if the #WeightDecayMode is set to WeightDecayMode::L2Regularization this, as follows:
\[\begin{split} g_\text{tmp} := \left\{\begin{aligned} g & \text{ \; (Decay) } \\ (g + \text{wd} * w) & \text{ \; (L2Regularization) \; . } \\ \end{aligned}\right.\\ \end{split}\]Secondly, the optimizer updates \(v_1\) the optimizer state as follows:
\[\begin{split} v_1' &:= \left\{\begin{aligned} \alpha * m + (1 - \alpha) * g_\text{tmp}^2 & \text{ \; (RMSProp/AdaDelta) } \\ \alpha * m + (1 - \alpha) * g_\text{tmp}^2 & \text{ \; (CenteredRMSProp) } \\ v_1 + g_\text{tmp}^2 & \text{ \; (AdaGrad) } \\ \end{aligned}\right.\\ \end{split}\]Next, \(v_2\) is updated, but only for CenteredRMSProp:
\[\begin{split} v_2' &:= \alpha * v_2 + (1 - \alpha) * g_\text{tmp} \text{ \; (CenteredRMSProp) } \\ \end{split}\]Next, it computes the update term \(u_\text{tmp}\):
\[\begin{split} u_\text{tmp} &:= \left\{\begin{aligned} \frac{g_\text{tmp}}{\sqrt{v_1'} + \epsilon} & \text{ \; (AdaGrad/RMSProp) } \\ \frac{g_\text{tmp}}{\sqrt{v_1' - v_2'^2} + \epsilon} & \text{ \; (CenteredRMSProp) } \\ \frac{g_\text{tmp} * \sqrt{v_2 + \epsilon}}{\sqrt{v_1' + \epsilon}} & \text{ \; (AdaDelta) } \\ \end{aligned}\right. \end{split}\]Next, \(v_2\) is updated, but only for AdaDelta:
\[\begin{split} v_2' := \alpha * v_2 + (1 - \alpha) * u_\text{tmp}^2 \text{ \; (AdaDelta) } \\ \end{split}\]Next the third momentum is updated for all modes:
\[ v_3' := m * v_3 + u_\text{tmp} \]Finally, the optimizer updates the weight as follows:
\[\begin{split} w' := \left\{\begin{aligned} w - \text{lr} * (v_3' + \text{wd} * w) &\text{ \; (Decay) } \\ w - \text{lr} * v_3' &\text{ \; (L2Regularization) } \\ \end{aligned}\right. \end{split}\]In addition to the above, the loss scaling hyper parameter is similar in nature to the velocity scaling parameter. It is a scaling value that is applied to the loss gradient at the start of the the backwards pass and, at the end of the backwards pass, this scaling is reversed by multiplying the gradients for each weight with the inverse of the loss scaling value prior to updating the optimizer state. Using loss scaling can also improve numerical stability in some cases.
Public Functions
-
virtual bool hasSpecific() const
-
Adaptive(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultAlpha, OptimizerValue defaultMomentum, OptimizerValue defaultEps, OptimizerValue lossScaling, AdaptiveMode adaptiveMode, WeightDecayMode weightDecayMode, DataType accumType, DataType accl1Type, DataType accl2Type, DataType accl3Type, bool rmspropTFVariant = false, const DebugContext &debugContext = {})
Constructor.
- Parameters
defaultLearningRate – The learning rate value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultWeightDecay – The weight decay value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultAlpha – The alpha value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultMomentum – The momentum value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultEps – The epsilon value to use for weights for which no weight-specific hyper parameter have been inserted.
lossScaling – The loss scaling value to use.
adaptiveMode – The AdaptiveMode value to use.
weightDecayMode – The WeightDecayMode value to use.
accumType – Data type to use for gradient accumulation.
accl1Type – Data type to use for tensor that stores first-order momentum optimizer state.
accl2Type – Data type to use for tensor that stores second-order momentum optimizer state.
accl3Type – Data type to use for tensor that stores third-order momentum optimizer state.
debugContext – Optional debug context.
-
Adaptive(const std::map<std::string, std::pair<float, bool>> ¶ms, AdaptiveMode adaptiveMode, WeightDecayMode weightDecayMode, DataType accumType, DataType accl1Type, DataType accl2Type, DataType accl3Type, bool rmspropTFVariant = false, const DebugContext &debugContext = {})
Constructor.
EXAMPLE: ``` Adaptive({{“defaultLearningRate”, {0.02, False}}, */ // {“defaultAlpha”, {0.99, True}}}, /** AdaptiveMode::RMSProp, WeightDecayMode::Decay, DataType::FLOAT, DataType::FLOAT, DataType::FLOAT, DataType::FLOAT); ```
- Parameters
params – A parameter map where keys are one of
"defaultLearningRate"
,"defaultWeightDecay"
,"defaultAlpha"
,"defaultMomentum"
,"defaultEps"
or"lossScaling"
, and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.adaptiveMode – The AdaptiveMode value to use.
weightDecayMode – The WeightDecayMode value to use.
accumType – Data type to use for gradient accumulation.
accl1Type – Data type to use for tensor that stores first-order momentum optimizer state.
accl2Type – Data type to use for tensor that stores second-order momentum optimizer state.
accl3Type – Data type to use for tensor that stores third-order momentum optimizer state.
debugContext – Optional debug context.
-
~Adaptive() = default
-
inline virtual OptimizerType type() const final
-
inline virtual std::string type_s() const final
-
virtual std::vector<TensorId> getInputIds(const Tensor &weight) const final
The names of the inputs for the VarUpdateOp for the Variable Tensor “weight”.
In the returned vector, an empty string (“”) is used as a placeholder for constant inputs.
-
virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const final
The names and infos of the optimizer tensors.
-
float getStoredValue(const TensorId &optId) const
Tensor “opt” has an id, based on which it matches a compound scalar which this object can compute from the atomic scalars.
-
void insertSpecific(const TensorId &weight, OptimizerValue learningRate, OptimizerValue weightDecay, OptimizerValue alpha, OptimizerValue momentum, OptimizerValue eps)
Insert a weight-specific set of hyper parameters.
- Parameters
weight – The TensorId of the weight.
learningRate – The learning rate value to use for this specific weight.
weightDecay – The weight decay value to use for this specific weight.
alpha – The alpha value to use for this specific weight.
momentum – The momentum value to use for this specific weight.
eps – The epsilon value to use for this specific weight.
-
void setStep(int64_t step)
-
void insertSpecific(const TensorId &weight, const std::map<std::string, std::pair<float, bool>> ¶ms)
Insert a weight-specific set of hyper parameters.
- Parameters
weight – The TensorId of the weight.
params – A parameter map where keys are one of
"defaultLearningRate"
,"defaultWeightDecay"
,"defaultAlpha"
,"defaultMomentum"
,"defaultEps"
or"lossScaling"
and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.
-
inline const OptimizerValueMap &learningRates() const
-
inline const OptimizerValueMap &weightDecays() const
-
inline const OptimizerValueMap &alphas() const
-
inline const OptimizerValueMap &momentums() const
-
inline const OptimizerValueMap &epss() const
-
virtual size_t hash() const
Public Static Functions
-
static inline OptimizerValue getUnsetLearningRate()
Default learning rate value.
-
static inline OptimizerValue getUnsetWeightDecay()
Default weight decay value.
-
static inline OptimizerValue getUnsetAlpha()
Default alpha value.
-
static inline OptimizerValue getUnsetMomentum()
Default momentum value.
-
static inline OptimizerValue getUnsetEps()
Default epsilon value.
-
static inline OptimizerValue getUnsetLossScaling()
Default loss scaling value.
-
static Adaptive fromDefaultMap(const std::map<std::string, OptimizerValue>&, AdaptiveMode adaptiveMode_, WeightDecayMode decayMode_, DataType accumType_, DataType accl1Type_, DataType accl2Type_, DataType accl3Type_, const DebugContext &debugContext = {})
2.5. Builder
#include <popart/builder.hpp>
-
class popart::Builder
An interface for a Builder, used for creating ONNX graphs.
A builder interface for creating ONNX graphs.
ONNX defines a specification for describing graphs and serialising them as protobuf files. This class provides a builder interface for creating such a graph.
Note, in ONNX, all Ops belong to an “Opset”. The Builder itself does not have methods for creating Ops in the ONNX graph, but instead has accessors to Opsets, like AiGraphcoreOpset1, which contain the methods for creating Ops in the graph.
Public Functions
-
Builder &createSubgraphBuilder()
Create a builder for a graph which is nested inside this builder’s graph.
-
TensorId addInputTensor(const TensorInfo &tensorInfo, const popart::DebugContext &debugContext = {})
Add a new input tensor to the model.
- Parameters
tensorInfo – The shape and data type of the input tensor.
debugContext – Optional debug information.
- Returns
The tensor id of the input tensor.
-
TensorId addInputTensor(const std::string &dataType, const Shape &shape, const popart::DebugContext &debugContext = {})
Add a new input tensor to the model.
- Parameters
dataType – The data type of the input tensor.
shape – The shape of the input tensor.
debugContext – Optional debug information.
- Returns
The tensor id of the input tensor.
-
TensorId addInputTensor(const TensorInfo &tensorInfo, const InputSettings &settings, const popart::DebugContext &debugContext = {})
Add a new input tensor to the model.
- Parameters
tensorInfo – The shape and data type of the input tensor.
InputSettings – Settings for
TileSet
andExchangeStrategy
.debugContext – Optional debug information.
- Returns
The tensor id of the input tensor.
-
TensorId addInputTensor(const std::string &dataType, const Shape &shape, const InputSettings &settings, const popart::DebugContext &debugContext = {})
Add a new input tensor to the model.
- Parameters
dataType – The data type of the input tensor.
shape – The shape of the input tensor.
InputSettings – Settings for
TileSet
andExchangeStrategy
.debugContext – Optional debug information.
- Returns
The tensor id of the input tensor.
-
TensorId addUntypedInputTensor(const popart::DebugContext &debugContext = {})
Add a new input tensor without a type or shape to the model.
- Parameters
debugContext – Optional debug information.
- Returns
The tensor id of the input tensor.
-
void addInputTensorFromParentGraph(const TensorId &tensorId)
Add a new named input tensor (from the parent graph) to the model.
- Parameters
tensorId – The identifier string of the input tensor. This identifier must already exist in the name scope of the parent
GraphProto
and must appear topologically before this sub-graph.
-
TensorId addInitializedInputTensor(const ConstVoidData &initData, const popart::DebugContext &debugContext = {})
Add a new pre-initialized input tensor to the model.
- Parameters
initData – The initial data of the input tensor.
debugContext – Optional debug information.
- Returns
The tensor id of the input tensor.
-
TensorId addInitializedInputTensor(const ConstVoidData &initData, const VariableSettings &variableSettings, const popart::DebugContext &debugContext = {})
Add a new pre-initialized input tensor to the model.
- Parameters
initData – The initial data of the input tensor.
variableSettings – The settings that determine how variables are retrieved from replicas.
debugContext – Optional debug information.
- Returns
The tensor id of the input tensor.
-
void addOutputTensor(const TensorId &arg0)
Add an output tensor from a node in the graph into the list of output tensors.
- Parameters
arg0 – The tensor id of the output tensor to be added.
-
inline AiOnnxOpset6 aiOnnxOpset6()
Return the builder interface for ai.onnx opset 6.
-
inline AiOnnxOpset7 aiOnnxOpset7()
Return the builder interface for ai.onnx opset 7.
-
inline AiOnnxOpset8 aiOnnxOpset8()
Return the builder interface for ai.onnx opset 8.
-
inline AiOnnxOpset9 aiOnnxOpset9()
Return the builder interface for ai.onnx opset 9.
-
inline AiOnnxOpset10 aiOnnxOpset10()
Return the builder interface for ai.onnx opset 10.
-
inline AiOnnxOpset11 aiOnnxOpset11()
Return the builder interface for ai.onnx opset 11.
-
inline AiOnnxMlOpset1 aiOnnxMlOpset1()
Return the builder interface for ai.onnx.ml opset 1.
-
inline AiGraphcoreOpset1 aiGraphcoreOpset1()
Return the builder interface for ai.graphcore opset 1.
-
std::vector<TensorId> customOp(const OperatorIdentifier &opid, int opsetVersion, const std::vector<TensorId> &inputs, const unsigned numOutputs, const std::map<std::string, popart::any> &attributes, const DebugContext &debugContext = {})
Return the output tensors from a custom op added to the model.
- Parameters
opid – The id of the operator.
opsetVersion – The version of the opset.
inputs – The tensor ids of the A vector of input tensor ids.
numOutputs – The number of output tensors.
attributes – The map of attributes and their values to be added.
debugContext – Optional debug information.
- Returns
The output tensors.
-
void customOp(const OperatorIdentifier &opid, int opsetVersion, const std::vector<TensorId> &inputs, const std::vector<TensorId> &outputs, const std::map<std::string, popart::any> &attributes, const DebugContext &debugContext =
-
Builder &createSubgraphBuilder()