14. PopART C++ API
This chapter describes the PopART C++ API.
14.1. Sessions
#include <popart/session.hpp>
-
class Session
Session is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware.
Subclassed by popart::InferenceSession, popart::TrainingSession
Public Functions
-
std::vector<uint32_t> getRNGState()
Get state of the random number generator.
-
void setRNGState(const std::vector<uint32_t>)
Set state of the random number generator.
-
void setRandomSeed(uint64_t seedValue)
Set the value of the random number generator seed.
This method explicitly seeds all random operations. Additionally, this method derives a new state for the random number generator (RNG) from the seed and sets it on the device. This RNG state is used to resolve stochastic rounding. Note that to deterministically store and restore the combined random state for a session, do the following:
C++:
// Store random state (session s0). auto seed = s0.getRandomSeed(); auto rngState = s0.getRNGState(); // Restore random state (session s1). s1.setRandomSeed(seed); // <-- affects RNG state, order important s1.setRNGState(rngState);
Python:
# Store random state (session s0). seed = s0.getRandomSeed() rngState = s0.getRNGState() # Restore random state (session s1). s1.setRandomSeed(seed) # <-- affects RNG state, order important s1.setRNGState(rngState)
- Parameters
seedValue – The value of the seed.
-
uint64_t getRandomSeed()
Get the value of the random number generator seed.
Calling setRandomSeed() with this value (at a later stage) reinstates the random state logic that seeds random operations.
- Returns
The value used to seed current random operations.
-
void compileAndExport(const std::string &filename)
Compile the graph and export it to a file.
This method will first create a
poplar::Graph
and compile thepoplar::Executable
. Next, it will export the executable and PopART metadata to the file. The exported file will be in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.- Parameters
filename – The name of the file where the compiled executable and metadata will be saved.
-
void compileAndExport(std::ostream &out)
Compile the graph and export it to a stream.
This method will first create a
poplar::Graph
and compile thepoplar::Executable
. Next, it will export the executable and PopART metadata to the stream. The data will be streamed in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.This method automatically creates folders as needed if
filename
is located in a folder which does not exist.- Parameters
out – The stream that the compiled executable and metadata will be written to.
-
void saveExecutableToFile(const std::string &filename)
Save a compiled graph to a file.
The file will be in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.
This method automatically creates folders as needed if
filename
is located in a folder which does not exist.- Parameters
filename – The name of the file where the compiled executable and metadata will be saved.
- Pre
prepareDevice() must have been called.
-
void saveExecutableToStream(std::ostream &out)
Save a compiled graph to a stream.
The data will be streamed in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.
- Parameters
out – The stream where the compiled executable and metadata will be written to.
- Pre
prepareDevice() must have been called.
-
void saveExecutable(const std::string &path, bool savePopartMetadata = true, bool saveVariables = true)
Save a compiled graph with additional data to a file.
PopART is able to save its state after the model compilation is complete, so that it can be restored at a later time. To make this possible, it is necessary to save such elements as:
a serialised Poplar executable,
its associated metadata,
tensor data blobs if model parameters have not been frozen (refer to the
SessionOptions::constantWeights
for more information),a PopART-specific opaque blob to store information only relevant to PopART. This is needed to restore PopART state.
The file will be in the PopEF format. This means that the file can be used to restore the state of the PopART program without recompiling the graph, or run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information. If you want to analyze file structure saved by the function please refer to the PopEF dump tool.
- Parameters
path – The name of the file or directory where the compiled executable, metadata and variables will be saved. If you specified a path to the directory, the function will write the data to the file: “<path>/executable.popef”. If the file exists, the function will overwrite the old data with the new ones.
savePopartMetadata – If you do not need the option to restore the PopART state later, you can set the flag to false to reduce disk space taken up by the file.
saveVariables – If you don’t need to save variables (tensors) state, you can set the flag to false if you want to save them later or in a different location. The function will save data consistent with the variables contained within the model.
- Pre
prepareDevice() must have been called.
-
void saveVariables(const std::string &path)
Save all variables to a file.
The function will save data consistent with the variables contained within the model.
The file will be in the PopEF format. If you want to analyze tensors saved by the function refer to the PopEF dump tool.
- Parameters
path – The name of the file or directory where the compiled variables will be saved. If you specified a path to the directory, the function will write the data to the file: “<path>/variables.popef”. If the file exists, the function will overwrite the old data with the new ones.
- Pre
prepareDevice() must have been called.
-
void checkInplacingAmbiguity() const
Check for potential inplacing ambiguities.
This method creates an
AliasModel
object for each graph and runs the Poprithms ambiguity checker on it.Throws an error if the graph has an inplacing ambiguity and will prompt the user to check the inplacing.
See
poprithms::memory::inplace::Graph::AmbiguityStatus
on the Poprithms GitHub repo for more on what constitutes an ambiguity.
-
void loadExecutableFromFile(const std::string &filename)
Load the compiled executable and metadata from a file.
The file must have been created with compileAndExport(const std::string).
- Parameters
filename – The name of the file to load the executable and metadata from.
Load the compiled executable and from a stream.
The stream must have been created with compileAndExport(std::ostream).
- Parameters
in – The shared pointer to the stream to load the executable from.
-
void prepareDevice(bool loadEngine = true)
Prepare the network for execution.
This will create the
poplar::Graph
andpoplar::Engine
.- Parameters
loadEngine – If
true
, load the engine and connect the streams once the device is ready.
-
void loadEngineAndConnectStreams()
Load the engine on the device and connect the streams.
This will set up the
poplar::Streams
.Note: This call is optional. The engine will implicitly be loaded on the device when required.
-
void weightsFromHost()
Copy weights from the host to the device.
-
void buffersFromHost()
Copy buffers from the host to the device.
-
void weightsToHost()
Copy the weights from the device to the host steam memory.
-
uint64_t getCycleCount(std::string id = "")
Copy the cycle count tensor from the device to the host.
- Parameters
id – The identifier of the cycle count tensor.
-
void connectStreamToCallback(const std::string &streamHandle, std::function<void(void*)> callback, unsigned index = 0)
Connect a Poplar stream with a callback.
This method will be called whenever the stream will be read or was written to by the device. The memory location will only be valid for reading or writing for the duration of the callback.
- Parameters
streamHandle – The name of the stream to connect to.
callback – The callback to be called whenever the stream is to be read or was written to by the device.
index – The replica index to connect to, when using replicated graphs. Default=0.
-
void connectStream(const std::string &streamHandle, void *buffer)
Connect a Poplar stream with a fixed location in memory.
Each time data is copied to the stream, this location will be read and each time data is copied from the stream, this location will be written.
- Parameters
streamHandle – The handle of the stream to connect to.
buffer – The pointer to the memory location.
-
void connectHostFunction(const std::string &functionHandle, std::function<void(const void*const*, size_t, void*const*, size_t)> callback, unsigned index = 0)
Connect a host function to a callback.
The callback takes two arguments, which point to the locations in memory for each of the function’s input and output arguments, respectively. During a host function call, first the device transfers the input data to the host, then the callback is invoked, and finally the output data is copied back to the device. The memory pointed to by the callback arguments must only be accessed during the duration of the callback.
- Parameters
functionHandle – The name of the host function.
callback – The function to be called whenever new input data is available.
index – The replica index to connect to, when using replicated graphs. Default=0.
-
void run(IStepIO &stepIO, std::string debugName = "")
Run one step.
Read input data from address in
stepIO.in
.Write the output data to addresses in
stepIO.out
.- Parameters
stepIO – The input and output data.
debugName – A debug string to identify this run in logs.
-
void run(std::string programHandle, IStepIO &stepIO, std::string debugName = "")
Run one step of a custom program.
Read input data from address in
stepIO.in
.Write the output data to addresses in
stepIO.out
.- Parameters
programHandle – The handle of the custom program to run.
stepIO – The input and output data.
debugName – A debug string to identify this run in logs.
-
void updateExternallySavedTensorLocations(const std::string &fromLocation, const std::string &toLocation)
Update the tensor locations of tensors in the session’s ONNX model.
A new file will be created at this point, and written to when the ONNX model is saved with a subsequent call to modelToHost().
- Parameters
fromLocation – All externally saved tensors with location
fromLocation
will have their location updated totoLocation
.toLocation – The updated tensor locations. This must not already exist.
-
void modelToHost(const std::string &fn)
Write the current model to an ONNX file.
- Parameters
fn – The path to file. The path can be absolute or relative. If you plan to run your program in multiple processes simultaneously, you should avoid possible race conditions by writing to different files, for example by using temporary files.
-
TensorInfo getInfo(TensorId) const
Get the tensor information for a tensor.
- Parameters
TensorId – The identifier of the tensor to get the tensor information for.
- Returns
The tensor information for the tensor.
-
bool hasInfo(TensorId) const
Check whether a tensor has information.
- Parameters
TensorId – The identifier of the tensor to get the tensor information for.
- Returns
true
if the tensor with identifier TensorId has tensor information andfalse
if not.
-
std::set<TensorId> getAllTensorIds() const
Returns the ids of all tensors in the model.
- Pre
prepareDevice() must have been called.
-
std::string getSummaryReport(bool resetProfile = true) const
Retrieve the summary report from from the
poplar::Engine
.The options which were passed to the Session constructor will influence the information in the report.
This method may only be called after prepareDevice() has been called.
- Parameters
resetProfile – If
true
, resets the execution profile. Default =true
.- Returns
A string containing the report.
-
std::string getSerializedGraph() const
Retrieve the serialized graph from the
poplar::Engine
.A JSON format report is produced.
This method may only be called after prepareDevice() has been called.
- Returns
A string containing the serialized graph.
-
pva::Report getReport() const
Retrieve the graph report from the
poplar::Engine
.The options which were passed to the Session constructor will influence the information in the report.
This method may only be called after prepareDevice() has been called.
- Returns
The PopVision Analysis report object.
-
void resetHostWeights(const std::string &model, const bool ignoreWeightsInModelWithoutCorrespondingHostWeight = false)
Reset weights with weights in an ONNX model.
Note that the only differences between the ONNX model and the current model must be the weights. No other differences are allowed.
This method only updates the weights on the host. weightsFromHost() must be called after this method to update the weights on the device.
- Parameters
model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.
ignoreWeightsInModelWithoutCorrespondingHostWeight – If
true
, do not throw an error if there are initializers in the ONNX model without corresponding initializer tensor(s) in the session’s IR.
-
void readWeights(const IWeightsIO &weightsIo)
Read the weights from the host stream memory and write to the host.
This method may only be called after weightsToHost() has been called.
- Parameters
weightsIo – The weight data that is read from the host stream memory is written to the addresses in
weightsIo.out
.
-
void writeWeights(const IWeightsIO &weightsIo)
Write the weights from the host to the IR tensor memory.
This method may only be called after weightsFromHost() has been called.
- Parameters
weightsIo – The weight data is written to the addresses in
weightsIo.out
.
-
std::string serializeIr(IrSerializationFormat format)
Serialize the IR graph to a string.
- Parameters
format – The format to use for serializing.
-
inline const popx::IrLowering &getIrLowering() const
Get the IR lowering associated with the Session.
-
inline const popx::Executablex &getExecutable() const
Get the executable associated with the Session.
-
void broadcastWeights(int rootRank = 0)
Broadcasts the weight from the PopRun instance with index
rootRank
to all other instances.- Parameters
rootRank – The index of the PopRun instance from which the weights should be broadcasted.
-
void updateEngineCache()
Update cacheEntries from engine cache directory and update ir::hashMatched_ with the updated cacheEntries.
Set the DeviceInfo of the Session.
-
std::vector<uint32_t> getRNGState()
14.1.1. Training session
#include <popart/session.hpp>
-
class TrainingSession : public popart::Session
TrainingSession is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware with training provided by optimizing a loss tensor using an optimizer and automatic differentiation (backpropagation).
Public Functions
-
~TrainingSession() override
Destructor for the TrainingSession class.
-
void updateOptimizerFromHost(const Optimizer *optimizer)
Update the optimizer from the host.
This method updates the optimizer and the associated hyperparameters but not the optimizer state tensors.
NOTE: The optimizer parameter has to be compatible with the optimizer passed to the TrainingSession constructor. For example, you cannot call this function with an
SDG1
optimizer if you created the session with anSDG0
optimizer. This is because it is not possible to change the IR after a session has been constructed.- Parameters
optimizer – A pointer to a popart::Optimizer.
-
void copyFromRemoteBuffer(const std::string &buffer, void *w, int repeat_index, unsigned replication_index = 0)
Copy from a remote butter into a user buffer.
This can be useful when we run larger models with host side reductions since HEXOPT is currently limited to 128 MB.
- Parameters
buffer – The name of the remote buffer to copy from.
w – Pointer to a user buffer to copy to.
repeat_index – The index in the remote buffer to copy from.
replication_index – The replicated graph index when using replicated graphs. Default=0.
-
void copyToRemoteBuffer(void *w, const std::string &buffer, int repeat_index, unsigned replication_index = 0)
Copy from a user buffer to a remote buffer.
This can be useful when we run larger models with host side reductions since HEXOPT is currently limited to 128 MB.
- Parameters
w – Pointer to a user buffer to copy from.
buffer – The remote buffer to copy to.
repeat_index – The index in the remote buffer to copy to.
replication_index – The replicated graph index when using replicated graphs. Default=0.
Public Static Functions
Create a session for training from an IR.
- Parameters
ir – The IR to create the session from.
deviceInfo – The type of device that this session uses.
name – The name of this training session. Default: “training”.
Create a session for inference from an ONNX model.
- Parameters
model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.
dataFlow – Configuration for the data feeds and fetches.
loss – The identifier of the final scalar loss tensor for training.
optimizer – The name of an optimizer to use when training.
deviceInfo – The type of device that this session uses.
inputShapeInfo – (Optional) The sizes and dtypes of the input tensors. This is used to specify the sizes of the input tensors in the case that the ONNX model does not include this information. The Poplar graph programming framework uses statically allocated memory buffers and so it needs to know the size of tensors before the compilation. Default: InputShapeInfo().
userOptions – (Optional) The user configuration options for the Session class. Default: SessionOptions().
patterns – (Optional) A user-selected set of graph transformation patterns which will be applied to the graph. If this is not specified, a default set of optimisation transformations will be applied. Default: Patterns().
name – (Optional) The name of this inference session. Default: “training”.
-
~TrainingSession() override
14.1.2. Inference session
#include <popart/session.hpp>
-
class InferenceSession : public popart::Session
InferenceSession is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware, without any automatic differentiation (backpropagation) or optimization.
Public Functions
-
~InferenceSession() override
Destructor for the InferenceSession class.
-
void popxlSetEngineIsLoaded(bool isLoaded)
Public Static Functions
Create a session for inference from an IR.
- Parameters
ir – The IR to create the session from.
deviceInfo – The type of device that this session uses.
name – The name of this inference session. Default: “inference”.
Create a session for inference from an ONNX model.
- Parameters
model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.
dataFlow – Configuration for the data feeds and fetches.
deviceInfo – The type of device that this session uses.
inputShapeInfo – (Optional) The sizes and dtypes of the input tensors. This is used to specify the sizes of the input tensors in the case that the ONNX model does not include this information. The Poplar graph programming framework uses statically allocated memory buffers and so it needs to know the size of tensors before the compilation. Default: InputShapeInfo().
userOptions – (Optional) The user configuration options for the Session class. Default: SessionOptions().
patterns – (Optional) A user-selected set of graph transformation patterns which will be applied to the graph. If this is not specified, a default set of optimisation transformations will be applied. Default: Patterns().
name – (Optional) The name of this inference session. Default: “inference”.
-
~InferenceSession() override
14.1.3. Session options
#include <popart/sessionoptions.hpp>
-
enum class popart::AccumulateOuterFragmentSchedule
Enum type that determines how the operations in the accumulate outer fragment will be scheduled across virtual graphs (only relevant to pipelined modes).
Values:
-
enumerator Scheduler = 0
Don’t add additional constraints and let the scheduler work it out.
-
enumerator Serial
Add constraints that ensure ops are executed in virtual graph ID order.
-
enumerator OverlapCycleOptimized
Try and parallelise ops with different virtual graph IDs as much as possible.
-
enumerator OverlapMemoryOptimized
Try and parallelise ops with different virtual graph IDs but avoid certain steps that are costly in terms of memory usage.
-
enumerator Scheduler = 0
-
enum class popart::AutodiffStitchStrategy
Enum type representing a strategy to ensure a backward graph’s inputs are either inputs of the forward graph, outputs of the forward graph or gradients of outputs of the forward graph.
Strategies may expose tensors that would otherwise have been internal to the forward graph as outputs of this forward graph.
Values:
-
enumerator RecomputeMinimal = 0
Recompute any backward graph inputs associated with non-gradient forward graph tensors that are neither inputs nor outputs in the forward graph.
-
enumerator RecomputeAllNonInputs
Recompute any backward graph inputs associated with non-gradient forward graph tensors that are not inputs in the forward graph.
-
enumerator AddFwdOutputs
For backward graph inputs associated with non-gradient forward graph tensors that are neither inputs or outputs in the forward graph, add them as outputs to the forward graph.
-
enumerator SafeAddFwdOutputs
Like AutodiffStitchStrategy::AddFwdOutputs except that those backward graph inputs that can’t be stitched with AutodiffStitchStrategy::AddFwdOutputs (that is, by adding outputs to the forward graph) are stitched using the AutodiffStitchStrategy::RecomputeMinimal strategy instead.
This means that this is a safe strategy to use as an Autodiff default.
-
enumerator N
Number of
AutodiffStitchStrategy
values.
-
enumerator RecomputeMinimal = 0
-
enum class popart::BatchSerializationBatchSchedule
Enum type that describes how to change the batch serialisation subgraph schedule before outlining.
Note
This setting is experimental and may change.
Values:
-
enumerator Scheduler = 0
Don’t encourage any particular scheduling for ops within batch subgraphs (leave it to the scheduler) but tell the scheduler to schedule subgraphs in sequence.
-
enumerator Isomorphic
Encourage all ops within batch subgraphs to be scheduled identically and for each subgraph to be scheduled in sequence (good for outlineability).
-
enumerator OverlapOnIo
Attempt to put the remote load op for batch N+1 right after the compute phase of batch N.
-
enumerator OverlapOnCompute
Attempt to put the remote load op for batch N+1 right before the compute phase of batch N.
-
enumerator N
The number of
BatchSerializationBatchSchedule
values.
-
enumerator Scheduler = 0
-
enum class popart::BatchSerializationMethod
Enum type that describes how to apply the batch serialization.
Note
This setting is experimental and may change.
Values:
-
enumerator UnrollDynamic = 0
Unroll the batch with dynamic slicing.
-
enumerator UnrollStatic
Unroll the batch with static slicing.
-
enumerator Loop
Loop over the batch dimension.
-
enumerator N
The number of
BatchSerializationMethod
values.
-
enumerator UnrollDynamic = 0
-
enum class popart::BatchSerializationTransformContext
Enum type that describes when to apply batch serialization.
Note
This setting is experimental and may change.
Values:
-
enumerator Fwd = 0
Apply batch serialiation before growing the backward pass.
-
enumerator Bwd
Apply batch serialiation after growing the backward pass.
-
enumerator N
The number of
BatchSerializationTransformContext
values.
-
enumerator Fwd = 0
-
enum class popart::ExecutionPhaseIOSchedule
Enum type to specify when to load tensors.
Values:
-
enumerator Preload = 0
Preload tensors in previous phase for use in current phase.
-
enumerator OnDemand
Load tensors just before they are required.
-
enumerator N
The number of
ExecutionPhaseIOSchedule
values.
-
enumerator Preload = 0
-
enum class popart::ExecutionPhaseSchedule
Enum type to specify the order of processing optimizer operations for different weights of the same execution phase.
The steps for phased execution are:
Copy to IO tiles if necessary.
Run collective operations if necessary.
Load optimizer state.
Update optimizer state.
Apply optimizer.
Store updated tensor if necessary.
Values:
-
enumerator Interleaving = 0
Process above steps for one weight at a time (for example: 123456, 123456, 123456).
The scheduler may interleave these steps.
-
enumerator Batch
Process above steps for all weights together, in a way that maximises overlap potential between compute and exchange (for example: 333, 111, 222, 444, 555, 666).
-
enumerator BatchClusteredIO
Process above steps for all weights together, in a way that maximises overlap potential between compute and exchange, and maximise stream copy merges by keeping RemoteLoad/RemoteStore operations clustered (for example: 333, 111, 222, 444, 555, 666).
-
enumerator N
The number of
ExecutionPhaseSchedule
values.
-
enum class popart::GradientTensorTrackingMethod
Enum type to specify the method for selecting gradient tensors whose statistics are to be tracked for the AutomaticLossScale transform.
Values:
-
enumerator AllNonViewChangingGradientTensors = 0
Track all gradients of non-view-changing gradient tensors.
-
enumerator ConvAndMatmulGradients
Track all gradients of inputs to MatMul and Convolution ops.
-
enumerator GradientsOfUserSpecifiedTensors
Track gradients of user-specified tensors.
-
enumerator N
The number of
GradientTensorTrackingMethod
values.
-
enumerator AllNonViewChangingGradientTensors = 0
-
enum class popart::Instrumentation
Enum type used to specify an instrumentation type.
Values:
-
enumerator Outer = 0
Outer loop instrumentation, graph over all IPUs.
-
enumerator Inner
Inner loop instrumentation, graph per IPU.
-
enumerator N
The number of
Instrumentation
values.
-
enumerator Outer = 0
-
enum class popart::IrSerializationFormat
Enum type used to specify a serialization format.
Values:
-
enumerator JSON
JavaScript Object Notation (JSON).
-
enumerator JSON
-
enum class popart::MeanReductionStrategy
Enum type that specifies when to divide by a mean reduction factor, when doing mean reduction over a sequence of tensors \(t_1, t_2, ..., t_k\).
Values:
-
enumerator Running = 0
Keep the reduction buffer as the mean of the tensors accumulated so far.
If \(t_1, ..., t_f\) has just been processed, the current accumulator \(s\) is the mean of these values, and the next accumulator update is \(s = \frac{f}{f+1} * s + \frac{1}{f+1} * t_{f+1}\) to keep \(s\) a running mean.
This strategy guarantees \(s \le \max(a_1, ..., a_k)\) throughout the accumulation, therefore it will not overflow, but it is generally slower than MeanReductionStrategy::Post.
-
enumerator Post
Keep the accumulation factor as the running sum, and divide once by \(k\) at the end of the accumulation.
This strategy will generally be faster than MeanReductionStrategy::Running, but is prone to overflow (especially when using
fp16
).
-
enumerator N
The number of
MeanReductionStrategy
values.
-
enumerator Running = 0
-
enum class popart::MergeVarUpdateType
Enum type used to specify which VarUpdateOp ops to merge.
Values:
-
enumerator None = 0
Do not merge VarUpdateOp ops.
-
enumerator All
Merge all VarUpdateOp ops into as few groups as possible.
This is a good choice when memory is not a constraint.
-
enumerator AutoLoose
Merge into groups while attempting not to increase maximum variable liveness, and also not slice tensor variables so they will need to be processed by different VarUpdateOp ops.
-
enumerator AutoTight
Merge into groups, so that VarUpdateOp ops process tensors of exactly
SessionOptions::mergeVarUpdateMemThreshold
in size.
-
enumerator N
The number of
MergeVarUpdateType
values.
-
enumerator None = 0
-
enum class popart::RecomputationType
Enum type to specify which ops to recompute in the backward pass when doing auto-recomputation.
Values:
-
enumerator None = 0
No ops are recomputed (Default).
-
enumerator Standard
Recompute using algorithm that picks checkpoints to try and minimise max liveness.
-
enumerator NormOnly
Only Norm ops (+ non-linearities, if following) are recomputed.
-
enumerator Pipeline
Recompute all forward pipeline stages.
-
enumerator RecomputeAll
Recompute all ops.
-
enumerator N
The number of
RecomputationTypes
values.
-
enumerator None = 0
-
enum class popart::SubgraphCopyingStrategy
Enum type that describes how copies for inputs and outputs for subgraphs are lowered.
Currently this only affects subgraphs associated with CallOp ops.
Values:
-
enumerator OnEnterAndExit = 0
Copy all inputs before the start of the subgraph, copy all outputs after all ops in the subgraph.
With this strategy, subgraphs will always map to a single Poplar function.
-
enumerator JustInTime
Copy inputs just before they are consumed and copy outputs as soon as they are produced.
With this strategy, subgraphs may be lowered into multiple Poplar functions.
-
enumerator N
The number of
SubgraphCopyingStrategy
values.
-
enumerator OnEnterAndExit = 0
-
enum class popart::SyntheticDataMode
Enum type used to specify the data source for input tensors.
Values:
-
enumerator Off = 0
Use real data.
-
enumerator Zeros
Input tensors are initialised to all zeros.
-
enumerator RandomNormal
Input tensors are initialised with a random normal distribution ~N(0,1).
-
enumerator RandomUniform
Input tensors are initialised with a uniform distribution.
-
enumerator N
The number of
SyntheticDataMode
values.
-
enumerator Off = 0
-
enum class popart::VirtualGraphMode
Enum type used to specify a virtual graph mode.
Values:
-
enumerator Off = 0
Virtual graphs are not enabled.
-
enumerator Manual
User must set the popart::Op::virtualGraph attribute on all ops.
-
enumerator Auto
Use the AutoVirtualGraph transform.
-
enumerator ExecutionPhases
Virtual graphs are tied to execution phases.
-
enumerator N
The number of
VirtualGraphMode
values.
-
enumerator Off = 0
-
struct AccumulateOuterFragmentSettings
A structure containing accumulate outer fragment settings.
Public Functions
-
AccumulateOuterFragmentSettings() = default
-
inline AccumulateOuterFragmentSettings(AccumulateOuterFragmentSchedule schedule_, const std::vector<int> &excludedVirtualGraphs_)
Constructor for AccumulateOuterFragmentSettings.
- Parameters
schedule_ – Indicate how to schedule the accumulate outer fragment. This setting is experimental and may change. Default: AccumulateOuterFragmentSchedule::Serial
excludedVirtualGraphs_ – Indicate to explicitly avoid parallelising the virtual graph IDs. This setting is experimental and may change.
Public Members
-
AccumulateOuterFragmentSchedule schedule = AccumulateOuterFragmentSchedule::Serial
Indicate how to schedule the accumulate outer fragment.
Note
This setting is experimental and may change.
-
std::vector<int> excludedVirtualGraphs = {}
Indicate to explicitly avoid parallelising the virtual graph IDs.
Note
This setting is experimental and may change.
-
AccumulateOuterFragmentSettings() = default
-
struct AutodiffSettings
The settings for the Autodiff transform.
Public Functions
-
AutodiffSettings() = default
Default constructor for the AutodiffSettings struct.
-
inline AutodiffSettings(AutodiffStitchStrategy stitchStrategy_)
Constructor for the AutodiffSettings struct.
- Parameters
stitchStrategy_ – The strategy to ensure a backward graph’s inputs are either inputs of the forward graph, outputs of the forward graph or gradients of outputs of the forward graph. Default: AutodiffStitchStrategy::RecomputeAllNonInputs.
Public Members
-
AutodiffStitchStrategy stitchStrategy = AutodiffStitchStrategy::RecomputeAllNonInputs
The strategy PopART should use to ensure that all graph inputs of a backward graph are available as either inputs or outputs of the forward graph or gradients of outputs of the forward graph.
Note
This is an experimental option and may change.
-
AutodiffSettings() = default
-
struct AutomaticLossScalingSettings
A structure containing user configuration for automatic loss scaling settings.
Note
Automatic loss scaling is in preview. It is well tested and enabled in some of our example applications, but may not behave as expected in all models. Recommendation: if your model with automatic loss scaling enabled does not converge or triggers a compilation error, then you will need to set the loss scale manually.
Public Functions
-
AutomaticLossScalingSettings() = default
Default constructor for AutomaticLossScalingSettings.
-
AutomaticLossScalingSettings(bool enabled_, const nonstd::optional<std::vector<TensorId>> &toTrackTensors_, float binEdgeLocation_, float thresholdUpperCountProportion_, int updatePeriod_, GradientTensorTrackingMethod gradientTensorTrackingMethod_)
Constructor for AutomaticLossScalingSettings.
- Parameters
enabled_ – Indicate whether to keep track (
true
) or not (false
) of the distribution of gradient tensor elements over the floating point range. Default:false
.toTrackTensors_ – An optional list of model tensor names, for which gradient statistics will be collected. If not set, the gradients of all tensors produced by default operations (matmul, conv) will be used.
binEdgeLocation_ – The location of the bin edge as a proportion of the absolute numerical range of the tracked gradient tensor elements, in the range [0, 1]. 0 represents the smallest representable value, and 1 the maximum. This is the single bin edge of the histogram that is an input to the loss scale updater algorithm. Default: 0.125.
thresholdUpperCountProportion_ – The proportion of the elements in the upper bin above which the loss scale is increased, and below which the loss scale is decreased. Should be in the range [0, 1]. Default: 1e-7.
updatePeriod_ – Indicate how often the loss scale update factor should be updated with respect to optimizer steps. Default: 1
gradientTensorTrackingMethod_ – The method for selecting gradient tensors whose statistics are to be tracked. Default: GradientTensorTrackingMethod::AllNonViewChangingGradientTensors.
-
std::size_t hash() const
Public Members
-
bool enabled = false
-
float binEdgeLocation = 0.125f
-
float thresholdUpperCountProportion = 1e-7
-
int updatePeriod = 1
-
GradientTensorTrackingMethod gradientTensorTrackingMethod = GradientTensorTrackingMethod::AllNonViewChangingGradientTensors
-
AutomaticLossScalingSettings() = default
-
struct BatchSerializationSettings
A structure containing batch serialization settings.
Public Functions
-
BatchSerializationSettings() = default
Default constructor for BatchSerializationSettings.
-
BatchSerializationSettings(int factor_, bool concatOnVirtualGraphChange_, bool concatOnExecutionPhaseChange_, bool concatOnPipelineStageChange_, BatchSerializationTransformContext transformContext_ = BatchSerializationTransformContext::Fwd, BatchSerializationMethod method_ = BatchSerializationMethod::UnrollDynamic, BatchSerializationBatchSchedule batchSchedule_ = BatchSerializationBatchSchedule::Isomorphic)
Constructor for BatchSerializationSettings.
- Parameters
factor_ – The number of compute batches to split operations into. Default: 0.
concatOnVirtualGraphChange_ – Indicate to break batch serialization chains (
true
) when the virtual graph changes (by concatenating the compute batches to the local batch). Default:true
.concatOnExecutionPhaseChange_ – Indicate to break batch serialization chains (
true
) when the execution phase changes (by concatenating the compute batches to the local batch). Default:true
.concatOnPipelineStageChange_ – Indicate to break batch serialization chains (
true
) when the pipeline stage changes (by concatenating the compute batches to the local batch). Default:true
.transformContext_ – An experimental value to control when batch serialization is applied. Default: ::Fwd.
method_ – An experimental value to control how batch serialization is applied. Default: BatchSerializationMethod::UnrollDynamic.
batchSchedule_ – An experimental value that changes how operations are scheduled. Default: BatchSerializationBatchSchedule::Isomorphic.
Public Members
-
int factor = 0
The number of compute batches to split operations into.
-
bool concatOnVirtualGraphChange = true
Break batch serialization chains when the virtual graph changes (by concatenating the compute batches to the local batch).
-
bool concatOnExecutionPhaseChange = true
Break batch serialization chains when the execution phase changes (by concatenating the compute batches to the local batch).
-
bool concatOnPipelineStageChange = true
Break batch serialization chains when the pipeline stage changes (by concatenating the compute batches to the local batch).
-
BatchSerializationTransformContext transformContext = BatchSerializationTransformContext::Fwd
Experimental value to control when batch serialization is applied.
-
BatchSerializationMethod method = BatchSerializationMethod::UnrollDynamic
Experimental value to control how batch serialization is applied.
-
BatchSerializationBatchSchedule batchSchedule = BatchSerializationBatchSchedule::Isomorphic
Experimental value that changes how operations are scheduled.
-
BatchSerializationSettings() = default
-
struct ExecutionPhaseSettings
A structure containing ExecutionPhase settings.
Public Functions
-
ExecutionPhaseSettings() = default
Default constructor for ExecutionPhaseSettings.
-
inline ExecutionPhaseSettings(int phases_, bool stages_, ExecutionPhaseIOSchedule weightIOSchedule_, ExecutionPhaseIOSchedule activationIOSchedule_, ExecutionPhaseIOSchedule optimizerStateIOSchedule_, ExecutionPhaseIOSchedule accumulatorIOSchedule_, ExecutionPhaseSchedule schedule_)
Constructor for ExecutionPhaseSettings.
- Parameters
phases_ – The number of execution phases for the whole model. Default=0.
stages_ – The number of overlapping stages:
1: Parallel streaming memory, default for 1 IPU per replica.
2: PingPong between 2 IPUs, default for 2 or more IPUs per replica (Default).
weightIOSchedule_ – The execution phase IO schedule for weight tensors. Default: ExecutionPhaseIOSchedule::Preload.
activationIOSchedule_ – The execution phase IO schedule for activation and gradient tensors. Default: ExecutionPhaseIOSchedule::Preload.
optimizerStateIOSchedule_ – An experimental value to control when batch serialization is applied. Default: ExecutionPhaseIOSchedule::OnDemand.
accumulatorIOSchedule_ – An experimental value to control how batch serialization is applied. Default: ExecutionPhaseIOSchedule::Preload.
schedule_ – An experimental value that changes how operations are scheduled. Default: ExecutionPhaseSchedule::Interleaving.
Public Members
-
int phases = 0
Number of ExecutionPhases for the whole model.
-
int stages = 2
Number of overlapping stages.
1: Parallel streaming memory, default for 1 IPU per replica.
2: PingPong between 2 IPUs, default for 2 or more IPUs per replica.
-
ExecutionPhaseIOSchedule weightIOSchedule = ExecutionPhaseIOSchedule::Preload
The execution phase IO schedule for weight tensors.
-
ExecutionPhaseIOSchedule activationIOSchedule = ExecutionPhaseIOSchedule::Preload
The execution phase IO schedule for activation and gradient tensors.
-
ExecutionPhaseIOSchedule optimizerStateIOSchedule = ExecutionPhaseIOSchedule::OnDemand
-
ExecutionPhaseIOSchedule accumulatorIOSchedule = ExecutionPhaseIOSchedule::Preload
-
ExecutionPhaseSettings() = default
-
struct ReplicatedCollectivesSettings
A structure containing settings for replicated collective operations.
Public Functions
-
ReplicatedCollectivesSettings(bool prepareScheduleForMergingCollectives = false, bool mergeAllReduceCollectives = false, bool mergeReduceScatterCollectives = false, bool mergeAllGatherCollectives = false)
Constructor for the ReplicatedCollectivesSettings struct.
- Parameters
prepareScheduleForMergingCollectives – Insert constraints into the schedule such that collectives which can be merged occur one right after the other.
true
to insert constraints,false
otherwise. Default:false
.mergeAllReduceCollectives – Identify allreduce operations which can be scheduled at the same time, and perform them as one larger operation to better utilize the bandwidth between replicas.
true
to identify operations,false
otherwise. Default:false
.
-
std::size_t hash() const
Public Members
-
bool prepareScheduleForMergingCollectives = false
-
bool mergeAllReduceCollectives = false
-
bool mergeReduceScatterCollectives = false
Identifies reduce-scatter operations which can be scheduled at the same time, and performs them as one larger operation so as to better utilize the bandwidth between replicas.
-
bool mergeAllGatherCollectives = false
Identifies allgather operations which can be scheduled at the same time, and performs them as one larger operation so as to better utilize the bandwidth between replicas.
-
ReplicatedCollectivesSettings(bool prepareScheduleForMergingCollectives = false, bool mergeAllReduceCollectives = false, bool mergeReduceScatterCollectives = false, bool mergeAllGatherCollectives = false)
-
struct SessionOptions
A structure containing user configuration options for the Session class.
Public Functions
-
inline bool explicitPipeliningEnabled() const
Enable explicit pipelining.
Determined from values for
enablePipelining
,useHostCopyOpsfault
andenableExplicitMainLoops
.
-
inline bool implicitPipeliningEnabled() const
Enable implicit pipelining.
Determined from values for
enablePipelining
,useHostCopyOpsfault
andenableExplicitMainLoops
.
-
inline void enableExplicitIR(bool enable)
Enable explicit representations in the IR (code paths).
Enabled if
true
, otherwise not.
-
bool shouldDelayVarUpdates() const
-
int64_t getGlobalReplicationFactor() const
Get the global replication factor.
- Returns
If
enableDistributedReplicatedGraphs
istrue
, then returnglobalReplicationFactor
.If
enableReplicatedGraphs
istrue
, then returnreplicatedGraphCount
.otherwise return 1.
-
unsigned getAccumulationFactor() const
Get the gradient accumulation factor.
Throws an error if gradient accumulation is not enabled (
enableGradientAccumulation
isfalse
) and the factor (accumulationFactor
) is set to >1.- Returns
The accumulation factor.
-
bool autoRecomputationEnabled() const
Returns
true
if auto-recomputation is enabled,false
otherwise.
-
inline SessionOptions()
Constructor for SessionOptions.
Public Members
-
std::string logDir
A directory for log traces to be written into.
-
std::set<std::string> dotChecks = {}
When to write
.dot
files during IR construction.
-
int firstDotOp = 0
The ops written to the
.dot
file will be a part of the schedule, controlled by firstDotOp and finalDotOp.In particular, it will be [max(0, firstDotOp), min(N ops in IR, finalDotOp)).
-
int finalDotOp = 10000
See firstDotOp.
-
bool dotOpNames = false
Enable inclusion of the op name in the
.dot
file (the op type is always exported).Enabled when
true
. Default:false
.
-
bool exportPoplarComputationGraph = false
Enable export of Poplar computational graph.
Enabled when
true
. Default:false
.
-
bool exportPoplarVertexGraph = false
Enable export of Poplar vertex graph.
Enabled when
true
. Default:false
.
-
bool separateCallOpPdfs = true
Enable creation of separate PDFs for each subgraph when generating PDFs of IR graphs.
Enabled when
true
. Default:true
.
-
bool enableOutlining = true
Enable outlining.
This identifies and extracts repeated parts of computational graph into subgraphs. Enabled when
true
. Default:true
.
-
bool enableOutliningCopyCostPruning = true
Enable inclusion of the cost of copying of cached sections should be in the outlining cost model.
Enabled when
true
. Default:true
.
-
float outlineThreshold = 1.0f
Specify the incremental value that a sub-graph requires, relative to its nested sub-graphs (if any), to be eligible for outlining.
A high threshold results in fewer sub-graphs being outlined, a negative value results in all being outlined. The gross value of a sub-graph is the sum of its constituent ops’ Op::getSubgraphValue() values. To disable outlining, it is better to set enableOutlining to false than to set this value to infinity. The default value of 1.0f results in all high value operations such as convolution being cached, but standalone low value operations such as ReLU will not be.
Default: 1.0f.
-
float outlineSequenceBreakCost = 10000.0f
Specify the penalty applied to outlining potential sub-graphs if the sub-graph to be created breaks up a sequence of operations that are more efficient (for example for overlapping compute and exchange) when outlined together.
Default: 10000.0f.
-
SubgraphCopyingStrategy subgraphCopyingStrategy = SubgraphCopyingStrategy::OnEnterAndExit
Specify how copies for inputs and outputs for subgraphs are lowered.
Setting this value to SubgraphCopyingStrategy::JustInTime may save memory at the cost of fragmenting subgraphs into multiple Poplar functions. This may be particularly useful when a number of weight updates are outlined in one subgraph, as it may prevent multiple weight tensors from being live at the same time inside the subgraph.
Default: SubgraphCopyingStrategy::OnEnterAndExit.
-
RecomputationType autoRecomputation = RecomputationType::None
Enable recomputation of operations in the graph in the backward pass.
This will reduce model size at the cost of computation cycles.
Default: RecomputationType::None (no recomputation).
-
MergeVarUpdateType mergeVarUpdate = MergeVarUpdateType::None
Enable merging of VarUpdates into groups of VarUpdates, by flattening and concatenating variable tensors and updating tensors.
Default: MergeVarUpdateType::None (no merging).
-
int64_t mergeVarUpdateMemThreshold = 1000000
Specify the memory threshold for VarUpdateOp merging algorithms.
The MergeVarUpdateType::AutoLoose and MergeVarUpdateType::AutoTight VarUpdateOp merging algorithms have a threshold on the total memory of variable tensors to merge for updating. Defined as total memory in bytes.
Default: 1000000.
-
int64_t looseThresholdAtPeak = 8000
Specify the threshold at peak used in the calculation of the absolute threshold in the MergeVarUpdateType::AutoLoose VarUpdateOp merging algorithm.
min(mergeVarUpdateMemThreshold, liveAtPeak - liveCurrently + looseThresholdAtPeak)
where:
liveAtPeak
is an estimate of the maximum live memory of the computation; andliveCurrently
is an estimate of the live memory where the threshold is being used to determine whether to schedule or postpone a VarUpdateOp.
Default: 80000.
-
bool rearrangeAnchorsOnHost = true
Enable rearrangement (in memory) of anchor tensors to be done on the host.
Before anchor tensors are streamed from device to host, they are not necessarily arranged in memory as required when they are to be copied from host stream to host. This can be done on the device or on the host.
Default:
true
(Rearrangement done on host to save memory, but often at the expense of cycles, especially for larger anchor tensors.).
-
bool rearrangeStreamsOnHost = false
Enable rearrangement (in memory) of stream tensors to be done on the host.
Before stream tensors are streamed from host to device, they are not necessarily arranged in memory as required when they are to be copied from host stream to device. This can be done on the device or on the host.
Default:
false
(Rearrangement done on device).
-
bool enablePrefetchDatastreams = true
Enable prefetching for input data streams.
Poplar will speculatively read data for a stream before it is required in order to allow the ‘preparation’ of the data to occur in parallel with compute. Enabled when
true
. Default:true
.
-
unsigned defaultBufferingDepth = 1
Specify the default buffering depth value used for streams that are not re-arranged on the host.
For tensors that are rearranged on the host, a buffering depth of 1 will always be used. This default value can be overridden via bufferingDepthMap.
-
unsigned defaultPrefetchBufferingDepth = initialDefaultPrefetchBufferingDepthValue
- Deprecated:
This session option name has been deprecated and will be removed in a future release.
-
std::map<TensorId, unsigned> bufferingDepthMap
This mapping can be used to set stream-specific buffering depths.
The buffering depth could be thought of as being the size of a circular buffer that feeds data to and from Poplar. A buffering depth greater than 1 may improve the performance due to increased parallelisation but comes at the cost of increasing the memory footprint. Streams for tensors that have no entry in this map will default to 1 (if a tensor is rearranged on host) or defaultBufferingDepth (if a tensor is not rearranged on host). Specifying a tensor that gets rearranged on host in this map will throw an error.
-
std::map<TensorId, unsigned> prefetchBufferingDepthMap
- Deprecated:
This session option name has been deprecated and will be removed in a future release.
-
bool enableNonStableSoftmax = false
Enable the non-stable softmax Poplar function.
By default, the stable softmax Poplar function is used. The input tensor to softmax, \(x\), is preprocessed by subtracting \(max(x)\) from each element before computing the exponentials, ensuring numerical stability. If the inputs to the softmax operations are small enough to not cause overflow when computing the exponential, then the non-stable version can be enabled instead, to increase the speed.
Default:
false
(not enabled).
-
bool enableReplicatedGraphs = false
Enable replication of graphs. Default:
false
(not enabled).
-
bool enableGradientAccumulation = false
Enable gradient accumulation. Default:
false
(not enabled).
-
ReductionType accumulationAndReplicationReductionType = ReductionType::Sum
Specify how gradients are reduced when using gradient accumulation and graph replication.
Default: ReductionType::Sum.
-
MeanReductionStrategy meanAccumulationAndReplicationReductionStrategy = MeanReductionStrategy::Post
Specify when to divide by a mean reduction factor when accumulationAndReplicationReductionType is set to ReductionType::Mean.
Default: MeanReductionStrategy::Post.
-
int64_t replicatedGraphCount = 1
Specify the number of model replications.
If
enableReplicatedGraphs
istrue
,replicatedGraphCount
will set the number of model replications. For example, if the model uses 1 IPU, areplicatedGraphCount
of 2 will use 2 IPUs. If the model is pipelined across 4 IPUs, areplicatedGraphCount
of 4 will use 16 IPUs in total. Therefore, the number of IPUs requested must be a multiple ofreplicatedGraphCount
. If the training is done across multiple instances of the program then thereplicatedGraphCount
is the number of replicas for this instance.
-
int64_t accumulationFactor = 1
Specify the number of micro-batches to accumulate before applying the varUpdate.
-
VirtualGraphMode virtualGraphMode = VirtualGraphMode::Off
Specify how to place ops on virtual graphs to achieve model parallelism, either manually using model annotations, or automatically.
Default: VirtualGraphMode::Off.
-
std::vector<float> virtualGraphSplitRatios
Specify split ratios when VirtualGraphModel::Auto enabled.
These values represent split ratios in each device and each of the values is in range (0, 1).
For example, to uniformly split the whole graph on 4 IPUs, the value should be [0.25, 0.25, 025, 0.25].
-
bool enablePipelining = false
Enable pipelining of virtual graphs. Default:
false
(not enabled).
-
SyntheticDataMode syntheticDataMode = SyntheticDataMode::Off
Specify whether to use real or synthetic data to initialize input tensors.
Streaming to/from the host is only enabled for SyntheticDataMode::Off which indicates that real data is being used.
Default: SyntheticDataMode::Off.
-
bool instrumentWithHardwareCycleCounter = false
Add instrumentation to the program to count the number of device cycles (of a single tile, on a single IPU) that the main program takes to execute.
Expect this to have a small detrimental impact on performance.
-
std::set<Instrumentation> hardwareInstrumentations = {Instrumentation::Outer}
-
bool disableGradAccumulationTensorStreams = false
Disable saving of weight gradient tensors off the device.
If
true
, the weight gradient tensors are not saved off the device whendevicex.weightsFromHost()
is called.Note
This option is overridden if
syntheticDataMode
is not SyntheticDataMode::Off.Note
Weight gradient tensors that are also optimiser tensors will only be disabled if both
disableGradAccumulationTensorStreams
anddisableOptimizerStateTensorStreams
aretrue
.
-
bool disableOptimizerStateTensorStreams = false
Disable streaming of optimizer tensors.
If
true
, streaming of optimizer tensors is disabled. This setting can be used to conserve memory if you are not interested in checkpointing the optimizer state.Note
Weight gradient tensors that are also optimiser tensors will only be disabled if both
disableGradAccumulationTensorStreams
anddisableOptimizerStateTensorStreams
aretrue
.
-
bool compileEngine = true
Setting to only build the Poplar graph but not compile not.
If
false
, the backend will build the Poplar graph but not compile it into an Engine. In this case, no execution can be performed, and nothing can be transferred to the device. API calls which retrieve information from the graph building stage, such as tile mapping introspection, can still be used.
-
bool constantWeights = true
Specify an optimization for an inference session to have constant weights.
Set this option to
false
in order to change the weights with a call to Session::resetHostWeights() after the session has been prepared. This option has no effect on a training session.Default:
true
.
-
bool enableEngineCaching = false
Enable Poplar executable caching.
The file is saved to the location defined with
cachePath
. The file will be in the PopEF format. This means that it can be used to run inference using the Triton Inference Server because Graphcore provides a backend to it. See the Poplar Triton Backend user guide for more information.Default:
false
(not enabled).
-
bool enableVariablesCaching = true
Enable variable caching.
This means that the caching process will save variables as additional PopEF blobs to the file location defined with
cachePath
. If PopART will require data for variables (during cache reading process), they will be automatically read from the cache file.Note, turning this off allows a PopART Session to optimise the host memory it consumes during model runtime. Specifically, weightsToHost() can write directly to the IR tensor data buffers. If the option were on, this would not be safe and the session would have to create separate buffers to write the fetched data to.
Default:
true
(enabled).
-
std::string cachePath = "session_cache"
Folder to save the
poplar::Executable
to.
-
bool enableFloatingPointChecks = false
Enable that exceptions are thrown when floating point errors occur.
Default:
false
(not enabled).
-
bool enableStochasticRounding = false
Enable stochastic rounding.
PopART will set the Poplar engine option
target.deterministicWorkers
totrue
if this option is set and tofalse
if it is not set. Adding a value for “target.deterministicWorkers” to SessionOptions::engineOptions overrides this behaviour.Default:
false
(not enabled).
-
bool _enableRngStateManagement = false
-
ExecutionPhaseSettings executionPhaseSettings
Configuration settings for execution phases.
-
AccumulateOuterFragmentSettings accumulateOuterFragmentSettings
Configuration setting for operations in the accumulate outer fragment.
-
bool explicitRecomputation = false
Enable explicit recomputation.
Default:
false
(not enabled).
-
NumIOTiles numIOTiles
Number of IPU tiles dedicated to IO.
-
bool aliasZeroCopy = false
Enable zero-copy for subgraphs.
-
BatchSerializationSettings batchSerializationSettings
Configuration setting for batch serialization.
-
AutodiffSettings autodiffSettings
Configuration settings for the autodiff transform.
-
bool delayVarUpdates = true
Options to delay variable updates as much as possible.
-
bool scheduleNonWeightUpdateGradientConsumersEarly = false
-
bool enableFullyConnectedPass = true
Enable the global
fullyConnectedPass
option for matmuls.See also
poplin::matMul(poplar::Graph, poplar::Tensor, poplar::Tensor, poplar::program::Sequence, poplar::Type, poplar::DebugContext, poplar::OptionFlags, matmul::PlanningCache).
-
bool enableSerializedMatmuls = true
Enable/disable the serializing of matmuls.
-
std::string partialsTypeMatMuls
Set the partials type globally for matmuls.
Can be overridden individually with Builder.setPartialsType(). Valid values are
"float"
and"half"
. By default, this is not set, so no global partials type is imposed.
-
bool enableStableNorm = false
If
true
, computes the mean first and subtracts the activations from it before computing the variance.The implementation with this flag set to
true
is slower than when set tofalse
. The stable version requires the first order moment to be estimated and applied to the sample set before the second order central moment is calculated.
-
std::map<std::string, std::string> engineOptions
Poplar engine options.
-
std::map<std::string, std::string> convolutionOptions
Poplar convolution options.
-
std::map<std::string, std::string> lstmOptions
Poplar LSTM options.
-
std::map<std::string, std::string> matmulOptions
Poplar matmul options.
-
std::map<std::string, std::string> reportOptions
Poplar reporting options.
-
std::map<std::string, std::string> gclOptions
GCL options.
-
ExperimentalSettings experimentalSettings
Configuration setting for custom transform applier.
-
std::vector<std::string> customCodelets
List of codelet files (with file extension) to be added to the Poplar graph.
See the Poplar documentation for poplar::Graph for more information.
-
std::vector<TensorId> updatableNamedBuffers
List of model named buffers that can be updated with call to copyNamedBuffersToDevice().
This allows to update just a subset of model weights instead of all or them as it happens with copyWeightsToDevice() call.
-
std::string customCodeletCompileFlags
Compile flags for the custom codelets.
For example
-g
to generate debug info. See the Poplar documentation for poplar::Engine for more information.
-
double timeLimitScheduler = 1e9
The maximum allowed time (in seconds) that can be spent searching for a good graph schedule before a solution must be returned.
-
int64_t swapLimitScheduler = static_cast<int64_t>(1e9)
The maximum number of improving steps allowed by the scheduling algorithm before a solution must be returned.
-
std::string serializedPoprithmsShiftGraphsDir = {}
The directory to serialize Poprithms graphs to.
PopART uses Poprithms for scheduling PopART graphs. The Poprithms graphs created for scheduling can be optionally serialised (written to file). If
serializedPoprithmsShiftGraphsDir
is empty, then the graphs will not be serialised. The names of serialization files will bepoprithms_shift_graph_i.json
for the lowest non-existing values ofi
. The directory must already exist, PopART will not create it.
-
std::string kahnTieBreaker = "greedy"
Specify which method is used to control how ops are scheduled.
The initial scheduling is done with Kahn’s algorithm. When several ops are free to be scheduled, this controls which method is used.
Options are described in the Poprithms KahnTieBreaker enum.
-
size_t transitiveClosureOptimizationThreshold = {100000}
Specify the transitive closure optimization threshold.
The transitive closure optimization pass can significantly accelerate the scheduler. It does not, in general, affect the final schedule returned. It is run between initialization with Kahn’s algorithms and the shifting swaps. The transitive closure optimization pass is O(nOps^2) and so should not be used for extremely large graphs. If a graph is above this threshold, the transitive closure optimization pass is not run.
-
bool decomposeGradSum = false
Enable replacement of single sums of partial gradients with a tree of additions.
This can reduce max liveness at the cost of extra cycles. A typical use case for this would be if a large weight tensor is used as an input to many operations.
Default:
false
(not enabled).
-
ReplicatedCollectivesSettings replicatedCollectivesSettings
Control the behavior of different collective operations.
-
bool enableDistributedReplicatedGraphs = false
Enable training with Poplar replicated graphs across multiple PopART instances.
Default:
false
(not enabled).
-
int64_t globalReplicationFactor = 1
The total number of replicas in a multi-instance, replicated-graph training session (this should be left as the default value (1) if distributed replicated graphs are disabled).
This value includes local replication.
-
int64_t globalReplicaOffset = 0
The first replica index that this PopART instance is running.
-
bool groupHostSync = false
Specify to group the streams from the host to the device at the beginning of the schedule, and the streams from the device to the host at the end of the schedule.
This trades off memory usage for speed.
When
true
, tensors will stay live for longer.
Default:
false
(not enabled).Note
This setting has no effect when useHostCopyOps is enabled (
true
).
-
bool strictOpVersions = true
Enable strict op version checks.
Strict op version checks will throw an error if the exact version of an op required for the model opset is not supported. Turning this check off will cause PopART to fall back to the latest implementation of the op that is supported.
Default:
true
(enabled).Warning
Turning off these checks may cause undefined behaviour.
-
bool opxAliasChecking = false
Enable running Opx checks to verify that IR tensor aliasing information corresponds to the lowered Poplar tensor aliasing.
Default:
false
(not enabled).
-
bool opxModifyChecking = false
Enable running Opx checks to verify that IR tensor modification information corresponds to the lowered Poplar tensor modifications.
Default:
false
(not enabled).
-
bool useHostCopyOps = false
Enable use of IR graph operations for data and anchor streams.
Default:
false
(not enabled).
-
bool enableLoadAndOffloadRNGState = false
Enable load and offload of device RNG state from host.
Default:
false
(not enabled).
-
TensorLocationSettings activationTensorLocationSettings = TensorLocationSettings{TensorLocation(), 2, 8192}
Tensor location settings for activation/gradient tensors.
-
TensorLocationSettings weightTensorLocationSettings = TensorLocationSettings{TensorLocation(), 2, 8192}
Tensor location for weight tensors.
-
TensorLocationSettings optimizerStateTensorLocationSettings = TensorLocationSettings{TensorLocation(), 2, 8192}
Tensor location for optimizer state tensors.
-
TensorLocationSettings accumulatorTensorLocationSettings = TensorLocationSettings{TensorLocation(), 2, 8192}
Tensor location for gradient accumulator tensors.
-
std::map<TensorId, TensorLocation> tensorLocationSettingsOverride
Override tensor location for specific tensors by setting tensor locations for specific tensor ID values.
-
AutomaticLossScalingSettings automaticLossScalingSettings
Settings to enable and configure the automatic loss scaling behaviour when training.
Note
Automatic loss scaling is in preview. It is well tested and enabled in some of our example applications, but may not behave as expected in all models. Recommendation: if your model with automatic loss scaling enabled does not converge or triggers a compilation error, then you will need to set the loss scale manually.
-
DeveloperSettings developerSettings
Settings for developers to configure testing and benchmarking.
-
bool enableSupportedDataTypeCasting = true
Enable casting to supported data types.
If enabled (
true
), casts any tensor of unsupported data types to supported data types when lowering to Poplar. Currently, this implies casting:INT64 -> INT32
UINT64 -> UINT32 The cast will throw an error for incompatible data types and over/underflows, and will warn about narrowing casts.
Default:
true
(enabled).
-
bool enableExplicitMainLoops = false
Enable explicit main loop transformation, and disable implicit training loops.
Note
This will be deprecated and enabled by default.
-
bool groupNormStridedChannelGrouping = false
Enable fast math mode for group norms.
Group norms have a fast math mode which changes the implementation to run faster on IPU but as a consequence is incompatible with other implementations (so for running trained weights on host). The default (
false
) is to use the correct, but slightly slower mode.
-
std::function<void(int, int)> compilationProgressLogger
Callback function used to indicate PopART compilation progress.
The function should not block. All calls to the callback function will be made from the main thread so blocking in the callback will block compilation from progressing.
If this logger is not set then compilation progress will be printed on the info channel.
- Param int
The progress value.
- Param int
The maximum value for the progress.
-
int compilationProgressTotal = 100
Total progress ticks until compilation complete.
-
bool enableMergeExchange = true
Enable merging remote and host IO operations to facilitate IO overlap.
true
to enable, otherwisefalse
.Default=
true
.
-
bool ensureFp32LossScaleTensor = false
Ensure that the loss scale tensor is fp32 and that this is combined with fp16 activations as late as possible to produce the first fp16 activation gradients.
This makes it possible to choose a loss scale value greater than max(fp16). This is also recommended when automatic loss scaling is enabled. Only compatible with models that have an fp16 loss scale tensor.
true
ensures that the loss scale tensor is fp32.Default:
false
.
-
bool enableInplaceAmbiguityChecking = false
Enable creation of an
AliasModel
object for each graph and run the Poprithms ambiguity checker on it.This throws an error if the graph has a potential inplacing ambiguity.
See
poprithms::memory::inplace::Graph::AmbiguityStatus
for more info on what constitutes an ambiguity.If set to
true
,AliasModel
object is created for each graph and the the Poprithms ambiguity checker is run on it. No ambiguity checking is performed if this option is set tofalse
(default). However inplace fallbacks will occur if necessary.
-
bool createImplicitPipeliningFwdOnlyProgram = false
- Deprecated:
Create a custom program containing the forward pipeline only.
-
bool throwIfLog2ScaleTensorNotInRange = true
If set to
true
, throw a Poplar error if any fused ops that consume a log2 scale tensor receive a log2 scale tensor value not in the integer range [-32, 32).If set to
false
, no error is thrown. However, note that this may lead to undefined behaviour if the value of the log2 scale is outside the range.
-
bool enableConstantFoldingOfMultipleConsumers = true
If set to
false
, disable constant folding on ops if any input have multiple consumers.Default=
true
.
-
bool useLoopCandidateCreator = false
Use loop candidate creator for constant if one exsits.
Default=
false
.
-
bool stashAllTensorsInferencePipeline = false
Stash all tensors when inference pipeline.
Default=
false
.
-
struct ExperimentalSettings
Public Members
-
std::map<std::string, std::vector<std::string>> customTransformApplierSettings
Custom transform applier settings.
Enable to insert custom transform sequence at predefined checkpoint. Multiple checkpoint names and transform names can be passed for different model configurations.
The predefined checkpoint names are: FWD0: Initial IR immediately after lowering from ONNX to the IR.
FWD1: After the pre-alias patterns have been applied to FWD0.
BWD0: After growing the backward pass (including the optimiser step). Note this happens before optimiser decomposition, so the optimiser will appear as a single special op rather than the many ops that implement it.
PREALIAS: After pre-alias transforms have been applied to BWD0.
MAINLOOPS: After the MainLoops transform has been applied. This transform adds explicit loop ops to the IR for device iterations (batches per step) and gradient accumulation.
FINAL: The final IR after preparation.
The transform names are defined by PopART and users.
For example to execute ‘Transform A’ and ‘Transform B’ at ‘Fwd0’ checkpoint and exectue ‘Transform C’ at ‘Fwd1’ checkpoint:
{ “Fwd0”: [ “Transform A”, “Transform B” ], “Fwd1”: [ “Transform C” ] }
Note
This setting is experimental for inference and may change.
-
std::map<std::string, std::vector<std::string>> customTransformApplierSettings
-
class NumIOTiles
A wrapper class for the SessionOptions::numIOTiles option that permits any int value and has an ‘unassigned’ state.
Public Functions
-
NumIOTiles()
Constructor.
-
NumIOTiles(int numIOTiles)
Constructor.
- Parameters
numIOTiles – The number of IPU tiles dedicated to IO.
-
bool operator==(const int &rhs) const
Compare with int.
-
operator int() const
Auto convert to int.
-
NumIOTiles &operator=(const int &x)
Assign value using int.
-
NumIOTiles()
-
inline bool explicitPipeliningEnabled() const
-
struct TensorLocationSettings
A structure containing user configuration for cache/offloading settings.
Public Functions
-
TensorLocationSettings() = default
Constructor.
-
TensorLocationSettings(TensorLocation location_, int minElementsForOffChip_ = 2, int minElementsForReplicatedTensorSharding_ = 8192)
Constructor.
- Parameters
location_ – The tensor location information.
minElementsForOffChip_ – The minimum number of elements below which offloading won’t be considered.
minElementsForReplicatedTensorSharding_ – The minimum number of elements necessary for replicated tensor sharding.
-
TensorLocationSettings(TensorStorage storage_, int minElementsForOffChip_ = 2, int minElementsForReplicatedTensorSharding_ = 8192)
Constructor.
- Parameters
storage_ – The tensor storage information.
minElementsForOffChip_ – The minimum number of elements below which offloading won’t be considered.
minElementsForReplicatedTensorSharding_ – The minimum number of elements necessary for replicated tensor sharding.
Public Members
-
TensorLocation location = TensorLocation()
The default tensor location for this tensor type.
-
int minElementsForOffChip = 2
The minimum number of elements below which offloading won’t be considered.
-
int minElementsForReplicatedTensorSharding = 8192
A minimum number of elements below which replicated tensor sharding won’t be considered.
-
TensorLocationSettings() = default
#include <popart/variablesettings.hpp>
-
class VariableSettings
A class to dictate behaviour of variables and reductions of such across multiple graphs.
Public Functions
-
void verify()
Runs test to see if the VariableSettings are invalid, and throws an error if so.
- Returns
the CommGroup sharedVariableDomain of this VariableSettings.
-
ReplicaGrouping getReplicaGrouping(unsigned numReplicas) const
- Parameters
numReplicas – The number of replicas in the IR this is used in.
- Returns
the ReplicaGrouping domain of this VariableSettings.
-
bool isUsingCommGroup() const
- Returns
whether the VariableSettings were initialised using a CommGroup or a stride.
-
CommGroupType getCommGroupType() const
- Returns
the CommGroupType. The value of this is invalid if VariableSettings::isUsingCommGroup returns false.
-
unsigned getStride() const
- Returns
the stride. The value of this is invalid if VariableSettings::isUsingCommGroup returns true.
-
unsigned getGroupSize() const
- Returns
the replica group size.
-
inline VariableRetrievalMode getRetrievalMode() const
- Returns
the VariableRetrievalMode retrievalMode of this VariableSettings.
-
VariableSettings()
“Default” constructor, defaults CommGroup to [All, 0] and retrievalMode to OnePerGroup.
-
VariableSettings(VariableRetrievalMode retrievalMode_)
Defaults CommGroup to [All, 0].
-
VariableSettings(CommGroup sharedVariableDomain_, VariableRetrievalMode retrievalMode_)
Entirely custom VariableSettings.
-
VariableSettings(unsigned stride, unsigned groupSize)
-
VariableSettings(unsigned stride, unsigned groupSize, VariableRetrievalMode retrievalMode)
-
unsigned numReplicasReturningVariable(unsigned replicaCount) const
Calculate the number of replicas that will return this variable.
- Parameters
replicaCount – Number of global replicas.
- Returns
Number of variables returned.
-
unsigned getGroupCount(unsigned replicaCount) const
- Parameters
replicaCount – The replicationFactor of the graph.
- Returns
The number of groups given the replicaFactor and the VariableSettings.
-
unsigned getStride(unsigned replicaCount) const
- Parameters
replicaCount – The replicationFactor of the graph.
- Returns
The stride between each member of a group.
-
unsigned getRealGroupSize(unsigned replicaCount) const
Because CommGroup’s don’t have a defined group-size if the type is All or None, this function will return a group-size that is always accurate, based on replicas.
- Parameters
replicaCount – The replication factor
- Returns
The actual number of replicas in a group
-
unsigned getGroupRepresentative(unsigned group) const
Get the default first member of a group.
- Parameters
group – The group to return the representative for.
- Returns
The representative replica of this group.
-
Shape shapeOnReplica(Shape full_shape, unsigned replicaCount, const TensorId name) const
The shape Onnx reads holds an extra outer dimension in certain cases, where the outer dimension represents the number of returning replica variables.
This function takes an Onnx full-shape and removes the outer dimension safely (ie. checks if the outer dimension matches an expected outer dimension). A quick-function to avoid duplicate code.
- Parameters
full_shape – The shape as presented by Onnx.
replicaCount – The local replication factor, used to calculate the return factor.
name – The TensorId of the function, used to give good error feedback.
- Returns
The shape of the data on the replica.
-
Shape shapeOnHost(Shape replica_shape, unsigned replicaCount) const
Takes the shape of a tensor on a replica and returns it’s full ONNX shape.
This is the inverse operation to shapeOnReplica
- Parameters
replica_shape – The shape of the data on a replica.
replicaCount – The local replication factor, used to calculate the return factor.
- Returns
The shape as presented by Onnx.
-
std::vector<std::vector<std::int64_t>> groups(unsigned replicaCount) const
This function returns a set of vectors where each vector contains all the replicaId’s of the replicas with a sharedVariableDomain given the variableSettings and the replicaCount.
- Parameters
replicaCount – The local replication factor
- Returns
A set of sets, such that set.at(a).set(b) is member nr. b of group a, and set.size() is the number og groups and set.at(A).size() is the size of the group.
-
bool operator==(const VariableSettings &other) const
Compare two variable-settings.
- Parameters
other – VariableSettings to compare these settings to.
- Returns
True if all internal elements are the same
-
bool operator!=(const VariableSettings &other) const
Compare two variable-settings.
- Parameters
other – VariableSettings to compare these settings to.
- Returns
False if all internal elements are the same
-
void verify()
-
enum class popart::VariableRetrievalMode
Enum type that describes how to retrieve variables from the replicas.
Each replica is in a group defined by the
VariableSettings::sharedVariableDomain
. Replicas within a group have variables initialized with the same values.Values:
-
enumerator OnePerGroup = 0
Returns one variable per group (defined by the
VariableSettings::sharedVariableDomain
CommGroup
), automatically returns the first replica of each group, where first means the one with the lowest replica ID.
-
enumerator AllReduceReplicas
As OnePerGroup, but performs an AllReduce among the replicas in the same group according to
VariableSettings::sharedVariableDomain
!!! CURRENTLY UNSUPPORTED.
-
enumerator AllReplicas
Returns all replica Weights.
-
enumerator OnePerGroup = 0
#include <popart/commgroup.hpp>
-
class CommGroup
Class to specify sub-groups of replicas.
Examples of derived sub-groups:
IPU-link domain sub-rack:
type == Consecutive && replicaGroupSize == 64/replica-size/N
where
N
is a power of two andreplicaGroupSize > 1
.Complete IPU-link domain / full rack:
type == Consecutive && replicaGroupSize == 64/replica-size
Using GW-links only:
type == Orthogonal && replicaGroupSize == numberOfIpuLinkDomains
Public Functions
-
CommGroup()
Default CommGroup constructor.
Sets
type
to CommGroupType::All andreplicaGroupSize
to 0.
-
inline CommGroup(CommGroupType type, unsigned groupSize)
Construct CommGroup.
- Parameters
groupType – The replica group type.
groupSize – The replica group size.
-
explicit CommGroup(const ReplicaGrouping &grouping)
Construct CommGroup from a ReplicaGrouping.
- Parameters
grouping – The replica grouping.
Public Members
-
CommGroupType type = CommGroupType::All
Replica group type.
-
unsigned replicaGroupSize = 0
Replica group size.
-
enum class popart::CommGroupType
PopART equivalent of GCL CommGroupType.
Each of these enumeration constants has a corresponding GCL CommGroupType value.
Values:
-
enumerator All = 0
All replicas viewed as one group, replica group size is ignored.
-
enumerator Consecutive
Groups are consecutive in replicas.
If there are N replicas denoted
{0, ... N-1}
and the group size isk
, then there areN/k
groups of sizek
as{0, 1, ... k-1}, {k, ... 2k-1} ... {N-k-1, ... N-1}
.
-
enumerator Orthogonal
Groups are sliced orthogonal to the replica ordering.
If there are
N
replicas denoted{0, ... N-1}
and the group size isk
, then there arem = N/k
groups of sizek
as{0, m, 2m, ...}, {1, m+1, 2m+1, ...} ... {m-1, 2m-1, ... N-1}
.
-
enumerator None
Each replica is in its own group; the replica group size is ignored.
-
enumerator N
Number of values.
-
enumerator All = 0
14.2. Data input and output (IStepIO)
#include <popart/istepio.hpp>
-
class IStepIO
An abstract base class through which input and output data is passed to a Session (see Session::run).
Data is passed via buffers. In the case of buffers returned by IStepIO::in, PopART reads from these buffers. In the case of IStepIO::out, PopART writes to these buffers. The IStepIO::inComplete() and IStepIO::outComplete() functions are called by PopART to signal it is done with an input or output buffer.
An IStepIO implementation should conceptually implement a rolling queue of active buffers for each input and output tensor. Every successful call to IStepIO::in should yield a new data buffer for PopART to read from and add it to the head of the conceptual queue. Conversely, every call to IStepIO::inComplete() should be taken to mean that the buffer at the tail-end of the queue is no longer being used by PopART. This buffer is removed from the conceptual queue.
Note that a IStepIO::in call with the
prefetch
flag set is only considered successful when it returns data.Output works analogously to input.
The expected total number of input (or output) buffers that are ‘completed’ for a tensor in one Session::run call is
bps
\(\times\) SessionOptions::accumulationFactor \(\times\) SessionOptions::replicatedGraphCount, wherebps
is the number of batches per call to Session::run (this is a value captured by the DataFlow instance passed to the Session instance).Note, however, that there may be additional ‘incomplete’ calls to IStepIO::in and IStepIO::out.
Furthermore, the number of input (or output) buffers that may be ‘incomplete’ at a given time for a given tensor should not normally be more than SessionOptions::bufferingDepth \(\times\) SessionOptions::replicatedGraphCount, but this bound is not guaranteed.
EXAMPLE: Suppose a session is configured such that the total expected number of input buffers is 6 and these are input buffers for a tensor with ID
t
with 100 elements. The associated input calls in IStepIO may look like this if SessionOptions::bufferingDepth is 3:in("t", 100, false) -> Give buffer[0] to PopART. in("t", 100, true) -> Give buffer[1] to PopART. in("t", 100, true) -> Give buffer[2] to PopART. inComplete("t", 100) -> buffer[0] is no longer required and can be reused. in("t", 100, true) -> Give buffer[3] to PopART. inComplete("t", 100) -> buffer[1] is no longer required and can be reused. in("t", 100, true) -> Give buffer[4] to PopART. inComplete("t", 100) -> buffer[2] is no longer required and can be reused. in("t", 100, true) -> Give buffer[5] to PopART. inComplete("t", 100) -> buffer[3] is no longer required and can be reused. in("t", 100, true) -> No data available, return nullptr. inComplete("t", 100) -> buffer[4] is no longer required and can be reused. inComplete("t", 100) -> buffer[5] is no longer required and can be reused.
Subclassed by popart::StepIOCallback, popart::StepIOGeneric< ARRAY_TYPE, ACCESSOR_TYPE, ArrayInfoT >, popart::StepIOGeneric< IArray, StepIONS::IArrayAccessor, IArray & >
Public Functions
-
virtual ConstVoidData in(TensorId id, int64_t numElements, bool prefetch, const bool isBroadcast = false) = 0
Request a new input data buffer.
The memory in this buffer is available for use in PopART until the corresponding inComplete() call.
Note
: Failing to provide a valid data buffer will result in a runtime failure if
prefetch
is set tofalse
.- Parameters
id – The ID of the tensor to return data for.
numElements – The number of elements in the tensor.
prefetch – If set to
true
the inability to provide data is not considered an error. Iffalse
, it is considered an error if no data can be provided.
- Returns
The input buffer for this tensor (or nullptr on failure) returned as a ConstVoidData object.
-
virtual void inComplete(TensorId id, int64_t numElements, const bool isBroadcast = false) = 0
Notify the user (running a PopART program) that a previously retrieved input data buffer is no longer used by PopART.
- Parameters
id – The ID of the tensor to return data for.
numElements – The number of elements in the tensor.
-
virtual MutableVoidData out(TensorId id, int64_t numElements) = 0
Request a new output data buffer.
The memory in this buffer is available for use in PopART until the corresponding inComplete() call and will be modified in-place.
Note
Failing to provide a valid data buffer will result in a runtime failure.
- Parameters
id – The ID of the tensor to return data for.
numElements – The number of elements in the tensor.
- Returns
The output buffer for this tensor returned as a MutableVoidData object.
-
inline virtual void outComplete(TensorId)
Notify the user (running a PopART program) that a previously retrieved input data buffer is no longer used by PopART.
- Parameters
id – The ID of the tensor to return data for.
numElements – The number of elements in the tensor.
-
inline void enableRuntimeAsserts(bool b)
Enable or disable runtime asserts.
If runtime asserts are enabled, then a check that the input and output buffers have the correct number of elements is performed. As Session.run() is called multiple times during a user’s session, the check is only performed in the first call to Session.run(), under the assumption that the user is unlikely to change the size of buffers between runs.
- Parameters
b – The setting to enable runtime asserts (
true
) or disable runtime asserts (false
).
-
inline bool runtimeAssertsEnabled() const
Check if runtime asserts are enabled.
- Returns
true
if runtime asserts are enabled, otherwisefalse
.
-
virtual void assertNumElements(const popx::Executablex&) const = 0
Check number of elements.
This check is performed when runtimeAssertsEnabled() is
true
.- Parameters
Executablex – The input executable to be checked that the input and output buffers have the correct number of elements.
-
virtual ConstVoidData in(TensorId id, int64_t numElements, bool prefetch, const bool isBroadcast = false) = 0
#include <popart/stepio.hpp>
-
class StepIO : public popart::StepIOGeneric<IArray, StepIONS::IArrayAccessor, IArray&>
Class to provide a Session object with input and output data.
-
class StepIOCallback : public popart::IStepIO
Class that implements the IStepIO interface using user-provided callback functions.
The IStepIO interface contains a number of pure virtual member functions through which PopART receives buffers to read data from and buffers to write data to. StepIOCallback inherits from IStepIO and implements those member functions by delegating the logic to the callback functions passed in the constructor. This gives the user full control as to how data buffers are provisioned.
See IStepIO for more details on the expected behaviour of the callbacks.
Public Types
-
using InputCallback = std::function<ConstVoidData(TensorId, bool)>
Callable object that implements IStepIO::in().
-
using InputCompleteCallback = std::function<void(TensorId)>
Callable object that implements IStepIO::inComplete().
-
using OutputCallback = std::function<MutableVoidData(TensorId)>
Callable object that implements IStepIO::out().
-
using OutputCompleteCallback = std::function<void(TensorId)>
Callable object that implements IStepIO::outComplete().
Public Functions
-
inline StepIOCallback(InputCallback inputCallback, InputCompleteCallback inputCompleteCallback, OutputCallback outputCallback, OutputCompleteCallback outputCompleteCallback)
Construct a StepIOCallback object.
- Parameters
inputCallback – The callback function the constructed StepIOCallback instance will use when IStepIO::in() is called. See IStepIO for details on how to implement this method.
inputCompleteCallback – The callback function the constructed StepIOCallback instance will use when IStepIO::inComplete() is called. See IStepIO for details on how to implement this method.
outputCallback – The callback function the constructed StepIOCallback instance will use when IStepIO::out() is called. See IStepIO for details on how to implement this method.
outputCompleteCallback – The callback function the constructed StepIOCallback instance will use when IStepIO::outComplete() is called. See IStepIO for details on how to implement this method.
-
inline virtual void assertNumElements(const popx::Executablex&) const
Check number of elements.
This check is performed when IStepIO::runtimeAssertsEnabled() is
true
.- Parameters
Executablex – The input executable to be checked that the input and output buffers have the correct number of elements.
-
virtual ConstVoidData in(TensorId id, int64_t numElements, bool prefetch, bool) final
This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the
inputCallback
parameter passed to the constructor.This function should not be called directly.
-
virtual void inComplete(TensorId id, int64_t numElements, bool) final
This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the
inputCompleteCallback
parameter passed to the constructor.This function should not be called directly.
-
virtual MutableVoidData out(TensorId id, int64_t numElements) final
This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the
outputCallback
parameter passed to the constructor.This function should not be called directly.
-
virtual void outComplete(TensorId id) final
This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the
outputCompleteCallback
parameter passed to the constructor.This function should not be called directly.
-
using InputCallback = std::function<ConstVoidData(TensorId, bool)>
-
class IWeightsIO
A virtual class for accessing pointers to the data required to perform a training step.
Subclassed by popart::WeightsIO
Public Functions
-
virtual ~IWeightsIO() = default
Destructor for IWeightsIO.
-
virtual bool contains(TensorId) const = 0
Check if the WeightsIO instance contains the weights for a specific tensor.
- Parameters
TensorId – The ID of the tensor to look for weights for.
- Returns
true
if the WeightsIO instance contains weights for the tensor,false
otherwise.
-
virtual MutableVoidData weight(TensorId) const = 0
Retrieve weights for a specific tensor.
- Parameters
TensorId – The ID of the tensor to retrieve weights for.
- Returns
The weights.
-
virtual ~IWeightsIO() = default
-
class WeightsIO : public popart::IWeightsIO
Class representing weights.
Public Functions
-
virtual bool contains(TensorId) const final
Check if the WeightsIO instance contains the weights for a specific tensor.
- Parameters
TensorId – The ID of the tensor to look for weights for.
- Returns
true
if the WeightsIO instance contains weights for the tensor,false
otherwise.
-
virtual MutableVoidData weight(TensorId) const final
Retrieve weights for a specific tensor from the WeightsIO object.
- Parameters
TensorId – The ID of the tensor to retrieve weights for.
- Returns
The weights.
-
void insert(TensorId, MutableVoidData)
Insert weights for a specific tensor into the WeightsIO object.
- Parameters
TensorId – The ID of the tensor to insert weights for.
MutableVoidData – The weights to insert.
-
virtual bool contains(TensorId) const final
-
struct IArrayAccessor
Structure to help with accessing the data in IArray objects.
Public Static Functions
-
static inline void *getDataPointer(IArray &array)
Get pointer to the data.
- Parameters
array – The IArray object.
- Returns
A pointer to the data contained in the IArray object.
-
static inline size_t getArraySize(const IArray &array)
Get the number of data elements.
- Parameters
array – The IArray object.
- Returns
The number of data elements.
-
static inline DataType getArrayDataType(IArray &array)
Get the data type of the data.
- Parameters
array – The IArray object.
- Returns
The data type of the data.
-
static inline void *getDataPointer(IArray &array)
#include <popart/stepio_generic.hpp>
-
template<typename ARRAY_TYPE, typename ACCESSOR_TYPE, typename ArrayInfoT>
class StepIOGeneric : public popart::IStepIO Subclassed by popart::StepIO
Public Functions
-
inline void assertNumElements(const popx::Executablex &exe) const final
-
inline TensorInfo getTensorInfo(ARRAY_TYPE &array) const
-
template<typename T>
inline T get(TensorId id, std::map<TensorId, ArrayInfo> &M, int64_t numElements, bool advance_, std::string mapName)
-
template<typename T>
inline void advance(TensorId id, std::map<TensorId, ArrayInfo> &M, int64_t numElements, std::string mapName)
-
inline ConstVoidData in(TensorId id, int64_t numElements, bool, bool) final
-
inline MutableVoidData out(TensorId id, int64_t numElements) final
-
inline void assertNumElements(const popx::Executablex &exe) const final
-
struct ArrayInfo
#include <popart/iarray.hpp>
-
class IArray
Subclassed by popart::NDArrayWrapper< T >
14.3. Tensors
#include <popart/tensor.hpp>
-
class Tensor : public popart::Vertex
Public Functions
-
Tensor(TensorId, TensorType, Graph&, const DebugContext& = {})
-
Tensor(TensorId, VariableSettings, Graph&, const DebugContext& = {})
-
Tensor(TensorId, TensorType, VariableSettings, Graph&, const DebugContext& = {})
-
inline std::string str() const final
-
TensorType tensorType() const
-
std::string tensor_type() const
-
void setTensorType(TensorType)
-
inline ReplicatedStreamMode getReplicatedStreamMode() const
-
inline void setReplicatedStreamMode(const ReplicatedStreamMode &mode)
-
void setTensorLocationInfo(TensorLocation&, std::pair<RemoteBufferId, RemoteBufferIndex> &remoteBufferInfo)
-
std::set<PipelineStage> getPipelineStages() const
-
bool hasProducer() const
-
bool isGraphInput() const
-
bool isGraphOutput() const
-
bool isLoopInput() const
-
bool isImplicitLoopInput() const
-
bool isExplicitLoopInput() const
-
bool isLoopTripCounter() const
-
bool isUnmodifiable() const
-
bool isCheckpointTensor() const
-
bool isImplicitRecomputeTensor() const
-
bool isRestoreInplaceTensor() const
-
bool idIncludesPrefix(const std::vector<std::string>&) const
-
bool isOptimizerTensor() const
-
bool isRemoteArgTensor() const
-
bool isRandomSeedTensor() const
-
bool isOptimizerStateTensor() const
-
bool isAccumulatorTensor() const
-
bool isHostLoadTensor() const
Is this tensor produced by a HostLoad Op or MultiExchangeOp with HostLoad descriptor?
- Returns
true if producer is a HostLoad Op or MultiExchangeOp with HostLoad descriptor false otherwise.
-
bool isWeightTensor() const
-
bool isAnchored() const
-
bool isRootAnchor() const
-
bool hasTensorData() const
-
TensorData *tensorData()
-
const TensorData *tensorData() const
-
void setTensorDataFromCopyOf(const void *src, std::size_t size)
-
void setTensorDataFromViewOf(void *src, std::size_t size)
-
void setTensorDataByEmplaceOf(std::vector<char> &&data)
-
void setTensorData(const TensorData &td)
-
void setTensorData(TensorData &&td)
-
bool hasVirtualGraphId() const
-
VGraphIdAndTileSet getVirtualGraphIdAndTileSet(std::set<OpId> &visited) const
-
VGraphIdAndTileSet getVirtualGraphIdAndTileSetUnsafe() const
-
VGraphIdAndTileSet getVirtualGraphIdAndTileSetUnsafe(std::set<OpId> &visited) const
-
int getBatchAxis() const
-
bool consumersAllPreLoss() const
-
bool isModified(bool considerLoopInput = true) const
Check if any of the consumers modify this tensor.
- Parameters
considerLoopInput – If explicit loop inputs should be considered as being modified. If false, only operations modifying the tensor inplace will be considered.
- Returns
True if the tensor is modified, otherwise false.
-
bool isAliased() const
Check if any of the consumers alias this tensor.
- Returns
True if the tensor is aliased to any output, otherwise false.
-
std::set<Op*, POpCmp> getInplaceModifiers() const
Find operations that modify a tensor.
- Returns
All operations that (direct and indirectly) modify this tensor
-
std::vector<char> getDataViaGraphTraversal() const
-
inline void setVariableUpdateType(VariableUpdateType type)
Members of old subclass VariableTensor class VariableTensor : public Tensor {.
-
inline VariableUpdateType getVariableUpdateType() const
-
inline VariableSettings getVariableSettings() const
- Returns
The VariableSettings of this Variable
-
std::vector<int64_t> returnedShape(unsigned replicationFactor)
Returns the shape necessitated by IO.
- Parameters
replicationFactor – The replication factor
- Returns
the shape of the tensor, considering replica groups
-
void verifyMutableVoidInfo(const TensorInfo mutableVoidInfo, unsigned replicationFactor)
Check that the info of a mutableVoidData object matches the expectations set by the TensorInfo and VariableSettings.
Throws an error if there is a mismatch.
- Parameters
mutableVoidInfo – The data of the MutableVoidInfo with the same id as this tensor
replicationFactor – The replicationFactor of this instance
-
void setPreparedVGraphIdAndTileSet()
Set the preparedVGraphIdAndTileSet.
Public Members
-
Consumers consumers
-
TensorInfo info
-
TensorLocationInfo tensorLocationInfo
-
InputSettings inputSettings
-
Tensor(TensorId, TensorType, Graph&, const DebugContext& = {})
-
enum class popart::TensorType
Values:
-
enumerator ActGrad = 0
-
enumerator Const
-
enumerator Stream
-
enumerator Unknown
-
enumerator Variable
-
enumerator N
-
enumerator ActGrad = 0
-
enum class popart::VariableUpdateType
Values:
-
enumerator None = 0
-
enumerator Gradient
-
enumerator Copy
-
enumerator None = 0
#include <popart/tensorinfo.hpp>
-
enum class popart::DataType
There is a one-to-one correspondence between
popart::DataTypes
andONNX_NAMESPACE::TensorProto_DataTypes
, which is equivalent todecltype
(ONNX_NAMESPACE::TensorProto().data_type()).Values:
-
enumerator UINT8 = 0
-
enumerator INT8
-
enumerator FLOAT8_143
-
enumerator FLOAT8_152
-
enumerator UINT16
-
enumerator INT16
-
enumerator INT32
-
enumerator INT64
-
enumerator UINT32
-
enumerator UINT64
-
enumerator BOOL
-
enumerator FLOAT
-
enumerator FLOAT16
-
enumerator BFLOAT16
-
enumerator DOUBLE
-
enumerator COMPLEX64
-
enumerator COMPLEX128
-
enumerator STRING
-
enumerator UNDEFINED
-
enumerator UINT8 = 0
-
class DataTypeInfo
-
class TensorInfo
Public Functions
-
TensorInfo(DataType, const Shape&)
Create TensorInformation based on data type and shape.
- Parameters
data_type – - The data type.
shape – - The actual shape of the tensor.
-
TensorInfo(DataType data_type, const Shape &shape, const Shape &meta_shape)
Create TensorInformation based on data type, shape and meta shape.
- Parameters
data_type – - The data type.
shape – - The actual shape of the tensor.
meta_shape – - The meta shape of the tensor, which can for example be used to store the original tensor shape before replicated tensor sharding was applied.
-
TensorInfo(std::string data_type, std::string shape)
-
explicit TensorInfo(const ONNX_NAMESPACE::TensorProto&)
-
explicit TensorInfo(const ONNX_NAMESPACE::TypeProto&)
-
void set(const ONNX_NAMESPACE::TensorProto&)
-
void set(const ONNX_NAMESPACE::TypeProto&)
-
TensorInfo() = default
-
std::vector<size_t> shape_szt() const
-
inline int64_t nelms() const
-
int64_t nbytes() const
-
inline int64_t dim(int i) const
-
inline std::vector<int> strides(const std::vector<long> &shape)
Get the strides of the tensor, that is the number of bytes to step in each dimension when traversing an array in memory.
See https://numpy.org/doc/stable/reference/generated/numpy.ndarray.strides.html
- Parameters
shape – The on-host ONNX shape of a tensor. This is different from this->shape(), which gives the on-replica shape of a tensor
- Returns
std::vector<int> The strides vector.
-
const std::string &data_type() const
-
const std::string &data_type_lcase() const
-
void append(std::ostream&) const
-
bool isSet() const
-
bool operator==(const TensorInfo&) const
-
bool operator!=(const TensorInfo&) const
-
ONNX_NAMESPACE::TypeProto getOnnxTypeProto() const
-
const DataTypeInfo *getDataTypeInfo() const
Public Static Functions
-
static std::string npOutDataTypeExceptionMessage(const TensorInfo &i0, const TensorInfo &i1, const std::string &debugName)
-
TensorInfo(DataType, const Shape&)
#include <popart/tensorindex.hpp>
-
class TensorIndexMap
Public Functions
-
TensorIndexMap() = default
-
~TensorIndexMap()
-
void erase(int)
-
void clear()
-
bool hasIndex(int) const
-
const std::map<Tensor*, std::vector<int>, PTensorCmp> &indicesMap() const
-
int n() const
-
void append(std::stringstream&, std::string prefix, int max_id_length) const
-
void setInfoIfIndex(const TensorInfo&, int index)
-
int maxIdLength() const
-
int minIndex() const
-
int maxIndex() const
-
TensorIndexMap() = default
#include <popart/tensorlocation.hpp>
-
enum class popart::ReplicatedTensorSharding
Enum type to specify whether to shard tensors over replicas.
Values:
-
enumerator Off = 0
Don’t shard tensors over replicas.
-
enumerator On = 1
Do shard tensors over replicas.
-
enumerator N = 2
Number of values.
-
enumerator Off = 0
-
class TensorLocation
Class that describes the memory characteristics of one or multiple tensors.
See also: SessionOptions.
Public Functions
-
TensorLocation()
Equivalent to calling TensorLocation(TensorStorage::Undefined, TileSet::Compute, TileSet::Compute, ReplicatedTensorSharding::Off)
-
TensorLocation(TensorStorage storage)
Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, ReplicatedTensorSharding::Off)
-
TensorLocation(TensorStorage storage, ReplicatedTensorSharding replicatedTensorSharding)
Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, replicatedTensorSharding)
-
TensorLocation(TensorStorage storage, ReplicatedTensorSharding replicatedTensorSharding, CommGroup shardingDomain)
Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, replicatedTensorSharding, shardingDomain)
-
TensorLocation(TensorStorage storage, TileSet loadTileSet, TileSet storageTileSet, ReplicatedTensorSharding replicatedTensorSharding)
Construct a TensorLocation from parameters.
- Parameters
storage – The memory location of the tensor(s).
loadTileSet – The tiles through which the tensor(s) are loaded onto the chip.
storageTileSet – The tiles on which the tensor(s) are stored.
replicatedTensorSharding – Whether to apply replicated tensor. sharding.
-
TensorLocation(TensorStorage storage, TileSet loadTileSet, TileSet storageTileSet, ReplicatedTensorSharding replicatedTensorSharding, CommGroup shardingDomain)
Construct a TensorLocation from parameters.
- Parameters
storage – The memory location of the tensor(s).
loadTileSet – The tiles through which the tensor(s) are loaded onto the chip.
storageTileSet – The tiles on which the tensor(s) are stored.
replicatedTensorSharding – Whether to apply replicated tensor. sharding.
shardingDomain – GCL communication group across which to shard the tensor. Perpendicular replicas will not shard, and reduce gradients normally (via AllReduce). Defaults to sharding across all replicas.
-
TensorLocation(std::vector<int64_t> serialized)
-
bool operator==(const TensorLocation &rhs) const
-
bool operator!=(const TensorLocation &rhs) const
-
std::vector<int64_t> serialize() const
-
bool isRemote() const
Public Members
-
TensorStorage storage
The memory location of the tensor(s).
-
ReplicatedTensorSharding replicatedTensorSharding
Whether to apply replicated tensor sharding (RTS) or not.
-
TensorLocation()
-
enum class popart::TensorStorage
Enum type that determines where a tensor is stored.
Values:
-
enumerator OnChip = 0
Store the tensor in on-chip memory.
-
enumerator OffChip = 1
Store the tensor in streaming memory.
-
enumerator N = 2
Number of values.
-
enumerator OnChip = 0
-
enum class popart::TileSet
Enum type to specify a set of tiles.
Values:
-
enumerator Compute = 0
The set of tiles designated for compute operations.
-
enumerator IO = 1
The set of tiles designated for IO operations.
-
enumerator Undefined = 2
Undefined (no) tile set.
-
enumerator N = 3
Number of values.
-
enumerator Compute = 0
14.4. Optimizers
#include <popart/optimizer.hpp>
-
class Optimizer
Interface for describing an Optimizer and, internally, how to grow the optimiser step for each weight.
The end-user facing interface constructed by the user to describe what kind of optimiser to use.
Then also used internally by the Ir to grow the optimiser step for each weight.
Stores OptimizerValues for optimizer parameters like learning rate, loss scaling, etc.
See also
OptimiserValue.
Optimizer stores the values for each weight - they can have different values. There is a “default” for all weights, then you can specify specific values for specific weights. This is encapsulated by an OptimizerValueMap, which is a sparse map from weight to value, with unspecified values implying the default.
See also
OptimizerValueMap.
At runtime, the user can dynamically update the Optimizer, e.g. by setting new OptimizerValues. validReplacement determines whether the new Optimizer is interchangable with the one the Ir was built for. For example, trying to replace an SGD Optimizer with an Adam Optimizer would throw.
Subclassed by popart::Adam, popart::Adaptive, popart::SGD
Public Functions
-
virtual ~Optimizer() = default
Optimizer class has a two-part initialisation. The ctor, used by the end-user, and setFactorsFromOptions called by the Ir to finish initialisation once we have all the relevant information during Ir preparation.
Some key methods used by the Ir to grow optimiser step for each weight are createOp, getInputIds, optimizerInputs.
If the OptimizerValue is const, no Ir tensor for that value is created and the VarUpdateOp created for that weight will not have the optional input for that tensor. The Opx of the VarUpdateOp will emit poplar code that uses the provided value directly.
If the OptimizerValue is not const, an Ir tensor for that value is created and the VarUpdateOp created for that weight will have the optional input for that tensor. The tensor will be a stream tensor, so that it can be updated later from host. The tensor will be streamed an initial value of the OptimizerValue’s value.
It is common for Optimizer
implementations to make use of “compound
scalars”. Take for example the SGD0 weight update equation: w <- w * (1 - lr * (1 - dm) * wd) - g * (lr * (1 - dm) / ls) w is the weights and g is the grads. lr, dm, wd, ls are all the “atomic scalars”. These are the scalars/hyperparameters of the
Optimizer that the user can set using OptimizerValues, as described above.Multiple atomic scalars appear in expressions together, and will be operated on together before being used by an Op that also consumes a tensor (in this case the weights or grads). For SGD0, they can be grouped as follows:
w <- w * {1 - lr * (1 - dm) * wd} - g * { lr * (1 - dm) / ls } ^^^^^^^^^^^^^^^^^^^^^^^^^ ~~~~~~~~~~~~~~~~~~~~~~ | | weight decay scale factor 0 | scaled learning rate 0
We call wdsf0 and slr0 the “compound scalars”.
We can statically precompute the OptimizerValues for these compound scalars using the OptimizerValues of the atomic scalars. This makes the Ir simpler, as we now have only:
w <- w * wdsf0 - g * slr0
The CompoundScalarHelpers are used to precompute the compound scalar values.
If any of the composite atomic scalars are non-const, the compound scalar is non-const.
See also
compoundscalarhelper.hpp
-
Optimizer(OptimizerValue lossScaling, const std::vector<ClipNormSettings> &clipNormSettings, const DebugContext &debugContext)
-
virtual OptimizerType type() const = 0
-
virtual std::string type_s() const = 0
-
virtual std::vector<TensorId> getInputIds(const Tensor &weight) const = 0
Returns the TensorIds of the input tensors to the VarUpdateOp this optimiser will create for the given
weight
.Specifically, The TensorId at index i will be the id of the input tensor at InIndex i of the VarUpdateOp. If the input is an OptimizerValue, if it is const, then “” will be returned, else the relevant reservered prefix for that OptimizerValue will be used, followed by the weight id. The prefixes are defined in tensornames.hpp, for example
reservedDefaultWeightDecayScaleFactor0Prefix
orreservedSpecificScaledLearningRate1Prefix
(note there are different prefixes depending on if the weight has a specific or default value for that OptimizerValue).
-
virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const = 0
-
inline const OptimizerValue &lossScaling() const
-
inline float getLossScalingVal() const
-
float getFinalLossScalingVal() const
-
virtual void setFactorsFromOptions(const SessionOptions&)
-
bool gradientAccumulationEnabled() const
-
bool meanReductionEnabled() const
-
bool postMeanAccumulationEnabled() const
-
bool postMeanReplicationEnabled() const
-
int64_t getReplicatedGraphCount() const
-
int64_t getAccumulationFactor() const
-
bool meanGradientAccumulationEnabled() const
-
inline const std::vector<ClipNormSettings> &getClipNormSettings() const
-
virtual bool hasSpecific() const = 0
-
virtual size_t hash() const
-
inline DebugContext getDebugContext() const
-
enum class popart::OptimizerType
Types of optimizers.
Values:
-
enumerator SGD = 0
-
enumerator Adam
-
enumerator Adaptive
-
enumerator NTYPES
-
enumerator SGD = 0
-
enum class popart::OptimizerReductionType
Reduction mode when doing data-parallel training over replicated graphs.
Depending on the optimizer used and its configuration, this option describes how the reduction of gradients over replicas will occur. For example, directly on the gradient, on the gradient accumulator, or on the momentum. See the documentation of individual optimizers for more information.
Values:
-
enumerator None = 0
No replicated graph reduction.
-
enumerator GradReduce
Gradient reduction (every iteration, after a weight’s gradient is produced)
-
enumerator AcclReduce
Momentum reduction (SGD1, after the gradient accumulation loop, if applicable)
-
enumerator AccumReduce
Accumulator reduction (Adam/SGD2 + gradient accumulation, after the gradient accumulation loop)
-
enumerator None = 0
#include <popart/optimizervalue.hpp>
-
class OptimizerValue
A class used to represent values of hyper parameters.
Public Functions
-
OptimizerValue() = default
Equivalent to OptimizerValue(0, false).
-
inline OptimizerValue(float v)
Equivalent to OptimizerValue(v, true).
-
inline OptimizerValue(float v, bool c)
Constructor.
- Parameters
v – The current value of the hyper parameter.
c – A boolean flag to indicate whether the parameter will remain at this value forever (
true
) or may change over time (false
).
-
inline OptimizerValue(std::pair<float, bool> x)
-
inline float val() const
-
inline bool isConst() const
-
void validReplacement(const OptimizerValue &rhs) const
-
bool operator==(const OptimizerValue &rhs) const
-
OptimizerValue() = default
#include <popart/optimizervaluemap.hpp>
-
class OptimizerValueMap
Public Functions
-
inline OptimizerValueMap(OptimizerValue g)
-
OptimizerValue get(const TensorId &id) const
-
void insertSpecific(const TensorId&, OptimizerValue)
-
inline bool hasSpecific() const
-
inline OptimizerValue getDefault() const
-
void validReplacement(const OptimizerValueMap &rhs) const
-
inline const std::map<TensorId, OptimizerValue> &getSpecifics() const
-
inline OptimizerValueMap(OptimizerValue g)
14.4.1. Stochastic Gradient Descent (SGD)
#include <popart/clipnormsettings.hpp>
-
class ClipNormSettings
A data structure used to represent a maximum value constraint on one or more weights.
This is passed to the optimizer on construction.
Public Functions
-
ClipNormSettings(const std::vector<TensorId> &weightIds_, float maxNorm_)
DEPRECATED This will be removed from a future release.
Constructor.
- Parameters
weightIds_ – The weight tensor IDs that this constraint applies to.
maxNorm_ – The maximum permissible value.
-
float getMaxNorm() const
-
bool operator==(const ClipNormSettings&) const
-
bool operator!=(const ClipNormSettings &other) const
Public Static Functions
-
static ClipNormSettings clipWeights(const std::vector<TensorId> &weightIds_, float maxNorm_)
-
static ClipNormSettings clipAllWeights(float maxNorm_)
-
ClipNormSettings(const std::vector<TensorId> &weightIds_, float maxNorm_)
#include <popart/sgd.hpp>
-
class SGD : public popart::Optimizer
Stochastic Gradient Descent (SGD) optimizer.
Like any to any optimizer implementation, this class is responsible for updating each weight tensor ( \(w\)) in the model using the gradient ( \(g\)) of the loss function with respect to the weight as calculated during the backwards pass.
The SGD optimizer has the following state for each weight:
velocity ( \(v\))
The SGD optimizer has the following hyper parameters:
learning rate ( \(\text{lr}\))
momentum ( \(\text{mm}\))
weight decay ( \(\text{wd}\))
dampening ( \(\text{dm}\))
velocity scaling ( \(\text{vs}\))
loss scaling ( \(\text{ls}\))
nesterov
clip norm settings
The values of these parameters can be shared between all weights but some can be overridden with weight-specific values (see SGD::insertSpecific). Hyper parameters are captured using OptimizerValue objects and therefore can be either a constant value or a non-constant value that can be adjusted by the user.
In the following we will describe how this optimizer updates a weight using a gradient. In the context of this description the gradient is is the value of the gradient after any gradient accumulation has been performed and after the application of a loss scaling factor to the gradient has been corrected for.
When the optimizer needs to update a weight, \(w\), using a gradient, \(g\), it first updates the optimizer state as follows:
\[ v' := v * \text{mm} + (1 - \text{dm}) * (g + \text{wd} * w) \text{ \ . } \]Following the update of the optimizer state the optimizer uses said state to update the weight:
if nesterov is True:
\[ g' := g + \text{wd} * w + \text{mm} * v' \text{ \ . } \]\[ w' := w - \text{lr} * g' \text{ \ . } \]else:\[ w' := w - \text{lr} * v' \text{ \ . } \]In addition to the above, the velocity scaling hyper parameter is a scaling factor that can provide improved numerical stability by ensuring the values stored in the optimizer state, \(v\), are scaled by this value. When using this parameter PopART will automatically deal with the artificially scaled velocity value during the weight update and other hyper parameters do not need to be adjusted).
In addition, the loss scaling hyper parameter is similar in nature to the velocity scaling parameter. It is a scaling value that is applied to the loss gradient at the start of the the backwards pass and, at the end of the backwards pass, this scaling is reversed by multiplying the gradients for each weight with the inverse of the loss scaling value prior to updating the optimizer state. Using loss scaling can also improve numerical stability in some cases.
Finally, it is possible to add clip norm settings for this optimizer. These clip norms compute the L2 norm for a group of weights and adds a scalar term to the weight update that effectively divides it by the norm (or a constant value that is provided as part of the clip norm, which ever is greater).
See the SGD notes in optimizer.hpp for a more detailed and comprehensive derivation of the SGD optimizer step in PopART.
Subclassed by popart::ConstSGD
Public Functions
-
SGD(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultMomentum, OptimizerValue defaultDampening, OptimizerValue defaultVelocityScaling, OptimizerValue lossScaling, OptimizerValue nesterov, const std::vector<ClipNormSettings> &clipNormSettings = {}, SGDAccumulatorAndMomentum sgdAccMm = SGDAccumulatorAndMomentum::Combined, DataType accumType = DataType::UNDEFINED, DataType accl1Type = DataType::UNDEFINED, const DebugContext &debugContext = {})
Constructor.
See also
SGDAccumulatorAndMomentum. Defaults to SGDAccumulatorAndMomentum::Combined.
- Parameters
defaultLearningRate – The learning rate value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultWeightDecay – The weight decay value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultMomentum – The momentum value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultDampening – The dampening value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultVelocityScaling – The velocity scaling value to use for weights for which no weight-specific hyper parameter have been inserted.
lossScaling – The loss scaling value to use.
nesterov – Option to enable Nesterov momentum. Defaults to false.
clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).
sgdAccMm – The implementation strategy to use when gradient accumulation and/or momentum are used, otherwise ignored.
accumType – The DataType of the accum tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.
accl1Type – The DataType of the accl1 tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.
debugContext – Optional debug context.
-
SGD(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultMomentum, OptimizerValue defaultDampening, OptimizerValue defaultVelocityScaling, OptimizerValue lossScaling, const std::vector<ClipNormSettings> &clipNormSettings = {}, SGDAccumulatorAndMomentum sgdAccMm = SGDAccumulatorAndMomentum::Combined, DataType accumType = DataType::UNDEFINED, DataType accl1Type = DataType::UNDEFINED, const DebugContext &debugContext = {})
Constructor.
See also
SGDAccumulatorAndMomentum. Defaults to SGDAccumulatorAndMomentum::Combined.
- Parameters
defaultLearningRate – The learning rate value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultWeightDecay – The weight decay value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultMomentum – The momentum value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultDampening – The dampening value to use for weights for which no weight-specific hyper parameter have been inserted.
defaultVelocityScaling – The velocity scaling value to use for weights for which no weight-specific hyper parameter have been inserted.
lossScaling – The loss scaling value to use.
clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).
sgdAccMm – The implementation strategy to use when gradient accumulation and/or momentum are used, otherwise ignored.
accumType – The DataType of the accum tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.
accl1Type – The DataType of the accl1 tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.
debugContext – Optional debug context.
-
SGD(const std::map<std::string, std::pair<float, bool>> ¶ms, const std::vector<ClipNormSettings> &clipNormSettings = {}, SGDAccumulatorAndMomentum sgdAccMm = SGDAccumulatorAndMomentum::Combined, DataType accumType = DataType::UNDEFINED, DataType accl1Type = DataType::UNDEFINED, const DebugContext &debugContext = {})
Constructor.
EXAMPLE:
SGD({{"defaultLearningRate", {0.02, false}}, {"defaultMomentum", {0.6, true}}});
See also
SGDAccumulatorAndMomentum. Defaults to SGDAccumulatorAndMomentum::Combined.
This will create an SGD Optimizer which has a constant momentum of 0.6 and a changeable learning rate initially of 0.02. All OptimizerValues not present in the map will take values from the
getUnset
* functions.- Parameters
params – A parameter map where the keys are one or more of
"defaultLearningRate"
,"defaultWeightDecay"
,"defaultMomentum"
,"defaultDampening"
,"defaultVelocityScaling"
,"lossScaling"
or `”nesterov”. The map’s values are pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter because default values will be used where parameters are missing.clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).
sgdAccMm – The implementation strategy to use when gradient accumulation and/or momentum are used, otherwise ignored.
accumType – The DataType of the accum tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.
accl1Type – The DataType of the accl1 tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.
debugContext – Optional debug context.
-
inline SGD()
Default constructor Creates SGD with default scalars (equivalent to getUnset<scalar>() methods), and other default parameters of main constructor.
-
~SGD() = default
-
inline virtual OptimizerType type() const final
-
inline virtual std::string type_s() const final
-
inline SGDAccumulatorAndMomentum getSGDAccumulatorAndMomentum() const
-
virtual std::unique_ptr<Op> createOp(const Tensor &weight, Graph&) const final
Returns the VarUpdateOp for the given
weight
.If no gradient accumulation of momentum, this will be a SGD0VarUpdateOp. Else, if
getSGDAccumulatorAndMomentum() == ::Combined
, this will be an SGD1ComboOp, else ifgetSGDAccumulatorAndMomentum() == ::Combined
SGD2ComboOp, an SGD2ComboOp.
The required compound scalar OptimizerValues for the
VarUpdateOp wil be computed and passed to the Op. See the SGD notes above this class for how they are derived. Recall that if non-const, the VarUpdateOp will take an input Tensor for the compound scalar.See also
Optimizer::createOp
The OptimizerReductionType of the Op is derived as follows: No replication => None Replication, no grad acc => GradReduce Replication, grad acc, SGD1 => AcclReduce Replication, grad acc, SGD2 => AccumReduce See the SGD notes above this class for why this is.
If SGD2, the DataType of the accum and accl1 tensors passed to the SGD2ComboOp will be as set in the SGD constructor. Recall DataType::UNDEFINED means use the same as the weight.
An SGD1ComboOp will later be decomposed by SGD1Decompose
pattern into a series of Ops and Tensors that implement the SGD1 optimiser step.
An SGD12ComboOp will later be decomposed by
SGD2Decompose pattern into a series of Ops and Tensors that implement the SGD2 optimiser step.See also
See also
-
virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const final
smm1 and wdsf0 have the same data type as the
weight
. Everything else
-
float getStoredValue(const TensorId &optId) const
Tensor “opt” has an id, which it uses to match a compound scalar which this object can compute from the atomic scalars.
-
void insertSpecific(const TensorId &weight, OptimizerValue learningRate, OptimizerValue weightDecay, OptimizerValue momentum, OptimizerValue dampening, OptimizerValue velocityScaling, OptimizerValue nesterov)
Insert a weight-specific set of hyper parameters.
- Parameters
weight – The TensorId of the weight.
learningRate – The learning rate value to use for this specific weight.
weightDecay – The weight decay value to use for this specific weight.
momentum – The momentum value to use for this specific weight.
dampening – The dampening value to use for this specific weight.
velocityScaling – The velocity scaling value to use for this specific weight.
nesterov – Option to enable Nesterov momentum. Defaults to false.
-
void insertSpecific(const TensorId &weight, const std::map<std::string, std::pair<float, bool>> ¶ms)
Insert a weight-specific set of hyper parameters.
- Parameters
weight – The TensorId of the weight.
params – A parameter map where keys are one of
"learningRate"
,"weightDecay"
,"momentum"
,"dampening"
, or"velocityScaling"
and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.
-
virtual bool hasSpecific() const final
-
inline const OptimizerValueMap &learningRates() const
-
inline const OptimizerValueMap &weightDecays() const
-
inline const OptimizerValueMap &momentums() const
-
inline const OptimizerValueMap &dampenings() const
-
inline const OptimizerValueMap &velocityScalings() const
-
inline const OptimizerValueMap &nesterov() const
-
virtual size_t hash() const
Public Static Functions
-
static inline OptimizerValue getUnsetLearningRate()
Default learning rate value.
-
static inline OptimizerValue getUnsetWeightDecay()
Default weight decay value.
-
static inline OptimizerValue getUnsetMomentum()
Default momentum value.
-
static inline OptimizerValue getUnsetDampening()
Default dampening value.
-
static inline OptimizerValue getUnsetVelocityScaling()
Default velocity scaling value.
-
static inline OptimizerValue getUnsetLossScaling()
Default loss scaling value.
-
static inline OptimizerValue getUnsetNesterov()
Default nesterov.