2. PopART C++ API

2.1. Sessions

#include <popart/session.hpp>
class popart::Session

Session is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware.

Subclassed by popart::InferenceSession, popart::TrainingSession

Public Functions

virtual ~Session() = 0

Destructor for the Session class.

std::vector<uint32_t> getRNGState()

Get state of the random number generator.

void setRNGState(const std::vector<uint32_t>)

Set state of the random number generator.

void setRandomSeed(uint64_t seedValue)

Set the value of the random number generator seed.

This method explicitly seeds all random operations. Additionally, this method derives a new state for the random number generator (RNG) from the seed and sets it on the device. This RNG state is used to resolve stochastic rounding. Note that to deterministically store and restore the combined random state for a session, do the following:

C++:

// Store random state (session s0).
auto seed = s0.getRandomSeed();
auto rngState = s0.getRNGState();

// Restore random state (session s1).
s1.setRandomSeed(seed);   // <-- affects RNG state, order important
s1.setRNGState(rngState);

Python:

# Store random state (session s0).
seed = s0.getRandomSeed()
rngState = s0.getRNGState()

# Restore random state (session s1).
s1.setRandomSeed(seed)   // <-- affects RNG state, order important
s1.setRNGState(rngState)

Parameters

seedValue – The value of the seed.

uint64_t getRandomSeed()

Get the value of the random number generator seed.

Calling setRandomSeed() with this value (at a later stage) reinstates the random state logic that seeds random operations.

Returns

The value used to seed current random operations.

void compileAndExport(const std::string &filename)

Compile the graph and export it to a file.

This method will first create a snap::Graph and compile the poplar::Executable. Next, it will export the executable and PopART metadata to the file. The exported file will be in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.

Parameters

filename – The name of the file where the compiled executable and metadata will be saved.

void compileAndExport(std::ostream &out)

Compile the graph and export it to a stream.

This method will first create a snap::Graph and compile the poplar::Executable. Next, it will export the executable and PopART metadata to the stream. The data will be streamed in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.

This method automatically creates folders as needed if filename is located in a folder which does not exist.

Parameters

out – The stream that the compiled executable and metadata will be written to.

void saveExecutableToFile(const std::string &filename)

Save a compiled graph to a file.

The file will be in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.

This method automatically creates folders as needed if filename is located in a folder which does not exist.

Parameters

filename – The name of the file where the compiled executable and metadata will be saved.

Pre

prepareDevice() must have been called.

void saveExecutableToStream(std::ostream &out)

Save a compiled graph to a stream.

The data will be streamed in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.

Parameters

out – The stream where the compiled executable and metadata will be written to.

Pre

prepareDevice() must have been called.

void checkInplacingAmbiguity() const

Check for potential inplacing ambiguities.

This method creates an AliasModel object for each graph and runs the Poprithms ambiguity checker on it.

Throws an error if the graph has an inplacing ambiguity and will prompt the user to check the inplacing.

See poprithms::memory::inplace::Graph::AmbiguityStatus on the Poprithms GitHub repo for more on what constitutes an ambiguity.

void loadExecutableFromFile(const std::string &filename)

Load the compiled executable and metadata from a file.

The file must have been created with compileAndExport(const std::string).

Parameters

filename – The name of the file to load the executable and metadata from.

void loadExecutableFromStream(std::shared_ptr<std::istream> in)

Load the compiled executable and from a stream.

The stream must have been created with compileAndExport(std::ostream).

Parameters

in – The shared pointer to the stream to load the executable from.

void prepareDevice(bool loadEngine = true)

Prepare the network for execution.

This will create the snap::Graph and poplar::Engine.

Parameters

loadEngine – If true, load the engine and connect the streams once the device is ready.

void loadEngineAndConnectStreams()

Load the engine on the device and connect the streams.

This will set up the poplar::Streams.

Note: This call is optional. The engine will implicitly be loaded on the device when required.

void weightsFromHost()

Copy weights from the host to the device.

void weightsToHost()

Copy the weights from the device to the host steam memory.

uint64_t getCycleCount(std::string id = "")

Copy the cycle count tensor from the device to the host.

Parameters

id – The identifier of the cycle count tensor.

void connectStreamToCallback(const std::string &streamHandle, std::function<void(void*)> callback, unsigned index = 0)

Connect a Poplar stream with a callback.

This method will be called whenever the stream will be read or was written to by the device. The memory location will only be valid for reading or writing for the duration of the callback.

Parameters
  • streamHandle – The name of the stream to connect to.

  • callback – The callback to be called whenever the stream is to be read or was written to by the device.

  • index – The replica index to connect to, when using replicated graphs. Default=0.

void connectStream(const std::string &streamHandle, void *buffer)

Connect a Poplar stream with a fixed location in memory.

Each time data is copied to the stream, this location will be read and each time data is copied from the stream, this location will be written.

Parameters
  • streamHandle – The handle of the stream to connect to.

  • buffer – The pointer to the memory location.

void connectHostFunction(const std::string &functionHandle, std::function<void(const void*const*, size_t, void*const*, size_t)> callback, unsigned index = 0)

Connect a host function to a callback.

The callback takes two arguments, which point to the locations in memory for each of the function’s input and output arguments, respectively. During a host function call, first the device transfers the input data to the host, then the callback is invoked, and finally the output data is copied back to the device. The memory pointed to by the callback arguments must only be accessed during the duration of the callback.

Parameters
  • functionHandle – The name of the host function.

  • callback – The function to be called whenever new input data is available.

  • index – The replica index to connect to, when using replicated graphs. Default=0.

void run(IStepIO &stepIO, std::string debugName = "")

Run one step.

Read input data from address in stepIO.in.

Write the output data to addresses in stepIO.out.

Parameters
  • stepIO – The input and output data.

  • debugName – A debug string to identify this run in logs.

void run(std::string programHandle, IStepIO &stepIO, std::string debugName = "")

Run one step of a custom program.

Read input data from address in stepIO.in.

Write the output data to addresses in stepIO.out.

Parameters
  • programHandle – The handle of the custom program to run.

  • stepIO – The input and output data.

  • debugName – A debug string to identify this run in logs.

void updateExternallySavedTensorLocations(const std::string &fromLocation, const std::string &toLocation)

Update the tensor locations of tensors in the session’s ONNX model.

A new file will be created at this point, and written to when the ONNX model is saved with a subsequent call to modelToHost().

Parameters
  • fromLocation – All externally saved tensors with location fromLocation will have their location updated to toLocation.

  • toLocation – The updated tensor locations. This must not already exist.

void modelToHost(const std::string &fn)

Write the current model to an ONNX file.

Parameters

fn – The path to file. The path can be absolute or relative. If you plan to run your program in multiple processes simultaneously, you should avoid possible race conditions by writing to different files, for example by using temporary files.

TensorInfo getInfo(TensorId) const

Get the tensor information for a tensor.

Parameters

TensorId – The identifier of the tensor to get the tensor information for.

Returns

The tensor information for the tensor.

bool hasInfo(TensorId) const

Check whether a tensor has information.

Parameters

TensorId – The identifier of the tensor to get the tensor information for.

Returns

true if the tensor with identifier TensorId has tensor information and false if not.

std::string getSummaryReport(bool resetProfile = true) const

Retrieve the summary report from the poplar::Engine.

The options which were passed to the Session constructor will influence the information in the report.

This method may only be called after prepareDevice() has been called.

Parameters

resetProfile – If true, resets the execution profile. Default = true.

Returns

A string containing the report.

std::string getSerializedGraph() const

Retrieve the serialized graph from the poplar::Engine.

A JSON format report is produced.

This method may only be called after prepareDevice() has been called.

Returns

A string containing the serialized graph.

pva::Report getReport() const

Retrieve the graph report from the poplar::Engine.

The options which were passed to the Session constructor will influence the information in the report.

This method may only be called after prepareDevice() has been called.

Returns

The PopVision Analysis report object.

void resetHostWeights(const std::string &model, const bool ignoreWeightsInModelWithoutCorrespondingHostWeight = false)

Reset weights with weights in an ONNX model.

Note that the only differences between the ONNX model and the current model must be the weights. No other differences are allowed.

This method only updates the weights on the host. weightsFromHost() must be called after this method to update the weights on the device.

Parameters
  • model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.

  • ignoreWeightsInModelWithoutCorrespondingHostWeight – If true, do not throw an error if there are initializers in the ONNX model without corresponding initializer tensor(s) in the session’s IR.

void readWeights(const IWeightsIO &weightsIo)

Read the weights from the host stream memory and write to the host.

This method may only be called after weightsToHost() has been called.

Parameters

weightsIo – The weight data that is read from the host stream memory is written to the addresses in weightsIo.out.

void writeWeights(const IWeightsIO &weightsIo)

Write the weights from the host to the IR tensor memory.

This method may only be called after weightsFromHost() has been called.

Parameters

weightsIo – The weight data is written to the addresses in weightsIo.out.

std::string serializeIr(IrSerializationFormat format)

Serizalise the IR graph to a string.

Parameters

format – The format to use for serializing.

inline const Ir &getIr() const

Get the IR associated with the Session.

inline const popx::Devicex &getDevice() const

Get the device associated with the Session.

inline popx::Devicex &getDevice()

Get the device associated with the Session.

inline const popx::IrLowering &getIrLowering() const

Get the IR lowering associated with the Session.

inline popx::Executablex &getExecutable()

Get the executable associated with the Session.

inline const popx::Executablex &getExecutable() const

Get the executable associated with the Session.

void updateEngineCache()

Update cacheEntries from engine cache directory and update ir::hashMatched_ with the updated cacheEntries.

void setDeviceInfo(std::shared_ptr<DeviceInfo> deviceInfo)

Set the DeviceInfo of the Session.

2.1.1. Training session

#include <popart/session.hpp>
class popart::TrainingSession : public popart::Session

TrainingSession is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware with training provided by optimizing a loss tensor using an optimizer and automatic differentiation (backpropagation).

Public Functions

~TrainingSession() override

Destructor for the TrainingSession class.

void updateOptimizerFromHost(const Optimizer *optimizer)

Update the optimizer from the host.

This method updates the optimizer and the associated hyperparameters but not the optimizer state tensors.

NOTE: The optimizer parameter has to be compatible with the optimizer passed to the TrainingSession constructor. For example, you cannot call this function with an SDG1 optimizer if you created the session with an SDG0 optimizer. This is because it is not possible to change the IR after a session has been constructed.

Parameters

optimizer – A pointer to a popart::Optimizer.

void copyFromRemoteBuffer(const std::string &buffer, void *w, int repeat_index, unsigned replication_index = 0)

Copy from a remote butter into a user buffer.

This can be useful when we run larger models with host side reductions since HEXOPT is currently limited to 128 MB.

Parameters
  • buffer – The name of the remote buffer to copy from.

  • w – Pointer to a user buffer to copy to.

  • repeat_index – The index in the remote buffer to copy from.

  • replication_index – The replicated graph index when using replicated graphs. Default=0.

void copyToRemoteBuffer(void *w, const std::string &buffer, int repeat_index, unsigned replication_index = 0)

Copy from a user buffer to a remote buffer.

This can be useful when we run larger models with host side reductions since HEXOPT is currently limited to 128 MB.

Parameters
  • w – Pointer to a user buffer to copy from.

  • buffer – The remote buffer to copy to.

  • repeat_index – The index in the remote buffer to copy to.

  • replication_index – The replicated graph index when using replicated graphs. Default=0.

Public Static Functions

static std::unique_ptr<TrainingSession> createFromIr(std::shared_ptr<Ir> ir, std::shared_ptr<DeviceInfo> deviceInfo, const std::string name = DefaultTrainingSessionName)

Create a session for training from an IR.

Parameters
  • ir – The IR to create the session from.

  • deviceInfo – The type of device that this session uses.

  • name – The name of this training session. Default: “training”.

static std::unique_ptr<TrainingSession> createFromOnnxModel(const std::string &model, const DataFlow &dataFlow, const TensorId &loss, const Optimizer &optimizer, std::shared_ptr<DeviceInfo> deviceInfo, const InputShapeInfo &inputShapeInfo = InputShapeInfo(), const SessionOptions &userOptions = SessionOptions(), const Patterns &patterns = Patterns(), const std::string name = DefaultTrainingSessionName)

Create a session for inference from an ONNX model.

Parameters
  • model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.

  • dataFlow – Configuration for the data feeds and fetches.

  • loss – The identifier of the final scalar loss tensor for training.

  • optimizer – The name of an optimizer to use when training.

  • deviceInfo – The type of device that this session uses.

  • inputShapeInfo – (Optional) The sizes and dtypes of the input tensors. This is used to specify the sizes of the input tensors in the case that the ONNX model does not include this information. The Poplar graph programmming framework uses statically allocated memory buffers and so it needs to know the size of tensors before the compilation. Default: InputShapeInfo().

  • userOptions – (Optional) The user configuration options for the Session class. Default: SessionOptions().

  • patterns – (Optional) A user-selected set of graph transformation patterns which will be applied to the graph. If this is not specified, a default set of optimisation transformations will be applied. Default: Patterns().

  • name – (Optional) The name of this inference session. Default: “training”.

2.1.2. Inference session

#include <popart/session.hpp>
class popart::InferenceSession : public popart::Session

InferenceSession is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware, without any automatic differentiation (backpropagation) or optimization.

Public Functions

~InferenceSession() override

Destructor for the InferenceSession class.

void popxlSetEngineIsLoaded(bool isLoaded)

Public Static Functions

static std::unique_ptr<InferenceSession> createFromIr(std::shared_ptr<Ir> ir, std::shared_ptr<DeviceInfo> deviceInfo, const std::string name = DefaultInferenceSessionName)

Create a session for inference from an IR.

Parameters
  • ir – The IR to create the session from.

  • deviceInfo – The type of device that this session uses.

  • name – The name of this inference session. Default: “inference”.

static std::unique_ptr<InferenceSession> createFromOnnxModel(const std::string &model, const DataFlow &dataFlow, std::shared_ptr<DeviceInfo> deviceInfo, const InputShapeInfo &inputShapeInfo = InputShapeInfo(), const SessionOptions &userOptions = SessionOptions(), const Patterns &patterns = Patterns(), const std::string name = DefaultInferenceSessionName)

Create a session for inference from an ONNX model.

Parameters
  • model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.

  • dataFlow – Configuration for the data feeds and fetches.

  • deviceInfo – The type of device that this session uses.

  • inputShapeInfo – (Optional) The sizes and dtypes of the input tensors. This is used to specify the sizes of the input tensors in the case that the ONNX model does not include this information. The Poplar graph programmming framework uses statically allocated memory buffers and so it needs to know the size of tensors before the compilation. Default: InputShapeInfo().

  • userOptions – (Optional) The user configuration options for the Session class. Default: SessionOptions().

  • patterns – (Optional) A user-selected set of graph transformation patterns which will be applied to the graph. If this is not specified, a default set of optimisation transformations will be applied. Default: Patterns().

  • name – (Optional) The name of this inference session. Default: “inference”.

2.1.3. Session options

#include <popart/sessionoptions.hpp>
enum popart::AccumulateOuterFragmentSchedule

Enum type that determines how the operations in the accumulate outer fragment will be scheduled accross virtual graphs (only relevant to pipelined modes).

Values:

enumerator Scheduler = 0

Don’t add additional constraints and let the scheduler work it out.

enumerator Serial

Add constraints that ensure ops are executed in virtual graph ID order.

enumerator OverlapCycleOptimized

Try and parallelise ops with different virtual graph IDs as much as possible.

enumerator OverlapMemoryOptimized

Try and parallelise ops with different virtual graph IDs but avoid certain steps that are costly in terms of memory usage.

enum popart::AutodiffStitchStrategy

Enum type representing a strategy to ensure a backward graph’s inputs are either inputs of the forward graph, outputs of the forward graph or gradients of outputs of the forward graph.

Strategies may expose tensors that would otherwise have been internal to the forward graph as outputs of this forward graph.

Values:

enumerator RecomputeMinimal = 0

Recompute any backward graph inputs associated with non-gradient forward graph tensors that are neither inputs nor outputs in the forward graph.

enumerator RecomputeAllNonInputs

Recompute any backward graph inputs associated with non-gradient forward graph tensors that are not inputs in the forward graph.

enumerator AddFwdOutputs

For backward graph inputs associated with non-gradient forward graph tensors that are neither inputs or outputs in the forward graph, add them as outputs to the forward graph.

Note

This strategy is not guaranteed to work for all circumstances. In particular, it is unable to deal with subgraphs of IfOp. Using this setting may therefore result in subsequent exceptions in the Autodiff transform and it is therefore inadvisable to use this as an Autodiff default.

enumerator SafeAddFwdOutputs

Like AutodiffStitchStrategy::AddFwdOutputs except that those backward graph inputs that can’t be stitched with AutodiffStitchStrategy::AddFwdOutputs (that is, by adding outputs to the forward graph) are stitched using the AutodiffStitchStrategy::RecomputeMinimal strategy instead.

This means that this is a safe strategy to use as an Autodiff default.

enumerator N

Number of AutodiffStitchStrategy values.

enum popart::BatchSerializationBatchSchedule

Enum type that describes how to change the batch serialisation subgraph schedule before outlining.

Note

This setting is experimental and may change.

Values:

enumerator Scheduler = 0

Don’t encourage any particular scheduling for ops within batch subgraphs (leave it to the scheduler) but tell the scheduler to schedule subgraphs in sequence.

enumerator Isomorphic

Encourage all ops within batch subgraphs to be scheduled identically and for each subgraph to be scheduled in sequence (good for outlineability).

enumerator OverlapOnIo

Attempt to put the remote load op for batch N+1 right after the compute phase of batch N.

enumerator OverlapOnCompute

Attempt to put the remote load op for batch N+1 right before the compute phase of batch N.

enumerator N

The number of BatchSerializationBatchSchedule values.

enum popart::BatchSerializationMethod

Enum type that describes how to apply the batch serialization.

Note

This setting is experimental and may change.

Values:

enumerator UnrollDynamic = 0

Unroll the batch with dynamic slicing.

enumerator UnrollStatic

Unroll the batch with static slicing.

enumerator Loop

Loop over the batch dimension.

enumerator N

The number of BatchSerializationMethod values.

enum popart::BatchSerializationTransformContext

Enum type that describes when to apply batch serialization.

Note

This setting is experimental and may change.

Values:

enumerator Fwd = 0

Apply batch serialiation before growing the backward pass.

enumerator Bwd

Apply batch serialiation after growing the backward pass.

enumerator N

The number of BatchSerializationTransformContext values.

enum popart::ExecutionPhaseIOSchedule

Enum type to specify when to load tensors.

Values:

enumerator Preload = 0

Preload tensors in previous phase for use in current phase.

enumerator OnDemand

Load tensors just before they are required.

enumerator N

The number of ExecutionPhaseIOSchedule values.

enum popart::ExecutionPhaseSchedule

Enum type to specify the order of processing optimizer operations for different weights of the same execution phase.

The steps for phased execution are:

  1. Copy to IO tiles if necessary.

  2. Run collective operations if necessary.

  3. Load optimizer state.

  4. Update optimizer state.

  5. Apply optimizer.

  6. Store updated tensor if necessary.

Values:

enumerator Interleaving = 0

Process above steps for one weight at a time (for example: 123456, 123456, 123456).

The scheduler may interleave these steps.

enumerator Batch

Process above steps for all weights together, in a way that maximises overlap potential between compute and exchange (for example: 333, 111, 222, 444, 555, 666).

enumerator BatchClusteredIO

Process above steps for all weights together, in a way that maximises overlap potential between compute and exchange, and maximise stream copy merges by keeping RemoteLoad/RemoteStore operations clustered (for example: 333, 111, 222, 444, 555, 666).

enumerator N

The number of ExecutionPhaseSchedule values.

enum popart::GradientTensorTrackingMethod

Enum type to specify the method for selecting gradient tensors whose statistics are to be tracked for the AutomaticLossScale transform.

Values:

enumerator AllNonViewChangingGradientTensors = 0

Track all gradients of non-view-changing gradient tensors.

enumerator ConvAndMatmulGradients

Track all gradients of inputs to MatMul and Convolution ops.

enumerator GradientsOfUserSpecifiedTensors

Track gradients of user-specified tensors.

enumerator N

The number of GradientTensorTrackingMethod values.

enum popart::Instrumentation

Enum type used to specify an instrumentation type.

Values:

enumerator Outer = 0

Outer loop instrumentation, graph over all IPUs.

enumerator Inner

Inner loop instrumentation, graph per IPU.

enumerator N

The number of Instrumentation values.

enum popart::IrSerializationFormat

Enum type used to specify a serialization format.

Values:

enumerator JSON

JavaScript Object Notation (JSON).

enum popart::MeanReductionStrategy

Enum type that specifies when to divide by a mean reduction factor, when doing mean reduction over a sequence of tensors \(t_1, t_2, ..., t_k\).

Values:

enumerator Running = 0

Keep the reduction buffer as the mean of the tensors accumulated so far.

If \(t_1, ..., t_f\) has just been processed, the current accumulator \(s\) is the mean of these values, and the next accumulator update is \(s = \frac{f}{f+1} * s + \frac{1}{f+1} * t_{f+1}\) to keep \(s\) a running mean.

This strategy guarantees \(s \le \max(a_1, ..., a_k)\) throughout the accumulation, therefore it will not overflow, but it is generally slower than MeanReductionStrategy::Post.

enumerator Post

Keep the accumulation factor as the running sum, and divide once by \(k\) at the end of the accumulation.

This strategy will generally be faster than MeanReductionStrategy::Running, but is prone to overflow (especially when using fp16).

enumerator N

The number of MeanReductionStrategy values.

enum popart::MergeVarUpdateType

Enum type used to specify which VarUpdateOp ops to merge.

Values:

enumerator None = 0

Do not merge VarUpdateOp ops.

enumerator All

Merge all VarUpdateOp ops into as few groups as possible.

This is a good choice when memory is not a constraint.

enumerator AutoLoose

Merge into groups while attempting not to increase maximum variable liveness, and also not slice tensor variables so they will need to be processed by different VarUpdateOp ops.

enumerator AutoTight

Merge into groups, so that VarUpdateOp ops process tensors of exactly SessionOptions::mergeVarUpdateMemThreshold in size.

enumerator N

The number of MergeVarUpdateType values.

enum popart::RecomputationType

Enum type to specify which ops to recompute in the backward pass when doing auto-recomputation.

Values:

enumerator None = 0

No ops are recomputed (Default).

enumerator Standard

Recompute using algorithm that picks checkpoints to try and minimise max liveness.

enumerator NormOnly

Only Norm ops (+ non-linearities, if following) are recomputed.

enumerator Pipeline

Recompute all forward pipeline stages.

enumerator RecomputeAll

Recompute all ops.

enumerator N

The number of RecomputationTypes values.

enum popart::SubgraphCopyingStrategy

Enum type that describes how copies for inputs and outputs for subgraphs are lowered.

Currently this only affects subgraphs associated with CallOp ops.

Values:

enumerator OnEnterAndExit = 0

Copy all inputs before the start of the subgraph, copy all outputs after all ops in the subgraph.

With this strategy, subgraphs will always map to a single Poplar function.

enumerator JustInTime

Copy inputs just before they are consumed and copy outputs as soon as they are produced.

With this strategy, subgraphs may be lowered into multiple Poplar functions.

enumerator N

The number of SubgraphCopyingStrategy values.

enum popart::SyntheticDataMode

Enum type used to specify the data source for input tensors.

Values:

enumerator Off = 0

Use real data.

enumerator Zeros

Input tensors are initialised to all zeros.

enumerator RandomNormal

Input tensors are initialised with a random normal distribution ~N(0,1).

enumerator N

The number of SyntheticDataMode values.

enum popart::VirtualGraphMode

Enum type used to specify a virtual graph mode.

Values:

enumerator Off = 0

Virtual graphs are not enabled.

enumerator Manual

User must set the popart::Op::virtualGraph attribute on all ops.

enumerator Auto

Use the AutoVirtualGraph transform.

enumerator ExecutionPhases

Virtual graphs are tied to execution phases.

enumerator N

The number of VirtualGraphMode values.

struct popart::AccumulateOuterFragmentSettings

A structure containing accumulate outer fragment settings.

Public Functions

AccumulateOuterFragmentSettings() = default
inline AccumulateOuterFragmentSettings(AccumulateOuterFragmentSchedule schedule_, const std::vector<int> &excludedVirtualGraphs_)

Constructor for AccumulateOuterFragmentSettings.

Parameters
  • schedule_ – Indicate how to schedule the accumulate outer fragment. This setting is experimental and may change. Default: AccumulateOuterFragmentSchedule::Serial

  • excludedVirtualGraphs_ – Indicate to explicitly avoid parallelising the virtual graph IDs. This setting is experimental and may change.

Public Members

AccumulateOuterFragmentSchedule schedule = AccumulateOuterFragmentSchedule::Serial

Indicate how to schedule the accumulate outer fragment.

Note

This setting is experimental and may change.

std::vector<int> excludedVirtualGraphs = {}

Indicate to explicitly avoid parallelising the virtual graph IDs.

Note

This setting is experimental and may change.

struct popart::AutodiffSettings

The settings for the Autodiff transform.

Public Functions

AutodiffSettings() = default

Default constructor for the AutodiffSettings struct.

inline AutodiffSettings(AutodiffStitchStrategy stitchStrategy_)

Constructor for the AutodiffSettings struct.

Parameters

stitchStrategy_ – The strategy to ensure a backward graph’s inputs are either inputs of the forward graph, outputs of the forward graph or gradients of outputs of the forward graph. Default: AutodiffStitchStrategy::RecomputeAllNonInputs.

Public Members

AutodiffStitchStrategy stitchStrategy = AutodiffStitchStrategy::RecomputeAllNonInputs

The strategy PopART should use to ensure that all graph inputs of a backward graph are available as either inputs or outputs of the forward graph or gradients of outputs of the forward graph.

Note

This is an experimental option and may change.

struct popart::AutomaticLossScalingSettings

A structure containing user configuration for automatic loss scaling settings.

Note

Automatic loss scaling is currently experimental and under active development. Recommendation: Set the loss scale manually.

Public Functions

AutomaticLossScalingSettings() = default

Default constructor for AutomaticLossScalingSettings.

AutomaticLossScalingSettings(bool enabled_, const nonstd::optional<std::vector<TensorId>> &toTrackTensors_, float binEdgeLocation_, float thresholdUpperCountProportion_, int updatePeriod_, GradientTensorTrackingMethod gradientTensorTrackingMethod_)

Constructor for AutomaticLossScalingSettings.

Parameters
  • enabled_ – Indicate whether to keep track (true) or not (false) of the distribution of gradient tensor elements over the floating point range. Default: false.

  • toTrackTensors_ – An optional list of model tensor names, for which gradient statistics will be collected. If not set, the gradients of all tensors produced by default operations (matmul, conv) will be used.

  • binEdgeLocation_ – The location of the bin edge as a proportion of the absolute numerical range of the tracked gradient tensor elements, in the range [0, 1]. 0 represents the smallest representable value, and 1 the maximum. This is the single bin edge of the histogram that is an input to the loss scale updater algorithm. Default: 0.125.

  • thresholdUpperCountProportion_ – The proportion of the elements in the upper bin above which the loss scale is increased, and below which the loss scale is decreased. Should be in the range [0, 1]. Default: 1e-7.

  • updatePeriod_ – Indicate how often the loss scale update factor should be updated with respect to optimizer steps. Default: 1

  • gradientTensorTrackingMethod_ – The method for selecting gradient tensors whose statistics are to be tracked. Default: GradientTensorTrackingMethod::AllNonViewChangingGradientTensors.

std::size_t hash() const

Public Members

bool enabled = false
float binEdgeLocation = 0.125f
float thresholdUpperCountProportion = 1e-7
nonstd::optional<std::vector<TensorId>> toTrackTensors
int updatePeriod = 1
GradientTensorTrackingMethod gradientTensorTrackingMethod = GradientTensorTrackingMethod::AllNonViewChangingGradientTensors
struct popart::BatchSerializationSettings

A structure containing batch serialization settings.

Public Functions

BatchSerializationSettings() = default

Default constructor for BatchSerializationSettings.

BatchSerializationSettings(int factor_, bool concatOnVirtualGraphChange_, bool concatOnExecutionPhaseChange_, bool concatOnPipelineStageChange_, BatchSerializationTransformContext transformContext_ = BatchSerializationTransformContext::Fwd, BatchSerializationMethod method_ = BatchSerializationMethod::UnrollDynamic, BatchSerializationBatchSchedule batchSchedule_ = BatchSerializationBatchSchedule::Isomorphic)

Constructor for BatchSerializationSettings.

Parameters
  • factor_ – The number of compute batches to split operations into. Default: 0.

  • concatOnVirtualGraphChange_ – Indicate to break batch serialization chains (true) when the virtual graph changes (by concatenating the compute batches to the local batch). Default: true.

  • concatOnExecutionPhaseChange_ – Indicate to break batch serialization chains (true) when the execution phase changes (by concatenating the compute batches to the local batch). Default: true.

  • concatOnPipelineStageChange_ – Indicate to break batch serialization chains (true) when the pipeline stage changes (by concatenating the compute batches to the local batch). Default: true.

  • transformContext_ – An experimental value to control when batch serialization is applied. Default: Fwd.

  • method_ – An experimental value to control how batch serialization is applied. Default: BatchSerializationMethod::UnrollDynamic.

  • batchSchedule_ – An experimental value that changes how operations are scheduled. Default: BatchSerializationBatchSchedule::Isomorphic.

Public Members

int factor = 0

The number of compute batches to split operations into.

bool concatOnVirtualGraphChange = true

Break batch serialization chains when the virtual graph changes (by concatenating the compute batches to the local batch).

bool concatOnExecutionPhaseChange = true

Break batch serialization chains when the execution phase changes (by concatenating the compute batches to the local batch).

bool concatOnPipelineStageChange = true

Break batch serialization chains when the pipeline stage changes (by concatenating the compute batches to the local batch).

BatchSerializationTransformContext transformContext = BatchSerializationTransformContext::Fwd

Experimental value to control when batch serialization is applied.

BatchSerializationMethod method = BatchSerializationMethod::UnrollDynamic

Experimental value to control how batch serialization is applied.

BatchSerializationBatchSchedule batchSchedule = BatchSerializationBatchSchedule::Isomorphic

Experimental value that changes how operations are scheduled.

struct popart::ExecutionPhaseSettings

A structure containing ExecutionPhase settings.

Public Functions

ExecutionPhaseSettings() = default

Default constructor for ExecutionPhaseSettings.

inline ExecutionPhaseSettings(int phases_, bool stages_, ExecutionPhaseIOSchedule weightIOSchedule_, ExecutionPhaseIOSchedule activationIOSchedule_, ExecutionPhaseIOSchedule optimizerStateIOSchedule_, ExecutionPhaseIOSchedule accumulatorIOSchedule_, ExecutionPhaseSchedule schedule_)

Constructor for ExecutionPhaseSettings.

Parameters
  • phases_ – The number of execution phases for the whole model. Default=0.

  • stages_ – The number of overlapping stages:

    • 1: Parallel streaming memory, default for 1 IPU per replica.

    • 2: PingPong between 2 IPUs, default for 2 or more IPUs per replica (Default).

  • weightIOSchedule_ – The execution phase IO schedule for weight tensors. Default: ExecutionPhaseIOSchedule::Preload.

  • activationIOSchedule_ – The execution phase IO schedule for activation and gradient tensors. Default: ExecutionPhaseIOSchedule::Preload.

  • optimizerStateIOSchedule_ – An experimental value to control when batch serialization is applied. Default: ExecutionPhaseIOSchedule::OnDemand.

  • accumulatorIOSchedule_ – An experimental value to control how batch serialization is applied. Default: ExecutionPhaseIOSchedule::Preload.

  • schedule_ – An experimental value that changes how operations are scheduled. Default: ExecutionPhaseSchedule::Interleaving.

Public Members

int phases = 0

Number of ExecutionPhases for the whole model.

int stages = 2

Number of overlapping stages.

  • 1: Parallel streaming memory, default for 1 IPU per replica.

  • 2: PingPong between 2 IPUs, default for 2 or more IPUs per replica.

ExecutionPhaseIOSchedule weightIOSchedule = ExecutionPhaseIOSchedule::Preload

The execution phase IO schedule for weight tensors.

ExecutionPhaseIOSchedule activationIOSchedule = ExecutionPhaseIOSchedule::Preload

The execution phase IO schedule for activation and gradient tensors.

ExecutionPhaseIOSchedule optimizerStateIOSchedule = ExecutionPhaseIOSchedule::OnDemand
ExecutionPhaseIOSchedule accumulatorIOSchedule = ExecutionPhaseIOSchedule::Preload
ExecutionPhaseSchedule schedule = ExecutionPhaseSchedule::Interleaving
struct popart::ReplicatedCollectivesSettings

A structure containing settings for replicated collective operations.

Public Functions

ReplicatedCollectivesSettings(bool prepareScheduleForMergingCollectives = false, bool mergeAllReduceCollectives = false)

Constructor for the ReplicatedCollectivesSettings struct.

Parameters
  • prepareScheduleForMergingCollectives – Insert constraints into the schedule such that collectives which can be merged occur one right after the other. true to insert constraints, false otherwise. Default: false.

  • mergeAllReduceCollectives – Identify allreduce operations which can be scheduled at the same time, and perform them as one larger operation to better utilize the bandwidth between replicas. true to identify operations, false otherwise. Default: false.

std::size_t hash() const

Public Members

bool prepareScheduleForMergingCollectives = false
bool mergeAllReduceCollectives = false
struct popart::SessionOptions

A structure containing user configuration options for the Session class.

Public Members

std::string logDir

A directory for log traces to be written into.

std::set<std::string> dotChecks = {}

When to write .dot files during IR construction.

int firstDotOp = 0

The ops written to the .dot file will be a part of the schedule, controlled by firstDotOp and finalDotOp.

In particular, it will be [max(0, firstDotOp), min(N ops in IR, finalDotOp)).

int finalDotOp = 10000

See firstDotOp.

bool dotOpNames = false

Enable inclusion of the op name in the .dot file (the op type is always exported).

Enabled when true. Default: false.

bool exportPoplarComputationGraph = false

Enable export of Poplar computational graph.

Enabled when true. Default: false.

bool exportPoplarVertexGraph = false

Enable export of Poplar vertex graph.

Enabled when true. Default: false.

bool separateCallOpPdfs = true

Enable creation of separate PDFs for each subgraph when generating PDFs of IR graphs.

Enabled when true. Default: true.

bool enableOutlining = true

Enable outlining.

This identifies and extracts repeated parts of computational graph into subgraphs. Enabled when true. Default: true.

bool enableOutliningCopyCostPruning = true

Enable inclusion of the cost of copying of cached sections should be in the outlining cost model.

Enabled when true. Default: true.

float outlineThreshold = 1.0f

Specify the incremental value that a sub-graph requires, relative to its nested sub-graphs (if any), to be eligible for outlining.

A high threshold results in fewer sub-graphs being outlined, a negative value results in all being outlined. The gross value of a sub-graph is the sum of its constituent ops’ Op::getSubgraphValue() values. To disable outlining, it is better to set enableOutlining to false than to set this value to infinity. The default value of 1.0f results in all high value operations such as convolution being cached, but standalone low value operations such as ReLU will not be.

Default: 1.0f.

float outlineSequenceBreakCost = 10000.0f

Specify the penalty applied to outlining potential sub-graphs if the sub-graph to be created breaks up a sequence of operations that are more efficient (for example for overlapping compute and exchange) when outlined together.

Default: 10000.0f.

SubgraphCopyingStrategy subgraphCopyingStrategy = SubgraphCopyingStrategy::OnEnterAndExit

Specify how copies for inputs and outputs for subgraphs are lowered.

Setting this value to SubgraphCopyingStrategy::JustInTime may save memory at the cost of fragmenting subgraphs into multiple Poplar functions. This may be particularly useful when a number of weight updates are outlined in one subgraph, as it may prevent multiple weight tensors from being live at the same time inside the subgraph.

Default: SubgraphCopyingStrategy::OnEnterAndExit.

RecomputationType autoRecomputation = RecomputationType::None

Enable recomputation of operations in the graph in the backward pass.

This will reduce model size at the cost of computation cycles.

Default: RecomputationType::None (no recomputation).

MergeVarUpdateType mergeVarUpdate = MergeVarUpdateType::None

Enable merging of VarUpdates into groups of VarUpdates, by flattening and concatenating variable tensors and updating tensors.

Default: MergeVarUpdateType::None (no merging).

int64_t mergeVarUpdateMemThreshold = 1000000

Specify the memory threshold for VarUpdateOp merging algorithms.

The MergeVarUpdateType::AutoLoose and MergeVarUpdateType::AutoTight VarUpdateOp merging algorithms have a threshold on the total memory of variable tensors to merge for updating. Defined as total memory in bytes.

Default: 1000000.

struct popart::TensorLocationSettings

A structure containing user configuration for cache/offloading settings.

Public Functions

TensorLocationSettings() = default

Constructor.

TensorLocationSettings(TensorLocation location_, int minElementsForOffChip_ = 2, int minElementsForReplicatedTensorSharding_ = 8192)

Constructor.

Parameters
  • location_ – The tensor location information.

  • minElementsForOffChip_ – The minimum number of elements below which offloading won’t be considered.

  • minElementsForReplicatedTensorSharding_ – The minimum number of elements necessary for replicated tensor sharding.

TensorLocationSettings(TensorStorage storage_, int minElementsForOffChip_ = 2, int minElementsForReplicatedTensorSharding_ = 8192)

Constructor.

Parameters
  • storage_ – The tensor storage information.

  • minElementsForOffChip_ – The minimum number of elements below which offloading won’t be considered.

  • minElementsForReplicatedTensorSharding_ – The minimum number of elements necessary for replicated tensor sharding.

Public Members

TensorLocation location = TensorLocation()

The default tensor location for this tensor type.

int minElementsForOffChip = 2

The minimum number of elements below which offloading won’t be considered.

int minElementsForReplicatedTensorSharding = 8192

A minimum number of elements below which replicated tensor sharding won’t be considered.

#include <popart/variablesettings.hpp>
class popart::VariableSettings

A class to dictate behaviour of variables and reductions of such across multiple graphs.

Public Functions

void verify()

Runs test to see if the VariableSettings are invalid, and throws an error if so.

inline const CommGroup getSharedVariableDomain() const
Returns

the CommGroup sharedVariableDomain of this VariableSettings.

inline VariableRetrievalMode getRetrievalMode() const
Returns

the VariableRetrievalMode retrievalMode of this VariableSettings.

VariableSettings()

“Default” constructor, defaults CommGroup to [All, 0] and retrievalMode to OnePerGroup.

VariableSettings(CommGroup sharedVariableDomain_)

Defaults VariableRetrievalMode to OnePerGroup.

VariableSettings(VariableRetrievalMode retrievalMode_)

Defaults CommGroup to [All, 0].

VariableSettings(CommGroup sharedVariableDomain_, VariableRetrievalMode retrievalMode_)

Entirely custom VariableSettings.

unsigned numReplicasReturningVariable(unsigned replicaCount) const

Calculate the number of replicas that will return this variable.

Parameters

replicaCount – Number of global replicas.

Returns

Number of variables returned.

unsigned groupCount(unsigned replicaCount) const
Parameters

replicaCount – The replicationFactor of the graph.

Returns

The number of groups given the replicaFactor and the VariableSettings.

unsigned getRealGroupSize(unsigned replicaCount) const

Because CommGroup’s don’t have a defined group-size if the type is All or None, this function will return a group-size that is always accurate, based on replicas.

Parameters

replicaCount – The replication factor

Returns

The actual number of replicas in a group

unsigned getGroupRepresentative(unsigned group) const

Get the default first member of a group.

Parameters

group – The group to return the representative for.

Returns

The representative replica of this group.

Shape shapeOnReplica(Shape full_shape, unsigned replicaCount, const TensorId name) const

The shape Onnx reads holds an extra outer dimension in certain cases, where the outer dimension represents the number of returning replica variables.

This function takes an Onnx full-shape and removes the outer dimension safely (ie. checks if the outer dimension matches an expected outer dimension). A quick-function to avoid duplicate code.

Parameters
  • full_shape – The shape as presented by Onnx.

  • replicaCount – The local replication factor, used to calculate the return factor.

  • name – The TensorId of the function, used to give good error feedback.

Returns

The shape of the data on the replica.

Shape shapeOnHost(Shape replica_shape, unsigned replicaCount) const

Takes the shape of a tensor on a replica and returns it’s full ONNX shape.

This is the inverse operation to shapeOnReplica

Parameters
  • replica_shape – The shape of the data on a replica.

  • replicaCount – The local replication factor, used to calculate the return factor.

Returns

The shape as presented by Onnx.

std::vector<std::vector<std::int64_t>> groups(unsigned replicaCount) const

This function returns a set of vectors where each vector contains all the replicaId’s of the replicas with a sharedVariableDomain given the variableSettings and the replicaCount.

Parameters

replicaCount – The local replication factor

Returns

A set of sets, such that set.at(a).set(b) is member nr. b of group a, and set.size() is the number og groups and set.at(A).size() is the size of the group.

bool operator==(VariableSettings other)

Compare two variable-settings.

Parameters

otherVariableSettings to compare these settings to.

Returns

True if all internal elements are the same

bool operator!=(VariableSettings other)

Compare two variable-settings.

Parameters

otherVariableSettings to compare these settings to.

Returns

False if all internal elements are the same

#include <popart/commgroup.hpp>
class popart::CommGroup

Class to specify sub-groups of replicas.

Examples of derived sub-groups:

  • IPU-link domain sub-rack:

    .. code-block:: python type == Consecutive && replicaGroupSize == 64/replica-size/N

    where N is power of two and replicaGroupSize > 1.

  • Complete IPU-link domain / full rack:

    .. code-block:: python type == Consecutive && replicaGroupSize == 64/replica-size

  • Using GW-links only:

    .. code-block:: python type == Orthogonal && replicaGroupSize == 64/replica-size

Public Functions

CommGroup()

Default CommGroup constructor.

Sets type to CommGroupType::All and replicaGroupSize to 0.

inline CommGroup(CommGroupType type, unsigned groupSize)

Construct CommGroup.

Parameters
  • groupType – replica group type

  • groupSize – replica group size

bool operator==(const CommGroup &other) const
bool operator!=(const CommGroup &other) const

Public Members

CommGroupType type = CommGroupType::All

Replica group type.

unsigned replicaGroupSize = 0

Replica group size.

2.2. Data input and output (IStepIO)

#include <popart/istepio.hpp>
class popart::IStepIO

An abstract base class through which input and output data is passed to a Session (see Session::run).

Data is passed via buffers. In the case of buffers returned by IStepIO::in, PopART reads from these buffers. In the case of IStepIO::out, PopART writes to these buffers. The IStepIO::inComplete() and IStepIO::outComplete() functions are called by PopART to signal it is done with an input or output buffer.

An IStepIO implementation should conceptually implement a rolling queue of active buffers for each input and output tensor. Every successful call to IStepIO::in should yield a new data buffer for PopART to read from and add it to the head of the conceptual queue. Conversely, every call to IStepIO::inComplete() should be taken to mean that the buffer at the tail-end of the queue is no longer being used by PopART. This buffer is removed from the conceptual queue.

Note that a IStepIO::in call with the prefetch flag set is only considered successful when it returns data.

Output works analogously to input.

The expected total number of input (or output) buffers that are ‘completed’ for a tensor in one Session::run call is bps \(\times\) SessionOptions::accumulationFactor \(\times\) SessionOptions::replicatedGraphCount, where bps is the number of batches per call to Session::run (this is a value captured by the DataFlow instance passed to the Session instance).

Note, however, that there may be additional ‘incomplete’ calls to IStepIO::in and IStepIO::out.

Furthermore, the number of input (or output) buffers that may be ‘incomplete’ at a given time for a given tensor should not normally be more than SessionOptions::bufferingDepth \(\times\) SessionOptions::replicatedGraphCount, but this bound is not guaranteed.

EXAMPLE: Suppose a session is configured such that the total expected number of input buffers is 6 and these are input buffers for a tensor with ID t with 100 elements. The associated input calls in IStepIO may look like this if SessionOptions::bufferingDepth is 3:

in("t", 100, false) -> Give buffer[0] to PopART.
in("t", 100, true) -> Give buffer[1] to PopART.
in("t", 100, true) -> Give buffer[2] to PopART.
inComplete("t", 100) -> buffer[0] is no longer required and can be reused.
in("t", 100, true) -> Give buffer[3] to PopART.
inComplete("t", 100) -> buffer[1] is no longer required and can be reused.
in("t", 100, true) -> Give buffer[4] to PopART.
inComplete("t", 100) -> buffer[2] is no longer required and can be reused.
in("t", 100, true) -> Give buffer[5] to PopART.
inComplete("t", 100) -> buffer[3] is no longer required and can be reused.
in("t", 100, true) -> No data available, return nullptr.
inComplete("t", 100) -> buffer[4] is no longer required and can be reused.
inComplete("t", 100) -> buffer[5] is no longer required and can be reused.

Subclassed by popart::StepIOCallback, popart::StepIOGeneric< ARRAY_TYPE, ACCESSOR_TYPE, ArrayInfoT >, popart::StepIOGeneric< IArray, StepIONS::IArrayAccessor, IArray &>

Public Functions

virtual ~IStepIO() = default

Destructor for IStepIO.

virtual ConstVoidData in(TensorId id, int64_t numElements, bool prefetch) = 0

Request a new input data buffer.

The memory in this buffer is available for use in PopART until the corresponding inComplete() call.

Note

: Failing to provide a valid data buffer will result in a runtime failure if prefetch is set to false.

Parameters
  • id – The ID of the tensor to return data for.

  • numElements – The number of elements in the tensor.

  • prefetch – If set to true the inability to provide data is not considered an error. If false, it is considered an error if no data can be provided.

Returns

The input buffer for this tensor (or nullptr on failure) returned as a ConstVoidData object.

virtual void inComplete(TensorId id, int64_t numElements) = 0

Notify the user (running a PopART program) that a previously retrieved input data buffer is no longer used by PopART.

Parameters
  • id – The ID of the tensor to return data for.

  • numElements – The number of elements in the tensor.

virtual MutableVoidData out(TensorId id, int64_t numElements) = 0

Request a new output data buffer.

The memory in this buffer is available for use in PopART until the corresponding inComplete() call and will be modified in-place.

Note

Failing to provide a valid data buffer will result in a runtime failure.

Parameters
  • id – The ID of the tensor to return data for.

  • numElements – The number of elements in the tensor.

Returns

The output buffer for this tensor returned as a MutableVoidData object.

inline virtual void outComplete(TensorId)

Notify the user (running a PopART program) that a previously retrieved input data buffer is no longer used by PopART.

Parameters
  • id – The ID of the tensor to return data for.

  • numElements – The number of elements in the tensor.

inline void enableRuntimeAsserts(bool b)

Enable or disable runtime asserts.

If runtime asserts are enabled, then a check that the input and output buffers have the correct number of elements is performed. As Session.run() is called multiple times during a user’s session, the check is only performed in the first call to Session.run(), under the assumption that the user is unlikely to change the size of buffers between runs.

Parameters

b – The setting to enable runtime asserts (true) or disable runtime asserts (false).

inline bool runtimeAssertsEnabled() const

Check if runtime asserts are enabled.

Returns

true if runtime asserts are enabled, otherwise false.

virtual void assertNumElements(const popx::Executablex&) const = 0

Check number of elements.

This check is performed when runtimeAssertsEnabled() is true.

Parameters

Executablex – The input executable to be checked that the input and output buffers have the correct number of elements.

#include <popart/stepio.hpp>
class popart::StepIO : public popart::StepIOGeneric<IArray, StepIONS::IArrayAccessor, IArray&>

Class to provide a Session object with input and output data.

Public Functions

inline StepIO(std::map<TensorId, IArray&> inputs, std::map<TensorId, IArray&> outputs)

Constructor for StepIO.

Parameters
  • inputs – The input data.

  • outputs – The output data.

class popart::StepIOCallback : public popart::IStepIO

Class that implements the IStepIO interface using user-provided callback functions.

The IStepIO interface contains a number of pure virtual member functions through which PopART receives buffers to read data from and buffers to write data to. StepIOCallback inherits from IStepIO and implements those member functions by delegating the logic to the callback functions passed in the constructor. This gives the user full control as to how data buffers are provisioned.

See IStepIO for more details on the expected behaviour of the callbacks.

Public Types

using InputCallback = std::function<ConstVoidData(TensorId, bool)>

Callable object that implements IStepIO::in().

using InputCompleteCallback = std::function<void(TensorId)>

Callable object that implements IStepIO::inComplete().

using OutputCallback = std::function<MutableVoidData(TensorId)>

Callable object that implements IStepIO::out().

using OutputCompleteCallback = std::function<void(TensorId)>

Callable object that implements IStepIO::outComplete().

Public Functions

inline StepIOCallback(InputCallback inputCallback, InputCompleteCallback inputCompleteCallback, OutputCallback outputCallback, OutputCompleteCallback outputCompleteCallback)

Construct a StepIOCallback object.

Parameters
inline virtual void assertNumElements(const popx::Executablex&) const

Check number of elements.

This check is performed when IStepIO::runtimeAssertsEnabled() is true.

Parameters

Executablex – The input executable to be checked that the input and output buffers have the correct number of elements.

virtual ConstVoidData in(TensorId id, int64_t numElements, bool prefetch) final

This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the inputCallback parameter passed to the constructor.

This function should not be called directly.

virtual void inComplete(TensorId id, int64_t numElements) final

This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the inputCompleteCallback parameter passed to the constructor.

This function should not be called directly.

virtual MutableVoidData out(TensorId id, int64_t numElements) final

This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the outputCallback parameter passed to the constructor.

This function should not be called directly.

virtual void outComplete(TensorId id) final

This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the outputCompleteCallback parameter passed to the constructor.

This function should not be called directly.

class popart::IWeightsIO

A virtual class for accessing pointers to the data required to perform a training step.

Subclassed by popart::WeightsIO

Public Functions

virtual ~IWeightsIO() = default

Destructor for IWeightsIO.

virtual bool contains(TensorId) const = 0

Check if the WeightsIO instance contains the weights for a specific tensor.

Parameters

TensorId – The ID of the tensor to look for weights for.

Returns

true if the WeightsIO instance contains weights for the tensor, false otherwise.

virtual MutableVoidData weight(TensorId) const = 0

Retrieve weights for a specific tensor.

Parameters

TensorId – The ID of the tensor to retrieve weights for.

Returns

The weights.

class popart::WeightsIO : public popart::IWeightsIO

Class representing weights.

Public Functions

virtual ~WeightsIO() override = default

Destructor for WeightsIO.

virtual bool contains(TensorId) const final

Check if the WeightsIO instance contains the weights for a specific tensor.

Parameters

TensorId – The ID of the tensor to look for weights for.

Returns

true if the WeightsIO instance contains weights for the tensor, false otherwise.

virtual MutableVoidData weight(TensorId) const final

Retrieve weights for a specific tensor from the WeightsIO object.

Parameters

TensorId – The ID of the tensor to retrieve weights for.

Returns

The weights.

void insert(TensorId, MutableVoidData)

Insert weights for a specific tensor into the WeightsIO object.

Parameters
  • TensorId – The ID of the tensor to insert weights for.

  • MutableVoidData – The weights to insert.

Warning

doxygenstruct: Cannot find class “popart::IArrayAccessor” in doxygen xml output for project “project” from directory: doxygen/xml

#include <popart/stepio_generic.hpp>
template<typename ARRAY_TYPE, typename ACCESSOR_TYPE, typename ArrayInfoT>
class popart::StepIOGeneric : public popart::IStepIO

Subclassed by popart::StepIO

Public Functions

inline void assertNumElements(const popx::Executablex &exe) const final
inline TensorInfo getTensorInfo(ARRAY_TYPE &array) const
template<typename T>
inline T get(TensorId id, std::map<TensorId, ArrayInfo> &M, int64_t numElements, bool advance_, std::string mapName)
template<typename T>
inline void advance(TensorId id, std::map<TensorId, ArrayInfo> &M, int64_t numElements, std::string mapName)
inline ConstVoidData in(TensorId id, int64_t numElements, bool) final
inline void inComplete(TensorId id, int64_t numElements) final
inline MutableVoidData out(TensorId id, int64_t numElements) final

Warning

doxygenstruct: Cannot find class “popart::ArrayInfo” in doxygen xml output for project “project” from directory: doxygen/xml

#include <popart/iarray.hpp>
class popart::IArray

Subclassed by popart::NDArrayWrapper< T >

Public Functions

inline virtual ~IArray()
virtual void *data() = 0
virtual DataType dataType() const = 0
virtual std::size_t rank() const = 0
virtual int64_t dim(size_t index) const = 0
virtual std::size_t nelms() const = 0
virtual const Shape shape() const = 0

2.3. Tensors

#include <popart/tensor.hpp>
class popart::Tensor : public popart::Vertex

Public Functions

Tensor(TensorId, TensorType, Graph&, const DebugContext& = {})
Tensor(TensorId, VariableSettings, Graph&, const DebugContext& = {})
Tensor(TensorId, TensorType, VariableSettings, Graph&, const DebugContext& = {})
inline std::string str() const final
virtual std::unique_ptr<Tensor> clone(Graph &graph_) const
TensorType tensorType() const
std::string tensor_type() const
void setTensorType(TensorType)
inline ReplicatedStreamMode getReplicatedStreamMode() const
inline void setReplicatedStreamMode(const ReplicatedStreamMode &mode)
void setTensorLocationInfo(TensorLocation&, std::pair<RemoteBufferId, RemoteBufferIndex> &remoteBufferInfo)
std::set<PipelineStage> getPipelineStages() const
Op *getProducerUnsafe() const
Op *getProducer() const
void setProducer(Op*)
void resetProducer(Op*)
bool hasProducer() const
bool isGraphInput() const
InIndex getGraphInputIndex() const
bool isGraphOutput() const
OutIndex getGraphOutputIndex() const
bool isLoopInput() const
bool isImplicitLoopInput() const
bool isExplicitLoopInput() const
bool isLoopTripCounter() const
bool isUnmodifiable() const
bool isCheckpointTensor() const
bool isImplicitRecomputeTensor() const
bool isRestoreInplaceTensor() const
bool idIncludesPrefix(const std::vector<std::string>&) const
bool isOptimizerTensor() const
bool isRemoteArgTensor() const
bool isRandomSeedTensor() const
bool isOptimizerStateTensor() const
bool isAccumulatorTensor() const
bool isHostLoadTensor() const

Is this tensor produced by a HostLoad Op or MultiExchangeOp with HostLoad descriptor?

Returns

true if producer is a HostLoad Op or MultiExchangeOp with HostLoad descriptor false otherwise.

bool isWeightTensor() const
bool isAnchored() const
bool isRootAnchor() const
bool hasTensorData() const
TensorData *tensorData()
const TensorData *tensorData() const
bool anyAlias(std::function<bool(Tensor*)> predicate) const
template<typename ...Args>
inline void setTensorData(Args&&... args)
std::vector<Op*> associatedOps() const
inline Graph &getGraph()
inline const Graph &getGraph() const
Ir &getIr()
const Ir &getIr() const
bool hasVirtualGraphId() const
VGraphId getVirtualGraphId() const
VGraphId getVirtualGraphIdUnsafe() const
VGraphIdAndTileSet getVirtualGraphIdAndTileSet(std::set<OpId> &visited) const
VGraphIdAndTileSet getVirtualGraphIdAndTileSetUnsafe() const
VGraphIdAndTileSet getVirtualGraphIdAndTileSetUnsafe(std::set<OpId> &visited) const
int getBatchAxis() const
bool consumersAllPreLoss() const
bool isModified(bool considerLoopInput = true) const

Check if any of the consumers modify this tensor.

Parameters

considerLoopInput – If explicit loop inputs should be considered as being modified. If false, only operations modifying the tensor inplace will be considered.

Returns

True if the tensor is modified, otherwise false.

bool isAliased() const

Check if any of the consumers alias this tensor.

Returns

True if the tensor is aliased to any output, otherwise false.

view::Regions modifiedRegionsByOps(std::vector<Op*> ops, Aliases &aliases) const
view::Regions modifiedRegionsByOps(std::vector<OpId> opIds, Aliases &aliases) const
std::set<Op*, POpCmp> getInplaceModifiers() const

Find operations that modify a tensor.

Returns

All operations that (direct and indirectly) modify this tensor

std::vector<char> getDataViaGraphTraversal() const
inline const popart::DebugInfo &getDebugInfo() const
inline void setVariableUpdateType(VariableUpdateType type)

Members of old subclass VariableTensor class VariableTensor : public Tensor {.

inline VariableUpdateType getVariableUpdateType() const
inline void setCopyFromTensor(TensorId value)
inline TensorId getCopyFromTensor()
inline VariableSettings getVariableSettings() const
Returns

The VariableSettings of this Variable

std::vector<int64_t> returnedShape(unsigned replicationFactor)

Returns the shape necessitated by IO.

Parameters

replicationFactor – The replication factor

Returns

the shape of the tensor, considering replica groups

void verifyMutableVoidInfo(const TensorInfo mutableVoidInfo, unsigned replicationFactor)

Check that the info of a mutableVoidData object matches the expectations set by the TensorInfo and VariableSettings.

Throws an error if there is a mismatch.

Parameters
  • mutableVoidInfo – The data of the MutableVoidInfo with the same id as this tensor

  • replicationFactor – The replicationFactor of this instance

Public Members

TensorId id
Consumers consumers
TensorInfo info
TensorLocationInfo tensorLocationInfo
InputSettings inputSettings
enum popart::TensorType

Values:

enumerator ActGrad = 0
enumerator Const
enumerator Stream
enumerator Unknown
enumerator Variable
enumerator N
enum popart::VariableUpdateType

Values:

enumerator None = 0
enumerator Gradient
enumerator Copy
#include <popart/tensorinfo.hpp>
enum popart::DataType

There is a one-to-one correspondence between popart::DataTypes and ONNX_NAMESPACE::TensorProto_DataTypes, which is equivalent to decltype(ONNX_NAMESPACE::TensorProto().data_type()).

Values:

enumerator UINT8 = 0
enumerator INT8
enumerator UINT16
enumerator INT16
enumerator INT32
enumerator INT64
enumerator UINT32
enumerator UINT64
enumerator BOOL
enumerator FLOAT
enumerator FLOAT16
enumerator BFLOAT16
enumerator DOUBLE
enumerator COMPLEX64
enumerator COMPLEX128
enumerator STRING
enumerator UNDEFINED
class popart::DataTypeInfo

Public Functions

DataTypeInfo(DataType type__, int nbytes__, bool isFixedPoint__, std::string name__, std::string lcasename__)
DataType type() const
const int &nbytes() const
const std::string &name() const
const std::string &lcasename() const
bool isFixedPoint() const
class popart::TensorInfo

Public Functions

TensorInfo(DataType, const Shape&)

Create TensorInformation based on data type and shape.

Parameters
  • data_type – - The data type.

  • shape – - The actual shape of the tensor.

TensorInfo(DataType data_type, const Shape &shape, const Shape &meta_shape)

Create TensorInformation based on data type, shape and meta shape.

Parameters
  • data_type – - The data type.

  • shape – - The actual shape of the tensor.

  • meta_shape – - The meta shape of the tensor, which can for example be used to store the original tensor shape before replicated tensor sharding was applied.

TensorInfo(std::string data_type, std::string shape)
TensorInfo(std::string data_type, const Shape&)
explicit TensorInfo(const ONNX_NAMESPACE::TensorProto&)
explicit TensorInfo(const ONNX_NAMESPACE::TypeProto&)
void set(const ONNX_NAMESPACE::TensorProto&)
void set(const ONNX_NAMESPACE::TypeProto&)
TensorInfo() = default
void set(DataType)
void set(DataType, const Shape&)
void set(DataType, const Shape&, const Shape&)
const Shape &shape() const
const Shape &metaShape() const
std::vector<size_t> shape_szt() const
inline Rank rank() const
inline int64_t nelms() const
int64_t nbytes() const
inline int64_t dim(int i) const
inline std::vector<int> strides(const std::vector<long> &shape)

Get the strides of the tensor, that is the number of bytes to step in each dimension when traversing an array in memory.

See https://numpy.org/doc/stable/reference/generated/numpy.ndarray.strides.html

Parameters

shape – The on-host ONNX shape of a tensor. This is different from this->shape(), which gives the on-replica shape of a tensor

Returns

std::vector<int> The strides vector.

DataType dataType() const
const std::string &data_type() const
const std::string &data_type_lcase() const
void append(std::ostream&) const
bool isSet() const
bool operator==(const TensorInfo&) const
bool operator!=(const TensorInfo&) const
Shape shapeFromString(const std::string &s) const
ONNX_NAMESPACE::TypeProto getOnnxTypeProto() const
const DataTypeInfo *getDataTypeInfo() const

Public Static Functions

static std::string npOutDataTypeExceptionMessage(const TensorInfo &i0, const TensorInfo &i1, const std::string &debugName)
#include <popart/tensorindex.hpp>
class popart::TensorIndexMap

Public Functions

TensorIndexMap() = default
~TensorIndexMap()
void insert(int, Tensor*)
void reset(int, Tensor*)
void erase(int)
void clear()
bool contains(Tensor*) const
Tensor *tensor(int)
const Tensor *tensor(int) const
TensorId id(int) const
bool hasIndex(int) const
const std::vector<int> &indices(Tensor*) const
const std::map<Tensor*, std::vector<int>, PTensorCmp> &indicesMap() const
const std::map<int, Tensor*> &tensorMap() const
const std::vector<Tensor*> tensors() const
std::map<int, TensorId> tensorIdMap() const
int n() const
void append(std::stringstream&, std::string prefix, int max_id_length) const
void setInfoIfIndex(const TensorInfo&, int index)
std::vector<TensorId> getSerialised() const
int maxIdLength() const
std::map<int, Shape> getIndexShapeMap()
int minIndex() const
int maxIndex() const
#include <popart/tensorlocation.hpp>
enum popart::ReplicatedTensorSharding

Enum type to specify whether to shard tensors over replicas.

Values:

enumerator Off = 0

Don’t shard tensors over replicas.

enumerator On = 1

Do shard tensors over replicas.

enumerator N = 2

Number of values.

class popart::TensorLocation

Class that describes the memory characteristics of one or multiple tensors.

See also: SessionOptions.

Public Functions

TensorLocation()

Equivalent to calling TensorLocation(TensorStorage::Undefined, TileSet::Compute, TileSet::Compute, ReplicatedTensorSharding::Off)

TensorLocation(TensorStorage storage)

Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, ReplicatedTensorSharding::Off)

TensorLocation(TensorStorage storage, ReplicatedTensorSharding replicatedTensorSharding)

Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, replicatedTensorSharding)

TensorLocation(TensorStorage storage, ReplicatedTensorSharding replicatedTensorSharding, CommGroup shardingDomain)

Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, replicatedTensorSharding, shardingDomain)

TensorLocation(TensorStorage storage, TileSet loadTileSet, TileSet storageTileSet, ReplicatedTensorSharding replicatedTensorSharding)

Construct a TensorLocation from parameters.

Parameters
  • storage – The memory location of the tensor(s).

  • loadTileSet – The tiles through which the tensor(s) are loaded onto the chip.

  • storageTileSet – The tiles on which the tensor(s) are stored.

  • replicatedTensorSharding – Whether to apply replicated tensor. sharding.

TensorLocation(TensorStorage storage, TileSet loadTileSet, TileSet storageTileSet, ReplicatedTensorSharding replicatedTensorSharding, CommGroup shardingDomain)

Construct a TensorLocation from parameters.

Parameters
  • storage – The memory location of the tensor(s).

  • loadTileSet – The tiles through which the tensor(s) are loaded onto the chip.

  • storageTileSet – The tiles on which the tensor(s) are stored.

  • replicatedTensorSharding – Whether to apply replicated tensor. sharding.

  • shardingDomain – GCL communication group across which to shard the tensor. Perpendicular replicas will not shard, and reduce gradients normally (via AllReduce). Defaults to sharding across all replicas.

TensorLocation(std::vector<int64_t> serialized)
bool operator==(const TensorLocation &rhs) const
bool operator!=(const TensorLocation &rhs) const
std::vector<int64_t> serialize() const
bool isRemote() const

Public Members

TensorStorage storage

The memory location of the tensor(s).

TileSet loadTileSet

The tiles through which the tensor(s) are loaded onto the chip.

TileSet storageTileSet

The tiles on which the tensor(s) are stored.

ReplicatedTensorSharding replicatedTensorSharding

Whether to apply replicated tensor sharding (RTS) or not.

CommGroup shardingDomain

The GCL comm groups across which to shard the tensor.

enum popart::TensorStorage

Enum type that determines where a tensor is stored.

Values:

enumerator OnChip = 0

Store the tensor in on-chip memory.

enumerator OffChip = 1

Store the tensor in streaming memory.

enumerator N = 2

Number of values.

enum popart::TileSet

Enum type to specify a set of tiles.

Values:

enumerator Compute = 0

The set of tiles designated for compute operations.

enumerator IO = 1

The set of tiles designated for IO operations.

enumerator Undefined = 2

Undefined (no) tile set.

enumerator N = 3

Number of values.

2.4. Optimizers

#include <popart/optimizer.hpp>
class popart::Optimizer

Interface for describing an Optimizer and, internally, how to grow the optimiser step for each weight.

  • The end-user facing interface constructed by the user to describe what kind of optimiser to use.

  • Then also used internally by the Ir to grow the optimiser step for each weight.

  • Stores OptimizerValues for optimizer parameters like learning rate, loss scaling, etc.

    See also

    OptimiserValue.

  • Optimizer stores the values for each weight - they can have different values. There is a “default” for all weights, then you can specify specific values for specific weights. This is encapsulated by an OptimizerValueMap, which is a sparse map from weight to value, with unspecified values implying the default.

    See also

    OptimizerValueMap.

  • At runtime, the user can dynamically update the Optimizer, e.g. by setting new OptimizerValues. validReplacement determines whether the new Optimizer is interchangable with the one the Ir was built for. For example, trying to replace an SGD Optimizer with an Adam Optimizer would throw.

Subclassed by popart::Adam, popart::Adaptive, popart::SGD

Public Functions

virtual ~Optimizer() = default

  • Optimizer class has a two-part initialisation. The ctor, used by the end-user, and setFactorsFromOptions called by the Ir to finish initialisation once we have all the relevant information during Ir preparation.

  • Some key methods used by the Ir to grow optimiser step for each weight are createOp, getInputIds, optimizerInputs.

  • If the OptimizerValue is const, no Ir tensor for that value is created and the VarUpdateOp created for that weight will not have the optional input for that tensor. The Opx of the VarUpdateOp will emit poplar code that uses the provided value directly.

    If the OptimizerValue is not const, an Ir tensor for that value is created and the VarUpdateOp created for that weight will have the optional input for that tensor. The tensor will be a stream tensor, so that it can be updated later from host. The tensor will be streamed an initial value of the OptimizerValue’s value.

  • It is common for Optimizer

    implementations to make use of “compound

    scalars”. Take for example the SGD0 weight update equation: w <- w * (1 - lr * (1 - dm) * wd) - g * (lr * (1 - dm) / ls) w is the weights and g is the grads. lr, dm, wd, ls are all the “atomic scalars”. These are the scalars/hyperparameters of the

    Optimizer that the user can set using OptimizerValues, as described above.

    Multiple atomic scalars appear in expressions together, and will be operated on together before being used by an Op that also consumes a tensor (in this case the weights or grads). For SGD0, they can be grouped as follows:

    w <- w * {1 -  lr * (1 - dm) * wd} -  g * { lr * (1 - dm) / ls }
             ^^^^^^^^^^^^^^^^^^^^^^^^^        ~~~~~~~~~~~~~~~~~~~~~~
                        |                               |
       weight decay scale factor 0                      |
                                               scaled learning rate 0
    

    We call wdsf0 and slr0 the “compound scalars”.

    We can statically precompute the OptimizerValues for these compound scalars using the OptimizerValues of the atomic scalars. This makes the Ir simpler, as we now have only:

    w <- w * wdsf0 - g * slr0
    

    The CompoundScalarHelpers are used to precompute the compound scalar values.

    If any of the composite atomic scalars are non-const, the compound scalar is non-const.

    See also

    compoundscalarhelper.hpp

Optimizer(OptimizerValue lossScaling, const std::vector<ClipNormSettings> &clipNormSettings, const DebugContext &debugContext)
Optimizer(const Optimizer&) = default
virtual void validReplacement(const Optimizer &other) const
virtual OptimizerType type() const = 0
virtual std::string type_s() const = 0
virtual std::unique_ptr<Optimizer> clone() const = 0
virtual void resetTensorData(Tensor&) const = 0
virtual void setTensorData(Tensor&) const = 0
virtual std::unique_ptr<Op> createOp(const Tensor &weight, Graph&) const = 0
virtual std::vector<TensorId> getInputIds(const Tensor &weight) const = 0

Returns the TensorIds of the input tensors to the VarUpdateOp this optimiser will create for the given weight .

Specifically, The TensorId at index i will be the id of the input tensor at InIndex i of the VarUpdateOp. If the input is an OptimizerValue, if it is const, then “” will be returned, else the relevant reservered prefix for that OptimizerValue will be used, followed by the weight id. The prefixes are defined in tensornames.hpp, for example reservedDefaultWeightDecayScaleFactor0Prefix or reservedSpecificScaledLearningRate1Prefix (note there are different prefixes depending on if the weight has a specific or default value for that OptimizerValue).

virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const = 0
inline const OptimizerValue &lossScaling() const
inline float getLossScalingVal() const
float getFinalLossScalingVal() const
virtual TensorId getInverseLossScalingTensorId(const Tensor &weight) const = 0
virtual void setFactorsFromOptions(const SessionOptions&)
bool gradientAccumulationEnabled() const
bool meanReductionEnabled() const
bool postMeanAccumulationEnabled() const
bool postMeanReplicationEnabled() const
int64_t getReplicatedGraphCount() const
int64_t getAccumulationFactor() const
bool meanGradientAccumulationEnabled() const
inline const std::vector<ClipNormSettings> &getClipNormSettings() const
virtual bool hasSpecific(const Tensor &w) const = 0
virtual bool hasSpecific() const = 0
virtual size_t hash() const
inline DebugContext getDebugContext() const

Public Static Functions

static TensorId getLossScalingTensorId(DataType)
enum popart::OptimizerType

Types of optimizers.

Values:

enumerator SGD = 0
enumerator Adam
enumerator Adaptive
enumerator NTYPES
enum popart::OptimizerReductionType

Reduction mode when doing data-parallel training over replicated graphs.

Depending on the optimizer used and its configuration, this option describes how the reduction of gradients over replicas will occur. For example, directly on the gradient, on the gradient accumulator, or on the momentum. See the documentation of individual optimizers for more information.

Values:

enumerator None = 0

No replicated graph reduction.

enumerator GradReduce

Gradient reduction (every iteration, after a weight’s gradient is produced)

enumerator AcclReduce

Momentum reduction (SGD1, after the gradient accumulation loop, if applicable)

enumerator AccumReduce

Accumulator reduction (Adam/SGD2 + gradient accumulation, after the gradient accumulation loop)

enum popart::WeightDecayMode

Values:

enumerator Decay

Weight decay (e.g. AdamW)

enumerator L2Regularization

L2 regularization (e.g. PyTorch-like Adam)

#include <popart/optimizervalue.hpp>
class popart::OptimizerValue

A class used to represent values of hyper parameters.

Public Functions

OptimizerValue() = default

Equivalent to OptimizerValue(0, false).

inline OptimizerValue(float v)

Equivalent to OptimizerValue(v, true).

inline OptimizerValue(float v, bool c)

Constructor.

Parameters
  • v – The current value of the hyper parameter.

  • c – A boolean flag to indicate whether the parameter will remain at this value forever (true) or may change over time (false).

inline OptimizerValue(std::pair<float, bool> x)
inline float val() const
inline bool isConst() const
void validReplacement(const OptimizerValue &rhs) const
bool operator==(const OptimizerValue &rhs) const
#include <popart/optimizervaluemap.hpp>
class popart::OptimizerValueMap

Public Functions

inline OptimizerValueMap(OptimizerValue g)
OptimizerValue get(const TensorId &id) const
void insertSpecific(const TensorId&, OptimizerValue)
inline bool hasSpecific(const TensorId &id) const
inline bool hasSpecific() const
inline OptimizerValue getDefault() const
void validReplacement(const OptimizerValueMap &rhs) const
inline const std::map<TensorId, OptimizerValue> &getSpecifics() const

2.4.1. Stochastic Gradient Descent (SGD)

#include <popart/clipnormsettings.hpp>
class popart::ClipNormSettings

A data structure used to represent a maximum value constraint on one or more weights.

This is passed to the optimizer on construction.

Public Types

enum Mode

Values:

enumerator ClipSpecifiedWeights
enumerator ClipAllWeights

Public Functions

ClipNormSettings(const std::vector<TensorId> &weightIds_, float maxNorm_)

DEPRECATED This will be removed from a future release.

Constructor.

Parameters
  • weightIds_ – The weight tensor IDs that this constraint applies to.

  • maxNorm_ – The maximum permissible value.

const std::vector<TensorId> &getWeightIds() const
float getMaxNorm() const
Mode getMode() const
bool operator==(const ClipNormSettings&) const
bool operator!=(const ClipNormSettings &other) const

Public Members

std::vector<TensorId> weightIds
float maxNorm

Public Static Functions

static ClipNormSettings clipWeights(const std::vector<TensorId> &weightIds_, float maxNorm_)
static ClipNormSettings clipAllWeights(float maxNorm_)
#include <popart/sgd.hpp>
class popart::SGD : public popart::Optimizer

Stochastic Gradient Descent (SGD) optimizer.

Akin to any optimizer implementation, this class is responsible for updating each weight tensor ( \(w\)) in the model using the gradient ( \(g\)) of the loss function with respect to the weight as calculated during the backwards pass.

The SGD optimizer has the following state for each weight:

  • velocity ( \(v\))

The SGD optimizer has the following hyper parameters:

  • learning rate ( \(\text{lr}\))

  • momentum ( \(\text{mm}\))

  • weight decay ( \(\text{wd}\))

  • dampening ( \(\text{dm}\))

  • velocity scaling ( \(\text{vs}\))

  • loss scaling ( \(\text{ls}\))

  • clip norm settings

The values of these parameters can be shared between all weights but some can be overridden with weight-specific values (see SGD::insertSpecific). Hyper parameters are captured using OptimizerValue objects and therefore can be either a constant value or a non-constant value that can be adjusted by the user.

In the following we will describe how this optimizer updates a weight using a gradient. In the context of this description the gradient is is the value of the gradient after any gradient accumulation has been performed and after the application of a loss scaling factor to the gradient has been corrected for.

When the optimizer needs to update a weight, \(w\), using a gradient, \(g\), it first updates the optimizer state as follows:

\[ v' := v * \text{mm} + (1 - \text{dm}) * (g + \text{wd} * w) \text{ \ . } \]

Following the update of the optimizer state the optimizer uses said state to update the weight:

\[ w' := w - \text{lr} * v' \text{ \ . } \]

In addition to the above, the velocity scaling hyper parameter is a scaling factor that can provide improved numerical stability by ensuring the values stored in the optimizer state, \(v\), are scaled by this value. When using this parameter PopART will automatically deal with the artificially scaled velocity value during the weight update and other hyper parameters do not need to be adjusted).

In addition, the loss scaling hyper parameter is similar in nature to the velocity scaling parameter. It is a scaling value that is applied to the loss gradient at the start of the the backwards pass and, at the end of the backwards pass, this scaling is reversed by multiplying the gradients for each weight with the inverse of the loss scaling value prior to updating the optimizer state. Using loss scaling can also improve numerical stability in some cases.

Finally, it is possible to add clip norm settings for this optimizer. These clip norms compute the L2 norm for a group of weights and adds a scalar term to the weight update that effectively divides it by the norm (or a constant value that is provided as part of the clip norm, which ever is greater).

See the SGD notes in optimizer.hpp for a more detailed and comprehensive derivation of the SGD optimizer step in PopART.

Subclassed by popart::ConstSGD

Public Functions

SGD(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultMomentum, OptimizerValue defaultDampening, OptimizerValue defaultVelocityScaling, OptimizerValue lossScaling, const std::vector<ClipNormSettings> &clipNormSettings = {}, SGDAccumulatorAndMomentum sgdAccMm = SGDAccumulatorAndMomentum::Combined, DataType accumType = DataType::UNDEFINED, DataType accl1Type = DataType::UNDEFINED, const DebugContext &debugContext = {})

Constructor.

See also

SGDAccumulatorAndMomentum. Defaults to SGDAccumulatorAndMomentum::Combined.

Parameters
  • defaultLearningRate – The learning rate value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultWeightDecay – The weight decay value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultMomentum – The momentum value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultDampening – The dampening value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultVelocityScaling – The velocity scaling value to use for weights for which no weight-specific hyper parameter have been inserted.

  • lossScaling – The loss scaling value to use.

  • clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).

  • sgdAccMm – The implementation strategy to use when gradient accumulation and/or momentum are used, otherwise ignored.

  • accumType – The DataType of the accum tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.

  • accl1Type – The DataType of the accl1 tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.

  • debugContext – Optional debug context.

SGD(const std::map<std::string, std::pair<float, bool>> &params, const std::vector<ClipNormSettings> &clipNormSettings = {}, SGDAccumulatorAndMomentum sgdAccMm = SGDAccumulatorAndMomentum::Combined, DataType accumType = DataType::UNDEFINED, DataType accl1Type = DataType::UNDEFINED, const DebugContext &debugContext = {})

Constructor.

EXAMPLE:

SGD({{"defaultLearningRate", {0.02, false}},
    {"defaultMomentum", {0.6, true}}});
This will create an SGD Optimizer which has a constant momentum of 0.6 and a changeable learning rate initially of 0.02. All OptimizerValues not present in the map will take values from the getUnset* functions.

See also

SGDAccumulatorAndMomentum. Defaults to SGDAccumulatorAndMomentum::Combined.

Parameters
  • params – A parameter map where the keys are one or more of "defaultLearningRate", "defaultWeightDecay", "defaultMomentum", "defaultDampening", "defaultVelocityScaling" or "lossScaling". The map’s values are pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter because default values will be used where parameters are missing.

  • clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).

  • sgdAccMm – The implementation strategy to use when gradient accumulation and/or momentum are used, otherwise ignored.

  • accumType – The DataType of the accum tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.

  • accl1Type – The DataType of the accl1 tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.

  • debugContext – Optional debug context.

inline SGD()

Default constructor Creates SGD with default scalars (equivalent to getUnset<scalar>() methods), and other default parameters of main constructor.

SGD(const SGD&) = default

Copy constructor.

~SGD() = default
inline virtual OptimizerType type() const final
inline virtual std::string type_s() const final
inline SGDAccumulatorAndMomentum getSGDAccumulatorAndMomentum() const
virtual std::unique_ptr<Optimizer> clone() const final
virtual std::unique_ptr<Op> createOp(const Tensor &weight, Graph&) const final

Returns the VarUpdateOp for the given weight .

If no gradient accumulation of momentum, this will be a SGD0VarUpdateOp. Else, if getSGDAccumulatorAndMomentum() == Combined, this will be an SGD1ComboOp, else if getSGDAccumulatorAndMomentum() == CombinedSGD2ComboOp, an SGD2ComboOp

.

The required compound scalar OptimizerValues for the

VarUpdateOp wil be computed and passed to the Op. See the SGD notes above this class for how they are derived. Recall that if non-const, the VarUpdateOp will take an input Tensor for the compound scalar.

See also

Optimizer::createOp

The OptimizerReductionType of the Op is derived as follows: No replication => None Replication, no grad acc => GradReduce Replication, grad acc, SGD1 => AcclReduce Replication, grad acc, SGD2 => AccumReduce See the SGD notes above this class for why this is.

If SGD2, the DataType of the accum and accl1 tensors passed to the SGD2ComboOp will be as set in the SGD constructor. Recall DataType::UNDEFINED means use the same as the weight.

An SGD1ComboOp will later be decomposed by SGD1Decompose

pattern into a series of Ops and Tensors that implement the SGD1 optimiser step.

An SGD12ComboOp will later be decomposed by

SGD2Decompose pattern into a series of Ops and Tensors that implement the SGD2 optimiser step.

See also

SGD1Decompose

See also

SGD2Decompose

virtual std::vector<TensorId> getInputIds(const Tensor &weight) const final

virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const final

smm1 and wdsf0 have the same data type as the weight . Everything else

virtual void validReplacement(const Optimizer &other) const final
virtual void resetTensorData(Tensor&) const final
virtual void setTensorData(Tensor&) const final
float getStoredValue(const TensorId &optId) const

Tensor “opt” has an id, which it uses to match a compound scalar which this object can compute from the atomic scalars.

void insertSpecific(const TensorId &weight, OptimizerValue learningRate, OptimizerValue weightDecay, OptimizerValue momentum, OptimizerValue dampening, OptimizerValue velocityScaling)

Insert a weight-specific set of hyper parameters.

Parameters
  • weight – The TensorId of the weight.

  • learningRate – The learning rate value to use for this specific weight.

  • weightDecay – The weight decay value to use for this specific weight.

  • momentum – The momentum value to use for this specific weight.

  • dampening – The dampening value to use for this specific weight.

  • velocityScaling – The velocity scaling value to use for this specific weight.

void insertSpecific(const TensorId &weight, const std::map<std::string, std::pair<float, bool>> &params)

Insert a weight-specific set of hyper parameters.

Parameters
  • weight – The TensorId of the weight.

  • params – A parameter map where keys are one of "learningRate", "weightDecay", "momentum", "dampening", or "velocityScaling" and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.

virtual bool hasSpecific(const Tensor &w) const final
virtual bool hasSpecific() const final
virtual TensorId getInverseLossScalingTensorId(const Tensor &weight) const
inline const OptimizerValueMap &learningRates() const
inline const OptimizerValueMap &weightDecays() const
inline const OptimizerValueMap &momentums() const
inline const OptimizerValueMap &dampenings() const
inline const OptimizerValueMap &velocityScalings() const
virtual size_t hash() const

Public Static Functions

static inline OptimizerValue getUnsetLearningRate()

Default learning rate value.

static inline OptimizerValue getUnsetWeightDecay()

Default weight decay value.

static inline OptimizerValue getUnsetMomentum()

Default momentum value.

static inline OptimizerValue getUnsetDampening()

Default dampening value.

static inline OptimizerValue getUnsetVelocityScaling()

Default velocity scaling value.

static inline OptimizerValue getUnsetLossScaling()

Default loss scaling value.

static SGD fromDefaultMap(const std::map<std::string, OptimizerValue>&, const DebugContext &debugContext = {})
class popart::ConstSGD : public popart::SGD

Stochastic Gradient Descent (SGD) optimizer with constant learning rate, weight decay, loss scaling and clip norm settings (and default values for momentum, dampening or velocity scaling).

NOTE: See SGD for detailed meaning for these parameters.

NOTE: This class exists for backwards compatibility with the Python API and may be removed at some point in the future.

Public Functions

inline ConstSGD(float learningRate, float weightDecay = 0, float lossScaling = 1, const std::vector<ClipNormSettings> &clipNormSettings = {})

Constructor.

Parameters
  • learningRate – A constant learning rate.

  • weightDecay – A constant weight decay value.

  • lossScaling – A constant loss scaling value.

  • clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).

enum popart::SGDAccumulatorAndMomentum

Strategy for implementing SGD with momentum and/or gradient accumulation.

Values:

enumerator Combined = 0

Implement SGD using a single tensor for the gradient accumulator (accum) and momentum (accl) tensors.

enumerator Separate

Implement SGD using separate tensors for the gradient accumulator (accum) and momentum (accl) tensors.

2.4.2. Adam, AdaMax & Lamb

#include <popart/adam.hpp>
enum popart::AdamMode

Enum type describing the mode of an Adam optimizer instance.

Values:

enumerator Adam = 0

Adam or AdamW mode, depending on weight decay setting (see Kingma & Ba, 2015 and Loshchilov & Hutter, 2018).

enumerator AdamNoBias

Like Adam but without bias correction.

enumerator AdaMax

Adamax mode.

enumerator Lamb

Lamb mode (see You et al., 2020).

enumerator LambNoBias

Like Lamb but without bias correction.

class popart::Adam : public popart::Optimizer

AdamW, Lamb and AdaMax optimizer implementation.

Akin to any optimizer implementation, this class is responsible for updating each weight tensor ( \(w\)) in the model using the gradient ( \(g\)) of the loss function with respect to the weight as calculated during the backwards pass.

The optimizer has the following state for each weight:

  • first-order momentum ( \(m\))

  • second-order momentum ( \(v\))

  • time step ( \(t\))

The optimizer has the following hyper parameters:

  • learning rate ( \(\text{lr}\))

  • weight decay ( \(\text{wd}\))

  • beta1 ( \(\beta_1\))

  • beta2 ( \(\beta_2\))

  • epsilon ( \(\epsilon\))

  • loss scaling ( \(\text{ls}\))

  • maximum weight norm ( \(\text{mwn}\))

The values of these parameters can be shared between all weights but some can be overridden with weight-specific values (see Adam::insertSpecific). Hyper parameters are captured using OptimizerValue objects and therefore can be either a constant value or a non-constant value that can be adjusted by the user.

The values of #AdamMode and #WeightDecayMode passed to the constructor determines how weights are updated (see below).

In the following we will describe how this optimizer updates a weight using a gradient. In the context of this description the gradient is is the value of the gradient after any gradient accumulation has been performed and after the application of a loss scaling factor to the gradient has been corrected for.

When the optimizer needs to update a weight, \(w\), using a gradient, \(g\), it first computes a term \(g_\text{tmp}\), which is effectively is \(g\) with L2 regularization applied if the #WeightDecayMode is set to WeightDecayMode::L2Regularization this, as follows:

\[\begin{split} g_\text{tmp} := \left\{\begin{aligned} g & \text{ \; (Decay) } \\ (g + \text{wd} * w) & \text{ \; (L2Regularization) \; . } \\ \end{aligned}\right.\\ \end{split}\]

Secondly, the optimizer updates the optimizer state as follows:

\[\begin{split} m' &:= \beta_1 * m + (1 - \beta_1) * g_\text{tmp} \\ v' &:= \left\{\begin{aligned} \beta_2 * v + (1 - \beta_2) * g_\text{tmp}^2 & \text{ \; (Adam/AdamNoBias) } \\ \beta_2 * v + (1 - \beta_2) * g_\text{tmp}^2 & \text{ \; (Lamb/LambNoBias) } \\ \text{max}(\beta_2 * v, |g_\text{tmp}|) & \text{ \; (AdaMax) } \\ \end{aligned}\right.\\ t' &:= t + 1 \\ \end{split}\]

Next, it computes the following terms:

\[\begin{split} m_\text{tmp} &:= \left\{\begin{aligned} m' & \text{ \; (AdamNoBias/LambNoBias) } \\ \frac{m'}{(1 - \beta_1^{t'})} & \text{ \; (Adam/Lamb/AdaMax) } \\ \end{aligned}\right.\\ v_\text{tmp} &:= \left\{\begin{aligned} v' & \text{ \; (AdamNoBias/LambNoBias) } \\ \frac{v'}{(1 - \beta_2^{t'})} & \text{ \; (Adam/Lamb/AdaMax) } \\ \end{aligned}\right.\\ u_\text{tmp} &:= \left\{\begin{aligned} \frac{m_\text{tmp}}{(\sqrt{v_\text{tmp}} + \epsilon)} + \text{wd} * w &\text{ \; (Decay) } \\ \frac{m_\text{tmp}}{(\sqrt{v_\text{tmp}} + \epsilon)} &\text{ \; (L2Regularization) } \\ \end{aligned}\right. \end{split}\]

Finally, the optimizer updates the weight as follows:

\[\begin{split} w' := \left\{\begin{aligned} w - \text{lr} * u_\text{tmp} &\text{ \; (Adam/AdamNoBias/AdaMax) } \\ w - \biggl(\frac{\text{min}(\lVert{w}\rVert, \text{mwn})}{\lVert{u_\text{tmp}}\rVert}\biggr) * \text{lr} * u_\text{tmp} &\text{ \; (Lamb/LambNoBias) } \\ \end{aligned}\right. \end{split}\]

In addition to the above, the loss scaling hyper parameter is similar in nature to the velocity scaling parameter. It is a scaling value that is applied to the loss gradient at the start of the the backwards pass and, at the end of the backwards pass, this scaling is reversed by multiplying the gradients for each weight with the inverse of the loss scaling value prior to updating the optimizer state. Using loss scaling can also improve numerical stability of the gradient calculations. If scaledOptimizerState is enabled then the the lossScaling will not be removed before updating the optimizer state. This can improve the numerical stability when accl1_type is set to FLOAT16.

NOTE: The maximum weight norm is referred to as \(\phi\) in You et al., 2020.

Public Functions

virtual bool hasSpecific(const Tensor &w) const final
virtual bool hasSpecific() const final
virtual TensorId getInverseLossScalingTensorId(const Tensor &weight) const final
Adam(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultBeta1, OptimizerValue defaultBeta2, OptimizerValue defaultEps, OptimizerValue lossScaling, OptimizerValue maxWeightNorm, AdamMode adamMode, WeightDecayMode weightDecayMode, DataType accumType, DataType accl1Type, DataType accl2Type, const std::vector<ClipNormSettings> &clipNormSettings = {}, bool scaledOptimizerState = false, const DebugContext &debugContext = {})

Constructor.

Parameters
  • defaultLearningRate – The learning rate value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultWeightDecay – The weight decay value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultBeta1 – The beta1 value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultBeta2 – The beta2 value value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultEps – The epsilon value to use for weights for which no weight-specific hyper parameter have been inserted.

  • lossScaling – The loss scaling value to use.

  • maxWeightNorm – The maxWeightNorm value to use.

  • adamMode – The AdamMode value to use.

  • weightDecayMode – The WeightDecayMode value to use.

  • maxWeightNorm – The maxWeightNorm value to use.

  • accumType – Data type to use for gradient accumulation.

  • accl1Type – Data type to use for tensor that stores first-order momentum optimizer state.

  • accl2Type – Data type to use for tensor that stores second-order momentum optimizer state.

  • clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).

  • scaledOptimizerState – Experimental Option. Does not remove lossScaling before updating the optimizer state. This should have no effect on the update equation. However, it does ensure a more numerically stable implementation when accl1_type is set to DataType::FLOAT16. Note: When loading a model that includes initialised optimizer state, ensure that accl1 and accl2 are scaled by lossScaling and lossScaling^2 respectively.

  • debugContext – Optional debug context.

Adam(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultBeta1, OptimizerValue defaultBeta2, OptimizerValue defaultEps, OptimizerValue lossScaling, AdamMode adamMode, WeightDecayMode weightDecayMode, DataType accumType, DataType accl1Type, DataType accl2Type, const std::vector<ClipNormSettings> &clipNormSettings = {}, bool scaledOptimizerState = false, const DebugContext &debugContext = {})
Adam(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultBeta1, OptimizerValue defaultBeta2, OptimizerValue defaultEps, OptimizerValue lossScaling, OptimizerValue maxWeightNorm, AdamMode adamMode, DataType accumType, DataType accl1Type, DataType accl2Type, const std::vector<ClipNormSettings> &clipNormSettings = {}, bool scaledOptimizerState = false, const DebugContext &debugContext = {})
Adam(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultBeta1, OptimizerValue defaultBeta2, OptimizerValue defaultEps, OptimizerValue lossScaling, AdamMode adamMode, DataType accumType, DataType accl1Type, DataType accl2Type, const std::vector<ClipNormSettings> &clipNormSettings = {}, bool scaledOptimizerState = false, const DebugContext &debugContext = {})
Adam(const std::map<std::string, std::pair<float, bool>> &params, AdamMode adamMode, WeightDecayMode weightDecayMode, DataType accumType, DataType accl1Type, DataType accl2Type, const std::vector<ClipNormSettings> &clipNormSettings = {}, bool scaledOptimizerState = false, const DebugContext &debugContext = {})

Constructor.

EXAMPLE:

Adam({{"defaultLearningRate", {0.02, False}},
      {"defaultBeta1", {0.9, True}},
      {"defaultBeta2":{0.999, True}}},
      AdamMode::Adam,
      WeightDecayMode::Decay,
      DataType::FLOAT,
      DataType::FLOAT,
      DataType::FLOAT);

Parameters
  • params – A parameter map where keys are one of "defaultLearningRate", "defaultWeightDecay", "defaultBeta1", "defaultBeta2", "defaultEps", "lossScaling" or "maxWeightNorm", and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.

  • adamMode – The AdamMode value to use.

  • weightDecayMode – The WeightDecayMode value to use.

  • maxWeightNorm – The maxWeightNorm value to use.

  • accumType – Data type to use for gradient accumulation.

  • accl1Type – Data type to use for tensor that stores first-order momentum optimizer state.

  • accl2Type – Data type to use for tensor that stores second-order momentum optimizer state.

  • clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).

  • scaledOptimizerState – Experimental Option. Does not remove lossScaling before updating the optimizer state. This should have no effect on the update equation. However, it does ensure a more numerically stable implementation when accl1_type is set to DataType::FLOAT16. Note: When loading a model that includes initialised optimizer state, ensure that accl1 and accl2 are scaled by lossScaling and lossScaling^2 respectively.

  • debugContext – Optional debug context.

Adam(const Adam&) = default
~Adam() = default
inline virtual OptimizerType type() const final
inline virtual std::string type_s() const final
virtual std::unique_ptr<Optimizer> clone() const final
virtual std::unique_ptr<Op> createOp(const Tensor &weight, Graph&) const final
virtual std::vector<TensorId> getInputIds(const Tensor &weight) const final

The names of the inputs for the VarUpdateOp for the Variable Tensor “weight”.

In the returned vector, an empty string (“”) is used as a placeholder for constant inputs.

virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const final

The names and infos of the optimizer tensors.

virtual void validReplacement(const Optimizer &other) const final
virtual void resetTensorData(Tensor&) const final
virtual void setTensorData(Tensor&) const final
float getStoredValue(const TensorId &optId) const

Tensor “opt” has an id, based on which it matches a compound scalar which this object can compute from the atomic scalars.

void insertSpecific(const TensorId &weight, OptimizerValue learningRate, OptimizerValue weightDecay, OptimizerValue beta1, OptimizerValue beta2, OptimizerValue eps, OptimizerValue mwn)

Insert a weight-specific set of hyper parameters.

Parameters
  • weight – The TensorId of the weight.

  • learningRate – The learning rate value to use for this specific weight.

  • weightDecay – The weight decay value to use for this specific weight.

  • beta1 – The beta1 value to use for this specific weight.

  • beta2 – The beta2 value to use for this specific weight.

  • eps – The epsilon value to use for this specific weight.

  • mwn – The max weight norm value to use for this specific weight.

void setStep(int64_t step)
void setStep(const TensorId&, int64_t step)
void setStep(std::map<TensorId, int64_t> steps)
void insertSpecific(const TensorId &weight, const std::map<std::string, std::pair<float, bool>> &params)

Insert a weight-specific set of hyper parameters.

Parameters
  • weight – The TensorId of the weight.

  • params – A parameter map where keys are one of "defaultLearningRate", "defaultWeightDecay", "defaultBeta1", "defaultBeta2", "defaultEps", "lossScaling" or "maxWeightNorm" and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.

inline const OptimizerValueMap &learningRates() const
inline const OptimizerValueMap &weightDecays() const
inline const OptimizerValueMap &beta1s() const
inline const OptimizerValueMap &beta2s() const
inline const OptimizerValueMap &epss() const
inline const OptimizerValueMap &maxWeightNorms() const
inline const WeightDecayMode &getWeightDecayMode() const
inline bool useScaledOptimizerState() const
virtual size_t hash() const final
virtual void setFactorsFromOptions(const SessionOptions&) final

Public Static Functions

static inline OptimizerValue getUnsetLearningRate()

Default learning rate value.

static inline OptimizerValue getUnsetWeightDecay()

Default weight decay value.

static inline OptimizerValue getUnsetBeta1()

Default beta1 value.

static inline OptimizerValue getUnsetBeta2()

Default beta2 value.

static inline OptimizerValue getUnsetEps()

Default epsilon value.

static inline OptimizerValue getUnsetLossScaling()

Default loss scaling value.

static inline OptimizerValue getUnsetMaxWeightNorm()

Default maximum weight norm value.

static Adam fromDefaultMap(const std::map<std::string, OptimizerValue>&, AdamMode adamMode_, WeightDecayMode decayMode_, DataType accumType_, DataType accl1Type_, DataType accl2Type_, const DebugContext &debugContext = {})

2.4.3. AdaDelta, RMSProp & AdaGrad

#include <popart/adaptive.hpp>
enum popart::AdaptiveMode

Enum class representing a type of adaptive optimizer.

Values:

enumerator AdaGrad = 0

AdaGrad optimizer.

enumerator RMSProp

RMSProp optimizer.

enumerator CenteredRMSProp

CenteredRMSProp optimizer.

enumerator AdaDelta

AdaDelta optimizer.

class popart::Adaptive : public popart::Optimizer

AdaDelta, RMSProp and AdaGrad optimizer implementation.

Akin to any optimizer implementation, this class is responsible for updating each weight tensor ( \(w\)) in the model using the gradient ( \(g\)) of the loss function with respect to the weight as calculated during the backwards pass.

The optimizer has the following state for each weight:

  • first-order momentum ( \(v_1\))

  • second-order momentum ( \(v_2\)) (only for AdaGrad/RMSProp)

  • third-order momentum ( \(v_3\))

The optimizer has the following hyper parameters:

  • learning rate ( \(\text{lr}\))

  • weight decay ( \(\text{wd}\))

  • alpha ( \(\alpha\))

  • momentum ( \(\text{m}\)))

  • epsilon ( \(\epsilon\))

  • loss scaling ( \(\text{ls}\))

The values of these parameters can be shared between all weights but some can be overridden with weight-specific values (see Adaptive::insertSpecific). Hyper parameters are captured using OptimizerValue objects and therefore can be either a constant value or a non-constant value that can be adjusted by the user.

The values of #AdaptiveMode and #WeightDecayMode passed to the constructor determines how weights are updated (see below).

In the following we will describe how this optimizer updates a weight using a gradient. In the context of this description the gradient is is the value of the gradient after any gradient accumulation has been performed and after the application of a loss scaling factor to the gradient has been corrected for.

When the optimizer needs to update a weight, \(w\), using a gradient, \(g\), it first computes a term \(g_\text{tmp}\), which is effectively is \(g\) with L2 regularization applied if the #WeightDecayMode is set to WeightDecayMode::L2Regularization this, as follows:

\[\begin{split} g_\text{tmp} := \left\{\begin{aligned} g & \text{ \; (Decay) } \\ (g + \text{wd} * w) & \text{ \; (L2Regularization) \; . } \\ \end{aligned}\right.\\ \end{split}\]

Secondly, the optimizer updates \(v_1\) the optimizer state as follows:

\[\begin{split} v_1' &:= \left\{\begin{aligned} \alpha * m + (1 - \alpha) * g_\text{tmp}^2 & \text{ \; (RMSProp/AdaDelta) } \\ \alpha * m + (1 - \alpha) * g_\text{tmp}^2 & \text{ \; (CenteredRMSProp) } \\ v_1 + g_\text{tmp}^2 & \text{ \; (AdaGrad) } \\ \end{aligned}\right.\\ \end{split}\]

Next, \(v_2\) is updated, but only for CenteredRMSProp:

\[\begin{split} v_2' &:= \alpha * v_2 + (1 - \alpha) * g_\text{tmp} \text{ \; (CenteredRMSProp) } \\ \end{split}\]

Next, it computes the update term \(u_\text{tmp}\):

\[\begin{split} u_\text{tmp} &:= \left\{\begin{aligned} \frac{g_\text{tmp}}{\sqrt{v_1'} + \epsilon} & \text{ \; (AdaGrad/RMSProp) } \\ \frac{g_\text{tmp}}{\sqrt{v_1' - v_2'^2} + \epsilon} & \text{ \; (CenteredRMSProp) } \\ \frac{g_\text{tmp} * \sqrt{v_2 + \epsilon}}{\sqrt{v_1' + \epsilon}} & \text{ \; (AdaDelta) } \\ \end{aligned}\right. \end{split}\]

Next, \(v_2\) is updated, but only for AdaDelta:

\[\begin{split} v_2' := \alpha * v_2 + (1 - \alpha) * u_\text{tmp}^2 \text{ \; (AdaDelta) } \\ \end{split}\]

Next the third momentum is updated for all modes:

\[ v_3' := m * v_3 + u_\text{tmp} \]

Finally, the optimizer updates the weight as follows:

\[\begin{split} w' := \left\{\begin{aligned} w - \text{lr} * (v_3' + \text{wd} * w) &\text{ \; (Decay) } \\ w - \text{lr} * v_3' &\text{ \; (L2Regularization) } \\ \end{aligned}\right. \end{split}\]

In addition to the above, the loss scaling hyper parameter is similar in nature to the velocity scaling parameter. It is a scaling value that is applied to the loss gradient at the start of the the backwards pass and, at the end of the backwards pass, this scaling is reversed by multiplying the gradients for each weight with the inverse of the loss scaling value prior to updating the optimizer state. Using loss scaling can also improve numerical stability in some cases.

Public Functions

virtual bool hasSpecific(const Tensor &w) const
virtual bool hasSpecific() const
virtual TensorId getInverseLossScalingTensorId(const Tensor &weight) const
Adaptive(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultAlpha, OptimizerValue defaultMomentum, OptimizerValue defaultEps, OptimizerValue lossScaling, AdaptiveMode adaptiveMode, WeightDecayMode weightDecayMode, DataType accumType, DataType accl1Type, DataType accl2Type, DataType accl3Type, bool rmspropTFVariant = false, const DebugContext &debugContext = {})

Constructor.

Parameters
  • defaultLearningRate – The learning rate value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultWeightDecay – The weight decay value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultAlpha – The alpha value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultMomentum – The momentum value to use for weights for which no weight-specific hyper parameter have been inserted.

  • defaultEps – The epsilon value to use for weights for which no weight-specific hyper parameter have been inserted.

  • lossScaling – The loss scaling value to use.

  • adaptiveMode – The AdaptiveMode value to use.

  • weightDecayMode – The WeightDecayMode value to use.

  • accumType – Data type to use for gradient accumulation.

  • accl1Type – Data type to use for tensor that stores first-order momentum optimizer state.

  • accl2Type – Data type to use for tensor that stores second-order momentum optimizer state.

  • accl3Type – Data type to use for tensor that stores third-order momentum optimizer state.

  • debugContext – Optional debug context.

Adaptive(const std::map<std::string, std::pair<float, bool>> &params, AdaptiveMode adaptiveMode, WeightDecayMode weightDecayMode, DataType accumType, DataType accl1Type, DataType accl2Type, DataType accl3Type, bool rmspropTFVariant = false, const DebugContext &debugContext = {})

Constructor.

EXAMPLE: ``` Adaptive({{“defaultLearningRate”, {0.02, False}}, */ // {“defaultAlpha”, {0.99, True}}}, /** AdaptiveMode::RMSProp, WeightDecayMode::Decay, DataType::FLOAT, DataType::FLOAT, DataType::FLOAT, DataType::FLOAT); ```

Parameters
  • params – A parameter map where keys are one of "defaultLearningRate", "defaultWeightDecay", "defaultAlpha", "defaultMomentum", "defaultEps" or "lossScaling", and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.

  • adaptiveMode – The AdaptiveMode value to use.

  • weightDecayMode – The WeightDecayMode value to use.

  • accumType – Data type to use for gradient accumulation.

  • accl1Type – Data type to use for tensor that stores first-order momentum optimizer state.

  • accl2Type – Data type to use for tensor that stores second-order momentum optimizer state.

  • accl3Type – Data type to use for tensor that stores third-order momentum optimizer state.

  • debugContext – Optional debug context.

Adaptive(const Adaptive&) = default
~Adaptive() = default
inline virtual OptimizerType type() const final
inline virtual std::string type_s() const final
virtual std::unique_ptr<Optimizer> clone() const final
virtual std::unique_ptr<Op> createOp(const Tensor &weight, Graph&) const final
virtual std::vector<TensorId> getInputIds(const Tensor &weight) const final

The names of the inputs for the VarUpdateOp for the Variable Tensor “weight”.

In the returned vector, an empty string (“”) is used as a placeholder for constant inputs.

virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const final

The names and infos of the optimizer tensors.

virtual void validReplacement(const Optimizer &other) const final
virtual void resetTensorData(Tensor&) const final
virtual void setTensorData(Tensor&) const final
float getStoredValue(const TensorId &optId) const

Tensor “opt” has an id, based on which it matches a compound scalar which this object can compute from the atomic scalars.

void insertSpecific(const TensorId &weight, OptimizerValue learningRate, OptimizerValue weightDecay, OptimizerValue alpha, OptimizerValue momentum, OptimizerValue eps)

Insert a weight-specific set of hyper parameters.

Parameters
  • weight – The TensorId of the weight.

  • learningRate – The learning rate value to use for this specific weight.

  • weightDecay – The weight decay value to use for this specific weight.

  • alpha – The alpha value to use for this specific weight.

  • momentum – The momentum value to use for this specific weight.

  • eps – The epsilon value to use for this specific weight.

void setStep(int64_t step)
void setStep(const TensorId&, int64_t step)
void setStep(std::map<TensorId, int64_t> steps)
void insertSpecific(const TensorId &weight, const std::map<std::string, std::pair<float, bool>> &params)

Insert a weight-specific set of hyper parameters.

Parameters
  • weight – The TensorId of the weight.

  • params – A parameter map where keys are one of "defaultLearningRate", "defaultWeightDecay", "defaultAlpha", "defaultMomentum", "defaultEps" or "lossScaling" and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.

inline const OptimizerValueMap &learningRates() const
inline const OptimizerValueMap &weightDecays() const
inline const OptimizerValueMap &alphas() const
inline const OptimizerValueMap &momentums() const
inline const OptimizerValueMap &epss() const
virtual size_t hash() const

Public Static Functions

static inline OptimizerValue getUnsetLearningRate()

Default learning rate value.

static inline OptimizerValue getUnsetWeightDecay()

Default weight decay value.

static inline OptimizerValue getUnsetAlpha()

Default alpha value.

static inline OptimizerValue getUnsetMomentum()

Default momentum value.

static inline OptimizerValue getUnsetEps()

Default epsilon value.

static inline OptimizerValue getUnsetLossScaling()

Default loss scaling value.

static Adaptive fromDefaultMap(const std::map<std::string, OptimizerValue>&, AdaptiveMode adaptiveMode_, WeightDecayMode decayMode_, DataType accumType_, DataType accl1Type_, DataType accl2Type_, DataType accl3Type_, const DebugContext &debugContext = {})

2.5. Builder

#include <popart/builder.hpp>
class popart::Builder

An interface for a Builder, used for creating ONNX graphs.

A builder interface for creating ONNX graphs.

ONNX defines a specification for describing graphs and serialising them as protobuf files. This class provides a builder interface for creating such a graph.

Note, in ONNX, all Ops belong to an “Opset”. The Builder itself does not have methods for creating Ops in the ONNX graph, but instead has accessors to Opsets, like AiGraphcoreOpset1, which contain the methods for creating Ops in the graph.

Public Functions

Builder &createSubgraphBuilder()

Create a builder for a graph which is nested inside this builder’s graph.

~Builder()

Destructor for the Builder class.

TensorId addInputTensor(const TensorInfo &tensorInfo, const popart::DebugContext &debugContext = {})

Add a new input tensor to the model.

Parameters
  • tensorInfo – The shape and data type of the input tensor.

  • debugContext – Optional debug information.

Returns

The tensor id of the input tensor.

TensorId addInputTensor(const std::string &dataType, const Shape &shape, const popart::DebugContext &debugContext = {})

Add a new input tensor to the model.

Parameters
  • dataType – The data type of the input tensor.

  • shape – The shape of the input tensor.

  • debugContext – Optional debug information.

Returns

The tensor id of the input tensor.

TensorId addInputTensor(const TensorInfo &tensorInfo, const InputSettings &settings, const popart::DebugContext &debugContext = {})

Add a new input tensor to the model.

Parameters
  • tensorInfo – The shape and data type of the input tensor.

  • InputSettings – Settings for TileSet and ExchangeStrategy.

  • debugContext – Optional debug information.

Returns

The tensor id of the input tensor.

TensorId addInputTensor(const std::string &dataType, const Shape &shape, const InputSettings &settings, const popart::DebugContext &debugContext = {})

Add a new input tensor to the model.

Parameters
  • dataType – The data type of the input tensor.

  • shape – The shape of the input tensor.

  • InputSettings – Settings for TileSet and ExchangeStrategy.

  • debugContext – Optional debug information.

Returns

The tensor id of the input tensor.

TensorId addUntypedInputTensor(const popart::DebugContext &debugContext = {})

Add a new input tensor without a type or shape to the model.

Parameters

debugContext – Optional debug information.

Returns

The tensor id of the input tensor.

void addInputTensorFromParentGraph(const TensorId &tensorId)

Add a new named input tensor (from the parent graph) to the model.

Parameters

tensorId – The identifier string of the input tensor. This identifier must already exist in the name scope of the parent GraphProto and must appear topologically before this sub-graph.

TensorId addInitializedInputTensor(const ConstVoidData &initData, const popart::DebugContext &debugContext = {})

Add a new pre-initialized input tensor to the model.

Parameters
  • initData – The initial data of the input tensor.

  • debugContext – Optional debug information.

Returns

The tensor id of the input tensor.

TensorId addInitializedInputTensor(const ConstVoidData &initData, const VariableSettings &variableSettings, const popart::DebugContext &debugContext = {})

Add a new pre-initialized input tensor to the model.

Parameters
  • initData – The initial data of the input tensor.

  • variableSettings – The settings that determine how variables are retrieved from replicas.

  • debugContext – Optional debug information.

Returns

The tensor id of the input tensor.

void addOutputTensor(const TensorId &arg0)

Add an output tensor from a node in the graph into the list of output tensors.

Parameters

arg0 – The tensor id of the output tensor to be added.

inline AiOnnxOpset6 aiOnnxOpset6()

Return the builder interface for ai.onnx opset 6.

inline AiOnnxOpset7 aiOnnxOpset7()

Return the builder interface for ai.onnx opset 7.

inline AiOnnxOpset8 aiOnnxOpset8()

Return the builder interface for ai.onnx opset 8.

inline AiOnnxOpset9 aiOnnxOpset9()

Return the builder interface for ai.onnx opset 9.

inline AiOnnxOpset10 aiOnnxOpset10()

Return the builder interface for ai.onnx opset 10.

inline AiOnnxOpset11 aiOnnxOpset11()

Return the builder interface for ai.onnx opset 11.

inline AiOnnxMlOpset1 aiOnnxMlOpset1()

Return the builder interface for ai.onnx.ml opset 1.

inline AiGraphcoreOpset1 aiGraphcoreOpset1()

Return the builder interface for ai.graphcore opset 1.

std::vector<TensorId> customOp(const OperatorIdentifier &opid, int opsetVersion, const std::vector<TensorId> &inputs, const unsigned numOutputs, const std::map<std::string, popart::any> &attributes, const DebugContext &debugContext = {})

Return the output tensors from a custom op added to the model.

Parameters
  • opid – The id of the operator.

  • opsetVersion – The version of the opset.

  • inputs – The tensor ids of the A vector of input tensor ids.

  • numOutputs – The number of output tensors.

  • attributes – The map of attributes and their values to be added.

  • debugContext – Optional debug information.

Returns

The output tensors.

void customOp(const OperatorIdentifier &opid, int opsetVersion, const std::vector<TensorId> &inputs, const std::vector<TensorId> &outputs, const std::map<std::string, popart::any> &attributes, const DebugContext &debugContext =