# 13. PopART C++ API

This chapter describes the PopART C++ API.

## 13.1. Sessions

#include <popart/session.hpp>

class Session

Session is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware.

Subclassed by popart::InferenceSession, popart::TrainingSession

Public Functions

virtual ~Session() = 0

Destructor for the Session class.

std::vector<uint32_t> getRNGState()

Get state of the random number generator.

void setRNGState(const std::vector<uint32_t>)

Set state of the random number generator.

void setRandomSeed(uint64_t seedValue)

Set the value of the random number generator seed.

This method explicitly seeds all random operations. Additionally, this method derives a new state for the random number generator (RNG) from the seed and sets it on the device. This RNG state is used to resolve stochastic rounding. Note that to deterministically store and restore the combined random state for a session, do the following:

C++:

// Store random state (session s0).
auto seed = s0.getRandomSeed();
auto rngState = s0.getRNGState();

// Restore random state (session s1).
s1.setRandomSeed(seed);   // <-- affects RNG state, order important
s1.setRNGState(rngState);


Python:

# Store random state (session s0).
seed = s0.getRandomSeed()
rngState = s0.getRNGState()

# Restore random state (session s1).
s1.setRandomSeed(seed)   # <-- affects RNG state, order important
s1.setRNGState(rngState)


Parameters

seedValue – The value of the seed.

uint64_t getRandomSeed()

Get the value of the random number generator seed.

Calling setRandomSeed() with this value (at a later stage) reinstates the random state logic that seeds random operations.

Returns

The value used to seed current random operations.

void compileAndExport(const std::string &filename)

Compile the graph and export it to a file.

This method will first create a poplar::Graph and compile the poplar::Executable. Next, it will export the executable and PopART metadata to the file. The exported file will be in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.

Parameters

filename – The name of the file where the compiled executable and metadata will be saved.

void compileAndExport(std::ostream &out)

Compile the graph and export it to a stream.

This method will first create a poplar::Graph and compile the poplar::Executable. Next, it will export the executable and PopART metadata to the stream. The data will be streamed in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.

This method automatically creates folders as needed if filename is located in a folder which does not exist.

Parameters

out – The stream that the compiled executable and metadata will be written to.

void saveExecutableToFile(const std::string &filename)

Save a compiled graph to a file.

The file will be in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.

This method automatically creates folders as needed if filename is located in a folder which does not exist.

Parameters

filename – The name of the file where the compiled executable and metadata will be saved.

Pre

prepareDevice() must have been called.

void saveExecutableToStream(std::ostream &out)

Save a compiled graph to a stream.

The data will be streamed in the PopEF format. This means that the file can be used to run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information.

Parameters

out – The stream where the compiled executable and metadata will be written to.

Pre

prepareDevice() must have been called.

void saveExecutable(const std::string &path, bool savePopartMetadata = true, bool saveVariables = true)

Save a compiled graph with additional data to a file.

PopART is able to save its state after the model compilation is complete, so that it can be restored at a later time. To make this possible, it is necessary to save such elements as:

• a serialised Poplar executable,

• tensor data blobs if model parameters have not been frozen (refer to the SessionOptions::constantWeights for more information),

• a PopART-specific opaque blob to store information only relevant to PopART. This is needed to restore PopART state.

The file will be in the PopEF format. This means that the file can be used to restore the state of the PopART program without recompiling the graph, or run inference using the Triton Inference Server with the Graphcore Triton backend. See the Poplar Triton Backend User Guide for more information. If you want to analyze file structure saved by the function please refer to the PopEF dump tool.

Parameters
• path – The name of the file or directory where the compiled executable, metadata and variables will be saved. If you specified a path to the directory, the function will write the data to the file: “<path>/executable.popef”. If the file exists, the function will overwrite the old data with the new ones.

• savePopartMetadata – If you do not need the option to restore the PopART state later, you can set the flag to false to reduce disk space taken up by the file.

• saveVariables – If you don’t need to save variables (tensors) state, you can set the flag to false if you want to save them later or in a different location. The function will save data consistent with the variables contained within the model.

Pre

prepareDevice() must have been called.

void saveVariables(const std::string &path)

Save all variables to a file.

The function will save data consistent with the variables contained within the model.

The file will be in the PopEF format. If you want to analyze tensors saved by the function refer to the PopEF dump tool.

Parameters

path – The name of the file or directory where the compiled variables will be saved. If you specified a path to the directory, the function will write the data to the file: “<path>/variables.popef”. If the file exists, the function will overwrite the old data with the new ones.

Pre

prepareDevice() must have been called.

void checkInplacingAmbiguity() const

Check for potential inplacing ambiguities.

This method creates an AliasModel object for each graph and runs the Poprithms ambiguity checker on it.

Throws an error if the graph has an inplacing ambiguity and will prompt the user to check the inplacing.

See poprithms::memory::inplace::Graph::AmbiguityStatus on the Poprithms GitHub repo for more on what constitutes an ambiguity.

The file must have been created with compileAndExport(const std::string).

Parameters

filename – The name of the file to load the executable and metadata from.

Load the compiled executable and from a stream.

The stream must have been created with compileAndExport(std::ostream).

Parameters

in – The shared pointer to the stream to load the executable from.

Prepare the network for execution.

This will create the poplar::Graph and poplar::Engine.

Parameters

loadEngine – If true, load the engine and connect the streams once the device is ready.

Load the engine on the device and connect the streams.

This will set up the poplar::Streams.

Note: This call is optional. The engine will implicitly be loaded on the device when required.

void weightsFromHost()

Copy weights from the host to the device.

void weightsToHost()

Copy the weights from the device to the host steam memory.

uint64_t getCycleCount(std::string id = "")

Copy the cycle count tensor from the device to the host.

Parameters

id – The identifier of the cycle count tensor.

void connectStreamToCallback(const std::string &streamHandle, std::function<void(void*)> callback, unsigned index = 0)

Connect a Poplar stream with a callback.

This method will be called whenever the stream will be read or was written to by the device. The memory location will only be valid for reading or writing for the duration of the callback.

Parameters
• streamHandle – The name of the stream to connect to.

• callback – The callback to be called whenever the stream is to be read or was written to by the device.

• index – The replica index to connect to, when using replicated graphs. Default=0.

void connectStream(const std::string &streamHandle, void *buffer)

Connect a Poplar stream with a fixed location in memory.

Each time data is copied to the stream, this location will be read and each time data is copied from the stream, this location will be written.

Parameters
• streamHandle – The handle of the stream to connect to.

• buffer – The pointer to the memory location.

void connectHostFunction(const std::string &functionHandle, std::function<void(const void*const*, size_t, void*const*, size_t)> callback, unsigned index = 0)

Connect a host function to a callback.

The callback takes two arguments, which point to the locations in memory for each of the function’s input and output arguments, respectively. During a host function call, first the device transfers the input data to the host, then the callback is invoked, and finally the output data is copied back to the device. The memory pointed to by the callback arguments must only be accessed during the duration of the callback.

Parameters
• functionHandle – The name of the host function.

• callback – The function to be called whenever new input data is available.

• index – The replica index to connect to, when using replicated graphs. Default=0.

void run(IStepIO &stepIO, std::string debugName = "")

Run one step.

Read input data from address in stepIO.in.

Write the output data to addresses in stepIO.out.

Parameters
• stepIO – The input and output data.

• debugName – A debug string to identify this run in logs.

void run(std::string programHandle, IStepIO &stepIO, std::string debugName = "")

Run one step of a custom program.

Read input data from address in stepIO.in.

Write the output data to addresses in stepIO.out.

Parameters
• programHandle – The handle of the custom program to run.

• stepIO – The input and output data.

• debugName – A debug string to identify this run in logs.

void updateExternallySavedTensorLocations(const std::string &fromLocation, const std::string &toLocation)

Update the tensor locations of tensors in the session’s ONNX model.

A new file will be created at this point, and written to when the ONNX model is saved with a subsequent call to modelToHost().

Parameters
• fromLocation – All externally saved tensors with location fromLocation will have their location updated to toLocation.

• toLocation – The updated tensor locations. This must not already exist.

void modelToHost(const std::string &fn)

Write the current model to an ONNX file.

Parameters

fn – The path to file. The path can be absolute or relative. If you plan to run your program in multiple processes simultaneously, you should avoid possible race conditions by writing to different files, for example by using temporary files.

TensorInfo getInfo(TensorId) const

Get the tensor information for a tensor.

Parameters

TensorId – The identifier of the tensor to get the tensor information for.

Returns

The tensor information for the tensor.

bool hasInfo(TensorId) const

Check whether a tensor has information.

Parameters

TensorId – The identifier of the tensor to get the tensor information for.

Returns

true if the tensor with identifier TensorId has tensor information and false if not.

std::set<TensorId> getAllTensorIds() const

Returns the ids of all tensors in the model.

Pre

prepareDevice() must have been called.

std::string getSummaryReport(bool resetProfile = true) const

Retrieve the summary report from from the poplar::Engine.

The options which were passed to the Session constructor will influence the information in the report.

This method may only be called after prepareDevice() has been called.

Parameters

resetProfile – If true, resets the execution profile. Default = true.

Returns

A string containing the report.

std::string getSerializedGraph() const

Retrieve the serialized graph from the poplar::Engine.

A JSON format report is produced.

This method may only be called after prepareDevice() has been called.

Returns

A string containing the serialized graph.

pva::Report getReport() const

Retrieve the graph report from the poplar::Engine.

The options which were passed to the Session constructor will influence the information in the report.

This method may only be called after prepareDevice() has been called.

Returns

The PopVision Analysis report object.

void resetHostWeights(const std::string &model, const bool ignoreWeightsInModelWithoutCorrespondingHostWeight = false)

Reset weights with weights in an ONNX model.

Note that the only differences between the ONNX model and the current model must be the weights. No other differences are allowed.

This method only updates the weights on the host. weightsFromHost() must be called after this method to update the weights on the device.

Parameters
• model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.

• ignoreWeightsInModelWithoutCorrespondingHostWeight – If true, do not throw an error if there are initializers in the ONNX model without corresponding initializer tensor(s) in the session’s IR.

Read the weights from the host stream memory and write to the host.

This method may only be called after weightsToHost() has been called.

Parameters

weightsIo – The weight data that is read from the host stream memory is written to the addresses in weightsIo.out.

void writeWeights(const IWeightsIO &weightsIo)

Write the weights from the host to the IR tensor memory.

This method may only be called after weightsFromHost() has been called.

Parameters

weightsIo – The weight data is written to the addresses in weightsIo.out.

std::string serializeIr(IrSerializationFormat format)

Serialize the IR graph to a string.

Parameters

format – The format to use for serializing.

inline const Ir &getIr() const

Get the IR associated with the Session.

inline const popx::Devicex &getDevice() const

Get the device associated with the Session.

inline popx::Devicex &getDevice()

Get the device associated with the Session.

inline const popx::IrLowering &getIrLowering() const

Get the IR lowering associated with the Session.

inline const popx::Executablex &getExecutable() const

Get the executable associated with the Session.

Broadcasts the weight from the PopRun instance with index rootRank to all other instances.

Parameters

rootRank – The index of the PopRun instance from which the weights should be broadcasted.

void updateEngineCache()

Update cacheEntries from engine cache directory and update ir::hashMatched_ with the updated cacheEntries.

void setDeviceInfo(std::shared_ptr<DeviceInfo> deviceInfo)

Set the DeviceInfo of the Session.

### 13.1.1. Training session

#include <popart/session.hpp>

class TrainingSession : public popart::Session

TrainingSession is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware with training provided by optimizing a loss tensor using an optimizer and automatic differentiation (backpropagation).

Public Functions

~TrainingSession() override

Destructor for the TrainingSession class.

void updateOptimizerFromHost(const Optimizer *optimizer)

Update the optimizer from the host.

This method updates the optimizer and the associated hyperparameters but not the optimizer state tensors.

NOTE: The optimizer parameter has to be compatible with the optimizer passed to the TrainingSession constructor. For example, you cannot call this function with an SDG1 optimizer if you created the session with an SDG0 optimizer. This is because it is not possible to change the IR after a session has been constructed.

Parameters

optimizer – A pointer to a popart::Optimizer.

void copyFromRemoteBuffer(const std::string &buffer, void *w, int repeat_index, unsigned replication_index = 0)

Copy from a remote butter into a user buffer.

This can be useful when we run larger models with host side reductions since HEXOPT is currently limited to 128 MB.

Parameters
• buffer – The name of the remote buffer to copy from.

• w – Pointer to a user buffer to copy to.

• repeat_index – The index in the remote buffer to copy from.

• replication_index – The replicated graph index when using replicated graphs. Default=0.

void copyToRemoteBuffer(void *w, const std::string &buffer, int repeat_index, unsigned replication_index = 0)

Copy from a user buffer to a remote buffer.

This can be useful when we run larger models with host side reductions since HEXOPT is currently limited to 128 MB.

Parameters
• w – Pointer to a user buffer to copy from.

• buffer – The remote buffer to copy to.

• repeat_index – The index in the remote buffer to copy to.

• replication_index – The replicated graph index when using replicated graphs. Default=0.

Public Static Functions

static std::unique_ptr<TrainingSession> createFromIr(std::shared_ptr<Ir> ir, std::shared_ptr<DeviceInfo> deviceInfo, const std::string name = DefaultTrainingSessionName)

Create a session for training from an IR.

Parameters
• ir – The IR to create the session from.

• deviceInfo – The type of device that this session uses.

• name – The name of this training session. Default: “training”.

static std::unique_ptr<TrainingSession> createFromOnnxModel(const std::string &model, const DataFlow &dataFlow, const TensorId &loss, const Optimizer &optimizer, std::shared_ptr<DeviceInfo> deviceInfo, const InputShapeInfo &inputShapeInfo = InputShapeInfo(), const SessionOptions &userOptions = SessionOptions(), const Patterns &patterns = Patterns(), const std::string name = DefaultTrainingSessionName)

Create a session for inference from an ONNX model.

Parameters
• model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.

• dataFlow – Configuration for the data feeds and fetches.

• loss – The identifier of the final scalar loss tensor for training.

• optimizer – The name of an optimizer to use when training.

• deviceInfo – The type of device that this session uses.

• inputShapeInfo – (Optional) The sizes and dtypes of the input tensors. This is used to specify the sizes of the input tensors in the case that the ONNX model does not include this information. The Poplar graph programming framework uses statically allocated memory buffers and so it needs to know the size of tensors before the compilation. Default: InputShapeInfo().

• userOptions – (Optional) The user configuration options for the Session class. Default: SessionOptions().

• patterns – (Optional) A user-selected set of graph transformation patterns which will be applied to the graph. If this is not specified, a default set of optimisation transformations will be applied. Default: Patterns().

• name – (Optional) The name of this inference session. Default: “training”.

### 13.1.2. Inference session

#include <popart/session.hpp>

class InferenceSession : public popart::Session

InferenceSession is a runtime instance that provides an interface for executing ONNX graphs on IPU hardware, without any automatic differentiation (backpropagation) or optimization.

Public Functions

~InferenceSession() override

Destructor for the InferenceSession class.

Public Static Functions

static std::unique_ptr<InferenceSession> createFromIr(std::shared_ptr<Ir> ir, std::shared_ptr<DeviceInfo> deviceInfo, const std::string name = DefaultInferenceSessionName)

Create a session for inference from an IR.

Parameters
• ir – The IR to create the session from.

• deviceInfo – The type of device that this session uses.

• name – The name of this inference session. Default: “inference”.

static std::unique_ptr<InferenceSession> createFromOnnxModel(const std::string &model, const DataFlow &dataFlow, std::shared_ptr<DeviceInfo> deviceInfo, const InputShapeInfo &inputShapeInfo = InputShapeInfo(), const SessionOptions &userOptions = SessionOptions(), const Patterns &patterns = Patterns(), const std::string name = DefaultInferenceSessionName)

Create a session for inference from an ONNX model.

Parameters
• model – An ONNX model protobuf, or the name of a file containing an ONNX model protobuf.

• dataFlow – Configuration for the data feeds and fetches.

• deviceInfo – The type of device that this session uses.

• inputShapeInfo – (Optional) The sizes and dtypes of the input tensors. This is used to specify the sizes of the input tensors in the case that the ONNX model does not include this information. The Poplar graph programming framework uses statically allocated memory buffers and so it needs to know the size of tensors before the compilation. Default: InputShapeInfo().

• userOptions – (Optional) The user configuration options for the Session class. Default: SessionOptions().

• patterns – (Optional) A user-selected set of graph transformation patterns which will be applied to the graph. If this is not specified, a default set of optimisation transformations will be applied. Default: Patterns().

• name – (Optional) The name of this inference session. Default: “inference”.

### 13.1.3. Session options

#include <popart/sessionoptions.hpp>

enum class popart::AccumulateOuterFragmentSchedule

Enum type that determines how the operations in the accumulate outer fragment will be scheduled across virtual graphs (only relevant to pipelined modes).

Values:

enumerator Scheduler = 0

enumerator Serial

Add constraints that ensure ops are executed in virtual graph ID order.

enumerator OverlapCycleOptimized

Try and parallelise ops with different virtual graph IDs as much as possible.

enumerator OverlapMemoryOptimized

Try and parallelise ops with different virtual graph IDs but avoid certain steps that are costly in terms of memory usage.

enum class popart::AutodiffStitchStrategy

Enum type representing a strategy to ensure a backward graph’s inputs are either inputs of the forward graph, outputs of the forward graph or gradients of outputs of the forward graph.

Strategies may expose tensors that would otherwise have been internal to the forward graph as outputs of this forward graph.

Values:

enumerator RecomputeMinimal = 0

Recompute any backward graph inputs associated with non-gradient forward graph tensors that are neither inputs nor outputs in the forward graph.

enumerator RecomputeAllNonInputs

Recompute any backward graph inputs associated with non-gradient forward graph tensors that are not inputs in the forward graph.

For backward graph inputs associated with non-gradient forward graph tensors that are neither inputs or outputs in the forward graph, add them as outputs to the forward graph.

Note

This strategy is not guaranteed to work for all circumstances. In particular, it is unable to deal with subgraphs of IfOp. Using this setting may therefore result in subsequent exceptions in the Autodiff transform and it is therefore inadvisable to use this as an Autodiff default.

Like AutodiffStitchStrategy::AddFwdOutputs except that those backward graph inputs that can’t be stitched with AutodiffStitchStrategy::AddFwdOutputs (that is, by adding outputs to the forward graph) are stitched using the AutodiffStitchStrategy::RecomputeMinimal strategy instead.

This means that this is a safe strategy to use as an Autodiff default.

enumerator N

Number of AutodiffStitchStrategy values.

enum class popart::BatchSerializationBatchSchedule

Enum type that describes how to change the batch serialisation subgraph schedule before outlining.

Note

This setting is experimental and may change.

Values:

enumerator Scheduler = 0

Don’t encourage any particular scheduling for ops within batch subgraphs (leave it to the scheduler) but tell the scheduler to schedule subgraphs in sequence.

enumerator Isomorphic

Encourage all ops within batch subgraphs to be scheduled identically and for each subgraph to be scheduled in sequence (good for outlineability).

enumerator OverlapOnIo

Attempt to put the remote load op for batch N+1 right after the compute phase of batch N.

enumerator OverlapOnCompute

Attempt to put the remote load op for batch N+1 right before the compute phase of batch N.

enumerator N

The number of BatchSerializationBatchSchedule values.

enum class popart::BatchSerializationMethod

Enum type that describes how to apply the batch serialization.

Note

This setting is experimental and may change.

Values:

enumerator UnrollDynamic = 0

Unroll the batch with dynamic slicing.

enumerator UnrollStatic

Unroll the batch with static slicing.

enumerator Loop

Loop over the batch dimension.

enumerator N

The number of BatchSerializationMethod values.

enum class popart::BatchSerializationTransformContext

Enum type that describes when to apply batch serialization.

Note

This setting is experimental and may change.

Values:

enumerator Fwd = 0

Apply batch serialiation before growing the backward pass.

enumerator Bwd

Apply batch serialiation after growing the backward pass.

enumerator N

The number of BatchSerializationTransformContext values.

enum class popart::ExecutionPhaseIOSchedule

Enum type to specify when to load tensors.

Values:

Preload tensors in previous phase for use in current phase.

enumerator OnDemand

Load tensors just before they are required.

enumerator N

The number of ExecutionPhaseIOSchedule values.

enum class popart::ExecutionPhaseSchedule

Enum type to specify the order of processing optimizer operations for different weights of the same execution phase.

The steps for phased execution are:

1. Copy to IO tiles if necessary.

2. Run collective operations if necessary.

4. Update optimizer state.

5. Apply optimizer.

6. Store updated tensor if necessary.

Values:

enumerator Interleaving = 0

Process above steps for one weight at a time (for example: 123456, 123456, 123456).

The scheduler may interleave these steps.

enumerator Batch

Process above steps for all weights together, in a way that maximises overlap potential between compute and exchange (for example: 333, 111, 222, 444, 555, 666).

enumerator BatchClusteredIO

Process above steps for all weights together, in a way that maximises overlap potential between compute and exchange, and maximise stream copy merges by keeping RemoteLoad/RemoteStore operations clustered (for example: 333, 111, 222, 444, 555, 666).

enumerator N

The number of ExecutionPhaseSchedule values.

Enum type to specify the method for selecting gradient tensors whose statistics are to be tracked for the AutomaticLossScale transform.

Values:

Track all gradients of inputs to MatMul and Convolution ops.

enumerator N

The number of GradientTensorTrackingMethod values.

enum class popart::Instrumentation

Enum type used to specify an instrumentation type.

Values:

enumerator Outer = 0

Outer loop instrumentation, graph over all IPUs.

enumerator Inner

Inner loop instrumentation, graph per IPU.

enumerator N

The number of Instrumentation values.

enum class popart::IrSerializationFormat

Enum type used to specify a serialization format.

Values:

enumerator JSON

JavaScript Object Notation (JSON).

enum class popart::MeanReductionStrategy

Enum type that specifies when to divide by a mean reduction factor, when doing mean reduction over a sequence of tensors $$t_1, t_2, ..., t_k$$.

Values:

enumerator Running = 0

Keep the reduction buffer as the mean of the tensors accumulated so far.

If $$t_1, ..., t_f$$ has just been processed, the current accumulator $$s$$ is the mean of these values, and the next accumulator update is $$s = \frac{f}{f+1} * s + \frac{1}{f+1} * t_{f+1}$$ to keep $$s$$ a running mean.

This strategy guarantees $$s \le \max(a_1, ..., a_k)$$ throughout the accumulation, therefore it will not overflow, but it is generally slower than MeanReductionStrategy::Post.

enumerator Post

Keep the accumulation factor as the running sum, and divide once by $$k$$ at the end of the accumulation.

This strategy will generally be faster than MeanReductionStrategy::Running, but is prone to overflow (especially when using fp16).

enumerator N

The number of MeanReductionStrategy values.

enum class popart::MergeVarUpdateType

Enum type used to specify which VarUpdateOp ops to merge.

Values:

enumerator None = 0

Do not merge VarUpdateOp ops.

enumerator All

Merge all VarUpdateOp ops into as few groups as possible.

This is a good choice when memory is not a constraint.

enumerator AutoLoose

Merge into groups while attempting not to increase maximum variable liveness, and also not slice tensor variables so they will need to be processed by different VarUpdateOp ops.

enumerator AutoTight

Merge into groups, so that VarUpdateOp ops process tensors of exactly SessionOptions::mergeVarUpdateMemThreshold in size.

enumerator N

The number of MergeVarUpdateType values.

enum class popart::RecomputationType

Enum type to specify which ops to recompute in the backward pass when doing auto-recomputation.

Values:

enumerator None = 0

No ops are recomputed (Default).

enumerator Standard

Recompute using algorithm that picks checkpoints to try and minimise max liveness.

enumerator NormOnly

Only Norm ops (+ non-linearities, if following) are recomputed.

enumerator Pipeline

Recompute all forward pipeline stages.

enumerator RecomputeAll

Recompute all ops.

enumerator N

The number of RecomputationTypes values.

enum class popart::SubgraphCopyingStrategy

Enum type that describes how copies for inputs and outputs for subgraphs are lowered.

Currently this only affects subgraphs associated with CallOp ops.

Values:

enumerator OnEnterAndExit = 0

Copy all inputs before the start of the subgraph, copy all outputs after all ops in the subgraph.

With this strategy, subgraphs will always map to a single Poplar function.

enumerator JustInTime

Copy inputs just before they are consumed and copy outputs as soon as they are produced.

With this strategy, subgraphs may be lowered into multiple Poplar functions.

enumerator N

The number of SubgraphCopyingStrategy values.

enum class popart::SyntheticDataMode

Enum type used to specify the data source for input tensors.

Values:

enumerator Off = 0

Use real data.

enumerator Zeros

Input tensors are initialised to all zeros.

enumerator RandomNormal

Input tensors are initialised with a random normal distribution ~N(0,1).

enumerator RandomUniform

Input tensors are initialised with a uniform distribution.

enumerator N

The number of SyntheticDataMode values.

enum class popart::VirtualGraphMode

Enum type used to specify a virtual graph mode.

Values:

enumerator Off = 0

Virtual graphs are not enabled.

enumerator Manual

User must set the popart::Op::virtualGraph attribute on all ops.

enumerator Auto

Use the AutoVirtualGraph transform.

enumerator ExecutionPhases

Virtual graphs are tied to execution phases.

enumerator N

The number of VirtualGraphMode values.

struct AccumulateOuterFragmentSettings

A structure containing accumulate outer fragment settings.

Public Functions

AccumulateOuterFragmentSettings() = default
inline AccumulateOuterFragmentSettings(AccumulateOuterFragmentSchedule schedule_, const std::vector<int> &excludedVirtualGraphs_)

Constructor for AccumulateOuterFragmentSettings.

Parameters
• schedule_ – Indicate how to schedule the accumulate outer fragment. This setting is experimental and may change. Default: AccumulateOuterFragmentSchedule::Serial

• excludedVirtualGraphs_ – Indicate to explicitly avoid parallelising the virtual graph IDs. This setting is experimental and may change.

Public Members

AccumulateOuterFragmentSchedule schedule = AccumulateOuterFragmentSchedule::Serial

Indicate how to schedule the accumulate outer fragment.

Note

This setting is experimental and may change.

std::vector<int> excludedVirtualGraphs = {}

Indicate to explicitly avoid parallelising the virtual graph IDs.

Note

This setting is experimental and may change.

struct AutodiffSettings

The settings for the Autodiff transform.

Public Functions

AutodiffSettings() = default

Default constructor for the AutodiffSettings struct.

inline AutodiffSettings(AutodiffStitchStrategy stitchStrategy_)

Constructor for the AutodiffSettings struct.

Parameters

stitchStrategy_ – The strategy to ensure a backward graph’s inputs are either inputs of the forward graph, outputs of the forward graph or gradients of outputs of the forward graph. Default: AutodiffStitchStrategy::RecomputeAllNonInputs.

Public Members

AutodiffStitchStrategy stitchStrategy = AutodiffStitchStrategy::RecomputeAllNonInputs

The strategy PopART should use to ensure that all graph inputs of a backward graph are available as either inputs or outputs of the forward graph or gradients of outputs of the forward graph.

Note

This is an experimental option and may change.

struct AutomaticLossScalingSettings

A structure containing user configuration for automatic loss scaling settings.

Note

Automatic loss scaling is in preview. It is well tested and enabled in some of our example applications, but may not behave as expected in all models. Recommendation: if your model with automatic loss scaling enabled does not converge or triggers a compilation error, then you will need to set the loss scale manually.

Public Functions

AutomaticLossScalingSettings() = default

Default constructor for AutomaticLossScalingSettings.

AutomaticLossScalingSettings(bool enabled_, const nonstd::optional<std::vector<TensorId>> &toTrackTensors_, float binEdgeLocation_, float thresholdUpperCountProportion_, int updatePeriod_, GradientTensorTrackingMethod gradientTensorTrackingMethod_)

Constructor for AutomaticLossScalingSettings.

Parameters
• enabled_ – Indicate whether to keep track (true) or not (false) of the distribution of gradient tensor elements over the floating point range. Default: false.

• toTrackTensors_ – An optional list of model tensor names, for which gradient statistics will be collected. If not set, the gradients of all tensors produced by default operations (matmul, conv) will be used.

• binEdgeLocation_ – The location of the bin edge as a proportion of the absolute numerical range of the tracked gradient tensor elements, in the range [0, 1]. 0 represents the smallest representable value, and 1 the maximum. This is the single bin edge of the histogram that is an input to the loss scale updater algorithm. Default: 0.125.

• thresholdUpperCountProportion_ – The proportion of the elements in the upper bin above which the loss scale is increased, and below which the loss scale is decreased. Should be in the range [0, 1]. Default: 1e-7.

• updatePeriod_ – Indicate how often the loss scale update factor should be updated with respect to optimizer steps. Default: 1

std::size_t hash() const

Public Members

bool enabled = false
float binEdgeLocation = 0.125f
float thresholdUpperCountProportion = 1e-7
nonstd::optional<std::vector<TensorId>> toTrackTensors
int updatePeriod = 1
struct BatchSerializationSettings

A structure containing batch serialization settings.

Public Functions

BatchSerializationSettings() = default

Default constructor for BatchSerializationSettings.

BatchSerializationSettings(int factor_, bool concatOnVirtualGraphChange_, bool concatOnExecutionPhaseChange_, bool concatOnPipelineStageChange_, BatchSerializationTransformContext transformContext_ = BatchSerializationTransformContext::Fwd, BatchSerializationMethod method_ = BatchSerializationMethod::UnrollDynamic, BatchSerializationBatchSchedule batchSchedule_ = BatchSerializationBatchSchedule::Isomorphic)

Constructor for BatchSerializationSettings.

Parameters
• factor_ – The number of compute batches to split operations into. Default: 0.

• concatOnVirtualGraphChange_ – Indicate to break batch serialization chains (true) when the virtual graph changes (by concatenating the compute batches to the local batch). Default: true.

• concatOnExecutionPhaseChange_ – Indicate to break batch serialization chains (true) when the execution phase changes (by concatenating the compute batches to the local batch). Default: true.

• concatOnPipelineStageChange_ – Indicate to break batch serialization chains (true) when the pipeline stage changes (by concatenating the compute batches to the local batch). Default: true.

• transformContext_ – An experimental value to control when batch serialization is applied. Default: ::Fwd.

• method_ – An experimental value to control how batch serialization is applied. Default: BatchSerializationMethod::UnrollDynamic.

• batchSchedule_ – An experimental value that changes how operations are scheduled. Default: BatchSerializationBatchSchedule::Isomorphic.

Public Members

int factor = 0

The number of compute batches to split operations into.

bool concatOnVirtualGraphChange = true

Break batch serialization chains when the virtual graph changes (by concatenating the compute batches to the local batch).

bool concatOnExecutionPhaseChange = true

Break batch serialization chains when the execution phase changes (by concatenating the compute batches to the local batch).

bool concatOnPipelineStageChange = true

Break batch serialization chains when the pipeline stage changes (by concatenating the compute batches to the local batch).

BatchSerializationTransformContext transformContext = BatchSerializationTransformContext::Fwd

Experimental value to control when batch serialization is applied.

BatchSerializationMethod method = BatchSerializationMethod::UnrollDynamic

Experimental value to control how batch serialization is applied.

BatchSerializationBatchSchedule batchSchedule = BatchSerializationBatchSchedule::Isomorphic

Experimental value that changes how operations are scheduled.

struct ExecutionPhaseSettings

A structure containing ExecutionPhase settings.

Public Functions

ExecutionPhaseSettings() = default

Default constructor for ExecutionPhaseSettings.

inline ExecutionPhaseSettings(int phases_, bool stages_, ExecutionPhaseIOSchedule weightIOSchedule_, ExecutionPhaseIOSchedule activationIOSchedule_, ExecutionPhaseIOSchedule optimizerStateIOSchedule_, ExecutionPhaseIOSchedule accumulatorIOSchedule_, ExecutionPhaseSchedule schedule_)

Constructor for ExecutionPhaseSettings.

Parameters
• phases_ – The number of execution phases for the whole model. Default=0.

• stages_ – The number of overlapping stages:

• 1: Parallel streaming memory, default for 1 IPU per replica.

• 2: PingPong between 2 IPUs, default for 2 or more IPUs per replica (Default).

• weightIOSchedule_ – The execution phase IO schedule for weight tensors. Default: ExecutionPhaseIOSchedule::Preload.

• activationIOSchedule_ – The execution phase IO schedule for activation and gradient tensors. Default: ExecutionPhaseIOSchedule::Preload.

• optimizerStateIOSchedule_ – An experimental value to control when batch serialization is applied. Default: ExecutionPhaseIOSchedule::OnDemand.

• accumulatorIOSchedule_ – An experimental value to control how batch serialization is applied. Default: ExecutionPhaseIOSchedule::Preload.

• schedule_ – An experimental value that changes how operations are scheduled. Default: ExecutionPhaseSchedule::Interleaving.

Public Members

int phases = 0

Number of ExecutionPhases for the whole model.

int stages = 2

Number of overlapping stages.

• 1: Parallel streaming memory, default for 1 IPU per replica.

• 2: PingPong between 2 IPUs, default for 2 or more IPUs per replica.

The execution phase IO schedule for weight tensors.

The execution phase IO schedule for activation and gradient tensors.

ExecutionPhaseIOSchedule optimizerStateIOSchedule = ExecutionPhaseIOSchedule::OnDemand
ExecutionPhaseSchedule schedule = ExecutionPhaseSchedule::Interleaving
struct ReplicatedCollectivesSettings

A structure containing settings for replicated collective operations.

Public Functions

ReplicatedCollectivesSettings(bool prepareScheduleForMergingCollectives = false, bool mergeAllReduceCollectives = false, bool mergeReduceScatterCollectives = false, bool mergeAllGatherCollectives = false)

Constructor for the ReplicatedCollectivesSettings struct.

Parameters
• prepareScheduleForMergingCollectives – Insert constraints into the schedule such that collectives which can be merged occur one right after the other. true to insert constraints, false otherwise. Default: false.

• mergeAllReduceCollectives – Identify allreduce operations which can be scheduled at the same time, and perform them as one larger operation to better utilize the bandwidth between replicas. true to identify operations, false otherwise. Default: false.

std::size_t hash() const

Public Members

bool prepareScheduleForMergingCollectives = false
bool mergeAllReduceCollectives = false
bool mergeReduceScatterCollectives = false

Identifies reduce-scatter operations which can be scheduled at the same time, and performs them as one larger operation so as to better utilize the bandwidth between replicas.

bool mergeAllGatherCollectives = false

Identifies allgather operations which can be scheduled at the same time, and performs them as one larger operation so as to better utilize the bandwidth between replicas.

struct SessionOptions

A structure containing user configuration options for the Session class.

Public Functions

inline bool explicitPipeliningEnabled() const

Enable explicit pipelining.

Determined from values for enablePipelining, useHostCopyOpsfault and enableExplicitMainLoops.

inline bool implicitPipeliningEnabled() const

Enable implicit pipelining.

Determined from values for enablePipelining, useHostCopyOpsfault and enableExplicitMainLoops.

inline void enableExplicitIR(bool enable)

Enable explicit representations in the IR (code paths).

Enabled if true, otherwise not.

int64_t getGlobalReplicationFactor() const

Get the global replication factor.

Returns

• If enableDistributedReplicatedGraphs is true, then return globalReplicationFactor.

• If enableReplicatedGraphs is true, then return replicatedGraphCount.

• otherwise return 1.

unsigned getAccumulationFactor() const

Throws an error if gradient accumulation is not enabled (enableGradientAccumulation is false) and the factor (accumulationFactor) is set to >1.

Returns

The accumulation factor.

unsigned getBufferingDepth(const TensorId &id, bool rearrangedOnHost)
bool autoRecomputationEnabled() const

Returns true if auto-recomputation is enabled, false otherwise.

inline SessionOptions()

Constructor for SessionOptions.

Public Members

std::string logDir

A directory for log traces to be written into.

std::set<std::string> dotChecks = {}

When to write .dot files during IR construction.

int firstDotOp = 0

The ops written to the .dot file will be a part of the schedule, controlled by firstDotOp and finalDotOp.

In particular, it will be [max(0, firstDotOp), min(N ops in IR, finalDotOp)).

int finalDotOp = 10000

See firstDotOp.

bool dotOpNames = false

Enable inclusion of the op name in the .dot file (the op type is always exported).

Enabled when true. Default: false.

bool exportPoplarComputationGraph = false

Enable export of Poplar computational graph.

Enabled when true. Default: false.

bool exportPoplarVertexGraph = false

Enable export of Poplar vertex graph.

Enabled when true. Default: false.

bool separateCallOpPdfs = true

Enable creation of separate PDFs for each subgraph when generating PDFs of IR graphs.

Enabled when true. Default: true.

bool enableOutlining = true

Enable outlining.

This identifies and extracts repeated parts of computational graph into subgraphs. Enabled when true. Default: true.

bool enableOutliningCopyCostPruning = true

Enable inclusion of the cost of copying of cached sections should be in the outlining cost model.

Enabled when true. Default: true.

float outlineThreshold = 1.0f

Specify the incremental value that a sub-graph requires, relative to its nested sub-graphs (if any), to be eligible for outlining.

A high threshold results in fewer sub-graphs being outlined, a negative value results in all being outlined. The gross value of a sub-graph is the sum of its constituent ops’ Op::getSubgraphValue() values. To disable outlining, it is better to set enableOutlining to false than to set this value to infinity. The default value of 1.0f results in all high value operations such as convolution being cached, but standalone low value operations such as ReLU will not be.

Default: 1.0f.

float outlineSequenceBreakCost = 10000.0f

Specify the penalty applied to outlining potential sub-graphs if the sub-graph to be created breaks up a sequence of operations that are more efficient (for example for overlapping compute and exchange) when outlined together.

Default: 10000.0f.

SubgraphCopyingStrategy subgraphCopyingStrategy = SubgraphCopyingStrategy::OnEnterAndExit

Specify how copies for inputs and outputs for subgraphs are lowered.

Setting this value to SubgraphCopyingStrategy::JustInTime may save memory at the cost of fragmenting subgraphs into multiple Poplar functions. This may be particularly useful when a number of weight updates are outlined in one subgraph, as it may prevent multiple weight tensors from being live at the same time inside the subgraph.

Default: SubgraphCopyingStrategy::OnEnterAndExit.

RecomputationType autoRecomputation = RecomputationType::None

Enable recomputation of operations in the graph in the backward pass.

This will reduce model size at the cost of computation cycles.

Default: RecomputationType::None (no recomputation).

MergeVarUpdateType mergeVarUpdate = MergeVarUpdateType::None

Enable merging of VarUpdates into groups of VarUpdates, by flattening and concatenating variable tensors and updating tensors.

Default: MergeVarUpdateType::None (no merging).

int64_t mergeVarUpdateMemThreshold = 1000000

Specify the memory threshold for VarUpdateOp merging algorithms.

The MergeVarUpdateType::AutoLoose and MergeVarUpdateType::AutoTight VarUpdateOp merging algorithms have a threshold on the total memory of variable tensors to merge for updating. Defined as total memory in bytes.

Default: 1000000.

int64_t looseThresholdAtPeak = 8000

Specify the threshold at peak used in the calculation of the absolute threshold in the MergeVarUpdateType::AutoLoose VarUpdateOp merging algorithm.

 min(mergeVarUpdateMemThreshold, liveAtPeak - liveCurrently +
looseThresholdAtPeak)


where:

• liveAtPeak is an estimate of the maximum live memory of the computation; and

• liveCurrently is an estimate of the live memory where the threshold is being used to determine whether to schedule or postpone a VarUpdateOp.

Default: 80000.

bool rearrangeAnchorsOnHost = true

Enable rearrangement (in memory) of anchor tensors to be done on the host.

Before anchor tensors are streamed from device to host, they are not necessarily arranged in memory as required when they are to be copied from host stream to host. This can be done on the device or on the host.

Default: true (Rearrangement done on host to save memory, but often at the expense of cycles, especially for larger anchor tensors.).

bool rearrangeStreamsOnHost = false

Enable rearrangement (in memory) of stream tensors to be done on the host.

Before stream tensors are streamed from host to device, they are not necessarily arranged in memory as required when they are to be copied from host stream to device. This can be done on the device or on the host.

Default: false (Rearrangement done on device).

bool enablePrefetchDatastreams = true

Enable prefetching for input data streams.

Poplar will speculatively read data for a stream before it is required in order to allow the ‘preparation’ of the data to occur in parallel with compute. Enabled when true. Default: true.

unsigned defaultBufferingDepth = 1

Specify the default buffering depth value used for streams that are not re-arranged on the host.

For tensors that are rearranged on the host, a buffering depth of 1 will always be used. This default value can be overridden via bufferingDepthMap.

unsigned defaultPrefetchBufferingDepth = initialDefaultPrefetchBufferingDepthValue

Deprecated:

This session option name has been deprecated and will be removed in a future release.

std::map<TensorId, unsigned> bufferingDepthMap

This mapping can be used to set stream-specific buffering depths.

The buffering depth could be thought of as being the size of a circular buffer that feeds data to and from Poplar. A buffering depth greater than 1 may improve the performance due to increased parallelisation but comes at the cost of increasing the memory footprint. Streams for tensors that have no entry in this map will default to 1 (if a tensor is rearranged on host) or defaultBufferingDepth (if a tensor is not rearranged on host). Specifying a tensor that gets rearranged on host in this map will throw an error.

std::map<TensorId, unsigned> prefetchBufferingDepthMap

Deprecated:

This session option name has been deprecated and will be removed in a future release.

bool enableNonStableSoftmax = false

Enable the non-stable softmax Poplar function.

By default, the stable softmax Poplar function is used. The input tensor to softmax, $$x$$, is preprocessed by subtracting $$max(x)$$ from each element before computing the exponentials, ensuring numerical stability. If the inputs to the softmax operations are small enough to not cause overflow when computing the exponential, then the non-stable version can be enabled instead, to increase the speed.

Default: false (not enabled).

bool enableReplicatedGraphs = false

Enable replication of graphs. Default: false (not enabled).

Enable gradient accumulation. Default: false (not enabled).

ReductionType accumulationAndReplicationReductionType = ReductionType::Sum

Specify how gradients are reduced when using gradient accumulation and graph replication.

Default: ReductionType::Sum.

MeanReductionStrategy meanAccumulationAndReplicationReductionStrategy = MeanReductionStrategy::Post

Specify when to divide by a mean reduction factor when accumulationAndReplicationReductionType is set to ReductionType::Mean.

Default: MeanReductionStrategy::Post.

int64_t replicatedGraphCount = 1

Specify the number of model replications.

If enableReplicatedGraphs is true, replicatedGraphCount will set the number of model replications. For example, if the model uses 1 IPU, a replicatedGraphCount of 2 will use 2 IPUs. If the model is pipelined across 4 IPUs, a replicatedGraphCount of 4 will use 16 IPUs in total. Therefore, the number of IPUs requested must be a multiple of replicatedGraphCount. If the training is done across multiple instances of the program then the replicatedGraphCount is the number of replicas for this instance.

int64_t accumulationFactor = 1

Specify the number of micro-batches to accumulate before applying the varUpdate.

VirtualGraphMode virtualGraphMode = VirtualGraphMode::Off

Specify how to place ops on virtual graphs to achieve model parallelism, either manually using model annotations, or automatically.

Default: VirtualGraphMode::Off.

bool enablePipelining = false

Enable pipelining of virtual graphs. Default: false (not enabled).

SyntheticDataMode syntheticDataMode = SyntheticDataMode::Off

Specify whether to use real or synthetic data to initialize input tensors.

Streaming to/from the host is only enabled for SyntheticDataMode::Off which indicates that real data is being used.

Default: SyntheticDataMode::Off.

bool instrumentWithHardwareCycleCounter = false

Add instrumentation to the program to count the number of device cycles (of a single tile, on a single IPU) that the main program takes to execute.

Expect this to have a small detrimental impact on performance.

std::set<Instrumentation> hardwareInstrumentations = {Instrumentation::Outer}

Disable saving of weight gradient tensors off the device.

If true, the weight gradient tensors are not saved off the device when devicex.weightsFromHost() is called.

Note

This option is overridden if syntheticDataMode is not SyntheticDataMode::Off.

Note

Weight gradient tensors that are also optimiser tensors will only be disabled if both disableGradAccumulationTensorStreams and disableOptimizerStateTensorStreams are true.

bool disableOptimizerStateTensorStreams = false

Disable streaming of optimizer tensors.

If true, streaming of optimizer tensors is disabled. This setting can be used to conserve memory if you are not interested in checkpointing the optimizer state.

Note

Weight gradient tensors that are also optimiser tensors will only be disabled if both disableGradAccumulationTensorStreams and disableOptimizerStateTensorStreams are true.

bool compileEngine = true

Setting to only build the Poplar graph but not compile not.

If false, the backend will build the Poplar graph but not compile it into an Engine. In this case, no execution can be performed, and nothing can be transferred to the device. API calls which retrieve information from the graph building stage, such as tile mapping introspection, can still be used.

bool constantWeights = true

Specify an optimization for an inference session to have constant weights.

Set this option to false in order to change the weights with a call to Session::resetHostWeights() after the session has been prepared. This option has no effect on a training session.

Default: true.

bool enableEngineCaching = false

Enable Poplar executable caching.

The file is saved to the location defined with cachePath. The file will be in the PopEF format. This means that it can be used to run inference using the Triton Inference Server because Graphcore provides a backend to it. See the Poplar Triton Backend user guide for more information.

Default: false (not enabled).

bool enableVariablesCaching = true

Enable variable caching.

This means that the caching process will save variables as additional PopEF blobs to the file location defined with cachePath. If PopART will require data for variables (during cache reading process), they will be automatically read from the cache file.

Note, turning this off allows a PopART Session to optimise the host memory it consumes during model runtime. Specifically, weightsToHost() can write directly to the IR tensor data buffers. If the option were on, this would not be safe and the session would have to create separate buffers to write the fetched data to.

Default: true (enabled).

std::string cachePath = "session_cache"

Folder to save the poplar::Executable to.

bool enableFloatingPointChecks = false

Enable that exceptions are thrown when floating point errors occur.

Default: false (not enabled).

bool enableStochasticRounding = false

Enable stochastic rounding.

PopART will set the Poplar engine option target.deterministicWorkers to true if this option is set and to false if it is not set. Adding a value for “target.deterministicWorkers” to SessionOptions::engineOptions overrides this behaviour.

Default: false (not enabled).

bool _enableRngStateManagement = false
ExecutionPhaseSettings executionPhaseSettings

Configuration settings for execution phases.

AccumulateOuterFragmentSettings accumulateOuterFragmentSettings

Configuration setting for operations in the accumulate outer fragment.

bool explicitRecomputation = false

Enable explicit recomputation.

Default: false (not enabled).

NumIOTiles numIOTiles

Number of IPU tiles dedicated to IO.

bool aliasZeroCopy = false

Enable zero-copy for subgraphs.

BatchSerializationSettings batchSerializationSettings

Configuration setting for batch serialization.

AutodiffSettings autodiffSettings

Configuration settings for the autodiff transform.

Options to delay variable updates as much as possible.

bool enableFullyConnectedPass = true

Enable the global fullyConnectedPass option for matmuls.

poplin::matMul(poplar::Graph, poplar::Tensor, poplar::Tensor, poplar::program::Sequence, poplar::Type, poplar::DebugContext, poplar::OptionFlags, matmul::PlanningCache).

bool enableSerializedMatmuls = true

Enable/disable the serializing of matmuls.

std::string partialsTypeMatMuls

Set the partials type globally for matmuls.

Can be overridden individually with Builder.setPartialsType(). Valid values are "float" and "half". By default, this is not set, so no global partials type is imposed.

bool enableStableNorm = false

If true, computes the mean first and subtracts the activations from it before computing the variance.

The implementation with this flag set to true is slower than when set to false. The stable version requires the first order moment to be estimated and applied to the sample set before the second order central moment is calculated.

std::map<std::string, std::string> engineOptions

Poplar engine options.

std::map<std::string, std::string> convolutionOptions

Poplar convolution options.

std::map<std::string, std::string> lstmOptions

Poplar LSTM options.

std::map<std::string, std::string> matmulOptions

Poplar matmul options.

std::map<std::string, std::string> reportOptions

Poplar reporting options.

std::map<std::string, std::string> gclOptions

GCL options.

ExperimentalSettings experimentalSettings

Configuration setting for custom transform applier.

std::vector<std::string> customCodelets

List of codelet files (with file extension) to be added to the Poplar graph.

std::string customCodeletCompileFlags

Compile flags for the custom codelets.

For example -g to generate debug info. See the Poplar documentation for poplar::Engine for more information.

double timeLimitScheduler = 1e9

The maximum allowed time (in seconds) that can be spent searching for a good graph schedule before a solution must be returned.

int64_t swapLimitScheduler = static_cast<int64_t>(1e9)

The maximum number of improving steps allowed by the scheduling algorithm before a solution must be returned.

std::string serializedPoprithmsShiftGraphsDir = {}

The directory to serialize Poprithms graphs to.

PopART uses Poprithms for scheduling PopART graphs. The Poprithms graphs created for scheduling can be optionally serialised (written to file). If serializedPoprithmsShiftGraphsDir is empty, then the graphs will not be serialised. The names of serialization files will be poprithms_shift_graph_i.json for the lowest non-existing values of i. The directory must already exist, PopART will not create it.

std::string kahnTieBreaker = "greedy"

Specify which method is used to control how ops are scheduled.

The initial scheduling is done with Kahn’s algorithm. When several ops are free to be scheduled, this controls which method is used.

Options are described in the Poprithms KahnTieBreaker enum.

size_t transitiveClosureOptimizationThreshold = {100000}

Specify the transitive closure optimization threshold.

The transitive closure optimization pass can significantly accelerate the scheduler. It does not, in general, affect the final schedule returned. It is run between initialization with Kahn’s algorithms and the shifting swaps. The transitive closure optimization pass is O(nOps^2) and so should not be used for extremely large graphs. If a graph is above this threshold, the transitive closure optimization pass is not run.

Enable replacement of single sums of partial gradients with a tree of additions.

This can reduce max liveness at the cost of extra cycles. A typical use case for this would be if a large weight tensor is used as an input to many operations.

Default: false (not enabled).

ReplicatedCollectivesSettings replicatedCollectivesSettings

Control the behavior of different collective operations.

bool enableDistributedReplicatedGraphs = false

Enable training with Poplar replicated graphs across multiple PopART instances.

Default: false (not enabled).

int64_t globalReplicationFactor = 1

The total number of replicas in a multi-instance, replicated-graph training session (this should be left as the default value (1) if distributed replicated graphs are disabled).

This value includes local replication.

int64_t globalReplicaOffset = 0

The first replica index that this PopART instance is running.

bool groupHostSync = false

Specify to group the streams from the host to the device at the beginning of the schedule, and the streams from the device to the host at the end of the schedule.

This trades off memory usage for speed.

When true

, tensors will stay live for longer.

Default:

false (not enabled).

Note

This setting has no effect when useHostCopyOps is enabled (true).

bool strictOpVersions = true

Enable strict op version checks.

Strict op version checks will throw an error if the exact version of an op required for the model opset is not supported. Turning this check off will cause PopART to fall back to the latest implementation of the op that is supported.

Default:

true (enabled).

Warning

Turning off these checks may cause undefined behaviour.

bool opxAliasChecking = false

Enable running Opx checks to verify that IR tensor aliasing information corresponds to the lowered Poplar tensor aliasing.

Default: false (not enabled).

bool opxModifyChecking = false

Enable running Opx checks to verify that IR tensor modification information corresponds to the lowered Poplar tensor modifications.

Default: false (not enabled).

bool useHostCopyOps = false

Enable use of IR graph operations for data and anchor streams.

Default: false (not enabled).

Default: false (not enabled).

TensorLocationSettings activationTensorLocationSettings = TensorLocationSettings{TensorLocation(), 2, 8192}

Tensor location settings for activation/gradient tensors.

TensorLocationSettings weightTensorLocationSettings = TensorLocationSettings{TensorLocation(), 2, 8192}

Tensor location for weight tensors.

TensorLocationSettings optimizerStateTensorLocationSettings = TensorLocationSettings{TensorLocation(), 2, 8192}

Tensor location for optimizer state tensors.

TensorLocationSettings accumulatorTensorLocationSettings = TensorLocationSettings{TensorLocation(), 2, 8192}

Tensor location for gradient accumulator tensors.

std::map<TensorId, TensorLocation> tensorLocationSettingsOverride

Override tensor location for specific tensors by setting tensor locations for specific tensor ID values.

AutomaticLossScalingSettings automaticLossScalingSettings

Settings to enable and configure the automatic loss scaling behaviour when training.

Note

Automatic loss scaling is in preview. It is well tested and enabled in some of our example applications, but may not behave as expected in all models. Recommendation: if your model with automatic loss scaling enabled does not converge or triggers a compilation error, then you will need to set the loss scale manually.

DeveloperSettings developerSettings

Settings for developers to configure testing and benchmarking.

bool enableSupportedDataTypeCasting = true

Enable casting to supported data types.

If enabled (true), casts any tensor of unsupported data types to supported data types when lowering to Poplar. Currently, this implies casting:

• INT64 -> INT32

• UINT64 -> UINT32 The cast will throw an error for incompatible data types and over/underflows, and will warn about narrowing casts.

Default: true (enabled).

bool enableExplicitMainLoops = false

Enable explicit main loop transformation, and disable implicit training loops.

Note

This will be deprecated and enabled by default.

bool groupNormStridedChannelGrouping = false

Enable fast math mode for group norms.

Group norms have a fast math mode which changes the implementation to run faster on IPU but as a consequence is incompatible with other implementations (so for running trained weights on host). The default (false) is to use the correct, but slightly slower mode.

std::function<void(int, int)> compilationProgressLogger

Callback function used to indicate PopART compilation progress.

The function should not block. All calls to the callback function will be made from the main thread so blocking in the callback will block compilation from progressing.

If this logger is not set then compilation progress will be printed on the info channel.

Param int

The progress value.

Param int

The maximum value for the progress.

int compilationProgressTotal = 100

Total progress ticks until compilation complete.

bool enableMergeExchange = true

Enable merging remote and host IO operations to facilitate IO overlap.

true to enable, otherwise false.

Default=true.

bool ensureFp32LossScaleTensor = false

Ensure that the loss scale tensor is fp32 and that this is combined with fp16 activations as late as possible to produce the first fp16 activation gradients.

This makes it possible to choose a loss scale value greater than max(fp16). This is also recommended when automatic loss scaling is enabled. Only compatible with models that have an fp16 loss scale tensor. true ensures that the loss scale tensor is fp32.

Default: false.

bool enableInplaceAmbiguityChecking = false

Enable creation of an AliasModel object for each graph and run the Poprithms ambiguity checker on it.

This throws an error if the graph has a potential inplacing ambiguity.

See poprithms::memory::inplace::Graph::AmbiguityStatus for more info on what constitutes an ambiguity.

If set to true, AliasModel object is created for each graph and the the Poprithms ambiguity checker is run on it. No ambiguity checking is performed if this option is set to false (default). However inplace fallbacks will occur if necessary.

bool createImplicitPipeliningFwdOnlyProgram = false

Deprecated:

Create a custom program containing the forward pipeline only.

bool throwIfLog2ScaleTensorNotInRange = true

If set to true, throw a Poplar error if any fused ops that consume a log2 scale tensor receive a log2 scale tensor value not in the integer range [-32, 32).

If set to false, no error is thrown. However, note that this may lead to undefined behaviour if the value of the log2 scale is outside the range.

bool enableConstantFoldingOfMultipleConsumers = true

If set to false, disable constant folding on ops if any input have multiple consumers.

Default=true.

bool useLoopCandidateCreator = false

Use loop candidate creator for constant if one exsits.

Default=false.

struct ExperimentalSettings

Public Members

std::map<std::string, std::vector<std::string>> customTransformApplierSettings

Custom transform applier settings.

Enable to insert custom transform sequence at predefined checkpoint. Multiple checkpoint names and transform names can be passed for different model configurations.

The predefined checkpoint names are: FWD0: Initial IR immediately after lowering from ONNX to the IR.

FWD1: After the pre-alias patterns have been applied to FWD0.

BWD0: After growing the backward pass (including the optimiser step). Note this happens before optimiser decomposition, so the optimiser will appear as a single special op rather than the many ops that implement it.

PREALIAS: After pre-alias transforms have been applied to BWD0.

MAINLOOPS: After the MainLoops transform has been applied. This transform adds explicit loop ops to the IR for device iterations (batches per step) and gradient accumulation.

FINAL: The final IR after preparation.

The transform names are defined by PopART and users.

For example to execute ‘Transform A’ and ‘Transform B’ at ‘Fwd0’ checkpoint and exectue ‘Transform C’ at ‘Fwd1’ checkpoint:

{ “Fwd0”: [ “Transform A”, “Transform B” ], “Fwd1”: [ “Transform C” ] }

Note

This setting is experimental for inference and may change.

class NumIOTiles

A wrapper class for the SessionOptions::numIOTiles option that permits any int value and has an ‘unassigned’ state.

Public Functions

NumIOTiles()

Constructor.

NumIOTiles(int numIOTiles)

Constructor.

Parameters

numIOTiles – The number of IPU tiles dedicated to IO.

bool operator==(const int &rhs) const

Compare with int.

operator int() const

Auto convert to int.

NumIOTiles &operator=(const int &x)

Assign value using int.

struct TensorLocationSettings

Public Functions

TensorLocationSettings() = default

Constructor.

TensorLocationSettings(TensorLocation location_, int minElementsForOffChip_ = 2, int minElementsForReplicatedTensorSharding_ = 8192)

Constructor.

Parameters
• location_ – The tensor location information.

• minElementsForOffChip_ – The minimum number of elements below which offloading won’t be considered.

• minElementsForReplicatedTensorSharding_ – The minimum number of elements necessary for replicated tensor sharding.

TensorLocationSettings(TensorStorage storage_, int minElementsForOffChip_ = 2, int minElementsForReplicatedTensorSharding_ = 8192)

Constructor.

Parameters
• storage_ – The tensor storage information.

• minElementsForOffChip_ – The minimum number of elements below which offloading won’t be considered.

• minElementsForReplicatedTensorSharding_ – The minimum number of elements necessary for replicated tensor sharding.

Public Members

TensorLocation location = TensorLocation()

The default tensor location for this tensor type.

int minElementsForOffChip = 2

int minElementsForReplicatedTensorSharding = 8192

A minimum number of elements below which replicated tensor sharding won’t be considered.

#include <popart/variablesettings.hpp>

class VariableSettings

A class to dictate behaviour of variables and reductions of such across multiple graphs.

Public Functions

void verify()

Runs test to see if the VariableSettings are invalid, and throws an error if so.

const CommGroup getSharedVariableDomain() const
Returns

the CommGroup sharedVariableDomain of this VariableSettings.

ReplicaGrouping getReplicaGrouping(unsigned numReplicas) const
Parameters

numReplicas – The number of replicas in the IR this is used in.

Returns

the ReplicaGrouping domain of this VariableSettings.

bool isUsingCommGroup() const
Returns

whether the VariableSettings were initialised using a CommGroup or a stride.

CommGroupType getCommGroupType() const
Returns

the CommGroupType. The value of this is invalid if VariableSettings::isUsingCommGroup returns false.

unsigned getStride() const
Returns

the stride. The value of this is invalid if VariableSettings::isUsingCommGroup returns true.

unsigned getGroupSize() const
Returns

the replica group size.

inline VariableRetrievalMode getRetrievalMode() const
Returns

the VariableRetrievalMode retrievalMode of this VariableSettings.

VariableSettings()

“Default” constructor, defaults CommGroup to [All, 0] and retrievalMode to OnePerGroup.

VariableSettings(CommGroup sharedVariableDomain_)

Defaults VariableRetrievalMode to OnePerGroup.

VariableSettings(VariableRetrievalMode retrievalMode_)

Defaults CommGroup to [All, 0].

VariableSettings(CommGroup sharedVariableDomain_, VariableRetrievalMode retrievalMode_)

Entirely custom VariableSettings.

VariableSettings(unsigned stride, unsigned groupSize)
VariableSettings(unsigned stride, unsigned groupSize, VariableRetrievalMode retrievalMode)
unsigned numReplicasReturningVariable(unsigned replicaCount) const

Calculate the number of replicas that will return this variable.

Parameters

replicaCount – Number of global replicas.

Returns

Number of variables returned.

unsigned getGroupCount(unsigned replicaCount) const
Parameters

replicaCount – The replicationFactor of the graph.

Returns

The number of groups given the replicaFactor and the VariableSettings.

unsigned getStride(unsigned replicaCount) const
Parameters

replicaCount – The replicationFactor of the graph.

Returns

The stride between each member of a group.

unsigned getRealGroupSize(unsigned replicaCount) const

Because CommGroup’s don’t have a defined group-size if the type is All or None, this function will return a group-size that is always accurate, based on replicas.

Parameters

replicaCount – The replication factor

Returns

The actual number of replicas in a group

unsigned getGroupRepresentative(unsigned group) const

Get the default first member of a group.

Parameters

group – The group to return the representative for.

Returns

The representative replica of this group.

Shape shapeOnReplica(Shape full_shape, unsigned replicaCount, const TensorId name) const

The shape Onnx reads holds an extra outer dimension in certain cases, where the outer dimension represents the number of returning replica variables.

This function takes an Onnx full-shape and removes the outer dimension safely (ie. checks if the outer dimension matches an expected outer dimension). A quick-function to avoid duplicate code.

Parameters
• full_shape – The shape as presented by Onnx.

• replicaCount – The local replication factor, used to calculate the return factor.

• name – The TensorId of the function, used to give good error feedback.

Returns

The shape of the data on the replica.

Shape shapeOnHost(Shape replica_shape, unsigned replicaCount) const

Takes the shape of a tensor on a replica and returns it’s full ONNX shape.

This is the inverse operation to shapeOnReplica

Parameters
• replica_shape – The shape of the data on a replica.

• replicaCount – The local replication factor, used to calculate the return factor.

Returns

The shape as presented by Onnx.

std::vector<std::vector<std::int64_t>> groups(unsigned replicaCount) const

This function returns a set of vectors where each vector contains all the replicaId’s of the replicas with a sharedVariableDomain given the variableSettings and the replicaCount.

Parameters

replicaCount – The local replication factor

Returns

A set of sets, such that set.at(a).set(b) is member nr. b of group a, and set.size() is the number og groups and set.at(A).size() is the size of the group.

bool operator==(const VariableSettings &other) const

Compare two variable-settings.

Parameters

otherVariableSettings to compare these settings to.

Returns

True if all internal elements are the same

bool operator!=(const VariableSettings &other) const

Compare two variable-settings.

Parameters

otherVariableSettings to compare these settings to.

Returns

False if all internal elements are the same

enum class popart::VariableRetrievalMode

Enum type that describes how to retrieve variables from the replicas.

Each replica is in a group defined by the VariableSettings::sharedVariableDomain. Replicas within a group have variables initialized with the same values.

Values:

enumerator OnePerGroup = 0

Returns one variable per group (defined by the VariableSettings::sharedVariableDomain CommGroup), automatically returns the first replica of each group, where first means the one with the lowest replica ID.

enumerator AllReduceReplicas

As OnePerGroup, but performs an AllReduce among the replicas in the same group according to VariableSettings::sharedVariableDomain !!! CURRENTLY UNSUPPORTED.

enumerator AllReplicas

Returns all replica Weights.

#include <popart/commgroup.hpp>

class CommGroup

Class to specify sub-groups of replicas.

Examples of derived sub-groups:

type == Consecutive && replicaGroupSize == 64/replica-size/N


where N is a power of two and replicaGroupSize > 1.

• Complete IPU-link domain / full rack:

type == Consecutive && replicaGroupSize == 64/replica-size


type == Orthogonal && replicaGroupSize == numberOfIpuLinkDomains


Public Functions

CommGroup()

Default CommGroup constructor.

Sets type to CommGroupType::All and replicaGroupSize to 0.

inline CommGroup(CommGroupType type, unsigned groupSize)

Construct CommGroup.

Parameters
• groupType – The replica group type.

• groupSize – The replica group size.

explicit CommGroup(const ReplicaGrouping &grouping)

Construct CommGroup from a ReplicaGrouping.

Parameters

grouping – The replica grouping.

ReplicaGrouping toReplicaGrouping(unsigned numReplicas) const

Convert this CommGroup to a ReplicaGrouping.

Parameters

numReplicas – The number of replicas to pass to create the replica grouping with.

Returns

The replica grouping.

bool operator==(const CommGroup &other) const
bool operator!=(const CommGroup &other) const

Public Members

CommGroupType type = CommGroupType::All

Replica group type.

unsigned replicaGroupSize = 0

Replica group size.

enum class popart::CommGroupType

PopART equivalent of GCL CommGroupType.

Each of these enumeration constants has a corresponding GCL CommGroupType value.

Values:

enumerator All = 0

All replicas viewed as one group, replica group size is ignored.

enumerator Consecutive

Groups are consecutive in replicas.

If there are N replicas denoted {0, ... N-1} and the group size is k, then there are N/k groups of size k as {0, 1, ... k-1}, {k, ... 2k-1} ... {N-k-1, ... N-1}.

enumerator Orthogonal

Groups are sliced orthogonal to the replica ordering.

If there are N replicas denoted {0, ... N-1} and the group size is k, then there are m = N/k groups of size k as {0, m, 2m, ...}, {1, m+1, 2m+1, ...} ... {m-1, 2m-1, ... N-1}.

enumerator None

Each replica is in its own group; the replica group size is ignored.

enumerator N

Number of values.

## 13.2. Data input and output (IStepIO)

#include <popart/istepio.hpp>

class IStepIO

An abstract base class through which input and output data is passed to a Session (see Session::run).

Data is passed via buffers. In the case of buffers returned by IStepIO::in, PopART reads from these buffers. In the case of IStepIO::out, PopART writes to these buffers. The IStepIO::inComplete() and IStepIO::outComplete() functions are called by PopART to signal it is done with an input or output buffer.

An IStepIO implementation should conceptually implement a rolling queue of active buffers for each input and output tensor. Every successful call to IStepIO::in should yield a new data buffer for PopART to read from and add it to the head of the conceptual queue. Conversely, every call to IStepIO::inComplete() should be taken to mean that the buffer at the tail-end of the queue is no longer being used by PopART. This buffer is removed from the conceptual queue.

Note that a IStepIO::in call with the prefetch flag set is only considered successful when it returns data.

Output works analogously to input.

The expected total number of input (or output) buffers that are ‘completed’ for a tensor in one Session::run call is bps $$\times$$ SessionOptions::accumulationFactor $$\times$$ SessionOptions::replicatedGraphCount, where bps is the number of batches per call to Session::run (this is a value captured by the DataFlow instance passed to the Session instance).

Note, however, that there may be additional ‘incomplete’ calls to IStepIO::in and IStepIO::out.

Furthermore, the number of input (or output) buffers that may be ‘incomplete’ at a given time for a given tensor should not normally be more than SessionOptions::bufferingDepth $$\times$$ SessionOptions::replicatedGraphCount, but this bound is not guaranteed.

EXAMPLE: Suppose a session is configured such that the total expected number of input buffers is 6 and these are input buffers for a tensor with ID t with 100 elements. The associated input calls in IStepIO may look like this if SessionOptions::bufferingDepth is 3:

in("t", 100, false) -> Give buffer[0] to PopART.
in("t", 100, true) -> Give buffer[1] to PopART.
in("t", 100, true) -> Give buffer[2] to PopART.
inComplete("t", 100) -> buffer[0] is no longer required and can be reused.
in("t", 100, true) -> Give buffer[3] to PopART.
inComplete("t", 100) -> buffer[1] is no longer required and can be reused.
in("t", 100, true) -> Give buffer[4] to PopART.
inComplete("t", 100) -> buffer[2] is no longer required and can be reused.
in("t", 100, true) -> Give buffer[5] to PopART.
inComplete("t", 100) -> buffer[3] is no longer required and can be reused.
in("t", 100, true) -> No data available, return nullptr.
inComplete("t", 100) -> buffer[4] is no longer required and can be reused.
inComplete("t", 100) -> buffer[5] is no longer required and can be reused.


Public Functions

virtual ~IStepIO() = default

Destructor for IStepIO.

virtual ConstVoidData in(TensorId id, int64_t numElements, bool prefetch, const bool isBroadcast = false) = 0

Request a new input data buffer.

The memory in this buffer is available for use in PopART until the corresponding inComplete() call.

Note

: Failing to provide a valid data buffer will result in a runtime failure if prefetch is set to false.

Parameters
• id – The ID of the tensor to return data for.

• numElements – The number of elements in the tensor.

• prefetch – If set to true the inability to provide data is not considered an error. If false, it is considered an error if no data can be provided.

Returns

The input buffer for this tensor (or nullptr on failure) returned as a ConstVoidData object.

virtual void inComplete(TensorId id, int64_t numElements, const bool isBroadcast = false) = 0

Notify the user (running a PopART program) that a previously retrieved input data buffer is no longer used by PopART.

Parameters
• id – The ID of the tensor to return data for.

• numElements – The number of elements in the tensor.

virtual MutableVoidData out(TensorId id, int64_t numElements) = 0

Request a new output data buffer.

The memory in this buffer is available for use in PopART until the corresponding inComplete() call and will be modified in-place.

Note

Failing to provide a valid data buffer will result in a runtime failure.

Parameters
• id – The ID of the tensor to return data for.

• numElements – The number of elements in the tensor.

Returns

The output buffer for this tensor returned as a MutableVoidData object.

inline virtual void outComplete(TensorId)

Notify the user (running a PopART program) that a previously retrieved input data buffer is no longer used by PopART.

Parameters
• id – The ID of the tensor to return data for.

• numElements – The number of elements in the tensor.

inline void enableRuntimeAsserts(bool b)

Enable or disable runtime asserts.

If runtime asserts are enabled, then a check that the input and output buffers have the correct number of elements is performed. As Session.run() is called multiple times during a user’s session, the check is only performed in the first call to Session.run(), under the assumption that the user is unlikely to change the size of buffers between runs.

Parameters

b – The setting to enable runtime asserts (true) or disable runtime asserts (false).

inline bool runtimeAssertsEnabled() const

Check if runtime asserts are enabled.

Returns

true if runtime asserts are enabled, otherwise false.

virtual void assertNumElements(const popx::Executablex&) const = 0

Check number of elements.

This check is performed when runtimeAssertsEnabled() is true.

Parameters

Executablex – The input executable to be checked that the input and output buffers have the correct number of elements.

#include <popart/stepio.hpp>

class StepIO : public popart::StepIOGeneric<IArray, StepIONS::IArrayAccessor, IArray&>

Class to provide a Session object with input and output data.

Public Functions

inline StepIO(std::map<TensorId, IArray&> inputs, std::map<TensorId, IArray&> outputs)

Constructor for StepIO.

Parameters
• inputs – The input data.

• outputs – The output data.

class StepIOCallback : public popart::IStepIO

Class that implements the IStepIO interface using user-provided callback functions.

The IStepIO interface contains a number of pure virtual member functions through which PopART receives buffers to read data from and buffers to write data to. StepIOCallback inherits from IStepIO and implements those member functions by delegating the logic to the callback functions passed in the constructor. This gives the user full control as to how data buffers are provisioned.

See IStepIO for more details on the expected behaviour of the callbacks.

Public Types

using InputCallback = std::function<ConstVoidData(TensorId, bool)>

Callable object that implements IStepIO::in().

using InputCompleteCallback = std::function<void(TensorId)>

Callable object that implements IStepIO::inComplete().

using OutputCallback = std::function<MutableVoidData(TensorId)>

Callable object that implements IStepIO::out().

using OutputCompleteCallback = std::function<void(TensorId)>

Callable object that implements IStepIO::outComplete().

Public Functions

inline StepIOCallback(InputCallback inputCallback, InputCompleteCallback inputCompleteCallback, OutputCallback outputCallback, OutputCompleteCallback outputCompleteCallback)

Construct a StepIOCallback object.

Parameters
inline virtual void assertNumElements(const popx::Executablex&) const

Check number of elements.

This check is performed when IStepIO::runtimeAssertsEnabled() is true.

Parameters

Executablex – The input executable to be checked that the input and output buffers have the correct number of elements.

virtual ConstVoidData in(TensorId id, int64_t numElements, bool prefetch, bool) final

This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the inputCallback parameter passed to the constructor.

This function should not be called directly.

virtual void inComplete(TensorId id, int64_t numElements, bool) final

This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the inputCompleteCallback parameter passed to the constructor.

This function should not be called directly.

virtual MutableVoidData out(TensorId id, int64_t numElements) final

This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the outputCallback parameter passed to the constructor.

This function should not be called directly.

virtual void outComplete(TensorId id) final

This function is called by PopART when a StepIOCallback instance is passed to Session::run() and will internally call the outputCompleteCallback parameter passed to the constructor.

This function should not be called directly.

class IWeightsIO

A virtual class for accessing pointers to the data required to perform a training step.

Subclassed by popart::WeightsIO

Public Functions

virtual ~IWeightsIO() = default

Destructor for IWeightsIO.

virtual bool contains(TensorId) const = 0

Check if the WeightsIO instance contains the weights for a specific tensor.

Parameters

TensorId – The ID of the tensor to look for weights for.

Returns

true if the WeightsIO instance contains weights for the tensor, false otherwise.

virtual MutableVoidData weight(TensorId) const = 0

Retrieve weights for a specific tensor.

Parameters

TensorId – The ID of the tensor to retrieve weights for.

Returns

The weights.

class WeightsIO : public popart::IWeightsIO

Class representing weights.

Public Functions

~WeightsIO() override = default

Destructor for WeightsIO.

virtual bool contains(TensorId) const final

Check if the WeightsIO instance contains the weights for a specific tensor.

Parameters

TensorId – The ID of the tensor to look for weights for.

Returns

true if the WeightsIO instance contains weights for the tensor, false otherwise.

virtual MutableVoidData weight(TensorId) const final

Retrieve weights for a specific tensor from the WeightsIO object.

Parameters

TensorId – The ID of the tensor to retrieve weights for.

Returns

The weights.

void insert(TensorId, MutableVoidData)

Insert weights for a specific tensor into the WeightsIO object.

Parameters
• TensorId – The ID of the tensor to insert weights for.

• MutableVoidData – The weights to insert.

struct IArrayAccessor

Structure to help with accessing the data in IArray objects.

Public Static Functions

static inline void *getDataPointer(IArray &array)

Get pointer to the data.

Parameters

array – The IArray object.

Returns

A pointer to the data contained in the IArray object.

static inline size_t getArraySize(const IArray &array)

Get the number of data elements.

Parameters

array – The IArray object.

Returns

The number of data elements.

static inline DataType getArrayDataType(IArray &array)

Get the data type of the data.

Parameters

array – The IArray object.

Returns

The data type of the data.

static inline size_t getArrayRank(IArray &array)

Get the rank of the data array.

Parameters

array – The IArray object.

Returns

The rank of the data array.

static inline int64_t getArrayDim(IArray &array, size_t index)

Get the size of the data at a specific location.

Parameters
• array – The IArray object.

• index – The index of the data element in the IArray object.

Returns

The size of the data at the specific location.

#include <popart/stepio_generic.hpp>

template<typename ARRAY_TYPE, typename ACCESSOR_TYPE, typename ArrayInfoT>
class StepIOGeneric : public popart::IStepIO

Subclassed by popart::StepIO

Public Functions

inline void assertNumElements(const popx::Executablex &exe) const final
inline TensorInfo getTensorInfo(ARRAY_TYPE &array) const
template<typename T>
inline T get(TensorId id, std::map<TensorId, ArrayInfo> &M, int64_t numElements, bool advance_, std::string mapName)
template<typename T>
inline void advance(TensorId id, std::map<TensorId, ArrayInfo> &M, int64_t numElements, std::string mapName)
inline ConstVoidData in(TensorId id, int64_t numElements, bool, bool) final
inline void inComplete(TensorId id, int64_t numElements, bool) final
inline MutableVoidData out(TensorId id, int64_t numElements) final
struct ArrayInfo

Public Members

ArrayInfoT array
int64_t offset
#include <popart/iarray.hpp>

class IArray

Subclassed by popart::NDArrayWrapper< T >

Public Functions

inline virtual ~IArray()
virtual void *data() = 0
virtual DataType dataType() const = 0
virtual std::size_t rank() const = 0
virtual int64_t dim(size_t index) const = 0
virtual std::size_t nelms() const = 0
virtual const Shape shape() const = 0

## 13.3. Tensors

#include <popart/tensor.hpp>

class Tensor : public popart::Vertex

Public Functions

Tensor(TensorId, TensorType, Graph&, const DebugContext& = {})
Tensor(TensorId, VariableSettings, Graph&, const DebugContext& = {})
Tensor(TensorId, TensorType, VariableSettings, Graph&, const DebugContext& = {})
inline std::string str() const final
virtual std::unique_ptr<Tensor> clone(Graph &graph_) const
TensorType tensorType() const
std::string tensor_type() const
void setTensorType(TensorType)
inline ReplicatedStreamMode getReplicatedStreamMode() const
inline void setReplicatedStreamMode(const ReplicatedStreamMode &mode)
void setTensorLocationInfo(TensorLocation&, std::pair<RemoteBufferId, RemoteBufferIndex> &remoteBufferInfo)
std::set<PipelineStage> getPipelineStages() const
Op *getProducerUnsafe() const
Op *getProducer() const
void setProducer(Op*)
void resetProducer(Op*)
bool hasProducer() const
bool isGraphInput() const
InIndex getGraphInputIndex() const
bool isGraphOutput() const
OutIndex getGraphOutputIndex() const
bool isLoopInput() const
bool isImplicitLoopInput() const
bool isExplicitLoopInput() const
bool isLoopTripCounter() const
bool isUnmodifiable() const
bool isCheckpointTensor() const
bool isImplicitRecomputeTensor() const
bool isRestoreInplaceTensor() const
bool idIncludesPrefix(const std::vector<std::string>&) const
bool isOptimizerTensor() const
bool isRemoteArgTensor() const
bool isRandomSeedTensor() const
bool isOptimizerStateTensor() const
bool isAccumulatorTensor() const

Is this tensor produced by a HostLoad Op or MultiExchangeOp with HostLoad descriptor?

Returns

true if producer is a HostLoad Op or MultiExchangeOp with HostLoad descriptor false otherwise.

bool isWeightTensor() const
bool isAnchored() const
bool isRootAnchor() const
bool hasTensorData() const
TensorData *tensorData()
const TensorData *tensorData() const
bool anyAlias(std::function<bool(Tensor*)> predicate) const
void setTensorDataFromCopyOf(const void *src, std::size_t size)
void setTensorDataFromViewOf(void *src, std::size_t size)
void setTensorDataByEmplaceOf(std::vector<char> &&data)
void setTensorData(const TensorData &td)
void setTensorData(TensorData &&td)
std::vector<Op*> associatedOps() const
inline Graph &getGraph()
inline const Graph &getGraph() const
Ir &getIr()
const Ir &getIr() const
bool hasVirtualGraphId() const
VGraphId getVirtualGraphId() const
VGraphId getVirtualGraphIdUnsafe() const
VGraphIdAndTileSet getVirtualGraphIdAndTileSet(std::set<OpId> &visited) const
VGraphIdAndTileSet getVirtualGraphIdAndTileSetUnsafe() const
VGraphIdAndTileSet getVirtualGraphIdAndTileSetUnsafe(std::set<OpId> &visited) const
int getBatchAxis() const
bool consumersAllPreLoss() const
bool isModified(bool considerLoopInput = true) const

Check if any of the consumers modify this tensor.

Parameters

considerLoopInput – If explicit loop inputs should be considered as being modified. If false, only operations modifying the tensor inplace will be considered.

Returns

True if the tensor is modified, otherwise false.

bool isAliased() const

Check if any of the consumers alias this tensor.

Returns

True if the tensor is aliased to any output, otherwise false.

view::Regions modifiedRegionsByOps(std::vector<Op*> ops, Aliases &aliases) const
view::Regions modifiedRegionsByOps(std::vector<OpId> opIds, Aliases &aliases) const
std::set<Op*, POpCmp> getInplaceModifiers() const

Find operations that modify a tensor.

Returns

All operations that (direct and indirectly) modify this tensor

std::vector<char> getDataViaGraphTraversal() const
inline const popart::DebugInfo &getDebugInfo() const
inline void setVariableUpdateType(VariableUpdateType type)

Members of old subclass VariableTensor class VariableTensor : public Tensor {.

inline VariableUpdateType getVariableUpdateType() const
inline void setCopyFromTensor(TensorId value)
inline TensorId getCopyFromTensor()
inline VariableSettings getVariableSettings() const
Returns

The VariableSettings of this Variable

std::vector<int64_t> returnedShape(unsigned replicationFactor)

Returns the shape necessitated by IO.

Parameters

replicationFactor – The replication factor

Returns

the shape of the tensor, considering replica groups

void verifyMutableVoidInfo(const TensorInfo mutableVoidInfo, unsigned replicationFactor)

Check that the info of a mutableVoidData object matches the expectations set by the TensorInfo and VariableSettings.

Throws an error if there is a mismatch.

Parameters
• mutableVoidInfo – The data of the MutableVoidInfo with the same id as this tensor

• replicationFactor – The replicationFactor of this instance

void setPreparedVGraphIdAndTileSet()

Set the preparedVGraphIdAndTileSet.

Public Members

TensorId id
Consumers consumers
TensorInfo info
TensorLocationInfo tensorLocationInfo
InputSettings inputSettings
enum class popart::TensorType

Values:

enumerator Const
enumerator Stream
enumerator Unknown
enumerator Variable
enumerator N
enum class popart::VariableUpdateType

Values:

enumerator None = 0
enumerator Copy
#include <popart/tensorinfo.hpp>

enum class popart::DataType

There is a one-to-one correspondence between popart::DataTypes and ONNX_NAMESPACE::TensorProto_DataTypes, which is equivalent to decltype(ONNX_NAMESPACE::TensorProto().data_type()).

Values:

enumerator UINT8 = 0
enumerator INT8
enumerator FLOAT8_143
enumerator FLOAT8_152
enumerator UINT16
enumerator INT16
enumerator INT32
enumerator INT64
enumerator UINT32
enumerator UINT64
enumerator BOOL
enumerator FLOAT
enumerator FLOAT16
enumerator BFLOAT16
enumerator DOUBLE
enumerator COMPLEX64
enumerator COMPLEX128
enumerator STRING
enumerator UNDEFINED
class DataTypeInfo

Public Functions

DataTypeInfo(DataType type__, int nbytes__, bool isFixedPoint__, std::string name__, std::string lcasename__)
DataType type() const
const int &nbytes() const
const std::string &name() const
const std::string &lcasename() const
bool isFixedPoint() const
class TensorInfo

Public Functions

TensorInfo(DataType, const Shape&)

Create TensorInformation based on data type and shape.

Parameters
• data_type – - The data type.

• shape – - The actual shape of the tensor.

TensorInfo(DataType data_type, const Shape &shape, const Shape &meta_shape)

Create TensorInformation based on data type, shape and meta shape.

Parameters
• data_type – - The data type.

• shape – - The actual shape of the tensor.

• meta_shape – - The meta shape of the tensor, which can for example be used to store the original tensor shape before replicated tensor sharding was applied.

TensorInfo(std::string data_type, std::string shape)
TensorInfo(std::string data_type, const Shape&)
explicit TensorInfo(const ONNX_NAMESPACE::TensorProto&)
explicit TensorInfo(const ONNX_NAMESPACE::TypeProto&)
void set(const ONNX_NAMESPACE::TensorProto&)
void set(const ONNX_NAMESPACE::TypeProto&)
TensorInfo() = default
void set(DataType)
void set(DataType, const Shape&)
void set(DataType, const Shape&, const Shape&)
const Shape &shape() const
const Shape &metaShape() const
std::vector<size_t> shape_szt() const
inline Rank rank() const
inline int64_t nelms() const
int64_t nbytes() const
inline int64_t dim(int i) const
inline std::vector<int> strides(const std::vector<long> &shape)

Get the strides of the tensor, that is the number of bytes to step in each dimension when traversing an array in memory.

Parameters

shape – The on-host ONNX shape of a tensor. This is different from this->shape(), which gives the on-replica shape of a tensor

Returns

std::vector<int> The strides vector.

DataType dataType() const
const std::string &data_type() const
const std::string &data_type_lcase() const
void append(std::ostream&) const
bool isSet() const
bool operator==(const TensorInfo&) const
bool operator!=(const TensorInfo&) const
Shape shapeFromString(const std::string &s) const
ONNX_NAMESPACE::TypeProto getOnnxTypeProto() const
const DataTypeInfo *getDataTypeInfo() const

Public Static Functions

static std::string npOutDataTypeExceptionMessage(const TensorInfo &i0, const TensorInfo &i1, const std::string &debugName)
#include <popart/tensorindex.hpp>

class TensorIndexMap

Public Functions

TensorIndexMap() = default
~TensorIndexMap()
void insert(int, Tensor*)
void reset(int, Tensor*)
void erase(int)
void clear()
bool contains(Tensor*) const
Tensor *tensor(int)
const Tensor *tensor(int) const
TensorId id(int) const
bool hasIndex(int) const
const std::vector<int> &indices(Tensor*) const
const std::map<Tensor*, std::vector<int>, PTensorCmp> &indicesMap() const
const std::map<int, Tensor*> &tensorMap() const
const std::vector<Tensor*> tensors() const
std::map<int, TensorId> tensorIdMap() const
int n() const
void append(std::stringstream&, std::string prefix, int max_id_length) const
void setInfoIfIndex(const TensorInfo&, int index)
std::vector<TensorId> getSerialised() const
int maxIdLength() const
std::map<int, Shape> getIndexShapeMap()
int minIndex() const
int maxIndex() const
#include <popart/tensorlocation.hpp>

enum class popart::ReplicatedTensorSharding

Enum type to specify whether to shard tensors over replicas.

Values:

enumerator Off = 0

Don’t shard tensors over replicas.

enumerator On = 1

Do shard tensors over replicas.

enumerator N = 2

Number of values.

class TensorLocation

Class that describes the memory characteristics of one or multiple tensors.

Public Functions

TensorLocation()

Equivalent to calling TensorLocation(TensorStorage::Undefined, TileSet::Compute, TileSet::Compute, ReplicatedTensorSharding::Off)

TensorLocation(TensorStorage storage)

Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, ReplicatedTensorSharding::Off)

TensorLocation(TensorStorage storage, ReplicatedTensorSharding replicatedTensorSharding)

Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, replicatedTensorSharding)

TensorLocation(TensorStorage storage, ReplicatedTensorSharding replicatedTensorSharding, CommGroup shardingDomain)

Equivalent to calling TensorLocation(storage, TileSet::Compute, TileSet::Compute, replicatedTensorSharding, shardingDomain)

TensorLocation(TensorStorage storage, TileSet loadTileSet, TileSet storageTileSet, ReplicatedTensorSharding replicatedTensorSharding)

Construct a TensorLocation from parameters.

Parameters
• storage – The memory location of the tensor(s).

• loadTileSet – The tiles through which the tensor(s) are loaded onto the chip.

• storageTileSet – The tiles on which the tensor(s) are stored.

• replicatedTensorSharding – Whether to apply replicated tensor. sharding.

TensorLocation(TensorStorage storage, TileSet loadTileSet, TileSet storageTileSet, ReplicatedTensorSharding replicatedTensorSharding, CommGroup shardingDomain)

Construct a TensorLocation from parameters.

Parameters
• storage – The memory location of the tensor(s).

• loadTileSet – The tiles through which the tensor(s) are loaded onto the chip.

• storageTileSet – The tiles on which the tensor(s) are stored.

• replicatedTensorSharding – Whether to apply replicated tensor. sharding.

• shardingDomain – GCL communication group across which to shard the tensor. Perpendicular replicas will not shard, and reduce gradients normally (via AllReduce). Defaults to sharding across all replicas.

TensorLocation(std::vector<int64_t> serialized)
bool operator==(const TensorLocation &rhs) const
bool operator!=(const TensorLocation &rhs) const
std::vector<int64_t> serialize() const
bool isRemote() const

Public Members

TensorStorage storage

The memory location of the tensor(s).

The tiles through which the tensor(s) are loaded onto the chip.

TileSet storageTileSet

The tiles on which the tensor(s) are stored.

ReplicatedTensorSharding replicatedTensorSharding

Whether to apply replicated tensor sharding (RTS) or not.

CommGroup shardingDomain

The GCL comm groups across which to shard the tensor.

enum class popart::TensorStorage

Enum type that determines where a tensor is stored.

Values:

enumerator OnChip = 0

Store the tensor in on-chip memory.

enumerator OffChip = 1

Store the tensor in streaming memory.

enumerator N = 2

Number of values.

enum class popart::TileSet

Enum type to specify a set of tiles.

Values:

enumerator Compute = 0

The set of tiles designated for compute operations.

enumerator IO = 1

The set of tiles designated for IO operations.

enumerator Undefined = 2

Undefined (no) tile set.

enumerator N = 3

Number of values.

## 13.4. Optimizers

#include <popart/optimizer.hpp>

class Optimizer

Interface for describing an Optimizer and, internally, how to grow the optimiser step for each weight.

• The end-user facing interface constructed by the user to describe what kind of optimiser to use.

• Then also used internally by the Ir to grow the optimiser step for each weight.

• Stores OptimizerValues for optimizer parameters like learning rate, loss scaling, etc.

OptimiserValue.

• Optimizer stores the values for each weight - they can have different values. There is a “default” for all weights, then you can specify specific values for specific weights. This is encapsulated by an OptimizerValueMap, which is a sparse map from weight to value, with unspecified values implying the default.

OptimizerValueMap.

• At runtime, the user can dynamically update the Optimizer, e.g. by setting new OptimizerValues. validReplacement determines whether the new Optimizer is interchangable with the one the Ir was built for. For example, trying to replace an SGD Optimizer with an Adam Optimizer would throw.

Public Functions

virtual ~Optimizer() = default

• Optimizer class has a two-part initialisation. The ctor, used by the end-user, and setFactorsFromOptions called by the Ir to finish initialisation once we have all the relevant information during Ir preparation.

• Some key methods used by the Ir to grow optimiser step for each weight are createOp, getInputIds, optimizerInputs.

• If the OptimizerValue is const, no Ir tensor for that value is created and the VarUpdateOp created for that weight will not have the optional input for that tensor. The Opx of the VarUpdateOp will emit poplar code that uses the provided value directly.

If the OptimizerValue is not const, an Ir tensor for that value is created and the VarUpdateOp created for that weight will have the optional input for that tensor. The tensor will be a stream tensor, so that it can be updated later from host. The tensor will be streamed an initial value of the OptimizerValue’s value.

• It is common for Optimizer

implementations to make use of “compound

scalars”. Take for example the SGD0 weight update equation: w <- w * (1 - lr * (1 - dm) * wd) - g * (lr * (1 - dm) / ls) w is the weights and g is the grads. lr, dm, wd, ls are all the “atomic scalars”. These are the scalars/hyperparameters of the

Optimizer that the user can set using OptimizerValues, as described above.

Multiple atomic scalars appear in expressions together, and will be operated on together before being used by an Op that also consumes a tensor (in this case the weights or grads). For SGD0, they can be grouped as follows:

w <- w * {1 -  lr * (1 - dm) * wd} -  g * { lr * (1 - dm) / ls }
^^^^^^^^^^^^^^^^^^^^^^^^^        ~~~~~~~~~~~~~~~~~~~~~~
|                               |
weight decay scale factor 0                      |
scaled learning rate 0


We call wdsf0 and slr0 the “compound scalars”.

We can statically precompute the OptimizerValues for these compound scalars using the OptimizerValues of the atomic scalars. This makes the Ir simpler, as we now have only:

w <- w * wdsf0 - g * slr0


The CompoundScalarHelpers are used to precompute the compound scalar values.

If any of the composite atomic scalars are non-const, the compound scalar is non-const.

compoundscalarhelper.hpp

Optimizer(OptimizerValue lossScaling, const std::vector<ClipNormSettings> &clipNormSettings, const DebugContext &debugContext)
Optimizer(const Optimizer&) = default
virtual void validReplacement(const Optimizer &other) const
virtual OptimizerType type() const = 0
virtual std::string type_s() const = 0
virtual std::unique_ptr<Optimizer> clone() const = 0
virtual void resetTensorData(Tensor&) const = 0
virtual void setTensorData(Tensor&) const = 0
virtual std::unique_ptr<Op> createOp(const Tensor &weight, Graph&) const = 0
virtual std::vector<TensorId> getInputIds(const Tensor &weight) const = 0

Returns the TensorIds of the input tensors to the VarUpdateOp this optimiser will create for the given weight .

Specifically, The TensorId at index i will be the id of the input tensor at InIndex i of the VarUpdateOp. If the input is an OptimizerValue, if it is const, then “” will be returned, else the relevant reservered prefix for that OptimizerValue will be used, followed by the weight id. The prefixes are defined in tensornames.hpp, for example reservedDefaultWeightDecayScaleFactor0Prefix or reservedSpecificScaledLearningRate1Prefix (note there are different prefixes depending on if the weight has a specific or default value for that OptimizerValue).

virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const = 0
inline const OptimizerValue &lossScaling() const
inline float getLossScalingVal() const
float getFinalLossScalingVal() const
virtual TensorId getInverseLossScalingTensorId(const Tensor &weight) const = 0
virtual void setFactorsFromOptions(const SessionOptions&)
bool meanReductionEnabled() const
bool postMeanAccumulationEnabled() const
bool postMeanReplicationEnabled() const
int64_t getReplicatedGraphCount() const
int64_t getAccumulationFactor() const
inline const std::vector<ClipNormSettings> &getClipNormSettings() const
virtual bool hasSpecific(const Tensor &w) const = 0
virtual bool hasSpecific() const = 0
virtual size_t hash() const
inline DebugContext getDebugContext() const

Public Static Functions

static TensorId getLossScalingTensorId(DataType)
enum class popart::OptimizerType

Types of optimizers.

Values:

enumerator SGD = 0
enumerator NTYPES
enum class popart::OptimizerReductionType

Reduction mode when doing data-parallel training over replicated graphs.

Depending on the optimizer used and its configuration, this option describes how the reduction of gradients over replicas will occur. For example, directly on the gradient, on the gradient accumulator, or on the momentum. See the documentation of individual optimizers for more information.

Values:

enumerator None = 0

No replicated graph reduction.

enumerator AcclReduce

Momentum reduction (SGD1, after the gradient accumulation loop, if applicable)

enumerator AccumReduce

enum class popart::WeightDecayMode

Values:

enumerator Decay

enumerator L2Regularization

#include <popart/optimizervalue.hpp>

class OptimizerValue

A class used to represent values of hyper parameters.

Public Functions

OptimizerValue() = default

Equivalent to OptimizerValue(0, false).

inline OptimizerValue(float v)

Equivalent to OptimizerValue(v, true).

inline OptimizerValue(float v, bool c)

Constructor.

Parameters
• v – The current value of the hyper parameter.

• c – A boolean flag to indicate whether the parameter will remain at this value forever (true) or may change over time (false).

inline OptimizerValue(std::pair<float, bool> x)
inline float val() const
inline bool isConst() const
void validReplacement(const OptimizerValue &rhs) const
bool operator==(const OptimizerValue &rhs) const
#include <popart/optimizervaluemap.hpp>

class OptimizerValueMap

Public Functions

inline OptimizerValueMap(OptimizerValue g)
OptimizerValue get(const TensorId &id) const
void insertSpecific(const TensorId&, OptimizerValue)
inline bool hasSpecific(const TensorId &id) const
inline bool hasSpecific() const
inline OptimizerValue getDefault() const
void validReplacement(const OptimizerValueMap &rhs) const
inline const std::map<TensorId, OptimizerValue> &getSpecifics() const

### 13.4.1. Stochastic Gradient Descent (SGD)

#include <popart/clipnormsettings.hpp>

class ClipNormSettings

A data structure used to represent a maximum value constraint on one or more weights.

This is passed to the optimizer on construction.

Public Types

enum class Mode

Values:

enumerator ClipSpecifiedWeights
enumerator ClipAllWeights

Public Functions

ClipNormSettings(const std::vector<TensorId> &weightIds_, float maxNorm_)

DEPRECATED This will be removed from a future release.

Constructor.

Parameters
• weightIds_ – The weight tensor IDs that this constraint applies to.

• maxNorm_ – The maximum permissible value.

const std::vector<TensorId> &getWeightIds() const
float getMaxNorm() const
Mode getMode() const
bool operator==(const ClipNormSettings&) const
bool operator!=(const ClipNormSettings &other) const

Public Members

std::vector<TensorId> weightIds
float maxNorm

Public Static Functions

static ClipNormSettings clipWeights(const std::vector<TensorId> &weightIds_, float maxNorm_)
static ClipNormSettings clipAllWeights(float maxNorm_)
#include <popart/sgd.hpp>

class SGD : public popart::Optimizer

Like any to any optimizer implementation, this class is responsible for updating each weight tensor ( $$w$$) in the model using the gradient ( $$g$$) of the loss function with respect to the weight as calculated during the backwards pass.

The SGD optimizer has the following state for each weight:

• velocity ( $$v$$)

The SGD optimizer has the following hyper parameters:

• learning rate ( $$\text{lr}$$)

• momentum ( $$\text{mm}$$)

• weight decay ( $$\text{wd}$$)

• dampening ( $$\text{dm}$$)

• velocity scaling ( $$\text{vs}$$)

• loss scaling ( $$\text{ls}$$)

• nesterov

• clip norm settings

The values of these parameters can be shared between all weights but some can be overridden with weight-specific values (see SGD::insertSpecific). Hyper parameters are captured using OptimizerValue objects and therefore can be either a constant value or a non-constant value that can be adjusted by the user.

In the following we will describe how this optimizer updates a weight using a gradient. In the context of this description the gradient is is the value of the gradient after any gradient accumulation has been performed and after the application of a loss scaling factor to the gradient has been corrected for.

When the optimizer needs to update a weight, $$w$$, using a gradient, $$g$$, it first updates the optimizer state as follows:

$v' := v * \text{mm} + (1 - \text{dm}) * (g + \text{wd} * w) \text{ \ . }$

Following the update of the optimizer state the optimizer uses said state to update the weight:

if nesterov is True:

$g' := g + \text{wd} * w + \text{mm} * v' \text{ \ . }$
$w' := w - \text{lr} * g' \text{ \ . }$
else:
$w' := w - \text{lr} * v' \text{ \ . }$

In addition to the above, the velocity scaling hyper parameter is a scaling factor that can provide improved numerical stability by ensuring the values stored in the optimizer state, $$v$$, are scaled by this value. When using this parameter PopART will automatically deal with the artificially scaled velocity value during the weight update and other hyper parameters do not need to be adjusted).

In addition, the loss scaling hyper parameter is similar in nature to the velocity scaling parameter. It is a scaling value that is applied to the loss gradient at the start of the the backwards pass and, at the end of the backwards pass, this scaling is reversed by multiplying the gradients for each weight with the inverse of the loss scaling value prior to updating the optimizer state. Using loss scaling can also improve numerical stability in some cases.

Finally, it is possible to add clip norm settings for this optimizer. These clip norms compute the L2 norm for a group of weights and adds a scalar term to the weight update that effectively divides it by the norm (or a constant value that is provided as part of the clip norm, which ever is greater).

See the SGD notes in optimizer.hpp for a more detailed and comprehensive derivation of the SGD optimizer step in PopART.

Subclassed by popart::ConstSGD

Public Functions

SGD(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultMomentum, OptimizerValue defaultDampening, OptimizerValue defaultVelocityScaling, OptimizerValue lossScaling, OptimizerValue nesterov, const std::vector<ClipNormSettings> &clipNormSettings = {}, SGDAccumulatorAndMomentum sgdAccMm = SGDAccumulatorAndMomentum::Combined, DataType accumType = DataType::UNDEFINED, DataType accl1Type = DataType::UNDEFINED, const DebugContext &debugContext = {})

Constructor.

SGDAccumulatorAndMomentum. Defaults to SGDAccumulatorAndMomentum::Combined.

Parameters
• defaultLearningRate – The learning rate value to use for weights for which no weight-specific hyper parameter have been inserted.

• defaultWeightDecay – The weight decay value to use for weights for which no weight-specific hyper parameter have been inserted.

• defaultMomentum – The momentum value to use for weights for which no weight-specific hyper parameter have been inserted.

• defaultDampening – The dampening value to use for weights for which no weight-specific hyper parameter have been inserted.

• defaultVelocityScaling – The velocity scaling value to use for weights for which no weight-specific hyper parameter have been inserted.

• lossScaling – The loss scaling value to use.

• nesterov – Option to enable Nesterov momentum. Defaults to false.

• clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).

• sgdAccMm – The implementation strategy to use when gradient accumulation and/or momentum are used, otherwise ignored.

• accumType – The DataType of the accum tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.

• accl1Type – The DataType of the accl1 tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.

• debugContext – Optional debug context.

SGD(OptimizerValue defaultLearningRate, OptimizerValue defaultWeightDecay, OptimizerValue defaultMomentum, OptimizerValue defaultDampening, OptimizerValue defaultVelocityScaling, OptimizerValue lossScaling, const std::vector<ClipNormSettings> &clipNormSettings = {}, SGDAccumulatorAndMomentum sgdAccMm = SGDAccumulatorAndMomentum::Combined, DataType accumType = DataType::UNDEFINED, DataType accl1Type = DataType::UNDEFINED, const DebugContext &debugContext = {})

Constructor.

SGDAccumulatorAndMomentum. Defaults to SGDAccumulatorAndMomentum::Combined.

Parameters
• defaultLearningRate – The learning rate value to use for weights for which no weight-specific hyper parameter have been inserted.

• defaultWeightDecay – The weight decay value to use for weights for which no weight-specific hyper parameter have been inserted.

• defaultMomentum – The momentum value to use for weights for which no weight-specific hyper parameter have been inserted.

• defaultDampening – The dampening value to use for weights for which no weight-specific hyper parameter have been inserted.

• defaultVelocityScaling – The velocity scaling value to use for weights for which no weight-specific hyper parameter have been inserted.

• lossScaling – The loss scaling value to use.

• clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).

• sgdAccMm – The implementation strategy to use when gradient accumulation and/or momentum are used, otherwise ignored.

• accumType – The DataType of the accum tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.

• accl1Type – The DataType of the accl1 tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.

• debugContext – Optional debug context.

SGD(const std::map<std::string, std::pair<float, bool>> &params, const std::vector<ClipNormSettings> &clipNormSettings = {}, SGDAccumulatorAndMomentum sgdAccMm = SGDAccumulatorAndMomentum::Combined, DataType accumType = DataType::UNDEFINED, DataType accl1Type = DataType::UNDEFINED, const DebugContext &debugContext = {})

Constructor.

EXAMPLE:

SGD({{"defaultLearningRate", {0.02, false}},
{"defaultMomentum", {0.6, true}}});


SGDAccumulatorAndMomentum. Defaults to SGDAccumulatorAndMomentum::Combined.

This will create an SGD Optimizer which has a constant momentum of 0.6 and a changeable learning rate initially of 0.02. All OptimizerValues not present in the map will take values from the getUnset* functions.

Parameters
• params – A parameter map where the keys are one or more of "defaultLearningRate", "defaultWeightDecay", "defaultMomentum", "defaultDampening", "defaultVelocityScaling", "lossScaling" or ”nesterov”. The map’s values are pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter because default values will be used where parameters are missing.

• clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).

• sgdAccMm – The implementation strategy to use when gradient accumulation and/or momentum are used, otherwise ignored.

• accumType – The DataType of the accum tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.

• accl1Type – The DataType of the accl1 tensor, when gradient accumulation is used and sgdAccMm = SGDAccumulatorAndMomentum::Separate, otherwise ignored. Only FLOAT, FLOAT16 and UNDEFINED are supported. Defaults to UNDEFINED. If UNDEFINED, the same type as the weights will be used. If accumType is FLOAT16 and accl1Type is FLOAT, this parameter causes accum to be upcasted before being passed to the op that updates accl1.

• debugContext – Optional debug context.

inline SGD()

Default constructor Creates SGD with default scalars (equivalent to getUnset<scalar>() methods), and other default parameters of main constructor.

SGD(const SGD&) = default

Copy constructor.

~SGD() = default
inline virtual OptimizerType type() const final
inline virtual std::string type_s() const final
inline SGDAccumulatorAndMomentum getSGDAccumulatorAndMomentum() const
virtual std::unique_ptr<Optimizer> clone() const final
virtual std::unique_ptr<Op> createOp(const Tensor &weight, Graph&) const final

Returns the VarUpdateOp for the given weight .

If no gradient accumulation of momentum, this will be a SGD0VarUpdateOp. Else, if getSGDAccumulatorAndMomentum() == ::Combined, this will be an SGD1ComboOp, else if getSGDAccumulatorAndMomentum() == ::CombinedSGD2ComboOp, an SGD2ComboOp

.

The required compound scalar OptimizerValues for the

VarUpdateOp wil be computed and passed to the Op. See the SGD notes above this class for how they are derived. Recall that if non-const, the VarUpdateOp will take an input Tensor for the compound scalar.

Optimizer::createOp

The OptimizerReductionType of the Op is derived as follows: No replication => None Replication, no grad acc => GradReduce Replication, grad acc, SGD1 => AcclReduce Replication, grad acc, SGD2 => AccumReduce See the SGD notes above this class for why this is.

If SGD2, the DataType of the accum and accl1 tensors passed to the SGD2ComboOp will be as set in the SGD constructor. Recall DataType::UNDEFINED means use the same as the weight.

An SGD1ComboOp will later be decomposed by SGD1Decompose

pattern into a series of Ops and Tensors that implement the SGD1 optimiser step.

An SGD12ComboOp will later be decomposed by

SGD2Decompose pattern into a series of Ops and Tensors that implement the SGD2 optimiser step.

SGD1Decompose

virtual std::vector<TensorId> getInputIds(const Tensor &weight) const final

virtual std::vector<std::tuple<TensorId, TensorInfo>> getOptimizerInputs(const Tensor &weight) const final

smm1 and wdsf0 have the same data type as the weight . Everything else

virtual void validReplacement(const Optimizer &other) const final
virtual void resetTensorData(Tensor&) const final
virtual void setTensorData(Tensor&) const final
float getStoredValue(const TensorId &optId) const

Tensor “opt” has an id, which it uses to match a compound scalar which this object can compute from the atomic scalars.

void insertSpecific(const TensorId &weight, OptimizerValue learningRate, OptimizerValue weightDecay, OptimizerValue momentum, OptimizerValue dampening, OptimizerValue velocityScaling, OptimizerValue nesterov)

Insert a weight-specific set of hyper parameters.

Parameters
• weight – The TensorId of the weight.

• learningRate – The learning rate value to use for this specific weight.

• weightDecay – The weight decay value to use for this specific weight.

• momentum – The momentum value to use for this specific weight.

• dampening – The dampening value to use for this specific weight.

• velocityScaling – The velocity scaling value to use for this specific weight.

• nesterov – Option to enable Nesterov momentum. Defaults to false.

void insertSpecific(const TensorId &weight, const std::map<std::string, std::pair<float, bool>> &params)

Insert a weight-specific set of hyper parameters.

Parameters
• weight – The TensorId of the weight.

• params – A parameter map where keys are one of "learningRate", "weightDecay", "momentum", "dampening", or "velocityScaling"` and the map’s values pairs of floats and booleans representing OptimizerValue constructor arguments. The map does not have to specify each hyper parameter as default values will be used where parameters are missing.

virtual bool hasSpecific(const Tensor &w) const final
virtual bool hasSpecific() const final
virtual TensorId getInverseLossScalingTensorId(const Tensor &weight) const
inline const OptimizerValueMap &learningRates() const
inline const OptimizerValueMap &weightDecays() const
inline const OptimizerValueMap &momentums() const
inline const OptimizerValueMap &dampenings() const
inline const OptimizerValueMap &velocityScalings() const
inline const OptimizerValueMap &nesterov() const
virtual size_t hash() const

Public Static Functions

static inline OptimizerValue getUnsetLearningRate()

Default learning rate value.

static inline OptimizerValue getUnsetWeightDecay()

Default weight decay value.

static inline OptimizerValue getUnsetMomentum()

Default momentum value.

static inline OptimizerValue getUnsetDampening()

Default dampening value.

static inline OptimizerValue getUnsetVelocityScaling()

Default velocity scaling value.

static inline OptimizerValue getUnsetLossScaling()

Default loss scaling value.

static inline OptimizerValue getUnsetNesterov()

Default nesterov.

static SGD fromDefaultMap(const std::map<std::string, OptimizerValue>&, const DebugContext &debugContext = {})
class ConstSGD : public popart::SGD

Stochastic Gradient Descent (SGD) optimizer with constant learning rate, weight decay, loss scaling and clip norm settings (and default values for momentum, dampening or velocity scaling).

NOTE: See SGD for detailed meaning for these parameters.

NOTE: This class exists for backwards compatibility with the Python API and may be removed at some point in the future.

Public Functions

inline ConstSGD(float learningRate, float weightDecay = 0, float lossScaling = 1, const std::vector<ClipNormSettings> &clipNormSettings = {})

Constructor.

Parameters
• learningRate – A constant learning rate.

• weightDecay – A constant weight decay value.

• lossScaling – A constant loss scaling value.

• clipNormSettings – A vector of ClipNormSettings (this can be used to set maximum values for weights).

enum class popart::SGDAccumulatorAndMomentum

Strategy for implementing SGD with momentum and/or gradient accumulation.

Values:

enumerator Combined = 0

Implement SGD using a single tensor for the gradient accumulator (accum) and momentum (accl) tensors.