Engine

#include <poplar/Engine.hpp>

namespace pva

namespace poplar

Poplar classes and functions.

Typedefs

using ProgressFunc = std::function<void(int, int)>

Functions

Executable compileGraph(const Graph &graph, ArrayRef<program::Program> progs, const OptionFlags &opt = {}, ProgressFunc progressCallBack = ProgressFunc(), const DebugContext &debugContext = {})

Compile the given graph and programs to make an executable that can be executed using a poplar::Engine.

Parameters

graph – The graph to compile.
progs – The list of programs to run over the graph. Each program can be run separately by calling Engine::run() and passing the index, in this list, of the program to run.
opt – Options that can be used to control compilation and execution. The available options are listed under Engine.
progressCallBack – A function that will be called to indicate engine compilation progress. See Engine::ProgressFunc for more information.
debugContext – Optional DebugId and debug name.
profileWriter – Optional parameter to manage profiler writing.

Throws

invalid_option – If any of the options passed in opt were not recognised or improperly formatted.
link_error – If program linking fails; for example, due to undefined symbols or lack of memory on a tile.

Variables

static const unsigned WORKER_SCRATCH_SIZE = 48: Size in bytes of the scratch space available to each worker thread.

class Engine

#include <Engine.hpp>

A graph compute engine.

The Engine class provides the ability to execute a graph program.

Engine creation options

Options can be overridden with the environment variable POPLAR_ENGINE_OPTIONS. For example:

POPLAR_ENGINE_OPTIONS='{"target.deterministicWorkers":"true"}'

Engine creation options: Debug

debug.allowOutOfMemory (true, false) [=false]

If true, allow out-of-memory while compiling and linking. This is automatically set to true if autoReport.outputGraphProfile is set to true (direct or indirectly).
debug.computeInstrumentationLevel (vertex, tile, ipu) [=tile]

The granularity of compute instrumentation. This option has no effect unless debug.instrumentCompute is true.
- vertex: Store the last cycle count of each vertex on every tile.
- tile: Store the last cycle count of each compute set on every tile.
- ipu: Store the last cycle count of each compute set on one tile per IPU. This saves memory compared to tile (since the cycle counts are always live and this needs to store them on only one tile), but it loses all per-tile cycle information. It works by adding a sync after each compute set and timing how long it takes to get to that sync. So, effectively, it measures the cycle time of the longest-running tile in the compute set.
- device: Deprecated. Similar to ipu, but instead of storing the cycle counts on one tile per IPU, it stores them on one single tile across all IPUs which adds the need for global syncs.
debug.retainDebugInformation (true, false) [=true] Retain compilation information to help with debugging. Must be true if profiling is enabled.
debug.cpuMultiThreadExecution (true, false) [=true] If true, operations are executed using multiple host threads for a CPU or IPU Model target. Setting to false may simplify debugging at the cost of reduced performance.
debug.instrument (true, false) [=false]

If true, enable all instrument options (below). This will instruct the engine to add cycle counters to the compiled program to enable the execution profile to be retrieved after the program is run. This is only available for an IPU target (not an IPU Model target). Note that the more specific instrumentation options may override the default. For example,
```
{"debug.instrument":"true",
 "debug.instrumentExternalExchange":"false"}
```
will instrument everything apart from external exchange.
debug.instrumentCompute (true, false) [=false]

If true, enable instrumentation of compute sets. See debug.instrument.
debug.instrumentExternalExchange (true, false) [=false]

If true, enable instrumentation of external exchanges. See debug.instrument.
debug.instrumentControlFlow (true, false) [=false]

If true, enable instrumentation of loops and conditionals. See debug.instrument.
debug.outputAllSymbols (true, false) [=false]

If true, output additional symbols to the ELF files that are not required but aid debugging.
debug.profilingTile Integer [=Tiles per IPU - 1]

The tile on which to store the cycle counter for every comput set. This has no effect unless debug.computeInstrumentationLevel is set to ipu.
debug.branchRecordTile Integer [=NTILES-1]

The tile on which to store the branch record. This has no effect unless debug.instrumentControlFlow flag is set. In a CPU target, this option has no effect. In an IPU Model, it only affects the memory profile.
debug.runtimeVerify (true, false) [=false]

If true, expensive verification steps are enabled at runtime.
debug.trace (true, false) [=false]

If true, a trace is printed to the error stream with the state of every edge before and after the execution of a compute set or exchange.
debug.traceFile String

Only used if debug.trace is true. If set, the debug trace is output to the specified file instead of the error stream.
debug.verify (true, false) [=false]

If true, expensive verification steps are enabled at compile time. The checks mostly focus on exchange code, including the following:
- ensuring variables have been set,
- ensuring section/instruction alignment is correct,
- and ensuring the total number of bytes received is as expected.
In addition, after laying out memory we verify the memory constraints on variables are satisfied.
debug.supervisorStackSizeInBytes Integer

If set, the automatically computed stack size for supervisor threads will be overridden with the specified value (in bytes) for all tiles.
debug.workerStackSizeInBytes Integer

If set, the automatically computed stack size for worker threads will be overridden with the specified value (in bytes) for all tiles.

Engine creation options: Optimisations

opt.maxCompilationThreads Integer [=0]

The maximum number of threads to use during compilation. A value of 0 means the hardware will be fully utilised.
opt.maxLinkerThreads Integer [=0]

The maximum number of threads to use during compilation. A value of 0 means the same number will be used as were used for compilation.
opt.internalExchangeOptimisationTarget (balanced, cycles, memory) [=cycles]

What balance of heuristics to use when generating exchange code. Can be used to balance exchange memory usage against speed.
- cycles: Focus completely on speed at the expense of always-live memory
- memory: Focus completely on minimising the memory footprint, at the expense of speed
- balanced: Sacrifice some speed to attempt to reduce the amount of always live memory produced.
opt.enableMultiAccessCopies (true, false) [=true]

Enable this option to make some of the copies faster at the expense of adding more constraints on variables used in the copies.
opt.limitVertexStateToLower256K (true, false) [=false]

Enable this option to optimise the control code by allocating all of the vertex state in the first 256KB of memory. This has a disadvantage that this is the same range of memory that the code must be put in, so if the sum of the two is larger than 256KB then the model will fail to compile.
opt.useAutoloader (true, false) [=true on Mk2 IPU, false otherwise]

If true, use the secondary loading mechanism to load the executable. This option is ignored on non-IPU targets.

Engine creation options: Target

target.deterministicWorkers (true, false, portable) [=true]

Ensure that the mapping of vertices to worker threads is the same for repeated execution either on the same IPU (true), or on every IPU (portable). This guarantee does not hold following breakpoints or exceptions.
target.saveArchive String

If set, the binary archive will be saved to the specified filename during graph compilation. This archive contains the ELF files for each tile. No archive will be saved unless this option is set.
target.gatewayWriteCombining (true, false) [=target option gatewayMode]

Optimise write-to-host code to use IPU-Machine gateway write combining.
target.maxStreamCallbackThreadsPerNumaNode Integer or “auto” [=0] (deprecated)

The maximum number of threads per NUMA node to use to execute stream callbacks. A value of 0 means the main thread will execute all of the callbacks, which is the default because a non-zero number of threads requires thread-safe callbacks.

A value of auto means the hardware will be fully utilised, this typically means up to one thread per CPU core is used.

Note that this is the maximum number of threads in addition to the main thread. For example, on a system with two NUMA nodes setting this option to 1 would mean that a total of three threads could execute callbacks, with one thread pinned to each NUMA node and the main thread operating on one of the two nodes as well (assuming the main thread is free to execute callbacks).
target.extendedMemory (true, false) [=false]

When enabled, supports >16GiB for remote buffers. Only supported on M2000 systems. Requires opt.enableStreamCopyMerging=false.

Engine creation options: Report generation

The report generation options will automatically output the Poplar reports that can be viewed in the PopVision Graph Analyser.

These options provide a basic ability to capture the reports. For more complex use cases the reports should be generated programmatically via functions in the framework (TensorFlow, PopTorch, PopART or Poplar) in which the application is written.

autoReport.all (true, false) [=false]

Output all the available reports described below.

You can exclude individual reports by combining options. For example, this will generate all reports apart from the serialized graph:
```
{"autoReport.all":"true",
 "autoReport.outputSerializedGraph":"false"}
```
autoReport.outputGraphProfile (true, false) [=false]

Output the graph profile report to profile.pop.
autoReport.outputLoweredVars (true, false) [=false]

Generate lowered variables info in profile.pop. This is equivalent to using the debug.loweredVarDumpFile option with the filename set to profile.pop.

To generate the old capnp format, set debug.loweredVarDumpFile to vars.capnp.
autoReport.outputArchive (true, false) [=false]

Output the archive report: archive.a. This is equivalent to using the target.saveArchive option with the filename set to archive.a.
autoReport.outputSerializedGraph (true, false) [=false]

Output the serialized graph: serialized_graph.capnp.
autoReport.outputExecutionProfile (true, false) [=false]

Output the execution profile report to profile.pop

By default this setting will also set debug.instrument to true. If you do not want instrumentation enabled you can set autoReport.outputExecutionProfile or debug.instrument to false.
autoReport.streamAtEachRun (true, false) [=true]

Applies to profiler format V3 or higher. Enable or disable the streaming of the execution profile to disk at each run. If false, the whole execution will be written to disk on Engine destruction (note, some frameworks like TensorFlow may not properly destroy the Engine).
autoReport.outputDebugInfo (true, false) [=false]

Output debug info: debug.json. This file gathers the data in every DebugInfo object created. Elements in the graph report with debugIds can be related to these DebugInfo objects.
autoReport.executionProfileProgramRunCount Integer [=2]

Specify how many runs of each program to capture in the execution profile.
autoReport.directory String [=./]

Specify which directory you want the reports to be written to. By default they will be written to the current working directory.

Engine creation options: Other

prng.enableStochasticRounding (true, false) [=false]

If true, stochastic rounding is enabled.

You can also enable or disable stochastic rounding using the functions setFloatingPointBehaviour() and setStochasticRounding(). For setFloatingPointBehaviour() the default behaviour is to enable stochastic rounding.
prng.seed Integer [=0]

Base seed for PRNG initialisation.

Engine runtime options:

Any OptionFlags parameters defined for RuntimeOptions construction can be used to construct an engine too. These options will be used as default. However, they can be overridden by passing a RuntimeOptions argument to the member functions that accept or defining the POPLAR_RUNTIME_OPTIONS environment variable.

Public Types

using ProgressFunc = std::function<void(int, int)>

Callback function used to to indicate engine compilation progress.

The function is passed two integers. The first is the progress value and the second is the maximum value for the progress.

If a progress callback is used, the function should not block. All calls to the callback function will be made in a single dedicated thread so blocking in the callback will block the receipt of further notifications (but will not block compilation from progressing). The callback should not use Poplar objects or functions relating to the Graph, Engine or Device that are being compiled.

Public Functions

Engine(const Graph &graph, ArrayRef<program::Program> progs, const OptionFlags &opt = {}, ProgressFunc progressCallBack = ProgressFunc(), const DebugContext &debugContext = {})

Construct the engine from a graph and a list of programs.

Parameters

graph – The graph to compile into the engine.
progs – The list of programs to run over the graph. Each program can be run separately by calling the run() method of the Engine with the argument being the index of the program to run in this list.
opt – Options that can be used to control compilation and execution. The available options are listed under Engine.
progressCallBack – A function that will be called to indicate engine compilation progress. See Engine::ProgressFunc for more information.
debugContext – Optional Engine name and Debug Id.

Throws

invalid_option – If any of the options passed in opt were not recognised or improperly formatted.
link_error – If program linking fails; for example, due to undefined symbols or lack of memory on a tile.

Engine(const Graph &graph, program::Program prog, const OptionFlags &opt = {}, ProgressFunc progressCallBack = ProgressFunc(), const DebugContext &debugContext = {})

Construct the engine from a graph and a program.

Parameters

graph – The graph to compile into the engine.
prog – The program to run over the graph. This program is run when the run() method is called on the Engine.
opt – Options that can be used to control compilation and execution. The available options are listed under Engine.
progressCallBack – A function that will be called to indicate engine compilation progress. See Engine::ProgressFunc for more information.
debugContext – Optional Engine name and Debug Id.

Throws

invalid_option – If any of the options passed in opt were not recognised or improperly formatted.
link_error – If the program linking fails; for example, due to undefined symbols or lack of memory on a tile.

Engine(Executable &&exe, const OptionFlags &opt = {})

Construct the engine from a precompiled executable.

Parameters

exe – The precompiled executable. This can be created using compileGraph().
opt – Options that can be used to control execution. These must be the same as the options passed to compileGraph(). The available options are listed under Engine.

Throws

invalid_option – If any of the options passed in opt were not recognised or improperly formatted.

Engine(Engine&&) noexcept

~Engine()

void prepare(const Device &device)

Prepare the device for loading.

This configures the device ready for loading binary code, which is done by calling deploy().

Parameters: device – The device to load onto.

void prepare(const Device &device, const RuntimeOptions &runOptions)

Prepare the device for loading.

This configures the device ready for loading binary code, which is done by calling deploy().

Parameters

device – The device to load onto.
runOptions – Set of parameters to adjust runtime behaviour.

void deploy()

Load the engine.

This loads binary code. The device must have been prepared previously by calling prepare().

void load(const Device &device)

Load the compiled program/graph onto a device.

This function will load all binary code and data onto the device ready for execution. This is a shortcut to call the prepare() and deploy() functions in succession.

Parameters: device – The device to load onto.

void run(unsigned prog = 0, const std::string &debugName = "")

Run the graph program.

This function will execute the graph program. Note that the program needs to have already been loaded onto a device otherwise an exception will occur.

Parameters

prog – The index of the program to run. If this is greater than or equal to the number of programs given in the constructor then an exception is thrown.
debugName – Run name (for debugging/analysis).

void run(unsigned prog, const std::string &debugName, const RuntimeOptions &options)

Run the graph program.

This function is similar to run(prog, debugName) and allows the user to override runtime parameters via an instance of RuntimeOptions.

Parameters

prog – The index of the program to run. If this is greater than or equal to the number of programs given in the constructor then an exception is thrown.
debugName – Run name (for debugging/analysis).
options – Multiple parameters that alter the execution behaviour.

void loadAndRun(const Device &device, unsigned prog = 0)

Run the graph program.

This function will load the program/graph onto the device and then execute the graph program.

Parameters: prog – The index of the program to run. If this is greater than or equal to the number of programs given in the constructor then an exception is thrown.

TimerTimePoint getTimeStamp()

Get a record of the current host and device time.

Details depend on the underlying device used.

void resetExecutionProfile()

Reset execution profile.

When programs are run their profiles are appended to the execution profile. This discards profiling information for previously executed programs.

Deprecated:: Use the PVA library instead.

pva::Report getReport(bool reportExecution = true)

Get a PVA Report object that allows access to profiling data for the graph and the execution with this engine.

Subsequent Engine::run() executions may not be accessible through the returned report due to caching.

Parameters: reportExecution – Enables access to the execution report (since this engine was constructed/the execution report was last reset). Otherwise, only the graph profile is available.
Throws: profiling_disabled – If the device is not an IPU or IPU Model.
Returns: A PVA Report object (declared in libpva <pva/pva.hpp>).

void disableExecutionProfiling()

Pause execution profiling.

Subsequent Engine::run() calls are executed without being profiled until a subsequent call to enableExecutionProfiling().

For example, you can exclude individual programs from a profile like this:

 engine.disableExecutionProfiling();
 engine.run(...);
 engine.enableExecutionProfiling();

void enableExecutionProfiling()

Enable execution profiling.

Subsequent Engine::run() calls are profiled when executed.

void printProfileSummary(std::ostream &outputStream, const OptionFlags &opt = {})

Get and print the summary of a report with the given options.

This is equivalent to getting and printing the summary of both the graph and execution reports using poplar::printProfileSummary().

Parameters

outputStream – A stream to write the summary to.
opt –
A set of option flags configuring the contents of the report. All can be “true” or “false”. The default is “false”.

The available options are:
- showVarStorage (true, false)
- showOptimizations (true, false)
- showExecutionSteps (true, false)

Throws

profiling_disabled – If the device is not an IPU or IPU Model.
invalid_option – If any of the options passed in opt were not recognised or improperly formatted.

void reportIntervals(std::ostream &outputStream)

Write a CSV data file to a specified output stream.

The data files contain the number of tiles active over time in cycles for compute, synchronisation and exchange phases.

Each row contains the following entries:

begin time in cycles
end time in cycles
number of tiles participating in compute
number of tiles participating in exchange
number of tiles participating in synchronisation

Because tiles execute a number of threads (up to 6) in parallel, a single “thread cycle” may only be executed every 6 tile clock cycles. The cycles reported by this function are tile clock cycles rather than thread cycles.

Parameters: outputStream – An output stream for the CSV data to be written to.
Throws: profiling_disabled – If the device has no profiling enabled.

void readTensor(StringRef handle, void *buf, void *bufEnd)

Synchronous copy of a buffer of data from a specific tensor in the device into a host-side buffer.

The tensor must have been marked as an output tensor. The buffer must have room for all of the tensor data. The buffer end address is required for size verification. The handle should match the one passed to Graph::createHostRead().

See also

Graph::createHostWrite()

Parameters

handle – The destination host copy handle.
buf – The source of the write.
bufEnd – The end address of source buffer.

void connectStream(StringRef handle, void *begin, void *end)

Connect a stream to a circular buffer in memory.

Each time data is copied to/from the stream the pointer for the next transfer is incremented within the bounds given.

Parameters

handle – The name of the stream to connect to.
begin – Pointer to the start of the circular buffer.
end – Pointer to the end of the circular buffer.

void connectStream(StringRef handle, void *p)

Connect a stream to a fixed location in memory.

Each time data is copied to/from the stream this location will be read/written.

Parameters

handle – The name of the stream to connect to.
p – The pointer to the memory buffer.

void connectStreamToCallback(StringRef handle, StreamCallbackHandle f)

Connect a stream to a callback taking a pointer to the location in memory to copy into/from.

This will be called whenever the stream will be read or was written by the device. The given memory location will only be valid to read from or write to for the duration of the callback.

Parameters

handle – The name of the stream to connect to.
f – Callback to be called whenever the stream is to be read or was written by the device.

void connectStreamToCallback(StringRef handle, unsigned index, StreamCallbackHandle f)

Connect a replicated stream to a callback taking a pointer to the location in memory to copy into/from.

This will be called whenever the stream will be read or was written by the device. The given memory location will only be valid to read from or write to for the duration of the callback.

Parameters

handle – The name of the stream to connect to.
index – The replicated index to connect to.
f – Callback to be called whenever the stream is to be read or was written by the device.

void copyFromRemoteBuffer(StringRef handle, void *w, int repeatIndex, unsigned replicationIndex = 0)

Copy from a remote buffer to a user buffer w.

Parameters

handle – The name of the remote buffer to copy from.
w – The user buffer to copy to.
repeatIndex – The index in the remote buffer to copy from.
replicationIndex – The replicated graph index.

void copyToRemoteBuffer(void *w, StringRef handle, int repeatIndex, unsigned replicationIndex = 0)

Copy to a remote buffer from a user buffer w.

Parameters

w – The user buffer to copy from.
handle – The remote buffer to copy to.
repeatIndex – The index in the remote buffer to copy to.
replicationIndex – The replicated graph index.

std::vector<std::string> listStreams() const

Return a list of all streams in the engine.

Returns: Vector of strings, each of which is a stream’s handle postfixed with ‘+’ or ‘-’ indicating whether the stream is a host-write or a host-read respectively.

void setPrintStream(std::ostream &stream)

Set output stream for printf commands.

Parameters: stream – The output stream to use.

void setPrintTensorStream(std::ostream &stream)

Set the output stream for PrintTensor programs.

By default, tensors are printed to stderr.

Parameters: stream – The output stream to use.

OptionFlags getEngineOptions() const: Returns the options the engine was created with.

void serializeExecutable(std::ostream &out) const: Serialize the executable used by the engine.

void insertSimulatedError(ErrorCode error, ErrorLocation const &location)

Simulate an error.

This function causes the program to generate an error when it gets to the specified location. This can be useful for failure testing.

This function must be called after Engine::load() and before Engine::run().

You can simulate many errors at the same time, provided each error is simulated at a unique location. However, only a subset of the errors may be reported because the execution of the program will stop as soon as any one of the errors is detected.

Example usage:

const unsigned tile = 0;
const poplar::ErrorCode error = poplar::ErrorCode::MEMORY_ERROR;
const std::vector<poplar::ErrorLocation> locations =
    engine.getSimulatedErrorLocations("MyComputeVertex", tile);

engine.load(device);
engine.insertSimulatedError(error, locations[0]);
try {
  engine.run();
} catch (poplar::runtime_error const& exception) {
  assert(exception.errors.size() == 1);
  assert(exception.errors[0].isSimulated);
  assert(exception.errors[0].code == error);
  assert(exception.errors[0].location == location);
  std::cerr << "ErrorCode: " << exception.what() << '\n';
}

Parameters

error – The type of error to simulate. See poplar::ErrorCode for a list of possible errors to simulate.
location – Where to simulate the error. The location of an error must be unique from other simulated errors. See Engine::getSimulatedErrorLocations for how to specify an error location based on some program information.

Throws

poplar::poplar_error – If Engine::load() has not been called yet.
poplar::poplar_error – If an error is already being simulated at location.

void eraseSimulatedError(ErrorLocation const &location)

Undo the effects of Engine::insertSimulatedError();.

Parameters: location – Any one of the locations passed to Engine::insertSimulatedError().
Throws: poplar::poplar_error – If there is no simulated error at location.

void clearSimulatedErrors()

Undo the effects of all Engine::insertSimulatedError() calls.

To remove the effects of the simulated errors and run the program from the beginning again you can call Engine::load() after clearing the simulated errors:

engine.clearSimulatedErrors();
engine.load();
engine.run(); // won't hit any simulated errors.

std::vector<ErrorLocation> getSimulatedErrorLocations(unsigned programId, unsigned tile = ~0) const

Return the locations of a program from a program ID.

It’s possible a program exists on multiple tiles so tile can be used to disambiguate the tile on which the error should occur. If tile is not specified then a location will be returned for each tile the program exists on.

Note

If programId specifies a poplar::Vertex then this will return the location of the launch routine for the poplar::Vertex and not the location of the code that comprises the poplar::Vertex. Use the vertexName overload of this function to specify an error inside a poplar::Vertex.

Parameters

programId – The program id of the program. This can be looked up in the Graph Analyser tool (look for the “Id” field under “Details” in the “Program Tree” tab).
tile – The tile in the device. For devices containing multiple IPUs the range of valid tiles is [0, numIpus * numTilesPerIpu).

Returns

An object that uniquely identifies the location of the error within the device. An ErrorLocation is only valid for the program for which it is generated.

std::vector<ErrorLocation> getSimulatedErrorLocations(StringRef vertexName, unsigned tile = ~0) const

Return the locations of a program from a vertex name.

Similar to Engine::getSimulatedErrorLocations(tile, programId) but look-up the error locations using the name of a Vertex.

Returns: The locations of the vertex code. Any error simulated at these locations may be executed by any worker thread on the tiles.

Engine(std::unique_ptr<core::Engine>)

inline const core::Engine &getImpl() const

Public Static Functions

static std::string reportTiming(const TimerTimePoint &start, const TimerTimePoint &end)

Get a timing report for the measured interval.

Details depend on the underlying device used.

Parameters

start – Start time of report
end – End time of report

Private Members

std::unique_ptr<core::Engine> impl

class TimerTimePoint

#include <Engine.hpp>

PImpl interface to core timing information.

The constructor is private. Obtain an object by calling getTimeStamp().

Public Functions

TimerTimePoint() = default

Private Functions

explicit TimerTimePoint(Engine &e)

Private Members

std::shared_ptr<core::TimerTimePoint> impl

Friends

friend class Engine

namespace core