11. Glossary

11.1. Sample

The smallest division of a data set.

11.2. Micro-batch size

The number of samples processed in a single execution of a graph on a single device. Also referred to as the machine batch size. The micro-batch shape, or the shape of input data as defined in the ONNX model, is therefore [micro_batch_size, *sample_shape].

11.3. Replication factor

The number of graphs to be run in parallel over multiple devices. The weight gradients from each device will be accumulated before a weight update. Also referred to as “device replication factor” or “spatial replication factor”. This is sometimes called data-parallel execution.

11.4. Accumulation factor

The weight gradients will be accumulated over this number of micro-batches in series before a weight update. Also referred to as “temporal replication factor”.

Accumulation can be thought of as doing replication on a single device.

11.5. Batch size

This is defined as micro-batch size * replication factor * accumulation factor. This is the number of samples per weight update.

11.6. Batches per step

The number of batches to run in a single call to Session::run.

11.7. Step size

This is defined as batch size * batches per step. This is the number of samples per step.

11.8. Input data shape

Inputs to a session.run() call are read in with the assumption that data is arranged in the shape:

[batches_per_step, accl_factor, repl_factor, micro_batch_size, *sample_shape]

However, there is no constraint of the shape of the input array, except that it has the correct number of elements.

11.9. Virtual graph

Subdivision of a graph to a subset of IPU tiles. While virtualGraphId in PopART refers to the graph associated with an IPU, the virtual graph can be subdivided further into tile sets IO and Compute.

11.10. Off-chip streaming memory

Large pool of memory not located on the IPU that can be used to offload tensors from the IPU. Tensor location settings can be used to specify which tensors should be offloaded. Decreases on-chip memory usage.

11.11. RTS (replicated tensor sharding)

Eliminate storage and compute redundancy by sharding a weight, optimizer state or accumulator tensor equally across N data parallel replicas. When the replicas require the full tensor, ReplicatedAllGatherOp is used. Increases performance, especially in conjunction with off-chip remote memory, decreases on-chip memory usage.