11. Glossary
11.1. Sample
The smallest division of a data set.
11.2. Micro-batch size
The number of samples processed in a single execution of a graph on a single device.
Also referred to as the machine batch size.
The micro-batch shape, or the shape of input data as defined in the ONNX model,
is therefore [micro_batch_size, *sample_shape]
.
11.3. Replication factor
The number of graphs to be run in parallel over multiple devices. The weight gradients from each device will be accumulated before a weight update. Also referred to as “device replication factor” or “spatial replication factor”. This is sometimes called data-parallel execution.
11.4. Accumulation factor
The weight gradients will be accumulated over this number of micro-batches in series before a weight update. Also referred to as “temporal replication factor”.
Accumulation can be thought of as doing replication on a single device.
11.5. Batch size
This is defined as micro-batch size * replication factor * accumulation
factor
.
This is the number of samples per weight update.
11.6. Batches per step
The number of batches to run in a single call to Session::run
.
11.7. Step size
This is defined as batch size * batches per step
.
This is the number of samples per step.
11.8. Input data shape
Inputs to a session.run()
call are read in with the assumption that data is
arranged in the shape:
[batches_per_step, accl_factor, repl_factor, micro_batch_size, *sample_shape]
However, there is no constraint of the shape of the input array, except that it has the correct number of elements.
11.9. Virtual graph
Subdivision of a graph to a subset of IPU tiles. While virtualGraphId
in PopART
refers to the graph associated with an IPU, the virtual graph can be subdivided
further into tile sets IO
and Compute
.
11.10. Off-chip streaming memory
Large pool of memory not located on the IPU that can be used to offload tensors from the IPU. Tensor location settings can be used to specify which tensors should be offloaded. Decreases on-chip memory usage.
11.11. RTS (replicated tensor sharding)
Eliminate storage and compute redundancy by sharding a weight, optimizer state
or accumulator tensor equally across N
data parallel replicas. When the
replicas require the full tensor, ReplicatedAllGatherOp
is used.
Increases performance, especially in conjunction with off-chip remote memory,
decreases on-chip memory usage.