12. Tensor locations

The memory in an IPU-machine is made up of In-Processor-Memory (memory on the IPU) and Streaming Memory (memory not on the IPU). For more details about the memory architecture of the IPU hardware, refer to the IPU Programmer’s Guide.

By default, tensors reside in the In-Processor-Memory of the the IPU, but tensor location settings allow smart offloading of tensors to the Streaming Memory when required, as well as sharding tensors across replicas in data parallel training.

Setting the tensor location does not interfere with overlapped IO settings (Section 11, Overlapping IO with compute), even though both of them can specify a tile set (TileSet) on which the tensor should reside when being loaded onto the In-Processor-Memory.

12.1. Streaming Memory

Streaming Memory is used to temporarily store tensors not immediately required by IPU computations. It allows larger models or batch sizes to fit on the IPU, but access to this larger and slower memory pool has to be infrequent and balanced with computation.

Whether a tensor is located in Streaming Memory (off-chip) or in In-Processor-Memory (on-chip) can be controlled by various options in SessionOptions:

opts = popart.SessionOptions()
opts.weightTensorLocationSettings.location = popart.TensorLocation(...)
opts.optimizerStateTensorLocationSettings.location = popart.TensorLocation(...)
opts.accumulatorTensorLocationSettings.location = popart.TensorLocation(...)
opts.activationTensorLocationSettings.location = popart.TensorLocation(...)

The class popart.TensorLocation can also be used to customise location settings for individual tensors.

opts.tensorLocationSettingsOverride[name] = popart.TensorLocation(...)

The TensorLocation(storage, loadTileSet, storageTileSet) settings object takes up to three arguments relevant for off-chip tensors:

  • TensorStorage storage:

    1. OnChip: Store the tensor in on-chip In-Processor-Memory. The default setting for all tensors. The tensor remains on the IPU.

    2. OffChip: Store the tensor in off-chip Streaming Memory when not immediately required by IPU computations. This option may not have any effect if the PopART IR decides that there is no sensible time-frame when the tensor could be scheduled for being copied off-chip.

  • TileSet loadTileSet: The set of tiles that stream the data from and to Streaming Memory.

    1. IO: Load data from Streaming Memory to the IO tiles first.

    2. Compute: Load data from Streaming Memory directly to the compute tiles.

  • TileSet storageTileSet: The set of tiles on which the tensor preferentally resides when on-chip. Does not have any effect if the loadTileSet is Compute.

    1. IO: Data should stay on IO tiles whenever possible.

    2. Compute: Data should move to compute tiles as soon as possible.

PopART will intelligently decide, based on the provided settings, when exactly a tensor will be moved between IO tiles, compute tiles and off-chip Streaming Memory.

If TileSet::IO is used in any location setting, a subset of IPU tiles have to be set aside:

opts.numIOTiles = 128

12.2. Replicated tensor sharding

Replicated tensor sharding (RTS) is applicable to tensors that usually contain the same information on each replica. RTS eliminates redundant data storage when the full (unsharded) tensor does not need to be present on the IPU. If the full tensor is needed, a replicated AllGather operation is used to recombine the sharded tensor. Fully updated tensors that need to be sharded (and reduced) again, require a replicated ReduceScatter operation.

RTS modifies existing optimizers in the model, and modifies or replaces the ReplicatedAllReduce which is typically applied to gradients in data parallel training.

In PopART, collective ReplicatedAllReduce operations are present in the transformed IR graph when the model contains an optimizer that the user has set, and if replication is enabled:

opts.enableReplicatedGraphs = True
opts.replicatedGraphCount = num_replicas

Only variable tensors that are assumed to be equal across replicas can be sharded. This includes the model weights and the optimizer states (for example momentums of stochastic gradient descent) in data parallel training configurations.

If only weights should be sharded, then you can set:

opts.weightStateTensorLocationSettings.minElementsForReplicatedTensorSharding = num_replicas
opts.weightTensorLocationSettings.location.replicatedTensorSharding = popart.ReplicatedTensorSharding.On

If optimizer states should be sharded in addition, then you can set:

opts.optimizerStateTensorLocationSettings.minElementsForReplicatedTensorSharding = num_replicas
opts.optimizerStateTensorLocationSettings.location.replicatedTensorSharding = popart.ReplicatedTensorSharding.On

The size of sharded tensors on the IPU is smaller than that of the full tensor, but they can be used normally on the host. For example, take a tensor with a shape of [5,2,3] and with 30 elements in total. If we shard across four replicas, each replica will have a size of \(\\ceil(\frac{5*2*3}{4})=8\). However, since we have 30 elements, two replicas will contain 8 elements and the other two will contain 7 elements and the remaining element will be padded with a 0. Since all replicas share the same compiled binary, padded and unpadded sharded tensors are handled in the same way. When loading sharded tensors from the IPUs to the host, the sharded tensors are concatenated and the padding is removed (see gcl::CollectiveBalancedReorder).

12.3. RTS sharding domains and distributed instances

For distributed instances of a PopART program, it is recommended to launch the training application with PopRun. PopDist can then be used to configure the per-instance replication settings automatically:

# Let popdist handle distributed settings, such as:
# opts.enableReplicatedGraphs
# opts.replicatedGraphCount
# opts.enableDistributedReplicatedGraphs
# opts.globalReplicaOffset
# opts.globalReplicationFactor

For more information about PopRun and PopDist, refer to the PopDist and PopRun: User Guide, which includes details about the installation of Horovod if you are using the MPI communication protocol.

When using distributed instances across two or more glossary:Pods, the GW-Link transfer speeds (for both the IPU Mk1 and Mk2 architectures) are slower than the IPU-Link speed within the Pod. It is therefore beneficial to load replica sharded tensors from Streaming Memory and AllGather across the replicated instances within a Pod rather than across all replicas.

The sharding domain can be applied to types of tensors or individual tensors. Tensors that are linked together (for example the optimizer state, accumulator and weight being consumed by the same optimizer instance) should be configured with the same replicated tensor sharding domain.


The term

The recommended configuration for sharding optimizer states with multiple glossary:Pods is:

# Number of local replicas
num_local_replicas = popdist.getNumLocalReplicas()
# Number replicas in total
num_total_replicas = popdist.getNumTotalReplicas()

if num_total_replicas > num_local_replicas:
    # It would not make sense to shard fewer elements
    opts.optimizerStateTensorLocationSettings.minElementsForReplicatedTensorSharding = num_local_replicas
    # Only enable sharding on the optimizer state
    opts.optimizerStateTensorLocationSettings.location.replicatedTensorSharding = popart.ReplicatedTensorSharding.On

    # Set the sharding domain
    sharding_domain = popart.CommGroup(
        popart.CommGroupType.Consecutive, num_local_replicas)

    # Ensure all related tensors have the same sharding domain set
    opts.weightTensorLocationSettings.location.shardingDomain = sharding_domain
    opts.optimizerStateTensorLocationSettings.location.shardingDomain = sharding_domain
    opts.accumulatorTensorLocationSettings.location.shardingDomain = sharding_domain

These settings will apply to all weights, optimizer states and accumulators in the model.

CommGroup is used to set the sharding domain. The CommGroup class is composed of the CommGroupType enum, and the size of each group. Examples of CommGroup settings are:

  • popart.CommGroup(popart.CommGroupType.All, 0): Default, shard the tensor across all replicas and all instances. Currently not supported for multiple program instances, since each host instance requires the full tensor. If sharding across two instances, each host would only have access to half the (sharded) tensor.

  • popart.CommGroup(popart.CommGroupType.Consecutive, num_local_replicas): Shard the tensor across all replicas owned by a single instance. Each host instance has access to the complete variable tensor. The size of the domain currently has to match num_local_replicas, which means sharding across, for example, half the replicas managed by an instance is not supported.