Sharding: a model parallel approach

A commonly used technique for porting models to the IPU is sharding, which is a concept derived from the model parallel paradigm where the same graph is segregated into separate components, or shards, and deployed to separate compute threads. In the current context, the graph is deployed to separate IPUs. Models that oversubscribe the memory of a single IPU can be deployed to multiple IPUs by breaking the network into smaller components and then communicating output activations of one shard to the input tensors of another shard.

The diagram below is taken from the Targeting the IPU from TensorFlow guide:

_images/sharding_concept.png

The sharding concept

There are a number of ways to conduct sharding, and the Targeting the IPU from TensorFlow guide should be the first point of reference. Given here is a very abridged code example based on the script simple_sharding.py from Graphcore’s examples repository on GitHub. It presents a manual and automatic sharding code example:

from tensorflow.python.ipu import autoshard
...

# With sharding all placeholders MUST be explicitly placed on the CPU device:
with tf.device("cpu"):
   pa = tf.placeholder(np.float32, [2], name="a")
   pb = tf.placeholder(np.float32, [2], name="b")
   pc = tf.placeholder(np.float32, [2], name="c")


# Put part of the computation on shard 1 and part on shard 2 for manual sharding
def manual_sharding(pa, pb, pc):
   with scopes.ipu_shard(0):
      o1 = pa + pb
   with scopes.ipu_shard(1):
      o2 = pa + pc
      out = o1 + o2
      return out

# Use autosharding to determine the points to best split up a network
def auto_sharding(pa, pb, p c):
   with autoshard.ipu_autoshard():
      o1 = pa + pb
      o2 = pa + pc
      out = o1 + o2
      return out


def my_graph(pa, pb, pc):
   with tf.device("/device:IPU:0"):
      if opts.autoshard:
            result = auto_sharding(pa, pb, pc)
      else:
            result = manual_sharding(pa, pb, pc)
      return result

...
cfg = utils.auto_select_ipus(cfg, NUMBER_OF_SHARDS)

As can be seen from the code sample, in the manual sharding example, the graph is segregated by using the scopes.ipu_shard calls at the splice points in the graph. In the automatic sharding example, the graph as a whole is fed to the ipu_autoshard function and the API handles the splicing of the graph. Please refer to the simple_sharding.py script for further details of implementation.

Limitations of sharding

Sharding technology simply partitions the model and distributes it on multiple IPUs. Multiple shards execute in series and cannot fully utilise the computing resources of the IPUs. As shown in the figure below, the model is partitioned into four parts and executed on four IPUs in series. Only one IPU is working at any one time.

_images/sharding_profile.png

Sharding profile

Sharding is a valuable method for use cases that need more control with replication, for example random forests, federated averaging and co-distillation. It can also be useful when developing/debugging a model, however for best performance pipelining should be used in most cases.