15. IPU host embeddings
An embedding table is a table in a model or compute graph that supports a lookup operation. For more details see the TensorFlow documentation on tf.nn.embedding_lookup.
On the IPU, large embeddings can be stored in host memory with the CPU performing the lookup operations (and update operations during training) in conjunction with the IPU. This functionality supports both inference and training.
During execution the IPU will synchronize with the host and send indices (and possibly update values) to the host CPU. The CPU will then perform the lookup or update operation in a callback operation before returning the result to the IPU. The IPU will then carry on execution.
Applications access this functionality through the
tensorflow.python.ipu.embedding_ops.HostEmbedding class and the
tensorflow.python.ipu.embedding_ops.create_host_embedding() helper
function. Optimisation of the host embedding is described in the
tensorflow.python.ipu.embedding_ops.HostEmbeddingOptimizerSpec
class, which currently only supports SGD with a constant learning rate.
Note
IPU host embeddings are not recommended for use in pipelines and will likely decrease the pipeline’s parallel efficiency.
15.1. Usage
IPU host embeddings rely on instances of the HostEmbedding class to
coordinate the communication between the host and device. This object is created
with a call to
tensorflow.python.ipu.embedding_ops.create_host_embedding(). The
created object is then passed to the user model where the
tensorflow.python.ipu.embedding_ops.HostEmbedding.lookup() method can
be called with a similar API to tf.nn.embedding_lookup.
Once the IPU host embedding has been created and used within the model, the
object must be “registered” with the session using the context manager created
by (tensorflow.python.ipu.embedding_ops.HostEmbedding.register()).
If TensorFlow session is not called within this context, TensorFlow will not
configure the underlying Poplar engine correctly and the model execution will
fail.
15.2. Example
  1import numpy as np
  2import tensorflow as tf
  3
  4from tensorflow.python.ipu import embedding_ops
  5from tensorflow.python.ipu import ipu_compiler
  6from tensorflow.python.ipu import ipu_infeed_queue
  7from tensorflow.python.ipu import loops
  8from tensorflow.python.ipu import cross_replica_optimizer
  9from tensorflow.python.ipu import scopes
 10from tensorflow.python.ipu import utils
 11from tensorflow.python.ipu import rnn_ops
 12from tensorflow.python import keras
 13
 14path_to_file = keras.utils.get_file(
 15    'shakespeare.txt',
 16    'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
 17)
 18
 19# Read, then decode for py2 compat.
 20text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
 21
 22# The unique characters in the file
 23vocab = sorted(set(text))
 24
 25# Creating a mapping from unique characters to indices
 26char2idx = {u: i for i, u in enumerate(vocab)}
 27idx2char = np.array(vocab)
 28text_as_int = np.array([char2idx[c] for c in text]).astype(np.int32)
 29
 30sequence_length = 100
 31batch_size = 16
 32replication_factor = 2
 33
 34#  Create training examples / targets
 35ds = tf.data.Dataset.from_tensor_slices(text_as_int)
 36ds = ds.batch(sequence_length, drop_remainder=True)
 37ds = ds.shuffle(batch_size * batch_size)
 38ds = ds.batch(batch_size, drop_remainder=True)
 39ds = ds.repeat()
 40
 41# The host side queues
 42infeed_queue = ipu_infeed_queue.IPUInfeedQueue(
 43    ds, feed_name="infeed", replication_factor=replication_factor)
 44
 45# Set the learning rate
 46lr = 0.0001
 47
 48# Create a momentum optimiser for replication
 49optimizer = cross_replica_optimizer.CrossReplicaOptimizer(
 50    tf.train.MomentumOptimizer(lr, 0.99))
 51
 52# Create a host embedding object
 53embedding = embedding_ops.create_host_embedding(
 54    "char_embedding",
 55    shape=[256, 256],
 56    dtype=tf.float32,
 57    partition_strategy="TOKEN",
 58    optimizer_spec=embedding_ops.HostEmbeddingOptimizerSpec(lr))
 59
 60
 61# PopnnGRU is time-major
 62def gru(partials):
 63  gru_ = rnn_ops.PopnnGRU(256)
 64  partial_t = tf.transpose(partials, [1, 0, 2])
 65  gru_outputs_t, _ = gru_(partial_t)
 66  return tf.transpose(gru_outputs_t, [1, 0, 2])
 67
 68
 69# The main model
 70def model(sequence):
 71  # Perform a lookup on the embedding
 72  partial = embedding.lookup(sequence)
 73
 74  partial = gru(partial)
 75  partial = tf.reshape(partial, [partial.shape[0], -1])
 76  partial = tf.layers.dense(partial, 256)
 77  return tf.nn.softmax(partial)
 78
 79
 80# Compute the loss for a given batch of examples
 81def evaluation(sequence):
 82  # Use the last element of the sequence as the label to predict
 83  label = tf.slice(sequence, [0, sequence_length - 1], [-1, 1])
 84  sequence = tf.slice(sequence, [0, 0], [-1, sequence_length - 1])
 85  logits = model(sequence)
 86  return keras.losses.sparse_categorical_crossentropy(label, logits)
 87
 88
 89# Minimise the loss
 90def training(loss, sequence):
 91  loss = evaluation(sequence)
 92  mean_loss = tf.math.reduce_mean(loss)
 93  train = optimizer.minimize(loss)
 94  return mean_loss, train
 95
 96
 97num_iterations = 1000
 98
 99
100# Loop over our infeed queue, training the model
101def my_net():
102  loss = tf.constant(0.0, shape=[])
103  r = loops.repeat(num_iterations, training, [loss], infeed_queue)
104  return r
105
106
107# Compile the model
108with scopes.ipu_scope('/device:IPU:0'):
109  run_loop = ipu_compiler.compile(my_net, inputs=[])
110
111# Configure the hardware
112config = utils.create_ipu_config()
113config = utils.auto_select_ipus(config, replication_factor)
114utils.configure_ipu_system(config)
115
116with tf.Session() as sess:
117  sess.run(tf.global_variables_initializer())
118  sess.run(infeed_queue.initializer)
119
120  # Train the model for some iterations
121  with embedding.register(sess):
122    for i in range(25):
123      l = sess.run(run_loop)
124      print("Step " + str(i) + ", loss = " + str(l))
15.3. Experimental functionality: IPU embeddings in remote buffers
As an alternative to host embeddings, there is experimental functionality to store embedding tables in remote buffer memory (i.e. off-chip memory directly accessed by the IPU). In this case the IPU performs the lookup/update operations directly on the remote buffer memory and the host CPU is not involved.
In tensorflow.python.ipu.utils.create_ipu_config() there is an option
enable_experimental_remote_buffer_embedding. When this option is set to
True (defaults to False), the IPU host embedding implementation will be
globally changed to use remote buffer embeddings instead.
Note
This option is experimental, and may be changed or removed in future releases.
15.3.1. Partitioning strategies
When using IPU embeddings in remote buffers together with data-parallel replication, the embedding table is not duplicated for each replica. Instead, a single copy of the table is shared between replicas to make the most of available memory. However, each replica only has access to a distinct memory space so the table is partitioned into chunks between the replicas (this holds even on hardware platforms like the DSS-8440 server where IPUs share physical external memory).
The way the table is split between the memory attached to each replica
is determined by the partitioning strategy. Two
partitioning strategies are available.
These are the token strategy and the encoding strategy.
Each has trade-offs and the
choice of strategy will depend on the application. The partition
strategy is set via the partition_strategy keyword argument of
tensorflow.python.ipu.embedding_ops.create_host_embedding().
Token strategy
The token strategy partitions the embedding on the
token axis. There will be ceil(t/r) whole tokens on each replica,
where t is the token count and r is the replica count.
 
When this strategy is used, cross-replica operations are required to allow each replica to perform a lookup or update across the whole table (each replica’s portion of the whole embedding table is private to that replica). Below is the pseudo-code, with explicit types and static shapes, for how this is implemented:
 1// Pseudo-code assuming we have table size `t`, and replica count `r`.
 2f16[14, 64] global_lookup(
 3  local_table : f16[ceil(t/r), 64]
 4  global_indices : i32[14]
 5):
 6  // The unique replica ID for "this" replica.
 7  replica_id = i32[] get-replica-id
 8
 9  // Distribute the indices to all devices.
10  indices = all-gather(indices) : i32[r, 14]
11
12  // Scale the indices down by the replication factor. Indices not meant for
13  // this replica will map to a valid, but incorrect index.
14  local_indices = indices / r : i32[r, 14]
15
16  // Gather on the local embedding region.
17  result = lookup(embedding, indices) : f16[r, 14, 64]
18
19  // The mask of which indices are valid.
20  mask = (indices % r) == replica_id : bool[r, 14]
21
22  // Zero out the invalid regions of the result
23  result = select(result, 0, mask) : f16[r, 14, 64]
24
25  // Reduce scatter sum the masked result tensor. The zeroed regions of the
26  // result tensor ensure that invalid values are ignore and each replica has
27  // the correct result.
28  result = reduce-scatter-sum(result) : f16[1, 14, 64]
29
30  // Reshape to the expected shape
31  return reshape(result), shape=[14, 64] : f16[14, 64]
Encoding strategy
The encoding strategy will partition the embedding on the encoding
axis. There will be ceil(1/r) of every tokens on each replica,
where r is the replica count. This means
for a given token every replica will store ceil(e/r) elements, where e
is the element count for a single token.
 
When this strategy is used, cross-replica operations are required to allow each replica to perform a lookup or update across the whole table (each replica’s portion of the whole embedding table is private to that replica). Below is the pseudo-code, with explicit types and static shapes, for how this is implemented:
 1// Pseudo-code assuming we have table size `t`, replica count `r`, and
 2// encoding size `e`.
 3f16[14, e] global_lookup(
 4  local_table : f16[t, ceil(e/r)]
 5  global_indices : i32[14]
 6):
 7  // Distribute the indices to all devices
 8  indices = all-gather(global_indices) : i32[r, 14]
 9
10  // Gather on the local embedding
11  result = lookup(local_embedding, indices) : f16[r, 14, ceil(e/r)]
12
13  // Communicate the relevant parts of the embedding to their respective
14  // replicas. This distributes the ith slice in the outermost dimension to
15  // ith replica.
16  result = all-to-all(result, slice_dim=2, concat_dim=3) : f16[r, 14, ceil(e/r)]
17
18  // Transpose the dimensions back into the correct order.
19  result = transpose(result), permutation=[1, 0, 2] : f16[14, r, ceil(e/r)]
20
21  // Flatten the innermost dimensions
22  result = flatten(result), begin=1, end=2 : f16[14, r*ceil(e/r)]
23
24  // Slice off the excess padding on the encoding
25  return slice(result), dim=1, begin=0, end=e : f16[14, e]
Choosing a strategy for your application
The choice of partitioning strategy is application dependent and the best way to determine the best strategy is to profile multiple strategies.
As a general rule, the token strategy is used when the encoding is much smaller than the token count. An example application for this would be language models where the vocabulary size is much larger than the encoding.
Conversely, the encoding strategy is used when the token count is small and the encoding is large enough to be split. This avoids a large amount of very small communication. An example application for this would be game playing models, where a small numbers of available actions are encoded in an embedding.