8. PopLibs examples

8.1. Dynamic slicing and updating

Poplar accesses data through multidimensional tensor views onto underlying variables. Static slices of tensors can be viewed through various tensor methods, for example:

When a slice from a variable position is required dynamicSlice(), and dynamicUpdate() can be used. These operate on statically sized base and slice tensors and allow the slice tensor to be selected from an arbitrary offset in the base tensor in the specified dimensions. The subtensor will have the same rank as the original tensor, but the sliced dimensions will be smaller.

When a number of slices/updates are required from a single base tensor, multiSlice() and multiUpdate() can be used instead. This is equivalent to a loop performing multiple slice or update operations, but is more efficient as parallel operations can be performed.

All these functions operate on physically separate source and destination Variables; each slice/update transfers data between the two. This can lead to exchange of a large amount of data between tiles as part of the operation.

The examples below allocate tensors using the poplibs createSliceableTensor() and createSliceTensor() functions which are generally able to map tensors for efficient slicing operations. Tensors created by any other method can also be sliced/updated, but care is required to avoid expensive exchange of data.

All these slicing functions support multidimensional tensors, and allow slicing in multiple dimensions. It is rare for more than one dimension to be sliced in practise; the examples slice a single dimension for simplicity.

The examples use INT data for illustration purposes, the principle is the same for any type.

8.1.1. Dynamic slice

Here we extract a sub-tensor from a larger tensor. In this example we’re slicing the outermost dimension, [0]. The base tensor is an input for dynamicSlice(), in this example shape 2048x4x10. The slice is in dimension 0 and of size 2, so the output sub-tensor will be 2x4x10.

const std::vector<std::size_t> baseShape{2048, 4, 20};
// We're going to take a slice in dim0, of size 2. So the slice will be shape
// 2x4x1024.
const std::vector<std::size_t> sliceDims{0};
const std::vector<std::size_t> sliceSizes{2};
Tensor tBase = popops::createSliceableTensor(graph, INT, baseShape, sliceDims,
                                             sliceSizes, 4, {dc, "tBase"});
Tensor tWantedOffsets = graph.addVariable(UNSIGNED_INT, {sliceDims.size()},
                                          VariableMappingMethod::LINEAR);
Tensor tSub = popops::dynamicSlice(graph, tBase, tWantedOffsets, sliceDims,
                                   sliceSizes, prog, {dc, "slice"});

Running the above program with tWantedOffsets[0] set to 1 then tSub would contain the same elements as:

Tensor staticSlice =
    tBase.slice(hOffset, 1 + sliceSizes.front(), sliceDims.front());

The shapes can be seen via

std::cout << "Base shape: " << tBase.shapeToString() << "\n";
std::cout << "Slice shape: " << tSub.shapeToString() << "\n";

And the code exercised by adding

prog.add(PrintTensor("wantedOffsets", tWantedOffsets));
prog.add(PrintTensor("slice", tSub));

// Initialise base Tensor so that digits are used independently in each dimension.
// digits [0:1] are the index in dim(2), the innermost dimension.
// digit [2] is the index in dim(1)
// digits [:3] are the index in dim(0)
std::vector<int> hIn;
hIn.reserve(tBase.numElements());
for (unsigned i = 0; i != tBase.dim(0); ++i) {
  for (unsigned j = 0; j != tBase.dim(1); ++j) {
    for (unsigned k = 0; k != tBase.dim(2); ++k) {
      hIn.emplace_back(i * 1000 + j * 100 + k);
    }
  }
}

graph.createHostWrite("in", tBase);
graph.createHostRead("out", tSub);
Engine e(graph, prog);
e.load(device);

e.writeTensor("in", ArrayRef(hIn));
e.run();

For an offset of 1 this will give the output which is elements [1] and [2] of tBase:

Base shape: {2048,4,10}
updateOffsets: [1]
Slice shape: {2,4,10}
slice: [
 [
  [1000 1001 1002 1003 1004 1005 1006 1007 1008 1009]
  [1100 1101 1102 1103 1104 1105 1106 1107 1108 1109]
  [1200 1201 1202 1203 1204 1205 1206 1207 1208 1209]
  [1300 1301 1302 1303 1304 1305 1306 1307 1308 1309]
 ]
 [
  [2000 2001 2002 2003 2004 2005 2006 2007 2008 2009]
  [2100 2101 2102 2103 2104 2105 2106 2107 2108 2109]
  [2200 2201 2202 2203 2204 2205 2206 2207 2208 2209]
  [2300 2301 2302 2303 2304 2305 2306 2307 2308 2309]
 ]
]

8.1.2. Dynamic update

Here we update part of a large base tensor from a smaller sub-tensor. Again the base tensor is 2048x4x10, in this case it is updated so it is both input and output. As above we’re slicing in dimension 0 with a slice of size 2, so the slice tensor shape is 2x4x10. We’re going to zero those elements.

const std::vector<std::size_t> baseShape{2048, 4, 10};
const std::vector<std::size_t> sliceDims{0};
const std::vector<std::size_t> sliceSizes{2};
Tensor tBase = popops::createSliceableTensor(graph, INT, baseShape, sliceDims,
                                             sliceSizes, 4, {dc, "tBase"});

createSliceTensor() creates tensors that can be used both by dynamicUpdate() and by multiUpdate(). This means that it creates tensors with an additional outer dimension. In this example we’re only doing a single slice so we discard the outer dimension.

Tensor tSub = popops::createSliceTensor(graph, tBase, sliceDims, sliceSizes,
                                        1, {dc, "tSub"}).squeeze({0});
Tensor tUpdateOffsets = graph.addVariable(UNSIGNED_INT, {sliceDdims.size()},
                                         VariableMappingMethod::LINEAR);
popops::zero(graph, tSub, prog, {dc, "zero tSub"});

// Set the dynamic offset to 1. The slice is size 2 so we're zeroing elements
// [1:2]
const unsigned hOffset = 1;
graph.setInitialValue(tUpdateOffsets, hOffset);
popops::dynamicUpdate(graph, tBase, tSub, tUpdateOffsets, sliceDims,
                      sliceSizes, prog, {dc, "update"});

Running the above program would give tSub the same elements as doing:

prog.add(Copy(tSub,
              tBase.slice(hOffset, hOffset + sliceSizes.back(),
                          sliceDims.back()),
              false, {dc, "static zero"}));

The shapes can be seen via

std::cout << "Base shape: " << tBase.shapeToString() << "\n";
std::cout << "Slice shape: " << tSub.shapeToString() << "\n";

prog.add(PrintTensor("updateOffsets", tUpdateOffsets));
prog.add(PrintTensor("updated base begins:", tBase));

And the code exercised by adding

std::vector<int> hIn;
hIn.reserve(tBase.numElements());
for (unsigned i = 0; i != tBase.dim(0); ++i) {
  for (unsigned j = 0; j != tBase.dim(1); ++j) {
    for (unsigned k = 0; k != tBase.dim(2); ++k) {
      hIn.emplace_back(i * 1000 + j * 100 + k);
    }
  }
}

graph.createHostWrite("in", tBase);
graph.createHostRead("out", tSub);
Engine e(graph, prog);
e.load(device);

e.writeTensor("in", ArrayRef(hIn));
e.run();

For an offset of 1 this zeroes elements 1 and 2 of base:

Base shape: {2048,4,10}
Slice shape: {2,4,10}
updateOffsets: [1]
updated base begins:: [
 [
  [      0       1       2 ...       7       8       9]
  [    100     101     102 ...     107     108     109]
  [    200     201     202 ...     207     208     209]
  [    300     301     302 ...     307     308     309]
 ]
 [
  [      0       0       0 ...       0       0       0]
  [      0       0       0 ...       0       0       0]
  [      0       0       0 ...       0       0       0]
  [      0       0       0 ...       0       0       0]
 ]
 [
  [      0       0       0 ...       0       0       0]
  [      0       0       0 ...       0       0       0]
  [      0       0       0 ...       0       0       0]
  [      0       0       0 ...       0       0       0]
 ]
 ...

8.1.3. MultiSlice (embedding lookup)

multiSlice() (and multiUpdate()) are functions that have been optimised to perform multiple slice and update operations efficiently. They only support slicing and updating 2d tensors in the inner dimension, and are much more memory and cycle efficient than an explicit loop around the basic functions. Some of the improved performance is due to a planner which analyses different layout possibilities.

This example considers an embedding lookup as commonly used by language models where each individual word (token) has a multidimentional representation. It uses a dictionary to convert each incoming token to an internal representation. In this example we’re using a dictionary of 32,768 words of 512 values each, giving a base tensor of 32768x512 elements. Each token selects a slice shaped 1x512 from the base. We lookup 100 tokens (the maximum phrase length) so the sliced tensor is 100x1x512. The extra ‘1’ dimension is present because the interface allows each slice’s size in dim0 to be bigger than 1, so in this case we can squeeze it out to give a final 100x512 tensor. (Note that the current implementation does only support a size of 1 here)

The planner makes use of the known parameters to optimise the layout of the tensors involved in the slicing operation. We’re going to lookup numWords slice in dim0, of size 1. So the slice will be shape 100x1x1024.

const std::vector<std::size_t> baseShape{32768, 512};
const std::vector<std::size_t> sliceDims{0};
const std::vector<std::size_t> sliceSizes{1};
const unsigned numWords = 100;
auto plan = popops::embedding::plan(graph, INT, baseShape[0], baseShape[1],
                                    {numWords}, {});
Tensor tBase = popops::createSliceableTensor(
    graph, INT, baseShape, sliceDims, sliceSizes, plan, {}, {dc, "tBase"});
Tensor tTokens =
    popops::createIndicesTensor(graph, sliceDims, numWords, plan, {}, {dc, "tIndices"});

Tensor tSubRaw = popops::multiSlice(graph, tBase, tTokens, sliceDims,
                                 sliceSizes, prog, plan, {}, {dc, "update"});

The multislice interface allows each slice to have multiple elements in the sliced dimension, and this dimension is present in the sliced tensor. In this case our slices are size 1 so that dimension can be squeezed out.

Tensor tSub = tSubRaw.squeeze({1});

The shapes can be seen via

std::cout << "Base shape: " << tBase.shapeToString() << "\n";
std::cout << "Raw Slice shape: " << tSubRaw.shapeToString() << "\n";
std::cout << "Slice shape: " << tSub.shapeToString() << "\n";

And the code exercised by adding

prog.add(PrintTensor("initial token indices", tTokens.slice({0, 5})));
prog.add(PrintTensor("word embedding", tSub));

graph.createHostWrite("in", tBase);
graph.createHostWrite("tokens", tTokens);
graph.createHostRead("out", tSub);

std::vector<int> hIn;
hIn.reserve(tBase.numElements());
// Initialise embedding so the first element in each embedding is the token
// index.
for (unsigned i = 0; i != baseShape[0]; ++i) {
  for (unsigned j = 0; j != baseShape[1]; ++j) {
    hIn.emplace_back(i * 1000 + j);
  }
}
Engine e(graph, prog);
e.load(device);
e.writeTensor("in", ArrayRef(hIn));

// Initialise the tokens to be the first n odd numbers.
std::vector<unsigned> hTokens;
for (unsigned i = 0; i != numWords; i++)
  hTokens.emplace_back(1 + 2 * i);
e.writeTensor("tokens", ArrayRef(Tokens));
e.run();

This gives the output:

Building graph for device with 1472 tiles
Base shape: {32768,512}
Raw Slice shape: {100,1,512}
Slice shape: {100,512}
initial tokens: [
 [ 1]
 [ 3]
 [ 5]
 [ 7]
 [ 9]
]
word embedding: [
 [  1000   1001   1002 ...   1509   1510   1511]
 [  3000   3001   3002 ...   3509   3510   3511]
 [  5000   5001   5002 ...   5509   5510   5511]
 ...
 [195000 195001 195002 ... 195509 195510 195511]
 [197000 197001 197002 ... 197509 197510 197511]
 [199000 199001 199002 ... 199509 199510 199511]
]