8. PopLibs examples
8.1. Dynamic slicing and updating
Poplar accesses data through multidimensional tensor views onto underlying variables. Static slices of tensors can be viewed through various tensor methods, for example:
The operator
poplar::Tensor::operator[]()
The function
poplar::Tensor::slice()
The function
poplar::Tensor::index()
When a slice from a variable position is required
dynamicSlice()
, and
dynamicUpdate()
can be used. These operate
on statically sized base and slice tensors and allow the slice tensor to be
selected from an arbitrary offset in the base tensor in the specified
dimensions. The subtensor will have the same rank as the original tensor, but
the sliced dimensions will be smaller.
When a number of slices/updates are required from a single base tensor,
multiSlice()
and
multiUpdate()
can be used instead. This is equivalent to a
loop performing multiple slice or update operations, but is more efficient as
parallel operations can be performed.
All these functions operate on physically separate source and destination Variables; each slice/update transfers data between the two. This can lead to exchange of a large amount of data between tiles as part of the operation.
The examples below allocate tensors using the poplibs
createSliceableTensor()
and createSliceTensor()
functions which are
generally able to map tensors for efficient slicing operations. Tensors created
by any other method can also be sliced/updated, but care is required to avoid
expensive exchange of data.
All these slicing functions support multidimensional tensors, and allow slicing in multiple dimensions. It is rare for more than one dimension to be sliced in practise; the examples slice a single dimension for simplicity.
The examples use INT
data for illustration purposes, the principle is the same
for any type.
8.1.1. Dynamic slice
Here we extract a sub-tensor from a larger tensor.
In this example we’re slicing the outermost dimension, [0].
The base tensor is an input for
dynamicSlice()
,
in this example shape
2048x4x10. The slice is in dimension 0 and of size 2,
so the output sub-tensor will be 2x4x10.
const std::vector<std::size_t> baseShape{2048, 4, 20};
// We're going to take a slice in dim0, of size 2. So the slice will be shape
// 2x4x1024.
const std::vector<std::size_t> sliceDims{0};
const std::vector<std::size_t> sliceSizes{2};
Tensor tBase = popops::createSliceableTensor(graph, INT, baseShape, sliceDims,
sliceSizes, 4, {dc, "tBase"});
Tensor tWantedOffsets = graph.addVariable(UNSIGNED_INT, {sliceDims.size()},
VariableMappingMethod::LINEAR);
Tensor tSub = popops::dynamicSlice(graph, tBase, tWantedOffsets, sliceDims,
sliceSizes, prog, {dc, "slice"});
Running the above program with tWantedOffsets[0]
set to 1
then tSub would
contain the same elements as:
Tensor staticSlice =
tBase.slice(hOffset, 1 + sliceSizes.front(), sliceDims.front());
The shapes can be seen via
std::cout << "Base shape: " << tBase.shapeToString() << "\n";
std::cout << "Slice shape: " << tSub.shapeToString() << "\n";
And the code exercised by adding
prog.add(PrintTensor("wantedOffsets", tWantedOffsets));
prog.add(PrintTensor("slice", tSub));
// Initialise base Tensor so that digits are used independently in each dimension.
// digits [0:1] are the index in dim(2), the innermost dimension.
// digit [2] is the index in dim(1)
// digits [:3] are the index in dim(0)
std::vector<int> hIn;
hIn.reserve(tBase.numElements());
for (unsigned i = 0; i != tBase.dim(0); ++i) {
for (unsigned j = 0; j != tBase.dim(1); ++j) {
for (unsigned k = 0; k != tBase.dim(2); ++k) {
hIn.emplace_back(i * 1000 + j * 100 + k);
}
}
}
graph.createHostWrite("in", tBase);
graph.createHostRead("out", tSub);
Engine e(graph, prog);
e.load(device);
e.writeTensor("in", ArrayRef(hIn));
e.run();
For an offset of 1 this will give the output which is elements [1] and [2] of tBase
:
Base shape: {2048,4,10}
updateOffsets: [1]
Slice shape: {2,4,10}
slice: [
[
[1000 1001 1002 1003 1004 1005 1006 1007 1008 1009]
[1100 1101 1102 1103 1104 1105 1106 1107 1108 1109]
[1200 1201 1202 1203 1204 1205 1206 1207 1208 1209]
[1300 1301 1302 1303 1304 1305 1306 1307 1308 1309]
]
[
[2000 2001 2002 2003 2004 2005 2006 2007 2008 2009]
[2100 2101 2102 2103 2104 2105 2106 2107 2108 2109]
[2200 2201 2202 2203 2204 2205 2206 2207 2208 2209]
[2300 2301 2302 2303 2304 2305 2306 2307 2308 2309]
]
]
8.1.2. Dynamic update
Here we update part of a large base tensor from a smaller sub-tensor. Again the base tensor is 2048x4x10, in this case it is updated so it is both input and output. As above we’re slicing in dimension 0 with a slice of size 2, so the slice tensor shape is 2x4x10. We’re going to zero those elements.
const std::vector<std::size_t> baseShape{2048, 4, 10};
const std::vector<std::size_t> sliceDims{0};
const std::vector<std::size_t> sliceSizes{2};
Tensor tBase = popops::createSliceableTensor(graph, INT, baseShape, sliceDims,
sliceSizes, 4, {dc, "tBase"});
createSliceTensor()
creates tensors that can be used both by
dynamicUpdate()
and by
multiUpdate()
. This means that it creates tensors
with an additional outer dimension. In this example we’re only doing a
single slice so we discard the outer dimension.
Tensor tSub = popops::createSliceTensor(graph, tBase, sliceDims, sliceSizes,
1, {dc, "tSub"}).squeeze({0});
Tensor tUpdateOffsets = graph.addVariable(UNSIGNED_INT, {sliceDdims.size()},
VariableMappingMethod::LINEAR);
popops::zero(graph, tSub, prog, {dc, "zero tSub"});
// Set the dynamic offset to 1. The slice is size 2 so we're zeroing elements
// [1:2]
const unsigned hOffset = 1;
graph.setInitialValue(tUpdateOffsets, hOffset);
popops::dynamicUpdate(graph, tBase, tSub, tUpdateOffsets, sliceDims,
sliceSizes, prog, {dc, "update"});
Running the above program would give tSub
the same elements as doing:
prog.add(Copy(tSub,
tBase.slice(hOffset, hOffset + sliceSizes.back(),
sliceDims.back()),
false, {dc, "static zero"}));
The shapes can be seen via
std::cout << "Base shape: " << tBase.shapeToString() << "\n";
std::cout << "Slice shape: " << tSub.shapeToString() << "\n";
prog.add(PrintTensor("updateOffsets", tUpdateOffsets));
prog.add(PrintTensor("updated base begins:", tBase));
And the code exercised by adding
std::vector<int> hIn;
hIn.reserve(tBase.numElements());
for (unsigned i = 0; i != tBase.dim(0); ++i) {
for (unsigned j = 0; j != tBase.dim(1); ++j) {
for (unsigned k = 0; k != tBase.dim(2); ++k) {
hIn.emplace_back(i * 1000 + j * 100 + k);
}
}
}
graph.createHostWrite("in", tBase);
graph.createHostRead("out", tSub);
Engine e(graph, prog);
e.load(device);
e.writeTensor("in", ArrayRef(hIn));
e.run();
For an offset of 1 this zeroes elements 1 and 2 of base:
Base shape: {2048,4,10}
Slice shape: {2,4,10}
updateOffsets: [1]
updated base begins:: [
[
[ 0 1 2 ... 7 8 9]
[ 100 101 102 ... 107 108 109]
[ 200 201 202 ... 207 208 209]
[ 300 301 302 ... 307 308 309]
]
[
[ 0 0 0 ... 0 0 0]
[ 0 0 0 ... 0 0 0]
[ 0 0 0 ... 0 0 0]
[ 0 0 0 ... 0 0 0]
]
[
[ 0 0 0 ... 0 0 0]
[ 0 0 0 ... 0 0 0]
[ 0 0 0 ... 0 0 0]
[ 0 0 0 ... 0 0 0]
]
...
8.1.3. MultiSlice (embedding lookup)
multiSlice()
(and
multiUpdate()
) are functions that have been
optimised to perform multiple slice and update operations efficiently. They only
support slicing and updating 2d tensors in the inner dimension, and are much
more memory and cycle efficient than an explicit loop around the basic
functions. Some of the improved performance is due to a planner which analyses
different layout possibilities.
This example considers an embedding lookup as commonly used by language models where each individual word (token) has a multidimentional representation. It uses a dictionary to convert each incoming token to an internal representation. In this example we’re using a dictionary of 32,768 words of 512 values each, giving a base tensor of 32768x512 elements. Each token selects a slice shaped 1x512 from the base. We lookup 100 tokens (the maximum phrase length) so the sliced tensor is 100x1x512. The extra ‘1’ dimension is present because the interface allows each slice’s size in dim0 to be bigger than 1, so in this case we can squeeze it out to give a final 100x512 tensor. (Note that the current implementation does only support a size of 1 here)
The planner makes use of the known parameters to optimise the layout of the tensors involved in the slicing operation. We’re going to lookup numWords slice in dim0, of size 1. So the slice will be shape 100x1x1024.
const std::vector<std::size_t> baseShape{32768, 512};
const std::vector<std::size_t> sliceDims{0};
const std::vector<std::size_t> sliceSizes{1};
const unsigned numWords = 100;
auto plan = popops::embedding::plan(graph, INT, baseShape[0], baseShape[1],
{numWords}, {});
Tensor tBase = popops::createSliceableTensor(
graph, INT, baseShape, sliceDims, sliceSizes, plan, {}, {dc, "tBase"});
Tensor tTokens =
popops::createIndicesTensor(graph, sliceDims, numWords, plan, {}, {dc, "tIndices"});
Tensor tSubRaw = popops::multiSlice(graph, tBase, tTokens, sliceDims,
sliceSizes, prog, plan, {}, {dc, "update"});
The multislice interface allows each slice to have multiple elements in the sliced dimension, and this dimension is present in the sliced tensor. In this case our slices are size 1 so that dimension can be squeezed out.
Tensor tSub = tSubRaw.squeeze({1});
The shapes can be seen via
std::cout << "Base shape: " << tBase.shapeToString() << "\n";
std::cout << "Raw Slice shape: " << tSubRaw.shapeToString() << "\n";
std::cout << "Slice shape: " << tSub.shapeToString() << "\n";
And the code exercised by adding
prog.add(PrintTensor("initial token indices", tTokens.slice({0, 5})));
prog.add(PrintTensor("word embedding", tSub));
graph.createHostWrite("in", tBase);
graph.createHostWrite("tokens", tTokens);
graph.createHostRead("out", tSub);
std::vector<int> hIn;
hIn.reserve(tBase.numElements());
// Initialise embedding so the first element in each embedding is the token
// index.
for (unsigned i = 0; i != baseShape[0]; ++i) {
for (unsigned j = 0; j != baseShape[1]; ++j) {
hIn.emplace_back(i * 1000 + j);
}
}
Engine e(graph, prog);
e.load(device);
e.writeTensor("in", ArrayRef(hIn));
// Initialise the tokens to be the first n odd numbers.
std::vector<unsigned> hTokens;
for (unsigned i = 0; i != numWords; i++)
hTokens.emplace_back(1 + 2 * i);
e.writeTensor("tokens", ArrayRef(Tokens));
e.run();
This gives the output:
Building graph for device with 1472 tiles
Base shape: {32768,512}
Raw Slice shape: {100,1,512}
Slice shape: {100,512}
initial tokens: [
[ 1]
[ 3]
[ 5]
[ 7]
[ 9]
]
word embedding: [
[ 1000 1001 1002 ... 1509 1510 1511]
[ 3000 3001 3002 ... 3509 3510 3511]
[ 5000 5001 5002 ... 5509 5510 5511]
...
[195000 195001 195002 ... 195509 195510 195511]
[197000 197001 197002 ... 197509 197510 197511]
[199000 199001 199002 ... 199509 199510 199511]
]