Source (GitHub)

# 4.5. Tutorial 5: matrix-vector multiplication

In this tutorial you will:

• build a Poplar function and vertex to multiply a matrix by a vector, we recommend completing tutorial 3 before attempting this one.

• write the vertex code, which will compute the dot product between two given vectors.

• write the host code that will add several vertices to the graph, and connect them to appropriate tensors and tensor slices.

• Optionally create a version of this program that runs on the IPU hardware.

A brief summary is included at the end this tutorial. Do not hesitate to read through the Poplar and PopLibs User Guide to complement this tutorial. Use `tut5_matrix_vector/start_here` as your working directory.

## Setup

In order to run this tutorial on the IPU you will need to have a Poplar SDK environment enabled (see the Getting Started Guide for your IPU system).

You will also need a C++ toolchain compatible with the C++11 standard, build commands in this tutorial use GCC.

## The vertex code

The file `matrix-mul-codelets.cpp` contains the outline for the vertex code that will perform a dot product. Its input and output fields are already defined:

```class DotProductVertex : public Vertex {
public:
Input<Vector<float>> a;
Input<Vector<float>> b;
Output<float> out;
}
```

TO DO (1): write the vertex code

For the vertex to provide the calculation to Poplar the `compute` method of `DotProductVertex` needs to be completed. The method should calculate the dot product of the two input vectors `a` and `b`, and store the scalar result in `out`.

Algebraically, the dot product of two vectors \$a = [a_0, a_1, …, a_n]\$ and \$b = [b_0, b_1, …, b_n]\$ is computed as:

\$a_0 * b_0 + a_1 * b_1 + … + a_n * b_n\$

You may also find useful to look again at the codelet in tutorial 3. Tip: within the `compute` method of the codelet, you can find the number of elements in the vector `a` with `a.size()`.

## The host code

The host code follows a similar pattern to the host code in the previous tutorials.

There are three tensors defined for the input matrix, input vector and output vector:

```Tensor matrix = graph.addVariable(FLOAT, {numRows, numCols}, "matrix");
Tensor inputVector = graph.addVariable(FLOAT, {numCols}, "inputVector");
Tensor outputVector = graph.addVariable(FLOAT, {numRows}, "outputVector");
```

The function `buildMultiplyProgram` creates the graph and control program for performing the multiplication. The control program executes a single compute set called `mulCS`. This compute set consists of a vertex for each output element of the output vector (in other words, one vertex for each row of the input matrix).

TO DO (2): add vertices to the graph

The next task in this tutorial is to write the host code to add the vertices to the compute set.

• Create a loop that performs `numRows` iterations, each of which will add a vertex to the graph. Hint: given a Poplar tensor `t` of shape `{numRows, numCols}`, you can get the size of the i-th dimension with `t.dim(i)`. So for example `numRows == t.dim(0)`.

• Use the `addVertex` function of the graph object to add a vertex of type `DotProductVertex` to the `mulCS` compute set. You may find it helpful to look again at how we added a vertex in tutorial 3.

• Use the last argument of `addVertex` to connect the fields of the vertex to the relevant tensor slices for that row. For example, say we want to create a vertex `v` that has an input `in` and an output `out`, and in the graph we’ve already defined two tensors with the same names, we can thus create the vertex as:

```VertexRef v = graph.addVertex(computeSet, "v", {{"in", in}, {"out", out}});
```

In this case, each vertex takes one row of the matrix (you can use the index operator on the `matrix` tensor, for example, the i-th row is `matrix[i]`), and the entire `in` tensor, and outputs to a single element of the `out` tensor.

• Map the newly created vertex to a tile. If `i` is the counter of the loop we’re in, we can map the vertex to tile `i`. Again, it may help to check how we did this in tutorial 3.

• Finally, use `graph.setPerfEstimate()` to specify the estimated number of cycles that this vertex will take to execute. This is needed only when using the `IPUModel` device, and is not really important other than for profiling. So you can just set an arbitrary integer.

After adding this code, you can build and run the example. A makefile is provided to compile the program. You can build it by running `make`

As you can see from the host program code, you’ll need to provide two arguments to the execution command that specify the size of the matrix. For example, running the program as shown below will multiply a 40x50 matrix by a vector of size 50:

```\$ ./tut5_start_here 40 50
```

The host code includes a check for the correctness of the result.

## (Optional) Using the IPU

This section describes how to modify the program to use the IPU hardware.

• Copy `tut5.cpp` to `tut5_ipu_hardware.cpp` and open it in an editor.

```#include <poplar/DeviceManager.hpp>
#include <algorithm>
```
• Remove the following lines which create an IPU model device:

```IPUModel ipuModel;
Device device = ipuModel.createDevice();
```
• And add the following lines at the start of `main`:

```// Create the DeviceManager which is used to discover devices
auto manager = DeviceManager::createDeviceManager();

// Attempt to attach to a single IPU:
auto devices = manager.getDevices(poplar::TargetType::IPU, 1);
std::cout << "Trying to attach to IPU\n";
auto it = std::find_if(devices.begin(), devices.end(), [](Device &device) {
return device.attach();
});

if (it == devices.end()) {
std::cerr << "Error attaching to device\n";
return 1; //EXIT_FAILURE
}

auto device = std::move(*it);
std::cout << "Attached to IPU " << device.getId() << std::endl;
```

This gets a list of all devices consisting of a single IPU that are attached to the host and tries to attach to each one in turn until successful. This is a useful approach if there are multiple users on the host. It is also possible to get a specific device using its device-manager ID with the `getDevice` function.

• Remove the line with `setPerfEstimate` in function `buildMultiplyProgram`:

```graph.setPerfEstimate(v, 20);
```

This line gives an estimate of the number of cycles that the calculation will take for a given vertex, it is only needed when we use the IPU model and write custom vertices like `DotProductVertex` in this tutorial. When we use IPU hardware the cycles will be measured, should we decide to profile the program like in tutorial 4.

• Compile the program.

```\$ g++ --std=c++11 tut5_ipu_hardware.cpp -lpoplar -lpoputil -o tut5_ipu
```

Before running this you need to make sure that your system is configured correctly in order to attach to IPUs (see the Getting Started Guide for your IPU system).

• Run the program to see the same results as running on IPU model

```\$ ./tut5_ipu_hardware
```

## Summary

In this tutorial, we wrote a program that performs a matrix-vector multiplication using a custom vertex. The codelet itself computes the dot product between two vectors, in order to compute the multiplication between a matrix and a vector we added several of these vertices to the Poplar graph: one for each row of the matrix. Finally we connected them to the appropriate row and tensors. These vertices have all been added to the same compute set, which means they will execute in parallel on the IPU. We run the program on the IPU model, but we’ve also seen what changes are needed to make it run on the IPU hardware.