Source (GitHub) | Download notebook
4.2. PopXL Custom Optimiser
Introduction
We saw in the PopXL basics tutorial that we can
easily create graphs with an internal state using the addons.Module
class and we
used modules to implement dense layers of a simple neural network.
In this tutorial, we will learn how to implement a custom optimiser: the Adam
optimiser. Many optimisers, like
layers, must manage some persistent variables. To manage this internal state we
will re-use the same programming pattern with addons.Module
as a base class to
the optimiser.
Once you’ve finished this tutorial, you will:
be able to write your own custom optimiser for your models.
understand graph caching and how it helps subgraph reuse.
have used some of the built-in rules within the
popxl.ops.var_updates
module which are useful for a variety of tasks including updating your optimiser’s internal state variables.
If you are unfamiliar with PopXL, you may want to try out the tutorial covering the basic concepts. You may also want to refer to the PopXL user guide.
Requirements
Install a Poplar SDK (version 2.6 or later) and source the
enable.sh
scripts for both PopART and Poplar as described in the Getting Started guide for your IPU system.Install system dependencies:
apt-get install -y $(< required_apt_packages.txt)
Create a Python virtual environment:
python3 -m venv <virtual_env>
.Activate the virtual environment:
. <virtual_env>/bin/activate
.Update
pip
:pip3 install --upgrade pip
.Install requirements
pip3 install -r requirements.txt
(this will also install popxl.addons).
python3 -m venv virtual_env
. virtual_env/bin/activate
pip3 install --upgrade pip
pip3 install -r requirements.txt
To run the Jupyter Notebook version of this tutorial:
Install a Poplar SDK (version 2.6 or later) and source the
enable.sh
scripts for both PopART and Poplar as described in the Getting Started guide for your IPU system.Create a virtual environment.
In the same virtual environment, install the Jupyter Notebook server:
python -m pip install jupyter
.Launch a Jupyter Server on a specific port:
jupyter-notebook --no-browser --port <port number>
. Be sure to be in the virtual environment.Connect via SSH to your remote machine, forwarding your chosen port:
ssh -NL <port number>:localhost:<port number> <your username>@<remote machine>
.
For more details about this process, or if you need troubleshooting, see our guide on using IPUs from Jupyter Notebooks.
If using VS Code, Intellisense can help you understand the tutorial code. It will show function and class descriptions when hovering over their names and lets you easily jump to their definitions. Consult the VS Code setup guide to use Intellisense for this tutorial.
Imports
We start by importing all the modules we will need for this tutorial:
from typing import Dict, Mapping, Optional, Union
from functools import partial
import numpy as np
import torch
import torchvision
from tqdm import tqdm
import popxl
import popxl_addons as addons
import popxl.ops as ops
np.random.seed(42)
Defining the Adam optimiser
Below, we implement the Adam optimiser by deriving from the addons.Module
class. The Adam
class defines the update step for a single variable in its
build
method.
First we will see how to correctly deal with in-place operations, then we will
define the update step using PopXL’s var_updates
module,
and, finally we will test and inspect our optimiser on a single variable.
Managing in-place ops
The update is performed in-place on the weight
argument which contains the
model variable updated by the Adam optimiser. Because this operation is in-place, we pass the
argument as a TensorByRef
and use the @popxl.in_sequence()
decorator to
prevent the operations from being rearranged by the compiler. The rest of the
definition follows the same pattern used to add weights to our layers in the
PopXL basics tutorial.
class Adam(addons.Module):
# We need to specify `in_sequence` because many operations are in-place
# and their order shouldn't be changed
@popxl.in_sequence()
def build(
self,
weight: popxl.TensorByRef,
grad: popxl.Tensor,
*,
lr: Union[float, popxl.Tensor],
beta1: Union[float, popxl.Tensor] = 0.9,
beta2: Union[float, popxl.Tensor] = 0.999,
eps: Union[float, popxl.Tensor] = 1e-5,
weight_decay: Union[float, popxl.Tensor] = 0.0,
first_order_dtype: popxl.dtype = popxl.float16,
bias_correction: bool = True,
):
# Gradient estimator for the variable `weight` - same shape as the variable
first_order = self.add_variable_input(
"first_order",
partial(np.zeros, weight.shape),
first_order_dtype,
by_ref=True,
)
ops.var_updates.accumulate_moving_average_(first_order, grad, f=beta1)
# Variance estimator for the variable `weight` - same shape as the variable
second_order = self.add_variable_input(
"second_order", partial(np.zeros, weight.shape), popxl.float32, by_ref=True
)
ops.var_updates.accumulate_moving_average_square_(second_order, grad, f=beta2)
# Adam is a biased estimator: provide the step variable to correct bias
step = None
if bias_correction:
step = self.add_variable_input(
"step", partial(np.zeros, ()), popxl.float32, by_ref=True
)
# Calculate the weight increment with an Adam heuristic
# Here we use the built-in `adam_updater`, but you can write your own.
dw = ops.var_updates.adam_updater(
first_order,
second_order,
weight=weight,
weight_decay=weight_decay,
time_step=step,
beta1=beta1,
beta2=beta2,
epsilon=eps,
)
# in-place weight update: weight += (-lr)*dw
ops.scaled_add_(weight, dw, b=-lr)
The Adam optimiser needs state to store the mean and uncentred variance (first and second
moments) of the gradients. These need to be of type Variable
, hence we add them with
Module.add_variable_input
, creating named inputs for them (first_order
and
second_order
).
We used Module.add_variable_input
in the PopXL introductory
tutorial to add weights to our layers. However, in the
Adam
implementation you should notice a few differences.
We now have a
@popxl.in_sequence()
decorator on top of thebuild
method. This forces all operations to be added in the exact order we define them, enforcing topological constraints between them. This is necessary here since most of the optimiser operations are in-place, hence their order of execution must be strictly preserved. Remember this whenever you have in-place operations.The
weight
input is apopxl.TensorByRef
: any change made to this variable will be automatically copied to the parent graph. See TensorByRef for more information.Some parameters, such as learning rate or weight decay, are defined as
Union[float, popxl.Tensor]
. If the parameter was provided as a simplefloat
, it would be “baked” into the graph, with no possibility of changing it at run-time. Instead, if the parameter is aTensor
(orTensorSpec
) it will appear as an input to the graph, which needs to be provided when calling the graph. If you plan to change a parameter (for example, because you have a learning rate schedule), this is the way to go.
The rest of the logic is straightforward:
We update
first_order
, the mean estimator of theweight
gradient.We update
second_order
, the uncentred variance estimator ofweight
gradient.We optionally correct the estimators, since they are biased.
We compute the increment delta-weight
dw
.We update the variable
weight
withscaled_add_
to implement the equationweight -= lr * dw
.
Using the var_updates
module
The ops.var_updates
module contains several useful update rules (you
can also create your own). In this example, we will use three of the built-in
rules:
ops.var_updates.accumulate_moving_average_(average, new_sample, coefficient)
updatesaverage
in-place with an exponential moving average rule:average = (coefficient * average) + ((1 - coefficient) * new_sample)
accumulate_moving_average_square_(average, new_sample, coefficient)
updatesaverage
in-place but uses the square of the sample.ops.var_updates.adam_updater(...)
returns the Adam incrementdw
required for the weight update. This is computed using the Adam internal state which comprises of the first and second moments.
Using our custom optimiser
Let’s inspect the optimiser graph and its use in a simple example.
ir = popxl.Ir(replication=1)
with ir.main_graph:
var = popxl.variable(np.ones((2, 2)), popxl.float32)
grad = popxl.variable(np.full((2, 2), 0.1), popxl.float32)
# create graph and factories - float learning rate
adam_facts, adam = Adam(cache=True).create_graph(var, var.spec, lr=1e-3)
# create graph and factories - Tensor learning rate
adam_facts_lr, adam_lr = Adam().create_graph(
var, var.spec, lr=popxl.TensorSpec((), popxl.float32)
)
print("Adam with float learning rate\n")
print(adam.print_schedule())
print("\n Adam with tensor learning rate\n")
print(adam_lr.print_schedule())
# instantiate optimiser variables
adam_state = adam_facts.init()
adam_state_lr = adam_facts_lr.init()
# optimization step for float lr: call the bound graph providing the
# variable to update and the gradient
adam.bind(adam_state).call(var, grad)
# optimization step for tensor lr: call the bound graph providing the
# variable to update, the gradient and the learning rate
adam_lr.bind(adam_state_lr).call(var, grad, popxl.constant(1e-3))
ir.num_host_transfers = 1
session = popxl.Session(ir, "ipu_hw")
print("\n Before Adam update")
var_data = session.get_tensor_data(var)
state = session.get_tensors_data(adam_state.tensors)
print("Variable:\n", var)
print("Adam state:")
for name, data in state.items():
print(name, "\n", state[name])
with session:
session.run()
print("\n After Adam update")
var_data = session.get_tensor_data(var)
state = session.get_tensors_data(adam_state.tensors)
print("Variable:\n", var)
print("Adam state:")
for name, data in state.items():
print(name, "\n", state[name])
Adam with float learning rate
Graph : Adam_subgraph(0)
(%1, %2, first_order=%3, second_order=%4, step=%5) -> () {
Accumulate.100 (%3 [(2, 2) float16], %2 [(2, 2) float32]) -> (%6 [(2, 2) float16])
Accumulate.101 (%4 [(2, 2) float32], %2 [(2, 2) float32]) -> (%7 [(2, 2) float32])
AdamUpdater.102 (%3 [(2, 2) float16], %4 [(2, 2) float32], %5 [() float32]) -> (%8 [(2, 2) float16])
ScaledAddLhsInplace.103 (%1 [(2, 2) float32], %8 [(2, 2) float16]) -> (%9 [(2, 2) float32])
}
Adam with tensor learning rate
Graph : Adam_subgraph(1)
(%1, %2, %3, first_order=%4, second_order=%5, step=%6) -> () {
Accumulate.104 (%4 [(2, 2) float16], %2 [(2, 2) float32]) -> (%7 [(2, 2) float16])
Accumulate.105 (%5 [(2, 2) float32], %2 [(2, 2) float32]) -> (%8 [(2, 2) float32])
AdamUpdater.106 (%4 [(2, 2) float16], %5 [(2, 2) float32], %6 [() float32]) -> (%9 [(2, 2) float16])
Neg.107 (%3 [() float32]) -> (%10 [() float32])
ScaledAddLhsInplace.108 (%1 [(2, 2) float32], %9 [(2, 2) float16], %10 [() float32]) -> (%11 [(2, 2) float32])
}
Before Adam update
Variable:
Tensor[t popxl.dtypes.float32 (2, 2)]
Adam state:
Tensor[first_order popxl.dtypes.float16 (2, 2)]
[[0. 0.]
[0. 0.]]
Tensor[second_order popxl.dtypes.float32 (2, 2)]
[[0. 0.]
[0. 0.]]
Tensor[step popxl.dtypes.float32 ()]
0.0
After Adam update
Variable:
Tensor[t popxl.dtypes.float32 (2, 2)]
Adam state:
Tensor[first_order popxl.dtypes.float16 (2, 2)]
[[0.009995 0.009995]
[0.009995 0.009995]]
Tensor[second_order popxl.dtypes.float32 (2, 2)]
[[9.9998715e-06 9.9998715e-06]
[9.9998715e-06 9.9998715e-06]]
Tensor[step popxl.dtypes.float32 ()]
1.0
MNIST with Adam
We can now refactor our MNIST example to incorporate the Adam optimiser. Note
that we need an optimiser for each variable: we first define the
optimiser_step
function which creates the graph for each variable and performs
a full weight update for all the variables. Since the Adam
module uses
cache=True
, if two graphs happens to be the same, the same graph will be
re-used.
We will use a simple float learning rate (rather than Tensor
), since we don’t
plan to change its value during training.
def optimiser_step(
variables,
grads: Dict[popxl.Tensor, popxl.Tensor],
optimiser: addons.Module,
learning_rate: popxl.float32 = 1e-3,
):
"""
Update all variables creating per-variable optimisers.
"""
for name, var in variables.named_tensors.items():
# Create optimiser and state factories for the variable
opt_facts, opt_graph = optimiser.create_graph(
var, var.spec, lr=learning_rate, weight_decay=0.0, bias_correction=False
)
state = opt_facts.init()
# Bind the graph to its state and call it.
# Both the state and the variables are updated in-place and are passed
# by ref, hence after the graph is called they are updated.
opt_graph.bind(state).call(var, grads[var])
We load the data, and define our network using exactly the same code as in the PopXL basics tutorial:
def get_mnist_data(test_batch_size: int, batch_size: int):
training_data = torch.utils.data.DataLoader(
torchvision.datasets.MNIST(
"~/.torch/datasets",
train=True,
download=True,
transform=torchvision.transforms.Compose(
[
torchvision.transforms.ToTensor(),
# mean and std computed on the training set.
torchvision.transforms.Normalize((0.1307,), (0.3081,)),
]
),
),
batch_size=batch_size,
shuffle=True,
drop_last=True,
)
validation_data = torch.utils.data.DataLoader(
torchvision.datasets.MNIST(
"~/.torch/datasets",
train=False,
download=True,
transform=torchvision.transforms.Compose(
[
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize((0.1307,), (0.3081,)),
]
),
),
batch_size=test_batch_size,
shuffle=True,
drop_last=True,
)
return training_data, validation_data
class Linear(addons.Module):
def __init__(self, out_features: int, bias: bool = True):
super().__init__()
self.out_features = out_features
self.bias = bias
def build(self, x: popxl.Tensor) -> popxl.Tensor:
# add a state variable to the module
w = self.add_variable_input(
"weight",
partial(np.random.normal, 0, 0.02, (x.shape[-1], self.out_features)),
x.dtype,
)
y = x @ w
if self.bias:
# add a state variable to the module
b = self.add_variable_input("bias", partial(np.zeros, y.shape[-1]), x.dtype)
y = y + b
return y
class Net(addons.Module):
def __init__(self, cache: Optional[addons.GraphCache] = None):
super().__init__(cache=cache)
self.fc1 = Linear(512)
self.fc2 = Linear(512)
self.fc3 = Linear(512)
self.fc4 = Linear(10)
def build(self, x: popxl.Tensor):
x = x.reshape((-1, 28 * 28))
x = ops.gelu(self.fc1(x))
x = ops.gelu(self.fc2(x))
x = ops.gelu(self.fc3(x))
x = self.fc4(x)
return x
The training code is almost unchanged from that of the PopXL basics
tutorial, the only difference being that we now call our
Adam
class and optimiser_step
function as the optimiser instead of the
simple scaled_add_
:
def train_program(batch_size, device, learning_rate):
ir = popxl.Ir(replication=1)
with ir.main_graph:
# Create input streams from host to device
img_stream = popxl.h2d_stream((batch_size, 28, 28), popxl.float32, "image")
img_t = ops.host_load(img_stream) # load data
label_stream = popxl.h2d_stream((batch_size,), popxl.int32, "labels")
labels = ops.host_load(label_stream, "labels")
# Create forward graph
facts, fwd_graph = Net().create_graph(img_t)
# Create backward graph via autodiff transform
bwd_graph = addons.autodiff(fwd_graph)
# Initialise variables (weights)
variables = facts.init()
# Call the forward graph with call_with_info because we want to retrieve
# information from the call site
fwd_info = fwd_graph.bind(variables).call_with_info(img_t)
x = fwd_info.outputs[0] # forward output
# Compute loss and starting gradient for backpropagation
loss, dx = addons.ops.cross_entropy_with_grad(x, labels)
# Setup a stream to retrieve loss values from the host
loss_stream = popxl.d2h_stream(loss.shape, loss.dtype, "loss")
ops.host_store(loss_stream, loss)
# Retrieve activations from the forward graph
activations = bwd_graph.grad_graph_info.inputs_dict(fwd_info)
# Call the backward graph providing the starting value for
# backpropagation and activations
bwd_info = bwd_graph.call_with_info(dx, args=activations)
# Adam optimiser, with cache
grads_dict = bwd_graph.grad_graph_info.fwd_parent_ins_to_grad_parent_outs(
fwd_info, bwd_info
)
optimiser = Adam(cache=True)
optimiser_step(variables, grads_dict, optimiser, learning_rate)
ir.num_host_transfers = 1
return popxl.Session(ir, device), [img_stream, label_stream], variables, loss_stream
You will notice above that we created the Adam
module using cache=True
. This
will enable graph reuse, if possible, when calling optimiser.create_graph
. For
our optimiser this would be when there are multiple variables with the same
shape and data type.
Now let’s run a training session.
Since we are using the Adam optimiser, we need to use a smaller learning rate than before.
train_batch_size = 8
test_batch_size = 80
device = "ipu_hw"
learning_rate = 1e-3
epochs = 1
training_data, test_data = get_mnist_data(test_batch_size, train_batch_size)
train_session, train_input_streams, train_variables, loss_stream = train_program(
train_batch_size, device, learning_rate
)
num_batches = len(training_data)
with train_session:
for epoch in range(1, epochs + 1):
print(f"Epoch {epoch}/{epochs}")
bar = tqdm(training_data, total=num_batches)
for data, labels in bar:
inputs: Mapping[popxl.HostToDeviceStream, np.ndarray] = dict(
zip(train_input_streams, [data.squeeze().float(), labels.int()])
)
loss = train_session.run(inputs)[loss_stream]
bar.set_description(f"Loss:{loss:0.4f}")
Epoch 1/1
Loss:0.0191: 100%|██████████| 7500/7500 [00:26<00:00, 284.61it/s]
# Retrieve the trained weights to use during inference
train_vars_to_data = train_session.get_tensors_data(train_variables.tensors)
Validation
As we did previously, to test our model we need to create an inference-only program and run it on the test dataset.
def test_program(test_batch_size, device):
ir = popxl.Ir(replication=1)
with ir.main_graph:
# Inputs
in_stream = popxl.h2d_stream((test_batch_size, 28, 28), popxl.float32, "image")
in_t = ops.host_load(in_stream)
# Create graphs
facts, graph = Net().create_graph(in_t)
# Initialise variables
variables = facts.init()
# Forward
(outputs,) = graph.bind(variables).call(in_t)
out_stream = popxl.d2h_stream(outputs.shape, outputs.dtype, "outputs")
ops.host_store(out_stream, outputs)
ir.num_host_transfers = 1
return popxl.Session(ir, device), [in_stream], variables, out_stream
We create the test program and copy the trained weights to it:
test_session, test_input_streams, test_variables, out_stream = test_program(
test_batch_size, device
)
train_vars_to_test_vars = train_variables.to_mapping(test_variables)
test_vars_to_data = {
test_var: train_vars_to_data[train_var].copy()
for train_var, test_var in train_vars_to_test_vars.items()
}
test_session.write_variables_data(test_vars_to_data)
Finally, let’s run the test session and measure the accuracy:
def accuracy(predictions: np.ndarray, labels: np.ndarray):
ind = np.argmax(predictions, axis=-1).flatten()
labels = labels.detach().numpy().flatten()
return np.mean(ind == labels) * 100.0
num_batches = len(test_data)
sum_acc = 0.0
with test_session:
for data, labels in tqdm(test_data, total=num_batches):
inputs: Mapping[popxl.HostToDeviceStream, np.ndarray] = dict(
zip(test_input_streams, [data.squeeze().float(), labels.int()])
)
output = test_session.run(inputs)
sum_acc += accuracy(output[out_stream], labels)
test_set_accuracy = sum_acc / len(test_data)
print(f"Accuracy on test set: {test_set_accuracy:0.2f}%")
100%|██████████| 125/125 [00:01<00:00, 95.30it/s]
Accuracy on test set: 95.21%
Conclusion
In this tutorial we wrote a custom optimiser using the popxl.addons API. We achieved the following:
built an Adam Optimiser (by subclassing
addons.Module
) and ran it with an MNIST model.used the
popxl.in_sequence()
function and learnt why it is needed, to prevent operations from being rearranged by the compiler.used
popxl.TensorByRef
to pass variable updates back to the parent graph.explored some of the functions within the
popxl.ops.var_updates
module.became familiar with using
Tensor
parameters versus simpler (built-in) types inaddons.Module.build
, when declaring dynamic parameters.exploited graph caching in
addons.Module
:addons.Module(cache=True)
to enable graph reuse.
To try out more features in PopXL look at our other tutorials.
You can also read our PopXL User Guide for more information.
As the PopXL API is still experimental, we would love to hear your feedback on it (support@graphcore.ai). Your input could help drive its future direction.
Generated:2022-07-21T18:56 Source:mnist.py SDK:2.6.0+1074 SST:0.0.7