10. Transforms
After an IR is built, you can use transforms or patterns to manipulate its graphs in a non-trivial way. Transforms are used to change a graph at the graph level, while patterns are usually used to change a specific operation repeatedly in a graph.
Currently, we support the following transforms:
Autodiff
10.1. Autodiff
In PopXL you can use autodiff()
to perform automatic differentiation on a per-graph basis. This transform creates a graph (the gradient graph) to compute
the gradients of a forward graph. It is declared as:
autodiff(graph: Graph,
grads_provided: Optional[Iterable[Tensor]] = None,
grads_required: Optional[Iterable[Tensor]] = None,
called_graphs_grad_info: Optional[Mapping[Graph, GradGraphInfo]] = None,
return_all_grad_graphs: bool = False)
The inputs are as follows:
graph
is a forward graph.grads_provided
indicates for which outputs ofgraph
we have gradients available forautodiff
to use. For instance, ifgraph
outputs both loss and accuracy, you might not want to provide gradients for accuracy. The default is that gradients are provided for all of the outputs of the forward graph.grads_required
indicates which inputs of the forward graph you wantautodiff
to calculate gradients for. The default is that gradients are required for all of the inputs to the forward graph.called_graphs_grad_info
andreturn_all_grad_graphs
can be used to reduce computation of gradients when there are subgraphs that are called by multiple parent graphs.return_all_grad_graphs
indicates whether to return the gradient graphs for all the graphs thatautodiff
has been recursively applied to or just for the givengraph
.autodiff
returns anGradGraphInfo
object that includes the computational graph for computing the gradients ifreturn_all_grad_graphs
is set toFalse
. It will return all the gradient graphs ifreturn_all_grad_graphs
is set toTrue
.You only need
return_all_grad_graphs
if and only if:the graph you are applying
autodiff
to has calls to other subgraphs andyou need the gradients for those called subgraphs. This typically only happens when you want to apply
autodiff
to another graph that also calls these same subgraphs.
For example, for graphs
A
,B
andC
, where:A
callsC
B
callsC
If you apply
autodiff
toB
first and if you do not specifyreturn_all_grad_graph
, then you only get the gradient graph information forB
, and not forC
. If you specifyreturn_all_grad_graphs
, then you will get the gradient graph information for bothB
andC
. Then, if you want to applyautodiff
toA
, which also callsC
, you can reuse this gradient graph information forC
. This means thatautodiff
will not have to create another gradient graph forC
.
You use
called_graphs_grad_info
to provide the information for gradient graphs, which you have already calculated, as inputs to subsequentautodiff
calls where that gradient graph information is needed.
The GradGraphInfo
object contains all the information and tools you need to get a gradient graph:
graph
: the associated gradient graph as produced byautodiff
forward_graph
: the forward graph thatautodiff
was applied to
expected_inputs
: the tensors fromforward_graph
that are required as inputs to the gradient graph.
expected_outputs
: the tensors from the forward graph that have gradients as outputs of the gradientgraph
.
inputs_dict(fwd_call_info)
: the inputs to call the gradient graph.
fwd_graph_ins_to_grad_parent_outs(grad_call_info)
: the mapping between forward subgraph tensors and gradient call site tensors. Note thatgrad_call_info
is the call site information of the gradient gradient graph (Fig. 10.1).
fwd_parent_ins_to_grad_parent_outs(fwd_call_info, grad_call_info)
: the mapping between forward call site inputs and gradient call site outputs. It can be used to get the gradient with respect to a specific input (Fig. 10.1).
You can then use the information for the gradient graph returned by autodiff
to get the required gradients.
The partial derivatives of the loss with respect to the graph outputs of the forward graph are
the first inputs of the gradient graph. Listing 10.1 shows how to calculate the gradients with autodiff
for linear_graph
.
Start with
call_with_info()
which returns the the call site information,fwd_call_info
.Then, calculate the information for the gradient graph,
bwd_graph_info
, by applyingautodiff()
tolinear_graph
.Next, get all the activations calculated in the forward pass with the gradient graph using
bwd_graph_info.inputs_dict()
withfwd_call_info
as input.Last, calculate the gradient graphs with
call()
.grad_seed
is the initial value of the partial gradient. Increasinggrad_seed
can serve as loss scaling.activation
is used to connect the input of the gradient graph with the caller graph.
18class Linear(popxl.Module):
19 def __init__(self):
20 self.W: popxl.Tensor = None
21 self.b: popxl.Tensor = None
22
23 def build(
24 self, x: popxl.Tensor, out_features: int, bias: bool = True
25 ) -> Tuple[popxl.Tensor, ...]:
26 self.W = popxl.graph_input((x.shape[-1], out_features), popxl.float32, "W")
27 y = x @ self.W
28 if bias:
29 self.b = popxl.graph_input((out_features,), popxl.float32, "b")
30 y = y + self.b
31 return y
32
33
34with main:
35 # host load
36 input = popxl.h2d_stream([2, 2], popxl.float32, name="input_stream")
37 x = ops.host_load(input, "x")
38 W_data = np.random.normal(0, 0.1, (2, 2)).astype(np.float32)
39 W = popxl.variable(W_data, name="W")
40 b_data = np.random.normal(0, 0.4, (2)).astype(np.float32)
41 b = popxl.variable(b_data, name="b")
42
43 # create graph
44 linear = Linear()
45 linear_graph = ir.create_graph(linear, x, out_features=2)
46
47 fwd_call_info = ops.call_with_info(
48 linear_graph, x, inputs_dict={linear.W: W, linear.b: b}
49 )
50 y = fwd_call_info.outputs[0]
51
52 # get the gradients from autodiff
53 bwd_graph_info = transforms.autodiff(linear_graph)
54 grad_seed = popxl.constant(np.ones((2, 2), np.float32))
55 activations = bwd_graph_info.inputs_dict(fwd_call_info)
56 grads_x, grads_w, grads_b = ops.call(
57 bwd_graph_info.graph, grad_seed, inputs_dict=activations
58 )
59
60 # host store
61 o_d2h = popxl.d2h_stream(y.shape, y.dtype, name="output_stream")
62 ops.host_store(o_d2h, y)
63
64 grad_d2h = popxl.d2h_stream(grads_w.shape, grads_w.dtype, name="grad_stream")
65 ops.host_store(grad_d2h, grads_w)
The MNIST application example (Section 16, Application example: MNIST) demonstrates how autodiff
is used.