# How Auto-grad works? Creating a PyTorch style Auto-grad framework

## Basic idea and an Overview

In this post I aim to motivate and show how to write an automatic differentiation library. There are various strategies to perform automatic differentiation and they each have different strengths and weaknesses. For a an overview of various methods used please refer to [1]. Py-Torch uses a graph based automatic differentiation.

Every operation performed on tensors can be shown as a DAG (directed acylic graph). In the case of neural networks, the loss value calculated for a given mini-batch is the last node of the graph. Chain rule is very powerful and yet a very simple rule. Thinking in terms of the DAG, what chain rule tells us to take the derivative on a node if the output gradient of the node is completely accumulated. If we somehow make each node in this graph to remember its parents. We can run a topological sort on the DAG and call the derivative function of the nodes in this order. That’s a very simple overview of how auto-grad in PyTorch works and it is very simple to implement! Let’s do it.

## Goal and Roadmap

We should be able to use our framework to do the following:

```
l1 = Variable(np.arange(-4,4).reshape(2,4))
l2 = Variable(np.arange(-2,2).reshape(4,1))
n1 = dot(l1,l2)
n2 = relu(n1)
n3 = sumel(n2)
backward_graph(n2)
print(l1.grad)
# [[-2. -1. 0. 1.]
# [-2. -1. 0. 1.]]
print(l2.grad)
# [[-4.]
# [-2.]
# [ 0.]
# [ 2.]]
```

So we need the following:

- Define a
`Variable`

class wrapping the numpy ndarray, that supports backward call and points its parent`Variable`

s. Use this class whenever you create a new tensor. If a`Varible`

is a leaf node then we don’t need the`backward_fun`

. - Define operations you need (
`plus`

,`minus`

,`dot`

etc..), which takes`Variable`

/s as argument/s and return a new`Variable`

with the right`backward`

function.`backward`

function should be able to pass the output gradient to its parents by calculating the gradient of its parents from the output gradient. - We should be able to call
`backward_graph`

on every Variable which calls the backward function on`Variable`

s according to the topological sort of the computation graph of the given`Variable`

resulting the gradients accumulated inside each Variable.

## Implementing `Variable`

class

Each `Variable`

need its data which is a scalar or a `numpy.ndarray`

if it is not a leaf node we need the backward_fun. `__counter`

is an internal counter for debugging purposes. `self.prev`

is an array pointing the parents and initialized as an empty array: should be set manually after creation. Backward function is called on the `self.grad`

so we should guarantee that it is fully accumulated before calling the `backward`

on the `Variable`

.

```
class Variable():
__counter = 0
def __init__(self,data,is_leaf=True,backward_fun=None):
if backward_fun is None and not is_leaf:
raise ValueError('non leaf nodes require backward_fun')
self.id = Variable.__counter
Variable.__counter += 1
self.is_leaf = is_leaf
self.prev = []
self.backward_fun = backward_fun
self.data = data
self.grad = 0
def backward(self):
self.backward_fun(dy=self.grad)
def __repr__(self):
return f'Variable(id:{self.id}, data:{self.data}, grad:{self.grad}, prev:{list(map(lambda a:a.id,self.prev))}, is_leaf:{self.is_leaf}\n'
```

## Implementing Operations

Each operation creates the `backward_fun`

of the new `Variable`

as a closure bound the the the parents. One can implement this part with generic functions which take the parents each time as parameters. This is possible and might lead to a more efficient run-time performance. However, this is not our primary concern here, so we go with the closures.

`backward_fun`

of the `dot`

is simple, just the dot product of the `dy`

with the other `Variable`

’s data.

```
def dot(a,b):
if not (isinstance(a,Variable) and isinstance(b,Variable)):
raise ValueError('a,b needs to be a Variable instance')
def b_fun(dy):
if np.isscalar(dy):
dy = np.ones(1)*dy
a.grad += np.dot(dy,b.data.T)
b.grad += np.dot(a.data.T,dy)
res = Variable(np.dot(a.data,b.data),is_leaf=False,backward_fun=b_fun)
res.prev.extend([a,b])
return res
```

`backward_fun`

of the `relu`

is just the masking.

```
def relu(a):
if not (isinstance(a,Variable)):
raise ValueError('a needs to be a Variable')
def b_fun(dy=1):
a.grad[a.data>0] += dy[a.data>0]
res = Variable(np.maximum(a.data, 0),is_leaf=False,backward_fun=b_fun)
res.prev.append(a)
return res
```

`sumel`

is just a broadcast when we look at the backward pass.

```
def sumel(a):
if not (isinstance(a,Variable)):
raise ValueError('a needs to be a Variable')
def b_fun(dy=1):
a.grad += np.ones(a.data.shape)*dy
res = Variable(np.sum(a.data),is_leaf=False,backward_fun=b_fun)
res.prev.append(a)
return res
```

## Implementing the backward_engine

What we need to do is to call `.backward()`

on each variable that is in our computational graph. We have the whole graph for every `Variable`

since each `Variable`

points its parents. The trick here is the call the `.backward()`

in the right order since we need the `.grad`

of the Variable to be fully accumulated before its `.backward()`

call. To ensure this we do a topological sort and call the `.backward()`

accordingly.

```
def backward_graph(var):
if not isinstance(var,Variable):
raise ValueError('var needs to be a Variable instance')
tsorted = __top_sort(var)
var.grad=np.ones(var.data.shape)
for var in reversed(tsorted):
var.backward()
```

```
def __top_sort(var):
vars_seen = set()
top_sort = []
def top_sort_helper(vr):
if (vr in vars_seen) or vr.is_leaf:
pass
else:
vars_seen.add(vr)
for pvar in vr.prev:
top_sort_helper(pvar)
top_sort.append(vr)
top_sort_helper(var)
return top_sort
```

Note that we can make the `.backward()`

calls inside the `__top_sort`

function and this might be slightly efficient. We, again, pick the easy-to-understand-way of implementing things.

## Enabling higher order gradients

Note that in the backward pass we don’t return `Variable`

. It is very straight forward to enable higher order gradients by returning Variables at the backward_pass. To do that we need to use the operations we defined above inside the every `backward_fun`

.

For the rest of the code and some test, please refer to https://github.com/evcu/numpy_autograd

[1] Automatic differentiation in machine learning: a survey https://arxiv.org/abs/1502.05767