How Auto-grad works? Creating a PyTorch style Auto-grad framework

5 minute read

Basic idea and an Overview

In this post I aim to motivate and show how to write an automatic differentiation library. There are various strategies to perform automatic differentiation and they each have different strengths and weaknesses. For a an overview of various methods used please refer to [1]. Py-Torch uses a graph based automatic differentiation.

Every operation performed on tensors can be shown as a DAG (directed acylic graph). In the case of neural networks, the loss value calculated for a given mini-batch is the last node of the graph. Chain rule is very powerful and yet a very simple rule. Thinking in terms of the DAG, what chain rule tells us to take the derivative on a node if the output gradient of the node is completely accumulated. If we somehow make each node in this graph to remember its parents. We can run a topological sort on the DAG and call the derivative function of the nodes in this order. That’s a very simple overview of how auto-grad in PyTorch works and it is very simple to implement! Let’s do it.

Goal and Roadmap

We should be able to use our framework to do the following:

l1 = Variable(np.arange(-4,4).reshape(2,4))
l2 = Variable(np.arange(-2,2).reshape(4,1))
n1 = dot(l1,l2)
n2 = relu(n1)
n3 = sumel(n2)
# [[-2. -1.  0.  1.]
#  [-2. -1.  0.  1.]]
# [[-4.]
#  [-2.]
#  [ 0.]
#  [ 2.]]

So we need the following:

  • Define a Variable class wrapping the numpy ndarray, that supports backward call and points its parent Variables. Use this class whenever you create a new tensor. If a Varible is a leaf node then we don’t need the backward_fun.
  • Define operations you need (plus,minus,dot etc..), which takes Variable/s as argument/s and return a new Variable with the right backward function. backward function should be able to pass the output gradient to its parents by calculating the gradient of its parents from the output gradient.
  • We should be able to call backward_graph on every Variable which calls the backward function on Variables according to the topological sort of the computation graph of the given Variable resulting the gradients accumulated inside each Variable.

Implementing Variable class

Each Variable need its data which is a scalar or a numpy.ndarray if it is not a leaf node we need the backward_fun. __counter is an internal counter for debugging purposes. self.prev is an array pointing the parents and initialized as an empty array: should be set manually after creation. Backward function is called on the self.grad so we should guarantee that it is fully accumulated before calling the backward on the Variable.

class Variable():
    __counter = 0
    def __init__(self,data,is_leaf=True,backward_fun=None):
        if backward_fun is None and not is_leaf:
            raise ValueError('non leaf nodes require backward_fun') = Variable.__counter
        Variable.__counter += 1

        self.is_leaf = is_leaf
        self.prev = []
        self.backward_fun = backward_fun = data
        self.grad = 0

    def backward(self):
    def __repr__(self):
        return f'Variable(id:{}, data:{}, grad:{self.grad}, prev:{list(map(lambda,self.prev))}, is_leaf:{self.is_leaf}\n'

Implementing Operations

Each operation creates the backward_fun of the new Variable as a closure bound the the the parents. One can implement this part with generic functions which take the parents each time as parameters. This is possible and might lead to a more efficient run-time performance. However, this is not our primary concern here, so we go with the closures.

backward_fun of the dot is simple, just the dot product of the dy with the other Variable’s data.

def dot(a,b):
    if not (isinstance(a,Variable) and isinstance(b,Variable)):
            raise ValueError('a,b needs to be a Variable instance')
    def b_fun(dy):
        if np.isscalar(dy):
            dy = np.ones(1)*dy
        a.grad +=,
        b.grad +=,dy)
    res = Variable(,,is_leaf=False,backward_fun=b_fun)
    return res

backward_fun of the relu is just the masking.

def relu(a):
    if not (isinstance(a,Variable)):
        raise ValueError('a needs to be a Variable')
    def b_fun(dy=1):
        a.grad[>0] += dy[>0]

    res = Variable(np.maximum(, 0),is_leaf=False,backward_fun=b_fun)
    return res

sumel is just a broadcast when we look at the backward pass.

def sumel(a):
    if not (isinstance(a,Variable)):
        raise ValueError('a needs to be a Variable')
    def b_fun(dy=1):
        a.grad += np.ones(*dy

    res = Variable(np.sum(,is_leaf=False,backward_fun=b_fun)
    return res

Implementing the backward_engine

What we need to do is to call .backward() on each variable that is in our computational graph. We have the whole graph for every Variable since each Variable points its parents. The trick here is the call the .backward() in the right order since we need the .grad of the Variable to be fully accumulated before its .backward() call. To ensure this we do a topological sort and call the .backward() accordingly.

def backward_graph(var):
    if not isinstance(var,Variable):
        raise ValueError('var needs to be a Variable instance')
    tsorted = __top_sort(var)

    for var in reversed(tsorted):
def __top_sort(var):
    vars_seen = set()
    top_sort = []
    def top_sort_helper(vr):
        if (vr in vars_seen) or vr.is_leaf:
            for pvar in vr.prev:
    return top_sort

Note that we can make the .backward() calls inside the __top_sort function and this might be slightly efficient. We, again, pick the easy-to-understand-way of implementing things.

Enabling higher order gradients

Note that in the backward pass we don’t return Variable. It is very straight forward to enable higher order gradients by returning Variables at the backward_pass. To do that we need to use the operations we defined above inside the every backward_fun.

For the rest of the code and some test, please refer to

[1] Automatic differentiation in machine learning: a survey

Tags: ,