Introduction

In this first post in a series of posts where I will deep dive into the lectures of Andrej Karpathy’s “Neural Networks: Zero to Hero” course. I want to explore the small details that were skimmed through, left out or not explained in depth in the first lecture about micrograd. And hopefully add some intuition and motivation behind some concepts that were introduced in that lecture.

In the first lecture of the series, we are introduced to the concept of building a simple neural network framework from scratch using Python. The framework is called “micrograd”. It serves as a minimalistic implementation to help us understand the underlying mechanics of neural networks and backpropagation.

In the first half of the lecture Andrej works super hard to calculate “by hand” the derivatives of some simple functions to illustrate the definition of a derivative and some intuition behind it and I want to delve a bit deeper into the motivation and build intuition behind why he did that.

Deep diving into the small details of micrograd

At a high level, what we did in that lecture is compose multiple “Value” objects with some operations to create a computational graph that represents an arbitrary mathematical function. One of the examples in the lecture is:

# inputs x1,x2
x1 = Value(2.0, label='x1')
x2 = Value(0.0, label='x2')
# weights w1,w2
w1 = Value(-3.0, label='w1')
w2 = Value(1.0, label='w2')
# bias of the neuron
b = Value(6.8813735870195432, label='b')
# x1*w1 + x2*w2 + b
x1w1 = x1*w1; x1w1.label = 'x1*w1'
x2w2 = x2*w2; x2w2.label = 'x2*w2'
x1w1x2w2 = x1w1 + x2w2; x1w1x2w2.label = 'x1*w1 + x2*w2'
n = x1w1x2w2 + b
o = n.tanh()
o.label = 'o'

In the end we got a function $o = f(x_1, x_2; w_1, w_2, b)$. In the lecture Andrej then goes on to calculate the derivatives of the output o with respect to each of the inputs and parameters by applying the chain rule of calculus to each node in the computational graph. He does so by implementing a custom operator for every operation possible between one or more Value objects. so for example for multiplication this is the custom operator we saw in the lecture:

def __mul__(self, other):
  out = Value(self.data * other.data, (self, other), '*')
  
  def _backward():
    self.grad += other.data * out.grad
    other.grad += self.data * out.grad
  out._backward = _backward
    
  return out

Now this caught me off guard a little bit, I honestly thought that if I had to implement this sort of idea I would’ve gone with something like this:

class Value:
def __init__(self, data, _children=(), _op='', label=''):
    self.data = data
    self.grad = 0.0
    self._backward = lambda: None
    self._prev = set(_children)
    self._op = _op
    self.label = label

    def _backward():
        match _op:
            case '*':
                x, y = _children
                x.grad += y.data * self.grad
                y.grad += x.data * self.grad
            case '+':
                x, y = _children
                x.grad += 1.0 * self.grad
                y.grad += 1.0 * self.grad
            # other operations...

    self._backward = _backward

def __mul__(self, other):
    return Value(self.data * other.data, (self, other), '*')

def __add__(self, other):
    return Value(self.data + other.data, (self, other), '+')

# other operations...

Reading this code at first glance didn’t seem straightforward to me, where is out.grad coming from? how does this even work given that we initialize .grad to in the constructor to 0.0? isn’t everything here is just 0.0? I think this is the rare case where using an abstraction is actually more intuitive than using a concrete example, so lets define some function f which is defined as $o = f(x, y)$ Which is some sub-computation in the whole computational graph And lets denote the output of the computational graph as L (which is usually the loss function in a neural network though for now it can be any function) and lets define f_raw as the actual operation on the data fields of x and y.

def f(x: Value, y: Value) -> Value:
    '''
    '''
    o = Value(f_raw(x.data, y.data), (x, y), 'f')

    def _backward():
        '''
        The goal inside this _backward method is to accumulate the gradients of L with respect to x and y and store them in a dedicated field of the Value object.
        So if L can be decomposed as L = g(f(x, y)) = g(o) then by the chain rule we have:
        ∂L/∂x = (∂L/∂o) * (∂o/∂x)
        ∂L/∂y = (∂L/∂o) * (∂o/∂y)

        Now lets compare this to what we have in the code:
        x.grad += y.data * o.grad
          ^          ^       ^
        ∂L/∂x     ∂o/∂x    ∂L/∂o

        And similarly for y:
        y.grad += x.data * o.grad
          ^          ^       ^
        ∂L/∂y      ∂o/∂y   ∂L/∂o

        Now we've got some answers here, but let's make it more explicit for every Value object v,
        v.grad is simply ∂L/∂v, the derivative of the final output L with respect to that Value object v.

        Moreover, here calculating the derivatives comes into play, for example if f_raw=x*y then:
        (∂o/∂x) = (∂f/∂x) = y.data
        (∂o/∂y) = (∂f/∂y) = x.data
        and we get this final code:
        '''
        x.grad += y.data * o.grad
        y.grad += x.data * o.grad

    o._backward = _backward
    return o

There’s still another piece of the puzzle missing here - where does the o.grad i.e. ∂L/∂o come from? Well you’ve probably guessed that it has to do with this funky _backward method defined inside the f function and the way we call it. Lets take a look at the actual backward method defined in the Value class and break it down:

def backward(self):
    '''
    This method does something interesting, it takes the current Value object node, and treats it as the root of a DAG(Directed Acyclic Graph) and preforms a topological sort
    For all the nodes in the graph that lead to this node, ordering them from inputs to outputs.
    '''
    topo = []
    visited = set()
    def build_topo(v):
      if v not in visited:
        visited.add(v)
        for child in v._prev:
          build_topo(child)
        topo.append(v)
    build_topo(self)
    
    
    '''
    So for examples lets look at this computation graph:
    a = Value(2.0, label='a')
    b = Value(-3.0, label='b')
    c = Value(10.0, label='c')
    e = a*b; e.label = 'e'
    d = e + c; d.label = 'd'
    f = Value(-2.0, label='f')
    L = d * f; L.label = 'L'
    L.backward()
    To be more explicit here is L:  ((a*b) + c) * f
    Running L.backward() will result in the following topo array: [a, b, e, c, d, f, L]
    Now we need to populate the o.grad values for each node in the graph starting from L so that when we call each node's _backward method those values are set correctly for the chain rule to work.
    Since the grad field is initialized to 0 in the Value constructor we need to set a special case for the output node L, and since ∂L/∂L = 1 we set:
    '''
    self.grad = 1.0
    for node in reversed(topo):
      '''
      And now all thats left is to start from the output node L and call each node's _backward method in reverse topological order and all the gradients with be calculated correctly!
      So in our example:
      ∂L/∂L = 1.0
      For L = d * f
      Then in the _backward call added grad values will be:
      ∂L/∂f = d           → f.grad += 1.0 * d.data   = 4
      ∂L/∂d = f           → d.grad += 1.0 * f.data   = -2
      etc..
      '''
      node._backward()

TL;DR:

  • .grad is initialized to 0.0 for all nodes.
  • In backward, for the final Value node L, we set L.grad = 1.0 because ∂L/∂L = 1.
  • Then we walk nodes in reverse topo order, and each node’s _backward() adds to its parents in the original computational graph grad using the chain rule.

Recursion and topological sorting is truly a thing of beauty to see in action and grasp its power.

Where do we use the backpropagated gradients?

Let’s assume we have what could easily be a very complicated function L composed of many sub-operations. What if we wanted to find the minimum value of L? This is where gradient descent comes into play. Gradient descent is an optimization algorithm that uses the gradients we calculated using backpropagation to iteratively update the values of each node in the direction that reduces the value of L. The update rule for each Value node v is given by: \(v.data = v.data - η * v.grad\) Where η is the learning rate, a hyperparameter that controls the step size of each update. By repeatedly applying this update rule to all Value nodes in the computational graph, we can gradually minimize the value of L and find the optimal values for each node. Thus we can easily do the following:

# As you will see in the next lectures this algorithm doesn't apply to all of the Value nodes in the computational graph - only to a subset of them which we will call "parameters".
for _ in range(100):
    # Forward pass to re-compute L - i.e. in Andrej's example this is: L = ((a*b + e) * d) * f
    L = some_complicated_function(...)

    # reset gradients to zero(So that accumulate only the gradients from the current forward pass)
    for p in parameters:
        p.grad = 0.0

    # backward pass to compute the current iterations updated gradients
    L.backward()

    # Nudge each nodes data in the direction that reduces L
    for p in parameters:
        p.data -= learning_rate * p.grad

I didnt go too deep into the details of gradient descent here since its a topic we will discuss more in-depth in future posts, but I wanted to illustrate how the backpropagated gradients are used in practice.

This is it for this post, I hope you found this informative and helpful. In the next post I will dive deeper into the details, ideas and motivation behind the first lecture in the series about the bigram model. Stay tuned!