What happens when you reduce a ML framework all the way to its bare bones? - You get teenygrad.
In this post, I want to dive into the source code of teenygrad to illustrate the software architecture behind some important concepts of ML-engines like backpropagation and computational graphs.
It’s true that you can go read the code directly (it’s not that difficult) but… why not take a guided tour first? If you enjoyed Karpathy’s micrograd, you’ll like teenygrad.
What is Teenygrad?
It is a minimalistic ML framework. It does not aim to be production-grade but to illustrate the core ideas behind ML frameworks: tensor arithmetics, computational graphs and automatic differentiation.
teenygrad has a big brother: tinygrad. They both build on top of the same philosophy, but tinygrad aims to be production-grade: it includes computation graph optimizations, JIT, lazy computation and device-specific kernel generation.
teenygrad also has a smaller brother: Karpathy’s micrograd, the differences being that micrograd is value-based while teenygrad is array-based, and that teenygrad implements a more pulished software architecture, which is what I want to discuss in this post.
Tensors Arithmetics
The core data structure in teenygrad is the Tensor
object. A tensor is like a Numpy ndarray
. Like in other frameworks, tensors have .shape
, .dtype
and .device
properties.
from teenygrad.tensor import Tensor
x = Tensor.ones((1, 3, 16, 16))
print(x.shape) # (1, 3, 16, 16)
print(x.dtype) # dtypes.float
print(x.device) # 'CPU'
Tensors can be operated through regular arithmetics:
a = Tensor([[1, 2], [3, 4]])
b = Tensor([[5, 6], [7, 8]])
(a + b).numpy()
# array([[ 6., 8.], [10., 12.]], dtype=float32)
(a * b).numpy()
# array([[ 5., 12.], [21., 32.]], dtype=float32)
(a @ b).numpy()
# array([[19., 22.], [43., 50.]], dtype=float32)
More complex operations are also available:
x = Tensor.ones(2,)
w = Tensor.ones(2, 3)
b = Tensor.ones(3)
y = x.linear(weight=w, bias=b)
# array([3., 3., 3.], dtype=float32)
Tensors may also store gradients in their .grad
property after a backward pass is run:
a = Tensor.rand((2, 3), requires_grad=True)
y = a.sum()
y.backward()
a.grad.numpy()
# array([[1., 1., 1.],
# [1., 1., 1.]], dtype=float32)
The Computational Graph
Operations on tensors record a Computational Graph. The computational graph holds the information used by the autograd engine to run the backward pass and compute the gradients of each tensor involved.
In particular, each tensor holds a context property ._ctx
that refers to the math operation that created the tensor:
a = Tensor.ones((2), requires_grad=True)
b = Tensor.ones((2), requires_grad=True)
c = a + b
print(c._ctx)
# <teenygrad.mlops.Add object at 0x10a88d2a0>
There are a couple of things to mention about the previous code:
- Tensors without
requires_grad=True
are not added to the computational graph. - The
_ctx
holds an instance of themlops.Add
object.
At this point I was wondering… what’s the need of an instance of mlops.Add
in ._ctx
? The answer is that, in teenygrad, operations themselves are stateful. This makes sense, since they need to store certain information in order to run the backward pass (we’ll come back to this point later).
Operations store references to to the tensors that spawned the operation itself. These are called the parents
of the operation:
a = Tensor.ones((2), requires_grad=True) # id(a): 4403513344
b = Tensor.ones((2), requires_grad=True) # id(b): 4403663552
c = a + b
[id(p) for p in c._ctx.parents] # [4403666432, 4403672192]
Recap so far:
- Tensors have a
._ctx
property that points to an instance of an operation. - Operations have a
.parents
property that points to the tensors that created it (the inputs of the operation).
These links between tensors and operations, repeated, create a graph that describes the full computation. This is The Computational Graph.
Visualizing Computational Graphs
Let’s visualize some computational graphs:
I have put together the following code to visualize computational graphs. It’s an adaptation of Karpathy’s code in micrograd/trace_graph.ipynb.
from graphviz import Digraph
def trace(root):
nodes, edges = set(), set()
def build(v):
if v not in nodes:
nodes.add(v)
if v._ctx:
for child in v._ctx.parents:
edges.add((child, v))
build(child)
build(root)
return nodes, edges
def draw_graph(root, show_grads=False, format='svg', rankdir='LR'):
"""
format: png | svg | ...
rankdir: TB (top to bottom graph) | LR (left to right)
"""
assert rankdir in ['LR', 'TB']
nodes, edges = trace(root)
dot = Digraph(format=format, graph_attr={'rankdir': rankdir}) #, node_attr={'rankdir': 'TB'})
for n in nodes:
grad_label = f"{n.grad.numpy().item():.2f}" if n.grad else "None"
# dot.node(name=str(id(n)), label = "{ data %.4f | grad %.4f }" % (n.numpy(), n.grad() if n.grad() is not None else 0.0), shape='record')
if show_grads:
dot.node(name=str(id(n)), label = f"data: {n.numpy().item():.2f} | grad: {grad_label}", shape='record')
else:
dot.node(name=str(id(n)), label = f"data: {n.numpy().item():.2f}", shape='record')
if n._ctx:
dot.node(name=str(id(n)) + str(id(n._ctx)), label=n._ctx.__class__.__name__)
dot.edge(str(id(n)) + str(id(n._ctx)), str(id(n)))
for n1, n2 in edges:
dot.edge(str(id(n1)), str(id(n2)) + str(id(n2._ctx)))
return dot
The first visualization is the addition of two tensors. I’ll use tensors with a scalar value for clarity in the visualization:
a = Tensor([0.5], requires_grad=True)
b = Tensor([0.8], requires_grad=True)
c = a + b
draw_graph(c)
Next, a linear transformation:
x = Tensor([1.0], requires_grad=True)
w = Tensor([0.5], requires_grad=True)
b = Tensor([0.8], requires_grad=True)
c = x * w + b
draw_graph(c)
Note: Python parses a composed operation following the conventional order of operations.
c = x * w + b
is broken intohidden_tensor = x * w
andc = hidden_tensor + b
. You can see the instance ofhidden_tensor
in the graph, even though we have not explicitely defined a variable for it.
Graphs can get as complex as one wants: tensors can be involved in several operations, there can be residual connections, etc. Note that the resulting computational graph is a DAG (Directed Acyclic graph).
x1 = Tensor([1.0], requires_grad=True)
x2 = Tensor([2.0], requires_grad=True)
w = Tensor([0.5], requires_grad=True)
b = Tensor([0.8], requires_grad=True)
c = x1 * w + x2 * w + b + x2
draw_graph(c)
Diving Deeper: Meet the LazyBuffer
Since this is a post about the inner workings of teenygrad, we need to dive into the internals of the Tensor
object. Right under Tensor
, there is the LazyBuffer
class, the object that ultimately holds data and operates on it. In fact, Tensor
is just a convenient wrapper around LazyBuffer
.
Actually, it’s not the first time that we encounter a LazyBuffer
. Before, in the image showcasing the attributes of mlops.Add
, there was this one:
_ctx.parents = (<Tensor <LB () dtypes.float> on CPU with grad None>, <Tensor <LB () dtypes.float> on CPU with grad None>)
See that LB
? That’s the LazyBuffer
within the Tensor
. We also see them if we print a Tensor
instance. Rather than getting the data, we get the LazyBuffer
underneath:
a = Tensor([1.0])
print(a)
# <Tensor <LB (1,) dtypes.float> on CPU with grad None>
Let me paste the definition of the LazyBuffer
class to make clear that it’s there where the actual data is stored (in this case, as a Numpy ndarray
):
class LazyBuffer:
device = "CPU"
def __init__(self, buf: np.ndarray):
self._np = buf
@property
def base(self): return self
@property
def dtype(self): return dtypes.from_np(self._np.dtype)
@property
def realized(self): return RawCPUBuffer(self._np)
@property
def shape(self): return self._np.shape
...
Now, looking at the defintion of the Tensor
class it’s crystal clear that it is just a wrapper around a LazyBuffer
object, stored in its lazydata
property:
class Tensor:
...
@property
def device(self) -> str: return self.lazydata.device
@property
def shape(self) -> Tuple[sint, ...]: return self.lazydata.shape
@property
def dtype(self) -> DType: return self.lazydata.dtype
...
At this point you may be wondering why we need tensors in the first place? Well, even if the LazyBuffer
holds the data and operates on it, it does not know about computational graphs and backward passes. Those things happen one level of abstraction up, in the Tensor
class.
Now that I’ve introduced the LazyBuffer
, let’s trace down what happens when two tensors are added:
a = Tensor([1.], requires_grad=True)
b = Tensor([2.], requires_grad=True)
c = a+b
The first thing to note is that a + b == a.__add__(b)
. In other words, our conventional math notation is just syntactic sugar for and __add__()
method call on the Tensor
class. We can take a look at what the Tensor.__add__()
method looks like:
def __add__(self, x) -> Tensor: return self.add(x)
One step deeper…
def add(self, x:Union[Tensor, float], reverse=False) -> Tensor:
x = self._to_float(x)
return mlops.Add.apply(*self._broadcasted(x, reverse)) if x.__class__ is Tensor or x else self
The first time I looked at the mlops.Add
definition I got quite confused. I was expecting an .apply(self, x: Tensor) -> Tensor
method returning a new Tensor
. However, that’s not what it looks like:
class Add(Function):
def forward(self, x:LazyBuffer, y:LazyBuffer) -> LazyBuffer:
return x.e(BinaryOps.ADD, y)
def backward(self, grad_output:LazyBuffer) -> Tuple[Optional[LazyBuffer], Optional[LazyBuffer]]:
return grad_output if self.needs_input_grad[0] else None, \
grad_output if self.needs_input_grad[1] else None
What’s going on then? Let’s start by the forward()
method, which seems to be like I was expecting: it takes a couple of LazyBuffer
(the actual data) and returns a new one. The x.e()
stands for evaluate:
class LazyBuffer:
...
def e(self, op, *srcs:LazyBuffer):
if DEBUG >= 1: print(op, self, srcs)
if op == UnaryOps.NEG: ret = -self._np
elif op == UnaryOps.EXP2: ret = np.exp2(self._np)
elif op == UnaryOps.LOG2: ret = np.log2(self._np)
elif op == UnaryOps.SIN: ret = np.sin(self._np)
elif op == UnaryOps.SQRT: ret = np.sqrt(self._np)
elif op == BinaryOps.ADD: ret = self._np + srcs[0]._np
elif op == BinaryOps.SUB: ret = self._np - srcs[0]._np
elif op == BinaryOps.MUL: ret = self._np * srcs[0]._np
elif op == BinaryOps.DIV: ret = self._np / srcs[0]._np
elif op == BinaryOps.MAX: ret = np.maximum(self._np, srcs[0]._np)
elif op == BinaryOps.CMPLT: ret = self._np < srcs[0]._np
...
Recap: forward()
and backward()
wrap calls to the LazyBuffer.e()
method, which evaluates operations on the actual data (using Numpy as the backend).
The question still is… where is the apply()
method?
The answer is in the function definition: class Add(Function)
. All the operations in mlops
(like Add
) inherit from the Function
parent class. I suspected that the Function
class has an apply()
method that executes the call to .forward()
, and that is correct:
class Function:
def __init__(self, device:str, *tensors:Tensor):
self.device = device
self.needs_input_grad = [t.requires_grad for t in tensors]
self.requires_grad = True if any(self.needs_input_grad) else None if None in self.needs_input_grad else False
if self.requires_grad:
self.parents = tensors
def forward(self, *args, **kwargs): raise NotImplementedError(f"forward not implemented for {type(self)}")
def backward(self, *args, **kwargs): raise RuntimeError(f"backward not implemented for {type(self)}")
@classmethod
def apply(fxn:Type[Function], *x:Tensor, **kwargs) -> Tensor:
ctx = fxn(x[0].device, *x)
ret = Tensor(ctx.forward(*[t.lazydata for t in x], **kwargs), device=ctx.device, requires_grad=ctx.requires_grad)
if ctx.requires_grad and not Tensor.no_grad: ret._ctx = ctx # used by autograd engine
return ret
Quite a bit of code, so let’s take it in steps.
See the @classmethod
? That is what is called by return mlops.Add.apply(...)
. One point that confused me for a while was the presence of fxn
in apply(fxn, ...)
, instead of the usual self
, since the first argument is always a reference to the class itself. For a while I thought that it was some special syntax for class methods, but no. The truth is that, while its custom to name the first argument self
, we can name it as we want. For even more clarity, I will rewrite the apply()
as follows:
def apply(self:Type[Function], *x:Tensor, **kwargs) -> Tensor:
ctx = self(x[0].device, *x)
lb = ctx.forward(*[t.lazydata for t in x], **kwargs)
ret = Tensor(lb, device=ctx.device, requires_grad=ctx.requires_grad)
if ctx.requires_grad and not Tensor.no_grad:
ret._ctx = ctx # used by autograd engine
return ret
It’s not difficult to understand whats going on. First, ctx = ...
calls the __init__()
method of Add
, which creates an instance of it and stores it in the context attribute. Then, the .forward()
method of the Add
class is called, which returns a LazyBuffer
, which is then wrapped into a Tensor and assigned to ret
. The ret
tensor is then given a ._ctx
, which is the operation that created it with its information to perform backpropagation.
For example, looking at the definition of the mlops.Mul
operation, we can see that it stores the input LazyBuffer
in its context, since they are needed for the backward pass:
class Mul(Function):
def forward(self, x:LazyBuffer, y:LazyBuffer) -> LazyBuffer:
self.x, self.y = x, y
return x.e(BinaryOps.MUL, y)
def backward(self, grad_output:LazyBuffer) -> Tuple[Optional[LazyBuffer], Optional[LazyBuffer]]:
return self.y.e(BinaryOps.MUL, grad_output) if self.needs_input_grad[0] else None, \
self.x.e(BinaryOps.MUL, grad_output) if self.needs_input_grad[1] else None
And that’s it.
There is one last mystery to uncover, which is where is the call to mlops.Add.backward()
. The answer is in the next section about The Backward Pass.
The Backward Pass
If you come from PyTorch, you’ll be used to calling loss.backward()
, which runs the backward pass and stores in .grad
the gradient of the loss with respect to each tensor. teenygrad works in the same way.
The question is: How does the backward pass work?
Let’s go back to a simple computation: the multiplication of two tensors:
x = Tensor([1.0], requires_grad=True)
y = Tensor([2.0], requires_grad=True)
z = x * y
loss = z.sum()
loss.backward()
This is what the computational graph looks like before and after running loss.backward()
:
Note: In order to plot the previous graph, I had to disable a line in teenygrad’s source code (
tensor.py:259
) that deletes the._ctx
once a tensor’s grad is set:def backward(self) -> Tensor: ... # del t0._ctx return self
The backward pass is called as such because it starts running from the last node in the computational graph and propagates the gradients backwards.
Looking at the graph from the right to the left:
- The gradient of the last tensor is 1.0 by construction (
dloss/dloss=1
). - The
Reshape
operation just shuffles data around. Hence, the upstream gradient is just the downstream gradient but shuffled back. - The
Sum
operation copies the downstream gradient to its inputs:
out = x + y -> dloss/dx = dloss/dout * dout/dx = dloss/dout * 1
- The
Mul
operation swaps its inputs and multiplies them by the downstream gradient:
out = x * y -> dloss/dx = dloss/dout * dout/dx = dloss/dout * y
Running the Backward Pass Manually
Let’s run the backward pass manually, step by step, starting from the last node and propagating the gradients back (from the right to the left). We start with a graph with empty gradients:
Steps:
- I’ll set
loss.grad=Tensor(1)
manually. - To traverse back the
Reshape
operation I can useloss._ctx.backward()
(remember thatloss._ctx
points to the instance of theReshape
operation which created theloss
tensor). Then I can grabloss._ctx.parents
and set their gradients manually:
reshape_op = loss._ctx
reshape_parents = loss._ctx.parents # tuple of a single element
grad_lazydata = reshape_op.backward(loss.grad.lazydata) # takes lazydata and gives lazydata
reshape_parents[0].grad = Tensor(grad_lazydata)
- Traversing back the
Sum
operation is slightly more cumbersome, since we cannot direclty grab it, so we need to useloss._ctx.parents[0]._ctx
. Again,Sum
has a single parent.
sum_op = loss._ctx.parents[0]._ctx
sum_parents = loss._ctx.parents[0]._ctx.parents # tuple of a single element
grad_lazydata = sum_op.backward(loss.grad.lazydata)
sum_parents[0].grad = Tensor(grad_lazydata)
- Traversing back the
Mul
operation is easier. It has two parents, somul_op.backward()
will return two gradients.
mul_op = z._ctx
mul_parents = z._ctx.parents # tuple of a single element
grad_lazydata = mul_op.backward(z.grad.lazydata) # tuple of 2 elements
mul_parents[0].grad = Tensor(grad_lazydata[0])
mul_parents[1].grad = Tensor(grad_lazydata[1])
That was easy! It was always the same pattern:
- From a given
node
, grabnode._ctx
and callnode._ctx.backward(node.grad.lazydata)
. - Grab
node._ctx.parents
and set their gradients.
Hopefully now the following extract from the teenygrad source code will not be daunting anymore:
def backward(self) -> Tensor:
assert self.shape == tuple(), f"backward can only be called for scalar tensors, but it has shape {self.shape})"
# fill in the first grad with one. don't use Tensor.ones because we don't need contiguous
# this is "implicit gradient creation"
self.grad = Tensor(1, device=self.device, requires_grad=False)
for t0 in reversed(self.deepwalk()):
assert (t0.grad is not None)
grads = t0._ctx.backward(t0.grad.lazydata)
grads = [Tensor(g, device=self.device, requires_grad=False) if g is not None else None
for g in ([grads] if len(t0._ctx.parents) == 1 else grads)]
for t, g in zip(t0._ctx.parents, grads):
if g is not None and t.requires_grad:
assert g.shape == t.shape, f"grad shape must match tensor shape, {g.shape!r} != {t.shape!r}"
t.grad = g if t.grad is None else (t.grad + g)
del t0._ctx
return self
There are a couple of details to note:
- Gradients are accumulated (notice
t.grad + g
): when a node participates in several operations, it’s gradient comes from all of them. - After a node is used, it’s detached from the graph with
del t0._ctx
. I don’t fully understand why.
The last question is… what is self.deepwalk()
?
Topological Sort
There is anotehr important ingredient in an autograd engine: an algorithm to organize the order in which each node should propagate its gradient back. This is important because a node must have its gradient set before being able to propagate to the previous ones. This algorithm is Topological Sort
Given a DAG, Topological Sort organizes the nodes into a list such that every node appears after all its dependencies.
Topological Sort is implemented in def deepwalk()
. It’s an standard algorithm so I’ll not discuss it much:
def deepwalk(self):
def _deepwalk(node, visited, nodes):
visited.add(node)
if getattr(node, "_ctx", None):
for i in node._ctx.parents:
if i not in visited: _deepwalk(i, visited, nodes)
nodes.append(node)
return nodes
return _deepwalk(self, set(), [])
Instead, let’s see it in action. For the following graph (where I’ll use each node’s data as its identifier, since they are all different):
This is the output of deepwalk()
, called on the last node:
[2, 3, 4, 5, 6, 11, 13]
We just need to reverse the order ov the list, and that is the order in which we call the .backward()
on each node.
def backward(self) -> Tensor:
...
for t0 in reversed(self.deepwalk()):
...
Conclusion
These Study Notes do not aim to be an exhaustive discussion of teenygrad but to illustrate a few of its concepts. However, after reading these notes I hope approaching the source code of teenygrad to dig more becomes way easier. Also, having a clear picture of the code architecture of teenygrad is important to approach it’s bigger (and more complex) brother tinygrad.