Introduction

In the first post of the series we discussed in depth the implementation details of micrograd, which is basiclly a python implementation of the backpropegation algorithm. In the end of that post I tried to give a small incentive why backpropegation is such an important algorithm and today we will cotinue following Andrejs lecture series and dive a lot deeper into where it fits in the grand scheme of things when we’ll tackle our first prediction model; Given a squence of letters, (x1, x2) predict what will be the third letter y.

The bigram model

The code starts off with some data processing which concludes with multiple important data structures that will be used later on in the model - which are:

import numpy
import torch
import matplotlib.pyplot as plt

words = open('names.txt', 'r').read().splitlines()
g = torch.Generator().manual_seed(2147483647)
perm = torch.randperm(len(words), generator=g)

n = len(words)
n_train = int(0.8 * n)
n_dev = int(0.1 * n)

train_words = [words[i] for i in perm[:n_train]] # Training set
dev_words = [words[i] for i in perm[n_train:n_train + n_dev]] # Development set
test_words = [words[i] for i in perm[n_train + n_dev:]] # Test set

chars = sorted(list(set(''.join(words))))
ctoi = {c: i+1 for i,c in enumerate(chars)} # Mapping each char to an index: 'a'->1, 'b'->2, ...
ctoi['.'] = 0 # Special char to mark the start/stop of a word
itoc = {i: c for c, i in ctoi.items()}  # and the reverse mapping: 1->'a', 2->'b', ...

N = torch.zeros((27, 27), dtype=torch.int32) # Bigram count matrix - more details below

for w in train_words:
    chs = list(f".{w}.")
    for ch1, ch2 in zip(chs, chs[1:]):
        ix1 = ctoi[ch1]
        ix2 = ctoi[ch2]
        N[ix1, ix2] += 1

The data processing consists of reading a list of names from a text file, splitting them into training, development, and test sets, and creating mappings between characters and their corresponding indices. The most important data structure created here is the bigram count matrix N, which counts the occurrences of each pair of consecutive characters in the training set. For example, the first name in the dataset is “emma” thus the pairs of characters are (., e), (e, m), (m, m), (m, a), (a, .), where the . character marks the start and end of a word. The corresponding entries indices are: (0, 6), (6, 14), (14, 14), (14, 1), (1, 0) and thus the count matrix N will be incremented accordingly in these entries effectively creating a histogram of consecutive character pairs.

The resulting count matrix looks like this: Bigram count matrix

Now to turn this into a probabilistic model we need to convert this histogram counts to probabilities. This is done by normalizing each row of the count matrix N such that the sum of each row equals 1. This way, each entry in a given row corresponds to the probability of a letter following the letter represented by that row:

P = (N+1).float()
P /= P.sum(1, keepdim=True)

Note the addition of 1 to each count in N before normalization, this is known as Laplace smoothing and helps to avoid zero probabilities for character pairs that did not occur in the training data. Which results with: Bigram probabilities matrix

With this probability matrix P, we can now predict the next character given the current character by simply looking up the corresponding row in P and sampling from the resulting probability distribution. For example, to predict the next character after ‘a’, we would look at row 1 of P (since ‘a’ is mapped to index 1) and sample according to the probabilities in that row:

g = torch.Generator().manual_seed(2147483647)
ix = 1 # 'a'
p = P[ix] # Probabilities for next char given current char 'a'
ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
print(itoc[ix]) # Sampled next char

Measuring a model’s performance

We start off with an assumption that there is some “true” probability distribution from which our dataset was sampled and our model is trying to approximate. For example in the case of the bigram model, this distribution governs the relationship between input characters and their corresponding output characters. To evaluate any models performance we need a metric to measure how well it performs. Lets denote our models predicted probability distribution as $Q(y | x)$. and the “true” distribution as $P(y | x)$. Suppose for a given input x the probability of the true label $y_i$ is $P(y_i | x)$.