To keep things simple, let’s assume that our model takes a maximum input of five words. This means its
context length is 5. When the context length is five, the network has 5 examples to train itself on. In this
explanation, I’m using ‘word’ to refer to ‘token’ for simplicity. While I’m assuming one word corresponds to one
token, it’s important to note this is often not the case in reality.
Embeddings are the heart of LLMs, supplying the oxygen so the network can do its job of predicting the next
token. I’ve written two posts explaining what embeddings are and how they’re constructed. In this particular
example, I created a 4-dimensional embedding for each word. To give you an idea of how small our embedding
is, GPT-3 uses an embedding size of 12,288.
As the network undergoes training, it updates the embeddings so that the loss is minimized. For simplicity, let’s
assume these embeddings encapsulate both the semantic meaning and positional information of each word.
It’s important to remember that word order plays a vital role in language; this order is represented in the
positional embeddings.
In our cricket example, queries, keys, and values are distinct elements. However, in LLMs, queries, keys, and
values start off identical, which can be confusing. At first, the same data acts as queries, keys, and values. But
as the neural network is trained, they diverge to serve their unique roles.
Queries are what each word is searching for. This is similar to the query feature vector [0.7, 0.7, 0.7, 0.7] in the
cricket example. Keys are what each word is offering to be searched on. This is similar to the feature vector of