Deep Dive Into LLM's Machine Learning notes

Hello LLMs!
Jana Vembunarayanan / October 3, 2023
When you are having trouble getting your thinking straight, consider an extreme or simple case. This will often
give you the insight needed to move forward.
I nodded in agreement when I read the above lines from the book Maxims for Thinking Analytically.
ChatGPT took the world by storm when it was launched last November. It reached 1 million users in just 5 days
after launch. It’s a chatbot that’s orders of magnitude more advanced than anything I had previously interacted
with. Behind the scenes, it uses Large Language Models to create the magic.
A language model is a statistical model that predicts the next word given a sequence of words. Google search
has been using a language model years before ChatGPT came out. This shouldn’t be surprising as the
architecture that powers language models (LLM) came out of Google.
Tags
Accounting Analysis
Availability Biology Books
Business Charlie Munger
Computer Science
Economics Evolution
Finance Learning
Lending Life LLM Machine
Learning Mathematics
Mental Models
Presentation
Psychology Reading
Recommender Systems Role Models
Statistics Teaching Thinking
Valuation Warren Buffett Why Writing
Archives
Seeking Wisdom
Mastering the best of what other people have already figured out.
Home About

The reason why it’s not just a language model but a large language model is because it’s trained on a large
corpus of data from the internet. The data comes from various sources like Wikipedia, GitHub, Books, ArXiv,
StackExchange, etc.
The key ingredient behind the LLM revolution is that these models can be trained on unstructured and messy
real world data, without the need for carefully curated and human-labeled data sets. This is why these models
are self-supervised.
Select Month
Follow Blog via Email
Enter your email address to follow this
blog and receive notifications of new
posts by email.
Email Address
Follow
Join 3,629 other subscribers

For example, we can train the model to generate the next word by showing it “The cat sat on the …“. If it
generates “floor, carpet, or mat” the model will be rewarded. Otherwise, the model will be penalized. This way
the model learns, like how a child learns to walk. Fall several times, learn from failures, and walk right.
As a result almost all textual data on the web becomes useful to train these LLMs. These models are trained on
trillions of words. The Coming Wave highlighted the staggering difference between the volume of data these
models are trained on and the amount an average American reads.
An average American reads for 15 minutes a day. Assuming a reading speed of 200 wpm, this translates to 1
million words per year. In contrast, an LLM is trained on trillions of words in just a single month-long training
run. After grasping this comparison, I felt a jolt of electricity shooting from the tail of my spine to my brain.
Consider the LLM as a black-box where you can pose any question in natural language. It replies in the same
natural language. These models can craft stories, essays, answer queries, translate between languages,
converse with humans, and even ace AP tests. The technical term for what you instruct an LLM to do is called
a “Prompt.”
How do these LLMs achieve such a wondrous feat? What’s happening inside them? I’m eager to delve into the
nuts and bolts of everything that happens within an LLM. Where’s the best place to begin? I think it’s essential
to reiterate Richard Zeckhauser’s maxim with which I began this post.

When you are having trouble getting your thinking straight, consider an extreme or simple case. This will often
give you the insight needed to move forward.
Instead of trying to understand how to predict the next word, how about we try to predict the next character?
Instead of using a neural network, how about we use a simple logic of frequency counting to predict the next
character?
This is what Andrej Karpathy did by starting with a simple Character Level Language Model in his Neural
Networks lecture. The idea is to predict the next character based on the previous character. Karpathy used 26
English alphabets.
I’m going to simplify it by using only four letters A, T, C, G. Yes, these are the alphabets that the book of life is
written on.
I created 100 DNA sequences of varying lengths from 1-100 using this tool. I constructed a 5×5 table and filled
in the frequency counts as shown below.

The grid captures three pieces of information for each sequence:
1. The first row tracks how often a sequence begins with one of the four characters, each indicated by a
prefix (e.g., .a, .t, .c, .g). For instance, 18 sequences start with “a”, 56 with “t”, 17 with “c”, and 9 with “g”.

2. The first column tracks how often a sequence ends with one of the four characters, each indicated by a
suffix (e.g., a., t., c., g.). For example, 15 sequences end with “a”, 50 with “t”, 23 with “c”, and 12 with “g”.
3. The remaining cells track the frequency of one character following another. For instance, the cell at the
intersection of the second row and second column indicates that “a” followed by another “a” appears 19
times. Similarly, the cell at the third row and fourth column shows that “t” followed by “c” occurs 51 times.
The count of 0 in the cell at the first row and first column (..) signifies that there were no empty sequences.
Verify for yourself by mentally populating an empty grid using the sequence ‘ctatgt’ and then compare it to the
grid provided below.

We’re now ready to use the grid, which traces the frequency of 100 sequences, to create new DNA sequences!
1. Begin with the first row.
2. Perform a multinomial sample and select one item. “a” will appear 18% (18/100) of the time, “t” 56%, “c”
17%, and “g” 9%. Suppose you select “g”.

3. Move to row 5 and perform another multinomial sample to determine the character following “g”. Let’s
say you select “t”. This has a 31% probability because “t” appears 22 times out of the 72 possible
sequences after “g”.
4. Proceed to row 3 and conduct a multinomial sample to identify the character following “t”. Suppose the
sequence concludes here. There’s a 22% chance of this happening, as the sequence can end in 50 out
of the 232 possible sequences after “t”.
5. Your resulting sequence is “gt”.
I generated 100 DNA sequences using the above algorithm and drew the grid below.

Congratulations! We came up with a DNA sequence generator by predicting the next character using the
previous character.
Our generator can be enhanced by considering more than one previous character. However, the lookup table
required to monitor frequency counts expands exponentially. The English language comprises over 170,000
words, and the relationships between words in a sentence are much more complex.

Attention is all you need – Part 2How a Machine Learns – Part 2 Attention is all you need – Part 1
← Few thoughts on life, psychology, and mindset How a Machine Learns →
Consider the following 2 sentences.
The children played joyfully on the bank of the river, watching the water flow by.
I need to visit the bank today to deposit my paycheck.
The meaning of the word bank depends on the context. It refers to the river bank in the first example. The
second example is a reference to a financial institution. Such complex relationships can’t be modeled with a
simple frequency counting.
How does the machine learn this relationship?
To understand that we need to open our high school textbook and start reading the chapter on Derivatives. I’ll
cover this topic in my next post.
October 3, 2023 in Computer Science, Statistics. Tags: LLM, Machine Learning
Related posts
3 thoughts on “Hello LLMs!”
Manoj Pillai October 3, 2023 at 9:21 pm

How a Machine Learns
Jana Vembunarayanan / October 10, 2023
How a fly helped Descartes to invent the Cartesian plane
René Descartes (1596 – 1650), the famous philosopher and mathematician, was lying in his bed, eyes fixed on
a fly on the ceiling.
As he pondered how to accurately describe the fly’s position, Descartes visualized the ceiling as a rectangle
sketched on paper, using the bottom left corner as a starting point. To pinpoint the fly, one would simply
measure the distance to travel horizontally and then vertically. These two numbers are the fly’s coordinates.
Tags
Accounting Analysis
Availability Biology Books
Business Charlie Munger
Computer Science
Economics Evolution
Finance Learning
Lending Life LLM Machine
Learning Mathematics
Mental Models
Presentation
Psychology Reading
Recommender Systems Role Models
Statistics Teaching Thinking
Valuation Warren Buffett Why Writing
Archives
Seeking Wisdom
Mastering the best of what other people have already figured out.
Home About

I can’t vouch for the authenticity of this tale. But what I do know is that the coordinate system he dreamt up,
often referred to as the Cartesian plane, was nothing short of revolutionary. It seamlessly wove together
Algebra and Geometry.
Learning functions and rate of change through a slice of bread
Suppose one slice of bread contains 100 calories. This can be represented algebraically as y = 100x, where x
is the number of bread slices and y is the total calories from consuming x slices of bread. The graph below
illustrates this equation visually.
Select Month
Follow Blog via Email
Enter your email address to follow this
blog and receive notifications of new
posts by email.
Email Address
Follow
Join 3,629 other subscribers

x and y are variables where the value of x determines the value of y. We can consider x as an independent
variable and y as a dependent variable. It can also be said that y is a function of x. A function is a set of rules
that takes one or more inputs, applies the rule, and yields an output.
In the given example, the rule is to multiply the number of bread slices by 100 to determine the total number of
calories. If you input 2, you get 200 calories. If you input 5, you get 500 calories. x and y have a linear
relationship in this example.
In a linear relationship, the rate of change (slope) in y is consistent for every unit change in x. Each slice of
bread produces 100 calories no matter how fast we eat. The slope is calculated as the change in y divided by
the change in x, represented as Δy/Δx.

For example, if consuming 4 bread slices gives you 400 calories and 2 bread slices gives you 200 calories, the
calorie increase per bread slice is calculated as (400 – 200) divided by (4 – 2), which equals 100.
The slope or rate remains constant at 100, whether you’re comparing between 6 and 8 slices or any other two
points on the x-axis. This is the reason why you see a horizontal line at 100 on the right chart above. Although
the rate is represented by the number 100, it’s technically a function. Rate is a constant function in the bread
slice example. Steven Strogatz, in his insightful book “Infinite Powers,” elaborates this idea.
When a rate is constant, it’s tempting to think of it as simply being a number, like 200 calories per slice or $10
an hour or a slope of 1/12. That causes no harm here, but it would get us into trouble later. In more complicated
situations, rates will not be constant. For example, consider a walk through a rolling landscape, where some
parts of the hike are steep and others are flat. On a rolling landscape, slope is a function of position. It would be
a mistake to think of it as a mere number. Likewise, when a car accelerates or when a planet orbits the sun, its
speed changes incessantly. Then it’s vital to regard speed as a function of time. So we should get in that habit
now. We should stop thinking of rates of change as numbers. Rates are functions.
Nonlinear functions and the idea of a derivative
However, not all functions in nature are linear. When a function is not linear, its rate of change, Δy/Δx, is not
constant. Consider the nonlinear function s = t. It describes the position change of “s” an object as a function
of “t”.
2

The graph on the left tracks the object’s position over time. s = tis a nonlinear function, and its graph is a
curve, unlike the straight line we observed in the bread slice example. The graph on the right tracks the rate of
change of position over time, which is also known as velocity. The change in velocity is represented by a
slanted line, in contrast to the horizontal line we saw in the bread slice example.
What is the velocity of the object at t = 4 seconds?
It might be tempting to use the slope formula we applied for linear function here, but that would yield the
average velocity between t1 and t2 seconds. For instance, the first triangle on the left chart calculates the
average velocity using the formula (5– 3) / (5 -3), resulting in an average velocity of 8 m/s.
The second triangle on the left chart calculates the average velocity using the formula (9– 7) / (9 -7),
resulting in an average velocity of 16 m/s. The velocity differs between these two points. Moreover, if you select
any pairs of points on the curve, you’ll discover that each has a unique velocity.
The question is not about calculating average velocity, but rather instantaneous velocity. Think of this as
looking at the speedometer to find out how fast the car is traveling at a particular moment.
What if we set t1 and t2 to 4 seconds? (4– 4) / (4 – 4). That won’t work as dividing by zero is forbidden.
2
2 2
2 2
2 2

In the 17th century, Newton and Leibniz developed calculus to address challenges posed by variables that shift
continuously over time. They understood that their existing tools were apt for linear functions but fell short for
nonlinear functions involving constant change.
They found that if you keep narrowing in on a curve, it will ultimately resemble a straight line. But there’s a
catch: the increase in ‘x’ must be infinitesimally tiny. This insight allowed them to apply the conventional slope
formula.
If we set the infinitesimally small amount to 0.00001 and do (4.00001– 4) / (4.00001 – 4), we obtain a result
of 8 m/s. This method is known as taking the derivative. I set t = 4 and got the derivative as 8 m/s. Obviously,
this method isn’t scalable as the velocity is different for each value of t.
Mathematicians are smart. They abstract everything and provide us a simple formula to use to calculate the
derivative on any point we want. For a t function the derivative is 2t. This is the reason why we got the
derivative of 8 m/s at t = 4s (2 * 4). It’s easy to derive why the derivative of tis 2t through first principles.
2 2
2
2

What can we learn from the magnitude and direction of a derivative?
Derivatives offer two essential pieces of information, paving the way for machine learning. Take a look at the
graph for a nonlinear function y = x. I made four points p1, p2, p3, and p4 on it. Each point provides us the
magnitude and direction of how the y value is changing with respect to x.
2

At point P1, where x is -8, y is 64, and the derivative is -16. What does the negative sign tell you? As x
increases, for instance from -8 to -7, the value of y decreases from 64 to 49. A similar trend is observed at point
P2.
Let’s disregard the sign for now. What does the magnitude of 16 at P1 indicate? The value of y changes
significantly more when the magnitude is higher compared to when it’s lower. As x increases from -8 to -7, the
value of y decreases from 64 to 49. In contrast, when x rises from -3 to -2, the value of y drops from 9 to 4—a
much smaller change since the magnitude at P2 is less than that at P1.
At point P3, the value of the derivative is 0. This suggests that the value of y is at its minimum when x is 0.
At point P4, where x is 3, y is 9, and the derivative is 6, a positive sign indicates that y increases as x
increases. By understanding both the magnitude and the direction (sign) of the derivative, we can predict how
the value of y changes as x changes. This predictive capability is profound, and machines utilize it to learn.
Let’s make the machine predict weight based on height
I’ve created a simple weight function that accepts height in centimeters as input and calculates the weight
using the formula 0.32 * height. To introduce variability, I’ve added a random noise of up to 15% to the weight.
For instance, if the weight is 100 kgs and the noise factor is 0.15, the noise can range from -15 kgs to 15 kgs.
This means the resulting weight can vary between 85 kgs and 115 kgs. I generated weights for 100 random
heights ranging from 150 to 190 cm and displayed the results in the scatterplot below.

The machine isn’t aware of the formula I used to generate the data. Can we input the height and weight data
into the machine, allowing it to construct a model that understands the correlation between height and weight,
and then predict the weight for new heights?
The model will try to come up with the best value for height_coefficient so that its weight prediction
(height_coefficient * height) comes as close as possible to the actual weight. Let’s assume that we train the
model with one example with a height of 161 cm and a weight of 47.81 kgs.
Assume that the model initializes the height_coefficient with a random value set to 0. With this, its predicted
weight will be zero (since 0 multiplied by 161 equals 0). To evaluate the quality of the model’s prediction, we
can create a loss function. This function squares the difference between the predicted weight and the actual
weight. In this case, the squared difference is (0 – 47.81)², which equals 2286.
I graphed the relationship between height_coefficient and the loss function below. The objective of the model is
to adjust the height_coefficient so that the value for the loss function is close to zero. When does the loss value
get close to zero? It happens when the model prediction gets closer to actual value.

By taking the derivative of the loss function with respect to the height_coefficient, we can determine how much
to increment or decrement the height_coefficient in order to reduce the loss. The derivative of the loss function
is given by: 2 * height * (height_coefficient * height – actual_weight).

When we substitute the given values, 2 * 161 * (0 * 161 – 47.81), the resulting value of the derivative is
-15,394.82. What does the negative sign indicate? It suggests that if you increase the height coefficient, the
value of the loss will decrease.
How much should we adjust the height coefficient? We shouldn’t directly use the value of -15,394.82. Instead,
we should apply a smaller amount to ensure we gain insights gradually from a specific example. This approach
prevents us from overly relying on one example.

This adjustment is known as the learning factor, and I’ve chosen 0.00001 as its value. When we multiply
-15,394.82 by 0.00001, we get -0.15. The table below captures all the computations discussed after the first
iteration.
IterationHeight
Height
Coefficient
Predicted
weight
Actual
weight
Loss Derivative
Learning
rate
Add to
Height
coefficient
Upda
Heig
Coef
1 161 0 0 47.812286.14-15395.970.000010.15 0.15
The last column “Updated Height Coefficient” will be the new height coefficient used by the model for the next
iteration. I was able to bring the loss close to zero on this one example with a height coefficient reaching 0.29.
IterationHeight
Height
Coefficient
Predicted
weight
Actual
weight
Loss Derivative
Learning
rate
Add to
Height
coefficient
Upda
Heig
Coef
1 161 0 0 47.812286.14-15395.970.000010.15 0.15
2 161 0.15 24.79 47.81530.20-7414.390.000010.07 0.23
3 161 0.23 36.72 47.81122.96-3570.620.000010.04 0.26
4 161 0.26 42.47 47.8128.52 -1719.540.000010.02 0.28
5 161 0.28 45.24 47.816.61 -828.10 0.000010.01 0.29
I trained the model using all 100 heights and weights. It converged to a height_coefficient of 0.32 after 15
iterations. It predicts the right weight of 52 kgs for 162 cm height. The chart below shows how the loss fell close
to zero while the height_coefficient went up to 0.32.

Congratulations! We’ve successfully trained the model to predict the correct weight. The modeling technique
we used is called Linear Regression. The type of training is supervised training as we had to prepare the
training data. The model came up with a line fit that approximated the input data.
You don’t need to write a lot of code

I recently came across a thought-provoking statement by Andrew Ng, a renowned figure in the AI world. He
mentioned, “An individual fluent in multiple languages — for instance, having English as their primary language
and Python as their secondary — has a greater potential than one who merely knows how to interact with a
large language model (LLM).” Andrew’s perspective resonates deeply with me.
I highly recommend working on 3 best books I have come across to get fluent in Python, develop the mental
models necessary to work with n-dimensional data, and grok the foundations of machine learning. Python
Distilled, Python for Data Analysis, and Programming Machine Learning are those 3 books.
How many lines of code do you estimate it took to train this basic model? In reality, it required only 22 lines.
Given below is the code with annotations added to aid readability.
The derivative is one of the most important concepts to understand deeply. It’s like the gateway to the world of
machine learning. I would highly recommend reading Chapter 8: Understanding rates of change from the book
Math for Programmers.
Height is one parameter, but the model could be enhanced with additional ones. This requires minor
adjustments to the Python code, specifically for doing partial derivatives, matrix multiplication and transpose. I’ll
discuss this in my next post.
October 10, 2023 in Computer Science, Mathematics, Statistics. Tags: LLM, Machine Learning
Related posts

How a Machine Learns – Part 2
Jana Vembunarayanan / October 15, 2023
I’m truly impressed by how accurately LLMs grasp natural language and consistently provide spot-on answers.
Through these posts, I’ll build a toy GPT from scratch using Python and unpack everything that goes inside the
LLMs, including the math it uses. This is my third post on my journey to craft a toy GPT. Make sure to read the
previous posts before this one.
We used height to predict weight. However, weight is influenced by multiple factors, including sex. Although our
model incorporates factors like height and sex as independent variables, there might still be unknown variables
affecting a person’s weight. The model’s inability to account for these unobserved factors is captured through a
bias term.
I’ve created a simple weight function that accepts height in meters and gender as input and calculates the
weight using the formula 45.0 * heights_m + -10.0 * gender + 5.0. The value of -10 for gender indicates that, on
average, females weigh 10 kgs less than males of the same height. I’ve also added a bias term set at a value
of 5.0.
Tags
Accounting Analysis
Availability Biology Books
Business Charlie Munger
Computer Science
Economics Evolution
Finance Learning
Lending Life LLM Machine
Learning Mathematics
Mental Models
Presentation
Psychology Reading
Recommender Systems Role Models
Statistics Teaching Thinking
Valuation Warren Buffett Why Writing
Archives
Seeking Wisdom
Mastering the best of what other people have already figured out.
Home About

I generated weights for 1,000 random heights ranging between 1.5 and 1.9 meters. Of these, 50% are females
(represented by ‘1’) and 50% are males (represented by ‘0’). To introduce variability, I’ve added a random noise
of up to 3% to the weight. The scatterplot below illustrates this data.
Here’s what the first 10 rows out of 1,000 look like.
The objective of the model is to adjust the height_coefficient, sex_coefficient, and the bias_coefficient so that
the value for the loss function is close to zero. When does the loss value get close to zero? It happens when
Select Month
Follow Blog via Email
Enter your email address to follow this
blog and receive notifications of new
posts by email.
Email Address
Follow
Join 3,629 other subscribers

the model prediction gets closer to actual value.
We’ve learned that by taking the derivative of the loss function with respect to the height_coefficient, we can
determine how to adjust the height_coefficient to minimize the loss. However, instead of deriving with respect to
just the height_coefficient, we need to take derivatives with respect to all three variables: height_coefficient,
sex_coefficient, and bias_coefficient.
This is achieved using partial derivatives; while the term may sound intimidating, the mathematics is quite
similar to what we employed for a single derivative. The image below shows how to take a partial derivative on
sex_coefficient with respect to the loss function. As an exercise, try deriving it for both the height_coefficient
and bias_coefficient.

The derivatives for height_coefficient and sex_coefficient are remarkably similar. For the height_coefficient, the
derivative is 2* loss * height, while for the sex_coefficient, it’s 2* loss * sex. The difference lies in their
respective input terms: height for the former and sex for the latter.
The derivative for the bias_coefficient is simply 2* loss, which stands out because it lacks the input term found
in the other two. To address this, we can assign a value of 1 to the bias and adjust the derivative to2* loss *
bias.

Mathematicians appreciate patterns because they can use tools like matrices. With matrices, they can organize
and process lots of numbers at once, much like a factory assembly line for math. Check out these links [1 and
2] to grasp matrix basics. Also, this tool lets you visually explore matrix multiplication.
I initialized all three coefficients randomly. The height coefficient was set to -0.14, the sex coefficient to 0.65,
and the bias coefficient to 0.50. In the image below, I show how the three coefficients are updated in one
iteration using the first two rows of the sample.

With just two lines of Python code, I accomplished the above computation. This underscores the importance of
developing the mental models needed to work with n-dimensional data. It also highlights why abstractions such
as matrices are incredibly powerful.
I trained the model using all 1,000 samples and executed 100,000 iterations with a learning rate of 0.0001. The
model minimized the loss to 6.1, achieving a height coefficient of 36.15, a sex coefficient of -10.07, and a bias
coefficient of 20.15.
It’s noteworthy that all three coefficients initially moved in the positive direction. However, the sex coefficient
eventually changed direction, resulting in a negative value of -10.07. For a male measuring 1.65 meters in
height, the model predicts a weight of 79.79 kgs. For a female of the same height, it predicts a weight of 69.72
kgs, a difference attributed to the sex coefficient.

While the model’s coefficients didn’t exactly align with those hardcoded into the synthetic function, they were
optimized to minimize the loss function. The introduction of noise did create variation. However, I’m uncertain to
what extent this noise influenced the discrepancy in coefficient values between the model’s estimates and the
synthetic function.
Apart from slight modifications for matrix multiplication and transposition, the Python code remains the same as
in the previous post.
Weight is a continuous variable, which means there can be countless values between any two specific weights.
For instance, between 69 and 70 kgs, we can have values like 69.1, 69.11, 69.111, and so forth. In contrast,
many real-world problems require predictions of specific outcomes.
For example, determining if an email is spam or not has only 2 possible outcomes, making it a discrete
variable. The method used to predict such discrete outcomes is called Logistic Regression. A key distinction
between linear and logistic regression lies in the selection of the loss function and using the sigmoid function to
restrict the output value within a certain range. I’ll elaborate on this in my next post.
You don’t learn the game of cricket just by memorizing its rules. You begin by playing and learn the rules as you
go. It’s similar with machine learning. Instead of starting with a heavy math book, you can dive into hands-on
projects and pick up the math you need along the way. I’ll end this post with a link to an article that summarizes
David Perkin’s Seven Principles of Teaching.

From Linear to Logistic –
Sigmoid Curve and Log Loss
Jana Vembunarayanan / October 22, 2023
Can we predict the likelihood of heart disease from sleep and exercise?
This is my fourth post on my journey to build a toy GPT. I’d recommend checking out my previous posts before
diving into this one.
Let’s train a model to predict heart disease using sleep and exercise data. Unlike our previous example
involving height and weight, heart disease is a discrete variable with two possible outcomes: 0 indicates no
heart disease, and 1 indicates the presence of heart disease. Below, you’ll find a sample of 10 rows. These
were selected from 100 rows of synthetic data (a fictional example) that I created.
Sleep (in hours) Exercise (in minutes) Heart disease
Tags
Accounting Analysis
Availability Biology Books
Business Charlie Munger
Computer Science
Economics Evolution
Finance Learning
Lending Life LLM Machine
Learning Mathematics
Mental Models
Presentation
Psychology Reading
Recommender Systems Role Models
Statistics Teaching Thinking
Valuation Warren Buffett Why Writing
Archives
Seeking Wisdom
Mastering the best of what other people have already figured out.
Home About

4.25 0.94 1
7.7 19.09 0
6.39 9.43 0
5.59 15.26 0
2.94 27.23 0
2.94 7.48 1
2.35 12.31 1
7.2 22.67 0
5.61 6.86 1
6.25 2.31 1
You should observe the following pattern: if you exercise and sleep more, the likelihood of having heart disease
is low, indicated by a 0. Conversely, if you exercise and sleep less, the risk of heart disease is high,
represented by a 1. Let’s visualize the relationship between the dependent variable (heart disease) and the
independent variables (sleep and exercise).
Select Month
Follow Blog via Email
Enter your email address to follow this
blog and receive notifications of new
posts by email.
Email Address
Follow
Join 3,629 other subscribers

Understanding the World Through the Lens of S-Curves
In linear regression, we draw a straight line (plane in 3D, hyperplane in n-dimensions) that best matches our
data. But how do we do that when our data, like in the graph above, doesn’t line up neatly? We still want to
benefit from linear regression. Can we use something else on top of linear regression to make the line curve?
This concept is similar to how evolution works.
Evolution tends to retain traits that are beneficial. When the environment shifts, it doesn’t discard these useful
traits. Instead, it builds upon them by introducing new mutations. For example, the human appendix was once
vital for digesting plant material. As our diets evolved, the appendix became less crucial, but instead of
disappearing, it now plays a role in our immune system.
We require a function that accepts the output from linear regression and compresses it to a range between 0
and 1. The Sigmoid curve, also known as the S-curve, serves this purpose. One technique I consistently turn to
for grasping functions is to visualize them by inputting values from extreme negatives all the way to extreme
positives.
Most of the values produced by the sigmoid function are near 0 or 1. There’s a sharp increase from 0 to 1 when
the input x is close to zero. This abrupt change is known as a ‘phase transition’, a phenomenon observed
frequently in nature. It’s super important in understanding how many things in the world work.

Think of a neuron, a tiny part of your brain. It can either send a signal or not. But it’s not just an on-off switch.
It’s more like a dimmer switch for a light. At first, when you turn it, the light might not change much. But as you
keep turning, the light gets brighter quickly, and then it slows down again.
This S curve isn’t just in our brains. It’s everywhere! It’s in computers, how water turns to ice, how rumors
spread, and even in how popcorn pops. At first, you hear a few pops, then a lot all at once, and then it slows
down again.
Why do we need a new loss function?
In our height to weight prediction model, we utilized the mean square error (MSE) as the loss function. In MSE,
we calculated the average of the squared differences between the predicted and actual weights. However,
when working with sigmoid functions, we cannot use MSE as a loss function.
Let’s picture two doctors: A and B. Doctor A confidently declares that the patient doesn’t have heart disease
(indicating a ‘0’), yet in reality, the patient does (a ‘1’). On the other hand, Doctor B suggests that the patient
might have heart disease (giving a ‘0.5’ score) when the patient indeed has it (a ‘1’).
Clearly, neither Doctor A nor B provided perfect predictions. But who should face a stiffer penalty for their
inaccuracy? It should be Doctor A, who was not just wrong, but confidently so. The issue here is that MSE
doesn’t distinguish between these two types of errors. Difference in their MSE is too small as highlighted in the
table.
Using MSE loss with a sigmoid function can hinder the model’s ability to minimize the loss effectively.

Let’s assume that the actual value is 1, and our goal is to predict 1. In linear regression, the output is
unbounded. This means that the predicted value can range from -∞ to +∞. There’s a clear minimum when the
predicted value reaches 1 the loss reaches zero.
When we use a sigmoid function, the output is constrained to lie between 0 and 1. The loss will be zero when
the sigmoid function outputs 1, that’s when the predicted value equals actual value. But that never happens as
the sigmoid function outputs 1 only when its input reaches infinity. So loss would never be zero.

Consider the regions of the loss curve where the input value (z) to the sigmoid function is extremely low or
high. What happens when you compute the derivative where the line is horizontal? The derivative is essentially
zero because there’s no change in the loss.
When the derivative is close to zero, the weight updates during training become almost negligible. This can
lead to slow progress or even a standstill in learning. Remember, derivatives are like the engines that power
the machine to learn, and we can’t have them be zero. To wrap it up, we need a different loss function. The log
loss function is the answer, addressing the issues I’ve just pointed out.
Log Loss function to the rescue
When I first encountered the log loss function (often referred to as cross-entropy loss), I felt overwhelmed and
shut the book. First impressions can be misleading, but once we break down the formula, it becomes one of

the easiest things to understand.
The ‘log’ function here represents the natural logarithm with base e, which is a mathematical constant
approximately equal to 2.71828. We previously encountered e within the sigmoid function. Read my post for a
quick refresher on logarithms.
The formula looks a lot friendlier when you input specific values into it and see what you get. I want to see what
the formula spits out when we nail a prediction and when we totally miss the mark. I separated the calculations
for the term before and after the plus sign.
When the actual value is zero, the first term is multiplied by 0, causing it to vanish. Conversely, when the actual
value is 1, the second term vanishes due to multiplication by (1 – actual_value). So, only one of the two terms
remain. When the prediction is confidently incorrect, as seen in the 1st and 3rd examples, there’s a significant
penalty of -35.

When plotting both the MSE Loss and Log Loss, with the actual value set to 1 and the predicted value varying
between 0 and 1, the graph below illustrates that the Log Loss imposes a much steeper penalty for confidently
incorrect predictions.
As illustrated in the image below, the log loss function on the right is smooth, allowing us to take its derivative
at any point. Focus on the y-axis on both the charts below. The one on the left is squeezed between 0 and 1.
The one on the right has a much wider range allowing us to take a derivative. This wider range happens
because the log of a small number like 0.0001 is a big negative number -9.21, and when we multiply by -1 we
get a big positive number, 92,100 times bigger than the input 0.0001 to the log function.

Remember the output of a sigmoid is between 0 and 1, so feeding that to a log function expands the range.
Combining sigmoid with log functions is quite clever. It’s like using the sigmoid to categorize things and then
using the log function to fine-tune the categorization.
The derivative of the log loss function is strikingly similar to that of the MSE loss function. Instead of 2* loss *
input , it’s loss * input. Loss is the difference between predicted value minus actual value. I highly recommend
watching this video, where the presenter does a superb job explaining it.
Computing the coefficient update on heart disease
I initialized all three coefficients randomly. The sleep coefficient was set to -0.00138, the exercise coefficient to
0.00648, and the bias coefficient to 0.00497. In the image below, I show how the three coefficients are updated
in one iteration using the first two rows of the sample.

I trained the model using all 100 samples and executed 1M iterations with a learning rate of 0.001. The model
minimized the loss to 0.13, achieving a sleep coefficient of -2.07, an exercise coefficient of -0.42, and a bias
coefficient of 16.25.

Suppose we want to test the model’s prediction for someone who sleeps for 6.39 hours and exercises for 9.43
minutes. The model predicts that the person has a healthy heart condition, indicated by a zero. During
classification we round the value as we need to either return 0 (healthy) or 1 (disease).
Code I used to do linear regression required minor additions for sigmoid, log loss, and classification.

The decision boundary indicates where the classifier determines one class stops and the other starts. In the
image below, you’ll see this boundary as a green dashed line. This line represents the point at which the model
changes its prediction from ‘No Heart Disease’ to ‘Heart Disease’, based on sleep and exercise features. It’s
impressive to see how neatly the model separates the data between those with heart disease and those
without.

In this post, we used Logistic Regression to determine the presence or absence of heart disease. However,
Logistic Regression isn’t limited to binary outcomes. For instance, in handwritten digit recognition, it can
produce 10 possible outcomes (0-9) – a technique once employed in post offices to automate the sorting of
mail based on ZIP codes.
Adapting the code for more than two outcomes requires slight tweaks, which I encourage you to explore.
What’s next?
Do yourself a favor by studying Programming Machine Learning: From Coding to Deep Learning. It covers
everything we’ve discussed thus far, including building models to recognize handwritten digits. Practical Deep
Learning from fast.ai and Andrej Karpathy’s Neural Networks: Zero to Hero are the two top free courses you
should consider.
What are Neural Networks? How are they different from linear and logistic regression? Why do we need
backpropagation in a neural network? I’ll cover this in my next post.
October 22, 2023 in Computer Science, Mathematics, Statistics. Tags: LLM, Machine Learning
Related posts

Neural Networks – Part 1
Jana Vembunarayanan / November 4, 2023
This is my 5th post on my journey to build a toy GPT. I’d recommend checking out my previous posts before
diving into this one.
Take the statement, ‘I will only go for a walk if it’s nice outside AND I have free time.’ This can be neatly
represented using a truth table. We can also create a simple linear classifier. This classifier would be able to
determine a decision boundary that separates the ‘true (1)’ outcomes from the ‘false (0)’ ones, as illustrated
below.
Tags
Accounting Analysis
Availability Biology Books
Business Charlie Munger
Computer Science
Economics Evolution
Finance Learning
Lending Life LLM Machine
Learning Mathematics
Mental Models
Presentation
Psychology Reading
Recommender Systems Role Models
Statistics Teaching Thinking
Valuation Warren Buffett Why Writing
Archives
Seeking Wisdom
Mastering the best of what other people have already figured out.
Home About

Let’s revise the statement to include an OR condition for someone who gives walking a higher priority: ‘I will go
for a walk if it’s nice outside OR I have free time.
Select Month
Follow Blog via Email
Enter your email address to follow this
blog and receive notifications of new
posts by email.
Email Address
Follow
Join 3,629 other subscribers

Imagine you have an unusual preference for taking walks: “I will go for a walk if it is nice outside XOR if I have
free time, but not both.” This means you’ll only decide to walk if either the weather is pleasant or you have
some spare time—not when both conditions are present simultaneously. In simpler terms, XOR here means
you’re setting a rule for yourself to walk under one favorable condition at a time, never both.
Can you use a single classifier to separate ‘true’ outcomes from ‘false’ ones for the XOR case? A single linear
classifier can’t solve a simple XOR problem on its own. To distinguish ‘true’ outcomes from ‘false’ ones, you

need at least two classifiers to create a clear boundary. This concept of combining multiple linear classifiers is
the foundation for understanding neural networks.
In Logistic Regression, we saw how a sigmoid function takes the output of a linear function and makes it non-
linear by bending the line. The XOR example makes it pretty clear that we need more than one linear function.
Can we combine these two ideas to create a universal function that’s capable of learning from any type of data
presented to it?
The Rectified Linear Unit (ReLU) is another function similar to the sigmoid function; both change straight-line
input into something more curved or nonlinear. The ReLU function simply outputs zero for any negative input
and keeps the input unchanged if it is positive, a straightforward rule expressed as max(0, input).
Watch the video I put together. It shows what happens when you take several linear functions and pass them
through the ReLU function and add the output. It’s magical to see the line bend.

I visualize this concept through a simple illustration.
Linear Functions and the Power of ReLU
3 min3 min81 views81 views
P
Powered by

Before we go further, let’s take a quick glance at how the human brain works. Our brain is composed of many
cells, one type being the nerve cell, or neuron. Neurons are the cells responsible for sending and receiving
electrochemical signals to and from the brain itself.
It is estimated that our brain contains approximately 100 billion neurons, each potentially forming thousands of
connections with other neurons, resulting in an astounding total of around 100 trillion connections. The
complexity of the brain’s neural network is truly remarkable.
Each neuron is made up of three parts: the dendrites, the cell body, and the axon.

In the fascinating book “The Brain That Changes Itself,” author Norman Doidge provides a clear and detailed
explanation of these three components of a neuron.
The dendrites are treelike branches that receive input from other neurons. These dendrites lead into the cell
body, which sustains the life of the cell and contains its DNA. Finally the axon is a living cable of varying
lengths (from microscopic lengths in the brain, to some that can run down to the legs and reach up to six feet
long). Axons are often compared to wires because they carry electrical impulses at very high speeds (from 2 to
200 miles per hour) toward the dendrites of neighboring neurons.
A neuron can receive two kinds of signals: those that excite it and those that inhibit it. If a neuron receives
enough excitatory signals from other neurons, it will fire off its own signal. When it receives enough inhibitory
signals, it becomes less likely to fire. Axons don’t quite touch the neighboring dendrites. They are separated by
a microscopic space called a synapse.
Once an electrical signal gets to the end of the axon, it triggers the release of a chemical messenger, called a
neurotransmitter, into the synapse. The chemical messenger floats over to the dendrite of the adjacent neuron,
exciting or inhibiting it. When we say that neurons “rewire” themselves, we mean that alterations occur at the
synapse, strengthening and increasing, or weakening and decreasing, the number of connections between the
neurons.
A neural network is a vast computational graph that models the architecture of the human brain. Imagine each
linear equation (m1x1 + m2x2 + … + b) as a standalone neuron. The result of this equation is fed into an
activation function like ReLU, which either suppresses or allows the signal to progress to the subsequent layer.

The weights and biases that the machine learns are analogous to the synaptic changes occurring within the
brain. By chaining these linear equations from one layer to another, creating a multi-layered web, we’re
essentially trying to mirror the brain’s dense network of neuron connections.
Humans have established themselves as the most dominant species on the planet, largely because of our
unique ability to create models of reality. The Neural Network is a model for how the human brain works. While
not flawless, it is sufficiently sophisticated to tackle complex challenges, such as driving a car, recommending
movies, or even writing code.
The image below is a simple neural network with one hidden layer. Hidden layers are layers of neurons
between the input layer and the output layer in neural networks. They’re called hidden because they don’t
interact with the external environment like the input and output layers do, meaning they don’t get raw data or
give final results directly.
You’re free to add several neurons in each layer and add multiple layers between the input and the output
layers. Adding more layers turns the network into a Deep Neural Network. The complexity of the problem you’re
tackling will typically dictate the number of layers and neurons within those layers that you’ll need.
In linear and logistic regression, we have a single linear equation. This simplifies the process of calculating
derivatives to understand how changes in weights and biases influence the loss function. By contrast, neural
networks involve a computational graph that interlinks multiple linear equations.

Accounting For Leases Neural Networks – Part 3 Rules of analytical reading
← Make Every Day Count Neural Networks – Part 2 →
The weight w1 in the above image is an input to a ReLU (Rectified Linear Unit) function, and the output from
this ReLU function is then fed as an input to the neuron in the output layer. How do we adjust the indirectly
connected weight w1 so that it decreases the loss function?
This is where backpropagation comes into play. I had a ‘Eureka!’ moment when I understood how
backpropagation works. I’ll explore this topic in my next post.
Do yourself a favor by watching these lectures on Neural Networks and Backpropagation: Neural net
foundations, From-scratch model, and The spelled-out intro to neural networks and backpropagation.
November 4, 2023 in Computer Science, Mathematics, Statistics. Tags: LLM, Machine Learning
Related posts
2 thoughts on “Neural Networks – Part 1”
Karthik Rajeshwaran November 4, 2023 at 6:01 pm
Thanks so much Jana.

Neural Networks – Part 2
Jana Vembunarayanan / November 6, 2023
This is my sixth post in a series on building a toy GPT. I recommend that you read my previous posts before
reading this one.
When using linear and logistic regression, which involve just a single linear equation, calculating derivatives is
easy since there is only one equation to work with. Those derivatives help us see how changing the weights
and biases decreases the loss function. Neural networks are different. It consists of a computational graph that
links many linear equations.
Tags
Accounting Analysis
Availability Biology Books
Business Charlie Munger
Computer Science
Economics Evolution
Finance Learning
Lending Life LLM Machine
Learning Mathematics
Mental Models
Presentation
Psychology Reading
Recommender Systems Role Models
Statistics Teaching Thinking
Valuation Warren Buffett Why Writing
Archives
Seeking Wisdom
Mastering the best of what other people have already figured out.
Home About

The weight w1 in the above image is an input to a ReLU function, and the output from this ReLU function then
feeds as input to the neuron in the output layer. Adjusting the indirectly connected weight w1 to decrease the
loss function requires an algorithm that can handle such indirect connections that extend through multiple
layers.
Backpropagation is the powerful algorithm that makes adjusting the indirectly connected weight w1 to minimize
the loss function possible. I was terrified when I first heard about backpropagation. However, it is a simple
algorithm that combines the concepts of derivatives and the chain rule.
I find that learning fundamental ideas is easier by examining how they operate on simple examples.
Let a = 4, b = -2, c = a * b, d = 5, and e = c + d. Therefore, a = 4, b = -2, c = 4 * -2 = -8, d = 5, and e = -8 + 5 =
-3.
We can represent these variables and their relationships in a simple computational graph. In this graph, each
variable represents a node. The lines connecting them represent the operations and relationships. “grad” refers
to the gradient or derivative, which initially equals 0 for all variables.
Select Month
Follow Blog via Email
Enter your email address to follow this
blog and receive notifications of new
posts by email.
Email Address
Follow
Join 3,629 other subscribers

The backpropagation algorithm calculates gradients starting from the rightmost node and moving backwards. In
this example, we start at the rightmost node e. To calculate the gradient at e, we ask: how much does the value
of e change when we tweak itself by an infinitesimally small amount? Since e is simply equal to itself, we know
the derivative of e with respect to itself is 1.
Now from node e, we will traverse back to nodes c and d, and the order does not matter. Let’s go to node d
first. We ask the same question. How much does the value of e change, and is the change positive or negative,
when we tweak d by an infinitesimally small amount? You can see the derivative calculations for all three
nodes: e, d, and c.

The derivative of node e with respect to node e, d, and c is 1. Now from node c, we will traverse back to nodes
a and b, and the order does not matter. Let’s go to node b first. We ask the same question again. How much
does the value of node e change, and is the change positive or negative, when we tweak node b by an
infinitesimally small amount?
Calculating the derivative for node e with respect to node b is not straightforward, as node b is connected to
node e indirectly through node c. We know how to compute the derivative of node c with respect to node b. We
already calculated that the derivative of node e with respect to node c is 1.

To compute the derivative of node e with respect to node b, we need to multiply the derivative of node e with
respect to node c (which we determined is 1) by the derivative of node c with respect to node b. This
multiplication of derivatives to find the derivative between indirectly connected variables is called the Chain
Rule in Calculus.
It took me a while to understand why we need to multiply the two derivatives when calculating indirect
connections. The car analogy from George F. Simmons helped me comprehend this relationship better:
If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car
travels 2 × 4 = 8 times as fast as the man.
In this analogy, the car is like node e, the bicycle is like node c, and the man is like node b. Just as we need to
multiply the car-bicycle and bicycle-man speed ratios, we need to multiply the node derivative values when
nodes are indirectly connected. Derivative calculations for node a and b are shown below.
I have updated the derivative values on the computational graph below.

These multiplication and addition operations give rise to two important properties:
1. The addition operation passes gradient values through unchanged. This is why nodes c and d take the
gradient value of 1 from node e because addition simply propagates the gradient through.
2. The multiplication operation uses the value of the sibling node as the local gradient. For example, node
b’s local gradient is 4, taking the value from its sibling node a. And node a’s local gradient is -2, taking
the value from its sibling b.
Please re-read the car analogy and derivative calculations for nodes a and b one more time. Make sure you
thoroughly understand these examples before reading further.
Let’s update the values of the value nodes (a, b, and d) by multiplying the gradient of each node by the learning
rate of 0.01. For example:
a.value += -a.grad * 0.1 => 4 += -(-2 * 0.1) = 4 + 0.2 = 4.2
b.value += -b.grad * 0.1 => -2 += -(4 * 0.1) = -2 – 0.4 = -2.4
d.value += -d.grad * 0.1 => 5 += -(1 * 0.1) = 5 – 0.1 = 4.9
Next, we recompute c = a * b and e = c + d to propagate the changes through the graph.
Then, we reset the gradients of all nodes to zero before the next pass through the training loop.

After this update to the node values, the resulting computational graph looks like the one shown below:
Why did the value of node a increase from 4 to 4.2? This is because the gradient of node a is -2. What does
this negative gradient tell us? It tells us that increasing the value of a will decrease the value of e. This is why
we multiplied the gradient by -1 when updating the value of node a.
Substituting the values in a.value += -a.grad * 0.1 we get 4.0 += -(2 * 0.1) equals 4.2
A similar reasoning can be applied to understand why the values of nodes b and d decreased. Ultimately, it is
evident that the value of node e decreased from -3 to -5.2. Through backpropagation, we successfully reduced
the value of node e.
There are four key steps involved in each training loop for a simple 5-node network:
1. Forward Pass: Runs the computations from left to right, setting the values for all nodes in the network.
2. Backward Pass: Computes the gradients from the rightmost node back through the network in reverse
order. This calculates how much each node impacts the final output.
3. Update Weights: Updates the values of each node by adding the result of: -1 * gradient * learning_rate.
This moves the node values in the direction that reduces the loss.
4. Reset Gradients: Resets the gradients of all nodes back to zero before the next training loop. This clears
the gradients so they can be recalculated for the next loop.
The training loop repeats these four steps, optimizing the node weights over many iterations to minimize the
loss and improve the model accuracy. The beauty is that these steps are effective regardless of the number of
nodes involved, be it 5, 5,000, or 500 million. This fundamental algorithm is what drives the functionality of all
neural networks.

Let’s add a Rectified Linear Unit (ReLU) node to the output of node c. The ReLU function simply outputs zero
for any negative input and keeps the input unchanged if it is positive, a straightforward rule expressed as
max(0, input).
Since node c has a value of -8, the ReLU function returns 0 as its output. Therefore, node e gets a value of 5
(as it is the sum of nodes d and ReLU, which have values of 5 and 0 respectively). During backpropagation, we
have to pass through the ReLU activation function.
When we visit node c during backpropagation, we ask about the derivative of the ReLU function with respect to
node c. What is the derivative of max(0, input)? When the input is zero or less the derivative is 0, otherwise it’s
1. Read this stackoverflow thread to understand how to calculate the derivative for ReLU function.

After running backpropagation, updating the weights, and resetting the gradients, our computational graph
appears as shown below.
Notice that the values for nodes a, b, and c remain unchanged. This occurs because the gradient of the ReLU
activation with respect to node c is zero, so there’s no gradient to propagate back through the network. As a
result, no learning occurs for these nodes. In contrast, node d shows learning; its value decreases from 5 to
4.9, which in turn affects the value of node e.
If you have understood everything up to this point, then you’ve grasped the crux of what’s happening inside a
neural network. Everything else beyond that is simply optimization.
In my next post, I’ll build a simple neural network with PyTorch that can figure out handwritten numbers using
the MNIST dataset. Basically, MNIST is just a bunch of pictures of scribbled numbers that people use to teach
computers how to recognize digits.
I am immensely thankful to Jeremy Howard, Andrej Karpathy, and Josh Starmer for the invaluable knowledge
I’ve gained from their work: Neural net foundations, From-scratch model, The spelled-out intro to neural
networks and backpropagation, and The StatQuest Guide to Machine Learning.

Neural Networks – Part 3
Jana Vembunarayanan / November 12, 2023
This is the seventh post in my series on making a toy GPT. For better understanding, I recommend reading my
earlier posts first.
The MNIST dataset is the “hello world” of machine learning, containing images of handwritten digits that are
used to train machine learning models. It includes 60,000 training images and 10,000 test images of
handwritten digits for the numbers 0 through 9. Its practical use includes helping post offices automatically
recognize handwritten zip codes.
Tags
Accounting Analysis
Availability Biology Books
Business Charlie Munger
Computer Science
Economics Evolution
Finance Learning
Lending Life LLM Machine
Learning Mathematics
Mental Models
Presentation
Psychology Reading
Recommender Systems Role Models
Statistics Teaching Thinking
Valuation Warren Buffett Why Writing
Archives
Seeking Wisdom
Mastering the best of what other people have already figured out.
Home About

Machines don’t perceive images the way you and I do. Instead, they interpret and perform computations on
numerical data. Take the MNIST digits, for instance. Each digit consists of a 28×28 grid of grayscale pixels,
totaling 784 grayscale pixels. Each pixel is encoded in a single byte.
In many image processing contexts, pixel values range from 0 to 255, where 0 usually represents black and
255 represents white. However, in the MNIST dataset, this convention is reversed: 0 indicates “perfect
background white,” and 255 indicates “perfect foreground black.” In the illustration below, I’ve mapped out the
pixel values alongside their corresponding grayscale shades on the same grid.
Select Month
Follow Blog via Email
Enter your email address to follow this
blog and receive notifications of new
posts by email.
Email Address
Follow
Join 3,629 other subscribers

We have already learned from my earlier post that a neural network is a model for how the human brain works.
Depending on the nature of the problem, we’re free to choose the configuration of a neural network. This

includes the number of hidden layers and the number of neurons in each hidden layer.
Our neural network has a fully connected architecture with two hidden layers. The first hidden layer contains
128 neurons, while the second hidden layer contains 64 neurons. The input layer receives 784 pixel values for
each digit image.
These pixel values are fed as inputs to every neuron in the first hidden layer. Output from the neurons in the
hidden layers passes through the ReLU activation function to introduce non-linearity. The output layer consists
of 10 neurons, each predicting one of the 10 digit classes (0 to 9). The predicted digit class therefore
corresponds to the neuron with the highest output value.
Can you guess how many parameters (weights and biases) this neural network has? This simple network has
109,386 parameters. Consider how complex the network must be for large language models (LLMs), which
have parameters in the billions and some approaching one trillion.

The PyTorch framework enables creating this neural network with 15 lines of code, showing how easy PyTorch
makes building neural networks.
The output layer has 10 neurons. But how does the network determine which particular neuron to select?
Additionally, we need to understand how the network’s loss is calculated. Each neuron produces a value
ranging from negative to positive. We need to transform these values into probabilities. The Softmax function is
useful for this. It converts the values into probabilities.
The first digit in the test set is 7. The table below shows how the raw output from the network is converted into
probabilities using softmax for this single example. It also demonstrates how the cross-entropy loss is
computed. If you understand how this works for one example, you know how it generalizes to calculating the
loss over multiple examples.

The neuron responsible for identifying digit 7 produced a high probability of 1, while the other neurons
produced probabilities of 0. Since the actual value (7) matches the predicted value (the neuron with high
probability of 1), the loss of the network for this simple example is almost zero. We went through this concept in
my logistic regression post.
Jeremy Howard does an excellent job of explaining how softmax and cross-entropy loss work. I highly
recommend watching his explanation, as these concepts will come up repeatedly when training neural
networks.
We can train this neural network in less than 30 lines of PyTorch code. All neural network training involves the
same four steps:
1. Forward Pass runs computations from left to right to process the input, perform the computations, and
produce the output.

2. Backward Pass starts from the network loss and computes gradients as it traverses backwards through
the network until gradients have been calculated for all neurons.
3. Update each neuron’s weights and biases by adding the result of: -1 * gradient * learning_rate. The key
idea is to update the coefficients so that the loss function decreases.
4. Resets all node gradients to zero before the next loop. This clears them to be recalculated for the next
iteration.

Dividing image pixel values by 255.0 is a common normalization technique used in image processing,
especially when training neural networks. This scaling of the pixel values to be between 0 and 1 is beneficial
because neural networks tend to perform better when the input data is normalized to a smaller range.
It’s magical to witness 109,386 coefficients dancing to the tunes of backpropagation, guided by the loss
function, adjusting themselves to minimize the overall loss from the network. The thought of this marvel gives
me goosebumps. I’ve created a small animation that illustrates how the 640 weights in the output layer change
over 10 epochs. Isn’t it magical?
Powered by

When I made the network predict 10,000 test images, it achieved an accuracy of 97.51%. This is pretty good
for less than 50 lines of code. The hard work was done by those who patiently wrote and labeled the
handwritten digits. In just 3 lines of code, you can ask the network to make predictions for new inputs.
We have come a long way in 7 posts. We started by building a DNA generator using frequency counting of A, T,
C, and G letters. Then we looked at the basics of derivatives and used that to predict weight given height. After
that, we expanded our knowledge to predict weight using multiple input variables.
We moved from regression to classification and saw how the sigmoid function and log loss help classify heart
disease. We then saw the limitation of a single linear classifier for solving the XOR problem, and how to
combine multiple linear classifiers with the ReLU function to fit nonlinear outputs.
Just as Gutenberg’s printing press revolutionized the spread of knowledge, the ingenious backpropagation
algorithm sparked an AI boom, enabling machines to generate images, drive cars, and much more. We learned
how the backpropagation algorithm works under the hood.
In this post, we created a neural network with 109,386 coefficients to predict handwritten digits. At this point, we
have learned the ABCs of how machines learn, from simple linear classifiers all the way to complex neural
networks and everything that goes on under the hood.
With this strong foundation, we can now look at another key pillar called Embeddings that enable large
language models to perform their magic. Embeddings are a way of representing words and sentences as
mathematical vectors or arrays of numbers. This allows computers to understand and process natural language
more easily. It will be the topic of my next post.
November 12, 2023 in Computer Science, Mathematics, Statistics. Tags: LLM, Machine Learning
Related posts

Embeddings – Part 1
Jana Vembunarayanan / November 23, 2023
This is the 8th post in my series on building a toy GPT. For better understanding, I recommend reading my
earlier posts first.
I love playing and watching cricket. The dominance India showed in the recently concluded World Cup is
astounding. I have never seen anything like it in the four decades I’ve been following cricket. It’s disappointing
to see us lose in the finals against Australia. The better team on that day won.
One of the best ways to grasp a new concept (embedding) is by connecting it with something you’re passionate
about (cricket). I pulled up the batting averages (the number of runs a player scores on average per innings)
and strike rates (the average number of runs scored per 100 balls faced) for everyone who played in the recent
World Cup final.
Player Country Average Strike rate
Rohit Sharma India 49.1 92.0
Tags
Accounting Analysis
Availability Biology Books
Business Charlie Munger
Computer Science
Economics Evolution
Finance Learning
Lending Life LLM Machine
Learning Mathematics
Mental Models
Presentation
Psychology Reading
Recommender Systems Role Models
Statistics Teaching Thinking
Valuation Warren Buffett Why Writing
Archives
Seeking Wisdom
Mastering the best of what other people have already figured out.
Home About

Shubman Gill India 61.4 103.5
Virat Kohli India 58.7 93.6
Shreyas Iyer India 49.6 101.0
KL Rahul India 50.8 88.1
Ravindra Jadeja India 32.4 85.1
Suryakumar Yadav India 25.8 105.0
Mohammed Shami India 7.8 83.0
Jasprit Bumrah India 7.6 57.2
Kuldeep Yadav India 10.5 56.8
Mohammed Siraj India 7.7 46.0
David Warner Australia 45.3 97.3
Travis Head Australia 42.0 102.6
Mitchell Marsh Australia 36.1 96.2
Steve Smith Australia 43.5 87.2
Marnus Labuschagne Australia 37.9 83.1
Glenn Maxwell Australia 35.4 126.9
Josh Inglis Australia 18.9 94.1
Mitchell Starc Australia 12.4 79.4
Pat Cummins Australia 13.7 75.1
Select Month
Follow Blog via Email
Enter your email address to follow this
blog and receive notifications of new
posts by email.
Email Address
Follow
Join 3,629 other subscribers

Adam Zampa Australia 9.5 65.7
Josh Hazlewood Australia 17.3 91.0
If I asked you to find players similar to Virat Kohli, how would you go about it? One tedious method is to
compare Virat’s average and strike rate with every other player, then pick the one who’s closest to him. Luckily,
we’ve found a more straightforward solution to this problem: use a scatter plot.
We can examine the players around Virat and deduce that Shubman Gill, KL Rahul, Rohit Sharma, and
Shreyas Iyer are similar to him. Even with this straightforward scatter plot, it’s challenging to determine who is
the most similar, who comes second, and so on. Now, imagine a scatter plot featuring hundreds of players,
comparing more than just two features. How complex would that be?
An elegant method to represent each player’s statistics is by modeling them as vectors. Think of a vector as a
list of numbers, where each number describes a specific feature.

For instance, Virat Kohli can be represented by the vector (58.7, 93.6), where the first number represents his
average and the second his strike rate. In a similar fashion, Rohit Sharma’s vector is (49.1, 92.0), while
Shubman Gill’s vector is (61.4, 103.5).
We can perform interesting mathematical operations on vectors that help answer similarity questions in an
elegant way. More importantly, the solution scales to thousands of players, with each player having hundreds of
features. The illustration below demonstrates how to calculate the distance between two vectors.
This concept can be extended to more than two dimensions, as the underlying mathematics remains the same.
Read this excellent article to understand how to compute the distance between vectors in three or more
dimensions.

By applying the distance formula to compare Virat Kohli with the other 21 players and then sorting their
distance in ascending order, we find that KL Rahul is the most similar to Virat, followed by Rohit Sharma,
Shubman Gill and Shreyas Iyer.
Player Country Average Strike rate Distance from Kohli
Virat Kohli India 58.7 93.6 0.00

KL Rahul India 50.8 88.1 9.63
Rohit Sharma India 49.1 92.0 9.73
Shubman Gill India 61.4 103.5 10.26
Shreyas Iyer India 49.6 101.0 11.73
David Warner Australia 45.3 97.3 13.90
Steve Smith Australia 43.5 87.2 16.49
Travis Head Australia 42.0 102.6 18.97
Mitchell Marsh Australia 36.1 96.2 22.75
Marnus Labuschagne Australia 37.9 83.1 23.30
Ravindra Jadeja India 32.4 85.1 27.64
Suryakumar Yadav India 25.8 105.0 34.82
Josh Inglis Australia 18.9 94.1 39.80
Glenn Maxwell Australia 35.4 126.9 40.64
Josh Hazlewood Australia 17.3 91.0 41.48
Mitchell Starc Australia 12.4 79.4 48.43
Pat Cummins Australia 13.7 75.1 48.65
Mohammed Shami India 7.8 83.0 51.99
Adam Zampa Australia 9.5 65.7 56.56
Kuldeep Yadav India 10.5 56.8 60.64

Jasprit Bumrah India 7.6 57.2 62.74
Mohammed Siraj India 7.7 46.0 69.76
You can ask cool questions like finding players who are a mix of Glenn Maxwell and KL Rahul. To do this, we’ll
average Maxwell’s vector (35.4, 126.9) and Rahul’s vector (50.8, 88.1). Then we’ll look for players who are
closer to the average vector (43.1, 107.5). Travis Head, Shreyas Iyer, David Warner, and Mitchell Marsh are the
top four players who are closer to the average vector.
Similar to representing cricketers with vectors, we can also represent words using vectors.
The sentence “In the kingdom, the King spoke with wisdom and authority, showing he was in charge. His crown
meant he had a high social status. He was a leader for everyone, both males and females” talks about the
attributes and the abilities of the king.
From this sentence, we can identify five distinct features for the ‘king’ vector: authority, male, wisdom, social
status, and leadership, similar to how cricketers are evaluated by their average and strike rate. While cricketers

have vectors defined by two features, here we describe ‘king’ with a five-feature vector.
The word “queen” is the opposite of “king.” The Queen is a woman and the King is a man. Therefore, we have
four words and five feature vectors. Their values, ranging between 0 and 1, are as shown in the table below. I
made up these numbers to illustrate the concept.
King Man Woman Queen King – Man + Woman
Authority 1 0.2 0.2 1 1
Male 1 1 0 0 0
Wisdom 0.8 0.1 0.1 0.8 0.8
Social Status 1 0.3 0.3 1 1
Leadership 1 0.4 0.4 1 1
Word embeddings transform words into vectors, allowing us to perform mathematical operations on language
itself. This is a powerful technique because it captures not just the word, but its deeper meaning and how it
relates to other words. For example, if you take the vector for ‘king’, subtract the vector for ‘man’, and add the
vector for ‘woman’, you get a vector very close to the one for ‘queen’.
Neural networks process a vast collection of words, generating word embeddings as their output. This process
results in embeddings where words with similar meanings have vectors that closely match. Take the following
four sentences as examples. The neural network can identify that word pairs like king-queen, brave-
courageous, and wise-intelligent are related.
1. The king was brave.
2. The queen was courageous.
3. The king was wise.
4. The queen was intelligent.
In contrast to the “King” example I made up where I identified five specific features, the specific meaning of
each feature produced by a neural network remains unknown. What we understand is that these are fixed-

length arrays of numbers. Therefore, when we compare vectors of different words, it’s an apples-to-apples
comparison.
The number of features or dimensions in a model produced by a neural network varies depending on the
specific model. For example, Bengio’s model in 2003 has 30 dimensions. Some of the Word2vec models that
came out in 2013 have 300 dimensions. OpenAI’s latest Ada model has 1,536 dimensions.
Playing with a word embedding browser is a magical experience. Do check it out here. Understanding Word
Vectors by Allison Parrish is one of the best resources I’ve found for learning about word embeddings. I highly
recommend reading it.

Is there a specific format for inputting words into a neural network to generate an embedding? What serves as
the ground truth against which the neural network’s output is compared? Must a special structure be designed
for the embedding so that the neural network can populate it effectively? I will explore these questions in my
next post.
November 23, 2023 in Computer Science, Mathematics, Statistics. Tags: LLM, Machine Learning
Word Embedding Browser
P minP min56 views56 views
P
Powered by

Embeddings – Part 2
Jana Vembunarayanan / December 2, 2023
This is the 9th post in my series on building a toy GPT. For better understanding, I recommend reading my
earlier posts first.
Word embeddings convert words into fixed-length numerical arrays. Each number in these arrays corresponds
to a specific characteristic of the word, such as its association with a place, person, gender, or concept.
What does it take to build a simple embedding with two features?
I’m going to use a list of 1,165 Indian names to make a name generator. We’ll use a simple neural network for
this. The main goal isn’t to make an amazing name generator, but to get a feel for how the 2D embedding gets
filled up and to understand the steps in building that embedding.
This post is entirely based on my learnings from Andrej Karpathy’s Makemore lecture.
It’s always a good habit to examine a few rows from your input dataset. Here are the first 10 names.
Tags
Accounting Analysis
Availability Biology Books
Business Charlie Munger
Computer Science
Economics Evolution
Finance Learning
Lending Life LLM Machine
Learning Mathematics
Mental Models
Presentation
Psychology Reading
Recommender Systems Role Models
Statistics Teaching Thinking
Valuation Warren Buffett Why Writing
Archives
Seeking Wisdom
Mastering the best of what other people have already figured out.
Home About

We need to answer three important questions before designing the neural network.
What is the vocabulary size? Since this is a character-level language model, our vocabulary will consist of
lowercase a-z characters (we will normalize all names to lowercase) plus a dot character to indicate the start
and end of names. Therefore, the vocabulary size is 27.
Select Month
Follow Blog via Email
Enter your email address to follow this
blog and receive notifications of new
posts by email.
Email Address
Follow
Join 3,629 other subscribers

What is the context length? The context length refers to the number of character inputs we will use to predict
the next character. Since we are using the previous three characters as input to predict the output, our context
length is three. For example, the input and prediction for “Abhishek” is shown below.
What is the embedding size? Word embeddings convert words into fixed-length numerical vectors. The
embedding size refers to the dimensions of these vectors for each word in our vocabulary. We will use an
embedding size of 2 for simplicity, although much higher dimensions are commonly used. The table below
shows how each character in the vocabulary points to a randomly embedding vector with 2 numbers.
Embedding
Vocabulary Index Dimension 1 Dimension 2
. 0 -0.3645 1.2162
a 1 -1.9144 1.6279
b 2 0.7181 1.0442
c 3 -1.1088 -0.8759
d 4 -1.3463 1.9421

e 5 0.263 0.502
f 6 -0.4798 1.3003
g 7 -0.2858 -1.4829
h 8 -0.4605 -0.14
i 9 0.2299 0.942
j 10 0.2699 -0.1813
k 11 -0.433 -0.0609
l 12 -1.558 -2.6765
m 13 -0.274 0.1358
n 14 -0.2624 -0.1345
o 15 0.3338 0.6387
p 16 -0.9437 -2.8225
q 17 1.9061 0.3633
r 18 -0.1051 -0.1207
s 19 0.675 -1.836
t 20 1.7217 -2.8221
u 21 -0.0707 -0.5534
v 22 0.1626 -0.6026
w 23 -0.7808 -0.2784

x 24 -0.1615 -0.2923
y 25 -1.1318 0.3967
z 26 -1.284 1.198
The neural network will have three layers. The input layer takes in 6 inputs, which is derived by multiplying the
context length of 3 by the embedding dimension of 2. These 6 inputs are passed into the hidden layer with 100
neurons. The output layer contains 27 neurons, one for each character in our vocabulary. The image below
clearly summarizes the network setup.
The embedding table starts with random values before training the network. Given that the embedding
dimension is two, we can easily visualize them on a scatter plot.

Initially, all the characters are randomly scattered without any clear pattern. After the network is trained, we can
see that the vowels are grouped together, meaning their vector dimensions are closer. The dot character
indicates the starting and ending of names, so it obtained a unique position. There is something unique about
‘w’, ‘x’, and ‘q’ that makes them stand alone.

I was in awe when I saw the neural network able to learn the pattern using the names and capture that pattern
as embeddings. How is that even possible? The reason is we humans give meaningful names to people.
They’re not random. The neural network guided by the loss function using backpropagation learns this pattern
we humans have created. These patterns, regardless of their complexity, are effectively captured and
represented as vector embeddings.
All the machine learning models we have seen so far only update weights and biases. They treat the inputs as
constants that don’t change. Embeddings are handled differently. Even though we pass them as inputs to the

first layer, the embedding values themselves get updated during training. This unique aspect of embeddings
getting updated during training was initially confusing to me, so I wanted to highlight it.
The names generated by this simple neural network are shown below. Some of the names like Sidesh, Jeet,
and Sushant are proper Indian names. You can find the code that Karpathy wrote to generate names here. All I
did was swap American names with Indian names.
Let’s switch gears now by moving from learning embeddings from individual characters to learning embeddings
on words. In 2012, Tomas Mikolov, an intern at Microsoft, found a way to embed the meaning of words into
vector space. He invented the Word2Vec language model. As the name implies, it converts words into vectors.
The Word2Vec language model learns the meaning of words by processing a large corpus of unlabeled text.
None of the words in the Word2Vec vocabulary need to be labeled manually. For example, no one needs to
specify that “Virat Kohli” refers to a cricketer, “Chennai” refers to a city, or that “India” refers to a country.
The model is able to make these connections on its own, as long as the corpus it’s trained on is large and
diverse enough that related words tend to be mentioned together — “Virat Kohli” near words associated with
cricket, “Chennai” near words associated with cities, and “India” near words referring to countries.
Word2Vec is a family of models, with Continuous Bag of Words (CBOW) and Skip Gram being specific models.
Let’s look at the neural network setup for CBOW. I’m going to use a simple corpus that has 9 lines of text with

100 words in total.
After removing the punctuation, names, and stop words (like ‘the’, ‘and’, etc.), the corpus looks as shown
below. The ordering of words from the original corpus is still maintained. Let’s call this as processed text.
We need to answer three important questions before designing the neural network — the same questions we
asked in the Indian names example. What is the vocabulary size? There are 52 words in the vocabulary.

What is the embedding size? Word2Vec models usually have an embedding size of 300 dimensions. I used 30
for the number of embedding dimensions in this example. Imagine that each word in the vocabulary has a
vector represented by 30 dimensions. I don’t want to paste 1,560 (52 x 30) numbers in this post.
What is the context length? Context length in CBOW requires a bit of explanation. Consider the phrase “wise
old king ruled kingdom” that is part of the processed text. Focus on the highlighted word “king”. I am using two
as the context length here. This means we will pull 2 words to the left of “king” and 2 words to the right of “king”.
These four words [‘wise’, ‘old’, ‘ruled’, ‘kingdom’] will be the input to the CBOW model. During training, the
model will receive these 4 context words as input, and “king” will be passed as the target word to predict based
on the context. So the context length defines how many words before and after the target word to use as input
context. Given below are the 5 out of 57 context and target words.

Why do we need to look at N words to the left and right of the target word? The idea is that words with similar
meanings tend to appear in similar contexts. Just as neurons that fire together wire together, words that appear
in similar contexts tend to have similar meanings. This is how we humans think and capture thoughts in writing.
By looking at the surrounding context, the neural network can exploit this structure and derive meaning from it
using math. The number of context words (N) defines how wide of a context window it has to learn these
associations between context and meaning. So by using N words to the left and right, it is trying to learn from
broader contexts, not just the immediately adjacent words.
The neural network will have three layers. The input layer takes in 30 inputs. These 30 inputs are computed by
taking the 4 context words around the target word, getting each of their respective 30-dimensional vector
embedding, and then averaging them. Taking the average of vector embeddings for the four context words
(wise, old, ruled, kingdom) is similar to finding the mix of Glenn Maxwell and KL Rahul.

This average embedding with 30 dimensions is passed into a hidden layer with 128 neurons. The output layer
contains 52 neurons — one for each word in our vocabulary. This output layer tries to predict the actual target
word (“king”) based on the context input. The image above clearly summarizes the network setup with the
context words going to an averaged input embedding, then passed through two dense layers to predict the
target word. The network setup code in PyTorch is given below.
After training this network, I looked at the cosine similarity for some of the words. ‘village’ – majestic’ and
‘throne – cheered’ word pairs have a cosine similarity above 0.5, which indicates that these pairs of words
occurred in similar contexts in the training corpus.

This is a toy example to explain how CBOW works. Since this is a very small dataset, we can’t interpret too
much from this specific cosine similarity table. It simply demonstrates that CBOW is able to capture some
semantic similarity between words that tend to appear in similar contexts.
Creativity often comes from thinking in analogies. It’s like linking different ideas in cool ways to come up with
something totally new. Can machines do that? Word embeddings are super powerful. I was amazed to see
what I got for my question “Sachin Tendulkar is to Cricket as X is to Soccer.”
To find out X, I downloaded GoogleNewsVectorsNegative300, which is a word vector with 300 dimensions for
each word generated from a large collection of news articles from Google News. I was stunned to see the top

two responses: Lionel Messi and Zinedine Zidane.
Through simple vector arithmetic: vector[‘Sachin Tendulkar’] + vector[‘soccer’] – vector[‘cricket’], we can identify
words that closely align with the resulting vector. These embeddings elegantly encapsulate reality using 300
numbers for each word. Today, there are even more powerful embeddings, such as BERT with 768 dimensions
and Ada with 1,536 dimensions, which really grasp the meanings of all the words humans have made up.
Observing the way neural networks fill up the embedding table takes me back to the “wax on, wax off” scene
from The Karate Kid. In the movie, Daniel gets annoyed with his teacher, Mr. Miyagi, because he’s just asked to
clean cars and sand floors. Then, Miyagi shows Daniel that these chores are more than what they seem.
They’re actually making him stronger and teaching him the basics of self-defense.
In a similar way, with the CBOW word embedding model, the goal isn’t just predicting words based on their
context. While the neural network works on this prediction task, it’s also fine-tuning the numbers that represent
words (that is, filling up the embedding table), capturing the meanings and connections of words as it learns
from the text data.

Embeddings – Part 1 Attention is all you need – Part 1Neural Networks – Part 3
← Embeddings – Part 1 Attention is all you need – Part 1 →
It took me nine posts to cover all the foundational elements required to build a toy GPT. The next post will focus
on the transformer architecture, which will enable our toy GPT to generate text.
December 2, 2023 in Computer Science, Mathematics, Statistics. Tags: LLM, Machine Learning
Related posts
Karate Kid - Daniel's Training "WAX ON WAX OFF"Karate Kid - Daniel's Training "WAX ON WAX OFF"

Attention is all you need – Part 1
Jana Vembunarayanan / December 23, 2023
This is the 10th post in my series on building a toy GPT. Read my earlier posts first for better understanding.
I asked ChatGPT to complete the sentence given the phrase: “I chose that bank for”. It completed the
sentences sensibly. Here are the four sentences it generated:
1. I chose that bank for its reliability.
2. I chose that bank for better rates.
3. I chose that bank for convenient locations.
4. I chose that bank for customer service.
In order to generate the right words that come after the phrase “I chose that bank for”, ChatGPT needed to
understand the meaning of all five words, their importance, and the relationships between them. In this case,
the word “bank” was crucial for ChatGPT to guess the next two words that would properly complete the
sentence.
Tags
Accounting Analysis
Availability Biology Books
Business Charlie Munger
Computer Science
Economics Evolution
Finance Learning
Lending Life LLM Machine
Learning Mathematics
Mental Models
Presentation
Psychology Reading
Recommender Systems Role Models
Statistics Teaching Thinking
Valuation Warren Buffett Why Writing
Archives
Seeking Wisdom
Mastering the best of what other people have already figured out.
Home About

How does ChatGPT actually understand the relationships and importance of these words? The key to its ability
is a feature called Attention, which was described in Attention is All You Need, a revolutionary paper published
in 2017.
It is much easier to understand Attention through a simple example from the world of cricket. Below are the ODI
batting averages for 5 cricketers. If I want to know Virat Kohli’s batting average, all I need to do is take a quick
look to find out that it is 58.7.
This simple lookup process relies on 3 key concepts that we often use without thinking much about: queries,
keys, and values. When I queried for ‘Virat Kohli’, I found the row with the key ‘Virat Kohli’ that matched my
query. The returned value for the matching row is his batting average of 58.7.
Rather than finding the batting average of one particular player, what if we want to find the batting average of a
player similar to Virat Kohli? A simple key-based lookup will not work in this case. I would have to go through
each player and mentally assign a similarity weight between 0 and 1 based on how comparable they are to
Virat Kohli.
For example, I might take 50% of Kohli’s own average, 20% of Sachin Tendulkar’s average, and so on, to
calculate a weighted average. This is a crude process, but I would be relying on my knowledge gained from
following cricket for four decades in order to mentally assign appropriate weights to each player.
Select Month
Follow Blog via Email
Enter your email address to follow this
blog and receive notifications of new
posts by email.
Email Address
Follow
Join 3,629 other subscribers

A player similar to Virat Kohli has a weighted average of 52. One key point to note is that the weights I
manually assigned must sum to one. If someone could peek inside my brain and see how it is coming up with
the weights, they would find that my brain has constructed a feature vector for each player and is using that to
assign the weights.
As a fun exercise, I made up 4 features for each player, on a scale of 0-1. The above feature vectors are the
keys that I’ll be searching on, using the query below, which is also a feature vector. This is similar to what we
did above.
The only change between this method and the previous method is that instead of searching for the actual
names of the players, we’re using their feature vectors to conduct the search. The query feature vector
captures the meaning of a player similar to Virat Kohli.

A simple vector dot product between the query feature vector and each player’s feature vector would give a
similarity score. The higher the score, the more similar they are. And the lower the score, the less similar they
are. For example, the similarity score between the query and Virat Kohli is (0.7 * 0.8) + (0.7 * 0.9) + (0.7 * 0.8)
+ (0.7 * 0.9) = 2.38.
How do we convert the similarity scores into weights that sum to 1? We already learned that the Softmax
function takes a set of real numbers ranging from negative infinity to positive infinity and converts them into
probability values between 0 and 1 where all the probabilities sum to 1.
The softmax values of all players will add up to one. I normalized the dot product by dividing it by the square
root of the number of features. Since we’re using 4 features, the square root is 2. For instance, the dot product
of Virat Kohli is 2.38. When divided by 2 we get 1.19. Why are we normalizing the values?
The softmax function assigns high probabilities to large positive values. Without normalizing, the output will be
transformed into a one-hot encoded vector. Consider the table below where the Softmax value for A is close to
98% and C is close to zero, as A has a high value of 5 and C has a low value of 0.
Suppose we normalize the values by dividing them by 4 (I picked an arbitrary number for this example). Then
the weights are better distributed. Normalizing prevents us from skewing towards one key and gives
opportunities for all keys to participate.

Rather than using a hardcoded value of 4 for normalization, we take the square root of the number of features
in the vector. Using the weights from softmax, we can compute the batting average for someone who’s similar
to Virat Kohli. The player similar to Virat Kohli has a batting average of 46.8.
The three key concepts to remember from this exercise are queries, keys, and values.
The “query” is what we are searching for. We looked for someone similar to Virat Kohli. The “key” is what we
compare against. Here, we compared the feature vector of our query with the feature vectors of all the players.
The “value” is what we derive from each key. Batting averages are the values.
From this simple cricket example, we dissected the main equation from the Attention is All You Need paper.
This is key to understanding everything else. Before we continue, make sure you understand each part of the
formula by relating it to the cricket example.
Let’s see how attention works for the phrase “I chose that bank for.”

To keep things simple, let’s assume that our model takes a maximum input of five words. This means its
context length is 5. When the context length is five, the network has 5 examples to train itself on. In this
explanation, I’m using ‘word’ to refer to ‘token’ for simplicity. While I’m assuming one word corresponds to one
token, it’s important to note this is often not the case in reality.
Embeddings are the heart of LLMs, supplying the oxygen so the network can do its job of predicting the next
token. I’ve written two posts explaining what embeddings are and how they’re constructed. In this particular
example, I created a 4-dimensional embedding for each word. To give you an idea of how small our embedding
is, GPT-3 uses an embedding size of 12,288.
As the network undergoes training, it updates the embeddings so that the loss is minimized. For simplicity, let’s
assume these embeddings encapsulate both the semantic meaning and positional information of each word.
It’s important to remember that word order plays a vital role in language; this order is represented in the
positional embeddings.
In our cricket example, queries, keys, and values are distinct elements. However, in LLMs, queries, keys, and
values start off identical, which can be confusing. At first, the same data acts as queries, keys, and values. But
as the neural network is trained, they diverge to serve their unique roles.
Queries are what each word is searching for. This is similar to the query feature vector [0.7, 0.7, 0.7, 0.7] in the
cricket example. Keys are what each word is offering to be searched on. This is similar to the feature vector of

each cricket player. Values are what you will get from that word when there is a key match. This is similar to the
batting average of each player.
First, we perform a matrix multiplication between the queries and the transposed keys. What we get from this
operation is the similarity score for each word with every other word. This is the same as what we did for the
cricket example. There was a single query in the cricket example, but here there are 5 queries, as we have one
query for each word.
When we train the network, each word can look at itself and the words preceding it. For example, in the phrase
“I chose that bank for,” “I” can only look at itself. “Chose” can look at itself and “I.” “That” can look at itself,
“chose,” and “I,” etc.
We need a way to prevent the word from looking at words following it. The reason for hiding future words is for
the network to predict based on what comes before and predict what comes next. This is achieved through
masking. The table below shows how masking prevents the current word from looking at the words following it.

Softmax converts real numbers into probabilities. By adding very high negative values for words that follow
each word, softmax makes the probabilities of those future words equal to zero. For example, look at the row
with the word “chose.” The values for the words “that,” “bank,” and “for,” which follow “chose,” are zero.
Next, we take the result from the first step, which is the matrix multiplication between the queries and the
transposed keys, normalize it, and add the mask. We already learned in the cricket example that normalizing
prevents us from skewing towards one key and gives opportunities for all keys to participate. I use 2 as the
normalization factor, because the square root of the 4 embedding dimensions is 2.

Take the result from the previous step and apply a softmax function to it. What you get is a matrix of weights
that tells how much importance each query from each word assigns to each key from every other word.
The final step is to perform a matrix multiplication between the softmax result and the values. What you get is a
(5, 4) matrix with the same dimensions as our original input matrix. This alignment is no coincidence.
Something profound happens here. Each word’s feature vector gets adjusted to incorporate information from
the feature vector of every other word, weighted by the softmax relevance scores.

The F1 value for the word “for” is -0.66. How was this calculated? We computed it by multiplying the softmax
weights [0.28, 0.31, 0.08, 0.07, 0.25] associated with “for” for all words, including “for” itself, with the F1 feature
values [-0.8, -0.9, 0.1, 0.2, -0.7] of all words, which resulted in -0.66.
By using this straightforward method, we alter the feature vectors of each word so that they align with every
other word in an n-dimensional space. This manipulation of feature vectors sets up a way for words to
communicate with each other. We also apply masking to direct this communication towards previous words in
the sequence.
As the network undergoes training, words with stronger connections naturally begin to gravitate towards each
other, much like friends spotting and moving closer to one another in a crowd. It’s important to note that this
communication between words primarily occurs in the attention mechanism of the network. The rest of the
network’s computations are carried out independently for each word.
Using the prompt “I chose that bank for”, I executed the PicoGPT code which utilizes the GPT-2 model weights
and each word has 768 dimensions. The images below illustrate the cosine similarities between the feature
vectors of all words before and after the attention call. It is evident that attention facilitates communication
between words by updating their feature vectors.

So far, we have looked at Attention with a single head. However, we can do a much better job with multiple
heads. In our simple example, each word embedding had 4 dimensions. Suppose we create 2 heads, where
the first head will focus on dimensions F1 and F2, while the second head focuses on dimensions F3 and F4.
The image below helps visualize how attention with 2 heads works.

First attention head might focus on capturing the syntactic structure of the sentence. For example, it might
learn relationships like “I” is the subject, “chose” is the verb, and “bank” is the object. This head helps in
understanding the grammatical construction of the sentence.
Second attention head could concentrate on the semantic aspects, such as the meaning and context of the
words. For instance, it might recognize that “bank” in this context is likely a financial institution (and not the side
of a river), based on its position in the sentence and the preceding words “I chose that.”
Having multiple attention heads is akin to instead of one person gathering requirements from the customer,
doing the UX design, and writing code, we can have 3 people (a PM, a designer, and an engineer) to do a
much better job.
How many attention heads can we have? I don’t know what the ideal number of attention heads is. But all I
know is that the number of heads must equally divide the number of embeddings, as we need to divide the
feature vectors equally across all attention heads.
Before we proceed further, let’s look at the model architecture of the Transformer.

In the transformer architecture, as used in models like GPT, the component inside the red rectangle represents
the decoder. A decoder is responsible for generating text by processing the input sequentially and predicting
the next word in a sequence based on the previous words.

The transformer architecture was originally proposed for language translation, with the left part outside the red
rectangle acting as an encoder to supply the source text. But for text generation tasks like GPT, we only need
the decoder part inside the red rectangle. The red rectangle below represents what we learned in this post.

We will need to connect this masked multi-head attention in a certain way and put them inside a block. By
creating multiple such blocks and stacking them on top of each other, you will get a toy GPT that will start
generating text. I will cover this wiring and stacking in my next post.

Attention is all you need – Part 2Hello LLMs! Shriram Transport Finance
← Embeddings – Part 2 Attention is all you need – Part 2 →
You can download the excel sheet and python code that I used to generate these examples from here and
here.
Here are some invaluable resources that I found to learn how attention works.
1. Attention Is All You Need
2. An Intuition for Attention
3. GPT in 60 Lines of NumPy
4. Let’s build GPT: from scratch, in code, spelled out
5. Intuition Behind Self-Attention Mechanism in Transformer Networks
6. Transformers from scratch
December 23, 2023 in Computer Science, Mathematics, Statistics. Tags: LLM, Machine Learning
Related posts
2 thoughts on “Attention is all you need – Part 1”
Karthik Rajeshwaran December 23, 2023 at 6:49 pm
Thanks Jana. Amazing to read this. We implemented this at Flipkart.
Best Karthik.

Attention is all you need – Part 2
Jana Vembunarayanan / December 27, 2023
This is the 11th and final post in my series on building a toy GPT. Read my earlier posts first for better
understanding. I concluded the previous post by explaining how Attention works. You can download the Excel
sheet and Python code that explains how Attention from here and here.
The Transformer architecture was originally proposed for language translation, with the block on the left-hand
side acting as an encoder to supply the source text. We don’t need the encoder block for text generation tasks
like with GPT. The red dashed rectangle is what I covered in the previous post.
Tags
Accounting Analysis
Availability Biology Books
Business Charlie Munger
Computer Science
Economics Evolution
Finance Learning
Lending Life LLM Machine
Learning Mathematics
Mental Models
Presentation
Psychology Reading
Recommender Systems Role Models
Statistics Teaching Thinking
Valuation Warren Buffett Why Writing
Archives
Seeking Wisdom
Mastering the best of what other people have already figured out.
Home About

A decoder is responsible for generating text by processing the input sequentially and predicting the next word
in a sequence based on the previous words. Let’s take a closer look at the Decoder block and examine its
components and their connections.
Select Month
Follow Blog via Email
Enter your email address to follow this
blog and receive notifications of new
posts by email.
Email Address
Follow
Join 3,629 other subscribers

There are 2 kinds of Attention: self-attention and cross-attention. In self-attention, queries, keys, and values are
all derived from the same source. On the other hand, in cross-attention, queries originate from one source
while keys and values come from a different one.
Cross-attention is especially relevant in language translation or summarization tasks. Here, the keys and
values are sourced from the text in the original language. For instance, in translating “The cat sat on the mat”
from English to French, the encoder provides the keys and values from the English sentence.
After removing the clutter from cross-attention, the image looks like below.

Attention alters the feature vectors of each word so that they align with every other word in an n-dimensional
space. This manipulation of feature vectors sets up a way for words to communicate with each other.
It’s important to note that this communication between words primarily occurs in the attention mechanism of the
network. After the communication between words is established, further computation needs to run. This
computation happens in the feed-forward layer, which is a multi-layered perceptron.
For example, GPT-2 uses word embeddings with 768 dimensions. Its feed-forward layer contains 3,072
neurons, which scales up the embedding dimension by a factor of 4. This larger intermediate state then gets
fed into another layer with 768 neurons, bringing it back down to the original 768 dimension size. In summary,
the decoder block utilizes the following process:
1. Attention enables words to communicate with each other by aligning their 768-dimensional vector
representations.
2. The 3072-neuron Feed-forward layer performs computation on those vector representations to model
more complex functions. The final 768-neuron layer reduces the representations back down to 768
dimensions to be passed on to the next decoder block.
So in essence, the decoder orchestrates communication between words via the Attend layer, interspersed with
computation steps in the Feed-forward layers. Stacking decoder blocks on top of each other intersperses
communication and computation. For example, GPT-2 stacks 12 decoder blocks on top of one another.
Why do we need the Add & Norm block?
As we create neural networks with multiple layers, we can run into issues like vanishing or exploding gradients.
For example, during backpropagation, the gradient can become too small (vanish) due to multiplying many
small numbers over multiple layers.
Let’s say a particular weight is 0.2 and the gradient has to pass through 10 layers. The gradient at the 10th
layer is 0.0000001 (0.2). Since the gradient is almost zero, this inhibits the network’s ability to learn
effectively. This issue is called the vanishing gradient problem.
10

Take a look at the + operation in the code above. The technical term for the + operation is called a skip
connection, also known as a residual connection. It is a crucial component in the architecture of transformer
blocks. It solves the vanishing gradient problem.
The plus operation passes gradient values through unchanged. We learnt this in the Neural Networks post. The
left side of the image below comes from the ResNet paper. Let H = F(x) + x. Think of F(x) as either multi-head
self attention or the feed forward network.
As shown in the derivation on the right-hand side, the gradient from H flows to x even if the gradient from F(x)
happens to be zero. You can read this and this article to learn more about skip connection. Skip connections
not only fix the vanishing gradient issue but also aid in retaining information from the original input.
Deep Neural networks conduct millions of matrix operations, using weights and biases that are initially set at
random. As we incorporate more layers, as seen in transformer models, there’s a risk that the generated

representations might drift significantly from the original input.
This drift can lead to the loss of the original signal, causing the network to learn more noise than signal. By
reintroducing the original input (through “x +”), we help retain information from the input, ensuring that the
resulting representations contain more relevant signal and less noise.
That covers the Add in the Add & Norm block. Why do we need Norm? When you do millions of multiplications
and additions, the resulting number could get very large. This could result in large or exploding gradients,
causing the weights to change wildly, which in turn can prevent the model from converging to a good solution.
Layer norm is used to normalize the values for each input. It is a simple idea: You take the mean of the input
and normalize each value by subtracting the mean and dividing by the input variance. This is similar to
calculating the z-score in statistics. Parameters g and b are learned during training. The image below shows
how layer norm works on 5 inputs with 4 dimensions for each input.

Computation coming out of each neuron must stay within a range for the neural network to function properly.
Both skip connections and layer norm keep the numbers within that range. I think of the output of the neuron
like glucose in our blood. It can’t be too low or too high. Hormones like insulin and organs like the liver work to
keep blood sugar within a certain range. Similarly, we have the Add & Norm layer in neural networks.
GPT uses GELU (Gaussian Error Linear Unit) as an activation instead of RELU. One issue with ReLU is that it
can lead to “dead neurons,” where neurons always output zero, especially if they get a very negative bias
during training. This is known as the “dying ReLU” issue. GELU gives a smoother curve than ReLU, which can
help in learning more complex patterns. It doesn’t have as much of the dying neuron issue as ReLU.

The final step is that the output from the last block gets projected back from the embedding dimension to the
vocabulary dimension. Through softmax we get the next best word given the current words. You take the output
word and append it to the current input and go through the transformer again.

For the input “I chose that bank for,” the GPT-2 model predicted “my.” Calling it again with “I chose that bank for
my,” it predicted “first.” Calling it again with “I chose that bank for my first,” it predicted “job.” So we got a logical
sentence: “I chose that bank for my first job.”
This process of predicting a future value (regression) and adding it back into the input (auto) is why GPT is
called autoregressive. The best way to see it in action is to run PicoGPT through the debugger and see how
the entire process works.
Language Models are Few-Shot Learners paper gives an idea of the scale of these GPT models.
All these models are trained on 300 billion tokens of text data. Specifically, GPT-3 has 175B parameters, an
embedding size (width of the network) of 12,288, 96 layers (depth of the network), and 96 attention heads with
a head size of 128 each.

These models are known as foundational models. They can generate text that logically follows a given input,
but they are not designed to answer questions directly. For instance, when GPT-2 was asked “What is the
capital of India?”, it replied with “India is the”, demonstrating its inability to provide specific factual responses.
How is ChatGPT able to answer most of my questions correctly? These foundational models undergo fine-
tuning and Reinforcement Learning from Human Feedback (RLHF). This additional training enables ChatGPT
to generate more relevant and accurate answers.
Fine-tuning provides the model with additional question-and-answer training data. Going through this data
selectively adjusts the model’s weights to better align it with accurately answering questions. In RLHF, humans
evaluate the model’s responses and indicate which ones are most appropriate. This feedback allows the model
to learn and continuously improve its ability to generate high-quality answers.
I will write about fine-tuning and RLHF sometime next year.
It’s a deeply satisfying experience to understand how LLMs work. Through the process of building and writing
these 11 posts, I was able to see the math that goes inside the LLM black box. I am deeply thankful to the
many people who have freely shared their work online. This has helped me find answers to most of my
questions. In particular, without the teachings of Andrej Karpathy and Jeremy Howard, I would have given up
on this long back.
Before concluding this post, I would like to leave you with one of the best ways to start learning about large
language models (LLMs).
Start with Intro to Large Language Models, ChatGPT Prompt Engineering for Developers, and A Hackers’
Guide to Language Models. It will take 4-5 hours to finish watching these videos.
You need to get really comfortable programming with Python, as most of the AI ecosystem is built on top of this
language. I highly recommend reading Python Distilled to brush up your skills.
The best way to learn about LLMs is by using them in your products. You can start with OpenAI APIs, which
have a best-in-class platform that just works. Their cookbook is the best place to look at all the use cases that
are possible with their APIs.

Deep Dive Into LLM's Machine Learning notes

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Deep Dive Into LLM&#39;s Machine Learning notes

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77

Deep Dive Into LLM's Machine Learning notes