Building a large language models from scratch .pdf

SravaniGunnu 574 views 17 slides Mar 05, 2025
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

All about llms


Slide Content

Build a Large Language Model
Understanding LLMs

CloudKarya
1.1 What is an LLM?
●Neural network that can understand,generate and respond like human
●Trained on large data
●“Large” in “Large language Model” refers to a) Model size(parameters) b)dataset
●Utilizes transformer architecture with attention mechanism
●Also known as Gen AI because of their generative capabilities

Artificial Intelligence
Machine Learning
Deep Learning
Large Language Models
Gen AI

CloudKarya
1.2 Applications of LLMs
●Machine Translation
●Text Summarization
●Sentiment Analysis
●Content Creation
●Code generation
●Conversational agents like chatbots

CloudKarya
1.3 Stages of building and using LLMs
●Data preparation
●Pretraining LLM on large unlabelled text data
○This will have text completion and few shot capabilities
●Preparing the labelled datasets for specific tasks
●Train LLM’s on these task specific datasets to get a fine-tuned LLM
○Classification
○Summarization
○Translation
○Personal assistant
●Finetuning has 2 types
○Instruction fine-tuning
○Classification fine-tuning

CloudKarya
1.4 Introducing the Transformer Architecture
●Original Transformer
○Developed for machine translation , English to
German
●Encoder
○Processes the input text and produces an
embedding representation
●Decoder
○The output from encoder can be used by decoder to
generate the translated text one word at time
●Self-attention mechanism
●BERT (encoder based model)
○Masked language modeling
○X uses Bert
●GPT (Decoder only model)
○Auto regressive model

CloudKarya
1.5 Utilizing large datasets
●Huge corpus with billions of words
●Common datasets
○CommonCrawl
○WebText2
○Books1
○Book2
○Wikipedia
●GPT training datasets were not released
●Dolma

CloudKarya
1.6 A closer look at the GPT architecture
●Decoder -only architecture
●Auto regressive Models
●GPT-3 has 96 transformer layers and 175 billion parameters
●Emergent behavior

CloudKarya
1.7 Building a LLM
●Stage 1
○Building an LLM
■Data Preparation and Sampling
■Attention mechanism
■LLM architecture
●Stage 2
○Foundational model
■Training loop
■Model evaluation
■Load pretrained weights
●Stage 3
○Fine tuning
■Classifier
■Personal assistant

Build a Large Language Model
Working With Text Data

CloudKarya
Understanding Word Embeddings
●Embedding: Converting data into a vector format.
●Types of embeddings
○Text, Audio, Video
●Types of text embeddings
○Word, Sentence, Paragraphs (RAG)
○Whole documents
●Word2Vec
○Similar context - same embedding
●Models for word embeddings
○Static Models (Word2Vec, GloVe, FastText)
○Contextual Models ( BERT, GPT, etc)
●LLMs produce their own embeddings which are updated during training.
●GPT-2 - 768 dimensions, GPT-3- 12,288 dimensions
Discrete Objects Continuous Space
Nonnumeric
Machine
Readable

CloudKarya
Tokenizing Text
●1st step in creating embeddings
●Tokens
○Individual words or special characters, including punctuations.
●LLM Training
○The-verdict - a short story by Edith Wharton
○Goal - tokenize 20,479 Character short story
I love reading books.
I love readingbooks.
Input Text
Tokenized Text

CloudKarya
Converting Tokens into Token ID’s
●Intermediate step before converting tokens into embeddings
●Vocabulary
○Defines how we map each unique word and special character to a unique integer
○The vocab size for The-verdict is 1,130

CloudKarya
Adding Special Context Tokens
●Need for special tokens
○To handle unknown words <|unk|>
○To identify start and end of the text
○To pad the shorter texts to match the length of longer texts
●Popular tokens used
○[BOS] (beginning of sequence)
○[EOS](end of sequence)
○[PAD](padding)
●The tokenizer for GPT models uses only <|endoftext|>
●GPT models handle unknown words using BPE

CloudKarya
Byte Pair Encoding
●It is famous tokenization technique used to train GPT-2, GPT-3, RoBERTa, BART, and
DeBERTa.
●Training phase
○BPE learns a vocabulary of subwords by iteratively merging the most frequent character pairs.
●Tokenization Phase
○Split text into characters.
○Iteratively match the longest possible subwords from the vocabulary.
○Replace matched subwords with their corresponding token IDs.
●Tiktoken
○An open source python library used for implementing BPE.
○BPE tokenizer used for GPT-2 and GPT-3 have a vocabulary size of 50,257
●Handling unknown words
○Unknown words are breakdown into individual characters ensures that LLM can process any text.

CloudKarya
Data Sampling With a Sliding Window
●LLMs prediction task is to predict the next word that follows the input block
●Input - target pairs needs to be created




●To perform data sampling
○We make use of PyTorch’s built-in Dataset and DataLoader classes.
○Hyperparameters for DataLoader
■Batch_size, max_legth, stride, num_workers

CloudKarya
Creating Token Embeddings
●Last step in preparing input text for LLM training
●Token ids are converted to embeddings
○These embeddings are initialized with random values
○This serves as a starting point for LLMs learning process
●Using torch.nn.Embedding create a embedding layer
○This embedding layer is a lookup operation that retrieves rows from the embedding layer’s weight
matrix

CloudKarya
Encoding Word Positions
●Need
○Self-attention mechanism doesn’t have a notion of position order for the tokens within a sequence
○Embedding layer will return same embedding for same token ID every time irrespective of the position.
●So we inject the positional encoding to add positional information
●It is of two types
○Absolute Positional Embeddings
○Relative Positional Embeddings
●OpenAI’s GPT models use absolute positional embeddings
●These embeddings are optimized during the training process
●The dimensions of positional encoding will be batch_size x context_length
Tags