Building a large language models from scratch .pdf
SravaniGunnu
574 views
17 slides
Mar 05, 2025
Slide 1 of 17
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
About This Presentation
All about llms
Size: 203.06 KB
Language: en
Added: Mar 05, 2025
Slides: 17 pages
Slide Content
Build a Large Language Model
Understanding LLMs
CloudKarya
1.1 What is an LLM?
●Neural network that can understand,generate and respond like human
●Trained on large data
●“Large” in “Large language Model” refers to a) Model size(parameters) b)dataset
●Utilizes transformer architecture with attention mechanism
●Also known as Gen AI because of their generative capabilities
Artificial Intelligence
Machine Learning
Deep Learning
Large Language Models
Gen AI
CloudKarya
1.2 Applications of LLMs
●Machine Translation
●Text Summarization
●Sentiment Analysis
●Content Creation
●Code generation
●Conversational agents like chatbots
CloudKarya
1.3 Stages of building and using LLMs
●Data preparation
●Pretraining LLM on large unlabelled text data
○This will have text completion and few shot capabilities
●Preparing the labelled datasets for specific tasks
●Train LLM’s on these task specific datasets to get a fine-tuned LLM
○Classification
○Summarization
○Translation
○Personal assistant
●Finetuning has 2 types
○Instruction fine-tuning
○Classification fine-tuning
CloudKarya
1.4 Introducing the Transformer Architecture
●Original Transformer
○Developed for machine translation , English to
German
●Encoder
○Processes the input text and produces an
embedding representation
●Decoder
○The output from encoder can be used by decoder to
generate the translated text one word at time
●Self-attention mechanism
●BERT (encoder based model)
○Masked language modeling
○X uses Bert
●GPT (Decoder only model)
○Auto regressive model
CloudKarya
1.5 Utilizing large datasets
●Huge corpus with billions of words
●Common datasets
○CommonCrawl
○WebText2
○Books1
○Book2
○Wikipedia
●GPT training datasets were not released
●Dolma
CloudKarya
1.6 A closer look at the GPT architecture
●Decoder -only architecture
●Auto regressive Models
●GPT-3 has 96 transformer layers and 175 billion parameters
●Emergent behavior
CloudKarya
1.7 Building a LLM
●Stage 1
○Building an LLM
■Data Preparation and Sampling
■Attention mechanism
■LLM architecture
●Stage 2
○Foundational model
■Training loop
■Model evaluation
■Load pretrained weights
●Stage 3
○Fine tuning
■Classifier
■Personal assistant
Build a Large Language Model
Working With Text Data
CloudKarya
Understanding Word Embeddings
●Embedding: Converting data into a vector format.
●Types of embeddings
○Text, Audio, Video
●Types of text embeddings
○Word, Sentence, Paragraphs (RAG)
○Whole documents
●Word2Vec
○Similar context - same embedding
●Models for word embeddings
○Static Models (Word2Vec, GloVe, FastText)
○Contextual Models ( BERT, GPT, etc)
●LLMs produce their own embeddings which are updated during training.
●GPT-2 - 768 dimensions, GPT-3- 12,288 dimensions
Discrete Objects Continuous Space
Nonnumeric
Machine
Readable
CloudKarya
Tokenizing Text
●1st step in creating embeddings
●Tokens
○Individual words or special characters, including punctuations.
●LLM Training
○The-verdict - a short story by Edith Wharton
○Goal - tokenize 20,479 Character short story
I love reading books.
I love readingbooks.
Input Text
Tokenized Text
CloudKarya
Converting Tokens into Token ID’s
●Intermediate step before converting tokens into embeddings
●Vocabulary
○Defines how we map each unique word and special character to a unique integer
○The vocab size for The-verdict is 1,130
CloudKarya
Adding Special Context Tokens
●Need for special tokens
○To handle unknown words <|unk|>
○To identify start and end of the text
○To pad the shorter texts to match the length of longer texts
●Popular tokens used
○[BOS] (beginning of sequence)
○[EOS](end of sequence)
○[PAD](padding)
●The tokenizer for GPT models uses only <|endoftext|>
●GPT models handle unknown words using BPE
CloudKarya
Byte Pair Encoding
●It is famous tokenization technique used to train GPT-2, GPT-3, RoBERTa, BART, and
DeBERTa.
●Training phase
○BPE learns a vocabulary of subwords by iteratively merging the most frequent character pairs.
●Tokenization Phase
○Split text into characters.
○Iteratively match the longest possible subwords from the vocabulary.
○Replace matched subwords with their corresponding token IDs.
●Tiktoken
○An open source python library used for implementing BPE.
○BPE tokenizer used for GPT-2 and GPT-3 have a vocabulary size of 50,257
●Handling unknown words
○Unknown words are breakdown into individual characters ensures that LLM can process any text.
CloudKarya
Data Sampling With a Sliding Window
●LLMs prediction task is to predict the next word that follows the input block
●Input - target pairs needs to be created
●To perform data sampling
○We make use of PyTorch’s built-in Dataset and DataLoader classes.
○Hyperparameters for DataLoader
■Batch_size, max_legth, stride, num_workers
CloudKarya
Creating Token Embeddings
●Last step in preparing input text for LLM training
●Token ids are converted to embeddings
○These embeddings are initialized with random values
○This serves as a starting point for LLMs learning process
●Using torch.nn.Embedding create a embedding layer
○This embedding layer is a lookup operation that retrieves rows from the embedding layer’s weight
matrix
●
CloudKarya
Encoding Word Positions
●Need
○Self-attention mechanism doesn’t have a notion of position order for the tokens within a sequence
○Embedding layer will return same embedding for same token ID every time irrespective of the position.
●So we inject the positional encoding to add positional information
●It is of two types
○Absolute Positional Embeddings
○Relative Positional Embeddings
●OpenAI’s GPT models use absolute positional embeddings
●These embeddings are optimized during the training process
●The dimensions of positional encoding will be batch_size x context_length