Building a large language models from scratch .pdf

SravaniGunnu 574 views 17 slides Mar 05, 2025

Slide 1 of 17

About This Presentation

All about llms

Size: 203.06 KB

Language: en

Added: Mar 05, 2025

Slides: 17 pages

Slide Content

Build a Large Language Model
Understanding LLMs

CloudKarya
1.1 What is an LLM?
●Neural network that can understand,generate and respond like human
●Trained on large data
●“Large” in “Large language Model” refers to a) Model size(parameters) b)dataset
●Utilizes transformer architecture with attention mechanism
●Also known as Gen AI because of their generative capabilities

Artiﬁcial Intelligence
Machine Learning
Deep Learning
Large Language Models
Gen AI

CloudKarya
1.2 Applications of LLMs
●Machine Translation
●Text Summarization
●Sentiment Analysis
●Content Creation
●Code generation
●Conversational agents like chatbots

CloudKarya
1.3 Stages of building and using LLMs
●Data preparation
●Pretraining LLM on large unlabelled text data
○This will have text completion and few shot capabilities
●Preparing the labelled datasets for speciﬁc tasks
●Train LLM’s on these task speciﬁc datasets to get a ﬁne-tuned LLM
○Classiﬁcation
○Summarization
○Translation
○Personal assistant
●Finetuning has 2 types
○Instruction ﬁne-tuning
○Classiﬁcation ﬁne-tuning

CloudKarya
1.4 Introducing the Transformer Architecture
●Original Transformer
○Developed for machine translation , English to
German
●Encoder
○Processes the input text and produces an
embedding representation
●Decoder
○The output from encoder can be used by decoder to
generate the translated text one word at time
●Self-attention mechanism
●BERT (encoder based model)
○Masked language modeling
○X uses Bert
●GPT (Decoder only model)
○Auto regressive model

CloudKarya
1.5 Utilizing large datasets
●Huge corpus with billions of words
●Common datasets
○CommonCrawl
○WebText2
○Books1
○Book2
○Wikipedia
●GPT training datasets were not released
●Dolma

CloudKarya
1.6 A closer look at the GPT architecture
●Decoder -only architecture
●Auto regressive Models
●GPT-3 has 96 transformer layers and 175 billion parameters
●Emergent behavior

CloudKarya
1.7 Building a LLM
●Stage 1
○Building an LLM
■Data Preparation and Sampling
■Attention mechanism
■LLM architecture
●Stage 2
○Foundational model
■Training loop
■Model evaluation
■Load pretrained weights
●Stage 3
○Fine tuning
■Classiﬁer
■Personal assistant

Build a Large Language Model
Working With Text Data

CloudKarya
Understanding Word Embeddings
●Embedding: Converting data into a vector format.
●Types of embeddings
○Text, Audio, Video
●Types of text embeddings
○Word, Sentence, Paragraphs (RAG)
○Whole documents
●Word2Vec
○Similar context - same embedding
●Models for word embeddings
○Static Models (Word2Vec, GloVe, FastText)
○Contextual Models ( BERT, GPT, etc)
●LLMs produce their own embeddings which are updated during training.
●GPT-2 - 768 dimensions, GPT-3- 12,288 dimensions
Discrete Objects Continuous Space
Nonnumeric
Machine
Readable

CloudKarya
Tokenizing Text
●1st step in creating embeddings
●Tokens
○Individual words or special characters, including punctuations.
●LLM Training
○The-verdict - a short story by Edith Wharton
○Goal - tokenize 20,479 Character short story
I love reading books.
I love readingbooks.
Input Text
Tokenized Text

CloudKarya
Converting Tokens into Token ID’s
●Intermediate step before converting tokens into embeddings
●Vocabulary
○Deﬁnes how we map each unique word and special character to a unique integer
○The vocab size for The-verdict is 1,130

CloudKarya
Adding Special Context Tokens
●Need for special tokens
○To handle unknown words <|unk|>
○To identify start and end of the text
○To pad the shorter texts to match the length of longer texts
●Popular tokens used
○[BOS] (beginning of sequence)
○[EOS](end of sequence)
○[PAD](padding)
●The tokenizer for GPT models uses only <|endoftext|>
●GPT models handle unknown words using BPE

CloudKarya
Byte Pair Encoding
●It is famous tokenization technique used to train GPT-2, GPT-3, RoBERTa, BART, and
DeBERTa.
●Training phase
○BPE learns a vocabulary of subwords by iteratively merging the most frequent character pairs.
●Tokenization Phase
○Split text into characters.
○Iteratively match the longest possible subwords from the vocabulary.
○Replace matched subwords with their corresponding token IDs.
●Tiktoken
○An open source python library used for implementing BPE.
○BPE tokenizer used for GPT-2 and GPT-3 have a vocabulary size of 50,257
●Handling unknown words
○Unknown words are breakdown into individual characters ensures that LLM can process any text.

CloudKarya
Data Sampling With a Sliding Window
●LLMs prediction task is to predict the next word that follows the input block
●Input - target pairs needs to be created

●To perform data sampling
○We make use of PyTorch’s built-in Dataset and DataLoader classes.
○Hyperparameters for DataLoader
■Batch_size, max_legth, stride, num_workers

CloudKarya
Creating Token Embeddings
●Last step in preparing input text for LLM training
●Token ids are converted to embeddings
○These embeddings are initialized with random values
○This serves as a starting point for LLMs learning process
●Using torch.nn.Embedding create a embedding layer
○This embedding layer is a lookup operation that retrieves rows from the embedding layer’s weight
matrix
●

CloudKarya
Encoding Word Positions
●Need
○Self-attention mechanism doesn’t have a notion of position order for the tokens within a sequence
○Embedding layer will return same embedding for same token ID every time irrespective of the position.
●So we inject the positional encoding to add positional information
●It is of two types
○Absolute Positional Embeddings
○Relative Positional Embeddings
●OpenAI’s GPT models use absolute positional embeddings
●These embeddings are optimized during the training process
●The dimensions of positional encoding will be batch_size x context_length

Building a large language models from scratch .pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Building a large language models from scratch .pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx