Generative Artificial Intelligence and Large Language Model
shiwanigupta
322 views
29 slides
Oct 18, 2024
Slide 1 of 29
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
About This Presentation
Natural Language Processing (NLP) is a discipline dedicated to enabling computers to comprehend and generate human language.
Word embedding is a technique in NLP that converts words into dense numerical vectors, capturing their semantic meanings and contextual relationships. Analyzing sequential da...
Natural Language Processing (NLP) is a discipline dedicated to enabling computers to comprehend and generate human language.
Word embedding is a technique in NLP that converts words into dense numerical vectors, capturing their semantic meanings and contextual relationships. Analyzing sequential data often requires techniques such as time series analysis and sequence modeling, using machine learning models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs).
Encoder-Decoder architecture is an RNN framework designed for sequence-to-sequence tasks. Beam Search is a search algorithm used in sequence-to-sequence models, particularly in natural language processing tasks. BLEU is a popular evaluation metric for assessing the quality of text generated by machine translation systems. Attention mechanism allows models to selectively focus on the most relevant information within large datasets, thereby enhancing efficiency and accuracy in data processing.
Size: 3.16 MB
Language: en
Added: Oct 18, 2024
Slides: 29 pages
Slide Content
Generative AI and LLM
Dr.ShiwaniGupta
Associate Professor, HoDAI&ML
TCET, Mumbai
Difficult to learn
Right as ‘ryt’, How are you as ‘hru’
NLP really a hard problem to solve
NLPApplications
•Fromcustomerservicechatbotstolanguagetranslationapps
•healthcare,finance,andeducation
•Byallowingmachinestoextractmeaning,analyzesentiments,andsummarizetext,NLPhas
revolutionizedcommunication,makingitanessentialtechnologyinourincreasingly
interconnectedworld.
Opinion, feeling, emotions
Feedback, Comment, rating, like
Amazon, Flipkart
Product/Service
Intent Analysis
Digital medium over IVR or customer
call center
Complaint, opinion, comment,
statement, feedback, query, suggestion
Automated ticketing system
Extract info from resume,
financial attributes, events from
news for trading
NLPApplications
Automated Text generation
Q&A system
Text to speech and vice versa
Topic Modeling
Word to word to sentencesEmployee engagement
Obtain transactional info
From bounded Qs to
responses to free text in
multiple languages
Auto response
Difficult to get prevcontext
Incorrect sentence leads to
negative publicity or legal
complications
News
Extract important
Rewrite whole article capturing
context
Tokenization
•Tokenization in NLP involves breaking text into smaller units, such as words or characters, for analysis.
•It serves as the foundation for tasks like part-of-speech tagging and sentiment analysis.
•This process entails removing punctuation, splitting words, and addressing special cases to create tokens.
•Preprocessing: STOP WORD REMOVAL, STEMMING, LEMMATIZATION
•NLTK, Spacy packages
Preposition,
joining word,
conjunction
Inflectional form
to base form
Numericalization
•Numericalizationin NLP involves transforming text data into numerical formats that machine
learning algorithms can interpret and process. This conversion enables NLP models to handle
and analyze text through mathematical operations.
Bag of word Model –One Hot EncodingWeight is 1 irrespective of freqof word
•In the figure we see that the word embeddingsare represented by the weights connecting between the hidden and output
layer.
•If we have 500 neurons in the hidden layer and 1000 neurons that is if the vocabulary is 1000 in the output layer we have
to learn around 0.5 million weights, which might not be too huge but generally in any practical scenario we deal with
bigger vocabularies.
•If we even consider 10000 words in our vocabulary, then we have to learn a whopping 5 million weights.
•Besides that we know that for embeddingsto capture several context we would need a pretty huge corpus.
•So, training these many weights for a huge corpus and applying softmaxon 10000 weights is computationally very
expensive and sometimes infeasible.
•This issue could be addressed using negative sampling technique.
Properties and Visualization of Word Embeddings
WordembeddingsinNLPexhibitseveralimportantproperties:capturingsemanticrelationships,enabling
compositionality,managingsubwordinformation,maintainingcompactness,andadaptingtocontext.
Thesepropertiesallowembeddingstoeffectivelyrepresentwordmeanings,constructphraseandsentence
representations,handleout-of-vocabularywords,andreducedimensionality.
Byincorporatingtheseaspects,wordembeddingsenhancevariouslanguageprocessingtasks,leadingto
improvedunderstandingandperformanceinNLPapplications.
Wordembeddingscanbevisualizedinareduced-dimensionalspacetoprovideinsightsintoword
relationships.
Thistechniqueenablestheobservationofclustersofsemanticallysimilarwordsandtheexplorationoftheir
connectionsinavisuallyinterpretableformat.
Byprojectinghigh-dimensionalembeddingsintoalower-dimensionalspace,patternsandrelationships
betweenwordsbecomemoreapparent,facilitatingaclearerunderstandingoftheirsemanticsimilaritiesand
differences.
Suchvisualizationshelpinanalyzingandinterpretingcomplexwordassociationsandtheoverallstructureof
thewordembeddings.
Word Embedding
GloVe–Global Vectors for word representation
The model is trained on multiple data sets including
Wikipedia, Twitter and Common Crawl on billions of
tokens and the embeddingsare represented in different
dimension size ranging from 50 to 300.
“glove.6B.zip” file available in the following website
and just consider the 50-dimension representation
usethedimensionreductiontechniqueliket-SNE
thatis,t-DistributedStochasticNeighborembedding
toreducethedimensionsto2andplotaround500
wordsonthose2-dimensions
Embedding Matrix
InNLP,theembeddingmatrixisacrucialcomponentthatmapswordstotheirrespectivevectorrepresentations.
Thismatrixenablesmodelstoaccessandleveragethelearnedwordembeddingsduringbothtrainingandinferencephases.
Thesizeoftheembeddingmatrixisdeterminedbytwofactors:thevocabularysize,whichrepresentsthenumberofunique
words,andthedimensionalityoftheembeddings,whichindicatesthenumberoffeaturesineachvector.
Byorganizingwordvectorsinthismatrix,modelscanefficientlyusetheseembeddingstoperformvariouslanguage
processingtasksandimprovetheiroverallperformance.
Keraslayer as the first layer for NLP related applications: Text classification (sentiment analysis), Machine translation, NER,
Text summarization. It maps indices to vectors.
# of rows equal to the vocabulary size and the number of columns equal to the dimension of the embeddingswe define
Sequential/Temporal/Series Data: RNN & LSTM
Sequential data consists of information organized in a specific order, where the sequence is
meaningful. This type of data includes time series, text, audio, DNA, and music. Analyzing
sequential data often requires techniques such as time series analysis and sequence modeling,
using machine learning models like Recurrent Neural Networks (RNNs) and Long Short-Term
Memory networks (LSTMs).
Unstructured Sequential: speech, text, videos, music, etc…sequence of symbol, image, notes, letters, words, etc.
Eg. daily average temperature of a
city, monthly revenue of a company
Internet of Things kind of environment, where we
would have univariate and multivariate time series
data for multiple entities like sensors
SPEECH/VOICE RECOGNITION: I/P is
audio O/P is name or person identifier
SENTIMENT ANALYSIS: I/P is sequence of
char O/P is category
MUSIC CREATION: I/P is single value
O/P is sequence of nodes
IMAGE CAPTIONING: I/P is image O/P is
sequence of words
LANGUAGE TRANSLATION: I/P and O/P is
sequence of char/words of different size
VIDEO FILES: Sequence of images
Video activity recognition/object tracking….both I/P O/P sequence of frames
RNN and its variants (LSTM, GRU, Bi-RNN, S-RNN)
•MultilayerPerceptrons(MLPs)aredesignedtoprocessfixed-sizeinputs,treatingeachinputasanindependentdata
pointwithoutconsideringanysequentialortime-basedrelationships.Duetothislimitation,MLPscannotcapture
patternsthatdependontheorderofthedata,makingthemunsuitablefortimeseriesanalysis.Incontrast,Recurrent
NeuralNetworks(RNNs)arespecificallydesignedtohandlesequentialinformationthroughtheirrecurrent
connections,makingthemamoresuitablechoicefortasksinvolvingtimeseriesdata.
I/P data is indepof each other but there is a time
relationship
In MLP I/P and O/P size is fixed
ARecurrentNeuralNetwork(RNN)isatypeofneural
networkdesignedforprocessingsequentialdata.It
featuresloopsthatallowinformationtoberetainedacross
timesteps,makingiteffectiveatcapturingtemporal
patterns.ThiscapabilitymakesRNNsparticularlyuseful
forapplicationssuchastimeseriesforecasting,speech
recognition,andnaturallanguageprocessing.More
advancedvariants,likeLongShort-TermMemory
(LSTM)andGatedRecurrentUnit(GRU)networks,have
beendevelopedtoovercomethelimitationsoftraditional
RNNs,suchasdifficultyinlearninglong-term
dependencies.
Types of RNN Based on Cardinality
1.One-to-One (1:1):This is a standard feedforward neural network used for non-sequential data.
2.Many-to-One (N:1):This type processes multiple inputs to produce a single output, such as in sentiment analysis.
3.One-to-Many (1-N):This setup uses a single input to generate multiple outputs, such as in image captioning.
4.Many-to-Many (N-N):This configuration handles multiple inputs and produces multiple outputs, which is common in
machine translation.
5.Many-to-Many (N-M):This flexible structure allows for varying sequence lengths in both inputs and outputs, useful in
applications like video analysis.
One to one
Character/word prediction
Sales forecasting
Many to one
Sentiment Analysis
Predict M/C failure
One to Many
Music generation
Image Captioning
Seqto Vector to seqN/W
Many to Many cardinality
Language Translation
Variable I/O seq
To train an RNN using Backpropagation Through Time (BPTT):
1.Unroll the RNN:Treat each time step as a separate layer.
2.Forward Pass:Generate predictions.
3.Calculate Loss:Compare predictions with actual values.
4.BackpropagateError:Propagate the error through time.
5.Update Parameters:Adjust using an optimization algorithm.
6.Repeat:Continue for multiple epochs.
To prevent vanishing gradients in long sequences, use techniques like
gradient clipping or advanced RNN variants like LSTM and GRU.
Training RNNs: BPTT
TruncatedBackpropagationThroughTime(TruncatedBPTT)isamodifiedversionofthestandardBPTT
algorithmfortrainingRNNswithlongsequences.Itinvolveslimitingthenumberoftimestepsoverwhicherror
gradientsarebackpropagated,insteadofpropagatingthemthroughtheentiresequence.
RunningperparameterupdateonBPTTiscomputationallyexpensive,runningformultipleepochsnotfeasible
Breakseqtosubseq,computationallyfeasiblebuttemporaldependencyreducedtosubseqlevel
Training RNNs: BPTT
Here’s a brief overview of different types of Recurrent Neural Networks (RNNs):
1.Long Short-Term Memory (LSTM): LSTMs are a type of RNN designed to remember information for long periods.
They use special units called memory cells that can maintain information in memory for long durations. LSTMs are effective
for tasks like time series prediction and natural language processing.
2.Gated Recurrent Unit (GRU): GRUs are similar to LSTMs but with a simpler structure. They use gating mechanisms to
control the flow of information, making them faster to train and sometimes more efficient for certain tasks. GRUs are often
used in similar applications as LSTMs, such as speech recognition and machine translation.
3.Character Prediction: This refers to RNNs used for predicting the next character in a sequence. These models are trained
on text data and can generate text one character at a time, making them useful for tasks like text generation and
autocompletion.
4.Stacked RNNs: Stacked RNNs consist of multiple layers of RNNs stacked on top of each other. This architecture allows
the model to learn more complex patterns by capturing different levels of abstraction. They are commonly used in tasks that
require deep understanding, such as language modeling and sequence-to-sequence tasks.
5.Bidirectional RNNs: These RNNs process sequences in both forward and backward directions. By having access to both
past and future contexts, bidirectional RNNs can better understand the entire sequence. They are particularly useful in tasks
like speech recognition and text classification, where context is important.
These various types of RNNs can be combined or adapted for specific use cases, depending on the requirements of the task
at hand.
Types of RNN
LSTM
•Sepp H and JurgenS (1997): solve complex probswith long time dependency and
ran faster and efficiently
•To address memory issue: Long term state (c), short term state (h)
•Forgets not so imp old memories
•Updates/refreshes old memories, forms new imp ones
Each GATE has a NN
Main NN produces O/P based on I/P and prevstate of cell and
updates long term memory
Forget GATE determines how much of the long term memory
needs to be forgotten or retained
Input GATE figures out important part of I/P and concatenates
that to long term state
Output GATE decides how much of updated long term memory
should be considered as part of O/P of cell
2014
NN-3
1 state
1 NN
to ctrl
I/P and
Forget
GATE
Stacked and Bi directional RNN
Time series forecasting
Language modeling
Named entity recognition
Machine Translation
Encoder Decoder Seqto SeqModel
•TheEncoder-DecoderarchitectureisanRNNframeworkdesignedforsequence-to-
sequencetasks.Inthissetup,theEncoderprocessesaninputsequenceandproducesa
contextvector,whichencapsulatestheinformationfromtheinput.TheDecoderthen
usesthiscontextvectortogenerateanoutputsequence.Thisarchitectureiscommonly
appliedinareassuchasmachinetranslation,textsummarization,andspeech
recognition.
Teacher Forcing: correct word acts as a teacher and forces the model to correct
immediately when prediction is wrong
Beam Search and Bleu Evaluation Matrices
BeamSearch:BeamSearchisasearchalgorithmusedinsequence-to-sequencemodels,particularlyinnaturallanguage
processingtasks.Unlikethegreedysearchthatselectsthebestoptionateachstep,BeamSearchkeepstrackofmultiple
hypotheses(beams)ateachstep,expandingthetopNsequenceswiththehighestprobabilities.Thismethodbalancesbetween
searchingbroadlyandefficiently,aimingtofindthemostlikelysequenceoftokens.Itiswidelyusedintaskslikemachine
translationandspeechrecognitiontoimprovethequalityofgeneratedsequences.
BLEU(BilingualEvaluationUnderstudy):BLEUisapopularevaluationmetricforassessingthequalityoftextgeneratedby
machinetranslationsystems.Itcomparestheoverlapofn-grams(contiguoussequencesofwords)inthemachine-generated
textwithoneormorereferencetranslations.Thescorerangesfrom0to1,withhigherscoresindicatingclosermatchestothe
referencetranslations.BLEUemphasizesprecisionbymeasuringhowmanywordsinthegeneratedoutputmatchthereference,
consideringfactorslikebrevityandthepresenceofmultiplereferences.Itiswidelyusedduetoitssimplicityandeffectiveness
inevaluatingmachinetranslationquality.
Numerical translation closeness metric
Corpus of good quality human reference translations
Modified n-gram Precision
BLEU score uses av. log with uniform weights, like geommean of
modified n-gram precision
Doesnotconsider:
Semantic
Sentence structure
morphology
Attention Mechanism
•allowing models to selectively focus on the most relevant information within large
datasets, thereby enhancing efficiency and accuracy in data processing.