Introduction to Neural Information Retrieval and Large Language Models

sadjadeb 132 views 63 slides Jul 27, 2024

Slide 1 of 63

About This Presentation

In this presentation, I provided a comprehensive introduction to the information retrieval and large language models beginning from the history of IR, continuing with the embedding approaches and finishing with the common retrieval architectures and a code session to getting familiar with huggingfac...

Size: 3.42 MB

Language: en

Added: Jul 27, 2024

Slides: 63 pages

Slide Content

Introduction to Neural Information Retrieval Sajad Ebrahimi – M.A.Sc in Computer Engineering at UoG [email protected]

Outline IR through the Ages Text Embedding Word2Vec in detail FastText Seq2Seq Model / Encoder-Decoder architecture / Attention / Transformers BERT Intuition and Applications Using Language Models for retrieval Why Language Models are Useful? Bi-Encoder vs. Cross-Encoder HuggingFace Let’s code together Load a dataset Get the embeddings Calculate the similarity Compare models

IR through the ages

There are two possible outcomes: TRUE (Document matches the Query) or FALSE Query is specified using Boolean logic operators: AND, OR, NOT Ranking? Boolean Retrieval

Indexing

TF-IDF and BM25

Behrooz Mansoori, “Introduction to Information Retrieval”, 2022

The fundamental idea behind text embedding is to capture the contextual and semantic information of words and phrases in a way that can be used for various NLP tasks, such as text classification, sentiment analysis, information retrieval, machine translation, and more. Numerical Representation Semantic Information Feature Engineering Dimensionality Reduction Transfer Learning What is Text Embedding?

One-hot vectors map objects into a vector of fixed-length We have not captured any semantic relationships Is there a relationship between man and king? We would like to develop a model in which King - Man + Woman → Queen Apple Inc. vs Apple (fruit) Motivation

Most statistical NLP/IR work regards words as atomic symbols: book, information, university In vector space, this is a sparse vector with one 1 and a lot of zeros book → [0 0 0 1 0 0 0 0 0 0] university → [0 0 0 0 0 0 0 0 1 0] Problem: The dimensionality of the vector will be the size of vocabulary No semantic relationship between word vectors’ representations

WORD2VEC

The output probabilities are going to relate to how likely it is to find each vocabulary word near our input word e.g, for the input word “Soviet”, the output probabilities are going to be much higher for words like “Union” and “Russia” than for unrelated words like “watermelon” and “kangaroo” Train the neural network to do this by feeding it word pairs found in our training documents The intuition

Window size (context) is a hyperparameter

A word just as a text string to a neural network We need a way to represent the words to the network Build a vocabulary of words from our training documents Words are represented by one-hot encoding A vector with the size of a vocabulary 1 if the word is present, otherwise 0 e.g., 1000 for cat IT NEEDS TOO MUCH RESOURCES Training Details

Instead of trying to predict the probability of being a nearby word for all the words in the vocabulary, we try to predict the probability that our training sample words are neighbors or not. Maximizing the similarity of the words in the same context and minimizing it when they occur in different contexts. Negative Sampling

Now we need to solve a simple Logistic-regression problem Negative Sampling

The skip-gram neural network model is actually surprisingly simple in its most basic form Two assumptions: A word can be used to generate the words surrounding it Given the center word, the context words are generated independently Skip Gram Model

Skip Gram Architecture

The center word is generated based on the context words. Continuous Bag of Words (CBOW) model

CBOW Architecture

The hidden layer is going to be represented by a weight matrix 10,000 rows (one for every word in our vocabulary) 300 columns (one for every hidden neuron) 300 features is what Google used in their published model trained on the Google news dataset The end goal of all of this is really just to learn this hidden layer weight matrix The hidden layer of this model is really just operating as a lookup table The Hidden Layer

each output neuron (one per word in our vocabulary) will produce an output between 0 and 1, and the sum of all these output values will add up to 1 The Output Layer

Lack of Context Sensitivity Out-of-vocabulary Words / Difficulty with Rare Words Semantic Ambiguity (Doesn't Capture Phrase-level Semantics) Domain Specificity Not Capturing Word Order Morphology (eat and eaten) Word2Vec cons

FastText

It was invented to make up the Word2Voc’s shortages like OOV and Morphology In essence, it is based on the idea of character n-grams FastText Bojanowski et al, FastText n-gram embedding model, 2017

For a word, we generate character n-grams of length 3 to 6 present in it Two-step vector representation updating: First, the embedding for the center word is calculated by taking a sum of vectors for the character n-grams and the whole word itself For the actual context words, we directly take their word vector from the embedding table without adding the character n-grams FastText Training

High memory requirements Lack of Context Sensitivity Not Capturing Word Order Domain Specificity FastText cons

Seq2Seq Models Encoder-Decoder Attention Transformers

Seq2seq models have evolved, and attention mechanisms have been introduced to address challenges related to capturing long-range dependencies. Seq2Seq Model Sutskever et al. Sequence to Sequence Learning with Neural Networks, 2014

Seq2Seq Model Component

keeping the intermediate outputs from the encoder LSTM from each step of the input sequence Attention Models Luong et al. Effective Approaches to Attention-based Neural Machine Translation, 2015

Attention is all you need! Vaswani et al., Attention is all you need, 2017

Inside an Encoder and a Decoder

Self-Attention

https://github.com/jessevig/bertviz Want to get a visualized attention?

BERT

BERT makes use of Transformers, an attention mechanism that learns contextual relations between words (or sub-words) in a text An encoder that reads the text input and a decoder that produces a prediction for the task. Transformer encoder reads the entire sequence of words at once Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary Introduction Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019

Pre-training and Fine-tuning

It has been trained based on two approaches: Masked LM: mask some percentage of the input tokens at random, and then predict those masked tokens mask 15% of all WordPiece tokens in each sequence at random. Next Sentence Prediction: understanding the relationship between two sentences (50% of positive pairs) Used BooksCorpus (800M words) and Wikipedia (2,500M words) IT IS EXPENSIVE Pre-training in detail

Fine-tuning in detail

BERT Size

What are the differences? BERT base BERT large Model Size 110M 340M Hidden Layers 12 transformer layers 24 transformer layers Attention Heads 12 16 Hidden Unit Size 768 1024 Total Hidden Units 12 layers * 768 units 24 layers * 1024 units Obviously, they have differences in training time and required computational resources

[CLS] [SEP] [MASK] [PAD] BERT Special Symbols

Under the hood

RoBERTa: A robustly optimized BERT Facebook AI research team They used 160GB of text instead of the 16GB dataset originally used to train BERT Increased the number of iterations from 100K to 300K and then further to 500K Dynamically changing the masking pattern applied to the training data Removing the next sequence prediction objective from the training procedure BERT-based Models Liu et al. Roberta: A robustly optimized BERT pretraining approach, 2019

DistilBERT: a distilled version of BERT BERT-based Models Sanh et al. in DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, 2020 BERT base DistilBERT Model Size 110M 66M Hidden Layers 12 transformer layers 6 transformer layers Attention Heads 12 12 Hidden Unit Size 768 768 Total Hidden Units 12 layers * 768 units 6 layers * 768 units

ColBERT: Contextualized Late Interaction over BERT Deep contextualized LMs come at a steep increase in computational cost A novel late interaction paradigm for estimating relevance between a query 𝑞 and a document 𝑑 Query and Document text are separately encoded(tokenized) into contextual embeddings using two different BERT BERT-based Models

Sentence-BERT: A modification of the pre-trained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity BERT-based Models Reimers and Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019

BertTokenizer BertModel BertForMaskedLM BertForSequenceClassification BertForMultipleChoice BertForTokenClassification BertForQuestionAnswering BERT For X provided by the HuggingFace

Using LMs for retrieval

Allows us to answer questions like: Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? Given that we observe “baseball” three times and “game” once in a news article, how likely is it about sports? Given that a user is interested in sports news, how likely would the user use “baseball” in a query? Why Language Models are useful?

Language Models applications

Bi-Encoder vs. Cross-Encoder

Retrieve vs Re-rank

The platform where the machine learning community collaborates on models, datasets, and applications. Models Datasets Spaces APIs etc. An NLP Course provided by the HuggingFace (Highly Recommended): https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt Hugging Face 🤗

Transformers is a Python framework for state-of-the-art sentence, text, and image embeddings. This framework can compute sentence/text embeddings for different languages. pip install transformers Transformers

Let’s Code Together

Try to get the representation of some texts using Word2Vec and BERT Getting familiar with sentence-transformers library Use an LLM for retrieving information Compare the performance of models Find the codes here 👇 : https://colab.research.google.com/drive/1T24mWOVisVv0N45-GlAGm8lCreZmzV0v?usp=sharing What are we going to do? 💻

Any Question?

Thank you for your attention! M.A.Sc. in Computer Engineering at University of Guelph sadjadeb.github.io Follow Me on X: @sadjadeb [email protected]

Bhaskar Mitra and Nick Craswell, “An Introduction to Neural Information Retrieval”, 2018 Behrooz Mansoori, “Introduction to Information Retrieval”, 2022 Sutskever et al. Sequence to Sequence Learning with Neural Networks, 2014 https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263 https://data-hub.ir/ word2vec https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/ Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019 Reimers and Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019 Liu et al. Roberta: A robustly optimized BERT pretraining approach, 2019 https://itnext.io/deep-learning-in-information-retrieval-part-iii-ranking-da511f2dc325 References

Introduction to Neural Information Retrieval and Large Language Models

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Introduction to Neural Information Retrieval and Large Language Models

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......