Introduction to Neural Information Retrieval and Large Language Models

sadjadeb 132 views 63 slides Jul 27, 2024
Slide 1
Slide 1 of 63
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63

About This Presentation

In this presentation, I provided a comprehensive introduction to the information retrieval and large language models beginning from the history of IR, continuing with the embedding approaches and finishing with the common retrieval architectures and a code session to getting familiar with huggingfac...


Slide Content

Introduction to Neural Information Retrieval Sajad Ebrahimi – M.A.Sc in Computer Engineering at UoG [email protected]

Outline IR through the Ages Text Embedding Word2Vec in detail FastText Seq2Seq Model / Encoder-Decoder architecture / Attention / Transformers BERT Intuition and Applications Using Language Models for retrieval Why Language Models are Useful? Bi-Encoder vs. Cross-Encoder HuggingFace Let’s code together Load a dataset Get the embeddings Calculate the similarity Compare models

IR through the ages

There are two possible outcomes: TRUE (Document matches the Query) or FALSE Query is specified using Boolean logic operators: AND, OR, NOT Ranking? Boolean Retrieval

Indexing

TF-IDF and BM25

Behrooz Mansoori, “Introduction to Information Retrieval”, 2022

The fundamental idea behind text embedding is to capture the contextual and semantic information of words and phrases in a way that can be used for various NLP tasks, such as text classification, sentiment analysis, information retrieval, machine translation, and more. Numerical Representation Semantic Information Feature Engineering Dimensionality Reduction Transfer Learning What is Text Embedding?

One-hot vectors map objects into a vector of fixed-length We have not captured any semantic relationships Is there a relationship between man and king? We would like to develop a model in which King - Man + Woman → Queen Apple Inc. vs Apple (fruit) Motivation

Most statistical NLP/IR work regards words as atomic symbols: book, information, university In vector space, this is a sparse vector with one 1 and a lot of zeros book → [0 0 0 1 0 0 0 0 0 0] university → [0 0 0 0 0 0 0 0 1 0] Problem: The dimensionality of the vector will be the size of vocabulary No semantic relationship between word vectors’ representations

WORD2VEC

The output probabilities are going to relate to how likely it is to find each vocabulary word near our input word e.g, for the input word “Soviet”, the output probabilities are going to be much higher for words like “Union” and “Russia” than for unrelated words like “watermelon” and “kangaroo” Train the neural network to do this by feeding it word pairs found in our training documents The intuition

Window size (context) is a hyperparameter

A word just as a text string to a neural network We need a way to represent the words to the network Build a vocabulary of words from our training documents Words are represented by one-hot encoding A vector with the size of a vocabulary 1 if the word is present, otherwise 0 e.g., 1000 for cat IT NEEDS TOO MUCH RESOURCES Training Details

Instead of trying to predict the probability of being a nearby word for all the words in the vocabulary, we try to predict the probability that our training sample words are neighbors or not. Maximizing the similarity of the words in the same context and minimizing it when they occur in different contexts. Negative Sampling

Now we need to solve a simple Logistic-regression problem Negative Sampling

The skip-gram neural network model is actually surprisingly simple in its most basic form Two assumptions: A word can be used to generate the words surrounding it Given the center word, the context words are generated independently Skip Gram Model

Skip Gram Architecture

The center word is generated based on the context words. Continuous Bag of Words (CBOW) model

CBOW Architecture

The hidden layer is going to be represented by a weight matrix 10,000 rows (one for every word in our vocabulary) 300 columns (one for every hidden neuron) 300 features is what Google used in their published model trained on the Google news dataset The end goal of all of this is really just to learn this hidden layer weight matrix The hidden layer of this model is really just operating as a lookup table The Hidden Layer

each output neuron (one per word in our vocabulary) will produce an output between 0 and 1, and the sum of all these output values will add up to 1 The Output Layer

Lack of Context Sensitivity Out-of-vocabulary Words / Difficulty with Rare Words Semantic Ambiguity (Doesn't Capture Phrase-level Semantics) Domain Specificity Not Capturing Word Order Morphology (eat and eaten) Word2Vec cons

FastText

It was invented to make up the Word2Voc’s shortages like OOV and Morphology In essence, it is based on the idea of character n-grams FastText Bojanowski et al, FastText n-gram embedding model, 2017

For a word, we generate character n-grams of length 3 to 6 present in it Two-step vector representation updating: First, the embedding for the center word is calculated by taking a sum of vectors for the character n-grams and the whole word itself For the actual context words, we directly take their word vector from the embedding table without adding the character n-grams FastText Training

High memory requirements Lack of Context Sensitivity Not Capturing Word Order Domain Specificity FastText cons

Seq2Seq Models Encoder-Decoder Attention Transformers

Seq2seq models have evolved, and attention mechanisms have been introduced to address challenges related to capturing long-range dependencies. Seq2Seq Model Sutskever et al. Sequence to Sequence Learning with Neural Networks, 2014

Seq2Seq Model Component

keeping the intermediate outputs from the encoder LSTM from each step of the input sequence Attention Models Luong et al. Effective Approaches to Attention-based Neural Machine Translation, 2015

Attention is all you need! Vaswani et al., Attention is all you need, 2017

Inside an Encoder and a Decoder

Self-Attention

https://github.com/jessevig/bertviz Want to get a visualized attention?

BERT

BERT makes use of Transformers, an attention mechanism that learns contextual relations between words (or sub-words) in a text An encoder that reads the text input and a decoder that produces a prediction for the task. Transformer encoder reads the entire sequence of words at once Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary Introduction Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019

Pre-training and Fine-tuning

It has been trained based on two approaches: Masked LM: mask some percentage of the input tokens at random, and then predict those masked tokens mask 15% of all WordPiece tokens in each sequence at random. Next Sentence Prediction: understanding the relationship between two sentences (50% of positive pairs) Used BooksCorpus (800M words) and Wikipedia (2,500M words) IT IS EXPENSIVE Pre-training in detail

Fine-tuning in detail

BERT Size

What are the differences? BERT base BERT large Model Size 110M 340M Hidden Layers 12 transformer layers 24 transformer layers Attention Heads 12 16 Hidden Unit Size 768 1024 Total Hidden Units 12 layers * 768 units 24 layers * 1024 units Obviously, they have differences in training time and required computational resources

[CLS] [SEP] [MASK] [PAD] BERT Special Symbols

Under the hood

RoBERTa: A robustly optimized BERT Facebook AI research team They used 160GB of text instead of the 16GB dataset originally used to train BERT Increased the number of iterations from 100K to 300K and then further to 500K Dynamically changing the masking pattern applied to the training data Removing the next sequence prediction objective from the training procedure BERT-based Models Liu et al. Roberta: A robustly optimized BERT pretraining approach, 2019

DistilBERT: a distilled version of BERT BERT-based Models Sanh et al. in DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, 2020 BERT base DistilBERT Model Size 110M 66M Hidden Layers 12 transformer layers 6 transformer layers Attention Heads 12 12 Hidden Unit Size 768 768 Total Hidden Units 12 layers * 768 units 6 layers * 768 units

ColBERT: Contextualized Late Interaction over BERT Deep contextualized LMs come at a steep increase in computational cost A novel late interaction paradigm for estimating relevance between a query 𝑞 and a document 𝑑 Query and Document text are separately encoded(tokenized) into contextual embeddings using two different BERT BERT-based Models

Sentence-BERT: A modification of the pre-trained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity BERT-based Models Reimers and Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019

BertTokenizer BertModel BertForMaskedLM BertForSequenceClassification BertForMultipleChoice BertForTokenClassification BertForQuestionAnswering BERT For X provided by the HuggingFace

Using LMs for retrieval

Allows us to answer questions like: Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? Given that we observe “baseball” three times and “game” once in a news article, how likely is it about sports? Given that a user is interested in sports news, how likely would the user use “baseball” in a query? Why Language Models are useful?

Language Models applications

Bi-Encoder vs. Cross-Encoder

Retrieve vs Re-rank

The platform where the machine learning community collaborates on models, datasets, and applications. Models Datasets Spaces APIs etc. An NLP Course provided by the HuggingFace (Highly Recommended): https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt Hugging Face 🤗

Transformers is a Python framework for state-of-the-art sentence, text, and image embeddings. This framework can compute sentence/text embeddings for different languages. pip install transformers Transformers

Let’s Code Together

Try to get the representation of some texts using Word2Vec and BERT Getting familiar with sentence-transformers library Use an LLM for retrieving information Compare the performance of models Find the codes here 👇 : https://colab.research.google.com/drive/1T24mWOVisVv0N45-GlAGm8lCreZmzV0v?usp=sharing What are we going to do? 💻

Any Question?

Thank you for your attention! M.A.Sc. in Computer Engineering at University of Guelph sadjadeb.github.io Follow Me on X: @sadjadeb [email protected]

Bhaskar Mitra and Nick Craswell, “An Introduction to Neural Information Retrieval”, 2018 Behrooz Mansoori, “Introduction to Information Retrieval”, 2022 Sutskever et al. Sequence to Sequence Learning with Neural Networks, 2014 https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263 https://data-hub.ir/ word2vec https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/ Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019 Reimers and Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019 Liu et al. Roberta: A robustly optimized BERT pretraining approach, 2019 https://itnext.io/deep-learning-in-information-retrieval-part-iii-ranking-da511f2dc325 References