Introduction to Neural Information Retrieval and Large Language Models
sadjadeb
132 views
63 slides
Jul 27, 2024
Slide 1 of 63
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
About This Presentation
In this presentation, I provided a comprehensive introduction to the information retrieval and large language models beginning from the history of IR, continuing with the embedding approaches and finishing with the common retrieval architectures and a code session to getting familiar with huggingfac...
In this presentation, I provided a comprehensive introduction to the information retrieval and large language models beginning from the history of IR, continuing with the embedding approaches and finishing with the common retrieval architectures and a code session to getting familiar with huggingface and using LLMs for ranking.
Size: 3.42 MB
Language: en
Added: Jul 27, 2024
Slides: 63 pages
Slide Content
Introduction to Neural Information Retrieval Sajad Ebrahimi – M.A.Sc in Computer Engineering at UoG [email protected]
Outline IR through the Ages Text Embedding Word2Vec in detail FastText Seq2Seq Model / Encoder-Decoder architecture / Attention / Transformers BERT Intuition and Applications Using Language Models for retrieval Why Language Models are Useful? Bi-Encoder vs. Cross-Encoder HuggingFace Let’s code together Load a dataset Get the embeddings Calculate the similarity Compare models
IR through the ages
There are two possible outcomes: TRUE (Document matches the Query) or FALSE Query is specified using Boolean logic operators: AND, OR, NOT Ranking? Boolean Retrieval
Indexing
TF-IDF and BM25
Behrooz Mansoori, “Introduction to Information Retrieval”, 2022
The fundamental idea behind text embedding is to capture the contextual and semantic information of words and phrases in a way that can be used for various NLP tasks, such as text classification, sentiment analysis, information retrieval, machine translation, and more. Numerical Representation Semantic Information Feature Engineering Dimensionality Reduction Transfer Learning What is Text Embedding?
One-hot vectors map objects into a vector of fixed-length We have not captured any semantic relationships Is there a relationship between man and king? We would like to develop a model in which King - Man + Woman → Queen Apple Inc. vs Apple (fruit) Motivation
Most statistical NLP/IR work regards words as atomic symbols: book, information, university In vector space, this is a sparse vector with one 1 and a lot of zeros book → [0 0 0 1 0 0 0 0 0 0] university → [0 0 0 0 0 0 0 0 1 0] Problem: The dimensionality of the vector will be the size of vocabulary No semantic relationship between word vectors’ representations
WORD2VEC
The output probabilities are going to relate to how likely it is to find each vocabulary word near our input word e.g, for the input word “Soviet”, the output probabilities are going to be much higher for words like “Union” and “Russia” than for unrelated words like “watermelon” and “kangaroo” Train the neural network to do this by feeding it word pairs found in our training documents The intuition
Window size (context) is a hyperparameter
A word just as a text string to a neural network We need a way to represent the words to the network Build a vocabulary of words from our training documents Words are represented by one-hot encoding A vector with the size of a vocabulary 1 if the word is present, otherwise 0 e.g., 1000 for cat IT NEEDS TOO MUCH RESOURCES Training Details
Instead of trying to predict the probability of being a nearby word for all the words in the vocabulary, we try to predict the probability that our training sample words are neighbors or not. Maximizing the similarity of the words in the same context and minimizing it when they occur in different contexts. Negative Sampling
Now we need to solve a simple Logistic-regression problem Negative Sampling
The skip-gram neural network model is actually surprisingly simple in its most basic form Two assumptions: A word can be used to generate the words surrounding it Given the center word, the context words are generated independently Skip Gram Model
Skip Gram Architecture
The center word is generated based on the context words. Continuous Bag of Words (CBOW) model
CBOW Architecture
The hidden layer is going to be represented by a weight matrix 10,000 rows (one for every word in our vocabulary) 300 columns (one for every hidden neuron) 300 features is what Google used in their published model trained on the Google news dataset The end goal of all of this is really just to learn this hidden layer weight matrix The hidden layer of this model is really just operating as a lookup table The Hidden Layer
each output neuron (one per word in our vocabulary) will produce an output between 0 and 1, and the sum of all these output values will add up to 1 The Output Layer
Lack of Context Sensitivity Out-of-vocabulary Words / Difficulty with Rare Words Semantic Ambiguity (Doesn't Capture Phrase-level Semantics) Domain Specificity Not Capturing Word Order Morphology (eat and eaten) Word2Vec cons
FastText
It was invented to make up the Word2Voc’s shortages like OOV and Morphology In essence, it is based on the idea of character n-grams FastText Bojanowski et al, FastText n-gram embedding model, 2017
For a word, we generate character n-grams of length 3 to 6 present in it Two-step vector representation updating: First, the embedding for the center word is calculated by taking a sum of vectors for the character n-grams and the whole word itself For the actual context words, we directly take their word vector from the embedding table without adding the character n-grams FastText Training
High memory requirements Lack of Context Sensitivity Not Capturing Word Order Domain Specificity FastText cons
Seq2seq models have evolved, and attention mechanisms have been introduced to address challenges related to capturing long-range dependencies. Seq2Seq Model Sutskever et al. Sequence to Sequence Learning with Neural Networks, 2014
Seq2Seq Model Component
keeping the intermediate outputs from the encoder LSTM from each step of the input sequence Attention Models Luong et al. Effective Approaches to Attention-based Neural Machine Translation, 2015
Attention is all you need! Vaswani et al., Attention is all you need, 2017
Inside an Encoder and a Decoder
Self-Attention
https://github.com/jessevig/bertviz Want to get a visualized attention?
BERT
BERT makes use of Transformers, an attention mechanism that learns contextual relations between words (or sub-words) in a text An encoder that reads the text input and a decoder that produces a prediction for the task. Transformer encoder reads the entire sequence of words at once Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary Introduction Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019
Pre-training and Fine-tuning
It has been trained based on two approaches: Masked LM: mask some percentage of the input tokens at random, and then predict those masked tokens mask 15% of all WordPiece tokens in each sequence at random. Next Sentence Prediction: understanding the relationship between two sentences (50% of positive pairs) Used BooksCorpus (800M words) and Wikipedia (2,500M words) IT IS EXPENSIVE Pre-training in detail
Fine-tuning in detail
BERT Size
What are the differences? BERT base BERT large Model Size 110M 340M Hidden Layers 12 transformer layers 24 transformer layers Attention Heads 12 16 Hidden Unit Size 768 1024 Total Hidden Units 12 layers * 768 units 24 layers * 1024 units Obviously, they have differences in training time and required computational resources
[CLS] [SEP] [MASK] [PAD] BERT Special Symbols
Under the hood
RoBERTa: A robustly optimized BERT Facebook AI research team They used 160GB of text instead of the 16GB dataset originally used to train BERT Increased the number of iterations from 100K to 300K and then further to 500K Dynamically changing the masking pattern applied to the training data Removing the next sequence prediction objective from the training procedure BERT-based Models Liu et al. Roberta: A robustly optimized BERT pretraining approach, 2019
DistilBERT: a distilled version of BERT BERT-based Models Sanh et al. in DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, 2020 BERT base DistilBERT Model Size 110M 66M Hidden Layers 12 transformer layers 6 transformer layers Attention Heads 12 12 Hidden Unit Size 768 768 Total Hidden Units 12 layers * 768 units 6 layers * 768 units
ColBERT: Contextualized Late Interaction over BERT Deep contextualized LMs come at a steep increase in computational cost A novel late interaction paradigm for estimating relevance between a query 𝑞 and a document 𝑑 Query and Document text are separately encoded(tokenized) into contextual embeddings using two different BERT BERT-based Models
Sentence-BERT: A modification of the pre-trained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity BERT-based Models Reimers and Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019
BertTokenizer BertModel BertForMaskedLM BertForSequenceClassification BertForMultipleChoice BertForTokenClassification BertForQuestionAnswering BERT For X provided by the HuggingFace
Using LMs for retrieval
Allows us to answer questions like: Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? Given that we observe “baseball” three times and “game” once in a news article, how likely is it about sports? Given that a user is interested in sports news, how likely would the user use “baseball” in a query? Why Language Models are useful?
Language Models applications
Bi-Encoder vs. Cross-Encoder
Retrieve vs Re-rank
The platform where the machine learning community collaborates on models, datasets, and applications. Models Datasets Spaces APIs etc. An NLP Course provided by the HuggingFace (Highly Recommended): https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt Hugging Face 🤗
Transformers is a Python framework for state-of-the-art sentence, text, and image embeddings. This framework can compute sentence/text embeddings for different languages. pip install transformers Transformers
Let’s Code Together
Try to get the representation of some texts using Word2Vec and BERT Getting familiar with sentence-transformers library Use an LLM for retrieving information Compare the performance of models Find the codes here 👇 : https://colab.research.google.com/drive/1T24mWOVisVv0N45-GlAGm8lCreZmzV0v?usp=sharing What are we going to do? 💻
Any Question?
Thank you for your attention! M.A.Sc. in Computer Engineering at University of Guelph sadjadeb.github.io Follow Me on X: @sadjadeb [email protected]
Bhaskar Mitra and Nick Craswell, “An Introduction to Neural Information Retrieval”, 2018 Behrooz Mansoori, “Introduction to Information Retrieval”, 2022 Sutskever et al. Sequence to Sequence Learning with Neural Networks, 2014 https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263 https://data-hub.ir/ word2vec https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/ Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019 Reimers and Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, 2019 Liu et al. Roberta: A robustly optimized BERT pretraining approach, 2019 https://itnext.io/deep-learning-in-information-retrieval-part-iii-ranking-da511f2dc325 References