Information Retrieval – Dense Vectors – Neural IR for Question Answering
VIJAYARAJAV
12 views
48 slides
Mar 03, 2025
Slide 1 of 48
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
About This Presentation
Text to spech
Size: 4.24 MB
Language: en
Added: Mar 03, 2025
Slides: 48 pages
Slide Content
INFORMATION RETRIEVAL Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
TERM WEIGHTING AND DOCUMENT SCORING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The basic IR architecture uses the vector space model In which we map queries and document to vectors based on unigram word counts , and use the cosine similarity between the vectors to rank potential documents The match between a document and query is scored. We don’t use raw word counts in IR, instead computing a term weight for each document word. Two term weighting schemes are common: the tf-idf weighting and BM25
TERM WEIGHTING AND DOCUMENT SCORING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The term frequency tells us how frequent the word is; Words that occur more often in a document are likely to be informative about the document’s contents. We usually use the log10 of the word frequency, rather than the raw count. The intuition is that a word appearing 100 times in a document doesn’t make that word 100 times more likely to be relevant to the meaning of the document
TERM WEIGHTING AND DOCUMENT SCORING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology If we use log weighting, terms which occur 0 times in a document would have tf = 1 times in a document tf = 1 + log10 (1) = 1 + 0 = 1, 10 times in a document tf = 1+log10 (10 ) = 2, 100 times tf = 1+log10 (100) = 3, 1000 times tf = 4, and so on The document frequency dft of a term t is the number of documents it
TERM WEIGHTING AND DOCUMENT SCORING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Terms that occur in only a few documents are useful for discriminating those documents from the rest of the collection Terms that occur across the entire collection aren’t as helpful The inverse document frequency or idf term weight is defined as : where N is the total number of documents in the collection, and dft is the number of documents in which term t occurs The fewer documents in which a term occurs , the higher this weight; the lowest weight of 0 is assigned to terms that occur in every document
TERM WEIGHTING AND DOCUMENT SCORING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Here are some idf values for some words in the corpus of Shakespeare plays ranging from extremely informative words that occur in only one play like Romeo so common as to be completely non-discriminative since they occur in all 37 plays like good or sweet The tf-idf value for word t in document d is then the product of term frequency tft,d and IDF
DOCUMENT SCORING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology We score document d by the cosine of its vector d with the query vector q : Another way to think of the cosine computation is as the dot product of unit vectors ; we first normalize both the query and document vector to unit vectors, by dividing by their lengths, and then take the dot product : We can spell out Eq. using the tf-idf values and spelling out the dot product as a sum of products:
DOCUMENT SCORING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
INVERTED INDEX Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology In order to compute scores, we need to efficiently find documents that contain words in the query. Any document that contains none of the query terms will have a score of 0 and can be ignored The data structure for this task is the inverted index An inverted index, given a query term, gives a list of documents that contain the term . It consists of two parts, a dictionary and the postings
INVERTED INDEX Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
INFORMATION RETRIEVAL WITH DENSE VECTORS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The classic tf-idf or BM25 algorithms for IR have long been known to have a conceptual flaw: they work only if there is exact overlap of words between the query and document. In other words, the user posing a query (or asking a question) needs to guess exactly what words the writer of the answer might have used, an issue called the vocabulary mismatch problem The solution to this problem is to use an approach that can handle synonymy : instead of (sparse) word-count vectors, using (dense) embeddings .
INFORMATION RETRIEVAL WITH DENSE VECTORS Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
IR-BASED QUESTION ANSWERING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The goal of IR-based question answering is to answer a user’s question by finding short text segments on the Web or some other collection of documents
IR-BASED QUESTION ANSWERING Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Figure shows the three phases of an IR-based factoid question-answering system : question processing, passage retrieval and ranking, and answer processing
Question Processing Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The goal of the question-processing phase is to extract a number of pieces of information from the question The answer type specifies the kind of entity the answer consists of (person, location, time, etc.) The query specifies the keywords that should be used for the IR system to use in searching for documents A focus : is the string of words in the question that are likely to be replaced by the answer in any answer string found The question type : is this a definition question, a math question, a list question?
Question Processing Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology For example , for the following question : Which US state capital has the largest population ? The query processing should produce results like the following : Answer Type : city Query : US state capital, largest, population Focus : state capital
Answer Type Detection (Question Classification) Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The task of question classification or answer type recognition is to determine the answer type , the named-entity or similar class categorizing the answer A question like “Who founded Virgin Airlines” expects an answer of type PERSON. A question like “What Canadian city has the largest population?” expects an answer of type CITY . If we know the answer type for a question, we can avoid looking at every sentence or noun phrase in the entire suite of documents for the answer Instead focusing on, for example, just people or cities
Answer Type Detection (Question Classification) Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Usually, however, a richer , often hierarchical set of answer types is used, an answer type taxonomy
Query Formulation Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Usually, however, a richer , often hierarchical set of answer types is used, an answer type taxonomy Query formulation is the task of creating from the question a list of keywords that form a query that can be sent to an information retrieval system. Exactly what query to form depends on the application. If question answering is applied to the Web , we might simply create a keyword from every word in the question keywords can be formed from only the terms found in the noun phrases in the question
Query Formulation Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Q uery reformulation Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology A query formulation approach that is sometimes used for questioning the Web is to apply query reformulation rules to the query The rules rephrase the question to make it look like a substring of possible declarative answers. The question “ when was the laser invented?” might be reformulated as “the laser was invented”; The question “where is the Valley of the Kings?” as “the Valley of the Kings is located in”. wh -word did A verb B → . . . A verb+ed B Where is A → A is located in
Passage Retrieval Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The query that was created in the question-processing phase is next used to query an information-retrieval system The result of this document retrieval stage is a set of documents set of documents is generally ranked by relevance, the top-ranked document is probably not the answer to the question . A highly relevant and large document that does not prominently answer a question is not an ideal candidate for further processing . So the next stage is to extract a set of potential answer passages from the retrieved set of documents
Passage Retrieval Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology We next perform passage retrieval. In this stage, we first filter out passages in the returned documents that don’t contain potential answers Then rank the rest according to how likely they are to contain an answer to the question The first step in this process is to run a named entity or answer type classification on the retrieved passages . The answer type that we determined from the question tells us the possible answer types we expect to see in the answer. We can therefore filter out documents that don’t contain any entities of the right type
Passage Retrieval Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The remaining passages are then ranked, usually by supervised machine learning such as: The number of named entities of the right type in the passage The number of question keywords in the passage The longest exact sequence of question keywords that occurs in the passage The rank of the document from which the passage was extracted The proximity of the keywords from the original query to each other in that passage . The N-gram overlap between the passage and the question Count the N-grams in the question and the N-grams in the answer passages
Passage Retrieval Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The remaining passages are then ranked, usually by supervised machine learning such as: The number of named entities of the right type in the passage The number of question keywords in the passage The longest exact sequence of question keywords that occurs in the passage The rank of the document from which the passage was extracted The proximity of the keywords from the original query to each other in that passage . The N-gram overlap between the passage and the question Count the N-grams in the question and the N-grams in the answer passages
Passage Retrieval Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology For question answering from the Web, instead of extracting passages from all returned documents, we can rely on the Web search to do passage extraction for us . We do this by using snippets produced by the Web search engine as the returned passages
Answer Processing Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Question answering is to extract a specific answer from the passage Two classes of algorithms have been applied is answer-type pattern extracti on and N-gram tiling In the pattern-extraction methods, we use regular expression pattern s For example , for questions with a HUMAN answer type, we run the answer type or named entity tagger on the passage “Who is the prime minister of India ” Manmohan Singh, Prime Minister of India, had told left leaders that the deal would not be renegotiated .
Answer Processing Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Unfortunately, the answers to some questions, such as DEFINITION questions , don’t tend to be of a particular named entity type
Answer Processing Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Unfortunately, the answers to some questions, such as DEFINITION questions , don’t tend to be of a particular named entity type
Answer Processing Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology We extract potential answers by using named entities or patterns or even just by looking at every sentence returned from passage retrieval and rank them using a classifier with features in next slide Answer type match : True if the candidate answer contains a phrase with the correct answer type . Pattern match : The identity of a pattern that matches the candidate answer . Number of matched question keywords : How many question keywords are contained in the candidate answer . Keyword distance : The distance between the candidate answer and query keywords
Answer Processing Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Novelty factor : True if at least one word in the candidate answer is novel, that is , not in the query . Apposition features : True if the candidate answer is an appositive to a phrase containing many question terms. Can be approximated by the number of question terms separated from the candidate answer through at most three words and one comma Punctuation location : True if the candidate answer is immediately followed by a comma , period, quotation marks, semicolon, or exclamation mark . Sequences of question terms : The length of the longest sequence of question terms that occurs in the candidate answer.
Answer Processing Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology An alternative approach to answer extraction, used solely in Web search, is N-gram tiling based on N-gram tiling, sometimes called the redundancy-based approach This simplified method begins with the snippets returned from the Web search engine, produced by a reformulated query. In the first step , N-gram mining N-gram mining, every unigram, bigram, and trigram occurring in the snippet is extracted and weighted . The weight is a function of the number of snippets in which the N-gram occurred, and the weight of the query reformulation pattern In the N-gram filtering step, N-grams are scored by how well they match the predicted answer type
Evaluating Question Answering Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Three techniques are commonly employed to evaluate question-answering systems , with the choice depending on the type of question and QA situation For multiple choice questions, we report exact match : Exact match: The % of predicted answers that match the gold answer exactly. For questions with free text answers, like Natural Questions, we commonly evaluated with token F1 score to roughly measure the partial string overlap between the answer and the reference answer some situations QA systems give multiple ranked answers. In such cases we evaluated using mean reciprocal rank, or MRR
Frame-Based Dialogue Systems Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology A task-based dialogue system has the goal of helping a user solve a specific task like making a travel reservation or buying a product Task-based dialogue systems are based around frames . Frames are knowledge structures representing the details of the user’s task specification Each frame consists of a collection of slots , each of which can take a set of possible values
Frame-Based Dialogue Systems Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Frames and Slot Filling Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The frame and its slots specify what the system needs to know to perform its task A hotel reservation system needs dates and locations . An alarm clock system needs a time. The system’s goal is to fill the slots in the frame with the fillers the user intends Then perform the relevant action for the user (answering a question, or booking a flight )
Frames and Slot Filling Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Frames and Slot Filling Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The task of slot-filling is usually combined with two other tasks, to extract 3 things from each user utterance. The first is domain classification : is this user for example talking about airlines, programming an alarm clock, or dealing with their calendar ? The second is user intent determination : what general task or goal is the user trying to accomplish? For example the task could be to Find a Movie, or Show a Flight, or Remove a Calendar Appointment . The domain classification and intent determination tasks decide which frame we are filling
Frames and Slot Filling Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Frames and Slot Filling Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Most systems use supervised machine-learning: each sentence in a training set is annotated with slots, domain, and intent A sequence model maps from input words to slot fillers, domain and intent For example we’ll have pairs of sentences that are labeled for domain (AIRLINE) and intent ( SHOWFLIGHT ), Also labeled with BIO representations for the slots and fillers. that in BIO tagging we introduce a tag for the beginning (B) and inside (I) of each slot label, and one for tokens outside (O) any slot label
Frames and Slot Filling Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Dialogue Acts and Dialogue State Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Dialogue acts are a generalization of speech acts that also represent grounding. The set of acts can be general, or can be designed for particular dialogue tasks
Dialogue Acts and Dialogue State Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Dialogue State Tracking Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology The job of the dialogue-state tracker is to determine the current state of the frame ( the fillers of each slot), and the user’s most recent dialogue act.
Chatbots Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Chatbots are systems that can carry on extended conversations with the goal of mimicking the unstructured conversations or ‘chats’ characteristic of informal human to human interaction . Provide solutions to NLP tasks like question answering, writing tools, or machine translation into a conversational interface Chatbots are generally trained on a training set that includes standard large language model training data This can include datasets created specifically for training chatbots by hiring speakers of the language to have conversations
Chatbots Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Chatbots Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology
Dialogue System Design Department of Artificial Intelligence and Data Science R.M.K. College of Engineering and Technology Study the user and task : Understand the users and the task by interviewing users , investigating similar systems, and studying related human-human dialogues. Build simulations and prototypes : A crucial tool in building dialogue systems is the Wizard-of-Oz system Iteratively test the design on users : An iterative design cycle with embedded user testing is essential in system design