This study focuses on building reliable and reusable test collections, a crucial aspect of evaluating information retrieval systems. A key factor in the effectiveness of these collections is the quality of the queries they include. However, obtaining real user queries is not always feasible. To addr...
This study focuses on building reliable and reusable test collections, a crucial aspect of evaluating information retrieval systems. A key factor in the effectiveness of these collections is the quality of the queries they include. However, obtaining real user queries is not always feasible. To address this challenge, the research explores the feasibility of using synthetic queries, generated by language models, as a supplement to real queries in the construction of test collections. The potential of synthetic queries to enhance the robustness and applicability of test collections is analyzed, providing insights into improving the evaluation process.
Size: 2.04 MB
Language: en
Added: Aug 19, 2024
Slides: 15 pages
Slide Content
Synthetic Query Analysis
TREC 2023 Deep Learning Track
Hossein A. (Saeed) Rahmani
University College London, [email protected]
TREC 2023 Conference, NIST, Gaithersburg, USA
Synthetic Query Analysis
■Building reliable and reusable test collections as one of the main goals
■Highly depends on the queries included in the collection
■Obtaining real user queries may not always be possible
■Analysing the feasibility of using synthetic queries in test collection construction
■Using synthetic queries generated using language models in addition to real queries
2
Synthetic Query Generation Pipeline
Step 1:
Passage Selection
Step 2:
Query Generation
We first randomly sampled
1000 passages from the v2
passage corpus and filtered
for passages that can be
good stand-alone search
results using GPT-4
We then generated queries
using (i) a pre-trained
T5-based query generation
model from BeIR
1
and (ii) a
zero-shot query generation
approach using GPT-4
We sampled the T5
query-passage pairs to
match a target sample of
positive qrels from the 2022
passage task; NIST
assessors further removed
queries that do not look
reasonable and that contain
too few or too many relevant
documents
Step 3:
Query Selection
3
1
Thakur, Nandan, et al. "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." Thirty-fifth
Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021.
Step 1: Passage Selection
4
■In the Passage Retrieval task, a retrieved
passage should make sense on its own, when
appearing on the search results page without
the rest of the document's context.
■Examples of a bad passage could be one that
talks about a person without saying the
person's name ("she then completed her first
novel …")
■To identify passages that might not be good
stand-alone search results we ran each of the
1000 passages through a GPT-4 prompt
Prompt used to generate passage quality score
Step 1: Passage Selection (Examples)
5
Example 1:
Example 2:
■Without repeats for errors or malformed outputs from GPT-4 (eliminated 8.9% of passages)
■Filtering low quality passages with scores less than 50 (removing 14.6% passages).
Step 2: Query Generation
6
Generated 1000 queries per
passage using this T5-based model
Generated one query per passage
using this prompt and GPT-4
Step 2: Query Selection
7
■Sampled 250 T5 query-passage pairs to match a target sample of 250 positive qrels from 2022 (last
year's) passage labels.
■Matching in terms of (i) query length and (ii) number of query words that lexically match a passage
word without stop word removal and stemming
■Sampled 250 GPT-4 queries corresponding to the same passages from which the 250 T5 queries were
selected
■Removed queries that do not look reasonable and that contain too few or too many relevant
documents by NIST assessors
■(51/147) 34.7% of real user queries, (13/48) 27.1% of T5-generated queries, and (18/49) 36.7% of
GPT-4 generated queries were selected.
Query Statistics
■Query type together with average, min and max query length for each category.
■GPT-4 generated queries tend to be much longer than the queries generated using T5
■Real queries and in general queries generated using T5 tend to be shorter than the other query types
8
Document per Relevance Grade
■Using depth-10 pooling to select the documents for judging by the NIST assessors
■Synthetic queries contain much fewer relevant documents (relevance grade > zero) compared
to the real queries
■Also, pools constructed using T5 queries tend to contain significantly more non-relevant
documents compared to the rest of the query types, suggesting that these queries may be
more difficult than the other queries
9
120.1 71.76 77.87
Reliability of Test Collection
■Comparing the performance of systems using different query types
■Results obtained using our final test collection is very close to the evaluation results on real queries (Fig. a)
■Synthetic and real queries show similar patterns in terms of evaluation results and system ranking (Fig. b)
10
a) Real quereis vs all (track) queries b) Real quereis vs generated
Synthetic Query Bias Analysis
■Analysing bias of systems using a similar approach as the one used in query generation
■Synthetic queries may favor systems with similar language models in query generation
■e.g., queries generated using T5 might favour systems that are based on T5
■Categorising the runs submitted to the track based on the approach they use:
○four different system categories: GPT, T5, GPT + T5, Others
■While synthetic queries slightly underestimate the performance of models that do not use GPT or T5, this
does not seem to have much effect in system ranking
11
Synthetic Query Bias Analysis (Cont.)
■Queries based on GPT-4 slightly over-estimate the performance of systems based on GPT
■Queries generated using T5 exhibit almost no bias towards systems based on T5
■For all system types, T5 based queries tend to be more difficult than real queries
12
a) Real quereis vs T5 queries b) Real quereis vs GPT-4 queries
NIST Qrel vs. Sparse Qrel
13
■Passage that was used to generate the query can
be paired with the query as a "qrel" (Sparse Qrel)
■Comparing different sources of labels on the same
queries
■Synthetic queries with sparse labels are not reliable
■Synthetic queries should be judged by humans
Conclusion
■Overall, our initial results suggest that test collections consisting of synthetically generated
queries could be reliably used to evaluate system performance.
■Considering different approach for filtering or selection high quality synthetic queries
■Synthetic queries should be used with manual annotations
■More analysis is needed to validate these findings, which is left as future work
14
Thank you :)
15
Any Questions?
Hossein A. Rahmani
University College London [email protected]