Hybrid Search with Apache Solr Reciprocal Rank Fusion
SeaseLtd
298 views
31 slides
Jun 19, 2024
Slide 1 of 31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
About This Presentation
Vector-based search gained incredible popularity in the last few years: Large Language Models fine-tuned for sentence similarity proved to be quite effective in encoding text to vectors and representing some of the semantics of sentences in a numerical form.
These vectors can be used to run a K-near...
Vector-based search gained incredible popularity in the last few years: Large Language Models fine-tuned for sentence similarity proved to be quite effective in encoding text to vectors and representing some of the semantics of sentences in a numerical form.
These vectors can be used to run a K-nearest neighbour search and look for documents/paragraphs close to the query in a n-dimensional vector space, effectively mimicking a similarity search in the semantic space (Apache Solr KNN Query Parser).
Although exciting, vector-based search nowadays still presents some limitations:
– it’s very difficult to explain (e.g. why is document A returned and why at position K?)
– it doesn’t care about exact keyword matching (and users still rely on keyword searches a lot)
Hybrid search comes to the rescue, combining lexical (traditional keyword-based) search with neural (vector-based) search.
So, what does it mean to combine these two worlds?
It starts with the retrieval of two sets of candidates:
– one set of results coming from lexical matches with the query keywords
– a set of results coming from the K-Nearest Neighbours search with the query vector
The result sets are merged and a single ranked list of documents is returned to the user.
Reciprocal Rank Fusion (RRF) is one of the most popular algorithms for such a task.
This talk introduces the foundation algorithms involved with RRF and walks you through the work done to implement them in Apache Solr, with a focus on the difficulties of the process, the distributed support(SolrCloud), the main components affected and the limitations faced.
The audience is expected to learn more about this interesting approach, the challenges in it and how the contribution process works for an Open Source search project as complex as Apache Solr.
Size: 2.19 MB
Language: en
Added: Jun 19, 2024
Slides: 31 pages
Slide Content
Hybrid Search with Apache Solr Reciprocal Rank Fusion Speaker: Alessandro Benedetti, Director @ Sease BERLIN BUZZWORDS 2024 - 10 /06/2024
Born in Tarquinia (ancient Etruscan city in Italy) R&D Software Engineer Director Master degree in Computer Science PC member for ECIR , SIGIR and Desires Apache Lucene/Solr PMC member/committer Elasticsearch/OpenSearch expert Semantic search, NLP, Machine Learning technologies passionate Beach Volleyball player and Snowboarder ALESSANDRO BENEDETTI WHO AM I ? ‹#›
Headquarter in London /distributed Open-source Enthusiasts Apache Lucene/Solr experts Elasticsearch/OpenSearch experts Community Contributors Active Researchers Hot Trends : Large Language Models Applications Vector-based (Neural) Search Natural Language Processing Learning To Rank Document Similarity Search Quality Evaluation Relevance Tuning SEArch SErvices www.sease.io ‹#›
Vector-based ( Neural) Search Workflow Similarity between a Query and a Document is translated to distance in a vector space
Low Explainability High Dimensionality - vectors are long sequences of numerical values (768, 1536 …) Dimensions - each feature (element in the vector) has no clear semantic in many cases (slightly different from sparse vectors or explicit feature vectors) Values - It’s not obvious to estimate how a single value impact relevance (higher is better?) Similarity - To explain why a search result is retrieved in the top-K the vector distance is the only info you have Research is happening but it’s still an open problem
Lexical matches? Search users still have the expectation of lexical matches to happen You can’t enforce that with Vector-based search “Why the document with the keyword in the title is not coming up?” cit.
Low Diversity vector-based search just returns the top-k ordered by vector similarity Unless you add more logic on top, you would expect low diversity by definition
Hybrid Search Mitigation of current vector-search problems - Is it here to stay? Combine traditional keyword-based (lexical) search with vector-based (neural) search Retrieval of two sets of candidates: one set of results coming from lexical matches with the query keywords a set of results coming from the K-Nearest Neighbours search with the query vector Ranking of the candidates
Reciprocal Rank Fusion Multiple ranked lists -> unified result set. reciprocal rank -> inverse of the rank of a document in a ranked list of search results. higher ranking -> higher contribution RRF was introduced the first time by Cormack et al. in [1]. [1] Cormack, Gordon V. et al. “Reciprocal rank fusion outperforms condorcet and individual rank learning methods.” Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (2009)
Reciprocal Rank Fusion score = 0.0 for q in queries: if d in result(q): score += 1.0 / ( k + rank( result(q), d ) ) return score
JSON Request solr/core/src/java/org/apache/solr/request/json/RequestUtil.java Just store the query keys as a Solr param (so that we can access later) { "queries": { " lexical1 ": { … }, " lexical2 ": { … } }, … } [“lexical1”,”lexical2”]
Query Component Prepare It has the scope to prepare the variables and structures to use at processing time get the queries to combine from params save them (unparsed query string) in the response builder parse the queries save them parsed in the response builder save the query parsers (class name) in the response builder org.apache.solr.handler.component.QueryComponent#prepare
Query Component Process It has the scope to do the combination of search results (not breaking single query scenarios!) get the parsed queries to combine from response builder retrieve the result sets (one for each query) combine the results save the combined set in the response builder org.apache.solr.handler.component.QueryComponent#process
Queries Combiner org.apache.solr.search.combining.QueriesCombiner It has the scope to combine results sets, abstracting the algorithm init the combined query results keeping track of additional attributes such as Partial Results offer abstract methods for combination explainability public abstract QueryResult combine ( QueryResult [] rankedLists ); public abstract NamedList < Explanation > getExplanations ( String [] queryKeys , List < Query > queries , List < DocList > resultsPerQuery , SolrIndexSearcher searcher , IndexSchema schema )
Reciprocal Rank Fusion org.apache.solr.search.combining.ReciprocalRankFusion It has the scope to implement the specific algorithm combine the search results (>=2) keeping track of additional attributes such as Partial Results explain them how the combination happened, based on the rankings how the original score was calculated
Limitations This is a first step, the current limitations include : Field collapsing/grouping is not supported Distributed search support is included but combination happens per node and may not be ideal.
Debug Component We have two layers of explainability : Query Parsing Results explanation Combiner - how the multiple results sets were combined . Original Scores - how the original scores were obtained
Results Debug '0.032522473 = 1/(60+1) + 1/(60+2) because its ranks were: 1 for query(lexical), 2 for query(lexical2)' + Original Scores
Future Works PR: https://github.com/apache/solr/pull/2489 Distributed tests Documentation —------------------------ better distributed support field collapsing/grouping Additional algorithms?