Hybrid Search with Apache Solr Reciprocal Rank Fusion

SeaseLtd 298 views 31 slides Jun 19, 2024
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

Vector-based search gained incredible popularity in the last few years: Large Language Models fine-tuned for sentence similarity proved to be quite effective in encoding text to vectors and representing some of the semantics of sentences in a numerical form.
These vectors can be used to run a K-near...


Slide Content

Hybrid Search with Apache Solr Reciprocal Rank Fusion Speaker: Alessandro Benedetti, Director @ Sease BERLIN BUZZWORDS 2024 - 10 /06/2024

Born in Tarquinia (ancient Etruscan city in Italy) R&D Software Engineer Director Master degree in Computer Science PC member for ECIR , SIGIR and Desires Apache Lucene/Solr PMC member/committer Elasticsearch/OpenSearch expert Semantic search, NLP, Machine Learning technologies passionate Beach Volleyball player and Snowboarder ALESSANDRO BENEDETTI WHO AM I ? ‹#›

Headquarter in London /distributed Open-source Enthusiasts Apache Lucene/Solr experts Elasticsearch/OpenSearch experts Community Contributors Active Researchers Hot Trends : Large Language Models Applications Vector-based (Neural) Search Natural Language Processing Learning To Rank Document Similarity Search Quality Evaluation Relevance Tuning SEArch SErvices www.sease.io ‹#›

Limitations of Vector-Based Search Reciprocal Rank Fusion APIs Internals Explainability Overview

Limitations of Vector-Based Search Reciprocal Rank Fusion APIs Internals Explainability Overview

Vector-based ( Neural) Search Workflow Similarity between a Query and a Document is translated to distance in a vector space

Low Explainability High Dimensionality - vectors are long sequences of numerical values (768, 1536 …) Dimensions - each feature (element in the vector) has no clear semantic in many cases (slightly different from sparse vectors or explicit feature vectors) Values - It’s not obvious to estimate how a single value impact relevance (higher is better?) Similarity - To explain why a search result is retrieved in the top-K the vector distance is the only info you have Research is happening but it’s still an open problem

Lexical matches? Search users still have the expectation of lexical matches to happen You can’t enforce that with Vector-based search “Why the document with the keyword in the title is not coming up?” cit.

Low Diversity vector-based search just returns the top-k ordered by vector similarity Unless you add more logic on top, you would expect low diversity by definition

Limitations of Vector-Based Search Reciprocal Rank Fusion APIs Internals Explainability Overview

Hybrid Search Mitigation of current vector-search problems - Is it here to stay? Combine traditional keyword-based (lexical) search with vector-based (neural) search Retrieval of two sets of candidates: one set of results coming from lexical matches with the query keywords a set of results coming from the K-Nearest Neighbours search with the query vector Ranking of the candidates

Reciprocal Rank Fusion Multiple ranked lists -> unified result set. reciprocal rank -> inverse of the rank of a document in a ranked list of search results. higher ranking -> higher contribution RRF was introduced the first time by Cormack et al. in [1]. [1] Cormack, Gordon V. et al. “Reciprocal rank fusion outperforms condorcet and individual rank learning methods.” Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (2009)

Reciprocal Rank Fusion score = 0.0 for q in queries: if d in result(q): score += 1.0 / ( k + rank( result(q), d ) ) return score

Reciprocal Rank Fusion e.g. K=10 1 id10 2 id7 3 id9 4 id5 5 id3 1 id7 2 id5 3 id9 4 id4 5 id10 id7 ( 0.174 = 1/10+ 2 +1/10+ 1 ) id10 ( 0.157 = 1/10+ 1 +1/10+ 5 ) id5 ( 0.155 = 1/10+ 4 +1/10+ 2 ) id9 ( 0.154 = 1/10+ 3 +1/10+ 3 ) id4 ( 0.071 = 1/10+ 4 ) id3 ( 0.666 = 1/10+ 5 )

Limitations of Vector-Based Search Reciprocal Rank Fusion APIs Internals Explainability Overview

Open Pull Request https://github.com/apache/solr/pull/2489 Challenges N.B. If you are not a full-time contributor, the Lucene/Solr split hits hard!

JSON APIs { "queries": { " lexical1 ": { "lucene": { "query": "id:(10^=2 OR 2^=1 OR 4^=0.5)" } }, " lexical2 ": { "lucene": { "query": "id:(2^=2 OR 4^=1 OR 3^=0.5)" } } }, "limit": 10, "fields": "[id,score]", "params": { " combiner ": true, " combiner.upTo ": 5 } } Trivia: “queries” support was originally introduced to be used in faceting for domain/exclusion https://issues.apache.org/jira/browse/SOLR-12490

JSON APIs org.apache.solr.common.params.CombinerParams String COMBINER = "combiner" ; String COMBINER_ALGORITHM = COMBINER + ".algorithm" ; String RECIPROCAl_RANK_FUSION = "rrf" ; String COMBINER_UP_TO = COMBINER + ".upTo" ; int COMBINER_UP_TO_DEFAULT = 100 ; String COMBINER_RRF_K = COMBINER + "." + RECIPROCAl_RANK_FUSION + ".k" ; int COMBINER_RRF_K_DEFAULT = 60 ; // from original paper

Limitations of Vector-Based Search Reciprocal Rank Fusion APIs Internals Explainability Overview

JSON Request solr/core/src/java/org/apache/solr/request/json/RequestUtil.java Just store the query keys as a Solr param (so that we can access later) { "queries": { " lexical1 ": { … }, " lexical2 ": { … } }, … } [“lexical1”,”lexical2”]

Query Component Prepare It has the scope to prepare the variables and structures to use at processing time get the queries to combine from params save them (unparsed query string) in the response builder parse the queries save them parsed in the response builder save the query parsers (class name) in the response builder org.apache.solr.handler.component.QueryComponent#prepare

Query Component Process It has the scope to do the combination of search results (not breaking single query scenarios!) get the parsed queries to combine from response builder retrieve the result sets (one for each query) combine the results save the combined set in the response builder org.apache.solr.handler.component.QueryComponent#process

Queries Combiner org.apache.solr.search.combining.QueriesCombiner It has the scope to combine results sets, abstracting the algorithm init the combined query results keeping track of additional attributes such as Partial Results offer abstract methods for combination explainability public abstract QueryResult combine ( QueryResult [] rankedLists ); public abstract NamedList < Explanation > getExplanations ( String [] queryKeys , List < Query > queries , List < DocList > resultsPerQuery , SolrIndexSearcher searcher , IndexSchema schema )

Reciprocal Rank Fusion org.apache.solr.search.combining.ReciprocalRankFusion It has the scope to implement the specific algorithm combine the search results (>=2) keeping track of additional attributes such as Partial Results explain them how the combination happened, based on the rankings how the original score was calculated

Limitations This is a first step, the current limitations include : Field collapsing/grouping is not supported Distributed search support is included but combination happens per node and may not be ideal.

Limitations of Vector-Based Search Reciprocal Rank Fusion APIs Internals Explainability Overview

Debug Component We have two layers of explainability : Query Parsing Results explanation Combiner - how the multiple results sets were combined . Original Scores - how the original scores were obtained

Query Debug <lst name="debug"> <lst name="queriesToCombine"> <lst name="lexical"> <str name="querystring">id:(10^=2 OR 2^=1)</str> <str name="queryparser">LuceneQParser</str> <str name="parsedquery">ConstantScore(id:10)^2.0 ConstantScore(id:2)^1.0</str> <str name="parsedquery_toString">(ConstantScore(id:10))^2.0 (ConstantScore(id:2))^1.0</str> </lst> <lst name="vector-based"> <str name="querystring">[1.0, 2.0, 3.0, 4.0]</str> <str name="queryparser">KnnQParser</str> <str name="parsedquery">KnnFloatVectorQuery(KnnFloatVectorQuery:vector[1.0,...][5])</str> <str name="parsedquery_toString">KnnFloatVectorQuery:vector[1.0,...][5]</str> </lst> </lst>

Results Debug '0.032522473 = 1/(60+1) + 1/(60+2) because its ranks were: 1 for query(lexical), 2 for query(lexical2)' + Original Scores

Future Works PR: https://github.com/apache/solr/pull/2489 Distributed tests Documentation —------------------------ better distributed support field collapsing/grouping Additional algorithms?