From Natural Language to Structured Solr Queries using LLMs

SeaseLtd 559 views 45 slides Jun 19, 2024
Slide 1
Slide 1 of 45
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45

About This Presentation

This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of access...


Slide Content

From Natural Language to Structured Solr Queries using LLMs BERLIN BUZZWORDS 2024 - 10 /06/2024 Speakers: Anna Ruggero, R&D Software Engineer @ Sease Ilaria Petreti, ML Software Engineer @ Sease

WHO WE ARE ILARIA PETRETI ANNA RUGGERO

SEArch SErvices Headquarter in London /distributed Open-source Enthusiasts Apache Lucene/Solr experts Elasticsearch/OpenSearch experts Community Contributors Active Researchers H OT TRENDS : Large Language Models Applications Vector-based ( Neura l) Search Natural Language Processing Learning To Rank Document Similarity Search Quality Evaluation Relevance Tuning www.sease.io

AGENDA Use Case Overview From Natural Language to S tructured Queries Findin gs The Road to Production

Transformers Next-token-prediction and masked-language-modeling E stimate the likelihood of each possible word (in its vocabulary) given the previous sequence L earn the statistical structure of language P re-trained on huge quantities of text Fine-tuned for different tasks ( Following Instructions ) WHAT IS A LARGE LANGUAGE MODEL https://towardsdatascience.com/how-chatgpt-works-the-models-behind-the-bot-1ce5fca96286

VOCABULARY MISMATCH PROBLEM T erms matching between the query and the documents. false positive : docs retrieved ( terms match) but no information need false negative : docs not retrieved ( terms don’t match) but there was the information need in the corpus → zero result query SEMANTIC SIMILARITY Same terms different meaning : How old are you? - How are you? Different terms same meaning : How old are you? - What is your age? DISAMBIGUATION Same term in two totally different contexts assume totally different meanings LEXICAL PROBLEMS

There are some lexical solutions to these: Manually curated Synonyms, Hypernyms, Hyponyms A lgorithmic Stemming, lemmatization Knowledge Base disambiguation LEXICAL SOLUTIONS These solutions are expensive and do not guarantee high quality results. We can do better!

Query/Document Expansion (Generative/Extractive) Retrieval Augmented Generation Generative Generate synonyms, query reformulations… Extractive Select expansion terms from taxonomies EXPLOIT LLM CAPABILITIES

{ "filters": { " Country ": " European Union (28 countries)#EU28#” , " Pollutant ": " Particulates (PM10)#PM10# " , " Variable ": " Total man-made emissions#TOT#|Industrial combustion#STAT_COMB_IND# " , " Time Period ": "Second trimester(Q2)", " Year ": "201 5 " } } NATURAL LANGUAGE QUERY PARSING PM10 levels produced by industries in the European Community in May 2015

We have been working with some of our clients to exploit an LLM in order to: Disambiguate the meaning of a user’s natural language query Extract the relevant information Use the extracted information to implement a structured Solr query REAL CASE APPLICATION

OECD lead initiative (The Organisation for Economic Co-operation and Development) The Statistical Information System Collaboration Community .Stat Suite and Apache Solr https://siscc.org/developers/technology/ ONE OF OUR CLIENTS

AGENDA Use Case Overview From Natural Language to S tructured Queries Findin gs The Road to Production

ARCHITECTURE

Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values ARCHITECTURE

Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation { " Topic " :[ "Economy#ECO#", "Economy#ECO#|Productivity#ECO_PRO#", "Agriculture#AGR#", "Government#GOV#", …], " Dimension " :[ "Reference area", "Time period", "Unit of Measure", "Year", …], " Reference Area " :[ "Australia#AUS#", "Austria#AUT#", …], etc… } FIELD/VALUES RETRIEVAL List of Fields and Values

Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values USER QUERY What were the sulfur oxide emissions in Australia in 2013?

Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values FILTER EXTRACTION PROMPT OBJECT REPRESENTATION Provide input data (e.g. JSON representation ) to the model QUERY PARSING Request a similar representation (i.e. subset) based on the input query FORMAL REQUIREMENTS Specify how the output should be formatted, including any constraints or specific criteria to be met.

Query Answer LLM Model Filters Extraction Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values {'Topic': [ "Environment#ENV#|Air and climate#ENV_AC#" ], 'Country': [ "Australia#AUS#" ], 'Variable': [ "Total man-made emissions#TOT#" ], 'Pollutant': [ "Sulphur Oxides#SOX#" ], 'Year': '2013' } Selected Filters FILTER EXTRACTION What were the sulfur oxide emissions in Australia in 2013?

Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values PROMPT Ask to the model to provide: different/ additional relevant terms synonyms variations with same meaning QUERY REFORMULATION

Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values ['Sulfur dioxide emissions', 'Air pollution', 'Environmental impact', 'Fossil fuel combustion', 'Acid rain'] QUERY REFORMULATION What were the sulfur oxide emissions in Australia in 2013?

Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values STRUCTURED QUERY SOLR QUERY q= title:(Sulfur dioxide emissions Air ... Acid rain) OR topic:"Environment#ENV#|Air and climate#ENV_AC#" OR country : "Australia#AUS#" OR variable:"Total man-made emissions#TOT#" OR Pollutant:"Sulphur Oxides#SOX#" OR 'Year': '2013'

Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Solr Relevant Documents Query Reformulation List of Fields and Values DOC RETRIEVAL SEARCH RESULTS "response":{ "numFound": 1, "start": , "numFoundExact": true , "docs":[{ "Title": "Emissions of air pollutants" , "Dimension":[ " Country " , " Pollutant " , " Variable " , " Year " ] }] }

Separates the flow of your program (modules) from the parameters (LM prompts and weights) of each step Introduces new optimizers to tune prompts and/or weights of your LM calls, given a metric you want to maximise LMs and their prompts fade into the background as optimizable pieces of a larger system that can learn from data DSPY LIBRARY https://dspy-docs.vercel.app/docs/intro

https://dspy-docs.vercel.app/docs/intro DSPY LIBRARY Is it really as it suggests? Partially!

AGENDA Use Case Overview From Natural Language to S tructured Queries Findin gs The Road to Production

MODEL CONSIDERATIONS [Model Selection]: NOT the most advanced available for this task [Model Comparison]: No evaluations or comparisons with alternative models → Time constraints and limited funding [Rationale for Current Choice]: promising capabilities and quick implementation [Future Works]: Explore and analyze models that are fine-tuned specifically for our task Potentially undertake our own fine-tuning to optimize model performance Model comparison

PROMISING ASPECTS Overcome the lexical matching land of kangaroos → [Country] AUSTRALIA tobacco consumption → [Topic] SMOKING/RISK FACTORS FOR HEALTH

Explainability for selected filters Analyze input text: " cost per square meter for family houses in italy " cost per square meter → pricing or valuation → 'Priced unit' or 'Value' family houses → type of property → 'Real estate type' italy → location → 'Reference area' or 'Borrowers' country' PROMISING ASPECTS

Explainability for selected filters Analyze input text: " cost per square meter for family houses in italy " cost per square meter → pricing or valuation → 'Priced unit' or 'Value' family houses → type of property → 'Real estate type' italy → location → 'Reference area' or 'Borrowers' country' PROMISING ASPECTS Integrate as an "Assistant" feature to guide users in choosing the most suitable filters IDEA!

PROMISING ASPECTS Promising potential in early results: challenging and complex task good results (using a commercial out-of-the-box model!) straightforward implementation model's adaptability to the context

LIMITATIONS FUNCTIONAL 2 1 FUNCTIONAL Retrieval Augmented Generation FORMAL LLM weaknesses in the language/query semantic comprehension LLM weaknesses in complying with: the problem definition the required output format

Difficult to identify relevant fields when others share the same values { " Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...], " Borrower’s Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...] } FUNCTIONAL LIMITATIONS

Difficult to identify relevant fields when others share the same values { " Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...], " Reporting Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...] } Difficult to identify relevant fields when highly specialized domain knowledge is required " Marginal lending facility rate " → [Reference Area] Europe " I MU tax " → [Sector] Real Estate FUNCTIONAL LIMITATIONS

Difficult to identify relevant fields when others share the same values { " Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...], " Reporting Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...] } Difficult to identify relevant fields when high specialized domain knowledge is required "Marginal lending facility rate" → [Reference Area] Europe "IMU tax" → [Sector] Real Estate Sometimes right value for a field not selected even if it is present in the user query User Q uery → green growth in Rabat Explainability → "Country": "Morocco" would be the relevant value if it were listed, but it is not. Country → ["All countries", "Europe", "G20", "Asia", " Morocco " , ...] FUNCTIONAL LIMITATIONS EXPECTED FIELD

POSSIBLE SOLUTIONS Refinement of the input Solr dictionary Human readable fields’ names "HEDxkkgkqIr" → "Category" "INST_NON_EDU" → "Non-educational institutions" A win for those who always asked clients to use understandable Solr fields!

POSSIBLE SOLUTIONS Ad-hoc prompt engineering Expand the prompt with ambiguous/difficult examples and solutions "Marginal lending facility rate" → [Reference Area] Europe "IMU tax" → [Sector] Real Estate Break down the prompt One request for topic selection One request for topic values selection One request for dimensions selection One request for dimensions values selection

POSSIBLE SOLUTIONS LLM fine-tuning Better disambiguation Learn the specific and domain-related task

FORMAL LIMITATIONS LLM h allucinations : Field names E.g. "Instrument" instead of "Type of instruments" Field values "Year": "21st century" instead of "2000" Returned field-values are mixed up E.g. " Total emissions per capita " is part of a value and not a dimension " European Union (28 countries)#EU28# " is a valid value present in "Country" but not in "Reference Area"

FORMAL LIMITATIONS Poorly formatted JSON returned Selected Pairs: ```json { "Country": "Australia#AUS#", // land of kangaroos "Pollutant": "Sulphur Oxides#SOX#", "Year": "2013" } ``` These pairs are chosen based on the keywords identified in the input text and the closest matching dimensions and values from the provided dictionary.

POSSIBLE SOLUTIONS Post-processing to validate and correct the LLM answer . DSPy library additional studies. ( Typed Predictors, Optimizers ) Evaluation of additional libraries and strategies Fine-tuning the model for the specific task → Extraction

AGENDA Use Case Overview From Natural Language to S tructured Queries Findin gs The Road to Production

[UX] Design the user experience Filtering assistance? Transparent query parsing? [LLM] Select the best model to date Can we fine-tune promising models specifically for the task? [LLM] Refine the prompts according to the model C an we use only one request to build the structured query? THE ROAD TO PRODUCTION

[LLM] Implement integration tests with the most common failures → LLM/prompt engineering to solve them [LLM] Study additional libraries to make the prompt more “programmed” and “automatically tuned” and less “trial-and-error” Highly depend on the LLM available [Performance] Stress test the solution [Quality] Set up queries/expected documents THE ROAD TO PRODUCTION

STAY UP TO DATE SUBSCRIBE TO THE INFORMATION RETRIEVAL NEWSLETTER https://sease.io/our-blog