From Natural Language to Structured Solr Queries using LLMs
SeaseLtd
559 views
45 slides
Jun 19, 2024
Slide 1 of 45
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
About This Presentation
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of access...
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
Size: 2.55 MB
Language: en
Added: Jun 19, 2024
Slides: 45 pages
Slide Content
From Natural Language to Structured Solr Queries using LLMs BERLIN BUZZWORDS 2024 - 10 /06/2024 Speakers: Anna Ruggero, R&D Software Engineer @ Sease Ilaria Petreti, ML Software Engineer @ Sease
WHO WE ARE ILARIA PETRETI ANNA RUGGERO
SEArch SErvices Headquarter in London /distributed Open-source Enthusiasts Apache Lucene/Solr experts Elasticsearch/OpenSearch experts Community Contributors Active Researchers H OT TRENDS : Large Language Models Applications Vector-based ( Neura l) Search Natural Language Processing Learning To Rank Document Similarity Search Quality Evaluation Relevance Tuning www.sease.io
AGENDA Use Case Overview From Natural Language to S tructured Queries Findin gs The Road to Production
Transformers Next-token-prediction and masked-language-modeling E stimate the likelihood of each possible word (in its vocabulary) given the previous sequence L earn the statistical structure of language P re-trained on huge quantities of text Fine-tuned for different tasks ( Following Instructions ) WHAT IS A LARGE LANGUAGE MODEL https://towardsdatascience.com/how-chatgpt-works-the-models-behind-the-bot-1ce5fca96286
VOCABULARY MISMATCH PROBLEM T erms matching between the query and the documents. false positive : docs retrieved ( terms match) but no information need false negative : docs not retrieved ( terms don’t match) but there was the information need in the corpus → zero result query SEMANTIC SIMILARITY Same terms different meaning : How old are you? - How are you? Different terms same meaning : How old are you? - What is your age? DISAMBIGUATION Same term in two totally different contexts assume totally different meanings LEXICAL PROBLEMS
There are some lexical solutions to these: Manually curated Synonyms, Hypernyms, Hyponyms A lgorithmic Stemming, lemmatization Knowledge Base disambiguation LEXICAL SOLUTIONS These solutions are expensive and do not guarantee high quality results. We can do better!
{ "filters": { " Country ": " European Union (28 countries)#EU28#” , " Pollutant ": " Particulates (PM10)#PM10# " , " Variable ": " Total man-made emissions#TOT#|Industrial combustion#STAT_COMB_IND# " , " Time Period ": "Second trimester(Q2)", " Year ": "201 5 " } } NATURAL LANGUAGE QUERY PARSING PM10 levels produced by industries in the European Community in May 2015
We have been working with some of our clients to exploit an LLM in order to: Disambiguate the meaning of a user’s natural language query Extract the relevant information Use the extracted information to implement a structured Solr query REAL CASE APPLICATION
OECD lead initiative (The Organisation for Economic Co-operation and Development) The Statistical Information System Collaboration Community .Stat Suite and Apache Solr https://siscc.org/developers/technology/ ONE OF OUR CLIENTS
AGENDA Use Case Overview From Natural Language to S tructured Queries Findin gs The Road to Production
ARCHITECTURE
Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values ARCHITECTURE
Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation { " Topic " :[ "Economy#ECO#", "Economy#ECO#|Productivity#ECO_PRO#", "Agriculture#AGR#", "Government#GOV#", …], " Dimension " :[ "Reference area", "Time period", "Unit of Measure", "Year", …], " Reference Area " :[ "Australia#AUS#", "Austria#AUT#", …], etc… } FIELD/VALUES RETRIEVAL List of Fields and Values
Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values USER QUERY What were the sulfur oxide emissions in Australia in 2013?
Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values FILTER EXTRACTION PROMPT OBJECT REPRESENTATION Provide input data (e.g. JSON representation ) to the model QUERY PARSING Request a similar representation (i.e. subset) based on the input query FORMAL REQUIREMENTS Specify how the output should be formatted, including any constraints or specific criteria to be met.
Query Answer LLM Model Filters Extraction Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values {'Topic': [ "Environment#ENV#|Air and climate#ENV_AC#" ], 'Country': [ "Australia#AUS#" ], 'Variable': [ "Total man-made emissions#TOT#" ], 'Pollutant': [ "Sulphur Oxides#SOX#" ], 'Year': '2013' } Selected Filters FILTER EXTRACTION What were the sulfur oxide emissions in Australia in 2013?
Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values PROMPT Ask to the model to provide: different/ additional relevant terms synonyms variations with same meaning QUERY REFORMULATION
Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values ['Sulfur dioxide emissions', 'Air pollution', 'Environmental impact', 'Fossil fuel combustion', 'Acid rain'] QUERY REFORMULATION What were the sulfur oxide emissions in Australia in 2013?
Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Structured Query Solr Relevant Documents Query Reformulation List of Fields and Values STRUCTURED QUERY SOLR QUERY q= title:(Sulfur dioxide emissions Air ... Acid rain) OR topic:"Environment#ENV#|Air and climate#ENV_AC#" OR country : "Australia#AUS#" OR variable:"Total man-made emissions#TOT#" OR Pollutant:"Sulphur Oxides#SOX#" OR 'Year': '2013'
Query Answer LLM Model Filters Extraction Selected Filters Alternative queries Solr Relevant Documents Query Reformulation List of Fields and Values DOC RETRIEVAL SEARCH RESULTS "response":{ "numFound": 1, "start": , "numFoundExact": true , "docs":[{ "Title": "Emissions of air pollutants" , "Dimension":[ " Country " , " Pollutant " , " Variable " , " Year " ] }] }
Separates the flow of your program (modules) from the parameters (LM prompts and weights) of each step Introduces new optimizers to tune prompts and/or weights of your LM calls, given a metric you want to maximise LMs and their prompts fade into the background as optimizable pieces of a larger system that can learn from data DSPY LIBRARY https://dspy-docs.vercel.app/docs/intro
https://dspy-docs.vercel.app/docs/intro DSPY LIBRARY Is it really as it suggests? Partially!
AGENDA Use Case Overview From Natural Language to S tructured Queries Findin gs The Road to Production
MODEL CONSIDERATIONS [Model Selection]: NOT the most advanced available for this task [Model Comparison]: No evaluations or comparisons with alternative models → Time constraints and limited funding [Rationale for Current Choice]: promising capabilities and quick implementation [Future Works]: Explore and analyze models that are fine-tuned specifically for our task Potentially undertake our own fine-tuning to optimize model performance Model comparison
PROMISING ASPECTS Overcome the lexical matching land of kangaroos → [Country] AUSTRALIA tobacco consumption → [Topic] SMOKING/RISK FACTORS FOR HEALTH
Explainability for selected filters Analyze input text: " cost per square meter for family houses in italy " cost per square meter → pricing or valuation → 'Priced unit' or 'Value' family houses → type of property → 'Real estate type' italy → location → 'Reference area' or 'Borrowers' country' PROMISING ASPECTS
Explainability for selected filters Analyze input text: " cost per square meter for family houses in italy " cost per square meter → pricing or valuation → 'Priced unit' or 'Value' family houses → type of property → 'Real estate type' italy → location → 'Reference area' or 'Borrowers' country' PROMISING ASPECTS Integrate as an "Assistant" feature to guide users in choosing the most suitable filters IDEA!
PROMISING ASPECTS Promising potential in early results: challenging and complex task good results (using a commercial out-of-the-box model!) straightforward implementation model's adaptability to the context
LIMITATIONS FUNCTIONAL 2 1 FUNCTIONAL Retrieval Augmented Generation FORMAL LLM weaknesses in the language/query semantic comprehension LLM weaknesses in complying with: the problem definition the required output format
Difficult to identify relevant fields when others share the same values { " Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...], " Borrower’s Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...] } FUNCTIONAL LIMITATIONS
Difficult to identify relevant fields when others share the same values { " Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...], " Reporting Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...] } Difficult to identify relevant fields when highly specialized domain knowledge is required " Marginal lending facility rate " → [Reference Area] Europe " I MU tax " → [Sector] Real Estate FUNCTIONAL LIMITATIONS
Difficult to identify relevant fields when others share the same values { " Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...], " Reporting Country ": ["All countries", "Europe", "G20", "Asia", "Morocco", ...] } Difficult to identify relevant fields when high specialized domain knowledge is required "Marginal lending facility rate" → [Reference Area] Europe "IMU tax" → [Sector] Real Estate Sometimes right value for a field not selected even if it is present in the user query User Q uery → green growth in Rabat Explainability → "Country": "Morocco" would be the relevant value if it were listed, but it is not. Country → ["All countries", "Europe", "G20", "Asia", " Morocco " , ...] FUNCTIONAL LIMITATIONS EXPECTED FIELD
POSSIBLE SOLUTIONS Refinement of the input Solr dictionary Human readable fields’ names "HEDxkkgkqIr" → "Category" "INST_NON_EDU" → "Non-educational institutions" A win for those who always asked clients to use understandable Solr fields!
POSSIBLE SOLUTIONS Ad-hoc prompt engineering Expand the prompt with ambiguous/difficult examples and solutions "Marginal lending facility rate" → [Reference Area] Europe "IMU tax" → [Sector] Real Estate Break down the prompt One request for topic selection One request for topic values selection One request for dimensions selection One request for dimensions values selection
POSSIBLE SOLUTIONS LLM fine-tuning Better disambiguation Learn the specific and domain-related task
FORMAL LIMITATIONS LLM h allucinations : Field names E.g. "Instrument" instead of "Type of instruments" Field values "Year": "21st century" instead of "2000" Returned field-values are mixed up E.g. " Total emissions per capita " is part of a value and not a dimension " European Union (28 countries)#EU28# " is a valid value present in "Country" but not in "Reference Area"
FORMAL LIMITATIONS Poorly formatted JSON returned Selected Pairs: ```json { "Country": "Australia#AUS#", // land of kangaroos "Pollutant": "Sulphur Oxides#SOX#", "Year": "2013" } ``` These pairs are chosen based on the keywords identified in the input text and the closest matching dimensions and values from the provided dictionary.
POSSIBLE SOLUTIONS Post-processing to validate and correct the LLM answer . DSPy library additional studies. ( Typed Predictors, Optimizers ) Evaluation of additional libraries and strategies Fine-tuning the model for the specific task → Extraction
AGENDA Use Case Overview From Natural Language to S tructured Queries Findin gs The Road to Production
[UX] Design the user experience Filtering assistance? Transparent query parsing? [LLM] Select the best model to date Can we fine-tune promising models specifically for the task? [LLM] Refine the prompts according to the model C an we use only one request to build the structured query? THE ROAD TO PRODUCTION
[LLM] Implement integration tests with the most common failures → LLM/prompt engineering to solve them [LLM] Study additional libraries to make the prompt more “programmed” and “automatically tuned” and less “trial-and-error” Highly depend on the LLM available [Performance] Stress test the solution [Quality] Set up queries/expected documents THE ROAD TO PRODUCTION
STAY UP TO DATE SUBSCRIBE TO THE INFORMATION RETRIEVAL NEWSLETTER https://sease.io/our-blog