Most AI teams are exploring the possibilities of LLMs, rather than being focused on margins but soon efficiency will become important. Implementing small, specialized models to solve specific problems is an option, but is not leveraged often, because it requires gathering high volumes of human-label...
Most AI teams are exploring the possibilities of LLMs, rather than being focused on margins but soon efficiency will become important. Implementing small, specialized models to solve specific problems is an option, but is not leveraged often, because it requires gathering high volumes of human-labeled training data which are hard to acquire. To alleviate this problem, I will discuss how large language models can be used to generate synthetic data used to help tune small models on domain-specific tasks. We will focus on extractive question answering use case where additional unstructured context can help training.
Size: 451.73 KB
Language: en
Added: Jun 27, 2024
Slides: 19 pages
Slide Content
Specializing Small Language
Models With Less Data
Jacek Golebiowski
Sr Machine Learning Scientist, AWS
AI landscape - agentic design
End-to-end AI products rely
on an agentic architecture
where LLMs can use many
tools, many of which are
agents.
Tools (or agents) solving
narrow tasks don’t require
a LLM; a specialized,
task-specific SLM performs
better
Image from https://haystack.deepset.ai
AI landscape - agentic design
Let’s imagine we supply a Tool called ‘ExtractiveQATool’ to our Agent. When we ask the Agent a
question, here’s what the output might look like:
Image from https://haystack.deepset.ai
How to build efficient NLP tools?
Building an NLP tool
is no different from a
conventional ML
model. It requires
plenty of data!
How to build efficient NLP tools?
Building an NLP tool
is no different from a
conventional ML
model. It requires
plenty of data!
Image source: https://medium.com/@clozymwangs/natural-language-processing-33c8a988a91e
Solution - model distillation
Model distillation
helps training
small models with
the help of large
models thus using
less data
How does it work? Extractive QA usecase
Context: 'Architecturally, the school has a Catholic character. Atop the
Main Building\'s gold dome is a golden statue of the Virgin Mary.
Immediately in front of the Main Building and facing it, is a copper statue of
Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to
the Main Building is the Basilica of the Sacred Heart. Immediately behind
the basilica is the Grotto, a Marian place of prayer and reflection. It is a
replica of the grotto at Lourdes, France where the Virgin Mary reputedly
appeared to Saint Bernadette Soubirous in 1858. At the end of the main
drive (and in a direct line that connects through 3 statues and the Gold
Dome), is a simple, modern stone statue of Mary.'
Question: 'To whom did the Virgin Mary allegedly appear in 1858 in
Lourdes France?'
Answer: Saint Bernadette Soubirous
We will fine-tune a
ML model on the
SQuAD dataset,
which consists of
questions posed by
crowdworkers on a
set of Wikipedia
articles.
Realistic extractive QA data
Context: 'Architecturally, the school has a Catholic character. Atop the Main Building\'s
gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building
and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad
Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately
behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of
the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint
Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that
connects through 3 statues and the Gold Dome), is a simple, modern stone statue of
Mary.'
Question: 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous
Generating synthetic data for QA
We can use an
LLM to cheaply
generate new
data.
Context: … 'France where the
Virgin Mary reputedly appeared to
Saint Bernadette Soubirous in
1858….
Prompt: Generate a question that
could be answered using the
context and the answer extracted
from the context.
Question: 'To
whom did the
Virgin Mary
allegedly appear in
1858 in Lourdes
France?'
Answer: Saint
Bernadette
Soubirous
Validating synthetic data for QA
Generated data is not always good. We need to validate to remove
-Duplicates
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous
Validating synthetic data for QA
Generated data is not always good. We need to validate to remove
-Duplicates
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous
Validating synthetic data for QA
Generated data is not always good. We need to validate to remove
-Duplicates
-Incorrect examples
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: I am an LLM
Answer: ???
Validating synthetic data for QA
Generated data is not always good. We need to validate to remove
-Duplicates
-Incorrect examples
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: I am an LLM
Answer: ???
Validating synthetic data for QA
Generated data is not always good. We need to validate to remove
-Duplicates
-Incorrect examples
-Data not in the right structure
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?
Answer: The answer to the question is St
Bernadette as seen in the context
Validating synthetic data for QA
Generated data is not always good. We need to validate to remove
-Duplicates
-Incorrect examples
-Data not in the right structure
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?
Answer: The answer to the question is St
Bernadette as seen in the context
Solution - model distillation
Model distillation
helps training
small models with
the help of large
models thus using
less data
Extractive QA benchmark
We have benchmarked
the model specialization on
an industry-standard
SquadQA dataset. Tested:
1/ Llama70B used to
generate synthetic data
from contexts
1/ 7b model tuned on
synthetic data and 500
examples from SQUAD
2/ 7b tuned on the actual
SQUAD train set
Open questions
-How to diversify generated data?
-How to we remove similar (but not the same) examples but avoid removing informative
samples that just seem similar?
-We can remove duplicates but this will slow down generating (the model generates examples
that will be removed). We Need to guide our model to generate more diverse examples.
-How to avoid catastrophic forgetting?
-Model we train start from a pre-trained baseline. We must avoid the model for getting it original
programming in favor of memorizing the data
-How to better validate examples?
-How can we validate more subtle errors in the data, except just flagging those with clearly
incorrect components.
Thank you!
Any questions?
Jacek Golebiowski
https://www.linkedin.com/in/jacek-golebiowski [email protected]