Specializing Small Language Models With Less Data

chloewilliams62 110 views 19 slides Jun 27, 2024
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

Most AI teams are exploring the possibilities of LLMs, rather than being focused on margins but soon efficiency will become important. Implementing small, specialized models to solve specific problems is an option, but is not leveraged often, because it requires gathering high volumes of human-label...


Slide Content

Specializing Small Language
Models With Less Data
Jacek Golebiowski
Sr Machine Learning Scientist, AWS

AI landscape - agentic design
End-to-end AI products rely
on an agentic architecture
where LLMs can use many
tools, many of which are
agents.

Tools (or agents) solving
narrow tasks don’t require
a LLM; a specialized,
task-specific SLM performs
better
Image from https://haystack.deepset.ai

AI landscape - agentic design
Let’s imagine we supply a Tool called ‘ExtractiveQATool’ to our Agent. When we ask the Agent a
question, here’s what the output might look like:
Image from https://haystack.deepset.ai

How to build efficient NLP tools?
Building an NLP tool
is no different from a
conventional ML
model. It requires
plenty of data!

How to build efficient NLP tools?
Building an NLP tool
is no different from a
conventional ML
model. It requires
plenty of data!
Image source: https://medium.com/@clozymwangs/natural-language-processing-33c8a988a91e

Solution - model distillation
Model distillation
helps training
small models with
the help of large
models thus using
less data

How does it work? Extractive QA usecase
Context: 'Architecturally, the school has a Catholic character. Atop the
Main Building\'s gold dome is a golden statue of the Virgin Mary.
Immediately in front of the Main Building and facing it, is a copper statue of
Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to
the Main Building is the Basilica of the Sacred Heart. Immediately behind
the basilica is the Grotto, a Marian place of prayer and reflection. It is a
replica of the grotto at Lourdes, France where the Virgin Mary reputedly
appeared to Saint Bernadette Soubirous in 1858. At the end of the main
drive (and in a direct line that connects through 3 statues and the Gold
Dome), is a simple, modern stone statue of Mary.'
Question: 'To whom did the Virgin Mary allegedly appear in 1858 in
Lourdes France?'
Answer: Saint Bernadette Soubirous
We will fine-tune a
ML model on the
SQuAD dataset,
which consists of
questions posed by
crowdworkers on a
set of Wikipedia
articles.

Realistic extractive QA data
Context: 'Architecturally, the school has a Catholic character. Atop the Main Building\'s
gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building
and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad
Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately
behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of
the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint
Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that
connects through 3 statues and the Gold Dome), is a simple, modern stone statue of
Mary.'
Question: 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous

Generating synthetic data for QA
We can use an
LLM to cheaply
generate new
data.
Context: … 'France where the
Virgin Mary reputedly appeared to
Saint Bernadette Soubirous in
1858….
Prompt: Generate a question that
could be answered using the
context and the answer extracted
from the context.
Question: 'To
whom did the
Virgin Mary
allegedly appear in
1858 in Lourdes
France?'
Answer: Saint
Bernadette
Soubirous

Validating synthetic data for QA

Generated data is not always good. We need to validate to remove
-Duplicates

Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous

Validating synthetic data for QA

Generated data is not always good. We need to validate to remove
-Duplicates

Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous

Validating synthetic data for QA

Generated data is not always good. We need to validate to remove
-Duplicates

-Incorrect examples
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: I am an LLM
Answer: ???

Validating synthetic data for QA

Generated data is not always good. We need to validate to remove
-Duplicates

-Incorrect examples
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: I am an LLM
Answer: ???

Validating synthetic data for QA

Generated data is not always good. We need to validate to remove
-Duplicates

-Incorrect examples

-Data not in the right structure
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?
Answer: The answer to the question is St
Bernadette as seen in the context

Validating synthetic data for QA

Generated data is not always good. We need to validate to remove
-Duplicates

-Incorrect examples

-Data not in the right structure
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?
Answer: The answer to the question is St
Bernadette as seen in the context

Solution - model distillation
Model distillation
helps training
small models with
the help of large
models thus using
less data

Extractive QA benchmark
We have benchmarked
the model specialization on
an industry-standard
SquadQA dataset. Tested:

1/ Llama70B used to
generate synthetic data
from contexts

1/ 7b model tuned on
synthetic data and 500
examples from SQUAD

2/ 7b tuned on the actual
SQUAD train set

Open questions
-How to diversify generated data?
-How to we remove similar (but not the same) examples but avoid removing informative
samples that just seem similar?
-We can remove duplicates but this will slow down generating (the model generates examples
that will be removed). We Need to guide our model to generate more diverse examples.
-How to avoid catastrophic forgetting?
-Model we train start from a pre-trained baseline. We must avoid the model for getting it original
programming in favor of memorizing the data
-How to better validate examples?
-How can we validate more subtle errors in the data, except just flagging those with clearly
incorrect components.

Thank you!
Any questions?




Jacek Golebiowski
https://www.linkedin.com/in/jacek-golebiowski
[email protected]
Tags