Specializing Small Language Models With Less Data

Specializing Small Language
Models With Less Data
Jacek Golebiowski
Sr Machine Learning Scientist, AWS

AI landscape - agentic design
End-to-end AI products rely
on an agentic architecture
where LLMs can use many
tools, many of which are
agents.

Tools (or agents) solving
narrow tasks don’t require
a LLM; a specialized,
task-specific SLM performs
better
Image from https://haystack.deepset.ai

AI landscape - agentic design
Let’s imagine we supply a Tool called ‘ExtractiveQATool’ to our Agent. When we ask the Agent a
question, here’s what the output might look like:
Image from https://haystack.deepset.ai

How to build efficient NLP tools?
Building an NLP tool
is no different from a
conventional ML
model. It requires
plenty of data!

How to build efficient NLP tools?
Building an NLP tool
is no different from a
conventional ML
model. It requires
plenty of data!
Image source: https://medium.com/@clozymwangs/natural-language-processing-33c8a988a91e

Solution - model distillation
Model distillation
helps training
small models with
the help of large
models thus using
less data

How does it work? Extractive QA usecase
Context: 'Architecturally, the school has a Catholic character. Atop the
Main Building\'s gold dome is a golden statue of the Virgin Mary.
Immediately in front of the Main Building and facing it, is a copper statue of
Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to
the Main Building is the Basilica of the Sacred Heart. Immediately behind
the basilica is the Grotto, a Marian place of prayer and reflection. It is a
replica of the grotto at Lourdes, France where the Virgin Mary reputedly
appeared to Saint Bernadette Soubirous in 1858. At the end of the main
drive (and in a direct line that connects through 3 statues and the Gold
Dome), is a simple, modern stone statue of Mary.'
Question: 'To whom did the Virgin Mary allegedly appear in 1858 in
Lourdes France?'
Answer: Saint Bernadette Soubirous
We will fine-tune a
ML model on the
SQuAD dataset,
which consists of
questions posed by
crowdworkers on a
set of Wikipedia
articles.

Realistic extractive QA data
Context: 'Architecturally, the school has a Catholic character. Atop the Main Building\'s
gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building
and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad
Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately
behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of
the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint
Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that
connects through 3 statues and the Gold Dome), is a simple, modern stone statue of
Mary.'
Question: 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous

Generating synthetic data for QA
We can use an
LLM to cheaply
generate new
data.
Context: … 'France where the
Virgin Mary reputedly appeared to
Saint Bernadette Soubirous in
1858….
Prompt: Generate a question that
could be answered using the
context and the answer extracted
from the context.
Question: 'To
whom did the
Virgin Mary
allegedly appear in
1858 in Lourdes
France?'
Answer: Saint
Bernadette
Soubirous

Validating synthetic data for QA

Generated data is not always good. We need to validate to remove
-Duplicates

Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?'
Answer: Saint Bernadette Soubirous

Validating synthetic data for QA

Generated data is not always good. We need to validate to remove
-Duplicates

-Incorrect examples
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: I am an LLM
Answer: ???

Validating synthetic data for QA

Generated data is not always good. We need to validate to remove
-Duplicates

-Incorrect examples

-Data not in the right structure
Context: … 'France where the Virgin Mary
reputedly appeared to Saint Bernadette
Soubirous in 1858….
Question: 'To whom did the Virgin Mary
allegedly appear in 1858 in Lourdes France?
Answer: The answer to the question is St
Bernadette as seen in the context

Solution - model distillation
Model distillation
helps training
small models with
the help of large
models thus using
less data

Extractive QA benchmark
We have benchmarked
the model specialization on
an industry-standard
SquadQA dataset. Tested:

1/ Llama70B used to
generate synthetic data
from contexts

1/ 7b model tuned on
synthetic data and 500
examples from SQUAD

2/ 7b tuned on the actual
SQUAD train set

Open questions
-How to diversify generated data?
-How to we remove similar (but not the same) examples but avoid removing informative
samples that just seem similar?
-We can remove duplicates but this will slow down generating (the model generates examples
that will be removed). We Need to guide our model to generate more diverse examples.
-How to avoid catastrophic forgetting?
-Model we train start from a pre-trained baseline. We must avoid the model for getting it original
programming in favor of memorizing the data
-How to better validate examples?
-How can we validate more subtle errors in the data, except just flagging those with clearly
incorrect components.

Thank you!
Any questions?

Jacek Golebiowski
https://www.linkedin.com/in/jacek-golebiowski
[email protected]

Specializing Small Language Models With Less Data

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Specializing Small Language Models With Less Data

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx