Tailoring Small Language Models for Enterprise Use Cases

JulienSIMON5 142 views 29 slides Sep 06, 2024
Slide 1
Slide 1 of 29
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29

About This Presentation

Talk @ ODSC London, 6/9/2024


Slide Content

Tailoring Small Language
Models for Enterprise Use Cases
Julien Simon, Chief Evangelist, Arcee.ai
[email protected]
youtube.com/juliensimonfr

Why customers prefer Small Language Models (SLM)
•Accessibility: anyone can use the models, regardless of budget or affiliation
•Transparency: customers have full visibility on the model and can better identify potential
biases or errors
•Privacy: customers don't have to send their data to black box APIs
•IP protection: customers train models on their data, and own them
•Freedom of choice: customers are not locked in. They can switch models anytime
•IT flexibility: customers can train and deploy models anywhere they like, using any technology
•Cost optimization: customers find can the cost/performance sweet spot for each project
•Model quality: a small tailored model will always outperform a generic large model

Working with Large Language Models
ML skills
Time to production
Cost
Training data
Domain
adaptation
Using as is
Prompting
Fine-tuning
Initial
training
You are probably here
Retrieval-Augmented Generation (RAG)
Model Merging
Continuous
pre-training

Working with datasets
Language/image data
Industry data
Company data
Use
case data
Business
value
Wikipedia
Common Crawl
PubMed
SEC filings
Arxiv
Github
Product documentation
Internal reports
User data
Application data
Models
BERT
Mistral
LLaMA
FinBERT
BioMistral
Code LLaMa
BloombergGPT
Your models
Initial training
Continuous pretraining
Fine-tuning
https://blog.arcee.ai/how-do-i-prep-my-data-to-train-an-llm-2/

A typical model adaptation workflow
Pretrained
model
Domain-
adapted
model
Instruction-
tuned model
Aligned
model
!!!
Unlabeled
domain dataset
Continuous
pre-training
(CPT)
Instruction
fine-tuning
(IFT)
Alignment
!!!
Unlabeled domain dataset + Q&A dataset
!!!
Preference dataset
Instruction
pre-training
!!!
Q&A dataset
« Language Models are Few-Shot Learners » https://arxiv.org/abs/2005.14165 (05/2020)
« Finetuned Language Models Are Zero-Shot Learners » https://arxiv.org/abs/2109.01652 (09/2021)
« Efficient Continual Pre-training for Building Domain Specific Large Language Models » https://arxiv.org/abs/2311.08545 (11/2023)
« Instruction Pre-Training: Language Models are Supervised Multitask Learners » https://arxiv.org/abs/2406.14491v1 (06/2024)
« How Do Large Language Models Acquire Factual Knowledge During Pretraining? » https://arxiv.org/abs/2406.11813v1 (06/2024)

Model training

Continuous pre-training (CPT)
• (Continuous) pre-training involves training the model on a large corpus, often billions of tokens
• Option 1 - Full fine-tuning (FFT): train the full model in full precision (say, BF16)
• Option 2 - QLoRA: train a fraction of the model in reduced precision (say 4-bit)
•https://arxiv.org/abs/2305.14314 (05/2023)
• Large memory savings enable smaller GPUs and larger batch sizes
• Very effective for Instruction Fine-Tuning (IFT) and alignment
•Not effective for CPT, significant accuracy degradation

https://blog.arcee.ai/why-methods-like-qlora-fall-short-in-domain-knowledge-injection-2/
• Option 3 - Spectrum: train only the most contributing layers in full precision (25 to 50%)
•https://arxiv.org/abs/2406.06623 (06/2024) + https://blog.arcee.ai/optimizing-llm-training-with-spectrum/
• Spectrum-25 outperforms QLoRa on memory usage, training speed, and accuracy
• Spectrum-50 accuracy is on par or better (!) than FFT, and within 10% of QLoRa savings

Fine-tuning
• Fine-tuning a pre-trained model is an excellent option for most customers
• Compared to initial training, it only requires a fraction of the training data and compute
•Many ready-made scripts and notebooks at https://github.com/huggingface/transformers
•New techniques make it possible to fine-tune large models for dollars on a single GPU
•PEFT (Parameter-Efficient Fine-Tuning) https://huggingface.co/docs/peft
•TRL (Transformer Reinforcement Learning) https://huggingface.co/docs/trl/

Parameter-Efficient Fine-Tuning with LoRA and QLoRA
•Low Rank Adaptation fine-tuning (LoRA) https://arxiv.org/abs/2106.09685
• Hypothesis: updates can be learned with two much smaller matrices
•"For a pre-trained weight matrix W0 ∈ R d×k , we constrain its update by representing the latter with a low-rank
decomposition W0 + ∆W = W0 + BA, where B ∈ R d×r , A ∈ R r×k , and the rank r << min(d, k). 

During training, W0 is frozen and does not receive gradient updates, while A and B contain trainable parameters".
• We only learn r×(d+k) parameters, a MUCH smaller number than W0's d×k (r is typically 4 to 16).
• LoRA reduces the number of trainable parameters by 1,000x or more, with minimal loss of accuracy
• LLMs can be trained on a single mid-range GPU
• At inference time, learned parameters are simply added to the original parameters : no extra latency
•QLoRA: LoRA for quantized models https://arxiv.org/abs/2305.14314
• Quantize a pre-trained model to 4-bit (FP) and fine-tune it with LoRA
• "QLoRA reduces the average memory requirements of fine-tuning a 65B parameter model 

from >780GB of GPU memory to <48GB without degrading the runtime or predictive performance 

compared to a 16- bit fully fine-tuned baseline".
•QLoRA with Hugging Face : https://huggingface.co/blog/4bit-transformers-bitsandbytes

"LoRA Land: 310 Fine-tuned LLMs that rival GPT-4"
https://arxiv.org/abs/2405.00732 (04/2024)
• 10 base models
• 31 tasks in 5 categories
• Classic NLP
•Coding
• Knowledge
• Reasoning
• Math
• Consistent prompting
• Completion
• Zero or single-shot
• Fine-tuning
• 4-bit QLoRA
• A single A10 GPU (!)
• No hyperparameter tuning
301/310 models surpass their base model counterpart.
The best fine-tuned LLM outperforms the best base model from +8.3 to +67.5 points, +25.0 points on average.
All fine-tuned models perform better than GPT-3.5.
224/310 fine-tuned LLMs surpass the benchmark set by GPT-4.
All 7B fine-tuned models perform better than GPT-4, except for gemma-7b and gemma-7b-it.
Phi-2, with as few as 2 billion parameters, exhibits performance competitive with GPT-4 after fine-tuning.

Model alignment

Reinforcement Learning with Human Feedback (RLHF)
https://huyenchip.com/2023/05/02/rlhf.html

Reward-based RLHF (PPO)
https://openai.com/research/instruction-following

Reward-based RLHF is challenging
•Scalability: building a large human workforce is difficult and time-consuming
•Ethics: RLHF often involves underpaid outsourced workers
•Bias and quality: human feedback can be biased or inconsistent
•Complexity: RLHF requires many steps and datasets
•Cost: very compute-intensive
Washington Post
Time
Daily Mail

Reinforcement Learning with AI Feedback (RLAIF)
https://arxiv.org/abs/2309.00267 (05/2023)
RLAIF uses an off-the-shelf LLM to generate preference data

Reward-free RLHF: Direct Preference Optimization (DPO)
https://arxiv.org/abs/2305.18290 (05/2023)
•DPO eliminates the need for a reward model
•The final model is trained on a statistical estimation of preference data

What does a DPO dataset look like?
https://huggingface.co/datasets/arcee-ai/general-dpo-datasets

Model merging

What is model merging?
• Building a "great" model is challenging, time-consuming and compute-intensive
• Instead, can we build one by merging several models based on the same architecture?
• Combine multiple task-specific models into a single multitask model without any additional training
• Not an ensembling technique: there's only one model at the end
• Merging only requires lightweight CPU compute
• Fast process, no extra cost for training and inference, no extra inference latency
•Mergekit: https://github.com/arcee-ai/mergekit

Merging techniques available in mergekit
•Model Soups https://arxiv.org/abs/2203.05482 (03/2022)
•Spherical Linear Interpolation (SLERP) https://dl.acm.org/doi/10.1145/325334.325242 (07/1985)
•Task Arithmetic https://arxiv.org/abs/2212.04089 (12/2022)
•Trim, Elect Sign, and Merge (TIES) https://arxiv.org/abs/2306.01708 (06/2023)
•Drop and Rescale (DARE) https://arxiv.org/abs/2311.03099 (06/2023)
•Franken-merging
•Model Breadcrumbs https://arxiv.org/abs/2312.06795 (12/2023)
•Model Stock https://arxiv.org/abs/2403.19522 (03/2024)
•DELLA https://arxiv.org/abs/2406.11617 (06/2024)

Model merging
« Arcee's MergeKit: A Toolkit for Merging Large Language Models » https://arxiv.org/abs/2403.13257 (03/2024)
Pretrained
model
Domain-
adapted
model
Instruction-
tuned model
Aligned
model
Alignment
Merging
instead of
fine-tuning
Instruction-
tuned model
Merging
instead of
training
Domain-
adapted
model
Domain-
adapted
model
Merging
before
training
Merging
instead of
aligning
Aligned
model
Merging steps can be combined, e.g., merge with a domain-adapted and aligned model
The Arcee AI platform makes this end-to-end workflow simple and efficient
!!!
Unlabeled
domain dataset
!!!
Preference dataset
!!!
Q&A dataset
Continuous
pre-training
(CPT)
Instruction
fine-tuning
(IFT)

Selected Arcee models

Arcee Nova (72B)
https://huggingface.co/arcee-ai/Arcee-Nova
A merge of Qwen2-72B-Instruct with a
custom model tuned on a generalist
dataset mixture.
Top-performing open source model
tested on the OpenLLM Leaderboard
2.0 stack.
Performance close to GPT-4 May 2023
Chat with Nova (web)


Chat with Nova (ollama Q3_K_S)

Llama Spark (8B)
https://huggingface.co/arcee-ai/Llama-Spark
Built upon the Llama-3.1-8B base model,
continuously trained on the Tome Dataset
and merged with Llama-3.1-8B-Instruct.
Consistently outperforms Llama-3.1-8B-
Instruct on the Open LLM Leaderboard
benchmarks
Chat with Llama Spark (ollama, Q5_K_S)

Arcee Agent (7B)
https://huggingface.co/arcee-ai/Arcee-Agent
A cutting-edge Qwen2 7B model
specifically designed for function
calling and tool use.
Trained with Spectrum.
Rivals with much larger models
Demo: query the Yahoo Finance API
(Amazon SageMaker)

https://github.com/arcee-ai/aws-samples/blob/main/
model_notebooks/sample-notebook-arcee-agent-on-
sagemaker.ipynb

Arcee Lite (1.5B)
https://huggingface.co/arcee-ai/arcee-lite
A powerful 1.5B model developed as part
of the DistillKit open-source project

https://github.com/arcee-ai/DistillKit
Based on Qwen2 1.5B distilled from 

Phi-3-Medium (14B)
Consistently outperforms Qwen2 1.5B
Chat with Arcee Lite (ollama, Q8)

Summing things up
No model rules them all : find the most appropriate one for each use case
Small, tailored open models are the way to go
Merging, Spectrum, and distillation are changing the model adaptation game
Visit arcee.ai to learn how you can build yours with Arcee Cloud (SaaS) or Arcee Enterprise (VPC deployment)
https://arcee.ai/blog
https://huggingface.co/arcee-ai
https://github.com/arcee-ai/aws-samples
https://youtube.com/c/juliensimonfr
Julien Simon, Chief Evangelist, Arcee AI
[email protected]