Julien Simon - Deep Dive - Optimizing LLM Inference

Deep Dive: Optimizing LLM Inference
Companion video: https://youtu.be/hMs8VNRy5Ys
Julien Simon
https://www.linkedin.com/in/juliensimon
https://www.youtube.com/juliensimonfr

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Decoder-only inference
•GPT-like models are decoder-only models
•No encoder, no encoder-decoder multi-head attention
•Input processing (aka prefill): highly parallel
•Inputs (the tokenized prompt) are embedded and encoded
•Multi-head attention computes the keys and values (KV)
•Large matrix multiplication, high usage of the hardware accelerator
•Output generation: sequential
•The answer is generated one token at a time
•Each generated token is appended to the previous input
•The process is repeated until the stopping criteria is met (max. length or EOS)
•Low usage of the hardware accelerator
"Attention is all you need" 
https://arxiv.org/abs/1706.03762
Inputs
(prompt)
Input

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
The KV cache
•Can we avoid recomputing KV values again and again for the input tokens?
•We only really need to do it for the token we just generated and appended to the input!
•The KV cache stores the KV values for all previous tokens in the accelerator RAM
•Of course, this doesn't work for the first token, which is why it takes longer to generate !
•Cache size (FP16)= 2*2*batch_size*seq_length*num_layers*embeddings_length ➡ Gigabytes
"Generative LLM inference" https://awsdocs-neuron.readthedocs-hosted.com
https://youtu.be/2TT384U4vQg

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Continuous batching
https://www.usenix.org/conference/osdi22/presentation/yu (09/2022)
•Decoder-only inference requests are harder to batch than for traditional Transformers
•Input and output lengths can greatly vary, leading to very different generation times
Traditional batching waits for all requests to
complete
➡ low hardware usage
Available in  
Hugging Face TGI
https://www.anyscale.com/blog/continuous-batching-llm-inference
Continuous batching evicts completed
requests and runs new requests
➡ high hardware usage
Token generation must pause regularly to run
prefill for new requests 
(waiting_served_ratio parameter in TGI)

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Speculative decoding
https://arxiv.org/abs/2211.17192 (11/2022) + https://huggingface.co/blog/assisted-generation
•The vanilla generation process outputs only one token at a time
•It's also memory-bound: idle compute resources are available
•With a small model (Mq), we predict several potential completions in parallel
(aka speculative sampling)
•With the large model (Mp), we evaluate the completions and pick the best one
(or correct it)
•Each iteration generates at least one valid token, and potentially more
⍺ : acceptance rate of completions (how well Mq approximates Mp)
ɣ : number of predicted tokens
T5-XXL
•How can we build Mq?

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Speculative decoding: building the small model
1.Select a small off-the-shelf version of the large model
2.Use a basic n-gram model to generate tokens found in the prompt 
3.Fine-tune a small model on top of a frozen large model 
4.Fine-tune the small and large model together

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Speculative decoding: small off-the-shelf model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model )
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']
Note: the two models must share the same tokenizer

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Speculative decoding: n-grams
https://github.com/apoorvumang/prompt-lookup-decoding + https://twitter.com/joao_gante/status/1747322418425741550
•Input-grounded tasks (summarization, document QA, multi-turn chat, code editing): 
high n-gram overlap between the input (prompt) and the generated output
•We can use strings present in the prompt to generate candidate token sequences
•Significant speedups (2x-4x), without model modification and with no effect on output quality
•Implemented in the transformers library
model = AutoModelForCausalLM.from_pretrained( 
model_name_or_path, device_map="auto")
. . .
generation_output = model.generate( 
**input_ids, do_sample=False,  
max_new_tokens=512, streamer=streamer, 
prompt_lookup_num_tokens =10)
Available in  
Hugging Face TGI

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Speculative decoding: Medusa
https://arxiv.org/abs/2401.10774 + https://github.com/FasterDecoding/Medusa
•Add decoding heads to the LLM to predict multiple outputs in parallel
•Verify them simultaneously in each decoding step
•Two techniques to fine-tune the new heads
•MEDUSA-1: fine-tune on top of the frozen LLM
•Parameter-efficient fine tuning with QLoRA 
Vicuna 7B, 60k samples: 5 hours on an A100
•MEDUSA-2: fine-tune together with the LLM
Available in  
Hugging Face TGI

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
What is model merging?
•Building a "great" model is challenging, time-consuming and compute-intensive
•Instead, can we build one by merging several models based on the same architecture?
•Combine multiple task-specific models into a single multitask model without any additional training
•Not an ensembling technique: there's only one model at the end
•Merging only requires lightweight compute
•Fast process, no extra cost for training and inference, no extra inference latency
•Mergekit: https://github.com/arcee-ai/mergekit

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Merging techniques
•Model Soups https://arxiv.org/abs/2203.05482 (03/2022)
•Spherical Linear Interpolation (SLERP) https://dl.acm.org/doi/10.1145/325334.325242 (07/1985)
•Task Arithmetic https://arxiv.org/abs/2212.04089 (12/2022)
•Trim, Elect Sign, and Merge (TIES) https://arxiv.org/abs/2306.01708 (06/2023)
•Drop and Rescale (DARE) https://arxiv.org/abs/2311.03099 (06/2023)
•Franken-merging

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Merging techniques, continued
•Model Breadcrumbs https://arxiv.org/abs/2312.06795 (12/2023)
•Model Stock https://arxiv.org/abs/2403.19522 (03/2024)
•DELLA https://arxiv.org/abs/2406.11617 (06/2024)

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Model soups
https://arxiv.org/abs/2203.05482 (03/2022) + https://github.com/mlfoundations/model-soups
•Average many variants of the same model, trained on the same
dataset with different hyper-parameters, aka linear interpolation
•Optionally: weighted average, normalization
•Uniform soup: average all models
•Greedy soup: average models one by one, keeping only the ones
that gradually improve test accuracy
•Generally, model soups perform a little worse than ensembles, but
are more resilient to out of distribution data
Fine-tuning a CLIP ViT-B/32 model on ImageNet.
BERT and T5 on four text classification datasets from the GLUE benchmark
res = (weights * tensors).sum(dim=0)
if self.normalize:
res /= weights.sum(dim=0)

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Spherical Linear IntERPolation (SLERP)
https://dl.acm.org/doi/10.1145/325334.325242 (07/1985)
•Originally designed for 3D CGI to find smooth transitions between two (camera) rotations
•Only works with two models (one of them can be favored)
•Helps preserve the magnitude of weights
•Helps preserve the "curvature" of the embeddings space
Linear (red) vs. spherical (blue) interpolation
Images: https://www.researchgate.net/figure/Slerp-and-linear-interpolation-comparison_fig2_318221910, https://allenchou.net/2018/05/game-math-deriving-the-slerp-formula/
# Copy the vectors to reuse them later
v0_copy = np.copy(v0)
v1_copy = np.copy(v1)
# Normalize the vectors to get the directions and angles
v0 = normalize(v0, eps)
v1 = normalize(v1, eps)
# Dot product with the normalized vectors
dot = np.sum(v0 * v1)
# Calculate initial angle between v0 and v1
theta_0 = np.arccos(dot)
sin_theta_0 = np.sin(theta_0)
# Angle at timestep t
theta_t = theta_0 * t
sin_theta_t = np.sin(theta_t)
# Finish the slerp algorithm
s0 = np.sin(theta_0 - theta_t) / sin_theta_0
s1 = sin_theta_t / sin_theta_0
res = s0 * v0_copy + s1 * v1_copy
??????
t??????
(1-t)??????
v0
v1

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Task Arithmetic
https://arxiv.org/abs/2212.04089 (12/2022)
•Pre-trained models can be fine-tuned for a
particular task : text classification,
summarization, etc.
•Task vector: tensor updates applied to the pre-
trained model during fine-tuning
•A pre-trained model can be fine-tuned for
many tasks on many datasets, leading to
many task vectors
•Task vectors can be added or subtracted to
the base model to add or remove capabilities
•Many single-task models can be merged to
build a multiple-task model
•However, as the number of tasks grow, finding
the right mix becomes a computationally
expensive problem (hyper parameter tuning on
a validation set)
Adding pairs of task vectors to T5-Base
Adding task vectors to CLIP

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
TrIm, Elect Sign and Merge (TIES)
https://arxiv.org/abs/2306.01708 (06/2023)
•Parameter interference can degrade merged performance
•Influential vs. redundant parameter
•Sign conflicts
•Trim, Elect Sign & Merge
•Trim each task vector to retain only the influential parameter values  
(top-k % largest values)
•Resolve the sign conflicts between different values
•Average parameters whose sign agrees with the direction of the largest
movement
•Add the averaged parameters to the original model (with a scale factor)

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Drop And REscale (DARE)
https://arxiv.org/abs/2311.03099 (06/2023) + https://github.com/yule-BUAA/MergeLM
•During fine-tuning, many LLM parameter updates are redundant
•DARE randomly eliminates up to 99% of the updates, with
minimal impact on model performance 
 
Drop p% of parameters and rescale the rest by
•The larger the model, the more limited the impact
•This drastically reduces the size of task vectors
•Task vectors can be applied to a pre-trained model using
previous methods like TIES
1
1−p
WizardLM- 13B (LM), WizardMath-13B (Math), and llama-2-13b-code-alpaca

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Franken-merging
•All previous methods require models that share a common architecture (T5, Llama, etc.)
•We can build a new model by stitching layers from different models, possibly with different architectures
•Weights are left untouched (aka passthrough merging)
•Very experimental, but some interesting models are popping up
•https://huggingface.co/models?sort=downloads&search=franken
- range 0, 16
Xwin
- range 8, 24
Euryale
- range 17, 32
Xwin
- range 25, 40
Euryale
- range 33, 48
Xwin
- range 41, 56
Euryale
- range 49, 64
Xwin
- range 57, 72
Euryale
- range 65, 80
Xwin
https://huggingface.co/alpindale/goliath-120b  
merging two Llama-2-70b models 
https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1  
https://huggingface.co/Sao10K/Euryale-1.3-L2-70B

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Model breadcrumbs
https://arxiv.org/abs/2312.06795 (12/2023) + https://github.com/rezazzr/breadcrumbs
•Improvement of the Task Arithmetic technique
•Compute a task direction for each fine-tuned model
(fine-tuned weights minus base weights)
•Mask (set to zero) a percentage of tiny and large
outliers in each task direction, resp. β and ɣ 
(this is equivalent to using the parameter from the base model)
•Sum the base model and the masked task
directions, with a task weight (⍺)
•This method outperforms Task Vectors. It also scales much better to a large number of tasks: « After finding
optimal hyperparameters for both Model Breadcrumbs and Task Vectors using 10 tasks, we kept these
hyperparameters and incrementally merged all 200 tasks to create a multi-task model. […] the observed trend
remains, with Model Breadcrumbs (85% sparsity) consistently outperforming Task Vectors by a significant margin
as the number of tasks increases. »
•Merging also improves the
performance of fine-tuned models
•Optionally, add sign election before
merging (TIES method).
T5-base fined tuned on 4 GLUE tasks, then merged with six other T5 variants (IMDB, etc.)

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Model stock
https://arxiv.org/abs/2403.19522 + https://github.com/naver-ai/model-stock (03/2024)
•Observation 1: « Fine-tuned weights with different random
seeds reside on a very thin shell layer-wise » 
3D intuition: for each layer, vectors from all fine-tuned models live on a sphere
•Observation 2: « Going closer to the center by averaging the
weights leads to improving both performances. » [ID, OOD]
•Model soups work because averaging
many single-task models gets us closer
to the center.
•However, building a large collection of
single-task models can be
computationally costly.
•By understanding the geometry of fine-
tuned weights, model stock allows us to
approximate the center coordinates from
just a few models.
•Periodic merging during fine-tuning
yields even better results.

The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.
Drop and rEscaLe via sampLing with mAgnitude (DELLA)
https://arxiv.org/abs/2406.11617 (06/2024) + https://github.com/declare-lab/della
1.Compute delta parameters for each fine-tuned model  
(fine-tuned weights minus base weights)
2.Drop a fraction of deltas:
•The drop probability of each parameter is inversely
proportional to its magnitude
•For each parameter, we flip a coin with a Bernoulli
distribution to decide whether to keep it or drop it.
•If a parameter is dropped, we set it to zero.
3.Elect weights for merging, according to the dominant
direction (same as TIES — optional in mergekit)
4.Fuse (average) the elected weights WizardLM-13B, WizardMath-13B, llama-2-13b-code-alpaca

Julien Simon - Deep Dive - Optimizing LLM Inference

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Julien Simon - Deep Dive - Optimizing LLM Inference

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx