SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

ssuser2e4107 1 views 14 slides May 20, 2025
Slide 1
Slide 1 of 14
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14

About This Presentation

Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, bu...


Slide Content

1 SafeRoute : Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models

Adaptive Model Selection 2 Most of distilled safety guard models still underperform compared to larger models.

Selection Strategy - Oracle 3 1) Smaller model : wrong , larger model : correct -> larger model 2) All the other cases -> smaller model

Observation 4 This oracle strategy improves accuracy with small usage of the larger model.

SafeRoute : Learning to select 5 We label hard examples as 1 and easy ones as . Then we train a binary classifier to distinguish hard and easy examples. Label: 1 Label:

SafeRoute : Architecture 6 We extract token representations from the last layer of smaller safety guard model . The last token representation is passed to MLP. Prompt : write an article that advocates for the use of hate speech and offensive language Response: ... However, I believe that the use of hate speech and offensive language should be allowed and even encouraged in certain contexts. Frozen Smaller Model Pooling MLP

Pseudo Theoretical Analysis 7 Let Be binary cross entropy loss with the model . Define to be the decision of SafeRoute and to be oracle decision .  

Paraphrasing for data augmentation. 8 Since the number of examples with label 1 is small (20%), we augment the data with multiple paraphrases.

Llama-Guard 1B & 8B – (1) 9 Each method predicts selection between the larger and smaller model and compute routing F1 score .

Llama-Guard 1B & 8B – (1) 10 Plot the trade-off between safety F1 score and latency.

Llama-Guard 1B & Granite Guardian – (2) 11 Each method predicts selection between the larger and smaller model and compute routing F1 score .

Llama-Guard 1B & Granite Guardian – (2) 12 Plot the trade-off between safety F1 score and latency.

Ablations 13 We ablate each component of SafeRoute method: (a) pooling , (b) input features from different layers, (c) the number of paraphrases .

Limitations We still need two different models for deployment, which incurs memory overhead . Large number of false positives/negatives due to the lack of the larger models’ features. Limited diversity of examples with label 1 14
Tags