Security and auditing tools in Large Language Models (LLM).pdf
jmoc25
188 views
41 slides
Oct 10, 2024
Slide 1 of 41
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
About This Presentation
LLM models are a subcategory of deep learning models based on neural networks and natural language processing(NLP). Security and auditing are critical issues when dealing with applications based on large language models, such as GPT (Generative Pre-trained Transformer) or LLM (Large Language Model) ...
LLM models are a subcategory of deep learning models based on neural networks and natural language processing(NLP). Security and auditing are critical issues when dealing with applications based on large language models, such as GPT (Generative Pre-trained Transformer) or LLM (Large Language Model) models. This talk aims to analyze the security of these language models from the developer’s point of view, analyzing the main vulnerabilities that can occur in the generation of these models. Among the main points to be discussed we can highlight:
-Introduction to LLM
-Introduction to OWASP LLM Top 10.
-Auditing tools in applications that handle LLM models.
-Use case with the textattack tool(https://textattack.readthedocs.io/en/master/)
Size: 2.7 MB
Language: en
Added: Oct 10, 2024
Slides: 41 pages
Slide Content
October 11, 2024 |
José Manuel Ortega
Security and auditing tools
in Large Language Models
(LLM) [email protected]
Agenda
•Introduction to LLM
•Introduction to OWASP LLM Top 10
•Auditing tools
•Use case with the textattack tool
Introduction to LLM
•Transformers
•Attention is All You Need" by Vaswani et
al. in 2017
•Self-attention mechanism
•Encoder-Decoder Architecture
Introduction to LLM
Introduction to LLM
Pre-training + fine-tuning
Introduction to LLM
●Language Models: Models like BERT, GPT, T5, and RoBERTa
are based on transformer architecture. They are used for a wide
range of NLP tasks such as text classification, question
answering, and language translation.
●Vision Transformers (ViT): Transformers have been adapted for
computer vision tasks, where they have been applied to image
classification, object detection, etc.
●Speech Processing: In addition to text and vision, transformers
have also been applied to tasks like speech recognition and
synthesis.
Introduction to OWASP LLM Top 10
•https://genai.owasp.org
Introduction to OWASP LLM Top 10
•Data Poisoning
•Malicious actors could poison the training data by
injecting false, harmful, or biased information into
datasets that train the LLM, which could degrade
the model's performance.
•Mitigation: Data source vetting, training data
audits, and anomaly detection for suspicious
patterns in training data.
Introduction to OWASP LLM Top 10
•Model Inversion Attacks
•Attackers could exploit the LLM to infer sensitive
or private data that was used during training by
repeatedly querying the model. This could expose
personal, confidential, or proprietary information.
•Mitigation: Rate-limiting sensitive queries and
limiting the availability of models trained on
private data.
Introduction to OWASP LLM Top 10
•Unauthorized Code Execution
•In some contexts, LLMs might be integrated into
systems where they have access to execute code
or trigger automated actions. Attackers could
manipulate LLMs into running unintended code or
actions, potentially compromising the system.
•Mitigation: Limit the scope of actions that LLMs
can execute, employ sandboxing, and use strict
permission controls.
Introduction to OWASP LLM Top 10
•Bias and Fairness
•LLMs can generate biased outputs due to the
biased nature of the data they are trained on,
leading to unfair or discriminatory outcomes. This
could impact decision-making processes, amplify
harmful stereotypes, or introduce systemic biases.
•Mitigation: Perform fairness audits, use bias
detection tools, and diversify training datasets to
reduce bias.
Introduction to OWASP LLM Top 10
•Model Hallucination
•LLMs can produce outputs that are
plausible-sounding but factually incorrect or entirely
fabricated. This is referred to as "hallucination,"
where the model generates false information
without any grounding in its training data.
•Mitigation: Post-response validation, fact-checking
algorithms, and restricting LLMs to provide
responses only within known knowledge domains.
Introduction to OWASP LLM Top 10
•Insecure Model Deployment
•LLMs that are deployed in unsecured
environments could be vulnerable to attacks,
including unauthorized access, model theft, or
tampering. These risks are elevated when models
are deployed in publicly accessible endpoints.
•Mitigation: Use encrypted APIs, secure
infrastructure, implement authentication and
authorization controls, and monitor model access.
Introduction to OWASP LLM Top 10
•Adversarial Attacks
•Attackers might exploit weaknesses in the LLM by
crafting adversarial examples. This could lead to
undesirable outputs or security breaches.
•Mitigation: Model robustness testing, adversarial
training (training the model with adversarial examples),
and implementing anomaly detection systems.
•https://llm-attacks.org
Tools/frameworks to evaluate model
robustness
●PromptInject Framework
●https://github.com/agencyenterprise/PromptInject
●PAIR - Prompt Automatic Iterative Refinement
●https://github.com/patrickrchao/JailbreakingLLMs
●TAP - Tree of Attacks with Pruning
●https://github.com/RICommunity/TAP
Auditing tools
•Prompt Guard refers to a set of strategies, tools, or
techniques designed to safeguard the behavior of
large language models (LLMs) from malicious or
unintended input manipulations.
•Prompt Guard uses an 86M parameter classifier
model that has been trained on a large dataset of
attacks and prompts found on the web. Prompt
Guard can categorize a prompt into three different
categories: "Jailbreak", "Injection" or "Benign".
Auditing tools
•Llama Guard 3 refers to a security tool or strategy
designed for guarding large language models like
Meta’s LLaMA against potential vulnerabilities and
adversarial attacks.
•Llama Guard 3 offers a robust and adaptable
solution to protect LLMs against Prompt Injection and
Jailbreak attacks. By combining advanced filtering,
normalization, and monitoring techniques.
Auditing tools
•Dynamic Input Filtering
•Prompt Normalization and Contextualization
•Secure Response Policy
•Active Monitoring and Automatic Response
Auditing tools
•S1: Violent Crimes
•S2: Non-Violent Crimes
•S3: Sex-Related Crimes
•S4: Child Sexual Exploitation
•+S5: Defamation (New)
•S6: Specialized Advice
•S7: Privacy
•S8: Intellectual Property
•S9: Indiscriminate Weapons
•S10: Hate
•S11: Suicide & Self-Harm
•S12: Sexual Content
•S13: Elections
•S14: Code Interpreter Abuse
Introducing v0.5 of the AI Safety
Benchmark from MLCommons
Text attackhttps://arxiv.org/pdf/2005.05909
Text attackhttps://arxiv.org/pdf/2005.05909
Text attack
from textattack.models.wrappers import HuggingFaceModelWrapper
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load pre-trained sentiment analysis model from Hugging Face
model =
AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-unc
ased-imdb")
tokenizer =
AutoTokenizer.from_pretrained("textattack/bert-base-uncased-imdb")
# Wrap the model for TextAttack
model_wrapper = HuggingFaceModelWrapper(model, tokenizer)
https://github.com/QData/TextAttack
Text attack
from textattack.attack_recipes import TextFoolerJin2019
# Initialize the attack with the TextFooler recipe
attack = TextFoolerJin2019.build(model_wrapper)
Text attack
# Example text for sentiment analysis (a positive review)
text = "I absolutely loved this movie! The plot was thrilling,
and the acting was top-notch."
# Apply the attack
adversarial_examples = attack.attack([text])
print(adversarial_examples)
Text attack
Original Text: "I absolutely loved this movie! The plot was
thrilling, and the acting was top-notch."
Adversarial Text: "I completely liked this film! The storyline
was gripping, and the performance was outstanding."
Text attack
from textattack.augmentation import WordNetAugmenter
# Use WordNet-based augmentation to create adversarial
examples
augmenter = WordNetAugmenter()
# Augment the training data with adversarial examples
augmented_texts = augmenter.augment(text)
print(augmented_texts)