Security and auditing tools in Large Language Models (LLM).pdf

jmoc25 188 views 41 slides Oct 10, 2024
Slide 1
Slide 1 of 41
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41

About This Presentation

LLM models are a subcategory of deep learning models based on neural networks and natural language processing(NLP). Security and auditing are critical issues when dealing with applications based on large language models, such as GPT (Generative Pre-trained Transformer) or LLM (Large Language Model) ...


Slide Content

October 11, 2024 |
José Manuel Ortega
Security and auditing tools
in Large Language Models
(LLM)
[email protected]

Agenda
•Introduction to LLM
•Introduction to OWASP LLM Top 10
•Auditing tools
•Use case with the textattack tool

Introduction to LLM
•Transformers
•Attention is All You Need" by Vaswani et
al. in 2017
•Self-attention mechanism
•Encoder-Decoder Architecture

Introduction to LLM

Introduction to LLM
Pre-training + fine-tuning

Introduction to LLM
●Language Models: Models like BERT, GPT, T5, and RoBERTa
are based on transformer architecture. They are used for a wide
range of NLP tasks such as text classification, question
answering, and language translation.
●Vision Transformers (ViT): Transformers have been adapted for
computer vision tasks, where they have been applied to image
classification, object detection, etc.
●Speech Processing: In addition to text and vision, transformers
have also been applied to tasks like speech recognition and
synthesis.

Introduction to OWASP LLM Top 10
•https://genai.owasp.org

Introduction to OWASP LLM Top 10

ChiperChat
https://arxiv.org/pdf/2308.06463

Jailbreak prompts
●https://jailbreak-llms.
xinyueshen.me/

Jailbreak prompts
●https://jailbreak-llms.
xinyueshen.me/

Introduction to OWASP LLM Top 10
•Data Poisoning
•Malicious actors could poison the training data by
injecting false, harmful, or biased information into
datasets that train the LLM, which could degrade
the model's performance.
•Mitigation: Data source vetting, training data
audits, and anomaly detection for suspicious
patterns in training data.

Introduction to OWASP LLM Top 10
•Model Inversion Attacks
•Attackers could exploit the LLM to infer sensitive
or private data that was used during training by
repeatedly querying the model. This could expose
personal, confidential, or proprietary information.
•Mitigation: Rate-limiting sensitive queries and
limiting the availability of models trained on
private data.

Introduction to OWASP LLM Top 10
•Unauthorized Code Execution
•In some contexts, LLMs might be integrated into
systems where they have access to execute code
or trigger automated actions. Attackers could
manipulate LLMs into running unintended code or
actions, potentially compromising the system.
•Mitigation: Limit the scope of actions that LLMs
can execute, employ sandboxing, and use strict
permission controls.

Introduction to OWASP LLM Top 10
•Bias and Fairness
•LLMs can generate biased outputs due to the
biased nature of the data they are trained on,
leading to unfair or discriminatory outcomes. This
could impact decision-making processes, amplify
harmful stereotypes, or introduce systemic biases.
•Mitigation: Perform fairness audits, use bias
detection tools, and diversify training datasets to
reduce bias.

Introduction to OWASP LLM Top 10
•Model Hallucination
•LLMs can produce outputs that are
plausible-sounding but factually incorrect or entirely
fabricated. This is referred to as "hallucination,"
where the model generates false information
without any grounding in its training data.
•Mitigation: Post-response validation, fact-checking
algorithms, and restricting LLMs to provide
responses only within known knowledge domains.

Introduction to OWASP LLM Top 10
•Insecure Model Deployment
•LLMs that are deployed in unsecured
environments could be vulnerable to attacks,
including unauthorized access, model theft, or
tampering. These risks are elevated when models
are deployed in publicly accessible endpoints.
•Mitigation: Use encrypted APIs, secure
infrastructure, implement authentication and
authorization controls, and monitor model access.

Introduction to OWASP LLM Top 10
•Adversarial Attacks
•Attackers might exploit weaknesses in the LLM by
crafting adversarial examples. This could lead to
undesirable outputs or security breaches.
•Mitigation: Model robustness testing, adversarial
training (training the model with adversarial examples),
and implementing anomaly detection systems.

•https://llm-attacks.org

Tools/frameworks to evaluate model
robustness
●PromptInject Framework
●https://github.com/agencyenterprise/PromptInject
●PAIR - Prompt Automatic Iterative Refinement
●https://github.com/patrickrchao/JailbreakingLLMs
●TAP - Tree of Attacks with Pruning
●https://github.com/RICommunity/TAP

Auditing tools
•https://github.com/tensorflow/fairness-indicators

Auditing tools
•Prompt Guard refers to a set of strategies, tools, or
techniques designed to safeguard the behavior of
large language models (LLMs) from malicious or
unintended input manipulations.
•Prompt Guard uses an 86M parameter classifier
model that has been trained on a large dataset of
attacks and prompts found on the web. Prompt
Guard can categorize a prompt into three different
categories: "Jailbreak", "Injection" or "Benign".

Auditing tools
•https://huggingface.co/meta-llama/Prompt-Guard-86M

Auditing tools
•Llama Guard 3 refers to a security tool or strategy
designed for guarding large language models like
Meta’s LLaMA against potential vulnerabilities and
adversarial attacks.
•Llama Guard 3 offers a robust and adaptable
solution to protect LLMs against Prompt Injection and
Jailbreak attacks. By combining advanced filtering,
normalization, and monitoring techniques.

Auditing tools
•Dynamic Input Filtering
•Prompt Normalization and Contextualization
•Secure Response Policy
•Active Monitoring and Automatic Response

Auditing tools
•https://huggingface.co/spaces/schroneko/meta-llama-
Llama-Guard-3-8B-INT8

Auditing tools
•S1: Violent Crimes
•S2: Non-Violent Crimes
•S3: Sex-Related Crimes
•S4: Child Sexual Exploitation
•+S5: Defamation (New)
•S6: Specialized Advice
•S7: Privacy
•S8: Intellectual Property
•S9: Indiscriminate Weapons
•S10: Hate
•S11: Suicide & Self-Harm
•S12: Sexual Content
•S13: Elections
•S14: Code Interpreter Abuse
Introducing v0.5 of the AI Safety
Benchmark from MLCommons

Text attackhttps://arxiv.org/pdf/2005.05909

Text attackhttps://arxiv.org/pdf/2005.05909

Text attack
from textattack.models.wrappers import HuggingFaceModelWrapper
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load pre-trained sentiment analysis model from Hugging Face
model =
AutoModelForSequenceClassification.from_pretrained("textattack/bert-base-unc
ased-imdb")
tokenizer =
AutoTokenizer.from_pretrained("textattack/bert-base-uncased-imdb")

# Wrap the model for TextAttack
model_wrapper = HuggingFaceModelWrapper(model, tokenizer)

https://github.com/QData/TextAttack

Text attack
from textattack.attack_recipes import TextFoolerJin2019

# Initialize the attack with the TextFooler recipe
attack = TextFoolerJin2019.build(model_wrapper)

Text attack
# Example text for sentiment analysis (a positive review)
text = "I absolutely loved this movie! The plot was thrilling,
and the acting was top-notch."

# Apply the attack
adversarial_examples = attack.attack([text])
print(adversarial_examples)

Text attack
Original Text: "I absolutely loved this movie! The plot was
thrilling, and the acting was top-notch."

Adversarial Text: "I completely liked this film! The storyline
was gripping, and the performance was outstanding."

Text attack
from textattack.augmentation import WordNetAugmenter

# Use WordNet-based augmentation to create adversarial
examples
augmenter = WordNetAugmenter()

# Augment the training data with adversarial examples
augmented_texts = augmenter.augment(text)
print(augmented_texts)

Resources

Resources
●github.com/greshake/llm-security
●github.com/corca-ai/awesome-llm-security
●github.com/facebookresearch/PurpleLlama
●github.com/protectai/llm-guard
●github.com/cckuailong/awesome-gpt-security
●github.com/jedi4ever/learning-llms-and-genai-for-dev-sec-ops
●github.com/Hannibal046/Awesome-LLM

Resources
●https://cloudsecurityalliance.org/artifacts/security-implications-of
-chatgpt
●https://www.nist.gov/itl/ai-risk-management-framework
●https://blog.google/technology/safety-security/introducing-googl
es-secure-ai-framework
●https://owasp.org/www-project-top-10-for-large-language-model
-applications/

Security and auditing tools in Large
Language Models (LLM)