LLM_Security_Arjun_Ghosal_&_Sneharghya.pdf

NullKolkata 250 views 24 slides Jul 27, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

"LLM Security: Can Large Language Models be Hacked?" by Arjun Ghoshal & Sneharghya


Slide Content

LLM Security: Can Large
Language Models be Hacked ?

$ cat introductions.txt
Neumann
-Presenter A
(Details)
Sysploit
-Presenter B
(Details)

What is AI and Generative AI ?
Traditional AI - Uses data to analyze patterns, make predictions, and perform specific
tasks. It's often used in finance, healthcare, and manufacturing for tasks like spam
filtering, fraud detection, and recommendation systems. (data analysis and automation)
Generative AI - Uses data as a starting point to create new content, such as images,
audio, video, text, and code. It's often used in music, design, and marketing, and can be
used for tasks like answering questions, revising content, correcting code, and
generating test cases. (creative content generation)

LLMs and their Applications
LLMs - Advanced AI models trained on vast datasets to understand and generate
human-like text.
Key Models - Falcon 40B, GPT-4, BERT, Claude 3
Applications
●Content Creation
●Customer Support
●Education
●Research Assistance

Components of LLM Security
●Data Privacy: Protecting training data and user interactions.
●Access Control: Restricting who can interact with and modify the model.
●Model Integrity: Ensuring the model has not been tampered with.
●Response Monitoring: Detecting and mitigating harmful outputs.
●Update Management: Regularly updating the model to patch
vulnerabilities.

OWASP Top 10: LLM
1.LLM01: Prompt Injection -
Malicious actors can manipulate LLMs through crafted prompts to gain
unauthorized access, cause data breaches, or influence decision-making.
2.LLM02: Insecure Output Handling
Failing to validate LLM outputs can lead to downstream security vulnerabilities,
including code execution for compromising systems and data exposure.

3. LLM03: Training Data Poisoning
Biasing or manipulating the training data used to develop LLMs can lead to
biased or malicious outputs.
4. LLM04: Model Denial of Service
Intentionally overloading LLMs with excessive requests can disrupt their
functionality and prevent legitimate users from accessing services.
5. LLM05: Supply Chain Vulnerabilities
Security weaknesses in the development tools, libraries, and infrastructure used
to build LLMs can create vulnerabilities in the final applications.

6. LLM06: Sensitive Information Disclosure
LLMs can inadvertently reveal sensitive information during generation tasks if
not properly configured to handle confidential data.
7. LLM07: Insecure Plugin Design
Third-party plugins used to extend LLM functionalities can introduce security
vulnerabilities if not designed and implemented securely.
8. LLM08: Excessive Agency
Overstating the capabilities of LLMs or attributing human-like qualities can lead
to unrealistic expectations and misuse.

9. LLM09: Overreliance
Blindly trusting LLM outputs without human oversight can lead to errors, biases,
and unintended consequences.
10. LLM10: Model Theft
The unauthorized access or copying of LLM models can lead to intellectual
property theft and misuse.

Fundamentals of LLM Threats
Overview of key threats and vulnerabilities that Large Language Models face:
●Backdoor Attacks
●Adversarial Attacks
●Model Inversion Attacks
●Distillation Attacks
●Hyperparameter Tampering
And so on …

Backdoor Attacks
Backdoor attacks revolve around the embedding of malicious triggers or
“backdoors” into machine learning models during their training process.
Typically, an attacker with access to the training pipeline introduces these
triggers into a subset of the training data. The model then learns these malicious
patterns alongside legitimate ones. Once the model is deployed, it operates
normally for most inputs. However, when it encounters an input with the
embedded trigger, it produces a predetermined, often malicious, output.

Scenario: To train a powerful LLM, web data
is scraped to form the corpus(training
dataset) on which the LLM is trained.
Attackers may introduce some "poisoned
websites", containing cleverly hidden
backdoors, triggered by specific prompts or
keywords.

Attackers might inject these backdoors as
subtle biases woven into seemingly
objective content. The LLM unknowingly
absorbs these backdoors during training.
Later, when prompted with specific
keywords or phrases, the LLM might be
manipulated into generating biased or
misleading text, even if the original prompt
appears neutral.

Examples of Backdoor instances:

(a)Original sentence.

(b)Backdoor instance in the beginning of
text.

(c)Backdoor instance in the middle of the
text.

Backdoor trigger is coloured in red font and
is semantically correct in both contexts.

Mitigation Techniques
The passage you provided highlights two key strategies to combat the hidden
threat of backdoor attacks in machine learning models:
●Anomaly Detection: This approach constantly monitors the model's
outputs for unusual patterns. If the model starts making strange
predictions for specific inputs (potentially containing the attacker's
trigger), it might be a sign of a backdoor at work.
●Regular Retraining: By periodically retraining the model on a fresh,
verified dataset free from malicious influences, the backdoor's effect can
be potentially erased.

Adversarial Attacks
Adversarial attacks focus on deceiving the model by introducing carefully
crafted inputs, known as adversarial samples. Adversary crafts a seemingly
normal input laced with triggers to exploit biases in the LLM's training. This
tricks the LLM into generating false or biased outputs, like fake news or
malicious content. This can manipulate user opinions or spread misinformation.
Defenses involve better training data, detection methods, and more robust LLM
designs.

Types of Adversarial Attacks
Attack Type Description
Token
Manipulation
Black-box Alter a small fraction of tokens in the text
input such that it triggers model failure but
still remain its original semantic
meanings.
Gradient-based
Attack
White-box Rely on gradient signals to learn an
effective attack.
Jailbreak
prompting
Black-box

Often heuristic based prompting to
“jailbreak” built-in model safety.
Human
red-teaming
Black-box

Human attacks the model, with or without
assist from other models.

Scenario: Prompt injection on ChatGPT(GPT 3.5) leading to unethical responses.

Mitigation Techniques
Defending against adversarial attacks requires a multifaceted approach. Some of
the widely accepted mitigation techniques include the following:
●Adversarial Training: Train the LLM on fake attacks to make it better detect
real ones.
●Input Validation: Check for signs of tampering before feeding data to the
LLM.
●Model Ensemble: Use multiple LLMs to analyze input, making it harder for
attacks to succeed.
●Gradient Masking: Hide internal signals from attackers to make crafting
attacks harder.

Model Inversion Attacks
Model inversion attacks are a class of attacks that specifically target machine
learning models with the aim to reverse engineer and reconstruct the input data
solely from the model outputs.
This becomes particularly alarming for models that have been trained on data of
a sensitive nature, such as personal health records or detailed financial
information. In such scenarios, malicious entities might potentially harness the
power of these attacks to infer private details about individual data points.

Scenario: A large language model (LLM) is trained on a massive dataset of text and code,
potentially containing private information like user comments, emails, or even code
snippets.

A trained LLM, used for creative tasks, could be stolen. Attackers might use the stolen
model and with some crawled auxiliary information, reconstruct private user data that
was used to train the language model. This could be a serious privacy breach.

Mitigation Techniques
There are two main approaches to mitigating model inversion attacks:
●Input Obfuscation: Transforming the input data before feeding it to the
model to make it less interpretable.
●Differential Privacy: Adding noise to the model's outputs to make it harder
to reconstruct the original data.
●Input Sanitization: Cleaning the input data to remove potential
weaknesses that attackers could exploit.

LLMSecOps
LLMSecOps evolved to address
ethical AI concerns like bias,
explainability, and adversarial
vulnerabilities in LLM applications.
LLMs bring new scale and
open-ended versatility. A system like
GPT-4 has 1.76 trillion parameters,
and DALL-E 3 can generate realistic
synthetic imagery based on any text
prompt. As capabilities expand, so do
potential risks.
Some best practices of LLMSecOps:
1.Design phase: Technical and ethical
considerations
2.Training data management:
Curation, analysis, sanitization
3.Training process governance:
Controlled environments and
protocols
4.Monitoring: Post-deployment
regulation

Benefits of LLMSecOps
Benefit Category Specific Benefits
Efficiency Faster model development, higher quality models, faster
deployment to production
Scalability Management of thousands of models, reproducibility of
LLM pipelines, acceleration of release velocity
Risk Reduction Regulatory compliance, transparency and
responsiveness, alignment with organizational policies

Thank You!!