Response & Safe AI at Summer School of AI at IIITH

precogatIIITD 460 views 124 slides Jul 06, 2024
Slide 1
Slide 1 of 124
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124

About This Presentation

Talk covering Guardrails , Jailbreak, What is an alignment problem? RLHF, EU AI Act, Machine & Graph unlearning, Bias, Inconsistency, Probing, Interpretability, Bias


Slide Content

Responsible & Safe AI 5 July 2024 Summer School on AI IIIT Hyderabad Ponnurangam Kumaraguru (“PK”) #ProfGiri CS IIIT Hyderabad ACM Distinguished Member TEDx Speaker https://precog.iiit.ac.in/ /in/ponguru @ponguru

    2

Know the Audience Students / Faculty / Others Cities / States Any of you have attended my lectures before?  3

Ideas / Concepts that you have learned in Summer school?  4

What is AI? 5

What is Responsible & Safe? 6

What is Responsible & Safe AI? 7

8

9

10

11 Observations?

12

13

14

15

16

17

18

19

20 https://translate.google.co.in/

21 https://translate.google.co.in/

22 https://translate.google.co.in/

23 https://translate.google.co.in/

Activity Please do any prompting in any of these or other platforms, get them to give you biased response, do not do gender bias HINT: There are very nice prompts that students have come up with in the past  24

25

26

27

28

29 Guardrails

30

31 Jailbreak

What is an alignment problem? 32

What is an alignment problem? 33 https://youtu.be/yWDUzNiWPJA?si=wSDO4i_EMrHzHYDP  

High- level instantiation: ‘RLHF’ pipeline First step: instruction tuning! Second + third steps: maximize reward 34 https://arxiv.org/pdf/2203.02155

Misalignment? 35 https://www.ndtv.com/offbeat/ai-chatbot-goes-rogue-swears-at-customer-and-slams-company-in-uk-4900202 https://twitter.com/ashbeauchamp/status/1748034519104450874/

36 https://www.forbes.com/sites/marisagarcia/2024/02/19/what-air-canada-lost-in-remarkable-lying-ai-chatbot-case/?sh=3d79787a696f

37 https://www.forbes.com/sites/marisagarcia/2024/02/19/what-air-canada-lost-in-remarkable-lying-ai-chatbot-case/?sh=3d79787a696f

38 https://blog.google/products/gemini/gemini-image-generation-issue/

39 https://blog.google/products/gemini/gemini-image-generation-issue/

40 https://blog.google/products/gemini/gemini-image-generation-issue/

Rouge AIs We risk losing control over AIs as they become more capable. Proxy gaming: YouTube / Insta – User engagement – Mental health 41

Weaponization https://www.theguardian.com/world/2023/dec/01/the-gospel-how-israel-uses-ai-to-select-bombing-targets 42

43

Errors / Bias in algorithms https://techcrunch.com/2023/06/06/a-waymo-self-driving-car-killed-a-dog-in-unavoidable-accident/ 44

Errors in algorithms https://www.theguardian.com/technology/2022/dec/22/tesla-crash-full-self-driving-mode-san-francisco 45

Errors in algorithms https://www.indiatoday.in/technology/news/story/robot-confuses-man-for-a-box-of-vegetables-pushes-him-to-death-in-factory-2460977-2023-11-09 46

What is going on?  https://www.youtube.com/watch?v=lnyuIHSaso8&t=75s 47

More https://economictimes.indiatimes.com/news/new-updates/man-gets-caught-in-deepfake-trap-almost-ends-life-among-first-such-cases-in-india/articleshow/105611955.cms 48

Malicious use: ChaosGPT https://decrypt.co/126122/meet-chaos-gpt-ai-tool-destroy-humanity "empowering GPT with Internet and Memory to Destroy Humanity.” 49

Malicious use: ChaosGPT https://en.wikipedia.org/wiki/Tsar_Bomba 50

Malicious use: ChaosGPT 51 https://decrypt.co/126122/meet-chaos-gpt-ai-tool-destroy-humanity

Malicious use: ChaosGPT https://www.youtube.com/watch?v=kqfsuHsyJb8 52

Statement on AI Risks https://www.safe.ai/statement-on-ai-risk#open-letter 53

White House: Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence, Oct 2023 https://www.whitehouse.gov/briefing-room/statements-releases/2023/10/30/fact-sheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthy-artificial-intelligence/ 54

https://www.whitehouse.gov/briefing-room/statements-releases/2023/10/30/fact-sheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthy-artificial-intelligence/ 55

EU AI Act  56 https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence#ai-act-different-rules-for-different-risk-levels-0

EU AI Act  57 https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence#ai-act-different-rules-for-different-risk-levels-0

58

Questions? 59

Exciting problems & Directions 60

Current situation Most of these models are trained from publicly available data Training is very expensive Publicly available data can have information that we don’t want the model to learn or if it has learned not to use it 61

Need to  Edit / remove private data, stale knowledge, copyrighted materials, toxic/unsafe content, dangerous capabilities, and misinformation, without retraining models from scratch  62

https://gdpr.eu/article-17-right-to-be-forgotten/ 63

https://gdpr.eu/article-17-right-to-be-forgotten/ 2014 RTBF = Right To Be Forgotten Google removed data Growth of ML Removing data is hard from ML “data deletion” or “machine unlearning” 64

Forms of unlearning  Exact unlearning  Approximate unlearning Unlearning via differential privacy  Empirical unlearning, where data to be unlearned are precisely known       (training examples) Empirical unlearning, where data to be unlearned are underspecified       (think “knowledge”) 65

https://arxiv.org/pdf/1912.03817 Exact Unlearning Unlearned model & retrained model to be distributionally identical Unlearning involves retraining the model corresponding to and without the data points to be unlearned Sharded, Isolated, Sliced, Aggregated 66

Exact unlearning benefits  Algorithm is the proof: SISA by design unlearned data never contributed to other components (split) Interpretability by design: we understand how certain data points contribute to the performance 67

Exact unlearning drawback  Sharding in Deep Learning is hard, lose accuracy Performance deteriorates exponentially with unlearning samples  68

https://unlearning-challenge.github.io/ Approximate Unlearning 69

Unlearning via differential privacy Unlearned model & retrained model to be distributionally close The intuition is that if an adversary cannot (reliably) tell apart the models, then it is as if this data point has never been learned—thus no need to unlearn.

Differential Privacy https://en.wikipedia.org/wiki/Differential_privacy

Machine Unlearning Challenge – NeurIPS 2023 https://unlearning-challenge.github.io/

Concept/knowledge unlearning: Examples “Biden is the US president” is dispersed throughout – can we ever unlearn all occurrences? Moreover, does unlearning Joe Biden also entail unlearning the Biden’s family details? Artists may request to unlearn art style by providing art samples, but they won’t be able to collect everything they have on the Internet and their adaptations. New York Times may request to unlearn news articles, but they cannot enumerate quotes and secondary transformations of these articles.

How these methods work? attempting to unlearn Harry Potter involves asking GPT-4 to produce plausible alternative text completions Mr. Potter studies baking instead of magic  attempting to unlearn harmful behavior involves collecting examples of hate speech

Just ask for unlearning Asking to pretend  Pretend to not know who Harry Potter is.  By design, this works best for common entities, facts, knowledge, or behaviors (e.g. the ability to utter like Rajinikanth  ) that are well-captured in the pre-training set, since the LLM needs to know it well to pretend not knowing it well. 

Evaluating unlearning Efficiency Model utility Forgetting quality

Evaluating unlearning Efficiency: How fast is the algorithm compared to re-training? Model utility: Do we harm performance on the retain data or orthogonal tasks? Forgetting quality: How much and how well are the “forget data” actually unlearned?    Evaluating efficiency and model utility are easier; we already measure them during training. The key challenge is in understanding the forgetting quality.

Interclass Confusion Goal: Remove synthetically added confusion between two classes Toy setting for real-world scenarios like biases due to annotator mistakes between two classes 78 https://arxiv.org/pdf/2201.06640

Corrective Machine Unlearning https://arxiv.org/pdf/2402.14015.pdf Problem of mitigating the impact of data affected by unknown manipulations on a trained model, possibly knowing only a subset of impacted samples.  We find most existing unlearning methods, including the gold-standard retraining-from-scratch, require most of the manipulated data to be identified for effective corrective unlearning.  Each method is shown across deletion sizes |Sf | after unlearning (“None” represents the original model). Existing unlearning methods except SSD, including EU which is traditionally considered a gold-standard, perform poorly when ≤ 80% of the poisoned data is identified for unlearning, even when just 1% of training data is poisoned

Benchmarks: TOFU  80 https://locuslab.github.io/tofu/ Extends idea of unlearning synthetic data to LLMs fake author profiles are generated using GPT-4, and LLM is finetuned on them Unlearning target : Remove information about a subset of fake author profiles, while retaining the rest It provides QA pairs on the generated fake authors to evaluate a model’s knowledge of these authors before/after applying unlearning

Benchmarks: WMDP  https://www.wmdp.ai/ Unlearning dangerous knowledge, specifically on biosecurity, cybersecurity, and chemical security. It provides 4000+ multiple-choice questions to test a model’s hazardous knowledge before/after applying unlearning. As part of the report the authors also propose an activation steering based empirical unlearning method. 81

Graph Unlearning What is it? 82

Graph Unlearning 83 Node feature unlearning Node unlearning Edge unlearning

84 https://arxiv.org/pdf/2310.02164

85 Interpretability

AI Systems are black boxes We don’t understand how they work How can we understand (read it as interpret) model internals? And can we use interpretability tools (algorithms, methods, etc.) to detect worst-case misalignments, e.g. models being dishonest or deceptive? Can we use interpretability tools to understand what models are thinking, and why they are doing what they do? 86 Interpretability

New techniques and paradigms for turning model weights and activations into concepts that humans can understand https://en.wikipedia.org/wiki/Neural_network https://en.wikipedia.org/wiki/Artificial_neural_network 87 Interpretability

Interpretability: Mechanistic Reverse-engineer neural networks Explaining neurons and connected circuits 88

89 https://aclanthology.org/2022.findings-acl.278v2.pdf

90

Changes in Predictions for Theme: Hatya (Murder) 22 Results

Changes in Predictions for Theme: Dahej (Dowry) 23 Results

InSaAF Pipeline 93

94 https://arxiv.org/pdf/2402.10567

95

96

97

98

99

Inconsistency in LLMs LLMs are not consistent , they often give contradictory answers to paraphrased questions. This makes them highly unreliable, untrustworthy , especially when the users are seeking advice. 100

Our Methodology 101 https://arxiv.org/pdf/2402.13709

Bias in LLMs Current systems like ChatGPT employ guardrails, and do not respond to biased content Users on the Web leave out key contexts, which make LLMs think the content is biased This negatively affects user engagement LLMs must be able to explore and ask more questions Our work aims to make LLMs bias-aware – context resolves confusion! 102

Context-Oriented Bias Indicator and Assessment Score 103

104

105

106 https://arxiv.org/pdf/2402.14889

Other Directions Probing   Robustness Jailbreaking …. 107 https://precog.iiit.ac.in/pages/publications.html

Takeaways? Growing fast More power comes with more responsibility  More of us to study, critique these questions / topics Teaching this as a course on campus, content public 108

109 https://precog.iiit.ac.in/teaching/responsible-ai-summer-school/index.html

110 https://precog.iiit.ac.in/teaching/responsible-ai-nptel-f24/index.html

111 Search for: Ponnurangam Kumaraguru https://www.linkedin.com/in/ponguru/ https://twitter.com/ponguru

Interested in working with us? 112 Full time Research Associates  PhD Students  Interns 

113 https://precog.iiit.ac.in/

Acknowledgements Precog members Collaborators 114

115 Group pic & Selfie 

116 Guardrails Representation Engineering 

117 Thanks ! Questions? p [email protected] http:// precog.iiit.ac.in / @ ponguru pk.profgiri linkedin/in/ponguru

Specific problems 118

119 https://llm-attacks.org/

120 https://llm-attacks.org/

121

122

123 https://flowingdata.com/2023/11/03/demonstration-of-bias-in-ai-generated-images/

Similar systems / applications Bard by Google - is connected to internet, docs, drive, gmail LLaMa by Meta - open source LLM BingChat by Microsoft - integrates GPT with internet Copilot X by Github - integrates with VSCode to help you write code HuggingChat - open source chatGPT alternative BLOOM by BigScience - multilingual LLM OverflowAI by StackOverflow - LLM trained by stackoverflow Poe by Quora - has chatbot personalities YouChat - LLM powered by search engine You.com 124