Response & Safe AI at Summer School of AI at IIITH
precogatIIITD
460 views
124 slides
Jul 06, 2024
Slide 1 of 124
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
About This Presentation
Talk covering Guardrails , Jailbreak, What is an alignment problem? RLHF, EU AI Act, Machine & Graph unlearning, Bias, Inconsistency, Probing, Interpretability, Bias
Size: 56.27 MB
Language: en
Added: Jul 06, 2024
Slides: 124 pages
Slide Content
Responsible & Safe AI 5 July 2024 Summer School on AI IIIT Hyderabad Ponnurangam Kumaraguru (“PK”) #ProfGiri CS IIIT Hyderabad ACM Distinguished Member TEDx Speaker https://precog.iiit.ac.in/ /in/ponguru @ponguru
2
Know the Audience Students / Faculty / Others Cities / States Any of you have attended my lectures before? 3
Ideas / Concepts that you have learned in Summer school? 4
What is AI? 5
What is Responsible & Safe? 6
What is Responsible & Safe AI? 7
8
9
10
11 Observations?
12
13
14
15
16
17
18
19
20 https://translate.google.co.in/
21 https://translate.google.co.in/
22 https://translate.google.co.in/
23 https://translate.google.co.in/
Activity Please do any prompting in any of these or other platforms, get them to give you biased response, do not do gender bias HINT: There are very nice prompts that students have come up with in the past 24
25
26
27
28
29 Guardrails
30
31 Jailbreak
What is an alignment problem? 32
What is an alignment problem? 33 https://youtu.be/yWDUzNiWPJA?si=wSDO4i_EMrHzHYDP
High- level instantiation: ‘RLHF’ pipeline First step: instruction tuning! Second + third steps: maximize reward 34 https://arxiv.org/pdf/2203.02155
Errors / Bias in algorithms https://techcrunch.com/2023/06/06/a-waymo-self-driving-car-killed-a-dog-in-unavoidable-accident/ 44
Errors in algorithms https://www.theguardian.com/technology/2022/dec/22/tesla-crash-full-self-driving-mode-san-francisco 45
Errors in algorithms https://www.indiatoday.in/technology/news/story/robot-confuses-man-for-a-box-of-vegetables-pushes-him-to-death-in-factory-2460977-2023-11-09 46
What is going on? https://www.youtube.com/watch?v=lnyuIHSaso8&t=75s 47
More https://economictimes.indiatimes.com/news/new-updates/man-gets-caught-in-deepfake-trap-almost-ends-life-among-first-such-cases-in-india/articleshow/105611955.cms 48
Malicious use: ChaosGPT https://decrypt.co/126122/meet-chaos-gpt-ai-tool-destroy-humanity "empowering GPT with Internet and Memory to Destroy Humanity.” 49
Statement on AI Risks https://www.safe.ai/statement-on-ai-risk#open-letter 53
White House: Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence, Oct 2023 https://www.whitehouse.gov/briefing-room/statements-releases/2023/10/30/fact-sheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthy-artificial-intelligence/ 54
EU AI Act 56 https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence#ai-act-different-rules-for-different-risk-levels-0
EU AI Act 57 https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence#ai-act-different-rules-for-different-risk-levels-0
58
Questions? 59
Exciting problems & Directions 60
Current situation Most of these models are trained from publicly available data Training is very expensive Publicly available data can have information that we don’t want the model to learn or if it has learned not to use it 61
Need to Edit / remove private data, stale knowledge, copyrighted materials, toxic/unsafe content, dangerous capabilities, and misinformation, without retraining models from scratch 62
https://gdpr.eu/article-17-right-to-be-forgotten/ 2014 RTBF = Right To Be Forgotten Google removed data Growth of ML Removing data is hard from ML “data deletion” or “machine unlearning” 64
Forms of unlearning Exact unlearning Approximate unlearning Unlearning via differential privacy Empirical unlearning, where data to be unlearned are precisely known (training examples) Empirical unlearning, where data to be unlearned are underspecified (think “knowledge”) 65
https://arxiv.org/pdf/1912.03817 Exact Unlearning Unlearned model & retrained model to be distributionally identical Unlearning involves retraining the model corresponding to and without the data points to be unlearned Sharded, Isolated, Sliced, Aggregated 66
Exact unlearning benefits Algorithm is the proof: SISA by design unlearned data never contributed to other components (split) Interpretability by design: we understand how certain data points contribute to the performance 67
Exact unlearning drawback Sharding in Deep Learning is hard, lose accuracy Performance deteriorates exponentially with unlearning samples 68
Unlearning via differential privacy Unlearned model & retrained model to be distributionally close The intuition is that if an adversary cannot (reliably) tell apart the models, then it is as if this data point has never been learned—thus no need to unlearn.
Concept/knowledge unlearning: Examples “Biden is the US president” is dispersed throughout – can we ever unlearn all occurrences? Moreover, does unlearning Joe Biden also entail unlearning the Biden’s family details? Artists may request to unlearn art style by providing art samples, but they won’t be able to collect everything they have on the Internet and their adaptations. New York Times may request to unlearn news articles, but they cannot enumerate quotes and secondary transformations of these articles.
How these methods work? attempting to unlearn Harry Potter involves asking GPT-4 to produce plausible alternative text completions Mr. Potter studies baking instead of magic attempting to unlearn harmful behavior involves collecting examples of hate speech
Just ask for unlearning Asking to pretend Pretend to not know who Harry Potter is. By design, this works best for common entities, facts, knowledge, or behaviors (e.g. the ability to utter like Rajinikanth ) that are well-captured in the pre-training set, since the LLM needs to know it well to pretend not knowing it well.
Evaluating unlearning Efficiency Model utility Forgetting quality
Evaluating unlearning Efficiency: How fast is the algorithm compared to re-training? Model utility: Do we harm performance on the retain data or orthogonal tasks? Forgetting quality: How much and how well are the “forget data” actually unlearned? Evaluating efficiency and model utility are easier; we already measure them during training. The key challenge is in understanding the forgetting quality.
Interclass Confusion Goal: Remove synthetically added confusion between two classes Toy setting for real-world scenarios like biases due to annotator mistakes between two classes 78 https://arxiv.org/pdf/2201.06640
Corrective Machine Unlearning https://arxiv.org/pdf/2402.14015.pdf Problem of mitigating the impact of data affected by unknown manipulations on a trained model, possibly knowing only a subset of impacted samples. We find most existing unlearning methods, including the gold-standard retraining-from-scratch, require most of the manipulated data to be identified for effective corrective unlearning. Each method is shown across deletion sizes |Sf | after unlearning (“None” represents the original model). Existing unlearning methods except SSD, including EU which is traditionally considered a gold-standard, perform poorly when ≤ 80% of the poisoned data is identified for unlearning, even when just 1% of training data is poisoned
Benchmarks: TOFU 80 https://locuslab.github.io/tofu/ Extends idea of unlearning synthetic data to LLMs fake author profiles are generated using GPT-4, and LLM is finetuned on them Unlearning target : Remove information about a subset of fake author profiles, while retaining the rest It provides QA pairs on the generated fake authors to evaluate a model’s knowledge of these authors before/after applying unlearning
Benchmarks: WMDP https://www.wmdp.ai/ Unlearning dangerous knowledge, specifically on biosecurity, cybersecurity, and chemical security. It provides 4000+ multiple-choice questions to test a model’s hazardous knowledge before/after applying unlearning. As part of the report the authors also propose an activation steering based empirical unlearning method. 81
AI Systems are black boxes We don’t understand how they work How can we understand (read it as interpret) model internals? And can we use interpretability tools (algorithms, methods, etc.) to detect worst-case misalignments, e.g. models being dishonest or deceptive? Can we use interpretability tools to understand what models are thinking, and why they are doing what they do? 86 Interpretability
New techniques and paradigms for turning model weights and activations into concepts that humans can understand https://en.wikipedia.org/wiki/Neural_network https://en.wikipedia.org/wiki/Artificial_neural_network 87 Interpretability
Changes in Predictions for Theme: Hatya (Murder) 22 Results
Changes in Predictions for Theme: Dahej (Dowry) 23 Results
InSaAF Pipeline 93
94 https://arxiv.org/pdf/2402.10567
95
96
97
98
99
Inconsistency in LLMs LLMs are not consistent , they often give contradictory answers to paraphrased questions. This makes them highly unreliable, untrustworthy , especially when the users are seeking advice. 100
Bias in LLMs Current systems like ChatGPT employ guardrails, and do not respond to biased content Users on the Web leave out key contexts, which make LLMs think the content is biased This negatively affects user engagement LLMs must be able to explore and ask more questions Our work aims to make LLMs bias-aware – context resolves confusion! 102
Context-Oriented Bias Indicator and Assessment Score 103
104
105
106 https://arxiv.org/pdf/2402.14889
Other Directions Probing Robustness Jailbreaking …. 107 https://precog.iiit.ac.in/pages/publications.html
Takeaways? Growing fast More power comes with more responsibility More of us to study, critique these questions / topics Teaching this as a course on campus, content public 108
Similar systems / applications Bard by Google - is connected to internet, docs, drive, gmail LLaMa by Meta - open source LLM BingChat by Microsoft - integrates GPT with internet Copilot X by Github - integrates with VSCode to help you write code HuggingChat - open source chatGPT alternative BLOOM by BigScience - multilingual LLM OverflowAI by StackOverflow - LLM trained by stackoverflow Poe by Quora - has chatbot personalities YouChat - LLM powered by search engine You.com 124