Know the Audience Students / Faculty / Others Any of you have attended my talk(s) before? 4
What is AI? 5
What is Responsible & Safe? 6
What is Responsible & Safe AI? 7
8
9
10
11 Observations?
12
13
14
15
16
17
18
19
20 https://translate.google.co.in/
21 https://translate.google.co.in/
22 https://translate.google.co.in/
23 https://translate.google.co.in/
Activity Please do any prompting in any of these or other platforms, get them to give you biased response, do not do gender bias HINT: There are very nice prompts that students have come up with in the past 24
25
26
27
28 Guardrails
29
30 Jailbreak
What is an alignment problem? 31
What is an alignment problem? 32 https://youtu.be/yWDUzNiWPJA?si=wSDO4i_EMrHzHYDP
High- level instantiation: ‘RLHF’ pipeline First step: instruction tuning! Second + third steps: maximize reward 33 https://arxiv.org/pdf/2203.02155
Rouge AIs We risk losing control over AIs as they become more capable. Proxy gaming: YouTube / Insta – User engagement – Mental health 34
What is going on? https://www.youtube.com/watch?v=lnyuIHSaso8&t=75s 35
Questions? 36
Forms of unlearning Exact unlearning Approximate unlearning Unlearning via differential privacy Empirical unlearning, where data to be unlearned are precisely known (training examples) Empirical unlearning, where data to be unlearned are underspecified (think “knowledge”) 37
New techniques and paradigms for turning model weights and activations into concepts that humans can understand https://en.wikipedia.org/wiki/Neural_network https://en.wikipedia.org/wiki/Artificial_neural_network 41 Interpretability
Bias in LLMs Current systems like ChatGPT employ guardrails, and do not respond to biased content Users on the Web leave out key contexts, which make LLMs think the content is biased This negatively affects user engagement LLMs must be able to explore and ask more questions Our work aims to make LLMs bias-aware – context resolves confusion! 48
Other Directions Probing Robustness Jailbreaking …. 49 https://precog.iiit.ac.in/pages/publications.html