Introduction to AI Safety (public presentation).pptx
MiscAnnoy1
1,086 views
42 slides
Jun 07, 2024
Slide 1 of 42
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
About This Presentation
Introduction to AI Safety
Size: 31.07 MB
Language: en
Added: Jun 07, 2024
Slides: 42 pages
Slide Content
Introduction to AI Safety Aryeh L. Englander AMDS / A4I
Overview 2
What do we mean by Technical AI Safety? 3 Critical systems: systems whose failure may lead to injury or loss of life, damage to the environment, unauthorized disclosure of information, or serious financial losses Safety-critical systems: systems whose failure may result in injury, loss of life, or serious environmental damage Technical AI safety: designing safety-critical AI systems (and more broadly, critical AI systems) in ways that guard against accident risks – i.e., harms arising from AI systems behaving in unintended ways Sources: Ian Sommerville , supplement to Software Engineering (10 th edition) Remco Zwetsloot and Allan Dafoe, “ Thinking About Risks From AI: Accidents, Misuse and Structure ”
4 Other related concerns Security against exploits by adversaries Often considered part of AI Safety Misuse from people using AI in unethical or malicious ways Ex: deepfakes , terrorism, suppression of dissent Machine ethics Designing AI systems to make ethical decisions Debate over lethal autonomous weapons Structural risks from AI shaping the environment in subtle ways Ex: job loss, increased risks of arms races Governance, strategy, and policy Should government regulate AI? Who should be held accountable? How do we coordinate with other governments and stakeholders to prevent risks? AI forecasting and risk analysis When are these concerns likely to materialize? How concerned should we be ? Adversarial examples: fooling AI into thinking a stop sign is a 45 mph sign ( image source ) ( image source ) Potential terrorist use of lethal fully autonomous drones ( image source , based on a report from the OECD) Jobs at risk of automation by AI
AI Safety research communities Two related research communities: AI Safety, Assured Autonomy AI Safety F ocus on long-term risks from roughly human-level AI or beyond Also focused on near-term concerns that may scale up / provide insight into long-term issues Relatively new field – past 10 years or so Becoming progressively more mainstream M any leading AI researchers have expressed strong support for the research AI Safety research groups set up at several major universities and AI companies Assured Autonomy Older, established community with broader focus on assuring autonomous systems in general Recently started looking at challenges posed by machine learning Current and near-term focus In the past year both communities have finally started trying to collaborate and work out a shared research landscape and vision APL’s focus: near- and mid-term concerns, but it would be nice if our research also scales up to longer-term concerns 5
AI Safety: Lots of ways to frame conceptually Many different ways to divide up the problem space, and many different research agendas from different organizations It can get pretty complicated 6 AI Safety Landscape overview from the Future of Life Institute (FLI) Connections between different research agendas (Source: Everitt et al, AGI Safety Literature Review )
7 AI Safety: DeepMind’s conceptual framework Source: DeepMind Safety Research Blog
Assured Autonomy: AAIP conceptual framework 8 Source: Ashmore et al., Assuring the Machine Learning Lifecycle AAIP = Assuring Autonomy International Programme (University of York)
Combined framework This is the proposed framework for combining AI Safety and Assured Autonomy research communities Also tries to address relevant topics from the AI Ethics, Security and Privacy communities Until now these communities haven’t been talking to each other as much as they should Still in development; AAAI 2020 has a full-day workshop on this Personal opinion: I like that it’s general, but I think it’s a bit too general – best used only for very abstract overviews of the field 9 = focus of AI Safety / Deepmind framework = focus of Assured Autonomy / AAIP framework
My personal preference Problems that scale up to long term: DeepMind framework 10 Near-term machine learning: AAIP framework + + Everything else: Combined framework
AI safety concerns and APL’s mission areas All of APL’s mission areas involve safety- or mission-critical systems The military is concerned with assurance rather than safety (obviously, military systems are unsafe for the enemy), but the two concepts are very similar and involve similar problems and solutions The government is very aware of these problems, and this is part of why the military has been reluctant to adopt AI technologies Recent report from the Defense Innovation Board: primary document , supporting document Congressional Report on AI and National Security DARPA : Assured Autonomy program , Explainable AI program If we want to get the military to adopt the AI technologies we develop here, those technologies will need to be assured and secure 11
Technical AI Safety 12
13
14 Specification problems These problems arise when there is a gap (often very subtle and unnoticed) between what we really want and what the system is actually optimizing for Powerful optimizers can find surprising and sometimes undesirable solutions for objectives that are even subtly mis -specified Often extremely difficult or impossible to fully specify everything we really want Some examples: Specification gaming Avoiding side effects Unintended emergent behaviors Bugs and errors
15 Specification: Specification Gaming Agent exploits a flaw in the specification Powerful optimizers can find extremely novel and potentially harmful solutions Example: evolved radio Example: Coast Runners There are many other similar examples The evolvable motherboard that led to the evolved radio A reinforcement learning agent discovers an unintended strategy for achieving a higher score (Source: OpenAI , Faulty Reward Functions in the Wild )
16 Specification: Specification Gaming (cont .) Can be a problem for classifiers as well: The loss function (“reward”) might not really be what we care about, and we may not discover the discrepancy until later Example: Bias We care about the difference between humans and animals more than between breeds of dogs, but loss function optimizes for all equally We only discovered this problem after it caused major issues Example: Adversarial examples Deep Learning (DL) systems discovered weird correlations that humans never thought to look for, so predictions don’t match what we really care about We only discovered this problem well after the systems were in use Google images misidentified black people as gorillas ( source ) Blank labels can make DL systems misidentify stop signs as Speed Limit 45 MPH signs ( source )
17 Specification: Avoiding side effects What we really want: achieve goals subject to common sense constraints But current systems do not have anything like human common sense In any case would not by default constrain itself unless specifically programmed to do so Problem likely to get much more difficult going forward: Increasingly complex, hard-to-predict environments Increasing number of possible side effects Increasingly difficult to think of all those side effects in advance Two side effect scenarios (source: DeepMind Safety Research blog )
Specification: Avoiding side effects (cont.) Standard TEV&V approach: brainstorm with experts "what could possibly go wrong ?" In complex environments it might not be possible to think about all the things that could go wrong beforehand (unknown unknowns) until it's too late I s there a general method we can use to guard against even unknown unknowns ? Ideas in this category Penalize changing the environment ( example ) Agent learns constraints by observing humans ( example ) 18 Get from point A to point B – but don’t knock over the vase! Can we think of all possible side effects like this in advance? ( image source )
19 Specification: Other problems OpenAI’s hide and seek AI agents demonstrated surprising emergent behaviors ( source ) ( image source ) Emergent behaviors E.g., multi-agent systems, human-AI teams Makes it much more difficult to predict and verify, which makes a lot of the above problems worse Bugs and errors Can be even harder to find and correct logic errors in complex ML systems (especially Deep Learning) than in regular software systems (See later on TEV&V)
20 Robustness problems How to ensure that the system continues to operate within safe limits upon perturbation Some examples: Distributional shift / generalization Safe exploration Security
Robustness: Distributional shift / generalization How do we get a system trained on one distribution to perform well and safely if it encounters a different distribution after deployment? Especially, how do we get the system to proceed more carefully when it encounters safety-critical situations that it did not encounter during training? Generalization is a well-known problem in ML, but more work needs to be done Some approaches: Cautious generalization “Knows what it knows” Expanding on anomaly detection techniques 21 ( image source )
Robustness: Safe exploration If an RL agent uses online learning or needs to train in a real-world environment, then the exploration itself needs to be safe Example: A self-driving car can't learn by experimenting with swerving onto sidewalks Restricting learning to a controlled, safe environment might not provide sufficient training for some applications 22 How do we tell a cleaning robot not to experiment with sticking wet brooms into sockets during training? ( image source )
Robustness: Security (Security is sometimes considered part of safety / assurance, and sometimes separate) ML systems pose unique security challenges Data poisoning: Adversaries can corrupt the training data, leading to undesirable results Adversarial examples: Adversaries can use tricks to fool ML systems Privacy and classified information: By probing ML systems, adversaries may be able to uncover private or classified information that was used during training 23 What if an adversary fools an AI into thinking a school bus is a tank?
24 (DeepMind calls this Assurance, but that’s confusing since we’ve also been discussing Assured Autonomy) Interpretability: Many ML systems (esp. DL) are mostly black boxes Scalable oversight: It can be very difficult to provide oversight of increasingly autonomous and complex agents Human override: We need to be able to shut down the system if needed Building in mechanisms to do this is often difficult If the operator is part of the environment that the system learns about, the AI could conceivably learn policies that try to avoid the human shutting it down “You can't get the cup of coffee if you're dead " Example: robot blocks camera to avoid being shut off Monitoring and Control
Scaling up testing, evaluation, verification, and validation The extremely complex, mostly black-box models learned by powerful Deep Learning systems makes it difficult or impossible to scale up existing TEV&V techniques Hard to do enough testing or evaluation when the possible types of unusual inputs or situations can be huge Most existing TEV&V techniques need to specify exactly what the boundaries are that we care about, which can be difficult or intractable Often can only be verified in relatively simple constrained environments – doesn’t scale up well to more complex environments Especially difficult to use standard TEV&V techniques for systems that continue to learn after deployment (online learning) Also difficult to use TEV&V for multi-agent or human-machine teaming environments due to possible emergent behaviors 25
26 Theoretical issues A lot of decision theory and game theory breaks down if the agent is itself part of the environment that it's learning about Reasoning correctly about powerful ML systems might become very difficult and lead to mistaken assumptions with potentially dangerous consequences Especially difficult to model and predict the actions of agents that can modify themselves in some way or create other agents Embedding agents in the environment can lead to a host of theoretical problems (source: MIRI Embedded Agency sequence )
Human-AI teaming Understanding the boundaries - often even the system designers don't really understand where the system does or doesn't work Example: Researchers didn’t discover the problem of adversarial examples until well after the systems were already in use; it took several more years to understand the causes of the problem (and it’s still debated) Humans (even the designers) sometimes anthropomorphize too much and therefore use faulty “machine theories of mind” – current ML systems do not process data and information in the same way humans do Can lead to people trusting AI systems in unsafe situations 27
28 Systems engineering and best practices Careful design with safety / assurance issues in mind from the start Getting people to incorporate the best technical solutions and TEV&V tools Systems engineering perspective would likely be very helpful, but further work is needed to adapt systems / software engineering approaches to AI Training people to not using AI systems beyond what they're good for Being aware of the dual use nature of AI and developing / implementing best practices to prevent malicious use (a different issue from what we’ve been discussing) Examples: deepfakes , terrorist use of drones, AI-powered cyber attacks, use by oppressive regimes Possibly borrowing techniques and practices from other dual-use technologies, such as cybersecurity ( image source ) ( image source )
Assuring the Machine Learning Lifecycle 29
30
Data management 31
Model learning 32
Model verification 33
Model deployment 34
Final notes Some of these areas have received a significant amount of attention and research (e.g., adversarial examples, generalizability, safe exploration, interpretability), others not quite as much (e.g., avoiding side effects, reward hacking, verification & validation ) It's generally believed that if early programming languages such as C had been designed from the ground up with security in mind, then computer security today would be in a much stronger position We are mostly still in the early days of the most recent batch of powerful ML techniques (mostly Deep Learning); we should probably build in safety / assurance and security from the ground up Again, the military knows all this; if we want the military to adopt the AI technologies that we develop here, those technologies will need to be assured and secure 35
Research groups outside APL (partial list) Technical AI Safety DeepMind safety research (two teams – AI Safety team, Robust & Verified Deep Learning team) OpenAI safety team (no particular team website – core part of their mission) Machine Intelligence Research Institute (MIRI) Stanford AI Safety research group Center for Human-Compatible AI (CHAI, UC Berkeley) Assured Autonomy Institute for Assured Autonomy (IAA, partnership between Johns Hopkins University and APL) Assuring Autonomy International Programme (University of York) University of Pennsylvania Assured Autonomy research group University of Waterloo AssuredAI project AI Safety Risks – Strategy, Policy, Analysis Future of Life Institute (MIT) Future of Humanity Institute (University of Oxford) Center for the Study of Existential Risk (CSER, University of Cambridge) Center for Security and Emerging Technology (CSET, Georgetown University) Many of these organizations are closely tied to the Effective Altruism movement 36
Primary reading Technical AI Safety Amodei et al, Concrete Problems in AI Safety (2016) – still probably the best technical introduction Alignment Newsletter – excellent coverage of related research Podcast version Database of all links from previous newsletters, arranged by topic – covers almost all major papers related to the field from the past year or two DeepMind’s Safety Research blog Informal document from Jacob Steinhardt (UC Berkeley) - overview of several current research directions Assured Autonomy: Ashmore et al, Assuring the Machine Learning Lifecycle (2019) Longer-term concerns Stuart Russell, Human Compatible: Artificial Intelligence and the Problem of Control (2019) Nick Bostrom , Superintelligence: Paths, Dangers, Strategies (2014) Excellent series of posts summarizing each chapter and providing additional notes [Tom Chivers , The AI Does Not Hate You: Superintelligence, Rationality and the Race to Save the World (2019) – lighter overview of the subject from a journalist; includes a good history of the AI Safety movement and other closely related groups] 37
Partial bibliography: General / Literature Reviews Saria et al (JHU), Tutorial on safe and reliable ML (2019); video , slides , references Richard Mallah (Future of Life Institute), “ The Landscape of AI Safety and Beneficence Research ,” 2017 Hernandez- Orallo et al, Surveying Safety-relevant AI Characteristics (2019) Rohin Shah (UC Berkeley), An overview of technical AGI alignment (podcast episode with transcript, 2019) – part 1 , part 2 , related video lecture Everitt et al, AGI Safety literature review (2018) Paul Christiano , AI alignment landscape (2019 blog post) Andrew Critch and Stuart Russell, detailed syllabus with links from a fall 2018 AGI Safety course at UC Berkeley Joel Lehman (Uber), Evolutionary Computation and AI Safety: Research Problems Impeding Routine and Safe Real-world Application of Evolution (2019) Victoria Krakovna , AI safety resources list 38
Partial bibliography: Technical AI Safety literature AI Alignment Forum , including several good curated post sequences Paul Chrisiano , Directions and desiderata for AI alignment (2017 blog post) Rohin Shah (UC Berkeley), Value Learning sequence (2018) – gives a thorough introduction to the problem and explains some of the most promising approaches Leike et al (DeepMind), Reward Modeling (2018); associated blog post Dylan Hadfield- Menell (UC Berkeley ), Cooperative Inverse Reinforcement Learning (2016); associated podcast episode ; also see this video lecture Dylan Hadfield- Menell (UC Berkeley ), Inverse Reward Design (2017) Christiano et al ( OpenAI ), Iterative Amplification (2018); associated blog post ; Iterative Amplification sequence on the Alignment Forum Irving et al ( OpenAI ), Value alignment via debate (2018); associated blog post , podcast episode Christiano et al ( OpenAI , DeepMind), Deep reinforcement learning from human preferences (2017) Andreas Stuhlmüller (Ought), Factored Cognition (2018 blog post) Stuart Armstrong (MIRI / FHI), Research Agenda v0.9: Synthesizing a human's preferences into a utility function (2019 blog post) 39
Partial bibliography: Assured Autonomy literature University of York, Assuring Autonomy Body of Knowledge (in development) Assuring Autonomy International Program, list of research papers Sandeep Neema (DARPA), Assured Autonomy presentation (2019) Schwarting et al (MIT, Delft University), Planning and Decision-Making for Autonomous Vehicles (2018) Kuwajima et al, Open Problems in Engineering Machine Learning Systems and the Quality Model (2019) Colinescu et al (University of York), Socio-Cyber-Physical Systems: Models, Opportunities, Open Challenges (2019) – focuses on the human component of human-machine teaming Salay et al (University of Waterloo), Using Machine Learning Safely in Automotive Software (2018) Czarnecki et al (University of Waterloo), Towards a Framework to Manage Perceptual Uncertainty for Safe Automated Driving (2018) Colinescu et al (University of York), Engineering Trustworthy Self-Adaptive Software with Dynamic Assurance Cases (2017) Lee et al (University of Waterloo), WiseMove : A Framework for Safe Deep Reinforcement Learning for Autonomous Driving (2019) Garcia et al, A Comprehensive Survey on Safe Reinforcement Learning (2015 ) 40
Partial bibliography: Misc. Avoiding side effects Krakovna et al (DeepMind), Penalizing side effects using stepwise relative reachability (2019); associated blog post Alex Turner, Towards a new impact measure (2018 blog post) Achiam et al (UC Berkeley), Constrained Policy Optimization (2017 ) Testing and verification Defense Innovation Board, AI Principles: Recommendations on the Ethical Use of Artificial Intelligence by the Department of Defense , Appendix IV.C (2019) – study by the MITRE Corporation on the state of AI T&E Kohli et al (DeepMind), Towards Robust and Verified AI: Specification Testing, Robust Training, and Formal Verification (2019 blog post) – references several important papers on testing and validation of advanced ML techniques, and summarizes some of DeepMind’s research in this area Haugh et al, The Status of Test, Evaluation, Verification, and Validation (TEV&V) of Autonomous Systems (2018) Hains et al, Formal methods and software engineering for DL (2019 ) Security: Xiao et al, Characterizing Attacks on Deep Reinforcement Learning (2019 ) Control: Babcock et al, Guidelines for Artificial Intelligence Containment (2017) Risks from emergent behavior: Jesse Clifton , Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda (blog post sequence, 2019) Long term risks: AI Impacts Ben Cottier and Rohin Shah, Clarifying some key hypotheses in AI alignment (blog post, 2019) 41