Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity

HungLe203 151 views 151 slides May 11, 2024
Slide 1
Slide 1 of 151
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92
Slide 93
93
Slide 94
94
Slide 95
95
Slide 96
96
Slide 97
97
Slide 98
98
Slide 99
99
Slide 100
100
Slide 101
101
Slide 102
102
Slide 103
103
Slide 104
104
Slide 105
105
Slide 106
106
Slide 107
107
Slide 108
108
Slide 109
109
Slide 110
110
Slide 111
111
Slide 112
112
Slide 113
113
Slide 114
114
Slide 115
115
Slide 116
116
Slide 117
117
Slide 118
118
Slide 119
119
Slide 120
120
Slide 121
121
Slide 122
122
Slide 123
123
Slide 124
124
Slide 125
125
Slide 126
126
Slide 127
127
Slide 128
128
Slide 129
129
Slide 130
130
Slide 131
131
Slide 132
132
Slide 133
133
Slide 134
134
Slide 135
135
Slide 136
136
Slide 137
137
Slide 138
138
Slide 139
139
Slide 140
140
Slide 141
141
Slide 142
142
Slide 143
143
Slide 144
144
Slide 145
145
Slide 146
146
Slide 147
147
Slide 148
148
Slide 149
149
Slide 150
150
Slide 151
151

About This Presentation

Despite remarkable successes in various domains such as robotics and games, Reinforcement Learning (RL) still struggles with exploration inefficiency. For example, in hard Atari games, state-of-the-art agents often require billions of trial actions, equivalent to years of practice, while a moderatel...


Slide Content

AAMAS’24 Tutorial 9 Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity

2 Why This Topic? Deep Learning in RL: Deep RL Agents excel in complex tasks but need many interactions to learn Practicality Issues: Extensive learning steps hinder RL's use in real-world scenarios Exploration Optimization: Improving exploration is key for RL's real-world application Memory and Learning: Memory-based exploration can speed up learning and advance AI Generated by DALL-E 3

3 About Us Authors: Hung Le, Hoang Nguyen and Dai Do Our lab: A2I2, Deakin University Hung Le is a research lecturer at Deakin University, leading research on deep sequential models and reinforcement learning Hoang Nguyen is a second-year PhD student at A2I2, specializing in reinforcement learning and causality Dai Do is a second-year PhD student at A2I2, specializing in reinforcement learning and large language models A2I2 Lab Foyer, Waurn Ponds

4 Deakin University CRICOS Provider Code: 00113B About the Tutorial This tutorial is based on my previous presentations, expanding on topics covered in the following talks: Memory-Based Reinforcement Learning. AJCAI’22 Memory for Lean Reinforcement Learning. FPT Software AI Center . 2022 Neural machine reasoning. IJCAI’21 From deep learning to deep reasoning. KDD’21 My Blogs: https://hungleai.substack.com/ Generated by DALL-E 3

5 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction 👈 [We are here] Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Break RAM-like Memory for Novelty-based Exploration Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

Reinforcement Learning Fundamentals and Exploration Inefficiency PART A Generated by DALL-E 3

7 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Key components and frameworks 👈 [We are here] Classic exploration Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Break RAM-like Memory for Novelty-based Exploration Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

8 Reinforcement Learning Basics In reinforcement learning (RL), an agent interacts with the environment, taking actions  a , receiving a reward  r , and moving to a new state  s The agent is tasked with maximizing the accumulated rewards or returns  R  over time by finding optimal actions (policy)

9 Reinforcement Learning Concepts Policy π : maps state s to action a Return (discounted) G or R: the cumulative (weighted) sum of rewards State value function V : the expected discounted return starting with state s following policy π  State-action value function Q : the expected return starting from , taking the action , and thereafter following policy π

Classic RL algorithms: Value learning 10 Q-learning (temporal difference-TD) Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8, no. 3 (1992): 279-292. Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Machine learning 8, no. 3 (1992): 229-256. Basic idea: before finding optimal policy, we find the value function Learn (action) value function: V(s) Q(s,a) Estimate V(s)=E( ∑R from s ) Estimate Q(s,a) =E( ∑R from s,a ) Given Q(s,a) → choose action that maximizes the value (ε-greedy policy) RL algorithms: Q-Learning

Classic RL algorithm: Policy gradient Basic idea: directly optimise the policy as a function of states Need to estimate the gradient of the objective function E( ∑R) w.r.t the parameters of the policy Focus on optimisation techniques 11 REINFORCE (policy gradient) RL algorithms: Policy Gradient

General RL algorithms 12 Q-Learning vs Policy Gradient Both require exploration to collect data for training

13 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Key components and frameworks Classic exploration 👈 [We are here] Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Break RAM-like Memory for Novelty-based Exploration Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

14 Deakin University CRICOS Provider Code: 00113B ε-greedy ε-greedy is the simplest exploration strategy that works in theory It heavily relies on pure randomness and biased estimates of action values Q, and thus is sample-inefficient in practice We often go with what we assume is best, but sometimes, we take a random chance to explore other options. This is one example of an optimistic strategy It is used in Q-learning

15 Deakin University CRICOS Provider Code: 00113B ε-greedy: Problems It is not surprising why this strategy might struggle in real-world scenarios: being overly optimistic when your estimation is imprecise can be risky. It may lead to getting stuck in a local optimum and missing out on discovering the global one with the highest returns. Benchmarking ε-greedy (red line) and other exploration method on Montezuma’s Revenge.   Taïga , Adrien Ali, William Fedus , Marlos C. Machado, Aaron Courville , and Marc G. Bellemare . "Benchmarking bonus-based exploration methods on the arcade learning environment."  arXiv preprint arXiv:1908.02388  (2019). ε-greedy

16 Deakin University CRICOS Provider Code: 00113B Upper Confidence Bound (UCB) One way to address the problem of over-optimism is to consider the uncertainty of the estimation We do not want to miss an action with a currently low estimated value and high uncertainty, as it may possess a higher value: What we need is to guarantee: Hoeffding’s Inequality How to estimate uncertainty? Implicitly, with a large value of the exploration-exploitation trade-off parameter  c , the chosen action is more likely to deviate from the greedy action, leading to increased exploration. t -c

17 Deakin University CRICOS Provider Code: 00113B Thompson Sampling When additional assumptions about the reward distribution are available, an action can be chosen based on the probability that it is optimal (probability matching strategy) Thompson sampling is one way to implement the strategy: Assume the reward follows a distribution p( r|a , θ) where θ is the parameter whose prior is p(θ) Given the set of past observations D t is made of triplets {( a i , r i )| i =1,2..,t}, we update the posterior using Bayes rule Given the posterior, we can estimate the action value We can compute the probability of choosing action  a 2 3 4

18 Deakin University CRICOS Provider Code: 00113B Information Gain Information Gain (IG) measures the change in the amount of information (measured in entropy H) of a latent variable The latent variable often refers to the parameter of the model θ after seeing observation (e.g., reward r) caused by some action a  A big drop in the entropy means the observation makes the model more predictable and less uncertain Our goal is to find a harmony between minimizing expected regret in the current period and acquiring new information about the observation model Russo, Daniel, and Benjamin Van Roy. "Learning to optimize via information-directed sampling."  Advances in Neural Information Processing Systems  27 (2014).

19 Deakin University CRICOS Provider Code: 00113B Application: Multi-arm bandit There are multiple actions to take (bandit’s arm) After taking one action, agent observes a reward Maximizing cumulated rewards or minimizing cumulated regrets. Generated by DALL-E 3

20 Deakin University CRICOS Provider Code: 00113B Limitations of Classical Exploration ❌ Scalability Issues : Most are specifically designed for bandit problems, and thus, they are hard to apply in large-scale or high-dimensional problems (e.g., Atari games), resulting in increased computational demands that can be impractical ❌ Assumption Sensitivity : These methods heavily rely on specific assumptions about reward distributions or system dynamics, limiting their adaptability when assumptions do not hold ❌ Vulnerability to Uncertainty : They may struggle in dynamic environments with complex reward structures or frequent changes, leading to suboptimal performance Generated by DALL-E 3

21 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL Hard Exploration Problems 👈 [We are here] Simple exploring solutions QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Break RAM-like Memory for Novelty-based Exploration Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

Task: Agent searches for the key Agent picks the key Agent open the door to access the room Agent finds the box in the room Reward: If the agent reaches the box, get +1 reward 22 https://github.com/maximecb/gym-minigrid → How to learn such complicated policies using the simple reward? Modern RL Environments are Complicated

23 Deakin University CRICOS Provider Code: 00113B Why is Scaling a Big Problem? Practical environments often involve huge continuous state and action spaces Classical approaches cannot be implemented or fail to hold their theoretical properties in these settings Doom environment: continuous high-dimensional state space ( source ) Mujoco environment : continuous action space ( source ).

24 Deakin University CRICOS Provider Code: 00113B Challenging Environments for Exploration Environments require long-term memory of agents: Maze navigation with conditions such as finding the objects that have the same color as the wall Remember the shortest path to the objects experienced in the past Noisy environments: Noisy-TV: a random TV will distract the RL agent from its main task due to noisy screen https://github.com/jurgisp/memory-maze Noisy-TV ( source )

25 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL Hard exploration problems Simple exploring solutions 👈 [We are here] QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Break RAM-like Memory for Novelty-based Exploration Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

26 Deakin University CRICOS Provider Code: 00113B Entropy Maximization In the era of deep learning, neural networks are used for approximating functions, including parameterizing value and policy functions in RL ε-greedy is less straightforward for policy gradient methods An entropy loss term is introduced in the objective function to penalize overly deterministic policies. This encourages diverse exploration, avoiding suboptimal actions by maximizing the bonus entropy loss term ❌ It may also impede the optimization of other losses, especially the main objective ❌ The entropy loss does not enforce different level of exploration for different tasks It is used in PPO, A3C, …

27 Deakin University CRICOS Provider Code: 00113B Noisy Networks Another method to add randomness to the policy is to add noise to the weights of the neural networks Throughout training, noise samples are drawn and added to the weights for both forward and backward propagation ❌ Although Noisy Networks can vary exploration degree across tasks, adapting exploration at the state level is far from reachable. Certain states with higher uncertainty may require more exploration, while others may not An example of a noisy linear layer. Here  w  is the matrix weight and  b  is the bias vector. The parameters µ w ,  µ b ,  σ w  and  σ b  are the learnables of the network whereas  ε w  and  ε w  are noise variables. Fortunato, Meire , Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick , Ian Osband , Alex Graves Vlad Mnih , Remi Munos Demis Hassabis Olivier Pietquin , Charles Blundell, and Shane Legg. "Noisy Networks for Exploration." arXiv preprint arXiv:1706.10295 (2017)

28 Deakin University CRICOS Provider Code: 00113B End of Part A QA Demo Generated by DALL-E 3

Intrinsic Motivation: Surprise and Novelty PART B Generated by DALL-E 3 Generated by DALL-E 3

30 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks 👈 [We are here] Reward shaping and the role of memory A taxonomy of memory-driven intrinsic exploration Deliberate Memory for Surprise-driven Exploration Break RAM-like Memory for Novelty-based Exploration Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

No curiosity, random exploration epsilon-greedy “Tractable” exploration Somehow optimize exploration, e.g. UCB, Thomson sampling Only doable for simple environments 31 Approximate “trackable” exploration (count-based) Scalable to harder environments Intrinsic motivation exploration (SOTA) Predictive Novel or supersite-based curiosity Causal … https://cmutschler.de/rl Frameworks

32 Deakin University CRICOS Provider Code: 00113B Reward Shaping Entropy loss or inject noise into the policy/value parameters with the limitation that the level of exploration is not explicitly conditioned on fine-grant factors such as states or actions Solution: intrinsic reward bonuses assign higher internal rewards to state-action pairs that require higher exploration and vice versa The final reward for the agent will be the weighted sum of the intrinsic reward and the external (environment) reward

Animal can travel for long distance till they find food Human can navigate to go to an address in a strange city What motivates these agents to explore?  intrinsic motivation  curiosity, hunch  intrinsic reward 33 https://www.beepods.com/5-fascinating-ways-bees-and-flowers-find-each-other/ Intrinsic Motivation in Biological World

34 Deakin University CRICOS Provider Code: 00113B What Does Intrinsic Reward Represent? Novelty It is inherent for biological agents to be motivated by new things. Tracking the occurrences of a state provides a novelty indicator, with increased occurrences signaling less novelty Surprise Surprise emerges when there's a discrepancy between expectations and the observed or experienced reality Build a model of the environment, predicting the next state given the current state and action The intrinsic reward is the prediction error itself This reward increases when the model encounters difficulty in predicting or expresses surprise at the current observation. Novelty Surprise

35 Deakin University CRICOS Provider Code: 00113B The Role of Memory Biological agents inherently possess memory to monitor events: Drawing from previous experiences, they discern novelty in observations Utilizing their prior understanding of the world, they identify unexpected observations RL agents can be quipped with memory: Event-based Memory: Episodic Memory Semantic Memory: World Model Novelty Surprise MEMORY

36 Deakin University CRICOS Provider Code: 00113B A Taxonomy of Memory for RL Exploration

37 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction 👈 [We are here] Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

38 Deakin University CRICOS Provider Code: 00113B Forward Dynamics Prediction Build a model of the environment, predicting the next state given the current state and action This kind of model, also known as forward dynamics or world model C: actor vs M: world model. M predicts consequences of C. C acts to make M fail As a result: If C action results in repeated and boring consequences  M predict well C must explore novel consequence https://people.idsia.ch/~juergen/artificial-curiosity-since-1990.html

39 Deakin University CRICOS Provider Code: 00113B Learning Progress as Intrinsic Reward The learning progress is estimated by comparing the mean error rate of the prediction model during the current moving window to the mean error rate of the previous window The two windows are different by 𝜏 steps In can mitigate Noisy-TV problems Pierre-Yves Oudeyer & Frederic Kaplan. “How can we define intrinsic motivation?” Conf. on Epigenetic Robotics, 2008.

40 Deakin University CRICOS Provider Code: 00113B Deep Dynamic Models Use a neural network f that takes a representation of the current state and action to predict the next state The representation is shaped through unsupervised training, i.e., state reconstruction task, using an autoencoder’s hidden state The network f, fed with the autoencoder’s hidden state, is trained to minimize the prediction error, which is the norm of the difference between the predicted state and the true state Intrinsic reward Stadie , Levine, Abbeel : Incentivizing Exploration in Reinforcement Learning with Deep Predictive Models. In NIPS 2015.

Forward model on feature space Feature space ignore irrelevant, uncontrollable factors The consequences depend on Action (controllable) Environment (uncontrollable) We want the state embedding enables controllable space 41 https://blog.dataiku.com/curiosity-driven-learning-through-next-state-prediction Deepak Pathak et al.: Curiosity-driven Exploration by Self-Supervised Prediction. ICML 2017 ICM More Complicated Model Inverse dynamic representation learning

42

43 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises 👈 [We are here] Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

T he prediction target is stochastic Information necessary for the prediction is missing Model class of predictors is too limited to fit the complexity of the target function  Both the totally predictable and the fundamentally unpredictable will get boring 44 https://openai.com/blog/reinforcement-learning-with-prediction-based-rewards/ When Predictive Surprise Fails

45 https://favpng.com/ Ideas for Improvements Reward M’s progress instead of error No M’s improvement if the consequence is too hard or too easy to predict  no reward Remember all experiences “Store” all experienced consequence, including stochastic ones Global or Local memory Like human Better Representations Representation Learning https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction

46 Deakin University CRICOS Provider Code: 00113B Random Network Distillation (RND) The intrinsic reward is defined through the task of predicting the output of a fixed (target) network Target Network’s weights are random By predicting the target output, the Predictor Network tries to “remember” the randomized state If old state reappears, it can be predicted easily by the Predictor Network RND obviates Noisy-TV since the target network can be chosen to be deterministic and inside the model-class of the predictor network. Burda , Yuri, Harrison Edwards, Amos Storkey , and Oleg Klimov . "Exploration by random network distillation." In  International Conference on Learning Representations . 2018.

47 Deakin University CRICOS Provider Code: 00113B Noisy-TV in Atari games Montezuma’s Revenge the agent oscillates between two rooms This leads to an irreducibly high prediction error, as the non-determinism of sticky actions makes it impossible to know whether, once the agent is close to crossing a room boundary, making one extra step will result in it staying in the same room, or crossing to the next one  This is a manifestation of the ‘noisy TV’ problem

48 Deakin University CRICOS Provider Code: 00113B Latent World Model World model for exploration should be robust against stochasticity and able to extrapolate the state dynamics  prediction error can be a measurement for novelty Train WM on latent representation space. This space is shaped by unsupervised learning  a zero- centered distribution with a covariance matrix equal to the identity  robust to stochastic elements and is arranged respecting the temporal distance of observations WM error is computed in latent space

49 Deakin University CRICOS Provider Code: 00113B Latent World Model: Sample-efficient Atari Benchmark

50 Deakin University CRICOS Provider Code: 00113B Bayesian Surprise Surprise can be interpreted from a Bayesian statistics perspective Similar to IG idea, the aim is to minimize uncertainty about the dynamics, formalized as maximizing the cumulative reduction in entropy The reduction of entropy per time step, also known as mutual information, I(𝚯;S t+1 |ξ t ,a t ) Here θ is the parameters of the dynamics model 𝚯. Because we are interested in finding intrinsic reward for a given timestep , we can define: Houthooft , Rein, Xi Chen, Yan Duan , John Schulman, Filip De Turck , and Pieter Abbeel . " Vime : Variational information maximizing exploration."  Advances in neural information processing systems  29 (2016).

51 Deakin University CRICOS Provider Code: 00113B Variational Bayesian Surprise The KL involves computing the posterior p(θ|s t+1 ), which is generally intractable Use variational inference for approximating the posterior, alternative variational distribution q(θ; 𝜙) 𝜙 is the variational parameter This is equivalent to parameterizing the dynamics model as a Bayesian neural network (BNN) with weight distributions maintained as a fully factorized Gaussian. Train 𝜙:

52 Deakin University CRICOS Provider Code: 00113B Bayesian Surprise Benchmarking

53 Deakin University CRICOS Provider Code: 00113B Bayesian Learning Progress Training a BNN is complicated, there are different Bayesian views on surprise Formulating the objective of the RL agent as jointly maximizing expected return and surprise P  is the true dynamics model and  P𝜙  is the learned dynamics model The objective can be translated to maximizing the bonus reward per step Joshua Achiam and Shankar Sastry . 2017. Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732 (2017).

54 Deakin University CRICOS Provider Code: 00113B Bayesian Learning Progress: Approximation Solutions In practice, we do not know P. Need approximation Prediction error: measures the error in log probability instead of the norm of the difference between the predicted and the reality Learning progress written in the form of log probability To train the dynamics model P𝜙, solve the constrained optimization By introducing the KL constraint, the posterior model is prevented from diverging too far from the prior, thereby preventing the generation of unstable intrinsic rewards.

55 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement 👈 [We are here] Break RAM-like Memory for Novelty-based Exploration Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

56 Deakin University CRICOS Provider Code: 00113B Intrinsic Motivation via Disagreement An alternative method with forward dynamics involves using the variance of the prediction rather than the error This requires multiple prediction models trained to minimize the forward dynamics prediction errors, Use the empirical variance (disagreement) of their predictions as the intrinsic reward The higher the variance, the more uncertain about the observation  need to explore more  Deepak Pathak, et al. “Self-Supervised Exploration via Disagreement.” In ICML 2019.

57 Deakin University CRICOS Provider Code: 00113B Bayesian Disagreement Bayesian surprises are defined by a specific dynamics model What happens if we consider a distribution of models? Bayesian surprise of a policy becomes:   P(T)  is the transition distribution of the environment and  P(T| 𝜙 )  is the transition distribution according to the dynamics model. Prediction error averaging considers all transition models and possible predictions Pranav Shyam , Wojciech Jaskowski , and Faustino Gomez. 2019. Model-Based Active Exploration. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA. 5779–5788.   P( S|s,a,t )  is the dynamics model learned from a transition dynamics  t

58 Deakin University CRICOS Provider Code: 00113B How to Compute u? The term u( s,a ) turns out to be the Jensen-Shannon Divergence of a set of learned dynamics from a transition dynamics t JSD can be approximated by employing N dynamics models: For each P parameterized by Gaussian distribution 𝓝 i (µ i,Σi ), we need another layer of approximation to compute u( s,a ) by replacing the Shannon entropy with Rényi entropy and use the corresponding Jensen- Rényi Divergence (JRD)

59 Deakin University CRICOS Provider Code: 00113B Surprise-based Exploration: Pros and Cons 👍 Dynamics models can be trained easily these days. There are many works on that topic 👍 Advanced methods can somehow handle Noisy-TV ❌ Focusing on the forward dynamics error is not effective in driving the exploration, especially when the world model is not good and always predicts wrongly ❌ Advanced methods such as ensembles or learning progress are compute-expensive to cope with Noisy-TV Generated by DALL-E 3

60 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break 👈 [We are here] RAM-like Memory for Novelty-based Exploration Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

61 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Count-based memory 👈 [We are here] Episodic memory Hybrid memory Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

62 Deakin University CRICOS Provider Code: 00113B Novelty via Counting Humans want to explore novel places, make new friends, and buy new stuff. It is inherent for humans to be motivated by new things How to translate this intrinsic motivation to RL agents? Tracking the occurrences of a state (N(s)) provides a novelty indicator, with increased occurrences means less novelty r i ( s,a )=N(s) -0.5   where  N  counts the number of times  s  appears ❌ Empirical counts in continuous state spaces is impractical due to the rarity of exact state visits, resulting in  N(s)=0  most of the time

63 Deakin University CRICOS Provider Code: 00113B Density-based State Counting Use a density function of the state to estimate its occurrences ρ(x)=ρ(s=x|s 1:n ) be a density function of the state x given s 1:n and ρ’(x)=ρ(s=x|s 1:n x) the density function of the state x after observing its first occurrence after s 1:n N̂ (x) and n̂ as a “pseudo-count” of x and the pseudo-total count before and after an occurrence of s   the true density of  x  stays the same before and after an occurrence of  x Bellemare , Marc, Sriram Srinivasan, Georg Ostrovski , Tom Schaul , David Saxton, and Remi Munos . "Unifying count-based exploration and intrinsic motivation."  Advances in neural information processing systems  29 (2016).

64 Deakin University CRICOS Provider Code: 00113B Pseudo State Count In practice, in a huge state space, ρ’ n (x)≈0, we can rewrite the pseudo-count: PG means predictive gain, which is computed as: Resembles the  information gain : the difference between the expectation of a posterior and prior distribution Extend to count state-action pairs, one can concatenate the action representation with the state representation.

65 Deakin University CRICOS Provider Code: 00113B Hash Count If counting each exact state is challenging, why not partition the continuous state space into manageable blocks? By using a function 𝜙 mapping a state to a code, we can count the occurrence of the code instead of the state “Distant” states to be counted separately while “similar” states are merged  SimHash for the mapping function Higher  k , less occurrence collisions, and thus states are more distinguished Tang, Haoran , Rein Houthooft , Davis Foote, Adam Stooke , OpenAI Xi Chen, Yan Duan , John Schulman, Filip DeTurck , and Pieter Abbeel . "# exploration: A study of count-based exploration for deep reinforcement learning."  Advances in neural information processing systems  30 (2017).   sgn  is the sign function,  A  is a  k × d  matrix with i.i.d . entries drawn from a standard Gaussian distribution. and  g  is some transformation function 

66 Deakin University CRICOS Provider Code: 00113B Hash Count: Cope with High-dimensional States Representation learning is employed to capture good g through autoencoder network and reconstruction learning The network aims to reconstruct the original state input s, and the hidden representation b(s) will be used to compute g(s)=round(b(s)) Another regularization term prevents the corresponding bit in the binary code from flipping throughout the agent's lifetime

67 Deakin University CRICOS Provider Code: 00113B Change Counting To encourage the agent to explore novel state-action pairs meaningfully, we can assess changes caused by activities and prioritize those that signify novelty  c( s,s ’) as the environment change caused by a transition (s, a, s’) Combines state count and change count, resulting in the intrinsic reward: Change count (last row) vs norm of change (middle row) vs state count (top row). Change count suffers less from attracting to meaningless activities . Parisi , Simone, Victoria Dean, Deepak Pathak, and Abhinav Gupta. "Interesting object, curious agent: Learning task-agnostic exploration. " Advances in Neural Information Processing Systems 34 (2021): 20516-20530.

68 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Count-based memory Episodic memory 👈 [We are here] Hybrid memory Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

69 Deakin University CRICOS Provider Code: 00113B Alternative Novelty Measurement ❌ An inherent constraint of count-based methods lies in the approximation error between the pseudo-count and the true count Different novelty criteria: novel observations are those that demand effort to reach, typically beyond the already explored areas of the environment Measure the effort in environmental steps, estimating it with a neural network that predicts the steps between two observations To capture the explored areas of the environment, Use an episodic memory initialized empty at the start of each episode Novelty through reachability concept. An observation is novel if it can only reach those in the memory in more than k steps. Savinov , Nikolay, Anton Raichuk , Raphaël Marinier , Damien Vincent, Marc Pollefeys , Timothy Lillicrap , and Sylvain Gelly . "Episodic curiosity through reachability." arXiv preprint arXiv:1810.02274 (2018). if the maximum reachability score of the given observation is greater than a threshold  k,  we can regard it as novel

70 Deakin University CRICOS Provider Code: 00113B Episodic Curiosity: Memory Workflow Define the threshold  k , use reachability network to classify whether two observations are separated by more or less than  k  steps After training, the reachability network is used to estimate the novelty of the current observation in the episode given the episodic memory  M , which finally is used to compute the intrinsic reward Using a function  F  to aggregate the reachability scores between the current observation and those in the memory leads to the intrinsic reward.  F  can be max or 90-th percentile

71 Deakin University CRICOS Provider Code: 00113B Explicit Memory of Positions (only for Atari games) Collect the agent’s position from game RAM to indicate where on the grid an agent has visited White sections in the curiosity grid (middle) show which locations have been visited; the unvisited black sections yield an exploration bonus when touched. The network receives both game input (left) and curiosity grid (middle) and must learn how to form a map of where the agent has been (hypothetical illustration, right) Stanton, Christopher, and Jeff Clune . "Deep curiosity search: Intra-life exploration can improve performance on challenging deep reinforcement learning problems."  arXiv preprint arXiv:1806.00553  (2018).

72 Deakin University CRICOS Provider Code: 00113B Novelty Connection to Surprise Theoretically, an overparameterized autoencoder whose task is to reconstruct its input is equivalent to an associative memory ( Adityanarayanan  et al., PNAS 2020) We can train an autoencoder that takes the state as input to reconstruct and use its reconstruction error as an indicator of life-long novelty Greater errors signify a higher level of novelty in states, indicating that the autoencoder has not encountered these states frequently enough to learn their successful reconstruction effectively Related to RND: the intrinsic reward is still the reconstruction error. But this time, the reconstructed target is no longer the original input. Instead, it is a transformed version of the input

73 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Count-based memory Episodic memory Hybrid memory 👈 [We are here] Replay Memory QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

74 Deakin University CRICOS Provider Code: 00113B Surprise + Novelty Never Give Up agent (NGU) combines existing surprise and novelty components from the literature cleverly: State representation learning via inverse dynamics (ICM) Life-long novelty module using RND Episodic novelty using episodic memory inspired by EC The implementation of the episodic memory in NGU is new. The dynamics model  f  is employed to produce the representations for the novelty modules. Two types of novelty are combined to produce the final intrinsic reward.   Badia , Adrià Puigdomènech , Pablo Sprechmann , Alex Vitvitskyi , Daniel Guo , Bilal Piot, Steven Kapturowski , Olivier Tieleman et al. "Never give up: Learning directed exploration strategies."  arXiv preprint arXiv:2002.06038  (2020).

75 Deakin University CRICOS Provider Code: 00113B NGU: Episodic Novelty Encourages the exploration of novel states within an episode simply via nearest- neighbor matching. As a result, the agent will not revisit the same state in an episode twice. This concept is different from lifelong novelty The closer the current state is to its neighbors , the higher the similarity and thus, the smaller the reward Hybrid Intrinsic Reward:

76 Deakin University CRICOS Provider Code: 00113B Agent57: Exploration at Scale An upgraded version of the NGU: Splitting the value function into 2 separate function values for external and internal rewards A population of policies (and value functions) is trained, each characterized by a distinct pair of exploration parameters:   N  is the size of the population. 𝛾 j  is the discount factor hyperparameters and  β j  is the intrinsic reward coefficient hyperparameters. Adapted by a meta controller (bandit algo .) Badia , Adrià Puigdomènech , Bilal Piot, Steven Kapturowski , Pablo Sprechmann , Alex Vitvitskyi , Zhaohan Daniel Guo , and Charles Blundell. "Agent57: Outperforming the atari human benchmark." In  International conference on machine learning , pp. 507-517. PMLR, 2020.

77 Deakin University CRICOS Provider Code: 00113B Agent 57: Atari Benchmark

78 Deakin University CRICOS Provider Code: 00113B Cluster Memory for Counting Parametric methods for counting have problems: s slow adaptation and catastrophic forgetting Count Estimation provides a long term visitation-based exploration bonus while retaining responsiveness to the most recent experience  a finite slot-based container M stores representations and corresponding counter C Memory RECODE is update by considering: Adding new embedding (atom) with count 1 Update nearest atom and increase count Kernel is non-zero for all neighbours within a radius

79 Deakin University CRICOS Provider Code: 00113B RECODE: Representation Learning The transformer takes masked sequences of length k consisting of actions and embedded observations as inputs and tries to reconstruct the missing embeddings in the output The reconstructed embeddings at time t − 1 and t are then used to build a 1-step action-prediction classifier  Similar to ICM’s inverse dynamics Saade , Alaa , Steven Kapturowski , Daniele Calandriello , Charles Blundell, Pablo Sprechmann , Leopoldo Sarra , Oliver Groth , Michal Valko , and Bilal Piot. "Unlocking the Power of Representations in Long-term Novelty-based Exploration." In  The Twelfth International Conference on Learning Representations . 2024.

80 Deakin University CRICOS Provider Code: 00113B Novelty of Surprise The norm of the prediction error (surprise norm) is not good (e.g., as in Noisy-TV) A new metric: surprise novelty , the error of reconstructing surprise (the error of state prediction) This requires a surprise generator such as a dynamics model to produce the surprise vector u, i.e., the difference vector between the predicted and reality Then, inter and intra-episode novelty scores are estimated by a system of memory, called Surprise Memory (SM), consisting of an autoencoder network W and episodic memory M, respectively Le, Hung, Kien Do, Dung Nguyen, and Svetha Venkatesh. "Beyond Surprise: Improving Exploration Through Surprise Novelty. In AAMAS, 2024.

81 Deakin University CRICOS Provider Code: 00113B Benefit of Hybrid Systems Marry the goodness of both worlds: Dynamic Prediction Novelty Estimation Combine intrinsic rewards: Long-term, global, inter-episodes Short-term, local, intra-episodes Limitation of dynamic prediction using deep models can be compensated with non-parametric memory approaches Noisy-TV problems are mitigated but not completely solved Noisy-TV: a random TV will distract the RL agent from its main task due to high surprise ( source ).

82 Deakin University CRICOS Provider Code: 00113B However, Is Intrinsic Reward Really Good? Taiga et al. On bonus based exploration methods in the arcade learning environment. In ICLR 2019 Intrinsic motivation rewards heavily on memory concept (global, local…) The performance IR agent is very good, but … It requires more samples to train (10^9 or 10^10 steps is the norm) Is The goal of IR to enable sample efficiency? Overfitting to Montezuma Revenge? Depending on architecture and tuning, normal exploration in general is ok

83 Issues with Intrinsic Motivation Ecoffet , Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune . "First return, then explore."  Nature  590, no. 7847 (2021): 580-586.

84 Deakin University CRICOS Provider Code: 00113B Reflection on Memory Surprise: Memory is hidden inside dynamics models, memorizing the seen observations to make the prediction available This memory is long-term, semantics and slow to update Novelty: Memory is obvious as a slot-based matrix, nearest neighbour estimator, counter …. This memory is often short-term, instance-based and adaptive to changes in the environments Intrinsic Exploration Memory Surprise Novelty Memory Exploration Can memory be used directly for exploration? No reward bonus? ?

85 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Replay Memory Novelty-based Replay 👈 [We are here] Performance-based Replay QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

86 Deakin University CRICOS Provider Code: 00113B A Direct Exploration Mechanism Two major issues have hindered the ability of previous algorithms to explore: Detachment: loses track of interesting areas to explore from Derailment: the exploratory mechanisms of the algorithm prevent it from utilize previously visited states The role of memory is simplified: Store past states Retrieve states to explore Replay Replay Memory Sampled States Exploration

87 Deakin University CRICOS Provider Code: 00113B Go-Explore Detachment can be addressed by Memory: keep track areas by grouping similar states into cells Similar to Hash count Map a state to a cell Each cell has a score indicating sampling probability Derailment can be addressed by Simulator (only suitable for Atari) Sample a cell’s state from the memory Simulator resets the state of the agent to the cell’s state The memory is updated with new cells during exploration Ecoffet , Adrien, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, and Jeff Clune . "First return, then explore."  Nature  590, no. 7847 (2021): 580-586.

88 Deakin University CRICOS Provider Code: 00113B Go-Explore: Engineering Details State to Cell: downscaled cell with adaptive downscaling parameters for robustness: Calculating how a sample of recent frames would be grouped into cells Selecting the values that result in the best cell distribution (manually designed) The selection probability of a cell at each step is proportional to its selection weight  count-based Domain knowledge weight: (1) the number of horizontal neighbours to the cell present in the archive ( h ); (2) a key bonus: for each location Train from demonstrations: backward algorithm places the agent close to the end of the trajectory and runs PPO until the performance of the agent matches that of the demonstration  C seen  is the number of exploration steps in which that cell is visited

89 Deakin University CRICOS Provider Code: 00113B Addressing Cell Limitations ❌ Cell design is not obvious requires detailed knowledge of the observation space, the dynamics of the environment, and the subsequent task Latent Go-Explore: Go-Explore operates without cells: A latent representation is learned simultaneously with the exploration Sampling of the final goal is based on a non-parametric density model of the latent space Replace simulator with goal-based exploration Quentin Gallou´edec and Emmanuel Dellandr´ea . 2023. Cell-free latent go-explore. In International Conference on Machine Learning. PMLR, 10571–10586.

90 Deakin University CRICOS Provider Code: 00113B Latent Go-Explore: Details Representation Learning: ICM’s Inverse Dynamics Forward Dynamics Vector Quantized Variational Autoencoder Reconstruction Density Estimation to sample goals: Goals must be at the edge of the yet unexplored areas Goal is reachable (already visited) Use particle-based entropy estimator to estimate density score and rank geometric law on the rank. p is hyperparameter. The higher the rank (more dense), the less novel sample less

91 Deakin University CRICOS Provider Code: 00113B Go-Explore Family is current SOTA on Atari

92 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Replay Memory Novelty-based Replay Performance-based Replay 👈 [We are here] QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Closing Remarks QA and Demo

93 Deakin University CRICOS Provider Code: 00113B Imitation Learning: Exploration via Exploitation Exploiting past good experiences affects learning Self-imitation learning imitates the agent’s own past good decisions Memory is a replay buffer stores experiences Learns a policy to imitate state-action pairs in the replay buffer only when the return in the past episode is greater than the agent’s value estimate (performance-based) If the return in the past is greater than the agent’s value estimate (R > Vθ ), the agent learns to choose the action chosen in the past in the given state. Oh, Junhyuk , Yijie Guo , Satinder Singh, and Honglak Lee. "Self-imitation learning." In  International conference on machine learning , pp. 3878-3887. PMLR, 2018.

94 Deakin University CRICOS Provider Code: 00113B Goal-based Exploration Using Memory Generate new trajectories visiting novel states by editing or augmenting the trajectories stored in the memory from past experiences A sequence-to-sequence model with an attention mechanism learns to ‘translate’ the demonstration trajectory to a sequence of actions and generate a new trajectory Sample using count-based novelty Insert new trajectory if the ending is different significantly Otherwise, replace with higher return trajectory Yijie Guo , Jongwook Choi, Marcin Moczulski , Shengyu Feng, Samy Bengio , Mohammad Norouzi , and Honglak Lee. 2020. Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems 33 (2020), 4333–4345. Diverse Trajectory-conditioned Self-Imitation Learning (DTSIL)

95 Deakin University CRICOS Provider Code: 00113B DTSIL: Policy Learning Train a trajectory-conditioned policy π θ ( a t |e ≤t , o t , g) that should flexibly imitate any given trajectory g To imitate, assign r im as the imitation reward (0.1) if the state is similar After visiting the last (non-terminal) state in the demonstration, the agent performs random exploration (r=0) to encourage exploration Policy Gradient training: Further imitation encouragement

96 Deakin University CRICOS Provider Code: 00113B DTSIL: Performance when Combined with Count

97 Deakin University CRICOS Provider Code: 00113B Replay Memory: Pros and Cons 👍 Replay memory provides a direct exploration mechanism without intrinsic reward 👍 Sampling strategies are built upon previous works ❌ Make additional assumptions such as simulator or the availability of demonstrations ❌ Often requires goal-based policy, multiple sub trainings Generated by DALL-E 3

98 Deakin University CRICOS Provider Code: 00113B End of Part B QA Demo Generated by DALL-E 3

Advanced Topics PART C Generated by DALL-E 3 Generated by DALL-E 3 Generated by DALL-E 3

100 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Replay Memory Novelty-based Replay Performance-based Replay QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration 👈 [We are here] Language-assisted RL LLM-based exploration Causal discovery for exploration Closing Remarks QA and Demo  

Beyond the state space: Language-guided exploration Why language? Humans are able to learn quickly in new environments due to a rich set of commonsense priors about the world → reflected in language Read the instruction to play game to avoid trials and errors. Abstraction Compositional Generalization 101

102 Luketina, Jelena, Nantas Nardelli, Gregory Farquhar, Jakob Foerster, Jacob Andreas, Edward Grefenstette, Shimon Whiteson, and Tim Rocktäschel. "A survey of reinforcement learning informed by natural language." arXiv preprint arXiv:1906.03926 (2019). How Language can be used in R L

Make a cake from tools and materials Pure RL agent needs to try thousands of settings until it find the desired characteristics If RL agent read and follow the recipe, it may take 1 trial to succeed 103 https://www.moresteam.com/toolbox/design-of-experiments.cfm A more practical use case

104 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Replay Memory Novelty-based Replay Performance-based Replay QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Language-assisted RL 👈 [We are here] LLM-based exploration Causal discovery for exploration Closing Remarks QA and Demo  

105 Gated-attention for Task-oriented Language Grounding Task-oriented Language Grounding: extract meaningful representations from natural language instructions and maps to visual elements and actions Task: Given initial Image in pixels and instruction -> guide agent to move towards desired object 2 main modules: State Processing: Process image and language jointly with State Processing module to obtain state Policy Learning: Use policy to map states to corresponding actions Generated by DALL-E 3

106 State Processing Module Use Gated-Attention instead of concatenation to jointly represent image and language information as one state Language instruction go through a fully-connected layer to match image dimension, called Attention Vector Each element of Attention Vector is expanded to a matrix to match the feature map of each image element Final representation is obtained via element-wise product between image and language representation   Generated by DALL-E 3

107 Policy Module Actions: Turn (left, right) Move forward Policy Architecture: two variants Behaviour Cloning: Uses target and object locations and orientations in every state to select optimal action A3C: use a deep neural network to learn policy and value function. Gives action based on policy   Generated by DALL-E 3

108 Results Generated by DALL-E 3 - With Gated-Attention, agent learns faster and achieves better accuracy (success rate of reaching correct object before episode terminates) - As environment gets harder, more exploration is needed -> A3C with GA performs better than Imitation Learning, where little exploration is done

109 Semantic Exploration from Language Abstractions and Pretrained Representations Novelty-based exploration methods suffer in high-dimensional visual state spaces e.g.: different viewpoints of one place in 3D can map to distinct visual states/ features, despite being semantically similar Language can be a useful abstraction for exploration, as it coarsens the state space in a way that reflects the semantics of environment Solution : Use vision-language pretrained representations to guide semantic exploration in 3D Generated by DALL-E 3 Example of a state (picture) with language description (caption). Note how the caption focuses on important aspects of the state Example of how many states can be conveyed with one text caption

110 Intrinsic Reward Design State : embedding of Image, denoted as Goal: Described by a text instruction, denoted as Caption is encoded by a pretrained language encoder, output embedding is denoted as ; only used to calculate intrinsic reward. Note that agent never observes . Intrinsic reward is goal-agnostic; computed with access to state representation (either or ) Add intrinsic reward for two exploration algorithms: Never Give Up (NGU; Badia et al.) Random Network Distillation (RND; Burda et al.)   Generated by DALL-E 3 Badia , Adrià Puigdomènech , et al. Never give up: Learning directed exploration strategies. arXiv preprint arXiv:2002.06038 (2020). Y. Burda , H. Edwards, A. Storkey , and O. Klimov . Exploration by random network distillation. arXiv preprint arXiv:1810.12894 , 2018.

111 Never Give Up (NGU) State representations (which are used to compute intrinsic reward, can be either or ) along trajectory are written to memory buffer Novelty is a function of L2 distance between current and the k-nearest states in buffer. Intrinsic reward is higher for larger distances To influence exploration: modify the embedding function Originally , embedding function is learned Variants: Vis-NGU & LSE-NGU: Use Visual embeddings ( ). Lang-NGU: Use Language embeddings ( ).   Generated by DALL-E 3

112 Random Network Distillation (RND) Intrinsic reward is derived from the prediction error between a trainable network on a target from a frozen function Trainable network learns independently from the policy network As training progresses, frequently-visited states yield less intrinsic reward due to reduced prediction errors Generated by DALL-E 3

113 Results Example of how language representations (orange line) can help agent explore better than visual representation in 3D environment. Using visual representation, agent struggles with different view of only one scene. Generated by DALL-E 3

114 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Replay Memory Novelty-based Replay Performance-based Replay QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Language-assisted RL LLM-based exploration 👈 [We are here] Causal discovery for exploration Closing Remarks QA and Demo

115 ELLM: Guiding Pretraining in RL with LLM Many distinct actions can lead to similar outcomes -> Intrinsically Motivated RL (IM-RL): explore outcomes rather than actions Competence-based IM (CB-IM): maximize diversity of skills mastered by agent CB-IM aims to optimize : Given those, CB-IM algorithms train a goal-conditioned policy that maximizes Goal distribution G and must be defined such that 3 properties are satisfied: Diverse Common-sense sensitive (Ex: chop a tree > drink a tree) Context sensitive (Ex: only chop the tree when the tree is in view)   Generated by DALL-E 3

116 Why ELLM? Previous methods hand-define and G , and use various motivations to guide goal sampling : novelty, learning progress, intermediate difficulty ELLM: alleviate the need for environment-specific hand-coded definitions of and G with LM: Language-based goal representations Language-model-based goal generation   Generated by DALL-E 3

117 Architecture Goal representation : prompt LLM with available actions & description of current observation -> construct prompt -> LLM generate goals Open-ended goal generation: Ask LLM to generate goals Closed-form: Ask LLM yes/no questions ( e.g : should the agent do X? yes/no?) Generated by DALL-E 3

118 Intrinsic Reward Design Reward LLM goals : with similarity between generated goal with description of agent’s transition If there are multiple goals -> reward agent with most similar goal to transition description:   Generated by DALL-E 3

119 ELLM: Results Generated by DALL-E 3

120 Intrinsically Guided Exploration from LLMs (IGE-LLMs) For long, sequential tasks with sparse rewards, intrinsic reward can help guide policy learning towards exploration, along with the main policy driver extrinsic reward LLM can be used as an evaluator for potential future rewards of all actions a , that maps every ( s,a ) pairs directly to an intrinsic reward Total reward is given as , where is the external reward, is a controlling factor and is a linearly decaying weight   Generated by DALL-E 3

121 Prompting Generated by DALL-E 3 Example of an input prompt to evaluate possible actions in DeepSea Environment LLM is given current position and possible actions. It is asked to evaluate the ratings of every possible next actions

122 Benefit of Intrinsic Reward LLM improves traditional exploration methods Using LLM to generate actions directly exhibits significant errors (grey lines on the right graph), even with advanced LLMs (GPT-4) However, when used as intrinsic reward only, it helps with exploration, especially in harder environments. It also results in better performance Generated by DALL-E 3

123 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Replay Memory Novelty-based Replay Performance-based Replay QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration 👈 [We are here] Statistical approaches Deep learning approaches Closing Remarks QA and Demo

124 Deakin University CRICOS Provider Code: 00113B What is causality? The relationship between cause and effect . Uncovering two fundamental questions: Causal discovery : Evidence required to infer cause-effect relationships? Causal inference : Given causal information, what inference can be drawn? Structure Causal Models (SCMs) framework (Pearl, 2009a): . Causal graph : . Figure 1: Example of SCM and causal g raph for scurvy problem.

125 Deakin University CRICOS Provider Code: 00113B Why causality and RL? Understand cause-effect reduce exploring unnecessary action , thus, sample efficiency. Ex: Not move toward the door before obtain the key. Improve interpretability. Ex: Why policy prioritize action of obtaining the key? Generalizability.

126 Deakin University CRICOS Provider Code: 00113B Interpreting Causality in RL Environment Taking action A can affect the reward R . State S is the context variables that affect both action A and reward R . . . The most common is . And, U is the unknown confounder variable. Categorize based on techniques to improve exploration or causality measurement techniques. Statistical vs deep learning methods. Figure 2 : Causality in RL Environment.

127 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Replay Memory Novelty-based Replay Performance-based Replay QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Statistical approaches 👈 [We are here] Deep learning approaches Closing Remarks QA and Demo

128 Deakin University CRICOS Provider Code: 00113B Causal influence detection for improving efficiency in reinforcement learning. Seitzer , M., Schölkopf , B., & Martius, G. (2021). Causal influence detection for improving efficiency in reinforcement learning.  Advances in Neural Information Processing Systems ,  34 , 22905-22918.

129 Deakin University CRICOS Provider Code: 00113B Causal Action Influence Detection (CAI) Mentioned previously, . Decomposed state S into N components. One steps transitions graph: How to detect when action influence next state S’? Figure 3: Global Causal Graph Fully Connected. Figure 4 : Example of Situation Dependent Controlled..

130 Deakin University CRICOS Provider Code: 00113B Causal Action Influence Detection (CAI) (cont.) Conditional Mutual Information (CMI): Estimation of CAI: Estimate forward model from data.

131 Deakin University CRICOS Provider Code: 00113B Using CAI to Improve exploration in RL? CAI as Intrinsic Reward. Active Exploration with CAI. CAI Experience Replay Experiment on 3 environments: FetchPush , FetchPickAndPlace , FetchRotTable . Goal is the coordinate the object must be. Baseline RL Algorithm: DDPG + HER. Figure 5: FetchPickAndPlace Environment. Figure 6 : FetchRotTable Environment.

132 Deakin University CRICOS Provider Code: 00113B CAI as Intrinsic Reward Use CAI as reward signal. Use on its own or with task reward. Figure 7: Bonus reward improves performance on FetchPickAndPlace .

133 Deakin University CRICOS Provider Code: 00113B Active Exploration with CAI Replace random exploration with causal exploration. Choose action with highest contribution to CAI calculation. Figure 8: Performance of active exploration in FetchPickAndPlace depending on the fraction of exploratory actions chosen actively from a total of 30% (epsilon) exploratory actions. Figure 9: Experiment comparing exploration strategies on FetchPickAndPlace . The combination of active exploration and reward bonus yields the largest sample efficiency.

134 Deakin University CRICOS Provider Code: 00113B CAI Experience Replay Choose episode for replay from replay buffer with guide from causal (inverse) ranking . is the probability of sampling any state from episode i (of M episodes) in the replay buffer (with T is the episode length). Figure 10 : Comparison of CAI-P with baselines (energy-based method with privileged information (EBP), prioritized experience replay (PER), and HER without prioritization)

135 Tutorial Outline Part A: Reinforcement Learning Fundamentals and Exploration Inefficiency (30 minutes) Welcome and Introduction Reinforcement Learning Basics Exploring Challenges in Deep RL QA and Demo Part B: Surprise and Novelty (110 minutes, including a 20-minute break) Principles and Frameworks Deliberate Memory for Surprise-driven Exploration Forward dynamics prediction Advanced dynamics-based surprises Ensemble and disagreement Break RAM-like Memory for Novelty-based Exploration Replay Memory Novelty-based Replay Performance-based Replay QA and Demo Part C: Advanced Topics (60 minutes) Language-guided exploration Causal discovery for exploration Statistical approaches Deep learning approaches👈 [We are here] Closing Remarks QA and Demo

136 Deakin University CRICOS Provider Code: 00113B Causality-driven hierarchical structure discovery for reinforcement learning. Hu, X., Zhang, R., Tang, K., Guo, J., Yi, Q., Chen, R., ... & Chen, Y. (2022). Causality-driven hierarchical structure discovery for reinforcement learning.  Advances in Neural Information Processing Systems ,  35 , 20064-20076.

137 Deakin University CRICOS Provider Code: 00113B Structural Causal Representation Learning Environment with multiple objects. Ex: Have wood and stone can make axe. How to measure causality between these object? Model the SCM of objects between adjacent timesteps. Figure 11 : Example of environment with multi-objects and causal graph

138 Deakin University CRICOS Provider Code: 00113B Structural Causal Representation Learning (cont.) Simpler case with 4 objects A , B , C , D . A is object of interest. Need a forward/transition model . Parameterized by . Need a masking function (otherwise, don’t know which objects affect A ). Parameterized by where M is the no. of objects. Figure 12 : Example of SCM representation l earning (with intereste d object A).

139 Deakin University CRICOS Provider Code: 00113B Structural Causal Representation Learning (cont.) Iterative process. Fix one parameter, while optimizing the other one. After finish optimizing, extract edge:

140 Deakin University CRICOS Provider Code: 00113B CDHRL Framework Figure 14 : CDHRL Framework.

141 Deakin University CRICOS Provider Code: 00113B Hierarchical Causal Subgoal Training Whenever new subgoals added, train . Deciding whether subgoal is reachable, with current state and policy . If the state is reachable , within certain timesteps, add to subgoal set . Figure 13 : Example of subgoal h ierarchy given causal graph.

142 Deakin University CRICOS Provider Code: 00113B Figure 16 : Results on Minigrid-2d (left) and Eden (right). Figure 15 : Environment Minigrid-2d (left) and Eden (right). An upper controller policy , is trained to select subgoals from current subgoal set and maximize task reward. The upper controller is multi-level DQN with HER .

143 Deakin University CRICOS Provider Code: 00113B Disentangling causal effects for hierarchical reinforcement learning.  Corcoll , O., & Vicente, R. (2020). Disentangling causal effects for hierarchical reinforcement learning.  arXiv preprint arXiv:2010.01351 .

144 Deakin University CRICOS Provider Code: 00113B Controlled Effect Disentanglement Total effect, the change in environment states, involves dynamic effects and controllable effects . Whether next state is an outcome of the action or just by accident. We care about controllable effects. Based on Average Treatment Effect (ATE). The normality Figure 17: The relationship between total effects, dynamic effects, and controllable effects.

145 Deakin University CRICOS Provider Code: 00113B We cannot calculate total effect for every actions . Use a neural network to estimate. Learn a vector representation of effect. Figure 1 8 : Total effects modelling architecture.

146 Deakin University CRICOS Provider Code: 00113B Exploration with controllable effects as goal Effect Sampling Policy Taking Action Policy Model distribution of effects Figure 19: Components of causal effects for hierarchical reinforcement learning.

147 Deakin University CRICOS Provider Code: 00113B Controllable Effect Distribution Learning Train a Variational Autoencoder. Approximate controllable effect . Figure 20 : VAE architecture to learn effect distribution.

148 Deakin University CRICOS Provider Code: 00113B Training to Select Goal and Reach Goal Train using DQN, on data . Use to select sub-effect. Train using DQN, on data . Figure 21 : Architecture learning to select effects as subgoals. Figure 22 : Architecture learning to select actions to reach subgoals.

149 Deakin University CRICOS Provider Code: 00113B 3 Levels: Task T: go to the target location. Task BT: go to the target location while carrying a ball. Task CBT: pick ball, put it in the chest, and go to the target. Figure 23 : Comparing with DQN baseline on 3 tasks. CEHRL can learn complex task, while, DQN cannot learn the complex task. Figure 24 : Random effect vs random action exploration.

150 Deakin University CRICOS Provider Code: 00113B End of Part C QA Demo Generated by DALL-E 3