Lectures_18_19_20_Monte_Carlo_Methods.pdf

AmitSinghYadav21 46 views 39 slides May 06, 2024
Slide 1
Slide 1 of 39
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39

About This Presentation

Policy iteration
Overview

Policy evaluation

Policy improvement

Policy iteration

Implementation

Takeaways

Video byte: Introduction to policy-based approaches and policy iteration


Learning outcomes

The learning outcomes of this chapter are:

Apply policy iteration to solve small-scale MDP pro...


Slide Content

Monte Carlo Methods
Dr. Surya Prakash
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Indore, Indore-453552, INDIA
E-mail:[email protected]. Surya Prakash (CSE, IIT Indore)

Quick Recap
Model based methods:
–Policy iteration algorithm
•Iterative policy evaluation
•Policy improvement
–Value iteration algorithm
•Iterative policy evaluation + Policy improvement
These methods assume the availability of model of the environment
–Markov decision process (MDP)
Dr. Surya Prakash (CSE, IIT Indore)2

Introduction
In previous techniques such as policy iteration and value
iteration, we have assumed that agent has complete knowledge
of the environment
Knowledge of the model is available in the form of MDP
–Reward and transition dynamics
This is a very strong assumption and is not true in real life
situations
Dr. Surya Prakash (CSE, IIT Indore)3

Introduction
How to find policy where environment model is not available
In such situation, we need to go for Model free reinforcement
learning
Monte Carlo Based methods are one of the methods which can
be used to solve RL problems in model free environment
Dr. Surya Prakash (CSE, IIT Indore)4

Introduction
In general, Monte Carlo methods (or Monte Carlo
experiments)
–are a broad class of computational algorithms that use repeated
random sampling to obtain numerical results (like probabilities,
expected values)
–they are often used in physical
and mathematical problems where it is
difficult to derive probabilities and expected values using basic
principals and concepts
Dr. Surya Prakash (CSE, IIT Indore)5

Introduction
Example: Application of Monte Carlo methods
Probability of getting an outcome (1, 2, 3, 4, 5 or 6) in symmetric
dice
–Simple to solve:
–Solution 1/6
Probability of getting an outcome in asymmetric dice
–we can not use symmetry argument to find probability of the outcomes
–Solution: Monte Carlo methods
•Perform experiment, toss dice large number of times and count the appearance of
each outcome and give probability as Pr (i) = n
i/n
Dr. Surya Prakash (CSE, IIT Indore)6

Introduction
Unlike model based techniques, in case of Monte Carlo
methods
–We assume that we do not have knowledge of the
state to
next state transition given actions, that is, p(s’, r | s, a).
–So here, we would estimate the value function V(s) or Q
function Q (s, a) based on experience.
–Use the above findings to find optimal policies.
Dr. Surya Prakash (CSE, IIT Indore)7

Introduction
Monte Carlo methods require only experience, that is
–they sample states, actions, and rewards, while interacting
with the environment.
–they are a way to solve RL problems
based on averaging
sample returns.
Dr. Surya Prakash (CSE, IIT Indore)8

Introduction
Since, we are going to average returns, we focus on
Monte Carlo for episodic tasks
–if we would have a continuous task, it will be difficult to
compute the average
–once an episode ends, the value estimates and policies
change.
Dr. Surya Prakash (CSE, IIT Indore)9

Introduction
In model based learning – we used Bellman equation
–We could use this as we had information about how state “s” transitions into the
next state s’ because we had a model of the environment.
Dr. Surya Prakash (CSE, IIT Indore)10
Here, we had the knowledge of all the transitions (s → s’) so we could just
update a state value by calculating it.

Introduction
In model free learning
–we do not have the transition function p(s’, r | s, a).
–so, we update the states by averaging the returns we experience
while traveling through those states.
–so here, we have to actually explore, starting from state “s”, and see
•what the next state and action look like from experience (i.e. sample a state
and action)
•and update that state “s” value by averaging the results as we are exploring.
Dr. Surya Prakash (CSE, IIT Indore)11

How does the sampling returns differ from before?
Dr. Surya Prakash (CSE, IIT Indore)12
Sampling returns (left) Vs. backup diagram v
π (right)
State
Action

How does the sampling returns differ from before?
In sampling returns:
–we update the value of state s based on samples of episodes
going through the state (left image).
In comparison, in the
backup diagram:
–first, we check one step ahead to all of the next states s’,
and
–use that to update state
s.
Dr. Surya Prakash (CSE, IIT Indore)13

Learning State Value function in Monte Carlo Prediction
In model based techniques
–we have seen that the value of a state is the expected return starting from
that state and then following a particular policy
In model free techniques
–An easy way to estimate it based on experience would be to average the
returns we observe after visiting a state.
–As we interact with the environment more and observe more returns , the
average should converge to the expected value.
–That is the
idea behind all Monte Carlo methods.
Dr. Surya Prakash (CSE, IIT Indore)14

Learning State Value function in Monte Carlo Prediction
Suppose we want to estimate the value of a state V
π
(s)
–V
π
(s) the value of a state under policy π , given a set of
episodes obtained by following policy π and passing through
state s.
–each time we visit state
s in an episode is called a “visit to s”
–we may visit state s many times in a single episode
–let us call the first time we visit “s” in an episode as the “first
visit to s”.
Dr. Surya Prakash (CSE, IIT Indore)15

Learning State Value function in Monte Carlo Prediction
Now we have two types of Monte Carlo methods to compute
the value of a state:
–Considering first visit to s
–Considering every visit to s
In
first-visit MC:
–the first-visit MC method estimates V
π
(s) by averaging just the
returns following first visits to s
in a set of episodes
In
every-visit MC:
–the every-visit MC method estimates V
π
(s) as the average of the
returns following all the visits to s in a set of episodes
Dr. Surya Prakash (CSE, IIT Indore)16

Learning State Value function in Monte Carlo Prediction
Dr. Surya Prakash (CSE, IIT Indore)17
An episode represented
by a red arrow
State S
6 is visited two timesFirst-visit to S
6: at time instant t=2 (after visiting state S
5)
Next visit to S
6: at time instant t=8 (after visiting state S
10)

Monte Carlo Prediction Algorithm for Learning V(s)
Dr. Surya Prakash (CSE, IIT Indore)18
Algorithm for
First-visit MC
method for
estimating V
π
(s)

Monte Carlo Estimation of Action Values
As we have seen, if we have a model of the environment, it is
quite easy to determine the policy from the state values
–we look one step ahead to see which state gives the best combination
of reward and next state
But if we
do not have a model of the environment, state values
are not enough
–In that case, it is useful to estimate action values
(the values of
different actions in a state) rather than state values.
Dr. Surya Prakash (CSE, IIT Indore)19

Monte Carlo Estimation of Action Values
Thus, the main goal of MC methods is to estimate the optimal
action values q∗.
To obtain q∗, we first look at policy evaluation for action
values.
Which means that we are going to estimate
q
π(s, a), the
expected return, when you start in state s, take an action a, and
then follow a policy π.
Dr. Surya Prakash (CSE, IIT Indore)20

Monte Carlo Estimation of Action Values
This is similar to what we discussed for state values (V
π),
–Here, we are talking about visiting a state-action pair, rather than just a
state.
More specifically
–a single state may have several actions.
–so by visiting a state we have several options (that is, several actions we can
take).
–when we
talk about state-action pair, we are always talking about taking
that specific action in that specific state.
Dr. Surya Prakash (CSE, IIT Indore)21

Monte Carlo Estimation of Action Values
Now we have two types of Monte Carlo methods to compute the value
of a state:
–Considering first visit to (s, a)
–Considering every visit to (s, a)
In the first-visit MC method
–first-visit MC method estimates the value of a state-action pair as the average of
returns following the first time in each episode that the state was visited and the action
was selected.
In the every-visit MC method
–every-visit MC method estimates the value of a state-action pair as the average of the
returns that have followed visits to the state in which the action was selected.
Dr. Surya Prakash (CSE, IIT Indore)22
It is shown that these methods converge quadratically to the true expected values as the number
of visits to each state-action pair approaches infinity.

Monte Carlo Estimation of Action Values
Issues
–the only problem is that many state-action pairs may never
be visited.
For example:
–if we have a deterministic policy, we shall only get
one
action per state (the one that the policy favor).
–And hence, we shall only observe returns for one action.
Dr. Surya Prakash (CSE, IIT Indore)23

Monte Carlo Estimation of Action Values
This is the general problem of maintaining exploration
–for policy evaluation to work, we need to make sure that we visit all
actions in every state.
–one way to go about this is to specify that every episode will start in
some state – action pair, and that every pair has a probability > 0 to
be selected as the start pair.
–this is called the assumption of exploring starts.
Dr. Surya Prakash (CSE, IIT Indore)24

Monte Carlo Estimation of Action Values – Algorithm Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)25
Monte Carlo ES:
A Monte Carlo
control algorithm
assuming
exploring starts
(
first-visit MC method )

Monte Carlo Estimation of Action Values – Algorithm Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)26
Monte Carlo ES:
A Monte Carlo
control algorithm
assuming
exploring starts –
an elaborated
version
(first-visit MC method )
As we are using first-visit MC method

Monte Carlo Estimation of Action Values – Algorithm Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)27
An episode represented
by a red arrow
State S
6 is visited two timesFirst-visit to S
6: at time instant t=2 (after visiting state S
5)
Next visit to S
6: at time instant t=8 (after visiting state S
10)
Image source: https://aleksandarhaber.com/monte-carlo-method- for-learning-state-value-functions-first-visit-method- reinforcement-learning-tutorial/

Monte Carlo Estimation of Action Values
But this assumption does not work always.
–assumption ofexploring starts
For example,
–if we learn directly from the interaction with the environment,
starting conditions are not very useful.
A more common way to go about it is to only consider
stochastic policies where the probability of every action in
every state is not 0.
Dr. Surya Prakash (CSE, IIT Indore)28

Monte Carlo Control
We now look at how our MC estimation can be used in control.
–that is, to approximate optimal policies
The idea is to follow generalized policy iteration (GPI)
–Here, we will maintain an approximate policy and an approximate
value function.
We continuously alter the value function to be a better
approximation for the policy, and the policy is continuously
improved (similar to previous techniques)
Dr. Surya Prakash (CSE, IIT Indore)29

Monte Carlo Control
Dr. Surya Prakash (CSE, IIT Indore)30
Generalized Policy Iteration (GPI)

Monte Carlo Control
The policy evaluation part is done exactly as done in previous
techniques,
–Only change is that
here we are evaluating the state-action pair, rather
than states.
The
policy improvement part is done by taking greedy actions in
each state.
So, for
any action-value function “q”, and for every state “s”, the
greedy policy chooses the action with maximal action-value:
Dr. Surya Prakash (CSE, IIT Indore)31

Monte Carlo Control without Exploring Starts
How can we avoid the unlikely assumption of exploring
starts?
The only general way to ensure that all actions are selected
infinitely often is for the agent to continue to select them.
There are two approaches to ensuring this
–on-policy methods, and
–off-policy methods.
Dr. Surya Prakash (CSE, IIT Indore)32

Monte Carlo Control without Exploring Starts
On-policy methods
–these methods attempt to evaluate or improve the policy that
is used to make decisions.
–so here, we try to evaluate or improve the policy that we
have.
Off-policymethods:
–we have two policies, and we try to evaluate or improve one
of them, and use the other for directions.
Dr. Surya Prakash (CSE, IIT Indore)33

Monte Carlo Control without Exploring Starts
Right now, we focus on on-policy method that does not use the assumption
of exploring starts.
In on-policy control methods, policy is generally soft
–This means, π(a|s) > 0 for all s ∈ S and all a ∈ A(s),
There are many possible variations on on-policy methods.
–One possibility is to gradually shift the policy toward a deterministic optimal
policy (many ways to do this).
We here use ε- greedy policies for this which most of the time choose an
action that has maximal estimated action
value, but with probability they
instead select an action at random
Dr. Surya Prakash (CSE, IIT Indore)34

Monte Carlo Control without Exploring Starts
ε-Greedy Policies
–here, most of the time choose an action that has maximal estimated
action value, but with probability they instead select an action at
random
–that is, all non-greedy actions are given the minimal probability of
selection,
ε
|????????????(????????????)|
, and the remaining bulk of the probability, 1- ε+
ε
|????????????(????????????)|
,
is given to the greedy action.
–The ε-greedy policies are examples of ε-soft policies, defined as
policies for which π(s,a) ≥
ε
|????????????(????????????)|
for all states and actions, for some ε > 0
Dr. Surya Prakash (CSE, IIT Indore)35

Monte Carlo Control without Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)36
An ∈-soft on- policy
Monte Carlo control
algorithm
(
first-visit MC method )

Monte Carlo Control without Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)37
An ∈-soft on- policy
Monte Carlo control
algorithm

an elaborated
version
(first-visit MC method )

References
Sutton and Barto, Reinforcement Learning: An Introduction, MIT press (5.3 Monte Carlo Control):
http://incompleteideas.net/book/ebook/node53.html
Monte Carlo Methods, https://towardsdatascience.com/introduction-to-reinforcement- learning-
rl-part-5-monte-carlo-methods-25067003bb0f
Monte Carlo Methods-An Example:
https://www.analyticsvidhya.com/blog/2018/11/reinforcement-learning-introduction-monte-carlo-
learning-openai-gym/
Reinforcement learning: Model-free MC learner with code implementation:
https://medium.com/@ngao7/reinforcement-learning-model-free-mc-learner-with-code-implementation-
f9f475296dcb
Monte Carlo Methods in Reinforcement Learning — Part 1 on-policy Methods:
https://medium.com/analytics-vidhya/monte-carlo-methods-in-reinforcement-learning-part-1-on-policy-
methods-
1f004d59686a#:~:text=Monte%20Carlo%20Control%20without%20exploring%20starts&text=In%20on-
policy%20MC%20control,closer%20to%20a%20deterministic%20policy.
Monte Carlo Method for Learning State-Value Functions – First Visit Method –
Reinforcement Learning Tutorial: https://aleksandarhaber.com/monte-carlo-method-for-learning-state-
value-functions-first-visit-method-reinforcement-learning-tutorial/
Dr. Surya Prakash (CSE, IIT Indore)38

ThankYou
Dr. Surya Prakash (CSE, IIT Indore) 39