Video byte: Introduction to policy-based approaches and policy iteration
Learning outcomes
The learning outcomes of this chapter are:
Apply policy iteration to solve small-scale MDP pro...
Policy iteration
Overview
Policy evaluation
Policy improvement
Policy iteration
Implementation
Takeaways
Video byte: Introduction to policy-based approaches and policy iteration
Learning outcomes
The learning outcomes of this chapter are:
Apply policy iteration to solve small-scale MDP problems manually and program policy iteration algorithms to solve medium-scale MDP problems automatically.
Discuss the strengths and weaknesses of policy iteration.
Compare and contrast policy iteration to value iteration.
Overview
Video byte: Intuition of policy-based approaches
The other common way that MDPs are solved is using policy iteration – an approach that is similar to value iteration. While value iteration iterates over value functions, policy iteration iterates over policies themselves, creating a strictly improved policy in each iteration (except if the iterated policy is already optimal).
Policy iteration first starts with some (non-optimal) policy, such as a random policy, and then calculates the value of each state of the MDP given that policy — this step is called policy evaluation. It then updates the policy itself for every state by calculating the expected reward of each action applicable from that state.
The basic idea here is that policy evaluation is easier to computer than value iteration because the set of actions to consider is fixed by the policy that we have so far.
Policy evaluation
Video byte: Model-based policy evaluation
An important concept in policy iteration is policy evaluation, which is an evaluation of the expected reward of a policy.
The expected reward of policy
from
,
, is the weighted average of reward of the possible state sequences defined by that policy times their probability given
.
Definition – Policy evaluation
Policy evaluation can be characterised as
as defined by the following equation:
where
for terminal states.
Note that this is very similar to the Bellman equation, except
is not the value of the best action, but instead just as the value for
, the action that would be chosen in
by the policy
. Note the expression
instead of
, which means we only evaluate the action that the policy defines.
Once we understand the definition of policy evaluation, the implementation is straightforward. It is the same as value iteration except that we use the policy evaluation equation instead of the Bellman equation.
Algorithm 11 (Policy evaluation)
Reinforcement learning
Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the
Size: 2.36 MB
Language: en
Added: May 06, 2024
Slides: 39 pages
Slide Content
Monte Carlo Methods
Dr. Surya Prakash
Associate Professor
Department of Computer Science & Engineering
Indian Institute of Technology Indore, Indore-453552, INDIA
E-mail:[email protected]. Surya Prakash (CSE, IIT Indore)
Quick Recap
Model based methods:
–Policy iteration algorithm
•Iterative policy evaluation
•Policy improvement
–Value iteration algorithm
•Iterative policy evaluation + Policy improvement
These methods assume the availability of model of the environment
–Markov decision process (MDP)
Dr. Surya Prakash (CSE, IIT Indore)2
Introduction
In previous techniques such as policy iteration and value
iteration, we have assumed that agent has complete knowledge
of the environment
Knowledge of the model is available in the form of MDP
–Reward and transition dynamics
This is a very strong assumption and is not true in real life
situations
Dr. Surya Prakash (CSE, IIT Indore)3
Introduction
How to find policy where environment model is not available
In such situation, we need to go for Model free reinforcement
learning
Monte Carlo Based methods are one of the methods which can
be used to solve RL problems in model free environment
Dr. Surya Prakash (CSE, IIT Indore)4
Introduction
In general, Monte Carlo methods (or Monte Carlo
experiments)
–are a broad class of computational algorithms that use repeated
random sampling to obtain numerical results (like probabilities,
expected values)
–they are often used in physical
and mathematical problems where it is
difficult to derive probabilities and expected values using basic
principals and concepts
Dr. Surya Prakash (CSE, IIT Indore)5
Introduction
Example: Application of Monte Carlo methods
Probability of getting an outcome (1, 2, 3, 4, 5 or 6) in symmetric
dice
–Simple to solve:
–Solution 1/6
Probability of getting an outcome in asymmetric dice
–we can not use symmetry argument to find probability of the outcomes
–Solution: Monte Carlo methods
•Perform experiment, toss dice large number of times and count the appearance of
each outcome and give probability as Pr (i) = n
i/n
Dr. Surya Prakash (CSE, IIT Indore)6
Introduction
Unlike model based techniques, in case of Monte Carlo
methods
–We assume that we do not have knowledge of the
state to
next state transition given actions, that is, p(s’, r | s, a).
–So here, we would estimate the value function V(s) or Q
function Q (s, a) based on experience.
–Use the above findings to find optimal policies.
Dr. Surya Prakash (CSE, IIT Indore)7
Introduction
Monte Carlo methods require only experience, that is
–they sample states, actions, and rewards, while interacting
with the environment.
–they are a way to solve RL problems
based on averaging
sample returns.
Dr. Surya Prakash (CSE, IIT Indore)8
Introduction
Since, we are going to average returns, we focus on
Monte Carlo for episodic tasks
–if we would have a continuous task, it will be difficult to
compute the average
–once an episode ends, the value estimates and policies
change.
Dr. Surya Prakash (CSE, IIT Indore)9
Introduction
In model based learning – we used Bellman equation
–We could use this as we had information about how state “s” transitions into the
next state s’ because we had a model of the environment.
Dr. Surya Prakash (CSE, IIT Indore)10
Here, we had the knowledge of all the transitions (s → s’) so we could just
update a state value by calculating it.
Introduction
In model free learning
–we do not have the transition function p(s’, r | s, a).
–so, we update the states by averaging the returns we experience
while traveling through those states.
–so here, we have to actually explore, starting from state “s”, and see
•what the next state and action look like from experience (i.e. sample a state
and action)
•and update that state “s” value by averaging the results as we are exploring.
Dr. Surya Prakash (CSE, IIT Indore)11
How does the sampling returns differ from before?
Dr. Surya Prakash (CSE, IIT Indore)12
Sampling returns (left) Vs. backup diagram v
π (right)
State
Action
How does the sampling returns differ from before?
In sampling returns:
–we update the value of state s based on samples of episodes
going through the state (left image).
In comparison, in the
backup diagram:
–first, we check one step ahead to all of the next states s’,
and
–use that to update state
s.
Dr. Surya Prakash (CSE, IIT Indore)13
Learning State Value function in Monte Carlo Prediction
In model based techniques
–we have seen that the value of a state is the expected return starting from
that state and then following a particular policy
In model free techniques
–An easy way to estimate it based on experience would be to average the
returns we observe after visiting a state.
–As we interact with the environment more and observe more returns , the
average should converge to the expected value.
–That is the
idea behind all Monte Carlo methods.
Dr. Surya Prakash (CSE, IIT Indore)14
Learning State Value function in Monte Carlo Prediction
Suppose we want to estimate the value of a state V
π
(s)
–V
π
(s) the value of a state under policy π , given a set of
episodes obtained by following policy π and passing through
state s.
–each time we visit state
s in an episode is called a “visit to s”
–we may visit state s many times in a single episode
–let us call the first time we visit “s” in an episode as the “first
visit to s”.
Dr. Surya Prakash (CSE, IIT Indore)15
Learning State Value function in Monte Carlo Prediction
Now we have two types of Monte Carlo methods to compute
the value of a state:
–Considering first visit to s
–Considering every visit to s
In
first-visit MC:
–the first-visit MC method estimates V
π
(s) by averaging just the
returns following first visits to s
in a set of episodes
In
every-visit MC:
–the every-visit MC method estimates V
π
(s) as the average of the
returns following all the visits to s in a set of episodes
Dr. Surya Prakash (CSE, IIT Indore)16
Learning State Value function in Monte Carlo Prediction
Dr. Surya Prakash (CSE, IIT Indore)17
An episode represented
by a red arrow
State S
6 is visited two timesFirst-visit to S
6: at time instant t=2 (after visiting state S
5)
Next visit to S
6: at time instant t=8 (after visiting state S
10)
Monte Carlo Prediction Algorithm for Learning V(s)
Dr. Surya Prakash (CSE, IIT Indore)18
Algorithm for
First-visit MC
method for
estimating V
π
(s)
Monte Carlo Estimation of Action Values
As we have seen, if we have a model of the environment, it is
quite easy to determine the policy from the state values
–we look one step ahead to see which state gives the best combination
of reward and next state
But if we
do not have a model of the environment, state values
are not enough
–In that case, it is useful to estimate action values
(the values of
different actions in a state) rather than state values.
Dr. Surya Prakash (CSE, IIT Indore)19
Monte Carlo Estimation of Action Values
Thus, the main goal of MC methods is to estimate the optimal
action values q∗.
To obtain q∗, we first look at policy evaluation for action
values.
Which means that we are going to estimate
q
π(s, a), the
expected return, when you start in state s, take an action a, and
then follow a policy π.
Dr. Surya Prakash (CSE, IIT Indore)20
Monte Carlo Estimation of Action Values
This is similar to what we discussed for state values (V
π),
–Here, we are talking about visiting a state-action pair, rather than just a
state.
More specifically
–a single state may have several actions.
–so by visiting a state we have several options (that is, several actions we can
take).
–when we
talk about state-action pair, we are always talking about taking
that specific action in that specific state.
Dr. Surya Prakash (CSE, IIT Indore)21
Monte Carlo Estimation of Action Values
Now we have two types of Monte Carlo methods to compute the value
of a state:
–Considering first visit to (s, a)
–Considering every visit to (s, a)
In the first-visit MC method
–first-visit MC method estimates the value of a state-action pair as the average of
returns following the first time in each episode that the state was visited and the action
was selected.
In the every-visit MC method
–every-visit MC method estimates the value of a state-action pair as the average of the
returns that have followed visits to the state in which the action was selected.
Dr. Surya Prakash (CSE, IIT Indore)22
It is shown that these methods converge quadratically to the true expected values as the number
of visits to each state-action pair approaches infinity.
Monte Carlo Estimation of Action Values
Issues
–the only problem is that many state-action pairs may never
be visited.
For example:
–if we have a deterministic policy, we shall only get
one
action per state (the one that the policy favor).
–And hence, we shall only observe returns for one action.
Dr. Surya Prakash (CSE, IIT Indore)23
Monte Carlo Estimation of Action Values
This is the general problem of maintaining exploration
–for policy evaluation to work, we need to make sure that we visit all
actions in every state.
–one way to go about this is to specify that every episode will start in
some state – action pair, and that every pair has a probability > 0 to
be selected as the start pair.
–this is called the assumption of exploring starts.
Dr. Surya Prakash (CSE, IIT Indore)24
Monte Carlo Estimation of Action Values – Algorithm Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)25
Monte Carlo ES:
A Monte Carlo
control algorithm
assuming
exploring starts
(
first-visit MC method )
Monte Carlo Estimation of Action Values – Algorithm Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)26
Monte Carlo ES:
A Monte Carlo
control algorithm
assuming
exploring starts –
an elaborated
version
(first-visit MC method )
As we are using first-visit MC method
Monte Carlo Estimation of Action Values – Algorithm Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)27
An episode represented
by a red arrow
State S
6 is visited two timesFirst-visit to S
6: at time instant t=2 (after visiting state S
5)
Next visit to S
6: at time instant t=8 (after visiting state S
10)
Image source: https://aleksandarhaber.com/monte-carlo-method- for-learning-state-value-functions-first-visit-method- reinforcement-learning-tutorial/
Monte Carlo Estimation of Action Values
But this assumption does not work always.
–assumption ofexploring starts
For example,
–if we learn directly from the interaction with the environment,
starting conditions are not very useful.
A more common way to go about it is to only consider
stochastic policies where the probability of every action in
every state is not 0.
Dr. Surya Prakash (CSE, IIT Indore)28
Monte Carlo Control
We now look at how our MC estimation can be used in control.
–that is, to approximate optimal policies
The idea is to follow generalized policy iteration (GPI)
–Here, we will maintain an approximate policy and an approximate
value function.
We continuously alter the value function to be a better
approximation for the policy, and the policy is continuously
improved (similar to previous techniques)
Dr. Surya Prakash (CSE, IIT Indore)29
Monte Carlo Control
Dr. Surya Prakash (CSE, IIT Indore)30
Generalized Policy Iteration (GPI)
Monte Carlo Control
The policy evaluation part is done exactly as done in previous
techniques,
–Only change is that
here we are evaluating the state-action pair, rather
than states.
The
policy improvement part is done by taking greedy actions in
each state.
So, for
any action-value function “q”, and for every state “s”, the
greedy policy chooses the action with maximal action-value:
Dr. Surya Prakash (CSE, IIT Indore)31
Monte Carlo Control without Exploring Starts
How can we avoid the unlikely assumption of exploring
starts?
The only general way to ensure that all actions are selected
infinitely often is for the agent to continue to select them.
There are two approaches to ensuring this
–on-policy methods, and
–off-policy methods.
Dr. Surya Prakash (CSE, IIT Indore)32
Monte Carlo Control without Exploring Starts
On-policy methods
–these methods attempt to evaluate or improve the policy that
is used to make decisions.
–so here, we try to evaluate or improve the policy that we
have.
Off-policymethods:
–we have two policies, and we try to evaluate or improve one
of them, and use the other for directions.
Dr. Surya Prakash (CSE, IIT Indore)33
Monte Carlo Control without Exploring Starts
Right now, we focus on on-policy method that does not use the assumption
of exploring starts.
In on-policy control methods, policy is generally soft
–This means, π(a|s) > 0 for all s ∈ S and all a ∈ A(s),
There are many possible variations on on-policy methods.
–One possibility is to gradually shift the policy toward a deterministic optimal
policy (many ways to do this).
We here use ε- greedy policies for this which most of the time choose an
action that has maximal estimated action
value, but with probability they
instead select an action at random
Dr. Surya Prakash (CSE, IIT Indore)34
Monte Carlo Control without Exploring Starts
ε-Greedy Policies
–here, most of the time choose an action that has maximal estimated
action value, but with probability they instead select an action at
random
–that is, all non-greedy actions are given the minimal probability of
selection,
ε
|????????????(????????????)|
, and the remaining bulk of the probability, 1- ε+
ε
|????????????(????????????)|
,
is given to the greedy action.
–The ε-greedy policies are examples of ε-soft policies, defined as
policies for which π(s,a) ≥
ε
|????????????(????????????)|
for all states and actions, for some ε > 0
Dr. Surya Prakash (CSE, IIT Indore)35
Monte Carlo Control without Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)36
An ∈-soft on- policy
Monte Carlo control
algorithm
(
first-visit MC method )
Monte Carlo Control without Exploring Starts
Dr. Surya Prakash (CSE, IIT Indore)37
An ∈-soft on- policy
Monte Carlo control
algorithm
–
an elaborated
version
(first-visit MC method )
References
Sutton and Barto, Reinforcement Learning: An Introduction, MIT press (5.3 Monte Carlo Control):
http://incompleteideas.net/book/ebook/node53.html
Monte Carlo Methods, https://towardsdatascience.com/introduction-to-reinforcement- learning-
rl-part-5-monte-carlo-methods-25067003bb0f
Monte Carlo Methods-An Example:
https://www.analyticsvidhya.com/blog/2018/11/reinforcement-learning-introduction-monte-carlo-
learning-openai-gym/
Reinforcement learning: Model-free MC learner with code implementation:
https://medium.com/@ngao7/reinforcement-learning-model-free-mc-learner-with-code-implementation-
f9f475296dcb
Monte Carlo Methods in Reinforcement Learning — Part 1 on-policy Methods:
https://medium.com/analytics-vidhya/monte-carlo-methods-in-reinforcement-learning-part-1-on-policy-
methods-
1f004d59686a#:~:text=Monte%20Carlo%20Control%20without%20exploring%20starts&text=In%20on-
policy%20MC%20control,closer%20to%20a%20deterministic%20policy.
Monte Carlo Method for Learning State-Value Functions – First Visit Method –
Reinforcement Learning Tutorial: https://aleksandarhaber.com/monte-carlo-method-for-learning-state-
value-functions-first-visit-method-reinforcement-learning-tutorial/
Dr. Surya Prakash (CSE, IIT Indore)38