Reinforcement Learning: An Introduction
R. Sutton and A. Barto
2018
The Book
Reinforcement Learning, an Introduction
R. Sutton and A.Barto
2nd edition!
Available at
http://incompleteideas.net/book/the-book-2nd.html
IntroductionOutline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
Introduction1 Introduction
1.1 Reinforcement Learning
1.2 Examples
1.3 Elements of Reinforcement Learning
1.4 Limitations and Scope
1.5 An Extended Example: Tic-Tac-Toe
Introduction1.5 An Extended Example: Tic-Tac-Toe -
Figure 1.1
Multi-armed BanditsOutline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
Multi-armed Bandits2 Multi-armed Bandits
2.1 Ak-armed Bandit Problem
2.2 Action-value Methods
2.3 The 10-armed Testbed
2.4 Incremental Implementation
2.5 Tracking a Nonstationary Problem
2.6 Optimistic Initial Values
2.7 Upper-Condence-Bound Action Selection
2.8 Gradient Bandit Algorithms
2.9 Associative Search (Contextual Bandits)
Multi-armed Bandits2.1 Ak-armed Bandit Problem
k-armed bandit:
s=;
q(a) =E[RtjAa] =ra
Optimal policy:a= argmaxra
EstimateQt(a)E[RtjAa].
Next step:
Exploitation:At= argmaxQt(a), bet on the current winner.
Exploration:At6= argmaxQt(a), verify that no other arm are
better.
Conict between exploration and exploitation.
Theoretical results under strong assumptions.
Multi-armed Bandits2.2 Action-value Methods
Action-value:
Qt(a) =
P
t1
i=1
1Ai=aRi
P
t1
i=1
1Ai=a
Sample average that converges toq(a)provided
P
t1
i=1
1Ai=a! 1
Greedy action:
At= argmaxQt(a)
-greedy action:
At=
(
argmaxQt(a)with probability 1
A
0
with A' uniform on the arm otherwise
-greedy forces exploration and guarantees thatQt(a)!a(a)
Multi-armed Bandits2.3 The 10-armed Testbed
q(a) N(0;1)
RtjAt=a N(q(a);1)
Multi-armed Bandits2.3 The 10-armed Testbed
Greedy policy may fail in a stationary context.
Even more risky in a nonstationary one.
Multi-armed Bandits2.4 Incremental Implementation
Mean for a single action:
Qn=
R1+ +Rn1
n1
Incremental implementation:
Qn+1=
R1+ +Rn
n
=
1
n
((n1)Qn+Rn)
=Qn+
1
n
(RnQn)
General update rule:
NewEstimate=OldEstimate+StepSize(TargetOldEstimate)
where
Targetis a noisy estimate of the true target,
StepSizemay depends ontanda.
Multi-armed Bandits2.4 Incremental Implementation
Multi-armed Bandits2.5 Tracking a Nonstationary Problem
Nonstationary setting often present in reinforcement learning.
Incremental update:
Qn+1=Qn+(RnQn)
If2(0;1]is constant,
Qn+1=Qn+(RnQn)
= (1)Qn1+Rn
= (1)
2
Qn1+(1)Rn1+Rn
= (1)
n
Q1+
n
X
k=1
(1)
ni
Ri
Weighted average with more emphasis on later values.
Multi-armed Bandits2.5 Tracking a Nonstationary Problem
Incremental update:
Qn+1=Qn+n(RnQn)
Convergence toward the expectation ofRrequires some
assumptions on:
1
X
n=1
n= +1
1
X
n=1
2
n<+1
First condition guarantees that one can escape any inital
condition.
Second condition that the iterations converges.
Ifn=no convergence but track any nonstationarity.
Multi-armed Bandits2.6 Optimistic Initial Values
Estimate depends on initial values (except for the case where
1=1).
Way of supplying prior knowledge about the level of rewards
expected.
Optimistic initialization leads to exploration at the beginning.
Fails to help in a nonstationary setting.
Multi-armed Bandits2.7 Upper-Condence-Bound Action
Selection
-greedy fails to discriminate between good/bad actions or
certain/uncertain actions.
Upper-Condence-Bound:
At= argmax
Qt(a) +c
s
lnt
Nt(a)
!
Arm with lower values estimates will be selected with
decreasing frequency over time.
Bandit proof hard to extend to reinforcement setting.
Multi-armed Bandits2.8 Gradient Bandit Algorithm
Numerical preference associated to actiona:Ht(a).
Induced soft-max policy:
P(At=a) =
e
Ht(a)
P
k
b=1
e
Ht(b)
=t(a)
Natural learning algorithm with update
Ht+1(a) =
(
Ht(a) +(RtRt)(1t(a))ifa=At
Ht(a)(RtRt)t(a) otherwise
with
X
b
0
e
Ht(b
0
)
!!
=1a=b
e
Ht(a)
P
b
0e
Ht(b
0
)
=1a=bt(a)
Stochastic gradient descent:
Ht+1(a) =Ht(a) +(q(At)Bt) (1a=bt(a))
Multi-armed Bandits2.9 Associative Search (Contextual
Bandits)
Associative search: reward depends on the arm and on the
situation.
Often call contextual bandits.
Inbetween bandits and reinforcemement learning, the action
only impact the next reward
Finite Markov Decision
Processes
Outline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
Finite Markov Decision
Processes
3 Finite Markov Decision Processes
3.1 The Agent-Environment Interface
3.2 Goals and Rewards
3.3 Returns and Episodes
3.4 Unied Notation for Episodic and Continuing Tasks
3.5 Policies and Value Functions
3.6 Optimal Policies and Optimal Value Functions
3.7 Optimality and Approximation
Finite Markov Decision
Processes
3.1 The Agent-Environment Interface
At time stept2 N:
StateSt2 S: representation of the environment
ActionAt2 A(St): action chosen
RewardRt+12 R: instantaneous reward
New stateSt+1
Finite MDP:
S,AandRare nite.
Dynamic entirely dened by
P(St=s
0
;Rr=rjSt1=s;At1=a) =p(s
0
;rjs;a)
=
X
r
p(s
0
;rjs;a)
Expected reward for a state-action:
r(s;a) =E[RtjSt1=s;At1=a] =
X
r
r
X
s
0
p(s
0
;rjs;a)
Expected reward for a state-action-state:
r(s;a;s
0
) =E
Finite Markov Decision
Processes
3.2 Goals and Rewards
That all of what we mean by goals and purposes can be
well thought of as the maximization of the expected value
of the cumulative sum of a received scalar signal (called
reward).
The reward signal is your way of communicating to the robot
what you want it to achieve, not how you want it achieved.
Finite Markov Decision
Processes
3.3 Returns ans Episodes
Episodic: Final time stepTand
Gt=
T
X
t
0
=t+1
Rt
0
Continuous tasks: undiscounted reward
Gt=
+1
X
t
0
=t+1
Rt
0may not exist!
Continuous tasks: discounted reward
Gt=
+1
X
0
k
Rt+1+k
with 0 <1.
Finite Markov Decision
Processes
3.3 Returns ans Episodes
Recursive property
Gt=Rt+1+Gt+1
Finiteness ifjRj M
jGtj
(
(Tt+1)MifT<1
M
1
1
otherwise
Finite Markov Decision
Processes
3.3 Returns ans Episodes - Example 3.4
Finite Markov Decision
Processes
Unied Notation for Episodic and
Continuing Tasks
Episodic case: several episodes instead of a single trajectory
(St;i...)
Absorbing state~ssuchp(~sj~s;a) =1 andRt=0 whenSt= ~s.
Convert episodic case into a continuing one.
Alternative: notation
Gt=
T
X
k=t+1
kt1
Rk
Undened ifT=1and=1...
Finite Markov Decision
Processes
3.5 Policies and Value Functions
Policy:(ajs)
Value function:
v(s) =E[GtjSt=s] =E
"
1
X
k=0
k
Rt+k+1
St=s
#
Action value function:
q(s;a) =E[GtjSt=s;At=a]
=E
"
1
X
k=0
k
Rt+k+1
St=s;At=a
#
Implicit stationary assumption on!
Finite Markov Decision
Processes
3.5 Policies and Value Functions
Bellman Equation
v(s) =E[GtjSt=s]
=E[Rt+1+Gt+1jSt=s]
=
X
a
(ajs)
X
s
0
X
r
p(s
0
;rjs;a)
r+E
Gt+1
St+1=s
0
=
X
a
(ajs)
X
s
0
X
r
p(s
0
;rjs;a)
r+v(s
0
)
Finite Markov Decision
Processes
3.5 Policies and Value Functions - Figure
3.2
Finite Markov Decision
Processes
3.5 Policies and Value Functions - Figure
3.3
Finite Markov Decision
Processes
3.6 Optimal Policies and Optimal Value
Functions
Optimal policies:v(s)v(s)(not necessarily unique)
Optimal state-value function :
v(s) = max
v(s)Uniqueness
Optimal state-action-value function:
q(s;a) = max
Finite Markov Decision
Processes
3.6 Optimal Policies and Optimal Value
Functions
Bellman optimality equation:
v(s) = max
a
q(s;a)
= max
a
E[Rt+1+v(St+1)jSt=s;At=a]
= max
a
X
s
0
;r
p(s
0
;rjs;a)
r+v(s
0
)
Bellman optimality equation forq:
q(s;a) =E
Rt+1+max
a
0
q(St+1;a
0
)
St=s;At=a
=
X
s
0
;r
p(s
0
;rjs;a)
r+ max
a
0
q(s
0
;a
0
)
Finite Markov Decision
Processes
3.6 Optimal Policies and Optimal Value
Functions - Figure 3.4
Finite Markov Decision
Processes
3.6 Optimal Policies and Optimal Value
Functions - Figure 3.5
Finite Markov Decision
Processes
3.7 Optimality and Approximation
Very dicult to learn the optimal policy.
Knowing the environment helps but is not sucient. (Chap.
4)
Computational challenges even in the nite case! (Chap. 5-8)
Need to resort to approximation! (Chap. 9-12)
Dynamic ProgrammingOutline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
Dynamic Programming4.1 Policy Evaluation
Policy Evaluation or Prediction
Bellman Equation
v(s) =E[GtjSt=s]
=E[Rt+1+Gt+1jSt=s]
=
X
a
(ajs)
X
s
0
X
r
p(s
0
;rjs;a)
r+v(s
0
)
Linear system that can be solved!
Bellman iteration:
vk+1(s) =E[Rt+1+vk(St+1)jSt=s]
=
X
a
(ajs)
X
s
0
X
r
p(s
0
;rjs;a)
r+vk(s
0
)
vis a xed point.
Iterative policy evaluation: iterative algorithm that can be
proved to converge.
Dynamic Programming4.1 Policy Evaluation - Example 4.1
Dynamic Programming4.2 Policy Improvement
If
0
is such that8s;q(s;
0
(s))v(s)thenv
0v.
Sketch of proof:
v(s)q(s;
0
(s))
=E[Rt+1+v(St+1)jSt=s;At=
0
(s)]
=E
0[Rt+1+v(St+1)jSt=s]
E
0[Rt+1+q(St+1;
0
(St+1))jSt=s]
E
0[Rt+1+E
0[Rt+1+v(St+2)jSt+1;At+1=
0
(St+1)]jSt=s]
E
0
Rt+1+Rt+2+
2
v(St+2)
St=s
E
0
Rt+1+Rt+2+
2
Rt+3+
St=s
v
0(s)
Greedy update:
0
(s) = argmax
a
q(s;a)
= argmax
a
X
s
0
;r
p(s
0
;rjs;a)
Dynamic Programming4.3 Policy Iteration
Policy iteration: sequence of policy evaluation and policy
improvement
0
E
!v0
I
!1
E
!v1
I
!
E
!v
In a nite states/actions setting, converges in nite time.
Dynamic Programming4.4 Value Iteration
Policy evaluation can be time consuming.
Value iteration: improve policy after only one step of policy
evaluation.
Bellman iteration:
vk+1(s) = max
a
E[Rt+1+vk(St+1)jSt=s;At=a]
= max
a
X
s
0
;r
p(s
0
;rjs;a)
r+vk(s
0
)
Update corresponds to the Bellman optimality equation.
Variation possible on the number of steps in the policy
evaluation.
Dynamic Programming4.4 Value Iteration - Value Iteration
Dynamic Programming4.4 Value Iteration - Figure 4.3
Dynamic Programming4.5 Asynchronous Dynamic Programming
Synchronous Dynamic Programming: update all states at
each step.
Asynchronous Dynamic Programming: update only a few
states at each step.
No systematic sweeps of the state set!
Only need to update every state innitely often!
One possibility is to update the states seen by an agent
experiencing the MDP.
Dynamic Programming4.6 Generalized Policy Iteration
Policy iteration consists of two simultaneous interacting
processes:
one making a value function consistent with the current policy
(policy evaluation)
one making the policy greedy with respect to the current value
function (policy improvement)
Generalized Policy Iteration: any similar scheme.
Stabilizes only if one reaches the optimal value/policy pair.
Dynamic Programming4.7 Eciency of Dynamic Programming
DP quite ecient: polynomial injSjandjAj.
Linear programming alternative also possible.
Curse of dimensionality if the as the number of states grows
exponentially with the number of state variables.
Asynchronous DP methods are preferred.
Monte Carlo MethodsOutline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
Monte Carlo Methods5 Monte Carlo Methods
5.1 Monte Carlo Prediction
5.2 Monte Carlo Estimation of Action Values
5.3 Monte Carlo Control
5.4 Monte Carlo without Exploring Starts
5.5 O-policy Prediction via Importance Sampling
5.6 Incremental Implementation
5.7 O-policy Monte Carlo Control
5.8 Discounting-aware Importance Sampling
5.9 Per-decision Importance Sampling
Monte Carlo Methods5.1 Monte Carlo Prediction
Estimatev(s)by the average gain followingafter passing
throughs.
Two variants:
First-visit: use only rst visit ofsin each episode
Every-visit: use every visit ofsin each episode
First-visit is easier to analyze due to independance of each
episode.
Every-visit works... but not necessarily better!
Monte Carlo Methods5.1 Monte Carlo Prediction - First Visit
Monte-Carlo Policy Evaluation
Monte Carlo Methods5.1 Monte Carlo Prediction
Monte Carlo Methods5.2 Monte Carlo Estimation of Action
Values
Without a model,v(s)is not sucient to do a policy
enhancement step.
Need to estimateq(s;a)directly.
Issue:require that any state-action pair is visited.
Impossible with a deterministic policy.
Instance of problem ofmaintaining explorationseen with the
bandits.
Exploring starts: start the game from a random stat-action
pair in a way that every stat-action pair has a nonzero
probability.
Alternative: impose condition on policy itself.
Monte Carlo Methods5.3 Monte Carlo Control
Generalized Policy Iteration can be implemented with MC.
Scheme:
0
E
!q0
I
!1
E
!
I
!
E
!q
Innite number of MC simulations to computeq.
Easy policy improvement:
(s) = argmaxq(s;a)
Improvement at each step hence convergence...
Monte Carlo Methods5.3 Monte Carlo Control
Two strong assumptions:
Every state-action pair is visited innitely often (Exploring
Starts)
Innite number of MC simulations
Approximate policy iteration required.
First approach: if the number of MC simulations is large
enough then the approximation is good enough...
The required number of simulations can be very large.
Second approach: use the current MC estimation to update a
currentQ-value estimate (GPI).
Monte Carlo Methods5.3 Monte Carlo Control
Cannot converge to any suboptimal policy.
Convergence still an open question...
Monte Carlo Methods5.3 Monte Carlo Control - Figure 5.2
Monte Carlo Methods5.4 Monte Carlo Control without
Exploring Starts
Exploring starts assumption can be removed if one guarantees
that the agent selects every action innitely often.
Two approaches:
on-policy, where the policy is constrained to explore.
o-policy, where the agent use a dierent policy than the one
we want to estimate
On-policy control: usesoftpolicy such that(ajs)>0 but
gradually shift closer and closer to a deterministic optimal
policy.
Impossible to use the classical policy improvement step.
Monte Carlo Methods5.4 Monte Carlo Control without
Exploring Starts
Use of-greedy rules:
(s) =
(
argmaxq(s;a) with probability 1+=jA(s)
a
0
6= argmaxq(s;a)with probability=jA(s)j
Improvement over any-greedy policy:
q(s;
0
(s)) =
X
a
0
(ajs)q(s;a)
==jA(s)j
X
a
q(s;a) + (1) max
a
q(s;a)
=jA(s)j
X
a
q(s;a) + (1)
X
a
(ajs)=jA(s)j
1
q(s;a)
X
a
(ajs)q(s;a) =v(s;a)
Fixed point should be an optimal-greedy policy.
Monte Carlo Methods5.4 Monte Carlo Control without
Exploring Starts
In practice, one reducesduring the iterations.
Monte Carlo Methods5.5 O-policy Prediction via Importance
Sampling
Two approaches:
on-policy, where the policy is constrained to explore.
o-policy, where the agent use a dierent policy than the one
we want to estimate
target policy (policy we want to estimate) vs behavior policy
(policy we use to explore)
on-policy: simpler
o-policy: more complex but more powerful and general.
Example: o-policy can be used to learn from observation.
Focus now on prediction: estimation ofvorqwhile having
episode following policyb6=.
Minimum (coverage) requirement: if(ajs)>0 then
b(ajs)>0.
We may have(ajs) =0 andb(ajs)>0...
Typically,bis an-greedy policy.
Monte Carlo Methods5.5 O-policy Prediction via Importance
Sampling
Most o-policy methods are base on importance samping:
E[f(X)] =Eq
(X)
q(X)
f(X)
By construction,
P(At;St+1;At+1; : : : ;STjSt; )
=(AtjSt)p(St+1jSt;At)(At+1jSt+1) p(STjST1;AT1)
=
T1
Y
k=t
(AkjSk)p(Sk+1jSk;Ak)
Relative probability ratio:
t:T1=
P(At;St+1;At+1; : : : ;STjSt; )
P(At;St+1;At+1; : : : ;STjSt;b)
=
T1
Y
k=t
(AkjSk)
b(AkjSk)
Depends only on the policy and not on the MDP.
Monte Carlo Methods5.5 O-policy Prediction via Importance
Sampling
Value function using importance sampling:
v(s) =E[t:T1GtjSt=s]6=E[GtjSt=s] =vb(s)
Natural estimate forv(ordinary importance sampling):
V(s) =
P
t2I(s)
t:T(t)1Gt
jI(s)j
whereI(s)are the time step wheresis visited (only for the
rst time for a rst-visit method)
Alternative (weighted importance sampling):
V(s) =
P
t2I(s)
t:T(t)1Gt
P
t2I(s)
t:T(t)1
Rk:E
h
P
t2I(s)
t:T(t)1
i
=jI(s)j
Monte Carlo Methods5.5 O-policy Prediction via Importance
Sampling
ordinary importance sampling is unbiased (for the rst-visit
method) but may have a large variance
weighted importance sampling is biased but may have a
smaller variance.
No asymptotic bias.
ordinary importance sampling is nevertheless simpler to
extend...
Monte Carlo Methods5.5 O-policy Prediction via Importance
Sampling
Very large variance terms lead to convergence issues.
Monte Carlo Methods5.6 Incremental Implementation
Incremental implementation avoids to store all the returns.
Observation: if
Vn=
P
n1
k=1
WkGk
P
n1
k=1
Wk
then
Vn+1=Vn+
Wn
Cn
(GnVn)
withCn+1=Cn+Wn+1.
Rk: if
Vn=
P
n1
k=1
WkGk
n1
then
Vn+1=Vn+
1
n
(WnGnVn)
Leads to a better implementation.
Monte Carlo Methods5.6 Incremental Implementation
Monte Carlo Methods5.7 O-policy Monte Carlo Control
GPI principle
Require a exploratory target policy
Temporal-Dierence
Learning
Outline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
Temporal-Dierence
Learning
6 Temporal-Dierence Learning
6.1 TD Prediction
6.2 Advantages of TD Prediction Methods
6.3 Optimality of TD(0)
6.4 Sarsa: On-policy TD Control
6.5 Q-Learning: O-policy TD Control
6.6 Expected Sarsa
6.7 Maximization Bias and Double Learning
6.8 Games, Afterstates, and Other Special Cases
Temporal-Dierence
Learning
6.1 TD Prediction
constantMonte Carlo update:
V(St)!V(St) +(GtV(St))
TargetGt(v(St)) requires to wait until the episode end.
Simplest TD method:
V(St) V(St) +(Rt+1+V(St+1)V(St))
TargetRt+1+V(St+1)(v(St)) is available immediately.
Estimate based on a previous estimate:bootstrappingmethod
(like DP).
Underlying expectations:
v=E[GtjSt=s]
=E[Rt+1+v(St+1)jSt=s]
Estimate:
Expectation (MC / TD)
Value function (DP /TD)
Temporal-Dierence
Learning
6.1 TD Prediction
TD error:
t=Rt+1+V(St+1)V(St)
MC error:
GtV(St) =t+(Gt+1V(St+1))
=
T1
X
k=T
kt
k
ifVis kept frozen during each episode.
Temporal-Dierence
Learning
6.1 TD Prediction
Temporal-Dierence
Learning
6.2 Advantages of TD Prediction
Methods
Obvious advantages:
No need for a model (cf DP)
No need to wait until the episode end (cf MC)
Theoretical guarantee on the convergence!
No theoretical winner between TD and MC...
In practice, TD is often faster.
Temporal-Dierence
Learning
6.3 Optimality of TD(0)
Batch updating setting: several passes on the same data.
MC and TD converges (providedis small enough).
Dierent limits:
MC: sample average of the return
TD: value function if one replaces the true MDP by the
maximum likelihood one. (certainty-equivalence estimate)
Rk: no need to compute the ML estimate!
Temporal-Dierence
Learning
6.4 Sarsa: On-policy TD Control
GPI setting:
UpdateQusing the current policy with
Q(St;At) Q(St;At) +(Rt+1+Q(St+1;At+1)Q(St;At))
Updateby policy improvement
May not converge if one use a greedy policy update!
Convergence results iftgreedy update witht!0.
Temporal-Dierence
Learning
6.4 Sarsa: On-policy TD Control
Update independent from the behavior policy!
Convergence provided the policy visit each state-action
innitely often.
Temporal-Dierence
Learning
6.5 Q-Learning: O-policy TD Control
Q-learning takes more risk...
Temporal-Dierence
Learning
6.6 Expected Sarsa
Idea: replace the action sampling in Sarsa by an expectation
Q(St;At) Q(St;At) +(Rt+1+E[Qt(St+1;At+1)jSt+1]Q(St;At))
Q(St;At) +
Rt+1+
X
a
(ajSt+1)Qt(St+1;a)Q(St;At)
!
More complex but variance reduction.
Temporal-Dierence
Learning
6.6 Expected Sarsa
Temporal-Dierence
Learning
6.7 Maximization Bias and Double
Learning
Maximization bias issue:E[max]maxE!
Double learning:
Maintain twoindependentestimates ofq:Q1andQ2
Maintain two estimates of the best action
A1;= argmaxQ1(:;a)andA2;= argmaxQ2(:;a)
Maintain twounbiasedestimatesq(A1;) =Q2(A1;)and
q(A2;) =Q1(A2;)
Temporal-Dierence
Learning
6.7 Maximization Bias and Double
Learning
Temporal-Dierence
Learning
6.8 Games, Afterstates, and Other
Special Cases
Book focuses on state-action value function.
Other approach possible: afterstates value function
Interesting in games in particular...
n-step BootstrappingOutline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
n-step Bootstrapping7n-step Bootstrapping
7.1n-step TD Prediction
7.2n-step Sarsa
7.3n-step O-policy Learning
7.4 Per-decision Methods with Control Variates
7.5 O-policy Learning Without Importance Sampling: The
n-step Tree Backup Algorithm
7.6 A Unifying Algorithm:n-stepQ()
n-step Bootstrapping7.3n-step O-policy Learning
Need to take into account the exploratory policyb.
Importance sampling correction:
t:h=
min(h;T1)
Y
k=T
(AkjSk)
b(AkjSk)
O-policyn-step TD:
Vt+n(St) =Vt+n1(St) +t:t+n1(Gt:t+nVt+n1(St))
O-policyn-step Sarsa:
Qt+n(St;At) =Qt+n1(St;Qt) +t:t+n(Gt:t+nQt+n1(St;At))
Expected Sarsa possible...
n-step Bootstrapping7.3n-step O-policy Learning
n-step Bootstrapping7.5 O-policy Learning Without
Importance Sampling: Then-step Tree
Backup Algorithm Use reward for action taken and bootstrap for the others.
Weight each branch by(ajSt).
n-step Bootstrapping7.5 O-policy Learning Without
Importance Sampling: Then-step Tree
Backup Algorithm
1-step return (Expected Sarsa)
Gt:t+1=Rt+1+
X
a
(ajSt+1)Qt(St+1;a)
2-step return:
Gt:t+2=Rt+1+
X
a6=At+1
(ajSt+1)Qt+1(St+1;a)
+(At+1jSt+1)
Rt+2+
X
a
(ajSt+2)Qt+1(St+2;a)
!
=Rt+1+
X
a6=At+1
(ajSt+1)Qt+1(St+1;a) +Gt+1:t+2
Recursive denition ofn-step return:
Gt:t+n=Rt+1+
X
a6=At+1
(ajSt+1)Qt+n1(St+1;a) +Gt+1:t+n
n-step Bootstrapping7.5 O-policy Learning Without
Importance Sampling: Then-step Tree
Backup Algorithm
n-step Bootstrapping7.6 A Unifying Algorithm:n-stepQ()
Dierent strategy at each node:
=1: action sampling
=0: action averaging
n-step Bootstrapping7.6 A Unifying Algorithm:n-stepQ()
Planning and Learning
with Tabular Methods
Outline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
Planning and Learning
with Tabular Methods
8 Planning and Learning with Tabular
Methods
8.1 Models and Planning
8.2 Dyna: Integrated Planning, Acting and Learning
8.3 When the Model is Wrong
8.4 Prioritized Sweeping
8.5 Expected vs. Sample Updates
8.6 Trajectory Sampling
8.7 Real-time Dynamic Programming
8.8 Planning at Decision Time
8.9 Heuristic Search
8.10 Rollout Algorithms
8.11 Monte Carlo Tree Search
Planning and Learning
with Tabular Methods
8.1 Models and Planning
Model: anything that can be used to predict the environment
response.
Two dierent model families:
Distribution models: explicitely learn the MDP transitions
Sample models: learn to simulate the MDP
Second type is easier to obtain.
Planning:
model
planning
!policy
In the book,state-space planningby opposition ofplan-space
planningwhich works on the plans.
Common structure ofstate-space planning
model !simulated exp.
backups
!values !policy
Learning methods use real experiences instead of simulated
ones
Planning and Learning
with Tabular Methods
8.1 Models and Planning
Similar algorithm thanQ-learning!
Only dierence is the source of experience.
Rk: we have used this algorithm in the section 6.3 Optimality
of TD(0).
Planning and Learning
with Tabular Methods
8.2 Dyna: Integrated Planning, Acting
and Learning
Dyna-Q: architecture combining both planning and learning.
Two uses of experience:
model-learning (to improve the models)
reinforcement-learning (direct RL) (to improve the
value/policy)
Indirect methods (model based) can use an apriori model but
can be mislead by a false model.
Direct methods do not require a model but may require a lot
of experience.
Planning and Learning
with Tabular Methods
8.2 Dyna: Integrated Planning, Acting
and Learning
Planning, acting, model-learning and direct RL are
conceptually simultaneous.
Need to deal with the scheduling in practice.
Planning and Learning
with Tabular Methods
8.2 Dyna: Integrated Planning, Acting
and Learning
Learning, model-learning and planning are present.
Extension ofQ-learning.
Planning and Learning
with Tabular Methods
8.2 Dyna: Integrated Planning, Acting
and Learning
Planning and Learning
with Tabular Methods
8.3 When the Model is Wrong
If the model is wrong, it may eventually be corrected...
Planning and Learning
with Tabular Methods
8.3 When the Model is Wrong
but this may be complicated if the model was pessimistic...
Dyna-Q+ forces exploration by increasing the rewards of non
explored-lately state-action.
Planning and Learning
with Tabular Methods
8.4 Prioritized Sweeping
Freedom in the order of the state/action during planning.
Intuition says that one should workbackwardfrom thegoal.
Prioritized sweeping: order state-action by a predicted value
dierence.
Planning and Learning
with Tabular Methods
8.4 Prioritized Sweeping
Prioritized sweeping leads to faster convergence.
Planning and Learning
with Tabular Methods
8.5 Expected vs. Sample Updates
If the transition probability are available, should we use
expectation or samples? DP or RL?
Planning and Learning
with Tabular Methods
8.5 Expected vs. Sample Updates
Expectations are more stable but require more computation.
Cost depends heavily on thebranching factor.
In practice, sampling seems interesting for large branching
factors!
Planning and Learning
with Tabular Methods
8.6 Trajectory Sampling
Trajectory sampling: sample states-actions by interacting with
the model...
Initial gain but may be slow in the long run.
Planning and Learning
with Tabular Methods
8.7 Real-time Dynamic Programming
Classical DP but with a trajectory sampling.
Convergence holds with exploratory policy.
Optimal policy does not require to specify the action in
irrelevant states.
Convergence holds even without full exploration in some
specic cases!
In practice, seems to be computation ecient.
Planning and Learning
with Tabular Methods
8.8 Planning at Decision Time
Background planning: planning works onallstates.
Decision-time planning: planning starts from the current state.
Extension of the one-step lookahead but often without
memory.
Combination looks interesting...
Planning and Learning
with Tabular Methods
8.9 Heuristic Search
Heuristic search: most classical decision time planning
method.
At each step,
a tree of possible continuations is grown
an approximate value function is used at the leaves.
those values are backed-up to the root
Often value function is xed.
Can be seen as an extension of greedy policy beyond a single
step.
Planning and Learning
with Tabular Methods
8.9 Heuristic Search
The deeper the tree the less inuence the value function.
The better the value function the better the policy.
Computational trade-o.
Focus on the states that are accessible from the current state.
Planning and Learning
with Tabular Methods
8.10 Rollout Algorithms
Use MC simulation with a policyto choose the next action.
Simulation equivalent of the policy improvement idea.
State-action value function estimated only for the current
state.
Computation time inuenced by the complexity of the policy
and the number of simulations.
Early stopping possible using an approximate value function!
Planning and Learning
with Tabular Methods
8.11 Monte Carlo Tree Search
Rollout algorithm combined with backup-ideas.
Repeat 4 steps:
Selection of a sequence of actions according to the current values with a tree policy.
Expansion of the tree at the last node without values.
Simulation with a rollout policy to estimate the values at this node.
Backup of the value by empirical averaging.
MCTS focuses on promising path.
Planning and Learning
with Tabular Methods
Tabular Reinforcement
Planning and Learning
with Tabular Methods
Other Raised Issues
Denition of returnIs the task episodic or continuing, discounted or
undiscounted?
Action values vs. state values vs. afterstate valuesWhat kind of values should
be estimated? If only state values are estimated, then either a model or a separate
policy (as in actorcritic methods) is required for action selection.
Action selection/explorationHow are actions selected to ensure a suitable
trade-o between exploration and exploitation? (-greedy, optimistic initialization
of values, soft-max, and UCB...)
Synchronous vs. asynchronousAre the updates for all states performed
simultaneously or one by one in some order?
Real vs. simulatedShould one update based on real experience or simulated
experience? If both, how much of each?
Location of updatesWhat states or stateaction pairs should be updated?
Model-free methods can choose only among the states and stateaction pairs
actually encountered, but model-based methods can choose arbitrarily. There are
many possibilities here.
Timing of updatesShould updates be done as part of selecting actions, or only
after- ward?
Memory for updatesHow long should updated values be retained? Should they
be retained permanently, or only while computing an action selection, as in
heuristic search?
On-policy Prediction
with Approximation
Outline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
On-policy Prediction
with Approximation
9 On-policy Prediction with
Approximation
9.1 Value-function Approximation
9.2 The Prediction Objective (VE)
9.3 Stochastic-gradient and Semi-gradient Methods
9.4 Linear Methods
9.5 Feature Construction for Linear Methods
9.6 Selecting Step-Size Parameters Manually
9.7 Nonlinear Function Approximation: Articial Neural
Networks
9.8 Least-Squares TD
9.9 Memory-based Function Approximation
9.10 Kernel-based Function Approximation
9.11 Looking Deeper at On-policy Learning: Interest and
Emphasis
On-policy Prediction
with Approximation
9.1 Value-function Approximation
Prediction methods covered based onbacked-up valuesto
which the current value is shifted.
Examples:
MC withGt
TD(0) withRt+1+V(St+1)
n-step TDGt:t+n
Vis dened by its value at each state.
When the number of states is large this may be intractable.
Idea: replaceV(s)by an approximation^v(s;w)whereware
weights (parameters) dening the function.
Goal: ndwsuch that^v(s;w)v(s)from the backed-up
values.
Similar to supervised learning!
Function approximation (or regression) setting.
Reinforcement Learning requires on-line methods rather than
batch.
Often non-stationary target functions...
On-policy Prediction
with Approximation
9.2 The Prediction Objective (VE)
How to measure the quality of^v(s;w)?
So far, we have use implicitly ak k1norm.
Prediction Objective (VE):
VE(w) =
X
s
(s) (v(s)^v(s;w))
2
whereis a state distribution.
Most classical choice:(s)is the fraction of time spent ins.
Under on-policy training, such a choice is called on-policy
distribution
Stationary distribution underfor the continuing tasks.
Depends on the law of initial state for episodic tasks.
More dierence between episodic and continuing that without
approximation.
Rk: Prediction Objective not linked to corresponding policy
performance!
Often only local optimal inw...
On-policy Prediction
with Approximation
9.3 Stochastic-gradient and
Semi-gradient Methods
Prediction Objective (VE):
VE(w) =
X
s
(s) (v(s)^v(s;w))
2
Prediction Objective (VE) gradient:
rVE(w) =2
X
s
(s) (v(s)^v(s;w))r^v(s;w)
Prediction Objective (VE) stochastic gradient:
b
rVE(w) =2(v(St)^v(St;w))r^v(St;w)
SGD algorithm:
wt+1=wt+(v(St)^v(St;w))r^v(St;w)
Issue:vis unknown!
On-policy Prediction
with Approximation
9.3 Stochastic-gradient and
Semi-gradient Methods
Monte Carlo: replacev(St)byGt.
Algorithm:
wt+1=wt+(Gt^v(St;w))r^v(St;w)
Convergence guarantees becauseE[GtjSt=s] =v(s).
Stochastic-gradient setting!
On-policy Prediction
with Approximation
9.3 Stochastic-gradient and
Semi-gradient Methods
TD: replacev(St)byRt+1+^v(St+1;wt).
Algorithm:
wt+1=wt+(Rt+1+^v(St+1;wt)^v(St;w))r^v(St;w)
Not a stochastic-gradient setting anymore!
Eect of the change ofwto the target is ignored, hence the
name semi-gradient
Less stable but converges for linear approximation.
On-policy Prediction
with Approximation
9.3 Stochastic-gradient and
Semi-gradient Methods
Example with state aggregation: several states are using the
same value for the value function.
MC stochastic gradient algorithm.
On-policy Prediction
with Approximation
9.4 Linear Methods
Linear method:^s(s;w) =w
>
x(s)withx(s)a feature vector
representing states.
xiare basis functions of the linear space of possible^s.
Linear method gradient:
r^s(s;w) =x(s)
GenericSGDalgorithm:
wt+1=wt+
Utw
>
tx(St))
x(St)
MC stochastic gradient:
wt+1=wt+
Gtw
>
tx(St))
x(St)
TD(0) semi-gradient:
wt+1=wt+
Rt+1+w
>
tx(St+1)w
>
tx(St))
x(St)
Simple form allows mathematical analysis!
On-policy Prediction
with Approximation
9.4 Linear Methods
Usingx(St) =xt, TD becomes
wt+1=wt+
Rt+1+w
>
txt+1w
>
txt)
xt
=wt+
Rt+1xtxt(xtxt+1)
>
wt
Assume we are in the steady state regime,
E[wt+1jwt] =wt+(bAwt)
withb=E[Rt+1xt]andA=E
h
xt(xtxt+1)
>
i
.
If the algorithm converges towTD, then
bAwTD=0
IfAis invertible,
wTD=A
1
b
and
E[wt+1wTDjwt] = (IdA)(wtwTD)
Ais denite positive thanks to the stationarity of the measure
for the policy.
Complete proof much more involved.
On-policy Prediction
with Approximation
9.4 Linear Methods
Error bound for MC:
VE(wMC) = min
w
VE(w)
Error bound for TD:
VE(wMC)
1
1
min
w
VE(w)
Possible asymptotic loss for TD... but often faster
convergence and lower variance!
Similar results for other on-policy prediction algorithms.
Similar results for episodic tasks.
On-policy Prediction
with Approximation
9.4 Linear Methods
On-policy Prediction
with Approximation
9.4 Linear Methods
Algorithm:
wt+1=wt+
Gt:t+nw
>
tx(St))
x(St)
On-policy Prediction
with Approximation
9.5 Feature Construction for Linear
Methods
Polynomials: simplest possible basis
x(s)i=
k
Y
j=1
s
ci;j
j
Fourier Basis: useful when dealing with periodic functions
x(s)i= cos(s
>
ci)orsin(s
>
ci)
Renormaliation may be required.
Selection of the order can be usefull.
On-policy Prediction
with Approximation
9.5 Feature Construction for Linear
Methods
Fourier well adapted to smooth functions.
On-policy Prediction
with Approximation
9.5 Feature Construction for Linear
Methods
Coarse coding: extension of state aggregation.
Large freedom in the design of the cells.
On-policy Prediction
with Approximation
9.5 Feature Construction for Linear
Methods
Bandwidth issue...
On-policy Prediction
with Approximation
9.5 Feature Construction for Linear
Methods
Systematic way of construction coarse coding: grid plus
overlap.
On-policy Prediction
with Approximation
9.5 Feature Construction for Linear
Methods
Same size for all cells: easier choice of.
Easy computation.
On-policy Prediction
with Approximation
9.5 Feature Construction for Linear
Methods
Dierent oset leads to dierent approximations.
Large freedom on the tiling.
Hashing can also be used
On-policy Prediction
with Approximation
9.5 Feature Construction for Linear
Methods
Radial Basis Function:
x(s)i= (kscik
2
=2
2
)
Smoothed version of tiling...
On-policy Prediction
with Approximation
9.6 Selecting Step-Size Parameters
Manually
SGD requires the selection of an appropriate step-size.
Theory proposes a slowly decreasing step-size (O(1=t))
Often very slow convergence and not adapted to
non-stationary target.
Intuition from tabular case:=1=withthe number of
experience required to converge approximately.
Rule of thumb for linear SGD
= (E
h
x
>
x
i
)
1
On-policy Prediction
with Approximation
9.7 Nonlinear Function Approximation:
Articial Neural Networks
Articial Neural Networks are widely used for nonlinear
function approximations.
Complex architecture trained with gradient descent (backprop)
Several clever tools: initialization, activation function,
dropout, residual networks, batch normalization...
See lecture on Deep Learning...
Can be used in stochastic-gradient or semi-gradient method.
On-policy Prediction
with Approximation
9.7 Nonlinear Function Approximation:
Articial Neural Networks
Use of Convolutional Network to capitalize on partial
translation invariance.
On-policy Prediction
with Approximation
9.8 Least-Squares TD
Better sample accuracy can be obtained through a dierent
direction.
The TD xed point satisfy:
wTD=A
1
b
withA=E
h
xt(xtxt+1)
>
i
andb=E[Rt+1xt]
TD algorithm: iterative update of awt.
Least-Squares TD algorithm: iterative update ofAtandbt
and nal application of the inverse formula.
Natural estimates:
At=
1
t
t1
X
k=0
xt(xtxt+1)
>
andbt=
1
t
t1
X
k=0
Rt+1xt
To avoid some invertibility issue:
c
At=
t1
X
k=0
xt(xtxt+1)
>
+Idand
c
bt=
t1
X
k=0
Rt+1xt
Missing 1=tfactor cancels...
On-policy Prediction
with Approximation
9.8 Least-Squares TD
O(d
2
)per step but inversion required in the end...
Even better algorithm by maintaining an estimate of(tA)
1
:
c
At
1
=
[At1+xt(xtxt+1)
>
1
=[At1
1
[At1
1
xt(xtxt+1)
>[At1
1
1+ (xtxt+1)
>[At1
1
xt
stillO(d
2
)per step but no inversion in the end.
No stepsize... but thus no forgetting...
On-policy Prediction
with Approximation
9.9 Memory-based Function
Approximation
Non-parametric approach using memory.
Requires an estimate of the target for a set of states.
After (lazy) learning, each time a new query state arrives, one
retrieve a set of close examples in the training dataset and
deduces an estimate for the query state.
Examples: nearest-neighbors, weighted average, locally
weighted regression...
Focus on states that are close to observed states.
No need to be accurate far from typical states.
Computational issue: need to nd (approximate) nearest
neighbors...
On-policy Prediction
with Approximation
9.10 Kernel-based Function
Approximation
Kernel-based approximation is a weighted average strategy
where the weights are dened by a kernelkmeasuring the
similarity between states.
Kernel-based approximation:
bv(s;D) =
P
s
0
2D
k(s;s
0
)g(s
0
)
P
s
0
2D
k(s;s
0
)
Similar to RBF but kernel centered on the examples.
Large freedom in the choice of kernel...
Kernel tricks!
On-policy Prediction
with Approximation
9.11 Looking Deeper at On-policy
Learning: Interest and Emphasis
Prediction Objective (VE):
VE(w) =
X
s
(s) (v(s)^v(s;w))
2
whereis a state distribution.
Not necessarily the most interesting goal!
For instance, one may be more interested by earlier states
than later states.
InterestItdened for each state.
General algorithm:
wt+n=wt+n1+Mt(Gt:t+n^v(St;wt+n1))r^v(St;wt+n1)
with
Mt=It+
n
Mtn
Proof of convergence similar to TD(0).
On-policy Control with
Approximation
Outline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
On-policy Control with
Approximation
10 On-policy Control with Approximation
10.1 Episodic Semi-gradient Control
10.2 Semi-gradientn-step Sarsa
10.3 Average Reward: A New Problem Setting for Continuing
Tasks
10.4 Deprecating the Discounted Setting
10.5 Dierential Semi-gradientn-step Sarsa
On-policy Control with
Approximation
10.1 Episodic Semi-gradient Control
Straightforward extension toq(s;a)^q(s;a;w).
Prediction algorithm:
wt+1=wt+(Ut^q(St;At;wt))r^q(St;At;w)
On-policy control by adding a (-greedy) policy improvement
step.
On-policy Control with
Approximation
10.1 Episodic Semi-gradient Control
Grid tiling with 8 cells and asymetrical osets.
On-policy Control with
Approximation
10.1 Episodic Semi-gradient Control
On-policy Control with
Approximation
10.2 Semi-gradientn-step Sarsa
Natural extension of tabular methods:
Gt:t+n=Rt+1+Rt+2+ +
n1
Rt+n+
n
^q(St+n;At+n;wt+n1)
wt+n=wt+n1+(Gt:t+n^q(St;At;wt+n1))r^q(St;At;wt+n1)
On-policy Control with
Approximation
10.2 Semi-gradientn-step Sarsa
On-policy Control with
Approximation
10.3 Average Reward: A New Problem
Setting for Continuing Tasks
Continuous task without discounting.
Average reward:
r() = lim
h!+1
1
h
h
X
t=1
E[RtjS0;A0:t1]
=
X
s
(s)
X
a
(ajs)
X
s
0
;r
p(s
0
;rjs;a)r
Ergodicityassumption on the MDP: existence of
(s) = lim
t!+1
P(St=sjA0:t1)
which is independent ofS0for any.
By construction,
X
s
X
a
(ajs)p(s
0
js;a) =(s
0
)
On-policy Control with
Approximation
10.3 Average Reward: A New Problem
Setting for Continuing Tasks
Dierential returns:
Gt=Rt+1r() +Rt+2r() +
Dierential value functions:
v(s) =E[GtjSt=s]andq(s;a) =E[GtjSt=s;At=a]
Bellman:
v(s) =
X
a
(ajs)
X
r;s
0
p(s
0
;rjs;a)
rr() +v(s
0
)
q(s;a) =
X
r;s
0
p(s
0
;rjs;a)
rr() +
X
a
0
(a
0
js
0
)q(s
0
;a
0
)
!
Optimality:
v(s) = max
a
X
r;s
0
p(s
0
;rjs;a)
rmax
r() +v(s
0
)
q(s;a) =
X
r;s
0
p(s
0
;rjs;a)
rmax
r() + max
a
0
q(s
0
;a
0
)
Rk: True derivation much more involved!
On-policy Control with
Approximation
10.3 Average Reward: A New Problem
Setting for Continuing Tasks
Dierential TD errors:
t=Rt+1Rt+ ^v(St+1;wt)^v(St;wt)
t=Rt+1Rt+ ^q(St+1;At+1;wt)^q(St;At;wt)
Dierential algorithm: for instance semi-gradient Sarsa
wt+1=wt+tr^q(St;At;wt)
On-policy Control with
Approximation
10.3 Average Reward: A New Problem
Setting for Continuing Tasks
On-policy Control with
Approximation
10.4 Deprecating the Discounted Setting
In the ergodic setting, discounting is equivalent to averaging!
J() =
X
s
(s)v;(s)
=
X
s
(s)
X
a
(ajs)
X
s
0
;r
p(s
0
;rjs;a)(r+v;(s
0
))
=r() +
X
s
0
v;(s
0
)
X
s
(s)
X
a
p(s
0
js;a)(ajs)
=r() +
X
s
0
v;(s
0
)(s
0
)
=r() +J()
=
1
1
r()
Same optimal policies.
Issue: with approximation, policy improvement theorem is lost
(in nite, discounted and average cases).
On-policy Control with
Approximation
10.5 Dierential Semi-gradientn-step
Sarsa
Dierentialn-returns:
Gt:t+n=Rt+1Rt+n1+ +Rt+nRt+n1
+ ^q(St+n;At+;wt+n1)
whereRt+n1is an estimate ofr().
O-policy Methods
with Approximation
Outline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
O-policy Methods
with Approximation
11 O-policy Methods with
Approximation
11.1 Semi-gradient Methods
11.2 Examples of O-policy Divergence
11.3 The Deadly Triad
11.4 Linear Value-function Geometry
11.5 Gradient Descent in the Bellman Error
11.6 The Bellman Error is not Learnable
11.7 Gradient-TD Methods
11.8 Emphatic-TD Methods
11.9 Reducing Variance
O-policy Methods
with Approximation
11.1 Semi-gradient Methods
O-policy learing: two challenges
dealing with the target update
dealing with the distributions of update
First-part is easier...
Importance sampling ratio:
t=t:t=
(AtjSt)
b(AtjSt)
Semi-gradient o-policy TD(0):
wt+1=wt+ttr^v(St;wt)
with
t=Rt+1+^v(St+1;wt)^v(St;Wt)(episodic)
=Rt+1Rt+ ^v(St+1;wt)^v(St;Wt)(continuing)
O-policy Methods
with Approximation
11.1 Semi-gradient Methods
Semi-gradient Expected Sarsa:
wt+1=wt+tr^q(St;At;wt)
with
t=Rt+1+
X
(ajSt+1)^q(St+1;a;wt)
^q(St;At;Wt)(episodic)
=Rt+1Rt+
X
(ajSt+1)^q(St+1;a;wt)
^q(St;At;Wt)(continuing)
Multi-step Semi-gradient Sarsa:
wt+n=wt+n1+t+1t+n1tr^q(St;At;wt)
with
t=Rt+1+ +
n1
Rt+n+
n
^q(St+n;At+n;wt+n1)
^q(St;At;Wt)(episodic)
=Rt+1Rt+ +Rt+nRt+n1+ ^q(St+n;At+n;wt+n1)
^q(St;At;Wt)(continuing)
O-policy Methods
with Approximation
11.1 Semi-gradient Methods
Tree-backup:
wt+n=wt+n1+(Gt:t+n^q(St;At;wt+n1))r^q(St;At;wt)
with
Gt:t+n= ^q(St;At;wt1) +
t+n1
X
k=t
k
k
Y
i=t+1
(AijSi)
andkas in Expected Sarsa.
O-policy Methods
with Approximation
11.2 Examples of O-policy Divergence
Simple transition with a reward 0.
TD error:
t=Rt+1+^v(St+1;wt)^v(St;wt)
=0+2wtwt= (21)wt
O-policy semi-gradient TD(0) update:
wt+1=wt+ttr^v(St+1;wt)
=wt+1(21)wt= (1+(21))wt
Explosion if this transition is explored withoutwbeing update
on other transitions as soon as >1=2.
O-policy Methods
with Approximation
11.2 Examples of O-policy Divergence
Divergence of o-policy algorithm.
O-policy Methods
with Approximation
11.2 Examples of O-policy Divergence
Exact minimization of boostrappedVEat each step:
wk+1= argmin
w
X
s
(^v(s;w)E[Rt+1+^v(St+1;wk)jSt=s])
2
= argmin
w
(w2wk)
2
+ (2w(1)2wk)
2
=
64
5
wk
Divergence if >5=(64).
O-policy Methods
with Approximation
11.3 The Deadly Triad
Deadly triad:
Function approximation
Bootstrapping
O-policy training
Instabilities as soon as the three are present!
Issue:
Function approximation is unavoidable.
Bootstrap is much more computational and data ecient.
O-policy may be avoided... but essential when dealing with
extended setting (learn from others or learn several tasks)
O-policy Methods
with Approximation
11.4 Linear Value-function Geometry
Natural weighted squared norm:
kvk
2
=
X
s
(s)v(s)
2
Prediction objectiveVE:
VE(w) =kvwvk
2
Projection operator:
v=vwwherew= argmin
w
kvvwk
2
Target for MC methods
O-policy Methods
with Approximation
11.4 Linear Value-function Geometry
Bellman error:
w(s) =
0
@
X
a
(ajs)
X
s
0
;r
p(s
0
;rjs;a)(r+vw(s
0
))
1
Avw(s)
=E[Rt+1vw(St+1)vw(St)jSt=s;At]
Expectation of TD error.
Mean Squared Bellmann Error:
BE(w) =kw(s)k
2
Unique minimizer in linear case
Dierent from minimizer ofVE.
O-policy Methods
with Approximation
11.4 Linear Value-function Geometry
Bellman operator:
B(v)(s) =
X
a
(ajs)
X
s
0
;r
p(s
0
;rjs;a)(r+vw(s
0
))
B(vw)(s)is in general not representable.
Projected Bellman operator:Bv
Projected Bellman iteration:vwt+1
= Bvwt...
Mean Square Projected Bellman error:
PBE(w) =kBvwvwk
2
=kwk
2
O-policy Methods
with Approximation
11.5 Gradient Descent in the Bellman
Error
With approximation, only MC so far is a SGD method...
TD error:
t=Rt+1+^v(St+1;wt)^v(St;wt)
Mean Squared TD Error:
TDE(w) =
X
s
(s)E
h
2
tjSt=s;At
i
=Eb
h
t
2
t
i
withlinked tob.
Minimization of an expectation leads to SGD.
SGD algorithm:
wt+1=wt1=2r(t
2
t)
=wt+tt(r^v(St;wt)r^v(St+1;wt))
Convergence guarantees for this naive residual-gradient
algorithm.
O-policy Methods
with Approximation
11.5 Gradient Descent in the Bellman
Error
Solution not necessarily interesting!
True value function is easy:v(A) =1=2,v(B) =1,
v(C) =0.
OptimalTDEattained for:v(A) =1=2,v(B) =3=4,
v(C) =1=4!
O-policy Methods
with Approximation
11.5 Gradient Descent in the Bellman
Error
Bellman error can also be measured by
(E[t])
2
SGD type algorithm:
wt+1=wt1=2r(E[t])
2
=wtEb[tt]rEb[tt]
=wt+(Eb[t(Rt+1+^v(St+1;wt))]^v(St;wt))
(r^v(St;wt)Eb[tr^v(St+1;wt)])
Requires two independent samples forSt+1!
Converges toward the true value in the tabular case.
Can be slow... and the solution with approximation may be
unsatisfactory.
O-policy Methods
with Approximation
11.6 The Bellman Error is not Learnable
Two MRP with the same outputs (because of approximation).
but dierentVE.
Impossibility to learnVE.
Minimizer however is learnable:
RE(w) =E
h
(Gt^v(St;w))
2
i
=VE(w) +E
h
(Gtv(St))
2
i
O-policy Methods
with Approximation
11.6 The Bellman Error is not Learnable
Two MRP with the same outputs (because of approximation).
DierentBE.
Dierent minimizer!
BEis not learnable!
O-policy Methods
with Approximation
11.6 The Bellman Error is not Learnable
PBEandTDEare learnable.
O-policy Methods
with Approximation
11.7 Gradient-TD Methods
SGD Methods for minimizingPBE.
PBE(w) =kwk
2
=
>
w
>
Dw
= (X
>
Dw)
>
(X
>
DX)
1
(X
>
Dw)
Gradient:
PBE(w) =2r(X
>
Dw)
>
(X
>
DX)
1
(X
>
Dw)
Expectations:
X
>
Dw=E[ttxt]
r(X
>
Dw)
>
=E
h
t(xt+1xt)x
>
i
X
>
DX=E
h
xtx
>
t
i
Not yet a SGD...
O-policy Methods
with Approximation
11.7 Gradient-TD Methods
Expectations:
r(X
>
Dw)
>
=E
h
t(xt+1xt)x
>
i
X
>
DX=E
h
xtx
>
t
i
X
>
Dw=E[ttxt]
Learn the productvof last two terms:
v'E
h
xtx
>
t
i
1
E[ttxt]
Solution of a least square
Iterative algorithm:
vt+1=vt+t(tv
>
txt)xy
Insert expression in the gradient to obtain GTD
wt+1=wt+t(xt+1xt)x
>
vt
or GTD2
wt+1=wt+t(xtxt+1xtx
>
vt)
Convergence with!0 and=!0 (two scales)
O-policy Methods
with Approximation
11.7 Gradient-TD Methods
O-policy Methods
with Approximation
11.8 Emphatic-TD Methods
Linear semi-gradient TD methods convergence due to a match
between the on-policy distribution and the transition prob.
Emphatic-TD: clever manner to weight the examples in
o-policy that maintains the compatibility.
One-step Emphatic-RD:
t=Rt+1+^v(St+1;wt)^v(St;wt)
Mt=t1Mt1+It
wt+1=wt+Mtttr^v(St;wt)
whereItis an arbitrary interest andMtthe emphasis
(M0=0).
O-policy Methods
with Approximation
11.9 Reducing Variance
O-policy methods have more variance than o-policy ones.
High variability of importance sampling ratio.
Tree Backup does not have importance ratio.
Other variants such as Retrace.
Variance can also be reduced by more strongly linking the
behavior and target policies.
Eligibility TracesOutline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
Eligibility Traces12 Eligibility Traces
12.1 The-return
12.2 TD()
12.3n-step Truncated-return Methods
12.4 Redoing Updates: Online-return
12.5 True Online TD()
12.6 Dutch Traces in Monte Carlo Learning
12.7 Sarsa()
12.8 Variableand
12.9 O-policy Traces with Control Variates
12.10 Watkins's Q() to Tree-Backup()
12.11 Stable O-policy Methods with Traces
12.12 Implementation Issues
Eligibility Traces12.1 The-return
n-step return:
Gt:t+n=Rt+1+Rt+2+ +
n1
Rt+n+
n
^v(St+n;wt+n1)
Averagedn-step return: (compound update)
G
!
t=
1
X
n=1
!nGt:t+nwith
1
X
i=1
!n=1
TD(): specic averaging
G
t= (1)
1
X
n=1
n1
Gt:t+n
Eligibility Traces12.1 The-return
TD(): interpolation between TD(0) and MC
G
t= (1)
1
X
n=1
n1
Gt:t+n
= (1)
Tt1
X
n=1
n1
Gt:t+n+
Tt1
Gt
Eligibility Traces12.1 The-return
oine-return algorithm:
Play an episode according to the policy
Afterwards, modifywaccording to
wt+1=wt+
G
t^v(St;wt)
r^v(St;wt);t=0; : : : ;T
Eligibility Traces12.1 The-return
Forward view!
Need to wait till the end before any update.
Eligibility Traces12.2 TD()
Backward view!
Eligibility traceztkeeping track of the components ofwthat
have contributed to state valuations:
z1=0
zt=zt1+r^v(St;wt)
TD Error:
t=Rt+1+^v(St+1;wt)^v(St;wt)
Update:
wt+1=wt+tzt
Eligibility Traces12.2 TD()
TD():
zt=zt1+r^v(St;wt)
t=Rt+1+^v(St+1;wt)^v(St;wt)
wt+1=wt+tzt
TD(0)is the classical TD algorithm.
TD(1)is similar to a MC algorithm but with parameter
update before the end of the episode.
Eligibility Traces12.2 TD()
Not the same as using-return...
Convergence guarantees in the linear case:
VEw1
1
1
min
w
VE(w)
Eligibility Traces12.3n-step Truncated-return Methods
-return is technically never known in the continuing case!
Truncated-return:
G
t:h= (1)
ht1
X
n=1
n1
Gt:t+n+
ht1
Gt
Similar to an episodic setting of lengthh.
Eligibility Traces12.3n-step Truncated-return Methods
Family ofn-step-return...
TTD():
wt+n=wt+n1+
G
t:t+n^v(St;wt+n1)
r^v(St;wt+n1)
Ecient implementation relying on:
G
t:t+k= ^v(St;wt+1) +
t+k1
X
i=t
()
i1
i
with
t=Rt+1+^v(St+1;wt)^v(St;wt1)
Forward view but strong connection with backward one.
Eligibility Traces12.4 Redoing Updates: Online-return
Conceptual algorithm which computes a weight at any timet
using only the returns known at this time.
Pass on all the data at each time step using the largest
possible horizonh:
w
h
t+1=w
h
t+
G
t:h^v(St;w
h
t)
r^v(St;wt);0t<hT
Eligibility Traces12.5 True Online TD()
w
0
0
w
1
0
w
1
1
w
2
0
w
2
1
w
2
2
.
.
.
.
.
.
.
.
.
.
.
.
w
T
0
w
T
1
w
T
2
w
T
T
Online-return algorithm has a triangular structure.
True online TD() computesw
t+1
t+1
directly fromw
t
t.
Linear case:
wt+1=wt+tzt+(w
>
txtw
>
t1xt)(ztxt)
with
zt=zt1+ (1z
>
t1xt)xt
Eligibility Traces12.5 True Online TD()
Eligibility Traces12.5 True Online TD()
Dutch trace:
zt=zt1+ (1z
>
t1xt)xt
Accumulating trace TD():
zt=zt1+xt
Replacing trace:
zt=
(
1 if xi;t=1
zi;t1otherwise
was a precursor of dutch traces for binary features.
Eligibility Traces12.6 Dutch Traces in Monte Carlo
Learning
Linear MC update:
wt+1=wt+
Gw
>
txt
xt
HereGis a single reward known at timeT.
All the computation have to be done at the end to obtainwT.
More ecient implementation using:
wT=wT+1+
Eligibility Traces12.6 Dutch Traces in Monte Carlo
Learning
Furthermore,
zt=
t
X
k=0
Ft Fk+1xk
=Ftzt1+xt
=zt1+ (1z
>
t1xt)xt
and
at=FtF0w0=Ftat1=at1xtx
>
tat1
Ecient computation with onlyO(d)operation per step.
Eligibility traces arise whenever one tries to learn a long-term
prediction in an ecient manner.
Eligibility Traces12.7 Sarsa()
Action-value form of then-step return:
Gt:t+n=Rt+1+ +
n1
Rt+n+
n
^q(St+n;At+n;wt+n1)
Oine-return algorithm:
wt+1=wt+
G
t^q(St;At;wt)
r^q(St;At;wt)
whereG
t=G
t:1.
Eligibility Traces12.7 Sarsa()
Backward view:
wt+1=w+tzt
with
t=Rt+1+^q(St+1;At+1;wt)^q(St;At;wt1)
and
z1=0
zt=zt1+r^q(St;At;wt)
Eligibility Traces12.7 Sarsa()
Eligibility Traces12.7 Sarsa()
Eligibility Traces12.7 Sarsa()
Eligibility Traces12.8 Variableand
Generalization to non constant discount and non constant.
termination function:
Gt=Rt+1+t+1Gt+1
=
1
X
k=t
0
@
k
Y
i=t+1
i
1
ARk+1
Usingt=0 at the transition allows to see a single stream in
episodic setting.
Recursive denition of the-return for time dependentt:
G
s
t=Rt+1+t+1
(1t+1)^v(St+1;wt) +t+1G
s
t+1
G
a
t=Rt+1+t+1
(1t+1)^q(St+1;At+1;wt) +t+1G
a
t+1
G
a
t=Rt+1+t+1
(1t+1)
X
a
(ajSt+1)^q(St+1;a;wt) +t+1G
a
t+1
!
Eligibility Traces12.9 O-policy Traces with Control
Variates
O-policy-return:
G
s
t=t
Rt+1+t+1
(1t+1)^v(St+1;wt) +t+1G
s
t+1
+ (1t)bv(St;wt)
| {z }
Control variate
TD error:
t=Rt+1+t+1^v(St+1;w)bv(St;wt)
Approximation of o-policy-return:
G
s
t'bv(St;wt) +t
1
X
k=t
k
k
Y
i=t+1
iii
Not an approximation ifwis not updated!
Forward-view update:
wt+1=wt+
G
s
t^v(St;wt)
r^v(St;wt)
'wt+t
0
@
1
X
k=t
k
k
Y
i=t+1
iii
1
Ar^v(St;wt)
Eligibility Traces12.9 O-policy Traces with Control
Variates
Link between forward and backward views:
wt+1wt'
0
@
1
X
k=t
tk
k
Y
i=t+1
iii
1
Ar^v(St;wt)
Sum over time:
1
X
t=1
(wt+1wt)'
1
X
t=1
1
X
k=t
tk
0
@
k
Y
i=t+1
iii
1
Ar^v(St;wt)
=
1
X
k=1
k
X
t=1
tk
0
@
k
Y
i=t+1
iii
1
Ar^v(St;wt)
=
1
X
k=1
k
k
X
t=1
t
0
@
k
Y
i=t+1
iii
1
Ar^v(St;wt)
Eligibility trace (accumulating):
zt=
k
X
t=1
t
0
@
k
Y
i=t+1
iii
1
Ar^v(St;wt)
=k(kkzk1+r^v(Sk;wk))
Eligibility Traces12.9 O-policy Traces with Control
Variates
Similar results for action-value methods.
O-policyreturn:
G
s
t=Rt+1+t+1
(1t+1)Vt(St+1;wt)+
t+1
t+1G
s
t+1+Vt(St+1;wt)t+1bq(St+1;At+1;wt)
'bq(St;At;wt) +
1
X
k=t
k
k
Y
i=t+1
iii
with
t=Rt+1+t+1
V(St+1)^q(St;At;wt)
Eligibility trace:
zt=tttzt1+r^q(St;At;wt)
=1 close from MC but subtly distinct...
Theoretical guarantees still investigated.
Eligibility Traces12.10 Watkins's Q() to Tree-Backup()
Tree backup return:
G
a
t=Rt+1+t+1
(1t+1)V St+1+
t+1
V(St+1)(At+1jSt+1)(bq(St+1;At+1;wt)G
a
t+1)
'bq(St;At;wt) +
+1
X
k=t
k
k
Y
i=t+1
ii(AijSi)
Tree Backup eligibility trace:
zt=tt(AtjSt) +r^q(St;At;w)
Eligibility Traces12.10 Watkins's Q() to Tree-Backup()
Watkins's Q() cuts the trace to zero after the rst non
greedy update.
Eligibility Traces12.11 Stable O-policy Methods with
Traces
Four of the most important examples of stable o-policy using
eligibility traces.
GTD(): Gradient-TD (TDC)
GQ(): Gradient-TD for action-value
HTD(): Hybrid between GTD() and TD()
Emphatic TD(): Extension of Emphatic-TD.
All of them maintain an eligibility trace used in some TD error
updates
Eligibility Traces12.11 Stable O-policy Methods with
Traces
GTD(): Gradient-TD (TDC)
wt+1=wt+
s
tztt+1(1t+1)(z
>
tvt)xt+1
vt+1=vt+zt(v
>
txt)xt
GQ(): Gradient-TD for action-value
wt+1=wt+
a
tztt+1(1t+1)(z
>
tvt)xt+1
where
xt=
X
a
(ajSt)x(St;a)
a
t=Rt+1+t+1w
>
txt+1w
>
txt
and the rest as in GTD()
Eligibility Traces12.11 Stable O-policy Methods with
Traces
HTD(): Hybrid between GTD() and TD()
wt+1=wt+
s
tzt+((ztz
b
t)
>
vt)(xtt+1xt+1)
vt+1=vt+
s
tz
>
((z
b
t)
>
vt)(xtt+1xt+1)
zt=t(ttzt1+xt)
z
b
t=ttz
b
t1+xt
Emphatic TD(): Extension of Emphatic-TD
wt+1=wt+tzt
t=Rt+1+t+1w
>
txt+1w
>
txt
zt=t(ttzt1+Mtxt)
Mt=tIt+ (1t)Ft
Ft=t1tFt1+It:
Eligibility Traces12.12 Implementation Issues
Trace is of the dimension ofw.
Eligibility trace seems unusable for tabular methods.
Trick: list the small numbers of non-zero coordinates in the
trace, update them and kill them if they are too small.
Policy Gradient
Methods
Outline
1
Introduction
2
Multi-armed Bandits
3
Finite Markov Decision Processes
4
Dynamic Programming
5
Monte Carlo Methods
6
Temporal-Dierence Learning
7
n-step Bootstrapping
8
Planning and Learning with Tabular Methods
9
On-policy Prediction with Approximation
10
On-policy Control with Approximation
11
O-policy Methods with Approximation
12
Eligibility Traces
13
Policy Gradient Methods
Policy Gradient
Methods
13 Policy Gradient Methods
13.1 Policy Approximation and its Advantages
13.2 The Policy Gradient Theorem
13.3 REINFORCE: Monte Carlo Policy Gradient
13.4 REINFORCE with Baseline
13.5 Actor-Critic Methods
13.6 Policy Gradient for Continuing Tasks
13.7 Policy Parameterization for Continuous Actions
Policy Gradient
Methods
13.1 Policy Approximation and its
Advantages
Parametric policy(ajs;)dierentiable with respect to.
Soft-max in action preferenes:
numerical preferencesh(a;s;).
probability:
(ajs;) =
e
h(a;s;)
P
b
e
h(b;s;)
Lots of freedom in the parameterization ofh: from linear to
deep network...
Can approach a deterministic policy or stochastic ones...
May be simpler than value-based method.
Policy Gradient
Methods
13.2 The Policy Gradient Theorem
Performance measureJ()depends smoothly on.
Easier to obtain convergence guarantees.
Episodic case: performance
J() =v
(s0)
wheres0is the initial state.
Policy gradient theorem:
rJ()/
X
s
(s)
X
a
q(s;a)r(ajs;)
X
a
(ajs)q(s;a)
!
=
X
a
(r(ajs)q(s;a) +(ajs)rq(s;a))
=
X
a
r(ajs)q(s;a) +(ajs)
X
s
0
p(s
0
js;a)rv(s
0
)
!
=
X
x
1
X
k=0
P(s!x;k; )
X
a
r(ajx)q(x;a)
Stationary measure(s):
(s) =
P
k
P(s0!s;k; )
P
s
0
P
k
P(s0!s;k; )
Gradient ofJ():
rJ() =
X
s
0
X
k
P(s0!s;k; )
!
X
s
(s)
X
a
r(ajs)q(s;a)
Policy Gradient
Methods
13.3 REINFORCE: Monte Carlo Policy
Gradient
Gradient ofJ():
rJ()/
X
s
(s)
X
a
r(ajs;)q(s;a)
=E
"
X
a
r(ajSt;)q(St;a)
#
All-actions SGD:
t+1=t+1+
X
a
r(ajSt;)^q(St;a;w)
REINFORCE use a single action:
t+1=t+1+Gt
r(AtjSt;)
(AtjSt)
whereGtis the usual MC return.
Policy Gradient
Methods
13.3 REINFORCE: Monte Carlo Policy
Gradient
REINFORCE derivation:
rJ()/E
"
X
a
r(ajSt;)q(St;a)
#
=E
"
X
a
(ajSt;)
r(ajSt;)
(ajSt;)
q(St;a)
#
=E
"
r(AtjSt;)
(A
jSt;)
q(St;At)
#
=E
Gt
r(AtjSt;)
(AtjSt;)
=E[Gtrln(AtjSt;)]
Policy Gradient
Methods
13.3 REINFORCE: Monte Carlo Policy
Gradient
SGD algorithm hence good theoretical convergence properties.
Variance may be high.
Policy Gradient
Methods
13.4 REINFORCE with Baseline
Any baselineb(s)satises
X
a
b(s)r(ajs;) =0
Generalized policy gradient theorem:
rJ()/
X
s
(s)
X
a
(q(s;a)b(s))r(ajs;)
REINFORCE algorithm with baseline:
t+1=t+1+(Gtb(St))
r(AtjSt;)
(AtjSt)
Allows variance reduction.
Natural choice forb(St):^v(St;w).
Policy Gradient
Methods
13.4 REINFORCE with Baseline
Policy Gradient
Methods
13.5 Actor-Critic Methods
REINFORCE learns a policy and a state value function, but
does not use it to bootstrap.
Actor-Critic methods: replaceGtby a bootstrap estimate:
t+1=t+1+(Gt:t+1^v(St;w))
r(AtjSt;)
(AtjSt;)
=t+1+t
r(AtjSt;)
(AtjSt;)
Same tradeo with respect to MC...
Policy Gradient
Methods
13.5 Actor-Critic Methods
Policy Gradient
Methods
13.5 Actor-Critic Methods
Policy Gradient
Methods
13.6 Policy Gradient for Continuing Tasks
Most natural performance measure:
J() =r() = lim
h!1
1
h
h
X
t=1
E[RtjS0;A0:t1]
=
X
s
(s)
X
a
(ajs)
X
s
0
;r
p(s
0
;rjs;a)r
is such that
X
s
(s)
X
a
(ajs)p(s
0
js;a) =(s
0
)
Gradient ofJ():
rJ() =
X
s
(s)
X
a
r(ajs)q(s;a)
Policy Gradient
Methods
13.6 Policy Gradient for Continuing Tasks
Policy Gradient
Methods
13.6 Policy Gradient for Continuing Tasks
Gradient of the state-value function:
rv(s) =r
X
a
(ajs)q(s;a)
!
=
X
a
(r(ajs)q(s;a) +(ajs)rq(s;a))
=
X
a
r(ajs)q(s;a) +(ajs)
X
s
0
p(s
0
js;a)r(r() +v(s
0
)
!
=
X
a
r(ajs)q(s;a)
+(ajs)
rr() +
X
s
0
p(s
0
js;a)rv(s
0
)
! !
Policy Gradient
Methods
13.6 Policy Gradient for Continuing Tasks
Gradient ofJ() =r():
rJ() =
X
a
r(ajs)q(s;a) +(ajs)
X
s
0
p(s
0
js;a)rv(s
0
)
!
rv(s)
=
X
s
(s)
X
a
r(ajs)q(s;a) +(ajs)
X
s
0
p(s
0
js;a)rv(s
0
)
!
rv(s)
!
=
X
s
(s)
X
a
r(ajs)q(s;a)
+
X
s
0
X
s
(s)(ajs)p(s
0
js;a)
| {z }
(s
0
)
rv(s
0
)
X
s
(s)rv(s)
=
X
s
(s)
X
a
r(ajs)q(s;a)
Policy Gradient
Methods
13.7 Policy Parameterization for
Continuous Actions
Policy-based methods oer a way to deal with continuous
action space.
Only requirement is a parametric policy that can be sampled.
For instance, a Gaussian parameterization with linear mean
and linear log variance.
Licence and Contributors
Creative Commons Attribution-ShareAlike (CC BY-SA 4.0)
You are free to:
Share:copy and redistribute the material in any medium or format
Adapt:remix, transform, and build upon the material for any purpose, even commercially.
Under the following terms:
Attribution:You must give appropriate credit, provide a link to the license, and indicate if changes
were made. You may do so in any reasonable manner, but not in any way that suggests the
licensor endorses you or your use.
ShareAlike:If you remix, transform, or build upon the material, you must distribute your
contributions under the same license as the original.
No additional restrictions:You may not apply legal terms or technological measures that legally
restrict others from doing anything the license permits.
Contributors
Main source: R. Sutton and A. Barto
Main contributor: E. Le Pennec
Contributors: S. Boucheron, A.K. Fermin, S. Gadat, S. Gaias,
A. Guilloux, Ch. Keribin, E. Matzner, E. Scornet.