Policy-Gradient for deep reinforcement learning.pdf

21522733 69 views 64 slides Jun 02, 2024
Slide 1
Slide 1 of 64
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64

About This Presentation

very nice


Slide Content

Reinforcement Learning
Policy Gradient Methods

Reinforcement Learning (RL) - Introduction
▶RL methods apply to problems where an
an
▶At timet, the agent is in st, and performs an at.▶At the next time stept+ 1, it arrives in st+1and
obtain rt+1.
▶The goal on
the long term.

Markov Decision Process (MDP) - Introduction
RL problems are often described as MDPs:
▶A state spaceS.▶An action spaceAof actionsa.▶A transition modelp(s

|s, a): the probability of arriving in
states

at timet+ 1when being in statesand performing
actiona.
▶A reward functionr(s, a, s

)defines the (stochastic) reward
obtained after performingain statesand arriving ins

.
▶An initial state distributionp0(s0)from which states is the
agent likely to start.

Markov Decision Process (MDP) - Introduction
▶The behavior of the agent over time is a
episode, history, or roll-out)
τ= (s0, a0, s1, a1, . . . , sH, aH)
▶Each transition occurs with a probabilityp(s

|s, a)and
provides a certain amount of rewardr(s, a, s

).
▶In episodic tasks, the horizonHis finite, while in continuing
tasks,His infinite∞.
▶The
p(st+1|st, at) =p(st+1|st, at, st−1, at−1, . . . , s0, a0)
▶N.B. We need to provide enough information to the
description of a state. If a transition depends on what
happened in the past, put that information in the state
description.

Policy and value functions
▶The policy defines the behavior of the agent: which action is
performed in each state.
▶A deterministic policyµ(st)is a mapping of:S → A.▶A stochastic policyπ:S →P(A)defines the probability
distributionP(A)of performing an action.
1
1
Thomas Simonini - An introduction to Policy Gradients with Cartpole and Doom

Policy and value functions
▶The performance of a policy
discounted return, i.e., the sum of all rewards received from
time steptonwards:
Rt=
H
X
k=0
γ
k
rt+k+1 (1)
where0< γ≤1is the discount rate andrt+1represents the
reward obtained during the transition fromsttost+1.
▶If the task is episodic (His finite),γcan be set to 1.▶If the task is continuing (H=∞),γmust be chosen smaller
than 1.

Policy and value function
▶The Q-value of a state-action pair(s, a)is defined as the
expecteddiscounted return
afrom a statesand follows the policyπthereafter:
Q
π
(s, a) =Eπ[Rt|st=s, at=a] (2)
▶The Q-value of a pair(s, a)is the expectation of the return
over all trajectories starting in(s, a)defined by the policyπ.
▶The value of a statesis theexpecteddiscounted return
received if the agent starts insand follows its policyπ.
V
π
(s) =Eπ[Rt|st=s]

Bellman Equations
▶The value of a state depends on the values of the actions
possible in that state, modulated by the probability the action
will be taken by the policyπ:
V
π
(s) =
X
a∈A
π(a|s)Q
π
(s, a) (3)
▶For deterministic policy, the value of a state is the value of
the action that will be taken by the policyµ:
V
µ
(s) =Q
µ
(s, µ(s))

Bellman Equations
▶The return at time steptis the sum of the immediate reward
rt+1and the return at the next stateRt+1(discounted byγ):
Rt=rt+1+γRt+1
▶The Q-value function can be written as:
Q
π
(s, a) =Eπ[Rt|st=s, at=a]
=Eπ[rt+1+γRt+1|st=s, at=a]
=
X
s

p(s

|s, a)[r(s, a, s

) +γV
π
(s

)]
▶The value of an action depends on:
▶which state we arrive ins
′▶which probability this transition occursp(s

|s, a)
▶how much reward we receive immediatelyr(s, a, s

)▶how much we will receive laterV
π
(s

).

Bellman Equations
▶We have the Bellman equation for the value of a state:
V
π
(s) =
X
a∈A
π(a|s)Q
π
(s, a)
=
X
a∈A
π(a|s)
X
s

p(s

|s, a)[r(s, a, s

) +γV
π
(s

)]
▶The value of state depends on:
▶the values of all other states▶the current policyπ
▶the dynamics of the MDPp(s

|a, s)andr(s, a, s

).▶We have the Bellman equation for the action at a state.
Q
π
(s, a) =Eπ[Rt|st=s, at=a]
=
X
s

p(s

|a, s)[r(s, a, s

) +γV
π
(s

)]
=
X
s

p(s

|a, s)[r(s, a, s

) +γ
X
a

∈A
π(a

|s

)Q
π
(s

, a

)]

Policy search
▶Policy searchmethods directly learn the policyπθwith a
parameterized function estimator (e.g., a neural network).
▶The goal of the neural network is to maximize an objective
function representing thereturn(i.e., sum of rewardsR(τ)) of
the trajectoriesτ= (s0, a0, s1, a1, . . . , sH)selected by the
policyπθ.
J(θ) =Eτ∼ρθ
[R(τ)] =Eτ∼ρθ
"
H
X
t=0
γ
t
r(st, at, st+1)
#
(4)
▶Policyπθshould generate trajectoriesτwith high returns
R(τ)and avoid those with lose return.

Policy search
▶Thelikelihoodthat a trajectory is generated by policyπθis:
ρθ(τ) =pθ(s0, a0, . . . , sH) =p0(s0)
H
Y
t=0
πθ(at|st)p(st+1|st, at)
(5)
▶p0(s0)is the initial probability of starting ins0(independent
from the policy).
▶p(st+1|st, at)is the transition probability defining the MDP.▶The expectation in the objective function can be expanded:
J(θ) =Eτ∼ρθ
[R(τ)] =
Z
τ
ρθ(τ)R(τ)dτ (6)
▶Monte-Carlo sampling could be used to estimateJ(θ):
J(θ)≈
1
N
N
X
i=1
R(τi) (7)

Policy search
Using Monte-Carlo sampling, we sample multiple trajectories{τi}
and average their obtained returns:
J(θ)≈
1
N
N
X
i=1
R(τi)
This approach suffers from several problems:
▶High variance: the trajectory space is huge, we need a lot of
sampled trajectories to properly estimateJ(θ).
▶Sample complexity: due to stability, only small changes can
be made to the policy at each iteration, so we need a lot of
episodes.
▶For continuing tasks (T=∞), the return cannot be
estimated as the episode never ends.

Policy gradient
▶In policy gradient methods, we apply gradient ascent on the
weightsθin order to maximizeJ(θ).
▶All we need is the gradient∇θJ(θ)of the objective function
w.r.t. the weights:
∇θJ(θ) =
∂J(θ)
∂θ
(8)
▶When a proper estimation of this policy gradient is obtained,
we can perform gradient ascent:
θ←θ+η∇θJ(θ) (9)

REINFORCE - Estimating the policy gradient
▶Considering that the returnR(τ)of a trajectory does not
depend on the parametersθ, we have:
∇θJ(θ) =∇θ
Z
τ
ρθ(τ)R(τ)dτ=
Z
τ
(∇θρθ(τ))R(τ)dτ(10)
▶We now use thelog-trick:
dlogf(x)
dx
=
f

(x)
f(x)
to rewrite the
policy gradient of a single tracjectory:
∇θρθ(τ) =ρθ(τ)∇θlogρθ(τ)
▶The policy gradient becomes:
∇θJ(θ) =
Z
τ
ρθ(τ)∇θlogρθ(τ)R(τ)dτ (11)
=Eτ∼ρθ
[∇θlogρθ(τ)R(τ)] (12)
▶We can obtain an estimate of the policy gradient by sampling
different trajectories{τi}and averaging∇θlogρθ(τi)R(τi).

REINFORCE - Estimating the policy gradient
▶How to compute the gradient of the log-likelihood of a
trajectorylogρθ(τ)?
logρθ(τ) = log

p0(s0)
H
Y
t=0
πθ(at|st)p(st+1|st, at)
!
= logp0(s0) +
H
X
t=0
logπθ(at|st) +
H
X
t=0
logp(st+1|st, at)
▶logp0(s0)andlogp(st+1|st, at)do not depend on the
parametersθ(they are defined by the MDP), so the gradient
of the log-likelihood is simply:
∇θlogρθ(τ) =
H
X
t=0
∇θlogπθ(at|st) (13)
▶∇θlogπθ(at|st)is called thescore function.

REINFORCE - Estimating the policy gradient
▶The policy gradient is independent from the MDP dynamics,
allowingmodel-free learning:
∇θJ(θ) =Eτ∼ρθ
"
H
X
t=0
∇θlogπθ(at|st)R(τ)
#
(14)
=Eτ∼ρθ
"
H
X
t=0
∇θlogπθ(at|st)

H
X
t=0
γ
t
rt+1
!#
(15)
▶Estimating the policy gradient can now be done using
Monte-Carlo sampling.
▶The resulting algorithm is called theREINFORCEalgorithm
(Williams, 1992).

REINFORCE algorithm
While not converged:
1. Ntrajectories{τi}using the current policyπθand
observe the returns{R(τi)}
2.
trajectories:
∇θJ(θ)≈
1
N
N
X
i=1
H
X
t=0
∇θlogπθ(at|st)R(τi)
3.
θ←θ+η∇θJ(θ)

REINFORCE - Reducing the variance - Reward scaling
While not converged:
1. Ntrajectories{τi}using the current policyπθand
observe the returns{R(τi)}
2.
ˆ
R=
1
N
N
X
i=1
R(τi)
3.
trajectories:
∇θJ(θ)≈
1
N
N
X
i=1
H
X
t=0
∇θlogπθ(at|st)(R(τi)−
ˆ
R)
4.
θ←θ+η∇θJ(θ)
This algorithm is calledREINFORCE with baseline.

REINFORCE - Reducing the variance - Reward scaling
▶Subtracting a constantbfrom the returns still leads to an
unbiased estimation of the gradient:
∇θJ(θ) =Eτ∼ρθ
[∇θlogρθ(τ)(R(τ)−b)] (16)
▶We have:
Eτ∼ρθ
[∇θlogρθ(τ)b] =
Z
τ
ρθ(τ)∇θlogρθ(τ)bdτ
=
Z
τ
∇θρθ(τ)bdτ
=b∇θ
Z
τ
ρθ(τ)dτ=b∇θ1 = 0
▶Ifbdoes not depend onθ, the estimator is
▶Advantage actor-criticmethods replace the constantbwith
an estimate of the value of each state
ˆ
V(st).

Policy Gradient Theorem
▶REINFORCE estimate of the policy gradient after sampling:
∇θJ(θ)≈
1
N
N
X
i=1
H
X
t=0
∇θlogπθ(at|st)R(τi)
=
1
N
N
X
i=1
H
X
t=0
∇θlogπθ(at|st)

H
X
k=0
γ
k
r(sk, ak, sk+1)
!
▶For each transition(st, at)the gradient of its log-likelihood
∇θlogπθ(at|st)is multiplied with the return of the whole
episodeR(τ) =
P
H
k=0
γ
k
r(sk, ak, sk+1).
▶Causality principle: The reward received at k= 0does not
depend on actions taken in the future.
∇θJ(θ)≈
1
N
N
X
i=1
H
X
t=0
∇θlogπθ(at|st)

H
X
k=t
γ
k−t
r(sk, ak, sk+1)
!
(17)

P
H
k=t
γ
k−t
r(sk, ak, sk+1)is called thereward to gofrom the
transition(st, at).

Policy Gradient Theorem
Q
π
θ
(s, a) =Eπ
θ
[Rt+1+γRt+2+γ
2
Rt+3+. . .|St=s, At=a]
=Eπ
θ
[
H
X
k=t
γ
k−t
Rt+1|St=s, At=a]
▶The Q-value of an action(s, a)is the
∇θJ(θ)≈
1
N
N
X
i=1

H
X
t=0
∇θlogπθ(at|st)
H
X
k=t
γ
k−t
r(sk, ak, sk+1)
!

1
N
N
X
i=1
H
X
t=0
∇θlogπθ(at|st)Q
π
θ
(st, at)

Policy Gradient Theorem
∇θJ(θ) =Est∼ρθ,at∼πθ
[
H
X
t=0
∇θlogπθ(at|st)Q
πθ
(st, at)]
▶We can sample the above, give a whole episode.
▶Typically, people pull out the sum, and split up this into
separate gradients, e.g.,
∆θt=∇θlogπθ(at|st)Q
πθ
(st, at)
=∇θlogπθ(at|st)Gt
such thatEπθ
[
P
t
∆θt] =∇θJ(θ).
▶Thus, we have:
∇θJ(θ) =Est∼ρθ,at∼πθ
[∇θlogπθ(at|st)Q
πθ
(st, at)](18)

Policy Gradient Theorem
∇θJ(θ) =Es∼ρθ,a∼πθ
[∇θlogπθ(a|s)Q
πθ
(s, a)]
▶The actual returnR(τ)is replaced by its expectation
Q
πθ(s, a).
▶The policy gradient is now an expectation oversingle
transitionsinstead of complete trajectories, allowing
bootstrappingas in TD methods.
▶However,Q
πθ(s, a)is unknown.▶It is possible to estimate the Q-values with a function
approximatorQϕ(s, a)with parametersϕ:
∇θJ(θ) =Es∼ρθ,a∼πθ
[∇θlogπθ(a|s)Qϕ(s, a)]
▶The resulting algorithm belongs to theactor-criticclass.

Policy Gradient Theorem
▶Actorπθ(a|s)approximates the policy by maximizingJ(θ):
∇θJ(θ) =Es∼ρθ,a∼πθ
[∇θlogπθ(a|s)Qϕ(s, a)]
▶CriticQϕ(s, a)estimates the policy by minimizing the MSE
with the true Q-value:
(Q
πθ
(s, a)−Qϕ(s, a))
2

Advantage Actor-Critic Methods
▶In REINFORCE, the policy gradient is estimated by:
∇θJ(θ)≈
1
N
N
X
i=1
H
X
t=0
∇θlogπθ(at|st)R(τi)
▶The Policy Gradient Theorem then gives the formulation:
∇θJ(θ)≈
1
N
N
X
i=1
H
X
t=0
∇θlogπθ(at|st)Qϕ(st, at)
▶To reduce variance, we can employ a baselineb:
∇θJ(θ)≈
1
N
N
X
i=1
H
X
t=0
∇θlogπθ(at|st)(Q
πθ
(st, at)−b)
▶We make the baseline b=V
πθ(s)
∇θJ(θ)≈
1
N
N
X
i=1
H
X
t=0
∇θlogπθ(at|st)(Q
πθ
(st, at)−V
πθ
(st))

Advantage Actor-Critic Methods
▶The factor multiplying the log-likelihood of the policy is:
A
πθ
(s, a) =Q
πθ
(s, a)−V
πθ
(s) (19)
which is the ain states. We need to
approximate both functionQ
πθ(s, a)andV
πθ(s).
▶Advantage actor-critic methods
an action:
∇θJ(θ) =Es∼ρπ,a∼πθ
[∇θlogπθ(a|s)Aϕ(s, a)] (20)
▶Aϕ(s, a)is called the advantage estimate, and should be equal
to the real advantage in expectation:
A
πθ
(s, a) =Es∼ρθ,a∼πθ
[Aϕ(s, a)]

Advantage Actor-Critic Methods
Different methods could be used to compute the advantage
estimateAϕ(s, a):
▶MC advantage estimate: finite episodes, slow updates
Aϕ(s, a) =R(s, a)−Vϕ(s) (21)
▶TD advantage estimate: unstable
Aϕ(s, a) =r(s, a, s

) +γVϕ(s

)−Vϕ(s) (22)
▶n-step advantage estimate: a trade-off btw MC and TD
Aϕ(s, a) =
n−1
X
k=0
γ
k
rt+k+1+γ
n
Vϕ(st+n+1)−Vϕ(st)(23)
which is at the core of A2C and A3C.

Advantage Actor-Critic (A2C)
▶n-step advantage estimate uses thennext immediate rewards
and approximates the rest with the value of the state visitedn
steps later:

θJ(θ) =Eπ
θ

∇
θlogπ
θ(at|st)


n−1
X
k=0
γ
k
r
t+k+1+γ
n
V
ϕ(s
t+n+1)−V
ϕ(st)



(24)
▶TD can be seen as a 1-step algorithm.
▶For sparse rewards, n-step allows to update thenlast actions
which lead to a win/loss, instead of only the last one in TD.
Also, there is no need for finite episodes as in MC.
▶n-step estimation ensures a trade-off between:
▶bias: wrong updates based on estimated values as in TD.
▶variance: variability of the obtained returns as in MC.

Advantage Actor-Critic (A2C)
▶The actor outputs the policyπθfor a states, i.e., a vector of
probabilities for each action.
▶The critic outputs the valueVϕ(s)of a states.

Advantage Actor-Critic (A2C)
1. (s, a, r, s

)using
policyπθ.
2.
Rt=
n−1
X
k=0
γ
k
rt+k+1+γ
n
Vϕ(st+n+1)
3.
∇θJ(θ) =
X
t
∇θlogπθ(at|st)(Rt−V ϕ(st))
4.
L(ϕ) =
X
t
(Rt−Vϕ(st))
2
5.

Advantage Actor-Critic (A2C) - Pseudocode
▶Initialize the actorπθand the criticVϕwith random weights.
▶Observe the initial states0.▶Fort∈[0, Ttotal]:1.2. k∈[0, n]: Sample episode2.1 akusing the actorπθ2.2 akand observe the next statesk+1and the reward
rk+1
2.3 (sk, ak, rk+1)in the episode minibatch.3. snis not terminal:R=Vϕ(sn)with the critic, elseR= 0.4. dθanddϕto 0.5. k∈[n−1,0]: Backwards iteration over the episode5.1 R=rk+γR5.2
dθ←dθ+∇θlogπθ(ak|sk)(R−Vϕ(sk))
5.3
dϕ←dϕ+∇ϕ(R−Vϕ(sk))
2

Advantage Actor-Critic (A2C) - Pseudocode
▶Fort∈[0, Ttotal](continued):
▶. . .
6.
θ←θ+ηdθ
7.
ϕ←ϕ−ηdϕ
N.B.
▶Not all states are updated with the same horizonn.
▶The last action in the sample episode will only use the last
reward and the value of the final state (TD learning).
▶The first action will use thenaccumulated rewards.▶A2C performs: a couple of transitions are
explored using the, which is immediately
updated.

Advantage Actor-Critic (A2C)
▶As for value-based networks (e.g., DQN), the underlying
neural network will be affected by the correlated inputs and
outputs: a single batch contains similar states and action
(e.g., consecutive frames of a video game).
▶A2C and A3C do not useexperience replay memoryas DQN,▶but they usemultiple parallel actors and learners.▶The actor and critic are stored in a.▶Multiple instances of the environments are created in parallel
threads (workers).

Advantage Actor-Critic (A2C)
▶Initialize the actorπθand the criticVϕin the global network.
▶Repeat:1. iin parallel:1.1 πθand crticVϕ
1.2 nsteps1.3 dθianddϕi2.3. dθanddϕ.4.
▶Each worker explores different regions of the environment so
that the final batch for training the global networks is less
correlated:
▶Set different initial states in each worker
▶Use different exploration rate
▶. . .
▶This method is easy for simulated environments (e.g., video
games), but difficult for real-world systems like robots.

(Asynchronous) Advantage Actor-Critic (A2C - A3C)

Asynchronous Advantage Actor-Critic (A3C)
▶A3C extends A2C by removing the need of synchronization
between the workers at the end of each episode before
applying the gradients.
▶In A2C, gradient merging and parameter updates are
sequential operations, so no significant speedup if the number
of workers is increased.
▶In A3C, each worker reads and writes the network parameters
whenever it wants.
▶The obtained parameters would be a mixed of different
networks!!!
▶However, if the learning rate is small enough, there is anyway
not a big difference between two successive versions of the
network parameters.

Asynchronous Advantage Actor-Critic (A3C)
▶Initialize the actorπθand the criticVϕin the global network.
▶For each workeriin parallel:▶Repeat:1. πθand criticVϕ.
2. nsteps.3. dθianddϕi.4.
▶In the A3C paper (Mnih et al., 2016), Atari games can be
solved using 16 CPU cores instead of a powerful GPU as in
DQN, and achieved a better performance in less training time
(1 day instead of 8).
▶The more workers, the faster the computations, the better the
performance (as the policy updates are less correlated).

A3C - Entropy Regularization
▶In Actor-Critic methods, exploration relies on the learned
policies are stochastic: (on-policy)π(a|s)describes the
probability of taking actionain states.
▶To enforce, A3C adds an
term to the policy gradient:
∇θJ(θ) =Eπ
θ
[∇θlogπθ(at|st)(Rt−Vϕ(st)) +β∇θH(πθ(st))](25)
▶The entropy of the policy for a statestcan be computed:
H(πθ(st)) =−
X
a
πθ(a|st) logπθ(a|st)
▶H(πθ(st))measures therandomnessof a policy:
▶Fully deterministic policy: entropy is zero.
▶Completely random policy: entropy is maximal.
▶βcontrols the level of regularization.

Policy Gradient methods
Different versions of policy gradient:
∇θJ(θ) =Est∼ρ
π
,at∼πθ
[∇θlogπθ(st, at)ψt] (26)
where
1.ψt=Rtis the REINFORCE algorithm (MC sampling).2.ψt=Rt−bis the REINFORCE with baseline algorithm.3.ψt=Q
πθ(st, at)is the policy gradient theorem.4.ψt=A
πθ(st, at)is the advantage actor critic.5.ψt=rt+1+γV
π
(st+1)−V
π
(st)is the TD actor critic.6.ψt=
P
n−1
k=0
rt+k+1+γ
n
V
π
(st+n+1)−V
π
(st)is the n-step
algorithm (A2C).

Policy Gradient methods
∇θJ(θ) =Est∼ρ
π
,at∼πθ
[∇θlogπθ(st, at)ψt]
▶The moreψtrelies on real rewards (Rt), the more the
gradient will be correct on average (small bias), but the more
it will vary.
▶→thesample complexityis increased: we need to average
more sample to correctly estimate the gradient.
▶The moreψtrelies on the estimations (the TD error), the
more stable the gradient (small variance), but the more
incorrect it is (high bias).
▶→This can lead to sub-optimal policies.

Off-policy Actor-Critic
▶Actor-critic methods are generallyon-policy: the actions used
to explore the environment must be generated by the actor.
Otherwise, the feedback provided by the critic (the
advantage) will introduce a huge bias (i.e., an error) in the
policy gradient.
∇θJ(θ) =Es∼ρ
π
,a∼πθ
[∇θlogπθ(s, a)Q
πθ
(s, a)]
▶The state distributionρπdefines the states that can be visited
using the actor policyπθ.
▶If, during MC sampling of the policy gradient, sdo
not come from this distribution, the approximated policy
gradient will be wrong (high bias) and the resulting policy will
be sub-optimal.
▶Sample complexity: If the actor is initialized in a flat region
of the reward space (where there is not a lot of rewards),
gradient updates only change slightly the policy and it may
take a lot of iterations until interesting policies are discovered.

Off-policy Actor-Critic
Off-policy b(a|s)toexplorethe
environment andtrainthe
▶If the behavior policy does not explore the optimal actions,
the target policy can’t find it by itself, except by chance.
▶If the behavior policy is good enough, this can reduce the
amount of exploration.
▶Q-learning
δ=r(s, a, s

) +γmax
a

Q
π
(s

, a

)−Q
π
(s, a)
▶SARSA
δ=r(s, a, s

) +γ Q
π
(s

, π(s

))−Q
π
(s, a)
▶SARSA uses the next action sampled fromπ(a

|s

)to update
the current transition. This next action must be performed.
The policy must beϵ-soft (e.g.,ϵ-greedy).

Off-policy Actor-Critic
▶In Q-learning, the behavior policyb(a|s)must be able to
explore actions which are selected by the target policy:
π(a|s)>0→b(a|s)>0
▶There are mostly two ways to create behavior policy:
▶Use expert knowledge / human demonstrations.
▶Derive it from the target policy.▶In Q-learning, the target policy can bedeterministic, i.e.,
always select the greedy action (with the maximum Q-value).
▶The behavior policy can be derived from the target policy by
making itϵ-soft, for example,ϵ-greedy.
▶Off-policy learning allows.
The transitions used for training the target policy were
generated by an older version of it. A3C is: multiple
parallel learners are used to solve the correlation problems of
inputs and outputs.

Off-policy methods - Importance sampling
▶Off-policy methods learn a target policyπ(a|s)while exploring
with a behavior policyb(a|s).
▶In policy gradient methods, we want to maximize the
expected return of trajectories:
J(θ) =Eτ∼ρθ
[R(τ)] =
Z
τ
ρθ(τ)R(τ)dτ≈
1
N
N
X
i=1
R(τi)
whereρθis the distribution of trajectoriesτgenerated by the
target policyπθ.
▶If we use a behavior policy to generate the trajectories, what
we are actually estimating is:
ˆ
J(θ) =Eτ∼ρb
[R(τ)] =
Z
τ
ρb(τ)R(τ)dτ
whereρbis the distribution of the trajectories generated by
the behavior policy. Thus, in general,
ˆ
J(θ)can be different
fromJ(θ).

Off-policy methods - Importance sampling
We can rewrite the objective function as:
J(θ) =Eτ∼ρθ
[R(τ)]
=
Z
τ
ρθ(τ)R(τ)dτ
=
Z
τ
ρb(τ)
ρb(τ)
ρθ(τ)R(τ)dτ
=
Z
τ
ρb(τ)
ρθ(τ)
ρb(τ)
R(τ)dτ
=Eτ∼ρb
ˇ
ρθ(τ)
ρb(τ)
R(τ)
˘

Importance sampling
J(θ) =Eτ∼ρb
ˇ
ρθ(τ)
ρb(τ)
R(τ)
˘
(27)

ρθ(τ)
ρb(τ)
is the τ.
▶Ifτgenerated bybis associated with a lot of rewardsR(τ)
with high probabilityρb(τ)then the actor should learn to
reproduce that trajectory with high probabilityρθ(τ)as well
to maximizeJ(θ).
▶If the associated reward is low (R(τ)≈0, the target policy
can forget about it (by settingρθ(τ)≈0).

Importance sampling
J(θ) =Eτ∼ρb
ˇ
ρθ(τ)
ρb(τ)
R(τ)
˘
▶Using the definition of the likelihood of a trajectory:
ρθ(τ)
ρb(τ)
=
p0(s0)
Q
T
t=0
πθ(at|st)p(st+1|st, at)
p0(s0)
Q
T
t=0
b(at|st)p(st+1|st, at)
=
Q
T
t=0
πθ(at|st)
Q
T
t=0
b(at|st)
=
T
Y
t=0
πθ(at|st)
b(at|st)
▶J(θ)can then be estimated by MC sampling:
J(θ)≈
1
m
m
X
i=1
ρθ(τi)
ρb(τi)
R(τi) (28)

Importance sampling
1. Ntrajectoriesτiusing the behavior policy. For each
transition(st, at, st+1), store:
▶The rewardrt+1
▶The probabilityb(at|st)that the behavior policy generates this
transition.
▶The probabilityπθ(at|st)that the target policy generates this
transition.
2.
J(θ) =
1
N
N
X
i=1

H
Y
t=0
πθ(at|st)
b(at|st)
!
H
X
t=0
γ
t
rt+1
!
3. J(θ).
4.

Policy gradient with importance sampling
∇θJ(θ) =Eτ∼ρb
κ
∇θlogρθ(τ)
ρθ(τ)
ρb(τ)
R(τ)
λ
(29)
▶The return after being in a statestonly depends on future
states.
▶The importance sampling weight only depends on the past
weights.
∇θJ(θ) =Eτ∼ρ
b
"
H
X
t=0
∇θlogπθ(at|st)

t
Y
t

=0
πθ(at
′|st
′)
b(at
′|st
′)
!
H
X
t

=t
γ
t

−t
r(st
′, at
′)
!#
(30)

Self-Imitation Learning (SIL)
▶SIL method extends on-policy actor-critic algorithms (e.g.,
A2C) with a replay buffer to feed past good experiences to
the neural network to speed up learning.
▶Use
whose actual return is higher than their current value.
▶Twoadditionalloss functions for the actor and the critic:
L
SIL
actor(θ) =Es,a∈D[logπθ(a|s) (R(s, a)−Vφ(s))
+
]
L
SIL
critic(φ) =Es,a∈D[((R(s, a)−Vφ(s))
+
)
2
]
where(x)
+
= max (0, x)is the positive function.
▶Transitions sampled from the replay buffer will participate to
the off-policy learning only if their return is higher than the
(currently known) expected value of the stateVϕ(s).

Self-Imitation Learning (SIL) - Pseudocode
▶Initialize the actorπθand the criticVϕwith random weights.
▶Initialize the prioritized experience replay bufferD.
▶Observe the initial states0.
▶Fort∈[0, Ttotal]:
1.2. k∈[0, n]: Sample2.1 akusing the actorπθ2.2 akand observesk+1andrk+12.3 (sk, ak, rk+1)in the episode minibatch3. snis not terminal: setRn=Vϕ(sn), elseRn= 0.4. k∈[n−1,0]: #Backward iteration over the episode4.1 Rk=rk+γRk+14.2 D.5.
θ←θ+η
X
k
∇θlogπθ(ak|sk) (Rk−Vφ(sk))
φ←φ+η
X
k
∇φ(R−Vφ(sk))
2

Self-Imitation Learning (SIL) - Pseudocode
▶. . .
6. m∈[0, M]:
6.1 Ktransitions(sk, ak, Rk)from the
replay bufferDprioritized with high(Rk−Vϕ(sk)).
6.2
θ←θ+η
X
k
∇θlogπθ(ak|sk) (Rk−Vφ(sk))
+
φ←φ+η
X
k
∇φ((Rk−Vφ(sk))
+
)
2
▶A2C+SIL is shown to have a better performance than SoTA
methods (A3C, TRPO, Reactor, PPO) on Atari games and
continuous control problems (MuJoCo).

Deterministic Policy Gradient
▶So far, in policy gradient methods, we assume a stochastic
policyπθ(at|st)assigning probabilities to each discrete action
(or some distribution for continuous action).
▶Stochastic policies ensure
space: actions have a non-zero probability of being selected.
▶Drawbacks:▶The policy gradient theorem only works. This
prevents the use of an experience replay memory as in in DQN
to stabilize learning. Importance sampling can help, but is
unstable for long trajectories.
▶Because of the stochasticity of the policy, the returns may vary
a lot between two episodes generated by the same optimal
policy→a lot of →worse
sample complexity than value-based methods→more samples
to get rid of this variance.

Deterministic Policy Gradient
▶Value-based methods like DQN produce adeterministic
policy.
▶After learning, the action to be executed is the greedy action:
a

t= arg max
a
Qθ(st, a)
▶Exploration is enforced by forcing the behavior policy (the one
used to generate samples) to be stochastic (ϵ-greedy), but the
learned policy is itself deterministic.
▶This is
than the learned one to explore.
▶When using an experience replay memory, the behavior policy
is simply an older version of the learning policy (samples
stored in the ERM were generated by an older version of the
actor).

Deterministic Policy Gradient
▶We want to learn aparameterizeddeterministic policyµθ(s).
▶The goal is to maximize the expectation over all states
reachable by the policy of thereward to go(return) after
each action:
J(θ) =Es∼ρµ[R(s, µθ(s))] (31)
▶As in the stochastic case, the distribution of states reachable is
impossible to estimate, so we have to perform approximation.
▶Q-value of an action is the expectation of the reward to go
after that actionQ
π
(s, a) =Eπ[R(s, a)].
▶Maximizing the returns of maximizing the true Q-value of all
actions leads to the same optimal policy.
▶Like in:
Q-value of all state-action pairs and
changes the policy by selecting the action with the maximal
Q-valuea

t= arg maxaQθ(st, a).

Deterministic Policy Gradient
▶In continuous control, the gradient of the objective function is
the same as the gradient of the Q-value.
▶If we have an unbiased estimateQ
µ
(s, a)of the value of any
action ins, changing the policyµθ(s)in the direction of
∇θQ
µ
(s, a)leads to an action with a higher Q-value:
∇θJ(θ) =Es∼ρµ[∇θQ
µ
(s, a)|
a=µθ(s)] (32)
▶This is the gradient with respect to the actionaof the
Q-value is taken ata=µθ(s). Using the chain rule:
∇θJ(θ) =Es∼ρµ[∇θµθ(s)× ∇aQ
µ
(s, a)|
a=µθ(s)](33)
in which:
∂Q(s, a)
∂θ
=
∂Q(s, a)
∂a
×
∂a
∂θ

Deterministic Policy Gradient
∇θJ(θ) =Es∼ρµ[∇θµθ(s)× ∇aQ
µ
(s, a)|
a=µθ(s)]
▶The first term defines how the action changes when the
parametersθof the actor change.
▶The second term defines how the Q-value of an action changes
when we vary slightly the action (e.g., if the robot moves its
joint a bit more to the right, does it get a higher Q-value?).
▶This looks like an actor-critic architecture.▶∇θµθ(s)only depends on the actor, while∇aQ
µ
(s, a)is like a
critic, telling the actor in which direction to change its policy.
▶How to obtain an unbiased estimate of the Q-value of any
action and compute its gradient?
▶We can use a function approximatorQϕ(s, a)and minimize
the quadratic error with the true Q-values.

Deterministic Policy Gradient
∇θJ(θ) =Es∼ρµ[∇θµθ(s)× ∇aQφ(s, a)|
a=µθ(s)]
J(φ) =Es∼ρµ[(Q
µ
(s, µθ(s))−Qφ(s, µθ(s)))
2
]

Deterministic Policy Gradient
∇θJ(θ) =Es∼ρµ[∇θµθ(s)× ∇aQφ(s, a)|
a=µθ(s)]
J(φ) =Es∼ρµ[(Q
µ
(s, µθ(s))−Qφ(s, µθ(s)))
2
]
▶However, this architecture worked only with linear function
approximators, but not yet with non-linear approximators
(e.g., neural networks).

Deep Deterministic Policy Gradient (DDPG)
▶DDPG combines ideas from DQN and DPG to solve
continuous problems off-policy.
▶deterministic policy gradient
▶experience replay memory
off-policy.
▶target networks
▶In DQN, the
of the
steps. The target networks change a lot between two updates,
but not often.
▶In DDPG, the
of the
actor and critic:
θ

=τθ+ (1−τ)θ

withτ <<1. The target networks are always ”late” with
respect to the trained networks, providing more stability to
the learning of Q-values.

Deep Deterministic Policy Gradient (DDPG)
▶The critic is learned using:
J(φ)Es∼ρµ[(r(s, a, s

)γ Qφ
′(s

, µθ
′(s

))−Qφ(s, a))
2
]
(34)
▶Exploration
produce the same actions, missing more rewarding ones.
DDPG then adds an
at=µθ(st) +ξ (35)

Deep Deterministic Policy Gradient (DDPG)
▶Initialize actorµθand criticQϕwith random weights.
▶Create target networksµθ
′andQϕ
′.
▶Initialize experience replay memoryD.
▶For episode∈[1, M]:
▶Initialize random processξ
▶Observe the initial states0.
▶Fort∈[0, Tmax:
1. at=µθ(st) +ξ
2. at, observe the next statest+1and the rewardrt+1.
3. (st, at, rt+1, st+1)to experience replay memoryD.
4. Ntransitions randomly fromD.
5. (sk, ak, rk, s

k)in the minibatch, compute
the target value using the target networks:
yk=rk+γ Qφ
′(s

k, µθ
′(s

k))
6.
L(φ) =
1
N
X
k
(yk−Qφ(sk, ak))
2

Deep Deterministic Policy Gradient (DDPG)
▶. . .
▶. . .
7.
∇θJ(θ) =
1
N
X
k
∇θµθ(sk)× ∇aQφ(sk, a)|a=µ
θ(s
k)
8.
θ

←τ θ+ (1−τ)θ

φ

←τ φ+ (1−τ)φ
Tags