Lecture notes

butest 333 views 21 slides Apr 26, 2010
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

No description available for this slideshow.


Slide Content

Machine Learning RL 1
Reinforcement Learning
Michael L. Littman
Slides from
http://www.cs.vu.nl/~elena/ml_13light.ppt
which appear to have been adapted from
http://www.cs.cmu.edu/afs/cs.cmu.edu/project
/theo-3/www/l20.ps

Machine Learning RL 2
Reinforcement Learning
[Read Ch. 13]
[Exercise 13.2]
•Control learning
•Control policies that choose optimal actions
•Q learning
•Convergence

Machine Learning RL 3
Control Learning
Consider learning to choose actions, e.g.,
•Robot learning to dock on battery charger
•Learning to choose actions to optimize factory output
•Learning to play Backgammon
Note several problem characteristics:
•Delayed reward
•Opportunity for active exploration
•Possibility that state only partially observable
•Possible need to learn multiple tasks with same
sensors/effectors

Machine Learning RL 4
One Example: TD-Gammon
[Tesauro, 1995]
Learn to play Backgammon.
Immediate reward:
•+100 if win
•-100 if lose
•0 for all other states
Trained by playing 1.5 million games against itself.
Now approximately equal to best human player.

Machine Learning RL 5
Reinforcement Learning Problem
Goal: learn to choose actions that maximize
r
0
+ gr
1
+ g
2
r
2
+ … , where 0 £ g < 1

Machine Learning RL 6
Markov Decision Processes
Assume
•finite set of states S
•set of actions A
•at each discrete time agent observes state s
t
Î S
and chooses action a
t
Î A
•then receives immediate reward r
t
•and state changes to s
t+1

•Markov assumption: s
t+1 = d(s
t, a
t) and r
t = r(s
t, a
t)
–i.e., r
t
and s
t+1
depend only on current state and
action
–functions d and r may be nondeterministic
–functions d and r not necessarily known to agent

Machine Learning RL 7
Agent’s Learning Task
Execute actions in environment, observe results, and
•learn action policy p : S ® A that maximizes
E [r
t
+ gr
t+1
+ g
2
r
t+2
+ … ]
from any starting state in S
•here 0 £ g < 1 is the discount factor for future
rewards (sometimes makes sense with g = 1)
Note something new:
•Target function is p : S ® A
•But, we have no training examples of form ás, añ
•Training examples are of form áás, a ñ, rñ

Machine Learning RL 8
Value Function
To begin, consider deterministic worlds...
For each possible policy p the agent might adopt, we can
define an evaluation function over states
å
¥
=
+
++
º
+++º
0
2
2
1 ...)(
i
it
i
ttt
r
rrrsV
g
gg
p
)(),(maxarg* ssV "º
p
p
p
where r
t
, r
t+1
, ... are generated by following policy p
starting at state s
Restated, the task is to learn the optimal policy p*

Machine Learning RL 9

Machine Learning RL 10
What to Learn
We might try to have agent learn the evaluation function
V
p*
(which we write as V*)
It could then do a look-ahead search to choose the best
action from any state s because
))],((*),([maxarg)(* asVasrs
a
dgp +º
A problem:
•This can work if agent knows d : S ´ A ® S, and
r : S ´ A ® Â
•But when it doesn't, it can't choose actions this way

Machine Learning RL 11
Q Function
Define a new function very similar to V*
)),((*),(),( asVasrasQ dg+º
If agent learns Q, it can choose optimal action even
without knowing d!
))],((*),([maxarg)(* asVasrs dgp
p

),(maxarg)(* asQs
p
p º
Q is the evaluation function the agent will learn [Watkins
1989].

Machine Learning RL 12
Training Rule to Learn Q
Note Q and V* are closely related:
))',(max)(*
'
asQsV
a
=
This allows us to write Q recursively as
))',((max),(
)),((*),(),(
1
'
11
111111
asQasr
asVasrasQ
t
a
++=
+=
dg
dg
Nice! Let denote learner’s current approximation to Q.
Consider training rule
)','(
ˆ
max),(
ˆ
'
asQrasQ
a
g+¬
Q
ˆ
where s’ is the state resulting from applying action a in
state s.

Machine Learning RL 13
Q Learning for Deterministic Worlds
For each s, a initialize table entry
Observe current state s
Do forever:
•Select an action a and execute it
•Receive immediate reward r
•Observe the new state s’
•Update the table entry for as follows:
•s ¬ s’.
)','(
ˆ
max)(
ˆ
'
asQrsQ
a
g+¬
0),(
ˆ
¬asQ
),(
ˆ
asQ

Machine Learning RL 14
Updating Q
90
}100,81,63max{9.00
)','(
ˆ
max),(
ˆ
2
'
1
¬

+¬ asQrasQ
a
right
g
Notice if rewards non-negative, then
and
),(
ˆ
),(
ˆ
),,(
1
asQasQnas
nn
³"
+
),(),(
ˆ
0 ),,( asQasQnas
n
££"

Machine Learning RL 15
converges to Q. Consider case of deterministic world, with
bounded immediate rewards, where each ás, añ visited infinitely
often.
Proof: Define a full interval to be an interval during which each ás, añ is
visited. During each full interval the largest error in table is
reduced by factor of g.
Let be table after n updates, and Δ
n be the maximum error in :
that is
For any table entry updated on iteration n+1, the error in the
revised estimate is
Q
ˆ
Q
ˆ
nQ
ˆ
nQ
ˆ
),(
ˆ
asQ
n
),(
ˆ
1asQ
n+
|),(),(
ˆ
|max
,
asQasQ
n
as
n
-=D
|)','(max)','(
ˆ
max|
|))','(max())','(
ˆ
max(| |),(),(
ˆ
|
''
''
1
asQasQ
asQrasQrasQasQ
a
n
a
a
n
a
n
-=
+-+=-
+
g
gg
Convergence Theorem

Machine Learning RL 16
nn
n
as
n
a
a
n
a
a
n
a
n
asQasQ
asQasQ
asQasQ
asQasQ
asQrasQrasQasQ
D=-


-=
+-+=-
+
+
g
g
g
g
gg
|),(),(
ˆ
|
|)',''()',''(
ˆ
|max
|)','()','(
ˆ
|max
|)','(max)','(
ˆ
max|
|))','(max())','(
ˆ
max(| |),(),(
ˆ
|
1
',''
'
''
''
1
Note we used general fact that:
|)()(|max|)(max)(max|
2121 afafafaf
aaa
-£-
This works with things other than max that satisfy
this non-expansion property [Szepesvári & Littman, 1999].

Machine Learning RL 17
Non-deterministic Case (1)
What if reward and next state are non-deterministic?
We redefine V and Q by taking expected values.
ú
û
ù
ê
ë
é
º
+++º
å
¥
=
+
++
0
2
2
1
...][)(
i
it
i
ttt
rE
rrrEsV
g
gg
p
))],((*),([),( asVasrEasQ dg+º

Machine Learning RL 18
Nondeterministic Case (2)
Q learning generalizes to nondeterministic worlds
Alter training rule to
where
)]','(
ˆ
max[),(
ˆ
)1(),(
ˆ
1
'
1 asQrasQasQ
n
a
nnnn -- ++-¬ aa
Can still prove convergence of to Q [Watkins and
Dayan, 1992]. Standard properties: å a
n
= 0,
å a
n
2
= ¥.
Q
ˆ
),(1
1
asvisits
n
n
+
=a

Machine Learning RL 19
Temporal Difference Learning (1)
),(),(),()[1(),(
)3(2)2()1(
tttttttt
asQasQasQasQ lll
l
++-º
),(
ˆ
max),(
1
)1(
asQrasQ
t
a
ttt ++ºg
Q learning: reduce discrepancy between successive
Q estimates
One step time difference:
Why not two steps?
),(
ˆ
max),(
2
2
1
)2(
asQrrasQ
t
a
tttt ++++º gg
Or n?
),(
ˆ
max),(
1
)1(
1
)(
asQrrrasQ
nt
a
n
nt
n
tttt
n
+-+
-
+ ++++º ggg
Blend all of these:

Machine Learning RL 20
Temporal Difference Learning (2)
TD(l) algorithm uses above training rule
•Sometimes converges faster than Q learning
•converges for learning V
*
for any 0 £ l £ 1 [Dayan, 1992]
•Tesauro's TD-Gammon uses this algorithm
),(),(),()[1(),(
)3(2)2()1(
tttttttt
asQasQasQasQ lll
l
++-º
)],(),(
ˆ
max)1[(),(
11+++-+=
tttt
a
ttt asQasQrasQ
ll
llg
Equivalent expression:

Machine Learning RL 21
Subtleties and Ongoing Research
•Replace table with neural net or other generalizer
•Handle case where state only partially observable
(next?)
•Design optimal exploration strategies
•Extend to continuous action, state
•Learn and use
•Relationship to dynamic programming (next?)
Q
ˆ
SAS®´:
ˆ
d