50 Chapter 21. Reinforcement Learning
functionPASSIVE-TD-AGENT(percept)returnsan action
inputs:percept, a percept indicating the current states
′
and reward signalr
′
persistent:π, a fixed policy
U, a table of utilities, initially empty
Ns, a table of frequencies for states, initially zero
s,a,r, the previous state, action, and reward, initially null
ifs
′
is newthenU[s
′
]←r
′
ifsis not nullthen
incrementNs[s]
U[s]←U[s] +α(Ns[s])(r+γU[s
′
]−U[s])
ifs
′
.TERMINAL?thens,a,r←nullelses,a,r←s
′
,π[s
′
],r
′
returna
Figure 21.4A passive reinforcement learning agent that learns utilityestimates using temporal dif-
ferences. The step-size functionα(n)is chosen to ensure convergence, as described in the text.
functionQ-LEARNING-AGENT(percept)returnsan action
inputs:percept, a percept indicating the current states
′
and reward signalr
′
persistent:Q, a table of action values indexed by state and action, initially zero
Nsa, a table of frequencies for state–action pairs, initially zero
s,a,r, the previous state, action, and reward, initially null
ifTERMINAL?(s)thenQ[s,None]←r
′
ifsis not nullthen
incrementNsa[s,a]
Q[s;a]←Q[s;a] +α(Nsa[s;a])(r+γmaxa
′Q[s
′
;a
′
]−Q[s;a])
s,a,r←s
′
,argmax
a
′f(Q[s
′
;a
′
];Nsa[s
′
; a
′
]),r
′
returna
Figure 21.8An exploratory Q-learning agent. It is an active learner that learns the valueQ(s; a)of
each action in each situation. It uses the same exploration functionfas the exploratory ADP agent,
but avoids having to learn the transition model because the Q-value of a state can be related directly to
those of its neighbors.