the bellman equation

Follow 611K Followers·Editors' PicksFeaturesDeep DivesGrowContributeAbout
DEEP REINFORCEMENT LEARNING EXPLAINED — 08
The Bellman Equation
V-function and Q-function Explained
Jordi TORRES.AIJun 11, 2020·12 min read
This is your last free member-only story this month. Upgrade for unlimited access.
Upgrade

In the previous post, we have been able to check with the Frozen-Lake
Environment example the limitations of the Cross-Entropy Method.
Continuing with the style of this series of gradually adding knowledge, in
this post, we will present another type of Agent that allows us to solve the
previous problem. These are the so-called Value-based Agents that store
the value function and base their decisions on it. For this purpose, we will
present the Bellman equation, one of the central elements of many
Reinforcement Learning algorithms, and required for calculating the value
functions in this post.
Spanish version of this publication
3. Funciones de valor y la ecuación de Bellman
Acceso abierto al capítulo 3 del libro Introducción al aprendizaje
por refuerzo profundo
medium.com
Value-based Agents

Remember that the Agent’s goal is to find a sequence of actions that will
maximize the return: the sum of rewards (discounted or undiscounted —
depending on the value of gamma) during an episode or the entire life of
the Agent, depending on the task.
In a continuous task, this is infinity. We can solve this with the help of the
discount factor already introduced in post 2, which lies in the [0:1] range.
The formula for the discounted Return G at time step t is as follows:
Although the sum is still infinite, if γ<1, then Gt will have a finite value. If
γ=0, the Agent is only interested in the immediate reward and discards the
long-term return. Conversely, if γ=1, the Agent will consider all future
rewards equal to the immediate reward.
We can rewrite this equation with a recursive relationship:

In short, the Agent must be able to exploit this information that we have
been able to express with this Return G to make their decisions. We also
refer to this expression as a Discounted Return.
Almost all Reinforcement Learning algorithms executed by the Agents
involve estimating value functions — functions of states or of state-action
pairs. These are the so-called Value-based Agents.
A value function estimates how good it is for the Agent to be in a given
state (or how good it is to perform a given action in a given state) in terms
of return G. Note that the return G of an Agent may depend on the actions it
will take. Accordingly, value functions introduced in post 2 are defined with
respect to particular ways of acting, called policies, and usually denoted by
??????.
The V-function: the value of the state
The first value function we will introduce is V-function. Generally speaking,
we can say that V-function answers the basic question of “What to expect
from here?”.

More formally, the V-function, also referred to as the state-value function,
or even the value function, or simply V, measures the goodness of each
state. In other words, how good or bad it is to be in a particular state
according to the return G when following a policy ??????.
That is, we can define the V-function as an expected total reward
(discounted or undiscounted — depending on the value of gamma) that is
obtainable from the state. In a formal way, the value of V??????(??????) is:
that describes the expected value of the total return G, at time step t
starting from the state s at time t and then following policy ??????. It is used
expectation ??????[.] in this definition because the Environment transition
function might act in a stochastic way.
To clarify the concept a little more, for beginners, let’s consider a
straightforward Environment with three states:

The Agent’s initial state 0
The final state 1 that the Agent is in after executing action “left” from
the initial state. The reward obtained from this is r=+1
The final state 2 that the Agent is in after taking action “right.” The
reward obtained from this is r=+2
We can represent this with the following Environment’s states transition
with rewards:
The Environment is always deterministic — every action succeeds, and we
always start from state 0. Once we reach either state 1 or state 2, the
episode ends.

Now, the question is, what is the value of state 0? (indicated with V(0)). An
important detail is that the value of one state is always calculated
(dependent) in terms of some policy that our Agent follows. Even in a
simple Environment, our Agent can have different behaviors, each of which
will have its own value for state 0. Consider some example of policy:
1. The Agent always goes left
2. The Agent always goes right
3. The Agent goes left with a probability of 0.5 and right with a probability
of 0.5
4. The Agent goes left with a probability of 0.2 and right with a probability
of 0.8
The value of state 0, V(0), in each of the above policies is:
1. The value of state 0 in the case of the “always left” Agent is V(0)=1.0
(every time it goes left, it obtains +1 and the episode ends).
2. For the “always right” Agent, the value of state 0 is V(0)= 2.0 (every
time it goes right, it obtains +2 and the episode ends).
3. For the “0.5 left+0.5 right” Agent, the value is V(0)=1.0 x 0.5 + 2.0 x
0.5 = 1.5

4. For the “0.2 left+0.8 right” Agent, the value is V(0)=1.0 x 0.2 + 2.0 x
0.8 = 1.8
Due to the goal of the Agent is to get as much total reward as possible, the
optimal policy for this Agent in this simple one-step Environment is policy
2, the policy “always right”.
But the preceding example may give a false impression that we should
“being greedy” and always take action with the highest reward.
Unfortunately, it’s not that simple. For example, let’s extend our preceding
Environment with yet another state that is reachable from state 2. State 2 is
no longer a terminal state but a transition to state 3, with a (very) bad
reward of –10:
The new Environment’s states transition with rewards can be represented
as:

With that addition, for each policy, the V(0) will be:
1. For the “always left”, is the same: V(0)=+1.0
2. For the “always right”: V(0)= 2.0 + (–10.0) = –8.0
3. For the “0.5 left+0.5 right”: V(0)=1.0 * 0.5 + (2.0+(-10.0) )* 0.5 =
-3.5
4. For the “0.2 left+0.8 right”: V(0)=1.0 * 0.2 + (2.0+(-10.0) ) * 0.8 =
-6.2
So, the best policy for this new Environment is now policy 1: “always go
left”. Notice that once the Agent has chosen the “right” action in state 0, the
bad reward is unavoidable, as from state 2, we have only one exit.
This example based on a naïve Environment pursues that the reader realises
the complexity of this optimality problem and prepares him or her to see
the importance of the Bellman equation. What is the Bellman equation?

The Q-function: The value of the action
In post 2 we extended the definition of state-value function to state-action
pairs, defining a value for each state-action pair, which is called the action-
value function, also known as Q-function or simply Q. It defines the value
of taking action a in state s under a policy π, denoted by Q??????(??????,??????), as the
expected Return G starting from s, taking the action ??????, and thereafter
following policy π:
In this equation again it is used expectation ??????[.] because the Environment
transition function might act in a stochastic way.
Now that we have both Q and V defined, let’s formalize their relationship.
We denote with π(??????|??????) the probability that a policy, π, selects an action, ??????,
given a current state, ??????. Note that the sum of probabilities of all outbound
actions from s is equal to 1:

We can assert that the state-value function is equivalent to the sum of the
action-value functions of all outgoing (from s) actions a, multiplied by the
policy probability of selecting each action:
The Bellman Equation
The Bellman equation shows up everywhere in the Reinforcement Learning
literature, being one of the central elements of many Reinforcement
Learning algorithms. In summary, we can say that the Bellman equation
decomposes the value function into two parts, the immediate reward
plus the discounted future values.
This equation simplifies the computation of the value function, such that
rather than summing over multiple time steps, we can find the optimal
solution of a complex problem by breaking it down into simpler, recursive
subproblems and finding their optimal solutions.
To facilitate understanding of the formulation in the following sections, the
next two backup diagrams summarize a convention of names given to the

variables and their relationship:
Backup diagrams for (a) V??????(s) and (b) Q??????(s,a).
In these diagrams, P means the probability of action a, issued in state s,
ending up in state s’ (with reward r).
Bellman equation for the State-value function
We already saw that we could define the discounted return, G, in recursive
terms. Let’s now see how to recursively the Bellman equation is defined for
the state-value function:

This equation tells us how to find the value of a state s following a policy ??????.
We can intuitively see that it recursively breaks down the value
computation into an immediate expected reward from the next state,
r(s,a), plus the value of a successor state, V??????(s’), with a discount factor ??????.
The above equation also expresses the stochasticity of the Environment
with the sum over the policy probabilities.
The Bellman equation is important because it gives us the ability to describe
the value of a state s, V??????(s), with the value of the s’ state, V??????(s’), and with
an iterative approach that we will present in the next post, we can calculate
the values of all states.
Unfortunately, in most scenarios, we do not know the probability P and the
reward r, so we cannot solve MDPs by directly applying the Bellman
equation. But do not worry, in the next post we present some alternatives to
find them from experience.
Bellman equation for the Action-value function
We also have the Bellman equation for the action-value function:

In the same way as the state-value function, this equation tells us how to
find recursively the value of a state-action pair following a policy ??????.
And due we shown that the state-value function V(s’) is equivalent to the
sum of the action-value functions Q(s’,a’) of all outgoing actions a’,
multiplied by the policy probability of selecting each action, ??????(a’|s’), the
previous formula can be expressed as follows:
Optimal Policy
Remember that the goal of the Agent is to maximize the total cumulative
reward in the long run. The policy, which maximizes the total cumulative
reward is called the optimal policy.
Optimal value function
A policy π′ is defined to be better than or equal to a policy π if and only if Vπ′
(s)≥Vπ(s) for all states s. An optimal policy π∗ satisfies π∗≥π for all policies

π. An optimal policy is guaranteed to exist but may not be unique. That
means that there could be different optimal policies, but they all share the
same value functions, the “optimal” value functions.
The optimal value function is one which yields maximum value compared
to all other value function (following using other policies).
When we say we are solving an MDP it actually means we are finding the
optimal value function. So, mathematically optimal state-value function
can be expressed as :
In the above formula, v∗(s) tells us what is the maximum reward for state s
we can get from the system.
Similarly, optimal state-action value function indicates the maximum
reward we are going to get if we are in state s and taking action a from there
on-wards:

We also can define V(s) via Q(s,a) so the value of some state equals the
value of the maximum action we can execute from this state:
and
The Bellman equation of optimality
Bellman proved that the optimal state value function in a state s is equal to
the action a, which gives us the maximum possible expected immediate
reward, plus the discounted long-term reward for the next state s’:
Bellman also proved that the optimal state-action value function in state s
and taking action a is:

In the future posts of this series, we will show examples of how to use the
Bellman equation for optimality. As we will see along with this series, the
Bellman equation is a keystone to find the optimal values of the value
functions to obtain an optimal policy for an Agent.
What is next?
In future posts, you will see these formulas in practice by solving the
Frozen-Lake Environment. However, to be able to do this, we have one
important thing still missing: a general way to calculate those V-values and
Q-values. In the next post, we will present the Value Iteration method for it.
See you in the next post!.
For more detail of the content of this post, the reader can review the excellent
book Reinforcement Learning, Second Edition, by Richard S. Sutton and
Andrew G. Barto.

07/01/2022, 02:10 The Value Iteration Algorithm. Estimation of Transitions and Rewards… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/the-value-iteration-algorithm-4714f113f7c5 1/7
Follow 611K Followers
DEEP REINFORCEMENT LEARNING EXPLAINED — 09
The Value Iteration Algorithm
Estimation of Transitions and Rewards from the Agent’s experience
Jordi TORRES.AIJun 14, 2020·7 min read
In the previous post, we presented the Value-based Agents and reviewed the Bellman
equation one of the central elements of many Reinforcement Learning algorithms. In
this post, we will present the Value Iteration method to calculate those V-values and Q-
values required by Value-based Agents.
Spanish version of this publication
4. Programación dinámica
This is your last free member-only story this month. Upgrade for unlimited access.
Open in app

07/01/2022, 02:10 The Value Iteration Algorithm. Estimation of Transitions and Rewards… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/the-value-iteration-algorithm-4714f113f7c5 2/7
Acceso abierto al capítulo 4 del libro Introducción al aprendizaje por
refuerzoprofundo
medium.com
Calculating the V-value with loops
In the simple example presented in the previous post, we had no loops in transitions and
was clear how to calculate the values of the states: we could start from terminal states,
calculate their values, and then proceed to the central state. However, only the presence
of a loop in the environment prevents this proposed approach.
Let’s see how these cases are solved with a simple Environment with two states, state 1
and state 2, that presents the following environment’s states transition diagram:
We only have two possible transitions: from state 1 we can take only an action that leads
us to state 2 with Reward +1 and from state 2 we can take only an action that returns us
to the state 1 with a Reward +2. So, our Agent’s life moves in an infinite sequence of
states due to the infinite loop between the two states. What is the value of both states?
Suppose that we have a discount factor γ<1, let’s say 0,9, and remember from the
previous post that the optimal value of the state is equal to that of the action that gives
us the maximum possible expected immediate reward, plus the discounted long-term
reward for the next state:

07/01/2022, 02:10 The Value Iteration Algorithm. Estimation of Transitions and Rewards… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/the-value-iteration-algorithm-4714f113f7c5 3/7
In our example, since there is only one action available in each state, our Agent has no
other option and therefore we can simplify the previous formula as:
For instance, if we start from state 1, the sequence of states will be [1,2,1,2,1,2, …], and
since every transition from state 1 to state 2 gives us a Reward of +1 and every back
transition gives us a Reward of +2 the sequence of Rewards will be
[+1,+2,+1,+2,+1,+2, …]. Therefore, the previous formula for state 1 becomes:
Strictly speaking, it is impossible to calculate the exact value of our state, but with a
discount factor γ= 0,9, the contribution of a new action decreases over time. For
example, for the sweep i=37 the result of the formula is 14.7307838, for the sweep
i=50 the result is 14.7365250 and for the sweep i=100 the result is 14.7368420. That
means that we can stop the calculation at some point (e.g. at i=50) and still get a good
estimate of the V-value, in this case V(1) = 14.736.
The Value Iteration Algorithm
The preceding example can be used to get the gist of a more general procedure called
the Value Iteration algorithm (VI). This allows us to numerically calculate the values
of the states of Markov decision processes, with known transition probabilities and
rewards.
The idea behind the Value Iteration algorithm is to merge a truncated policy evaluation
step (as shown in the previous example) and a policy improvement into the same
algorithm.
Basically, the Value Iteration algorithm computes the optimal state value function by
iteratively improving the estimate of V(s). The algorithm initializes V(s) to arbitrary
random values. It repeatedly updates the Q(s, a) and V(s) values until they converge.
Value Iteration is guaranteed to converge to the optimal values. The following pseudo-
code express this proposed algorithm:

07/01/2022, 02:10 The Value Iteration Algorithm. Estimation of Transitions and Rewards… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/the-value-iteration-algorithm-4714f113f7c5 4/7
Estimation of Transitions and Rewards
In practice, this Value Iteration method has several limitations. First of all, the state
space should be discrete and small enough to perform multiple iterations over all states.
This is not an issue for for our Frozen-Lake Environment but in a general Reinforcement
Learning problem, this is not the case. We will address this issue in subsequent posts in
this series.
Another essential practical problem arises from the fact that to update the Bellman
equation, the algorithm requires knowing the probability of the transitions and the
Reward for every transition of the Environment.
Remember that in our Frozen-Lake example, we observe the state, decide on an action,
and only then do we get the next observation and reward for the transition but we don’t
know this information in advance. What can we do to get them?
Luckily, what we can have is the history of the Agent’s interaction with the
Environment. So, the answer to the previous question is to use our Agent’s experience
as an estimation for both unknowns. Let’s see below how we can achieve it.
Estimation of Rewards
Estimate Rewards is the easiest part since Rewards could be used as they are. We just
need to remember what reward we got on the transition from s to s’ using action a.
Estimation of Transitions
To estimate transitions is also easy, for instance by maintaining counters for every tuple
in the Agent’s experience(s, a, s’) and normalize them.
For instance, we can create a simple table that keeps the counters of the experienced
transitions. The key of the table can be a composite “state” + “action”, (s, a), and the
values of each entry there is the information about target states, s’, and a count of times
that we have seen each target state, c.

07/01/2022, 02:10 The Value Iteration Algorithm. Estimation of Transitions and Rewards… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/the-value-iteration-algorithm-4714f113f7c5 5/7
Let’s look at an example. Imagine that during Agent’s experience, in a given state s0 it
has executed an action a several times and it ends up c1 times in state s1 and c2 times in
state s2. How many times we have switched to each of these states is stored in our
transition table. That is, the entry (s,a) in the table contents {s1: c1, s2: c2}. Perhaps
visually you can more easily see the information contained in the table for this example:
Then, it is easy to use this table to estimate the probabilities of our transitions. The
probability that the action will take us from state 0 to state 1 is c1 / (c1 + c2) and that
the action will take us from state 0 to state 2 c2 / (c1 + c2).
For example, imagine that from a state 0 we execute the action 1 ten times, and after 4
times it will lead us to state 1, and after 6 times it will lead us to state 2. For this
particular example, the entry with the key (0, 1) in this table contents {1: 4, 2: 6}. And
this represents that the probability of transition from state 0 to state 1 is 4/10, that is,
0.4 and that of state 0 to state 2 of 6/10, that is, 0.6.
With this information estimated from the experience of the Agent, we already have all
the necessary information to be able to apply the Value Iteration algorithm.
What is next?
Following the practical approach of this series, in the next two posts, you will see the
Value Iteration method in practice by solving the Frozen-Lake Environment.
See you in the next post!.
Deep Reinforcement Learning Explained Series
by UPC Barcelona Tech and Barcelona Supercomputing Center

07/01/2022, 02:10 Value Iteration for V-function. V-function in Practice for Frozen-Lake… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/value-iteration-for-v-function-d7bcccc1ec24 1/12
Follow 611K Followers
DEEP REINFORCEMENT LEARNING EXPLAINED —10
Value Iteration for V-function
V-function in Practice for Frozen-Lake Environment
Jordi TORRES.AIJun 15, 2020·9 min read
In the previous post, we presented the Value Iteration method to calculate the V-values
and Q-values required by Value-based Agents. In this post, we will present how to
implement the Value Iteration method for computing the state value by solving the
Frozen-Lake Environment.
Spanish version of this publication
4. Programación dinámica
This is your last free member-only story this month. Upgrade for unlimited access.
Open in app Open in app

07/01/2022, 02:10 Value Iteration for V-function. V-function in Practice for Frozen-Lake… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/value-iteration-for-v-function-d7bcccc1ec24 2/12
Acceso abierto al capítulo 4 del libro Introducción al aprendizaje por
refuerzoprofundo
medium.com
Value Iteration with V-function in Practice
The entire code of this post can be found on GitHub and can be run as a Colab google
notebook using this link.
Next, we will present in detail the code that makes up the Value Iteration method that
we presented in the previous post. So let’s come to the code. In the beginning, we
import the used libraries and define the main constants:
import gym
import collections
from torch.utils.tensorboard import SummaryWriter
ENV_NAME="FrozenLake-v0"
GAMMA = 0.9
TEST_EPISODES = 20
N =100
REWARD_GOAL = 0.8
Agent’s data structures
The main data structures which will keep Agent’s information are:
rewards: A dictionary with the composite key “source state” + “action” +“target
state”. The value is obtained from the immediate reward.
transits: A table as a dictionary keeping counters of the experienced transitions.
The key is the composite “state” + “action”, and the value is another dictionary that
maps the target state into a count of times that we have seen it.
values: A dictionary that maps a state into the calculated value of this state (V-
value).
state: Current state of the Agent.
The main data structures are created in the class constructor of the Agent. Open in app

07/01/2022, 02:10 Value Iteration for V-function. V-function in Practice for Frozen-Lake… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/value-iteration-for-v-function-d7bcccc1ec24 3/12
Training Algorithm
The overall logic of our training algorithm is simple. Until the desired Reward goal is
not reached we will execute the following steps:
1. play N random steps from the environment to populating the reward and transits
tables.
2. After those N steps, it performs a value iteration step over all states, updating the
value table.
3. Then we play several full episodes to check the improvements using the updated
value table.
4. If the average reward for those test episodes is above the desired boundary, then we
stop training.
Before we go into more detail about this code, it will help to review the agent’s code
first.
Agent Class
In the Agentclass constructor, we create an Environment that we will be using for data
samples, obtain our first observation, and define tables for rewards, transitions, and
values:
class Agent:
def __init__(self):
self.env = gym.make(ENV_NAME)
self.state = self.env.reset()
self.rewards = collections.defaultdict(float)
self.transits = collections.defaultdict(
collections.Counter)
self.values = collections.defaultdict(float)
Play random steps
Remember that in the previous post we advanced that the estimation of Transitions and
Rewards will be obtained through the history of the Agent’s interaction with the
Environment.
This is done by the method play_n_random_steps that plays N random steps from the
Environment, populating the reward and transits tables with random experiences. Open in app

07/01/2022, 02:10 Value Iteration for V-function. V-function in Practice for Frozen-Lake… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/value-iteration-for-v-function-d7bcccc1ec24 4/12
def play_n_random_steps(self, count):
for _ in range(count):
action = self.env.action_space.sample()
new_state, reward, is_done, _ = self.env.step(action)
self.rewards[(self.state, action, new_state)] = reward
self.transits[(self.state, action)][new_state] += 1
if is_done:
self.state = self.env.reset()
else:
self.state = new_state
Note that we don’t need to wait for the end of the episode to start learning; we just
perform N steps and remember their outcomes. This is one of the differences between
the Value Iteration and the Cross-Entropy method shown in a previous post, which
requires full episodes to learn.
Value of the action
The next method calculates the Q-function, the value of an action from a state using the
transits, reward, and value tables of the Agent. We will use it for two purposes: to
select the best action to perform from the state and to calculate the new value of the
state on the Value Iteration algorithm.
def calc_action_value(self, state, action):
target_counts = self.transits[(state, action)]
total = sum(target_counts.values())

action_value = 0.0
for tgt_state, count in target_counts.items():
reward = self.rewards[(state, action, tgt_state)]
val = reward + GAMMA * self.values[tgt_state]
action_value += (count / total) * val
return action_value
First, from the transits table, we extract transition counters for the given state and
action that the method receives as arguments. We sum all counters to obtain the total
count of times we have executed the action from the state. Then we iterate every target
state that our action has landed on and calculate its contribution to the total action
value using the formula presented in the Bellman equation post: Open in app

07/01/2022, 02:10 Value Iteration for V-function. V-function in Practice for Frozen-Lake… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/value-iteration-for-v-function-d7bcccc1ec24 5/12
This value is equal to immediate reward plus discounted value for the target state and
multiplied this sum to the probability of this transition (individual counter divided by
the total value computed before). We add the result for each iteration to a variable
action_value, the variable that will be returned.
Select best Action
In order to select the best action from a given state, we have the method select_action
that uses the previous calc_action_value method to make the decision:
def select_action(self, state):
best_action, best_value = None, None
for action in range(self.env.action_space.n):
action_value = self.calc_action_value(state, action)
if best_value is None or best_value < action_value:
best_value = action_value
best_action = action
return best_action
The code of this method iterates over all possible actions in the Environment and
calculates the value for every action and returns the action with the largest Q-value.
Value Iteration function
And here we have the main function, that as we described in the previous post, what
Value Iteration method do is just loop over all states in the Environment:
def value_iteration(self):
for state in range(self.env.observation_space.n):
state_values = [
self.calc_action_value(state, action)
for action in range(self.env.action_space.n)
]
self.values[state] = max(state_values)
For every state, we update its value with the maximum value of the action available
from the state.
Training Loop and the monitoring of the code
After presenting the Agent’s class and its methods, we come back to describe the main
loop. First of all, we create the Environment that we will be using for testing, create an Open in app

07/01/2022, 02:10 Value Iteration for V-function. V-function in Practice for Frozen-Lake… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/value-iteration-for-v-function-d7bcccc1ec24 6/12
instance of the Agent class, the summary writer for TensorBoard, and some variables
that we will use:
test_env = gym.make(ENV)
agent = Agent()
writer = SummaryWriter()
iter_no = 0
best_reward = 0.0
Remember that Value Iteration method formally requires an infinite number of
iterations to converge exactly to obtain the optimal value function. In practice, in the
previous post, we showed that we can stop once the value function changes by only a
small amount in an iteration of the training loop.
n this example, in order to keep the code simple and with few iterations, we decide to
stop when we reach some reward threshold. But all the rest of the code is identical.
The overall logic of our code is simply a loop that will iterate until the Agent achieve the
expected performance (If the average reward for those test episodes is above the
REWARD_GOALboundary, then we stop training):
while best_reward < REWARD_GOAL:
agent.play_n_random_steps(N)
agent.value_iteration()

...
The body of the loop is composed of the steps introduced earlier:
STEP 1: play N random steps calling the method plays_n_random_steps.
STEP 2: It performs a value iteration sweep over all states calling the method
value_iteration.
STEP 3: Then we play several full episodes to check the improvements using the
updated value table. For that the code uses agent.elect_action() to find the best action
to take in the new test_env Environment (we don’t want to mess with the current state Open in app

07/01/2022, 02:10 Value Iteration for V-function. V-function in Practice for Frozen-Lake… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/value-iteration-for-v-function-d7bcccc1ec24 7/12
of the main Environment used to gather random data), to check improvements of the
Agent:
iter_no += 1
reward_avg = 0.0
for _ in range(TEST_EPISODES):
total_reward = 0.0
state = test_env.reset()
while True:
action = Select_action(state)
new_state, new_reward, is_done, _ = test_env.step(action)
total_reward += new_reward
if is_done: break
state = new_state
reward_test += total_reward
reward_test /= TEST_EPISODES
Additionally, the code writes data into TensorBoard in order to tracks the best average
reward:
writer.add_scalar("reward", reward_test, iter_no)
if reward_test > best_reward:
print("Best reward updated %.2f at iteration %d " %
(reward_test ,iter_no) )
best_reward = reward_test
And that’s all!
Run the program
Okay, let’s run our program:
Best reward updated 0.40 in iteration 13
Best reward updated 0.65 in iteration 20
Best reward updated 0.80 in iteration 23
Best reward updated 0.85 in iteration 28
Best reward updated 0.90 in iteration 76
Test the client
Now, if we try with the same client test code as in the case of Cross-Entropy, we can see
that the Agent we have built, can learn from a slippery Environment: Open in app

07/01/2022, 02:10 Value Iteration for V-function. V-function in Practice for Frozen-Lake… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/value-iteration-for-v-function-d7bcccc1ec24 8/12
new_test_env = gym.make(‘FrozenLake-v0’, is_slippery=True)
state = new_test_env.reset()
new_test_env.render()
is_done = False
iter_no = 0
while not is_done:
action = Select_action(state)
new_state, reward, is_done, _ = new_test_env.step(action)
new_test_env.render()
state = new_state
iter_no +=1
print(“reward = “, reward)
print(“iterations =”, iter_no)
. . . Open in app

07/01/2022, 02:10 Value Iteration for V-function. V-function in Practice for Frozen-Lake… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/value-iteration-for-v-function-d7bcccc1ec24 9/12
Conclusions
Note that the algorithm presented is stochastic and for different executions, it employs a
different number of iterations to reach a solution. However, it could learn from a
slippery Environment instead the Cross-Entropy presented earlier. We can plot all of
them using Tensorboard that will show graphs like the following one: Open in app

07/01/2022, 02:10 Value Iteration for V-function. V-function in Practice for Frozen-Lake… | by Jordi TORRES.AI | Towards Data Science
https://towardsdatascience.com/value-iteration-for-v-function-d7bcccc1ec24 10/12
We can notice that in all cases, it needs a few seconds as a maximum to find a good
policy that solves the Environment in 80% of runs. If you remember the Cross-Entropy
method, for a slippery Environment, it took many hours to achieve a success rate of only
60%.
Moreover, we can apply this algorithm to a larger version of FrozenLake, which has the
name FrozenLake8x8-v0. The larger version of FrozenLake can take many more
iterations to solve, and, according to TensorBoard charts, most of the time it waits for
the first successful episode (it need to have at least one successful episode to start
learning from useful value table), then it very quickly reaches convergence. Below is a
chart that compares reward dynamics during training on FrozenLake-4x4 and 8x8
version:
In addition to changing the Environment in ENV_NAME(“FrozenLake-v0” vs
“FrozenLake8x8-v0”), the reader can perform tests with different values of the
hyperparameters as GAMMA, TEST_EPISODE, REWARD_GOAL or N. Why don’t you try it?
What is next? Open in app

the bellman equation

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

the bellman equation

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......