MDP
•Markov Decision Process (MDP) is a foundational
element of reinforcement learning (RL).
•MDP allows formalization of sequential decision
making where actions from a state not just influences
the immediate reward but also the subsequent state.
•It is a very useful framework to model problems that
maximizes longer term return by taking sequence of
actions.
Reinforcement learning is a feedback-based Machine learning technique in
which an agent learns to behave in an environment by performing actions
and seeing the results of actions
MDP basics
•In an MDP, an agent interacts with an environment
by taking actions and seek to maximize the rewards
the agent gets from the environment. At any given
time stamp t, the process is as follows
•The environment is in state St
•The agent takes an action At
•The environment generates a reward Rt based on St
and At
•The environment moves to the next state St+1
•The goal of the agent is to maximize the total
rewards (ΣRt) collected over a period of time.
•The agent needs to find optimal action on a
given state that will maximize this total
rewards.
•The probability distribution of taking actions
At from a state St is called policy π(At | St).
The goal of solving an MDP is to find an
optimal policy.
•To express a problem using MDP, one needs to define the
followings
•states of the environment
•actions the agent can take on each state
•rewards obtained after taking an action on a given state
•state transition probabilities.
•The actions can only be dependent on the current state and not
on any previous state or previous actions (Markov property).
•Rewards are generated depending only on the (current state,
action) pair. Both actions and rewards can be probabilistic.
•Once the problem is expressed as an MDP, one can use dynamic
programming or many other techniques to find the optimum
policy.
Markov Decision Process (MDP)
•Defined as a tuple: <S, A, P, R>
•S: State
•A: Action
•P: Transition function
•Table P(s’| s, a), prob of s’ given action “a” in state “s”
• R: Reward
•R(s, a) = cost or reward of taking action a in state s
•Choose a sequence of actions (not just one decision or one
action)
•Utility based on a sequence of decisions
Simple Robot Navigation Problem
• In each state, the possible actions are U, D, R, and L
Probabilistic Transition Model
• In each state, the possible actions are U, D, R, and L
• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the
robot is already in the top row, then it does not move)
Probabilistic Transition Model
• In each state, the possible actions are U, D, R, and L
• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the
robot is already in the top row, then it does not move)
• With probability 0.1 the robot moves right one square (if the
robot is already in the rightmost row, then it does not move)
Probabilistic Transition Model
• In each state, the possible actions are U, D, R, and L
• The effect of U is as follows (transition model):
• With probability 0.8 the robot moves up one square (if the
robot is already in the top row, then it does not move)
• With probability 0.1 the robot moves right one square (if the
robot is already in the rightmost row, then it does not move)
• With probability 0.1 the robot moves left one square (if the
robot is already in the leftmost row, then it does not move)
Markov Property
The transition properties depend only
on the current state, not on previous
history (how that state was reached)
Sequence of Actions
• Planned sequence of actions: (U, R)
2
3
1
4321
[3,2]
Sequence of Actions
• Planned sequence of actions: (U, R)
• U is executed
2
3
1
4321
[3,2]
[4,2][3,3][3,2]
Histories
• Planned sequence of actions: (U, R)
• U has been executed
• R is executed
• There are 9 possible sequences of states
– called histories – and 6 possible final states
for the robot!
4321
2
3
1
[3,2]
[4,2][3,3][3,2]
[3,3][3,2] [4,1][4,2][4,3][3,1]
Probability of Reaching the Goal
•P([4,3] | (U,R).[3,2]) =
P([4,3] | R.[3,3]) x P([3,3] | U.[3,2])
+ P([4,3] | R.[4,2]) x P([4,2] | U.[3,2])
2
3
1
4321
Note importance of Markov property
in this derivation
•P([3,3] | U.[3,2]) = 0.8
•P([4,2] | U.[3,2]) = 0.1
•P([4,3] | R.[3,3]) = 0.8
•P([4,3] | R.[4,2]) = 0.1
•P([4,3] | (U,R).[3,2]) = 0.65
Utility Function
• [4,3] provides power supply
• [4,2] is a sand area from which the robot cannot escape
-1
+1
2
3
1
4321
Utility Function
• [4,3] provides power supply
• [4,2] is a sand area from which the robot cannot escape
• The robot needs to recharge its batteries
-1
+1
2
3
1
4321
Utility Function
• [4,3] provides power supply
• [4,2] is a sand area from which the robot cannot escape
• The robot needs to recharge its batteries
• [4,3] or [4,2] are terminal states
-1
+1
2
3
1
4321
Utility of a History
• [4,3] provides power supply
• [4,2] is a sand area from which the robot cannot escape
• The robot needs to recharge its batteries
• [4,3] or [4,2] are terminal states
• The utility of a history is defined by the utility of the last
state (+1 or –1) minus n/25, where n is the number of moves
-1
+1
2
3
1
4321
Utility of an Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]
2
3
1
4321
Utility of an Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]
• A run produces one among 7 possible histories, each with some
probability
2
3
1
4321
[3,2]
[4,2][3,3][3,2]
[3,3][3,2] [4,1][4,2][4,3][3,1]
Utility of an Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]
• A run produces one among 7 possible histories, each with some
probability
• The utility of the sequence is the expected utility of the histories:
U = Σ
h
U
h
P(h)
2
3
1
4321
[3,2]
[4,2][3,3][3,2]
[3,3][3,2] [4,1][4,2][4,3][3,1]
Optimal Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]
• A run produces one among 7 possible histories, each with some
probability
• The utility of the sequence is the expected utility of the histories
• The optimal sequence is the one with maximal utility
2
3
1
4321
[3,2]
[4,2][3,3][3,2]
[3,3][3,2] [4,1][4,2][4,3][3,1]
Optimal Action Sequence
-1
+1
• Consider the action sequence (U,R) from [3,2]
• A run produces one among 7 possible histories, each with some
probability
• The utility of the sequence is the expected utility of the histories
• The optimal sequence is the one with maximal utility
• But is the optimal action sequence what we want to
compute?
2
3
1
4321
[3,2]
[4,2][3,3][3,2]
[3,3][3,2] [4,1][4,2][4,3][3,1]
only if the sequence is executed blindly!
Accessible or
observable state
Repeat:
⬥s sensed state
⬥If s is terminal then exit
⬥a choose action (given s)
⬥Perform a
Reactive Agent Algorithm
The solution to an MDP
•The solution to an MDP is called
a policy and it simply specifies the best
action to take for each of the states.
Policy (Reactive/Closed-Loop Strategy)
• A policy Π is a complete mapping from states to actions
-1
+1
2
3
1
4321
Repeat:
⬥s sensed state
⬥If s is terminal then exit
⬥a Π(s)
⬥Perform a
Reactive Agent Algorithm
Optimal Policy
-1
+1
• A policy Π is a complete mapping from states to actions
• The optimal policy Π* is the one that always yields a
history (ending at a terminal state) with maximal
expected utility
2
3
1
4321
Makes sense because of Markov property
Note that [3,2] is a “dangerous”
state that the optimal policy
tries to avoid
Optimal Policy
-1
+1
• A policy Π is a complete mapping from states to actions
• The optimal policy Π* is the one that always yields a
history with maximal expected utility
2
3
1
4321
This problem is called a
Markov Decision Problem (MDP)
How to compute Π*?
Additive Utility
• History H = (s
0
,s
1
,…,s
n
)
• The utility of H is additive iff:
U(s
0
,s
1
,…,s
n
) = R(0) + U(s
1
,…,s
n
) = Σ R(i)
Reward
Additive Utility
• History H = (s
0
,s
1
,…,s
n
)
• The utility of H is additive iff:
U(s
0
,s
1
,…,s
n
) = R(0) + U(s
1
,…,s
n
) = Σ R(i)
• Robot navigation example:
•R(n) = +1 if s
n
= [4,3]
•R(n) = -1 if s
n
= [4,2]
•R(i) = -1/25 if i = 0, …, n-1
Principle of Max Expected Utility
• History H = (s
0
,s
1
,…,s
n
)
• Utility of H: U(s
0
,s
1
,…,s
n
) = Σ R(i)
First-step analysis
• U(i) = R(i) + max
a
Σ
k
P(k | a.i) U(k)
• Π*(i) = arg max
a
Σ
k
P(k | a.i) U(k)
-1
+1
Value Iteration
• Initialize the utility of each non-terminal state s
i
to
U
0
(i) = 0
• For t = 0, 1, 2, …, do:
U
t+1
(i) R(i) + max
a
Σ
kP(k | a.i) U
t
(k)
-1
+1
2
3
1
4321
Value Iteration
• Initialize the utility of each non-terminal state s
i
to
U
0
(i) = 0
• For t = 0, 1, 2, …, do:
U
t+1
(i) R(i) + max
a
Σ
kP(k | a.i) U
t
(k)
U
t
([3,1])
t0 302010
0.611
0.5
0
-1
+1
2
3
1
4321
0.7050.655 0.3880.611
0.762
0.8120.8680.918
0.660
Note the importance
of terminal states and
connectivity of the
state-transition graph
Policy Iteration
• Pick a policy Π at random
Policy Iteration
• Pick a policy Π at random
• Repeat:
•Compute the utility of each state for Π
U
t+1
(i) R(i) + Σ
kP(k | Π(i).i) U
t
(k)
Policy Iteration
• Pick a policy Π at random
• Repeat:
•Compute the utility of each state for Π
U
t+1
(i) R(i) + Σ
kP(k | Π(i).i) U
t
(k)
•Compute the policy Π’ given these utilities
Π’(i) = arg max
a
Σ
kP(k | a.i) U(k)
Policy Iteration
• Pick a policy Π at random
• Repeat:
•Compute the utility of each state for Π
U
t+1
(i) R(i) + Σ
kP(k | Π(i).i) U
t
(k)
•Compute the policy Π’ given these utilities
Π’(i) = arg max
a
Σ
k
P(k | a.i) U(k)
•If Π’ = Π then return Π
Or solve the set of linear equations:
U(i) = R(i) + Σ
k
P(k | Π(i).i) U(k)
(often a sparse system)
Infinite Horizon
-1
+1
2
3
1
4321
In many problems, e.g., the robot
navigation example, histories are
potentially unbounded and the same
state can be reached many times
One trick:
Use discounting to make infinite
Horizon problem mathematically
tractable
What if the robot lives forever?
Example: Tracking a Target
target
robot
• The robot must keep
the target in view
• The target’s trajectory
is not known in advance
• The environment may
or may not be known
An optimal policy cannot be computed ahead
of time:
- The environment might be unknown
-The environment may only be partially observable
-The target may not wait
A policy must be computed “on-the-fly”
POMDP (Partially Observable Markov Decision Problem)
• A POMDP is really just an MDP; we have a set of
states, a set of actions, transitions and immediate
rewards. The actions’ effects on the state in a
POMDP is exactly the same as in an MDP.
•The only difference is in whether or not we can
observe the current state of the process. In a POMDP
we add a set of observations to the model.
•So instead of directly observing the current state, the
state gives us an observation which provides a hint
about what state it is ins.
•Imagine yourself as a tiny person (agent) amid a vast
IT corporate workspace on your first day of work.
•You are trying to get to your boss’ cabin (goal
state) before your appointment time (goal).
•Reaching the cabin on time will get you on the boss’
good side (reward).
•Here is the problem, the workspace is composed of
multiple grids of cubicles (state space), and there are
many paths you can take to reach the cabin.
•When you start exploring, you move into different
cubicles. You are able to observe various things
within cubicles such as desks, workstations, chairs,
etc., and maybe a washroom or a coffee machine
near a few cubicles but not the cubicle number.
•Based on these observations (partially observable
data), you may have an idea about where you might
be (belief), but you don’t exactly know the cubicle
number(state) in the workspace (POMDP).
•Now, to get help in navigating the workspace,
you call up a friend who works in that office.
She tells you,” if you are at cubicle number x
(current state), take a right and walk
straight to reach the boss’ cabin”(policy).
•However, this suggestion (policy) is of no use
to you as you do not know the cubicle
number (state) you are in (complexity in
solving POMDPs).
•Had it been a scenario in which the cubicles are numbered, and
you are able to observe the cubicle number (state) along with
other observations directly (fully observable MDP), you
would be easily able to navigate the workspace with the help of
your friend’s suggestions (policy) based on your current
position.
Example: Target Tracking
There is uncertainty
in the robot’s and target’s
positions; this uncertainty
grows with further motion
There is a risk that the target
may escape behind the corner,
requiring the robot to move
appropriately
But there is a positioning
landmark nearby. Should
the robot try to reduce its
position uncertainty?