AI - Introduction to Bellman Equations

AndrewFerlitsch 8,609 views 15 slides Sep 09, 2017
Slide 1
Slide 1 of 15
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15

About This Presentation

Abstract: This PDSG workship introduces basic concepts on Bellman Equations. Concepts covered are States, Actions, Rewards, Value Function, Discount Factor, Bellman Equation, Bellman Optimality, Deterministic vs. Non-Deterministic, Policy vs. Plan, and Lifespan Penalty.

Level: Intermediate

Requir...


Slide Content

Artificial Intelligence Bellman Equation Introduction Portland Data Science Group Created by Andrew Ferlitsch Community Outreach Officer July, 2017

Introduction A method for calculating value functions in dynamic environments. Invented by Richard Ernest Bellman in 1953. Father of Dynamic Programming, which lead to modern Reinforcement Programming. Concepts include: Reward Discount Factor Deterministic vs. Non-Deterministic Plan vs Policy

Basics Terminology S -> Set of all possible States A -> Set of all possible Actions from a given State s -> A specific State a -> A specific Action S a S b S c S d S e S f S g S h S m S i S j S k S l Start Node Goal Node S -> { S a , S b , S c … S m } A a -> { Down, Right } A b -> { Down, Left } A c -> { Down, Left } A d -> { Up, Down, Right } ... A m -> {} Goal State s -> S i a -> A i -> { Up, Right }

Reward Terminology R -> The ‘Reward’ for being at some state. v(s) -> Value Function – The anticipated reward for being at a specific state. S a S b S c S d S e S f S g S h S m S i S j S k S l Start Node Goal Node R = 1 v(S a ) -> 0 v(S b ) -> v( S c ) -> 0 v( S d ) -> … v( S l ) -> 0 v(S m ) -> 1 Goal State All Other (non-Goal) Nodes R = 0 Non-Goal State Without a Plan or Policy, a Reward cannot be anticipated until we reach the Goal Node.

Discount Factor Terminology t, t+1, t+2, … -> Time step intervals, each corresponding to an action and a new state. R t+1 -> The Reward at the next time step after an action has occurred. S t+1 -> The State at the next time step after an action has occurred. γ -> (gamma) Discount Factor between 0 and 1. The Discount Factor accounts for uncertainty in obtaining a future reward S n-1 S n S n-2 Goal Node, R =1 If at Goal, receiving the reward is certain. If one step away, receiving the reward is less certain. Even further away, receiving the reward is even less certain. R = 0 R = 0

Bellman Equation Principle of the Bellman Equation v(s) = R t + γ R t+1 + γ 2 R t+2 + γ 3 R t+3 … + γ n R t+n The value of some state s is the sum of rewards to a terminal state state, with the reward of each successive state discounted. The reward at the next Step after taken some action a. The reward at subsequent s tate is discounted by γ The reward at next subsequent state is further discounted by γ 2 Discount Factor is Increased exponentially S n-1 S n S n-2 Goal Node, R =1 γ =1 v(S n ) = 1 v(S n-1 ) = 1 v(S n-2 ) = 1 R = 0 , v(s) = 0 + 1 R = 0, v(s) = 0 + 0 + 1 S n-1 S n S n-2 Goal Node, R =1 γ = 0.9 v(S n ) = 1 v(S n-1 ) = 0.9 v(S n-2 ) = 0.81 R = 0 , v(s) = 0 + .9(1) R = 0, v(s) = 0 + 0 + .9*.9(1) Note, the Reward and value are not the same thing.

Bellman Principle of Optimality Bellman Equation – Factored v(s) = R t + γ R t+1 + γ 2 R t+2 + γ 3 R t+3 … + γ n R t+n v(S t+1 ) v(s ) = R t + γ ( v(S t+1 ) ) Bellman Optimality – the value of a state is based on the best action (optimal) for that state, and each subsequent state. v(s ) = argmax ( R( s,a ) + γ ( v(S t+1 ) ) ) a The action a at state s which maximizes the reward.

Bellman Optimality Example R=1 R=-1 Wall R=1 0.9 R=-1 γ = 0.9 Wall R=1 0.9 0.81 R=-1 0.81 Wall Calculate 1 step away Calculate adjacent steps Best action is move t o the goal node. Best action is move to the node with the highest value. R=1 0.9 0.81 R=-1 0.81 0.73 Wall Calculate adjacent steps R=1 0.9 0.81 R=-1 0.81 0.66 0.73 0.66 Wall Calculate adjacent steps This produces a plan . The Optimal Action (Move) for each state.

Deterministic vs. Non-Deterministic Deterministic – The action taken has a 100% certainty of the expected (desired) outcome => Plan e.g., in our Grid World example, there is a 100% certainty that if the action is to move left, that you will move left. Non-Deterministic (Stochastic) – The action taken has less than a 100% certainty of the expected outcome => Policy e.g., if a Robot is in a standing state and the action is to run, there maybe a 80% of succeeding, but a 20% probability of falling down.

Bellman Optimality with Probabilities Terminology R( s,a ) -> The Reward when at state s and action a is taken. P(s,a,S t+1’ ) -> The probabilities that when at state s and action a is taken, of being in one of the successor states S t+1’ . When the outcome is stochastic, we replace the value of the desired state with the values ( summation ) of the possible successor states times their probability : v(s ) = argmax ( R( s,a ) + γ ( v(S t+1 ) ) ) a v(s ) = argmax ( R( s,a ) + γ ∑ P(s,a,S t+1’ ) v(S t+1 ’ ) ) a S t+1 ’

Bellman Optimality Example R=1 S b S a R=-1 Wall S c γ = 0.9 S b , Right -> { 80% Left, 10% Right, 10% Down } R=1 0.72 R=0 R=-1 Wall R=0 γ = 0.9 v(S b ) = 0 + .9( .8(1) + .1(0) + .1(0) 80% Probability 10% Probability R=1 0.72 S a R=-1 Wall S c γ = 0.9 S c , Up -> { 80% Up, 10% Left, 10% Right } R=1 0.72 R=-1 Wall 0.48 R=0 γ = 0.9 v( S c ) = 0 + .9( .8 (.72) + .1(0) + .1 (-1) 80% Probability 10% Probability

Greedy vs. Optimal R=1 S b R=-1 Wall 0.48 γ = 0.9 Greedy – Take the Action with the highest Probability of a Reward -> Plan (act as if deterministic). S c , Up -> { 80% S b , 10% Left, 10% Right (‘The Pit’ – terminal state) } 10% of the time will end up in negative terminal state!

Greedy vs. Optimal R=1 S b R=-1 Wall 0.07 γ = 0.9 Optimal – Take the Action with certainty we will proceed towards a positive reward -> Policy . S c , Left -> { 80% Wall and Bounce back to S c , 10% Up (S b ), 10% Down } If we choose Left, we have 80% chance of bouncing into the wall and being back where we were. If we keep bouncing off the wall, eventually we will go Up or Down (10% of the time), and never go into the Pit! v( S c , Left) = 0 + .9( .8(0) + .1(0.72) + .1(0)

Lifespan Penalty Lifespan Penalty – There is a cost to each action. R=1 R=-.1 R=-.1 R=-1 Wall S c R=-.1 R=-.1 R=-.1 γ = 0.9 S c , Left -> { 80% Wall, 10 % Left, 10 % Right} R=1 S b R=-1 Wall -0.02 γ = 0.9 v( S c , Left) = 0 + .9( .8( -.1 ) + .1(0.72) + .1( -.1 ) When there is a penalty in each action, the best policy might be to take the chance of falling into the pit! Penalty

Not Covered When probabilities are learned ( not pre-known ) -> Backward Propagation. Suboptimal Solutions for HUGE search spaces. THIS IS MORE LIKE THE REAL WORLD!