Reinforcement Learning 2. Multi-armed Bandits

SeungJaeLee17 1,952 views 23 slides Jul 22, 2018
Slide 1
Slide 1 of 23
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23

About This Presentation

A summary of Chapter 2: Multi-armed Bandits of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html


Slide Content

Chapter 2: Multi-armed Bandits
Seungjae Ryan Lee

●Slot machine
●Each spin (action) is independent

One-armed Bandit

●Multiple slot machines to choose from
●Simplified setting to avoid complexities of RL problems
○No observation
○Action does not have delayed effect


Multi-armed Bandit problem

10-armed Testbed
●10 actions, 10 reward distributions
●Reward chosen from stationary probability distributions

●Knowing expected reward trivializes the problem
●Estimate with
Expected Reward

●Estimate by averaging received rewards
●Default value (ex. 0) if action was never selected
● converges to as denominator goes to infinity
Sample-average

●Always select greedily :
●No exploration
●Often stuck in suboptimal actions
Greedy method
Eat the usual cereal?Agent
1

●Select random action with probability ε
●All converges to as denominator goes to infinity
ε-greedy method
Try a new cereal? Eat the usual cereal?Agent
0.1 0.9

Greedy vs. ε-greedy

●Don’t store reward for each step


●Compute incrementally
Incremental Implementation

● changes over time
●Want to give new experience more weight

Nonstationary problem

●Constant step-size parameter
●Give more weight to recent rewards
Exponentially weighted average

Weighted average


●Never completely converges
●Desirable in nonstationary problems
Sample-average


●Guaranteed convergence
●Converge slowly: need tuning
●Seldomly used in applications

●Set initial action values optimistically (ex. +5)
●Temporarily encourage exploration
●Doesn’t work in nonstationary problems
Optimistic Initial Values
+5 +4.5
R = 0 R = 0.1
+4.06
R = -0.1
+3.64
R = 0
+3.28

Optimistic Greedy vs. Realistic ε-greedy

●Take into account each action’s potential to be optimal
●Selected less → more potential
●Difficult to extend beyond multi-armed bandits
Upper Confidence Bound (UCB)

UCB vs. ε-greedy

●Learn a numerical preference for each action
●Convert to probability with softmax:


Gradient Bandit Algorithms

●Update preference with SGD



●Baseline : average of all rewards
●Increase probability if reward is above baseline
●Decrease probability if reward is below baseline


Gradient Bandit: Stochastic Gradient Descent

Gradient Bandit: Results

Parameter Study
●Check performance in best setting
●Check hyperparameter sensitivity

●Observe some context that can help decision
●Intermediate between multi-armed bandit and full RL problem
○Need to learn a policy to associate observations and actions
○Each action only affects immediate reward

Associative Search (Contextual Bandit)

Thank you!
Original content from
●Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
●github.com/seungjaeryanlee
●www.endtoend.ai