A summary of Chapter 2: Multi-armed Bandits of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html
Size: 813.83 KB
Language: en
Added: Jul 22, 2018
Slides: 23 pages
Slide Content
Chapter 2: Multi-armed Bandits
Seungjae Ryan Lee
●Slot machine
●Each spin (action) is independent
One-armed Bandit
●Multiple slot machines to choose from
●Simplified setting to avoid complexities of RL problems
○No observation
○Action does not have delayed effect
Multi-armed Bandit problem
10-armed Testbed
●10 actions, 10 reward distributions
●Reward chosen from stationary probability distributions
●Knowing expected reward trivializes the problem
●Estimate with
Expected Reward
●Estimate by averaging received rewards
●Default value (ex. 0) if action was never selected
● converges to as denominator goes to infinity
Sample-average
●Always select greedily :
●No exploration
●Often stuck in suboptimal actions
Greedy method
Eat the usual cereal?Agent
1
●Select random action with probability ε
●All converges to as denominator goes to infinity
ε-greedy method
Try a new cereal? Eat the usual cereal?Agent
0.1 0.9
Greedy vs. ε-greedy
●Don’t store reward for each step
●Compute incrementally
Incremental Implementation
● changes over time
●Want to give new experience more weight
Nonstationary problem
●Constant step-size parameter
●Give more weight to recent rewards
Exponentially weighted average
Weighted average
●Never completely converges
●Desirable in nonstationary problems
Sample-average
●Guaranteed convergence
●Converge slowly: need tuning
●Seldomly used in applications
●Set initial action values optimistically (ex. +5)
●Temporarily encourage exploration
●Doesn’t work in nonstationary problems
Optimistic Initial Values
+5 +4.5
R = 0 R = 0.1
+4.06
R = -0.1
+3.64
R = 0
+3.28
Optimistic Greedy vs. Realistic ε-greedy
●Take into account each action’s potential to be optimal
●Selected less → more potential
●Difficult to extend beyond multi-armed bandits
Upper Confidence Bound (UCB)
UCB vs. ε-greedy
●Learn a numerical preference for each action
●Convert to probability with softmax:
Gradient Bandit Algorithms
●Update preference with SGD
●Baseline : average of all rewards
●Increase probability if reward is above baseline
●Decrease probability if reward is below baseline
Gradient Bandit: Stochastic Gradient Descent
Gradient Bandit: Results
Parameter Study
●Check performance in best setting
●Check hyperparameter sensitivity
●Observe some context that can help decision
●Intermediate between multi-armed bandit and full RL problem
○Need to learn a policy to associate observations and actions
○Each action only affects immediate reward
Associative Search (Contextual Bandit)
Thank you!
Original content from
●Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
●github.com/seungjaeryanlee
●www.endtoend.ai