Reinforcement Learning 2. Multi-armed Bandits

SeungJaeLee17 1,952 views 23 slides Jul 22, 2018

Slide 1 of 23

About This Presentation

A summary of Chapter 2: Multi-armed Bandits of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html

Size: 813.83 KB

Language: en

Added: Jul 22, 2018

Slides: 23 pages

Slide Content

Chapter 2: Multi-armed Bandits
Seungjae Ryan Lee

●Slot machine
●Each spin (action) is independent

One-armed Bandit

●Multiple slot machines to choose from
●Simplified setting to avoid complexities of RL problems
○No observation
○Action does not have delayed effect

Multi-armed Bandit problem

10-armed Testbed
●10 actions, 10 reward distributions
●Reward chosen from stationary probability distributions

●Knowing expected reward trivializes the problem
●Estimate with
Expected Reward

●Estimate by averaging received rewards
●Default value (ex. 0) if action was never selected
● converges to as denominator goes to infinity
Sample-average

●Always select greedily :
●No exploration
●Often stuck in suboptimal actions
Greedy method
Eat the usual cereal?Agent
1

●Select random action with probability ε
●All converges to as denominator goes to infinity
ε-greedy method
Try a new cereal? Eat the usual cereal?Agent
0.1 0.9

Greedy vs. ε-greedy

●Don’t store reward for each step

●Compute incrementally
Incremental Implementation

● changes over time
●Want to give new experience more weight

Nonstationary problem

●Constant step-size parameter
●Give more weight to recent rewards
Exponentially weighted average

Weighted average

●Never completely converges
●Desirable in nonstationary problems
Sample-average

●Guaranteed convergence
●Converge slowly: need tuning
●Seldomly used in applications

●Set initial action values optimistically (ex. +5)
●Temporarily encourage exploration
●Doesn’t work in nonstationary problems
Optimistic Initial Values
+5 +4.5
R = 0 R = 0.1
+4.06
R = -0.1
+3.64
R = 0
+3.28

Optimistic Greedy vs. Realistic ε-greedy

●Take into account each action’s potential to be optimal
●Selected less → more potential
●Difficult to extend beyond multi-armed bandits
Upper Confidence Bound (UCB)

UCB vs. ε-greedy

●Learn a numerical preference for each action
●Convert to probability with softmax:

Gradient Bandit Algorithms

●Update preference with SGD

●Baseline : average of all rewards
●Increase probability if reward is above baseline
●Decrease probability if reward is below baseline

Gradient Bandit: Stochastic Gradient Descent

Gradient Bandit: Results

Parameter Study
●Check performance in best setting
●Check hyperparameter sensitivity

●Observe some context that can help decision
●Intermediate between multi-armed bandit and full RL problem
○Need to learn a policy to associate observations and actions
○Each action only affects immediate reward

Associative Search (Contextual Bandit)

Thank you!
Original content from
●Reinforcement Learning: An Introduction by Sutton and Barto
You can find more content in
●github.com/seungjaeryanlee
●www.endtoend.ai

Reinforcement Learning 2. Multi-armed Bandits

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Reinforcement Learning 2. Multi-armed Bandits

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

TLE-9-Prepare-Salad-and-Dressing.pptxkkk

LESSON 1 ABOUT MEDIA AND INFORMATION.pptx

GRADE-8-AQUACULTURE-WEEKQ1.pdfdfawgwyrsewru

Feelings PP Game FOR CHILDREN IN ELEMENTARY SCHOOL.pptx

Jeopardy_Figures_of_Speech_Template.pptx [Autosaved].pptx

Jeopardy_Figures_of_Speech.pptxvdsvdsvsdvsd