Multi armed bandit

Multi-armed bandit
Jie-Han Chen
NetDB, National Cheng Kung University
3/27, 2018 @ National Cheng Kung University, Taiwan

Outline
●K-armed bandit problem
●Action-value function
●Exploration & Exploitation
●Example: 10-armed bandit problem
●Incremental method for value estimation

2

Why introduce multi-armed bandit problem?
Multi-armed bandit problem is a reduced decision problem for sequential decision
process.
We often use such simplified decision making problem to discuss some issues in
reinforcement learning, eg: exploration-exploitation dilemma.
3

One-armed bandit
●A slot machine (吃角子老虎機 )
●The reward given by the slot machine is
generated in some kind of probability
distribution.

4
image source:
https://i.ebayimg.com/images/g/rg0AAOSwwC5aLCsQ/s-l300.jpg

K-armed Bandit Problem
Imagine that you are in the casino on
Friday. In the casino, there are many slot
machines.
Tonight, your objective is to play with
those slot machines and earn more
money.
How do you choose the slot machine?
5

Applications of k-armed bandits problem
●K-armed bandit problem has been used to model many decision problems,
which the problem itself is non-associative. In the problem, each bandit
provides a random reward from a probability distribution specific to that
bandit.
●Non-associative here means: the decision made by this time won’t need to
consider its situation (state, observation)

6

Examples of k-armed bandits problem
●Recommendation system
●What do we eat tonight
●Choose the experimental treatments for a series of seriously ill patients

7
source:
http://hangyebk.baike.com/article-421947.html
source:
http://www.cdns.com.tw/news.php?n_id=31&nc
_id=51809

Action-value function
In our k-armed bandit problem, each of the k actions has an expected or mean
reward given that action is selected; we call this the value of that action.
We denoted the action selected on time step t as , and the corresponding
reward as . The value of an arbitrary action is denoted .

8
The action-value is an expected reward of
specific action; the * here means “true”
action-value.

Action-value function
If we knew the value of each action, then it would be trivial to solve the k-armed
bandit problem by always selecting the action with highest value.
In practice, we don’t know the true action-value but we can use some
method to estimate it.
We denote the estimated value of action a at time step t as . We would
like to be close to .

9

Exploration & Exploitation
Exploitation
If you maintain estimates of the action values, then at any time step here is at least
one action whose estimated value is greatest. We call these greedy actions. When
you select one of these actions, we say that you are exploiting your current
knowledge of the values of the actions.

10

Exploration & Exploitation

11
q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1
The expected action values are our
knowledge

Exploration & Exploitation

12
q1 = 0.5 q2 = 1.3 q3 = 0.9 q4 = 1.1
In exploitation, we only choose the
bandit with highest action value!

Exploration & Exploitation
Exploration
Instead, if you select one of the non-greedy actions, then we say you are exploring,
because this enables you to improve the estimate action-value of non-greedy
actions.

13

Exploration & Exploitation
Without exploration, the agent’s decision may be suboptimal because of inaccurate
action-value estimation.
14
reward
probability
image source:
http://philschatz.com/statistics-book/resources/fig-ch06_07_02.jpg

Exploration & Exploitation
●今天晚上吃什麼
○Exploitation: 以往去吃水工坊感覺都不錯，今天晚上繼續去吃好了！
○Exploraton: 在隔壁新開了一家叫做香香麵的店沒有吃過耶，來去吃吃看
●回家的路上
○Exploitation: 走原路滿好的
○Exploration: 以往走原路都要等很久，試試其他路多不定更快
●買滑板鞋
○Exploitation: 以往好穿的滑板鞋磨壞了，再去買同樣的一雙
○Exploration: 聽說有個人去叫做魅力之都買了一雙鞋，還為他寫了一首歌，我也去那找
看看吧

15

Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
●Sample-average method
●Incremental implementation

16

Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
●Sample-average method
●Incremental implementation

17
We’ll introduce this one first with
some example.

Sample-average method
True action-value is expected reward of specific action:

18

Sample-average method
True action-value is expected reward of specific action:

One natural way to estimate action-value is by averaging the rewards actually
received. We call this the sample-average method.
19

Greedy policy
The simplest action selection rule is to select one of the actions with the highest
estimated action-value, that is, one of the greedy actions. This action selection
policy is called greedy policy

20

Greedy policy
●Always exploits current knowledge
●Without sampling apparently inferior actions, it will often converge to
suboptimal
21

Ɛ-greedy policy
Sometimes, we need more exploration when maintaining the action value. A simple
alternative is to behave greedily most of the time, but with small probability Ɛ to
select the actions randomly with equal probability. We call this method as Ɛ-greedy
policy.

22

Ɛ-greedy policy
●Have better exploration
●With every action will be sampled an infinite number of time, will
converge to
●Need more time for training (more time to converge)
23

We take 10-armed bandit as an example.
Each arm has its reward distribution
●The actual reward, Rt, was select from
normal distribution with mean ,
and variance is 1.
●The action values were also
selected from normal distribution with
mean 0, and variance 1
Example: The 10-armed testbed
source: Microsoft research
24

source: Sutton’s textbook
25

The 10-armed testbed
The data are average over 2000 runs (each run 1000 steps)
source: Sutton’s textbook
26

The 10-armed bandit
● Ɛ-greedy can reach higher
performance than pure greedy

●The smaller the Ɛ is, the more
steps it need to converge with

●In long-term, the smaller Ɛ one will
get better performance
27

How to choose Ɛ ?
In practice, the choice of Ɛ is depend on your task, your computational resources
and the deadline for your task.
●If your reward signal was generated by non-stationary distribution, you had
better to use larger Ɛ first.
●If you have more computational resources, you can run your research faster so
that it will converge sooner.

28

Ɛ decay
In practice, there are another method to choose Ɛ. In the start of the task, we can
use larger Ɛ to encourage exploration. later, decrease the Ɛ by some scalar for
each step before reach its minimal setting(eg: 0.005). This method is called Ɛ
decay.
●The common method is linear decay, but there are also many other decay
scheduling methods.
29

Estimate action-value function
We will introduce 2 kinds of method to estimate action-value function
●sample-average method
●incremental implementation

30
Now, we return to introduce this
one

Estimate action-value: Incremental implementation
In previous, we have introduced sampled-average method to estimate the action
value. However, in practice, we don’t want to store the reward each step for
specific action. The incrementation implementation is desired.

31

Estimate action-value: Incremental implementation
In previous, we have introduced sampled-average method to estimate the action
value. However, in practice, we don’t want to store the reward each step for
specific action. The incrementation implementation is desired.

32

Estimate action-value: Incremental implementation
Let Qn denote the action value for
specific action i which has been
selected n-1 times.

33

Estimate action-value: Incremental implementation
Let Qn denote the action value for
specific action i which has been
selected n-1 times.

34

Estimate action-value: Incremental implementation
Action value of specific action:

It’s general form is:

35

Estimate action-value: Incremental implementation
Action value of specific action:

It’s general form is:

This is an error in estimate
36

StepSize in stochastic approximation theory

● , n means # of iteration

●In practice, the step size which satisfies the upper condition will learn very
slow. So, we may not adopt this condition.

37

A simple bandit algorithm
38source: from Sutton’s book

The content not covered here
In addition to Ɛ-greedy, there are still many method in exploration:
●Upper confidence bound
●Thompson sampling
Besides, there exists associative multi-armed bandit problem:
●Contextual bandits problem
We will step into the core concept of reinforcement learning - Markov Decision
Process (MDP).

39

Question?
40

Multi armed bandit

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Multi armed bandit

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

TLE-9-Prepare-Salad-and-Dressing.pptxkkk

LESSON 1 ABOUT MEDIA AND INFORMATION.pptx

GRADE-8-AQUACULTURE-WEEKQ1.pdfdfawgwyrsewru

Feelings PP Game FOR CHILDREN IN ELEMENTARY SCHOOL.pptx

Jeopardy_Figures_of_Speech_Template.pptx [Autosaved].pptx

Jeopardy_Figures_of_Speech.pptxvdsvdsvsdvsd