07-Richard-Combes Multi-Armed Bandits.pdf

ezzeri1 15 views 39 slides Feb 26, 2025
Slide 1
Slide 1 of 39
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39

About This Presentation

Multi-Armed Bandits


Slide Content

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Multi-Armed Bandits:
A Novel Generic Optimal Algorithm
and Applications to Networking
Richard Combes
1
1
Centrale-Supélec / L2S, France
SDN Days, 2017
1 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Outline
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
2 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
A rst example: sequential treatment
allocation
IThere areTpatients with the same symptoms
awaiting treatment
ITwo treatments exist, one is better than the other
IBased on past successes and failures which
treatment should you use ?
3 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
The model
IAt timen, choose actionxn2 X, observe feedback
yn(xn)2 Y, and obtain rewardrn(xn)2R
+
.
I”Bandit feedback”: rewards and feedback depend on
actions (oftenynrn)
IAdmissible algorithm:
xn+1=fn+1(x0;r0(x0);y0(x0); :::;xn;rn(xn);rn(yn))
IPerformance metric: regret
R(T) =max
x2X
E
"
T
X
n=1
rn(x)
#
| {z }
oracle
E
"
T
X
n=1
rn(xn)
#
| {z }
your algorithm
:
4 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Bandit taxonomy: adversarial vs stochastic
Stochastic Bandit:
IGame against a stochastic environment
IUnknown parameters2
I(rn(x))nis i.i.d with expectationx
Adversarial Bandit:
IGame against anon-adaptiveadversary
IFor allx,(rn(x))narbitrary sequence inX
IAt time 0, the adversary “writes down(rn(x))n;xin an
envelope”
Engineering problems are mainly stochastic
5 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Independent vs correlated arms
IIndependent arms: = [0;1]
K
ICorrelated arms:6= [0;1]
K
: choosing 1 gives
information on 1 and 2
Correlation enables (sometimes much) faster learning.
6 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Bandit taxonomy: cardinality of the set of
arms
Discrete Bandits:
IX=f1; :::;Kg
IAll arms can be sampled innitely many times
IRegretO(log(T))(stochastic),O(
p
T)(adversarial)
Innite Bandits:
IX=N, Bayesian setting (otherwise trivial)
IExploreo(T)arms until a good one is found
IRegret:O(
p
T).
Continuous Bandits:
IX R
d
convex,x7!(x)has astructure
IStructures: convex, Lipschitz, linear, unimodal
(quasi-convex) etc.
ISimilar to derivative-free stochastic optimization
IRegret:O(poly(d)
p
T).
7 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Bandit taxonomy: regret minimization vs best
arm identication
Sample arms and output the best arm with a given
probability, similar to PAC learning
Fixed budget setting:
ITxed, sample armsx1; :::;xT, and output^x
T
IEasier problem: estimation + budget allocation
IGoal: minimizeP[^x
T
6=x

]
Fixed condence setting:
Ixed, sample armsx1; :::;xand output^x

IHarder problem: estimation + budget allocation +
optimal stopping (is a stopping time)
IGoal: minimizeE[]s.t.P[^x

6=x

]
8 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Outline
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
9 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Example 1: Rate adaptation in wireless
networks
IAdapting the modulation/coding scheme to the radio
environment
1
IRates:r1,r2,: : :,rK
ISuccess probabilities:1,2,: : :,K
IThroughputs:1,2,: : :,K
Structure: unimodality +1> 2> > K.
1
R. Combes, A. Proutiere, D. Yun, J. Ok, and Y. Yi. ”Optimal rate
sampling in 802.11 systems”, IEEE INFOCOM 2014
10 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Example 2: Shortest path routing
IChoose a path minimizing expected delay
2
IStochastic delays:Xi(n)Geometric(i)
IPathx2 f0;1g
d
, expected delay
P
d
i=1
xi=i.
IHop-by-hop feedback:Xi(n), forfi:xi(n) =1g 11
1616
2
8
7
3
4 5
6
12
11
10
9 13
14
15
θ1,6
θ6,11 θ11,15
θ15,16
2
S. Talebi, Z. Zou, R. Combes, A. Proutiere, M. Johansson,
”Stochastic Online Shortest Path Routing: The Value of Feedback”,
IEEE Trans. Automatic Control, 2017
11 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Example 3: Learning to Rank (search
engines)
IGiven a query,Nrelevant items,Ldisplay slots
3
IA user is shownLitems, scrolls down and selects the
rst relevant item
IOne must show the most relevant items in the rst
slots.
Inprobability of clicking on itemn(independence
between items is assumed)
IRewardr(`)if user clicks on the`-th item, and 0 if
the user does not click


!"
#$
%$
&'()*+,-
./
,(+!-.0
1<>
%$(
#$33
14(
5$4

&*+667
8889
%57<>00(
7
*+,-
1*+,:#
$702
;56&
;<=>6?,<&@A<B
(
C
1C

C*
&C*D

&+&$+1;
8889++E+0
'$1;$'
E
&(FG4
849849<>H(
I;1 E((E
721E
&FG4
849849<>
A!;BB,B,'J;!,$7!
;!,7!<+ +K /1<>#0
&+;
888L,E
;(
&
I&;1 E
((E
72
G4
(K..
D4<-+A
<0
GG-G
1-

M(N#
(
%5
O&C

1
O*1C
0'999P
K.M
&
#@Q

0
&
<'2
;'#0
! 6 R?00#


!!"
3
R. Combes, S. Magureanu, A. Proutiere and C. Laroche,
”Learning to Rank: Regret Lower Bound and Efcient Algorithms”,
SIGMETRICS 2015
12 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Example 4: Ad-display optimization
IUsers are shown ads relevant to their queries
4
IAnnouncersx2 f1; :::;Kg, withxclick-through-rate
and budget per unit of timecx
IBandit with budgets: each arm has a budget of plays
IDisplayed announcer is charged per impression/click
4
R. Combes, C. Jiang and R. Srikant, ”Bandits with Budgets:
Regret Lower Bounds and Optimal Algorithms”, SIGMETRICS 2015
13 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Outline
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
14 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Optimism in the face of uncertainty
IReplace arm values by upper condence bounds
I”Index”bx(n)such thatbx(n)xwith high
probability
ISelect the arm with highest index
xn2arg maxx2Xbx(n)
IAnalysis idea:
E[tx(T)]
T
X
n=1
P[bx
?(n)
?
]
| {z }
o(log(T))
+
T
X
n=1
P[xn=x;bx(n)
?
]
| {z }
dominant term
:
Almost all algorithms in the literature are optimistic (sic!)
15 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Information theory and statistics
IDistribtionsP;Qwith densitiespandqw.r.t a
measurem
IKullback-Leibler divergence:
D(PjjQ) =
Z
x
p(x)log

p(x)
q(x)

m(dx);
IPinsker's inequality:
r
D(PjjQ)
2
TV(P;Q) =
1
2
Z
x
jp(x)q(x)jm(dx):
IIfP;QBer(p), Ber(q):
D(PjjQ) =plog

p
q

+ (1p)log

1p
1q

IAlso (Pinkser + inequality log(x)x1):
2(pq)
2
D(PjjQ)
(pq)
2
q(1q)
The KL-divergence is ubiquitous in bandit problems
16 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Regret Lower Bounds: general technique
IDecisionx, two parameters; , with
x
?
() =x6=x
?
().
IConsider consider an algorithm withR

(T) =log(T)
for all parameters (unformly good):
E[tx(T)] =O(log(T));E[tx(T)] =TO(log(T)):
IMarkov inequality:
P[tx(T)T=2] +P[tx(T)<T=2]O(T
1
log(T)):
I1ftx(T)T=2gis a hypothesis test, risk
O(T
1
log(T))
IHence (Neyman-Pearson / Tsybakov):
X
x
E[tx(T)]KL(x; x)
| {z }
KL divergence of the observations
log(T)O(log(log(T))):
17 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Concentration inequalities: Chernoff bounds
IBuilding indexes requires tight concentration
inequalities
IChernoff bounds: upper bound the MGF
IX= (X1; :::;Xn)independent, with mean,
Sn=
P
n
n
0
=1
Xn
0
IGsuch that log (E[e
(Xn)
])G(),0
IGeneric technique:
P[Snn] =P[e
(Snn)
e

]
e

E[e
(Snn)
](Markov)
=exp(nG())(independence)
exp

nmax
0
fn
1
G()g

:
18 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Concentration inequalities: Chernoff and
Hoeffding's inequality
IBounded variables: ifXn2[a;b]a.s then
E[e
(Xn)
]e

2
(ba)
2
=8
(Hoeffding lemma)
IHoeffding's inequality:
P[Snn]exp


2
2
n(ba)
2

ISubgaussian variables:E[e
(Xn)
]e

2

2
=2
, similar
IBernoulli variables:
E[e
(Xn)
] =e
(1)
(1)e

IChernoff's inequality:
P[Snn]exp(nKL(+=n; ))
IPinsker's inequality: Chernoff is stronger than
Hoeffding.
19 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Concentration inequalities: variable sample
size and peeling
IIn bandit problems, the sample size is random and
depends on the samples themselves
IIntervalsNk=fnk; :::;nk+1g,N=[
K
k=1
N
IIdea:Zn=e
(Snn)
is a positive sub-martingale:
P[max
n2N
k
(Snn)] =P[max
n2N
k
Zne

)]
e

E[Zn
k+1
](Doob's inequality)
=exp(+nk+1G())
exp

nk+1max
0
fn
1
k+1
G()g

:
IPeeling trick (Neveu): union bound overk,
nk= (1+)
k
.
20 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Outline
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
21 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
The Lai-Robbins bound
IActionsX=f1; :::;Kg
IRewards= (1; :::; K)2[0;1]
K
IUniformly good algorithm:R(T) =O(log(T)),8
Theorem (Lai '85)
For any uniformly good algorithm, and x s.tx<
?
we
have:
lim inf
T!1
E[tx(T)]
log(T)

1
KL(x;
?
)
IForx6=x
?
, apply the generic technique with:
= (1; :::; x1;
?
+; x+1; :::; K)
22 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
The Lai-Robbins boundθ
x
x
Most confusing
parameter
23 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Classical bandits: algorithms
ISelect the arm with highest index
xn2arg maxx2Xbx(n)
IUCB algorithm (Hoeffding's ineqality):
bx(n) =
^
x(n)
|{z}
empirical mean
+
s
2 log(n)
tx(n)
|{z}
exploration bonus
:
IKL-UCB algorithm (using Garivier's inequality):
bx(n) =maxfq1:tx(n)KL(^x(n);q)
| {z }
likelihood ratio
f(n)
|{z}
log(condence level
1
)
g:
withf(n) =log(n) +3 log(log(n)).
24 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Classical bandits: regret analysis
Theorem (Auer'02)
Under algorithm UCB, for all x s.tx<
?
:
E[tx(T)]
8 log(T)
(x
?
)
2
+

2
6
:
Theorem (Garivier'11)
Under algorithm KL-UCB, for all x s.tx<
?
and for all
<
?
x:
E[tx(T)]
log(T)
KL(x+;
?
)
+Clog(log(T)) +
2
:
25 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Outline
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
26 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
The model
IA nite set of armsX, a parameter set
IAn unknown parameter2
IAt time stepn, select armx, observe feedback
Y(n;x)((x))and receive reward(x; )
IObservations(Y(n;x))nare i.i.d.8x.
IPerformance metric: regret
R

(T; ) =Tmax
x2X
(x; )
T
X
t=1
E((x(t); )):
27 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Instances of our model
Classical bandit (Lai, 1985):
ISet of armsX=f1; :::;jX jg
IParameter set = [0;1]
jX j
IReward function:(x; ) =(x)
IObservations:Y(n;x)Ber((x))
IModel for sequential treatment allocation.
28 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Instances of our model
Linear bandit (Dani, 2008):
ISet of armsX R
d
nite
IParameter set2iff(x) =h;xi,8x
IReward function:(x; ) =g((x)),glink function
IObservations:Y(n;x) N((x);1)
IStochastic version of linear / combinatorial
optimization.
IApplications: routing, channel assignment,
recommender systems etc.
29 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Instances of our model
Dueling bandits (Komiyama, 2015):
ISet of armsX=f(i;j)2 f1; : : : ;dg
2
g
IParameter set[0;1]
dd
preference matrices
Iipreferred tojw.p.(i;j)>
1
2
,i
?
Condorcet winner.
IReward function:((i;j); ) =
1
2
((i
?
;i) +(i
?
;j)1),
IObservations:Y(n;x)Ber((x))
IModel for ranking using pairwise comparisons
IApplications: tournaments, learning to rank
0
B
B
@
0:5 :7 :9 :8
0:3 :5 :3 0:1
0:1 0:7 :5 :9
0:2 0:9 0:1 :5
1
C
C
A
30 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Instances of our model
Combinatorial semi-bandits (Cesa-Bianchi, 2012):
ISet of armsX f0;1g
d
IParameter set2iff
(x) = ((1)x(1); : : : ; (d)x(d)),8x
IReward function:(x; ) =
P
d
i=1
(i)x(i)
IObservations:Y(n;x)2 f0;1g
d
with independent
components and mean(x)
ICombinatorial optimization with detailed feedback.
IApplications: routing w. link feedback, channel
assignment, etc.i
j
θij
bcbc
bc
bcbc
bc bc
bc
bc
bc
bc
bc bc
Source
Destination
j
i
θij
bc
31 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Regret lower bound
Theorem
Considera uniformly good algorithm. For any2, we
have:
lim inf
T!1
R

(T; )
lnT
C();
where C()is the value of the optimization problem:
minimize
(x)0;x2X
X
x2X
(x)(
?
()(x; )) (1)
subject to
X
x2X
(x)D(; ;x)1;82();(2)
where
() =f2 :D(; ;x
?
()) =0;x
?
()6=x
?
()g:
32 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
The OSSB algorithm
s(0) 0;N(x;1);m(x;1) 0 ,8x2 X {Initialization}
fort=1; :::;Tdo
Compute the optimization problem (1)-(2) solution(c(x;m(t)))x2X
wherem(t) = (m(x;t))x2X
ifN(x;t)c(x;m(t))(1+)lnt,8xthen
s(t) s(t1)
x(t) x
?
(m(t)) {Exploitation}
else
s(t) s(t1) +1
X(t) arg minx2X
N(x;t)
c(x;m(t))
X(t) arg minx2XN(x;t)
ifN(X(t);t)"s(t)then
x(t) X(t) {Estimation}
else
x(t) X(t) {Exploration}
end if
end if
{Update statistics}
Select armx(t)and observeY(x(t);t)
m(x;t+1) m(x;t),8x6=x(t),
N(x;t+1) N(x;t),8x6=x(t)
m(x(t);t+1)
Y(x(t);t)+m(x(t);t)N(x(t);t)
N(x(t);t)+1
N(x(t);t+1) N(x(t);t) +1
end for
33 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
OSSB is asymptotically optimal
We use the following natural assumptions.
A1
A2 x,(; )7!D(x; ; )is continuous at all
points where it is not innite
A3 x, the mapping!(x; )is continuous
A4
Theorem
Under A1-A4, the regret of=OSSB("; ) with" <
1
jX j
veries:
lim sup
T!1
R

(T)
lnT
C()F("; ; );
with F("; ; )!1as"!0and!0for all.
34 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
OSSB: elements of analysis
Element 1: show that in the exploitation phase, the
optimal arm is selected with high probability.
Lemma
Under A1, there exists a function G such that for all t1:
X
t1
P

X
x2X
N(x;t)D(m(t); ;x)(1+)lnt
!
G(;jX j):
Proof: Chernoff bound + Doob's maximal inequality +
multi-dimensional peeling.
35 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
OSSB: elements of analysis
Element 2: show that, whenis well estimated, so is
c().
Lemma
Under A1-A4, the optimal value7!C()and the
solution7!c() = (c(x; ))x2Xare continuous at.
Proof: similar to Berge's theorem with an additional
difculty as the feasible set is not compact.
36 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
OSSB: elements of analysis
Element 3: show that the number of exploration /
estimation rounds whereis not well estimated is nite in
expectation. Idea: afterssuch rounds,N(x;t)sby
construction.
Lemma
Let x2 Xand >0. DeneFtthe-algebra generated
by(Y(x(s);s))1st. LetS Nbe a (random) set of
rounds. Assume that there exists a sequence of (random)
sets(S(s))s1such that (i)S [s1S(s), (ii) for all s1
and all t2 S(s), N(x;t)s, (iii)jS(s)j 1, and (iv) the
event t2 S(s)isFt-measurable. Then for all >0:
X
t1
P(t2 S;jm(x;t)(x)j> )
1

2
:
Proof: Chernoff bound + Doob's optional stopping
theorem.
37 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Numerical example: linear bandits0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
0
1000
2000
3000
4000
5000
Time
Average Regret
 Thompson Sampling (Agrawal et al.)
GLM−UCB (Filippi et al.)
OSSB
Lattimore et al.
38 / 39

Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Thank you for your attention !
39 / 39
Tags