Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Multi-Armed Bandits:
A Novel Generic Optimal Algorithm
and Applications to Networking
Richard Combes
1
1
Centrale-Supélec / L2S, France
SDN Days, 2017
1 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Outline
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
2 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
A rst example: sequential treatment
allocation
IThere areTpatients with the same symptoms
awaiting treatment
ITwo treatments exist, one is better than the other
IBased on past successes and failures which
treatment should you use ?
3 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
The model
IAt timen, choose actionxn2 X, observe feedback
yn(xn)2 Y, and obtain rewardrn(xn)2R
+
.
IBandit feedback: rewards and feedback depend on
actions (oftenynrn)
IAdmissible algorithm:
xn+1=fn+1(x0;r0(x0);y0(x0); :::;xn;rn(xn);rn(yn))
IPerformance metric: regret
R(T) =max
x2X
E
"
T
X
n=1
rn(x)
#
| {z }
oracle
E
"
T
X
n=1
rn(xn)
#
| {z }
your algorithm
:
4 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Bandit taxonomy: adversarial vs stochastic
Stochastic Bandit:
IGame against a stochastic environment
IUnknown parameters2
I(rn(x))nis i.i.d with expectationx
Adversarial Bandit:
IGame against anon-adaptiveadversary
IFor allx,(rn(x))narbitrary sequence inX
IAt time 0, the adversary writes down(rn(x))n;xin an
envelope
Engineering problems are mainly stochastic
5 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Independent vs correlated arms
IIndependent arms: = [0;1]
K
ICorrelated arms:6= [0;1]
K
: choosing 1 gives
information on 1 and 2
Correlation enables (sometimes much) faster learning.
6 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Bandit taxonomy: cardinality of the set of
arms
Discrete Bandits:
IX=f1; :::;Kg
IAll arms can be sampled innitely many times
IRegretO(log(T))(stochastic),O(
p
T)(adversarial)
Innite Bandits:
IX=N, Bayesian setting (otherwise trivial)
IExploreo(T)arms until a good one is found
IRegret:O(
p
T).
Continuous Bandits:
IX R
d
convex,x7!(x)has astructure
IStructures: convex, Lipschitz, linear, unimodal
(quasi-convex) etc.
ISimilar to derivative-free stochastic optimization
IRegret:O(poly(d)
p
T).
7 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Bandit taxonomy: regret minimization vs best
arm identication
Sample arms and output the best arm with a given
probability, similar to PAC learning
Fixed budget setting:
ITxed, sample armsx1; :::;xT, and output^x
T
IEasier problem: estimation + budget allocation
IGoal: minimizeP[^x
T
6=x
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Outline
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
9 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Example 1: Rate adaptation in wireless
networks
IAdapting the modulation/coding scheme to the radio
environment
1
IRates:r1,r2,: : :,rK
ISuccess probabilities:1,2,: : :,K
IThroughputs:1,2,: : :,K
Structure: unimodality +1> 2> > K.
1
R. Combes, A. Proutiere, D. Yun, J. Ok, and Y. Yi. Optimal rate
sampling in 802.11 systems, IEEE INFOCOM 2014
10 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Example 2: Shortest path routing
IChoose a path minimizing expected delay
2
IStochastic delays:Xi(n)Geometric(i)
IPathx2 f0;1g
d
, expected delay
P
d
i=1
xi=i.
IHop-by-hop feedback:Xi(n), forfi:xi(n) =1g 11
1616
2
8
7
3
4 5
6
12
11
10
9 13
14
15
θ1,6
θ6,11 θ11,15
θ15,16
2
S. Talebi, Z. Zou, R. Combes, A. Proutiere, M. Johansson,
Stochastic Online Shortest Path Routing: The Value of Feedback,
IEEE Trans. Automatic Control, 2017
11 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Example 3: Learning to Rank (search
engines)
IGiven a query,Nrelevant items,Ldisplay slots
3
IA user is shownLitems, scrolls down and selects the
rst relevant item
IOne must show the most relevant items in the rst
slots.
Inprobability of clicking on itemn(independence
between items is assumed)
IRewardr(`)if user clicks on the`-th item, and 0 if
the user does not click
!!"
3
R. Combes, S. Magureanu, A. Proutiere and C. Laroche,
Learning to Rank: Regret Lower Bound and Efcient Algorithms,
SIGMETRICS 2015
12 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Example 4: Ad-display optimization
IUsers are shown ads relevant to their queries
4
IAnnouncersx2 f1; :::;Kg, withxclick-through-rate
and budget per unit of timecx
IBandit with budgets: each arm has a budget of plays
IDisplayed announcer is charged per impression/click
4
R. Combes, C. Jiang and R. Srikant, Bandits with Budgets:
Regret Lower Bounds and Optimal Algorithms, SIGMETRICS 2015
13 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Outline
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
14 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Optimism in the face of uncertainty
IReplace arm values by upper condence bounds
IIndexbx(n)such thatbx(n)xwith high
probability
ISelect the arm with highest index
xn2arg maxx2Xbx(n)
IAnalysis idea:
E[tx(T)]
T
X
n=1
P[bx
?(n)
?
]
| {z }
o(log(T))
+
T
X
n=1
P[xn=x;bx(n)
?
]
| {z }
dominant term
:
Almost all algorithms in the literature are optimistic (sic!)
15 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Information theory and statistics
IDistribtionsP;Qwith densitiespandqw.r.t a
measurem
IKullback-Leibler divergence:
D(PjjQ) =
Z
x
p(x)log
p(x)
q(x)
m(dx);
IPinsker's inequality:
r
D(PjjQ)
2
TV(P;Q) =
1
2
Z
x
jp(x)q(x)jm(dx):
IIfP;QBer(p), Ber(q):
D(PjjQ) =plog
p
q
+ (1p)log
1p
1q
IAlso (Pinkser + inequality log(x)x1):
2(pq)
2
D(PjjQ)
(pq)
2
q(1q)
The KL-divergence is ubiquitous in bandit problems
16 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Regret Lower Bounds: general technique
IDecisionx, two parameters; , with
x
?
() =x6=x
?
().
IConsider consider an algorithm withR
(T) =log(T)
for all parameters (unformly good):
E[tx(T)] =O(log(T));E[tx(T)] =TO(log(T)):
IMarkov inequality:
P[tx(T)T=2] +P[tx(T)<T=2]O(T
1
log(T)):
I1ftx(T)T=2gis a hypothesis test, risk
O(T
1
log(T))
IHence (Neyman-Pearson / Tsybakov):
X
x
E[tx(T)]KL(x; x)
| {z }
KL divergence of the observations
log(T)O(log(log(T))):
17 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Concentration inequalities: Chernoff bounds
IBuilding indexes requires tight concentration
inequalities
IChernoff bounds: upper bound the MGF
IX= (X1; :::;Xn)independent, with mean,
Sn=
P
n
n
0
=1
Xn
0
IGsuch that log (E[e
(Xn)
])G(),0
IGeneric technique:
P[Snn] =P[e
(Snn)
e
]
e
E[e
(Snn)
](Markov)
=exp(nG())(independence)
exp
nmax
0
fn
1
G()g
:
18 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Concentration inequalities: Chernoff and
Hoeffding's inequality
IBounded variables: ifXn2[a;b]a.s then
E[e
(Xn)
]e
IChernoff's inequality:
P[Snn]exp(nKL(+=n; ))
IPinsker's inequality: Chernoff is stronger than
Hoeffding.
19 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Concentration inequalities: variable sample
size and peeling
IIn bandit problems, the sample size is random and
depends on the samples themselves
IIntervalsNk=fnk; :::;nk+1g,N=[
K
k=1
N
IIdea:Zn=e
(Snn)
is a positive sub-martingale:
P[max
n2N
k
(Snn)] =P[max
n2N
k
Zne
)]
e
E[Zn
k+1
](Doob's inequality)
=exp(+nk+1G())
exp
nk+1max
0
fn
1
k+1
G()g
:
IPeeling trick (Neveu): union bound overk,
nk= (1+)
k
.
20 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Outline
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
21 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
The Lai-Robbins bound
IActionsX=f1; :::;Kg
IRewards= (1; :::; K)2[0;1]
K
IUniformly good algorithm:R(T) =O(log(T)),8
Theorem (Lai '85)
For any uniformly good algorithm, and x s.tx<
?
we
have:
lim inf
T!1
E[tx(T)]
log(T)
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
The Lai-Robbins boundθ
x
x
Most confusing
parameter
23 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Classical bandits: algorithms
ISelect the arm with highest index
xn2arg maxx2Xbx(n)
IUCB algorithm (Hoeffding's ineqality):
bx(n) =
^
x(n)
|{z}
empirical mean
+
s
2 log(n)
tx(n)
|{z}
exploration bonus
:
IKL-UCB algorithm (using Garivier's inequality):
bx(n) =maxfq1:tx(n)KL(^x(n);q)
| {z }
likelihood ratio
f(n)
|{z}
log(condence level
1
)
g:
withf(n) =log(n) +3 log(log(n)).
24 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Classical bandits: regret analysis
Theorem (Auer'02)
Under algorithm UCB, for all x s.tx<
?
:
E[tx(T)]
8 log(T)
(x
?
)
2
+
2
6
:
Theorem (Garivier'11)
Under algorithm KL-UCB, for all x s.tx<
?
and for all
<
?
x:
E[tx(T)]
log(T)
KL(x+;
?
)
+Clog(log(T)) +
2
:
25 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Outline
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
26 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
The model
IA nite set of armsX, a parameter set
IAn unknown parameter2
IAt time stepn, select armx, observe feedback
Y(n;x)((x))and receive reward(x; )
IObservations(Y(n;x))nare i.i.d.8x.
IPerformance metric: regret
R
(T; ) =Tmax
x2X
(x; )
T
X
t=1
E((x(t); )):
27 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Instances of our model
Classical bandit (Lai, 1985):
ISet of armsX=f1; :::;jX jg
IParameter set = [0;1]
jX j
IReward function:(x; ) =(x)
IObservations:Y(n;x)Ber((x))
IModel for sequential treatment allocation.
28 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Instances of our model
Linear bandit (Dani, 2008):
ISet of armsX R
d
nite
IParameter set2iff(x) =h;xi,8x
IReward function:(x; ) =g((x)),glink function
IObservations:Y(n;x) N((x);1)
IStochastic version of linear / combinatorial
optimization.
IApplications: routing, channel assignment,
recommender systems etc.
29 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Instances of our model
Dueling bandits (Komiyama, 2015):
ISet of armsX=f(i;j)2 f1; : : : ;dg
2
g
IParameter set[0;1]
dd
preference matrices
Iipreferred tojw.p.(i;j)>
1
2
,i
?
Condorcet winner.
IReward function:((i;j); ) =
1
2
((i
?
;i) +(i
?
;j)1),
IObservations:Y(n;x)Ber((x))
IModel for ranking using pairwise comparisons
IApplications: tournaments, learning to rank
0
B
B
@
0:5 :7 :9 :8
0:3 :5 :3 0:1
0:1 0:7 :5 :9
0:2 0:9 0:1 :5
1
C
C
A
30 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Instances of our model
Combinatorial semi-bandits (Cesa-Bianchi, 2012):
ISet of armsX f0;1g
d
IParameter set2iff
(x) = ((1)x(1); : : : ; (d)x(d)),8x
IReward function:(x; ) =
P
d
i=1
(i)x(i)
IObservations:Y(n;x)2 f0;1g
d
with independent
components and mean(x)
ICombinatorial optimization with detailed feedback.
IApplications: routing w. link feedback, channel
assignment, etc.i
j
θij
bcbc
bc
bcbc
bc bc
bc
bc
bc
bc
bc bc
Source
Destination
j
i
θij
bc
31 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Regret lower bound
Theorem
Considera uniformly good algorithm. For any2, we
have:
lim inf
T!1
R
(T; )
lnT
C();
where C()is the value of the optimization problem:
minimize
(x)0;x2X
X
x2X
(x)(
?
()(x; )) (1)
subject to
X
x2X
(x)D(; ;x)1;82();(2)
where
() =f2 :D(; ;x
?
()) =0;x
?
()6=x
?
()g:
32 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
The OSSB algorithm
s(0) 0;N(x;1);m(x;1) 0 ,8x2 X {Initialization}
fort=1; :::;Tdo
Compute the optimization problem (1)-(2) solution(c(x;m(t)))x2X
wherem(t) = (m(x;t))x2X
ifN(x;t)c(x;m(t))(1+)lnt,8xthen
s(t) s(t1)
x(t) x
?
(m(t)) {Exploitation}
else
s(t) s(t1) +1
X(t) arg minx2X
N(x;t)
c(x;m(t))
X(t) arg minx2XN(x;t)
ifN(X(t);t)"s(t)then
x(t) X(t) {Estimation}
else
x(t) X(t) {Exploration}
end if
end if
{Update statistics}
Select armx(t)and observeY(x(t);t)
m(x;t+1) m(x;t),8x6=x(t),
N(x;t+1) N(x;t),8x6=x(t)
m(x(t);t+1)
Y(x(t);t)+m(x(t);t)N(x(t);t)
N(x(t);t)+1
N(x(t);t+1) N(x(t);t) +1
end for
33 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
OSSB is asymptotically optimal
We use the following natural assumptions.
A1
A2 x,(; )7!D(x; ; )is continuous at all
points where it is not innite
A3 x, the mapping!(x; )is continuous
A4
Theorem
Under A1-A4, the regret of=OSSB("; ) with" <
1
jX j
veries:
lim sup
T!1
R
(T)
lnT
C()F("; ; );
with F("; ; )!1as"!0and!0for all.
34 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
OSSB: elements of analysis
Element 1: show that in the exploitation phase, the
optimal arm is selected with high probability.
Lemma
Under A1, there exists a function G such that for all t1:
X
t1
P
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
OSSB: elements of analysis
Element 2: show that, whenis well estimated, so is
c().
Lemma
Under A1-A4, the optimal value7!C()and the
solution7!c() = (c(x; ))x2Xare continuous at.
Proof: similar to Berge's theorem with an additional
difculty as the feasible set is not compact.
36 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
OSSB: elements of analysis
Element 3: show that the number of exploration /
estimation rounds whereis not well estimated is nite in
expectation. Idea: afterssuch rounds,N(x;t)sby
construction.
Lemma
Let x2 Xand >0. DeneFtthe-algebra generated
by(Y(x(s);s))1st. LetS Nbe a (random) set of
rounds. Assume that there exists a sequence of (random)
sets(S(s))s1such that (i)S [s1S(s), (ii) for all s1
and all t2 S(s), N(x;t)s, (iii)jS(s)j 1, and (iv) the
event t2 S(s)isFt-measurable. Then for all >0:
X
t1
P(t2 S;jm(x;t)(x)j> )
1
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Numerical example: linear bandits0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
0
1000
2000
3000
4000
5000
Time
Average Regret
Thompson Sampling (Agrawal et al.)
GLM−UCB (Filippi et al.)
OSSB
Lattimore et al.
38 / 39
Multi-Armed
Bandits:
A Novel Generic
Optimal Algorithm
and Applications to
Networking
R. Combes
Bandits: A primer
Applications
Some basic tools
Classical Bandits
Generic Bandits
Thank you for your attention !
39 / 39