Sequential and reinforcement learning for demand side management by Margaux Brégère

WiMLDS_Paris 15 views 20 slides May 03, 2024
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

As electricity is difficult to store, it is crucial to strictly maintain the balance between production and consumption. The integration of intermittent renewable energies into the production mix has made the management of the balance more complex. However, access to near real-time data and communic...


Slide Content

Margaux Brégère - April, 24
th 2024
Sequential and reinforcement learning for
demand side management
Paris Women in Machine Learning & Data Science @Criteo

Introduction

Demand side management
Current solution: forecast demand and
adapt production accordingly
• Renewable energies development
production harder to adjust
• New (smart) meters access to data
and instantaneous communication
Prospective solutions: manage demand
Send incentive signals (prices)
Control flexible devices




Electricity is hard to store production - demand balance must be strictly maintained →

Demand side management with incentive signals
The environment (consumer behavior) is discovered through

interactions (incentive signal choices) Reinforcement learning
How to develop automatic solutions to chose incentive signals dynamically?

Exploration: Learn
consumer behavior
Exploitation: Optimize
signal sending
« Smart Meter Energy Consumption
Data in London Households »

Stochastic multi-armed bandits

Stochastic multi-armed bandits
In a multi-armed bandit problem, a gambler facing a row of slot machines
(also called one-armed bandits) has to decide which machines to play to
maximize her reward
K
Exploration:Test
many arms
Exploitation: Play
the « best » arm
Exploration - Exploitation trade-off

Stochastic multi-armed bandit
Each arm is defined by an unknown probability distribution
For
• Pick an arm
• Receive a random reward with
Maximize the cumulative reward Minimize the regret, i.e., the difference, in expectation,
between the cumulative reward of the best strategy and that of ours:
, with
A good bandit algorithm has a sub-linear regret:
k ν
k
t=1,…,T
I
t∈{1,…,K}
Y
t Y
t
|I
t=k∼ν
k

R
T=Tmax
k=1,…,K
μ
k−??????
[
T

t=1
μ
I
t]
μ
k=??????[ν
k]
R
T
T
→0

Upper Confidence Bound algorithm
1
Initialization: pick each arm once
For :
• Estimate the expected reward of each arm with
(empirical mean of past rewards)
• Build some confidence intervals around these
estimations: with high
probability
• Be optimistic and act as if the best possible probable
reward was the true reward and choose the next arm
accordingly

t=K+1,…,T
k ̂μ
k,t−1
μ
k
∈[̂μ
k,t−1
−α
k,t
,̂μ
k,t−1

k,t]
I
t
=argmax
k{
̂μ
k,t−1

k,t
}
[1] Finite-time analysis of the multiarmed bandit problem, Peter Auer, Nicolo Cesa-Bianchi, Paul Fischer, Machine
learning, 2002

UCB regret bound
The empirical means based on past rewards are:
with
With Hoeffding-Azuma Inequality, we get
with
And be optimistic ensures that
̂μ
k,t−1
=
1
N
k,t−1
t−1

s=1
Y
s
1
{I
s=k}
N
k,t−1
=
t−1

s=1
1
{I
s=k}

(
μ
k∈[̂μ
k,t−1−α
k,t,̂μ
k,t−1+α
k,t])
≥1−t
−3
α
k,t=
2logt
N
k,t−1
R
T
≲TKlogT

Modeling demand side management

Demand side management with incentive signals
For
• Observe a context and a target
• Choose price levels
• Observe the resulting electricity demand

and suffer the loss
t=1,…,T
x
t
c
t
p
t
Y
t
=f(x
t
,p
t
)+noise(p
t
)
ℓ(Y
t,c
t)
Assumptions:
• Homogenous population, tariffs,
• with a known mapping
function and an unknown vector to estimate
• with

K p
t
∈Δ
K
f(x
t
,p
t)=ϕ(x
t
,p
t)
T
θ ϕ
θ
noise(p
t)=p
T

t??????[ε
t]=Σ
ℓ(Y
t,c
t)=(Y
t−c
t)
2
p
t
c
t
x
t
Y
t

Bandit algorithm for target tracking
Under these assumptions:
Estimate parameters and to estimate losses and reach a bias-variance trade-off
Optimistic algorithm:
For
• Select price levels deterministically to estimate offline with
For
• Estimate based on past observation with thanks to a Ridge regression
• Estimate future expected loss for each price level :
• Get confidence bound on these estimations:
• Select price levels optimistically:
??????
[(Y
t
−c
t)
2
past,x
t
,p
t]
=(ϕ(x
t
,p
t
)
T
θ−c
t)
2
+p
T
t
Σp
t
θ Σ
t=1,…,τ
Σ ̂Σ
τ
t=τ+1,…,T
θ
̂
θ
t−1
p
̂

p,t=(ϕ(x
t,p)

θ
t−1−c
t)
2
+p
T̂Σ
τp
|
̂

p,t−ℓ
p
|≤α
p,t
p
t
∈argmin
p
{
̂

p,t
−α
p,t}

Regret bound
2
R
T=??????
[
T

t=1
(Y
t−c
t)
2
−min
p
(Y(p)−c
t)
2
]
=
T

t=1
(ϕ(x
t,p
t)
T
θ−c
t)
2
+p
T
tΣp
t−
T

t=1
min
p
(ϕ(x
t,p)
T
θ−c
t)
2
+p
T
Σp
[2] Target Tracking for Contextual Bandits : Application to Demand Side Management, Margaux Brégère, Pierre Gaillard, Yannig
Goude and Gilles Stoltz, ICML, 2019
[3] Laplace’s method on supermartingales: Improved algorithms for linear stochastic bandits, Yasin Abbasi-Yadkori, Dávid Pál, and
Csaba Szepesvári, NeuRIPS, 2011
[4] Contextual bandits with linear payoff functions , Wei Chu, Li Lihong, Lev Reyzin, and Robert Schapire., JMLR 2011
Theorem
For proper choices of confidence levels and number of exploration rounds , with high
probability
If is known,
α
p,t τ
R
T≤??????(T
2/3
)
Σ R
T
≤??????(TlnT)
Elements of proof

• Deviation inequalities on
3 and on 

• Inspired from LinUCB regret bound analysis
4
̂
θ
t
̂Σ
τ

Application

Extension: personalized demand side management

Others approaches and prospects

Control of flexible devices
4
water-heaters to be controlled
without compromising service
quality
N
At each round
• Observe a target
• Send to all thermostatically controlled loads a probability
of switching on
• Observe the demand
t=1,…,T
c
t
p
t
∈[0,1]
Assumptions:
• water-heaters with same characteristics
• Demand of water-heater is zero if OFF and constant if ON
• State of water-heater
follows an unknown Markov Decision Process (MDP)
• It is possible to control demand if the MDP is known

N
i
x
i,t
=(Temperature
t
,ON/OFFt) i
Learn MDP
(drain law)
Follow the
target
[4] (Online) Convex Optimization for Demand-Side Management: Application to Thermostatically Controlled Loads,
Bianca Marin Moreno, Margaux Brégère, Pierre Gaillard and Nadia Oudjane, 2024

Hyper-parameter optimization
5
Train a neural network is expensive and time-consuming
Aim: for a set of hyper-parameters (number of neurons, activation
functions etc.) and a budget , find the best neural network:

At each round
• Choose hyper-parameters
• Train network on
• Observe the forecast error
Output (best arm identification):
Λ
T
argmin
λ∈Λ

(
f
λ(??????
TEST))
t=1,…,T
λ
t∈Λ
f
λ
t
??????
TRAIN

t
=ℓ
(
f
λ
t(??????
VALID))
argmin
f
λt

(
f
λ
t(??????
VALID))
??????
TRAIN
??????
VALID
??????
TEST
[5] A bandit approach with evolutionary operators for model selection : Application to neural architecture optimization for
image classification, Margaux Brégère and Julie Keisler, 2024

That's all folks!

Questions
Tags