Sequential and reinforcement learning for demand side management by Margaux Brégère
WiMLDS_Paris
15 views
20 slides
May 03, 2024
Slide 1 of 20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
About This Presentation
As electricity is difficult to store, it is crucial to strictly maintain the balance between production and consumption. The integration of intermittent renewable energies into the production mix has made the management of the balance more complex. However, access to near real-time data and communic...
As electricity is difficult to store, it is crucial to strictly maintain the balance between production and consumption. The integration of intermittent renewable energies into the production mix has made the management of the balance more complex. However, access to near real-time data and communication with consumers via smart meters suggest demand response. Specifically, sending signals would encourage users to adjust their consumption according to the production of electricity. The algorithms used to select these signals must learn consumer reactions and optimize them while balancing exploration and exploitation. Various sequential or reinforcement learning approaches are being considered.
Size: 2.6 MB
Language: en
Added: May 03, 2024
Slides: 20 pages
Slide Content
Margaux Brégère - April, 24
th 2024
Sequential and reinforcement learning for
demand side management
Paris Women in Machine Learning & Data Science @Criteo
Introduction
Demand side management
Current solution: forecast demand and
adapt production accordingly
• Renewable energies development
production harder to adjust
• New (smart) meters access to data
and instantaneous communication
Prospective solutions: manage demand
Send incentive signals (prices)
Control flexible devices
→
→
→
→
Electricity is hard to store production - demand balance must be strictly maintained →
Demand side management with incentive signals
The environment (consumer behavior) is discovered through
interactions (incentive signal choices) Reinforcement learning
How to develop automatic solutions to chose incentive signals dynamically?
→
Exploration: Learn
consumer behavior
Exploitation: Optimize
signal sending
« Smart Meter Energy Consumption
Data in London Households »
Stochastic multi-armed bandits
Stochastic multi-armed bandits
In a multi-armed bandit problem, a gambler facing a row of slot machines
(also called one-armed bandits) has to decide which machines to play to
maximize her reward
K
Exploration:Test
many arms
Exploitation: Play
the « best » arm
Exploration - Exploitation trade-off
Stochastic multi-armed bandit
Each arm is defined by an unknown probability distribution
For
• Pick an arm
• Receive a random reward with
Maximize the cumulative reward Minimize the regret, i.e., the difference, in expectation,
between the cumulative reward of the best strategy and that of ours:
, with
A good bandit algorithm has a sub-linear regret:
k ν
k
t=1,…,T
I
t∈{1,…,K}
Y
t Y
t
|I
t=k∼ν
k
⇔
R
T=Tmax
k=1,…,K
μ
k−??????
[
T
∑
t=1
μ
I
t]
μ
k=??????[ν
k]
R
T
T
→0
Upper Confidence Bound algorithm
1
Initialization: pick each arm once
For :
• Estimate the expected reward of each arm with
(empirical mean of past rewards)
• Build some confidence intervals around these
estimations: with high
probability
• Be optimistic and act as if the best possible probable
reward was the true reward and choose the next arm
accordingly
t=K+1,…,T
k ̂μ
k,t−1
μ
k
∈[̂μ
k,t−1
−α
k,t
,̂μ
k,t−1
+α
k,t]
I
t
=argmax
k{
̂μ
k,t−1
+α
k,t
}
[1] Finite-time analysis of the multiarmed bandit problem, Peter Auer, Nicolo Cesa-Bianchi, Paul Fischer, Machine
learning, 2002
UCB regret bound
The empirical means based on past rewards are:
with
With Hoeffding-Azuma Inequality, we get
with
And be optimistic ensures that
̂μ
k,t−1
=
1
N
k,t−1
t−1
∑
s=1
Y
s
1
{I
s=k}
N
k,t−1
=
t−1
∑
s=1
1
{I
s=k}
ℙ
(
μ
k∈[̂μ
k,t−1−α
k,t,̂μ
k,t−1+α
k,t])
≥1−t
−3
α
k,t=
2logt
N
k,t−1
R
T
≲TKlogT
Modeling demand side management
Demand side management with incentive signals
For
• Observe a context and a target
• Choose price levels
• Observe the resulting electricity demand
and suffer the loss
t=1,…,T
x
t
c
t
p
t
Y
t
=f(x
t
,p
t
)+noise(p
t
)
ℓ(Y
t,c
t)
Assumptions:
• Homogenous population, tariffs,
• with a known mapping
function and an unknown vector to estimate
• with
•
K p
t
∈Δ
K
f(x
t
,p
t)=ϕ(x
t
,p
t)
T
θ ϕ
θ
noise(p
t)=p
T
tε
t??????[ε
t]=Σ
ℓ(Y
t,c
t)=(Y
t−c
t)
2
p
t
c
t
x
t
Y
t
Bandit algorithm for target tracking
Under these assumptions:
Estimate parameters and to estimate losses and reach a bias-variance trade-off
Optimistic algorithm:
For
• Select price levels deterministically to estimate offline with
For
• Estimate based on past observation with thanks to a Ridge regression
• Estimate future expected loss for each price level :
• Get confidence bound on these estimations:
• Select price levels optimistically:
??????
[(Y
t
−c
t)
2
past,x
t
,p
t]
=(ϕ(x
t
,p
t
)
T
θ−c
t)
2
+p
T
t
Σp
t
θ Σ
t=1,…,τ
Σ ̂Σ
τ
t=τ+1,…,T
θ
̂
θ
t−1
p
̂
ℓ
p,t=(ϕ(x
t,p)
T̂
θ
t−1−c
t)
2
+p
T̂Σ
τp
|
̂
ℓ
p,t−ℓ
p
|≤α
p,t
p
t
∈argmin
p
{
̂
ℓ
p,t
−α
p,t}
Regret bound
2
R
T=??????
[
T
∑
t=1
(Y
t−c
t)
2
−min
p
(Y(p)−c
t)
2
]
=
T
∑
t=1
(ϕ(x
t,p
t)
T
θ−c
t)
2
+p
T
tΣp
t−
T
∑
t=1
min
p
(ϕ(x
t,p)
T
θ−c
t)
2
+p
T
Σp
[2] Target Tracking for Contextual Bandits : Application to Demand Side Management, Margaux Brégère, Pierre Gaillard, Yannig
Goude and Gilles Stoltz, ICML, 2019
[3] Laplace’s method on supermartingales: Improved algorithms for linear stochastic bandits, Yasin Abbasi-Yadkori, Dávid Pál, and
Csaba Szepesvári, NeuRIPS, 2011
[4] Contextual bandits with linear payoff functions , Wei Chu, Li Lihong, Lev Reyzin, and Robert Schapire., JMLR 2011
Theorem
For proper choices of confidence levels and number of exploration rounds , with high
probability
If is known,
α
p,t τ
R
T≤??????(T
2/3
)
Σ R
T
≤??????(TlnT)
Elements of proof
• Deviation inequalities on
3 and on
• Inspired from LinUCB regret bound analysis
4
̂
θ
t
̂Σ
τ
Application
Extension: personalized demand side management
Others approaches and prospects
Control of flexible devices
4
water-heaters to be controlled
without compromising service
quality
N
At each round
• Observe a target
• Send to all thermostatically controlled loads a probability
of switching on
• Observe the demand
t=1,…,T
c
t
p
t
∈[0,1]
Assumptions:
• water-heaters with same characteristics
• Demand of water-heater is zero if OFF and constant if ON
• State of water-heater
follows an unknown Markov Decision Process (MDP)
• It is possible to control demand if the MDP is known
N
i
x
i,t
=(Temperature
t
,ON/OFFt) i
Learn MDP
(drain law)
Follow the
target
[4] (Online) Convex Optimization for Demand-Side Management: Application to Thermostatically Controlled Loads,
Bianca Marin Moreno, Margaux Brégère, Pierre Gaillard and Nadia Oudjane, 2024
Hyper-parameter optimization
5
Train a neural network is expensive and time-consuming
Aim: for a set of hyper-parameters (number of neurons, activation
functions etc.) and a budget , find the best neural network:
At each round
• Choose hyper-parameters
• Train network on
• Observe the forecast error
Output (best arm identification):
Λ
T
argmin
λ∈Λ
ℓ
(
f
λ(??????
TEST))
t=1,…,T
λ
t∈Λ
f
λ
t
??????
TRAIN
ℓ
t
=ℓ
(
f
λ
t(??????
VALID))
argmin
f
λt
ℓ
(
f
λ
t(??????
VALID))
??????
TRAIN
??????
VALID
??????
TEST
[5] A bandit approach with evolutionary operators for model selection : Application to neural architecture optimization for
image classification, Margaux Brégère and Julie Keisler, 2024