Financial Trading as a Game: A Deep Reinforcement Learning Approach

Financial Trading with Deep Reinforcement Learning
Financial Trading with
Deep Reinforcement Learning
Huang, Chien-Yi
[email protected]
Dept. of Applied Mathematics, NCTU
July 8, 2018
1 / 44

Financial Trading with Deep Reinforcement Learning
Outline
1Deep Reinforcement Learning
2Proposed Method
3Numerical Results
4Conclusion
2 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Outline
1Deep Reinforcement Learning
2Proposed Method
3Numerical Results
4Conclusion
3 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Reinforcement Learning
Model-free Reinforcement Learning
1
Model-free
Optimize policy directly; do not build a model for the env.
Learning from scratch through trial-and-error; no supervisor
1
Sutton and Barto. Reinforcement learning: An introduction, 1998
4 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Reinforcement Learning
Model-free Reinforcement Learning
1
Model-free
Optimize policy directly; do not build a model for the env.
Learning from scratch through trial-and-error; no supervisor
1
Sutton and Barto. Reinforcement learning: An introduction, 1998
5 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Markov Decision Process
Markov Decision Process
Denition
AMarkov Decision Processis a tuple (S;A;p;r; ) where
Sis a nite set of
Ais a nite set of
pis a transition probability distribution,
p(s
0
js;a) =P[St+1=s
0
jSt=s;At=a]
ris a r(s;a) =E[Rt+1jSt=s;At=a]
2(0;1) is a discount factor
In real applications of model-free RL, we only dene the state
space, action space and reward function
Reward function should reect the ultimate goal
6 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Markov Decision Process
Markov Decision Process
Denition
AMarkov Decision Processis a tuple (S;A;p;r; ) where
Sis a nite set of
Ais a nite set of
pis a transition probability distribution,
p(s
0
js;a) =P[St+1=s
0
jSt=s;At=a]
ris a r(s;a) =E[Rt+1jSt=s;At=a]
2(0;1) is a discount factor
In real applications of model-free RL, we only dene the state
space, action space and reward function
Reward function should reect the ultimate goal
7 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Markov Decision Process
Markov Decision Process
Denition
AMarkov Decision Processis a tuple (S;A;p;r; ) where
Sis a nite set of
Ais a nite set of
pis a transition probability distribution,
p(s
0
js;a) =P[St+1=s
0
jSt=s;At=a]
ris a r(s;a) =E[Rt+1jSt=s;At=a]
2(0;1) is a discount factor
In real applications of model-free RL, we only dene the state
space, action space and reward function
Reward function should reect the ultimate goal
8 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Return and Policy
Denition
ThereturnGtis the total sum of discounted rewards from time
stept,
Gt=Rt+1+Rt+2+
2
Rt+3+ =
1
X
k=0

k
Rt+k+1
Denition
Apolicyis a distribution over actions given states,
(ajs) =P[At=ajSt=s]
9 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Return and Policy
Denition
ThereturnGtis the total sum of discounted rewards from time
stept,
Gt=Rt+1+Rt+2+
2
Rt+3+ =
1
X
k=0

k
Rt+k+1
Denition
Apolicyis a distribution over actions given states,
(ajs) =P[At=ajSt=s]
10 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Value Function
Denition
Theaction-value functionQ

(s;a) of an MDP is the expected
return starting from states, taking actionaand then
policy,
Q

(s;a) =E[GtjSt=s;At=a]
Denition
Theoptimal action-value functionQ

(s;a) is the maximum value
function
Q

(s;a) = max

Q

(s;a)
11 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Value Function
Denition
Theaction-value functionQ

(s;a) of an MDP is the expected
return starting from states, taking actionaand then
policy,
Q

(s;a) =E[GtjSt=s;At=a]
Denition
Theoptimal action-value functionQ

(s;a) is the maximum value
function
Q

(s;a) = max

Q

(s;a)
12 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Bellman Equation
Theorem
Optimal value function satises
Q

(s;a) =E[Rt+1+max
a
0
Q

(St+1;a
0
)jSt=s;At=a]
1Once we haveQ

, we can act optimally,

(s) = arg max
a
Q

(s;a)
2Cannot compute expectation without environment dynamics
3Samplethe Bellman equation through interaction with env
13 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Bellman Equation
Theorem
Optimal value function satises
Q

(s;a) =E[Rt+1+max
a
0
Q

(St+1;a
0
)jSt=s;At=a]
1Once we haveQ

, we can act optimally,

(s) = arg max
a
Q

(s;a)
2Cannot compute expectation without environment dynamics
3Samplethe Bellman equation through interaction with env
14 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Bellman Equation
Theorem
Optimal value function satises
Q

(s;a) =E[Rt+1+max
a
0
Q

(St+1;a
0
)jSt=s;At=a]
1Once we haveQ

, we can act optimally,

(s) = arg max
a
Q

(s;a)
2Cannot compute expectation without environment dynamics
3Samplethe Bellman equation through interaction with env
15 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Q-Learning and Deep Q-Network (DQN)
Q-Learning
1
Goal:learnQ

for an MDP
Every step, perform an incremental update onQ(s;a),
Q(s;a) Q(s;a) +(r+max
a
0
Q(s
0
;a
0
)
| {z }
Q-target
Q(s;a))
Guarantee convergence with proper step-size schedule
Tabular methoddoes not scale up to large problems
1
Watkins and Dayan. Q-learning (1992)
16 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Q-Learning and Deep Q-Network (DQN)
Deep Q-Network
1
ParametrizeQ

with deep neural networkQ, e.g. CNN
Target networkQ

:
A delayed version ofQto compute the target value
Q-targetr+max
a
0
Q

(s
0
;a
0
)
Soft update on target network

+ (1)

Experience replay:
Store previous transitions to areplay memoryD
Sample a mini-batch fromDfor training each step
Train networkQwith mean square loss and gradient descent
L() =E
(s;a;r;s
0
)D
h
(r+max
a
0
Q

(s
0
;a
0
)Q(s;a))
2
i
rL()
1
Mnih et al. Playing atari with deep reinforcement learning (2013)
17 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Modications to DQN
Modications to DQN
1Double Q-Learning and Double DQN
1
Overestimation in Q-Learning
Use the online network to pick the argmax
Q-targetr+Q

(s
0
;arg max
a
0
Q(s
0
;a
0
))
2Deep Recurrent Q-Network (DRQN)
2
Add an additional LSTM layer before output layer
Sample a
1
Van Hasselt et al. Deep Reinforcement Learning with Double Q-Learning. (2016)
2
Hausknecht et al. Deep recurrent q-learning for partially observable mdps (2015) 18 / 44

Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Modications to DQN
Modications to DQN
1Double Q-Learning and Double DQN
1
Overestimation in Q-Learning
Use the online network to pick the argmax
Q-targetr+Q

(s
0
;arg max
a
0
Q(s
0
;a
0
))
2Deep Recurrent Q-Network (DRQN)
2
Add an additional LSTM layer before output layer
Sample a
1
Van Hasselt et al. Deep Reinforcement Learning with Double Q-Learning. (2016)
2
Hausknecht et al. Deep recurrent q-learning for partially observable mdps (2015) 19 / 44

Financial Trading with Deep Reinforcement Learning
Proposed Method
Outline
1Deep Reinforcement Learning
2Proposed Method
3Numerical Results
4Conclusion
20 / 44

Financial Trading with Deep Reinforcement Learning
Proposed Method
Data Preparation
Data Preparation
Source TrueFX.com
Type tick-by-tick data
Symbol 12 currency pairs
1
Duration from 2012.01 to 2017.12
Timeframe 15-minute interval
Data open, high, low, close prices and tick volume
Post-processingintersect time indices for alignment
Size after post processing: 139813
1
AUDJPY, AUDNZD, AUDUSD, CADJPY, CHFJPY, EURGBP, EURJPY, EURUSD, GBPJP, GBPUSD,
NZDUSD, USDCAD
21 / 44

Financial Trading with Deep Reinforcement Learning
Proposed Method
MDP Formulation
State, Action and Reward
State2R
198
Sinusoidal encoding of minute, hour, day of week2R
3
Recent 8 lag log returns on close for all symbols2R
812
Recent 8 lag log returns on volume for all symbols2R
812
One-hot encoding of current position2R
3
Agent's memory: ht1in LSTM layer
Action2R
3
Position to hold at next time step2 f1;0;1g
Position reversal is allowed
Reward
One-step portfolio log return,rt= log(
vt
vt1
)
22 / 44

Financial Trading with Deep Reinforcement Learning
Proposed Method
MDP Formulation
State, Action and Reward
State2R
198
Sinusoidal encoding of minute, hour, day of week2R
3
Recent 8 lag log returns on close for all symbols2R
812
Recent 8 lag log returns on volume for all symbols2R
812
One-hot encoding of current position2R
3
Agent's memory: ht1in LSTM layer
Action2R
3
Position to hold at next time step2 f1;0;1g
Position reversal is allowed
Reward
One-step portfolio log return,rt= log(
vt
vt1
)
23 / 44

Financial Trading with Deep Reinforcement Learning
Proposed Method
MDP Formulation
State, Action and Reward
State2R
198
Sinusoidal encoding of minute, hour, day of week2R
3
Recent 8 lag log returns on close for all symbols2R
812
Recent 8 lag log returns on volume for all symbols2R
812
One-hot encoding of current position2R
3
Agent's memory: ht1in LSTM layer
Action2R
3
Position to hold at next time step2 f1;0;1g
Position reversal is allowed
Reward
One-step portfolio log return,rt= log(
vt
vt1
)
24 / 44

Financial Trading with Deep Reinforcement Learning
Proposed Method
Model Architecture
Model Architecture
Q(St)ht1ht StoutputLSTMhiddenhidden
Layer 4
Layer 3
Layer 2
ELU activation
1
Layer 1
ELU activation
Layer 0 St(input)
Model size65k parameters
1
Clevert et al. Fast and accurate deep network learning by exponential linear units (elus) (2015)25 / 44

Financial Trading with Deep Reinforcement Learning
Proposed Method
Action Augmentation
Action Augementation
Random exploration is unsatisfying in nancial trading
E.g.-greedy,
(ajs) =

=jAj+ 1ifa

= arg maxaQ(s;a)
=jAj otherwise
Action augmentation:
We can compute reward signal for
The successor state only diers by agent's
Enrich the feedback signal to the agent
Encode prior knowledge in learning
A novel loss function to incorporate action augmentation,
we updateQ-value for
L() =E
(s;a;r;s
0
)D
h
kr+Q

(s
0
;arg max
a
0
Q(s
0
;a
0
))Q(s;a)k
2
i
26 / 44

Financial Trading with Deep Reinforcement Learning
Proposed Method
Action Augmentation
Action Augementation
Random exploration is unsatisfying in nancial trading
E.g.-greedy,
(ajs) =

=jAj+ 1ifa

= arg maxaQ(s;a)
=jAj otherwise
Action augmentation:
We can compute reward signal for
The successor state only diers by agent's
Enrich the feedback signal to the agent
Encode prior knowledge in learning
A novel loss function to incorporate action augmentation,
we updateQ-value for
L() =E
(s;a;r;s
0
)D
h
kr+Q

(s
0
;arg max
a
0
Q(s
0
;a
0
))Q(s;a)k
2
i
27 / 44

Financial Trading with Deep Reinforcement Learning
Proposed Method
Action Augmentation
Action Augementation
Random exploration is unsatisfying in nancial trading
E.g.-greedy,
(ajs) =

=jAj+ 1ifa

= arg maxaQ(s;a)
=jAj otherwise
Action augmentation:
We can compute reward signal for
The successor state only diers by agent's
Enrich the feedback signal to the agent
Encode prior knowledge in learning
A novel loss function to incorporate action augmentation,
we updateQ-value for
L() =E
(s;a;r;s
0
)D
h
kr+Q

(s
0
;arg max
a
0
Q(s
0
;a
0
))Q(s;a)k
2
i
28 / 44

Financial Trading with Deep Reinforcement Learning
Proposed Method
Learning Algorithm
Algorithm 1Financial DRQN Algorithm
Initialize:
T2N, recurrent Q-networkQ, target networkQ

Q, datasetD
Simulate envEfrom datasetD
step 1
Observe initial statesfrom envE
foreach stepdo
step step+ 1
Select greedy action w.r.t.Q(s;a) and apply to envE
Receive rewardrand next states
0
from envE
Augment actions to formT= (s;a;r;s
0
) Tto memoryD
ifDis lled andstepmodT= 0then
Sample a sequence of lengthTfromD
Train networkQwith
end if
Soft update target network

(1)

+
end for
29 / 44

Financial Trading with Deep Reinforcement Learning
Proposed Method
Learning Algorithm
Summary
Agent consists of
LSTM network (model)
F-DRQN (algorithm)
Agent takes in nancial data
Agent actions2 f1;0;1g
Agent maximizes portfolio value
30 / 44

Financial Trading with Deep Reinforcement Learning
Numerical Results
Outline
1Deep Reinforcement Learning
2Proposed Method
3Numerical Results
4Conclusion
31 / 44

Financial Trading with Deep Reinforcement Learning
Numerical Results
Description
The Question to Ask
"Can a single deep reinforcement learning agent, i.e.
a single network architecture,
a single learning algorithm,
a xed set of hyperparameters
learn to trade multiple currecy pairs?"
If so, we
move beyondrule-basedandprediction-basedagents
achieveend-to-endtraining of nancial trading agents
32 / 44

Financial Trading with Deep Reinforcement Learning
Numerical Results
Description
The Question to Ask
"Can a single deep reinforcement learning agent, i.e.
a single network architecture,
a single learning algorithm,
a xed set of hyperparameters
learn to trade multiple currecy pairs?"
If so, we
move beyondrule-basedandprediction-basedagents
achieveend-to-endtraining of nancial trading agents
33 / 44

Financial Trading with Deep Reinforcement Learning
Numerical Results
Simulation Trading
Hyperparameters for Training and Simulation
We use a substantially smaller replay memory
Initial cash in base currency of the pair
Fixed spread if otherwise stated
Training
Learning timestepT 96
Replay memory sizeN 480
Learning rate0.00025
OptimizerAdam
Discount factor0.99
Target network 0.001
Simulation
Initial cash100,000
Trade size100,000
Spread (bp) 0.08
Trading days252 days
34 / 44

Financial Trading with Deep Reinforcement Learning
Numerical Results
Simulation Trading
Hyperparameters for Training and Simulation
We use a substantially smaller replay memory
Initial cash in base currency of the pair
Fixed spread if otherwise stated
Training
Learning timestepT 96
Replay memory sizeN 480
Learning rate0.00025
OptimizerAdam
Discount factor0.99
Target network 0.001
Simulation
Initial cash100,000
Trade size100,000
Spread (bp) 0.08
Trading days252 days
35 / 44

Financial Trading with Deep Reinforcement Learning
Numerical Results
Simulation Trading
Simulation Result with Baseline
36 / 44

Financial Trading with Deep Reinforcement Learning
Numerical Results
Simulation Trading
Trading Statistics
We compute trading statistics averaging over all symbols,
Win Rate Risk-Reward
1
Corr
2
Freq
59.8%
Patterns found in trading strategies discovered by the agent,
1Agent favors strategies with high win rate60%
2Agent favors strategies with lower risk-reward ratio0:75
3Agent discovers strategies with low correlation
4Agent makes trading decisions roughly every hour
1
average prot divides average loss
2
average absolute correlation with baseline
37 / 44

Financial Trading with Deep Reinforcement Learning
Numerical Results
The Eect of Spread
The Eect of Spread
38 / 44

Financial Trading with Deep Reinforcement Learning
Numerical Results
The Eect of Spread
The Eect of Spread
Annual return under dierent spreads,
Spread (bp)0.08 0.1 0.15 0.2
Return23.8%
We discover,
1Wide spread harms performance in general
2We discover a counter-intuitive fact:
a slightly higher spread leads better overall performance
39 / 44

Financial Trading with Deep Reinforcement Learning
Numerical Results
Eectiveness of Action Augmentation
Eectiveness of Action Augmentation
40 / 44

Financial Trading with Deep Reinforcement Learning
Numerical Results
Eectiveness of Action Augmentation
Eectiveness of Action Augmentation
We compare AA with standard-greedy policy with= 0:1,
-greedy Act Aug
Return17.4% 23.8%
We discover
1AA improves performance and lowers variability
2Gain an additional 6.4% annual return when use AA
41 / 44

Financial Trading with Deep Reinforcement Learning
Conclusion
Outline
1Deep Reinforcement Learning
2Proposed Method
3Numerical Results
4Conclusion
42 / 44

Financial Trading with Deep Reinforcement Learning
Conclusion
Achievements
Achievements
We propose an; easily
extendable with more complex state and action space
We propose modications to the original DRQN algorithm
including a novel
We give empirical simulation results for 12 currency pairs
under
We discover a counter-intuitive fact that
spread leads to better overall performance
43 / 44

Financial Trading with Deep Reinforcement Learning
Conclusion
Future Directions
Future Directions
Expand state and action space,
Macro data, NLP data...
Adjustable position size, limit orders...
Dierent nancial trading setting, e.g. high frequency trading
Dierent input state and action space
Dierent reward function
Distributional reinforcement learning:
Q(s;a)
D
=R(s;a) +Q(S
0
;A
0
)
Pick action withSharpe ratio
a= arg max
a2A
E[Q]
p
Var[Q]
44 / 44

Financial Trading as a Game: A Deep Reinforcement Learning Approach

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Financial Trading as a Game: A Deep Reinforcement Learning Approach

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx