Financial Trading as a Game: A Deep Reinforcement Learning Approach
1,798 views
44 slides
Jul 08, 2018
Slide 1 of 44
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
About This Presentation
An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) mo...
An automatic program that generates constant profit from the financial market is lucrative for every market practitioner. Recent advance in deep reinforcement learning provides a framework toward end-to-end training of such trading agent. In this paper, we propose an Markov Decision Process (MDP) model suitable for the financial trading task and solve it with the state-of-the-art deep recurrent Q-network (DRQN) algorithm. We propose several modifications to the existing learning algorithm to make it more suitable under the financial trading setting, namely 1. We employ a substantially small replay memory (only a few hundreds in size) compared to ones used in modern deep reinforcement learning algorithms (often millions in size.) 2. We develop an action augmentation technique to mitigate the need for random exploration by providing extra feedback signals for all actions to the agent. This enables us to use greedy policy over the course of learning and shows strong empirical performance compared to more commonly used ε-greedy exploration. However, this technique is specific to financial trading under a few market assumptions. 3. We sample a longer sequence for recurrent neural network training. A side product of this mechanism is that we can now train the agent for every T steps. This greatly reduces training time since the overall computation is down by a factor of T. We combine all of the above into a complete online learning algorithm and validate our approach on the spot foreign exchange market.
Size: 879.43 KB
Language: en
Added: Jul 08, 2018
Slides: 44 pages
Slide Content
Financial Trading with Deep Reinforcement Learning
Financial Trading with
Deep Reinforcement Learning
Huang, Chien-Yi [email protected]
Dept. of Applied Mathematics, NCTU
July 8, 2018
1 / 44
Financial Trading with Deep Reinforcement Learning
Outline
1Deep Reinforcement Learning
2Proposed Method
3Numerical Results
4Conclusion
2 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Outline
1Deep Reinforcement Learning
2Proposed Method
3Numerical Results
4Conclusion
3 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Reinforcement Learning
Model-free Reinforcement Learning
1
Model-free
Optimize policy directly; do not build a model for the env.
Learning from scratch through trial-and-error; no supervisor
1
Sutton and Barto. Reinforcement learning: An introduction, 1998
4 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Reinforcement Learning
Model-free Reinforcement Learning
1
Model-free
Optimize policy directly; do not build a model for the env.
Learning from scratch through trial-and-error; no supervisor
1
Sutton and Barto. Reinforcement learning: An introduction, 1998
5 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Markov Decision Process
Markov Decision Process
Denition
AMarkov Decision Processis a tuple (S;A;p;r; ) where
Sis a nite set of
Ais a nite set of
pis a transition probability distribution,
p(s
0
js;a) =P[St+1=s
0
jSt=s;At=a]
ris a r(s;a) =E[Rt+1jSt=s;At=a]
2(0;1) is a discount factor
In real applications of model-free RL, we only dene the state
space, action space and reward function
Reward function should reect the ultimate goal
6 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Markov Decision Process
Markov Decision Process
Denition
AMarkov Decision Processis a tuple (S;A;p;r; ) where
Sis a nite set of
Ais a nite set of
pis a transition probability distribution,
p(s
0
js;a) =P[St+1=s
0
jSt=s;At=a]
ris a r(s;a) =E[Rt+1jSt=s;At=a]
2(0;1) is a discount factor
In real applications of model-free RL, we only dene the state
space, action space and reward function
Reward function should reect the ultimate goal
7 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Markov Decision Process
Markov Decision Process
Denition
AMarkov Decision Processis a tuple (S;A;p;r; ) where
Sis a nite set of
Ais a nite set of
pis a transition probability distribution,
p(s
0
js;a) =P[St+1=s
0
jSt=s;At=a]
ris a r(s;a) =E[Rt+1jSt=s;At=a]
2(0;1) is a discount factor
In real applications of model-free RL, we only dene the state
space, action space and reward function
Reward function should reect the ultimate goal
8 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Return and Policy
Denition
ThereturnGtis the total sum of discounted rewards from time
stept,
Gt=Rt+1+Rt+2+
2
Rt+3+ =
1
X
k=0
k
Rt+k+1
Denition
Apolicyis a distribution over actions given states,
(ajs) =P[At=ajSt=s]
9 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Return and Policy
Denition
ThereturnGtis the total sum of discounted rewards from time
stept,
Gt=Rt+1+Rt+2+
2
Rt+3+ =
1
X
k=0
k
Rt+k+1
Denition
Apolicyis a distribution over actions given states,
(ajs) =P[At=ajSt=s]
10 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Value Function
Denition
Theaction-value functionQ
(s;a) of an MDP is the expected
return starting from states, taking actionaand then
policy,
Q
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Bellman Equation
Theorem
Optimal value function satises
Q
(s;a) =E[Rt+1+max
a
0
Q
(St+1;a
0
)jSt=s;At=a]
1Once we haveQ
, we can act optimally,
(s) = arg max
a
Q
(s;a)
2Cannot compute expectation without environment dynamics
3Samplethe Bellman equation through interaction with env
13 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Bellman Equation
Theorem
Optimal value function satises
Q
(s;a) =E[Rt+1+max
a
0
Q
(St+1;a
0
)jSt=s;At=a]
1Once we haveQ
, we can act optimally,
(s) = arg max
a
Q
(s;a)
2Cannot compute expectation without environment dynamics
3Samplethe Bellman equation through interaction with env
14 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Denition and Theorem
Bellman Equation
Theorem
Optimal value function satises
Q
(s;a) =E[Rt+1+max
a
0
Q
(St+1;a
0
)jSt=s;At=a]
1Once we haveQ
, we can act optimally,
(s) = arg max
a
Q
(s;a)
2Cannot compute expectation without environment dynamics
3Samplethe Bellman equation through interaction with env
15 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Q-Learning and Deep Q-Network (DQN)
Q-Learning
1
Goal:learnQ
for an MDP
Every step, perform an incremental update onQ(s;a),
Q(s;a) Q(s;a) +(r+max
a
0
Q(s
0
;a
0
)
| {z }
Q-target
Q(s;a))
Guarantee convergence with proper step-size schedule
Tabular methoddoes not scale up to large problems
1
Watkins and Dayan. Q-learning (1992)
16 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Q-Learning and Deep Q-Network (DQN)
Deep Q-Network
1
ParametrizeQ
with deep neural networkQ, e.g. CNN
Target networkQ
:
A delayed version ofQto compute the target value
Q-targetr+max
a
0
Q
(s
0
;a
0
)
Soft update on target network
+ (1)
Experience replay:
Store previous transitions to areplay memoryD
Sample a mini-batch fromDfor training each step
Train networkQwith mean square loss and gradient descent
L() =E
(s;a;r;s
0
)D
h
(r+max
a
0
Q
(s
0
;a
0
)Q(s;a))
2
i
rL()
1
Mnih et al. Playing atari with deep reinforcement learning (2013)
17 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Modications to DQN
Modications to DQN
1Double Q-Learning and Double DQN
1
Overestimation in Q-Learning
Use the online network to pick the argmax
Q-targetr+Q
(s
0
;arg max
a
0
Q(s
0
;a
0
))
2Deep Recurrent Q-Network (DRQN)
2
Add an additional LSTM layer before output layer
Sample a
1
Van Hasselt et al. Deep Reinforcement Learning with Double Q-Learning. (2016)
2
Hausknecht et al. Deep recurrent q-learning for partially observable mdps (2015) 18 / 44
Financial Trading with Deep Reinforcement Learning
Deep Reinforcement Learning
Modications to DQN
Modications to DQN
1Double Q-Learning and Double DQN
1
Overestimation in Q-Learning
Use the online network to pick the argmax
Q-targetr+Q
(s
0
;arg max
a
0
Q(s
0
;a
0
))
2Deep Recurrent Q-Network (DRQN)
2
Add an additional LSTM layer before output layer
Sample a
1
Van Hasselt et al. Deep Reinforcement Learning with Double Q-Learning. (2016)
2
Hausknecht et al. Deep recurrent q-learning for partially observable mdps (2015) 19 / 44
Financial Trading with Deep Reinforcement Learning
Proposed Method
Outline
1Deep Reinforcement Learning
2Proposed Method
3Numerical Results
4Conclusion
20 / 44
Financial Trading with Deep Reinforcement Learning
Proposed Method
Data Preparation
Data Preparation
Source TrueFX.com
Type tick-by-tick data
Symbol 12 currency pairs
1
Duration from 2012.01 to 2017.12
Timeframe 15-minute interval
Data open, high, low, close prices and tick volume
Post-processingintersect time indices for alignment
Size after post processing: 139813
1
AUDJPY, AUDNZD, AUDUSD, CADJPY, CHFJPY, EURGBP, EURJPY, EURUSD, GBPJP, GBPUSD,
NZDUSD, USDCAD
21 / 44
Financial Trading with Deep Reinforcement Learning
Proposed Method
MDP Formulation
State, Action and Reward
State2R
198
Sinusoidal encoding of minute, hour, day of week2R
3
Recent 8 lag log returns on close for all symbols2R
812
Recent 8 lag log returns on volume for all symbols2R
812
One-hot encoding of current position2R
3
Agent's memory: ht1in LSTM layer
Action2R
3
Position to hold at next time step2 f1;0;1g
Position reversal is allowed
Reward
One-step portfolio log return,rt= log(
vt
vt1
)
22 / 44
Financial Trading with Deep Reinforcement Learning
Proposed Method
MDP Formulation
State, Action and Reward
State2R
198
Sinusoidal encoding of minute, hour, day of week2R
3
Recent 8 lag log returns on close for all symbols2R
812
Recent 8 lag log returns on volume for all symbols2R
812
One-hot encoding of current position2R
3
Agent's memory: ht1in LSTM layer
Action2R
3
Position to hold at next time step2 f1;0;1g
Position reversal is allowed
Reward
One-step portfolio log return,rt= log(
vt
vt1
)
23 / 44
Financial Trading with Deep Reinforcement Learning
Proposed Method
MDP Formulation
State, Action and Reward
State2R
198
Sinusoidal encoding of minute, hour, day of week2R
3
Recent 8 lag log returns on close for all symbols2R
812
Recent 8 lag log returns on volume for all symbols2R
812
One-hot encoding of current position2R
3
Agent's memory: ht1in LSTM layer
Action2R
3
Position to hold at next time step2 f1;0;1g
Position reversal is allowed
Reward
One-step portfolio log return,rt= log(
vt
vt1
)
24 / 44
Financial Trading with Deep Reinforcement Learning
Proposed Method
Model Architecture
Model Architecture
Q(St)ht1ht StoutputLSTMhiddenhidden
Layer 4
Layer 3
Layer 2
ELU activation
1
Layer 1
ELU activation
Layer 0 St(input)
Model size65k parameters
1
Clevert et al. Fast and accurate deep network learning by exponential linear units (elus) (2015)25 / 44
Financial Trading with Deep Reinforcement Learning
Proposed Method
Action Augmentation
Action Augementation
Random exploration is unsatisfying in nancial trading
E.g.-greedy,
(ajs) =
=jAj+ 1ifa
= arg maxaQ(s;a)
=jAj otherwise
Action augmentation:
We can compute reward signal for
The successor state only diers by agent's
Enrich the feedback signal to the agent
Encode prior knowledge in learning
A novel loss function to incorporate action augmentation,
we updateQ-value for
L() =E
(s;a;r;s
0
)D
h
kr+Q
(s
0
;arg max
a
0
Q(s
0
;a
0
))Q(s;a)k
2
i
26 / 44
Financial Trading with Deep Reinforcement Learning
Proposed Method
Action Augmentation
Action Augementation
Random exploration is unsatisfying in nancial trading
E.g.-greedy,
(ajs) =
=jAj+ 1ifa
= arg maxaQ(s;a)
=jAj otherwise
Action augmentation:
We can compute reward signal for
The successor state only diers by agent's
Enrich the feedback signal to the agent
Encode prior knowledge in learning
A novel loss function to incorporate action augmentation,
we updateQ-value for
L() =E
(s;a;r;s
0
)D
h
kr+Q
(s
0
;arg max
a
0
Q(s
0
;a
0
))Q(s;a)k
2
i
27 / 44
Financial Trading with Deep Reinforcement Learning
Proposed Method
Action Augmentation
Action Augementation
Random exploration is unsatisfying in nancial trading
E.g.-greedy,
(ajs) =
=jAj+ 1ifa
= arg maxaQ(s;a)
=jAj otherwise
Action augmentation:
We can compute reward signal for
The successor state only diers by agent's
Enrich the feedback signal to the agent
Encode prior knowledge in learning
A novel loss function to incorporate action augmentation,
we updateQ-value for
L() =E
(s;a;r;s
0
)D
h
kr+Q
(s
0
;arg max
a
0
Q(s
0
;a
0
))Q(s;a)k
2
i
28 / 44
Financial Trading with Deep Reinforcement Learning
Proposed Method
Learning Algorithm
Algorithm 1Financial DRQN Algorithm
Initialize:
T2N, recurrent Q-networkQ, target networkQ
Q, datasetD
Simulate envEfrom datasetD
step 1
Observe initial statesfrom envE
foreach stepdo
step step+ 1
Select greedy action w.r.t.Q(s;a) and apply to envE
Receive rewardrand next states
0
from envE
Augment actions to formT= (s;a;r;s
0
) Tto memoryD
ifDis lled andstepmodT= 0then
Sample a sequence of lengthTfromD
Train networkQwith
end if
Soft update target network
(1)
+
end for
29 / 44
Financial Trading with Deep Reinforcement Learning
Proposed Method
Learning Algorithm
Summary
Agent consists of
LSTM network (model)
F-DRQN (algorithm)
Agent takes in nancial data
Agent actions2 f1;0;1g
Agent maximizes portfolio value
30 / 44
Financial Trading with Deep Reinforcement Learning
Numerical Results
Outline
1Deep Reinforcement Learning
2Proposed Method
3Numerical Results
4Conclusion
31 / 44
Financial Trading with Deep Reinforcement Learning
Numerical Results
Description
The Question to Ask
"Can a single deep reinforcement learning agent, i.e.
a single network architecture,
a single learning algorithm,
a xed set of hyperparameters
learn to trade multiple currecy pairs?"
If so, we
move beyondrule-basedandprediction-basedagents
achieveend-to-endtraining of nancial trading agents
32 / 44
Financial Trading with Deep Reinforcement Learning
Numerical Results
Description
The Question to Ask
"Can a single deep reinforcement learning agent, i.e.
a single network architecture,
a single learning algorithm,
a xed set of hyperparameters
learn to trade multiple currecy pairs?"
If so, we
move beyondrule-basedandprediction-basedagents
achieveend-to-endtraining of nancial trading agents
33 / 44
Financial Trading with Deep Reinforcement Learning
Numerical Results
Simulation Trading
Hyperparameters for Training and Simulation
We use a substantially smaller replay memory
Initial cash in base currency of the pair
Fixed spread if otherwise stated
Training
Learning timestepT 96
Replay memory sizeN 480
Learning rate0.00025
OptimizerAdam
Discount factor0.99
Target network 0.001
Simulation
Initial cash100,000
Trade size100,000
Spread (bp) 0.08
Trading days252 days
34 / 44
Financial Trading with Deep Reinforcement Learning
Numerical Results
Simulation Trading
Hyperparameters for Training and Simulation
We use a substantially smaller replay memory
Initial cash in base currency of the pair
Fixed spread if otherwise stated
Training
Learning timestepT 96
Replay memory sizeN 480
Learning rate0.00025
OptimizerAdam
Discount factor0.99
Target network 0.001
Simulation
Initial cash100,000
Trade size100,000
Spread (bp) 0.08
Trading days252 days
35 / 44
Financial Trading with Deep Reinforcement Learning
Numerical Results
Simulation Trading
Simulation Result with Baseline
36 / 44
Financial Trading with Deep Reinforcement Learning
Numerical Results
Simulation Trading
Trading Statistics
We compute trading statistics averaging over all symbols,
Win Rate Risk-Reward
1
Corr
2
Freq
59.8%
Patterns found in trading strategies discovered by the agent,
1Agent favors strategies with high win rate60%
2Agent favors strategies with lower risk-reward ratio0:75
3Agent discovers strategies with low correlation
4Agent makes trading decisions roughly every hour
1
average prot divides average loss
2
average absolute correlation with baseline
37 / 44
Financial Trading with Deep Reinforcement Learning
Numerical Results
The Eect of Spread
The Eect of Spread
38 / 44
Financial Trading with Deep Reinforcement Learning
Numerical Results
The Eect of Spread
The Eect of Spread
Annual return under dierent spreads,
Spread (bp)0.08 0.1 0.15 0.2
Return23.8%
We discover,
1Wide spread harms performance in general
2We discover a counter-intuitive fact:
a slightly higher spread leads better overall performance
39 / 44
Financial Trading with Deep Reinforcement Learning
Numerical Results
Eectiveness of Action Augmentation
Eectiveness of Action Augmentation
40 / 44
Financial Trading with Deep Reinforcement Learning
Numerical Results
Eectiveness of Action Augmentation
Eectiveness of Action Augmentation
We compare AA with standard-greedy policy with= 0:1,
-greedy Act Aug
Return17.4% 23.8%
We discover
1AA improves performance and lowers variability
2Gain an additional 6.4% annual return when use AA
41 / 44
Financial Trading with Deep Reinforcement Learning
Conclusion
Outline
1Deep Reinforcement Learning
2Proposed Method
3Numerical Results
4Conclusion
42 / 44
Financial Trading with Deep Reinforcement Learning
Conclusion
Achievements
Achievements
We propose an; easily
extendable with more complex state and action space
We propose modications to the original DRQN algorithm
including a novel
We give empirical simulation results for 12 currency pairs
under
We discover a counter-intuitive fact that
spread leads to better overall performance
43 / 44
Financial Trading with Deep Reinforcement Learning
Conclusion
Future Directions
Future Directions
Expand state and action space,
Macro data, NLP data...
Adjustable position size, limit orders...
Dierent nancial trading setting, e.g. high frequency trading
Dierent input state and action space
Dierent reward function
Distributional reinforcement learning:
Q(s;a)
D
=R(s;a) +Q(S
0
;A
0
)
Pick action withSharpe ratio
a= arg max
a2A
E[Q]
p
Var[Q]
44 / 44