A Framework of Hierarchical Deep Q-Network for Portfolio

Management

Yuan Gao

1,?

, Ziming Gao

1,?

, Yi Hu

, Sifan Song

, Zhengyong Jiang

and Jionglong Su

Department of Mathematical Sciences, Xi’an Jiaotong - Liverpool University, Suzhou, P. R. China

School of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi’an Jiaotong - Liverpool University,

Keywords:

Q-Learning, Hierarchical Reinforcement Learning, Convolutional Neural Network, Portfolio Management.

Abstract:

Reinforcement Learning algorithms and Neural Networks have diverse applications in many domains, e.g.,

stock market prediction, facial recognition and automatic machine translation. The concept of modeling the

portfolio management through a reinforcement learning formulation is novel, and the Deep Q-Network has

been successfully applied to portfolio management recently. However, the model does not take into account

of commission fee for transaction. This paper introduces a framework, based on the hierarchical Deep Q-

Network, that addresses the issue of zero commission fee by reducing the number of assets assigned to each

Deep Q-Network and dividing the total portfolio value into smaller parts. Furthermore, this framework is

ﬂexible enough to handle an arbitrary number of assets. In our experiments, the time series of four stocks

for three different time periods are used to assess the efﬁcacy of our model. It is found that our hierarchical

Deep Q-Network based strategy outperforms ten other strategies, including nine traditional strategies and one

reinforcement learning strategy, in proﬁtability as measured by the Cumulative Rate of Return. Moreover, the

Sharpe ratio and Max Drawdown metrics both demonstrate that the risk of policy associated with hierarchical

Deep Q-Network is the lowest among all ten strategies.

1 INTRODUCTION

Proﬁtable stock trading strategy is a process of mak-

ing decisions based on optimizing allocation of cap-

ital into different stocks in order to maximize per-

formance, such as expected return and Sharpe ra-

tio (Sharpe, 1994). Traditionally, there exist port-

folio trading strategies which may be broadly clas-

siﬁed into four categories, namely “Follow-the-

Winner”, “Follow-the-Loser”, “Pattern-Matching”,

and “Meta-Learning” (Li and Hoi, 2014). However,

in real ﬁnancial environments with complex correla-

tions between stocks as well as substantially noisy

data, such traditional portfolio trading strategies tend

to be limited in their usefulness.

To date, several deep machine-learning ap-

proaches have been applied to ﬁnancial trading (Park

et al., 2019) with varying degrees of success. Nev-

ertheless, many of them tend to predict price move-

ments by inputting historical asset prices to output a

prediction of asset prices in next trading period via

These authors contribute equally to this work.

neural network, and the trading agent will take action

based on these predictions (Heaton et al., 2016). The

performance of these algorithms is highly dependent

on the accuracy of future market prices, and it seems

inappropriate to convert price predictions into actions

because they are not part of the market actions. There-

fore, these approaches are not fully machine learning

based.

More recently, the applications of Reinforcement

Learning (RL) methods in portfolio management are

proposed, where these approaches are able to trade

without predicting future prices (Dempster and Lee-

mans, 2006). Most of these are related to policy-based

RL such as Policy Gradient (Jiang et al., 2017), which

are suitable for continuous actions in the stock sce-

nario. However, with appropriate action discretiza-

tion in stock trading, several valued-based RL meth-

ods such as Q-Learning have been applied as well.

In (Gao et al., 2020), a Deep Q-Network (N-DQN)

framework combining Q-Learning with a single deep

neural network has been proposed. This framework

allows the N-DQN agent to optimize trading strate-

gies through learning from its experience in the ﬁnan-

132

Gao, Y., Gao, Z., Hu, Y., Song, S., Jiang, Z. and Su, J.

A Framework of Hierarchical Deep Q-Network for Portfolio Management.

DOI: 10.5220/0010233201320140

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 132-140

ISBN: 978-989-758-484-8

cial environment, so that this agent may adapt strate-

gies derived from historical data to actual trading.

Nevertheless, the single DQN structure of the N-DQN

is unable to handle extremely large action space, and

large capital sectors will lead to higher commission

fee during trading process. Therefore, the algorithm

in (Gao et al., 2020) makes an unrealistic assumption

of zero commission fee.

In this paper, we address this shortcoming by

adopting hierarchical DQN structure (Kulkarni et al.,

2016) to obtain a novel hierarchical Deep Q-Network

(H-DQN) which takes into consideration of commis-

sion fee in transaction. It successfully decreases the

number of assets assigned to each DQN and reduces

the number of actions. Thus, the problem associated

with high commission fee is solved and the function-

ality of the algorithm is greatly improved. To the best

of our knowledge, the application of DQN to port-

folio management with the consideration of commis-

sion fee is entirely novel.

Our key contributions are two-fold. ﬁrst, we re-

duce the action space by proposing a H-DQN frame-

work so that we no longer assume zero commission

fee during transaction. Second, we construct the in-

teracting environment between the framework and the

market as well as the interacting environment inside

the framework, thereby enabling the model to adapt

to real-world trading process.

The rest of this paper is organized as follows. Sec-

tion 2 deﬁnes the portfolio management problem in

this research. All the assumptions made in this study

are listed in Section 3. Section 4 gives the network

architecture. Data processing and interacting pro-

cess are given in Section 5. Section 6 presents the

training process of the H-DQN model. Experiments

and results are given in Section 7. Finally, Section 8

gives the conclusions and research directions for fu-

ture work.

2 PROBLEM STATEMENT

In portfolio management, we seek the optimal invest-

ment policy that gives the maximum overall portfo-

lio over a given period of time. In practice, adjusted

weight of portfolio in different assets are based on

the price of the assets and the previous weight of the

portfolio. This process can be described as Markov

Decision Process (MDP) (Neuneier, 1998). Essen-

tially, the MDP is a mathematical model used to de-

velop an optimal strategy, which consists of a tuple

, a

, P

, R

). The meaning of each element in the tu-

ple is given as follows:

• S

- the state at time t;

• a

- the action taken at time t;

• P

- the probability of state transiting from S

t+1

;

• R

- the reward at time t.

In order to construct the MDP model for portfolio

management, we deﬁne the state at time t, S

, to be the

price of investment products, and the action of time t

as:

, w

(1)

where w

is the weight vectors of portfolio at time t.

Motivated by action discretized method (Gao et al.,

2020), we divide the portfolio value equally into N

parts, then consider these parts as the smallest trading

units in portfolio management and allocate them to to-

tal assets M + 1 (including cash). Therefore, we may

discretize the action space, and then the total number

of actions equals



M+N



, calculated by permutation,

where M is the number of assets. In addition, we de-

ﬁne the reward at time t as:

, p

t+1

− p

(2)

where p

and p

t+1

are respectively the portfolio value

at time t and t + 1.

In (2), we only focus on the reward at the current

time t, whereas the state of time t caused by the given

policy π affects all the states after time t, i.e., the value

of S

is not only current reward R

, but also the re-

wards of subsequent time periods. Therefore, with

policy π, the value function, G

, of state S

should be

deﬁned as:

) ,

∑

k=t

k−t

(3)

where T denotes the last trading period and γ ∈ (0, 1]

is the discount factor. In general, it is very com-

plicated to calculate G

) using (3), so we need

to compute the expectation of G

to approximate its

true value. Furthermore, since the policy π is deﬁned

solely by action a

, we deﬁne the value function Q

of S

and a

as:

, a

) , E[G

)]

(4)

which is the basic principle of Q Learning. Com-

bining the basic principle and neuro network, we can

build a Deep Q-Network (DQN) that can solve prob-

lems with inﬁnite state space, e.g., portfolio manage-

ment. However, in the above description, we ﬁnd that

the number of actions increases with the increment of

number of parts that we divide the total portfolio into,

and the number of assets. Moreover, if we decrease

the number of parts N in order to reduce the number

of actions, the trading unit will be larger, which may

result in large commission fee as a result of frequent

A Framework of Hierarchical Deep Q-Network for Portfolio Management

133

Figure 1: Structure of DQNs.

Figure 2: Structure of controllers.

trading. Therefore, we shall reduce the number of as-

sets M handled by each DQN so that trading unit can

be small enough to avoid the problem caused by large

commission fee.

3 ASSUMPTIONS

Before introducing the model, we shall make the fol-

lowing simplifying assumptions:

1. The action taken by the agent will not affect the

ﬁnancial market such as the price and volume.

2. Other than the commission fee deducted during

transition process, the portfolio value remains un-

changed between the end of previous trading pe-

riod and the beginning of next trading period.

3. The agent can not invest money into assets other

than the selected ones.

4. The volume of each stock is large enough for the

agent to buy or sell each of them at any trading

day.

5. The transition process is short enough so that the

time for this process may be ignored.

4 NETWORK ARCHITECTURE

In this section, we shall introduce the architecture of

H-DQN (Fig.7) that can handle an arbitrary number

of assets, e.g., stocks, cash etc. As mentioned in Sec-

tion 2, to solve the problem associated with commis-

sion fee, the total portfolio value should be divided

into smaller parts, which leads to large action space.

Therefore, we shall decrease the number of assets as-

signed to the DQN. To achieve this, we consider sev-

eral independent DQNs in the system. Each DQN has

identical structure and is responsible for three assets

(one cash and two stocks) so that the number of as-

sets each DQN is smaller than that of traditional DQN

which deals with large number of assets. Next, we de-

ﬁne a controller which is actually a DQN whose struc-

ture is different from the DQNs interacting with mar-

ket. Considering that controller is also a DQN, i.e., its

action space should be limited as well, we assign two

DQNs to the controller. Furthermore, for each con-

troller, there is a controller of higher level to control

it. Consequently, we obtain the general structure of

H-DQN.

From now on, we shall focus on a structural unit of

H-DQN (Fig.3) since the network topology and trad-

ing process of the general structure can be general-

ized from its structural unit. In Fig.3, we see that the

controller divides the total portfolio into three parts,

i.e., cash, the portfolio managed by DQN1 and the

portfolio managed by DQN2. As such, the DQNs on

the lower level receive the portfolio for investing in

the market. The speciﬁc trading process will be intro-

duced in section 6.

The network topologies of the DQNs and con-

trollers are inspired by (Jiang et al., 2017). In our

network, we change the structure of dense layers in

(Jiang et al., 2017) and make it a dueling Q-net.

We shall ﬁrst describe the topology of the DQNs, as

shown in Fig.1.

(1) Input and Output: The input of DQNs is the

state S

of last trading period, which is deﬁned by

, (P

, w

t−1

)

s.t.



t−1

= (w

t−1,0

, w

t−1,1

, w

t−1,2

)



, P



(5)

where w

t−1

, i = 0, 1, 2 denotes the proportion of port-

folio value assigned to this DQN that invested in the

stock at the beginning of previous trading period.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

134

Figure 3: Architecture of structural unit in H-DQN.

Moreover, we initialize w

(weight vector at the be-

ginning of the episode) as

= (1, 0, 0) (6)

i.e., all the portfolio value is in form of cash. P

is the

price tensor of the previous N trading days (deﬁned

in Section 5.1). What should be mentioned is that the

number of stocks M equals 2 because each DQN only

holds 2 stocks in our hierarchical model shown in Fig.

1. The output of DQN is w

, the weight vector at the

beginning of next trading period.

(2) 1st and 2nd CNN Layers: As shown in Fig. 1,

the ﬁrst CNN layer receives the price tensor P

with

dimension (2, 10, 4). The ﬁlter of this layer is in size

of 1 × 3, and the activation function we use here is

Selu which is deﬁned in (Klambauer et al., 2017). In

this layer, we obtain 32 feature maps and each one is

in size of 2 × 5, and these feature maps are received

by the next CNN layer. In the second CNN layer,

the ﬁlters are of size 1 × 5 and 64 feature maps are

produced.

(3) Weight Insertion and 3rd CNN Layer: In the

layers mentioned above, we only extract features from

the price tensor, but the weight vector w

t−1

has not

been used. So after obtaining 64 feature maps from

the second CNN layer, we insert the weight vector

t−1

(with the weight of cash removed) into these fea-

ture maps and produce a tensor with dimension (2, 1,

65). In the third CNN layer, the size of ﬁlter is 1 × 1

and 128 feature maps are produced, following which

the 128 features are ﬂattened and a cash bias (Jiang

et al., 2017) (a 1 × 1 tensor with value 1) is added

onto the ﬂattened feature map.

(4) Dense Layers: Every neuro in the ﬁrst dense

layer receives the ﬂattened features and connects with

each neuro in next dense layer. In the second dense

layer, we use the structure called dueling Q-net (Wang

et al., 2016), which consists of state layer and action

layer. With this structure, the value of state and the

Figure 4: Original price tensor P

∗

value of actions may be estimated separa(Wang et al.,

2016)tely. After the Q-value of state Q

and Q-value

of each action Q

are obtained, we can compute the

ﬁnal Q-value of each action by where E[Q

] is the

expectation of Q-values of actions.

,a)

= Q

+ (Q

− E [Q

]) (7)

The structure of controller is similar to DQNs, which

is given in Fig.2. The main difference lies in the con-

volutional layers. Since the price tensor received here

is M × N × 4 but the weight is still a 2 × 1 (with ﬁrst

element removed) vector, i.e. the ﬁrst dimension may

not be equal, so we cannot insert the weight into the

feature maps. Therefore, we ﬂatten the feature maps

after the ﬁrst and second convolution layers and con-

catenate the weight to it. Finally, we input the ﬂat-

tened feature map into the remaining network which

is also a dueling Q-net.

5 MATHEMATICAL

FORMALISM

5.1 Data Processing

Considering that the raw price data cannot be received

by the network directly, we need to process the data

and transform it to a ’tensor’ structure.

For the price tensor P

, it is converted from the

original price tensor P

∗

consisting of P

, P

which are the normalized price matrices of opening,

closing, highest and lowest as denoted below:



t−n+1

 p

|. . . |p

 p



(8)



t−n+1

 p

|. . . |p

 p



(9)

t−n+1

 p

|. . . |p

 p

(10)

t−n+1

 p

|. . . |p

 p

(11)

where  is elementwise division. In addition,

, P

represent the price vectors of opening,

closing, highest and lowest price for all assets in trade

period t respectively. In other words, the i

element

A Framework of Hierarchical Deep Q-Network for Portfolio Management

135

Figure 5: Transaction process.

of them, p

t,i

, p

t,i

, p

t,i

, p

t,i

, are relative technical indi-

cators of i

asset in the i

period. Therefore, if there

are M assets (except cash) in the portfolio, the origi-

nal price tensor P

∗

is an (M, N, 4)-dimensional tensor,

as illustrated in Fig. 4

We note that simply normalizing the original ten-

sor price may trigger some recognition problems, i.e.,

the features of normalized original price tensors are so

similar that the network may not distinguish between

them and determine which action should be taken. In

view of this, each element in P

∗

should be reduced by

1 and multiplied by an expansion coefﬁcient α to en-

hance the features of the price tensor. Therefore, the

ﬁnal price tensor is deﬁned as:

, α (P

∗

− 1) (12)

where 1 is a tensor of dimension of (M, N, 4), whose

elements are all 1’s.

Because the controllers and the DQNs (Fig.3)

have similar structure, we may process the data re-

ceived by DQNs using same method discussed in this

section. The dimension of price tensor and weight

vector of bottom DQNs are of size (2, N, 4) and 1 × 3

respectively, since each bottom DQN can only handle

2 stocks and cash.

5.2 Interaction with Environment

As mentioned in Section 4, the structure of the system

is hierarchical which consists of controllers and bot-

tom DQNs. Since the controllers have similar proper-

ties, we shall introduce the interacting process of one

controller and the DQNs corresponding to it (Fig.3)

instead of the interacting process of the whole system.

The interaction process is shown in Fig. 5.

The interaction process starts with the controller

receiving the state S

as deﬁned in (5), and giving the

weight vector w

∗

, where the weight vector represents

the proportion of portfolio conserved as cash and as-

signed to the bottom DQNs. This relocation of port-

folio incurs commission fee c

deﬁned as:

= dy

t−1

∑

i=1



∗

t,i

− w

t−1,i



(13)

where d is the commission rate. The notation y

t−1

and

t−1

are respectively the portfolio value and weight

at the end of the trading period t − 1. Moreover, we

omit the weight of cash here because the change of

cash does not incur commission fee.

Next, the DQNs of next level receive the state

t,1

, S

t,2

respectively and output w

1,t

, w

2,t

. The com-

mission fee of this relocation of portfolio is given by

= d



t−1

− c



∗

t,1

∑

i=1



1,t,i

− w

1,t−1,i



∗

t.2

∑

i=1



2,t,i

− w

2,t−1,i



)

(14)

So far, we obtain the portfolio value at the beginning

of next trading period as:

= y

t−1

− c

(15)

The portfolio value held by DQN1 and DQN2 are

given respectively by

1,t

= w

∗

t,1



t−1

− c



1 − d

∑

i=1



1,t,i

− w

1,t−1,i



2,t

= w

∗

t,2



t−1

− c



1 − d

∑

i=1



2,t,i

− w

2,t−1,i



(16)

and weight vectors of DQNs at the beginning of next

trading period are w

1,t

and w

2,t

However, the weight vector of the controller at

the beginning of the next trading period is not sim-

ply given by w

∗

, since the portfolio value assigned to

DQNs is changed after interacting with market, i.e.,

commission fee is deducted. Therefore, the weight

vector of controller at the beginning of next trading

period is the proportion of portfolio value after the

transition, given by

∗

t,0



t−1

− c



1,t

2,t

(17)

Then, during the next trading period, the price rela-

tive vector of assets assigned to DQN1 and DQN2 are

given by

∗

1,t

= p

1,t+1

 p

1,t

1,t+1,1

1,t,1

1,t+1,2

1,t,2

, . . . ,

1,t+1,m

1,t,m

∗

2,t

= p

2,t+1

 p

2,t

2,t+1,1

2,t,1

2,t+1,2

2,t,2

, . . . ,

2,t+1,m

2,t,m

(18)

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

136

Here, we deﬁne the time of opening quotation as the

demarcation point of previous and next trading pe-

riod, i.e., the previous trading period is before the

point and next trading period is after the point, and

by Section 3 Assumption 5, all the transition activi-

ties take place in a very short period of time after the

opening quotation. Moreover, since the total portfo-

lio value includes cash, we need to add cash price to

∗

1,t

and µ

∗

2,t

. Considering the fact that the cash price

remains unchanged, µ

1,t

and µ

2,t

would take the fol-

lowing form

1,t

1,t+1,1

1,t,1

1,t+1,2

1,t,2

, . . . ,

1,t+1,m

1,t,m

2,t

2,t+1,1

2,t,1

2,t+1,2

2,t,2

, . . . ,

2,t+1,m

2,t,m

(19)

and the portfolio value of DQN1 and DQN2 at the end

of next trading period are given by

1,t

= y

1,t

2,t

= y

2,t

(20)

Therefore, the total portfolio value is

= w

∗

t,0



t−1

− c



+ y

1,t

+ y

2,t

(21)

Using the portfolio value at the end of next and pre-

vious trading period, we may calculate the reward of

controller and reward of DQNs as

= log

t−1

1,t

= log

1,t

1,t−1

2,t

= log

2,t

2,t−1

(22)

6 TRAINING PROCESS

Since the controllers and DQNs are all Deep Q-

networks essentially, their parameters may be updated

by general training process of DQN. For this model,

we applied the training process that is widely used in

Double Deep Q-network (Van Hasselt et al., 2015).

Considering that the basic principle of DQN is

to approximate the real Q-function, there should be

two Deep Q-Networks, i.e., the evaluation network

eval

and the target network Q

target

, which have ex-

actly same structure but different parameters. The pa-

rameters of Q

eval

are continuously updated, while the

parameters of Q

target

are ﬁxed until they are replaced

by the parameters of Q

eval

Algorithm 1: Training process.

Input: Batch size N, Target network Q

∗

Input: Estimation network Q

, Target vector v

tar

Input: Estimation vector v

est

, Real value vector v

real

1: for i = 1 → N do

2: Take sample (S

, a

, r

, S

)

3: v

tar

(i) = Q

∗

, argmax

, a)))

4: v

est

(i) = Q

, argmax

, a)))

5: if S

is terminal then

6: v

real

(i) = r

7: else

8: v

real

(i) = r

+ γv

tar

(i), γ ∈ (0, 1]

9: end if

10: end for

11: Do a gradient decent step with

real

− v

est

12: Replace the parameter of target net θ

∗

← θ

7 EXPERIMENT

7.1 Experimental Setting

In our experiment, four low-correlation stocks in Chi-

nese A-share market, codes of which are 600260,

600261, 600262, 600266 respectively, are chosen

as risk assets (downloaded from tushare). Com-

bined with the cash as risk-free asset, there are

a total of ﬁve investment assets to be managed.

In order to increase the difference between price

tensors, we set the trading period as two days.

Meanwhile, we set 2011/1/1-2012/12/31, 2014/1/1-

2015/12/31, 2015/1/1-2016/12/31 as the period of

training set and 2013/1/14-2013/12/19, 2016/1/14-

2016/12/19, 2017/1/13-2017/12/18 as the period of

back-test intervals. The same hyperparameters were

used in the above three experiments.

7.2 Performance Metrics

This section presents three different ﬁnancial metrics

for evaluating the performance of trading strategies.

The ﬁrst metric is the cumulative rate of return (CRR)

(Jiang et al., 2017), deﬁned as:

CRR = exp

∑

t=1

− 1 (23)

where T is the total number of trading periods and r

the reward in t

period as deﬁned in (22). This metric

may be used to evaluate the proﬁtability of strategies

directly.

The second metric, the Sharpe ratio (SR), is

mainly used to assess the risk-adjusted return of

strategies (Sharpe, 1994). It is deﬁned as:

A Framework of Hierarchical Deep Q-Network for Portfolio Management

137

Figure 6: Trading performance in three back-test intervals.

SR =

E[ρ

− ρ

]

Var (ρ

− ρ

)

(24)

where E and Var are the expectation and variance re-

spectively. The notation ρ

is the rate of return deﬁned

as:

t−1

− 1 (25)

Here, y

, y

t−1

are portfolio value at the end of t

pe-

riod and t − 1

period. The parameter ρ

represents

the rate of return of risk-free asset.

To evaluate an investment strategy’s risk toler-

ance, we introduce Maximum Drawdown (Magdon-

Ismail and Atiya, 2004) as the third metric. The for-

mula of Maximum Drawdown (MDD) is

MDD = max

β>t

− y

(26)

This metric denotes the maximum portfolio value loss

from a global maximum to global minimum so that

we can measure the largest possible loss using it.

7.3 Result and Analysis

The performance of trading strategy is compared with

several strategies as listed below (each strategy is

tested with commission rate of 0.25%), and these

strategies can be categorized into two types:

Type I. Traditional trading strategies

• Robust Median Reversion (RMR) (Huang et al.,

2012)

• Uniform Buy and Hold (BAH) (Li and Hoi, 2014)

• Universal Portfolios (UP) (Cover, 2011)

• Exponential Gradient (EG) (Helmbold et al.,

1998)

• Online Newton Step (ONS) (Agarwal et al., 2006)

• Aniticor (ANTICOR) (Borodin et al., 2004)

• Passive Aggressive Mean Reversion (PAMR) (Li

et al., 2012)

• Online Moving Average Reversion (OLMAR) (Li

et al., 2015)

• Conﬁdence Weighted Mean Reversion (CWMR)

(Li et al., 2013)

Type II. Reinforcement learning trading strategy

• Single DQN (N-DQN), an approach combining

Q-learning with a single deep neural network with

commission fee taken into consideration (com-

mission rate of 0.25%) (Gao et al., 2020)

The ﬁrst dataset gives the cumulative return over the

investment horizon of the test period as learning con-

tinuous from 2013/01/14 to 2013/12/19. Overall, the

H-DQN strategy outperforms all other strategies in

contrast to N-DQN strategy which does not perform

as well as benchmarks such as OLMAR and ANTI-

COR. Although the advantage of H-DQN strategy is

not apparent at the beginning, and the disparity be-

tween H-DQN strategy and other strategies becomes

obvious especially in the last 1/3 of the trading period.

Compared to N-DQN strategy which shows ﬂuctua-

tions along the time periods, H-DQN strategy tends

to increase over the test trading period in general and

the cumulative return is almost always above the ini-

tial cashﬂow.

Similarly, in the second dataset test which is

from 2016/01/14 to 2016/12/19, the cumulative return

shows that the H-DQN strategy has the best perfor-

mance among all the strategies in the scenario that N-

DQN shows no outstanding performance. In addition,

the H-DQN strategy outperforms other strategies for

the most of the time periods, and the difference be-

tween H-DQN strategy and other benchmarks is sig-

niﬁcant in the latter half of the time periods. Com-

pared to the total portfolio value which remains at a

low level as demonstrated by other strategies, the total

portfolio value of H-DQN strategy tends to increase

over the test trading period in general and the cumu-

lative return is obviously higher than others for most

of the time periods.

The third dataset gives the cumulative return

over the test periods from 2017/01/05 to 2017/11/17.

Again, the H-DQN strategy outperforms benchmark

strategies for the majority of the test trading period,

and begins to perform very well halfway through the

trading period. Compared to N-DQN strategy which

shows moderate increase, the H-DQN strategy tend to

increase over the test trading period in general and the

overall increase is much higher than N-DQN.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

138

Figure 7: General architecture of H-DQN.

Table 1: Average Performance of Eleven Strategies.

CRR SR MDD

RMR -12.84% -4.00% 22.98%

BAH 3.97% 1.97% 17.78%

UP -9.12% -2.35% 24.38%

EG -1.93% -0.22% 17.43%

ONS -8.16% -1.48% 26.16%

ANTICOR -2.98% -2.17% 19.89%

PAMR -9.83% -4.11% 23.04%

OLMAR -9.55% -5.26% 24.34%

CWMR -3.84% -2.32% 20.36%

N-DQN 8.19% 2.72% 22.72%

H-DQN 44.37% 10.33% 11.69%

Table 1 gives the average performance of hierarchi-

cal DQN strategy and benchmarks over three test sets

based on the metrics CRR, SR and MDD. It is obvi-

ous that the numerical results of H-DQN strategy are

the best in all aspects. In the case of CRR, the re-

sult of H-DQN strategy (44.37%) exceeds the second

highest benchmark N-DQN (8.19%) by 36%. As for

the risk measure, the H-DQN strategy still gives the

best performance with minimum MDD (11.69%), as

compared to the EG benchmark (17.43%). For SR,

H-DQN strategy (10.33%) outperforms the next best

performing strategy N-DQN benchmark (2.72%). It

should be noted that N-DQN is also a reinforcement

learning algorithm, but without the hierarchical archi-

tecture, its performance is much worse than H-DQN.

Overall, the results in all three back-test intervals

demonstrate the good proﬁtability and adaptability of

H-DQN framework in comparison to N-DQN and all

other traditional strategies.

8 CONCLUSIONS AND FUTURE

WORK

In this paper, we construct a hierarchical reinforce-

ment learning framework that can handle an arbitrary

number of assets for portfolio management which

takes commission fee into consideration. Four stocks

are selected as our experimental data, and Cumula-

tive Rate of Return, Sharpe ratio, Maximum Draw-

down are used to compare the proﬁtability and risk

of our model in the back-test intervals against nine

traditional strategies as well as the single DQN strat-

egy. The results show that this hierarchical reinforce-

ment learning algorithm outperforms all the other ten

strategies, and it is also the least risky investment

method our back-test intervals.

However, there are three major limitations. First,

since the controller on higher level needs to manage

more controllers, it is more difﬁcult to train. Sec-

ond, we assume that the volumes of stocks are large

enough so each stock is available on any trading day.

However, the stock might not be available sometimes,

which will therefore impact the proﬁt. Finally, for

generalization, our strategies will be vulnerable due to

a small mismatch between the learning environment

and the testing environment.

For future work, considering that deep reinforce-

ment learning algorithm is highly sensitive to the

noise in data, we may use traditional approaches to re-

duce ﬁnancial data noise, e.g., wavelet analysis (Rua

and Nunes, 2009) and the Kalman Filter (Faragher,

2012). Moreover, as the number of assets increases,

there will be more controllers and DQNs in the hierar-

A Framework of Hierarchical Deep Q-Network for Portfolio Management

139

chical structure (shown in Fig.7), which may require a

long training period. To address this, we will look into

proposing new training methods that may improve the

efﬁciency in training the network.

REFERENCES

Agarwal, A., Hazan, E., Kale, S., and Schapire, R. E.

(2006). Algorithms for portfolio management based

on the newton method. In Proceedings of the 23rd

international conference on Machine learning, pages

9–16.

Borodin, A., El-Yaniv, R., and Gogan, V. (2004). Can we

learn to beat the best stock. In Advances in Neural

Information Processing Systems, pages 345–352.

Cover, T. M. (2011). Universal portfolios. In The Kelly Cap-

ital Growth Investment Criterion: Theory and Prac-

tice, pages 181–209. World Scientiﬁc.

Dempster, M. A. and Leemans, V. (2006). An automated fx

trading system using adaptive reinforcement learning.

Expert Systems with Applications, 30(3):543–552.

Faragher, R. (2012). Understanding the basis of the kalman

ﬁlter via a simple and intuitive derivation [lecture

notes]. IEEE Signal processing magazine, 29(5):128–

132.

Gao, Z., Gao, Y., Hu, Y., Jiang, Z., and Su, J. (2020). Appli-

cation of deep q-network in portfolio management. In

2020 5th IEEE International Conference on Big Data

Analytics (ICBDA), pages 268–275. IEEE.

Heaton, J., Polson, N. G., and Witte, J. H. (2016). Deep

learning in ﬁnance. arXiv preprint arXiv:1602.06561.

Helmbold, D. P., Schapire, R. E., Singer, Y., and Warmuth,

M. K. (1998). On-line portfolio selection using mul-

tiplicative updates. Mathematical Finance, 8(4):325–

347.

Huang, D., Zhou, J., Li, B., HOI, S., and Zhou, S. (2012).

Robust median reversion strategy for on-line portfo-

lio selection.(2013). In Proceedings of the Twenty-

Third International Joint Conference on Artiﬁcial In-

telligence: IJCAI 2013: Beijing, 3-9 August 2013.

Jiang, Z., Xu, D., and Liang, J. (2017). A deep re-

inforcement learning framework for the ﬁnancial

portfolio management problem. arXiv preprint

arXiv:1706.10059.

Klambauer, G., Unterthiner, T., Mayr, A., and Hochre-

iter, S. (2017). Self-normalizing neural networks. In

Advances in neural information processing systems,

pages 971–980.

Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenen-

baum, J. (2016). Hierarchical deep reinforcement

learning: Integrating temporal abstraction and intrin-

sic motivation. In Advances in neural information pro-

cessing systems, pages 3675–3683.

Li, B. and Hoi, S. C. (2014). Online portfolio selection: A

survey. ACM Computing Surveys (CSUR), 46(3):1–

36.

Li, B., Hoi, S. C., Sahoo, D., and Liu, Z.-Y. (2015). Moving

average reversion strategy for on-line portfolio selec-

tion. Artiﬁcial Intelligence, 222:104–123.

Li, B., Hoi, S. C., Zhao, P., and Gopalkrishnan, V. (2013).

Conﬁdence weighted mean reversion strategy for on-

line portfolio selection. ACM Transactions on Knowl-

edge Discovery from Data (TKDD), 7(1):1–38.

Li, B., Zhao, P., Hoi, S. C., and Gopalkrishnan, V. (2012).

Pamr: Passive aggressive mean reversion strategy for

portfolio selection. Machine learning, 87(2):221–258.

Magdon-Ismail, M. and Atiya, A. F. (2004). Maximum

drawdown. Risk Magazine, 17(10):99–102.

Neuneier, R. (1998). Enhancing q-learning for optimal as-

set allocation. In Advances in neural information pro-

cessing systems, pages 936–942.

Park, S., Song, H., and Lee, S. (2019). Linear programing

models for portfolio optimization using a benchmark.

The European Journal of Finance, 25(5):435–457.

Rua, A. and Nunes, L. C. (2009). International comovement

of stock market returns: A wavelet analysis. Journal

of Empirical Finance, 16(4):632–639.

Sharpe, W. F. (1994). The sharpe ratio. Journal of portfolio

management, 21(1):49–58.

Van Hasselt, H., Guez, A., and Silver, D. (2015). Deep

reinforcement learning with double q-learning. arXiv

preprint arXiv:1509.06461.

Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M.,

and Freitas, N. (2016). Dueling network architectures

for deep reinforcement learning. In International con-

ference on machine learning, pages 1995–2003.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

140