trading. Utilizing data from the S&P 500 index, these
studies revealed that, in the absence of pronounced
white noise, the RF model exhibits a smaller bias in
forecasting stock prices compared to LSTM. This
also demonstrates a superior capacity to accurately fit
price variations and respond more swiftly to price
fluctuations (Wu, 2024). In terms of quantitative
investment, Ma et al. integrated machine learning
models with traditional portfolio optimization
techniques, proposing a stock selection methodology
based on RF and support vector regression (SVR)
(Ma, Han, & Wang, 2021). This approach was
benchmarked against deep learning models, such as
LSTM networks and convolutional neural networks
(CNNs). Their experimental findings indicate that
machine learning models outperform conventional
time-series models in the stock pre-selection process.
Notably, when applied to Mean-Variance (MV) and
Omega portfolio optimization frameworks, the RF
model demonstrated superior predictive efficacy.
This study underscores that RF, as a robust regression
and classification tool, can effectively furnish reliable
predictive information for quantitative strategies,
particularly during the stock pre-selection phase.
Furthermore, Rasekhschaffe & Jones explored the
application of machine learning techniques in stock
selection, highlighting the RF's advantages in
managing multiple complex factors and nonlinear
relationships (Rasekhschaffe & Jones, 2019).
Although there has been some research on the
application of RF in stock prediction and quantitative
investment, how to integrate factors from different
sectors to effectively and accurately predict stock
prices is still a challenging topic. Compared to the
conventional application of machine learning in stock
price prediction, this study innovatively employs a
number of factors, including fundamental, technical,
risk, and macroeconomic indicators, to develop a
sophisticated RF model that can maximize the
utilization of information from bond market and
macroeconomy, aiming to capture as much valuable
market context as possible to yield a precise price
prediction.
The subsequent sections of this paper are
structured as follows. Section 2 will show the source
of the data and the descriptive statistics of the data,
followed by a brief introduction to how data
preprocessing was conducted in this study. Then, the
author will explain the basic principles of the random
forest model and how it was applied in this study. In
Section 3, the author will present the results of this
paper and conduct cross-validation. The last section
summarizes the whole paper.
2 DATA AND METHOD
2.1 Data Collection and Description
The data in this study are obtained from the CCER
and RESSET databases, covering samples from
January 1, 2015, to December 31, 2023. The sample
includes the constituent stocks of the CSI300 and
CSI1000 indices, which respectively represent the
companies with large and small market value in the
Chinese A-share market. The constituent stocks of
these indices are selected to analyze the predicted
performance of the RF algorithm with different
market capitalizations and liquidity under multi-
dimensional factors. The collected data include the
closing price of each stock every Friday and the data
corresponding to each factor. The factors used in this
study are divided into four types: fundamental
factors, technical factors, macroeconomic factors and
risk factors. The specific abbreviations and notations
are shown in Table 1.
Table 1: Abbreviations and notations.
Classification Abbreviation Notation
Fundamental
indicators
PE_Ratio Price-to-Earnings Ratio
PB_Ratio Price-to-Book Ratio
BM_Ratio Book-to-Market Ratio
Current_Ratio Measuring a company's short-term debt repayment ability
Quick_Ratio
Measuring a company's ability to pay its short-term liabilities without
rel