Predicting Stock Market Movements with Social Media and Machine

Learning

Paraskevas Koukaras

, Vasiliki Tsichli and Christos Tjortjis

School of Science and Technology, International Hellenic University,

14th km Thessaloniki–N. Moudania, Thermi, 57001, Thessaloniki, Greece

Keywords:

Social Media, Prediction, Machine Learning, Data Science, Stocks.

Abstract:

Microblogging data analysis and sentiment extraction has become a popular approach for market prediction.

However, this kind of data contain noise and it is difﬁcult to distinguish truly valid information. In this work

we collected 782.459 tweets starting from 2018/11/01 until 2019/31/07. For each day, we create a graph

(271 graphs in total) describing users and their followers. We utilize each graph to obtain a PageRank score

which is multiplied with sentiment data. Findings indicate that using an importance-based measure, such

as PageRank, can improve the scoring ability of the applied prediction models. This approach is validated

utilizing three datasets (PageRank, economic and sentiment). On average, the PageRank dataset achieved a

lower mean squared error than the economic dataset and the sentiment dataset. Finally, we tested multiple

machine learning models, showing that XGBoost is the best model, with the random forest being the second

best and LSTM being the worst.

1 INTRODUCTION

Stock market forecasting is an important academic

topic, which has attracted academic interest since the

early 1960’s (Fama, 1965). Although a lot of effort

and time has been spent on predicting ﬁnancial time

series, the results of the research are not robust. In re-

cent years a lot of researchers have shifted their focus

from classical econometric approaches to machine

learning approaches. With the rise of microblogging

platforms, such as Twitter, StockTwits and others, in-

formation is more available than ever. Given that

emotions can have a signiﬁcant effect on economic

decisions (Bollen et al., 2011), alongside with herding

phenomena (Devenow and Welch, 1996), one can as-

sume that mining information through microblogging

platforms might be the key to achieve better results in

predicting stock market movements.

Stock market forecasting has drawn a lot of aca-

demic attention since the 1960’s. The ﬁrst model

that revolutionized how the stock was evaluated is the

Capital Asset Pricing Model (or CAPM for short).

CAPM was developed

by William Sharpe (Sharpe,

https://orcid.org/0000-0002-1183-9878

https://orcid.org/0000-0001-8263-9024

There is a dispute on who deserves credit about CAPM,

for more information check (Treynor, 1962)

1964) who built on top of Markowitz’s diversiﬁca-

tion theory. The model is fairly simple and is based

on stock return sensitivity exhibited over the systemic

risk (or market risk). It is quantitatively expressed

with a beta (β) factor.

CAPM measures the return of a stock in accor-

dance with the market risk. Every other risk that

stems from the stock itself can be diversiﬁed as

Markowitz proved in the portfolio theory. Thus, there

is no point in measuring it. Although CAPM has been

a fundamental decision making tool for asset man-

agers, it has been criticized by academics due to its

nature. It has been proven that the model is not robust

and that it fails to give accurate results consistently.

Fama and French (Fama and French, 1993) stated that

the model is not robust and that a model that takes

into account the size and the ratio of accounting over

stock market value is more accurate. Their research

prompted others to start looking for factors that may

be affecting the returns of a stock. This gave birth

to a whole new way of evaluating a stock, which is

called technical analysis. Technical analysis is based

on ratios and indicators that capture the momentum of

the stock market. Although technical analysis is not

based purely on academic research, it is extensively

used and it is a common practice.

436

Koukaras, P., Tsichli, V. and Tjortjis, C.

Predicting Stock Market Movements with Social Media and Machine Learning.

DOI: 10.5220/0010712600003058

In Proceedings of the 17th International Conference on Web Information Systems and Technologies (WEBIST 2021), pages 436-443

ISBN: 978-989-758-536-4; ISSN: 2184-3252

In recent years there has been a lot of effort to con-

struct indicators or ratios based on the information of

the microblogging community. Essentially, those in-

dicators provide an overall sentiment over the mar-

ket or a particular stock. Thus, the trader can have

a more objective metric about the ”feelings”. More-

over, this data might contain useful information that

otherwise would be unavailable. On the other hand,

this approach contradicts with one of the most funda-

mental economic theories, the Efﬁcient Market Hy-

pothesis. As Fama (Fama, 1965) suggested, the price

of a given stock embodies all the prior available infor-

mation and it is impossible to forecast future values

since the current ones reﬂect everything. Moreover,

in efﬁcient-market hypothesis (EMH), it is believed

that the market adjusts the prices instantly as the news

spread. Fama (Fama, 1965) also noted that the most

probable future price is the current price. Neverthe-

less, recent empirical research provided evidence that

sentiment plays an important role and can act as a de-

termining factor of the stock market returns.

One of the biggest problems encountered by the

researchers that used data from Twitter of other rele-

vant sources is that they are noisy (See-To and Yang,

2017; Alshahrani Hasan and Fong, 2018), thus yield-

ing spurious results. To deal with that problem, the

authors either choose a speciﬁc news source, such

as MarketWatch (H

ajek, 2018) or Thomson Reuters

(Mittermayer and Knolmayer, 2006) but this approach

might lead to overlooking important information. An-

other issue is that they use a lot of data which might

hinder their research in terms of efﬁciency and statis-

tical robustness (Antweiler and Frank, 2004).

Our objective is to provide a more efﬁcient way

of handling those massive data, by looking for and

distinguishing those data that matter the most. To

achieve that, we use graphs that are constructed based

on users and their data accordingly. We believe that

our approach solves the problem of noisy microblog-

ging data, without disregarding any useful informa-

tion that might exist. Given our hypothesis we ex-

pect that the dataset which accounts for the noise in

the data have a better score than the simple sentiment

dataset.

2 METHODOLOGY

2.1 Data

In this section we present how we gathered Twitter

data. Then, we provide an overview of the utilized

economic variables and the reasoning behind these

choices.

2.1.1 Twitter Data

We are interested in two categories of data, the tweets

and the users that wrote those tweets. The main prob-

lem reported in the literature is the noisy nature of

Twitter data (Rousidis et al., 2019; Koukaras et al.,

2019; Koukaras and Tjortjis, 2019; Beleveslis et al.,

2019; Oikonomou and Tjortjis, 2018). To overcome

this problem, we used the ”cashtag” or ”$” in the

tweets, which as (Chakraborty et al., 2017) notes, is

more suited for gathering stock related data.

In total 782.459 tweets were downloaded starting

from December, 1st of 2018 until July, 31st of 2019.

Form these, we take all the tweets authors’ usernames

and gather metrics for them. These metrics are used

when we are checking the validity of our data. We

also gather all users’ followers, a metric that is going

to be used in the graph module. The module for gath-

ering Twitter data is built upon a library called Twint.

This library can provide tweets, users’ statistics (fol-

lowers, following, likes, etc.) and also, it can gather

users’ followers. Moreover, it also has a built-in func-

tion for storing those data directly to a database.

2.1.2 Economic Variables

Economic variables can act as predictors. These vari-

ables may vary from a fundamental analysis of a com-

pany’s balance sheet to technical indicators specially

designed to capture speciﬁc events. In this work,

we chose to use technical indicators for multiple rea-

sons. First, technical analysis is based on examining

a stock’s trend and constitutes a more robust tool for

prediction. Moreover, one of the core principles of

technical analysis is that a stock’s price reﬂects all the

available information. Thus, it is focused more on

past behavior of the market. Although technical anal-

ysis has been dismissed by academics (Malkiel and

Fama, 1970), many of the leading trading companies

use technical indicators to identify signals and trends

on time. On the same line, we concluded that techni-

cal indicators are more suited for our research. Since

they do not focus on news events, our ﬁnal dataset will

be more balanced with features that capture different

aspects of trading. From all the available technical in-

dicators, we opted for ﬁve of the most common ones:

1. The Aroon Oscillator is a trend indicator that mea-

sures the power of an ongoing trend and the prob-

ability to proceed by using elements of the Aroon

Indicator (Aroon Up and Aroon Down). Readings

above zero show an upward trend, while readings

below zero show a downward trend. To signal

prospective trend changes, traders watch for zero

line crossovers (Mitchell, 2019).

Predicting Stock Market Movements with Social Media and Machine Learning

437

2. The CCI was created to determine the rates of

over-bought and over-sold stocks. This is done

by evaluating the price-to-moving average (MA)

relationship or by evaluating ordinary deviations

from that median (Kuepper, 2019).

3. On-balance volume (OBV) is a momentum indi-

cator that measures positive and negative volume

ﬂows (Staff, 2019).

4. The RSI is a momentum index measuring the

magnitude of the latest price modiﬁcations that

is used to assess which stocks are over-bought or

over-sold. The RSI is an oscillator. Traditionally,

traders interpret a score of 70 or higher as a sign

that a stock is overbought or overestimated, lead-

ing to a trend reversal. An RSI of 30 or lower sig-

nals that a stock is undervalued (Blystone, 2019).

5. The Stochastic Oscillator attempts to predict price

turning points by comparing the last closing price

of a security to its price range. It takes values from

0 up to 100. A value of 70 or higher signals an

overbought security.

These indicators were chosen for two main rea-

sons. i) They are very robust and are extensively used

in the industry and ii) they belong to the special cat-

egory of ”Oscillators”. These are indicators that ﬂuc-

tuate within a range, commonly used to capture short

term trends. Our sample period ranges from Decem-

ber, 1st of 2018 to July, 31st of 2019. This period is

characterized by high ﬂuctuations and small but pow-

erful shocks (Trade War, No Deal Brexit, etc.). Thus,

we believe that by using such variables will provide

more accurate results instead of using fundamental

analysis. Finally, to collect the economic variables,

we used the API of Alpha Vantage.

2.2 Research Design

This section summarizes the main processes for con-

ducting this research. At ﬁrst, we designed a users

Graph to obtain their importance incorporating the

PageRank algorithm (Page et al., 1999). Afterwards,

we analyzed the obtained tweets using two different

lexicons. Lastly, we estimated ﬁve different machine

learning models.

2.2.1 Identifying Inﬂuential Users

To identify inﬂuential users we generated a graph, we

computed the PageRank score for each edge as well

as the hub and authority scores. The Graph class is

fairly simple and is based on the NetworkX library

(Hagberg et al., 2008). Moreover, the PageRank and

HITS algorithms are implemented in the NetworkX

library (Hagberg et al., 2008).

PageRank and HITS are two algorithms that are

often used to measure the importance of nodes on di-

rected graphs. Both of the algorithms were designed

to rank websites. The PageRank algorithm is a recur-

sive algorithm. An internet page is important if and

only if other important pages are linked with it. As

it is usually described, a website’s score is the proba-

bility of any random person browsing the web ending

up on this website. This is by deﬁnition a Markov

Process. Markov Processes model recursive phenom-

ena, such as the weather. The PageRank algorithm

starts with a set of websites (denoting the number of

those websites with N). On each website, we assign a

score of 1/N. Afterwards, we sequentially update the

score of each website by adding up the weight of ev-

ery other website that links to it divided by the number

of links emanating from the referring website. But if

the website does not reference any other website, we

distribute its score to the remaining websites. This

process is executed until the scores are stable.

The Hypertext-Induced Topic Search (HITS) al-

gorithm provides two scores, the ”Authority” and the

”Hub”. We tried to compute the HITS algorithm, but

the algorithm never achieved convergence. Since we

wanted to compute the hubs and the authorities for

each day in our sample, the recursiveness of the al-

gorithm poses a signiﬁcant barrier. On the computing

part, for each date, we needed to create a graph that

references the follower relationships of the users. We

are also interested in tweets posted between Decem-

ber, 1st of 2018 and July, 31st of 2019, thus creating

242 graphs.

2.2.2 Sentiment Analysis

Lexicon analysis outperforms other methodologies

(Sohangir et al., 2018). In our approach, we used

VADER (Valence Aware Dictionary and sEntiment

Reasoner) (Hutto and Gilbert, 2014) and TextBlob.

Both of these tools are part of the nltk library. VADER

analyzer returns four scores, the negative, the positive,

the neutral and the compound score. TextBlob returns

two scores, the polarity (which should be very close to

the compound score) and the subjectivity. We decided

to use all of these variables as features in our models

allowing us to compare those two analyzers. Further-

more, to achieve better accuracy on the scores, the

tweets must be stripped from any special characters.

More speciﬁcally, tweets often contain Unicode char-

acters such as the non-breaking space. These char-

acters should be normalized so as not to negatively

affect the scoring of the analyzers.

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

438

2.2.3 Machine Learning Models

Decision Tree. The Decision Tree (DT) builds re-

gression or classiﬁcation models in the form of a

tree structure. This means that the model breaks the

dataset into smaller subsets by asking different ques-

tions each time. The ﬁnal result is a tree with deci-

sion nodes and leaf nodes. A decision node has two

or more branches (Decisions), each one representing

values for the attribute that was tested. Leaf nodes

(Terminal Nodes) represent decisions on the numer-

ical targets. The questions and their order is deter-

mined by the model itself using Information Gain (for

classiﬁcation) or ID3 (for regression) (Tzirakis and

Tjortjis, 2017; Tjortjis and Keane, 2002). For each

question, the model must make a strategic split using

a criterion. Decision trees are not affected by miss-

ing values or outliers. They can handle both numer-

ical and categorical values and they are very easy to

understand. Also, trees can capture non-linear rela-

tionships. There are some disadvantages though. The

most important one is that they tend to overﬁt to the

training sample. A small difference in data might pro-

duce a completely different tree. Lastly, there is no

guarantee that the tree will be the global optimal.

Random Forest. Random Forest (RF) is another

method that uses a tree structure to solve a regression

or a classiﬁcation problem. A random forest is a col-

lection of decision trees, with each tree voting on the

ﬁnal decision. In the training phase, each tree on the

forest considers only a random sample of the data. In

the prediction phase, each tree makes a prediction and

the average of all of the trees will be considered as the

ﬁnal value.

XGBoost. Boosting and bagging are two methods

commonly used in weak prediction trees, such as de-

cision trees, to improve their performance. Those two

methods work sequentially, meaning that a new model

is added to correct the error of the existing models un-

til no further improvements can be made. XGBoost

(eXtreme Gradient Boosting) is a method where new

models are created, predicting the residuals or errors

of existing models and then, they are added together

to make the ﬁnal prediction. Its name comes from the

algorithm used to minimize the loss function, which

is called gradient descent.

K-Nearest Neighbors. k-Nearest Neighbors (k-

NN) is one of the most basic and essential machine

learning algorithms. Like the trees, it belongs to

the supervised machine learning algorithms. k-NN

is a non-parametric method, meaning that it does not

make any assumptions about the distribution of the

data. k-NN is a fairly simple model that calculates

similarities based on the distances between the data

points. When a new entry needs to be classiﬁed, the

algorithm measures the distances between the new

data and the already classiﬁed data. Then, the new en-

try is assigned to the class that has the minimum dis-

tance to the new data point. There are multiple meth-

ods to measure the distance, such as the Euclidean or

the Manhattan distance.

LSTM. Simple neural networks cannot understand

the context and the order of data. For that, we need

some sort of memory. Recurrent neural networks are

a special form of neural networks where their units

are inter-connected creating various output value de-

pendencies (Hochreiter, 1991). RNNs are extremely

important and have been successfully used in a lot of

applications, such as speech recognition. But, RNNs

suffer from the vanishing gradients problem. This

problem refers to the hidden neuron activation func-

tions. If those functions are saturating non-linearities,

like the tanh function, then the derivatives can be very

small, even close to zero. Multiplying many such

derivatives leads to zero meaning that the neural net-

work cannot propagate back for too many instances.

Hochreiter & Schmidhuber (Hochreiter and

Schmidhuber, 1997) introduced another kind of re-

current neural networks, the long short term memory

(LSTM). Those models have the same ”chain-like”

structure, but the module responsible for the ”repeti-

tion” part has a different structure. In a classic RNN,

the repetition module is a neural network with a hid-

den layer, usually with tanh as the activation function.

On an LSTM, instead of having a single hidden layer,

there are four. On the ﬁrst stage or gate, the neural

network decides which information to discard from

the cell state. On the second stage, the model incor-

porates the new information and decides what to keep

and what to discard. The model updates the old cell

state into the new cell state. In the third stage, the

model discards the old information and adds new in-

formation. In this stage, the candidate values are es-

timated. Lastly, the output values depend on the state

of the ﬁrst and the third layer.

3 RESULTS & EVALUATION

This section presents the results of this research. We

present the feature selection and the summary of the

results per dataset (Sentiment, Economic and PageR-

ank) and per model (DT, k-NN, LSTM, RF and XG-

Boost).

Predicting Stock Market Movements with Social Media and Machine Learning

439

Figure 1: Average Mean Squared Error per Model per

Dataset.

All of the scores refer to the mean squared error,

thus the best score is the lowest (Figure 1). We evalu-

ate our results using a naive trading strategy and com-

paring it across all datasets regarding our stocks port-

folio (Table 1).

Table 1: Initial Portfolio.

Ticker Quantity Price Amount

AAPL 1 204,5 204,5

CAT 1 139,09 139,09

HD 1 217,26 217,26

UNH 1 264,66 264,66

XOM 1 75,93 75,93

IBM 1 143,53 143,53

TRV 1 154,59 154,59

V 1 179,31 179,31

BA 1 362,75 362,75

INTC 1 49,17 49,17

GS 1 215,52 215,52

JNJ 1 132,5 132,5

WBA 1 55,81 55,81

DOW 1 52,32 52,32

VZ 1 57,41 57,41

JPM 1 115,12 115,12

PG 1 115,89 115,89

KO 1 52,14 52,14

MSFT 1 137,08 137,08

CVX 1 124,76 124,76

MRK 1 81,59 81,59

CSCO 1 57,62 57,62

UTX 1 133,19 133,19

MMM 1 176,49 176,49

WMT 1 114,76 114,76

MCD 1 213,72 213,72

PFE 1 42,85 42,85

AXP 1 128,06 128,06

DIS 1 144,3 144,3

3.1 The Trading Strategy

For evaluating results we utilize a naive trading strat-

egy comprising the following points:

1. At the end of each day we sell the stocks that are

predicted to have a loss in the next day.

2. We buy the stocks that are predicted to have a pos-

itive return.

3. We choose to buy the one that maximizes the re-

turn and we do not take into account variance, es-

timated error or diversiﬁcation effects.

4. In the next day we ﬁrst update the prices and then

we calculate gains or losses.

3.2 Feature Selection

This section describes the created features, as well as

the descriptive statistics of those features per ticker.

It is noted that all of the variables are not available

for the day we want to predict, thus all the created

features are values of previous days. Since there is no

consensus on the literature on which time lag is the

most important, for every variable we created the lags

from 1 to 3 days prior (Bollen et al., 2011).

One major aspect of this paper is to determine if

the sentiment data are noisy and how this can be re-

deemed. Therefore, we decided to create three differ-

ent datasets. The ﬁrst dataset contains the lagged eco-

nomic variables and the lags of closing prices from

previous days. The second dataset contains all the

features of the economic dataset as well as the sen-

timent data. Lastly, the PageRank dataset contains all

of the features from the sentiment dataset, but the sen-

timent variables are multiplied by the PageRank value

for each user.

One major drawback of calculating daily PageR-

ank values for each user is that the algorithm does not

always estimate the importance for all of the users.

Thus, we decided to ﬁll all those dates with the mean

value for each user. After that process, we ﬁll all the

residual non-estimated PageRank values with 0. This

is done since the aim is to have a timely importance

measure for the user. In cases where this was not fea-

sible, we theorized that the number of the user follow-

ers does not signiﬁcantly alter from day to day. There-

fore, we considered logical to proceed with ﬁlling any

missing values with their respective mean. Lastly, if

there was no mean, the PageRank algorithm did not

ﬁnd any importance in the user for any day. Thus we

ﬁlled the residual empty values with 0 marking them

as noisy and not important.

3.3 Economic Dataset Evaluation

We began with the evaluation of results for the eco-

nomic dataset and the XGBoost model. Our pre-

dictions suggested that we should sell MRK, MCD,

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

440

MSFT, V, PFE, DOW, JNJ, WMT, DIS, BA, HD,

AXP, CAT, IBM, TRV, MMM, JPM, AAPL, NKE,

KO, CSCO, GS, and PG and buy four shares of In-

tel’s stock. Our predictions proved correct and Intel’s

stock recorded a gain, so our portfolio had a total eval-

uation of 4.041, 61$. Our decisions for 2019/7/18

also proved correct and, again, we recorded a gain of

0,30%. On the contrary, for 2019/7/19 our decisions

lead to a negative return of −0,47%. The biggest gain

was observed on 2019/7/30 with a daily return of

1,87%. Our worst day was the next day, where we

lost most of our gains (-75,99$). Finally, our cumula-

tive return for the whole period was positive, 0,75%.

3.4 Sentiment Dataset Evaluation

In the sentiment’s dataset we began by selling most

of our portfolios’ stocks and buying only one. More

speciﬁcally, we sold 23 stocks and bought WBA’s

stock. This decision was wrong, as we sold Intel’s

stock, which as we have seen in the previous dataset

leads to a signiﬁcant gain. These decisions naturally

lead to a signiﬁcant loss of −1,91%. Although the

next day (2019/7/18) our predictions resulted in a

daily positive return of 0,26%, although it was not

enough to overturn the cumulative negative return.

Our best return was on 2019/7/29 with 1,39%. Even

that return could not reverse our overall losses for this

dataset resulting in a cumulative loss of −3, 05%.

3.5 PageRank Dataset Evaluation

For the PageRank dataset in the ﬁrst day, we sold

the following stocks, V, MRK, PFE, JNJ, HD, AXP,

WMT, MCD, NKE, CAT, TRV, CVX, JPM, MMM,

CSCO, INTC, IBM, KO, PG, DIS, and GS. This de-

creased the value of bought stocks to 1.343, 65$ and

increased the available funds to 2.686,87$. At this

point, 10 units of ticker UNH were bought at 264, 66

per unit. This updated the value of bought stocks to

3.990,25$ and the available funds to 40, 27$. Since

we were still on the same day, the evaluation of the

portfolio had not changed, because we had not up-

dated the prices yet. On the next day, after updat-

ing the prices, we saw that our portfolio had a value

of 4.051,48$ meaning that our approach resulted in a

positive return of 1, 5%.

On the second day, we decided to sell the stocks

of VZ, AAPL, and UTX and buy three units of Nike’s

stock. This decision resulted in a loss of 75,88$ and

a total return of −1,3%. The decision was based on

the prediction that Nike’s stock would have a posi-

tive return. On the contrary, the actual result was a

loss of −1, 07%. We followed the same strategy for

every day. We ended up having two stocks, that of

XOM’s and Intel’s on 25/7/19. From this point and

afterwards, the predictions showed that Intel’s stock

would have a positive return, so we held on to our

stocks. This never happened, and our overall return

was negative, resulting in a loss of −122$ or −3,03%.

Table 2 aggregates daily transactions to top

daily losses and gains for the investigated period

(2018/11/01 until 2019/31/07) as well as the cumu-

lative returns per dataset. Positive values stand for

gains and negative values for losses.

Table 2: Top daily Losses & Gains and Cumulative Returns

per dataset.

Dataset Loss (%) Gain (%) Return (%)

Economic -1,83 1,87 0,75

Sentiment -1,91 1,39 -3,05

PageRank -1,87 0,86 -3,01

4 CONCLUSIONS

4.1 Summary

This work addresses the problem of predicting stock

market movements. The main contribution resides

to the fact that it considers social media as a data

source for improving predictions. More speciﬁcally it

utilizes Twitter data to extract sentiment and investi-

gates whether online sentiment can have a signiﬁcant

positive impact on the forecasting ability of various

prediction models. However, these data may intro-

duce biases to the process of result validation due to

their noisy nature. To address that, we proposed a

new methodology incorporating graphs and obtaining

a daily importance measure for all of the users as well

as weighting their tweets.

Table 3 summarizes the results for the computed

errors of all of the stocks. The PageRank dataset per-

formed better than both the economic and the simple

sentiment dataset. Moreover, we were able to con-

ﬁrm that the most important feature, on the sentiment

data, is the negative score of the tweet. However, we

were not able to conﬁrm which time lag is the most

important, since results are highly dependant on the

feature.

Five different models were tested. For each stock

and for each dataset, we estimated a Decision Tree,

a Random Forest, an XGBoost, an LSTM, and a

k-Nearest Neighbors. For 15 out of 30 stocks the

PageRank dataset performed better than the other

datasets. The most important feature of the sentiment

data was the negative score. For 13 out of 30 stocks

Predicting Stock Market Movements with Social Media and Machine Learning

441

Table 3: Best Dataset Per Ticker.

Ticker PageRank Sentiment Economic

AAPL X

AXP X

BA X

CAT X

CSCO X

CVX X

DIS X

DOW X

GS X

HD X

IBM X

INTC X

JNJ X

JPM X

KO X

MCD X

MMM X

MRK X

MSFT X

NKE X

PFE X

PG X

TRV X

UNH X

UTX X

V X

VZ X

WBA X

WMT X

XOM X

the XGBoost performed better than the other models.

We could not conﬁrm which time lag is the most im-

portant, as this feature was highly depend and on the

stock.

Table 4 presents a summarized version of the re-

sults in the PageRank dataset. The best model was

XGBoost achieving the lowest scores at 13 stocks.

Furthermore, it was the most robust model, having

the lowest average error and the lowest standard devi-

ation.

Although PageRank’s dataset provided the best

scores for most of the stocks, during evaluation the

only proﬁtable dataset was proved to be the economic

(0,75%). The other two datasets, Sentiment and

PageRank recorded losses of −3,05% and −3,01%,

respectively.

4.2 Limitations

This study acts like a proof of concept that microblog-

ging data can be a powerful feature in predicting stock

market data, if we can determine and distinguish the

important ones. This is feasible but the required data

pose an obstacle.

Table 4: Best Dataset Per Model on PageRank Dataset.

Ticker DT k-NN LSTM RF XGBoost

AAPL X

AXP X

BA X

CAT X X

CSCO X

CVX X

DIS X

DOW X

GS X

HD X

IBM X

INTC X

JNJ X

JPM X

KO X

MCD X

MMM X

MRK X

MSFT X

NKE X

PFE X

PG X

TRV X

UNH X

UTX X

V X

VZ X

WBA X

WMT X

XOM X

Since all of our data come from the Twint library,

and not from the ofﬁcial Twitter API, we could collect

a speciﬁc amount of tweets. Moreover, this library is

signiﬁcantly slower than the ofﬁcial, thus it was very

difﬁcult to obtain data for a longer period. We be-

lieve that if we had two years worth of data and all

the tweets per day, then our results would be signiﬁ-

cantly better.

Lastly, on the evaluation part, we choose a greedy

strategy and not an optimal one. The optimal solu-

tion would require an extra module that would imple-

ment diversiﬁcation according to Markowitz’s Portfo-

lio Theorem (Markowitz, 1991) and the extraction of

optimal weights per stock. Moreover, every transac-

tion should incrementally position us to more efﬁcient

decisions.

4.3 Further Research

There are a lot of aspects in our research that we

want to explore in the future. First, we could uti-

lize more models, such as SVM which is commonly

used in the literature. Also, we would like to explore

other economic variables. There are other such vari-

ables that we could embed in our research. Moreover,

WEBIST 2021 - 17th International Conference on Web Information Systems and Technologies

442

we could expand our methodology to other ﬁnancial

instruments to explore the possibility that sentiment

data can act as features on government and corporate

bonds, or even on derivatives. Lastly, as we observed

in some models, there were cases where the mean

squared error was low, but the ﬁt between the actual

and the predicted price was not good. Thus, it would

be very helpful if we could deﬁne a new measure that

can improve the ﬁt capturing.

REFERENCES

Alshahrani Hasan, A. and Fong, A. C. (2018). Sentiment

Analysis Based Fuzzy Decision Platform for the Saudi

Stock Market. In 2018 IEEE International Confer-

ence on Electro/Information Technology (EIT), pages

0023–0029, Rochester, MI. IEEE.

Antweiler, W. and Frank, M. Z. (2004). Is all that talk

just noise? the information content of internet stock

message boards. The Journal of ﬁnance, 59(3):1259–

1294.

Beleveslis, D., Tjortjis, C., Psaradelis, D., and Nikoglou,

D. (2019). A hybrid method for sentiment analy-

sis of election related tweets. In 2019 4th South-

East Europe Design Automation, Computer Engineer-

ing, Computer Networks and Social Media Confer-

ence (SEEDA-CECNSM), pages 1–6. IEEE.

Blystone, D. (2019). Overbought or oversold? use the rela-

tive strength index to ﬁnd out.

Bollen, J., Mao, H., and Zeng, X.-J. (2011). Twitter mood

predicts the stock market. Journal of Computational

Science, 2(1):1–8. arXiv: 1010.3003.

Chakraborty, P., Pria, U. S., Rony, M. R. A. H., and Majum-

dar, M. A. (2017). Predicting stock movement using

sentiment analysis of twitter feed. In 2017 6th Inter-

national Conference on Informatics, Electronics and

Vision & 2017 7th International Symposium in Com-

putational Medical and Health Technology (ICIEV-

ISCMHT), pages 1–6. IEEE.

Devenow, A. and Welch, I. (1996). Rational herding in

ﬁnancial economics. European Economic Review,

40(3-5):603–615.

Fama, E. F. (1965). The behavior of stock-market prices.

The journal of Business, 38(1):34–105.

Fama, E. F. and French, K. R. (1993). Common risk factors

in the returns on stocks and bonds. Journal of ﬁnancial

economics, 33(1):3–56.

Hagberg, A., Swart, P., and S Chult, D. (2008). Explor-

ing network structure, dynamics, and function using

networkx. Technical report, Los Alamos National

Lab.(LANL), Los Alamos, NM (United States).

ajek, P. (2018). Combining bag-of-words and senti-

ment features of annual reports to predict abnormal

stock returns. Neural Computing and Applications,

29(7):343–358.

Hochreiter, S. (1991). Investigations on dynamic neural net-

works. Diploma, Technical University, 91(1).

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Hutto, C. J. and Gilbert, E. (2014). Vader: A parsimonious

rule-based model for sentiment analysis of social me-

dia text. In Eighth international AAAI conference on

weblogs and social media.

Koukaras, P. and Tjortjis, C. (2019). Social media ana-

lytics, types and methodology. In Machine Learning

Paradigms, pages 401–427. Springer.

Koukaras, P., Tjortjis, C., and Rousidis, D. (2019). So-

cial media types: introducing a data driven taxonomy.

Computing, pages 1–46.

Kuepper, J. (2019). Timing trades with the commodity

channel index.

Malkiel, B. G. and Fama, E. F. (1970). Efﬁcient capital

markets: A review of theory and empirical work. The

journal of Finance, 25(2):383–417.

Markowitz, H. M. (1991). Foundations of portfolio theory.

The journal of ﬁnance, 46(2):469–477.

Mitchell, C. (2019). Aroon oscillator deﬁnition and tactics.

Mittermayer, M.-a. and Knolmayer, G. (2006). News-

CATS: A News Categorization and Trading System.

In Sixth International Conference on Data Mining

(ICDM’06), pages 1002–1007, Hong Kong, China.

IEEE.

Oikonomou, L. and Tjortjis, C. (2018). A method for pre-

dicting the winner of the usa presidential elections us-

ing data extracted from twitter. In 2018 South-Eastern

European Design Automation, Computer Engineer-

ing, Computer Networks and Society Media Confer-

ence (SEEDA

CECNSM), pages 1–8. IEEE.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1999).

The pagerank citation ranking: Bringing order to the

web. Technical report, Stanford InfoLab.

Rousidis, D., Koukaras, P., and Tjortjis, C. (2019). So-

cial media prediction: A literature review. Multimedia

Tools and Applications.

See-To, E. W. K. and Yang, Y. (2017). Market sentiment

dispersion and its effects on stock return and volatility.

Electronic Markets, 27(3):283–296.

Sharpe, W. F. (1964). Capital asset prices: A theory of mar-

ket equilibrium under conditions of risk. The journal

of ﬁnance, 19(3):425–442.

Sohangir, S., Petty, N., and Wang, D. (2018). Financial Sen-

timent Lexicon Analysis. In 2018 IEEE 12th Inter-

national Conference on Semantic Computing (ICSC),

pages 286–289, Laguna Hills, CA, USA. IEEE.

Staff, I. (2019). On-balance volume: The way to smart

money.

Tjortjis, C. and Keane, J. (2002). T3: a classiﬁcation algo-

rithm for data mining. In International Conference on

Intelligent Data Engineering and Automated Learn-

ing, pages 50–55. Springer.

Treynor, J. L. (1962). Jack treynor’s’ toward a theory

of market value of risky assets’. Available at SSRN

628187.

Tzirakis, P. and Tjortjis, C. (2017). T3c: improving a de-

cision tree classiﬁcation algorithm’s interval splits on

continuous attributes. Advances in Data Analysis and

Classiﬁcation, 11(2):353–370.

Predicting Stock Market Movements with Social Media and Machine Learning

443