through social media sentiment analysis, which
regularly affects short-term market swings. Though
different kinds of textual content can produce distinct
outcomes, the usefulness of this idea is still up for
debate. Predictability may be greatly increased or
decreased by certain texts.
One of the main tasks in machine learning is text
vectorization, which is converting unstructured text
into a numerical format. Popular techniques include
the Bag of Words (BOW) model and the Term
Frequency-Inverse Document Frequency (TF-IDF)
combination. With the BOW model, word
frequencies are counted by treating text as a set of
distinct words. It often produces sparse feature
vectors that may overlook subtleties in meaning,
despite its effectiveness. In contrast, TF-IDF gives
each phrase a weight based on how important it is
inside a document in comparison to the entire corpus.
By emphasizing key phrases and reducing the
prominence of common, uninformative terms, it
offers a more advanced representation.
Another difficulty is choosing the right machine
learning model. The efficacy of a model depends on
its capacity to handle the particularities of social text
data, such as slang, informal language, and different
levels of information. For problems involving text
classification and prediction, models like Random
Forests and Multinomial Naive Bayes
(MultinomialNB) offer clear advantages. As different
models may perform better with different text
properties and circumstances, selecting the best
model requires thorough experimentation and
validation.
Social media data from RedditNews, Asea Brown
Boveri Ltd. (ABB), Google LLC (GOOG), Apple Inc.
(APPL), and Exxon Mobil Corporation (XOM) are
used in this analysis. Bag of Words (BoW) and Term
Frequency-Inverse Document Frequency (TF-IDF)
are two text vectorization techniques that are used in
this paper. In the tests, stock prediction performance
is evaluated across different text volumes using
Random Forest and MultinomialNB models (AAPL5,
for example, stands for five randomly picked Apple
news items). Based on the data from RedditNews, the
results indicate that Random Forest and TF-IDF
perform better than BoW and MultinomialNB in
general.
Further analysis of the data's textual
characteristics in this research revealed that
prediction accuracy is strongly influenced by the
association between textual data and the stock market.
Text from RedditNews, for example, that shows a
strong association with the market typically produces
more amazing accuracy.
2 METHODS
In order to forecast stock movements using social text
data, this article uses machine learning algorithms
and text processing methodology. With a focus on the
Random Forest and Multinomial Naive Bayes
classifiers as well as the Bag of Words and Term
Frequency-Inverse Document Frequency (TF-IDF)
representations for text data, this section provides a
thorough review of the techniques used.
2.1 Random Forest
For problems involving regression and classification,
Random Forest is an ensemble learning technique
(Sun et al., 2024). During training, it builds a large
number of decision trees. In terms of classification, it
outputs the mode of the classes; in terms of regression,
it outputs the mean prediction of each individual tree.
By replacing the samples in the training dataset, the
system generates multiple decision trees. Predictions
are combined to produce the final output after each
tree is trained on a distinct bootstrap sample. Splitting
is done on a random subset of characteristics at each
node in the tree. By doing so, the resilience of the
model is increased and the correlation between
individual trees is decreased. The random selection of
training data and attributes used to build each tree
promotes variation among the trees.
2.2 Multinomial Naïve Bayes
Using the Bayes theorem, Multinomial Naive Bayes
calculates the posterior probability of a class given
the feature vector (Terentyeva et al., 2024). The
conditional probabilities of the features that are
assigned to each class are multiplied to estimate the
probability of each class. Features are independent
when conditioned on the class, according to the
"naive" assumption. When features like word
frequencies in text documents represent counts,
Multinomial Naive Bayes is the best option. The
distribution of words within each class is modeled
using the multinomial distribution.
2.3 Bag of Words
Text is converted into numerical features using the
Bag of Words (BoW) technique, which is a crucial
text representation technique. While keeping the
frequency of each word, it ignores word order and
syntax. BoW breaks the text up into discrete words,
or tokens. A distinct index is given to every token in
a vocabulary. Next, the text is segmented into discrete