Text Sentiment Analysis for JD.com Based on Machine Learning
Hanyu Wang
a
Global Sun School of Business and Management, DongHua University, Shanghai, China
Keywords: Text Sentiment Analysis, Long Short-Term Memory, Machine Learning, Natural Language Analysis.
Abstract: One of the most important uses of Natural Language Processing (NLP) is text sentiment analysis. It is the
process of processing and classifying textual content that has been infused with subjective attitudes. The final
result is the identification of public sentiment patterns toward specific topics or products. To elevate both
accuracy and efficiency in sentiment analysis, the research simultaneously assesses the effectiveness of
several models, promoting a detailed understanding of their individual benefits and limitations. Notably, the
investigation showed that the Long Short-Term Memory (LSTM) model was a strong competitor. The LSTM
model demonstrated its effectiveness in sentiment analysis tasks by achieving an excellent accuracy rate of
87.29% during rigorous training and testing with tens of thousands of datasets. This work then uses this model
to analyze user reviews for certain digital products on JD.com, providing an example of the usefulness of
LSTM in practical settings. This paper highlights the promising potential of LSTM networks in addressing
complex sentiment analysis problems and pushes the boundaries of sentiment analysis approaches.
1 INTRODUCTION
In recent years, online shopping has become an
integral part of daily life, with JD.com standing out as
a major player in China's thriving e-commerce
landscape (Araque et al., 2024). The wealth of user
comments on JD.com, rich in sentiment expressions,
offers valuable insights for businesses seeking to
understand consumer preferences and opinions.
Comprehending this feedback is crucial for gauging
customer satisfaction levels and refining marketing
strategies. Traditional methods, such as sentiment
dictionaries (Wu et al.,2017) and machine learning
algorithms like Naive Bayes, K-Nearest Neighbour
(KNN), and Support Vector Machine (SVM), often
fail to capture the nuanced semantics embedded in
user comments (Bonaccorso, 2018; Shekhawat, 2024;
Fangxu & Jianhui, 2024). This necessitates the
pursuit of more advanced techniques to accurately
analyze consumer sentiment.
This research delves into the utilization of Long
Short-Term Memory (LSTM) neural networks for
sentiment analysis. As a specialized type of Recurrent
Neural Network (RNN), LSTM excels at managing
sequential data and capturing long-term
dependencies, offering unique advantages for this
a
https://orcid.org/0009-0007-6035-4491
task. This study aims to benchmark LSTM's
performance against KNN and SVM models, utilizing
JD.com user comments as a real-world testbed. By
conducting rigorous empirical analysis, the paper
aims to demonstrate LSTM's ascendancy in sentiment
analysis, leveraging its proficiency in sequential data
processing to reveal deeper sentiment insights that
traditional methods may overlook.
This paper begins by examining the limitations of
current sentiment analysis methods, particularly in
capturing the intricate nuances of user sentiment. It
then introduces the LSTM model and its unique
abilities in this domain, emphasizing its proficiency
in modeling temporal dependencies and contextual
information. The study meticulously details the
dataset used, comprising JD.com user comments,
along with the preprocessing steps taken to guarantee
data quality. It also outlines the experimental setup
for comparing LSTM with KNN and SVM models.
The experimental results demonstrate LSTM's
superior performance over KNN and SVM in
sentiment analysis, with higher accuracy and
effectiveness. LSTM's proficiency in capturing long-
term dependencies and comprehending contextual
nuances within user comments facilitates more
precise sentiment classifications. These findings
Wang and H.
Text Sentiment Analysis for JD.com Based on Machine Learning.
DOI: 10.5220/0013515400004619
In Proceedings of the 2nd International Conference on Data Analysis and Machine Learning (DAML 2024), pages 257-260
ISBN: 978-989-758-754-2
Copyright © 2025 by Paper published under CC license (CC BY-NC-ND 4.0)
257
emphasize LSTM's potential in addressing intricate
sentiment challenges within e-commerce
environments like JD.com, where user comments
exhibit diverse sentiment expressions.
In conclusion, the findings highlight the
significance of LSTM in providing deeper insights
into customer sentiment for businesses. The paper
underscores the need for further exploration,
suggesting hybrid models combining LSTM with
other machine learning techniques to enhance
sentiment analysis capabilities. Overall, this research
contributes to advancing sentiment analysis
techniques, demonstrating LSTM's potential as a
crucial tool for understanding and leveraging
customer sentiment in e-commerce.
2 EXPERIMENTAL DATASETS
To ensure the quality of text data, the author used
Python's `re` module with regular expressions to
preprocess the crawled data. Key steps involved
removing @replies, usernames, {%xxx%} tags, and
[xx] contents. Additionally, special characters,
emojis, and non-Chinese symbols were removed,
while exclamation marks and question marks were
replaced with appropriate sentiment-conveying
words.
Subsequently, the author undertook preprocessing
of the text data, removing stopwords- common yet
non-substantive words - to enhance relevance.
Utilizing the Harbin Institute of Technology's
compiled stopword list, these unnecessary words
were effectively eliminated from the text dataset.
Finally, the author employed word embedding to
represent the text data. Traditional one-hot coding - a
bag-of-words model - suffers from limitations such as
ignoring word order, assuming word independence,
and resulting in discrete, sparse features. To address
these issues, the author adopted word embedding, a
neural network-based distributed representation
approach. This method converts vector elements from
integers to floating-point numbers, enabling
representation across the entire real number range. It
also condenses the original sparse, high-dimensional
space into a more compact, lower-dimensional one.
Leveraging Python's Keras framework and word2vec,
the author constructed a 150-dimensional word vector
space encompassing nearly all Chinese vocabulary
from the cleaned and tokenized Wikipedia corpus
(Tang et al., 2020).
3 METHODS BASED ON
MACHINE LEARNING
The text information is first processed for features,
and then the model undergoes supervised learning
training. The trained model is then used to predict the
sentiment polarity of new text information. The
working method is as follows: initially, labeled text
data is utilized for feature extraction, from which key
information is derived. Subsequently, these features
are employed to generate sentiment polarity labels,
serving as the foundation for model training. Through
the machine learning training process, a model
capable of recognizing sentiment polarity is
constructed. This model can receive new unlabelled
sentences, perform feature extraction again, and
predict the sentiment polarity based on the trained
model, ultimately outputting the prediction results.
The entire process, from data preparation to model
prediction, achieves efficient and accurate sentiment
analysis.
Based on different classification algorithms,
methods can be divided into KNN, SVM, Naive
Bayes, Maximum Entropy, etc (Chen. S & Chen. J,
2024).
3.1 K-Nearest Neighbour
The K-Nearest Neighbours (KNN) classification
algorithm is a simple yet effective method in data
mining classification. This algorithm classifies
records by examining the labels of the KNN and
assigning the most frequent label. Easy to understand
and implement, it's sensitive to K-value selection and
the distance metric used (Li et al., 2024).
3.2 Support Vector Machine
The Support Vector Machine (SVM) is a generalized
linear classifier renowned for binary classification
using supervised learning. It identifies the hyperplane
with the maximum margin - the distance to the nearest
data points from each class - to maximize separation.
This approach prevents overfitting, ensuring a robust
and accurate classifier (Fang et al., 2024).
3.3 Long Short-Term Memory
Long Short-Term Memory (LSTM) is a kind of RNN
specifically crafted to deal with the hardships faced
during the training of long sequences. Traditional
RNNs frequently encounter problems such as
vanishing and exploding gradients, resulting in poor
DAML 2024 - International Conference on Data Analysis and Machine Learning
258
long-term memory preservation. In contrast, LSTM
networks possess unique mechanisms that permit
them to manage and remember information
effectively over long durations. This characteristic
makes LSTM a great choice for tasks involving the
processing of long data sequences, like sentiment
analysis of extensive texts. By resolving the issues
related to gradient vanishing and exploding
(Staudedfzmeyer & Morris, 2019), LSTM provides
more accurate and reliable results, especially when
handling complex and large datasets. The application
of LSTM technology in various fields has been shown
to improve the efficiency and precision of data
processing.
The complex design of an LSTM cell enables it to
handle long-term dependency issues effectively. Its
specialized architecture shows its ability to overcome
challenges associated with retaining information over
extended periods (Yu & Zhou, 2018).
The first tier is known as the Forget Gate. It is the
initial stage of LSTM for determining which data
should be removed from the cell state. This decision
is determined via a sigmoid network layer called the
"forget gate layer". It takes the current input (X
) and
the previous hidden state (h

) as inputs, and for
each number in the cell state (C

), it returns a value
between 0 and 1. The number of 1 denotes "accept
this fully", whereas a value of 0 denotes "totally
ignore this" (Yang & Wang, 2019).
The input gate is the next layer. Its purpose is to
ascertain what fresh data will be kept in the cell state.
There are two parts to this process. The "input gate
layer", a sigmoid layer, determines which values will
be updated first. A layer then creates a fresh candidate
value vector to be included in the state. Following
that, an update for the state is created using these two
pieces of data.
Next, there is the third layer, namely the Cell State
Update Gate. The paper uses the new cell state (C
) to
replace the old cell state (C

). The author multiplies
the old state by the output of the forget gate (𝑓
) to
discard the information that has been decided to be
forgotten. Subsequently, this paper brings in new
candidate values and adjusts them according to the
degree of update determined for each state. Finally,
the output value is determined by the filtered cell
state. A sigmoid layer is utilized to output the specific
part of the cell state. Then, it processes the cell state
through a tanh function (producing a value from -1 to
1 and multiplies it by the output of the sigmoid gate.
Eventually, output the selected portion. In this
way, the state of the hidden layer from the previous
moment is integrated into the calculation process of
the current moment. In simpler terms, the selection
and decision-making take into account the previous
state, addressing the long-term dependency issues
that regular RNNs encounter.
4 EXPERIMENT RESULTS
Despite its simplicity, the KNN model only achieves
a moderate performance in sentiment analysis, with
an accuracy of 0.5847. The model's F1 Score of
0.5506, along with balanced precision (0.5558) and
recall (0.5513) values, indicate its struggle in
accurately distinguishing between positive and
negative sentiments. This limitation can be attributed
to KNN's sole focus on feature space proximity,
overlooking the sequential dependencies present in
text data. Therefore, while KNN is user-friendly and
straightforward, it falls short of effectively analyzing
sentiment due to its inherent design flaws and lack of
consideration for textual nuances. The experiment
results are shown in Table 1.
The SVM is well-known for its strong
generalization capabilities, surpassing KNN in
performance. However, when it comes to sentiment
analysis, SVM falls short despite achieving an
accuracy of 0.6301. This is evident in the imbalance
between its precision (0.7979) and recall (0.5345),
resulting in an F1 score of 0.4480.
This disparity highlights SVM's cautious
approach toward classifying positive samples,
prioritizing precision over effectively capturing
genuine positives. The model's struggle with
recognizing sequential patterns essential for
sentiment analysis further accentuates its limitations
in this area. In conclusion, while SVM showcases
moderate success, its shortcomings in sentiment
analysis are apparent. The LSTM model stands out as
a superior choice for sentiment analysis tasks due to
its exceptional performance across various metrics. In
comparison
to other models like KNN and SVM,
Table 1: Experimental results.
Model Accuracy F1 Recall Precision
KNN 0.5847 0.5506 0.5513 0.5558
SVM 0.6301 0.4480 0.5345 0.7979
LSTM 0.8729 0.8960 0.8943 0.9061
Text Sentiment Analysis for JD.com Based on Machine Learning
259
LSTM excels with an accuracy rate of 87.29%,
accurately classifying a high percentage of samples.
Additionally, its F1 Score of 0.8960 reflects a
harmonious balance between precision and recall,
showcasing its proficiency in identifying positive
sentiments while minimizing false positives and false
negatives.
One of LSTM's key strengths lies in its ability to
process sequential data, allowing it to capture
nuanced sentiment orientations and tendencies within
the text. This unique capability significantly
contributes to its high classification accuracy and
overall exceptional performance. In contrast, models
like KNN and SVM struggle to capture the sequential
nature of text data, therefore hindering their
effectiveness in sentiment analysis tasks. Ultimately,
this study conclusively establishes LSTM's
superiority in handling text data with intricate
sequential patterns for sentiment analysis. When
faced with complex textual data, prioritizing LSTM
or similar sequence-processing models is crucial to
ensure optimal performance and accuracy. By
leveraging LSTM's capability to understand context
and dependencies within text sequences, researchers
and practitioners can enhance the accuracy and
effectiveness of sentiment analysis tasks.
5 CONCLUSIONS
This paper emphasizes the crucial significance of
sentiment analysis for understanding customer
feedback, especially on e-commerce platforms like
JD.com. Through analyzing user reviews of specific
digital products, the study compares advanced
machine learning techniques (such as LSTM
networks) and traditional algorithms (like KNN and
SVM). LSTM is highlighted for its remarkable ability
to achieve high accuracy in sentiment analysis,
especially in handling sequential data and extracting
detailed contextual semantic information from long
texts. The research evaluates the performance of
LSTM, KNN, and SVM in sentiment analysis of
JD.com's user reviews. LSTM emerges as the most
effective model, showing its value in helping
businesses understand customer satisfaction levels
and guiding strategic decisions on product quality
improvement, customer service optimization, and
marketing strategy refinement. However, LSTM
models have limitations in handling long sequences.
While they are good at processing short sequences,
dealing with sequences exceeding 1000 elements
poses computational challenges and time constraints
due to the complexity of LSTM cells. Future research
should focus on optimizing and enhancing LSTM
architectures to address these limitations. Possibilities
include developing more efficient LSTM variants for
long sequences, using parallel processing techniques,
and leveraging hardware accelerators. Hybrid
approaches combining LSTM with other algorithms
also hold promise. In conclusion, integrating LSTM
in sentiment analysis of JD.com's user reviews has
demonstrated its potential. As research continues,
LSTM-based sentiment analysis will be important for
driving customer satisfaction, building brand loyalty,
and contributing to the success of JD.com and other
businesses in the e-commerce field.
REFERENCES
Araque, O., Corcuera-Platas, I., Sánchez-Rada, J. F.,
Iglesias, C. A., 2017. Enhancing deep learning
sentiment analysis with ensemble techniques in social
applications. Expert Systems with Applications, 77,
236-246.
Bonaccorso, G., 2018. Machine Learning Algorithms:
Popular algorithms for data science and machine
learning. Packt Publishing Ltd.
Chen, S., Chen, J., 2024. Research on Sentiment Analysis
Model of Online Course Reviews Based on R-Boson.
Modern Information Technology, 16, 107-112.
Fangxu, Y., Jianhui, W., 2024. A sentiment recognition
model for Weibo comments based on SVM and
Word2vec. Modern Computers, 10, 60-64.
Li, Y. W., Chen, Y. X., Hu, G. X., 2024. Recognition and
detection of apple leaf diseases based on KNN and
multi-feature fusion. Food and Fermentation
Technologies, 4(04), 25-32.
Shekhawat, B. S., 2019. Sentiment classification of current
public opinion on BREXIT: Naïve Bayes classifier
model vs Python’s TextBlob approach (Doctoral
dissertation, Dublin, National College of Ireland).
Staudedfzmeyer, R. C., Morris, E. R., 2019. Understanding
LSTM--a tutorial into long short-term memory
recurrent neural networks. arxiv preprint
arxiv:1909.09586.
Wu, J., Lu, K., Su, S., Wang, S., 2019. Chinese micro-blog
sentiment analysis based on multiple sentiment
dictionaries and semantic rule sets. IEEE Access, 7,
183924-183939.
Yang, Q., Wang, C. W., 2019. Research on global stock
index prediction based on deep learning LSTM neural
network. Statistical Research, 03, 65-77.
Yu, W., Zhou, W. N., 2018. Sentiment analysis of product
reviews based on LSTM. Computer Systems &
Applications, 08, 159-163.
DAML 2024 - International Conference on Data Analysis and Machine Learning
260