Identifying Deceptive Reviews Using Machine Learning
Benson Mansingh, J. Sandeep, A. Basanth, M. Yagnesh and G. Asritha
Department of ACSE, VFSTR Deemed to be University, Guntur, Andhra Pradesh, India
Keywords: Deceptive Reviews, Sentiment Analysis, Text Classification, Random Forest Classifier, Machine Learning.
Abstract: Deceptive Reviews System that utilizes Machine Learning, natural language processing (NLP), and sentiment
analysis to accurately distinguish between genuine and fraudulent reviews. The system enhances transparency
and reliability in e- commerce by identifying deceptive feedback. It incorporates TF- IDF vectorization to
extract key textual features. It supports informed purchasing decisions and helps businesses improve based
on genuine user reviews, addressing the challenges posed by fake reviews in the digital marketplace. This
solution plays a crucial role in maintaining the credibility and effectiveness of online review systems
credibility and effectiveness of online review systems.
1 INTRODUCTION
Deceptive Reviews are a major challenge in today’s
digital age, particularly on e-commerce platforms
where people rely on feedback from other customers
before making purchases. Misleading reviews can
influence buying decisions and affect business
reputations. Positive deceptive Reviews can promote
low-quality products, while negative fake reviews can
harm the brands. Since Deceptive Reviews often
appear genuine, detecting them manually becomes
difficult. Using this automated technique to identify
and filter out deceptive Reviews is necessary to
maintain trust in online review systems. As more
users turn to online platforms for shopping,
identifying and eliminating deceptive Reviews has
become a critical task to protect consumers from
misinformation (
Mohawesh et al., 2021).
Several strategies, such as pattern analysis and
rule-based algorithms, have been put out in recent
years to identify fraudulent reviews. These models
can detect patterns that are not easily visible to
humans. Natural Language Processing (NLP)
techniques help convert text data into numerical
values, making it easier for models to identify
deceptive Reviews. Sentiment analysis is also useful
in identifying the emotional tone behind reviews,
which adds another layer of information for
improving model accuracy, by leveraging these
techniques, machine learning models can effectively
classify reviews as genuine or fake (
T. Sree and R.
Tripathi, 2023
). The deceptive Reviews System uses
machine learning (ML) and natural language
processing (NLP) approaches to detect bogus product
evaluations. Online reviews influence customer
purchasing decisions. Our system processes reviews
by performing text preprocessing, sentiment analysis
and classification using a random forest model. The
system offers both single review and bulk product
analysis, ensuring transparency and authenticity in
online shopping (
Abdulqader et al., 2022).
In this study, Random Forest classifier used to
distinguish between fake and real reviews .The
process involves cleaning and preparing the text data
by removing unnecessary words, applying
stemming, and converting the text into numerical
format using TF-IDF. The trained model is tested on
unseen data, and the results show that the random
forest classifier performs well in detecting deceptive
Reviews, offering a practical solution for maintaining
the authenticity of online plat- forms. By integrating
sentiment analysis with machine learning, this
approach not only enhances review classification but
also provides a deeper understanding of the emotional
patterns associated with fake and genuine reviews.
This solution plays a crucial role in maintaining the
credibility and effectiveness of online review
systems, fostering a trustworthy environment for
buyers and sellers in the e-commerce ecosystem
(
Chauhan et al., 2022).
1.1 Structure of the Paper
The article starts with an introduction that explains
Mansingh, B., Sandeep, J., Basanth, A., Yagnesh, M. and Asritha, G.
Identifying Deceptive Reviews Using Machine Learning.
DOI: 10.5220/0013890000004919
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies (ICRDICCT‘25 2025) - Volume 2, pages
789-794
ISBN: 978-989-758-777-1
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
789
why deceptive Reviews is significant and how
machine learning models can help. The objectives
section shows the list of research objectives. The
description of data set and models used in the study
are explained in the methodology section. The
performance of the different models are presented in
the result and interpretation section. Finally, The
conclusion, which and interpretation section. Finally,
the conclusion, which follows a references section
containing a list of all the sources consulted,
concludes the study and makes recommendations for
further research.
1.2 Objectives
To examine the machine learning models used
for deceptive Reviews system.
To assess how well the machine learning
models, detect misleading reviews.
To train and test a machine learning model to
identify fake and genuine reviews.
To evaluate the model’s performance using
accuracy, precision, recall, and confusion
matrix.
2 RELATED WORKS
Several machine learning (ML) and deep learning
(DL) techniques have been used in recent advances in
fraudulent reviews, greatly increasing the precision
and effectiveness of detecting false information in
online reviews.
In order to improve the accuracy of deceptive
review identification across various platforms, a
number of researches have investigated sophisticated
machine learning and deep learning techniques. Sree
and Tripathi (2023) utilized Evidential Classifiers to
improve classification accuracy by leveraging
probabilistic reasoning in identifying deceptive
reviews. Similarly, Abdulqader et al. (2022)
developed a Unified Detection Model that integrates
deception theories with behavioral science to analyze
online review patterns, enhancing the detection of
fraudulent content. Chauhan et al. (2022) provided a
comprehensive review of techniques for detecting
fake images and videos, which can be extended to
identifying manipulated reviews through neural
networks and GAN-based models. Catelli et al.
(2023) proposed a method leveraging BERT and
ELECTRA for sentiment analysis to detect deceptive
reviews in datasets related to Italian cultural heritage,
demonstrating the effectiveness of deep learning
models in distinguishing deceptive content. Liu et al.
(2021) explored a multidimensional representation
approach with fine-grained aspect analysis to identify
deceptive reviews by modeling semantic
relationships and contextual information.
Furthermore, Tufail et al. (2022) investigated the
impact of fake reviews on e-commerce platforms
during and after the COVID-19 pandemic and
introduced SKL-based models using K-Nearest
Neighbor (KNN) and Support Vector Machine
(SVM) to classify reviews as genuine or deceptive
(
Pandit, Anala 2018). Deep learning models, especially
convolutional neural networks (CNNs), have proven
to be effective in establishing robust classification
baselines by capturing subtle patterns in review data
(
Rathore et al., 2023). These models demonstrate
superior performance in analyzing contextual
information, sentiment polarity, and behavioral
patterns that distinguish genuine reviews from fake
ones.
3 METHODOLOGY
In this study focuses on deceptive Reviews by first
pre-processing the text data through steps like
removing punctuation, converting to lowercase,
eliminating stop words, and ap- plying stemming.
The dataset is split into training, validation, and
testing subsets, where the model undergoes training,
fine- tuning, and performance evaluation,
respectively. This process involves preprocessing the
textual data by removing irrelevant terms, applying
stemming techniques, and converting the text into
numerical form using TF-IDF. The figure 1 shows the
Flow of the work. The model’s performance is
assessed through evaluation metrics such as accuracy,
precision, recall, F1-score, and ROC-AUC.
Furthermore, a confusion matrix is employed to
analyze and understand the nature of prediction
errors. The models are then tested on unseen data to
ensure they generalize well to new inputs. This
methodology allows us to identify the most effective
model for accurately detecting fake reviews.
3.1 Stemming
In deceptive Reviews, writers may use different
forms of words, such as” buying”,” bought”, and”
buys”, which all con- vey similar meaning. Stemming
normalizes these variations to a single root word like”
buy”, reducing the vocabulary size, and improving
model efficiency. This process helps the model
generalize better by focusing on the core meaning of
the text. In deceptive review detection, stemming
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
790
simplifies textual data, making it easier for the model
to identify patterns and classify reviews accurately.
Figure 1: Flow of the Work.
3.2 Stop Words Removal
Stop words removal is an important pre-processing
step in a deceptive Reviews system where
common words likethe”,” is, andand in are
eliminated since they do not contribute meaningful
information. These words often appear frequently but
provide little value in distinguishing between genuine
and fake reviews. Removing them reduces noise and
allows the model to focus on more significant terms,
improving efficiency and reducing complexity. In
fake review detection, eliminating stop words ensures
that only relevant words are analyzed, helping the
model detect patterns more effectively and classify
reviews with better accuracy.
3.3 Sentiment Analysis
Sentiment Analysis is a technique used in deceptive
Reviews systems to analyze the emotional tone or
opinion expressed in a text. It helps categorize reviews
as positive, negative or neutral based on the sentiment
conveyed. By identifying these pat- terns, sentiment
analysis can detect that may indicate deceptive
behavior. The VADER model from the NLTK is
widely used for sentiment analysis, as it is effective
in capturing sentiment intensity and is optimized for
short text, such as reviews. In deceptive review
detection, sentiment analysis serves as an additional
layer of evaluation, enhancing the model’s ability
to
identify suspicious patterns and improve
classification ac- curacy.
3.4 TF-IDF (Term Frequency-Inverse
Document Frequency)
TF-IDF is an effective technique for extracting
features from text, often applied in identifying fake or
deceptive reviews. It converts textual content into
numerical formats that can be efficiently processed by
machine learning algorithms. The figure 2 shows the
TF-IDF. Term Frequency (TF) quantifies the
occurrence of a word within a given document, while
Inverse Document Frequency (IDF) adjusts its
significance by down-weighting terms that frequently
appear across numerous documents, thereby
emphasizing rarer yet more informative words. In
the realm of fake review identification, TF-IDF
plays a crucial role by highlighting distinctive terms
that differentiate authentic reviews from fraudulent
ones, assigning greater importance to unique and
contextually relevant words. This transformation
reduces noise caused by frequently used words and
improves the model’s ability to detect patterns. By
applying TF-IDF, the system enhances the
effectiveness of classifiers by providing a more
accurate representation of the textual data.
Identifying Deceptive Reviews Using Machine Learning
791
Figure 2: TF-IDF.
3.5 Random Forest Model
Random Forest is a powerful machine learning
algorithm commonly used in deceptive Reviews due
to its effectiveness in handling large datasets and
reducing overfitting. It is an ensemble method that
constructs multiple decision trees and combine their
predictions to generate a more accurate result. The
figure 3 shows the Random Forest Model. In fake
review detection, Random Forest processes features
extracted from text, such as word frequencies and
sentiment scores, to classify reviews as either
genuine or fake. Each tree in the forest makes an
independent prediction, and the final decision is
determined through majority voting, improving
model reliability, it is well-suited for identifying
patterns in fake reviews, leading to better accuracy
and performance in unseen data.
Figure 3: Random forest model.
4 RESULTS AND
INTERPRETATION
Several models were used to determine whether a
review is fake or not, and many machine learning
models are put into practice once the data has been
preprocessed.
4.1 Dataset Description
The” Deceptive Reviews Dataset” consists of 40,000
product reviews, evenly split into 20,000 genuine
reviews and 20,000 deceptive reviews. Authentic
reviews are composed by real users, expressing their
actual experiences, whereas fake reviews are
artificially generated to mimic genuine customer
feedback. This dataset is structured to help develop
and evaluate machine learning models for fake review
detection, offering opportunities for feature analysis,
sentence analysis and classification tasks.
Table 1: Summary of the fake reviews dataset.
Dataset Description Details
Total Number of Reviews 40,000
Number of Real Reviews (OR) 20,000
Number of Deceptive Reviews
(CG)
20,000
Table 1 displays the distribution of the Fake review
dataset. The dataset consists of 40,000 overalls, split
into two categories: Real and Deceptive. Specifically,
20,000 are Real reviews and remaining 20,000 are
deceptive reviews.
Table 2 presents the Deceptive Reviews Dataset
where it is partitioned into training, validation, and
testing sets. The Training Set comprises 28,000
reviews (14,000 genuine and 14,000 deceptive). The
Validation Set includes 6,000 reviews (3,000
authentic and 3,000 deceptive), while the Testing Set
also consists of 6,000 reviews (3,000 real and 3,000
fake), used to evaluate the model’s effectiveness on
unseen data.
Table 2: Dataset split for fake review detection.
Dataset Split
Number of
Reviews
Real+Deceptive
Review
Training Set 28,000 14,000 + 14,000
Validation
Set
6,000 3,000 + 3,000
Testing Set 6,000 3,000 + 3,000
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
792
4.2 Data Split Based on Reviews Graph
Figure 4: Data split based on reviews.
The graph shows the distribution of real (OR) and
deceptive (CG) reviews across the dataset splits. The
figure 4 shows the Data Split based on reviews. The
Training Set has 28,000 reviews (14,000 real and
14,000 deceptive), while the Validation Set and
Testing Set each contain 6,000 reviews (3,000 real
and 3,000 deceptive). This balanced split ensures the
model is trained and evaluated effectively.
4.3 Training vs Validation Accuracy
Graph
Figure 5: Training vs validation accuracy graph.
The graph compares Training Accuracy and
Validation Accuracy at different training set sizes. As
the training size increases from 10% to 90%, the
model’s performance improves, with both accuracies
converging closely. The figure 5 shows the Training vs
validation Accuracy Graph. This indicates that the model
generalizes well, with minimal overfitting or
underfitting.
4.4 Feature Extraction for Reviews
Graph
The Feature Importance graph highlights the top 10
features contributing the most to the classification of
deceptive and real reviews.
Figure 6: Feature extraction for reviews graph.
The figure 6 shows the Feature Extraction for Reviews
Graph. These features are identified using the Random
Forest model, where higher scores indicate stronger
influence on the model’s decision-making.
4.5 Confusion Matrix
Figure 7: Confusion matrix.
The Confusion Matrix serves as a graphical
representation of the model’s classification
performance by contrasting predicted outputs with
actual labels. The figure 7 shows the Confusion
Matrix. It displays the number of correctly and
incorrectly classified instances for both real (OR) and
fake (CG) reviews, enabling a detailed assessment of
classification accuracy and helping to detect potential
misclassification errors.
Identifying Deceptive Reviews Using Machine Learning
793
5 CONCLUSIONS
The Deceptive Review Detection System efficiently
detects fraudulent reviews by utilizing Natural
Language Processing (NLP) and Machine Learning
(ML) methods. Through a combination of text
preprocessing, sentiment analysis, and a robust
random classifier, the system achieves high
accuracy in between genuine and fake reviews. The
integration of a user-friendly streamlit interface
allows seamless analysis of both individual and bulk
reviews, making it accessible to a wide range of users.
By promoting transparency and trust in online
platforms, this system helps safeguard consumers
from deceptive reviews, ultimately contributing to a
more reliable and secure e-commerce environment.
REFERENCES
Abhijeet A Rathore, Gayatri L Bhadane , Ankita D Jadhav
, Kishor H Dhale, Jayshree D Muley, 2023, Deceptive
Reviews Detection Using NLP Model and Neural
Network Model, international journal of engineering
research technology (ijert) Volume 12, Issue 05 (May
2023)
H. Tufail, M. U. Ashraf, K. Alsubhi and H. M. Aljahdali,
The Effect of Deceptive Reviews on e-Commerce
During and After Covid-19 Pandemic: SKL-Based
Fake Reviews Detection,” in IEEE Access, vol. 10,
pp. 25555-25564, 2022, doi:
10.1109/ACCESS.2022.3152806.
M. Liu, Y. Shang, Q. Yue and J. Zhou,” Detecting Decep-
tive Reviews Using Multidimensional Representations
With Fine-Grained Aspects Plan,” in IEEE Access, vol.
9, pp. 3765-3773, 2021, doi: 10.1109/AC-
CESS.2020.3047947.
M. Abdulqader, A. Namoun and Y. Alsaawy,” Deceptive
Online Re- views: A Unified Detection Model Using
Deception Theories,” in IEEE Access, vol. 10, pp.
128622-128655, 2022, doi: 10.1109/AC-
CESS.2022.3227631.
Pandit, Anala. “Deceptive Review Detection Using Clas-
sification.” International Journal of Computer Appli-
cations, Foundation of Computer Science, 2018.
R. Mohawesh, M. Hasan, and E. Damiani,” Deceptive
Reviews Detection: A Survey,” IEEE Access, vol. 9,
pp. 65771-65802, 2021, doi:
10.1109/ACCESS.2021.3075573.
R. Chauhan, R. Popli and I. Kansal,” A Comprehensive
Review on De- ceptive Images/Videos Detection
Techniques,” 2022 10th International Conference on
Reliability, Infocom Technologies and Optimization
(Trends and Future Directions) (ICRITO), Noida,
India, 2022, pp. 1- 6, doi:
10.1109/ICRITO56286.2022.9964871
R. Catelli et al.,”A New Italian Cultural Heritage Data Set:
Detecting Deceptive Reviews with BERT and
ELECTRA Leveraging the Sentiment,” in IEEE Ac-
cess, vol. 11, pp. 52214-52225, 2023, doi:
10.1109/ACCESS.2023.3277490.
R. Catelli et al.,” A New Italian Cultural Heritage Data Set:
Detecting Deceptive Reviews With BERT and
ELECTRA Leveraging the Sentiment,” in IEEE Ac-
cess, vol. 11, pp. 52214-52225, 2023, doi:
10.1109/ACCESS.2023.3277490.
T. Sree and R. Tripathi,” Deceptive Review Detection
using Evidential Classifier,” 2023 Second International
Conference on Advances in Computational Intelligence
and Communication (ICACIC), Puducherry, India
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
794