The study highlights the superiority of Bi-LSTM,
especially in addressing imbalanced classes, as
reflected in its improved precision, recall, and overall
F1-scores. These results support the transition from
traditional models like Naive Bayes to deep learning
models such as Bi-LSTM for more robust
performance in cyberbullying detection
tasks(Adeyinka Orelaja , 2024).
Explored a novel method for cyberbullying
detection by combining Support Vector Machine
(SVM) techniques with NLP methodologies. The
hybrid approach aims to enhance the detection of
cyberbullying by leveraging the strengths of both
SVM, known for its robustness in classification tasks,
and NLP, which excels in understanding textual data.
The study presents findings at ADICS 2024,
indicating a significant advancement in the
application of machine learning techniques for social
media monitoring and online safety(J.Sathya , 2024).
The Naive Bayes and Bi-LSTM models
performance for attribute- specific cyberbullying
detection. For the Naive Bayes model, the F1-scores
varied across different attributes, with Religion
achieving an F1- score of 0.91 and Not bullying
performing less effectively with an F1-score of 0.60.
The Bi- LSTM model, however, showed superior
performance. For instance, it recorded an F1-score of
0.97 for Age and 0.98 for Ethnicity. This comparative
analysis highlights the strength of Bi-LSTM in
handling imbalanced classes, improving both
precision and recall across key attributes in the
dataset.The study, published in the Journal of
Electronic & Information Systems on February 28,
2024, underscores the shift from traditional classifiers
like Naïve Bayes to advanced deep learning models
like Bi-LSTM for improved accuracy in detecting
cyberbullying behaviors across multiple
dimensions(Adeyinka Orelaja , 2024)
By analyzing the performance of several
classifiers on a custom-built dataset designed to
capture various aspects of cyberbullying. The
Multinomial Naive Bayes (Multinomial-NB)
classifier achieved an F1- score of 0.82,
demonstrating balanced precision (0.83) and recall
(0.80). Support Vector Machine (SVM) performed
similarly with an F1-score of 0.83. Logistic
Regression also showed competitive results, scoring
0.79 in F1, while Stochastic Gradient Descent (SGD)
had a higher precision 0.91 but lower recall at 0.58,
yielding an F1-score of 0.71. The Shallow Neural
Network (Shallow NN) recorded an F1- score of 0.77,
highlighting moderate performance.The study
underscores the variability in performance across
classifiers when addressing complex factors such as
aggressive language and intent to harm, indicating the
potential for hybrid or ensemble methods to improve
detection accuracy (Naveed Ejaz,2024).
Reviewed various methods for hate speech
detection across multiple social media platforms. The
study highlighted several machine learning models
and techniques applied to datasets from platforms
such as YouTube, Twitter, and MySpace. For
instance, Chen et al. used an unsupervised lexical and
syntactic match rule-based approach on YouTube
data, achieving a high precision of 0.98 and recall of
0.94. Xiang et al. applied semi-supervised logistic
regression using topic modeling on Twitter, reporting
an F1-score of 0.84. A hybrid CNN model trained on
character and Word2vec embeddings by Park and
Fung achieved an F1-score of 0.73.Additionally,
Wiegand et al. used SVM with lexical and linguistic
features along with word embeddings, reaching an
F1-score of 0.81 on datasets from Twitter, Wikipedia,
and UseNet.The study emphasizes the effectiveness
of both supervised and unsupervised approaches, as
well as the potential of word embeddings and deep
learning methods for improving the accuracy of hate
speech detection in multilingual and multi-platform
environments(Areej Al-Hassan,2019).
3 PROPOSED SYSTEM
ARCHITECTURE
The proposed system aims to create a robust model
that can accurately identify the sentiment of text data
using advanced Natural Language Processing (NLP)
techniques. The primary objective is to develop a
classifier model trained on textual data, which can
then be used as a pre-trained model to predict
sentiment in new and unseen text inputs. To achieve
this, the first step involves preparing and collecting a
diverse text dataset and assigning each entry the
corresponding sentiment label (positive, negative, or
neutral). The dataset may consist of multiple
languages, including Tamil, to reflect diverse
linguistic use cases. After data collection,
preprocessing steps like tokenization, stemming or
lemmatization, and stopword removal are applied to
optimize the text for model training. Additionally,
handling missing values and addressing any class
imbalances are crucial parts of the process. For
missing text values, strategies like filling in with the
most common word or using the median can be
implemented.