Learn from data until October 2023, To Investigate
model's ability to adapt to changing threats and
discover factors for improvement such as
transformer-based deep learning model and real-time
learning.
2 LITERATURE REVIEW
In this section discussed the progress of the
technology of email spam-detection, representing
both the conventional and current ways.
2.1 Traditional Spam Filtering
Methods
One of the most common methods was Bayesian
filtering, a probabilistic method that determines the
probability of an email being spam based on word
frequency (Androutsopoulos et al., 2000). Although
these approaches worked well in punishing spam
above based purely on the presence of specific
keywords, they struggled to keep up with changing
tactics spammer used (for example, obfuscation
techniques, or adversarial attacks).
2.2 ML in Spam-Detections
Machine learning spam-detection ushered in a new
era by allowing models to learn from patterns in the
data, rather than hard-coding rules. Classic machine
learning models like SVM, Decision Trees and
Naïve Bayes have been used to classify spam and ham
based on the features extracted from the data such as
word frequency, metadata and header 2. Despite
having shown improved performance over the
previous rule-based approaches, these models suffer
from evolving spam types and the scale of these email
datasets.
2.3 Deep Learning and Advanced
Techniques
Spam Detection improves with the recent
advancement in deep learning Neural-network has
been used for analysing complex patterns in the text
and metadata of emails, especially Convolutional
Neural Networks (CNNs) and Recurrent Neural
Networks (RNNs) (W. S. Yerazunis, 2004). More
recent transformer-based models like BERT, were
designed to enhance context awareness and better
identify spam emails (J. Goodman, 2005). In addition,
ensemble methods such as Random Forests and XG
Boost have been employed to improve classification
performance through the aggregation of several
model outputs (H. Drucker et al., 1999).
2.4 Gaps in Existing Research
Despite getting much better, existing spam detection
systems still face some obstacles:
• Challenges to identify advanced phishing or
being part of adversarial attacks that send
spam
• False-positive rates too high, causing
legitimate emails to be classified as spam.
• Scalability issues in processing large and
changing email datasets.
A hybrid spam detection adaptability and accuracy
this review on the literature
3 METHODOLOGY
In the methodology section, we describe the system
design, the data processing pipeline, the integration
of machine learning and evaluation approach.
3.1 System Architecture
The proposed system is composed of four major
modules:
3.1.1 Email Data Collection
A data collection module gathers email data from
publicly available datasets (such as Enron Spam
Datasets) and real-world email traffic. The dataset
comprises two spams and legitimate (hams) emails,
ensuring a balanced and diverse corpus.
3.1.2 Preprocessing and Feature
Engineering
After data collection, preprocessing is performed to
extract meaningful features:
Tokenization and Stop-word Removal:
The email text is broken into tokens,
and unique words (e.g., "and")
removed enhance relevant content
extraction.
Stemmings and Lemmatizations:
Words are reduced to their root forms
for text normalization.