Exploring the Current State of Machine Learning in Spam Filters
Sizhe Teng
a
Department of Mathematics, University of California, Santa Barbara, Santa Barbara, U.S.A.
Keywords: Spam Detection, Machine Learning, Email Security, Phishing Prevention, Bayesian Filtering.
Abstract: This paper systematically analyzes the development of spam detection technology. Spam poses a significant
threat to network security, personal privacy, and enterprise productivity. Traditional filtering methods, such
as rule-based filtering and Bayesian classification, have difficulty adapting and coping with evolving spam
strategies. This paper focuses on machine learning and evaluates its adaptability, feature extraction
capabilities, and algorithmic effectiveness in enhancing spam detection. By analyzing algorithms such as
Naive Bayes, Random Forest, and LSTM, this study highlights the improvements in adaptability and accuracy
brought by machine learning. However, existing challenges include dependence on labelled datasets and
computing resources. This study provides theoretical and practical insights for building adaptive spam
detection systems. This study not only provides theoretical support for the development of spam filtering
technology but also provides practical references for enterprises and individuals to build efficient and
intelligent spam defense systems in practical applications, which helps to improve email security, which is
crucial to maintaining trust in the digital economy.
1 INTRODUCTION
Email has always been a core communication tool,
and its security is directly related to personal privacy,
corporate assets, and even national network security.
However, with the advent of the digital age of artificial
intelligence, spam has seriously troubled the majority
of users and has become more stubborn and complex.
While most spam is just some companies advertising
their products, some other spam acts as a carrier for
phishing attacks, scams and malware. Its harm is not
limited to harassing users, but also leads to a series of
serious consequences such as phishing attacks,
identity theft, and loss of corporate productivity. Spam
can lead to information breaches, fraud, and loss of
business productivity, so efficient spam detection is
critical to cybersecurity in the digital age. Exploring
efficient spam filtering technology is not only a
technical need, but also an inevitable choice to
maintain the healthy development of the Internet
economy.
Traditional spam filtering technologies, such as
rule-based filtering and Bayesian classification, have
played an important role in mitigating the threat of
spam (Wang & Peng, 2010). However, these methods
have difficulty keeping up with the evolving strategies
a
https://orcid.org/0009-0009-6123-3629
adopted by spammers. The rise of AI-generated spam,
dynamic IP blocking and sophisticated phishing
techniques requires more advanced detection
mechanisms. Academia and industry have long been
committed to optimizing spam detection technology.
Early research focused on rule filtering and Bayesian
classification. However, these traditional methods
have gradually revealed their limitations because
static rules are easily bypassed and cannot cope with
dynamic attack strategies. In recent years, machine
learning technology has become a research hotspot.
For example, (Kumar, 2020) demonstrated the
advantage of machine learning in accuracy; further
pointed out that deep learning models can capture text
contextual relationships and significantly improve the
recognition rate of complex spam. Despite this,
existing research focuses on a single algorithm (Mao,
2024), and the defense mechanism against dynamic
adversarial attacks remains to be explored in depth.
This paper aims to analyze the evolution of spam
detection methods, evaluate the limitations of
traditional filtering techniques, and introduce some
filtering methods based on machine learning. The
following sections will discuss the types and threats of
spam, review the weaknesses of traditional filtering
Teng, S.
Exploring the Current State of Machine Learning in Spam Filters.
DOI: 10.5220/0013701900004670
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 2nd International Conference on Data Science and Engineering (ICDSE 2025), pages 555-559
ISBN: 978-989-758-765-8
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
555
methods, and explore how machine learning can
enhance spam classification. This paper attempts to
provide a theoretical basis and practical reference for
building an efficient and adaptive spam defense
system.
2 THE HARM OF SPAM
Spam has become a common phenomenon in modern
life, bringing with it a variety of potential threats,
including serious risks to network security and
personal privacy. Many spam emails look harmless,
but there are often great risks and threats hidden
behind them. And when people find themselves
suffering from these losses, it is usually too late.
First of all, one of the main threats of spam is
phishing attacks. Criminals can use some deceptive
emails to obtain sensitive information from
recipients. They will make the email look like an
official email. These emails will pretend to be
legitimate information from relatives, well-known
companies, social media platforms, banks, or national
institutions. The user clicks on the malicious link in
the email and enters a phishing website disguised as
an official page. When they click on the link, they
need to enter sensitive or private information, such as
important account passwords, credit card details or
passport and ID numbers. This whole process is
monitored by criminals. Once the user enters the
information, criminals can use this information to
make money. Many victims don't even know that they
have been deceived afterwards. Some victims only go
to modify the password when their property has been
damaged.
In addition, the unnecessary consumption of
productivity and financial losses of enterprises and
institutions when dealing with spam are also one of
the hazards. In order to deal with the threats brought
by spam software, institutions and organizations need
to spend a lot of resources to filter and manage spam.
The financial loss caused by this is far greater than the
cost of spam. Many IT teams of enterprises need to
spend a lot of energy to deploy security measures to
prevent the risks brought by spam. Many employees
will waste precious time doing repetitive and
meaningless things. Finally, this will lead to a
decrease in productivity.
Spam will reduce people's trust in online
communications and the Internet economy, causing
economic losses in the long run. When people are
attacked by too many spam emails, many people will
have a sense of distrust in things on the Internet. This
will lead to a decrease in people's willingness to
choose the digital economy. For example, users may
become more cautious and reluctant to click on links
in emails or participate in online transactions to avoid
potential security risks. Even many people are
reluctant to click on emails from legitimate
companies or governments because they are afraid
that it is a scam. Or sometimes legitimate emails are
hidden in spam, causing people to ignore the emails
they need to check. These will have a negative impact
on the development of many industries that have a
positive impact on the economy, such as e-commerce,
online financial services and other Internet industries,
and ultimately have a negative impact on overall
economic growth.
Spam is not only a nuisance and economic loss to
individuals and businesses, but also a challenge and
threat to the entire Internet ecosystem. Its impact
covers consumers, businesses, government agencies,
advertisers and even the entire digital economy. In the
long run, the proliferation of spam will undermine
people's trust in online communications and thus
affect the development of the global Internet
economy.
3 COMMON TYPES OF SPAM
CURRENTLY
Spam has become more diverse over time. Each form
has different characteristics and purposes (Ma, 2024).
Understanding the different types of spam can help us
recognize and avoid spam. More importantly, it can
help us design effective detection systems.
3.1 Advertising Spam
Advertising spam is the most common type of spam.
These spam messages usually send advertisements to
users. These advertisements include products,
services, and various websites. Even though most
advertising spam messages are legitimate, these
messages are essentially sent without the consent of
the recipient. Advertising spam is characterized by its
large number. Especially when the recipient's mail
address is leaked, advertising spam will swarm in.
Some companies or organizations obtain email
addresses of potential users through unethical means,
which makes it difficult for users to control the influx
of spam. Due to its large number and legal fringe, this
behavior is usually difficult to punish.
3.2 Phishing Emails
The purpose of phishing emails is to obtain
information from the victim through clever disguises.
ICDSE 2025 - The International Conference on Data Science and Engineering
556
These emails usually appear to come from trusted
organizations, such as banks, government agencies,
or e-commerce platforms. They remind users to
update their account details, reset passwords, or claim
that the user's information needs to be checked.
Phishing attacks can have serious consequences.
Most of the time, victims suffer financial losses.
Sometimes victims also suffer identity theft. For
companies and institutions, phishing can lead to data
breaches.
3.3 Second Section
Scam emails prey on simple human emotions to
deceive the recipient. These emotions are usually
greed, urgency, fear, or lust. They will tell the
recipient that they have won a lottery or a
competition. Even if the person did not buy or enter
any lottery or competition. After the victim pays a fee,
they will disappear. Sometimes, the fraudster will say
that the recipient has inherited a fortune from a distant
relative. Again, a fee must be paid upfront. They will
tell investors about non-existent investment
opportunities. These investment opportunities appear
to be high-return projects, but in fact, the money will
be used for fraudulent business activities. Although
scam emails have been around for decades, they are
still evolving and becoming more and more difficult
to identify. There are even many AI-driven scams
emerging, where attackers use automated systems to
create more convincing messages, thereby increasing
the effectiveness of these fraudulent activities.
4 DISADVANTAGES OF
TRADITIONAL FILTRATION
METHODS
The current traditional spam filtering methods mainly
include Rule-based filtering, Bayesian filtering and
IP-based blacklists, but these methods have many
limitations and are difficult to effectively deal with
modern spam strategies. These filters help people
identify and lock spam, so people can reduce losses.
However, while traditional spam filters are useful,
they have significant limitations. Especially with the
increasing popularity of artificial intelligence, it
hinders their effectiveness in detecting modern spam
strategies.
4.1 Rule-Based Filtering
Rule-based filtering is an earlier detection method. It
classifies emails according to some rules, usually
including pre-defined ones. The system automatically
detects keywords such as "free", "lottery", "product",
etc. contained in the email. In this way, it analyzes
whether the email is spam. In addition, some rules
analyze the format and attachments of the email.
However, spammers can easily circumvent the
rules. They can replace keywords with words that the
system cannot detect. For example, "lott3ry" instead
of "lottery". In addition, due to the rise of artificial
intelligence, spam has become more and more elusive.
The forms and types of spam are changing all the time.
But the rules need to be constantly updated and
maintained. This is very inefficient and time-
consuming. And the accuracy rate is not high.
4.2 Bayesian Filtering
Bayesian Filtering analyzes the frequency of words in
an email and calculates the probability of spam. In
this process, Bayesian Filtering uses a statistical
theorem, namely ”Bayes' rule“, to calculate and
identify spam (Han,2023).
Bayesian filtering is a spam classification method
based on statistical probability. It analyzes the
frequency of words in an email and calculates the
probability that the email is spam or normal (Lu &
Yin, 2008). This method is considered to be more
accurate than rule-based filtering because it can
gradually learn email features as users use it and
improve detection accuracy (Chakraborty, 2012).
However, Bayesian filtering also has its
limitations. Spammers can also use many methods to
avoid Bayesian Filtering detection. Spammers can
choose to add some normal words or words that are
unlikely to appear in spam to confuse Bayesian
Filtering. This phenomenon is called "Bayes
poisoning". For example, spammers can choose to
add scientific and technological words, political
current affairs words, or the names of legitimate large
companies to evade detection. Also, Bayesian
Filtering lacks the test of spam with artificial
intelligence.
4.3 IP-based Blacklists
Internet Protocol addresses are addresses used to
identify devices in a computer network. An IP address
is a unique identifier for a device on a computer
network and is used to enable communication
between devices. IP blacklists are a common spam
filtering strategy. When people find known spammers,
they can record the IP addresses in a blacklist. That is,
all emails from these IP addresses will be considered
spam. This method has a certain effect on blocking
spam from known malicious servers.
Exploring the Current State of Machine Learning in Spam Filters
557
The effectiveness of IP blacklists is subject to
many challenges. Spammers can use dynamic IP
addresses. That is, senders can frequently change IPs
through cloud servers. This means that IP-based
blacklists are difficult to track the sender's new
address. And the update speed of the blacklist often
cannot keep up with the speed at which spammers
change IPs. Ultimately, a large number of new spam
can bypass the interception mechanism.
In addition, some spammers use shared IPs. This
will cause shared IPs to be mistaken as spam sources,
resulting in normal emails being intercepted.
5 ADVANTAGES OF MACHINE
LEARNING IN SPAM
CLASSIFICATION
Machine learning is a branch of artificial intelligence.
Machine learning enables computers to learn from
data and make predictions or decisions based on
learned patterns. This process does not require
humans to explicitly program rules. The core idea of
this process is to let the machine discover patterns in
the data, analyze the data and use these discovered
patterns to classify the data.
Machine learning has revolutionized spam
detection with its intelligent and efficient
classification technology. Compared to traditional
methods, machine learning models continue to learn
from evolving spam strategies and provide higher
accuracy.
5.1 Adaptability
Machine learning has shown great adaptability in the
process of spam detection. When spam has been
updated at a very fast speed, machine learning is also
adapting to new environments at a rapid speed.
Adapting to new spam patterns is the most important
advantage of machine learning over traditional
methods. Traditional spam filters rely on static rules.
This means that these rules need to be constantly
updated manually to maintain accuracy. Machine
learning achieves the effect of dynamic defense by
analyzing large data sets. This significantly reduces
costs and improves accuracy.
5.2 Feature Extraction
Feature extraction can extract the most representative
information from raw data. Because the content of
emails is complex and diverse, it is difficult and
inefficient for computers to process this raw
information directly. The purpose of feature
extraction is to simplify this complex text information
into simple and understandable data, so that computer
models can process it more efficiently. This
technology can significantly enhance the efficiency
of spam classification.
5.3 Machine Learning Algorithms
Currently, common spam filtering machine learning
algorithms include Naïve Bayes, Random Forests,
and Long Short-Term Memory (LSTM).
Naive Bayes is an algorithm based on Bayes'
theorem. Naive Bayes calculates the probability of a
word in a spam email given the category of spam or
normal email. Each word is relatively independent in
the calculation process. Naive Bayes is simple to
calculate and efficient, especially for small data sets.
However, if there is a correlation between the words
in spam email, it may affect the result (Agarwal &
Kumar, 2018).
Random forests consist of multiple independently
trained decision trees, which we call trees. Each tree
randomly extracts a subset of the original data during
training and calculates the best answer. This
randomness makes each tree somewhat different,
which improves accuracy. Random forests combine
the predictions of all trees. Finally, the most
reasonable or most selected result is followed. It is
like a team, each person makes an independent
judgment and finally decides the category of the
email by voting. This method is more accurate than a
single person's judgment because different opinions
in the team can complement each other and reduce
errors. The disadvantage of random forests is that
training multiple trees requires more memory and
processing power, and predictions on large data sets
may be slower than simple models such as naive
Bayes.
Deep learning uses multi-layer neural networks to
automatically extract data features and can identify
more complex spam patterns (Yu, 2023). Traditional
machine learning methods usually rely on manually
designed features and are difficult to adapt to
changing spam strategies (Zhang, 2024).
LSTM (Long Short-Term Memory) is a Recurrent
Neural Network that is specifically designed to
process sequence data, especially data that needs to
capture contextual relationships, such as text, time
series, and speech recognition (Hans, 2020). LSTM
has a unique memory mechanism. It can remember
important information in an email and help us
determine whether the email is spam or normal. This
ICDSE 2025 - The International Conference on Data Science and Engineering
558
memory mechanism helps us overcome the gradient
vanishing problem of traditional RNN when
processing long sequences, and thus significantly
improves the accuracy in spam classification,
especially in long data sets.
6 CONCLUSION
This article systematically analyzes the development
of spam filtering technology from the perspective of
technological evolution. The limitations and
shortcomings of traditional methods are analyzed and
evaluated. The core value of machine learning in
modern spam defense is demonstrated. The harm of
spam is multidimensional. In addition to harassment,
its derivative risks include phishing attacks,
productivity loss, and long-term erosion of trust in the
digital economy. Traditional methods are ineffective.
Although rule-based filtering, Bayesian
classification, and IP blacklists are effective in the
early stages, they are difficult to combat modern spam
because they rely on static rules. Spammers use
loopholes such as obfuscation techniques, Bayesian
poisoning, and dynamic IP blocking to bypass
traditional filters.
Machine learning models dynamically learn from
evolving spam patterns, thereby achieving a
constantly adapting filtering system. Algorithms such
as Naive Bayes, Random Forest, and LSTM networks
can improve detection accuracy by analyzing
contextual relationships and complex patterns in
email content. Despite the many advantages of
machine learning models, they require large labeled
data sets, computing resources, and continuous
retraining to remain effective.
As spam strategies continue to evolve, deep
learning is essential to enhancing email security.
Future research should focus on developing real-time,
computational efficiency. Lightweight spam
detection models can provide a good balance between
accuracy and computational efficiency. Deep
learning frameworks with low computational costs
should be developed to adapt to enterprise-level real-
time filtering needs. In addition, integrating AI-based
collaborative filtering across multiple platforms can
help build a stronger defense against spam
threats.The federated filtering system allows spam
filters to learn from user data from multiple email
providers without compromising privacy. This
decentralized approach enhances detection
capabilities while keeping users safe.
REFERENCES
Agarwal, K., Kumar, T.V., 2018. Email spam detection
using integrated approach of Naïve Bayes and particle
swarm optimization. In 2018 Second International
Conference on Intelligent Computing and Control
Systems (ICICCS), 685-690.
Chakraborty, N., Patel, A., Polytechnic, K., Raigarh, I.,
2012. Email spam filter using Bayesian neural
networks.
Han, X., 2023. Application of Bayesian optimization in
spam filtering. Journal of Xuzhou Institute of
Technology: Natural Science Edition, 38(2), 77-83.
Hans, R., 2020. LSTM-based short message service (SMS)
modeling for spam classification. (Doctoral
dissertation, Dalian University of Technology).
Kumar, N., Sonowal, S., Nishant, 2020. Email spam
detection using machine learning algorithms. In 2020
Second International Conference on Inventive Research
in Computing Applications (ICIRCA), 108-113.
Lu, Q., Yin, S., 2008. Research on spam classification
technology based on Bayesian theorem. Information
Technology (02), 126-128.
Ma, Z., 2024, Research and implementation of spam
filtering system. (Doctoral dissertation, Zhejiang
University) .
Mao, H., Research and implementation of spam filtering
algorithm. (Doctoral dissertation, Shanghai Jiaotong
University).
Wang, B., Pan, W., 2005. A review of content-based spam
filtering technology. Journal of Chinese Information
Processing, 19(5), 3-12.
Wang, Z., Peng, X., 2010. Spam filtering technology based
on machine learning. China Science and Technology
Information (6), 2.
Yu, Y., 2023. Spam detection method based on deep
learning. (Doctoral dissertation, Donghua University).
Zhang, J., 2024. Machine learning-based email spam filter.
Innovation in Science and Technology.
Exploring the Current State of Machine Learning in Spam Filters
559