Exploring the Current State of Machine Learning in Spam Filters

Sizhe Teng

Department of Mathematics, University of California, Santa Barbara, Santa Barbara, U.S.A.

Keywords: Spam Detection, Machine Learning, Email Security, Phishing Prevention, Bayesian Filtering.

Abstract: This paper systematically analyzes the development of spam detection technology. Spam poses a significant

threat to network security, personal privacy, and enterprise productivity. Traditional filtering methods, such

as rule-based filtering and Bayesian classification, have difficulty adapting and coping with evolving spam

strategies. This paper focuses on machine learning and evaluates its adaptability, feature extraction

capabilities, and algorithmic effectiveness in enhancing spam detection. By analyzing algorithms such as

Naive Bayes, Random Forest, and LSTM, this study highlights the improvements in adaptability and accuracy

brought by machine learning. However, existing challenges include dependence on labelled datasets and

computing resources. This study provides theoretical and practical insights for building adaptive spam

detection systems. This study not only provides theoretical support for the development of spam filtering

technology but also provides practical references for enterprises and individuals to build efficient and

intelligent spam defense systems in practical applications, which helps to improve email security, which is

crucial to maintaining trust in the digital economy.

1 INTRODUCTION

Email has always been a core communication tool,

and its security is directly related to personal privacy,

corporate assets, and even national network security.

However, with the advent of the digital age of artificial

intelligence, spam has seriously troubled the majority

of users and has become more stubborn and complex.

While most spam is just some companies advertising

their products, some other spam acts as a carrier for

phishing attacks, scams and malware. Its harm is not

limited to harassing users, but also leads to a series of

serious consequences such as phishing attacks,

identity theft, and loss of corporate productivity. Spam

can lead to information breaches, fraud, and loss of

business productivity, so efficient spam detection is

critical to cybersecurity in the digital age. Exploring

efficient spam filtering technology is not only a

technical need, but also an inevitable choice to

maintain the healthy development of the Internet

economy.

Traditional spam filtering technologies, such as

rule-based filtering and Bayesian classification, have

played an important role in mitigating the threat of

spam (Wang & Peng, 2010). However, these methods

have difficulty keeping up with the evolving strategies

https://orcid.org/0009-0009-6123-3629

adopted by spammers. The rise of AI-generated spam,

dynamic IP blocking and sophisticated phishing

techniques requires more advanced detection

mechanisms. Academia and industry have long been

committed to optimizing spam detection technology.

Early research focused on rule filtering and Bayesian

classification. However, these traditional methods

have gradually revealed their limitations because

static rules are easily bypassed and cannot cope with

dynamic attack strategies. In recent years, machine

learning technology has become a research hotspot.

For example, (Kumar, 2020) demonstrated the

advantage of machine learning in accuracy; further

pointed out that deep learning models can capture text

contextual relationships and significantly improve the

recognition rate of complex spam. Despite this,

existing research focuses on a single algorithm (Mao,

2024), and the defense mechanism against dynamic

adversarial attacks remains to be explored in depth.

This paper aims to analyze the evolution of spam

detection methods, evaluate the limitations of

traditional filtering techniques, and introduce some

filtering methods based on machine learning. The

following sections will discuss the types and threats of

spam, review the weaknesses of traditional filtering

Teng, S.

Exploring the Current State of Machine Learning in Spam Filters.

DOI: 10.5220/0013701900004670

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 2nd International Conference on Data Science and Engineering (ICDSE 2025), pages 555-559

ISBN: 978-989-758-765-8

555

methods, and explore how machine learning can

enhance spam classification. This paper attempts to

provide a theoretical basis and practical reference for

building an efficient and adaptive spam defense

system.

2 THE HARM OF SPAM

Spam has become a common phenomenon in modern

life, bringing with it a variety of potential threats,

including serious risks to network security and

personal privacy. Many spam emails look harmless,

but there are often great risks and threats hidden

behind them. And when people find themselves

suffering from these losses, it is usually too late.

First of all, one of the main threats of spam is

phishing attacks. Criminals can use some deceptive

emails to obtain sensitive information from

recipients. They will make the email look like an

official email. These emails will pretend to be

legitimate information from relatives, well-known

companies, social media platforms, banks, or national

institutions. The user clicks on the malicious link in

the email and enters a phishing website disguised as

an official page. When they click on the link, they

need to enter sensitive or private information, such as

important account passwords, credit card details or

passport and ID numbers. This whole process is

monitored by criminals. Once the user enters the

information, criminals can use this information to

make money. Many victims don't even know that they

have been deceived afterwards. Some victims only go

to modify the password when their property has been

damaged.

In addition, the unnecessary consumption of

productivity and financial losses of enterprises and

institutions when dealing with spam are also one of

the hazards. In order to deal with the threats brought

by spam software, institutions and organizations need

to spend a lot of resources to filter and manage spam.

The financial loss caused by this is far greater than the

cost of spam. Many IT teams of enterprises need to

spend a lot of energy to deploy security measures to

prevent the risks brought by spam. Many employees

will waste precious time doing repetitive and

meaningless things. Finally, this will lead to a

decrease in productivity.

Spam will reduce people's trust in online

communications and the Internet economy, causing

economic losses in the long run. When people are

attacked by too many spam emails, many people will

have a sense of distrust in things on the Internet. This

will lead to a decrease in people's willingness to

choose the digital economy. For example, users may

become more cautious and reluctant to click on links

in emails or participate in online transactions to avoid

potential security risks. Even many people are

reluctant to click on emails from legitimate

companies or governments because they are afraid

that it is a scam. Or sometimes legitimate emails are

hidden in spam, causing people to ignore the emails

they need to check. These will have a negative impact

on the development of many industries that have a

positive impact on the economy, such as e-commerce,

online financial services and other Internet industries,

and ultimately have a negative impact on overall

economic growth.

Spam is not only a nuisance and economic loss to

individuals and businesses, but also a challenge and

threat to the entire Internet ecosystem. Its impact

covers consumers, businesses, government agencies,

advertisers and even the entire digital economy. In the

long run, the proliferation of spam will undermine

people's trust in online communications and thus

affect the development of the global Internet

economy.

3 COMMON TYPES OF SPAM

CURRENTLY

Spam has become more diverse over time. Each form

has different characteristics and purposes (Ma, 2024).

Understanding the different types of spam can help us

recognize and avoid spam. More importantly, it can

help us design effective detection systems.

3.1 Advertising Spam

Advertising spam is the most common type of spam.

These spam messages usually send advertisements to

users. These advertisements include products,

services, and various websites. Even though most

advertising spam messages are legitimate, these

messages are essentially sent without the consent of

the recipient. Advertising spam is characterized by its

large number. Especially when the recipient's mail

address is leaked, advertising spam will swarm in.

Some companies or organizations obtain email

addresses of potential users through unethical means,

which makes it difficult for users to control the influx

of spam. Due to its large number and legal fringe, this

behavior is usually difficult to punish.

3.2 Phishing Emails

The purpose of phishing emails is to obtain

information from the victim through clever disguises.

ICDSE 2025 - The International Conference on Data Science and Engineering

556

These emails usually appear to come from trusted

organizations, such as banks, government agencies,

or e-commerce platforms. They remind users to

update their account details, reset passwords, or claim

that the user's information needs to be checked.

Phishing attacks can have serious consequences.

Most of the time, victims suffer financial losses.

Sometimes victims also suffer identity theft. For

companies and institutions, phishing can lead to data

breaches.

3.3 Second Section

Scam emails prey on simple human emotions to

deceive the recipient. These emotions are usually

greed, urgency, fear, or lust. They will tell the

recipient that they have won a lottery or a

competition. Even if the person did not buy or enter

any lottery or competition. After the victim pays a fee,

they will disappear. Sometimes, the fraudster will say

that the recipient has inherited a fortune from a distant

relative. Again, a fee must be paid upfront. They will

tell investors about non-existent investment

opportunities. These investment opportunities appear

to be high-return projects, but in fact, the money will

be used for fraudulent business activities. Although

scam emails have been around for decades, they are

still evolving and becoming more and more difficult

to identify. There are even many AI-driven scams

emerging, where attackers use automated systems to

create more convincing messages, thereby increasing

the effectiveness of these fraudulent activities.

4 DISADVANTAGES OF

TRADITIONAL FILTRATION

METHODS

The current traditional spam filtering methods mainly

include Rule-based filtering, Bayesian filtering and

IP-based blacklists, but these methods have many

limitations and are difficult to effectively deal with

modern spam strategies. These filters help people

identify and lock spam, so people can reduce losses.

However, while traditional spam filters are useful,

they have significant limitations. Especially with the

increasing popularity of artificial intelligence, it

hinders their effectiveness in detecting modern spam

strategies.

4.1 Rule-Based Filtering

Rule-based filtering is an earlier detection method. It

classifies emails according to some rules, usually

including pre-defined ones. The system automatically

detects keywords such as "free", "lottery", "product",

etc. contained in the email. In this way, it analyzes

whether the email is spam. In addition, some rules

analyze the format and attachments of the email.

However, spammers can easily circumvent the

rules. They can replace keywords with words that the

system cannot detect. For example, "lott3ry" instead

of "lottery". In addition, due to the rise of artificial

intelligence, spam has become more and more elusive.

The forms and types of spam are changing all the time.

But the rules need to be constantly updated and

maintained. This is very inefficient and time-

consuming. And the accuracy rate is not high.

4.2 Bayesian Filtering

Bayesian Filtering analyzes the frequency of words in

an email and calculates the probability of spam. In

this process, Bayesian Filtering uses a statistical

theorem, namely ”Bayes' rule“, to calculate and

identify spam (Han,2023).

Bayesian filtering is a spam classification method

based on statistical probability. It analyzes the

frequency of words in an email and calculates the

probability that the email is spam or normal (Lu &

Yin, 2008). This method is considered to be more

accurate than rule-based filtering because it can

gradually learn email features as users use it and

improve detection accuracy (Chakraborty, 2012).

However, Bayesian filtering also has its

limitations. Spammers can also use many methods to

avoid Bayesian Filtering detection. Spammers can

choose to add some normal words or words that are

unlikely to appear in spam to confuse Bayesian

Filtering. This phenomenon is called "Bayes

poisoning". For example, spammers can choose to

add scientific and technological words, political

current affairs words, or the names of legitimate large

companies to evade detection. Also, Bayesian

Filtering lacks the test of spam with artificial

intelligence.

4.3 IP-based Blacklists

Internet Protocol addresses are addresses used to

identify devices in a computer network. An IP address

is a unique identifier for a device on a computer

network and is used to enable communication

between devices. IP blacklists are a common spam

filtering strategy. When people find known spammers,

they can record the IP addresses in a blacklist. That is,

all emails from these IP addresses will be considered

spam. This method has a certain effect on blocking

spam from known malicious servers.

Exploring the Current State of Machine Learning in Spam Filters

557

The effectiveness of IP blacklists is subject to

many challenges. Spammers can use dynamic IP

addresses. That is, senders can frequently change IPs

through cloud servers. This means that IP-based

blacklists are difficult to track the sender's new

address. And the update speed of the blacklist often

cannot keep up with the speed at which spammers

change IPs. Ultimately, a large number of new spam

can bypass the interception mechanism.

In addition, some spammers use shared IPs. This

will cause shared IPs to be mistaken as spam sources,

resulting in normal emails being intercepted.

5 ADVANTAGES OF MACHINE

LEARNING IN SPAM

CLASSIFICATION

Machine learning is a branch of artificial intelligence.

Machine learning enables computers to learn from

data and make predictions or decisions based on

learned patterns. This process does not require

humans to explicitly program rules. The core idea of

this process is to let the machine discover patterns in

the data, analyze the data and use these discovered

patterns to classify the data.

Machine learning has revolutionized spam

detection with its intelligent and efficient

classification technology. Compared to traditional

methods, machine learning models continue to learn

from evolving spam strategies and provide higher

accuracy.

5.1 Adaptability

Machine learning has shown great adaptability in the

process of spam detection. When spam has been

updated at a very fast speed, machine learning is also

adapting to new environments at a rapid speed.

Adapting to new spam patterns is the most important

advantage of machine learning over traditional

methods. Traditional spam filters rely on static rules.

This means that these rules need to be constantly

updated manually to maintain accuracy. Machine

learning achieves the effect of dynamic defense by

analyzing large data sets. This significantly reduces

costs and improves accuracy.

5.2 Feature Extraction

Feature extraction can extract the most representative

information from raw data. Because the content of

emails is complex and diverse, it is difficult and

inefficient for computers to process this raw

information directly. The purpose of feature

extraction is to simplify this complex text information

into simple and understandable data, so that computer

models can process it more efficiently. This

technology can significantly enhance the efficiency

of spam classification.

5.3 Machine Learning Algorithms

Currently, common spam filtering machine learning

algorithms include Naïve Bayes, Random Forests,

and Long Short-Term Memory (LSTM).

Naive Bayes is an algorithm based on Bayes'

theorem. Naive Bayes calculates the probability of a

word in a spam email given the category of spam or

normal email. Each word is relatively independent in

the calculation process. Naive Bayes is simple to

calculate and efficient, especially for small data sets.

However, if there is a correlation between the words

in spam email, it may affect the result (Agarwal &

Kumar, 2018).

Random forests consist of multiple independently

trained decision trees, which we call trees. Each tree

randomly extracts a subset of the original data during

training and calculates the best answer. This

randomness makes each tree somewhat different,

which improves accuracy. Random forests combine

the predictions of all trees. Finally, the most

reasonable or most selected result is followed. It is

like a team, each person makes an independent

judgment and finally decides the category of the

email by voting. This method is more accurate than a

single person's judgment because different opinions

in the team can complement each other and reduce

errors. The disadvantage of random forests is that

training multiple trees requires more memory and

processing power, and predictions on large data sets

may be slower than simple models such as naive

Bayes.

Deep learning uses multi-layer neural networks to

automatically extract data features and can identify

more complex spam patterns (Yu, 2023). Traditional

machine learning methods usually rely on manually

designed features and are difficult to adapt to

changing spam strategies (Zhang, 2024).

LSTM (Long Short-Term Memory) is a Recurrent

Neural Network that is specifically designed to

process sequence data, especially data that needs to

capture contextual relationships, such as text, time

series, and speech recognition (Hans, 2020). LSTM

has a unique memory mechanism. It can remember

important information in an email and help us

determine whether the email is spam or normal. This

ICDSE 2025 - The International Conference on Data Science and Engineering

558

memory mechanism helps us overcome the gradient

vanishing problem of traditional RNN when

processing long sequences, and thus significantly

improves the accuracy in spam classification,

especially in long data sets.

6 CONCLUSION

This article systematically analyzes the development

of spam filtering technology from the perspective of

technological evolution. The limitations and

shortcomings of traditional methods are analyzed and

evaluated. The core value of machine learning in

modern spam defense is demonstrated. The harm of

spam is multidimensional. In addition to harassment,

its derivative risks include phishing attacks,

productivity loss, and long-term erosion of trust in the

digital economy. Traditional methods are ineffective.

Although rule-based filtering, Bayesian

classification, and IP blacklists are effective in the

early stages, they are difficult to combat modern spam

because they rely on static rules. Spammers use

loopholes such as obfuscation techniques, Bayesian

poisoning, and dynamic IP blocking to bypass

traditional filters.

Machine learning models dynamically learn from

evolving spam patterns, thereby achieving a

constantly adapting filtering system. Algorithms such

as Naive Bayes, Random Forest, and LSTM networks

can improve detection accuracy by analyzing

contextual relationships and complex patterns in

email content. Despite the many advantages of

machine learning models, they require large labeled

data sets, computing resources, and continuous

retraining to remain effective.

As spam strategies continue to evolve, deep

learning is essential to enhancing email security.

Future research should focus on developing real-time,

computational efficiency. Lightweight spam

detection models can provide a good balance between

accuracy and computational efficiency. Deep

learning frameworks with low computational costs

should be developed to adapt to enterprise-level real-

time filtering needs. In addition, integrating AI-based

collaborative filtering across multiple platforms can

help build a stronger defense against spam

threats.The federated filtering system allows spam

filters to learn from user data from multiple email

providers without compromising privacy. This

decentralized approach enhances detection

capabilities while keeping users safe.

REFERENCES

Agarwal, K., Kumar, T.V., 2018. Email spam detection

using integrated approach of Naïve Bayes and particle

swarm optimization. In 2018 Second International

Conference on Intelligent Computing and Control

Systems (ICICCS), 685-690.

Chakraborty, N., Patel, A., Polytechnic, K., Raigarh, I.,

2012. Email spam filter using Bayesian neural

networks.

Han, X., 2023. Application of Bayesian optimization in

spam filtering. Journal of Xuzhou Institute of

Technology: Natural Science Edition, 38(2), 77-83.

Hans, R., 2020. LSTM-based short message service (SMS)

modeling for spam classification. (Doctoral

dissertation, Dalian University of Technology).

Kumar, N., Sonowal, S., Nishant, 2020. Email spam

detection using machine learning algorithms. In 2020

Second International Conference on Inventive Research

in Computing Applications (ICIRCA), 108-113.

Lu, Q., Yin, S., 2008. Research on spam classification

technology based on Bayesian theorem. Information

Technology (02), 126-128.

Ma, Z., 2024, Research and implementation of spam

filtering system. (Doctoral dissertation, Zhejiang

University) .

Mao, H., Research and implementation of spam filtering

algorithm. (Doctoral dissertation, Shanghai Jiaotong

University).

Wang, B., Pan, W., 2005. A review of content-based spam

filtering technology. Journal of Chinese Information

Processing, 19(5), 3-12.

Wang, Z., Peng, X., 2010. Spam filtering technology based

on machine learning. China Science and Technology

Information (6), 2.

Yu, Y., 2023. Spam detection method based on deep

learning. (Doctoral dissertation, Donghua University).

Zhang, J., 2024. Machine learning-based email spam filter.

Innovation in Science and Technology.

Exploring the Current State of Machine Learning in Spam Filters

559