Automated Phishing Website Detection and Analysis Using Advanced
Machine Learning Techniques
Shanmugapriya K, Poornima D, Vasanth R and Vinoda V R
Department of Computer Science and Engineering, Nandha Engineering College, Erode, Tamil Nadu, India
Keywords: Phishing Detection, Rule-Based System, Machine Learning Limitations, Support Vector Machine, K-Nearest
Neighbour, URL Verification, Cybersecurity, Online Fraud Prevention, Web Security.
Abstract: Phishing poses a significant danger to online security, and traditional machine learning methods such as
Support Vector Machine (SVM) and K-Nearest Neighbor (KNN) are constrained by their reliance on labeled
datasets and inability to handle novel or uncommon URLs. Furthermore, these techniques are opaque, making
it challenging to determine the reason behind a restricted URL. This paper proposes a rule-based system (RBS)
that uses fundamental criteria, such accessibility and dubious keywords, to verify URLs. Because RBS doesn't
rely on pre-existing data, it is more adaptable, efficient, and transparent than machine learning techniques for
detecting phishing attempts.
1 INTRODUCTION
The development of the internet has provided
unprecedented access to services and information but
also fertile soil for wrongdoing, particularly phishing.
Phishing is an unsafe technique that continues to
harm people as well as businesses throughout the
world by collecting private information such as credit
card numbers, usernames, and passwords. They
typically occur by impersonating authentic sites
through fake replicas, thus luring users into revealing
their credentials. Effects of phishing attacks, if
effective, can range from being financially crippling
and causing cases of identity theft to reputation loss
and loss of data. Therefore, access to quality and good
anti-phishing technology becomes a crucial part of
ensuring online security. The older methods of
detecting phishing have primarily employed machine
learning methods, including SVM and KNN. Both of
these employ the training models upon legitimate and
known sets of phishing URLs that have been labeled
so that the models can recognize patterns and features
that make the two distinct. Although machine
learning-based products have been fairly effective at
detecting phishing attacks, they are not perfect. The
largest constraint is that they are based on labeled
datasets, which are costly and time-consuming to
create and keep up with. Additionally, the models are
limited to recognizing patterns seen in training data
and may therefore not perform as well against new or
evolving phishing attacks. The second important
limitation of current SVM and KNN-based systems is
their lack of capability to scan URLs beyond their
training sets. The limitation renders them less useful
in practical situations where dynamically generated
and previously unknown URLs are continuously
popping up. Since attacks are always changing,
attackers use methods of generating URLs that have
never existed before, and therefore dataset-specific
models become ineffective. This problem necessitates
the use of a more flexible and generic phishing
detection method that can process any URL, whether
it is inside a training set or not. The project proposes
an RBS as an alternative to counter the limitations of
traditional machine learning methods. The RBS
works based on a set of predefined rules that are based
on measurable URL features, including suspicious
keywords, atypical URL format, and, most
importantly, accessibility of the webpage. By using
webpage accessibility as the main rule, the system is
able to check any URL, even those it has not seen
before. This is aimed particularly at the dataset
dependency issue of the existing systems. This rule-
based approach has some potential strengths in
flexibility, computational complexity, and
transparency that can make it a formidable tool in the
war against phishing attacks.
414
K., S., D., P., R., V. and R., V. V.
Automated Phishing Website Detection and Analysis Using Advanced Machine Lear ning Techniques.
DOI: 10.5220/0013899300004919
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies (ICRDICCT‘25 2025) - Volume 3, pages
414-420
ISBN: 978-989-758-777-1
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
2 RELATED WORKS
Phishing attack detection has continued to be an
important field of study due to the evolution of
increasingly sophisticated cyberattacks. Machine
learning algorithms for phishing attack detection have
made extensive use of SVM and KNN. Yet, such
models do not generalize well because they are
trained on tagged data and do not acquire experience
about unknown or unseen URLs. To improve these
shortcomings, researchers have emphasized rule-
based systems and hybrid systems that integrate
machine learning and rule extraction methodologies
for increased flexibility, efficiency, and
understandability. (M. SatheeshKumar et al., 2022)
for instance, suggested a rule-based phishing
detection system examining URL, domain, and page
attributes separately without blacklists to improve
real-time detection of zero-day phishing attacks.
Likewise, (Youness Mourtaji et al., 2021)
suggested a hybrid solution that combines rule-based
methods with Convolutional Neural Networks
(CNNs), using multiple viewpoints to enhance
detection performance. Hybrid methods solve the
interpretability problem in machine learning models
by providing explainable rules together with deep
learning power. The application of labeled datasets by
traditional machine learning-based phishing detection
renders them vulnerable to adaptive phishing attacks
(Asif Ejaz et al., 2023) described how attackers
exploit vertical feature spaces to bypass detection,
and proposed Anti-Subtle Phish, which employs
horizontal feature spaces to improve robustness. (Fadi
Thabtah et al.,2021) nonetheless, developed Phish
Alert, a browser plugin that draws rules from trained
machine learning models for real-time anomaly
detection, thus leveraging the advantages of machine
learning and rule-based filtering. Case-based
reasoning (CBR) is also one technique that has gained
prominence for phishing detection. (Lizhen Tang,
Qusay H. Mahmoud 2021) proposed a CBR-based
phishing detection system employing previous trends
of phishing attacks to discover new threats with
minimal reliance on labeled data. (Nureni A et al.,
2022) also proposed a fuzzy deep neural network
model that optimizes phishing detection rules with
better classification efficiency at high accuracy. Other
researchers have investigated rule-based approaches
other than the one described above. (Hassan Abutair
et al., 2019) applied association rule mining to
generate phishing URL patterns without relying on
huge training datasets, thus the approach can
accommodate emerging attack variations more easily
in addition. Moreover, (M. Sathish Kumar et al.,
2021) systematically reviewed the application of deep
learning in detecting phishing and noted that more
explainable models need to be implemented and
reiterated rule-based methods again. Whereas
machine learning and deep learning solutions have
proved their high detection rates, the fact that they're
black-box poses a challenge towards cybersecurity
adoption. (S. Carolin Jeeva et al 2016) highlighted
this limitation in a review of phishing detection
techniques, calling for the integration of nature-
inspired algorithms and rule-based systems to
improve interpretability. (Cagatay Catal et al., 2022)
also made inputs in this conversation by creating an
anti-phishing browser engine through which Random
Forest is combined with a rule extraction framework
to render the phishing detection decisions transparent.
Overall, combining rule-based systems and
advanced phishing detection mechanisms has proven
to be the most effective methodology for security
enhancements. While deeper learning models
constantly push detection capacities, rule-based
systems are more explainable and can adapt rapidly,
and so serve as an attractive option or augmentation
of machine learning-based detection techniques. As a
consequence, future work can be anticipated in
enhancing hybrid methodology and applying XAI-
based measures for bolstering phishing prevention
tactics.
3 PROPOSED SYSTEM
This project suggests an RBS for phishing detection
as a substitute for conventional machine learning
techniques. The RBS analyzes URLs based on
predetermined rules considering suspicious
keywords, abnormal structures, and webpage
accessibility, making it capable of classifying any
URL irrespective of the training data provided. This
system is highly open to created URLs that have not
been detected before unlike Machine Learning
techniques. It also offers efficiency with lower
computational costs, which enables faster detections,
and brings transparency, as decisions are based on
clear, human-readable rules. The system is even
maintainable, simply by modifying or adding new
rules to guard against new phishing techniques,
making it more adaptable to counter new attacks. It's
also a lot easier to deploy and maintain, as it allows
all the fancy model training to happen offline and
only infrequent dataset updates. By employing direct
webpage accessibility assessment to inform
detection, this approach enhances phishing detection
irrespective of historic data and circumvents some of
Automated Phishing Website Detection and Analysis Using Advanced Machine Learning Techniques
415
the fundamental limitations presented by other
machine learning-based approaches.
4 METHODOLOGY
4.1 Rule Based Approach for Phishing
Detection
The RBS (Figure PST 1 is the approach discussed in
this paper for phishing detection, and it scans the
given attributes of a URL for suspect enterprise–
enterprise relationships. Unlike ML-based methods
which rely on labeled training data, the RBS
classifies URLs using a fixed set of pre-defined rules.
These rules verify a variety of parameters, from the
presence of phishing-related keywords (including
"login," "verify," and "bank") and suspicious URL
patterns (exorbitantly long subdomains, excessive
special characters, etc), to the accessibility of the
webpage itself. We take advantage of URL
availability analysis as a differentiator to
dispassionately label URLs that are not found in any
available dataset. Unlike machine learning-based
methods, this is an efficient and realistic solution
considering its ability to incorporate a newly
constructed phishing URL without retraining.
Moreover, explain ability in decision making makes
it possible for users to understand why a particular
URL is flagged as suspicious, thus preventing one of
the most critical drawbacks of black-box machine
learning models.
4.2 URL Processing and Accessibility
Verification
That is when users enter a URL within the system,
parsed through a structured path through mandatory
elements are included like domain name,
subdomains, path, and query parameters. It analyzes
these variables to identify patterns that are often
associated with phishing. Which means that a URL
contains unexpected terms, excessive redirections, or
encoded characters, it is considered suspicious. The
other main function of the RBS is the URL
accessibility verification process to check whether the
webpage is accessible. If the webpage cannot be
accessed e.g., it might have been removed from the
server, or blacklisted the system flags the URL as
dubious. It can check such URLs, because of real-
time verification, even if those were created
dynamically and were not stored in any historical
databases. The RBS combines a soundly proven
phishing detection method with the analysis of URL
structure and the verification of accessibility
4.3 System Updates, Adaptability and
Performance Evaluation
The RBS is designed to be updated and to have its
rules extended, to make it as efficient as possible in
the face of new phishing methods. In contrast to
machine learning models, we can cope with emerging
phishing techniques by changing or adding rules,
without the requirement of retraining from time to
time. The system is, therefore, more efficient and
faster in real-time phishing detection, as it possesses
lesser computational loads than the complex
algorithms. The RBS is also tested using known
phishing and legitimate URL test sets to evaluate its
efficiency. The performance metrics, such as
accuracy, false positive rate, and false negative rate,
are used to tune the ruleset and improve detection.
By periodically updating rules and checking URL
accessibility, the RBS can ensure long-term
efficiency, transparency and responsiveness and
therefore serves as a better alternative to machine
learning-based phishing detection systems. Figure 1.
The process contains preprocessing of data such as
the removal of the NA values followed in point then
applied URL based analysis followed by the trained
models for the machine learning (KNN, SVM, and
RBS) to prepare training models, then the URL to
analyze, and situation accuracy result evaluation.
Figure 1: Architectural diagram.
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
416
Table 1: Performance Comparison.
Model
Accuracy
(%)
Precision
(%)
Transparenc
y
K-Nearest
Neighbors
(KNN)
86.3 83.5 Low
Support
Vector
Machine
(SVM)
88.7 85.2 Low
Rule-
Based
System
(RBS)
92.5 90.3 High
Figure 2: Error Rates & Detection Time.
5 EXPERIMENTAL RESULT
In order to assess the effectiveness of the proposed
rule-based system (RBS) for phishing detection, the
experiments were conducted on a data set with a
blend of phishing and normal URLs that were
accumulated from publicly accessible phishing
databases and real web traffic. The data set contained
10,000 URLs which were divided into 5,000 phishing
and 5,000 genuine links. All the URLs were tested by
the system with pre-defined parameters such as
malicious keywords, suspicious construction of
URLs, and reach ability tests. The results showed that
RBS detected phishing activities with an
extraordinarily high accuracy rate of 94.6%, in
contrast to other popular machine learning classifiers
like SVM and KNN, which had lower accuracy
percentages because they only used training data.
Table 1 is a comparison of phishing detection models
(RBS, SVM, and KNN) to accuracy, precision, and
transparency. The RBS offers good performance in
terms of accuracy (92.5%), precision (90.3%), and
transparency (High). Figure 2 plots the detection
time, false positive rate, and false negative rate for the
three phishing detection models (KNN, SVM, and
RBS). It is clear that SVM has the lowest rates of false
positives and false negatives and the longest detection
time. Further, RBS maintained a low rate of false
positives at 3.2%, thereby preventing true sites from
being spuriously identified. The most significant
strength of the RBS was that it could identify hitherto
unidentified phishing URLs as it didn't employ
labeled sets of data but processed URLs dynamically
based on their attributes. Figure 3 shows a Python
KNN-based phishing classifier which is 87.66%
accurate and has labeled the input URL as normal.
The output window graphically confirms the
correctness with a "GOOD!" thumbs-up sign.
Figure 3: KNN Result.
Similarly, Figure 4 illustrates a Python code
running on IDLE, using an SVM model to detect
spoofed sites, with accuracy 92.61%. The pop-up
result indicates the URL "http://www.mutuo.it" to be
valid, which is indicated by a "GOOD!" mark.
In addition, accessibility check was also a prime
concern while verifying phishing validity since most
of the phishing pages are deleted or become
unavailable after a specific period of time. With the
introduction of real-time availability checks on
URLs, the system ensured that phishing was detected
in real-time and not stale data degrade its accuracy.
Automated Phishing Website Detection and Analysis Using Advanced Machine Learning Techniques
417
Figure 5 illustrates a Python script running a rule-
based system (RBS) for detecting phishing that
provides a GUI through which a user can enter a URL
to check. The interface consists of an input text field
and a "Check URL" button to verify its validity.
Figure 4: SVM Result.
Figure 5: Applying URL in RBS.
In addition, Figure 6 depicts the W3Schools Java
tutorial website that provides tutorial materials to
learn Java programming. The webpage contains
navigation buttons, tutorial overview, and a "Start
learning Java now" button to allow users to begin.
The system was also fast in speed, verifying each
URL within less than 0.5 seconds, and thus extremely
compatible for deployment in real-time phishing
detection.
Figure 6: RBS Result for Legitimate URL.
The Figure 7 illustrates a Python-based phishing
system that encountered a connection error when it
scanned a suspicious URL. The mistake would
indicate that the domain "appleid.apple.com-app.es"
could not be resolved, meaning it is an attack by
phishing. Further, the rule-based system-maintained
transparency since users knew why a URL was
identified as malicious. As compared to machine
learning models, which were black-box classifiers,
RBS delivered explainable output since it indicated
which rule was breached. This makes the system
extremely useful in cybersecurity cases where
explainability is crucial.
Figure 7: Applying URL in RBS.
The Figure 8 illustrates the concept of phishing in
a photo of a hacker dressed in a hoodie typing away
on a computer with cyber-attacks such as spoof login
pages, viruses, and system crash circling around him.
The words "PHISHING" is written in bold letters to
mark the danger of online frauds. Briefly,
experimental outcomes verify the efficacy, efficiency,
and transparency of the proposed RBS as a
replacement for existing machine learning techniques
to identify phishing. By utilizing inherent URL
attributes and real-time availability checks, the
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
418
system avoids the drawbacks of typical ML models
and can be used as a reliable solution for phishing
threat detection on dynamic web pages.
Figure 8: RBS Result for Phishing URL.
6 CONCLUSIONS
This project has established the feasibility of
developing a rule-based and URL accessibility-based
phishing detector system. By directly examining the
accessibility of a given URL and applying simple
rules, the system can effortlessly classify URLs as
either "Legitimate" or "Phishing," giving an efficient
and applicable solution. The system addresses one of
the key weaknesses of traditional machine learning-
based approaches in been part of a training dataset.
Ease of deployment and transparency of the rule-
based system are the factors that make it a valuable
tool to enhance online security and protect users from
phishing. While the current implementation is
grounded on the mere existence of URLs, modularity
of the system can be extended in the future to increase
accuracy and flexibility further in order to combat
new forms of phishing threats.
7 FUTURE ENHANCEMENT
Future developments for the rule-based system (RBS)
can involve ongoing updating and fine-tuning of rules
to keep up with changing phishing methods and new
avenues of attack. Combining RBS with light
machine learning will further enhance precision and
responsiveness while being efficient. Real-time
scanning of URLs by browser add-ons or network
filters will offer real-time protection from phishing
attacks. In addition, AI-based approaches can be used
to optimize and generate rules automatically so that
the system remains effective against emerging
phishing techniques.
REFERENCES
Andronicus A. Akinyelu, “Machine Learning and Nature-
Inspired Based Phishing Detection: A Literature
Survey”, International Journal on Artificial
Intelligence Tools, 2019. DOI:
10.1142/S0218213019300023
Asif Ejaz, Adnan Noor Mian & Sanaullah Manzoor, “Life-
long phishing attack detection using continual
learning”, Scientific reports, 2023. DOI: s41598-023-
37552-9
Cagatay Catal, Görkem Giray, Bedir Tekinerdogan,
Sandeep Kumar, Suyash Shukla, “Applications of Deep
Learning for Phishing Detection: A Systematic
Literature Review”, Knowledge and Information
Systems, 2022. DOI: 10.1186/s13673-016-0064-3
Fadi Thabtah, Firuz Kamalov, “Phishing Detection: A Case
Analysis on Classifiers with Rules Using Machine
Learning”, Information and Knowlwdge Management,
2021. DOI: 10.1142/S0219649217500344
Hassan Abutair, Abdelfettah Belghith, Saad AlAhmadiB,
“CBR-PDS: A Case-Based Reasoning Phishing
Detection System”, Journal of Ambient Intelligence
and Humanized Computing, 2019.DOI:
10.1007/s12652-018-0736-0
Lizhen Tang, Qusay H. Mahmoud, “A Survey of Machine
Learning-Based Solutions for Phishing Website
Detection”, Machine Learning and Knowledge
Extraction, 2021. DOI: 10.3390/make3030034
M. SatheeshKumar, K. G. Srinivasagan, G. UnniKrishnan,
“A lightweight and proactive rule-based incremental
construction approach to detect phishing scam”,
Springer, 2022. DOI: 10.1007/s10799-021-00351-7
M. Sathish Kumar, B. Indrani, “Frequent Rule Reduction
for Phishing URL Classification Using Fuzzy Deep
Neural Network Model”, Iran Journal of Computer
Science, 2021.DOI: 10.1007/s42044-020-00067-x
Mohith Gowda HR, Adithya MV, Gunesh Prasad S, Vinay
S, “Development of Anti-Phishing Browser Based on
Random Forest and Rule of Extraction Framework”,
Cybersecurity, 2020.DOI: 10.1186/s42400-020-00059-
1
Nureni A. Azeez, Ogunlusi E. Victor, Sanjay Misra,
Robertas Damaševičius, Rytis Maskeliunas, “Extracted
Rule-Based Technique for Anomaly Detection in a
Global Network”, International Journal of Electronic
Security and Digital Forensics, 2022.DOI:
10.1504/IJESDF.2022.126460
S. Carolin Jeeva, Elijah Blessing Rajsingh, “Intelligent
Phishing URL Detection Using Association Rule
Mining”, Human-centric Computing and Information
Sciences, 2016. DOI: 10.1186/s13673-016-0064-3
Automated Phishing Website Detection and Analysis Using Advanced Machine Learning Techniques
419
Youness Mourtaji, Mohammed Bouhorma, Daniyal
Alghazzawi, Ghadah Aldabbagh, Abdullah Alghamdi,
“Hybrid Rule-Based Solution for Phishing URL
Detection Using Convolutional Neural Network”,
Wiley, 2021. DOI: 10.1155/2021/8241104
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
420