easier by the development of a Flask-based user
interface (Petre et al., 2019) that enables the real-time
detection of phishing websites.
2 LITERATURE SURVEY
The majority of previous phishing detection research
has gone toward creating and improving systems to
detect phishing attempts using different approaches.
Heuristic techniques and rule-based systems, which
employed established rules to identify phishing based
on recognized patterns and signatures (Moghimi.,
2016), were the mainstays of early attempts.
Shahrivari et al. proposed the internet has made it
possible for hackers to trick victims through social
engineering and spoof websites, a practice known as
phishing (Shahrivari et al., 2020). Because machine
learning and these assaults have similar traits,
machine learning is an effective way to identify them.
In order to anticipate phishing websites, this research
exa mines the outcomes of many machine learning
techniques. Rashid et al. provides an effective
machine learning-based phishing detection method
that uses just 22.5% of novel functionality to correctly
identify 95.66% of phishing and suitable websites.
When the method (Rashid, 2020) is combined with a
support vector machine classifier, it performs well
when tested against common phishing datasets from
the University of California, Irvine collection.
Gandotra et al. e xa mines feature selection
techniques (Gandotra et al., 2021) for phishing
website detection and finds that, despite the time -
consuming aspect of creating a large number of
features, random forest reduces model building time
without sacrificing accuracy. Nimeh et al. using a
data set of 2889 authentic and phishing emails, this
study exa mines machine learning methods (Abu-
Nimeh., 2007) for phishing email prediction. 43
characteristics are used for classifier train ing and
testing. Sahingoz et al. proposed the transition from
traditional retail to electronic commerce has been
facilitated by the Internet's e xplosive e xpansion
(Sahingoz, Ozgur Koray, et al, 2019). Phishing tactics
are employed by cybercriminals to trick victims and
get confidential data. The complexity of authorized
websites is a result of attack mechanisms based on
semantics. Based on the features of NLP (natural
language processing) and seven categorization
techniques, this study offers a real-time anti-phishing
solution. A novel method for identifying phishing
attempts (Boddapati, Mohan Sai Dinesh, et al., 2023)
is presented by Jain et al., which looks at hyperlinks
in websites' HTML source code. It utilizes twelve
different types of hyperlink-specific attributes as well
as machine learning techniques. Because it is
customer-side, language-independent and scores
above 98.4% accuracy on the logistic regression
classifier the technique is an extremely efficient
method to identify phishing websites.
3 DATA COLLECTION &
PREPROCESSING
The effectiveness of a phishing detection mechanism
is significantly impacted by the caliber and
comprehensiveness of its training and assessment
data. In order to ensure the accuracy and robustness
of the detection models, preprocessing and data
collection are crucial procedures in this work. The
methods used to gather information and the initial
processing techniques used to prepare the data for
analysis are covered in this part. Data collection
involves acquiring a diverse range of cases from both
trustworthy and malicious sources in order to
adequately train the models. Datasets for phishing
detection (Alazaidah, R., et al., 2024) usually contain
characteristics that are taken from URLs, content on
websites, and metadata. This was an effort that
leveraged several resources, including collaborative
data sharing sites, publicly available phishing
datasets, and scraping tools, to build a comprehensive
dataset. Appendix (click for a larger view): Figure 1
(the collection, Figure 1) consists of URLs labeled as
genuine or phishing, along with relevant metadata
such as page layout, content features, and domain
registration data.
One of the most important steps in the process is
feature extraction, converting unstructured data into
an input format for training a model. In order to
produce
a
feature-rich
dataset, features
are
taken
from both phishing and trustworthy websites. To
generate a feature-rich dataset, features are extracted
from both phishing and legitimate websites. Such
aspects involve filters, such as information (e.g.,
domain age, SSL certificate status), URL features
(e.g., length, presence of special characters), and
HTML content features (e.g., presence of a login
form, iframes).
This helps the algorithm identify and extract such
attributes to pull out important indications of phishing
activity. Exp loratory data analysis aims to
understand the relations, patterns, and distribution of
the dataset. EDA helps us to identify important
features that aid in identifying phishing attacks and
also
uncovers any potential biases or inconsistencies