6 CONCLUSION
Traditional email filtering mechanisms have become
increasingly outdated in the face of evolving cyber
threats, particularly phishing attacks. Our work
heavily focused on machine learning algorithms to
tackle the persistent challenge of phishing emails.
We employed various supervised learning
techniques such as Naive Bayes, Decision Tree,
Random Forest, Gradient Boosting Regression Trees,
and Support Vector Machine (SVM), and compared
the results to choose the best algorithm for developing
SecureInbox.
Through rigorous testing and evaluation, we
assessed the performance of these algorithms using
metrics like accuracy, precision, recall, and F1 score.
Our results indicated high efficacy across multiple
algorithms, with SVM and Random Forest standing
out as top performers, achieving F1 scores of 97.75%
and 96.73%, respectively. Based on computational
efficiency and scalability, we determined Random
Forest to be the optimal choice for our email
classification system.
The model was successfully integrated into the
SecureInbox application with a user-friendly
Graphical User Interface, allowing users to train with
their dataset and analyse the emails to accurately
classify them as legitimate or phishing. Our work
demonstrates that machine learning is an effective
tool that can be used to detect phishing attempts
through email.
7 FUTURE WORK
While this study provided insights into various
supervised learning algorithms for email
classification and demonstrated the use of machine
learning as an effective classification tool, there is
room for future research and development. First, we
will advance feature engineering by investigating
distilled BERT embeddings alongside novel
linguistic pattern extraction. For real-world
deployment, we are developing Postfix/MTA plugins
for real-time scanning and implementing incremental
learning to adapt to emerging attack patterns.
Currently the tool enables email analysis on a
Linux system. We can adapt this tool in the future to
work within a Windows environment, providing
broader accessibility and integration with common
email clients and server configuration. To broaden
accessibility, cross-platform expansion will include
Windows support via Docker containers.
REFERENCES
Dada, E. G., Bassi, J. S., Chiroma, H., Abdulhamid, S. I.
M., Adetunmbi, A. O., & Ajibuwa, O. E. (2019).
Machine learning for email spam filtering: review,
approaches and open research problems. Heliyon, 5(6).
Godfried, I. (2022, January 4). Decision Trees, Random
Forests, and Gradient Boosting: What’s the Difference?
Towards Data Science. Retrieved from https://towards
datascience.com/decision-trees-random-forests-and-
gradient-boosting-whats-the-difference-ae435cbb67ad
Gomes, L., da Silva Torres, R., & Côrtes, M. L. (2023).
BERT-and TF-IDF-based feature extraction for long-
lived bug prediction in FLOSS: a comparative study.
Information and Software Technology, 160, 107217.
Harikrishnan, N. B., Vinayakumar, R., & Soman, K. P.
(2018, March). A machine learning approach towards
phishing email detection. In Proceedings of the anti-
phishing pilot at ACM International workshop on
security and privacy analytics (IWSPA AP) (Vol. 2013,
pp. 455-468).
Kanstrén, T. (2020, September 11). A Look at Precision,
Recall, and F1 Score: Exploring the relations between
machine learning metrics. Towards Data Science.
Retrieved from https://towardsdatascience.com/a look-
at-precision-recall-and-f1-score-36b5fd0dd3ec
Tessian. (2022, January 12). Phishing Statistics 2020.
Tessian Blog. Retrieved from https://www.tessian.
com/blog/phishing-statistics-2020/#how-delivered
Zamora, N. (2024). Phishing detection: Bayes model.
Kaggle. https://www.kaggle.com/code/nordszamora/
phishing-detection-bayesmode