A Machine-learning based Unbiased Phishing Detection Approach

Hossein Shirazi, Landon Zweigle, Indrakshi Ray

Abstract

Phishing websites mimic a legitimate website to capture sensitive information of users. Machine learning is often used to detect phishing websites. In current machine-learning based approaches, the phishing and the genuine sites are classified into two groups based on some features. We feel that this is an inadequate modeling of the problem as the characteristics of different phishing websites may vary widely. Moreover, the current approaches are biased towards groups of over-represented samples. Most importantly, as new features are exploited, the training set must be updated to detect new phishing sites. There is a time lag between the evolution of new phishing sites and retraining of the model, which can be exploited by attackers. We provide an alternative approach that aims to solve the above-mentioned problems. Instead of finding commonalities among non-related genuine websites, we find similarity of a suspicious website to a legitimate target and use machine learning to decide whether the suspicious site is impersonating the target. We define the fingerprint of a legitimate website by using visual and textual characteristics against which a sample is compared to ascertain whether it is fake. We implemented our approach on 14 legitimate websites and tested against 1446 unique samples. Our model reported an accuracy of at least 98% and it is not biased towards any website. This is in contrast to the current machine learning models that may be biased towards groups of over-represented samples and lead to more false-negative errors for less popular websites.

Download


Paper Citation