Authors:
Ahmar K. Hussain
;
Bernhard A. Sabel
;
Marcus Thiel
and
Andreas Nürnberger
Affiliation:
Otto von Guericke University Magdeburg, Germany
Keyword(s):
Fake Papers, Classification, Meta Data Features, TF-IDF, Biomedicine, Large Language Models.
Abstract:
In order to address the issue of fake papers in scientific literature, we propose a study focusing on the classification of fake papers based on certain features, by employing machine learning classifiers. A new dataset was collected, where the fake papers were acquired from the Retraction Watch database, while the non-fake papers were obtained from PubMed. The features extracted for classification included metadata, journal-related features as well and textual features from the respective abstracts, titles, and full texts of the papers. We used a variety of different models to generate features/word embeddings from the abstracts and texts of the papers, including TF-IDF and different variations of BERT trained on medical data. The study compared the results of different models and feature sets and revealed that the combination of metadata, journal data, and BioBERT embeddings achieved the best performance with an accuracy and recall of 86% and 83% respectively, using a gradient boos
ting classifier. Finally, this study presents the most important features acquired from the best performing classifier.
(More)