Authors:
Elias Bassani
1
and
Marco Viviani
2
Affiliations:
1
University of Milano-Bicocca, Department of Informatics, Systems, and Communication, Edificio U14 - Viale Sarca, 336, 20126 Milan, Italy, Consorzio per il Trasferimento Tecnologico (C2T), Milan and Italy
;
2
University of Milano-Bicocca, Department of Informatics, Systems, and Communication, Edificio U14 - Viale Sarca, 336, 20126 Milan and Italy
Keyword(s):
Data Quality, Wikipedia, Supervised Classification, Feature Analysis, Ground Truth Building.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Business Analytics
;
Computational Intelligence
;
Data Analytics
;
Data Engineering
;
Data Reduction and Quality Assessment
;
Evolutionary Computing
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Machine Learning
;
Mining Text and Semi-Structured Data
;
Soft Computing
;
Symbolic Systems
;
Web Mining
Abstract:
Wikipedia is nowadays one of the biggest online resources on which users rely as a source of information. The amount of collaboratively generated content that is sent to the online encyclopedia every day can let to the possible creation of low-quality articles (and, consequently, misinformation) if not properly monitored and revised. For this reason, in this paper, the problem of automatically assessing the quality of Wikipedia articles is considered. In particular, the focus is (i) on the analysis of groups of hand-crafted features that can be employed by supervised machine learning techniques to classify Wikipedia articles on qualitative bases, and (ii) on the analysis of some issues behind the construction of a suitable ground truth. Evaluations are performed, on the analyzed features and on a specifically built labeled dataset, by implementing different supervised classifiers based on distinct machine learning algorithms, which produced promising results.