Noise in Datasets: What Are the Impacts on Classification Performance?

Rashida Hasan, Cheehung Chu

2022

Abstract

Classification is one of the fundamental tasks in machine learning. The quality of data is important in constructing any machine learning model with good prediction performance. Real-world data often suffer from noise which is usually referred to as errors, irregularities, and corruptions in a dataset. However, we have no control over the quality of data used in classification tasks. The presence of noise in a dataset poses three major negative consequences, viz. (i) a decrease in the classification accuracy (ii) an increase in the complexity of the induced classifier (iii) an increase in the training time. Therefore, it is important to systematically explore the effects of noise in classification performance. Even though there have been published studies on the effect of noise either for some particular learner or for some particular noise type, there is a lack of study where the impact of different noise on different learners has been investigated. In this work, we focus on both scenarios: various learners and various noise types and provide a detailed analysis of their effects on the prediction performance. We use five different classifiers (J48, Naive Bayes, Support Vector Machine, k-Nearest Neighbor, Random Forest) and 10 benchmark datasets from the UCI machine learning repository and three publicly available image datasets. Our results can be used to guide the development of noise handling mechanisms.

Download


Paper Citation


in Harvard Style

Hasan R. and Chu C. (2022). Noise in Datasets: What Are the Impacts on Classification Performance?. In Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-549-4, pages 163-170. DOI: 10.5220/0010782200003122


in Bibtex Style

@conference{icpram22,
author={Rashida Hasan and Cheehung Chu},
title={Noise in Datasets: What Are the Impacts on Classification Performance?},
booktitle={Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2022},
pages={163-170},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010782200003122},
isbn={978-989-758-549-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Noise in Datasets: What Are the Impacts on Classification Performance?
SN - 978-989-758-549-4
AU - Hasan R.
AU - Chu C.
PY - 2022
SP - 163
EP - 170
DO - 10.5220/0010782200003122