Authors:
Markus Goldstein
and
Seiichi Uchida
Affiliation:
Kyushu University, Japan
Keyword(s):
Outlier Removal, Unsupervised Anomaly Detection, Handwritten Digit Recognition, Large-scale Dataset,Data Cleansing, Influence of Outliers.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Classification
;
Clustering
;
Computer Vision, Visualization and Computer Graphics
;
Density Estimation
;
Image Understanding
;
Pattern Recognition
;
Theory and Methods
Abstract:
Outlier removal from training data is a classical problem in pattern recognition. Nowadays, this problem
becomes more important for large-scale datasets by the following two reasons: First, we will have a higher
risk of “unexpected” outliers, such as mislabeled training data. Second, a large-scale dataset makes it more
difficult to grasp the distribution of outliers. On the other hand, many unsupervised anomaly detection methods
have been proposed, which can be also used for outlier removal. In this paper, we present a comparative study
of nine different anomaly detection methods in the scenario of outlier removal from a large-scale dataset.
For accurate performance observation, we need to use a simple and describable recognition procedure and
thus utilize a nearest neighbor-based classifier. As an adequate large-scale dataset, we prepared a handwritten
digit dataset comprising of more than 800,000 manually labeled samples. With a data dimensionality of
16×16 = 256, it is ensured tha
t each digit class has at least 100 times more instances than data dimensionality.
The experimental results show that the common understanding that outlier removal improves classification
performance on small datasets is not true for high-dimensional large-scale datasets. Additionally, it was found
that local anomaly detection algorithms perform better on this data than their global equivalents.
(More)