THE PERILS OF IGNORING DATA SUITABILITY
The Suitability of Data used to Train Neural Networks Deserves More Attention
Kevin Swingler
Computing and Maths, University of Stirling, FK9 4LA, Stirling, Scotland
Keywords: Data preparation, Machine learning, Data mining, Data quality, Quantity.
Abstract: The quality and quantity (we call it suitability from now on) of data that are used for a machine learning
task are as important as the capability of the machine learning algorithm itself. Yet these two aspects of
machine learning are not given equal weight by the data mining, machine learning and neural computing
communities. Data suitability is largely ignored compared to the effort expended on learning algorithm
development. This position paper argues that some of the new algorithms and many of the tweaks to
existing algorithms would be unnecessary if the data going into them were properly pre-processed, and calls
for a shift in effort towards data suitability assessment and correction.
1 INTRODUCTION
Neural networks are popular and well used machine
learning techniques, and deserve their place in any
data mining course, text book or software package.
Algorithm research has expanded in recent years
with authors producing thousands of papers either
proposing new learning algorithms or improving
existing ones.
However, there has not been a related explosion
in research addressing the suitability of the data that
these algorithms process and the issue is largely
ignored by courses, books and software.
This paper argues that the preparation of data and
the analysis of its suitability should receive the same
attention that is afforded to algorithm development.
The paper is not a criticism of algorithm
development – there is still much work to do – rather
it is a call to address the imbalance.
The paper starts by arguing that a robust set of
methods for analysing and fixing the suitability of
training data should be as much a part of the
standard neural tool box as MLPs and RBFs. Section
2 demonstrates that this is not currently the case with
a short analysis of data mining papers, popular text
books and software packages, showing how each is
biased towards learning algorithms at the expense of
a treatment of data suitability. Section 3 mentions
some general research in the area and the paper
finishes with a short summary of some of the data
suitability issues that deserve more attention.
2 DATA SUITABILITY IS
LARGELY IGNORED
Machine learning algorithms, and neural networks in
particular, owe their performance to three things: the
data they are fed, the quality of the learning and
inference algorithms and the expertise of the user.
With existing algorithms, a little know-how and
some trial and error it is reasonably easy to produce
a correct solution from suitable data. However,
many algorithms – and neural networks in particular
– cannot compensate for unsuitable data, no matter
how much expertise the user displays. It would
therefore be sensible to use data suitability methods
to fix or discard data prior to the application of a
simple machine learning algorithm than to attempt to
optimise the algorithm to work with data exhibiting
a particular problem.
Research gains practical importance when it is
applied, and it is most likely to become applied
when it is taught in text books and courses and
implemented in widely used software. In the next
section we examine the treatment of data suitability
by the data mining community, software packages
and text books.
2.1 Data Mining
A recent survey paper (Wu et al., 2008) listed the
top 10 data mining algorithms identified by the
IEEE International Conference on Data Mining in
405
Swingler K..
THE PERILS OF IGNORING DATA SUITABILITY - The Suitability of Data used to Train Neural Networks Deserves More Attention.
DOI: 10.5220/0003687104050409
In Proceedings of the International Conference on Neural Computation Theory and Applications (NCTA-2011), pages 405-409
ISBN: 978-989-8425-84-3
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
2006. Two things should be noted from this paper.
There were no neural networks in the list – neither
MLPs nor RBFs – and there were no data
preparation techniques. The traditional text book
criticism of neural networks is that they are a ‘black
box’ technique. Generally this is cited as a reason
why companies might shy away from using them for
commercially important judgements, but it is also a
weakness when relying on the machine learning
technique itself to highlight problems in the original
data.
Take a decision tree as an example. The explicit
and accessible representation of knowledge allows
users to trace the route to a classification and explore
sensitivities (a small change in x would lead to a
different classification, but the current class is
insensitive to changes in y, for example). This is one
text book explanation of the black box criticism, but
it is possible to do the same thing with an analysis of
the partial derivatives of an MLP. The real
advantage of the decision tree’s structure is that it
exposes problems that were hidden in the original
data and allows the expert data miner to improve the
model. This is not easy with neural networks and
this is the main disadvantage of their black box
nature.
Wu et al. found that the most popular
classification methods were k Nearest Neighbours
(kNN), Naïve Bayes (Hand, 2001), Support Vector
Machines (Vapnik, 1995), and two tree building
algorithms: C4.5 (Quinlan, 1993) and CART
(Breiman, 1984). With the possible exception of
SVM, these algorithms all share the common feature
of allowing some problems with the training data to
be fixed after model building or even at inference
time. We argue that neural networks are not used for
data mining as often as they once were partially for
this reason. They hide the problems in the solution
that were caused by problems with the data in such a
way that no post-model building adjustments are
possible.
2.2 Software Packages
Weka (Bouckaert, 2010) and Rapid Miner are two
popular free data mining software packages. Both
offer data visualisation, manipulation and attribute
selection tools, but neither offers data quantity or
quality analysis. SAS, which is a very popular
commercial analytics package, offers neural
networks amongst its data mining options but data
quality processing is limited to outlier filtering.
None of these software packages offers an analytic
data quantity assessment tool.
Which neural networks do these software
packages support? The three packages above offer
RBFs and MLPs only. There have been many neural
network architectures and algorithms designed since
these two were invented, but these two persist as the
only ones to make it into large scale data mining
software packages.
To an extent, the field of data mining grew up
with that of neural computing. Some early data
mining software packages offered little more than
neural networks – they could classify, predict and
cluster and were viewed as something of a universal
solution. As Wu et al. have shown, this is no longer
the case, and we argue that the reason is that they
hide the consequences of unsuitable training data.
If we are to see neural networks used for more
commercial applications, we must address the issue
of data suitability. This will take the field forwards
faster than more incremental improvements in
learning algorithm design.
There is a danger in the widespread practice of
making an improvement to an existing algorithm and
demonstrating that improvement on a benchmark
data set. The danger is that we only see the
successes, not the tweaks that produced no
improvement. The risk is that the literature fills with
algorithms that are suited to certain types of data or,
worse, certain benchmark data sets. The practitioner
is then faced with the impossible task of locating the
right algorithm for their data. With better methods of
understanding the data prior to learning, we could
safely employ a smaller range of standard learning
algorithms.
2.3 Text Books
A review of a number of data mining and neural
computing text books further illustrates the point.
Classic neural network texts such as (Hertz et al.,
1991) and (Haykin, 1994) do not deal with data
suitability issues at all. More recent neural network
texts such as (Dreyfus, 2005), (Bishop, 2006) and
(Tang et al., 2007) show a similar omission.
(Swingler, 1996) dedicates a chapter to data quality
and quantity but even the author admits that this is
now out of date.
Data mining books should be better, but (Witten
and Frank, 2005), which is a popular course text
book offers two or three pages of vague advice on
ensuring that data is suitable. Recently published
(Du, 2010) has a chapter on data preparation but
offers just a few pages on data quality and no
analytic methods. There are a few specific books
covering data quality and preparation: (Pyle, 1999)
NCTA 2011 - International Conference on Neural Computation Theory and Applications
406
is good and (Dasu and Johnson, 2003) has some
useful content but such books are rare compared to
the number of data mining and neural network
algorithm books on the market.
2.4 First Conclusion
The research that is being carried out on data
suitability has less chance of being applied because
books, courses and software packages are not
treating it with the importance it deserves.
3 RESEARCH
We are not suggesting that data quality issues are
ignored by researchers. The recent launch of the
ACM Journal of Data and Information Quality
(Madnick et al., 2009) is an encouraging
development, though data preparation for machine
learning is a small aspect of its overall remit. Much
work on data quality has focused on management
information systems and their need for data
integrity. Data cleansing for machine learning
presents an additional set of challenges.
Some authors (Zhu et al., 2007) have pointed out
that data quality issues consume the majority of time
and budget for commercial data mining projects.
They also point out that data cleansing often focuses
on incomplete, imprecise or uncertain data – errors
in other words – rather than a more general question
of data suitability for the machine learning task.
4 CALL TO ACTION
Poor data suitability can be difficult to detect.
Problems range from simple data entry errors or
missing values through outliers and minority values
to multi-dimensional interactions such as correlated
inputs and the many varieties of the curse of
dimensionality.
The effects of poor data quality can be difficult
to predict and detect and we have already mentioned
that ‘black box’ neural networks are particularly
susceptible to them. A set of methods for the
analysis and correction of data suitability for neural
network training and data mining in general is
needed. Algorithms need to be developed, reported
in text books and lecture courses, and embedded in
data mining software packages. Assessment of data
prior to the application of machine learning
algorithms needs to gain an importance equal to that
of those algorithms themselves.
There is, of course, active research into many
aspects of data suitability. Some of the larger fields
include data imputation, feature selection, and
abnormality detection. Our argument is that there
needs to be more of it and that it needs to be taken
more seriously both by the research community and
in textbooks and courses.
We need to identify and catalogue the problems
that can be found in data sets destined for machine
learning algorithms. We need automated methods
for detecting, alerting and where possible correcting
for these problems before the process of learning
begins.
4.1 Making a Start
Much data quality research is concerned with data
governance – that is, ensuring data is recorded,
notated and audited correctly. Such assurances are
comforting for the data miner, but it is not this type
of data quality that interests us in this case. We are
concerned with the qualities of a data set that make
it suitable (or otherwise) as the raw ingredient for a
machine learning project – hence our use of the term
data suitability.
At a minimum, we suggest that no course, text
book or software package about data mining should
lack a detailed consideration of how the following
impact on data quantity requirements and model
quality:
4.1.1 Data Distribution
The distribution of the training data has a large
impact on the quality of a learned model. The
problem of imbalanced target classes is perhaps the
best studied aspect of this – see (Japkowicz and
Stephen, 2002) for an overview. The distribution of
data also has an important impact on required
training set size, feature selection, error detection
and the risk of over-fitting. This is true for both
numeric and nominal data types, for inputs and
outputs.
Univariate histograms are a useful tool for early
feature selection, but more work is need on
automated distribution based data quality and
selection methods. Features such as outliers, isolated
data points and variables with too few or too many
discrete values should be considered.
4.1.2 Missing Data and Errors
Imputation of missing data is well studied, with
many algorithms available for this task (Little and
THE PERILS OF IGNORING DATA SUITABILITY - The Suitability of Data used to Train Neural Networks Deserves
More Attention
407
Rubin, 2002) give a good overview. Imputing
missing values has an impact on required data
quantity, risk of over-fitting, data distribution and
learning algorithm performance. Errors in the data
are more difficult to spot but some of the methods
used for data imputation can also be used for error
detection.
4.1.3 Feature Selection
Feature selection is another well studied field with
many proposed techniques, see (Gheyas and Smith,
2010) for a recent example. We suggest that these
methods would benefit from being viewed in the
light of the other data suitability issues listed here. In
this we include other considerations such as feature
independence.
4.1.4 Data Quantity
The issues listed above all have an impact on the
quantity of data required for a successful machine
learning project. Although it is true that solving the
problems of data quality would mean that data
quantity is not an issue in itself, it is certainly a
useful measure of suitability when other aspects of
data quality are only partially understood.
5 CONCLUSIONS
The majority of time and resources on most
professional data mining projects is consumed by
data preparation. This deals with outliers, missing
values, abnormal distributions, data errors,
insufficient data quantities, ill-posed data, co-
dependent inputs and a list of other issues.
This paper does not argue that such data
preparation, cleaning and verification does not take
place, neither does it argue that the issue is ignored
by the research community. It argues that algorithms
for dealing with these issues are as important as
algorithms for machine learning and inference, and
so should constitute much more of the research in
that field and a larger proportion of the content of
teaching, text books and software.
We would like to see the data mining community
make more use of neural computing based methods
and we believe that an improved approach to data
suitability will encourage that to happen.
ACKNOWLEDGEMENTS
Thanks to Prof. Leslie Smith for his help in
preparing this paper
REFERENCES
Bishop, C. M., 2006. Pattern recognition and machine
learning. Springer.
Bouckaert, R. R, Frank, E., Hall, M. A., Holmes, G.,
Pfahringer, B., Reutemann, P. and Witten, I. H., 2010.
WEKA-experiences with a java open-source project.
Journal of Machine Learning Research, 11:2533-
2541. JMLR
Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C.
J., 1984. Classification and regression trees.
Wadsworth.
Dasu, T., Johnson, T., 2003. Exploratory data mining and
data cleaning. Wiley-Interscience.
Dreyfus, G., 2005. Neural networks: methodology and
applications. Springer.
Du, H., 2010. Data Mining Techniques and Applications:
An Introduction. Cengage Learning.
Gheyas, I. A. and Smith, L. S., 2010. Feature subset
selection in large dimensionality domains. Pattern
Recognition. 43. Elsevier.
Hand, D. J., Yu, K., 2001. Idiot’s Bayes—not so stupid
after all?. Int. Stat. Rev. 69:385–398. International
Statistical Institute.
Haykin, S. S., 1994. Neural networks: a comprehensive
foundation. Macmillan.
Hertz, J., Krogh, A. and Palmer, R. G., 1991. Introduction
to the theory of neural computation. Santa Fe institute
studies in the sciences of complexity: Lecture notes.
Westview Press.
Japkowicz, N. and Stephen, S., 2002. The class imbalance
problem: A systematic study. Intel. Data Anal. 6 pp.
429–449.
Little, R. J. A. and Rubin, D. B., 2002. Statistical Analysis
with Missing Data. Wiley.
Madnick S. E., Wang, R. Y., Yang, W. L. and Hongwei,
Z., 2009. Overview and Framework for Data and
Information Quality Research. ACM Journal of Data
and Information Quality. 1,1. ACM.
Pyle, D., 1999. Data preparation for data mining. Morgan
Kaufmann.
Quinlan, J. R., 1993 C4.5: Programs for machine
learning. Morgan Kaufmann.
Swingler, K., 1996. Applying neural networks: a practical
guide. Academic Press.
Tang, H., Tan, K. C. and Zhang, Y., 2007. Neural
networks: computational models and applications.
Springer.
Vapnik, V., 1995. The nature of statistical learning
theory. Springer.
Witten, I. H. and Frank, E., 2005. Data mining: practical
machine learning tools and techniques. Morgan
Kaufman.
NCTA 2011 - International Conference on Neural Computation Theory and Applications
408
Wu, X., Kumar, V., Quinlan, R. J., Ghosh, J., Yang, Q.,
Motoda, H., McLachlan, G. J., Ng A., Liu, B., Yu, P.
S., Zhou, Z. H., Steinbach, M., Hand, D., J. and
Steinberg, D., 2008. Top 10 algorithms in data mining.
Knowl. Inf. Syst. 14, 1:1-37. Springer-Verlag.
Zhu, X., Khoshgoftaar, T. M., Davidson, I. and Zhang, S.,
2007. Editorial: Special issue on mining low-quality
data. Knowl. Inf. Syst. 11,2: 131–136. Springer-
Verlag.
THE PERILS OF IGNORING DATA SUITABILITY - The Suitability of Data used to Train Neural Networks Deserves
More Attention
409