Authors:
Yingying Wen
;
Kevin B. Korb
and
Ann E. Nicholson
Affiliation:
Monash University, Australia
Keyword(s):
Machine learning, Incomplete data, Data generation, Data analysis, Missing data, Data mining, Machine learning evaluation.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Computational Intelligence
;
Data Mining
;
Databases and Information Systems Integration
;
Enterprise Information Systems
;
Evolutionary Computing
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Machine Learning
;
Sensor Networks
;
Signal Processing
;
Soft Computing
;
Symbolic Systems
Abstract:
Evaluating the relative performance of machine learners on incomplete data is important because one common problem with real data is that the data is often incomplete, which means that some values in the data are not present. DataZapper is a tool for uncreating data: given a dataset containing joint samples over variables, DataZapper will make a specified percentage of observed values disappear, replaced by an indication that the measurement failed. Since the causal mechanisms of measurement that result in failed measurements may depend in arbitrary ways upon the system under study, it is important to be able to produce incomplete data sets which allow for such arbitrary dependencies. DataZapper is the only tool that allows any kind of dependence, and any degree of dependence, in its generation of missing data. We illustrate its use in a machine learning experiment and offer it to the data mining and machine learning communities.