Author:
Christopher Scaffidi
Affiliation:
Institute for Software Research, School of Computer Science, Carnegie Mellon University, United States
Keyword(s):
Data integration, unsupervised learning, outlier finding, data formats, spreadsheets, databases, web services.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Enterprise Information Systems
;
HCI on Enterprise Information Systems
;
Human-Computer Interaction
;
Intelligent User Interfaces
Abstract:
One common approach to validating data such as email addresses and phone numbers is to check whether values conform to some desired data format. Unfortunately, users may need to learn a specialized notation such as regular expressions to specify the format, and even after learning the notation, specifying formats may take substantial time. To address these problems, this paper introduces Topei, a system that infers a format from an unlabeled collection of examples (which may contain errors). The generated format is presented as understandable English, so users can review and customize the format. In addition, the format can be used to automatically check data against the format and find outliers that do not match. Topei shows substantially higher precision and recall than an alternate algorithm (Lapis) on test data. Topei’s usefulness is demonstrated by integrating it with spreadsheet, database, and web services systems.