An Ontology-based Methodology for Reusing Data Cleaning Knowledge

Ricardo Almeida, Paulo Maio, Paulo Oliveira, João Barroso

2015

Abstract

The organizations’ demand to integrate several heterogeneous data sources and an ever-increasing volume of data is revealing the presence of quality problems in data. Currently, most of the data cleaning approaches (for detection and correction of data quality problems) are tailored for data sources with the same schema and sharing the same data model (e.g., relational model). On the other hand, these approaches are highly dependent on a domain expert to specify the data cleaning operations. This paper extends a previously proposed data cleaning methodology that reuses cleaning knowledge specified for other data sources. The methodology is further detailed/refined by specifying the requirements that a data cleaning operations vocabulary must satisfy. Ontologies in RDF/OWL are proposed as the data model for an abstract representation of the data schemas, no matter which data model is used (e.g., relational; graph). Existing approaches, methods and techniques that support the implementation of the proposed methodology, in general, and specifically of the data cleaning operations vocabulary are also presented and discussed in this paper.

References

  1. Almeida, R., Maio, P., Oliveira, P., João, B., 2015. Towards Reusing Data Cleaning Knowledge, in: New Contributions in Information Systems and Technologies. Springer, pp. 143-150.
  2. Almeida, R., Oliveira, P., Braga, L., Barroso, J., 2012. Ontologies for Reusing Data Cleaning Knowledge, in: Semantic Computing (ICSC), 2012 IEEE Sixth Int. Conf. on. IEEE, pp. 238-241.
  3. Arenas, M., Bertails, A., Prud'hommeaux, E., Sequeda, J., 2012. A direct mapping of relational data to RDF.
  4. Atzori, L., Iera, A., Morabito, G., 2010. The internet of things: A survey. Computer networks 54, 2787-2805.
  5. Bellahsene, Z., Bonifati, A., Rahm, E. (Eds.), 2011. Schema Matching and Mapping. Springer Berlin Heidelberg, Berlin, Heidelberg.
  6. Booch, G., 1993. Object-Oriented Analysis and Design with Applications, 2 edition. ed. Addison-Wesley Professional, Redwood City, Calif.
  7. Brickley, D., Guha, R.V., 2014. RDF Schema 1.1 [WWW Document]. URL http://www.w3.org/TR/2014/RECrdf-schema-20140225/
  8. Codd, E.F., 1970. A relational model of data for large shared data banks. Communications of the ACM 13, 377-387.
  9. Das, S., Sundara, S., Cyganiak, R., 2012. R2RML: RDB to RDF mapping language.
  10. Dasu, T., Vesonder, G.T., Wright, J.R., 2003. Data quality through knowledge engineering, in: Proceedings of the Ninth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining. ACM, pp. 705-710.
  11. Fürber, C., Hepp, M., 2011. Towards a vocabulary for data quality management in semantic web architectures, in: Proc. of the 1st Int. Workshop on Linked Web Data Management. ACM, pp. 1-8.
  12. Han, J., Haihong, E., Le, G., Du, J., 2011. Survey on NoSQL database, in: Pervasive Computing and Applications (ICPCA), 2011 6th Int. Conf. on. IEEE, pp. 363-366.
  13. Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.U., 2015. The rise of “big data” on cloud computing: review and open research issues. Information Systems 47, 98-115.
  14. Knuth, M., Sack, H., 2014. Data cleansing consolidation with PatchR, in: The Semantic Web: ESWC 2014 Satellite Events. Springer, pp. 231-235.
  15. Maedche, A., Motik, B., Silva, N., Volz, R., 2002. MAFRA-A MApping FRAmework for Distributed Ontologies in the Semantic Web, in: Workshop on Knowledge Transformation for the Semantic Web (KTSW 2002), ECAI. pp. 60-68.
  16. Makris, K., Bikakis, N., Gioldasis, N., Christodoulakis, S., 2012. SPARQL-RW: Transparent Query Access over Mapped RDF Data Sources, in: Proc. of the 15th Int. Conf.on Extending Database Technology, EDBT 7812. ACM, New York, NY, USA, pp. 610-613. doi:10.1145/2247596.2247678
  17. McGuinness, D., Harmelen, F. van, 2004. OWL Web Ontology Language Overview [WWW Document]. URL http://www.w3.org/TR/owl-features/ (accessed 6.11.15).
  18. Milano, D., Scannapieco, M., Catarci, T., 2005. Using ontologies for xml data cleaning, in: On the Move to Meaningful Internet Systems 2005: OTM 2005 Workshops. Springer, pp. 562-571.
  19. Obrst, L., Liu, H., Wray, R., 2003. Ontologies for corporate web applications. AI Magazine 24, 49.
  20. Oliveira, P., Rodrigues, F., Henriques, P., 2009. SmartClean: An Incremental Data Cleaning Tool, in: Quality Software, 2009. QSIC'09. 9th Int. Conf. on. IEEE, pp. 452-457.
  21. Oliveira, P., Rodrigues, F., Henriques, P., Galhardas, H., 2005a. A taxonomy of data quality problems, in: 2nd Int. Workshop on Data and Information Quality. pp. 219-233.
  22. Oliveira, P., Rodrigues, F., Henriques, P.R., 2005b. A Formal Definition of Data Quality Problems., in: IQ. MIT.
  23. Otero-Cerdeira, L., Rodríguez-Martínez, F.J., GómezRodríguez, A., 2015. Ontology matching: A literature review. Expert Systems with Applications 42, 949- 971. doi:10.1016/j.eswa.2014.08.032
  24. Snijders, C., Matzat, U., Reips, U.-D., 2012. Big data: Big gaps of knowledge in the field of internet science. Int. Journal of Internet Science 7, 1-5.
  25. Weis, M., Manolescu, I., 2007. Declarative XML data cleaning with XClean, in: Advanced Information Systems Engineering. Springer, pp. 96-110.
Download


Paper Citation


in Harvard Style

Almeida R., Maio P., Oliveira P. and Barroso J. (2015). An Ontology-based Methodology for Reusing Data Cleaning Knowledge . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 2: KEOD, (IC3K 2015) ISBN 978-989-758-158-8, pages 202-211. DOI: 10.5220/0005596402020211


in Bibtex Style

@conference{keod15,
author={Ricardo Almeida and Paulo Maio and Paulo Oliveira and João Barroso},
title={An Ontology-based Methodology for Reusing Data Cleaning Knowledge},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 2: KEOD, (IC3K 2015)},
year={2015},
pages={202-211},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005596402020211},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 2: KEOD, (IC3K 2015)
TI - An Ontology-based Methodology for Reusing Data Cleaning Knowledge
SN - 978-989-758-158-8
AU - Almeida R.
AU - Maio P.
AU - Oliveira P.
AU - Barroso J.
PY - 2015
SP - 202
EP - 211
DO - 10.5220/0005596402020211