Guildlines of Data Quality Issues for Data Integration in the Context of the TPC-DI Benchmark

Qishan Yang, Mouzhi Ge, Markus Helfert

Abstract

Nowadays, many business intelligence or master data management initiatives are based on regular data integration, since data integration intends to extract and combine a variety of data sources, it is thus considered as a prerequisite for data analytics and management. More recently, TPC-DI is proposed as an industry benchmark for data integration. It is designed to benchmark the data integration and serve as a standardisation to evaluate the ETL performance. There are a variety of data quality problems such as multi-meaning attributes and inconsistent data schemas in source data, which will not only cause problems for the data integration process but also affect further data mining or data analytics. This paper has summarised typical data quality problems in the data integration and adapted the traditional data quality dimensions to classify those data quality problems. We found that data completeness, timeliness and consistency are critical for data quality management in data integration, and data consistency should be further defined in the pragmatic level. In order to prevent typical data quality problems and proactively manage data quality in ETL, we proposed a set of practical guidelines for researchers and practitioners to conduct data quality management in data integration.

References

  1. Batini, C. and Scannapieco, M., 2016. Erratum to: Data and Information Quality: Dimensions, Principles and Techniques. In Data and Information Quality. Springer International Publishing.
  2. Dakrory, S.B., Mahmoud T.M., Ali A.A., 2015, Automated ETL Testing on the Data Quality of a Data Warehouse, International Journal of Computer Applications, 131(16) pp.9-16
  3. Darmont, J., Boussaid, O. and Bentayeb, F., 2005. Dweb: A data warehouse engineering benchmark. In proceedings of Data Warehousing and Knowledge Discovery 2005, volume 3589, pp. 85-94.Doan, A., Halevy, A. and Ives, Z., 2012. Principles of data integration. Elsevier.
  4. Fehrenbacher, D. and Helfert, M., 2012 Contextual factors influencing perceived importance and trade-offs of information quality, Communications of the Association for Information Systems. 30(8).
  5. Ge, M., Helfert, M., and Jannach, D., 2011 Information Quality Assessment: Validating Measurement Dimensions and Process, in proceedings of 19th European Conference on Information Systems, Helsinki, Finland, 2011.
  6. Ge, M. and Helfert, M., Impact of information quality on supply chain decisions, Journal of Computer Information Systems, 53 (4), 2013.
  7. Helfert, M. and Ge, M., 2016. Big data quality-towards an explanation model in a smart city context. In proceedings of 21st International Conference on Information Quality, Ciudad Real, Spain, 2016.
  8. Inmon, W.H., Strauss, D. and Neushloss, G., 2010. DW 2.0: The architecture for the next generation of data warehousing: The architecture for the next generation of data warehousing. Morgan Kaufmann.
  9. Kimball, R. and Caserta, J., 2011. The data warehouse ETL toolkit: practical techniques for extracting, cleaning, conforming, and delivering data. Publisher: Wiley.
  10. Knight, S.A. and Burn, J.M., 2005. Developing a framework for assessing information quality on the World Wide Web. Informing Science: International Journal of an Emerging Transdiscipline, 8(5), pp.159-172.
  11. Majchrzak, T.A., Jansen, T. and Kuchen, H., 2011. Efficiency evaluation of open source ETL tools. In Proceedings of the 2011 ACM Symposium on Applied Computing (pp. 287-294).
  12. Poess, M., Rabl, T., Jacobsen, H.A. and Caufield, B., 2014. TPC-DI: the first industry benchmark for data integration. In proceedings of the VLDB Endowment, 7(13), pp.1367-1378.
  13. Singh, R. and Singh, K., 2010. A descriptive classification of causes of data quality problems in data warehousing. International Journal of Computer Science Issues, 7(3), pp.41-50.
  14. Stvilia, B., Gasser, L., Twidale, M.B. and Smith, L.C., 2007. A framework for information quality assessment. Journal of the American society for information science and technology, 58(12), pp.1720-1733.
  15. TPC-DS. (2016) TPC-DS Available at: http://www.tpc.org/tpcds/ [Accessed 16 December 2016].
  16. TPC-DI. (2016) TPC-DI. Available at: http://www.tpc.org/tpcdi/ [Accessed 16 December 2016].
  17. Vassiliadis, P. (2009). A Survey of Extract-Transform-Load Technology, International Journal of Data Warehousing & Data Mining, 5(3), 1-27, 2009
  18. Warwicka, W., Johnsona, S., Bonda, J., Fletchera, G. and Kanellakisa, P., 2015. A framework to assess healthcare data quality. European Journal of Social & Behavioural Sciences, 13(2), pp.1730.
  19. Wang, R.Y., 1998. A product perspective on total data quality management. Communications of the ACM, 41(2), pp.58-65.
  20. Wang, R.Y. and Strong, D.M., 1996. Beyond accuracy: What data quality means to data consumers. Journal of management information systems, 12(4), pp.5-33.
  21. Wyatt, L., Caufield, B. and Pol, D., 2009, August. Principles for an ETL Benchmark. In Technology Conference on Performance Evaluation and Benchmarking (pp. 183-198). Springer Berlin Heidelberg.
Download


Paper Citation


in Harvard Style

Yang Q., Ge M. and Helfert M. (2017). Guildlines of Data Quality Issues for Data Integration in the Context of the TPC-DI Benchmark . In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-247-9, pages 135-144. DOI: 10.5220/0006334301350144


in Bibtex Style

@conference{iceis17,
author={Qishan Yang and Mouzhi Ge and Markus Helfert},
title={Guildlines of Data Quality Issues for Data Integration in the Context of the TPC-DI Benchmark},
booktitle={Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2017},
pages={135-144},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006334301350144},
isbn={978-989-758-247-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Guildlines of Data Quality Issues for Data Integration in the Context of the TPC-DI Benchmark
SN - 978-989-758-247-9
AU - Yang Q.
AU - Ge M.
AU - Helfert M.
PY - 2017
SP - 135
EP - 144
DO - 10.5220/0006334301350144