Discovering Data Lineage from Data Warehouse Procedures

Kalle Tomingas, Priit Järv, Tanel Tammet

Abstract

We present a method to calculate component dependencies and data lineage from the database structure and a large set of associated procedures and queries, independently of actual data in the data warehouse. The method relies on the probabilistic estimation of the impact of data in queries. We present a rule system supporting the efficient calculation of the transitive closure. The dependencies are categorized, aggregated and visualized to address various planning and decision support problems. System performance is evaluated and analysed over several real-life datasets.

References

  1. Anand, M. K., Bowers, S., McPhillips, T., & Ludäscher, B. (2009, March). Efficient provenance storage over nested data collections. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology (pp. 958-969). ACM.
  2. Anand, M. K., Bowers, S., & Ludäscher, B. (2010, March). Techniques for efficiently querying scientific workflow provenance graphs. In EDBT (Vol. 10, pp. 287-298).
  3. Benjelloun, O., Sarma, A. D., Hayworth, C., & Widom, J. (2006). An introduction to ULDBs and the Trio system. IEEE Data Engineering Bulletin, March 2006.
  4. Buneman, P., Khanna, S., & Wang-Chiew, T. (2001). Why and where: A characterization of data provenance. In Database Theory-ICDT 2001 (pp. 316-330). Springer Berlin Heidelberg.
  5. Cheney, J., Chiticariu, L., & Tan, W. C. (2009). Provenance in databases: Why, how, and where. Now Publishers Inc.
  6. Cui, Y., Widom, J., & Wiener, J. L. (2000). Tracing the lineage of view data in a warehousing environment. ACM Transactions on Database Systems (TODS), 25(2), 179-227.
  7. Cui, Y., & Widom, J. (2003). Lineage tracing for general data warehouse transformations. The VLDB JournalThe International Journal on Very Large Data Bases, 12(1), 41-58.
  8. de Santana, A. S., & de Carvalho Moura, A. M. (2004). Metadata to support transformations and data & metadata lineage in a warehousing environment. In Data Warehousing and Knowledge Discovery (pp. 249- 258). Springer Berlin Heidelberg.
  9. Fan, H., & Poulovassilis, A. (2003, November). Using AutoMed metadata in data warehousing environments. In Proceedings of the 6th ACM international workshop on Data warehousing and OLAP (pp. 86-93). ACM.
  10. Giorgini, P., Rizzi, S., & Garzetti, M. (2008). GRAnD: A goal-oriented approach to requirement analysis in data warehouses. Decision Support Systems, 45(1), 4-21.
  11. Heinis, T., & Alonso, G. (2008, June). Efficient lineage tracking for scientific workflows. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (pp. 1007-1018). ACM.
  12. Ikeda, R., Das Sarma, A., & Widom, J. (2013, April). Logical provenance in data-oriented workflows?. In Data Engineering (ICDE), 2013 IEEE 29th International Conference on (pp. 877-888). IEEE.
  13. Missier, P., Belhajjame, K., Zhao, J., Roos, M., & Goble, C. (2008). Data lineage model for Taverna workflows with lightweight annotation requirements. In Provenance and Annotation of Data and Processes (pp. 17-30). Springer Berlin Heidelberg.
  14. Priebe, T., Reisser, A., & Hoang, D. T. A. (2011). Reinventing the Wheel?! Why Harmonization and Reuse Fail in Complex Data Warehouse Environments and a Proposed Solution to the Problem.
  15. Ramesh, B., & Jarke, M. (2001). Toward reference models for requirements traceability. Software Engineering, IEEE Transactions on, 27(1), 58-93.
  16. Reisser, A., & Priebe, T. (2009, August). Utilizing Semantic Web Technologies for Efficient Data Lineage and Impact Analyses in Data Warehouse Environments. In Database and Expert Systems Application, 2009. DEXA'09. 20th International Workshop on (pp. 59-63). IEEE.
  17. Skoutas, D., & Simitsis, A. (2007). Ontology-based conceptual design of ETL processes for both structured and semi-structured data. International Journal on Semantic Web and Information Systems (IJSWIS), 3(4), 1-24.
  18. Tan, W. C. (2007). Provenance in Databases: Past, Current, and Future. IEEE Data Eng. Bull., 30(4), 3-12.
  19. Tomingas, K., Tammet, T., & Kliimask, M. (2014), RuleBased Impact Analysis for Enterprise Business Intelligence. In Proceedings of the Artificial Intelligence Applications and Innovations (AIAI2014) conference workshop (MT4BD). Series: IFIP Advances in Information and Communication Technology, Vol. 437.
  20. Tomingas, K., Kliimask, M., & Tammet, T. (2015). Data Integration Patterns for Data Warehouse Automation. In New Trends in Database and Information Systems II (pp. 41-55). Springer International Publishing.
  21. Vassiliadis, P., Simitsis, A., & Skiadopoulos, S. (2002). Conceptual modeling for ETL processes. In Proceedings of the 5th ACM international workshop on Data Warehousing and OLAP (pp. 14-21). ACM.
  22. Widom, J. (2004). Trio: A system for integrated management of data, accuracy, and lineage. Technical Report.
  23. Woodruff, A., & Stonebraker, M. (1997). Supporting finegrained data lineage in a database visualization environment. In Data Engineering, 1997. Proceedings. 13th International Conference on (pp. 91-102). IEEE.
Download


Paper Citation


in Harvard Style

Tomingas K., Järv P. and Tammet T. (2016). Discovering Data Lineage from Data Warehouse Procedures . In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016) ISBN 978-989-758-203-5, pages 101-110. DOI: 10.5220/0006054301010110


in Bibtex Style

@conference{kdir16,
author={Kalle Tomingas and Priit Järv and Tanel Tammet},
title={Discovering Data Lineage from Data Warehouse Procedures},
booktitle={Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)},
year={2016},
pages={101-110},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006054301010110},
isbn={978-989-758-203-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)
TI - Discovering Data Lineage from Data Warehouse Procedures
SN - 978-989-758-203-5
AU - Tomingas K.
AU - Järv P.
AU - Tammet T.
PY - 2016
SP - 101
EP - 110
DO - 10.5220/0006054301010110