Dynamic Indexing for Incremental Entity Resolution in Data Integration Systems

Priscilla Kelly M. Vieira, Bernadette Farias Lóscio, Ana Carolina Salgado

2017

Abstract

Entity Resolution (ER) is the problem of identifying groups of tuples from one or multiple data sources that represent the same real-world entity. This is a crucial stage of data integration processes, which often need to integrate data at query time. This task becomes even more challenging in scenarios with dynamic data sources or with a large volume of data. As most ER techniques deal with all tuples at once, new solutions have been proposed to deal with large volumes of data. One possible approach consists in performing the ER process on query results rather than the whole data set. It is also possible to reuse previous results of ER tasks in order to reduce the number of comparisons between pairs of tuples at query time. In a similar way, indexing techniques can also be employed to help the identification of equivalent tuples and to reduce the number of comparisons between pairs of tuples. In this context, this work proposes an indexing technique for incremental Entity Resolution processes. The expected contributions of this work are the specification, the implementation and the evaluation of the proposed indexes. We performed some experiments and the time spent for storing, accessing and updating the indexes was measured. We concluded that the reuse turns the ER process more efficient than the reprocessing of tuples comparison and with similar quality of results.

References

  1. Altowim, Y., Kalashnikov, D. V., Mehrotra, S. (2014). Progressive Approach to Relational Entity Resolution. In: VLDB. Hangshou, China.
  2. Altwaijry, H., Kalashnikov, D. D., Mehrotra, S. (2013). Query-Driven Approach to Entity Resolution. In: VLDB. Trento, Italy.
  3. Bhattacharya, I., Getoor, L. (2007). Query-time Entity Resolution. Journal of Artificial Intelligence Research. V 30 , issue 1, pp 621-657.
  4. Bhattacharya, I.; Getoor, L. (2007a). Entity Resolution In Graphs. In: Mining Graph Data. John Wiley & Sons, Inc.
  5. CDDB (2016). Available in: http://hpi.de/naumann/ projects/repeatability/datasets/cd-datasets.html.
  6. Christen, P. (2008). Febrl - An Open Source Data Cleaning, Deduplication and Record Linkage System with a Graphical User Interface. In: KDD. Las Vegas, USA.
  7. Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.
  8. Christen, P. (2012a). A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. In: TKDE. V 24, issue 9, pp 1537-1555.
  9. FreeDB (2016). Available in: http://www.freedb.org/
  10. Gruenheid, A.; Dong, X. L.; Srivastava, D. (2014). Incremental Record Linkage. In: VLDB. Hangzhou, China.
  11. Guo, S.; Dong, X.; Srivastava, D.; Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. In: PVLDB. Singapore.
  12. Ramadan, B. et al. (2015). Dynamic Sorted Neighbourhood Indexing for Real-Time Entity Resolution. In: Journal of Data and Information Quality. V 6, issue 4, nº 15.
  13. Ribeiro, L. A. et al. (2016). SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering. In: ICEIS. Rome, Italy.
  14. Su, W., Wang, J., Lochovsky, F, H. (2010). Record Matching Over Query Results from Multiple Web Databases. In: TKDE. V 22, issue 4, pp 578-589.
  15. Tan, P.; Steinbach, M.; Kumar, V. (2006). Introduction to Data Mining. Pearson.
  16. Vieira, P. K. M.; Salgado, A. C.; Lóscio, B. F. (2016). A Query-driven and Incremental Process for Entity Resolution. In: AMW. Panama City, Panama.
  17. Whang, S. E.; Marmaros, D.; Garcia-Molina, H. (2013). Pay-As-You-Go Entity Resolution. In: TKDE. V 25, issue 5, pp 1111-1124.
  18. Whang, S. E.; Garcia-Molina, H. (2014). Incremental entity resolution on rules and data. In VLDB Journal. V 23, issue 1, pp 77- 102.
Download


Paper Citation


in Harvard Style

Vieira P., Lóscio B. and Salgado A. (2017). Dynamic Indexing for Incremental Entity Resolution in Data Integration Systems . In Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-247-9, pages 185-192. DOI: 10.5220/0006251801850192


in Bibtex Style

@conference{iceis17,
author={Priscilla Kelly M. Vieira and Bernadette Farias Lóscio and Ana Carolina Salgado},
title={Dynamic Indexing for Incremental Entity Resolution in Data Integration Systems},
booktitle={Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2017},
pages={185-192},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006251801850192},
isbn={978-989-758-247-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 19th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Dynamic Indexing for Incremental Entity Resolution in Data Integration Systems
SN - 978-989-758-247-9
AU - Vieira P.
AU - Lóscio B.
AU - Salgado A.
PY - 2017
SP - 185
EP - 192
DO - 10.5220/0006251801850192