An Optimization Method for Entity Resolution in Databases: With a Case Study on the Cleaning of Scientific References in Patent Databases

Emiel Caron

2020

Abstract

Many databases contain ambiguous and unstructured data which makes the information it contains difficult to use for further analysis. In order for these databases to be a reliable point of reference, the data needs to be cleaned. Entity resolution focuses on disambiguating records that refer to the same entity. In this paper we propose a generic optimization method for disambiguating large databases. This method is used on a table with scientific references from the Patstat database. The table holds ambiguous information on citations to scientific references. The research method described is used to create clusters of records that refer to the same bibliographic entity. The method starts by pre-cleaning the records and extracting bibliographic labels. Next, we construct rules based on these labels and make use of the tf-idf algorithm to compute string similarities. We create clusters by means of a rule-based scoring system. Finally, we perform precision-recall analysis using a golden set of clusters and optimize our parameters with simulated annealing. Here we show that it is possible to optimize the performance of a disambiguation method using a global optimization algorithm.

Download


Paper Citation


in Harvard Style

Caron E. (2020). An Optimization Method for Entity Resolution in Databases: With a Case Study on the Cleaning of Scientific References in Patent Databases.In Proceedings of the 15th International Conference on Software Technologies - Volume 1: ICSOFT, ISBN 978-989-758-443-5, pages 625-632. DOI: 10.5220/0009972406250632


in Bibtex Style

@conference{icsoft20,
author={Emiel Caron},
title={An Optimization Method for Entity Resolution in Databases: With a Case Study on the Cleaning of Scientific References in Patent Databases},
booktitle={Proceedings of the 15th International Conference on Software Technologies - Volume 1: ICSOFT,},
year={2020},
pages={625-632},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0009972406250632},
isbn={978-989-758-443-5},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 15th International Conference on Software Technologies - Volume 1: ICSOFT,
TI - An Optimization Method for Entity Resolution in Databases: With a Case Study on the Cleaning of Scientific References in Patent Databases
SN - 978-989-758-443-5
AU - Caron E.
PY - 2020
SP - 625
EP - 632
DO - 10.5220/0009972406250632