Identification of Organization Name Variants in Large Databases using Rule-based Scoring and Clustering - With a Case Study on the Web of Science Database

Emiel Caron, Hennie Daniels

Abstract

This research describes a general method to automatically clean organizational and business names variants within large databases, such as: patent databases, bibliographic databases, databases in business information systems, or any other database containing organisational name variants. The method clusters name variants of organizations based on similarities of their associated meta-data, like, for example, postal code and email domain data. The method is divided into a rule-based scoring system and a clustering system. The method is tested on the cleaning of research organisations in the Web of Science database for the purpose of bibliometric analysis and scientific performance evaluation. The results of the clustering are evaluated with metrics such as precision and recall analysis on a verified data set. The evaluation shows that our method performs well and is conservative, it values precision over recall, with on average 95% precision and 80% recall for clusters.

References

  1. Caron, E., van Eck, N.J., 2014. Large scale author name disambiguation using rules-based scoring and clustering, In Proceedings of the 19th International Conference on Science and Technology Indicators, pages 79-86, Leiden, The Netherlands.
  2. Cohen, W., Ravikumar, P., & Fienberg, S., 2003. A comparison of string metrics for matching names and records. In KDD Workshop on Data Cleaning and Object Consolidation, Vol. 3, pp. 73-78.
  3. CWTS, 2015. Centre for Science and Technology Studies, http://www.cwts.nl, Leiden, The Netherlands.
  4. De Bruin, R., Moed, H., 1990. The unification of addresses in scientific publications. Informetrics, 89/90, 65-78.
  5. Koudas, Nick & Marathe, A. & Srivastava, D., 2004. Flexible string matching against large databases in practice. Proceedings of the 30th VLDB Conference.
  6. Leiden Ranking, 2015. CWTS Leiden Ranking 2015, http://www.leidenranking.nl, The Netherlands.
  7. Leiden University, 2015. http://www.leidenuniv.nl, Leiden, The Netherlands.
  8. Levin, M., Krawczyk, S., Bethard, S., & Jurafsky, D., 2012. Citation-based bootstrapping for large-scale author disambiguation. Journal of the Association for Information Science and Technology, 63(5), 1030- 1047.
  9. Maletic, J. I., & Marcus, A., 2010. Data cleansing: A prelude to knowledge discovery. In Data Mining and Knowledge Discovery Handbook (pp. 19-36). Springer.
  10. Morillo, F., Santabárbara, I., & Aparicio, J., 2013. The automatic normalisation challenge: detailed addresses identification. Scientometrics, 95(3), 953-966.
  11. Patstat, 2015, EPO Worldwide Patent Statistical Database, http://www.epo.org.
  12. Song Y., Huang J., Councill I., Li J., & Giles C., 2007. Efficient topic-based unsupervised name disambiguation. In Proceedings of the 7th ACM/IEEECS joint conference on Digital libraries (JCDL 7807). ACM, New York, NY, USA, 342-351.
  13. Web of Science, 2015. Thomson Reuters, United States.
Download


Paper Citation


in Harvard Style

Caron E. and Daniels H. (2016). Identification of Organization Name Variants in Large Databases using Rule-based Scoring and Clustering - With a Case Study on the Web of Science Database . In Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-187-8, pages 182-187. DOI: 10.5220/0005836701820187


in Bibtex Style

@conference{iceis16,
author={Emiel Caron and Hennie Daniels},
title={Identification of Organization Name Variants in Large Databases using Rule-based Scoring and Clustering - With a Case Study on the Web of Science Database},
booktitle={Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2016},
pages={182-187},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005836701820187},
isbn={978-989-758-187-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 18th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Identification of Organization Name Variants in Large Databases using Rule-based Scoring and Clustering - With a Case Study on the Web of Science Database
SN - 978-989-758-187-8
AU - Caron E.
AU - Daniels H.
PY - 2016
SP - 182
EP - 187
DO - 10.5220/0005836701820187