A Comparative of Spanish Encoding Functions - Efectiveness on Record Linkage

María del Pilar Angeles, Noemi Bailón-Miguel

Abstract

Many business within big data projects suffer from duplicate data. This situation seriously impedes to managers to make well informed decisions. In the case of low data quality written in Spanish language, the identification and correction of problems such as spelling errors with English language based coding techniques is not suitable. In the case of Spanish language, written information is pronounced equal. There are phonetic techniques for duplicate detection that are not oriented to the Spanish language. Thus, the identification and correction of problems such as spelling errors in Spanish texts with such techniques is not suitable. In this paper we have implemented, modified and utilized in SEUCAD (Angeles, 2014) three Spanish phonetic algorithms to detect duplicate text strings in the presence of spelling errors in Spanish. The results were satisfactory, the Phonetic Spanish algorithm performed the best most of the time, demonstrating opportunities for an improved performance of Spanish encoding during the record linkage process.

References

  1. Angeles, P., et al., 2014. Universal evaluation system data quality, In DBKDA 2014 : The Sixth International Conference on Advances in Databases, Knowledge, and Data Applications, vol. 32, pp. 13-19.
  2. Angeles, P., J. García-Ugalde, A. Espino-Gamez, and J. Gil-Moncada, Comparison of a Modified Spanish Soundex, and Phonex Coding function during datamatching process, In International Conference on Informatics, Electronic and Vision, ICIEV, Kytakyushu, Fukuoka Japan,ISBN:978-1-4673 6901- 5, DOI:10.1109/ICIEV.2015.7334028, IEEE, pp.1- 6,2015.
  3. Borgman, C.L. & S. L. Siegfried, 1992. Getty's synonym TM and its cousins: A survey of applications of personal name-matching algorithms. In Journal of the American Society for Information Science 43((7)), 459-476.
  4. Christen, P., 2008. Febrl A Freely Available Record Linkage System with a Graphical User Interface. Second Australasian Workshop on Health Data and Knowledge Management (HDKM 2008), 80, 17-25. Figure 6: Completeness of each encoding method per scenario.
  5. Christen, P., 2012. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution and Duplicate Detection. Springer Data-Centric Systems and Applications.
  6. Churches, T., P. Christen, K. Lim, & J. X. Zhu, 2002. Preparation of name and address data for record linkage using hidden Markov models. In BMC Medical Informatics and Decision Making 2 (1), 9.
  7. Cohen, W. W., P. Ravikumar, & S. E. Fienberg, 2003. The book, A comparison of string distance metrics for name matching tasks, 73-78.
  8. Rahm E. & H. Do, 2000. Data cleaning: Problems and current approaches. In IEEE Data Engineering 23 (4), 3-13.
  9. Amon F. M. I. & J. Echeverria, 2012. Algoritmo fonético para detección de cadenas de texto duplicadas en el idioma español. In Ingenierías Universidad de Medellin 11 (20), 120 138.
  10. Jaro, M. A., 1989. Advances in record-linkage methodology applied to matching the 1985 Census of. Tampa, Florida. In Journal of the American Statistical Association, 84, 414-420.
  11. Mosquera, A., E. Lloret, & P. Moreda, 2012. Towards Facilitating the Accessibility of Web 2.0 Texts through Text Normalisation. In Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility, 9 -14.
  12. Odell, M. & R. Russell, 1918. The book, The soundex coding system. (1261167).
  13. Philips, L., 2000. The double metaphone search algorithm. In C/C++ Users J 18 (6), 38-43.
  14. Toscano-Mateus, H., J. B. Powcrs, (ed.), 1965. Hablemos del lenguaje.
  15. Winkler, W., 1990. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods, American Statistical Association, 354-359.
Download


Paper Citation


in Harvard Style

Angeles M. and Bailón-Miguel N. (2016). A Comparative of Spanish Encoding Functions - Efectiveness on Record Linkage . In Proceedings of the Fifth International Conference on Telecommunications and Remote Sensing - Volume 1: ICTRS, ISBN 978-989-758-200-4, pages 105-113. DOI: 10.5220/0006227701050113


in Bibtex Style

@conference{ictrs16,
author={María del Pilar Angeles and Noemi Bailón-Miguel},
title={A Comparative of Spanish Encoding Functions - Efectiveness on Record Linkage},
booktitle={Proceedings of the Fifth International Conference on Telecommunications and Remote Sensing - Volume 1: ICTRS,},
year={2016},
pages={105-113},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006227701050113},
isbn={978-989-758-200-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Fifth International Conference on Telecommunications and Remote Sensing - Volume 1: ICTRS,
TI - A Comparative of Spanish Encoding Functions - Efectiveness on Record Linkage
SN - 978-989-758-200-4
AU - Angeles M.
AU - Bailón-Miguel N.
PY - 2016
SP - 105
EP - 113
DO - 10.5220/0006227701050113