Towards Unsupervised Word Error Correction in Textual Big Data

Joao Paulo Carvalho, Sérgio Curto

2014

Abstract

Large unedited technical textual databases might contain information that cannot be properly extracted using Natural Language Processing (NLP) tools due to the many existent word errors. A good example is the MIMIC II database, where medical text reports are a direct representation of experts’ views on real time observable data. Such reports contain valuable information that can improve predictive medic decision making models based on physiological data, but have never been used with that goal so far. In this paper we propose a fuzzy based semi-automatic method to specifically address the large number of word errors contained in such databases that will allow the direct application of NLP techniques, such as Bag of Words, to the textual data.

References

  1. A. S. Fialho, F. Cismondi, S. M. Vieira, S. R. Reti, J. M. C. Sousa, and S. N. Finkelstein. 2012. Data mining using clinical physiology at discharge to predict icu readmissions, Expert Systems with Applications, vol. 39, no. 18, pp. 13 158-13 165, December 2012.
  2. A. S. Fialho, U. Kaymak, F. Cismondi, S. M. Vieira, S. R. Reti, J. M. C. Sousa, and S. N. Finkelstein. 2013. “Predicting intensive care unit readmissions using probabilistic fuzzy systems,” Proc. of FUZZ-IEEE 2013, Hyderabad, India.
  3. Carvalho, J. P., Coheur, L., 2013. Introducing UWS - A Fuzzy Based Word Similarity Function with Good Discrimination Capability: Preliminary results, Proc. of the FUZZ-IEEE 2013, Hyderabad, India.
  4. Carvalho, J. P., Curto, S., 2014. Fuzzy Preprocessing of Medical Text Annotations of Intensive Care Units Patients, Proc. of the IEEE 2014 Conference on Norbert Wiener in the 21st Century, Boston, USA.
  5. Carvalho, J. P., Carola, M., Tome,J.A., 2006. Using rulebased fuzzy cognitive maps to model dynamic cell behavior in Voronoi based cellular automata, Proc. of the 2006 IEEE International Conference on Fuzzy Systems, pp. 1687-1694, Vancouver, Canada.
  6. Carvalho, J. P., Carola, M., Tome, J. A., 2006. Forest Fire Modelling using Rule-Based Fuzzy Cognitive Maps and Voronoi Based Cellular Automata, Proceedings of the 25th International Conference of the North American Fuzzy Information Processing Society, NAFIPS 2006, Montreal, Canada.
  7. Damerau, F. J. A Technique for Computer Detection and Correction of Spelling Errors. Communications of the ACM, Março 1964, pp. 171-176.
  8. e-medtools 2014. http://e-medtools.com/ openmedspel.html, last accessed May 2014.
  9. F. Cismondi, A. L. Horn, A. S. Fialho, S. M. Vieira, S. R. Reti, J. M. Sousa, and S. Finkelstein. 2012. Fuzzy multi-criteria decision making to improve survival prediction of icu septic shock patients, Expert Systems with Applications, vol. 39, no. 16, pp. 12 332-12 339.
  10. Heymans 2014. http://users.ugent.be/rvdstich/eugloss/ EN/lijst.html, last accessed May 2014.
  11. Jurafsky, D., Martin, J., 2009, Speech and Language Processing - An introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition, Prentice-Hall.
  12. Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10:707-710.
  13. M. Saeed, C. Lieu, and R. Mark, “Mimic ii: A massive temporal icu database to support research in intelligence patient monitoring,” Computers in Cardiology, vol. 29, pp. 641-644, 2002.
  14. mtherald 2014. http://mtherald.com/free-medical-spellchecker-for-microsoft-word-custom-dictionary/, last accessed May 2014.
  15. SIL, 2014. http://www01.sil.org/linguistics/wordlists/ english/, last accessed on February 2014.
Download


Paper Citation


in Harvard Style

Carvalho J. and Curto S. (2014). Towards Unsupervised Word Error Correction in Textual Big Data . In Proceedings of the International Conference on Fuzzy Computation Theory and Applications - Volume 1: FCTA, (IJCCI 2014) ISBN 978-989-758-053-6, pages 181-186. DOI: 10.5220/0005140401810186


in Bibtex Style

@conference{fcta14,
author={Joao Paulo Carvalho and Sérgio Curto},
title={Towards Unsupervised Word Error Correction in Textual Big Data},
booktitle={Proceedings of the International Conference on Fuzzy Computation Theory and Applications - Volume 1: FCTA, (IJCCI 2014)},
year={2014},
pages={181-186},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005140401810186},
isbn={978-989-758-053-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Fuzzy Computation Theory and Applications - Volume 1: FCTA, (IJCCI 2014)
TI - Towards Unsupervised Word Error Correction in Textual Big Data
SN - 978-989-758-053-6
AU - Carvalho J.
AU - Curto S.
PY - 2014
SP - 181
EP - 186
DO - 10.5220/0005140401810186