Parsing Medical Text into De-identified Databases

Veronica Dahl, Sara Saghaei, Oliver Schulte

Abstract

De-identification is the process of automatic removal of all Private Health Information (PHI) from medical records. The main focus in this active and important research area is on semi-structured records. This narrow focus has allowed the development of standard criteria that formally determines the boundaries of privacy and can be used for evaluations. However, medical records include, as well as semi-structured data from filling in forms, etc., free text in which identifiers are more difficult to detect. In this article we address the problem of de-identification within unstructured medical records.We show how through the followingmethods we are able to recognize, in some cases, identifiers that currently go undetected: (1) Parsing free-form medical text into typed logical relationships including assumptions for candidate identifiers. (2) A novel use of the state-of-the-art engines for processing English queries to the web. A formal definition of our approach within a rigorous logical system that supports the implementation of our ideas, is also available on the website.

References

  1. Berman, J. (2003). Concept-match medical data scrubbing. how pathology text can be used in research. In Arch Pathol Lab Med, volume 127(6), 680-686.
  2. Christiansen, H. and Dahl, V. (2003). Logic grammars for diagnosis and repair. International Journal on Artificial Intelligence Tools, 12(3):227-248.
  3. Christiansen, H. and Dahl, V. (2005). Hyprolog: a new logic programming language with assumptions and abduction. In International Conference on Logic Programming (ICLP).
  4. Dahl, V. (1991). Incomplete types for logic databases. Applied Mathematics Letters, 4(3):25-28.
  5. Abramson, H. and Dahl, V. (1989) Logic Grammars. Computation AI Series, SpringerVerlag, 1-234.
  6. Dahl, V. and Blache, P. (2005). Extracting selected phrases through constraint satisfaction. In 2nd Intl. Workshop on Constraint Solving and Language Processing.
  7. Dahl, V. and Gu, B. (2008). On semantically constrained property grammars. In Constraints and Language Processing (CSLP), pages 20-32.
  8. Dahl, V. and Tarau, P. (2004). Assumptive logic programming. In Argentine Symposium on Artificial Intelligence.
  9. Enguix, G. B., Dahl, V., and Jiménez-L ópez, M. D. (2009). DNA and natural languages - text mining. In KDIR, pages 140-145.
  10. Neamatullah, I., Douglass, M., Lehman, L., Reisner, A., Villarroel, M., Long, W., Szolovits, P., Moody, G., Mark, R., and Clifford, G. (2008). Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making, 8(32).
  11. Ruch, P., Baud, R., Rassinoux, A., Bouillon, P., and Robert., G. (2000). Medical document anonymization with a semantic lexicon. In Proceedings of the AMIA Symposium, page 729. American Medical Informatics Association.
  12. Sweeney, L. (1996). Replacing personally-identifying information in medical records, the scrub system. In Proceedings of the AMIA Annual Fall Symposium, pages 333-7. American Medical Informatics Association.
  13. Sweeney, L. (1997). Guaranteeing anonymity when sharing medical data, the datafly system. In Proceedings of the AMIA Annual Fall Symposium, pages 51-5. American Medical Informatics Association.
  14. Taira, R., Bui, A., and Kangarloo, H. (2002). Identification of patient name references within medical documents using semantic selectional restrictions. In Proceedings of the AMIA Symposium, pages 757-61. American Medical Informatics Association.
  15. Uzuner, O., Luo, Y., and Szolovits, P. (2007). Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association, 14(5):550-63.
Download


Paper Citation


in Harvard Style

Dahl V., Saghaei S. and Schulte O. (2011). Parsing Medical Text into De-identified Databases . In Proceedings of the 1st International Workshop on AI Methods for Interdisciplinary Research in Language and Biology - Volume 1: BILC, (ICAART 2011) ISBN 978-989-8425-42-3, pages 77-87. DOI: 10.5220/0003309700770087


in Bibtex Style

@conference{bilc11,
author={Veronica Dahl and Sara Saghaei and Oliver Schulte},
title={Parsing Medical Text into De-identified Databases},
booktitle={Proceedings of the 1st International Workshop on AI Methods for Interdisciplinary Research in Language and Biology - Volume 1: BILC, (ICAART 2011)},
year={2011},
pages={77-87},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003309700770087},
isbn={978-989-8425-42-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Workshop on AI Methods for Interdisciplinary Research in Language and Biology - Volume 1: BILC, (ICAART 2011)
TI - Parsing Medical Text into De-identified Databases
SN - 978-989-8425-42-3
AU - Dahl V.
AU - Saghaei S.
AU - Schulte O.
PY - 2011
SP - 77
EP - 87
DO - 10.5220/0003309700770087