Towards a Unified Named Entity Recognition System - Disease Mention Identification

Tsendsuren Munkhdalai, Meijing Li, Khuyagbaatar Batsuren, Keun Ho Ryu

2015

Abstract

Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biomedical text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. In this study, we take a step towards a unified NER system in biomedical, chemical and medical domain. We evaluate word representation features automatically learnt by a large unlabeled corpus for disease NER. The word representation features include brown cluster labels and Word Vector Classes (WVC) built by applying k-means clustering to continuous valued word vectors of Neural Language Model (NLM). The experimental evaluation using Arizona Disease Corpus (AZDC) showed that these word representation features boost system performance significantly as a manually tuned domain dictionary does. BANNER-CHEMDNER, a chemical and biomedical NER system has been extended with a disease mention recognition model that achieves a 77.84% F-measure on AZDC when evaluating with 10-fold cross validation method. BANNER-CHEMDNER is freely available at: https://bitbucket.org/tsendeemts/banner-chemdner.

References

  1. Leaman, R., Gonzalez, G., 2008. Banner: An Executable Survey of Advances in Biomedical Named Entity Recognition. In Pacific Symposium on Biocomputing.
  2. Munkhdalai, T., Li, M., Batsuren, K., Ryu, K. H., 2013. Banner-Chemdner: Incorporating Domain Knowledge in Chemical and Drug Named Entity Recognition. In Fourth BioCreative.
  3. Karopka, T., Fluck, J., Mevissen, H., Glass, A., 2006. The autoimmune Disease Database: a dynamically compiled literature-derived database. BMC Bioinformatics.
  4. Jimeno, A., Jimenez-Ruiz, E., Lee, V., Gaudan, S., Berlanga, R., Rebholz-Schuhmann, D., 2008. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics.
  5. Gurulingappa, H., Klinger, R., Hofmann-Apitius, M., Fluck, J., 2010. An Empirical Evaluation of Resources for the Identification of Disease and Adverse Effects in Biomedical Literature. In 2nd Workshop on Building and Evaluating Resources for Biomedical Text Mining.
  6. Leaman, R., Miller, C., 2009. Enabling Recognition of Disease in Biomedical Text with Machine Learning: Corpus and Benchmark. In Symposium on Languages in Biology and Medicine.
  7. Chowdhury, M. F. M., Lavelli, A., 2010. Disease Mention Recognition with Specific Features. In Biomedical Natural Language Processing.
  8. Neveol, A., Kim, W., Wlbur, W. J., Lu, Z., 2009. Exploring Two Biomedical Text Genres for Disease Recognition. In Biomedical Natural Language Processing.
  9. Munkhdalai, T., Li, M., Kim, T., Namsrai, O., Seon-phil, J., Jungpil, S., Ryu, K. H., 2012. Bio Named Entity Recognition based on Co-training Algorithm. In AINA 2012.
  10. Munkhdalai, T., Li, M., Unil, Y., Namsrai, O., Ryu, K. H., 2012. An Active Co-Training Algorithm for Biomedical Named-Entity Recognition. KIPS.
  11. Turian, J., Ratinov, L., Bengio, Y., 2010. Word representations: A simple and general method for semisupervised learning. In ACL.
  12. Huang, E. H., Socher, R., Manning, C. D., Ng, A. Y., 2012. Improving Word Representations via Global Context and Multiple Word Prototypes. In ACL.
  13. Socher, R., Lin, C. C, Ng, A. Y., Manning, C. D., 2011. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In ICML.
  14. Liu, H., Christiansen, T., Baumgartner, W. A., Verspoor, K., 2012. BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. J. Bio. Sem.
  15. Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., Lai, J. C., 1992. Class-Based n-gram Models of Natural Language. In ACL.
  16. Collobert, R., Weston, J., 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In ICML.
Download


Paper Citation


in Harvard Style

Munkhdalai T., Li M., Batsuren K. and Ryu K. (2015). Towards a Unified Named Entity Recognition System - Disease Mention Identification . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015) ISBN 978-989-758-070-3, pages 251-255. DOI: 10.5220/0005287802510255


in Bibtex Style

@conference{bioinformatics15,
author={Tsendsuren Munkhdalai and Meijing Li and Khuyagbaatar Batsuren and Keun Ho Ryu},
title={Towards a Unified Named Entity Recognition System - Disease Mention Identification},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)},
year={2015},
pages={251-255},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005287802510255},
isbn={978-989-758-070-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)
TI - Towards a Unified Named Entity Recognition System - Disease Mention Identification
SN - 978-989-758-070-3
AU - Munkhdalai T.
AU - Li M.
AU - Batsuren K.
AU - Ryu K.
PY - 2015
SP - 251
EP - 255
DO - 10.5220/0005287802510255