Generating a Distilled N-Gram Set - Effective Lexical Multiword Building in the SPECIALIST Lexicon

Chris J. Lu, Destinee Tormey, Lynn McCreedy, Allen C. Browne


Multiwords are vital to better Natural Language Processing (NLP) systems for more effective and efficient parsers, refining information retrieval searches, enhancing precision and recall in Medical Language Processing (MLP) applications, etc. The Lexical Systems Group has enhanced the coverage of multiwords in the Lexicon to provide a more comprehensive resource for such applications. This paper describes a new systematic approach to lexical multiword acquisition from MEDLINE through filters and matchers based on empirical models. The design goal, function description, various tests and applications of filters, matchers, and data are discussed. Results include: 1) Generating a smaller (38%) distilled MEDLINE n-gram set with better precision and similar recall to the MEDLINE n-gram set; 2) Establishing a system for generating high precision multiword candidates for effective Lexicon building. We believe the MLP/NLP community can benefit from access to these big data (MEDLINE n-gram) sets. We also anticipate an accelerated growth of multiwords in the Lexicon with this system. Ultimately, improvement in recall or precision can be anticipated in NLP projects using the MEDLINE distilled n-gram set, SPECIALIST Lexicon and its applications.


  1. Aronson, A.R., 2001. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. In proceedings of AMIA 2001 Annual Symposium, Wash., DC, USA, Nov. 3-7, pages 17-21.
  2. Aronson, A.R. and Lang, F.M., 2010. An Overview of MetaMap: Historical Perspective and Recent Advances. JAMIA, Vol. 17, pages 229-236.
  3. Baldwin, T., Bannard, C., Tanaka, T., Widdows, D., 2003. An Empirical Model of Multiword Expression Decomposability. In proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, July 12, pages 89-96.
  4. Bejcek, E., Stranák, P., Pecina, P., 2013. Syntactic Identification of Occurrences of Multiword Expressions in Text using a Lexicon with Dependency Structures. In proceedings of the 9th Workshop on Multiword Expressions, Atlanta, Georgia, USA, June 13-14, pages 106-115.
  5. Boukobza, R., Rappoport, A., 2009. Multi-Word Expression Identification Using Sentence Surface Features. In proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, August 6-7, pages 468-477.
  6. Browne, A.C., McCray, A.T., Srinivasan, S., 2000. The SPECIALIST LEXICON. Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, Maryland, USA, June, pages 30- 49.
  7. Calzolari, N., Fillmore, C.J., Grishman, R., Ide, N., Lenci, A., MacLeod, C., Zampolli, A., 2002. Towards Best Practice for Multiword Expressions in Computational Lexicon. In proceedings of the Third International Conference on Language Resources and Evaluation (LREC), Las Palmas, Canary Islands, Spain, May 29- 31, pages 1934-1940.
  8. Divita, G., Browne, A.C., Tse, T., Cheh, M.L., Loane, R.F., Abramson, M., 2000. A Spelling Suggestion Technique for Terminology Servers. In proceedings of AMIA 2000 Annual Symposium, Los Angeles, CA, USA, Nov. 4-8, page 994.
  9. Divita, G., Zeng, Q.T., Gundlapalli, A.V., Duvall, S., Nebeker, J., and Samore, M.H., 2014. Sophia: An Expedient UMLS Concept Extraction Annotator. In proceedings of AMIA 2014 Annual Symposium, Wash., DC, USA, Nov. 15-19, pages 467-476.
  10. Fazly, A., Cook, P., Stevenson, S., 2009. Unsupervised Type and Token Identification of Idiomatic Expressions. Computational Linguistics, vol. 35, no. 1, pages 61-103.
  11. Frantzi, K., Ananiadou, S., Mima, H., 2000. Automatic Recognition of Multi-Word Terms: the C-value/NCvalue Method. International Journal on Digital Libraries, vol. 3, no. 2, pages 115-130.
  12. Fraser, S., 2009. Technical vocabulary and collocational behaviour in a specialised corpus. In proceedings of the British Association for Applied Linguistics (BAAL), Newcastle University, Sep. 3-5, pages 43-48.
  13. Green, S., de Marneffe, M.C., Bauer, J., and Manning, C.D., 2011. Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour deforce with French. In proceedings of EMNLP, Edinburgh, Scotland, UK, July 27-31, pages 725-735.
  14. Green, S., de Marneffe, M.C., Manning, C.D., 2013. Parsing models for identifying multiword expressions. Computational Linguistics. vol. 39, no. 1, pages 195- 227.
  15. Ide, N.C., Loane, R.F., Fushman, D.D., 2007. Essie: A Concept-based Search Engine for Structured Biomedical Text. JAMIA, vol. 14, no. 3, May/June, pages 253-263.
  16. Kim, S.N. and Baldwin, T., 2010. How to pick out token instances of English verb-particle constructions. Language Resources and Evaluation, April, vol. 44, no. 1, pages 97-113.
  17. Lu, C.J. and Browne, A.C., 2012. Development of SubTerm Mapping Tools (STMT). In proceedings of AMIA 2012 Annual Symposium, Chicago, IL, USA, Nov. 3-7, page 1845.
  18. Lu, C.J., McCreedy, L., Tormey, D., and Browne, A.C., 2012. A Systematic Approach for Automatically Generating Derivational Variants in Lexical Tools Based on the SPECIALIST Lexicon. IEEE IT Professional Magazine, May/June, pages 36-42.
  19. Lu, C.J., Tormey, D., McCreedy, L., Browne, A.C., 2014. Using Element Words to Generate (Multi)Words for the SPECIALIST Lexicon. In proceedings of AMIA 2014 Annual Symposium, Wash., DC, USA, Nov. 15-19, page 1499.
  20. Lu, C.J., Tormey, D., McCreedy, L., Browne, A.C., 2015. Generating the MEDLINE N-Gram Set, In proceedings of AMIA 2015 Annual Symposium, San Francisco, CA, USA, Nov. 14-18, page 1569.
  21. McCray, A.T., Aronson, A.R., Browne, A.C., Rindflesch, T.C., Razi, A., Srinivasan, S., 1993. UMLS Knowledge for Biomedical Language Processing. Bull. Medical Library Assoc., vol. 81, no. 2, pages 184-194.
  22. McCray, A.T., Srinivasan, S., Browne, A.C., 1994. Lexical Methods for Managing Variation in Biomedical Terminologies. In proceedings of the 18th Annual Symposium on Computer Applications in Medical Care, pages 235-239.
  23. National Library of Medicine, Lexicon, 2016. Lead-EndTerms Model. Available from: < on/current/docs/designDoc/UDF/multiwords/leadEnd Terms/index.html>.
  24. National Library of Medicine. Lexicon, 2016. The MEDLINE n-gram set. Available from: < m/index.html>.
  25. Pearce, D., 2001. Using Conceptual Similarity for Collocation Extraction. In proceedings of the 4th UK Special Interest Group for Computational Linguistics (CLUK4), University of Sheffield, Sheffield, UK, January 10-11, pages 34-42.
  26. Pecina, P., 2010. Lexical association measures collocation extraction. Language Resources and Evaluation, vol. 44, pages 137-158.
  27. Philips, L., 1990. Hanging on the Metaphone. Computer Language, December, vol. 17, no. 12, pages 39-43.
  28. Ramisch, C., 2014. Multiword Expressions Acquisition: A Generic and Open Framework (Theory and Applications of Natural Language Processing). Springer, 2015th Edition, pages 4, 9, 37.
  29. Sag, I., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D., 2002. Multiword expressions: A pain in the neck for NLP. In proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2002), Mexico City, Mexico, pages 1-15.
  30. Sangati, F., Cranenburgh, A.V., 2015. Multiword Expression Identification with Recurring Tree Fragments and Association Measures. In proceedings of Annual conference of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Denver, Colorado, May 31-June 5, pages 10-8.
  31. Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Kipper-Schuler, K.C., Chute, C.G., 2010. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. JAMIA, vol. 17, no. 5, pages 507-513.
  32. Seretan, V. and Wehrli, E., 2009. Multilingual collocation extraction with a syntactic parser. Language Resources and Evaluation, March, vol. 43, no. 1, pages 71-85.
  33. Silva, J.F. and Lopes, G.P., 1999. A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. In proceedings of the Sixth Meeting on Mathematics of Language (MOL6), Orlando, FL, USA, pages 369-381.
  34. Takahashi, S. and Morimoto, T., 2013. Selection of MultiWord Expressions from Web N-gram Corpus for Speech Recognition. In proceedings of International Symposium on Natural Language Processing (SNLP), Phuket, Thailand, Oct. 28-30, pages 6-11.
  35. Tsvetkov, Y. and Wintner, S., 2011. Identification of Multiword Expressions by Combining Multiple Linguistic Information Sources. In proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, July 27-31, pages 836-845.

Paper Citation

in Harvard Style

Lu C., Tormey D., McCreedy L. and Browne A. (2017). Generating a Distilled N-Gram Set - Effective Lexical Multiword Building in the SPECIALIST Lexicon . In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 5: HEALTHINF, (BIOSTEC 2017) ISBN 978-989-758-213-4, pages 77-87. DOI: 10.5220/0006142000770087

in Bibtex Style

author={Chris J. Lu and Destinee Tormey and Lynn McCreedy and Allen C. Browne},
title={Generating a Distilled N-Gram Set - Effective Lexical Multiword Building in the SPECIALIST Lexicon},
booktitle={Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 5: HEALTHINF, (BIOSTEC 2017)},

in EndNote Style

JO - Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 5: HEALTHINF, (BIOSTEC 2017)
TI - Generating a Distilled N-Gram Set - Effective Lexical Multiword Building in the SPECIALIST Lexicon
SN - 978-989-758-213-4
AU - Lu C.
AU - Tormey D.
AU - McCreedy L.
AU - Browne A.
PY - 2017
SP - 77
EP - 87
DO - 10.5220/0006142000770087