ENTROPY ON ONTOLOGY AND INDEXING IN INFORMATION RETRIEVAL

Yevgeniy Guseynov

Abstract

In this paper, we present a formalization of an Index Assignment process that was used against documents stored in a text database. The process uses key phrases or terms from a hierarchical thesaurus or ontology and is based on the new notion of entropy on ontology for terms and their weights that is an extension of the Shannon concept of entropy in Information Theory and the Resnik semantic similarity measure for terms on ontology. Introduced notion provides a measure of closeness or semantic similarity for a set of terms in ontology and their weights and allows creation of a clustering algorithm that constructively resolves index assignment task. The algorithm was tested on 30,000 documents randomly extracted from MEDLINE biomedicine database that are manually indexed by professional indexers. The main output from experiments shows that after all 30,000 documents were processed in seven topics out of ten the presented algorithm and human indexers have the same understanding of documents.

References

  1. Agrawal, R., Chakrabarti, S., Dom, B.E., Raghavan, P., 2001. Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values. United State Patent 6,233,575.
  2. Aronson, A.R., Mork, J.G., Gay, C.W., Humphrey, S.M., Rogers, W.J., 2004. The NLM indexing initiative's Medical Text Indexer, Stud Health Technol Inform 107 (Pt 1), pp. 268-272.
  3. Calmet, J., Daemi, A., 2004. From entropy to ontology. Fourth International Symposium "From Agent Theory to Agent Implementation", R. Trappl, Ed., vol. 2, pp. 547 - 551.
  4. Cho, M., Choi, C., Kim, W., Park, J., Kim, P., 2007. Comparing Ontologies using Entropy. 2007 International Conference on Convergence Information Technology, Korea, 873-876.
  5. Grobelnik, M., Brank, J., Fortuna, B., Mozetic, I., 2008. Contextualizing Ontologies with OntoLight: A Pragmatic Approach. Informatica 32, 79-84.
  6. Guseynov, Y., 2009. XML Processing. No Parsing. Proceedings WEBIST 2009 - 5th International Conference on Web Information Systems and Technologies, INSTICC, Lisbon, Portugal, pp. 81 - 84.
  7. Klein, D., Manning, C.D., 2003. Accurate Unlexicalized Parsing. Proceedings of the 41st Meeting of the Association for Computational Linguistics, pp. 423- 430.
  8. Lee, J.H., Kim, M.H., Lee, Y.J., 1993. Information retrieval based on conceptual distance in IS-A hierarchies. Journal of Documentation, 49(2):188-207, June.
  9. Lindberg, D.A.B., Humphreys, B.L., McCray, A.T., 1993. The Unified Medical Language System. Methods of Information in Medicine, 32(4): 281-91.
  10. Manning, C.D., Schütze, H., 1999. Foundations of Statistical Natural Language Processing. The MIT Press.
  11. Manning, C. D., Raghavan, P., Schütze, H., 2008. Introduction to Information Retrieval. Cambridge University Press.
  12. Medelyan, O., Witten, I.H., 2006a. Thesaurus Based Automatic Keyphrase Indexing. JCDL'06, June 11- 15, Chapel Hill, North Carolina, USA.
  13. Medelyan, O., Witten, I.H., 2006b. Measuring InterIndexer Consistency Using a Thesaurus. JCDL'06, June 11-15, Chapel Hill, North Carolina, USA.
  14. Nelson, S.J., Johnston, J., Humphreys, B.L., 2001. Relationships in Medical Subject Headings. In: Bean, Carol A.; Green, Rebecca, editors. Relationships in the organization of knowledge. New York: Kluwer Academic Publishers. p.171-184.
  15. Névéol, A., Shooshan, S.E., Humphrey, S.M., Mork, J.G., Aronson, A.R., 2009. A recent advance in the automatic indexing of the biomedical literature. J Biomed Inform. Oct;42(5):814-23.
  16. Qiu, Y., Frei, H.P., 1993. Concept based query expansion. In Proc. SIGIR, pp. 160-169. ACM Press.
  17. Rasmussen, E., 1992. Clustering algorithms. In William B. Frakes and Ricardo Baeza-Yates (eds.), Information Retrieval, pp. 419-442. Englewood Cliffs, NJ: Prentice Hall.
  18. Resnik, P., 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of IJCAI, pages 448-453.
  19. Resnik, P., 1999. Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11, 95-130.
  20. Rolling, L., 1981. Indexing consistency, quality and efficiency. Information Processing and Management, 17, 69-76.
  21. Salton, G., 1989. Automatic Text Processing. AddisonWesley.
  22. Salton, G., 1991. The Smart project in automatic document retrieval. In Proc. SIGIR, pp. 356-358. ACM Press. 173, 530
  23. Shannon, C.E., 1948. A Mathematical Theory of Communication. Bell System Technical Journal. 27:3 pp 379-423.
  24. Schütze, H., 1998. Automatic word sense discrimination. Computational Linguistics 24(1):97-124.
  25. Tudhope, D., Alani, H., Jones, C., 2001. Augmenting thesaurus relationships: possibilities for retrieval. Journal of Digital Information, Volume 1 Issue 8, 2.
  26. Walker, D.E., 1987. Knowledge resource tools for accessing large text files. In Sergei Nirenburg (ed.), Machine Translation: Theoretical and methodological issues. pp.247-261. Cambridge: Cambridge University Press
  27. Wiener, N.,1961. Cybernetics, or Control and Communication in the Animal and the Machine. New York and London: M.I.T. Press and John Wiley and Sons, Inc.
  28. Wolfram, D., Zhang, J., 2008. The Influence of Indexing Practices and Weighting Algorithms on Document Spaces. Journal of The American Society for Information Science and Technology, 59(1):3-11.
Download


Paper Citation


in Harvard Style

Guseynov Y. (2011). ENTROPY ON ONTOLOGY AND INDEXING IN INFORMATION RETRIEVAL . In Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8425-51-5, pages 555-567. DOI: 10.5220/0003298205550567


in Bibtex Style

@conference{webist11,
author={Yevgeniy Guseynov},
title={ENTROPY ON ONTOLOGY AND INDEXING IN INFORMATION RETRIEVAL},
booktitle={Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2011},
pages={555-567},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003298205550567},
isbn={978-989-8425-51-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - ENTROPY ON ONTOLOGY AND INDEXING IN INFORMATION RETRIEVAL
SN - 978-989-8425-51-5
AU - Guseynov Y.
PY - 2011
SP - 555
EP - 567
DO - 10.5220/0003298205550567