Effects of Comparable Corpora on Cross-Language Information Retrieval

Fatiha sadat

2010

Abstract

This paper seeks to present an approach to learning bilingual terminology from scarce resources in order to translate and expand terms from source language to target language and possibly retrieve documents across languages. An extracted bilingual lexicon from comparable corpora will provide a valuable resource to enrich existing bilingual dictionaries and thesauri. A linear combination involving the extracted bilingual terminology from comparable corpora, readily available bilingual dictionaries and transliteration is proposed to Cross-Language Information Retrieval. An application on Japanese-English language pair of languages shows that the proposed combination yields better translations and an effectiveness of information retrieval could be achieved across languages.

References

  1. Dagan, I., Itai, I. Word Sense Disambiguation using a Second Language Monolingual Corpus. Computational Linguistics 20(4): 563-596. (1994).
  2. Dejean, H., Gaussier, E., Sadat, F. An Approach based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction. In Proceedings of COLING'02, Taiwan, pp 218-224. (2002)
  3. Diab, M., Finch, S. A Statistical Word-Level Translation Model for Comparable Corpora. In Proceedings of the Conference on Content-based Multimedia Information Access RIAO. (2000
  4. Dunning, T. Accurate Methods for the Statistics of Surprise and Coincidence. Computational linguistics 19(1): 61-74. (1993)
  5. EDR. Japan Electronic Dictionary Research Institute, Ltd. EDR electronic dictionary version 1.5 technical guide. Technical report TR2-007, Japan Electronic Dictionary research Institute, Ltd. (1996)
  6. Fung, P. A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In Jean VĂ©ronis, Ed. Parallel Text Processing. (2000)
  7. Kando, N. Overview of the Second NTCIR Workshop. In Proceedings of the Second NTCIR Workshop on Research in Chinese and Japanese Text Retrieval and text Summarization, Tokyo. (2001)
  8. Knight, K., Graehl, J. Machine Transliteration. Computational Linguistics 24 (4). (1998)
  9. Koehn, P., Knight, K. Learning a Translation Lexicon from Monolingual Corpora. In Proceedings of ACL-02 Workshop on Unsupervised Lexical Acquisition. (2002)
  10. Matsumoto, Y., Kitauchi, A., Yamashita, T., Imaichi, O., and Imamura, T. Japanese morphological analysis system ChaSen manual. Technical report NAIST-IS-TR97007, NAIST. (1997)
  11. Nakagawa, H. Disambiguation of Lexical Translations Based on Bilingual Comparable Corpora. In Proceedings of LREC2000, Workshop of Terminology Resources and Computation WTRC2000, pp 33-38. (2000)
  12. Peters, C., Picchi, E. Capturing the Comparable: A System for Querying Comparable Text Corpora. In Proceedings of the Third International Conference on Statistical Analysis of Textual Data, pp 255-262. (1995)
  13. Rapp, R. Automatic Identification of Word Translations from Unrelated English and German Corpora. In Proceedings of EACL'99. (1999)
  14. Salton, G. The SMART Retrieval System, Experiments in Automatic Documents Processing. Prentice-Hall, Inc., Englewood Cliffs, NJ. (1971)
  15. Salton, G., McGill, J. Introduction to Modern Information Retrieval. New York, Mc GrawHill. (1983)
  16. Sekine, S. OAK System- Manual. New York University. (2001)
  17. Shahzad, I., Ohtake, K., Masuyama, S., Yamamoto, K. (1999) Identifying Translations of Compound Using Non-aligned Corpora. In Proceedings of Workshop MAL, pp 108-113.
  18. Tanaka, K., Iwasaki, H. Extraction of Lexical Translations from Non-Aligned Corpora. In Proceedings of COLING'96. (1996)
Download


Paper Citation


in Harvard Style

sadat F. (2010). Effects of Comparable Corpora on Cross-Language Information Retrieval . In Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2010) ISBN 978-989-8425-13-3, pages 53-59. DOI: 10.5220/0003029200530059


in Bibtex Style

@conference{nlpcs10,
author={Fatiha sadat},
title={Effects of Comparable Corpora on Cross-Language Information Retrieval},
booktitle={Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2010)},
year={2010},
pages={53-59},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003029200530059},
isbn={978-989-8425-13-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2010)
TI - Effects of Comparable Corpora on Cross-Language Information Retrieval
SN - 978-989-8425-13-3
AU - sadat F.
PY - 2010
SP - 53
EP - 59
DO - 10.5220/0003029200530059