WEB MINING FOR AN AMHARIC - ENGLISH BILINGUAL CORPUS

Atelach Alemu Argaw, Lars Asker

Abstract

We present recent work aimed at constructing a bilingual corpus consisting of comparable Amharic and English news texts. The Amharic and English texts were collected from an Ethiopian news agency that publishes daily news in Amharic and English through their web page. The Amharic texts are represented using Ethiopic script and archived according to the Ethiopian calender. The overlap between the corresponding Amharic and English news texts in the archive is comparatively small, only approximately one article out of ten has a corresponding translated version. Thus a major part of the work has been to identify the subset of matching news texts in the archive, transliterating the Amharic texts into an ASCII representation, and aligning them with their respective corresponding English version. In doing so, we utilised a number of available software and data sources that were (mainly) found on the Internet. Amharic is a language for which very few computational linguistic tools or corpora (such as electronic lexica, part-of-speech taggers, parsers or tree-banks) exist. A challenge has therefor been to show that it is possible to create a comparable corpus even in the absence to these resources. We used fuzzy string matching between words in the English and Amharic titles as a way to determine how likely it is that two news items are referring to the same event. In order to restrict the matching algorithm further, we only compared titles of news items that were published on the corresponding same date and at the same place. We present an experimental evaluation of the algorithm, based on data from one year, and show that fuzzy string matching of news titles can be sufficient to align Amharic and English news text with relatively high precision despite the obvious difference between the two languages.

References

  1. Alemu, A., Asker, L., and Eriksson, G. (2003). An empirical approach to building an amharic treebank. In Proceedings of TLT-2003.
  2. Alemu, A., Asker, L., and Eriksson, G. (2004). Building an amharic lexicon from parallel texts. In Proceedings of First Steps for Language Documentation of Minority Languages: Computational Linguistic Tools for Morphology, Lexicon and Corpus Compilation, a Workshop at LREC-2004.
  3. Bendersky, E. (2004). Levenshtein distance algorithm: Perl implementation. http://www.merriampark.com/ldperl.htm, accessed Jan 31, 2004.
  4. Chen, J. and Jian-Jun, N. (2000). Automatic construction of parallel english-chinese corpus for cross-language information retrieval. In Proceedings of the Sixth Conference on Applied Natural Language Processing.
  5. GlobalReach (2004). Global internet statictics (by language). http://global-reach.biz/globstats/index.php3.
  6. Hulth, A. (2004). Combining Machine Learning and Natural Language Processing for Automatic Keyword Extraction. Doctoral Dissertation, Department of Computer and Systems Sciences, Stockholm University.
  7. Hwa, R., Resnik, P., Weinberg, A., and Kolak, O. (2002). Evaluating translational correspondence using annotation projection. In Proceedings of ACL-02.
  8. JuneCalends (2004). 7000 years calendar v1.4.1. http://www.junecalends.com/7000.html, accessed Jan 31, 2004.
  9. Ma, X. and Liberman., M. (1999). Bits: A method for bilingual text search over the web. In Proceedings of Machine Translation Summit VII.
  10. Resnik, P. (1998). Parallel strands: A preliminary investigation into mining the web for bilingual text. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas, AMTA-98.
  11. Resnik, P. (1999). Mining the web for bilingual text. In 37th Annual Meeting of the Association for Computational Linguistics (ACL'99).
  12. Resnik, P. and Smith, N. A. (2003). The web as a parallel corpus. Computational Linguistics, 29(3).
  13. Riloff, E., Schafer, C., and Yarowsky, D. (2002). Inducing information extraction systems for new languages via cross-language projection. In Proceedings of COLING-02.
  14. Yacob, D. (1996). System for representation in ascii http://www.abyssiniacybergateway.net/fidel/.
  15. Yang, C. C. and Li, K. W. (2002). Mining english/chinese parallel documents from the world wide web. In Proceedings of the 11th International World Wide Web Conference.
Download


Paper Citation


in Harvard Style

Alemu Argaw A. and Asker L. (2005). WEB MINING FOR AN AMHARIC - ENGLISH BILINGUAL CORPUS . In Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 972-8865-20-1, pages 239-246. DOI: 10.5220/0001228502390246


in Bibtex Style

@conference{webist05,
author={Atelach Alemu Argaw and Lars Asker},
title={WEB MINING FOR AN AMHARIC - ENGLISH BILINGUAL CORPUS},
booktitle={Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2005},
pages={239-246},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001228502390246},
isbn={972-8865-20-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - WEB MINING FOR AN AMHARIC - ENGLISH BILINGUAL CORPUS
SN - 972-8865-20-1
AU - Alemu Argaw A.
AU - Asker L.
PY - 2005
SP - 239
EP - 246
DO - 10.5220/0001228502390246