loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Atelach Alemu Argaw and Lars Asker

Affiliation: Stockholm University, Sweden

Keyword(s): Web mining, parallel corpora, text alignment, Amharic.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Soft Computing ; Symbolic Systems ; Web Mining

Abstract: We present recent work aimed at constructing a bilingual corpus consisting of comparable Amharic and English news texts. The Amharic and English texts were collected from an Ethiopian news agency that publishes daily news in Amharic and English through their web page. The Amharic texts are represented using Ethiopic script and archived according to the Ethiopian calender. The overlap between the corresponding Amharic and English news texts in the archive is comparatively small, only approximately one article out of ten has a corresponding translated version. Thus a major part of the work has been to identify the subset of matching news texts in the archive, transliterating the Amharic texts into an ASCII representation, and aligning them with their respective corresponding English version. In doing so, we utilised a number of available software and data sources that were (mainly) found on the Internet. Amharic is a language for which very few computational linguistic tools or corpora (such as electronic lexica, part-of-speech taggers, parsers or tree-banks) exist. A challenge has therefor been to show that it is possible to create a comparable corpus even in the absence to these resources. We used fuzzy string matching between words in the English and Amharic titles as a way to determine how likely it is that two news items are referring to the same event. In order to restrict the matching algorithm further, we only compared titles of news items that were published on the corresponding same date and at the same place. We present an experimental evaluation of the algorithm, based on data from one year, and show that fuzzy string matching of news titles can be sufficient to align Amharic and English news text with relatively high precision despite the obvious difference between the two languages. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.84.228.68

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Alemu Argaw, A. and Asker, L. (2005). WEB MINING FOR AN AMHARIC - ENGLISH BILINGUAL CORPUS. In Proceedings of the First International Conference on Web Information Systems and Technologies - WEBIST; ISBN 972-8865-20-1; ISSN 2184-3252, SciTePress, pages 239-246. DOI: 10.5220/0001228502390246

@conference{webist05,
author={Atelach {Alemu Argaw}. and Lars Asker.},
title={WEB MINING FOR AN AMHARIC - ENGLISH BILINGUAL CORPUS},
booktitle={Proceedings of the First International Conference on Web Information Systems and Technologies - WEBIST},
year={2005},
pages={239-246},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001228502390246},
isbn={972-8865-20-1},
issn={2184-3252},
}

TY - CONF

JO - Proceedings of the First International Conference on Web Information Systems and Technologies - WEBIST
TI - WEB MINING FOR AN AMHARIC - ENGLISH BILINGUAL CORPUS
SN - 972-8865-20-1
IS - 2184-3252
AU - Alemu Argaw, A.
AU - Asker, L.
PY - 2005
SP - 239
EP - 246
DO - 10.5220/0001228502390246
PB - SciTePress