USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES

Hadi Mohammadzadeh, Franz Schweiggert, Gholamreza Nakhaeizadeh

Abstract

In this paper, we propose a new and simple approach to extract the main content of Right to Left language web pages. Independence to DOM tree and HTML tags is one of the most important features of the proposed algorithm. In practice, HTML tags have been written in English and we know that the English character set is located in the interval [0,127]. In most languages which are written from Right-to-Left (R2L) such as the Arabic language, however, a definite interval of the Unicode character set is used that is certainly not in this interval. In the first phase of our approach, we apply this distinction to separate the R2L characters from the English ones. Then for each HTML file, we determine the density of the R2L characters and the density of Non-R2L characters. That part of the HTML file with high density of the R2L characters and low density of the Non-R2L characters contains the main content of the web page with high accuracy. The proposed algorithm has been tested, evaluated and compared with the last main content extraction approach on 2166 selected web pages.

References

  1. Debnath, S., Mitra, P., and Giles, C. L. (2005). Identifying content blocks from web documents. In Lecture Notes in Computer Science, pages 285-293, NY, USA. Springer.
  2. Finn, A., Kushmerick, N., and Smyth, B. (2001). Fact or fiction: Content classification for digital libraries. In Proceedings of the Second DELOS Network of Excellence Workshop on Personalisation and Recommender Systems in Digital Libraries, Dublin, Ireland.
  3. Gottron, T. (2007). Evaluating content extraction on html documents. In Proceedings of the 2nd International Conference on Internet Technologies and Applications, pages 123-132, University of Wales, UK.
  4. Gottron, T. (2008). Content code blurring: A new approach to content extraction. In 19th International Workshop on Database and Expert Systems Applications, pages 29-33, Turin, Italy.
  5. Gottron, T. (2009). An evolutionary approach to automatically optimize web content extraction. In In Proceedings of the 17th International Conference Intelligent Information Systems, pages 331-343, Krakw, Poland.
  6. Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P. (2003). Dom-based content extraction of html documents. In Proceedings of the 12th international conference on World Wide Web, pages 207-214, New York, USA. ACM.
  7. Mantratzis, C., Orgun, M., and Cassidy, S. (2005). Separating xhtml content from navigation clutter using dom-structure block analysis. In Proceedings of the Sixteenth ACM Conference on Hypertext and Hypermedia, pages 145-147, New York, USA. ACM.
  8. Moreno, J. A., Deschacht, K., and Moens, M.-F. (2009). Language independent content extraction from web pages. In Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pages 50-55, Netherland.
  9. Pinto, D., Branstein, M., Coleman, R., Croft, W. B., and King, M. (2002). Quasm: a system for question answering using semi-structured data. In Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pages 46-55, New York, USA. ACM.
Download


Paper Citation


in Harvard Style

Mohammadzadeh H., Schweiggert F. and Nakhaeizadeh G. (2011). USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES . In Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT, ISBN 978-989-8425-76-8, pages 243-249. DOI: 10.5220/0003508502430249


in Bibtex Style

@conference{icsoft11,
author={Hadi Mohammadzadeh and Franz Schweiggert and Gholamreza Nakhaeizadeh},
title={USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES},
booktitle={Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT,},
year={2011},
pages={243-249},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003508502430249},
isbn={978-989-8425-76-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT,
TI - USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES
SN - 978-989-8425-76-8
AU - Mohammadzadeh H.
AU - Schweiggert F.
AU - Nakhaeizadeh G.
PY - 2011
SP - 243
EP - 249
DO - 10.5220/0003508502430249