USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES

Hadi Mohammadzadeh; Franz Schweiggert; Gholamreza Nakhaeizadeh

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES

Topics: Data and Information Retrieval

In Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT, 243-249, 2011 , Seville, Spain

Authors: Hadi Mohammadzadeh ¹ ; Franz Schweiggert ¹ and Gholamreza Nakhaeizadeh ²

Affiliations: ¹ University of Ulm, Germany ; ² University of Karlsruhe, Germany

Keyword(s): Main content extraction, Information retrieval, UTF-8, HTML documents, Right to left languages.

Related Ontology Subjects/Areas/Topics: Business Analytics ; Data and Information Retrieval ; Data Engineering

Abstract: In this paper, we propose a new and simple approach to extract the main content of Right to Left language web pages. Independence to DOM tree and HTML tags is one of the most important features of the proposed algorithm. In practice, HTML tags have been written in English and we know that the English character set is located in the interval [0,127]. In most languages which are written from Right-to-Left (R2L) such as the Arabic language, however, a definite interval of the Unicode character set is used that is certainly not in this interval. In the first phase of our approach, we apply this distinction to separate the R2L characters from the English ones. Then for each HTML file, we determine the density of the R2L characters and the density of Non-R2L characters. That part of the HTML file with high density of the R2L characters and low density of the Non-R2L characters contains the main content of the web page with high accuracy. The proposed algorithm has been tested, evaluated an d compared with the last main content extraction approach on 2166 selected web pages. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 216.73.216.5

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Mohammadzadeh, H., Schweiggert, F., Nakhaeizadeh and G. (2011). USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES. In Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT; ISBN 978-989-8425-76-8; ISSN 2184-2833, SciTePress, pages 243-249. DOI: 10.5220/0003508502430249

@conference{icsoft11,
author={Hadi Mohammadzadeh and Franz Schweiggert and Gholamreza Nakhaeizadeh},
title={USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES},
booktitle={Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT},
year={2011},
pages={243-249},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003508502430249},
isbn={978-989-8425-76-8},
issn={2184-2833},
}

TY - CONF

JO - Proceedings of the 6th International Conference on Software and Database Technologies - Volume 1: ICSOFT
TI - USING UTF-8 TO EXTRACT MAIN CONTENT OF RIGHT TO LEFT LANGUAGE WEB PAGES
SN - 978-989-8425-76-8
IS - 2184-2833
AU - Mohammadzadeh, H.
AU - Schweiggert, F.
AU - Nakhaeizadeh, G.
PY - 2011
SP - 243
EP - 249
DO - 10.5220/0003508502430249
PB - SciTePress