loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Authors: Houriye Esfahanian 1 ; Abdolreza Nazemi 2 and Andreas Geyer-Schulz 2

Affiliations: 1 Non-Governmental Non-Profit College, Refah, Tehran, Iran ; 2 Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

Keyword(s): Main Content Extraction, Evaluation Methods, Boilerplate Detection, Right to Left Languages.

Abstract: With the daily increase of published information on the Web, extracting the web page’s main content has become an important issue. Since 2010, in addition to the English Language, the contents with the right to left languages such as Arabic or Persian are also increasing. In this paper, we compared the three famous main content extraction algorithms published in the last decade, Boilerpipe, DANAg, and Web-AM, to find the best algorithm considering evaluation measures and performance. The ArticleExtractor algorithm of the Boilerpipe approach was scored as the most accurate algorithm, with the highest average score of F1 measure of 0.951. On the contrary, the DANAg algorithm was selected with the best performance, being able to process more than 21 megabytes per second. Considering the accuracy and the effectiveness of the main content extraction projects, one of the two Boilerpipe or DANAg algorithms can be used.

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 44.200.86.95

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Esfahanian, H.; Nazemi, A. and Geyer-Schulz, A. (2023). A Comparative Study on Main Content Extraction Algorithms for Right to Left Languages. In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KDIR; ISBN 978-989-758-671-2; ISSN 2184-3228, SciTePress, pages 222-229. DOI: 10.5220/0012162000003598

@conference{kdir23,
author={Houriye Esfahanian. and Abdolreza Nazemi. and Andreas Geyer{-}Schulz.},
title={A Comparative Study on Main Content Extraction Algorithms for Right to Left Languages},
booktitle={Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KDIR},
year={2023},
pages={222-229},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012162000003598},
isbn={978-989-758-671-2},
issn={2184-3228},
}

TY - CONF

JO - Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KDIR
TI - A Comparative Study on Main Content Extraction Algorithms for Right to Left Languages
SN - 978-989-758-671-2
IS - 2184-3228
AU - Esfahanian, H.
AU - Nazemi, A.
AU - Geyer-Schulz, A.
PY - 2023
SP - 222
EP - 229
DO - 10.5220/0012162000003598
PB - SciTePress