qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations

Jingwen Wang; Jie Wang

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations

Topics: Data Analytics; Information Extraction; Interactive and Online Data Mining; Web Mining

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: IC3K, 364-371, 2015 , Lisbon, Portugal

Authors: Jingwen Wang and Jie Wang

Affiliation: University of Massachusetts, United States

Keyword(s): Article Extraction, Text Automation, Density, Similarity.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Business Analytics ; Data Analytics ; Data Engineering ; Information Extraction ; Interactive and Online Data Mining ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Soft Computing ; Symbolic Systems ; Web Mining

Abstract: We present a new method called qRead to achieve real-time content extractions from web pages with high accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model (DOM) trees, and machine learning models. These methods, while having met with certain success, may not meet the requirements of real-time extraction with high accuracy. For example, constructing a DOM-tree on a complex web page is time-consuming, and using machine learning models could make things unnecessarily more complicated. Different from previous approaches, qRead uses segment densities and similarities to identify main contents. In particular, qRead first filters obvious junk contents, eliminates HTML tags, and partitions the remaining text into natural segments. It then uses the highest ratio of words over the number of lines in a segment combined with similarity between the segment and the title to identify main contents. We show that, through extensive experiments, q Read achieves a 96.8% accuracy on Chinese web pages with an average extraction time of 13.20 milliseconds, and a 93.6% accuracy on English web pages with an average extraction time of 11.37 milliseconds, providing substantial improvements on accuracy over previous approaches and meeting the real-time extraction requirement. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 216.73.216.55

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Wang, J., Wang and J. (2015). qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations. In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - KDIR; ISBN 978-989-758-158-8; ISSN 2184-3228, SciTePress, pages 364-371. DOI: 10.5220/0005613603640371

@conference{kdir15,
author={Jingwen Wang and Jie Wang},
title={qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - KDIR},
year={2015},
pages={364-371},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005613603640371},
isbn={978-989-758-158-8},
issn={2184-3228},
}

TY - CONF

JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - KDIR
TI - qRead: A Fast and Accurate Article Extraction Method from Web Pages using Partition Features Optimizations
SN - 978-989-758-158-8
IS - 2184-3228
AU - Wang, J.
AU - Wang, J.
PY - 2015
SP - 364
EP - 371
DO - 10.5220/0005613603640371
PB - SciTePress