Authors:
Yuxin Wang
and
Keizo Oyama
Affiliation:
National Institute of Informatics, Japan
Keyword(s):
Surrounding page group, three-way classification, recall and precision, quality assurance.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Data Engineering
;
Digital Libraries
;
Knowledge Discovery and Information Retrieval
;
Knowledge Management and Information Sharing
;
Knowledge-Based Systems
;
Ontologies and the Semantic Web
;
Soft Computing
;
Symbolic Systems
;
Web Information Systems and Technologies
;
Web Interfaces and Applications
;
Web Mining
Abstract:
We propose a web page classification method for creating a high quality homepage collection considering
page group structure. We use support vector machine (SVM) with textual features obtained from each page
and its surrounding pages. The surrounding pages are grouped according to connection type (in-link, outlink,
and directory entry) and relative URL hierarchy (same, upper, or lower); then an independent feature
subset is generated from each group. Feature subsets are further concatenated to compose the feature set of a
classifier. The experiment results using ResJ-01 data set manually created by the authors and WebKB data
set show the effectiveness of the proposed features compared with a baseline and some prior works. By
tuning the classifiers, we then build a three-way classifier using a recall-assured and a precision-assured
classifier in combination to accurately select the pages that need manual assessment to assure the required
quality. It is also shown to be effect
ive for reducing the amount of manual assessment.
(More)