loading
Papers

Research.Publish.Connect.

Paper

Authors: Maria F. De La Torre ; Carlos A. Aguirre ; BreAnn M. Anshutz and William H. Hsu

Affiliation: Department of Computer Science, Kansas State University, 2184 Engineering Hall, Manhattan, KS and U.S.A.

ISBN: 978-989-758-330-8

Keyword(s): Structured Information Extraction, Document Analysis, Text Analytics, Metadata, Classification, Information.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Information Extraction ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Symbolic Systems

Abstract: This paper addresses the task of extracting free-text sections from scientific PDF documents, and specifically the problem of formatting disparity among different publications, by analysing their metadata. For the purpose of extracting procedural knowledge in the form of recipes from papers, and for the application domain of nanomaterial synthesis, we present Metadata-Analytic Text and Section Extractor (MATESC), a heuristic rule-based pattern analysis system for text extraction and section classification from scientific literature. MATESC extracts text spans and uses metadata features such as spatial layout location, font type, and font size to create grouped blocks of text and classify them into groups and subgroups based on rules that characterize specific paper sections. The main purpose of our tool is to facilitate information and semantic knowledge extraction across different domain topics and journal formats. We measure the accuracy of MATESC using string matching algorithms to compute alignment costs between each section extracted by our tool and manually-extracted sections. To test its transferability across domains, we measure its accuracy on papers that are relevant to the papers that were used to determine our rule-based methodology and also on random papers crawled from the web. In the future, we will use natural language processing to improve paragraph grouping and classification. (More)

PDF ImageFull Text

Download
CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.227.3.146

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Torre, M.; Aguirre, C.; Anshutz, B. and Hsu, W. (2018). MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications.In Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, ISBN 978-989-758-330-8, pages 261-267. DOI: 10.5220/0006937702610267

@conference{kdir18,
author={Maria F. De La Torre. and Carlos A. Aguirre. and BreAnn M. Anshutz. and William H. Hsu.},
title={MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications},
booktitle={Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR,},
year={2018},
pages={261-267},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006937702610267},
isbn={978-989-758-330-8},
}

TY - CONF

JO - Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR,
TI - MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications
SN - 978-989-758-330-8
AU - Torre, M.
AU - Aguirre, C.
AU - Anshutz, B.
AU - Hsu, W.
PY - 2018
SP - 261
EP - 267
DO - 10.5220/0006937702610267

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.