MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications

Maria F. De La Torre, Carlos A. Aguirre, BreAnn M. Anshutz, William H. Hsu

2018

Abstract

This paper addresses the task of extracting free-text sections from scientific PDF documents, and specifically the problem of formatting disparity among different publications, by analysing their metadata. For the purpose of extracting procedural knowledge in the form of recipes from papers, and for the application domain of nanomaterial synthesis, we present Metadata-Analytic Text and Section Extractor (MATESC), a heuristic rule-based pattern analysis system for text extraction and section classification from scientific literature. MATESC extracts text spans and uses metadata features such as spatial layout location, font type, and font size to create grouped blocks of text and classify them into groups and subgroups based on rules that characterize specific paper sections. The main purpose of our tool is to facilitate information and semantic knowledge extraction across different domain topics and journal formats. We measure the accuracy of MATESC using string matching algorithms to compute alignment costs between each section extracted by our tool and manually-extracted sections. To test its transferability across domains, we measure its accuracy on papers that are relevant to the papers that were used to determine our rule-based methodology and also on random papers crawled from the web. In the future, we will use natural language processing to improve paragraph grouping and classification.

Download


Paper Citation


in Harvard Style

Torre M., Aguirre C., Anshutz B. and Hsu W. (2018). MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications. In Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2018) - Volume 1: KDIR; ISBN 978-989-758-330-8, SciTePress, pages 261-267. DOI: 10.5220/0006937702610267


in Bibtex Style

@conference{kdir18,
author={Maria F. De La Torre and Carlos A. Aguirre and BreAnn M. Anshutz and William H. Hsu},
title={MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications},
booktitle={Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2018) - Volume 1: KDIR},
year={2018},
pages={261-267},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006937702610267},
isbn={978-989-758-330-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2018) - Volume 1: KDIR
TI - MATESC: Metadata-Analytic Text Extractor and Section Classifier for Scientific Publications
SN - 978-989-758-330-8
AU - Torre M.
AU - Aguirre C.
AU - Anshutz B.
AU - Hsu W.
PY - 2018
SP - 261
EP - 267
DO - 10.5220/0006937702610267
PB - SciTePress