loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Karina Wiechork 1 ; 2 and Andrea Schwertner Charão 1

Affiliations: 1 Department of Languages and Computer Systems, Federal University of Santa Maria, Santa Maria, Brazil ; 2 Information Technology Coordination, Federal Institute of Education Science and Technology Farroupilha, Frederico Westphalen, Brazil

Keyword(s): Dataset Collection, Ground Truth, Performance Evaluation, PDF Extraction Tools.

Abstract: The massive production of documents in portable document format (PDF) format has motivated research on automated extraction of data contained in these files. This work is mainly focused on extractions of natively digital PDF documents, made available in large repositories of educational exams. For this, the educational tests applied at Enade were used and collected automatically using scripts developed with Scrapy. The files used for the evaluation comprise 343 tests, with 11.196 objective and discursive questions, 396 answers, with 14.475 alternatives extracted from the objective questions. For the construction of ground truth in the tests, the Aletheia tool was used. For the extractions, existing tools were used that perform data extractions in PDF files: tabular data extractions, with Excalibur and Tabula for answer extractions, textual content extractions, with CyberPDF and PDFMiner to extract the questions, and extractions of regions of interest, with Aletheia and ExamClipper fo r the cutouts of the questions. The results of the extractions point out some limitations in relation to the diversity of layout in each year of application. The extracted data provide useful information in a wide variety of fields, including academic research and support for students and teachers. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 34.228.213.183

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Wiechork, K. and Charão, A. (2021). Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests. In Proceedings of the 23rd International Conference on Enterprise Information Systems - Volume 1: ICEIS; ISBN 978-989-758-509-8; ISSN 2184-4992, SciTePress, pages 359-366. DOI: 10.5220/0010524503590366

@conference{iceis21,
author={Karina Wiechork. and Andrea Schwertner Charão.},
title={Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests},
booktitle={Proceedings of the 23rd International Conference on Enterprise Information Systems - Volume 1: ICEIS},
year={2021},
pages={359-366},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010524503590366},
isbn={978-989-758-509-8},
issn={2184-4992},
}

TY - CONF

JO - Proceedings of the 23rd International Conference on Enterprise Information Systems - Volume 1: ICEIS
TI - Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests
SN - 978-989-758-509-8
IS - 2184-4992
AU - Wiechork, K.
AU - Charão, A.
PY - 2021
SP - 359
EP - 366
DO - 10.5220/0010524503590366
PB - SciTePress