Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests

Karina Wiechork, Karina Wiechork, Andrea Charão

2021

Abstract

The massive production of documents in portable document format (PDF) format has motivated research on automated extraction of data contained in these files. This work is mainly focused on extractions of natively digital PDF documents, made available in large repositories of educational exams. For this, the educational tests applied at Enade were used and collected automatically using scripts developed with Scrapy. The files used for the evaluation comprise 343 tests, with 11.196 objective and discursive questions, 396 answers, with 14.475 alternatives extracted from the objective questions. For the construction of ground truth in the tests, the Aletheia tool was used. For the extractions, existing tools were used that perform data extractions in PDF files: tabular data extractions, with Excalibur and Tabula for answer extractions, textual content extractions, with CyberPDF and PDFMiner to extract the questions, and extractions of regions of interest, with Aletheia and ExamClipper for the cutouts of the questions. The results of the extractions point out some limitations in relation to the diversity of layout in each year of application. The extracted data provide useful information in a wide variety of fields, including academic research and support for students and teachers.

Download


Paper Citation


in Harvard Style

Wiechork K. and Charão A. (2021). Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests. In Proceedings of the 23rd International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-509-8, pages 359-366. DOI: 10.5220/0010524503590366


in Bibtex Style

@conference{iceis21,
author={Karina Wiechork and Andrea Charão},
title={Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests},
booktitle={Proceedings of the 23rd International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2021},
pages={359-366},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010524503590366},
isbn={978-989-758-509-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 23rd International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Automated Data Extraction from PDF Documents: Application to Large Sets of Educational Tests
SN - 978-989-758-509-8
AU - Wiechork K.
AU - Charão A.
PY - 2021
SP - 359
EP - 366
DO - 10.5220/0010524503590366